agent_inspect.metrics.multi_samples package
Submodules
agent_inspect.metrics.multi_samples.multi_sample_metric module
- class agent_inspect.metrics.multi_samples.multi_sample_metric.MultiSampleMetric(config=None)[source]
Bases:
ABCBase abstract class for metrics that aggregate results across multiple samples or trials.
Concrete subclasses should implement logic that combines multiple
NumericalScoreobjects into a single aggregated score.- Parameters:
config (
Optional[Dict[str,Any]]) – Optional configuration dictionary for metric initialization. Defaults toNone.
- abstract compute(scorer_results)[source]
Computes an aggregated metric score from multiple scorer results.
This method is intended to be implemented by concrete subclasses that define how multiple trial-level or sample-level
NumericalScoreobjects should be combined (for example, pass@k-style metrics).- Parameters:
scorer_results (
List[NumericalScore]) – A list ofNumericalScoreobjects produced by scorer metrics, one per trial or sample.- Returns:
A
NumericalScoreobject containing the aggregated result.
agent_inspect.metrics.multi_samples.pass_at_k module
- class agent_inspect.metrics.multi_samples.pass_at_k.PassAtK(config=None)[source]
Bases:
MultiSampleMetricMetric to calculate pass@k: the probability that at least one of k randomly sampled trials is successful.
\[pass@k = 1 - \frac{\binom{n-s}{k}}{\binom{n}{k}}\]- where:
n: total number of trials
s: number of successful trials
k: number of samples drawn
- Parameters:
k – Number of samples to draw (default: None, must be set before evaluation)
config (Dict[str, Any] | None)
- compute(success_scores)[source]
Computes the pass@k metric given a list of success scores from multiple trials.
The pass@k metric represents the probability that at least one of k randomly selected trials is successful, based on the total number of trials and the number of successful trials observed.
Configuration values are retrieved from the metric config, falling back to defaults if not explicitly provided.
- Parameters:
success_scores (
List[NumericalScore]) – A list ofNumericalScoreobjects, one per trial, where each score indicates success (typically 0 or 1).- Return type:
- Returns:
A
NumericalScoreobject containing the computed pass@k value.- Raises:
agent_inspect.exception.EvaluationError –
If
k_valueis less than or equal to 0If
num_trialsis less than or equal to 0If the number of provided success scores does not match
num_trialsIf
k_valueis greater thannum_trials
Example:
>>> from agent_inspect.metrics.multi_samples import PassAtK >>> from agent_inspect.models.metrics import NumericalScore >>> from agent_inspect.metrics.constants import K_VALUE, NO_OF_TRIALS >>> >>> metric = PassAtK(config={K_VALUE: 2, NO_OF_TRIALS: 5}) >>> scores = [NumericalScore(score=1), NumericalScore(score=0), ... NumericalScore(score=1), NumericalScore(score=0), ... NumericalScore(score=0)] >>> result = metric.compute(scores) >>> print(result.score)
agent_inspect.metrics.multi_samples.pass_hat_k module
- class agent_inspect.metrics.multi_samples.pass_hat_k.PassHatK(config=None)[source]
Bases:
MultiSampleMetricMetric to calculate pass^k: the probability that exactly k randomly sampled trials are successful.
\[pass^k = \frac{\binom{s}{k}}{\binom{n}{k}}\]- where:
n: total number of trials
s: number of successful trials
k: number of samples drawn
- Parameters:
k – Number of samples to draw (default: None, must be set before evaluation)
config (Dict[str, Any] | None)
- compute(success_scores)[source]
Computes the pass^k metric given a list of success scores from multiple trials.
The pass^k metric represents the probability that exactly k randomly selected trials are successful, based on the total number of trials and the number of successful trials observed.
Configuration values are retrieved from the metric config, falling back to defaults if not explicitly provided.
- Parameters:
success_scores (
List[NumericalScore]) – A list ofNumericalScoreobjects, one per trial, where each score indicates success (typically 0 or 1).- Return type:
- Returns:
A
NumericalScoreobject containing the computed pass^k value.- Raises:
agent_inspect.exception.EvaluationError –
If
k_valueis less than or equal to 0If
num_trialsis less than or equal to 0If the number of provided success scores does not match
num_trialsIf
k_valueis greater thannum_trials
Example:
>>> from agent_inspect.metrics.multi_samples import PassHatK >>> from agent_inspect.models.metrics import NumericalScore >>> from agent_inspect.metrics.constants import K_VALUE, NO_OF_TRIALS >>> >>> metric = PassHatK(config={K_VALUE: 2, NO_OF_TRIALS: 5}) >>> scores = [NumericalScore(score=1), NumericalScore(score=1), ... NumericalScore(score=0), NumericalScore(score=0), ... NumericalScore(score=0)] >>> result = metric.compute(scores) >>> print(result.score)
Module contents
- class agent_inspect.metrics.multi_samples.MultiSampleMetric(config=None)[source]
Bases:
ABCBase abstract class for metrics that aggregate results across multiple samples or trials.
Concrete subclasses should implement logic that combines multiple
NumericalScoreobjects into a single aggregated score.- Parameters:
config (
Optional[Dict[str,Any]]) – Optional configuration dictionary for metric initialization. Defaults toNone.
- abstract compute(scorer_results)[source]
Computes an aggregated metric score from multiple scorer results.
This method is intended to be implemented by concrete subclasses that define how multiple trial-level or sample-level
NumericalScoreobjects should be combined (for example, pass@k-style metrics).- Parameters:
scorer_results (
List[NumericalScore]) – A list ofNumericalScoreobjects produced by scorer metrics, one per trial or sample.- Returns:
A
NumericalScoreobject containing the aggregated result.
- class agent_inspect.metrics.multi_samples.PassAtK(config=None)[source]
Bases:
MultiSampleMetricMetric to calculate pass@k: the probability that at least one of k randomly sampled trials is successful.
\[pass@k = 1 - \frac{\binom{n-s}{k}}{\binom{n}{k}}\]- where:
n: total number of trials
s: number of successful trials
k: number of samples drawn
- Parameters:
k – Number of samples to draw (default: None, must be set before evaluation)
config (Dict[str, Any] | None)
- compute(success_scores)[source]
Computes the pass@k metric given a list of success scores from multiple trials.
The pass@k metric represents the probability that at least one of k randomly selected trials is successful, based on the total number of trials and the number of successful trials observed.
Configuration values are retrieved from the metric config, falling back to defaults if not explicitly provided.
- Parameters:
success_scores (
List[NumericalScore]) – A list ofNumericalScoreobjects, one per trial, where each score indicates success (typically 0 or 1).- Return type:
- Returns:
A
NumericalScoreobject containing the computed pass@k value.- Raises:
agent_inspect.exception.EvaluationError –
If
k_valueis less than or equal to 0If
num_trialsis less than or equal to 0If the number of provided success scores does not match
num_trialsIf
k_valueis greater thannum_trials
Example:
>>> from agent_inspect.metrics.multi_samples import PassAtK >>> from agent_inspect.models.metrics import NumericalScore >>> from agent_inspect.metrics.constants import K_VALUE, NO_OF_TRIALS >>> >>> metric = PassAtK(config={K_VALUE: 2, NO_OF_TRIALS: 5}) >>> scores = [NumericalScore(score=1), NumericalScore(score=0), ... NumericalScore(score=1), NumericalScore(score=0), ... NumericalScore(score=0)] >>> result = metric.compute(scores) >>> print(result.score)
- class agent_inspect.metrics.multi_samples.PassHatK(config=None)[source]
Bases:
MultiSampleMetricMetric to calculate pass^k: the probability that exactly k randomly sampled trials are successful.
\[pass^k = \frac{\binom{s}{k}}{\binom{n}{k}}\]- where:
n: total number of trials
s: number of successful trials
k: number of samples drawn
- Parameters:
k – Number of samples to draw (default: None, must be set before evaluation)
config (Dict[str, Any] | None)
- compute(success_scores)[source]
Computes the pass^k metric given a list of success scores from multiple trials.
The pass^k metric represents the probability that exactly k randomly selected trials are successful, based on the total number of trials and the number of successful trials observed.
Configuration values are retrieved from the metric config, falling back to defaults if not explicitly provided.
- Parameters:
success_scores (
List[NumericalScore]) – A list ofNumericalScoreobjects, one per trial, where each score indicates success (typically 0 or 1).- Return type:
- Returns:
A
NumericalScoreobject containing the computed pass^k value.- Raises:
agent_inspect.exception.EvaluationError –
If
k_valueis less than or equal to 0If
num_trialsis less than or equal to 0If the number of provided success scores does not match
num_trialsIf
k_valueis greater thannum_trials
Example:
>>> from agent_inspect.metrics.multi_samples import PassHatK >>> from agent_inspect.models.metrics import NumericalScore >>> from agent_inspect.metrics.constants import K_VALUE, NO_OF_TRIALS >>> >>> metric = PassHatK(config={K_VALUE: 2, NO_OF_TRIALS: 5}) >>> scores = [NumericalScore(score=1), NumericalScore(score=1), ... NumericalScore(score=0), NumericalScore(score=0), ... NumericalScore(score=0)] >>> result = metric.compute(scores) >>> print(result.score)