agent_inspect.metrics.multi_samples package

Submodules

agent_inspect.metrics.multi_samples.multi_sample_metric module

class agent_inspect.metrics.multi_samples.multi_sample_metric.MultiSampleMetric(config=None)[source]

Bases: ABC

Base abstract class for metrics that aggregate results across multiple samples or trials.

Concrete subclasses should implement logic that combines multiple NumericalScore objects into a single aggregated score.

Parameters:

config (Optional[Dict[str, Any]]) – Optional configuration dictionary for metric initialization. Defaults to None.

abstract compute(scorer_results)[source]

Computes an aggregated metric score from multiple scorer results.

This method is intended to be implemented by concrete subclasses that define how multiple trial-level or sample-level NumericalScore objects should be combined (for example, pass@k-style metrics).

Parameters:

scorer_results (List[NumericalScore]) – A list of NumericalScore objects produced by scorer metrics, one per trial or sample.

Returns:

A NumericalScore object containing the aggregated result.

agent_inspect.metrics.multi_samples.pass_at_k module

class agent_inspect.metrics.multi_samples.pass_at_k.PassAtK(config=None)[source]

Bases: MultiSampleMetric

Metric to calculate pass@k: the probability that at least one of k randomly sampled trials is successful.

\[pass@k = 1 - \frac{\binom{n-s}{k}}{\binom{n}{k}}\]
where:
  • n: total number of trials

  • s: number of successful trials

  • k: number of samples drawn

Parameters:
  • k – Number of samples to draw (default: None, must be set before evaluation)

  • config (Dict[str, Any] | None)

compute(success_scores)[source]

Computes the pass@k metric given a list of success scores from multiple trials.

The pass@k metric represents the probability that at least one of k randomly selected trials is successful, based on the total number of trials and the number of successful trials observed.

Configuration values are retrieved from the metric config, falling back to defaults if not explicitly provided.

Parameters:

success_scores (List[NumericalScore]) – A list of NumericalScore objects, one per trial, where each score indicates success (typically 0 or 1).

Return type:

NumericalScore

Returns:

A NumericalScore object containing the computed pass@k value.

Raises:

agent_inspect.exception.EvaluationError

  • If k_value is less than or equal to 0

  • If num_trials is less than or equal to 0

  • If the number of provided success scores does not match num_trials

  • If k_value is greater than num_trials

Example:

>>> from agent_inspect.metrics.multi_samples import PassAtK
>>> from agent_inspect.models.metrics import NumericalScore
>>> from agent_inspect.metrics.constants import K_VALUE, NO_OF_TRIALS
>>>
>>> metric = PassAtK(config={K_VALUE: 2, NO_OF_TRIALS: 5})
>>> scores = [NumericalScore(score=1), NumericalScore(score=0),
...           NumericalScore(score=1), NumericalScore(score=0),
...           NumericalScore(score=0)]
>>> result = metric.compute(scores)
>>> print(result.score)

agent_inspect.metrics.multi_samples.pass_hat_k module

class agent_inspect.metrics.multi_samples.pass_hat_k.PassHatK(config=None)[source]

Bases: MultiSampleMetric

Metric to calculate pass^k: the probability that exactly k randomly sampled trials are successful.

\[pass^k = \frac{\binom{s}{k}}{\binom{n}{k}}\]
where:
  • n: total number of trials

  • s: number of successful trials

  • k: number of samples drawn

Parameters:
  • k – Number of samples to draw (default: None, must be set before evaluation)

  • config (Dict[str, Any] | None)

compute(success_scores)[source]

Computes the pass^k metric given a list of success scores from multiple trials.

The pass^k metric represents the probability that exactly k randomly selected trials are successful, based on the total number of trials and the number of successful trials observed.

Configuration values are retrieved from the metric config, falling back to defaults if not explicitly provided.

Parameters:

success_scores (List[NumericalScore]) – A list of NumericalScore objects, one per trial, where each score indicates success (typically 0 or 1).

Return type:

NumericalScore

Returns:

A NumericalScore object containing the computed pass^k value.

Raises:

agent_inspect.exception.EvaluationError

  • If k_value is less than or equal to 0

  • If num_trials is less than or equal to 0

  • If the number of provided success scores does not match num_trials

  • If k_value is greater than num_trials

Example:

>>> from agent_inspect.metrics.multi_samples import PassHatK
>>> from agent_inspect.models.metrics import NumericalScore
>>> from agent_inspect.metrics.constants import K_VALUE, NO_OF_TRIALS
>>>
>>> metric = PassHatK(config={K_VALUE: 2, NO_OF_TRIALS: 5})
>>> scores = [NumericalScore(score=1), NumericalScore(score=1),
...           NumericalScore(score=0), NumericalScore(score=0),
...           NumericalScore(score=0)]
>>> result = metric.compute(scores)
>>> print(result.score)

Module contents

class agent_inspect.metrics.multi_samples.MultiSampleMetric(config=None)[source]

Bases: ABC

Base abstract class for metrics that aggregate results across multiple samples or trials.

Concrete subclasses should implement logic that combines multiple NumericalScore objects into a single aggregated score.

Parameters:

config (Optional[Dict[str, Any]]) – Optional configuration dictionary for metric initialization. Defaults to None.

abstract compute(scorer_results)[source]

Computes an aggregated metric score from multiple scorer results.

This method is intended to be implemented by concrete subclasses that define how multiple trial-level or sample-level NumericalScore objects should be combined (for example, pass@k-style metrics).

Parameters:

scorer_results (List[NumericalScore]) – A list of NumericalScore objects produced by scorer metrics, one per trial or sample.

Returns:

A NumericalScore object containing the aggregated result.

class agent_inspect.metrics.multi_samples.PassAtK(config=None)[source]

Bases: MultiSampleMetric

Metric to calculate pass@k: the probability that at least one of k randomly sampled trials is successful.

\[pass@k = 1 - \frac{\binom{n-s}{k}}{\binom{n}{k}}\]
where:
  • n: total number of trials

  • s: number of successful trials

  • k: number of samples drawn

Parameters:
  • k – Number of samples to draw (default: None, must be set before evaluation)

  • config (Dict[str, Any] | None)

compute(success_scores)[source]

Computes the pass@k metric given a list of success scores from multiple trials.

The pass@k metric represents the probability that at least one of k randomly selected trials is successful, based on the total number of trials and the number of successful trials observed.

Configuration values are retrieved from the metric config, falling back to defaults if not explicitly provided.

Parameters:

success_scores (List[NumericalScore]) – A list of NumericalScore objects, one per trial, where each score indicates success (typically 0 or 1).

Return type:

NumericalScore

Returns:

A NumericalScore object containing the computed pass@k value.

Raises:

agent_inspect.exception.EvaluationError

  • If k_value is less than or equal to 0

  • If num_trials is less than or equal to 0

  • If the number of provided success scores does not match num_trials

  • If k_value is greater than num_trials

Example:

>>> from agent_inspect.metrics.multi_samples import PassAtK
>>> from agent_inspect.models.metrics import NumericalScore
>>> from agent_inspect.metrics.constants import K_VALUE, NO_OF_TRIALS
>>>
>>> metric = PassAtK(config={K_VALUE: 2, NO_OF_TRIALS: 5})
>>> scores = [NumericalScore(score=1), NumericalScore(score=0),
...           NumericalScore(score=1), NumericalScore(score=0),
...           NumericalScore(score=0)]
>>> result = metric.compute(scores)
>>> print(result.score)
class agent_inspect.metrics.multi_samples.PassHatK(config=None)[source]

Bases: MultiSampleMetric

Metric to calculate pass^k: the probability that exactly k randomly sampled trials are successful.

\[pass^k = \frac{\binom{s}{k}}{\binom{n}{k}}\]
where:
  • n: total number of trials

  • s: number of successful trials

  • k: number of samples drawn

Parameters:
  • k – Number of samples to draw (default: None, must be set before evaluation)

  • config (Dict[str, Any] | None)

compute(success_scores)[source]

Computes the pass^k metric given a list of success scores from multiple trials.

The pass^k metric represents the probability that exactly k randomly selected trials are successful, based on the total number of trials and the number of successful trials observed.

Configuration values are retrieved from the metric config, falling back to defaults if not explicitly provided.

Parameters:

success_scores (List[NumericalScore]) – A list of NumericalScore objects, one per trial, where each score indicates success (typically 0 or 1).

Return type:

NumericalScore

Returns:

A NumericalScore object containing the computed pass^k value.

Raises:

agent_inspect.exception.EvaluationError

  • If k_value is less than or equal to 0

  • If num_trials is less than or equal to 0

  • If the number of provided success scores does not match num_trials

  • If k_value is greater than num_trials

Example:

>>> from agent_inspect.metrics.multi_samples import PassHatK
>>> from agent_inspect.models.metrics import NumericalScore
>>> from agent_inspect.metrics.constants import K_VALUE, NO_OF_TRIALS
>>>
>>> metric = PassHatK(config={K_VALUE: 2, NO_OF_TRIALS: 5})
>>> scores = [NumericalScore(score=1), NumericalScore(score=1),
...           NumericalScore(score=0), NumericalScore(score=0),
...           NumericalScore(score=0)]
>>> result = metric.compute(scores)
>>> print(result.score)