agent_inspect.metrics.scorer package

Submodules

agent_inspect.metrics.scorer.auc module

class agent_inspect.metrics.scorer.auc.AUC(llm_client, config=None)[source]

Bases: LLMBasedMetric

Metric to calculate the area under the progress curve produced by ProgressScoresThroughTurns class. For computing AUC, the discrete progress values are treated as a continuous, monotonically increasing function obtained via linear interpolation.

\[AUC = \int_{0}^{T} p(t) \ dt, \]

where \(T\) is the maximum turns of a conversation and \(p(t) := progress(i, G_i, \tau_i[1:t])\) denotes the progress at turn \(t\) (refer to the documentation on ProgressScoresThroughTurns).

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) –

    Default to None. Configuration options:

    • templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.

    • num_judge_trials: the number of LLM-as-a-judge runs. Default to 5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.

    • include_judge_explanation: a bool flag to indicate whether the output should also return judge explanations. Default to False.

    • optimize_judge_trials: a bool flag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set to False in order to perform error analysis later. Default to False.

    • max_retry_judge_trials: an int value indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to 5. This will be ignored if optimize_judge_trials is set to True.

    • max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than max_turns, the final progress score is populated up to max_turns. Default to 20.

evaluate(agent_trace, evaluation_data_sample)[source]

Returns the value of area under the progress scores curve. Calls the agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluate method underneath.

Parameters:
Returns:

a NumericalScore object containing AUC score, sub scores consisting of progress scores at every turn, and judge explanations.

Example:

>>> from agent_inspect.metrics.scorer import AUC
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> metric = AUC(
...     llm_client=client,
...     config={
...        MAX_TURNS: 8,
...        INCLUDE_JUDGE_EXPLANATION: True,
...        OPTIMIZE_JUDGE_TRIALS: False
...    }
... )
>>> metric_result = metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> print(metric_result.score) 
static get_auc_score_from_progress_scores(progress_scores)[source]

Computes the value of area under the progress scores curve given a list of progress scores at every conversational turn as input. The list of progress scores are obtained from agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluate method.

Parameters:

progress_scores (List[NumericalScore]) – a List [NumericalScore] object storing a list of progress scores at every conversational turn.

Return type:

NumericalScore

Returns:

a NumericalScore object containing the AUC score.

Example:

>>> from agent_inspect.metrics.scorer import ProgressScoresThroughTurns, AUC
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, INCLUDE_VALIDATION_RESULTS, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> progress_turns_metric = ProgressScoresThroughTurns(
...     llm_client=client,
...     config={
...        MAX_TURNS: 8,
...        INCLUDE_VALIDATION_RESULTS: True,
...        INCLUDE_JUDGE_EXPLANATION: True,
...        OPTIMIZE_JUDGE_TRIALS: False
...    }
... )
>>> progress_rates = progress_turns_metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> auc_metric = AUC(llm_client=client)
>>> auc_metric_result = auc_metric.get_auc_score_from_progress_scores(progress_rates)   
>>> print(auc_metric_result.score)     

agent_inspect.metrics.scorer.llm_based_metric module

class agent_inspect.metrics.scorer.llm_based_metric.LLMBasedMetric(llm_client, config=None)[source]

Bases: Metric

This is a base abstract class that should be extended for actual implementations.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) – configuration for metric initialization. Default to None.

abstract evaluate(agent_trace, evaluation_data_sample)[source]

This is an abstract method and should be implemented in a concrete class.

Parameters:
  • agent_trace (AgentDialogueTrace) – a AgentDialogueTrace object constructed with the agent trajectory information for a given data sample.

  • evaluation_data_sample (EvaluationSample) – a EvaluationSample object representing a data sample in the evaluation data set.

Returns:

a NumericalScore object or a List [NumericalScore] object.

agent_inspect.metrics.scorer.metric module

class agent_inspect.metrics.scorer.metric.Metric(config=None)[source]

Bases: ABC

This is a base abstract class that should be extended for actual implementations.

Parameters:

config (Optional[Dict[str, Any]]) – configuration for metric initialization. Default to None.

abstract evaluate(agent_trace, evaluation_data_sample)[source]

This is an abstract method and should be implemented in a concrete class.

Parameters:
  • agent_trace (AgentDialogueTrace) – a AgentDialogueTrace object constructed with the agent trajectory information for a given data sample.

  • evaluation_data_sample (EvaluationSample) – a EvaluationSample object representing a data sample in the evaluation data set.

Returns:

a NumericalScore object or a List [NumericalScore] object.

agent_inspect.metrics.scorer.ppt module

class agent_inspect.metrics.scorer.ppt.PPT(llm_client, config=None)[source]

Bases: ProgressBasedMetric

Metric to calculate the progress-per-turn (PPT) of the progress scores produced by ProgressScoresThroughTurns class. PPT metric is defined as the total increase in progress divided by the number of turns. It weights the increase in progress uniformly across the conversational turns.

\[PPT = \frac{1}{T} \sum_{t=0}^{T-1} p(t+1)-p(t)= \frac{p(T)}{T}, \]

where \(p(t) := progress(i, G_i, \tau_i[1:t])\) denotes the discrete progress at turn \(t\) (refer to the documentation on ProgressScoresThroughTurns), \(T\) is the minimum number of conversational turns to reach the final achieved progress \(p(T)\), and \(p(0)=0\).

Parameters:
  • llm_client (Any) – the client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) –

    Default to None. Configuration options:

    • templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.

    • num_judge_trials: the number of LLM-as-a-judge runs. Default to 5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.

    • include_judge_explanation: a bool flag to indicate whether the output should also return judge explanations. Default to False.

    • optimize_judge_trials: a bool flag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set to False in order to perform error analysis later. Default to False.

    • max_retry_judge_trials: an int value indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to 5. This will be ignored if optimize_judge_trials is set to True.

    • max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than max_turns, the final progress score is populated up to max_turns. Default to 20.

evaluate(agent_trace, evaluation_data_sample)[source]

Returns the progress-per-turn (PPT) value of the list of progress scores. Calls the agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluate method underneath.

Parameters:
Returns:

a NumericalScore object containing PPT score, sub scores consisting of progress scores at every turn, and judge explanations.

Example:

>>> from agent_inspect.metrics.scorer import PPT
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> metric = PPT(
...     llm_client=client,
...     config={
...        MAX_TURNS: 8,
...        INCLUDE_JUDGE_EXPLANATION: True,
...        OPTIMIZE_JUDGE_TRIALS: False
...    }
... )
>>> metric_result = metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> print(metric_result.score)        
static get_ppt_score_from_progress_scores(progress_scores)[source]

Computes the progress-per-turn (PPT) value given a list of progress scores at every conversational turn as input. The list of progress scores are obtained from agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluate method.

Parameters:

progress_scores (List[NumericalScore]) – a List [NumericalScore] object storing a list of progress scores at every conversational turn.

Return type:

NumericalScore

Returns:

a NumericalScore object containing the PPT score.

Example:

>>> from agent_inspect.metrics.scorer import ProgressScoresThroughTurns, PPT
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, INCLUDE_VALIDATION_RESULTS, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> progress_turns_metric = ProgressScoresThroughTurns(
...     llm_client=client,
...     config={
...        MAX_TURNS: 8,
...        INCLUDE_VALIDATION_RESULTS: True,
...        INCLUDE_JUDGE_EXPLANATION: True,
...        OPTIMIZE_JUDGE_TRIALS: False
...    }
... )
>>> progress_rates = progress_turns_metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> ppt_metric = PPT(llm_client=client)
>>> ppt_metric_result = ppt_metric.get_ppt_score_from_progress_scores(progress_rates)   
>>> print(ppt_metric_result.score)    

agent_inspect.metrics.scorer.progress module

class agent_inspect.metrics.scorer.progress.ProgressBasedMetric(llm_client, config=None)[source]

Bases: LLMBasedMetric

Abstract class which should be extended for actual implementations of progress metrics.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the llm-as-a-judge model for evaluations.

  • config (Optional[Dict[str, Any]]) – configuration for progress metric initialization. Default to None.

abstract evaluate(agent_trace, evaluation_data_sample)[source]

This is an abstract method and should be implemented in a concrete class.

Parameters:
Returns:

a NumericalScore object or a List [NumericalScore] object.

class agent_inspect.metrics.scorer.progress.ProgressScore(llm_client, config=None)[source]

Bases: ProgressBasedMetric

Metric to calculate agent’s progress for a given task sample based on the proportion of subgoals completed. Current metric supports only static conversation where the user utterances are predetermined.

\[progress(i, G_i, \tau_i)=\frac{1}{|G_i|} \sum_{j=1}^{|G_i|} LLM_{judge}(i, g_{i, j}, \tau_i ),\]

where \(LLM_{judge}(\cdot)\) is the output from the LLM-as-a-judge, \(G_i= \{ g_{i, 1}, ..., g_{i, j}, ..., g_{i, |G_i|} \}\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i\) is the agent trajectory for the entire conversation consisting of tool calls, agent responses, and user inputs.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) –

    Default to None. Configuration options:

    • templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user inputs, subgoal, trajectory, and agent responses. If this is not provided, the default template for static single-turn or static multi-turn conversation will be used.

    • num_judge_trials: the number of LLM-as-a-judge runs. Default to 5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.

    • include_judge_explanation: a bool flag to indicate whether the output should also return judge explanations. Default to False.

    • include_entire_prompt_in_validation_result: a bool flag to indicate whether to include the entire prompt sent to LLM-as-a-judge in the SubGoalValidationResult. Use this in the debugging tool UI for display. Default to False.

    • optimize_judge_trials: a bool flag to indicate whether to use optimized judge runs when doing a majority vote. Default to False.

    • max_retry_judge_trials: an int value indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to 5. This will be ignored if optimize_judge_trials is set to True.

evaluate(agent_trace, evaluation_data_sample)[source]

Returns a progress score given the agent trace and the evaluation data sample.

Parameters:
Returns:

a NumericalScore object containing progress score and judge explanations.

Example:

>>> from agent_inspect.metrics.scorer import ProgressScore
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> metric = ProgressScore(
...     llm_client=client,
...     config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False}
... )
>>> metric_result = metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> print(metric_result.score)
class agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns(llm_client, config=None)[source]

Bases: ProgressBasedMetric

Metric to calculate agent’s progress at every conversational turn (up to the final conversation turn \(T\)) for a given task sample based on the proportion of subgoals completed. Subgoals that are completed at the current or previous turns are not evaluated again in the subsequent turns. The metric assumes previously completed subgoals which are milestones cannot be undone.

For every conversational turn \(t\) up to the final turn \(T\), the agent’s progress at turn \(t\) is computed as follows:

\[progress(i, G_i, \tau_i[1:t])=\frac{1}{|G_i|} \sum_{j=1}^{|G_i|} LLM_{judge}(i, g_{i, j}, \tau_i[1:t]), \]

where \(LLM_{judge}(\cdot)\) is the output from the LLM-as-a-judge, \(G_i= \{ g_{i, 1}, ..., g_{i, j}, ..., g_{i, |G_i|} \}\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i[1:t]\) is the segment of agent trajectory from the first turn up to turn \(t\) consisting of tool calls, agent responses, and user inputs.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) –

    Default to None. Configuration options:

    • templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.

    • num_judge_trials: the number of LLM-as-a-judge runs. Default to 5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.

    • include_judge_explanation: a bool flag to indicate whether the output should also return judge explanations. Default to False.

    • include_validation_results: a bool flag to indicate whether the output should also return a List [SubGoalValidationResult]. This is used later for error analysis. Default to False.

    • include_entire_prompt_in_validation_result: a bool flag to indicate whether to include the entire prompt sent to LLM-as-a-judge in the SubGoalValidationResult. Use this in the debugging tool UI for display. Default to False.

    • optimize_judge_trials: a bool flag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set to False in order to perform error analysis later. Default to False.

    • max_retry_judge_trials: an int value indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to 5. This will be ignored if optimize_judge_trials is set to True.

    • max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than max_turns, the final progress score is populated up to max_turns. Default to 20.

evaluate(agent_trace, evaluation_data_sample)[source]

Returns a list of progress scores at every turn until max_turns given the agent trace and the evaluation data sample.

Parameters:
Returns:

a List [NumericalScore] object storing a list of progress scores at every turn until max_turns.

Example:

>>> from agent_inspect.metrics.scorer import ProgressScoresThroughTurns
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, INCLUDE_VALIDATION_RESULTS, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> metric = ProgressScoresThroughTurns(
...     llm_client=client,
...     config={
...        MAX_TURNS: 8,
...        INCLUDE_VALIDATION_RESULTS: True,
...        INCLUDE_JUDGE_EXPLANATION: True,
...        OPTIMIZE_JUDGE_TRIALS: False
...    }
... )
>>> metric_result = metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> print(metric_result) # print list of NumericalScore objects

agent_inspect.metrics.scorer.success module

class agent_inspect.metrics.scorer.success.SuccessBasedMetric(llm_client, config=None)[source]

Bases: LLMBasedMetric

Abstract class which should be extended for actual implementations of success metrics.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the llm-as-a-judge model for evaluations.

  • config (Optional[Dict[str, Any]]) – configuration for success metric initialization. Default to None.

abstract evaluate(agent_trace, evaluation_data_sample)[source]

This is an abstract method and should be implemented in a concrete class.

Parameters:
Returns:

a NumericalScore object or a List [NumericalScore] object.

static get_success_score_from_progress_score(progress_score)[source]

Computes the success score given the progress score. Success score is 1 if the progress score is 1, and 0 otherwise.

Parameters:

progress_score (NumericalScore) – a NumericalScore object containing the progress score

Return type:

NumericalScore

Returns:

a NumericalScore object containing success score and sub scores consisting of progress score.

Example:

>>> from agent_inspect.metrics.scorer import ProgressScore, SuccessBasedMetric
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> progress_metric = ProgressScore(
...     llm_client=client,
...     config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False}
... )
>>> progress_metric_result = progress_metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> success_score = SuccessBasedMetric.get_success_score_from_progress_score(progress_metric_result)
>>> print(success_score)
static get_success_score_from_validation_results(validation_results)[source]

Aggregates a list of SubGoalValidationResult objects to compute a success score. Success score is 1 if all the validation results indicate success, and 0 otherwise.

Parameters:

validation_results (List[SubGoalValidationResult]) – a List [SubGoalValidationResult] object containing the result of subgoal validations.

Return type:

NumericalScore

Returns:

a NumericalScore object containing success score and sub scores consisting of progress score.

class agent_inspect.metrics.scorer.success.SuccessScore(llm_client, config=None)[source]

Bases: SuccessBasedMetric

Metric to calculate agent’s success rate for a given task sample based on the agent’s progress. Current metric supports only static conversation where the user utterances are predetermined.

\[success(i, G_i, \tau_i) = 1 \ \mathrm{if} \ progress(i, G_i, \tau_i)=1, \ \mathrm{and} \ 0 \ \mathrm{otherwise}, \]

where \(progress(i, G_i, \tau_i)\) is the progress score of the agent (refer to the documentation on ProgressScore), \(G_i\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i\) is the agent trajectory for the entire conversation consisting of tool calls, agent responses, and user inputs.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) –

    Default to None. Configuration options:

    • templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user inputs, subgoal, trajectory, and agent responses. If this is not provided, the default template for static single-turn or static multi-turn conversation will be used.

    • num_judge_trials: the number of LLM-as-a-judge runs. Default to 5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.

    • include_judge_explanation: a bool flag to indicate whether the output should also return judge explanations. Default to False.

    • optimize_judge_trials: a bool flag to indicate whether to use optimized judge runs when doing a majority vote. Default to False.

    • max_retry_judge_trials: an int value indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to 5. This will be ignored if optimize_judge_trials is set to True.

evaluate(agent_trace, evaluation_data_sample)[source]

Returns a success score given the agent trace and the evaluation data sample. Calls the agent_inspect.metrics.scorer.progress.ProgressScore.evaluate and agent_inspect.metrics.scorer.success.SuccessBasedMetric.get_success_score_from_progress_score methods underneath.

Parameters:
  • agent_trace (AgentDialogueTrace) – Agent Trace object constructed with the traces produced by the data sample.

  • evaluation_data_sample (EvaluationSample) – Data Sample object that represents a data sample in the evaluation data set.

Returns:

a NumericalScore object containing success score, sub scores consisting of progress score, and judge explanations.

Example:

>>> from agent_inspect.metrics.scorer import SuccessScore
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> metric = SuccessScore(
...     llm_client=client,
...     config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False}
... )
>>> metric_result = metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> print(metric_result.score)
class agent_inspect.metrics.scorer.success.SuccessScoreFinalTurn(llm_client, config=None)[source]

Bases: SuccessBasedMetric

Metric to calculate agent’s success score for a given task sample based on the agent’s progress at the final conversational turn \(T\).

\[success(i, G_i, \tau_i[1:T]) = 1 \ \mathrm{if} \ progress(i, G_i, \tau_i[1:T])=1, \ \mathrm{and} \ 0 \ \mathrm{otherwise}, \]

where \(progress(i, G_i, \tau_i[1:T])\) is the progress score of the agent at the final conversation turn \(T\) (refer to the documentation on ProgressScoresThroughTurns), \(G_i\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i[1:T]\) is the segment of agent trajectory from the first turn up to final turn \(T\) consisting of tool calls, agent responses, and user inputs.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) –

    Default to None. Configuration options:

    • templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.

    • num_judge_trials: the number of LLM-as-a-judge runs. Default to 5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.

    • include_judge_explanation: a bool flag to indicate whether the output should also return judge explanations. Default to False.

    • optimize_judge_trials: a bool flag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set to False in order to perform error analysis later. Default to False.

    • max_retry_judge_trials: an int value indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to 5. This will be ignored if optimize_judge_trials is set to True.

    • max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than max_turns, the final progress score is populated up to max_turns. Default to 20.

evaluate(agent_trace, evaluation_data_sample)[source]

Returns a success score at the final conversational turn \(T\) given the agent trace and the evaluation data sample. Calls the agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluate and agent_inspect.metrics.scorer.success.SuccessBasedMetric.get_success_score_from_progress_score methods underneath.

Parameters:
  • agent_trace (AgentDialogueTrace) – Agent Trace object constructed with the traces produced by the data sample.

  • evaluation_data_sample (EvaluationSample) – Data Sample object that represents a data sample in the evaluation data set.

Returns:

a NumericalScore object containing success score at final turn \(T\), sub scores consisting of progress scores at every turn, and judge explanations.

Example:

>>> from agent_inspect.metrics.scorer import SuccessScoreFinalTurn
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> metric = SuccessScoreFinalTurn(
...     llm_client=client,
...     config={
...        MAX_TURNS: 8,
...        INCLUDE_JUDGE_EXPLANATION: True,
...        OPTIMIZE_JUDGE_TRIALS: False
...    }
... )
>>> metric_result = metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> print(metric_result.score)

agent_inspect.metrics.scorer.templates module

agent_inspect.metrics.scorer.tool_correctness module

class agent_inspect.metrics.scorer.tool_correctness.ToolCorrectnessMetric(llm_client, config=None)[source]

Bases: LLMBasedMetric

Metric to calculate the correctness rate of tool calls made by an agent in its entire dialogue trace. The final score is computed as the ratio of correct tool calls to the total number of tool calls made.

The tool correctness score \(\text{tool_correctness}(i, T_i, \tau_i)\) for sample \(i\) is defined as:

\[\text{tool_correctness}(i, T_i, \tau_i) = \frac{1}{N} \sum_{j=1}^N \mathbb{I}(T_{i,j} \text{ is called in } \tau_i)\]

where \(\tau_i\) refers to the agent trajectory for sample \(i\), \(T_i = \{T_{i,1}, T_{i,2}, \ldots, T_{i,N}\}\) represents the set of \(N\) expected tool calls for sample \(i\), and \(\mathbb{I}(\cdot)\) is the indicator function that equals 1 if the \(j\)-th tool call \(T_{i,j}\) is correctly called by the agent, and 0 otherwise.

The correctness of each tool call is determined by validating the agent’s tool call against the expected tool call using both exact match and LLM-as-a-judge approaches, depending on the configuration. Specifically, if an argument or parameter is set to value, exact match is used; if set to check, LLM-as-a-judge validates the correctness. The evaluation is performed across three dimensions: tool name, tool input arguments, and tool output. One tool call is considered correct only if all three dimensions are validated as correct.

Parameters:
  • llm_client (LLMClient) – The client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) –

    Default to None. Configuration options:

    • num_judge_trials: the number of LLM-as-a-judge runs. Default to 5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.

    • include_judge_explanation: a bool flag to indicate whether the output should also return judge explanations. Default to False.

evaluate(agent_trace, evaluation_data_sample)[source]

Returns a tool correctness score given the agent trace and the evaluation data sample.

Parameters:
  • agent_trace (AgentDialogueTrace) – Agent Trace object constructed with the traces produced by the data sample.

  • evaluation_data_sample (EvaluationSample) – Data Sample object that represents a data sample in the evaluation data set.

Returns:

a NumericalScore containing the tool correctness score (float) and explanations.

Example:

>>> from agent_inspect.metrics.scorer import ToolCorrectnessMetric
>>> from agent_inspect.metrics.constants import NUM_JUDGE_TRIALS 
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096)  # create client needed for LLM-based metric
>>> metric = ToolCorrectnessMetric(
...     llm_client=client,
...     config={
...         NUM_JUDGE_TRIALS: 5
...     }
... )
>>> tool_correctness_score = metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> print(tool_correctness_score.score)

Module contents

class agent_inspect.metrics.scorer.AUC(llm_client, config=None)[source]

Bases: LLMBasedMetric

Metric to calculate the area under the progress curve produced by ProgressScoresThroughTurns class. For computing AUC, the discrete progress values are treated as a continuous, monotonically increasing function obtained via linear interpolation.

\[AUC = \int_{0}^{T} p(t) \ dt, \]

where \(T\) is the maximum turns of a conversation and \(p(t) := progress(i, G_i, \tau_i[1:t])\) denotes the progress at turn \(t\) (refer to the documentation on ProgressScoresThroughTurns).

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) –

    Default to None. Configuration options:

    • templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.

    • num_judge_trials: the number of LLM-as-a-judge runs. Default to 5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.

    • include_judge_explanation: a bool flag to indicate whether the output should also return judge explanations. Default to False.

    • optimize_judge_trials: a bool flag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set to False in order to perform error analysis later. Default to False.

    • max_retry_judge_trials: an int value indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to 5. This will be ignored if optimize_judge_trials is set to True.

    • max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than max_turns, the final progress score is populated up to max_turns. Default to 20.

evaluate(agent_trace, evaluation_data_sample)[source]

Returns the value of area under the progress scores curve. Calls the agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluate method underneath.

Parameters:
Returns:

a NumericalScore object containing AUC score, sub scores consisting of progress scores at every turn, and judge explanations.

Example:

>>> from agent_inspect.metrics.scorer import AUC
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> metric = AUC(
...     llm_client=client,
...     config={
...        MAX_TURNS: 8,
...        INCLUDE_JUDGE_EXPLANATION: True,
...        OPTIMIZE_JUDGE_TRIALS: False
...    }
... )
>>> metric_result = metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> print(metric_result.score) 
static get_auc_score_from_progress_scores(progress_scores)[source]

Computes the value of area under the progress scores curve given a list of progress scores at every conversational turn as input. The list of progress scores are obtained from agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluate method.

Parameters:

progress_scores (List[NumericalScore]) – a List [NumericalScore] object storing a list of progress scores at every conversational turn.

Return type:

NumericalScore

Returns:

a NumericalScore object containing the AUC score.

Example:

>>> from agent_inspect.metrics.scorer import ProgressScoresThroughTurns, AUC
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, INCLUDE_VALIDATION_RESULTS, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> progress_turns_metric = ProgressScoresThroughTurns(
...     llm_client=client,
...     config={
...        MAX_TURNS: 8,
...        INCLUDE_VALIDATION_RESULTS: True,
...        INCLUDE_JUDGE_EXPLANATION: True,
...        OPTIMIZE_JUDGE_TRIALS: False
...    }
... )
>>> progress_rates = progress_turns_metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> auc_metric = AUC(llm_client=client)
>>> auc_metric_result = auc_metric.get_auc_score_from_progress_scores(progress_rates)   
>>> print(auc_metric_result.score)     
class agent_inspect.metrics.scorer.LLMBasedMetric(llm_client, config=None)[source]

Bases: Metric

This is a base abstract class that should be extended for actual implementations.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) – configuration for metric initialization. Default to None.

abstract evaluate(agent_trace, evaluation_data_sample)[source]

This is an abstract method and should be implemented in a concrete class.

Parameters:
  • agent_trace (AgentDialogueTrace) – a AgentDialogueTrace object constructed with the agent trajectory information for a given data sample.

  • evaluation_data_sample (EvaluationSample) – a EvaluationSample object representing a data sample in the evaluation data set.

Returns:

a NumericalScore object or a List [NumericalScore] object.

class agent_inspect.metrics.scorer.Metric(config=None)[source]

Bases: ABC

This is a base abstract class that should be extended for actual implementations.

Parameters:

config (Optional[Dict[str, Any]]) – configuration for metric initialization. Default to None.

abstract evaluate(agent_trace, evaluation_data_sample)[source]

This is an abstract method and should be implemented in a concrete class.

Parameters:
  • agent_trace (AgentDialogueTrace) – a AgentDialogueTrace object constructed with the agent trajectory information for a given data sample.

  • evaluation_data_sample (EvaluationSample) – a EvaluationSample object representing a data sample in the evaluation data set.

Returns:

a NumericalScore object or a List [NumericalScore] object.

class agent_inspect.metrics.scorer.PPT(llm_client, config=None)[source]

Bases: ProgressBasedMetric

Metric to calculate the progress-per-turn (PPT) of the progress scores produced by ProgressScoresThroughTurns class. PPT metric is defined as the total increase in progress divided by the number of turns. It weights the increase in progress uniformly across the conversational turns.

\[PPT = \frac{1}{T} \sum_{t=0}^{T-1} p(t+1)-p(t)= \frac{p(T)}{T}, \]

where \(p(t) := progress(i, G_i, \tau_i[1:t])\) denotes the discrete progress at turn \(t\) (refer to the documentation on ProgressScoresThroughTurns), \(T\) is the minimum number of conversational turns to reach the final achieved progress \(p(T)\), and \(p(0)=0\).

Parameters:
  • llm_client (Any) – the client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) –

    Default to None. Configuration options:

    • templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.

    • num_judge_trials: the number of LLM-as-a-judge runs. Default to 5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.

    • include_judge_explanation: a bool flag to indicate whether the output should also return judge explanations. Default to False.

    • optimize_judge_trials: a bool flag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set to False in order to perform error analysis later. Default to False.

    • max_retry_judge_trials: an int value indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to 5. This will be ignored if optimize_judge_trials is set to True.

    • max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than max_turns, the final progress score is populated up to max_turns. Default to 20.

evaluate(agent_trace, evaluation_data_sample)[source]

Returns the progress-per-turn (PPT) value of the list of progress scores. Calls the agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluate method underneath.

Parameters:
Returns:

a NumericalScore object containing PPT score, sub scores consisting of progress scores at every turn, and judge explanations.

Example:

>>> from agent_inspect.metrics.scorer import PPT
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> metric = PPT(
...     llm_client=client,
...     config={
...        MAX_TURNS: 8,
...        INCLUDE_JUDGE_EXPLANATION: True,
...        OPTIMIZE_JUDGE_TRIALS: False
...    }
... )
>>> metric_result = metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> print(metric_result.score)        
static get_ppt_score_from_progress_scores(progress_scores)[source]

Computes the progress-per-turn (PPT) value given a list of progress scores at every conversational turn as input. The list of progress scores are obtained from agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluate method.

Parameters:

progress_scores (List[NumericalScore]) – a List [NumericalScore] object storing a list of progress scores at every conversational turn.

Return type:

NumericalScore

Returns:

a NumericalScore object containing the PPT score.

Example:

>>> from agent_inspect.metrics.scorer import ProgressScoresThroughTurns, PPT
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, INCLUDE_VALIDATION_RESULTS, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> progress_turns_metric = ProgressScoresThroughTurns(
...     llm_client=client,
...     config={
...        MAX_TURNS: 8,
...        INCLUDE_VALIDATION_RESULTS: True,
...        INCLUDE_JUDGE_EXPLANATION: True,
...        OPTIMIZE_JUDGE_TRIALS: False
...    }
... )
>>> progress_rates = progress_turns_metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> ppt_metric = PPT(llm_client=client)
>>> ppt_metric_result = ppt_metric.get_ppt_score_from_progress_scores(progress_rates)   
>>> print(ppt_metric_result.score)    
class agent_inspect.metrics.scorer.ProgressBasedMetric(llm_client, config=None)[source]

Bases: LLMBasedMetric

Abstract class which should be extended for actual implementations of progress metrics.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the llm-as-a-judge model for evaluations.

  • config (Optional[Dict[str, Any]]) – configuration for progress metric initialization. Default to None.

abstract evaluate(agent_trace, evaluation_data_sample)[source]

This is an abstract method and should be implemented in a concrete class.

Parameters:
Returns:

a NumericalScore object or a List [NumericalScore] object.

class agent_inspect.metrics.scorer.ProgressScore(llm_client, config=None)[source]

Bases: ProgressBasedMetric

Metric to calculate agent’s progress for a given task sample based on the proportion of subgoals completed. Current metric supports only static conversation where the user utterances are predetermined.

\[progress(i, G_i, \tau_i)=\frac{1}{|G_i|} \sum_{j=1}^{|G_i|} LLM_{judge}(i, g_{i, j}, \tau_i ),\]

where \(LLM_{judge}(\cdot)\) is the output from the LLM-as-a-judge, \(G_i= \{ g_{i, 1}, ..., g_{i, j}, ..., g_{i, |G_i|} \}\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i\) is the agent trajectory for the entire conversation consisting of tool calls, agent responses, and user inputs.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) –

    Default to None. Configuration options:

    • templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user inputs, subgoal, trajectory, and agent responses. If this is not provided, the default template for static single-turn or static multi-turn conversation will be used.

    • num_judge_trials: the number of LLM-as-a-judge runs. Default to 5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.

    • include_judge_explanation: a bool flag to indicate whether the output should also return judge explanations. Default to False.

    • include_entire_prompt_in_validation_result: a bool flag to indicate whether to include the entire prompt sent to LLM-as-a-judge in the SubGoalValidationResult. Use this in the debugging tool UI for display. Default to False.

    • optimize_judge_trials: a bool flag to indicate whether to use optimized judge runs when doing a majority vote. Default to False.

    • max_retry_judge_trials: an int value indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to 5. This will be ignored if optimize_judge_trials is set to True.

evaluate(agent_trace, evaluation_data_sample)[source]

Returns a progress score given the agent trace and the evaluation data sample.

Parameters:
Returns:

a NumericalScore object containing progress score and judge explanations.

Example:

>>> from agent_inspect.metrics.scorer import ProgressScore
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> metric = ProgressScore(
...     llm_client=client,
...     config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False}
... )
>>> metric_result = metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> print(metric_result.score)
class agent_inspect.metrics.scorer.ProgressScoresThroughTurns(llm_client, config=None)[source]

Bases: ProgressBasedMetric

Metric to calculate agent’s progress at every conversational turn (up to the final conversation turn \(T\)) for a given task sample based on the proportion of subgoals completed. Subgoals that are completed at the current or previous turns are not evaluated again in the subsequent turns. The metric assumes previously completed subgoals which are milestones cannot be undone.

For every conversational turn \(t\) up to the final turn \(T\), the agent’s progress at turn \(t\) is computed as follows:

\[progress(i, G_i, \tau_i[1:t])=\frac{1}{|G_i|} \sum_{j=1}^{|G_i|} LLM_{judge}(i, g_{i, j}, \tau_i[1:t]), \]

where \(LLM_{judge}(\cdot)\) is the output from the LLM-as-a-judge, \(G_i= \{ g_{i, 1}, ..., g_{i, j}, ..., g_{i, |G_i|} \}\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i[1:t]\) is the segment of agent trajectory from the first turn up to turn \(t\) consisting of tool calls, agent responses, and user inputs.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) –

    Default to None. Configuration options:

    • templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.

    • num_judge_trials: the number of LLM-as-a-judge runs. Default to 5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.

    • include_judge_explanation: a bool flag to indicate whether the output should also return judge explanations. Default to False.

    • include_validation_results: a bool flag to indicate whether the output should also return a List [SubGoalValidationResult]. This is used later for error analysis. Default to False.

    • include_entire_prompt_in_validation_result: a bool flag to indicate whether to include the entire prompt sent to LLM-as-a-judge in the SubGoalValidationResult. Use this in the debugging tool UI for display. Default to False.

    • optimize_judge_trials: a bool flag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set to False in order to perform error analysis later. Default to False.

    • max_retry_judge_trials: an int value indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to 5. This will be ignored if optimize_judge_trials is set to True.

    • max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than max_turns, the final progress score is populated up to max_turns. Default to 20.

evaluate(agent_trace, evaluation_data_sample)[source]

Returns a list of progress scores at every turn until max_turns given the agent trace and the evaluation data sample.

Parameters:
Returns:

a List [NumericalScore] object storing a list of progress scores at every turn until max_turns.

Example:

>>> from agent_inspect.metrics.scorer import ProgressScoresThroughTurns
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, INCLUDE_VALIDATION_RESULTS, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> metric = ProgressScoresThroughTurns(
...     llm_client=client,
...     config={
...        MAX_TURNS: 8,
...        INCLUDE_VALIDATION_RESULTS: True,
...        INCLUDE_JUDGE_EXPLANATION: True,
...        OPTIMIZE_JUDGE_TRIALS: False
...    }
... )
>>> metric_result = metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> print(metric_result) # print list of NumericalScore objects
class agent_inspect.metrics.scorer.SuccessBasedMetric(llm_client, config=None)[source]

Bases: LLMBasedMetric

Abstract class which should be extended for actual implementations of success metrics.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the llm-as-a-judge model for evaluations.

  • config (Optional[Dict[str, Any]]) – configuration for success metric initialization. Default to None.

abstract evaluate(agent_trace, evaluation_data_sample)[source]

This is an abstract method and should be implemented in a concrete class.

Parameters:
Returns:

a NumericalScore object or a List [NumericalScore] object.

static get_success_score_from_progress_score(progress_score)[source]

Computes the success score given the progress score. Success score is 1 if the progress score is 1, and 0 otherwise.

Parameters:

progress_score (NumericalScore) – a NumericalScore object containing the progress score

Return type:

NumericalScore

Returns:

a NumericalScore object containing success score and sub scores consisting of progress score.

Example:

>>> from agent_inspect.metrics.scorer import ProgressScore, SuccessBasedMetric
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> progress_metric = ProgressScore(
...     llm_client=client,
...     config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False}
... )
>>> progress_metric_result = progress_metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> success_score = SuccessBasedMetric.get_success_score_from_progress_score(progress_metric_result)
>>> print(success_score)
static get_success_score_from_validation_results(validation_results)[source]

Aggregates a list of SubGoalValidationResult objects to compute a success score. Success score is 1 if all the validation results indicate success, and 0 otherwise.

Parameters:

validation_results (List[SubGoalValidationResult]) – a List [SubGoalValidationResult] object containing the result of subgoal validations.

Return type:

NumericalScore

Returns:

a NumericalScore object containing success score and sub scores consisting of progress score.

class agent_inspect.metrics.scorer.SuccessScore(llm_client, config=None)[source]

Bases: SuccessBasedMetric

Metric to calculate agent’s success rate for a given task sample based on the agent’s progress. Current metric supports only static conversation where the user utterances are predetermined.

\[success(i, G_i, \tau_i) = 1 \ \mathrm{if} \ progress(i, G_i, \tau_i)=1, \ \mathrm{and} \ 0 \ \mathrm{otherwise}, \]

where \(progress(i, G_i, \tau_i)\) is the progress score of the agent (refer to the documentation on ProgressScore), \(G_i\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i\) is the agent trajectory for the entire conversation consisting of tool calls, agent responses, and user inputs.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) –

    Default to None. Configuration options:

    • templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user inputs, subgoal, trajectory, and agent responses. If this is not provided, the default template for static single-turn or static multi-turn conversation will be used.

    • num_judge_trials: the number of LLM-as-a-judge runs. Default to 5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.

    • include_judge_explanation: a bool flag to indicate whether the output should also return judge explanations. Default to False.

    • optimize_judge_trials: a bool flag to indicate whether to use optimized judge runs when doing a majority vote. Default to False.

    • max_retry_judge_trials: an int value indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to 5. This will be ignored if optimize_judge_trials is set to True.

evaluate(agent_trace, evaluation_data_sample)[source]

Returns a success score given the agent trace and the evaluation data sample. Calls the agent_inspect.metrics.scorer.progress.ProgressScore.evaluate and agent_inspect.metrics.scorer.success.SuccessBasedMetric.get_success_score_from_progress_score methods underneath.

Parameters:
  • agent_trace (AgentDialogueTrace) – Agent Trace object constructed with the traces produced by the data sample.

  • evaluation_data_sample (EvaluationSample) – Data Sample object that represents a data sample in the evaluation data set.

Returns:

a NumericalScore object containing success score, sub scores consisting of progress score, and judge explanations.

Example:

>>> from agent_inspect.metrics.scorer import SuccessScore
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> metric = SuccessScore(
...     llm_client=client,
...     config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False}
... )
>>> metric_result = metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> print(metric_result.score)
class agent_inspect.metrics.scorer.SuccessScoreFinalTurn(llm_client, config=None)[source]

Bases: SuccessBasedMetric

Metric to calculate agent’s success score for a given task sample based on the agent’s progress at the final conversational turn \(T\).

\[success(i, G_i, \tau_i[1:T]) = 1 \ \mathrm{if} \ progress(i, G_i, \tau_i[1:T])=1, \ \mathrm{and} \ 0 \ \mathrm{otherwise}, \]

where \(progress(i, G_i, \tau_i[1:T])\) is the progress score of the agent at the final conversation turn \(T\) (refer to the documentation on ProgressScoresThroughTurns), \(G_i\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i[1:T]\) is the segment of agent trajectory from the first turn up to final turn \(T\) consisting of tool calls, agent responses, and user inputs.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) –

    Default to None. Configuration options:

    • templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.

    • num_judge_trials: the number of LLM-as-a-judge runs. Default to 5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.

    • include_judge_explanation: a bool flag to indicate whether the output should also return judge explanations. Default to False.

    • optimize_judge_trials: a bool flag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set to False in order to perform error analysis later. Default to False.

    • max_retry_judge_trials: an int value indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to 5. This will be ignored if optimize_judge_trials is set to True.

    • max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than max_turns, the final progress score is populated up to max_turns. Default to 20.

evaluate(agent_trace, evaluation_data_sample)[source]

Returns a success score at the final conversational turn \(T\) given the agent trace and the evaluation data sample. Calls the agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluate and agent_inspect.metrics.scorer.success.SuccessBasedMetric.get_success_score_from_progress_score methods underneath.

Parameters:
  • agent_trace (AgentDialogueTrace) – Agent Trace object constructed with the traces produced by the data sample.

  • evaluation_data_sample (EvaluationSample) – Data Sample object that represents a data sample in the evaluation data set.

Returns:

a NumericalScore object containing success score at final turn \(T\), sub scores consisting of progress scores at every turn, and judge explanations.

Example:

>>> from agent_inspect.metrics.scorer import SuccessScoreFinalTurn
>>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric
>>> metric = SuccessScoreFinalTurn(
...     llm_client=client,
...     config={
...        MAX_TURNS: 8,
...        INCLUDE_JUDGE_EXPLANATION: True,
...        OPTIMIZE_JUDGE_TRIALS: False
...    }
... )
>>> metric_result = metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> print(metric_result.score)
class agent_inspect.metrics.scorer.ToolCorrectnessMetric(llm_client, config=None)[source]

Bases: LLMBasedMetric

Metric to calculate the correctness rate of tool calls made by an agent in its entire dialogue trace. The final score is computed as the ratio of correct tool calls to the total number of tool calls made.

The tool correctness score \(\text{tool_correctness}(i, T_i, \tau_i)\) for sample \(i\) is defined as:

\[\text{tool_correctness}(i, T_i, \tau_i) = \frac{1}{N} \sum_{j=1}^N \mathbb{I}(T_{i,j} \text{ is called in } \tau_i)\]

where \(\tau_i\) refers to the agent trajectory for sample \(i\), \(T_i = \{T_{i,1}, T_{i,2}, \ldots, T_{i,N}\}\) represents the set of \(N\) expected tool calls for sample \(i\), and \(\mathbb{I}(\cdot)\) is the indicator function that equals 1 if the \(j\)-th tool call \(T_{i,j}\) is correctly called by the agent, and 0 otherwise.

The correctness of each tool call is determined by validating the agent’s tool call against the expected tool call using both exact match and LLM-as-a-judge approaches, depending on the configuration. Specifically, if an argument or parameter is set to value, exact match is used; if set to check, LLM-as-a-judge validates the correctness. The evaluation is performed across three dimensions: tool name, tool input arguments, and tool output. One tool call is considered correct only if all three dimensions are validated as correct.

Parameters:
  • llm_client (LLMClient) – The client which allows connection to the LLM-as-a-judge model for evaluation.

  • config (Optional[Dict[str, Any]]) –

    Default to None. Configuration options:

    • num_judge_trials: the number of LLM-as-a-judge runs. Default to 5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.

    • include_judge_explanation: a bool flag to indicate whether the output should also return judge explanations. Default to False.

evaluate(agent_trace, evaluation_data_sample)[source]

Returns a tool correctness score given the agent trace and the evaluation data sample.

Parameters:
  • agent_trace (AgentDialogueTrace) – Agent Trace object constructed with the traces produced by the data sample.

  • evaluation_data_sample (EvaluationSample) – Data Sample object that represents a data sample in the evaluation data set.

Returns:

a NumericalScore containing the tool correctness score (float) and explanations.

Example:

>>> from agent_inspect.metrics.scorer import ToolCorrectnessMetric
>>> from agent_inspect.metrics.constants import NUM_JUDGE_TRIALS 
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> data_sample=load_data_sample(sample_path) # Load data sample
>>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information
>>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096)  # create client needed for LLM-based metric
>>> metric = ToolCorrectnessMetric(
...     llm_client=client,
...     config={
...         NUM_JUDGE_TRIALS: 5
...     }
... )
>>> tool_correctness_score = metric.evaluate(
...     agent_trace=agent_trace,
...     evaluation_data_sample=data_sample
... )
>>> print(tool_correctness_score.score)