agent_inspect.metrics.scorer package
Submodules
agent_inspect.metrics.scorer.auc module
- class agent_inspect.metrics.scorer.auc.AUC(llm_client, config=None)[source]
Bases:
LLMBasedMetricMetric to calculate the area under the progress curve produced by
ProgressScoresThroughTurnsclass. For computing AUC, the discrete progress values are treated as a continuous, monotonically increasing function obtained via linear interpolation.\[AUC = \int_{0}^{T} p(t) \ dt, \]where \(T\) is the maximum turns of a conversation and \(p(t) := progress(i, G_i, \tau_i[1:t])\) denotes the progress at turn \(t\) (refer to the documentation on
ProgressScoresThroughTurns).- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.
num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.optimize_judge_trials: a
boolflag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set toFalsein order to perform error analysis later. Default toFalse.max_retry_judge_trials: an
intvalue indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to5. This will be ignored ifoptimize_judge_trialsis set toTrue.max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than
max_turns, the final progress score is populated up tomax_turns. Default to20.
- evaluate(agent_trace, evaluation_data_sample)[source]
Returns the value of area under the progress scores curve. Calls the
agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluatemethod underneath.- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject containing AUC score, sub scores consisting of progress scores at every turn, and judge explanations.
Example:
>>> from agent_inspect.metrics.scorer import AUC >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> metric = AUC( ... llm_client=client, ... config={ ... MAX_TURNS: 8, ... INCLUDE_JUDGE_EXPLANATION: True, ... OPTIMIZE_JUDGE_TRIALS: False ... } ... ) >>> metric_result = metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> print(metric_result.score)
- static get_auc_score_from_progress_scores(progress_scores)[source]
Computes the value of area under the progress scores curve given a list of progress scores at every conversational turn as input. The list of progress scores are obtained from
agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluatemethod.- Parameters:
progress_scores (
List[NumericalScore]) – aList[NumericalScore] object storing a list of progress scores at every conversational turn.- Return type:
- Returns:
a
NumericalScoreobject containing the AUC score.
Example:
>>> from agent_inspect.metrics.scorer import ProgressScoresThroughTurns, AUC >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, INCLUDE_VALIDATION_RESULTS, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> progress_turns_metric = ProgressScoresThroughTurns( ... llm_client=client, ... config={ ... MAX_TURNS: 8, ... INCLUDE_VALIDATION_RESULTS: True, ... INCLUDE_JUDGE_EXPLANATION: True, ... OPTIMIZE_JUDGE_TRIALS: False ... } ... ) >>> progress_rates = progress_turns_metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> auc_metric = AUC(llm_client=client) >>> auc_metric_result = auc_metric.get_auc_score_from_progress_scores(progress_rates) >>> print(auc_metric_result.score)
agent_inspect.metrics.scorer.llm_based_metric module
- class agent_inspect.metrics.scorer.llm_based_metric.LLMBasedMetric(llm_client, config=None)[source]
Bases:
MetricThis is a base abstract class that should be extended for actual implementations.
- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) – configuration for metric initialization. Default toNone.
- abstract evaluate(agent_trace, evaluation_data_sample)[source]
This is an abstract method and should be implemented in a concrete class.
- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject or aList[NumericalScore] object.
agent_inspect.metrics.scorer.metric module
- class agent_inspect.metrics.scorer.metric.Metric(config=None)[source]
Bases:
ABCThis is a base abstract class that should be extended for actual implementations.
- Parameters:
config (
Optional[Dict[str,Any]]) – configuration for metric initialization. Default toNone.
- abstract evaluate(agent_trace, evaluation_data_sample)[source]
This is an abstract method and should be implemented in a concrete class.
- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject or aList[NumericalScore] object.
agent_inspect.metrics.scorer.ppt module
- class agent_inspect.metrics.scorer.ppt.PPT(llm_client, config=None)[source]
Bases:
ProgressBasedMetricMetric to calculate the progress-per-turn (PPT) of the progress scores produced by
ProgressScoresThroughTurnsclass. PPT metric is defined as the total increase in progress divided by the number of turns. It weights the increase in progress uniformly across the conversational turns.\[PPT = \frac{1}{T} \sum_{t=0}^{T-1} p(t+1)-p(t)= \frac{p(T)}{T}, \]where \(p(t) := progress(i, G_i, \tau_i[1:t])\) denotes the discrete progress at turn \(t\) (refer to the documentation on
ProgressScoresThroughTurns), \(T\) is the minimum number of conversational turns to reach the final achieved progress \(p(T)\), and \(p(0)=0\).- Parameters:
llm_client (
Any) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.
num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.optimize_judge_trials: a
boolflag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set toFalsein order to perform error analysis later. Default toFalse.max_retry_judge_trials: an
intvalue indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to5. This will be ignored ifoptimize_judge_trialsis set toTrue.max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than
max_turns, the final progress score is populated up tomax_turns. Default to20.
- evaluate(agent_trace, evaluation_data_sample)[source]
Returns the progress-per-turn (PPT) value of the list of progress scores. Calls the
agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluatemethod underneath.- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject containing PPT score, sub scores consisting of progress scores at every turn, and judge explanations.
Example:
>>> from agent_inspect.metrics.scorer import PPT >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> metric = PPT( ... llm_client=client, ... config={ ... MAX_TURNS: 8, ... INCLUDE_JUDGE_EXPLANATION: True, ... OPTIMIZE_JUDGE_TRIALS: False ... } ... ) >>> metric_result = metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> print(metric_result.score)
- static get_ppt_score_from_progress_scores(progress_scores)[source]
Computes the progress-per-turn (PPT) value given a list of progress scores at every conversational turn as input. The list of progress scores are obtained from
agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluatemethod.- Parameters:
progress_scores (
List[NumericalScore]) – aList[NumericalScore] object storing a list of progress scores at every conversational turn.- Return type:
- Returns:
a
NumericalScoreobject containing the PPT score.
Example:
>>> from agent_inspect.metrics.scorer import ProgressScoresThroughTurns, PPT >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, INCLUDE_VALIDATION_RESULTS, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> progress_turns_metric = ProgressScoresThroughTurns( ... llm_client=client, ... config={ ... MAX_TURNS: 8, ... INCLUDE_VALIDATION_RESULTS: True, ... INCLUDE_JUDGE_EXPLANATION: True, ... OPTIMIZE_JUDGE_TRIALS: False ... } ... ) >>> progress_rates = progress_turns_metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> ppt_metric = PPT(llm_client=client) >>> ppt_metric_result = ppt_metric.get_ppt_score_from_progress_scores(progress_rates) >>> print(ppt_metric_result.score)
agent_inspect.metrics.scorer.progress module
- class agent_inspect.metrics.scorer.progress.ProgressBasedMetric(llm_client, config=None)[source]
Bases:
LLMBasedMetricAbstract class which should be extended for actual implementations of progress metrics.
- Parameters:
llm_client (
LLMClient) – the client which allows connection to the llm-as-a-judge model for evaluations.config (
Optional[Dict[str,Any]]) – configuration for progress metric initialization. Default toNone.
- abstract evaluate(agent_trace, evaluation_data_sample)[source]
This is an abstract method and should be implemented in a concrete class.
- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject or aList[NumericalScore] object.
- class agent_inspect.metrics.scorer.progress.ProgressScore(llm_client, config=None)[source]
Bases:
ProgressBasedMetricMetric to calculate agent’s progress for a given task sample based on the proportion of subgoals completed. Current metric supports only static conversation where the user utterances are predetermined.
\[progress(i, G_i, \tau_i)=\frac{1}{|G_i|} \sum_{j=1}^{|G_i|} LLM_{judge}(i, g_{i, j}, \tau_i ),\]where \(LLM_{judge}(\cdot)\) is the output from the LLM-as-a-judge, \(G_i= \{ g_{i, 1}, ..., g_{i, j}, ..., g_{i, |G_i|} \}\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i\) is the agent trajectory for the entire conversation consisting of tool calls, agent responses, and user inputs.
- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user inputs, subgoal, trajectory, and agent responses. If this is not provided, the default template for static single-turn or static multi-turn conversation will be used.
num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.include_entire_prompt_in_validation_result: a
boolflag to indicate whether to include the entire prompt sent to LLM-as-a-judge in the SubGoalValidationResult. Use this in the debugging tool UI for display. Default toFalse.optimize_judge_trials: a
boolflag to indicate whether to use optimized judge runs when doing a majority vote. Default toFalse.max_retry_judge_trials: an
intvalue indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to5. This will be ignored ifoptimize_judge_trialsis set toTrue.
- evaluate(agent_trace, evaluation_data_sample)[source]
Returns a progress score given the agent trace and the evaluation data sample.
- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject containing progress score and judge explanations.
Example:
>>> from agent_inspect.metrics.scorer import ProgressScore >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> metric = ProgressScore( ... llm_client=client, ... config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False} ... ) >>> metric_result = metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> print(metric_result.score)
- class agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns(llm_client, config=None)[source]
Bases:
ProgressBasedMetricMetric to calculate agent’s progress at every conversational turn (up to the final conversation turn \(T\)) for a given task sample based on the proportion of subgoals completed. Subgoals that are completed at the current or previous turns are not evaluated again in the subsequent turns. The metric assumes previously completed subgoals which are milestones cannot be undone.
For every conversational turn \(t\) up to the final turn \(T\), the agent’s progress at turn \(t\) is computed as follows:
\[progress(i, G_i, \tau_i[1:t])=\frac{1}{|G_i|} \sum_{j=1}^{|G_i|} LLM_{judge}(i, g_{i, j}, \tau_i[1:t]), \]where \(LLM_{judge}(\cdot)\) is the output from the LLM-as-a-judge, \(G_i= \{ g_{i, 1}, ..., g_{i, j}, ..., g_{i, |G_i|} \}\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i[1:t]\) is the segment of agent trajectory from the first turn up to turn \(t\) consisting of tool calls, agent responses, and user inputs.
- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.
num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.include_validation_results: a
boolflag to indicate whether the output should also return aList[SubGoalValidationResult]. This is used later for error analysis. Default toFalse.include_entire_prompt_in_validation_result: a
boolflag to indicate whether to include the entire prompt sent to LLM-as-a-judge in the SubGoalValidationResult. Use this in the debugging tool UI for display. Default toFalse.optimize_judge_trials: a
boolflag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set toFalsein order to perform error analysis later. Default toFalse.max_retry_judge_trials: an
intvalue indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to5. This will be ignored ifoptimize_judge_trialsis set toTrue.max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than
max_turns, the final progress score is populated up tomax_turns. Default to20.
- evaluate(agent_trace, evaluation_data_sample)[source]
Returns a list of progress scores at every turn until
max_turnsgiven the agent trace and the evaluation data sample.- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
List[NumericalScore] object storing a list of progress scores at every turn untilmax_turns.
Example:
>>> from agent_inspect.metrics.scorer import ProgressScoresThroughTurns >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, INCLUDE_VALIDATION_RESULTS, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> metric = ProgressScoresThroughTurns( ... llm_client=client, ... config={ ... MAX_TURNS: 8, ... INCLUDE_VALIDATION_RESULTS: True, ... INCLUDE_JUDGE_EXPLANATION: True, ... OPTIMIZE_JUDGE_TRIALS: False ... } ... ) >>> metric_result = metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> print(metric_result) # print list of NumericalScore objects
agent_inspect.metrics.scorer.success module
- class agent_inspect.metrics.scorer.success.SuccessBasedMetric(llm_client, config=None)[source]
Bases:
LLMBasedMetricAbstract class which should be extended for actual implementations of success metrics.
- Parameters:
llm_client (
LLMClient) – the client which allows connection to the llm-as-a-judge model for evaluations.config (
Optional[Dict[str,Any]]) – configuration for success metric initialization. Default toNone.
- abstract evaluate(agent_trace, evaluation_data_sample)[source]
This is an abstract method and should be implemented in a concrete class.
- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject or aList[NumericalScore] object.
- static get_success_score_from_progress_score(progress_score)[source]
Computes the success score given the progress score. Success score is 1 if the progress score is 1, and 0 otherwise.
- Parameters:
progress_score (
NumericalScore) – aNumericalScoreobject containing the progress score- Return type:
- Returns:
a
NumericalScoreobject containing success score and sub scores consisting of progress score.
Example:
>>> from agent_inspect.metrics.scorer import ProgressScore, SuccessBasedMetric >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> progress_metric = ProgressScore( ... llm_client=client, ... config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False} ... ) >>> progress_metric_result = progress_metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> success_score = SuccessBasedMetric.get_success_score_from_progress_score(progress_metric_result) >>> print(success_score)
- static get_success_score_from_validation_results(validation_results)[source]
Aggregates a list of SubGoalValidationResult objects to compute a success score. Success score is 1 if all the validation results indicate success, and 0 otherwise.
- Parameters:
validation_results (
List[SubGoalValidationResult]) – aList[SubGoalValidationResult] object containing the result of subgoal validations.- Return type:
- Returns:
a
NumericalScoreobject containing success score and sub scores consisting of progress score.
- class agent_inspect.metrics.scorer.success.SuccessScore(llm_client, config=None)[source]
Bases:
SuccessBasedMetricMetric to calculate agent’s success rate for a given task sample based on the agent’s progress. Current metric supports only static conversation where the user utterances are predetermined.
\[success(i, G_i, \tau_i) = 1 \ \mathrm{if} \ progress(i, G_i, \tau_i)=1, \ \mathrm{and} \ 0 \ \mathrm{otherwise}, \]where \(progress(i, G_i, \tau_i)\) is the progress score of the agent (refer to the documentation on
ProgressScore), \(G_i\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i\) is the agent trajectory for the entire conversation consisting of tool calls, agent responses, and user inputs.- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user inputs, subgoal, trajectory, and agent responses. If this is not provided, the default template for static single-turn or static multi-turn conversation will be used.
num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.optimize_judge_trials: a
boolflag to indicate whether to use optimized judge runs when doing a majority vote. Default toFalse.max_retry_judge_trials: an
intvalue indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to5. This will be ignored ifoptimize_judge_trialsis set toTrue.
- evaluate(agent_trace, evaluation_data_sample)[source]
Returns a success score given the agent trace and the evaluation data sample. Calls the
agent_inspect.metrics.scorer.progress.ProgressScore.evaluateandagent_inspect.metrics.scorer.success.SuccessBasedMetric.get_success_score_from_progress_scoremethods underneath.- Parameters:
agent_trace (
AgentDialogueTrace) – Agent Trace object constructed with the traces produced by the data sample.evaluation_data_sample (
EvaluationSample) – Data Sample object that represents a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject containing success score, sub scores consisting of progress score, and judge explanations.
Example:
>>> from agent_inspect.metrics.scorer import SuccessScore >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> metric = SuccessScore( ... llm_client=client, ... config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False} ... ) >>> metric_result = metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> print(metric_result.score)
- class agent_inspect.metrics.scorer.success.SuccessScoreFinalTurn(llm_client, config=None)[source]
Bases:
SuccessBasedMetricMetric to calculate agent’s success score for a given task sample based on the agent’s progress at the final conversational turn \(T\).
\[success(i, G_i, \tau_i[1:T]) = 1 \ \mathrm{if} \ progress(i, G_i, \tau_i[1:T])=1, \ \mathrm{and} \ 0 \ \mathrm{otherwise}, \]where \(progress(i, G_i, \tau_i[1:T])\) is the progress score of the agent at the final conversation turn \(T\) (refer to the documentation on
ProgressScoresThroughTurns), \(G_i\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i[1:T]\) is the segment of agent trajectory from the first turn up to final turn \(T\) consisting of tool calls, agent responses, and user inputs.- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.
num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.optimize_judge_trials: a
boolflag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set toFalsein order to perform error analysis later. Default toFalse.max_retry_judge_trials: an
intvalue indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to5. This will be ignored ifoptimize_judge_trialsis set toTrue.max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than
max_turns, the final progress score is populated up tomax_turns. Default to20.
- evaluate(agent_trace, evaluation_data_sample)[source]
Returns a success score at the final conversational turn \(T\) given the agent trace and the evaluation data sample. Calls the
agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluateandagent_inspect.metrics.scorer.success.SuccessBasedMetric.get_success_score_from_progress_scoremethods underneath.- Parameters:
agent_trace (
AgentDialogueTrace) – Agent Trace object constructed with the traces produced by the data sample.evaluation_data_sample (
EvaluationSample) – Data Sample object that represents a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject containing success score at final turn \(T\), sub scores consisting of progress scores at every turn, and judge explanations.
Example:
>>> from agent_inspect.metrics.scorer import SuccessScoreFinalTurn >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> metric = SuccessScoreFinalTurn( ... llm_client=client, ... config={ ... MAX_TURNS: 8, ... INCLUDE_JUDGE_EXPLANATION: True, ... OPTIMIZE_JUDGE_TRIALS: False ... } ... ) >>> metric_result = metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> print(metric_result.score)
agent_inspect.metrics.scorer.templates module
agent_inspect.metrics.scorer.tool_correctness module
- class agent_inspect.metrics.scorer.tool_correctness.ToolCorrectnessMetric(llm_client, config=None)[source]
Bases:
LLMBasedMetricMetric to calculate the correctness rate of tool calls made by an agent in its entire dialogue trace. The final score is computed as the ratio of correct tool calls to the total number of tool calls made.
The tool correctness score \(\text{tool_correctness}(i, T_i, \tau_i)\) for sample \(i\) is defined as:
\[\text{tool_correctness}(i, T_i, \tau_i) = \frac{1}{N} \sum_{j=1}^N \mathbb{I}(T_{i,j} \text{ is called in } \tau_i)\]where \(\tau_i\) refers to the agent trajectory for sample \(i\), \(T_i = \{T_{i,1}, T_{i,2}, \ldots, T_{i,N}\}\) represents the set of \(N\) expected tool calls for sample \(i\), and \(\mathbb{I}(\cdot)\) is the indicator function that equals 1 if the \(j\)-th tool call \(T_{i,j}\) is correctly called by the agent, and 0 otherwise.
The correctness of each tool call is determined by validating the agent’s tool call against the expected tool call using both exact match and LLM-as-a-judge approaches, depending on the configuration. Specifically, if an argument or parameter is set to
value, exact match is used; if set tocheck, LLM-as-a-judge validates the correctness. The evaluation is performed across three dimensions: tool name, tool input arguments, and tool output. One tool call is considered correct only if all three dimensions are validated as correct.- Parameters:
llm_client (
LLMClient) – The client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.
- evaluate(agent_trace, evaluation_data_sample)[source]
Returns a tool correctness score given the agent trace and the evaluation data sample.
- Parameters:
agent_trace (
AgentDialogueTrace) – Agent Trace object constructed with the traces produced by the data sample.evaluation_data_sample (
EvaluationSample) – Data Sample object that represents a data sample in the evaluation data set.
- Returns:
a
NumericalScorecontaining the tool correctness score (float) and explanations.
Example:
>>> from agent_inspect.metrics.scorer import ToolCorrectnessMetric >>> from agent_inspect.metrics.constants import NUM_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> metric = ToolCorrectnessMetric( ... llm_client=client, ... config={ ... NUM_JUDGE_TRIALS: 5 ... } ... ) >>> tool_correctness_score = metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> print(tool_correctness_score.score)
Module contents
- class agent_inspect.metrics.scorer.AUC(llm_client, config=None)[source]
Bases:
LLMBasedMetricMetric to calculate the area under the progress curve produced by
ProgressScoresThroughTurnsclass. For computing AUC, the discrete progress values are treated as a continuous, monotonically increasing function obtained via linear interpolation.\[AUC = \int_{0}^{T} p(t) \ dt, \]where \(T\) is the maximum turns of a conversation and \(p(t) := progress(i, G_i, \tau_i[1:t])\) denotes the progress at turn \(t\) (refer to the documentation on
ProgressScoresThroughTurns).- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.
num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.optimize_judge_trials: a
boolflag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set toFalsein order to perform error analysis later. Default toFalse.max_retry_judge_trials: an
intvalue indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to5. This will be ignored ifoptimize_judge_trialsis set toTrue.max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than
max_turns, the final progress score is populated up tomax_turns. Default to20.
- evaluate(agent_trace, evaluation_data_sample)[source]
Returns the value of area under the progress scores curve. Calls the
agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluatemethod underneath.- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject containing AUC score, sub scores consisting of progress scores at every turn, and judge explanations.
Example:
>>> from agent_inspect.metrics.scorer import AUC >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> metric = AUC( ... llm_client=client, ... config={ ... MAX_TURNS: 8, ... INCLUDE_JUDGE_EXPLANATION: True, ... OPTIMIZE_JUDGE_TRIALS: False ... } ... ) >>> metric_result = metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> print(metric_result.score)
- static get_auc_score_from_progress_scores(progress_scores)[source]
Computes the value of area under the progress scores curve given a list of progress scores at every conversational turn as input. The list of progress scores are obtained from
agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluatemethod.- Parameters:
progress_scores (
List[NumericalScore]) – aList[NumericalScore] object storing a list of progress scores at every conversational turn.- Return type:
- Returns:
a
NumericalScoreobject containing the AUC score.
Example:
>>> from agent_inspect.metrics.scorer import ProgressScoresThroughTurns, AUC >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, INCLUDE_VALIDATION_RESULTS, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> progress_turns_metric = ProgressScoresThroughTurns( ... llm_client=client, ... config={ ... MAX_TURNS: 8, ... INCLUDE_VALIDATION_RESULTS: True, ... INCLUDE_JUDGE_EXPLANATION: True, ... OPTIMIZE_JUDGE_TRIALS: False ... } ... ) >>> progress_rates = progress_turns_metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> auc_metric = AUC(llm_client=client) >>> auc_metric_result = auc_metric.get_auc_score_from_progress_scores(progress_rates) >>> print(auc_metric_result.score)
- class agent_inspect.metrics.scorer.LLMBasedMetric(llm_client, config=None)[source]
Bases:
MetricThis is a base abstract class that should be extended for actual implementations.
- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) – configuration for metric initialization. Default toNone.
- abstract evaluate(agent_trace, evaluation_data_sample)[source]
This is an abstract method and should be implemented in a concrete class.
- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject or aList[NumericalScore] object.
- class agent_inspect.metrics.scorer.Metric(config=None)[source]
Bases:
ABCThis is a base abstract class that should be extended for actual implementations.
- Parameters:
config (
Optional[Dict[str,Any]]) – configuration for metric initialization. Default toNone.
- abstract evaluate(agent_trace, evaluation_data_sample)[source]
This is an abstract method and should be implemented in a concrete class.
- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject or aList[NumericalScore] object.
- class agent_inspect.metrics.scorer.PPT(llm_client, config=None)[source]
Bases:
ProgressBasedMetricMetric to calculate the progress-per-turn (PPT) of the progress scores produced by
ProgressScoresThroughTurnsclass. PPT metric is defined as the total increase in progress divided by the number of turns. It weights the increase in progress uniformly across the conversational turns.\[PPT = \frac{1}{T} \sum_{t=0}^{T-1} p(t+1)-p(t)= \frac{p(T)}{T}, \]where \(p(t) := progress(i, G_i, \tau_i[1:t])\) denotes the discrete progress at turn \(t\) (refer to the documentation on
ProgressScoresThroughTurns), \(T\) is the minimum number of conversational turns to reach the final achieved progress \(p(T)\), and \(p(0)=0\).- Parameters:
llm_client (
Any) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.
num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.optimize_judge_trials: a
boolflag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set toFalsein order to perform error analysis later. Default toFalse.max_retry_judge_trials: an
intvalue indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to5. This will be ignored ifoptimize_judge_trialsis set toTrue.max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than
max_turns, the final progress score is populated up tomax_turns. Default to20.
- evaluate(agent_trace, evaluation_data_sample)[source]
Returns the progress-per-turn (PPT) value of the list of progress scores. Calls the
agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluatemethod underneath.- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject containing PPT score, sub scores consisting of progress scores at every turn, and judge explanations.
Example:
>>> from agent_inspect.metrics.scorer import PPT >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> metric = PPT( ... llm_client=client, ... config={ ... MAX_TURNS: 8, ... INCLUDE_JUDGE_EXPLANATION: True, ... OPTIMIZE_JUDGE_TRIALS: False ... } ... ) >>> metric_result = metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> print(metric_result.score)
- static get_ppt_score_from_progress_scores(progress_scores)[source]
Computes the progress-per-turn (PPT) value given a list of progress scores at every conversational turn as input. The list of progress scores are obtained from
agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluatemethod.- Parameters:
progress_scores (
List[NumericalScore]) – aList[NumericalScore] object storing a list of progress scores at every conversational turn.- Return type:
- Returns:
a
NumericalScoreobject containing the PPT score.
Example:
>>> from agent_inspect.metrics.scorer import ProgressScoresThroughTurns, PPT >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, INCLUDE_VALIDATION_RESULTS, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> progress_turns_metric = ProgressScoresThroughTurns( ... llm_client=client, ... config={ ... MAX_TURNS: 8, ... INCLUDE_VALIDATION_RESULTS: True, ... INCLUDE_JUDGE_EXPLANATION: True, ... OPTIMIZE_JUDGE_TRIALS: False ... } ... ) >>> progress_rates = progress_turns_metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> ppt_metric = PPT(llm_client=client) >>> ppt_metric_result = ppt_metric.get_ppt_score_from_progress_scores(progress_rates) >>> print(ppt_metric_result.score)
- class agent_inspect.metrics.scorer.ProgressBasedMetric(llm_client, config=None)[source]
Bases:
LLMBasedMetricAbstract class which should be extended for actual implementations of progress metrics.
- Parameters:
llm_client (
LLMClient) – the client which allows connection to the llm-as-a-judge model for evaluations.config (
Optional[Dict[str,Any]]) – configuration for progress metric initialization. Default toNone.
- abstract evaluate(agent_trace, evaluation_data_sample)[source]
This is an abstract method and should be implemented in a concrete class.
- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject or aList[NumericalScore] object.
- class agent_inspect.metrics.scorer.ProgressScore(llm_client, config=None)[source]
Bases:
ProgressBasedMetricMetric to calculate agent’s progress for a given task sample based on the proportion of subgoals completed. Current metric supports only static conversation where the user utterances are predetermined.
\[progress(i, G_i, \tau_i)=\frac{1}{|G_i|} \sum_{j=1}^{|G_i|} LLM_{judge}(i, g_{i, j}, \tau_i ),\]where \(LLM_{judge}(\cdot)\) is the output from the LLM-as-a-judge, \(G_i= \{ g_{i, 1}, ..., g_{i, j}, ..., g_{i, |G_i|} \}\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i\) is the agent trajectory for the entire conversation consisting of tool calls, agent responses, and user inputs.
- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user inputs, subgoal, trajectory, and agent responses. If this is not provided, the default template for static single-turn or static multi-turn conversation will be used.
num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.include_entire_prompt_in_validation_result: a
boolflag to indicate whether to include the entire prompt sent to LLM-as-a-judge in the SubGoalValidationResult. Use this in the debugging tool UI for display. Default toFalse.optimize_judge_trials: a
boolflag to indicate whether to use optimized judge runs when doing a majority vote. Default toFalse.max_retry_judge_trials: an
intvalue indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to5. This will be ignored ifoptimize_judge_trialsis set toTrue.
- evaluate(agent_trace, evaluation_data_sample)[source]
Returns a progress score given the agent trace and the evaluation data sample.
- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject containing progress score and judge explanations.
Example:
>>> from agent_inspect.metrics.scorer import ProgressScore >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> metric = ProgressScore( ... llm_client=client, ... config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False} ... ) >>> metric_result = metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> print(metric_result.score)
- class agent_inspect.metrics.scorer.ProgressScoresThroughTurns(llm_client, config=None)[source]
Bases:
ProgressBasedMetricMetric to calculate agent’s progress at every conversational turn (up to the final conversation turn \(T\)) for a given task sample based on the proportion of subgoals completed. Subgoals that are completed at the current or previous turns are not evaluated again in the subsequent turns. The metric assumes previously completed subgoals which are milestones cannot be undone.
For every conversational turn \(t\) up to the final turn \(T\), the agent’s progress at turn \(t\) is computed as follows:
\[progress(i, G_i, \tau_i[1:t])=\frac{1}{|G_i|} \sum_{j=1}^{|G_i|} LLM_{judge}(i, g_{i, j}, \tau_i[1:t]), \]where \(LLM_{judge}(\cdot)\) is the output from the LLM-as-a-judge, \(G_i= \{ g_{i, 1}, ..., g_{i, j}, ..., g_{i, |G_i|} \}\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i[1:t]\) is the segment of agent trajectory from the first turn up to turn \(t\) consisting of tool calls, agent responses, and user inputs.
- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.
num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.include_validation_results: a
boolflag to indicate whether the output should also return aList[SubGoalValidationResult]. This is used later for error analysis. Default toFalse.include_entire_prompt_in_validation_result: a
boolflag to indicate whether to include the entire prompt sent to LLM-as-a-judge in the SubGoalValidationResult. Use this in the debugging tool UI for display. Default toFalse.optimize_judge_trials: a
boolflag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set toFalsein order to perform error analysis later. Default toFalse.max_retry_judge_trials: an
intvalue indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to5. This will be ignored ifoptimize_judge_trialsis set toTrue.max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than
max_turns, the final progress score is populated up tomax_turns. Default to20.
- evaluate(agent_trace, evaluation_data_sample)[source]
Returns a list of progress scores at every turn until
max_turnsgiven the agent trace and the evaluation data sample.- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
List[NumericalScore] object storing a list of progress scores at every turn untilmax_turns.
Example:
>>> from agent_inspect.metrics.scorer import ProgressScoresThroughTurns >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, INCLUDE_VALIDATION_RESULTS, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> metric = ProgressScoresThroughTurns( ... llm_client=client, ... config={ ... MAX_TURNS: 8, ... INCLUDE_VALIDATION_RESULTS: True, ... INCLUDE_JUDGE_EXPLANATION: True, ... OPTIMIZE_JUDGE_TRIALS: False ... } ... ) >>> metric_result = metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> print(metric_result) # print list of NumericalScore objects
- class agent_inspect.metrics.scorer.SuccessBasedMetric(llm_client, config=None)[source]
Bases:
LLMBasedMetricAbstract class which should be extended for actual implementations of success metrics.
- Parameters:
llm_client (
LLMClient) – the client which allows connection to the llm-as-a-judge model for evaluations.config (
Optional[Dict[str,Any]]) – configuration for success metric initialization. Default toNone.
- abstract evaluate(agent_trace, evaluation_data_sample)[source]
This is an abstract method and should be implemented in a concrete class.
- Parameters:
agent_trace (
AgentDialogueTrace) – aAgentDialogueTraceobject constructed with the agent trajectory information for a given data sample.evaluation_data_sample (
EvaluationSample) – aEvaluationSampleobject representing a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject or aList[NumericalScore] object.
- static get_success_score_from_progress_score(progress_score)[source]
Computes the success score given the progress score. Success score is 1 if the progress score is 1, and 0 otherwise.
- Parameters:
progress_score (
NumericalScore) – aNumericalScoreobject containing the progress score- Return type:
- Returns:
a
NumericalScoreobject containing success score and sub scores consisting of progress score.
Example:
>>> from agent_inspect.metrics.scorer import ProgressScore, SuccessBasedMetric >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> progress_metric = ProgressScore( ... llm_client=client, ... config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False} ... ) >>> progress_metric_result = progress_metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> success_score = SuccessBasedMetric.get_success_score_from_progress_score(progress_metric_result) >>> print(success_score)
- static get_success_score_from_validation_results(validation_results)[source]
Aggregates a list of SubGoalValidationResult objects to compute a success score. Success score is 1 if all the validation results indicate success, and 0 otherwise.
- Parameters:
validation_results (
List[SubGoalValidationResult]) – aList[SubGoalValidationResult] object containing the result of subgoal validations.- Return type:
- Returns:
a
NumericalScoreobject containing success score and sub scores consisting of progress score.
- class agent_inspect.metrics.scorer.SuccessScore(llm_client, config=None)[source]
Bases:
SuccessBasedMetricMetric to calculate agent’s success rate for a given task sample based on the agent’s progress. Current metric supports only static conversation where the user utterances are predetermined.
\[success(i, G_i, \tau_i) = 1 \ \mathrm{if} \ progress(i, G_i, \tau_i)=1, \ \mathrm{and} \ 0 \ \mathrm{otherwise}, \]where \(progress(i, G_i, \tau_i)\) is the progress score of the agent (refer to the documentation on
ProgressScore), \(G_i\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i\) is the agent trajectory for the entire conversation consisting of tool calls, agent responses, and user inputs.- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user inputs, subgoal, trajectory, and agent responses. If this is not provided, the default template for static single-turn or static multi-turn conversation will be used.
num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.optimize_judge_trials: a
boolflag to indicate whether to use optimized judge runs when doing a majority vote. Default toFalse.max_retry_judge_trials: an
intvalue indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to5. This will be ignored ifoptimize_judge_trialsis set toTrue.
- evaluate(agent_trace, evaluation_data_sample)[source]
Returns a success score given the agent trace and the evaluation data sample. Calls the
agent_inspect.metrics.scorer.progress.ProgressScore.evaluateandagent_inspect.metrics.scorer.success.SuccessBasedMetric.get_success_score_from_progress_scoremethods underneath.- Parameters:
agent_trace (
AgentDialogueTrace) – Agent Trace object constructed with the traces produced by the data sample.evaluation_data_sample (
EvaluationSample) – Data Sample object that represents a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject containing success score, sub scores consisting of progress score, and judge explanations.
Example:
>>> from agent_inspect.metrics.scorer import SuccessScore >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> metric = SuccessScore( ... llm_client=client, ... config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False} ... ) >>> metric_result = metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> print(metric_result.score)
- class agent_inspect.metrics.scorer.SuccessScoreFinalTurn(llm_client, config=None)[source]
Bases:
SuccessBasedMetricMetric to calculate agent’s success score for a given task sample based on the agent’s progress at the final conversational turn \(T\).
\[success(i, G_i, \tau_i[1:T]) = 1 \ \mathrm{if} \ progress(i, G_i, \tau_i[1:T])=1, \ \mathrm{and} \ 0 \ \mathrm{otherwise}, \]where \(progress(i, G_i, \tau_i[1:T])\) is the progress score of the agent at the final conversation turn \(T\) (refer to the documentation on
ProgressScoresThroughTurns), \(G_i\) is the set of subgoals a.k.a grading notes for task sample \(i\), and \(\tau_i[1:T]\) is the segment of agent trajectory from the first turn up to final turn \(T\) consisting of tool calls, agent responses, and user inputs.- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user task, subgoal, trajectory, user utterances, and agent responses. If this is not provided, the default template for dynamic conversation will be used.
num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.optimize_judge_trials: a
boolflag to indicate whether to use optimized judge runs when doing a majority vote. This needs to be set toFalsein order to perform error analysis later. Default toFalse.max_retry_judge_trials: an
intvalue indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to5. This will be ignored ifoptimize_judge_trialsis set toTrue.max_turns: Evaluate the agent up to max_turns conversation only. For conversation shorter than
max_turns, the final progress score is populated up tomax_turns. Default to20.
- evaluate(agent_trace, evaluation_data_sample)[source]
Returns a success score at the final conversational turn \(T\) given the agent trace and the evaluation data sample. Calls the
agent_inspect.metrics.scorer.progress.ProgressScoresThroughTurns.evaluateandagent_inspect.metrics.scorer.success.SuccessBasedMetric.get_success_score_from_progress_scoremethods underneath.- Parameters:
agent_trace (
AgentDialogueTrace) – Agent Trace object constructed with the traces produced by the data sample.evaluation_data_sample (
EvaluationSample) – Data Sample object that represents a data sample in the evaluation data set.
- Returns:
a
NumericalScoreobject containing success score at final turn \(T\), sub scores consisting of progress scores at every turn, and judge explanations.
Example:
>>> from agent_inspect.metrics.scorer import SuccessScoreFinalTurn >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, MAX_TURNS, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> metric = SuccessScoreFinalTurn( ... llm_client=client, ... config={ ... MAX_TURNS: 8, ... INCLUDE_JUDGE_EXPLANATION: True, ... OPTIMIZE_JUDGE_TRIALS: False ... } ... ) >>> metric_result = metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> print(metric_result.score)
- class agent_inspect.metrics.scorer.ToolCorrectnessMetric(llm_client, config=None)[source]
Bases:
LLMBasedMetricMetric to calculate the correctness rate of tool calls made by an agent in its entire dialogue trace. The final score is computed as the ratio of correct tool calls to the total number of tool calls made.
The tool correctness score \(\text{tool_correctness}(i, T_i, \tau_i)\) for sample \(i\) is defined as:
\[\text{tool_correctness}(i, T_i, \tau_i) = \frac{1}{N} \sum_{j=1}^N \mathbb{I}(T_{i,j} \text{ is called in } \tau_i)\]where \(\tau_i\) refers to the agent trajectory for sample \(i\), \(T_i = \{T_{i,1}, T_{i,2}, \ldots, T_{i,N}\}\) represents the set of \(N\) expected tool calls for sample \(i\), and \(\mathbb{I}(\cdot)\) is the indicator function that equals 1 if the \(j\)-th tool call \(T_{i,j}\) is correctly called by the agent, and 0 otherwise.
The correctness of each tool call is determined by validating the agent’s tool call against the expected tool call using both exact match and LLM-as-a-judge approaches, depending on the configuration. Specifically, if an argument or parameter is set to
value, exact match is used; if set tocheck, LLM-as-a-judge validates the correctness. The evaluation is performed across three dimensions: tool name, tool input arguments, and tool output. One tool call is considered correct only if all three dimensions are validated as correct.- Parameters:
llm_client (
LLMClient) – The client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.
- evaluate(agent_trace, evaluation_data_sample)[source]
Returns a tool correctness score given the agent trace and the evaluation data sample.
- Parameters:
agent_trace (
AgentDialogueTrace) – Agent Trace object constructed with the traces produced by the data sample.evaluation_data_sample (
EvaluationSample) – Data Sample object that represents a data sample in the evaluation data set.
- Returns:
a
NumericalScorecontaining the tool correctness score (float) and explanations.
Example:
>>> from agent_inspect.metrics.scorer import ToolCorrectnessMetric >>> from agent_inspect.metrics.constants import NUM_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> data_sample=load_data_sample(sample_path) # Load data sample >>> agent_trace = load_agent_trace(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> metric = ToolCorrectnessMetric( ... llm_client=client, ... config={ ... NUM_JUDGE_TRIALS: 5 ... } ... ) >>> tool_correctness_score = metric.evaluate( ... agent_trace=agent_trace, ... evaluation_data_sample=data_sample ... ) >>> print(tool_correctness_score.score)