agent_inspect.metrics.validator package
Submodules
agent_inspect.metrics.validator.exact_match module
agent_inspect.metrics.validator.llm_check module
- async agent_inspect.metrics.validator.llm_check.llm_check(client, variables, template, post_process)[source]
- Return type:
bool- Parameters:
client (LLMClient)
variables (Dict)
template (str)
post_process (Callable[[LLMResponse], bool])
agent_inspect.metrics.validator.regex_match module
agent_inspect.metrics.validator.subgoal_completion module
- class agent_inspect.metrics.validator.subgoal_completion.SubGoalCompletionValidator(llm_client, config=None)[source]
Bases:
ValidatorValidator based on LLM-as-a-judge to assess whether the agent has completed the specified subgoal.
\[LLM_{judge}(i, g_{i, j}, \tau_i ) = 1 \ \mathrm{if \ agent \ accomplish} \ g_{i,j}, \mathrm{and} \ 0 \ \mathrm{otherwise}, \]where \(g_{i,j}\) is the \(j\)-th grading note for the task sample \(i\) and \(\tau_i\) is the agent trajectory consisting of tool calls, agent responses, and user inputs.
- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user inputs (static setting), subgoal, trajectory, agent responses, user task (dynamic setting), and user utterances (dynamic setting). If this is not provided, the default template for either static or dynamic conversation setting will be used.
num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.optimize_judge_trials: a
boolflag to indicate whether to use optimized judge runs when doing a majority vote. Default toFalse.max_retry_judge_trials: an
intvalue indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to5. This will be ignored ifoptimize_judge_trialsis set toTrue.include_entire_prompt_in_validation_result: a
boolflag to indicate whether to include the entire prompt sent to LLM-as-a-judge in the SubGoalValidationResult. Use this in the debugging tool UI for display. Default toFalse.
- async validate(turn_traces, sub_goal)[source]
Returns the LLM-as-a-judge binary score given the subgoal and the agent’s trace up to the current turn in a
staticconversation setting.- Parameters:
- Return type:
- Returns:
a
SubGoalValidationResultobject containing the judge binary score, the subgoal details, and the judge explanations.
Example:
>>> from agent_inspect.metrics.validator import SubGoalCompletionValidator >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> import asyncio >>> >>> data_subgoal = load_subgoal(sample_path) # Load subgoal >>> agent_turn_traces = load_agent_turn_traces(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> validator = SubGoalCompletionValidator( ... llm_client=client, ... config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False} ... ) >>> validator_result = asyncio.run( ... validator.validate( ... turn_traces=agent_turn_traces, ... sub_goal=data_subgoal ... ) ... ) >>> print(validator_result.is_completed)
- async validate_dynamic(turn_traces, sub_goal, user_instruction)[source]
Returns the LLM-as-a-judge binary score given the subgoal, user task instructions (optional), and the agent’s trace up to the current turn in a
dynamicconversation setting.- Parameters:
turn_traces (
List[TurnTrace]) – aList[TurnTrace] object constructed with the agent trajectory information from the first turn up to the current turn.sub_goal (
SubGoal) – aSubGoalobject representing a grading note in the form of a natural language text.user_instruction (
str) – astrobject representing the user task instructions. Provide user task instruction through this variable to use judge template with user summary instruction. Otherwise, set it as empty string to use judge template without any user summary instruction.
- Return type:
- Returns:
a
SubGoalValidationResultobject containing the judge binary score, the subgoal details, and the judge explanations.
Example:
>>> from agent_inspect.metrics.validator import SubGoalCompletionValidator >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> import asyncio >>> >>> data_subgoal, user_instruct = load_subgoal_user_instruct(sample_path) # Load subgoal and user instructions >>> agent_turn_traces = load_agent_turn_traces(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> validator = SubGoalCompletionValidator( ... llm_client=client, ... config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False} ... ) >>> validator_result = asyncio.run( ... validator.validate_dynamic( ... turn_traces=agent_turn_traces, ... sub_goal=data_subgoal, ... user_instruction=user_instruct ... ) ... ) >>> print(validator_result.is_completed)
agent_inspect.metrics.validator.tool_call_completion module
- class agent_inspect.metrics.validator.tool_call_completion.ToolCallCompletionValidator(llm_client, config=None)[source]
Bases:
ValidatorTool call completion validator using both exact match and LLM-as-a-judge approaches to assess whether the tool call is correctly completed.
This validator performs a three-dimensional evaluation of tool calls:
Tool name validation: Checks if the tool name in the agent’s trajectory matches the expected tool name exactly.
Tool input arguments validation: Validates each input parameter using either exact match (when the expected parameter specifies a
value) or LLM-as-a-judge (when the expected parameter specifies acheck).Tool output validation: Validates the tool’s output using either exact match (when the expected output specifies a
value) or LLM-as-a-judge (when the expected output specifies acheck).
A tool call is considered correctly completed only if all the above three dimensions pass validation. The validator iterates through all tool calls in the latest turn of the agent trace to find a matching tool call that satisfies all validation criteria set in
ExpectedToolCall.- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.
- async validate(agent_trace_turns, expected_tool_call)[source]
Returns a
ToolCallValidationResultindicating whether the tool call is correctly completed by the agent in the latest turn of the agent trace.- Parameters:
agent_trace_turns (
List[TurnTrace]) – A list ofTurnTracerepresenting the entire agent trace up to the current turn.expected_tool_call (
ExpectedToolCall) – AnExpectedToolCallrepresenting the expected tool call checklist to validate against.
- Return type:
- Returns:
a
ToolCallValidationResultindicating whether the tool call is correctly completed by the agent.
Example:
>>> from agent_inspect.metrics import ToolCallCompletionValidator >>> from agent_inspect.metrics.constants import NUM_JUDGE_TRIALS, INCLUDE_JUDGE_EXPLANATION >>> from agent_inspect.clients import AzureOpenAIClient >>> import asyncio >>> >>> expected_tool_call = load_expected_tool_call(expected_tool_call_file_path) # Load expected tool call checklist >>> agent_turn_traces = load_agent_turn_traces(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> validator = ToolCallCompletionValidator( ... llm_client=client, ... config={NUM_JUDGE_TRIALS: 5, INCLUDE_JUDGE_EXPLANATION: True} ... ) >>> validation_result = asyncio.run( ... validator.validate( ... agent_trace_turns = agent_turn_traces, ... expected_tool_call = expected_tool_call ... ) ... ) >>> print(validation_result.is_completed) # True or False
agent_inspect.metrics.validator.validator module
- class agent_inspect.metrics.validator.validator.Validator(llm_client, config=None)[source]
Bases:
objectAbstract class which should be extended for actual implementation of validators.
- Parameters:
llm_client (
LLMClient) – the client which allows connection to the llm-as-a-judge model for evaluations.config (
Optional[Dict[str,Any]]) – configuration for validator initialization. Default toNone.
- async get_majority_voted_score_from_judge_responses(prompt)[source]
- Return type:
(
int,List[str])- Parameters:
prompt (str)
- abstract async validate(turn_traces, kwargs)[source]
This is an abstract method and should be implemented in a concrete class.
- Parameters:
turn_traces (
List[TurnTrace]) – aList[TurnTrace] object constructed with the agent trajectory information from the first turn up to the current turn.kwargs – Additional keyword arguments that may be required for specific validation logic. These arguments can be used to pass optional parameters or configuration settings to the validator.
- Return type:
- Returns:
a
ValidationResultobject containing the validation output.
Module contents
- class agent_inspect.metrics.validator.SubGoalCompletionValidator(llm_client, config=None)[source]
Bases:
ValidatorValidator based on LLM-as-a-judge to assess whether the agent has completed the specified subgoal.
\[LLM_{judge}(i, g_{i, j}, \tau_i ) = 1 \ \mathrm{if \ agent \ accomplish} \ g_{i,j}, \mathrm{and} \ 0 \ \mathrm{otherwise}, \]where \(g_{i,j}\) is the \(j\)-th grading note for the task sample \(i\) and \(\tau_i\) is the agent trajectory consisting of tool calls, agent responses, and user inputs.
- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:templates_subgoal: a user provided LLM-as-a-judge template which will be sent to the judge with user inputs (static setting), subgoal, trajectory, agent responses, user task (dynamic setting), and user utterances (dynamic setting). If this is not provided, the default template for either static or dynamic conversation setting will be used.
num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.optimize_judge_trials: a
boolflag to indicate whether to use optimized judge runs when doing a majority vote. Default toFalse.max_retry_judge_trials: an
intvalue indicating the maximum number of retry attempts for each judge trial in case of errors related to LLM as a judge. Default to5. This will be ignored ifoptimize_judge_trialsis set toTrue.include_entire_prompt_in_validation_result: a
boolflag to indicate whether to include the entire prompt sent to LLM-as-a-judge in the SubGoalValidationResult. Use this in the debugging tool UI for display. Default toFalse.
- async validate(turn_traces, sub_goal)[source]
Returns the LLM-as-a-judge binary score given the subgoal and the agent’s trace up to the current turn in a
staticconversation setting.- Parameters:
- Return type:
- Returns:
a
SubGoalValidationResultobject containing the judge binary score, the subgoal details, and the judge explanations.
Example:
>>> from agent_inspect.metrics.validator import SubGoalCompletionValidator >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> import asyncio >>> >>> data_subgoal = load_subgoal(sample_path) # Load subgoal >>> agent_turn_traces = load_agent_turn_traces(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> validator = SubGoalCompletionValidator( ... llm_client=client, ... config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False} ... ) >>> validator_result = asyncio.run( ... validator.validate( ... turn_traces=agent_turn_traces, ... sub_goal=data_subgoal ... ) ... ) >>> print(validator_result.is_completed)
- async validate_dynamic(turn_traces, sub_goal, user_instruction)[source]
Returns the LLM-as-a-judge binary score given the subgoal, user task instructions (optional), and the agent’s trace up to the current turn in a
dynamicconversation setting.- Parameters:
turn_traces (
List[TurnTrace]) – aList[TurnTrace] object constructed with the agent trajectory information from the first turn up to the current turn.sub_goal (
SubGoal) – aSubGoalobject representing a grading note in the form of a natural language text.user_instruction (
str) – astrobject representing the user task instructions. Provide user task instruction through this variable to use judge template with user summary instruction. Otherwise, set it as empty string to use judge template without any user summary instruction.
- Return type:
- Returns:
a
SubGoalValidationResultobject containing the judge binary score, the subgoal details, and the judge explanations.
Example:
>>> from agent_inspect.metrics.validator import SubGoalCompletionValidator >>> from agent_inspect.metrics.constants import INCLUDE_JUDGE_EXPLANATION, OPTIMIZE_JUDGE_TRIALS >>> from agent_inspect.clients import AzureOpenAIClient >>> import asyncio >>> >>> data_subgoal, user_instruct = load_subgoal_user_instruct(sample_path) # Load subgoal and user instructions >>> agent_turn_traces = load_agent_turn_traces(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> validator = SubGoalCompletionValidator( ... llm_client=client, ... config={INCLUDE_JUDGE_EXPLANATION: True, OPTIMIZE_JUDGE_TRIALS: False} ... ) >>> validator_result = asyncio.run( ... validator.validate_dynamic( ... turn_traces=agent_turn_traces, ... sub_goal=data_subgoal, ... user_instruction=user_instruct ... ) ... ) >>> print(validator_result.is_completed)
- class agent_inspect.metrics.validator.ToolCallCompletionValidator(llm_client, config=None)[source]
Bases:
ValidatorTool call completion validator using both exact match and LLM-as-a-judge approaches to assess whether the tool call is correctly completed.
This validator performs a three-dimensional evaluation of tool calls:
Tool name validation: Checks if the tool name in the agent’s trajectory matches the expected tool name exactly.
Tool input arguments validation: Validates each input parameter using either exact match (when the expected parameter specifies a
value) or LLM-as-a-judge (when the expected parameter specifies acheck).Tool output validation: Validates the tool’s output using either exact match (when the expected output specifies a
value) or LLM-as-a-judge (when the expected output specifies acheck).
A tool call is considered correctly completed only if all the above three dimensions pass validation. The validator iterates through all tool calls in the latest turn of the agent trace to find a matching tool call that satisfies all validation criteria set in
ExpectedToolCall.- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM-as-a-judge model for evaluation.config (
Optional[Dict[str,Any]]) –Default to
None. Configuration options:num_judge_trials: the number of LLM-as-a-judge runs. Default to
5. A majority vote is used when the number of LLM-as-a-judge runs is set to a value larger than 1.include_judge_explanation: a
boolflag to indicate whether the output should also return judge explanations. Default toFalse.
- async validate(agent_trace_turns, expected_tool_call)[source]
Returns a
ToolCallValidationResultindicating whether the tool call is correctly completed by the agent in the latest turn of the agent trace.- Parameters:
agent_trace_turns (
List[TurnTrace]) – A list ofTurnTracerepresenting the entire agent trace up to the current turn.expected_tool_call (
ExpectedToolCall) – AnExpectedToolCallrepresenting the expected tool call checklist to validate against.
- Return type:
- Returns:
a
ToolCallValidationResultindicating whether the tool call is correctly completed by the agent.
Example:
>>> from agent_inspect.metrics import ToolCallCompletionValidator >>> from agent_inspect.metrics.constants import NUM_JUDGE_TRIALS, INCLUDE_JUDGE_EXPLANATION >>> from agent_inspect.clients import AzureOpenAIClient >>> import asyncio >>> >>> expected_tool_call = load_expected_tool_call(expected_tool_call_file_path) # Load expected tool call checklist >>> agent_turn_traces = load_agent_turn_traces(trace_file_path) # Load agent trajectory information >>> client = AzureOpenAIClient(model="gpt-4.1", max_tokens=4096) # create client needed for LLM-based metric >>> validator = ToolCallCompletionValidator( ... llm_client=client, ... config={NUM_JUDGE_TRIALS: 5, INCLUDE_JUDGE_EXPLANATION: True} ... ) >>> validation_result = asyncio.run( ... validator.validate( ... agent_trace_turns = agent_turn_traces, ... expected_tool_call = expected_tool_call ... ) ... ) >>> print(validation_result.is_completed) # True or False
- class agent_inspect.metrics.validator.Validator(llm_client, config=None)[source]
Bases:
objectAbstract class which should be extended for actual implementation of validators.
- Parameters:
llm_client (
LLMClient) – the client which allows connection to the llm-as-a-judge model for evaluations.config (
Optional[Dict[str,Any]]) – configuration for validator initialization. Default toNone.
- async get_majority_voted_score_from_judge_responses(prompt)[source]
- Return type:
(
int,List[str])- Parameters:
prompt (str)
- abstract async validate(turn_traces, kwargs)[source]
This is an abstract method and should be implemented in a concrete class.
- Parameters:
turn_traces (
List[TurnTrace]) – aList[TurnTrace] object constructed with the agent trajectory information from the first turn up to the current turn.kwargs – Additional keyword arguments that may be required for specific validation logic. These arguments can be used to pass optional parameters or configuration settings to the validator.
- Return type:
- Returns:
a
ValidationResultobject containing the validation output.
- agent_inspect.metrics.validator.exact_match(candidate, ground_truth, config=None)[source]
- Return type:
bool- Parameters:
candidate (str)
ground_truth (str)
config (Dict[str, Any] | None)
- async agent_inspect.metrics.validator.llm_check(client, variables, template, post_process)[source]
- Return type:
bool- Parameters:
client (LLMClient)
variables (Dict)
template (str)
post_process (Callable[[LLMResponse], bool])