agent_inspect.models.metrics package
Submodules
agent_inspect.models.metrics.agent_data_sample module
- class agent_inspect.models.metrics.agent_data_sample.Conversation(turn_id, message, expected_response=None)[source]
Bases:
objectRepresents one back-and-forth exchange between a user and an agent (one turn), containing an input message and an optional expected response.
- Parameters:
turn_id (int)
message (str)
expected_response (str | None)
-
expected_response:
Optional[str] = None Expected response from the agent given the agent input message. This can be none during the evaluation with user proxy.
-
message:
str Input message to the agent.
-
turn_id:
int ID representing the sequence number of conversation in the list of conversations.
- class agent_inspect.models.metrics.agent_data_sample.EvaluationSample(sub_goals, id=None, expected_tool_calls=None, conversation=None, user_instruction=None)[source]
Bases:
objectRepresents an item in the evaluation dataset.
- Parameters:
sub_goals (List[SubGoal])
id (int | None)
expected_tool_calls (List[ExpectedToolCall] | None)
conversation (List[Conversation] | None)
user_instruction (str | None)
-
conversation:
Optional[List[Conversation]] = None A list of conversation between an agent and a user/user proxy. The sequence of conversations in this list, matters for metric to understand which comes after next.
-
expected_tool_calls:
Optional[List[ExpectedToolCall]] = None A list of expected tools that an agent should call or utilize during an evaluation run. (e.g. API calls, calculator, web search, etc.)
-
id:
Optional[int] = None Unique identifier for the evaluation sample.
-
sub_goals:
List[SubGoal] A list of sub goals which should be achieved by an agent during an evaluation run.
-
user_instruction:
Optional[str] = None An instruction/instructions for user proxy how it should behave/response while communicating with the agent during an evaluation run.
- class agent_inspect.models.metrics.agent_data_sample.ExpectedToolCall(tool, expected_parameters=None, expected_output=None, turn=None)[source]
Bases:
objectRepresents the correct tool invocation an agent is expected to make for a particular task. It serves as the ground-truth tool call for the evaluations.
- Parameters:
tool (str)
expected_parameters (List[ToolInputParameter] | None)
expected_output (ToolOutput | None)
turn (int | str | None)
-
expected_output:
Optional[ToolOutput] = None Represents the expected output from the tool after the agent invokes it.
-
expected_parameters:
Optional[List[ToolInputParameter]] = None A list of parameters with which the tool should be called by the agent during the time of evaluation.
-
tool:
str Represents a tool that should be called or utilized by the agent at the time of evaluation. Can be a name, a description of the tool, the url of api call, etc.
-
turn:
Union[int,str,None] = None Represents which turn(s) should this tool call be considered. Optional and this is only a placeholder for future implementation.
- class agent_inspect.models.metrics.agent_data_sample.SubGoal(details, type=None, turn=None)[source]
Bases:
objectA subgoal is a natural language assertion that defines the success criteria of an agent within a larger task. Subgoals specify intermediate success criteria and can also include the final goal. eg: “Agent should call search_messages after getting the current timestamp” “Agent should inform the user: Your oldest message says ‘Hey kid, you want some GPU?’.
- Parameters:
details (str)
type (str | None)
turn (int | str | None)
-
details:
str Details containing the information and criteria used by the metric to determine when the agent has achieved this subgoal.
-
turn:
Union[int,str,None] = None Represents which turn(s) should this subgoal be considered. Optional and this is only a placeholder for future implementation.
-
type:
Optional[str] = None Represents the type of subgoal (e.g. grading notes).
- class agent_inspect.models.metrics.agent_data_sample.ToolInputParameter(name, value=None, check=None)[source]
Bases:
objectRepresents a parameter which the agent should invoke with the tool.
- Parameters:
name (str)
value (Any | None)
check (str | None)
-
check:
Optional[str] = None Represents the llm prompt to check the correctness of the parameter name and the value that the agent invokes with the tool.
-
name:
str Represents the expected name of the parameter (variable name).
-
value:
Optional[Any] = None Represents the expected value of the parameter at the moment the tool is invoked. This will be converted to str during metric calculation.
- class agent_inspect.models.metrics.agent_data_sample.ToolOutput(value=None, check=None)[source]
Bases:
objectRepresents the expected output from the tool after the agent invokes it.
- Parameters:
value (Any | None)
check (str | None)
-
check:
Optional[str] = None Represents the llm prompt to check the correctness of the output value from the tool after the agent invokes it.
-
value:
Optional[Any] = None Represents the expected output value from the tool after the agent invokes it. This will be converted to str during metric calculation.
agent_inspect.models.metrics.agent_trace module
- class agent_inspect.models.metrics.agent_trace.AgentDialogueTrace(turns=None)[source]
Bases:
objectRepresents an agent trace (logs) produced by an agent given an evaluation data sample during an evaluation run.
- Parameters:
turns (List[TurnTrace] | None)
- class agent_inspect.models.metrics.agent_trace.AgentResponse(response, status_code=None)[source]
Bases:
objectRepresents an agent response produced by the agent in one turn.
- Parameters:
response (str | dict)
status_code (str | None)
-
response:
Union[str,dict] A response from the agent, which can either be a python str or a python dict. This will be converted to a str during metric calculation if a dict is provided.
-
status_code:
Optional[str] = None Contains the http status code from agent if any.
- class agent_inspect.models.metrics.agent_trace.Step(id, parent_ids, tool=None, tool_input_args=None, tool_output=None, agent_thought=None, input_token_consumption=None, output_token_consumption=None, reasoning_token_consumption=None)[source]
Bases:
objectRepresents a necessary step that the agent takes within a turn to produce the final response(s) back to the user/ user proxy.
- Parameters:
id (str)
parent_ids (List[str])
tool (str | None)
tool_input_args (List[ToolInputParameter] | None)
tool_output (Any | None)
agent_thought (str | None)
input_token_consumption (int | None)
output_token_consumption (int | None)
reasoning_token_consumption (int | None)
-
agent_thought:
Optional[str] = None Represent the agent’s thinking process (what the agent is thinking) at the current step.
-
id:
str Represents an ID assigned to a step, must be a unique identifier.
-
input_token_consumption:
Optional[int] = None Represents the total number of input tokens consumed in the current step.
-
output_token_consumption:
Optional[int] = None Represents the total number of output tokens consumed in the current step.
-
parent_ids:
List[str] Contains a list of parent step ID(s) which the agent took before the current step (e.g. ID(s) of the immediate previous steps.)
-
reasoning_token_consumption:
Optional[int] = None Represents the total number of tokens consumed for reasoning purposes in the current step.
-
tool:
Optional[str] = None Represents a tool that was called or utilized by the agent during the current step. Can be a name, a description of the tool, the url of api call, etc. Optional if no tool is called in current step.
-
tool_input_args:
Optional[List[ToolInputParameter]] = None A list of parameters which the tool was called by the agent during the time of evaluation.
-
tool_output:
Optional[Any] = None Contains the tool output produced by the agent in the current step. Optional and can be none if no output was produced.
- class agent_inspect.models.metrics.agent_trace.TurnTrace(id, agent_input, agent_response=None, from_id=None, steps=None, latency_in_ms=None)[source]
Bases:
objectRepresents a turn: one back and forth conversation between the agent and the user/user proxy by encapsulating the agent trace (logs) produced during a turn.
- Parameters:
id (str)
agent_input (str)
agent_response (AgentResponse | None)
from_id (str | None)
steps (List[Step] | None)
latency_in_ms (float | None)
-
agent_input:
str An input string to the agent from a user/user proxy.
-
agent_response:
Optional[AgentResponse] = None An agent response produced by the agent given the input message (agent_input).
-
from_id:
Optional[str] = None Represents an ID of the parent turn that produces the current step. Optional and this is only a placeholder for future implementation.
-
id:
str Represents an ID assigned to a turn, must be a unique identifier.
-
latency_in_ms:
Optional[float] = None Represent the total time taken by the agent to produce the output response(s).
agent_inspect.models.metrics.metric_score module
- class agent_inspect.models.metrics.metric_score.BooleanScore(score, explanations=None)[source]
Bases:
objectRepresents a boolean score produced by a metric after computation.
- Parameters:
score (bool)
explanations (List[Any] | None)
-
explanations:
Optional[List[Any]] = None Contains a list of explanation/reason(s) for why/how this particular score is calculated.
-
score:
bool Contains a boolean representation of the final score after calculated by the metric.
- class agent_inspect.models.metrics.metric_score.NumericalScore(score, sub_scores=None, explanations=None, validation_results=None)[source]
Bases:
objectRepresents a numerical score produced by a metric after computation.
- Parameters:
score (float)
sub_scores (Dict[str, float] | None)
explanations (List[Any] | None)
validation_results (List[ValidationResult] | None)
-
explanations:
Optional[List[Any]] = None Contains a list of explanation/reason(s) for why/how this particular score is calculated.
-
score:
float Contains a numerical representation of the final score after calculated by the metric.
-
sub_scores:
Optional[Dict[str,float]] = None Contains a dictionary of intermediate scores used to compute the final score.
-
validation_results:
Optional[List[ValidationResult]] = None Contains a list of validation results for each subgoal associated with this score.
agent_inspect.models.metrics.validation_result module
- class agent_inspect.models.metrics.validation_result.SubGoalValidationResult(is_completed, explanations, sub_goal, prompt_sent_to_llmj=None)[source]
Bases:
ValidationResultRepresents a result produced after validation.
- Parameters:
is_completed (bool)
explanations (List[str])
sub_goal (SubGoal)
prompt_sent_to_llmj (str | None)
-
prompt_sent_to_llmj:
Optional[str] = None The entire prompt that is sent to the LLM-as-a-judge for validation.
- class agent_inspect.models.metrics.validation_result.ToolCallValidationResult(is_completed, explanations, expected_tool_call)[source]
Bases:
ValidationResultRepresents a tool correctness result after validation.
- Parameters:
is_completed (bool)
explanations (List[str])
expected_tool_call (ExpectedToolCall)
-
expected_tool_call:
ExpectedToolCall The expected ground truth tool call that is being validated against.
- class agent_inspect.models.metrics.validation_result.ValidationResult(is_completed, explanations)[source]
Bases:
objectRepresents a result produced after validation (e.g. subgoal validation, tool call validation).
- Parameters:
is_completed (bool)
explanations (List[str])
-
explanations:
List[str] Contains a list of explanation/reason(s) for why/how this particular validation result is produced.
-
is_completed:
bool Indicates whether the validation was completed successfully.
Module contents
- class agent_inspect.models.metrics.AgentDialogueTrace(turns=None)[source]
Bases:
objectRepresents an agent trace (logs) produced by an agent given an evaluation data sample during an evaluation run.
- Parameters:
turns (List[TurnTrace] | None)
- class agent_inspect.models.metrics.AgentResponse(response, status_code=None)[source]
Bases:
objectRepresents an agent response produced by the agent in one turn.
- Parameters:
response (str | dict)
status_code (str | None)
-
response:
Union[str,dict] A response from the agent, which can either be a python str or a python dict. This will be converted to a str during metric calculation if a dict is provided.
-
status_code:
Optional[str] = None Contains the http status code from agent if any.
- class agent_inspect.models.metrics.BooleanScore(score, explanations=None)[source]
Bases:
objectRepresents a boolean score produced by a metric after computation.
- Parameters:
score (bool)
explanations (List[Any] | None)
-
explanations:
Optional[List[Any]] = None Contains a list of explanation/reason(s) for why/how this particular score is calculated.
-
score:
bool Contains a boolean representation of the final score after calculated by the metric.
- class agent_inspect.models.metrics.Conversation(turn_id, message, expected_response=None)[source]
Bases:
objectRepresents one back-and-forth exchange between a user and an agent (one turn), containing an input message and an optional expected response.
- Parameters:
turn_id (int)
message (str)
expected_response (str | None)
-
expected_response:
Optional[str] = None Expected response from the agent given the agent input message. This can be none during the evaluation with user proxy.
-
message:
str Input message to the agent.
-
turn_id:
int ID representing the sequence number of conversation in the list of conversations.
- class agent_inspect.models.metrics.EvaluationSample(sub_goals, id=None, expected_tool_calls=None, conversation=None, user_instruction=None)[source]
Bases:
objectRepresents an item in the evaluation dataset.
- Parameters:
sub_goals (List[SubGoal])
id (int | None)
expected_tool_calls (List[ExpectedToolCall] | None)
conversation (List[Conversation] | None)
user_instruction (str | None)
-
conversation:
Optional[List[Conversation]] = None A list of conversation between an agent and a user/user proxy. The sequence of conversations in this list, matters for metric to understand which comes after next.
-
expected_tool_calls:
Optional[List[ExpectedToolCall]] = None A list of expected tools that an agent should call or utilize during an evaluation run. (e.g. API calls, calculator, web search, etc.)
-
id:
Optional[int] = None Unique identifier for the evaluation sample.
-
sub_goals:
List[SubGoal] A list of sub goals which should be achieved by an agent during an evaluation run.
-
user_instruction:
Optional[str] = None An instruction/instructions for user proxy how it should behave/response while communicating with the agent during an evaluation run.
- class agent_inspect.models.metrics.ExpectedToolCall(tool, expected_parameters=None, expected_output=None, turn=None)[source]
Bases:
objectRepresents the correct tool invocation an agent is expected to make for a particular task. It serves as the ground-truth tool call for the evaluations.
- Parameters:
tool (str)
expected_parameters (List[ToolInputParameter] | None)
expected_output (ToolOutput | None)
turn (int | str | None)
-
expected_output:
Optional[ToolOutput] = None Represents the expected output from the tool after the agent invokes it.
-
expected_parameters:
Optional[List[ToolInputParameter]] = None A list of parameters with which the tool should be called by the agent during the time of evaluation.
-
tool:
str Represents a tool that should be called or utilized by the agent at the time of evaluation. Can be a name, a description of the tool, the url of api call, etc.
-
turn:
Union[int,str,None] = None Represents which turn(s) should this tool call be considered. Optional and this is only a placeholder for future implementation.
- class agent_inspect.models.metrics.NumericalScore(score, sub_scores=None, explanations=None, validation_results=None)[source]
Bases:
objectRepresents a numerical score produced by a metric after computation.
- Parameters:
score (float)
sub_scores (Dict[str, float] | None)
explanations (List[Any] | None)
validation_results (List[ValidationResult] | None)
-
explanations:
Optional[List[Any]] = None Contains a list of explanation/reason(s) for why/how this particular score is calculated.
-
score:
float Contains a numerical representation of the final score after calculated by the metric.
-
sub_scores:
Optional[Dict[str,float]] = None Contains a dictionary of intermediate scores used to compute the final score.
-
validation_results:
Optional[List[ValidationResult]] = None Contains a list of validation results for each subgoal associated with this score.
- class agent_inspect.models.metrics.Step(id, parent_ids, tool=None, tool_input_args=None, tool_output=None, agent_thought=None, input_token_consumption=None, output_token_consumption=None, reasoning_token_consumption=None)[source]
Bases:
objectRepresents a necessary step that the agent takes within a turn to produce the final response(s) back to the user/ user proxy.
- Parameters:
id (str)
parent_ids (List[str])
tool (str | None)
tool_input_args (List[ToolInputParameter] | None)
tool_output (Any | None)
agent_thought (str | None)
input_token_consumption (int | None)
output_token_consumption (int | None)
reasoning_token_consumption (int | None)
-
agent_thought:
Optional[str] = None Represent the agent’s thinking process (what the agent is thinking) at the current step.
-
id:
str Represents an ID assigned to a step, must be a unique identifier.
-
input_token_consumption:
Optional[int] = None Represents the total number of input tokens consumed in the current step.
-
output_token_consumption:
Optional[int] = None Represents the total number of output tokens consumed in the current step.
-
parent_ids:
List[str] Contains a list of parent step ID(s) which the agent took before the current step (e.g. ID(s) of the immediate previous steps.)
-
reasoning_token_consumption:
Optional[int] = None Represents the total number of tokens consumed for reasoning purposes in the current step.
-
tool:
Optional[str] = None Represents a tool that was called or utilized by the agent during the current step. Can be a name, a description of the tool, the url of api call, etc. Optional if no tool is called in current step.
-
tool_input_args:
Optional[List[ToolInputParameter]] = None A list of parameters which the tool was called by the agent during the time of evaluation.
-
tool_output:
Optional[Any] = None Contains the tool output produced by the agent in the current step. Optional and can be none if no output was produced.
- class agent_inspect.models.metrics.SubGoal(details, type=None, turn=None)[source]
Bases:
objectA subgoal is a natural language assertion that defines the success criteria of an agent within a larger task. Subgoals specify intermediate success criteria and can also include the final goal. eg: “Agent should call search_messages after getting the current timestamp” “Agent should inform the user: Your oldest message says ‘Hey kid, you want some GPU?’.
- Parameters:
details (str)
type (str | None)
turn (int | str | None)
-
details:
str Details containing the information and criteria used by the metric to determine when the agent has achieved this subgoal.
-
turn:
Union[int,str,None] = None Represents which turn(s) should this subgoal be considered. Optional and this is only a placeholder for future implementation.
-
type:
Optional[str] = None Represents the type of subgoal (e.g. grading notes).
- class agent_inspect.models.metrics.SubGoalValidationResult(is_completed, explanations, sub_goal, prompt_sent_to_llmj=None)[source]
Bases:
ValidationResultRepresents a result produced after validation.
- Parameters:
is_completed (bool)
explanations (List[str])
sub_goal (SubGoal)
prompt_sent_to_llmj (str | None)
-
explanations:
List[str] Contains a list of explanation/reason(s) for why/how this particular validation result is produced.
-
is_completed:
bool Indicates whether the validation was completed successfully.
-
prompt_sent_to_llmj:
Optional[str] = None The entire prompt that is sent to the LLM-as-a-judge for validation.
- class agent_inspect.models.metrics.ToolCallValidationResult(is_completed, explanations, expected_tool_call)[source]
Bases:
ValidationResultRepresents a tool correctness result after validation.
- Parameters:
is_completed (bool)
explanations (List[str])
expected_tool_call (ExpectedToolCall)
-
expected_tool_call:
ExpectedToolCall The expected ground truth tool call that is being validated against.
-
explanations:
List[str] Contains a list of explanation/reason(s) for why/how this particular validation result is produced.
-
is_completed:
bool Indicates whether the validation was completed successfully.
- class agent_inspect.models.metrics.ToolInputParameter(name, value=None, check=None)[source]
Bases:
objectRepresents a parameter which the agent should invoke with the tool.
- Parameters:
name (str)
value (Any | None)
check (str | None)
-
check:
Optional[str] = None Represents the llm prompt to check the correctness of the parameter name and the value that the agent invokes with the tool.
-
name:
str Represents the expected name of the parameter (variable name).
-
value:
Optional[Any] = None Represents the expected value of the parameter at the moment the tool is invoked. This will be converted to str during metric calculation.
- class agent_inspect.models.metrics.ToolOutput(value=None, check=None)[source]
Bases:
objectRepresents the expected output from the tool after the agent invokes it.
- Parameters:
value (Any | None)
check (str | None)
-
check:
Optional[str] = None Represents the llm prompt to check the correctness of the output value from the tool after the agent invokes it.
-
value:
Optional[Any] = None Represents the expected output value from the tool after the agent invokes it. This will be converted to str during metric calculation.
- class agent_inspect.models.metrics.TurnTrace(id, agent_input, agent_response=None, from_id=None, steps=None, latency_in_ms=None)[source]
Bases:
objectRepresents a turn: one back and forth conversation between the agent and the user/user proxy by encapsulating the agent trace (logs) produced during a turn.
- Parameters:
id (str)
agent_input (str)
agent_response (AgentResponse | None)
from_id (str | None)
steps (List[Step] | None)
latency_in_ms (float | None)
-
agent_input:
str An input string to the agent from a user/user proxy.
-
agent_response:
Optional[AgentResponse] = None An agent response produced by the agent given the input message (agent_input).
-
from_id:
Optional[str] = None Represents an ID of the parent turn that produces the current step. Optional and this is only a placeholder for future implementation.
-
id:
str Represents an ID assigned to a turn, must be a unique identifier.
-
latency_in_ms:
Optional[float] = None Represent the total time taken by the agent to produce the output response(s).
- class agent_inspect.models.metrics.ValidationResult(is_completed, explanations)[source]
Bases:
objectRepresents a result produced after validation (e.g. subgoal validation, tool call validation).
- Parameters:
is_completed (bool)
explanations (List[str])
-
explanations:
List[str] Contains a list of explanation/reason(s) for why/how this particular validation result is produced.
-
is_completed:
bool Indicates whether the validation was completed successfully.