agent_inspect.models.metrics package

Submodules

agent_inspect.models.metrics.agent_data_sample module

class agent_inspect.models.metrics.agent_data_sample.Conversation(turn_id, message, expected_response=None)[source]

Bases: object

Represents one back-and-forth exchange between a user and an agent (one turn), containing an input message and an optional expected response.

Parameters:

turn_id (int)
message (str)
expected_response (str | None)

expected_response: Optional[str] = None: Expected response from the agent given the agent input message. This can be none during the evaluation with user proxy.

message: str: Input message to the agent.

turn_id: int: ID representing the sequence number of conversation in the list of conversations.

class agent_inspect.models.metrics.agent_data_sample.EvaluationSample(sub_goals, id=None, expected_tool_calls=None, conversation=None, user_instruction=None)[source]

Bases: object

Represents an item in the evaluation dataset.

Parameters:

sub_goals (List[SubGoal])
id (int | None)
expected_tool_calls (List[ExpectedToolCall] | None)
conversation (List[Conversation] | None)
user_instruction (str | None)

conversation: Optional[List[Conversation]] = None: A list of conversation between an agent and a user/user proxy. The sequence of conversations in this list, matters for metric to understand which comes after next.

expected_tool_calls: Optional[List[ExpectedToolCall]] = None: A list of expected tools that an agent should call or utilize during an evaluation run. (e.g. API calls, calculator, web search, etc.)

id: Optional[int] = None: Unique identifier for the evaluation sample.

sub_goals: List[SubGoal]: A list of sub goals which should be achieved by an agent during an evaluation run.

user_instruction: Optional[str] = None: An instruction/instructions for user proxy how it should behave/response while communicating with the agent during an evaluation run.

class agent_inspect.models.metrics.agent_data_sample.ExpectedToolCall(tool, expected_parameters=None, expected_output=None, turn=None)[source]

Bases: object

Represents the correct tool invocation an agent is expected to make for a particular task. It serves as the ground-truth tool call for the evaluations.

Parameters:

tool (str)
expected_parameters (List[ToolInputParameter] | None)
expected_output (ToolOutput | None)
turn (int | str | None)

expected_output: Optional[ToolOutput] = None: Represents the expected output from the tool after the agent invokes it.

expected_parameters: Optional[List[ToolInputParameter]] = None: A list of parameters with which the tool should be called by the agent during the time of evaluation.

tool: str: Represents a tool that should be called or utilized by the agent at the time of evaluation. Can be a name, a description of the tool, the url of api call, etc.

turn: Union[int, str, None] = None: Represents which turn(s) should this tool call be considered. Optional and this is only a placeholder for future implementation.

class agent_inspect.models.metrics.agent_data_sample.SubGoal(details, type=None, turn=None)[source]

Bases: object

A subgoal is a natural language assertion that defines the success criteria of an agent within a larger task. Subgoals specify intermediate success criteria and can also include the final goal. eg: “Agent should call search_messages after getting the current timestamp” “Agent should inform the user: Your oldest message says ‘Hey kid, you want some GPU?’.

Parameters:

details (str)
type (str | None)
turn (int | str | None)

details: str: Details containing the information and criteria used by the metric to determine when the agent has achieved this subgoal.

turn: Union[int, str, None] = None: Represents which turn(s) should this subgoal be considered. Optional and this is only a placeholder for future implementation.

type: Optional[str] = None: Represents the type of subgoal (e.g. grading notes).

class agent_inspect.models.metrics.agent_data_sample.ToolInputParameter(name, value=None, check=None)[source]

Bases: object

Represents a parameter which the agent should invoke with the tool.

Parameters:

name (str)
value (Any | None)
check (str | None)

check: Optional[str] = None: Represents the llm prompt to check the correctness of the parameter name and the value that the agent invokes with the tool.

name: str: Represents the expected name of the parameter (variable name).

value: Optional[Any] = None: Represents the expected value of the parameter at the moment the tool is invoked. This will be converted to str during metric calculation.

class agent_inspect.models.metrics.agent_data_sample.ToolOutput(value=None, check=None)[source]

Bases: object

Represents the expected output from the tool after the agent invokes it.

Parameters:

value (Any | None)
check (str | None)

check: Optional[str] = None: Represents the llm prompt to check the correctness of the output value from the tool after the agent invokes it.

value: Optional[Any] = None: Represents the expected output value from the tool after the agent invokes it. This will be converted to str during metric calculation.

agent_inspect.models.metrics.agent_trace module

class agent_inspect.models.metrics.agent_trace.AgentDialogueTrace(turns=None)[source]

Bases: object

Represents an agent trace (logs) produced by an agent given an evaluation data sample during an evaluation run.

Parameters:: turns (List[TurnTrace] | None)

turns: Optional[List[TurnTrace]] = None: A list of turn which contains the subset of agent trace (logs) produced during one back and forth conversation between the agent and the user/user proxy.

class agent_inspect.models.metrics.agent_trace.AgentResponse(response, status_code=None)[source]

Bases: object

Represents an agent response produced by the agent in one turn.

Parameters:

response (str | dict)
status_code (str | None)

response: Union[str, dict]: A response from the agent, which can either be a python str or a python dict. This will be converted to a str during metric calculation if a dict is provided.

status_code: Optional[str] = None: Contains the http status code from agent if any.

class agent_inspect.models.metrics.agent_trace.Step(id, parent_ids, tool=None, tool_input_args=None, tool_output=None, agent_thought=None, input_token_consumption=None, output_token_consumption=None, reasoning_token_consumption=None)[source]

Bases: object

Represents a necessary step that the agent takes within a turn to produce the final response(s) back to the user/ user proxy.

Parameters:

id (str)
parent_ids (List[str])
tool (str | None)
tool_input_args (List[ToolInputParameter] | None)
tool_output (Any | None)
agent_thought (str | None)
input_token_consumption (int | None)
output_token_consumption (int | None)
reasoning_token_consumption (int | None)

agent_thought: Optional[str] = None: Represent the agent’s thinking process (what the agent is thinking) at the current step.

id: str: Represents an ID assigned to a step, must be a unique identifier.

input_token_consumption: Optional[int] = None: Represents the total number of input tokens consumed in the current step.

output_token_consumption: Optional[int] = None: Represents the total number of output tokens consumed in the current step.

parent_ids: List[str]: Contains a list of parent step ID(s) which the agent took before the current step (e.g. ID(s) of the immediate previous steps.)

reasoning_token_consumption: Optional[int] = None: Represents the total number of tokens consumed for reasoning purposes in the current step.

tool: Optional[str] = None: Represents a tool that was called or utilized by the agent during the current step. Can be a name, a description of the tool, the url of api call, etc. Optional if no tool is called in current step.

tool_input_args: Optional[List[ToolInputParameter]] = None: A list of parameters which the tool was called by the agent during the time of evaluation.

tool_output: Optional[Any] = None: Contains the tool output produced by the agent in the current step. Optional and can be none if no output was produced.

class agent_inspect.models.metrics.agent_trace.TurnTrace(id, agent_input, agent_response=None, from_id=None, steps=None, latency_in_ms=None)[source]

Bases: object

Represents a turn: one back and forth conversation between the agent and the user/user proxy by encapsulating the agent trace (logs) produced during a turn.

Parameters:

id (str)
agent_input (str)
agent_response (AgentResponse | None)
from_id (str | None)
steps (List[Step] | None)
latency_in_ms (float | None)

agent_input: str: An input string to the agent from a user/user proxy.

agent_response: Optional[AgentResponse] = None: An agent response produced by the agent given the input message (agent_input).

from_id: Optional[str] = None: Represents an ID of the parent turn that produces the current step. Optional and this is only a placeholder for future implementation.

id: str: Represents an ID assigned to a turn, must be a unique identifier.

latency_in_ms: Optional[float] = None: Represent the total time taken by the agent to produce the output response(s).

steps: Optional[List[Step]] = None: A list of steps (e.g. tool calls, thinking processes, etc.) that the agent takes to finally produce agent response(s).

agent_inspect.models.metrics.metric_score module

class agent_inspect.models.metrics.metric_score.BooleanScore(score, explanations=None)[source]

Bases: object

Represents a boolean score produced by a metric after computation.

Parameters:

score (bool)
explanations (List[Any] | None)

explanations: Optional[List[Any]] = None: Contains a list of explanation/reason(s) for why/how this particular score is calculated.

score: bool: Contains a boolean representation of the final score after calculated by the metric.

class agent_inspect.models.metrics.metric_score.NumericalScore(score, sub_scores=None, explanations=None, validation_results=None)[source]

Bases: object

Represents a numerical score produced by a metric after computation.

Parameters:

score (float)
sub_scores (Dict[str, float] | None)
explanations (List[Any] | None)
validation_results (List[ValidationResult] | None)

explanations: Optional[List[Any]] = None: Contains a list of explanation/reason(s) for why/how this particular score is calculated.

score: float: Contains a numerical representation of the final score after calculated by the metric.

sub_scores: Optional[Dict[str, float]] = None: Contains a dictionary of intermediate scores used to compute the final score.

validation_results: Optional[List[ValidationResult]] = None: Contains a list of validation results for each subgoal associated with this score.

agent_inspect.models.metrics.validation_result module

class agent_inspect.models.metrics.validation_result.SubGoalValidationResult(is_completed, explanations, sub_goal, prompt_sent_to_llmj=None)[source]

Bases: ValidationResult

Represents a result produced after validation.

Parameters:

is_completed (bool)
explanations (List[str])
sub_goal (SubGoal)
prompt_sent_to_llmj (str | None)

prompt_sent_to_llmj: Optional[str] = None: The entire prompt that is sent to the LLM-as-a-judge for validation.

sub_goal: SubGoal: Contains the subgoal that is being validated.

class agent_inspect.models.metrics.validation_result.ToolCallValidationResult(is_completed, explanations, expected_tool_call)[source]

Bases: ValidationResult

Represents a tool correctness result after validation.

Parameters:

is_completed (bool)
explanations (List[str])
expected_tool_call (ExpectedToolCall)

expected_tool_call: ExpectedToolCall: The expected ground truth tool call that is being validated against.

class agent_inspect.models.metrics.validation_result.ValidationResult(is_completed, explanations)[source]

Bases: object

Represents a result produced after validation (e.g. subgoal validation, tool call validation).

Parameters:

is_completed (bool)
explanations (List[str])

explanations: List[str]: Contains a list of explanation/reason(s) for why/how this particular validation result is produced.

is_completed: bool: Indicates whether the validation was completed successfully.

Module contents

class agent_inspect.models.metrics.AgentDialogueTrace(turns=None)[source]

Bases: object

Represents an agent trace (logs) produced by an agent given an evaluation data sample during an evaluation run.

Parameters:: turns (List[TurnTrace] | None)

turns: Optional[List[TurnTrace]] = None: A list of turn which contains the subset of agent trace (logs) produced during one back and forth conversation between the agent and the user/user proxy.

class agent_inspect.models.metrics.AgentResponse(response, status_code=None)[source]

Bases: object

Represents an agent response produced by the agent in one turn.

Parameters:

response (str | dict)
status_code (str | None)

response: Union[str, dict]: A response from the agent, which can either be a python str or a python dict. This will be converted to a str during metric calculation if a dict is provided.

status_code: Optional[str] = None: Contains the http status code from agent if any.

class agent_inspect.models.metrics.BooleanScore(score, explanations=None)[source]

Bases: object

Represents a boolean score produced by a metric after computation.

Parameters:

score (bool)
explanations (List[Any] | None)

explanations: Optional[List[Any]] = None: Contains a list of explanation/reason(s) for why/how this particular score is calculated.

score: bool: Contains a boolean representation of the final score after calculated by the metric.

class agent_inspect.models.metrics.Conversation(turn_id, message, expected_response=None)[source]

Bases: object

Represents one back-and-forth exchange between a user and an agent (one turn), containing an input message and an optional expected response.

Parameters:

turn_id (int)
message (str)
expected_response (str | None)

expected_response: Optional[str] = None: Expected response from the agent given the agent input message. This can be none during the evaluation with user proxy.

message: str: Input message to the agent.

turn_id: int: ID representing the sequence number of conversation in the list of conversations.

class agent_inspect.models.metrics.EvaluationSample(sub_goals, id=None, expected_tool_calls=None, conversation=None, user_instruction=None)[source]

Bases: object

Represents an item in the evaluation dataset.

Parameters:

sub_goals (List[SubGoal])
id (int | None)
expected_tool_calls (List[ExpectedToolCall] | None)
conversation (List[Conversation] | None)
user_instruction (str | None)

conversation: Optional[List[Conversation]] = None: A list of conversation between an agent and a user/user proxy. The sequence of conversations in this list, matters for metric to understand which comes after next.

expected_tool_calls: Optional[List[ExpectedToolCall]] = None: A list of expected tools that an agent should call or utilize during an evaluation run. (e.g. API calls, calculator, web search, etc.)

id: Optional[int] = None: Unique identifier for the evaluation sample.

sub_goals: List[SubGoal]: A list of sub goals which should be achieved by an agent during an evaluation run.

user_instruction: Optional[str] = None: An instruction/instructions for user proxy how it should behave/response while communicating with the agent during an evaluation run.

class agent_inspect.models.metrics.ExpectedToolCall(tool, expected_parameters=None, expected_output=None, turn=None)[source]

Bases: object

Represents the correct tool invocation an agent is expected to make for a particular task. It serves as the ground-truth tool call for the evaluations.

Parameters:

tool (str)
expected_parameters (List[ToolInputParameter] | None)
expected_output (ToolOutput | None)
turn (int | str | None)

expected_output: Optional[ToolOutput] = None: Represents the expected output from the tool after the agent invokes it.

expected_parameters: Optional[List[ToolInputParameter]] = None: A list of parameters with which the tool should be called by the agent during the time of evaluation.

tool: str: Represents a tool that should be called or utilized by the agent at the time of evaluation. Can be a name, a description of the tool, the url of api call, etc.

turn: Union[int, str, None] = None: Represents which turn(s) should this tool call be considered. Optional and this is only a placeholder for future implementation.

class agent_inspect.models.metrics.NumericalScore(score, sub_scores=None, explanations=None, validation_results=None)[source]

Bases: object

Represents a numerical score produced by a metric after computation.

Parameters:

score (float)
sub_scores (Dict[str, float] | None)
explanations (List[Any] | None)
validation_results (List[ValidationResult] | None)

explanations: Optional[List[Any]] = None: Contains a list of explanation/reason(s) for why/how this particular score is calculated.

score: float: Contains a numerical representation of the final score after calculated by the metric.

sub_scores: Optional[Dict[str, float]] = None: Contains a dictionary of intermediate scores used to compute the final score.

validation_results: Optional[List[ValidationResult]] = None: Contains a list of validation results for each subgoal associated with this score.

class agent_inspect.models.metrics.Step(id, parent_ids, tool=None, tool_input_args=None, tool_output=None, agent_thought=None, input_token_consumption=None, output_token_consumption=None, reasoning_token_consumption=None)[source]

Bases: object

Represents a necessary step that the agent takes within a turn to produce the final response(s) back to the user/ user proxy.

Parameters:

id (str)
parent_ids (List[str])
tool (str | None)
tool_input_args (List[ToolInputParameter] | None)
tool_output (Any | None)
agent_thought (str | None)
input_token_consumption (int | None)
output_token_consumption (int | None)
reasoning_token_consumption (int | None)

agent_thought: Optional[str] = None: Represent the agent’s thinking process (what the agent is thinking) at the current step.

id: str: Represents an ID assigned to a step, must be a unique identifier.

input_token_consumption: Optional[int] = None: Represents the total number of input tokens consumed in the current step.

output_token_consumption: Optional[int] = None: Represents the total number of output tokens consumed in the current step.

parent_ids: List[str]: Contains a list of parent step ID(s) which the agent took before the current step (e.g. ID(s) of the immediate previous steps.)

reasoning_token_consumption: Optional[int] = None: Represents the total number of tokens consumed for reasoning purposes in the current step.

tool: Optional[str] = None: Represents a tool that was called or utilized by the agent during the current step. Can be a name, a description of the tool, the url of api call, etc. Optional if no tool is called in current step.

tool_input_args: Optional[List[ToolInputParameter]] = None: A list of parameters which the tool was called by the agent during the time of evaluation.

tool_output: Optional[Any] = None: Contains the tool output produced by the agent in the current step. Optional and can be none if no output was produced.

class agent_inspect.models.metrics.SubGoal(details, type=None, turn=None)[source]

Bases: object

A subgoal is a natural language assertion that defines the success criteria of an agent within a larger task. Subgoals specify intermediate success criteria and can also include the final goal. eg: “Agent should call search_messages after getting the current timestamp” “Agent should inform the user: Your oldest message says ‘Hey kid, you want some GPU?’.

Parameters:

details (str)
type (str | None)
turn (int | str | None)

details: str: Details containing the information and criteria used by the metric to determine when the agent has achieved this subgoal.

turn: Union[int, str, None] = None: Represents which turn(s) should this subgoal be considered. Optional and this is only a placeholder for future implementation.

type: Optional[str] = None: Represents the type of subgoal (e.g. grading notes).

class agent_inspect.models.metrics.SubGoalValidationResult(is_completed, explanations, sub_goal, prompt_sent_to_llmj=None)[source]

Bases: ValidationResult

Represents a result produced after validation.

Parameters:

is_completed (bool)
explanations (List[str])
sub_goal (SubGoal)
prompt_sent_to_llmj (str | None)

explanations: List[str]: Contains a list of explanation/reason(s) for why/how this particular validation result is produced.

is_completed: bool: Indicates whether the validation was completed successfully.

prompt_sent_to_llmj: Optional[str] = None: The entire prompt that is sent to the LLM-as-a-judge for validation.

sub_goal: SubGoal: Contains the subgoal that is being validated.

class agent_inspect.models.metrics.ToolCallValidationResult(is_completed, explanations, expected_tool_call)[source]

Bases: ValidationResult

Represents a tool correctness result after validation.

Parameters:

is_completed (bool)
explanations (List[str])
expected_tool_call (ExpectedToolCall)

expected_tool_call: ExpectedToolCall: The expected ground truth tool call that is being validated against.

explanations: List[str]: Contains a list of explanation/reason(s) for why/how this particular validation result is produced.

is_completed: bool: Indicates whether the validation was completed successfully.

class agent_inspect.models.metrics.ToolInputParameter(name, value=None, check=None)[source]

Bases: object

Represents a parameter which the agent should invoke with the tool.

Parameters:

name (str)
value (Any | None)
check (str | None)

check: Optional[str] = None: Represents the llm prompt to check the correctness of the parameter name and the value that the agent invokes with the tool.

name: str: Represents the expected name of the parameter (variable name).

value: Optional[Any] = None: Represents the expected value of the parameter at the moment the tool is invoked. This will be converted to str during metric calculation.

class agent_inspect.models.metrics.ToolOutput(value=None, check=None)[source]

Bases: object

Represents the expected output from the tool after the agent invokes it.

Parameters:

value (Any | None)
check (str | None)

check: Optional[str] = None: Represents the llm prompt to check the correctness of the output value from the tool after the agent invokes it.

value: Optional[Any] = None: Represents the expected output value from the tool after the agent invokes it. This will be converted to str during metric calculation.

class agent_inspect.models.metrics.TurnTrace(id, agent_input, agent_response=None, from_id=None, steps=None, latency_in_ms=None)[source]

Bases: object

Represents a turn: one back and forth conversation between the agent and the user/user proxy by encapsulating the agent trace (logs) produced during a turn.

Parameters:

id (str)
agent_input (str)
agent_response (AgentResponse | None)
from_id (str | None)
steps (List[Step] | None)
latency_in_ms (float | None)

agent_input: str: An input string to the agent from a user/user proxy.

agent_response: Optional[AgentResponse] = None: An agent response produced by the agent given the input message (agent_input).

from_id: Optional[str] = None: Represents an ID of the parent turn that produces the current step. Optional and this is only a placeholder for future implementation.

id: str: Represents an ID assigned to a turn, must be a unique identifier.

latency_in_ms: Optional[float] = None: Represent the total time taken by the agent to produce the output response(s).

steps: Optional[List[Step]] = None: A list of steps (e.g. tool calls, thinking processes, etc.) that the agent takes to finally produce agent response(s).

class agent_inspect.models.metrics.ValidationResult(is_completed, explanations)[source]

Bases: object

Represents a result produced after validation (e.g. subgoal validation, tool call validation).

Parameters:

is_completed (bool)
explanations (List[str])

explanations: List[str]: Contains a list of explanation/reason(s) for why/how this particular validation result is produced.

is_completed: bool: Indicates whether the validation was completed successfully.