agent_inspect.models.metrics package

Submodules

agent_inspect.models.metrics.agent_data_sample module

class agent_inspect.models.metrics.agent_data_sample.Conversation(turn_id, message, expected_response=None)[source]

Bases: object

Represents one back-and-forth exchange between a user and an agent (one turn), containing an input message and an optional expected response.

Parameters:
  • turn_id (int)

  • message (str)

  • expected_response (str | None)

expected_response: Optional[str] = None

Expected response from the agent given the agent input message. This can be none during the evaluation with user proxy.

message: str

Input message to the agent.

turn_id: int

ID representing the sequence number of conversation in the list of conversations.

class agent_inspect.models.metrics.agent_data_sample.EvaluationSample(sub_goals, id=None, expected_tool_calls=None, conversation=None, user_instruction=None)[source]

Bases: object

Represents an item in the evaluation dataset.

Parameters:
conversation: Optional[List[Conversation]] = None

A list of conversation between an agent and a user/user proxy. The sequence of conversations in this list, matters for metric to understand which comes after next.

expected_tool_calls: Optional[List[ExpectedToolCall]] = None

A list of expected tools that an agent should call or utilize during an evaluation run. (e.g. API calls, calculator, web search, etc.)

id: Optional[int] = None

Unique identifier for the evaluation sample.

sub_goals: List[SubGoal]

A list of sub goals which should be achieved by an agent during an evaluation run.

user_instruction: Optional[str] = None

An instruction/instructions for user proxy how it should behave/response while communicating with the agent during an evaluation run.

class agent_inspect.models.metrics.agent_data_sample.ExpectedToolCall(tool, expected_parameters=None, expected_output=None, turn=None)[source]

Bases: object

Represents the correct tool invocation an agent is expected to make for a particular task. It serves as the ground-truth tool call for the evaluations.

Parameters:
expected_output: Optional[ToolOutput] = None

Represents the expected output from the tool after the agent invokes it.

expected_parameters: Optional[List[ToolInputParameter]] = None

A list of parameters with which the tool should be called by the agent during the time of evaluation.

tool: str

Represents a tool that should be called or utilized by the agent at the time of evaluation. Can be a name, a description of the tool, the url of api call, etc.

turn: Union[int, str, None] = None

Represents which turn(s) should this tool call be considered. Optional and this is only a placeholder for future implementation.

class agent_inspect.models.metrics.agent_data_sample.SubGoal(details, type=None, turn=None)[source]

Bases: object

A subgoal is a natural language assertion that defines the success criteria of an agent within a larger task. Subgoals specify intermediate success criteria and can also include the final goal. eg: “Agent should call search_messages after getting the current timestamp” “Agent should inform the user: Your oldest message says ‘Hey kid, you want some GPU?’.

Parameters:
  • details (str)

  • type (str | None)

  • turn (int | str | None)

details: str

Details containing the information and criteria used by the metric to determine when the agent has achieved this subgoal.

turn: Union[int, str, None] = None

Represents which turn(s) should this subgoal be considered. Optional and this is only a placeholder for future implementation.

type: Optional[str] = None

Represents the type of subgoal (e.g. grading notes).

class agent_inspect.models.metrics.agent_data_sample.ToolInputParameter(name, value=None, check=None)[source]

Bases: object

Represents a parameter which the agent should invoke with the tool.

Parameters:
  • name (str)

  • value (Any | None)

  • check (str | None)

check: Optional[str] = None

Represents the llm prompt to check the correctness of the parameter name and the value that the agent invokes with the tool.

name: str

Represents the expected name of the parameter (variable name).

value: Optional[Any] = None

Represents the expected value of the parameter at the moment the tool is invoked. This will be converted to str during metric calculation.

class agent_inspect.models.metrics.agent_data_sample.ToolOutput(value=None, check=None)[source]

Bases: object

Represents the expected output from the tool after the agent invokes it.

Parameters:
  • value (Any | None)

  • check (str | None)

check: Optional[str] = None

Represents the llm prompt to check the correctness of the output value from the tool after the agent invokes it.

value: Optional[Any] = None

Represents the expected output value from the tool after the agent invokes it. This will be converted to str during metric calculation.

agent_inspect.models.metrics.agent_trace module

class agent_inspect.models.metrics.agent_trace.AgentDialogueTrace(turns=None)[source]

Bases: object

Represents an agent trace (logs) produced by an agent given an evaluation data sample during an evaluation run.

Parameters:

turns (List[TurnTrace] | None)

turns: Optional[List[TurnTrace]] = None

A list of turn which contains the subset of agent trace (logs) produced during one back and forth conversation between the agent and the user/user proxy.

class agent_inspect.models.metrics.agent_trace.AgentResponse(response, status_code=None)[source]

Bases: object

Represents an agent response produced by the agent in one turn.

Parameters:
  • response (str | dict)

  • status_code (str | None)

response: Union[str, dict]

A response from the agent, which can either be a python str or a python dict. This will be converted to a str during metric calculation if a dict is provided.

status_code: Optional[str] = None

Contains the http status code from agent if any.

class agent_inspect.models.metrics.agent_trace.Step(id, parent_ids, tool=None, tool_input_args=None, tool_output=None, agent_thought=None, input_token_consumption=None, output_token_consumption=None, reasoning_token_consumption=None)[source]

Bases: object

Represents a necessary step that the agent takes within a turn to produce the final response(s) back to the user/ user proxy.

Parameters:
  • id (str)

  • parent_ids (List[str])

  • tool (str | None)

  • tool_input_args (List[ToolInputParameter] | None)

  • tool_output (Any | None)

  • agent_thought (str | None)

  • input_token_consumption (int | None)

  • output_token_consumption (int | None)

  • reasoning_token_consumption (int | None)

agent_thought: Optional[str] = None

Represent the agent’s thinking process (what the agent is thinking) at the current step.

id: str

Represents an ID assigned to a step, must be a unique identifier.

input_token_consumption: Optional[int] = None

Represents the total number of input tokens consumed in the current step.

output_token_consumption: Optional[int] = None

Represents the total number of output tokens consumed in the current step.

parent_ids: List[str]

Contains a list of parent step ID(s) which the agent took before the current step (e.g. ID(s) of the immediate previous steps.)

reasoning_token_consumption: Optional[int] = None

Represents the total number of tokens consumed for reasoning purposes in the current step.

tool: Optional[str] = None

Represents a tool that was called or utilized by the agent during the current step. Can be a name, a description of the tool, the url of api call, etc. Optional if no tool is called in current step.

tool_input_args: Optional[List[ToolInputParameter]] = None

A list of parameters which the tool was called by the agent during the time of evaluation.

tool_output: Optional[Any] = None

Contains the tool output produced by the agent in the current step. Optional and can be none if no output was produced.

class agent_inspect.models.metrics.agent_trace.TurnTrace(id, agent_input, agent_response=None, from_id=None, steps=None, latency_in_ms=None)[source]

Bases: object

Represents a turn: one back and forth conversation between the agent and the user/user proxy by encapsulating the agent trace (logs) produced during a turn.

Parameters:
  • id (str)

  • agent_input (str)

  • agent_response (AgentResponse | None)

  • from_id (str | None)

  • steps (List[Step] | None)

  • latency_in_ms (float | None)

agent_input: str

An input string to the agent from a user/user proxy.

agent_response: Optional[AgentResponse] = None

An agent response produced by the agent given the input message (agent_input).

from_id: Optional[str] = None

Represents an ID of the parent turn that produces the current step. Optional and this is only a placeholder for future implementation.

id: str

Represents an ID assigned to a turn, must be a unique identifier.

latency_in_ms: Optional[float] = None

Represent the total time taken by the agent to produce the output response(s).

steps: Optional[List[Step]] = None

A list of steps (e.g. tool calls, thinking processes, etc.) that the agent takes to finally produce agent response(s).

agent_inspect.models.metrics.metric_score module

class agent_inspect.models.metrics.metric_score.BooleanScore(score, explanations=None)[source]

Bases: object

Represents a boolean score produced by a metric after computation.

Parameters:
  • score (bool)

  • explanations (List[Any] | None)

explanations: Optional[List[Any]] = None

Contains a list of explanation/reason(s) for why/how this particular score is calculated.

score: bool

Contains a boolean representation of the final score after calculated by the metric.

class agent_inspect.models.metrics.metric_score.NumericalScore(score, sub_scores=None, explanations=None, validation_results=None)[source]

Bases: object

Represents a numerical score produced by a metric after computation.

Parameters:
  • score (float)

  • sub_scores (Dict[str, float] | None)

  • explanations (List[Any] | None)

  • validation_results (List[ValidationResult] | None)

explanations: Optional[List[Any]] = None

Contains a list of explanation/reason(s) for why/how this particular score is calculated.

score: float

Contains a numerical representation of the final score after calculated by the metric.

sub_scores: Optional[Dict[str, float]] = None

Contains a dictionary of intermediate scores used to compute the final score.

validation_results: Optional[List[ValidationResult]] = None

Contains a list of validation results for each subgoal associated with this score.

agent_inspect.models.metrics.validation_result module

class agent_inspect.models.metrics.validation_result.SubGoalValidationResult(is_completed, explanations, sub_goal, prompt_sent_to_llmj=None)[source]

Bases: ValidationResult

Represents a result produced after validation.

Parameters:
  • is_completed (bool)

  • explanations (List[str])

  • sub_goal (SubGoal)

  • prompt_sent_to_llmj (str | None)

prompt_sent_to_llmj: Optional[str] = None

The entire prompt that is sent to the LLM-as-a-judge for validation.

sub_goal: SubGoal

Contains the subgoal that is being validated.

class agent_inspect.models.metrics.validation_result.ToolCallValidationResult(is_completed, explanations, expected_tool_call)[source]

Bases: ValidationResult

Represents a tool correctness result after validation.

Parameters:
  • is_completed (bool)

  • explanations (List[str])

  • expected_tool_call (ExpectedToolCall)

expected_tool_call: ExpectedToolCall

The expected ground truth tool call that is being validated against.

class agent_inspect.models.metrics.validation_result.ValidationResult(is_completed, explanations)[source]

Bases: object

Represents a result produced after validation (e.g. subgoal validation, tool call validation).

Parameters:
  • is_completed (bool)

  • explanations (List[str])

explanations: List[str]

Contains a list of explanation/reason(s) for why/how this particular validation result is produced.

is_completed: bool

Indicates whether the validation was completed successfully.

Module contents

class agent_inspect.models.metrics.AgentDialogueTrace(turns=None)[source]

Bases: object

Represents an agent trace (logs) produced by an agent given an evaluation data sample during an evaluation run.

Parameters:

turns (List[TurnTrace] | None)

turns: Optional[List[TurnTrace]] = None

A list of turn which contains the subset of agent trace (logs) produced during one back and forth conversation between the agent and the user/user proxy.

class agent_inspect.models.metrics.AgentResponse(response, status_code=None)[source]

Bases: object

Represents an agent response produced by the agent in one turn.

Parameters:
  • response (str | dict)

  • status_code (str | None)

response: Union[str, dict]

A response from the agent, which can either be a python str or a python dict. This will be converted to a str during metric calculation if a dict is provided.

status_code: Optional[str] = None

Contains the http status code from agent if any.

class agent_inspect.models.metrics.BooleanScore(score, explanations=None)[source]

Bases: object

Represents a boolean score produced by a metric after computation.

Parameters:
  • score (bool)

  • explanations (List[Any] | None)

explanations: Optional[List[Any]] = None

Contains a list of explanation/reason(s) for why/how this particular score is calculated.

score: bool

Contains a boolean representation of the final score after calculated by the metric.

class agent_inspect.models.metrics.Conversation(turn_id, message, expected_response=None)[source]

Bases: object

Represents one back-and-forth exchange between a user and an agent (one turn), containing an input message and an optional expected response.

Parameters:
  • turn_id (int)

  • message (str)

  • expected_response (str | None)

expected_response: Optional[str] = None

Expected response from the agent given the agent input message. This can be none during the evaluation with user proxy.

message: str

Input message to the agent.

turn_id: int

ID representing the sequence number of conversation in the list of conversations.

class agent_inspect.models.metrics.EvaluationSample(sub_goals, id=None, expected_tool_calls=None, conversation=None, user_instruction=None)[source]

Bases: object

Represents an item in the evaluation dataset.

Parameters:
conversation: Optional[List[Conversation]] = None

A list of conversation between an agent and a user/user proxy. The sequence of conversations in this list, matters for metric to understand which comes after next.

expected_tool_calls: Optional[List[ExpectedToolCall]] = None

A list of expected tools that an agent should call or utilize during an evaluation run. (e.g. API calls, calculator, web search, etc.)

id: Optional[int] = None

Unique identifier for the evaluation sample.

sub_goals: List[SubGoal]

A list of sub goals which should be achieved by an agent during an evaluation run.

user_instruction: Optional[str] = None

An instruction/instructions for user proxy how it should behave/response while communicating with the agent during an evaluation run.

class agent_inspect.models.metrics.ExpectedToolCall(tool, expected_parameters=None, expected_output=None, turn=None)[source]

Bases: object

Represents the correct tool invocation an agent is expected to make for a particular task. It serves as the ground-truth tool call for the evaluations.

Parameters:
expected_output: Optional[ToolOutput] = None

Represents the expected output from the tool after the agent invokes it.

expected_parameters: Optional[List[ToolInputParameter]] = None

A list of parameters with which the tool should be called by the agent during the time of evaluation.

tool: str

Represents a tool that should be called or utilized by the agent at the time of evaluation. Can be a name, a description of the tool, the url of api call, etc.

turn: Union[int, str, None] = None

Represents which turn(s) should this tool call be considered. Optional and this is only a placeholder for future implementation.

class agent_inspect.models.metrics.NumericalScore(score, sub_scores=None, explanations=None, validation_results=None)[source]

Bases: object

Represents a numerical score produced by a metric after computation.

Parameters:
  • score (float)

  • sub_scores (Dict[str, float] | None)

  • explanations (List[Any] | None)

  • validation_results (List[ValidationResult] | None)

explanations: Optional[List[Any]] = None

Contains a list of explanation/reason(s) for why/how this particular score is calculated.

score: float

Contains a numerical representation of the final score after calculated by the metric.

sub_scores: Optional[Dict[str, float]] = None

Contains a dictionary of intermediate scores used to compute the final score.

validation_results: Optional[List[ValidationResult]] = None

Contains a list of validation results for each subgoal associated with this score.

class agent_inspect.models.metrics.Step(id, parent_ids, tool=None, tool_input_args=None, tool_output=None, agent_thought=None, input_token_consumption=None, output_token_consumption=None, reasoning_token_consumption=None)[source]

Bases: object

Represents a necessary step that the agent takes within a turn to produce the final response(s) back to the user/ user proxy.

Parameters:
  • id (str)

  • parent_ids (List[str])

  • tool (str | None)

  • tool_input_args (List[ToolInputParameter] | None)

  • tool_output (Any | None)

  • agent_thought (str | None)

  • input_token_consumption (int | None)

  • output_token_consumption (int | None)

  • reasoning_token_consumption (int | None)

agent_thought: Optional[str] = None

Represent the agent’s thinking process (what the agent is thinking) at the current step.

id: str

Represents an ID assigned to a step, must be a unique identifier.

input_token_consumption: Optional[int] = None

Represents the total number of input tokens consumed in the current step.

output_token_consumption: Optional[int] = None

Represents the total number of output tokens consumed in the current step.

parent_ids: List[str]

Contains a list of parent step ID(s) which the agent took before the current step (e.g. ID(s) of the immediate previous steps.)

reasoning_token_consumption: Optional[int] = None

Represents the total number of tokens consumed for reasoning purposes in the current step.

tool: Optional[str] = None

Represents a tool that was called or utilized by the agent during the current step. Can be a name, a description of the tool, the url of api call, etc. Optional if no tool is called in current step.

tool_input_args: Optional[List[ToolInputParameter]] = None

A list of parameters which the tool was called by the agent during the time of evaluation.

tool_output: Optional[Any] = None

Contains the tool output produced by the agent in the current step. Optional and can be none if no output was produced.

class agent_inspect.models.metrics.SubGoal(details, type=None, turn=None)[source]

Bases: object

A subgoal is a natural language assertion that defines the success criteria of an agent within a larger task. Subgoals specify intermediate success criteria and can also include the final goal. eg: “Agent should call search_messages after getting the current timestamp” “Agent should inform the user: Your oldest message says ‘Hey kid, you want some GPU?’.

Parameters:
  • details (str)

  • type (str | None)

  • turn (int | str | None)

details: str

Details containing the information and criteria used by the metric to determine when the agent has achieved this subgoal.

turn: Union[int, str, None] = None

Represents which turn(s) should this subgoal be considered. Optional and this is only a placeholder for future implementation.

type: Optional[str] = None

Represents the type of subgoal (e.g. grading notes).

class agent_inspect.models.metrics.SubGoalValidationResult(is_completed, explanations, sub_goal, prompt_sent_to_llmj=None)[source]

Bases: ValidationResult

Represents a result produced after validation.

Parameters:
  • is_completed (bool)

  • explanations (List[str])

  • sub_goal (SubGoal)

  • prompt_sent_to_llmj (str | None)

explanations: List[str]

Contains a list of explanation/reason(s) for why/how this particular validation result is produced.

is_completed: bool

Indicates whether the validation was completed successfully.

prompt_sent_to_llmj: Optional[str] = None

The entire prompt that is sent to the LLM-as-a-judge for validation.

sub_goal: SubGoal

Contains the subgoal that is being validated.

class agent_inspect.models.metrics.ToolCallValidationResult(is_completed, explanations, expected_tool_call)[source]

Bases: ValidationResult

Represents a tool correctness result after validation.

Parameters:
  • is_completed (bool)

  • explanations (List[str])

  • expected_tool_call (ExpectedToolCall)

expected_tool_call: ExpectedToolCall

The expected ground truth tool call that is being validated against.

explanations: List[str]

Contains a list of explanation/reason(s) for why/how this particular validation result is produced.

is_completed: bool

Indicates whether the validation was completed successfully.

class agent_inspect.models.metrics.ToolInputParameter(name, value=None, check=None)[source]

Bases: object

Represents a parameter which the agent should invoke with the tool.

Parameters:
  • name (str)

  • value (Any | None)

  • check (str | None)

check: Optional[str] = None

Represents the llm prompt to check the correctness of the parameter name and the value that the agent invokes with the tool.

name: str

Represents the expected name of the parameter (variable name).

value: Optional[Any] = None

Represents the expected value of the parameter at the moment the tool is invoked. This will be converted to str during metric calculation.

class agent_inspect.models.metrics.ToolOutput(value=None, check=None)[source]

Bases: object

Represents the expected output from the tool after the agent invokes it.

Parameters:
  • value (Any | None)

  • check (str | None)

check: Optional[str] = None

Represents the llm prompt to check the correctness of the output value from the tool after the agent invokes it.

value: Optional[Any] = None

Represents the expected output value from the tool after the agent invokes it. This will be converted to str during metric calculation.

class agent_inspect.models.metrics.TurnTrace(id, agent_input, agent_response=None, from_id=None, steps=None, latency_in_ms=None)[source]

Bases: object

Represents a turn: one back and forth conversation between the agent and the user/user proxy by encapsulating the agent trace (logs) produced during a turn.

Parameters:
  • id (str)

  • agent_input (str)

  • agent_response (AgentResponse | None)

  • from_id (str | None)

  • steps (List[Step] | None)

  • latency_in_ms (float | None)

agent_input: str

An input string to the agent from a user/user proxy.

agent_response: Optional[AgentResponse] = None

An agent response produced by the agent given the input message (agent_input).

from_id: Optional[str] = None

Represents an ID of the parent turn that produces the current step. Optional and this is only a placeholder for future implementation.

id: str

Represents an ID assigned to a turn, must be a unique identifier.

latency_in_ms: Optional[float] = None

Represent the total time taken by the agent to produce the output response(s).

steps: Optional[List[Step]] = None

A list of steps (e.g. tool calls, thinking processes, etc.) that the agent takes to finally produce agent response(s).

class agent_inspect.models.metrics.ValidationResult(is_completed, explanations)[source]

Bases: object

Represents a result produced after validation (e.g. subgoal validation, tool call validation).

Parameters:
  • is_completed (bool)

  • explanations (List[str])

explanations: List[str]

Contains a list of explanation/reason(s) for why/how this particular validation result is produced.

is_completed: bool

Indicates whether the validation was completed successfully.