agent_inspect.tools.error_analysis package

Submodules

agent_inspect.tools.error_analysis.error_analysis module

class agent_inspect.tools.error_analysis.error_analysis.ErrorAnalysis(llm_client, max_workers=20)[source]

Bases: object

Method to perform error analysis across multiple data samples using LLMs in order for developers to easily identify and understand common errors of agents. The method is based on subgoal validations and will execute a two-step unsupervised learning process: 1) low-level error identification, 2) semantic clustering of error types.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the LLM model for performing error analysis.

  • max_workers (int) – Maximum number of concurrent workers for processing data samples. Default to 20.

analyze_batch(data_samples)[source]

Performs error analysis on a batch of data samples (usually across the entire dataset samples). Returns the ErrorAnalysisResult containing the clustered error types with their associated subgoal validations and the rest of subgoal validations that don’t have errors.

Parameters:

data_samples (List[ErrorAnalysisDataSample]) – List of data samples to perform error analysis on. Each data sample contains multiple subgoal validations.

Return type:

ErrorAnalysisResult

Returns:

an ErrorAnalysisResult containing the error analysis results:

Example:

>>> from agent_inspect.models.tools import ErrorAnalysisDataSample
>>> from agent_inspect.tools import ErrorAnalysis 
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> # Prepare your error_analysis_data_samples: List[ErrorAnalysisDataSample] = [...]
>>> error_analysis_data_samples: List[ErrorAnalysisDataSample] = [...] 
>>> llm_client = AzureOpenAIClient(
...     model="gpt-4.1", 
...     max_tokens=4096
...     )
>>> error_analyzer = ErrorAnalysis(
...     llm_client=llm_client, 
...     max_workers=10
...     )
>>>
>>> test_error_analysis_results = error_analyzer.analyze_batch(error_analysis_data_samples)
>>> test_error_categories = list(test_error_analysis_results.analyzed_validations_clustered_by_errors.keys())
>>> print(f"Identified error categories: {test_error_categories}")

agent_inspect.tools.error_analysis.llm_constants module

agent_inspect.tools.error_analysis.statistic_analysis module

class agent_inspect.tools.error_analysis.statistic_analysis.StatisticAnalysis[source]

Bases: object

Method to compute expectation and variance of agent’s progress across multiple LLM-as-a-judge runs.

For each subgoal \(g_{i,j} \in G_i\), we define a binary random variable \(Z_{i,j}\), where \(Z_{i,j}=1\) if the agent achieves the \(j\)-th subgoal (under the given trajectory), and \(0\) otherwise.

Let the probability of achieving the subgoal \(g_{i,j}\) be \(Pr(Z_{i,j}=1)=z_{i,j}\). Then, for a given sample with multiple subgoals \((i, G_i)\) and an agent trajectory \(\tau_i\), we define the progress as the proportion of subgoals successfully achieved \(progress(i, G_i, \tau_i) = \frac{\sum_j Z_{i,j}}{|G_i|}\).

So the expectation and variance of agent’s progress are measured by:

\[E[{progress}(i, G_i, \tau_i)] = \frac{\sum_j z_{i,j}}{|G_i|} \ ; \quad {Var}[{progress}(i, G_i, \tau_i)] = \frac{\sum_j z_{i,j} (1 - z_{i,j})}{|G_i|^2}\]

where \(z_{i,j}= \frac{1}{Q}\sum_{q=1}^Q z_{i,j}^{(q)}\) is estimated by averaging over \(Q\) judge runs per subgoal, generalizing the single binary judge output to a probabilistic estimate.

static compute_statistic_analysis_result(data_sample)[source]

Returns the judge expectation and variance on a single data sample.

Parameters:

data_sample (ErrorAnalysisDataSample) – The data sample containing subgoal validations.

Return type:

StatisticAnalysisResult

Returns:

an StatisticAnalysisResult containing judge expectation and variance.

Example:

>>> from agent_inspect.models.tools import ErrorAnalysisDataSample
>>> from agent_inspect.models.metrics import SubGoalValidationResult
>>> from agent_inspect.tools import StatisticAnalysis
>>> data_sample = ErrorAnalysisDataSample(
...     data_sample_id=1,
...     agent_run_id=101,
...     subgoal_validations=[
...         # The first element is the summarized explanation, skipped in computation. The rest are truncated judge explanations for demo, which only contain "Grade: I" or "Grade: C".
...         SubGoalValidationResult(
...             explanations=["Check: {subgoal 1} has failed.'", "Grade: I", "Grade: I", "Grade: I", "Grade: I", "Grade: C"]
...         ),
...         SubGoalValidationResult(
...             explanations=["Check: {subgoal 2} has failed.'", "Grade: I", "Grade: I", "Grade: I", "Grade: I", "Grade: I"]
...         )
...     ]
... )
>>> stat_result = StatisticAnalysis.compute_statistic_analysis_result(data_sample)
>>> stat_result.judge_expectation
0.1
>>> stat_result.judge_std
0.2

Module contents

class agent_inspect.tools.error_analysis.ErrorAnalysis(llm_client, max_workers=20)[source]

Bases: object

Method to perform error analysis across multiple data samples using LLMs in order for developers to easily identify and understand common errors of agents. The method is based on subgoal validations and will execute a two-step unsupervised learning process: 1) low-level error identification, 2) semantic clustering of error types.

Parameters:
  • llm_client (LLMClient) – the client which allows connection to the LLM model for performing error analysis.

  • max_workers (int) – Maximum number of concurrent workers for processing data samples. Default to 20.

analyze_batch(data_samples)[source]

Performs error analysis on a batch of data samples (usually across the entire dataset samples). Returns the ErrorAnalysisResult containing the clustered error types with their associated subgoal validations and the rest of subgoal validations that don’t have errors.

Parameters:

data_samples (List[ErrorAnalysisDataSample]) – List of data samples to perform error analysis on. Each data sample contains multiple subgoal validations.

Return type:

ErrorAnalysisResult

Returns:

an ErrorAnalysisResult containing the error analysis results:

Example:

>>> from agent_inspect.models.tools import ErrorAnalysisDataSample
>>> from agent_inspect.tools import ErrorAnalysis 
>>> from agent_inspect.clients import AzureOpenAIClient
>>>
>>> # Prepare your error_analysis_data_samples: List[ErrorAnalysisDataSample] = [...]
>>> error_analysis_data_samples: List[ErrorAnalysisDataSample] = [...] 
>>> llm_client = AzureOpenAIClient(
...     model="gpt-4.1", 
...     max_tokens=4096
...     )
>>> error_analyzer = ErrorAnalysis(
...     llm_client=llm_client, 
...     max_workers=10
...     )
>>>
>>> test_error_analysis_results = error_analyzer.analyze_batch(error_analysis_data_samples)
>>> test_error_categories = list(test_error_analysis_results.analyzed_validations_clustered_by_errors.keys())
>>> print(f"Identified error categories: {test_error_categories}")
class agent_inspect.tools.error_analysis.StatisticAnalysis[source]

Bases: object

Method to compute expectation and variance of agent’s progress across multiple LLM-as-a-judge runs.

For each subgoal \(g_{i,j} \in G_i\), we define a binary random variable \(Z_{i,j}\), where \(Z_{i,j}=1\) if the agent achieves the \(j\)-th subgoal (under the given trajectory), and \(0\) otherwise.

Let the probability of achieving the subgoal \(g_{i,j}\) be \(Pr(Z_{i,j}=1)=z_{i,j}\). Then, for a given sample with multiple subgoals \((i, G_i)\) and an agent trajectory \(\tau_i\), we define the progress as the proportion of subgoals successfully achieved \(progress(i, G_i, \tau_i) = \frac{\sum_j Z_{i,j}}{|G_i|}\).

So the expectation and variance of agent’s progress are measured by:

\[E[{progress}(i, G_i, \tau_i)] = \frac{\sum_j z_{i,j}}{|G_i|} \ ; \quad {Var}[{progress}(i, G_i, \tau_i)] = \frac{\sum_j z_{i,j} (1 - z_{i,j})}{|G_i|^2}\]

where \(z_{i,j}= \frac{1}{Q}\sum_{q=1}^Q z_{i,j}^{(q)}\) is estimated by averaging over \(Q\) judge runs per subgoal, generalizing the single binary judge output to a probabilistic estimate.

static compute_statistic_analysis_result(data_sample)[source]

Returns the judge expectation and variance on a single data sample.

Parameters:

data_sample (ErrorAnalysisDataSample) – The data sample containing subgoal validations.

Return type:

StatisticAnalysisResult

Returns:

an StatisticAnalysisResult containing judge expectation and variance.

Example:

>>> from agent_inspect.models.tools import ErrorAnalysisDataSample
>>> from agent_inspect.models.metrics import SubGoalValidationResult
>>> from agent_inspect.tools import StatisticAnalysis
>>> data_sample = ErrorAnalysisDataSample(
...     data_sample_id=1,
...     agent_run_id=101,
...     subgoal_validations=[
...         # The first element is the summarized explanation, skipped in computation. The rest are truncated judge explanations for demo, which only contain "Grade: I" or "Grade: C".
...         SubGoalValidationResult(
...             explanations=["Check: {subgoal 1} has failed.'", "Grade: I", "Grade: I", "Grade: I", "Grade: I", "Grade: C"]
...         ),
...         SubGoalValidationResult(
...             explanations=["Check: {subgoal 2} has failed.'", "Grade: I", "Grade: I", "Grade: I", "Grade: I", "Grade: I"]
...         )
...     ]
... )
>>> stat_result = StatisticAnalysis.compute_statistic_analysis_result(data_sample)
>>> stat_result.judge_expectation
0.1
>>> stat_result.judge_std
0.2