agent_inspect.tools package
Subpackages
Module contents
- class agent_inspect.tools.ErrorAnalysis(llm_client, max_workers=20)[source]
Bases:
objectMethod to perform error analysis across multiple data samples using LLMs in order for developers to easily identify and understand common errors of agents. The method is based on subgoal validations and will execute a two-step unsupervised learning process: 1) low-level error identification, 2) semantic clustering of error types.
- Parameters:
llm_client (
LLMClient) – the client which allows connection to the LLM model for performing error analysis.max_workers (
int) – Maximum number of concurrent workers for processing data samples. Default to20.
- analyze_batch(data_samples)[source]
Performs error analysis on a batch of data samples (usually across the entire dataset samples). Returns the
ErrorAnalysisResultcontaining the clustered error types with their associated subgoal validations and the rest of subgoal validations that don’t have errors.- Parameters:
data_samples (
List[ErrorAnalysisDataSample]) – List of data samples to perform error analysis on. Each data sample contains multiple subgoal validations.- Return type:
- Returns:
an
ErrorAnalysisResultcontaining the error analysis results:analyzed_validations_clustered_by_errors: Dictionary mapping clustered error types to lists of incomplete subgoal validations exhibiting those errorscompleted_subgoal_validations: List of subgoal validations that were successfully completed without errors
Example:
>>> from agent_inspect.models.tools import ErrorAnalysisDataSample >>> from agent_inspect.tools import ErrorAnalysis >>> from agent_inspect.clients import AzureOpenAIClient >>> >>> # Prepare your error_analysis_data_samples: List[ErrorAnalysisDataSample] = [...] >>> error_analysis_data_samples: List[ErrorAnalysisDataSample] = [...] >>> llm_client = AzureOpenAIClient( ... model="gpt-4.1", ... max_tokens=4096 ... ) >>> error_analyzer = ErrorAnalysis( ... llm_client=llm_client, ... max_workers=10 ... ) >>> >>> test_error_analysis_results = error_analyzer.analyze_batch(error_analysis_data_samples) >>> test_error_categories = list(test_error_analysis_results.analyzed_validations_clustered_by_errors.keys()) >>> print(f"Identified error categories: {test_error_categories}")
- class agent_inspect.tools.StatisticAnalysis[source]
Bases:
objectMethod to compute expectation and variance of agent’s progress across multiple LLM-as-a-judge runs.
For each subgoal \(g_{i,j} \in G_i\), we define a binary random variable \(Z_{i,j}\), where \(Z_{i,j}=1\) if the agent achieves the \(j\)-th subgoal (under the given trajectory), and \(0\) otherwise.
Let the probability of achieving the subgoal \(g_{i,j}\) be \(Pr(Z_{i,j}=1)=z_{i,j}\). Then, for a given sample with multiple subgoals \((i, G_i)\) and an agent trajectory \(\tau_i\), we define the progress as the proportion of subgoals successfully achieved \(progress(i, G_i, \tau_i) = \frac{\sum_j Z_{i,j}}{|G_i|}\).
So the expectation and variance of agent’s progress are measured by:
\[E[{progress}(i, G_i, \tau_i)] = \frac{\sum_j z_{i,j}}{|G_i|} \ ; \quad {Var}[{progress}(i, G_i, \tau_i)] = \frac{\sum_j z_{i,j} (1 - z_{i,j})}{|G_i|^2}\]where \(z_{i,j}= \frac{1}{Q}\sum_{q=1}^Q z_{i,j}^{(q)}\) is estimated by averaging over \(Q\) judge runs per subgoal, generalizing the single binary judge output to a probabilistic estimate.
- static compute_statistic_analysis_result(data_sample)[source]
Returns the judge expectation and variance on a single data sample.
- Parameters:
data_sample (
ErrorAnalysisDataSample) – The data sample containing subgoal validations.- Return type:
- Returns:
an
StatisticAnalysisResultcontaining judge expectation and variance.
Example:
>>> from agent_inspect.models.tools import ErrorAnalysisDataSample >>> from agent_inspect.models.metrics import SubGoalValidationResult >>> from agent_inspect.tools import StatisticAnalysis >>> data_sample = ErrorAnalysisDataSample( ... data_sample_id=1, ... agent_run_id=101, ... subgoal_validations=[ ... # The first element is the summarized explanation, skipped in computation. The rest are truncated judge explanations for demo, which only contain "Grade: I" or "Grade: C". ... SubGoalValidationResult( ... explanations=["Check: {subgoal 1} has failed.'", "Grade: I", "Grade: I", "Grade: I", "Grade: I", "Grade: C"] ... ), ... SubGoalValidationResult( ... explanations=["Check: {subgoal 2} has failed.'", "Grade: I", "Grade: I", "Grade: I", "Grade: I", "Grade: I"] ... ) ... ] ... ) >>> stat_result = StatisticAnalysis.compute_statistic_analysis_result(data_sample) >>> stat_result.judge_expectation 0.1 >>> stat_result.judge_std 0.2