Tutorial Logo

KDD 2025 Tutorial: Evaluation & Benchmarking of LLM Agents

Tutorial Date/Time: Sunday, August 3rd 2025, 1:00 PM – 4:00 PM @ MTCC, Convention Center, Toronto, Canada

Abstract

The rise of LLM-based agents has opened new frontiers in AI applications, yet evaluating these agents remains a complex and underdeveloped area. This tutorial provides a systematic survey of the field of LLM agent evaluation, introducing a two-dimensional taxonomy that organizes existing work along (1) evaluation objectives —what to evaluate, such as agent behavior, capabilities, reliability, and safety—and (2) evaluation process —how to evaluate, including interaction modes, datasets and benchmarks, metric computation methods, and tooling. In addition, we highlight enterprise-specific challenges, such as role-based access to data, the need for reliability guarantees, dynamic and long-horizon interactions, and compliance. Finally, we discuss future research directions toward holistic, more realistic, and scalable evaluation of LLM agents.

Target Audience

This tutorial is designed for applied and industry scientists, machine learning engineers, and enterprise AI practitioners who build or deploy LLM-based agents in production systems. It is also relevant for academic researchers studying evaluation methodologies, multi-agent systems, and trustworthy language models. Participants will gain a systematic evaluation framework, practical hands-on code examples, and insights into real-world deployment challenges.

Presentation Agenda

Resources

You can also download the tutorial presentation slides here: Download Slides (PDF).

Authors