KDD 2025 Tutorial: Evaluation & Benchmarking of LLM Agents

Tutorial Date/Time: Sunday, August 3rd 2025, 1:00 PM – 4:00 PM @ MTCC, Convention Center, Toronto, Canada

Abstract

The rise of LLM-based agents has opened new frontiers in AI applications, yet evaluating these agents remains a complex and underdeveloped area. This tutorial provides a systematic survey of the field of LLM agent evaluation, introducing a two-dimensional taxonomy that organizes existing work along (1) evaluation objectives —what to evaluate, such as agent behavior, capabilities, reliability, and safety—and (2) evaluation process —how to evaluate, including interaction modes, datasets and benchmarks, metric computation methods, and tooling. In addition, we highlight enterprise-specific challenges, such as role-based access to data, the need for reliability guarantees, dynamic and long-horizon interactions, and compliance. Finally, we discuss future research directions toward holistic, more realistic, and scalable evaluation of LLM agents.

Target Audience

This tutorial is designed for applied and industry scientists, machine learning engineers, and enterprise AI practitioners who build or deploy LLM-based agents in production systems. It is also relevant for academic researchers studying evaluation methodologies, multi-agent systems, and trustworthy language models. Participants will gain a systematic evaluation framework, practical hands-on code examples, and insights into real-world deployment challenges.

Presentation Agenda

Introduction (5 min)
- Motivation and tutorial goals
Taxonomy Overview (10 min)
- What to evaluate and how to evaluate
Evaluation Process (15 min)
- Interaction modes
- Evaluation data
- Metric computation methods
- Evaluation tooling
- Evaluation contexts
Evaluation Objectives (110 min)
- Agent Behavior
- Agent Capabilities
- Reliability
- Safety & Alignment
Enterprise-Specific Challenges (20 min)
- Access control
- Reliability guarantees
- Dynamic & long-horizon interactions
- Policy and compliance
Future Directions (10 min)
- Holistic frameworks
- Scalable evaluation
- Realistic enterprise settings

Authors

Mahmoud Mohammadi — SAP Labs, Bellevue, WA, USA
mahmoud.mohammadi@sap.com
Mahmoud is a Senior AI Scientist at SAP, where his research focuses on business foundation models and agentic AI, including graph foundation models, LLM integration, and the evaluation of intelligent agents. He also has expertise in Generative Adversarial Networks (GANs) and multimodal AI systems. Previously, Mahmoud worked at Microsoft, where he contributed to developing client-side deep learning models for Windows. He holds a Ph.D. in Computer Science form university of North Carolina at Charlotte.
Yipeng Li — SAP Labs, Bellevue, WA, USA
yipeng.li@sap.com
Yipeng Li is a Data Scientist Expert at SAP, leading research and development in agentic AI. His work focuses on single and multi-agent systems, quality assessment, and enabling internal AI research and development through common platforms. Before SAP, he worked at Microsoft on Office Copilot and at Facebook on large-scale machine learning projects. He holds a Ph.D. in Computer Science from The Ohio State University. His expertise includes prompt engineering, agentic systems development and evaluation, and machine learning algorithms and techniques.
Jane Lo — SAP Labs, Palo Alto, CA, USA
jane.lo@sap.com
Jane Lo is an AI Scientist at SAP, focusing on the research and development of agentic AI. She has worked on several projects in the field, focusing on the integration of enterprise tools, data, and private knowledge with agentic systems across a wide range of conversational and autonomous use cases. Her expertise includes multi-agent system development, agentic system evaluation, and synthetic data generation for conversational use cases. She received the B.S. and B.A. degrees in Industrial Engineering & Operations Research and Data Science from the University of California, Berkeley, in 2023.
Wendy Yip — SAP Labs, Palo Alto, CA, USA
wendy.yip@sap.com
Wendy is a Senior Data Scientist in SAP and has a background in astrophysics and spent time on machine learning and data-intensive science research at Johns Hopkins University. She then joined several Bay Area start-ups, contributed to building an AI-enabled home security camera, and a business process discovery bot. She is now working at SAP on agent-based systems, knowledge graphs, and other AI topics.

KDD 2025 Tutorial: Evaluation & Benchmarking of LLM Agents

Abstract

Target Audience

Presentation Agenda

Resources

Authors