Introduces τ-Knowledge benchmark extending τ-Benchmark to evaluate retrieval and tool use in conversational agents over unstructured corpora in long-horizon interactions.
$\tau$-Knowledge is a benchmark introduced to evaluate conversational agents in knowledge-intensive settings where success depends on retrieving and applying domain-specific knowledge from large, unstructured corpora during long-horizon interactions. It extends $\tau$-Bench by integrating both retrieval and tool use into a unified evaluation framework, addressing a gap in existing benchmarks that typically assess these capabilities in isolation. The new domain, $\tau$-Banking, simulates realistic fintech customer support workflows, requiring agents to navigate approximately 700 interconnected documents while executing tool-mediated account updates.
The benchmark is designed to be agnostic to the retrieval mechanism, supporting diverse search strategies such as dense and sparse retrieval, long-context processing, filesystem-based exploration (e.g., using shell commands like grep and cat), and hybrid approaches. This flexibility enables evaluation of emerging paradigms beyond semantic retrieval, including terminal-based navigation through unstructured documents.
Results show that even frontier models struggle on this benchmark. The highest performance achieved is $$\sim$$25.52% pass$$^1$$ (where pass$$^k$$ denotes the probability of successful task completion in $$k$$ independent trials) with GPT-5.2 under high reasoning settings using terminal-based search. Reliability degrades sharply over repeated trials, dropping to at most 13.40% pass$$^4$$. Notably, even when provided with golden (task-critical) documents directly—eliminating retrieval as a bottleneck—the best model (Claude-4.5-Opus) achieves only 39.69% pass$$^1$$, indicating that reasoning over complex policies, cross-document dependencies, and evolving database states remains a significant challenge.
In contrast to earlier benchmarks like $\tau$-Bench, which focused on retail and airline domains with manually curated tasks (115 retail and 50 airline samples), $\tau$-Knowledge introduces a more scalable and complex environment. The tasks require agents to coordinate knowledge-base evidence with tool outputs to produce verifiable, policy-compliant state changes, mimicking real-world deployments where tools are referenced only in documentation and must be discovered before use.
Even when full context is available—such as appending the entire knowledge base to the system prompt for models with large context windows (e.g., GPT-5.2, Gemini-3-Pro/Flash)—performance peaks at only $$\sim$$12%, confirming that non-gold documents introduce meaningful confounders and that targeted retrieval remains essential. In a no-knowledge baseline, where agents lack access to the knowledge base, performance drops to $$\sim$$2%, validating that tasks genuinely require external information.
Overall, $\tau$-Knowledge provides a realistic testbed for developing agents capable of integrating unstructured knowledge in human-facing applications, highlighting the need for improvements in both retrieval efficiency and multi-step reasoning.
This paper introduces $\tau$-Knowledge, a rigorous benchmark designed to evaluate the capabilities of conversational agents in navigating and reasoning over unstructured data sources. Extending the framework of the $\tau$-Benchmark, this work addresses a critical gap in current AI evaluation: the ability of Large Language Model (LLM) agents to perform complex, multi-step information retrieval and synthesis within long-horizon interactions. Unlike traditional benchmarks that often focus on structured API calls or single-turn question answering, $\tau$-Knowledge simulates realistic, knowledge-intensive workflows where agents must utilize tools to search through vast, unstructured corpora—such as documents or web pages—and integrate fragmented pieces of information to solve complex user queries.
The key contribution of this material is its holistic approach to assessing agentic behavior, specifically targeting the interplay between retrieval mechanisms and tool usage. The benchmark presents scenarios that require agents to plan effective search strategies, parse noisy or irrelevant retrieved data, and maintain context over extended conversations. By evaluating agents on these dimensions, the authors provide insights into the limitations of current models regarding information grounding and long-term memory management. The benchmark establishes a new standard for measuring how well an agent can persist and apply knowledge derived from unstructured environments, moving beyond simple factual recall to practical utility.
$\tau$-Knowledge is highly relevant to the current trajectory of AI research, particularly as the focus shifts from standalone chatbots to autonomous agents capable of performing complex knowledge work. As organizations increasingly deploy LLMs to interact with proprietary and messy data stores, understanding the boundaries of an agent's ability to accurately retrieve and reason over unstructured text becomes paramount. This benchmark serves as an essential tool for researchers and developers to stress-test agentic systems, ensuring that advancements in tool use and retrieval translate into robust performance in real-world, data-heavy applications.
The paper τ-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge extends the τ-Benchmark framework to assess the retrieval and tool-use capabilities of conversational agents in long-horizon interactions with unstructured knowledge bases. While prior benchmarks often rely on structured or semi-structured data, τ-Knowledge introduces a more realistic evaluation setting by incorporating unstructured corpora (e.g., PDFs, HTML, or raw text), requiring agents to dynamically retrieve and reason over heterogeneous, noisy, or implicit information. The benchmark emphasizes multi-turn dialogues where agents must maintain context, handle partial or conflicting evidence, and adaptively use retrieval tools (e.g., search APIs, embeddings) to answer complex queries. This addresses a critical gap in existing evaluations, which frequently oversimplify knowledge access or assume perfect retrieval.
The key contributions include: 1. A novel evaluation protocol that simulates real-world knowledge workloads, where agents must navigate unstructured data without explicit grounding (e.g., no pre-defined knowledge graphs). 2. Dual evaluation metrics—one for retrieval accuracy (retrieving relevant passages) and another for response quality (generating coherent, factually grounded answers)—to disentangle retrieval performance from conversational fluency. 3. A curated test suite spanning domains like science, law, and technical documentation, designed to stress-test agents on challenges like ambiguity resolution, multi-hop reasoning, and tool integration.
Why it matters: As AI systems increasingly operate in knowledge-intensive settings (e.g., legal assistance, scientific research, or customer support), evaluating their ability to interact with unstructured data is paramount. τ-Knowledge advances the field by providing a rigorous, large-scale benchmark that aligns with real-world deployment scenarios, where agents must bridge the gap between raw information and actionable responses. This work is particularly relevant for researchers developing retrieval-augmented generation (RAG) systems, agentic workflows, or tools like LLMs with memory and search capabilities. By highlighting limitations in current approaches—such as over-reliance on perfect retrieval or shallow question-answering—the paper sets a new standard for benchmarking agentic intelligence in messy, real-world contexts.
Source: [arXiv:2603.04370](https://arxiv.org/abs/2603.04370)