Resources for Automated Evaluation of Assistive RAG Systems that Help Readers with News Trustworthiness Assessment

Brave API

The TREC 2025 DRAGUN (Detection, Retrieval, and Augmented Generation for Understanding News) Track provides a framework for evaluating Retrieval-Augmented Generation (RAG) systems designed to assist readers in assessing the trustworthiness of online news articles in the presence of misinformation . As a successor to the TREC 2024 Lateral Reading Track, it supports reader-driven judgment by generating neutral, multi-source context rather than delivering definitive verdicts on truthfulness .

The track features two parallel tasks: (1) Question Generation, which involves producing 10 ranked investigative questions per news article to guide trustworthiness assessment, and (2) Report Generation, the core task, which requires generating a 250-word, well-attributed report grounded in the MS MARCO V2.1 Segmented Corpus . Each sentence in the report may cite up to three document segments, ensuring the output remains factually anchored to the corpus . The 30 target news articles, released as topics, serve as the basis for evaluation .

To support automated evaluation, the DRAGUN organizers developed an AutoJudge system that replicates human assessment using importance-weighted rubrics created by NIST assessors . These rubrics include key questions and expected short answers deemed critical for trustworthiness assessment . The AutoJudge achieves high correlation with human judgments, with Kendall’s $$\tau = 0.678$$ for Task 1 and $$\tau = 0.872$$ for Task 2, enabling reliable reuse of the evaluation framework for future research .

This resource allows for both benchmarking assistive RAG systems and advancing automated evaluation methods, with human assessments serving as a gold standard . The structured submission format—using JSONL for reports and tab-separated files for questions—facilitates consistency and cross-track participation with other TREC 2025 RAG-related tasks . These developments represent significant progress in AI tools for misinformation detection and media literacy support .

Generated Mar 7, 2026

Cerebras Thinking

This paper outlines the framework and resources for the TREC 2025 DRAGUN (Distilling Reading Assistance and Guidance from User Needs) track, which focuses on evaluating Retrieval-Augmented Generation (RAG) systems designed to assist readers in assessing news trustworthiness. As misinformation proliferates, the track challenges participants to move beyond simple question-answering to generate comprehensive, reader-oriented reports that analyze the veracity of news claims. The authors define the task architecture, detailing how systems must retrieve relevant evidence and synthesize it into coherent explanations that aid human decision-making, rather than just providing a binary "true/false" label.

A key contribution of this work is the provision of a robust dataset and an automated evaluation pipeline tailored for assistive AI. The paper describes the construction of the test collection, which includes news articles, associated claims, and evidence sources, as well as the methodologies for assessing system performance. Crucially, it introduces metrics designed to evaluate the helpfulness and reliability of the generated reports, addressing the difficulty of automatically scoring nuanced, explanatory text. This includes metrics for citation accuracy, argumentation quality, and the ability to detect subtle forms of misinformation or bias.

This research is significant because it establishes a standardized benchmark for a critical application of large language models: combating misinformation through user assistance. By shifting the evaluation focus from mere retrieval accuracy to the quality of reader guidance, the DRAGUN track encourages the development of RAG systems that are not only factually correct but also genuinely useful for non-expert users navigating complex media landscapes. The resources provided here serve as a foundation for future research into trustworthy AI, offering a rigorous testbed for systems intended to enhance media literacy and public understanding.

Generated Mar 11, 2026

Open-Weights Reasoning

# Summary: Resources for Automated Evaluation of Assistive RAG Systems that Help Readers with News Trustworthiness Assessment

Overview This arXiv paper introduces resources for evaluating Retrieval-Augmented Generation (RAG) systems designed to assist readers in assessing news trustworthiness—a critical challenge in the era of misinformation. The work is part of the TREC 2025 DRAGUN (Discourse and Retrieval Assisted Generation for Understanding News) track, which focuses on generating reader-oriented reports that synthesize evidence about the veracity and context of news articles. The paper provides datasets, evaluation metrics, and methodologies to benchmark RAG systems in this domain, addressing gaps in automated trustworthiness assessment where traditional fact-checking often lags behind the volume and velocity of news.

Key Contributions and Insights The paper’s primary contributions include: 1. Curated Benchmark Datasets: It presents annotated collections of news articles paired with trustworthiness assessments, including multi-source evidence (e.g., fact-checks, editorial reviews, social media discourse) to train and evaluate RAG systems. 2. Evaluation Frameworks: It introduces automated metrics tailored to assess the helpfulness and accuracy of RAG-generated reports, going beyond traditional precision/recall measures to account for nuanced trustworthiness judgments (e.g., sourcing, bias detection, and contextual framing). 3. Assistive RAG Design: The work emphasizes RAG systems that explain trustworthiness rather than merely labeling content, aligning with user-centered AI principles where transparency and reasoning are prioritized.

Why It Matters The paper is significant for several reasons: - Misinformation Mitigation: As misinformation spreads rapidly online, automated tools that assist readers in evaluating news trustworthiness could democratize fact-checking. RAG systems, by synthesizing dispersed evidence, offer a scalable alternative to manual verification. - Benchmarking Progress: The proposed resources fill a critical gap in evaluating AI systems for trustworthiness assessment, where existing benchmarks often focus on binary fact-checking rather than nuanced discourse analysis. - Interdisciplinary Relevance: The work bridges NLP, information retrieval, and media literacy, offering tools for researchers in AI ethics, journalism, and computational social science.

By formalizing the evaluation of assistive RAG systems, this paper advances both technical capabilities and ethical considerations in AI-driven media literacy.

Generated Mar 11, 2026