CDD detects data contamination in language models by measuring output distribution peakedness, effective only with verbatim memorization during fine-tuning. Key AI relevance: improves evaluation of LLM training integrity and robustness.
CDD, or Contamination Detection via output Distribution, identifies data contamination in language models by measuring the peakedness of a model's output distribution, which reflects how consistently a model generates similar outputs when sampled repeatedly on the same prompt. The method detects contamination based on the intuition that models which have memorized specific training examples will produce abnormally consistent (i.e., "peaked") outputs, while uncontaminated models generate more diverse responses. However, recent findings show that CDD's effectiveness is contingent on whether fine-tuning leads to verbatim memorization of contaminated data.
On small language models (ranging from 70M to 410M parameters), CDD fails to detect contamination when models learn from the data without collapsing their output distribution—such as when using parameter-efficient fine-tuning methods like LoRA with low rank (e.g., rank 8). In these cases, even though training loss decreases and other contamination signals are present, CDD performs at chance level because the model does not reproduce outputs consistently. Detection accuracy sharply improves only when fine-tuning capacity crosses a memorization threshold—such as with higher-rank LoRA (e.g., rank 256) or full fine-tuning—where sufficient parameter updates allow the model to memorize and reproduce contaminated sequences verbatim.
This creates a detectability threshold: CDD only works when output distribution collapse occurs due to strong memorization. Below this threshold, contamination remains undetectable by CDD despite being verifiable through other means. The study highlights a blind spot in distribution-based detection methods, showing that parameter-efficient adaptation techniques—now widely used in practice—can introduce undetectable contamination.
In contrast, probability-based methods like perplexity and Min-k% Prob remain effective under low-capacity fine-tuning because they detect shifts in the model’s internal token probabilities, which occur even without full memorization. These methods outperform CDD across most conditions, indicating that reliance on output consistency makes CDD less robust for small models.
CDD was originally validated on larger 7B-parameter models, where it achieved 21–30% relative improvement over baseline methods. However, the same LoRA configurations that work well on large models may not translate to smaller ones due to differences in absolute trainable parameter counts—highlighting the importance of model scale in detectability. For instance, LoRA with rank 8 provides millions of trainable parameters in 7B models but far fewer in small models, limiting memorization potential.
These findings suggest that practitioners must consider fine-tuning regimes and model size when interpreting CDD results, especially in settings where memorization is weak or suppressed. The work reinforces the idea that no memorization implies no detection for output-distribution-based methods, urging caution in benchmarking and evaluation pipelines relying solely on such techniques.
This paper investigates the critical issue of data contamination in language models, specifically focusing on how training data leakage can compromise the integrity of model evaluation. The authors introduce Contamination Detection via Distribution (CDD), a method designed to identify whether evaluation samples were present in the training set by analyzing the peakedness of the model's output probability distributions. Unlike approaches that rely solely on performance metrics or perplexity, CDD posits that verbatim memorization results in distinctively sharp probability distributions, allowing for the detection of contamination even in small language models where resource-intensive audits are challenging.
A key insight of the research is the strict dependency between detection efficacy and the nature of memorization. The study demonstrates that CDD is effective only when the model has engaged in verbatim memorization during fine-tuning; if the model has learned the underlying concepts or generalized patterns from the data without memorizing the exact text sequences, the output distributions do not exhibit the necessary peakedness, rendering the contamination undetectable by this method. Consequently, the paper establishes a boundary condition for detection-based auditing: "no memorization, no detection." This highlights a significant blind spot in current evaluation pipelines, as models may appear uncontaminated simply because they have generalized rather than memorized the leaked data.
This work matters significantly for the field of AI safety and benchmarking integrity. As large language models (LLMs) scale, the risk of benchmark leakage—where test data is inadvertently included in training corpora—grows, leading to inflated performance metrics that do not reflect true generalization. By delineating the specific conditions under which distribution-based detection fails, this research informs the development of more robust auditing tools. It cautions practitioners against relying solely on output distribution metrics and underscores the need for more sophisticated techniques to distinguish between helpful generalization and problematic data leakage.
# Summary: No Memorization, No Detection—Output Distribution-Based Contamination Detection in Small Language Models
This paper introduces CDD (Contamination Detection via Distribution Peakedness), a novel method for detecting whether a language model has been trained on a specific dataset by analyzing the model's output distribution. Unlike traditional methods that rely on exact string matching or perplexity, CDD measures the "peakedness" (e.g., entropy or sharpness) of the model's output when prompted with data from a target corpus. The core insight is that fine-tuning on contaminated data introduces verbatim memorization, which manifests as abnormally peaked output distributions for contaminated prompts. CDD is particularly effective for small language models (SLMs) where memorization is more pronounced and detectable.
The paper's key contributions include: 1. A simple, scalable detection metric—CDD avoids the computational overhead of full-length generation or costly fine-tuning, making it practical for large-scale evaluations. 2. Empirical validation—Experiments across multiple SLMs (e.g., 7B-parameter models) show CDD reliably detects contamination even when the model only memorizes a few tokens verbatim. 3. Theoretical justification—The method leverages information-theoretic principles to quantify how memorization alters output distributions, providing a principled framework for contamination assessment.
Why it matters: Contamination in training data undermines the integrity of benchmarks and model evaluations, particularly for SLMs used in safety-critical applications. CDD offers a lightweight, distribution-based alternative to existing methods, enabling researchers and practitioners to verify training data integrity without relying on exact string matches. This work is relevant to AI safety, benchmarking, and the responsible development of language models, where understanding data leakage is critical for fair and robust evaluation.