Defines counterfactual harm and complementarity principles for conversational AI to reliably aid high-stakes human decisions via user-defined rules.
Counterfactual harm in human-AI collaboration is defined as a situation where a decision maker would have made a better decision without using the AI system, even though they succeeded independently—meaning the AI's intervention leads to worse outcomes than if the human had acted alone . This concept is particularly relevant in high-stakes domains such as healthcare or criminal justice, where the principle of "first, do no harm" must extend to algorithmic decision support systems . Recent work formalizes this using structural causal models and shows that counterfactual harm can be estimated or bounded under certain monotonicity assumptions, enabling the design of systems that guarantee harm levels below a user-specified threshold via conformal risk control .
Complementarity in human-AI collaboration refers to the goal of achieving joint performance that exceeds the individual capabilities of either the human or the AI acting alone . However, achieving true complementarity remains challenging, especially when AI systems disrupt human agency or fail to align with expert reasoning workflows . In classification tasks, decision support systems based on prediction sets—where AI narrows possible labels and humans select from them—have shown promise in improving average accuracy . Yet, these systems may inadvertently cause counterfactual harm by leading experts away from correct independent judgments, revealing a trade-off between accuracy gains and potential harm .
Multi-round collaboration frameworks aim to overcome these limitations by enabling iterative, bidirectional interaction where AI supports intermediate decision stages rather than providing final recommendations . For example, in sepsis diagnosis, an AI system can suggest which lab tests to perform next to reduce uncertainty, aligning with clinicians’ cognitive processes and fostering trust over time . Such systems support counterfactual reasoning by allowing users to explore how missing data might affect predictions before ordering tests . These interactions go beyond one-way assistance, incorporating elements of collaborative exploration where both parties refine hypotheses and update beliefs iteratively .
To ensure reliable performance under user-specified requirements, recent frameworks propose integrating teleological constraints—constitutional principles that bound AI goal formation—and epistemic provenance trails for auditability . These allow high-stakes decisions to be traced back to their reasoning steps, supporting accountability . Additionally, mixed-initiative protocols define when the AI should defer, challenge, or interrupt, enabling "intelligent disobedience" in risky situations . For instance, if a clinician instructs an AI to administer a worksheet that could worsen a patient’s confusion, the AI might respond with a flagged review instead of compliance .
Effective collaboration also depends on trust calibration and interaction design that respects human expertise . Systems like A2C formalize this through modular stages—automation, augmented deferral, and collaborative exploration (CoEx)—where AI dynamically shifts modes based on confidence and context . In CoEx mode, AI acts as a collaborator, answering queries and offering context to support joint problem-solving in ambiguous cases . This aligns with findings that combining uncertainty estimates with explanations enhances both accuracy and subjective understanding, particularly when human confidence is low but model confidence is high .
Ultimately, advancing human-AI collaboration requires moving beyond static explanations or risk scores toward systems that engage in collaborative causal sensemaking—jointly constructing, critiquing, and revising causal models over time . This paradigm emphasizes not just output quality but sustained epistemic alignment, ensuring AI supports human judgment without undermining autonomy or introducing uncontrolled risks
This research addresses the critical challenge of deploying conversational AI in high-stakes decision-making environments where reliability and user control are paramount. The authors propose a framework for multi-round human-AI collaboration that enables users to inject domain-specific requirements and constraints directly into the interaction loop. Rather than treating the AI as a black-box oracle, the system leverages these user-defined rules to guide generation, ensuring that outputs adhere to strict safety and operational boundaries. This iterative process allows for continuous refinement, where the AI acts as a compliant assistant that adapts to the specific logic and risk tolerance defined by the human operator.
The paper’s theoretical contribution centers on the formalization of "counterfactual harm" and "complementarity principles" as guardrails for this interaction. Counterfactual harm provides a metric to assess whether the AI's intervention causes a worse outcome than if the human had proceeded without assistance, while complementarity ensures the system actively enhances human decision-making capabilities rather than merely providing redundant or distracting information. This work is significant because it establishes a rigorous foundation for trustworthy AI agents, moving beyond generic large language model capabilities toward specialized, auditable tools that respect human agency and mitigate the risks of automation bias in sensitive domains.
This paper introduces a framework for multi-round human-AI collaboration where AI assistants are constrained by user-specified requirements to ensure reliable support in high-stakes decision-making. The work formalizes two key principles: counterfactual harm and complementarity, which guide the AI's behavior to align with human goals while avoiding unintended negative outcomes. Counterfactual harm assesses the potential for the AI’s actions to cause harm if the user’s requirements were violated, while complementarity ensures the AI’s contributions meaningfully enhance (rather than replace) human expertise.
The paper’s core contribution is a formal methodology for integrating these principles into conversational AI, enabling iterative refinement of user-defined rules through dialogue. By treating requirements as dynamic constraints, the system adapts to evolving human preferences while maintaining safety and effectiveness. This approach addresses a critical gap in current AI assistants, which often lack mechanisms to enforce user-specified boundaries in high-risk domains like healthcare, finance, or legal advice. The work matters because it provides a scalable, principled foundation for deploying AI as a trustworthy collaborator rather than an autonomous decision-maker, bridging the gap between flexibility and reliability in human-AI interaction.
Why it matters: As AI systems increasingly assist in high-stakes contexts, ensuring they respect user-defined constraints is paramount. This paper offers a technical blueprint for designing AI that is both adaptive and accountable, moving beyond static rule-based systems toward interactive, requirement-aware collaboration. It has implications for ethical AI design, regulatory compliance, and the development of AI tools that augment—not replace—human judgment.