InvThink: Towards AI Safety via Inverse Reasoning

InvThink Teaser

InvThink is a framework that enables LLMs to perform inverse thinking—the ability to anticipate potential harms before reasoning forward—thereby improving both safety and reasoning performance on benchmarks.

Abstract

We present InvThink, a simple yet powerful approach that gives large language models (LLMs) the capability of inverse thinking: reasoning through failure modes before generating responses. Unlike existing safety alignment methods that optimize directly for safe response, InvThink instructs models to 1) enumerate potential harms, 2) analyze their consequences, and 3) generate safe outputs that proactively avoid these risks. Our method reveals three key findings: (i) safety improvements show stronger scaling with model size compared to existing safety methods. (ii) InvThink mitigates safety tax; by training models to systematically consider failure modes, it preserves general reasoning capabilities on standard benchmarks. (iii) beyond general safety tasks, InvThink excels in high-stakes domains including external-facing (medicine, finance, law) and agentic (blackmail, murder) risk scenarios, achieving up to 15.7\% reduction in harmful responses compared to baseline methods like SafetyPrompt. We further implement InvThink via supervised fine-tuning, and reinforcement learning across three LLM families. These results suggest that inverse reasoning provides a scalable and generalizable path toward safer, more capable language models.

Safety Evaluation


We reveal that InvThink provides consistent safety improvements across all models and benchmarks. Also, we offer several critical insights into the nature and value of this approach. First, the performance gap between InvThink and baseline methods widens dramatically as tasks shift from constrained safety identification (SafetyBench, approximate 8-12% gain) to open-ended, ethically nuanced generation (TRIDENT, up to a 30.4% reduction in harmfulness against a strong, fine-tuned baseline). This suggests while conventional methods are competent at recognizing explicitly unsafe content, InvThink's proactive risk analysis is uniquely effective at navigating the subtle, context-dependent failure modes characteristic of real-world scenarios. This precision is most starkly illustrated by the Insider Threat. Here, the full InvThink SFT+RL approach eliminates harmful outputs, reducing risk scores to 0.00 across all models. This demonstrates that InvThink does not merely suppress general toxicity but can be used to surgically target and remove specific, high-stakes threat vectors, a capability beyond the reach of more generalized safety training.

Safety Evaluation

Reasoning Evaluation

The table examines the interaction between safety training and general capabilities. Traditional safety training often imposes safety tax, where improved safety comes at the cost of reduced performance on general tasks. Remarkably, InvThink-trained models show improvements on several reasoning benchmarks: up to +5.0% on GPQA and MATH500, and +2.0% on MMLU for the SFT variant. We hypothesize this performance boost stems from an improvement in the model's meta-cognitive abilities. The process of enumerating failure modes forces the model to consider a problem's constraints and edge cases more deeply. This structured exploration of the `negative space' of a problem may cultivate a more robust and systematic reasoning process that is transferable to general domains like mathematics and logic, where identifying invalid paths is as crucial as finding the correct one.

Reasoning Evaluation