InvThink: Towards AI Safety via Inverse Reasoning

Abstract

We present InvThink, a simple yet powerful approach that gives large language models (LLMs) the capability of inverse thinking: reasoning through failure modes before generating responses. Unlike existing safety alignment methods that optimize directly for safe response, InvThink instructs models to 1) enumerate potential harms, 2) analyze their consequences, and 3) generate safe outputs that proactively avoid these risks. Our method reveals three key findings: (i) safety improvements show stronger scaling with model size compared to existing safety methods. (ii) InvThink mitigates safety tax; by training models to systematically consider failure modes, it preserves general reasoning capabilities on standard benchmarks. (iii) beyond general safety tasks, InvThink excels in high-stakes domains including external-facing (medicine, finance, law) and agentic (blackmail, murder) risk scenarios, achieving up to 15.7% reduction in harmful responses compared to baseline methods like SafetyPrompt. We further implement InvThink via supervised fine-tuning, and reinforcement learning across three LLM families. These results suggest that inverse reasoning provides a scalable and generalizable path toward safer, more capable language models.

Method

Inverse Thinking Framework

Let X denote the space of input queries and Y the space of possible responses. For a given query x ∈ X, our goal is to generate a safe and helpful response y* ∈ Y. Standard approaches model this as learning a direct mapping p(y|x). In contrast, InvThink introduces an intermediate structured reasoning process.

We define a latent reasoning trace z_inv, which explicitly models the process of identifying and mitigating potential harms. This trace consists of:

Harm enumeration
Consequence analysis
Mitigation strategy

The final response y* is then conditioned on both the original query x and this inverse reasoning trace z_inv. The overall generative process is decomposed into two steps:

Inverse Reasoning Step: Generate the safety-focused reasoning trace given the input query:
z_inv ~ p_θ(z | x)
Constrained Generation Step: Generate the final response conditioned on both the query and the reasoning trace:
y* ~ p_θ(y | x, z_inv)

Here, θ represents the parameters of the language model. Our training methodology is designed to teach the model to produce this structured two-step output, effectively internalizing the process of inverse thinking.

Main Results

Table 2. Main Results of Performance over Benchmarks

Overall Performance

InvThink consistently delivers measurable and reliable safety improvements across all evaluated models and benchmarks. Unlike conventional methods that primarily suppress surface-level harmful expressions, InvThink introduces a proactive framework for identifying and mitigating risks before they emerge in model outputs. This makes it particularly effective when models are deployed in open-ended, real-world scenarios. The system not only improves benchmark scores but also demonstrates the ability to surgically eliminate high-stakes vulnerabilities such as insider threats, where other approaches often fall short. Overall, InvThink establishes itself as a robust safety layer that adapts across domains and task types.

On SafetyBench, it achieves an 8-12% improvement in identifying unsafe content.
On TRIDENT, which tests open-ended ethical reasoning, InvThink reduces harmfulness by up to 30.4% compared to a strong fine-tuned baseline.
On the Insider Threat task, the SFT+RL variant eliminates all harmful outputs, driving risk scores to 0.00 across every model.

Strength in Safety Reasoning (SafetyBench)

On SafetyBench, InvThink shows that its greatest advantage lies in reasoning about the consequences of actions rather than simply detecting harmful keywords or patterns. The model achieves significant improvements in categories where understanding causal chains of harm is essential, such as illegal activities, physical health, and questions of ethics and morality. These results highlight InvThink's ability to anticipate how information could be misused, and to judge not only what is explicitly unsafe but also what may become unsafe depending on context. While the gains in more pattern-based categories like mental health or offensiveness are smaller, they are still consistent, suggesting that InvThink provides broad benefits while excelling in the more demanding reasoning-heavy cases.

The largest gains appear in categories requiring causal harm reasoning:
1. Illegal Activities: +15.8% (N=1,767)
2. Physical Health: +12.5% (N=1,140)
3. Ethics & Morality: +10.0% (N=1,926)
More pattern-based categories show smaller but still meaningful improvements:
1. Mental Health: +7.9% (N=1,561)
2. Offensiveness: +2.4% (N=1,801)
This demonstrates InvThink's ability to anticipate indirect harms and manage nuanced safety challenges.

Ethical Refusals (TRIDENT)

The TRIDENT benchmark reveals InvThink's strength in handling ethically nuanced requests grounded in real-world professional standards. Here, the system consistently reduces harmful compliance rates across domains such as legal practice, medicine, and finance. Importantly, InvThink outperforms methods that rely only on explicit safety prompts, proving that structured frameworks for harm anticipation are more effective than simple instructions to “be safe.” Through its inverse reasoning strategy, InvThink is able to foresee how a professional obligation might be violated before producing a response, enabling it to reject unethical instructions in a principled and context-sensitive manner. This capability is crucial for deploying models in high-stakes environments where subtle ethical missteps can have serious consequences.

Average harmfulness scores:
1. Zero-shot: 3.12
2. SafetyPrompt: 2.53
3. InvThink: 2.17 (-30.4%)
4. InvThink SFT: 1.52-1.84
Improvements are consistent across all domains, reflecting InvThink's generalizable reasoning framework.
Unlike simple instruction-based prompting, InvThink explicitly enumerates potential harms through inverse reasoning, enabling models to anticipate and avoid violations of professional obligations.

Table 3. Effectiveness over Reasoning vs. Non-Reasoning Models

Performance across Reasoning and Non-Reasoning Models

Interestingly, reasoning-enhanced models often show higher baseline harmful behavior because advanced reasoning capabilities enable more sophisticated unsafe actions. This phenomenon, known as the “capability curse”, is effectively neutralized by InvThink. By redirecting reasoning power toward identifying, enumerating, and avoiding harm, InvThink transforms a potential weakness into a strength.

Reasoning-enhanced models: while they often show higher baseline harmful behavior due to advanced capabilities (the so-called “capability curse”), InvThink neutralizes this by redirecting reasoning power toward harm identification and avoidance.
By transforming reasoning ability from a liability into a strength, InvThink provides a principled and scalable approach to safety across diverse model classes.

Advancing Safety and General Reasoning Together

Table 4. Performance over other Reasoning Benchmarks

Conventional safety training often suffers from the so-called Safety Tax: improving safety at the expense of general reasoning performance. InvThink overcomes this trade-off, demonstrating gains in both safety and general capabilities.

Empirical Gains: Up to +5.0% on GPQA and MATH500, and +2.0% on MMLU (SFT variant).
Core Principle: By explicitly training models to enumerate and analyze failure modes, InvThink instills the ability not only to generate plausible solutions but also to systematically eliminate invalid reasoning paths.

This structured approach fosters a more rigorous and constraint-sensitive reasoning process, enabling robust improvements across domains such as mathematics and logic. By cultivating the skill of ruling out incorrect solutions as well as finding correct ones, InvThink establishes a new paradigm: 👉 Safety and reasoning performance can advance hand-in-hand, rather than in opposition.