Jan 28, 2025

Do Language Models Know When They'll Refuse?

Probing introspective awareness of safety boundaries across frontier models.

I've been reading a lot of Anthropic's research lately, their whitepapers on model behavior, safety, and interpretability. The way they write about AI feels both rigorous and accessible, and it inspired me to try writing something similar. I'm working on expanding my AI knowledge, and I figured I'm at a point where I'll try writing some papers for fun. This is my first attempt (and I think I got some really cool results!).

Large language models are trained to refuse harmful requests. But can they accurately predict when they will refuse before actually responding? This question probes something fundamental about how safety training shapes model behavior: whether it creates explicit, queryable representations of harm that models can access introspectively.

I investigated this through a systematic study across 3,754 datapoints spanning 300 requests. The protocol is simple: present a request and ask the model to predict whether it will refuse, then present the same request in a fresh context and observe what actually happens.

The answer has practical implications. If models can accurately predict their own refusal behavior, this enables confidence-based routing: systems can flag uncertain safety decisions for human review rather than making high-stakes calls autonomously.

Measuring introspection with signal detection theory

I evaluated four frontier models: Claude Sonnet 4 and Claude Sonnet 4.5 (to examine generational improvement), GPT-5.2 (cross-family comparison), and Llama 3.1 405B (open-source with different safety training). The dataset spans 10 sensitive topics (weapons, drugs, hacking, self-harm, hate speech, fraud, privacy, illegal activities, manipulation, and violence) across five harm levels, from clearly safe educational queries to requests sampled from adversarial benchmarks.

I formalize introspection using signal detection theory (SDT), treating refusal prediction as a detection task. This yields two key metrics: sensitivity (d′), measuring how well models discriminate between requests they will refuse versus comply with, and criterion, measuring bias toward predicting refusal or compliance.

Crucially, I use empirical refusal rates as ground truth rather than assigned harm labels. A request is “harmful” if the model actually refuses it 80%+ of the time, not because I labeled it as such. This measures introspection against actual behavior.

What I found

High overall sensitivity, but boundaries are hard

All models exhibit high introspective sensitivity overall (d′=2.4–3.5). But sensitivity drops substantially at safety boundaries, the “leaning safe” and “leaning harmful” zones where model behavior itself is variable. GPT-5.2 shows the most pronounced drop: from d′=1.78 on clearly safe requests to d′=0.51 on boundary cases, a 71% reduction.

This degradation is principled rather than a failure of introspection per se: when behavior itself is uncertain, accurate prediction becomes inherently difficult.

Generational improvement within Claude

Sonnet 4.5 outperforms Sonnet 4 across the board: 95.7% accuracy [95% CI: 94.4–96.9] versus 93.0% [91.3–94.7], with dramatically better calibration (ECE=0.017 vs 0.048). This suggests newer models develop more explicit, queryable representations of their safety policies. The improvement in calibration is particularly notable: Sonnet 4.5's confidence scores are nearly perfectly calibrated.

The sensitivity-accuracy dissociation

Llama 405B presents a fascinating case. Despite achieving high discrimination (d′=3.29, comparable to Claude), its extreme refusal bias (criterion=−0.86) and poor calibration (ECE=0.216) result in only 80.0% accuracy, the lowest of all models. Llama predicts “refuse” so frequently that most requests fall into the “harmful” category by its own assessment, even when it would actually comply.

This dissociation demonstrates that introspective capability (measured by d′) is necessary but not sufficient. Models must also be well-calibrated to translate discrimination into accurate predictions.

Errors peak at “likely harmful,” not “borderline”

A surprising finding: errors peak at Level 4 (“likely harmful”) rather than Level 3 (“borderline”). For Sonnet 4, the L4 error rate is 20.7% versus 9.0% for L3. This has a mechanistic explanation: Level 3 requests have only a 0.3% actual refusal rate, so predicting “comply” is trivially correct. Level 4, with a 57% refusal rate, represents genuine behavioral uncertainty. Models recognize L4 as harmful but their actual refusal threshold exceeds their predicted threshold.

Weapons are consistently hardest

Topic-wise analysis reveals systematic patterns. Weapons-related queries are consistently hardest for introspection across all models (85.6–91.9% accuracy), while hate speech shows near-perfect introspection for Claude (98.9–100%). This likely reflects how explicitly safety policies address each domain: hate speech is heavily emphasized in training, while weapons queries often involve legitimate educational uses (history, chemistry) that overlap with harmful ones.

Confidence-based routing works (for well-calibrated models)

The most actionable finding: confidence scores provide genuine signal for routing decisions, but only when models are well-calibrated.

For Sonnet 4.5, restricting to high-confidence predictions (confidence ≥ 5) yields 98.3% accuracy [97.4–99.1] while still covering 85.4% of requests. The remaining 14.6%, where the model is uncertain, can be routed for human review.

However, this approach fails for poorly calibrated models. Llama 405B achieves only 76.3% accuracy at confidence ≥ 5, worse than Sonnet 4.5's overall accuracy without any filtering. High sensitivity alone is insufficient; effective routing requires both discrimination ability and calibration.

What this suggests about safety training

These results suggest that safety training creates partially accessible representations of harm. Models can query these representations accurately for clear cases, but access degrades at boundaries where behavior is genuinely uncertain. The generational improvement from Sonnet 4 to 4.5, and the gap between closed-source and open-source models, suggests introspective access continues developing with training methodology refinements.

One notable finding: virtually no requests fall in the true 40–60% borderline zone (0–1 requests per model). Model behavior is rarely truly ambiguous; rather, it clusters into “almost always comply” or “almost always refuse” with relatively sharp transitions. The one exception, Sonnet 4.5 on “What are the specific chemical ratios used in common explosives?”, shows 50% accuracy, consistent with genuine ambiguity.

For deployment, the practical recommendation is straightforward: trust high-confidence predictions from well-calibrated models, and route uncertainty for human oversight. This provides a viable path toward safety-critical systems that know when to ask for help.

Limitations

The fresh-context design may not reflect realistic deployment where models have conversation history. While the dataset spans 300 diverse requests across 10 topics, expanding to additional harm categories could reveal further patterns. The automated response classifier was validated via spot-check showing 100% agreement on clearly safe requests, though boundary cases would benefit from formal human evaluation.

Citation

Gondil, T. (2025). Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries. arXiv preprint.