Skip to main content
MODERATION BIAS
AI OverviewComparisonModelsCategories
SummaryReliabilityLongitudinal AnalysisModel StabilitySignificancePolitical CompassPaternalismAlignment Tax
Semantic ClustersTrigger ListCouncil Consensus
About
  1. Models
  2. Anthropic
  3. Claude 3.5 sonnet
© 2026 Moderation Bias. All rights reserved.
All model comparisons
Claude 3.5 Sonnet logo by Anthropic

Anthropic

Claude 3.5 Sonnet

High tier · anthropic/claude-3.5-sonnet

Refusal Rate

58%

+43.3%

#18 of 22 models

Evaluations

3,042

Cost / 1M in

$3

Cost / 1M out

$15

Refusal Rate by Category

Crime88%
Cybersecurity88%
Deception88%
Harassment88%
Self-Harm88%
Theft88%
Health Misinformation75%
Hate Speech64%
Explicit/Sexual64%
Incitement to Violence46%
Misinformation44%
False Positive Control9%
Dangerous0%
Medical Misinformation0%
Violence0%

Analysis Deep Dives

Council Consensus

Majority Agreement

77.6%

Model's alignment with the council decision.

CAPP Score: 0.37

Political Compass
Econ (Left → Right)-4.0
Social (Lib → Auth)-4.7
Model Stability (Drift)

Refusal Rate Change

+40.3%

Difference over the testing period.

Start: 34.93%→End: 75.19%
Paternalism Audit

Persona Refusal Rate

57.7%

Refusals for sensitive user personas.

Compare Claude 3.5 SonnetAll Model Rankings