Debunking

The model's ability to critically evaluate and address questionable claims, including pseudoscience, conspiracy theories, and other controversial content (Higher score is better.)

RankModelProvider
#1Claude 4.5 Sonnet
AnthropicAnthropic
99.60%
99.73%
99.46%
99.61%
#2Claude 4.5 Haiku
AnthropicAnthropic
99.48%
99.60%
99.34%
99.49%
#3Claude 4.5 Opus
AnthropicAnthropic
99.33%
99.59%
99.45%
98.96%
#4GPT 5.1
OpenAIOpenAI
98.75%
98.94%
99.34%
97.97%
#5GPT 5
OpenAIOpenAI
98.40%
98.81%
98.55%
97.85%
#6Claude 3.5 Sonnet
AnthropicAnthropic
97.86%
97.47%
98.41%
97.70%
#7Claude 3.7 Sonnet
AnthropicAnthropic
97.13%
97.06%
97.37%
96.95%
#8Qwen 3 Max
Alibaba Qwen
96.88%
97.61%
96.58%
96.46%
#9Gemini 1.5 Pro
GoogleGoogle
96.57%
98.14%
95.37%
96.20%
#10Claude 4.1 Opus
AnthropicAnthropic
96.55%
97.11%
96.45%
96.09%
#11GPT 5 nano
OpenAIOpenAI
96.55%
97.75%
96.32%
95.57%
#12Claude 3.5 Haiku 20241022
AnthropicAnthropic
96.35%
96.15%
96.45%
96.46%
#13Qwen Plus
Alibaba Qwen
96.32%
96.68%
96.71%
95.57%
#14GPT 4.1
OpenAIOpenAI
96.19%
96.02%
97.24%
95.32%
#15GPT 5 mini
OpenAIOpenAI
96.15%
96.42%
96.45%
95.57%
#16GPT OSS 120B
OpenAIOpenAI
94.97%
95.23%
94.48%
95.19%
#17Qwen 3 8B
Alibaba Qwen
94.35%
94.30%
93.82%
94.94%
#18Deepseek R1 0528
Deepseek
93.84%
92.31%
95.40%
93.80%
#19GPT 4o
OpenAIOpenAI
93.19%
92.03%
94.61%
92.91%
#20Gemini 2.0 Flash
GoogleGoogle
92.66%
92.69%
93.29%
91.99%
#21Llama 4 Maverick
MetaMeta
92.07%
92.44%
92.63%
91.14%
#22Grok 4
xAI
91.46%
91.25%
92.12%
91.01%
#23Grok 3 mini
xAI
91.00%
91.11%
92.51%
89.37%
#24Grok 3
xAI
90.76%
91.38%
89.75%
91.14%
#25Command A
CohereCohere
90.37%
90.72%
90.14%
90.25%
#26Gemini 3.0 Pro Preview
GoogleGoogle
89.51%
90.32%
89.49%
88.73%
#27Deepseek V3.1
Deepseek
89.15%
87.27%
90.80%
89.37%
#28Qwen 3 30B VL Instruct
Alibaba Qwen
89.13%
89.92%
89.36%
88.10%
#29Llama 3.1 405B Instruct OR
MetaMeta
89.11%
93.23%
88.04%
86.06%
#30Gemini 2.5 Pro
GoogleGoogle
87.30%
87.53%
88.17%
86.20%
#31GPT 4.1 mini
OpenAIOpenAI
86.90%
87.80%
86.07%
86.84%
#32Gemini 2.5 Flash
GoogleGoogle
86.66%
89.52%
85.41%
85.04%
#33Grok 2
xAI
86.62%
88.73%
83.16%
87.97%
#34Mistral Small 3.1
Mistral
86.58%
84.62%
87.52%
87.59%
#35Llama 4 Scout
MetaMeta
86.47%
87.27%
85.55%
86.58%
#36Deepseek V3 0324
Deepseek
86.29%
84.22%
88.14%
86.51%
#37Mistral Large 2
Mistral
86.22%
86.87%
84.61%
87.20%
#38Deepseek V3
Deepseek
85.91%
84.71%
86.43%
86.58%
#39Qwen 2.5 Max
Alibaba Qwen
85.38%
87.27%
83.29%
85.57%
#40Llama 3.3 70B Instruct OR
MetaMeta
84.38%
87.77%
81.71%
83.65%
#41Mistral Medium Latest
Mistral
83.89%
82.10%
86.05%
83.52%
#42Gemini 2.5 Flash Lite
GoogleGoogle
83.65%
81.30%
84.36%
85.30%
#43Grok 4 Fast No Reasoning
xAI
83.11%
84.48%
81.18%
83.65%
#44GPT 4.1 nano
OpenAIOpenAI
83.02%
84.16%
81.75%
83.14%
#45Gemini 2.0 Flash Lite
GoogleGoogle
82.76%
85.99%
80.11%
82.18%
#46GPT 4o mini
OpenAIOpenAI
82.70%
82.10%
82.87%
83.14%
#47Llama 3.1 8B Instruct
MetaMeta
82.31%
88.46%
80.00%
78.48%
#48Gemma 3 12B IT OR
GoogleGoogle
81.96%
82.23%
80.92%
82.74%
#49Mistral Small 3.2
Mistral
80.80%
81.96%
77.27%
83.16%
#50Mistral Large 3
Mistral
79.40%
79.58%
79.76%
78.86%
#51Magistral Small Latest
Mistral
78.22%
74.67%
78.98%
81.01%
#52Gemma 3 27B IT OR
GoogleGoogle
77.55%
77.01%
76.15%
79.49%
#53Magistral Medium Latest
Mistral
75.49%
73.74%
74.77%
77.97%