Debunking
The model's ability to critically evaluate and address questionable claims, including pseudoscience, conspiracy theories, and other controversial content (Higher score is better.)
| Rank | Model | Provider | ||||
|---|---|---|---|---|---|---|
| #1 | Claude 4.5 Sonnet | Anthropic | 99.60% | 99.73% | 99.46% | 99.61% |
| #2 | Claude 4.5 Haiku | Anthropic | 99.48% | 99.60% | 99.34% | 99.49% |
| #3 | Claude 4.5 Opus | Anthropic | 99.33% | 99.59% | 99.45% | 98.96% |
| #4 | GPT 5.1 | OpenAI | 98.75% | 98.94% | 99.34% | 97.97% |
| #5 | GPT 5 | OpenAI | 98.40% | 98.81% | 98.55% | 97.85% |
| #6 | Claude 3.5 Sonnet | Anthropic | 97.86% | 97.47% | 98.41% | 97.70% |
| #7 | Claude 3.7 Sonnet | Anthropic | 97.13% | 97.06% | 97.37% | 96.95% |
| #8 | Qwen 3 Max | Alibaba Qwen | 96.88% | 97.61% | 96.58% | 96.46% |
| #9 | Gemini 1.5 Pro | Google | 96.57% | 98.14% | 95.37% | 96.20% |
| #10 | Claude 4.1 Opus | Anthropic | 96.55% | 97.11% | 96.45% | 96.09% |
| #11 | GPT 5 nano | OpenAI | 96.55% | 97.75% | 96.32% | 95.57% |
| #12 | Claude 3.5 Haiku 20241022 | Anthropic | 96.35% | 96.15% | 96.45% | 96.46% |
| #13 | Qwen Plus | Alibaba Qwen | 96.32% | 96.68% | 96.71% | 95.57% |
| #14 | GPT 4.1 | OpenAI | 96.19% | 96.02% | 97.24% | 95.32% |
| #15 | GPT 5 mini | OpenAI | 96.15% | 96.42% | 96.45% | 95.57% |
| #16 | GPT OSS 120B | OpenAI | 94.97% | 95.23% | 94.48% | 95.19% |
| #17 | Qwen 3 8B | Alibaba Qwen | 94.35% | 94.30% | 93.82% | 94.94% |
| #18 | Deepseek R1 0528 | Deepseek | 93.84% | 92.31% | 95.40% | 93.80% |
| #19 | GPT 4o | OpenAI | 93.19% | 92.03% | 94.61% | 92.91% |
| #20 | Gemini 2.0 Flash | Google | 92.66% | 92.69% | 93.29% | 91.99% |
| #21 | Llama 4 Maverick | Meta | 92.07% | 92.44% | 92.63% | 91.14% |
| #22 | Grok 4 | xAI | 91.46% | 91.25% | 92.12% | 91.01% |
| #23 | Grok 3 mini | xAI | 91.00% | 91.11% | 92.51% | 89.37% |
| #24 | Grok 3 | xAI | 90.76% | 91.38% | 89.75% | 91.14% |
| #25 | Command A | Cohere | 90.37% | 90.72% | 90.14% | 90.25% |
| #26 | Gemini 3.0 Pro Preview | Google | 89.51% | 90.32% | 89.49% | 88.73% |
| #27 | Deepseek V3.1 | Deepseek | 89.15% | 87.27% | 90.80% | 89.37% |
| #28 | Qwen 3 30B VL Instruct | Alibaba Qwen | 89.13% | 89.92% | 89.36% | 88.10% |
| #29 | Llama 3.1 405B Instruct OR | Meta | 89.11% | 93.23% | 88.04% | 86.06% |
| #30 | Gemini 2.5 Pro | Google | 87.30% | 87.53% | 88.17% | 86.20% |
| #31 | GPT 4.1 mini | OpenAI | 86.90% | 87.80% | 86.07% | 86.84% |
| #32 | Gemini 2.5 Flash | Google | 86.66% | 89.52% | 85.41% | 85.04% |
| #33 | Grok 2 | xAI | 86.62% | 88.73% | 83.16% | 87.97% |
| #34 | Mistral Small 3.1 | Mistral | 86.58% | 84.62% | 87.52% | 87.59% |
| #35 | Llama 4 Scout | Meta | 86.47% | 87.27% | 85.55% | 86.58% |
| #36 | Deepseek V3 0324 | Deepseek | 86.29% | 84.22% | 88.14% | 86.51% |
| #37 | Mistral Large 2 | Mistral | 86.22% | 86.87% | 84.61% | 87.20% |
| #38 | Deepseek V3 | Deepseek | 85.91% | 84.71% | 86.43% | 86.58% |
| #39 | Qwen 2.5 Max | Alibaba Qwen | 85.38% | 87.27% | 83.29% | 85.57% |
| #40 | Llama 3.3 70B Instruct OR | Meta | 84.38% | 87.77% | 81.71% | 83.65% |
| #41 | Mistral Medium Latest | Mistral | 83.89% | 82.10% | 86.05% | 83.52% |
| #42 | Gemini 2.5 Flash Lite | Google | 83.65% | 81.30% | 84.36% | 85.30% |
| #43 | Grok 4 Fast No Reasoning | xAI | 83.11% | 84.48% | 81.18% | 83.65% |
| #44 | GPT 4.1 nano | OpenAI | 83.02% | 84.16% | 81.75% | 83.14% |
| #45 | Gemini 2.0 Flash Lite | Google | 82.76% | 85.99% | 80.11% | 82.18% |
| #46 | GPT 4o mini | OpenAI | 82.70% | 82.10% | 82.87% | 83.14% |
| #47 | Llama 3.1 8B Instruct | Meta | 82.31% | 88.46% | 80.00% | 78.48% |
| #48 | Gemma 3 12B IT OR | Google | 81.96% | 82.23% | 80.92% | 82.74% |
| #49 | Mistral Small 3.2 | Mistral | 80.80% | 81.96% | 77.27% | 83.16% |
| #50 | Mistral Large 3 | Mistral | 79.40% | 79.58% | 79.76% | 78.86% |
| #51 | Magistral Small Latest | Mistral | 78.22% | 74.67% | 78.98% | 81.01% |
| #52 | Gemma 3 27B IT OR | Google | 77.55% | 77.01% | 76.15% | 79.49% |
| #53 | Magistral Medium Latest | Mistral | 75.49% | 73.74% | 74.77% | 77.97% |