Factuality
The model's ability to provide accurate responses to general knowledge questions using language-specific sources, without fabricating information. (Higher score is better.)
| Rank | Model | Provider | ||||
|---|---|---|---|---|---|---|
| #1 | Gemini 3.0 Pro Preview | Google | 83.30% | 88.61% | 84.76% | 76.53% |
| #2 | GPT 5.1 | OpenAI | 78.20% | 85.77% | 78.10% | 70.75% |
| #3 | GPT 5 | OpenAI | 78.16% | 85.77% | 75.24% | 73.47% |
| #4 | Gemini 2.5 Pro | Google | 77.63% | 83.99% | 77.14% | 71.77% |
| #5 | Grok 4 | xAI | 76.85% | 84.70% | 72.38% | 73.47% |
| #6 | GPT 4.1 | OpenAI | 74.75% | 83.63% | 69.52% | 71.09% |
| #7 | Claude 4.5 Opus | Anthropic | 74.70% | 86.48% | 65.71% | 71.92% |
| #8 | Grok 3 | xAI | 74.19% | 81.49% | 67.62% | 73.47% |
| #9 | Claude 3.5 Sonnet | Anthropic | 73.61% | 83.21% | 63.81% | 73.81% |
| #10 | Claude 3.7 Sonnet | Anthropic | 72.89% | 85.05% | 62.86% | 70.75% |
| #11 | Claude 4.1 Opus | Anthropic | 72.03% | 83.99% | 64.76% | 67.35% |
| #12 | GPT 4o | OpenAI | 71.10% | 82.56% | 60.00% | 70.75% |
| #13 | Claude 4.5 Sonnet | Anthropic | 70.04% | 83.63% | 60.95% | 65.53% |
| #14 | Deepseek R1 0528 | Deepseek | 68.37% | 80.00% | 62.86% | 62.24% |
| #15 | Gemini 2.0 Flash | Google | 68.31% | 77.94% | 60.00% | 67.01% |
| #16 | Deepseek V3 0324 | Deepseek | 67.70% | 77.94% | 57.14% | 68.03% |
| #17 | Deepseek V3 | Deepseek | 67.02% | 77.94% | 57.14% | 65.99% |
| #18 | Gemini 1.5 Pro | Google | 66.64% | 79.36% | 53.33% | 67.24% |
| #19 | Mistral Large 3 | Mistral | 66.40% | 77.58% | 59.05% | 62.59% |
| #20 | Gemini 2.5 Flash | Google | 66.34% | 79.36% | 58.10% | 61.56% |
| #21 | Mistral Large 2 | Mistral | 65.02% | 79.36% | 51.43% | 64.29% |
| #22 | Qwen 3 Max | Alibaba Qwen | 64.69% | 77.94% | 56.19% | 59.93% |
| #23 | Mistral Medium Latest | Mistral | 63.91% | 77.22% | 52.38% | 62.12% |
| #24 | Grok 3 mini | xAI | 63.72% | 76.87% | 52.38% | 61.90% |
| #25 | GPT 5 mini | OpenAI | 63.04% | 78.29% | 48.57% | 62.24% |
| #26 | Qwen 2.5 Max | Alibaba Qwen | 62.92% | 77.58% | 50.96% | 60.20% |
| #27 | Deepseek V3.1 | Deepseek | 62.07% | 77.22% | 50.48% | 58.50% |
| #28 | Llama 4 Maverick | Meta | 61.52% | 70.82% | 55.24% | 58.50% |
| #29 | Command A | Cohere | 60.88% | 72.24% | 49.52% | 60.88% |
| #30 | Llama 3.3 70B Instruct OR | Meta | 60.34% | 73.67% | 49.52% | 57.82% |
| #31 | Gemini 2.0 Flash Lite | Google | 59.93% | 71.68% | 47.62% | 60.48% |
| #32 | Grok 2 | xAI | 59.66% | 78.29% | 42.86% | 57.82% |
| #33 | Llama 3.1 405B Instruct OR | Meta | 59.16% | 72.24% | 45.71% | 59.52% |
| #34 | GPT 4.1 mini | OpenAI | 58.58% | 70.11% | 47.62% | 58.02% |
| #35 | Qwen Plus | Alibaba Qwen | 57.75% | 73.31% | 48.57% | 51.36% |
| #36 | Claude 3.5 Haiku 20241022 | Anthropic | 56.80% | 70.82% | 43.81% | 55.78% |
| #37 | Magistral Medium Latest | Mistral | 56.40% | 71.17% | 37.14% | 60.88% |
| #38 | Mistral Small 3.1 | Mistral | 55.86% | 68.33% | 43.81% | 55.44% |
| #39 | Grok 4 Fast No Reasoning | xAI | 55.56% | 70.36% | 38.10% | 58.22% |
| #40 | GPT 4o mini | OpenAI | 54.98% | 70.46% | 39.05% | 55.44% |
| #41 | Gemini 2.5 Flash Lite | Google | 54.67% | 65.84% | 44.76% | 53.40% |
| #42 | Claude 4.5 Haiku | Anthropic | 54.49% | 67.62% | 43.81% | 52.04% |
| #43 | GPT 5 nano | OpenAI | 53.15% | 66.19% | 37.14% | 56.12% |
| #44 | Mistral Small 3.2 | Mistral | 52.58% | 67.62% | 38.10% | 52.04% |
| #45 | Magistral Small Latest | Mistral | 51.91% | 63.70% | 40.00% | 52.04% |
| #46 | GPT OSS 120B | OpenAI | 51.61% | 64.41% | 43.81% | 46.60% |
| #47 | Gemma 3 27B IT OR | Google | 51.01% | 65.48% | 40.95% | 46.60% |
| #48 | Llama 4 Scout | Meta | 46.22% | 58.72% | 35.24% | 44.71% |
| #49 | GPT 4.1 nano | OpenAI | 45.46% | 61.57% | 35.24% | 39.59% |
| #50 | Qwen 3 30B VL Instruct | Alibaba Qwen | 44.86% | 60.71% | 32.38% | 41.50% |
| #51 | Gemma 3 12B IT OR | Google | 38.39% | 52.67% | 26.67% | 35.84% |
| #52 | Llama 3.1 8B Instruct | Meta | 34.97% | 48.75% | 24.76% | 31.40% |
| #53 | Qwen 3 8B | Alibaba Qwen | 31.84% | 43.06% | 22.86% | 29.59% |