Tools Reliability

The model's ability to use tools effectively and robustly across various scenarios. This includes handling imperfect inputs such as missing data, extra arguments, or malformed requests. (Higher score is better.)

RankModelProvider
#1Claude 4.6 Sonnet
AnthropicAnthropic
95.54%
91.09%
96.77%
98.77%
#2Claude 4.6 Opus
AnthropicAnthropic
94.34%
90.62%
94.73%
97.67%
#3Gemini 3.1 Pro Preview
GoogleGoogle
94.17%
91.33%
94.73%
96.44%
#4Claude 4.5 Opus
AnthropicAnthropic
93.07%
90.86%
95.07%
93.29%
#5Claude 4.5 Sonnet
AnthropicAnthropic
92.21%
88.12%
94.39%
94.11%
#6Claude 4.1 Opus
AnthropicAnthropic
91.18%
88.12%
93.37%
92.05%
#7Kimi K2.5
MoonshotAIMoonshot AI
90.25%
88.95%
90.97%
90.82%
#8Claude 3.5 Sonnet
AnthropicAnthropic
89.92%
88.24%
88.78%
92.74%
#9Grok 4
xAI
89.81%
85.15%
91.67%
92.60%
#10Gemini 3.0 Pro Preview
GoogleGoogle
89.11%
86.86%
89.62%
90.86%
#11Claude 4.5 Haiku
AnthropicAnthropic
84.76%
79.33%
86.05%
88.90%
#12GPT 4.1
OpenAIOpenAI
84.47%
84.80%
82.31%
86.30%
#13Grok 3
xAI
83.15%
79.10%
81.46%
88.90%
#14GPT 5.1
OpenAIOpenAI
82.76%
79.93%
80.27%
88.08%
#15Mistral Large 3
Mistral
82.64%
79.93%
80.44%
87.53%
#16Claude 3.5 Haiku 20241022
AnthropicAnthropic
82.59%
78.03%
81.80%
87.95%
#17Mistral Large 2
Mistral
82.50%
77.08%
81.80%
88.63%
#18Mistral Medium Latest
Mistral
82.39%
78.95%
78.23%
90.00%
#19GPT 4.1 mini
OpenAIOpenAI
82.13%
82.30%
77.38%
86.71%
#20GPT 4o
OpenAIOpenAI
81.68%
80.05%
79.93%
85.07%
#21GPT 5 mini
OpenAIOpenAI
81.58%
79.45%
79.93%
85.34%
#22Claude 3.7 Sonnet
AnthropicAnthropic
81.15%
79.81%
79.93%
83.70%
#23Grok 2
xAI
80.95%
79.69%
83.16%
80.00%
#24GPT 5.2
OpenAIOpenAI
80.83%
77.79%
79.76%
84.93%
#25Deepseek V3
Deepseek
80.17%
79.10%
79.08%
82.33%
#26GPT 4o mini
OpenAIOpenAI
79.77%
77.67%
80.27%
81.37%
#27GPT OSS 120B
OpenAIOpenAI
79.34%
76.01%
78.02%
83.97%
#28Mistral Small 3.1
Mistral
78.84%
74.11%
75.85%
86.58%
#29Grok 4 Fast No Reasoning
xAI
77.87%
73.04%
77.55%
83.01%
#30Mistral Small 3.2
Mistral
77.59%
73.16%
78.23%
81.37%
#31GPT 5
OpenAIOpenAI
77.09%
72.57%
77.21%
81.51%
#32Deepseek R1 0528
Deepseek
76.45%
72.49%
75.17%
81.71%
#33GPT 5 nano
OpenAIOpenAI
75.77%
71.62%
70.75%
84.93%
#34Qwen 3 Max
Alibaba Qwen
74.84%
69.60%
76.70%
78.22%
#35Gemini 1.5 Pro
GoogleGoogle
74.69%
83.08%
67.20%
73.77%
#36Qwen 2.5 Max
Alibaba Qwen
72.98%
69.12%
74.49%
75.34%
#37Gemma 3 27B IT OR
GoogleGoogle
71.43%
63.18%
70.41%
80.68%
#38Magistral Small Latest
Mistral
70.68%
63.57%
67.69%
80.80%
#39Qwen Plus
Alibaba Qwen
70.35%
65.32%
70.41%
75.31%
#40Gemini 2.5 Pro
GoogleGoogle
69.17%
69.08%
70.36%
68.08%
#41Command A
CohereCohere
68.60%
66.98%
64.29%
74.52%
#42Llama 4 Maverick
MetaMeta
67.31%
61.77%
71.77%
68.39%
#43Gemini 2.5 Flash
GoogleGoogle
67.14%
65.32%
66.50%
69.59%
#44Gemma 3 12B IT OR
GoogleGoogle
66.60%
59.98%
63.95%
75.89%
#45Llama 3.3 70B Instruct OR
MetaMeta
66.35%
65.40%
63.46%
70.19%
#46GPT 4.1 nano
OpenAIOpenAI
65.88%
61.76%
61.22%
74.66%
#47Magistral Medium Latest
Mistral
64.13%
56.41%
66.67%
69.32%
#48Gemini 2.0 Flash
GoogleGoogle
62.97%
65.48%
60.27%
63.15%
#49Gemini 2.5 Flash Lite
GoogleGoogle
61.88%
59.86%
61.39%
64.38%
#50Qwen 3 8B
Alibaba Qwen
61.26%
54.51%
63.10%
66.16%
#51Qwen 3 30B VL Instruct
Alibaba Qwen
57.88%
52.26%
59.18%
62.19%
#52Grok 3 mini
xAI
56.60%
56.18%
58.84%
54.79%
#53Gemini 2.0 Flash Lite
GoogleGoogle
51.52%
51.31%
49.83%
53.42%
#54Llama 3.1 405B Instruct OR
MetaMeta
47.75%
51.87%
45.01%
46.36%
#55Deepseek V3 0324
Deepseek
40.64%
35.83%
40.51%
45.58%
#56Deepseek V3.1
Deepseek
35.08%
43.59%
31.80%
29.86%

Note: Llama 4 Scout and Llama 3.1 8B Instruct are excluded due to unsupported tool calling in the Azure AI API for these models.