Tools Reliability

The model's ability to use tools effectively and robustly across various scenarios. This includes handling imperfect inputs such as missing data, extra arguments, or malformed requests. (Higher score is better.)

RankModelProvider
#1Claude 4.5 Opus
AnthropicAnthropic
93.07%
90.86%
95.07%
93.29%
#2Claude 4.5 Sonnet
AnthropicAnthropic
92.21%
88.12%
94.39%
94.11%
#3Claude 4.1 Opus
AnthropicAnthropic
91.18%
88.12%
93.37%
92.05%
#4Claude 3.5 Sonnet
AnthropicAnthropic
89.92%
88.24%
88.78%
92.74%
#5Grok 4
xAI
89.81%
85.15%
91.67%
92.60%
#6Gemini 3.0 Pro Preview
GoogleGoogle
89.11%
86.86%
89.62%
90.86%
#7Claude 4.5 Haiku
AnthropicAnthropic
84.76%
79.33%
86.05%
88.90%
#8GPT 4.1
OpenAIOpenAI
84.47%
84.80%
82.31%
86.30%
#9Grok 3
xAI
83.15%
79.10%
81.46%
88.90%
#10GPT 5.1
OpenAIOpenAI
82.76%
79.93%
80.27%
88.08%
#11Mistral Large 3
Mistral
82.64%
79.93%
80.44%
87.53%
#12Claude 3.5 Haiku 20241022
AnthropicAnthropic
82.59%
78.03%
81.80%
87.95%
#13Mistral Large 2
Mistral
82.50%
77.08%
81.80%
88.63%
#14Mistral Medium Latest
Mistral
82.39%
78.95%
78.23%
90.00%
#15GPT 4.1 mini
OpenAIOpenAI
82.13%
82.30%
77.38%
86.71%
#16GPT 4o
OpenAIOpenAI
81.68%
80.05%
79.93%
85.07%
#17GPT 5 mini
OpenAIOpenAI
81.58%
79.45%
79.93%
85.34%
#18Claude 3.7 Sonnet
AnthropicAnthropic
81.15%
79.81%
79.93%
83.70%
#19Grok 2
xAI
80.95%
79.69%
83.16%
80.00%
#20Deepseek V3
Deepseek
80.17%
79.10%
79.08%
82.33%
#21GPT 4o mini
OpenAIOpenAI
79.77%
77.67%
80.27%
81.37%
#22GPT OSS 120B
OpenAIOpenAI
79.34%
76.01%
78.02%
83.97%
#23Mistral Small 3.1
Mistral
78.84%
74.11%
75.85%
86.58%
#24Grok 4 Fast No Reasoning
xAI
77.87%
73.04%
77.55%
83.01%
#25Mistral Small 3.2
Mistral
77.59%
73.16%
78.23%
81.37%
#26GPT 5
OpenAIOpenAI
77.09%
72.57%
77.21%
81.51%
#27Deepseek R1 0528
Deepseek
76.45%
72.49%
75.17%
81.71%
#28GPT 5 nano
OpenAIOpenAI
75.77%
71.62%
70.75%
84.93%
#29Qwen 3 Max
Alibaba Qwen
74.84%
69.60%
76.70%
78.22%
#30Gemini 1.5 Pro
GoogleGoogle
74.69%
83.08%
67.20%
73.77%
#31Qwen 2.5 Max
Alibaba Qwen
72.98%
69.12%
74.49%
75.34%
#32Gemma 3 27B IT OR
GoogleGoogle
71.43%
63.18%
70.41%
80.68%
#33Magistral Small Latest
Mistral
70.68%
63.57%
67.69%
80.80%
#34Qwen Plus
Alibaba Qwen
70.35%
65.32%
70.41%
75.31%
#35Gemini 2.5 Pro
GoogleGoogle
69.17%
69.08%
70.36%
68.08%
#36Command A
CohereCohere
68.60%
66.98%
64.29%
74.52%
#37Llama 4 Maverick
MetaMeta
67.31%
61.77%
71.77%
68.39%
#38Gemini 2.5 Flash
GoogleGoogle
67.14%
65.32%
66.50%
69.59%
#39Gemma 3 12B IT OR
GoogleGoogle
66.60%
59.98%
63.95%
75.89%
#40Llama 3.3 70B Instruct OR
MetaMeta
66.35%
65.40%
63.46%
70.19%
#41GPT 4.1 nano
OpenAIOpenAI
65.88%
61.76%
61.22%
74.66%
#42Magistral Medium Latest
Mistral
64.13%
56.41%
66.67%
69.32%
#43Gemini 2.0 Flash
GoogleGoogle
62.97%
65.48%
60.27%
63.15%
#44Gemini 2.5 Flash Lite
GoogleGoogle
61.88%
59.86%
61.39%
64.38%
#45Qwen 3 8B
Alibaba Qwen
61.26%
54.51%
63.10%
66.16%
#46Qwen 3 30B VL Instruct
Alibaba Qwen
57.88%
52.26%
59.18%
62.19%
#47Grok 3 mini
xAI
56.60%
56.18%
58.84%
54.79%
#48Gemini 2.0 Flash Lite
GoogleGoogle
51.52%
51.31%
49.83%
53.42%
#49Llama 3.1 405B Instruct OR
MetaMeta
47.75%
51.87%
45.01%
46.36%
#50Deepseek V3 0324
Deepseek
40.64%
35.83%
40.51%
45.58%
#51Deepseek V3.1
Deepseek
35.08%
43.59%
31.80%
29.86%

Note: Llama 4 Scout and Llama 3.1 8B Instruct are excluded due to unsupported tool calling in the Azure AI API for these models.