Model Comparisons
LLM comparisons backed by real evaluation data
Every comparison is powered by PeerLM's blind evaluation methodology. No opinions, no vibes — just data from anonymized, head-to-head testing.
Blind Testing
Models are anonymized. Evaluators never know which model produced which response.
Multi-Criteria Scoring
Each response is scored across weighted criteria specific to the use case.
Real Prompts
Comparisons use realistic prompts and system instructions, not synthetic benchmarks.
Anthropic: Claude Opus 4.6 vs MoonshotAI: Kimi K2.5: Coding Performance with 10 Evaluators
We evaluate Anthropic: Claude Opus 4.6 against MoonshotAI: Kimi K2.5 in a rigorous Coding Performance with 10 Evaluators test suite.
Anthropic: Claude Opus 4.6
8.7
MoonshotAI: Kimi K2.5
1.3
Anthropic: Claude Opus 4.6 vs Qwen: Qwen3.5 397B A17B: Coding Performance with 10 Evaluators
This analysis compares the coding capabilities of Anthropic: Claude Opus 4.6 vs Qwen: Qwen3.5 397B A17B, evaluated by 10 expert reviewers on accuracy and instruction following.
Anthropic: Claude Opus 4.6
8.3
Qwen: Qwen3.5 397B A17B
1.8
Anthropic: Claude Opus 4.6 vs Mistral: Mistral Large 3 2512: Coding Performance with 10 Evaluators
This comparative analysis evaluates Anthropic: Claude Opus 4.6 vs Mistral: Mistral Large 3 2512 across a rigorous Coding Performance with 10 Evaluators benchmark suite.
Anthropic: Claude Opus 4.6
9.5
Mistral: Mistral Large 3 2512
0.5
Anthropic: Claude Opus 4.6 vs Meta: Llama 4 Maverick: Coding Performance with 10 Evaluators
This analysis compares Anthropic: Claude Opus 4.6 vs Meta: Llama 4 Maverick based on their Coding Performance with 10 Evaluators.
Anthropic: Claude Opus 4.6
10.0
Meta: Llama 4 Maverick
0.0
Anthropic: Claude Opus 4.6 vs xAI: Grok 4: Coding Performance with 10 Evaluators
In our latest benchmark, we evaluate Anthropic: Claude Opus 4.6 vs xAI: Grok 4 on Coding Performance with 10 Evaluators to determine which model leads in software engineering tasks.
Anthropic: Claude Opus 4.6
8.9
xAI: Grok 4
1.1
Anthropic: Claude Opus 4.6 vs DeepSeek: DeepSeek V3.2: Coding Performance with 10 Evaluators
We evaluated Anthropic: Claude Opus 4.6 vs DeepSeek: DeepSeek V3.2 in a rigorous Coding Performance suite using 10 specialized evaluators to determine the current industry leader.
Anthropic: Claude Opus 4.6
8.2
DeepSeek: DeepSeek V3.2
1.8
Anthropic: Claude Opus 4.6 vs Google: Gemini 3.1 Pro Preview: Coding Performance with 10 Evaluators
We compare Anthropic: Claude Opus 4.6 and Google: Gemini 3.1 Pro Preview in a head-to-head analysis of Coding Performance with 10 Evaluators.
Anthropic: Claude Opus 4.6
8.2
Google: Gemini 3.1 Pro Preview
1.8
OpenAI: GPT-5.4 vs Z.ai: GLM 5: Coding Performance with 10 Evaluators
We evaluate OpenAI: GPT-5.4 vs Z.ai: GLM 5 on their Coding Performance with 10 Evaluators, analyzing how each model handles complex development tasks.
OpenAI: GPT-5.4
6.4
Z.ai: GLM 5
3.6
OpenAI: GPT-5.4 vs MoonshotAI: Kimi K2.5: Coding Performance with 10 Evaluators
A deep dive into the coding capabilities of OpenAI: GPT-5.4 and MoonshotAI: Kimi K2.5, evaluated by 10 expert reviewers.
OpenAI: GPT-5.4
5.5
MoonshotAI: Kimi K2.5
4.5
OpenAI: GPT-5.4 vs Qwen: Qwen3.5 397B A17B: Coding Performance with 10 Evaluators
This comparison analyzes OpenAI: GPT-5.4 vs Qwen: Qwen3.5 397B A17B across critical software development tasks using PeerLM's Coding Performance with 10 Evaluators suite.
OpenAI: GPT-5.4
7.4
Qwen: Qwen3.5 397B A17B
2.6
OpenAI: GPT-5.4 vs Mistral: Mistral Large 3 2512: Coding Performance with 10 Evaluators
We put OpenAI: GPT-5.4 and Mistral: Mistral Large 3 2512 to the test in a rigorous Coding Performance with 10 Evaluators benchmark to determine the superior model for development tasks.
OpenAI: GPT-5.4
8.1
Mistral: Mistral Large 3 2512
1.9
OpenAI: GPT-5.4 vs Meta: Llama 4 Maverick: Coding Performance with 10 Evaluators
This analysis compares OpenAI: GPT-5.4 vs Meta: Llama 4 Maverick, specifically evaluating their coding performance using a rigorous 10-evaluator benchmark suite.
OpenAI: GPT-5.4
9.7
Meta: Llama 4 Maverick
0.3
Need a comparison we haven't covered?
Run your own blind evaluation in minutes. Compare any models, with your prompts, scored on your criteria.