A comprehensive analysis of how GPT-4, Claude, Gemini, and other large language models perform against Stockfish, Leela Chess Zero, and purpose-built chess AI -- with ELO ratings, benchmarks, and the science behind the gap.
The ELO gap between language models and traditional chess engines remains enormous -- but the story is more nuanced than "LLMs can't play chess."
Placing LLMs, traditional engines, specialized chess transformers, and human players on the same scale reveals the true competitive picture.
| Entity | Type | ELO (Approx.) | Source / Method | Year |
|---|---|---|---|---|
| Stockfish 17.1 | Engine | 3,653 | CCRL 40/15 | 2025 |
| Leela Chess ZeroOpen-source / AlphaZero-style | Engine | ~3,500 | TCEC / CCRL | 2025 |
| AlphaZeroGoogle DeepMind | Engine | ~3,400+ | vs Stockfish 8 (1000 games) | 2018 |
| DeepMind Searchless Chess270M param transformer | Specialized | 2,895 | Lichess blitz vs humans | 2024 |
| DeepMind MCTS-MAVLLM + External Search | Hybrid | GM-level | ICML 2025 paper | 2025 |
| Magnus Carlsen | Human | 2,840 | FIDE Standard | 2025 |
| ChessLLM (fine-tuned)NAACL 2025 paper | Fine-tuned LLM | 1,788 | vs Stockfish (10x sampling) | 2025 |
| gpt-3.5-turbo-instructOpenAI | LLM | ~1,750 | vs calibrated Stockfish | 2023 |
| GPT-4OpenAI (chat) | LLM | ~1,370 | vs calibrated Stockfish | 2024 |
| o3 (medium)OpenAI reasoning | LLM | ~1,200 | Carlsen estimate (Kaggle) | 2025 |
| Grok 4xAI | LLM | ~800 | Carlsen estimate (Kaggle) | 2025 |
| o3 (low)OpenAI reasoning | LLM | 758 | LLM Chess vs Dragon | 2025 |
| Chess-GPT (50M)Research model | Fine-tuned LLM | ~1,300 | Emergent world model study | 2024 |
Traditional chess engines and LLMs solve the chess problem using fundamentally different computational strategies. Understanding this reveals why the gap is so large -- and why it may never fully close for general-purpose models.
Core approach: Exhaustive tree search with alpha-beta pruning and a hand-tuned + NNUE evaluation function.
Modern Stockfish combines classical search with NNUE (Efficiently Updatable Neural Networks) for position evaluation, achieving the best of both worlds.
Core approach: Deep neural network trained via self-play with Monte Carlo Tree Search (MCTS) at inference time.
AlphaZero defeated Stockfish 8 with a score of +155 -6 =839 in 1,000 games (2018), despite searching 1000x fewer positions.
Core approach: Next-token prediction on text sequences. Chess moves are just another sequence of characters.
Despite these limitations, research shows LLMs do develop emergent internal representations of board state -- linear probes can decode piece positions from hidden activations with 99.2% accuracy.
Core approach: Standard transformer architecture, purpose-trained on chess positions annotated with Stockfish evaluations.
This work (NeurIPS 2024) demonstrates that transformers can encode strong chess knowledge, but only when purpose-built -- not as a side effect of general language training.
From controlled benchmarks to the first-ever AI chess tournament, here's what the data shows.
Google and Kaggle organized the first major LLM chess tournament in August 2025. Eight leading models competed in a single-elimination bracket. Each AI got four attempts per move to produce a legal move; failure meant forfeiture.
| Round | Match | Score | Notable |
|---|---|---|---|
| QF | o3 vs Kimi k2 | 4-0 | All games ended within 8 moves -- Kimi couldn't make legal moves |
| QF | o4-mini vs DeepSeek R1 | 4-0 | DeepSeek struggled with move legality |
| QF | Gemini 2.5 Pro vs Claude 4 Opus | 4-0 | Only match with more checkmates than illegal move forfeits |
| QF | Grok 4 vs Gemini 2.5 Flash | 4-0 | Grok showed the strongest overall play on Day 1 |
| SF | o3 vs o4-mini | 4-0 | o3 dominant throughout |
| SF | Grok 4 vs Gemini 2.5 Pro | Grok advances | Decided on tiebreaks after close play |
| 3rd Place | Gemini 2.5 Pro vs o4-mini | 3.5-0.5 | Gemini takes bronze |
| Final | o3 vs Grok 4 | 4-0 | Grok collapsed -- dropped pieces early in every game |
Key Takeaway: Even the tournament winner (o3) plays at roughly Class D level (~1200 ELO). Many matches were decided by illegal move forfeiture rather than actual chess skill. GM Hikaru Nakamura noted o3 made "very few mistakes" but GM David Howell observed that Grok "crumbled under pressure."
The most comprehensive academic benchmark, testing 50+ models against both a random opponent and Komodo Dragon 1 chess engine.
Key Results vs Random Opponent (30 games each):
| Model | Win Rate | Checkmate Rate |
|---|---|---|
| o3 (medium) | 100.0% | N/A |
| o3 (low) | 96.3% | 92.7% |
| o4-mini (high) | 96.1% | 92.1% |
| o1 (medium) | 91.2% | 82.5% |
| Grok 3 Mini (high) | 86.4% | 72.7% |
| Non-reasoning avg | 0.7% | -- |
Move Quality (o4-mini vs GPT-4.1-mini):
| Metric | o4-mini (medium) | GPT-4.1-mini |
|---|---|---|
| Blunder Rate | 4.2% | 31.3% |
| Mistake Rate | 1.1% | 8.7% |
| Best Moves Found | 19.5% | 4.1% |
Critical Finding: 71.9% of non-reasoning model losses were due to instruction-following failures (unable to format valid moves), not chess knowledge. Reasoning models reduced this to 24.4%.
The percentage of illegal moves is perhaps the most telling metric for understanding the LLM-chess gap:
| Model | Illegal Move Rate | Games with Illegal Moves |
|---|---|---|
| gpt-3.5-turbo-instruct | 0.3% of moves | 16% of games |
| GPT-4 (chat) | 0.66% of moves | 32% of games |
| GPT-4o | 12.7% of moves | High |
| gpt-3.5-turbo (chat) | ~50%+ of moves | 93% of games |
| text-davinci-003 | Nearly all | 99% of games |
| Reasoning models (o1/o3) | <1% (with thinking) | <5% of games |
Reasoning models achieve near-perfect legality because they use their "thinking budget" to write out the board state, enumerate candidate moves, verify legality, and self-correct before committing. This is computationally expensive but effective.
For historical context, AlphaZero's matches against Stockfish 8 remain the most famous neural network vs. engine competition:
Important distinction: AlphaZero is a purpose-built chess engine with MCTS, not an LLM. It has perfect knowledge of chess rules and an internal board representation. It just uses a neural network for evaluation instead of hand-coded heuristics.
Academic papers, open-source projects, and corporate research pushing the boundaries of what transformers can learn about chess.
Perhaps the most important result bridging LLMs and chess engines. DeepMind trained a 270M-parameter transformer on ChessBench -- 10 million Lichess games annotated with Stockfish 16 evaluations (15 billion data points).
This proves transformers can learn strong chess -- but only when purpose-built and trained on chess-specific annotated data, not as a side effect of general language modeling.
A follow-up that added search back into the transformer equation, achieving even stronger results:
This represents the most promising hybrid approach: a chess-trained transformer augmented with structured search.
A landmark interpretability study showing that even a small (50M parameter) model trained on PGN chess notation develops sophisticated internal representations:
Key Insight: LLMs don't just memorize move sequences -- they develop genuine internal representations of the board. But these representations are fragile and break under perturbation.
Published in January 2025, this paper demonstrated that training on complete game sequences (rather than isolated positions) dramatically improves LLM chess ability:
One of the earliest systematic attempts to make LLMs play chess, presented as a Datasets and Benchmarks paper at NeurIPS 2023:
Multiple 2025 papers investigated whether RL post-training can develop genuine strategic reasoning in LLMs through chess:
One of the most curious findings in LLM chess research is that OpenAI's gpt-3.5-turbo-instruct (released September 2023) plays significantly better chess than newer, more capable models:
Why? The instruct model is a text completion model, not a chat model. RLHF training for chat may actually harm chess ability. As a pure next-token predictor, the instruct model more faithfully simulates the chess games in its training data. Chat-tuned models are optimized for helpful conversation, which conflicts with the narrow task of generating valid chess notation.
The most recent development: Google's Gemini 3 Pro and Gemini 3 Flash claimed the top ELO positions on the Kaggle Game Arena chess leaderboard as of February 2026.
From "LLMs can't play chess at all" to "LLMs play at club level" in three years.
Three distinct categories have emerged, each with different implications for the future of AI and chess.
Stockfish 17.1 (3653 ELO) remains the undisputed champion. It has won every major championship since 2020 and is roughly 800 ELO stronger than the best human who has ever lived. Its combination of NNUE evaluation with classical search is essentially unbeatable. Magnus Carlsen says he has "no chance" against his phone.
DeepMind's purpose-built transformers (2895 ELO searchless; GM-level with MCTS) represent the most exciting frontier. They prove that transformer architecture can encode grandmaster-level chess knowledge. When augmented with MCTS, they approach engine-tier performance with human-scale search budgets. This is the approach most likely to eventually challenge Stockfish.
General-purpose LLMs play at 800-1800 ELO depending on the model and measurement methodology. The best (gpt-3.5-turbo-instruct) plays at strong club level. Reasoning models (o3, Gemini 3) show rapid improvement but still plateau far below expert level. The gap to Stockfish is ~1,800-2,800 ELO points -- an unbridgeable chasm with current architectures.
Almost certainly not as general-purpose models. The architectural limitations are fundamental:
However, specialized chess transformers with search (like DeepMind's MCTS-MAV) may eventually match engines. The key is combining the transformer's learned evaluation with a proper search algorithm -- essentially reinventing AlphaZero with a larger, pre-trained model.
Chess is increasingly used as a benchmark for genuine reasoning (vs. pattern matching) because it requires:
The fact that LLMs perform better at math and coding (where chain-of-thought works well) than at chess (where spatial reasoning and lookahead are essential) suggests that current "reasoning" capabilities are more about sequential logic than spatial-strategic intelligence.
The LLM Chess benchmark found a Pearson r = 0.686 correlation between chess and coding performance -- moderately positive but with significant gaps, confirming chess tests a distinct reasoning capability.
All primary sources used in this research report, organized by category.