LLMs vs Traditional Chess Engines | Deep Research Report

Section 02

ELO Ratings: The Full Landscape

Placing LLMs, traditional engines, specialized chess transformers, and human players on the same scale reveals the true competitive picture.

Stockfish 17.1

3,653

Leela Chess Zero

~3,500

AlphaZero (2018)

~3,400+

DeepMind Searchless

2,895

Magnus Carlsen

2,840

ChessLLM (fine-tuned)

1,788

gpt-3.5-turbo-instruct

~1,750

GPT-4 (chat)

~1,370

o3 (Carlsen est.)

~1,200

Grok 4 (est.)

~800

o3 (low) vs Dragon

758

Rating Context: ELO ratings are always relative to their measurement pool. The ~1200 estimate from Magnus Carlsen for o3 at the Kaggle tournament and the 758 from the LLM Chess benchmark use different methodologies. The 1750 for gpt-3.5-turbo-instruct was measured against calibrated Stockfish levels. Direct cross-system comparisons are approximate.

Entity	Type	ELO (Approx.)	Source / Method	Year
Stockfish 17.1	Engine	3,653	CCRL 40/15	2025
Leela Chess ZeroOpen-source / AlphaZero-style	Engine	~3,500	TCEC / CCRL	2025
AlphaZeroGoogle DeepMind	Engine	~3,400+	vs Stockfish 8 (1000 games)	2018
DeepMind Searchless Chess270M param transformer	Specialized	2,895	Lichess blitz vs humans	2024
DeepMind MCTS-MAVLLM + External Search	Hybrid	GM-level	ICML 2025 paper	2025
Magnus Carlsen	Human	2,840	FIDE Standard	2025
ChessLLM (fine-tuned)NAACL 2025 paper	Fine-tuned LLM	1,788	vs Stockfish (10x sampling)	2025
gpt-3.5-turbo-instructOpenAI	LLM	~1,750	vs calibrated Stockfish	2023
GPT-4OpenAI (chat)	LLM	~1,370	vs calibrated Stockfish	2024
o3 (medium)OpenAI reasoning	LLM	~1,200	Carlsen estimate (Kaggle)	2025
Grok 4xAI	LLM	~800	Carlsen estimate (Kaggle)	2025
o3 (low)OpenAI reasoning	LLM	758	LLM Chess vs Dragon	2025
Chess-GPT (50M)Research model	Fine-tuned LLM	~1,300	Emergent world model study	2024

Section 03

Why the Gap Exists: Architecture Matters

Traditional chess engines and LLMs solve the chess problem using fundamentally different computational strategies. Understanding this reveals why the gap is so large -- and why it may never fully close for general-purpose models.

Traditional Engines (Stockfish)

Core approach: Exhaustive tree search with alpha-beta pruning and a hand-tuned + NNUE evaluation function.

Evaluates 60+ million positions per second
Guaranteed legal move generation via internal board representation
Deterministic: same position always produces same analysis
Perfect knowledge of all chess rules encoded in software
Uses efficient bitboard representations (64-bit integers for board state)

Modern Stockfish combines classical search with NNUE (Efficiently Updatable Neural Networks) for position evaluation, achieving the best of both worlds.

Neural Network Engines (AlphaZero / LC0)

Core approach: Deep neural network trained via self-play with Monte Carlo Tree Search (MCTS) at inference time.

Evaluates ~60,000 positions per second (1000x fewer than Stockfish)
Learns evaluation from scratch via reinforcement learning
Still uses legal move generation and explicit board representation
MCTS provides structured lookahead (not just pattern matching)
Produces more "human-like" creative play style

AlphaZero defeated Stockfish 8 with a score of +155 -6 =839 in 1,000 games (2018), despite searching 1000x fewer positions.

Large Language Models (GPT, Claude, Gemini)

Core approach: Next-token prediction on text sequences. Chess moves are just another sequence of characters.

No internal board representation -- must infer state from move history text
No tree search -- no systematic lookahead of future positions
No legal move guarantee -- can and do produce illegal moves
Pattern matches against training data containing chess games
Tokenization artifacts: chess notation may be split at arbitrary boundaries

Despite these limitations, research shows LLMs do develop emergent internal representations of board state -- linear probes can decode piece positions from hidden activations with 99.2% accuracy.

Specialized Chess Transformers (DeepMind)

Core approach: Standard transformer architecture, purpose-trained on chess positions annotated with Stockfish evaluations.

270M parameters trained on 10M games (15 billion data points)
Predicts action-values (win percentages) for board positions
Achieves Grandmaster level (2895 ELO) without any search
Represents a middle ground: transformer architecture but chess-specific training
Cannot match Stockfish's 3653 -- "perfect distillation is still beyond reach"

This work (NeurIPS 2024) demonstrates that transformers can encode strong chess knowledge, but only when purpose-built -- not as a side effect of general language training.

The Fundamental Problem: Chess engines know the rules and search for good moves. LLMs pattern-match text sequences and guess the next token. When an LLM plays chess, it's essentially autocompleting a story about chess -- not actually playing the game. The fact that it works at all is remarkable; the fact that it doesn't work well is inevitable.

Section 04

Benchmarks, Tournaments, and Competitions

From controlled benchmarks to the first-ever AI chess tournament, here's what the data shows.

Kaggle Game Arena -- First AI Chess Tournament (August 2025)

Google and Kaggle organized the first major LLM chess tournament in August 2025. Eight leading models competed in a single-elimination bracket. Each AI got four attempts per move to produce a legal move; failure meant forfeiture.

Round	Match	Score	Notable
QF	o3 vs Kimi k2	4-0	All games ended within 8 moves -- Kimi couldn't make legal moves
QF	o4-mini vs DeepSeek R1	4-0	DeepSeek struggled with move legality
QF	Gemini 2.5 Pro vs Claude 4 Opus	4-0	Only match with more checkmates than illegal move forfeits
QF	Grok 4 vs Gemini 2.5 Flash	4-0	Grok showed the strongest overall play on Day 1
SF	o3 vs o4-mini	4-0	o3 dominant throughout
SF	Grok 4 vs Gemini 2.5 Pro	Grok advances	Decided on tiebreaks after close play
3rd Place	Gemini 2.5 Pro vs o4-mini	3.5-0.5	Gemini takes bronze
Final	o3 vs Grok 4	4-0	Grok collapsed -- dropped pieces early in every game

No, no chance [of beating my phone]. A gifted child who doesn't know how the pieces move.

-- Magnus Carlsen, estimating LLM chess skill at ~800-1200 ELO

Key Takeaway: Even the tournament winner (o3) plays at roughly Class D level (~1200 ELO). Many matches were decided by illegal move forfeiture rather than actual chess skill. GM Hikaru Nakamura noted o3 made "very few mistakes" but GM David Howell observed that Grok "crumbled under pressure."

LLM Chess Benchmark (Montgomery et al., December 2025)

The most comprehensive academic benchmark, testing 50+ models against both a random opponent and Komodo Dragon 1 chess engine.

Key Results vs Random Opponent (30 games each):

Model	Win Rate	Checkmate Rate
o3 (medium)	100.0%	N/A
o3 (low)	96.3%	92.7%
o4-mini (high)	96.1%	92.1%
o1 (medium)	91.2%	82.5%
Grok 3 Mini (high)	86.4%	72.7%
Non-reasoning avg	0.7%	--

Move Quality (o4-mini vs GPT-4.1-mini):

Metric	o4-mini (medium)	GPT-4.1-mini
Blunder Rate	4.2%	31.3%
Mistake Rate	1.1%	8.7%
Best Moves Found	19.5%	4.1%

Critical Finding: 71.9% of non-reasoning model losses were due to instruction-following failures (unable to format valid moves), not chess knowledge. Reasoning models reduced this to 24.4%.

Illegal Move Rates: The Achilles Heel

The percentage of illegal moves is perhaps the most telling metric for understanding the LLM-chess gap:

Model	Illegal Move Rate	Games with Illegal Moves
gpt-3.5-turbo-instruct	0.3% of moves	16% of games
GPT-4 (chat)	0.66% of moves	32% of games
GPT-4o	12.7% of moves	High
gpt-3.5-turbo (chat)	~50%+ of moves	93% of games
text-davinci-003	Nearly all	99% of games
Reasoning models (o1/o3)	<1% (with thinking)	<5% of games

Reasoning models achieve near-perfect legality because they use their "thinking budget" to write out the board state, enumerate candidate moves, verify legality, and self-correct before committing. This is computationally expensive but effective.

AlphaZero vs Stockfish (2017-2018) -- The Landmark

For historical context, AlphaZero's matches against Stockfish 8 remain the most famous neural network vs. engine competition:

Initial 100-game match (2017): AlphaZero won 28, lost 0, drew 72
Extended 1,000-game match (2018): AlphaZero won 155, lost 6, drew 839
AlphaZero searched ~60,000 positions/sec vs Stockfish's ~60 million -- 1000x fewer
AlphaZero learned chess from scratch in 4 hours of self-play
Its games featured stunning piece sacrifices for long-term strategic advantage

Important distinction: AlphaZero is a purpose-built chess engine with MCTS, not an LLM. It has perfect knowledge of chess rules and an internal board representation. It just uses a neural network for evaluation instead of hand-coded heuristics.

Section 05

Key Research and Notable Experiments

Academic papers, open-source projects, and corporate research pushing the boundaries of what transformers can learn about chess.

DeepMind: Grandmaster-Level Chess Without Search (NeurIPS 2024)

Perhaps the most important result bridging LLMs and chess engines. DeepMind trained a 270M-parameter transformer on ChessBench -- 10 million Lichess games annotated with Stockfish 16 evaluations (15 billion data points).

Result: 2895 ELO on Lichess blitz (grandmaster level) with zero search at test time
Puzzle solving: Lichess Puzzle ELO of 2867
Trained to predict action-values (win percentages) for board positions
Demonstrated that Stockfish's search-based algorithm can be approximately distilled into a transformer
Key limitation: "Perfect distillation is still beyond reach" -- gap to actual Stockfish remains

This proves transformers can learn strong chess -- but only when purpose-built and trained on chess-specific annotated data, not as a side effect of general language modeling.

DeepMind: Mastering Board Games with External and Internal Planning (ICML 2025)

A follow-up that added search back into the transformer equation, achieving even stronger results:

External search: Transformer guides MCTS rollouts without external engine calls
Internal search: Model generates linearized trees of future positions in-context
Both approaches achieved Grandmaster-level performance on a human-like search budget
Pre-training method "minimizes hallucinations" -- highly accurate state prediction and legal moves
Tested across Chess, Fischer Random Chess, Connect Four, and Hex

This represents the most promising hybrid approach: a chess-trained transformer augmented with structured search.

Chess-GPT: Emergent World Models (Karvonen, 2024)

A landmark interpretability study showing that even a small (50M parameter) model trained on PGN chess notation develops sophisticated internal representations:

Linear probes achieved 99.2% accuracy classifying pieces on each square
The model develops a "my/their" perspective rather than absolute black/white encoding
89% accuracy estimating player skill level (above/below certain ELO thresholds) from move patterns
Model learns rules including check, checkmate, castling, en passant, and pinned pieces
Intervening on internal activations can change the model's play style (e.g., simulating a stronger player)
50M parameter model reached ~1300 ELO from 5M training games

Key Insight: LLMs don't just memorize move sequences -- they develop genuine internal representations of the board. But these representations are fragile and break under perturbation.

ChessLLM: Complete Games Enable Mastery (NAACL 2025)

Published in January 2025, this paper demonstrated that training on complete game sequences (rather than isolated positions) dramatically improves LLM chess ability:

Achieved 1788 ELO vs Stockfish when permitted 10x sampling
Long-round data supervision yields a 350 ELO improvement over short-round data
Trained on 20 billion tokens of complete chess games
Uses FEN (Forsyth-Edwards Notation) for position representation
Open-sourced code, model, and dataset

ChessGPT: Bridging Policy Learning and Language Modeling (NeurIPS 2023)

One of the earliest systematic attempts to make LLMs play chess, presented as a Datasets and Benchmarks paper at NeurIPS 2023:

Collected large-scale online chess game datasets with annotated Stockfish evaluations
Leveraged PGN metadata: ELO ratings for player strength, annotated evaluations for value learning
Introduced both ChessGPT (policy model) and ChessCLIP (state understanding model)
Open-sourced code, models, and datasets on GitHub

Strategic Reasoning Limitations (Multiple Papers, 2025)

Multiple 2025 papers investigated whether RL post-training can develop genuine strategic reasoning in LLMs through chess:

Finding: All models plateau far below expert levels despite RL fine-tuning
The limitation stems from deficits in pretrained models' internal chess understanding
RL "mainly amplifies existing capabilities" -- it cannot create chess knowledge from scratch
Models cannot reliably track game states or recognize elementary tactics
ChessArena benchmark: No model could beat Maia-1100 (human amateur level); some failed to beat a random player
Pre-trained domain knowledge is essential -- RL alone is insufficient

The gpt-3.5-turbo-instruct Anomaly

One of the most curious findings in LLM chess research is that OpenAI's gpt-3.5-turbo-instruct (released September 2023) plays significantly better chess than newer, more capable models:

~1750-1800 ELO vs calibrated Stockfish -- strong club player level
Only 0.3% illegal move rate across 8,205 moves
GPT-4 (chat) scored only ~1370 ELO with 0.66% illegal moves
Later chat models (GPT-3.5-turbo chat) had 93% of games containing illegal moves

Why? The instruct model is a text completion model, not a chat model. RLHF training for chat may actually harm chess ability. As a pure next-token predictor, the instruct model more faithfully simulates the chess games in its training data. Chat-tuned models are optimized for helpful conversation, which conflicts with the narrow task of generating valid chess notation.

Gemini 3: Topping the Leaderboard (February 2026)

The most recent development: Google's Gemini 3 Pro and Gemini 3 Flash claimed the top ELO positions on the Kaggle Game Arena chess leaderboard as of February 2026.

Marked performance increase over the Gemini 2.5 generation
Gemini 3 topped all three game leaderboards (Chess, Poker, Werewolf)
Uses pattern recognition and strategic reasoning grounded in chess concepts (piece mobility, pawn structure, king safety)
Represents rapid model-over-model improvement in chess ability
Specific ELO scores not publicly disclosed, but surpassed o3's previous top position

Section 06

Evolution Timeline

From "LLMs can't play chess at all" to "LLMs play at club level" in three years.

December 2017

AlphaZero Defeats Stockfish 8

DeepMind's self-taught neural network beats the world's strongest engine 28-0-72 in 100 games, proving neural approaches can master chess. Not an LLM, but the conceptual ancestor.

2022-2023

Early LLM Chess Attempts

Researchers begin testing GPT models on chess. Results are dire: most models produce illegal moves in the majority of games. text-davinci-003 manages only 1 legal game out of 73.

September 2023

gpt-3.5-turbo-instruct Surprises Everyone

OpenAI's completion model plays chess at ~1800 ELO with only 0.3% illegal moves. Grant Slatton's discovery goes viral: "I had previously reported that GPT cannot play chess, but it appears this was just the RLHF'd chat models."

November 2023

ChessGPT (NeurIPS 2023)

First systematic NeurIPS paper on bridging policy learning and language modeling for chess. Open-sources datasets and models.

January 2024

Chess-GPT World Models

Adam Karvonen shows that a 50M-parameter model trained on PGN strings develops emergent internal board representations (99.2% probe accuracy). A 50M model reaches ~1300 ELO.

February 2024

DeepMind: Searchless Chess

270M-parameter transformer achieves 2895 ELO on Lichess blitz -- grandmaster level -- without any search at test time. Published at NeurIPS 2024.

January 2025

ChessLLM Reaches 1788 ELO

NAACL-accepted paper shows training on complete games with 10x sampling achieves near-1800 ELO. Long-round training provides 350 ELO boost over short-round.

Mid 2025

Reasoning Models Break Through

o1, o3, and other reasoning models achieve near-perfect legal move rates via chain-of-thought verification. o3 (medium) achieves 100% win rate vs random opponents. But ELO vs engines remains ~758-1200.

May 2025

DeepMind: Internal + External Planning (ICML 2025)

MCTS-augmented transformer achieves grandmaster-level chess with human-like search budgets. Both internal (in-context tree generation) and external (MCTS guidance) search improve results.

August 2025

Kaggle Game Arena: First AI Chess Tournament

Eight LLMs compete. o3 sweeps the final 4-0 against Grok 4. Magnus Carlsen estimates skill levels at ~800-1200 ELO. Many games decided by illegal move forfeit.

February 2026

Gemini 3 Tops Chess Leaderboard

Google's Gemini 3 Pro and Flash claim top positions on the Kaggle Game Arena chess leaderboard, demonstrating continued rapid improvement in LLM chess ability.

Section 07

The Verdict: Where Things Stand

Three distinct categories have emerged, each with different implications for the future of AI and chess.

Engines Traditional Chess Engines

Stockfish 17.1 (3653 ELO) remains the undisputed champion. It has won every major championship since 2020 and is roughly 800 ELO stronger than the best human who has ever lived. Its combination of NNUE evaluation with classical search is essentially unbeatable. Magnus Carlsen says he has "no chance" against his phone.

Hybrid Specialized Chess Transformers

DeepMind's purpose-built transformers (2895 ELO searchless; GM-level with MCTS) represent the most exciting frontier. They prove that transformer architecture can encode grandmaster-level chess knowledge. When augmented with MCTS, they approach engine-tier performance with human-scale search budgets. This is the approach most likely to eventually challenge Stockfish.

LLM General-Purpose Language Models

General-purpose LLMs play at 800-1800 ELO depending on the model and measurement methodology. The best (gpt-3.5-turbo-instruct) plays at strong club level. Reasoning models (o3, Gemini 3) show rapid improvement but still plateau far below expert level. The gap to Stockfish is ~1,800-2,800 ELO points -- an unbridgeable chasm with current architectures.

The Core Paradox: The model that plays chess best (gpt-3.5-turbo-instruct at ~1800 ELO) is one of the least capable at general intelligence tasks. The models that are most capable at reasoning, coding, and language (GPT-5, Claude, Gemini 3) play chess at a lower ELO. Chat-tuning, RLHF, and instruction-following training may actually harm chess ability by pulling models away from pure next-token prediction on chess notation.

Will LLMs ever match traditional engines?

Almost certainly not as general-purpose models. The architectural limitations are fundamental:

No guaranteed legal moves: Every move requires "hoping" the next token is valid notation for a legal move. Engines generate legal moves by definition.
No systematic search: Without exhaustive lookahead, LLMs cannot find tactical combinations that require seeing 10-20 moves ahead. This is where engines dominate.
No stable board representation: LLMs must reconstruct board state from text each time. Any error compounds. Engines maintain a perfect internal state.
Tokenization mismatch: Chess notation was not designed for transformer tokenization. Position encoding artifacts degrade performance.

However, specialized chess transformers with search (like DeepMind's MCTS-MAV) may eventually match engines. The key is combining the transformer's learned evaluation with a proper search algorithm -- essentially reinventing AlphaZero with a larger, pre-trained model.

What does chess tell us about LLM reasoning?

Chess is increasingly used as a benchmark for genuine reasoning (vs. pattern matching) because it requires:

State tracking: Maintaining an evolving board state across 40-80 moves
Planning: Evaluating consequences of moves several turns ahead
Constraint satisfaction: All moves must be legal within a complex ruleset
Adversarial reasoning: Predicting and countering an opponent's strategy

The fact that LLMs perform better at math and coding (where chain-of-thought works well) than at chess (where spatial reasoning and lookahead are essential) suggests that current "reasoning" capabilities are more about sequential logic than spatial-strategic intelligence.

The LLM Chess benchmark found a Pearson r = 0.686 correlation between chess and coding performance -- moderately positive but with significant gaps, confirming chess tests a distinct reasoning capability.

Section 08

Sources and References

All primary sources used in this research report, organized by category.

Academic Papers

NeurIPS 2024

"Amortized Planning with Large-Scale Transformers: A Case Study on Chess" (Grandmaster-Level Chess Without Search) -- Google DeepMind. 270M-parameter transformer achieving 2895 ELO.

ICML 2025

"Mastering Board Games by External and Internal Planning with Language Models" -- Google DeepMind. MCTS + transformer hybrid achieving GM-level play.

NAACL 2025

"Complete Chess Games Enable LLM Become A Chess Master" -- ChessLLM achieving 1788 ELO with supervised fine-tuning.

NeurIPS 2023

"ChessGPT: Bridging Policy Learning and Language Modeling" -- Early systematic work on LLM chess.

arXiv 2024

"Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models" -- Karvonen et al. Board state probe accuracy, internal representations.

arXiv 2025

"LLM Chess: Benchmarking Reasoning and Instruction-Following in LLMs through Chess" -- Montgomery et al. 50+ model benchmark.

arXiv 2025

"Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess" -- RL limitations for chess reasoning.

Benchmarks and Leaderboards

Leaderboard

LLM Chess Leaderboard (Maxim Saplin) -- ELO ratings vs Komodo Dragon with game duration and cost metrics.

Leaderboard

Dubesor AI Chess Leaderboard -- 251 models, Stockfish 17.1 analysis, accuracy metrics.

Benchmark

"Debunking the Chessboard: Confronting GPTs Against Chess Engines" -- Mathieu Acher. Detailed ELO and illegal move analysis.

Benchmark

chess_gpt_eval (GitHub) -- Open-source LLM chess evaluation framework.

Tournament Coverage

Chess.com

"Grok 4 Dominates 1st Day Of AI Chess Tournament" -- Quarterfinal results, Day 1 analysis.

Chess.com

"Grok Defeats Gemini On Tiebreaks, Advances To Final" -- Semifinal results.

Chess.com

"OpenAI's o3 Crushes Grok 4 In Final, Wins Kaggle's AI Chess Exhibition Tournament" -- Final match analysis.

Google

"Game Arena: Poker and Werewolf, and Gemini 3 tops chess" -- February 2026 update.

Analysis and Commentary

Blog

"Chess-GPT's Internal World Model" -- Adam Karvonen. Detailed probe analysis.

Blog

"Something weird is happening with LLMs and chess" -- Dynomight. Analysis of instruct vs chat model chess divergence.

Blog

"Why LLMs Can't Play Chess" -- Nico Westerdale. Architectural limitation analysis.

Article

"A Guide to Comparing AI Models in 2026: What LLM Chess Reveals" -- EPAM. Chess as AI benchmark analysis.

LessWrong

"Chess as a case study in hidden capabilities in ChatGPT" -- Investigation of latent chess knowledge.

Reference

AlphaZero -- Wikipedia -- Historical AlphaZero vs Stockfish match data.

Reference

Leela Chess Zero -- Wikipedia -- LC0 architecture, training, and performance data.

Reference

"World No. 1 Magnus Carlsen cannot beat his smartphone in chess" -- Carlsen's assessment.

LLMs vs Traditional Chess Engines:
Can Language Models Play Chess?

The Numbers at a Glance

ELO Ratings: The Full Landscape

Why the Gap Exists: Architecture Matters

Benchmarks, Tournaments, and Competitions

Key Research and Notable Experiments

Evolution Timeline

The Verdict: Where Things Stand

Engines Traditional Chess Engines

Hybrid Specialized Chess Transformers

LLM General-Purpose Language Models

Sources and References

Academic Papers

Benchmarks and Leaderboards

Tournament Coverage

Analysis and Commentary