Evaluate Large Language Models
Reinforce Tactics provides a rigorous benchmark for testing LLM capabilities in strategic reasoning, spatial awareness, and long-horizon planning. Compare models head-to-head in competitive tournaments.
OpenAI GPT-5
Evaluate GPT-5 and GPT-5 Mini on complex tactical scenarios requiring multi-step planning.
Anthropic Claude
Benchmark Claude 4.5 Sonnet, Claude 4.5 Opus, and Claude Haiku 4.5 on strategic reasoning tasks.
Google Gemini
Test Gemini Pro and Gemini Ultra on spatial reasoning and resource management.
Custom Models
Integrate any LLM via API or local inference for comparative evaluation.
Run automated tournaments, generate ELO ratings, and analyze decision-making patterns across different model architectures and prompting strategies.
Learn About TournamentsRich Tactical Environment
Four distinct unit types create a complex decision space that challenges AI agents to reason about positioning, resource allocation, and opponent modeling.
Warrior
Frontline Fighter
Stalwart defenders who excel in close combat. High durability makes them perfect for holding the line.
Mage
Arcane Striker
Masters of mystical arts who can strike from afar and paralyze enemies for 3 turns.
Cleric
Support Healer
Devoted healers who restore allies and cure status effects. Essential for sustained campaigns.
Archer
Ranged Specialist
Precise marksmen with extended range from high ground. Enemies cannot counter-attack.
Built for AI Research
A complete tactical environment designed for reinforcement learning experimentation, LLM benchmarking, and AI development.
Turn-Based Tactical Combat
Strategic grid-based battles with attacks, counter-attacks, paralysis, and healing mechanics inspired by Fire Emblem and Advance Wars.
Gymnasium RL Environment
Full Gymnasium compatibility with multi-discrete action space, configurable reward shaping, and headless mode for high-speed training.
LLM Evaluation Framework
Benchmark GPT-4, Claude, Gemini, and other large language models on strategic reasoning, planning, and multi-step decision making.
Tournament System
Run automated tournaments between AI agents, track ELO ratings, and generate detailed performance analytics and leaderboards.
Replay & Analysis Tools
Record battles, export replays to video, and analyze decision patterns. Essential for AI research and model interpretability.
Modular Architecture
Clean, extensible Python codebase for adding new units, mechanics, reward functions, and custom AI agents.
Explore the Documentation
Everything you need to start evaluating LLMs and training RL agents.
Start Evaluating Your AI Models
Clone the repository, run your first LLM tournament, and discover how different models perform on strategic reasoning tasks. Open source and ready for research.