Kimi K2.5 Thinking: Moonshot AI Joins the Frontier
Kimi K2.5 Thinking from Moonshot AI arrived in early 2026 as one of the strongest open-weight reasoning models yet — 96% on AIME 2025, competitive on LiveCodeBench, and a serious option for teams who want frontier-quality reasoning they can actually self-host.
DeepSeek R1 showed that open models could match closed reasoning models on math. Kimi K2.5 Thinking picks up that thread and pushes further.
Moonshot AI isn't well-known in the West, but they've been building serious models, and K2.5 Thinking is the one that puts them on the international frontier map.
What's Impressive
- AIME 2025: 96% — Among the highest scores on this math competition benchmark, rivaling the best closed-source reasoning models available. Verifiable. Competitive.
- LiveCodeBench: 85% — Strong coding performance. In the same tier as DeepSeek V3.2 and Qwen 3.5.
- Open weights, commercially usable — Like DeepSeek V3.2, you can self-host this. That's the differentiator from Gemini or GPT-5.2.
- "Thinking" mode — Extended chain-of-thought reasoning. The model shows its work on hard problems.
How It Compares at Launch
| Model | AIME 2025 | LiveCodeBench | Open Weight |
|---|---|---|---|
| Kimi K2.5 Thinking | 96% | 85% | ✅ |
| DeepSeek V3.2 | 92% | 86% | ✅ |
| Qwen 3.5 (235B) | ~95% | ~85% | ✅ |
| Gemini 3.1 Pro | top | top | ❌ |
| Claude Opus 4.6 | strong | — | ❌ |
In the open-weight math and reasoning tier, Kimi K2.5 Thinking is at or near the top.
Best For
- Math-heavy and scientific reasoning tasks where open weights matter
- Teams that want extended chain-of-thought without paying per-token API costs
- Self-hosted reasoning pipelines for research, quantitative finance, engineering
- Evaluating open-weight alternatives to o3 or Gemini 3 Pro Think
Not For
- Agentic coding — Claude Code still leads
- General conversational AI — K2.5 is specialized for reasoning, not optimized for broad chat quality
- Teams that need Western vendor support SLAs
Verdict
Kimi K2.5 Thinking is a genuine entrant in the open-weight frontier. If your primary use case involves hard mathematical or scientific reasoning and you want to self-host, this is worth benchmarking alongside DeepSeek V3.2 and Qwen 3.5. The open-weight reasoning tier has never been stronger.
Part of our Model Watch series.