← Back to blog
·Greg Mousseau

Kimi K2.5 Thinking: Moonshot AI Joins the Frontier

Kimi K2.5 Thinking from Moonshot AI arrived in early 2026 as one of the strongest open-weight reasoning models yet — 96% on AIME 2025, competitive on LiveCodeBench, and a serious option for teams who want frontier-quality reasoning they can actually self-host.

Model IntelMoonshot AIReasoning

DeepSeek R1 showed that open models could match closed reasoning models on math. Kimi K2.5 Thinking picks up that thread and pushes further.

Moonshot AI isn't well-known in the West, but they've been building serious models, and K2.5 Thinking is the one that puts them on the international frontier map.

What's Impressive

  • AIME 2025: 96% — Among the highest scores on this math competition benchmark, rivaling the best closed-source reasoning models available. Verifiable. Competitive.
  • LiveCodeBench: 85% — Strong coding performance. In the same tier as DeepSeek V3.2 and Qwen 3.5.
  • Open weights, commercially usable — Like DeepSeek V3.2, you can self-host this. That's the differentiator from Gemini or GPT-5.2.
  • "Thinking" mode — Extended chain-of-thought reasoning. The model shows its work on hard problems.

How It Compares at Launch

ModelAIME 2025LiveCodeBenchOpen Weight
Kimi K2.5 Thinking96%85%
DeepSeek V3.292%86%
Qwen 3.5 (235B)~95%~85%
Gemini 3.1 Protoptop
Claude Opus 4.6strong

In the open-weight math and reasoning tier, Kimi K2.5 Thinking is at or near the top.

Best For

  • Math-heavy and scientific reasoning tasks where open weights matter
  • Teams that want extended chain-of-thought without paying per-token API costs
  • Self-hosted reasoning pipelines for research, quantitative finance, engineering
  • Evaluating open-weight alternatives to o3 or Gemini 3 Pro Think

Not For

  • Agentic coding — Claude Code still leads
  • General conversational AI — K2.5 is specialized for reasoning, not optimized for broad chat quality
  • Teams that need Western vendor support SLAs

Verdict

Kimi K2.5 Thinking is a genuine entrant in the open-weight frontier. If your primary use case involves hard mathematical or scientific reasoning and you want to self-host, this is worth benchmarking alongside DeepSeek V3.2 and Qwen 3.5. The open-weight reasoning tier has never been stronger.

Part of our Model Watch series.