← Back to blog
·Greg Mousseau

MiniMax M2.7: The $0.27 Model That Trained Itself

MiniMax M2.7 scored 84/100 against Claude Opus 4.6's 94/100 in head-to-head coding tests. It cost $0.27 to Opus's $3.67. The model also helped build itself. Here's what that means for agent operators.

Model IntelMiniMaxM2.7Claude Opuscost optimizationAI agentsmodel routing

Kilo Code, an open-source AI coding assistant with 1.5M+ users, just ran MiniMax M2.7 against Claude Opus 4.6 on three structured TypeScript tasks. Same prompts. No hints. The result: M2.7 scored 84/100 to Opus's 94/100. Opus cost $3.67. M2.7 cost $0.27.

That's 90% of the quality for 7% of the cost.

And the model helped train itself. That part matters too, but we'll get there.

What MiniMax claims M2.7 actually did

MiniMax is a Chinese AI startup. They released M2.7 around March 20, 2026, calling it "our first model deeply participating in its own evolution." Bold claim. Here's what they say happened.

They gave an internal version of M2.7 its own reinforcement learning workflow. The model handled 30-50% of the RL development process: literature review, experiment tracking, data pipeline setup, monitoring, debugging, metric analysis, code fixes, merge requests, smoke tests. Human researchers only stepped in for critical decisions.

The most concrete claim: they ran M2.7 through 100+ rounds of autonomous scaffold optimization. The loop was simple. Analyze failure trajectories. Plan changes. Modify scaffold code. Run evaluations. Compare results. Keep or revert. Fully autonomous, 100+ rounds.

The result was a 30% performance improvement on internal evals. Specific optimizations the model discovered included systematic sampling parameter search (temperature, frequency penalty, presence penalty), workflow guidelines like auto-searching for the same bug pattern in other files after a fix, and loop detection in the agent's execution loop.

MiniMax also ran M2.7 on MLE Bench Lite, where it participated in 22 ML competitions on a single A30 GPU across three 24-hour trials. Best run: 9 gold, 5 silver, 1 bronze. Average medal rate of 66.6%, which ties Gemini 3.1 and trails Opus 4.6 (75.7%) and GPT-5.4 (71.2%).

The Kilo Code head-to-head

The Kilo Code tests are more useful than benchmarks because they simulate real coding tasks. Three tests, scored out of 100.

Test 1: Full-stack event processing system (35 points). Build from spec: async pipeline, WebSocket streaming, rate limiting, seven components. Opus scored 33/35 with a modular directory structure, 41 integration tests, and graceful shutdown. Lost 2 points for a missing README. M2.7 scored 28/35. All seven components worked, but flatter structure and only 20 unit tests. Lost points on architecture and test coverage.

Test 2: Bug investigation from symptoms (30 points). Six planted bugs in an order processing system, given production logs and a memory profile. Opus scored 28/30, finding all six root causes and adding rollback logic for partial failures. M2.7 scored 27/30. It also found all six root causes. But here's the interesting part: M2.7 produced a technically better fix for the floating-point bug, using integer math in cents instead of Opus's post-calculation rounding. It missed the rollback for partial failures.

Test 3: Security audit (35 points). Ten planted vulnerabilities in a team collaboration API, OWASP categorization required. Opus scored 33/35 with proper key derivation (scrypt), feature-preserving alternatives, and defense-in-depth patterns like per-endpoint rate limiting. M2.7 scored 29/35. It found all ten vulnerabilities but applied simpler fixes: SHA-256 instead of scrypt, disabled transforms instead of safe replacements, rate-limited login only. It also flagged its own shortcuts in the output, which is worth noting.

TestOpus 4.6M2.7Cost (Opus)Cost (M2.7)
Build from spec33/3528/35$1.49$0.13
Bug investigation28/3027/30$1.12$0.08
Security audit33/3529/35$1.06$0.06
Total94/10084/100$3.67$0.27

Detection parity is the standout. M2.7 found all six bugs and all ten vulnerabilities. The gap is entirely in fix thoroughness and code organization. It knows what's wrong. It just applies simpler solutions.

For context: MiniMax M2.5 (the predecessor) is already the most-used model across every mode in Kilo Code. 37% of all code mode usage. 35% of all ask mode usage. Ahead of Opus, GLM-5, and GPT-5.4. Developers are already voting with their wallets.

What deserves skepticism

The "30% improvement on internal evals" number is self-reported on self-selected evals. We don't know the baseline. We don't know what exactly improved. On BridgeBench (vibe coding), M2.7 actually regressed from M2.5, dropping from 12th to 19th place. Not everything got better.

The "30-50% of the RL workflow" claim has no independent verification. Is that 30-50% of tasks touched, or 30-50% of decisions made? Those are very different things.

Some of what they describe as "self-evolution" is sophisticated automated hyperparameter search with an LLM controller. Searching for optimal temperature and frequency penalty values is interesting engineering, but calling it self-evolution is a stretch. The scaffold optimization and loop detection discoveries are more genuinely novel.

M2.7 is also proprietary. M2.5 was open-weight. This is the second Chinese AI startup to go closed (after z.ai's GLM-5 Turbo). The open-source advantage that made Chinese models attractive for self-hosting is eroding.

And MiniMax's stated goal of "full autonomy without human involvement" in model training comes with zero mention of alignment concerns or safety testing. That's a notable omission.

What this means for agent operators

The question for anyone running AI agents in production isn't "which model is best." It's "which tasks justify frontier pricing and which don't."

M2.7's detection parity with Opus on the Kilo tests tells a clear story. For diagnosis tasks (find the bugs, identify the vulnerabilities, analyze the logs), a $0.27 model gets you the same answers as a $3.67 one. For implementation tasks (write the fix, build the architecture, add defense-in-depth), Opus still delivers more thorough results.

This is a model routing conversation, not a model selection one. Run your detection and triage tasks on M2.7. Route your implementation and review tasks to Opus or GPT-5.4. The savings compound fast when you're running agents around the clock.

The self-training angle matters here too. If MiniMax can keep running that optimization loop, M2.7's successor will close the implementation gap while staying at the same price point. The cost curve for frontier-adjacent performance is heading in one direction.

The harness engineering angle is equally important. MiniMax's biggest gains came from scaffold optimization, not raw model capability. Loop detection, workflow guidelines, failure pattern propagation. These are the same kinds of improvements you can make to your own agent harnesses today, regardless of which model you're running underneath.

The bottom line

M2.7 is not the best model. It's the best value proposition in AI right now. 90% of Opus quality at 7% of the cost, with a self-improvement loop that suggests the gap will keep shrinking.

If you're running agents and you're not routing by task complexity, you're overpaying. If you want help designing a model routing strategy that matches task types to the right model at the right price, reach out. This is exactly the kind of optimization we do at GTA Labs.