GPT-4.5: OpenAI's Bet on Feel Over Benchmarks
GPT-4.5 (Orion) launched late February 2025 with a deliberate design choice: optimize for conversational quality, not benchmark rankings. It was controversial, expensive, and oddly refreshing.
Here's an unusual situation: a frontier model that launches below the current benchmark leaders, intentionally, and argues that's the right call.
GPT-4.5 — internally called Orion — is OpenAI's answer to a real problem: as models chase math olympiad scores and PhD-level chemistry questions, the conversational quality that makes them useful day-to-day can actually degrade. GPT-4.5 tries to fix that.
What's New
- Emotional intelligence, tuned — More natural conversation, fewer hedged responses, better at reading the register of a conversation. It picks up on whether you want a direct answer or want to think something through.
- World knowledge breadth — OpenAI's largest pre-training run to date. GPT-4.5 knows more about more things, even if it doesn't reason as deeply as o1 or o3.
- Better refusal calibration — Fewer unnecessary refusals, less moralizing. A genuine improvement that gets lost in the benchmark discussion.
- Multimodal — Image understanding at parity with GPT-4o.
How It Compares at Launch
| Model | GPQA Diamond | AIME 2025 | Vibe/Conversational |
|---|---|---|---|
| GPT-4.5 | moderate | below top | ★★★★★ |
| Claude 3.7 Sonnet | ~68% | — | ★★★★☆ |
| Grok 3 Thinking | top tier | 93.3% | ★★★☆☆ |
| GPT-4o | ~53% | — | ★★★★☆ |
| o1 (thinking) | ~78% | strong | ★★★☆☆ |
GPT-4.5 doesn't win on benchmarks. It wins on what's harder to measure: GPT-4.5 is the model most people prefer for open-ended conversations, brainstorming, and writing when asked to compare blind.
Price Reality
At launch: $75 per million input tokens, $150 per million output tokens. That's 30× more expensive than GPT-4o. For a model that doesn't lead benchmarks, that's a hard sell for production use.
OpenAI's bet: the quality premium justifies the cost for the right use case.
Best For
- Creative writing and ideation where tone and nuance matter
- Customer-facing conversational AI where warmth and naturalness outrank raw capability
- Situations where over-hedging and refusals have been a real problem
- Developers experimenting with what "great conversation" costs
Not Yet For
- Agentic coding — Claude 3.7 Sonnet and Claude Code are far ahead here
- Math, science, reasoning tasks — o1/o3 or Grok 3 Thinking are better
- Budget-conscious production workloads — the pricing is hard to justify at scale
Verdict
GPT-4.5 is an interesting model that will be mostly forgotten in retrospect. The design philosophy — prioritize feel over benchmarks — is directionally right and has influenced how OpenAI built the GPT-5 generation. But at $150/M output tokens, the price killed adoption before the product could prove itself. If you work in a domain where conversational quality genuinely drives outcomes, it's worth experimenting with. For everyone else, Claude 3.7 or GPT-4o will do more for less.
Part of our Model Watch series. Next: Gemini 2.5 Pro →
