← Back to blog
·Greg Mousseau

GPT-4.5: OpenAI's Bet on Feel Over Benchmarks

GPT-4.5 (Orion) launched late February 2025 with a deliberate design choice: optimize for conversational quality, not benchmark rankings. It was controversial, expensive, and oddly refreshing.

Model ReviewFrontier ModelsAI StrategyOpenAI

Here's an unusual situation: a frontier model that launches below the current benchmark leaders, intentionally, and argues that's the right call.

GPT-4.5 — internally called Orion — is OpenAI's answer to a real problem: as models chase math olympiad scores and PhD-level chemistry questions, the conversational quality that makes them useful day-to-day can actually degrade. GPT-4.5 tries to fix that.

What's New

  • Emotional intelligence, tuned — More natural conversation, fewer hedged responses, better at reading the register of a conversation. It picks up on whether you want a direct answer or want to think something through.
  • World knowledge breadth — OpenAI's largest pre-training run to date. GPT-4.5 knows more about more things, even if it doesn't reason as deeply as o1 or o3.
  • Better refusal calibration — Fewer unnecessary refusals, less moralizing. A genuine improvement that gets lost in the benchmark discussion.
  • Multimodal — Image understanding at parity with GPT-4o.

How It Compares at Launch

ModelGPQA DiamondAIME 2025Vibe/Conversational
GPT-4.5moderatebelow top★★★★★
Claude 3.7 Sonnet~68%★★★★☆
Grok 3 Thinkingtop tier93.3%★★★☆☆
GPT-4o~53%★★★★☆
o1 (thinking)~78%strong★★★☆☆

GPT-4.5 doesn't win on benchmarks. It wins on what's harder to measure: GPT-4.5 is the model most people prefer for open-ended conversations, brainstorming, and writing when asked to compare blind.

Price Reality

At launch: $75 per million input tokens, $150 per million output tokens. That's 30× more expensive than GPT-4o. For a model that doesn't lead benchmarks, that's a hard sell for production use.

OpenAI's bet: the quality premium justifies the cost for the right use case.

Best For

  • Creative writing and ideation where tone and nuance matter
  • Customer-facing conversational AI where warmth and naturalness outrank raw capability
  • Situations where over-hedging and refusals have been a real problem
  • Developers experimenting with what "great conversation" costs

Not Yet For

  • Agentic coding — Claude 3.7 Sonnet and Claude Code are far ahead here
  • Math, science, reasoning tasks — o1/o3 or Grok 3 Thinking are better
  • Budget-conscious production workloads — the pricing is hard to justify at scale

Verdict

GPT-4.5 is an interesting model that will be mostly forgotten in retrospect. The design philosophy — prioritize feel over benchmarks — is directionally right and has influenced how OpenAI built the GPT-5 generation. But at $150/M output tokens, the price killed adoption before the product could prove itself. If you work in a domain where conversational quality genuinely drives outcomes, it's worth experimenting with. For everyone else, Claude 3.7 or GPT-4o will do more for less.

Part of our Model Watch series. Next: Gemini 2.5 Pro →