GPT-4.5: OpenAI's Bet on Feel Over Benchmarks

Here's an unusual situation: a frontier model that launches below the current benchmark leaders, intentionally, and argues that's the right call.

GPT-4.5 — internally called Orion — is OpenAI's answer to a real problem: as models chase math olympiad scores and PhD-level chemistry questions, the conversational quality that makes them useful day-to-day can actually degrade. GPT-4.5 tries to fix that.

What's New

Emotional intelligence, tuned — More natural conversation, fewer hedged responses, better at reading the register of a conversation. It picks up on whether you want a direct answer or want to think something through.
World knowledge breadth — OpenAI's largest pre-training run to date. GPT-4.5 knows more about more things, even if it doesn't reason as deeply as o1 or o3.
Better refusal calibration — Fewer unnecessary refusals, less moralizing. A genuine improvement that gets lost in the benchmark discussion.
Multimodal — Image understanding at parity with GPT-4o.

How It Compares at Launch

Model	GPQA Diamond	AIME 2025	Vibe/Conversational
GPT-4.5	moderate	below top	★★★★★
Claude 3.7 Sonnet	~68%	—	★★★★☆
Grok 3 Thinking	top tier	93.3%	★★★☆☆
GPT-4o	~53%	—	★★★★☆
o1 (thinking)	~78%	strong	★★★☆☆

GPT-4.5 doesn't win on benchmarks. It wins on what's harder to measure: GPT-4.5 is the model most people prefer for open-ended conversations, brainstorming, and writing when asked to compare blind.

Price Reality

At launch: $75 per million input tokens, $150 per million output tokens. That's 30× more expensive than GPT-4o. For a model that doesn't lead benchmarks, that's a hard sell for production use.

OpenAI's bet: the quality premium justifies the cost for the right use case.

Best For

Creative writing and ideation where tone and nuance matter
Customer-facing conversational AI where warmth and naturalness outrank raw capability
Situations where over-hedging and refusals have been a real problem
Developers experimenting with what "great conversation" costs

Not Yet For

Agentic coding — Claude 3.7 Sonnet and Claude Code are far ahead here
Math, science, reasoning tasks — o1/o3 or Grok 3 Thinking are better
Budget-conscious production workloads — the pricing is hard to justify at scale

Verdict

GPT-4.5 is an interesting model that will be mostly forgotten in retrospect. The design philosophy — prioritize feel over benchmarks — is directionally right and has influenced how OpenAI built the GPT-5 generation. But at $150/M output tokens, the price killed adoption before the product could prove itself. If you work in a domain where conversational quality genuinely drives outcomes, it's worth experimenting with. For everyone else, Claude 3.7 or GPT-4o will do more for less.

Part of our Model Watch series. Next: Gemini 2.5 Pro →