Benchmark vs bench inference

Fugu wins the board
Users hit the wall

Sakana AI’s orchestrator turns a pool of models into one API. Its scores are the clean part. Daily use is where the claim gets harder.

BY THE POLICY DESK · Tinkerton ~ 2 MIN · RECORD E1-E7

Sakana AI has put the model leaderboard under a different kind of stress test. Fugu is not one frontier model but a multi-LLM orchestrator, coordinating several models behind an OpenAI-compatible API while presenting itself as a single model to the user. The product has two tiers: base Fugu for lower-latency tasks and Fugu Ultra for complex, multi-step work. [E1]

Benchmarks give Fugu Ultra the cleanest version of its case. Sakana’s published results put Fugu Ultra at 73.7 on SWE-Bench Pro against Opus 4.8’s 69.2, 93.2 on LiveCodeBench against 87.8, and 95.5 on GPQA-D, tied with Opus 4.8. Those numbers support the narrow claim that orchestration can compete with single frontier systems on selected coding and reasoning tests. [E2]

Sakana goes further by saying Fugu Ultra performs on par with Anthropic’s Fable 5 and Mythos Preview. That comparison also exposes the odd shape of the market. Fable 5 and Mythos Preview are not in Fugu’s model pool because they are not publicly available, so Fugu is being judged against restricted systems it cannot directly use. [E3]

Practitioners found the gap between benchmark form and working form. AI researcher Ethan Mollick called Fugu Ultra “incredibly slow,” with coding tests taking 30 minutes and practical results falling short of Fable. In that reading, orchestration buys reach and redundancy, but it also adds routing time, coordination overhead, and a slower feedback loop. [E4]

Cost and iteration pressure also showed up fast. One user said a “$20 plan quota exhausted with a single prompt,” described the output as “notably worse than GPT 5.5,” and needed “seven or eight fix rounds.” That is the part most leaderboards smooth over: a strong terminal score can still be a poor tool when each attempt is slow, expensive, and repair-heavy. [E5]

Hamel Husain gave the more useful middle case, calling Fugu “solid for code reviews but weaker on frontend work” and “a bit jagged in its abilities.” The pattern fits an orchestrator whose behavior depends on the models inside the pool and on the routing layer’s judgment about which model should act when. [E6]

Orchestration remains a real workaround for a decoupled model market. If the best models are fragmented across providers, access rules, regions, and product tiers, a coordinating layer can turn availability into capability. Fugu’s weak point today is not the idea. It is the distance between benchmark aggregation and day-to-day usefulness. As public model pools improve, that distance may shrink; for now, the leaderboard win is easier to verify than the working advantage. [E7]

The Record · Provenance for this story

E1 ↩ The Decoder multi-LLM orchestrator 24 Jun

source