Book a Free Strategy Call
Skip the read — talk to Walid in 30 min.
Free strategy call. We map your AI engineering team, you keep the notes.
Sakana Fugu vs Fable 5: Benchmarks, Orchestration, and Which to Use (2026)
Prefer open-source orchestration? See Maestro vs Sakana Fugu and the explainer on what LLM orchestration is.
Short answer: this isn't a fair fight, because it isn't even the same kind of contest. Sakana Fugu is an orchestration model that routes a task across a pool of frontier LLMs; Claude Fable 5 is a single frontier model from Anthropic. Sakana claims Fugu Ultra performs "on par with" Fable 5 — but the practical winner today is shaped less by benchmarks than by one blunt fact: Fable 5 was pulled from public access by US export controls on June 12, 2026, while Fugu (launched June 22, 2026) is openly available through an OpenAI-compatible API. So if you're choosing right now, the real sakana fugu vs fable 5 decision often collapses to "Fugu (or Opus 4.8), because you may not be able to touch Fable 5 at all."
This post walks through the categories, the real benchmark numbers, the catch behind the "parity" claim, the orchestration tradeoffs, and a decision framework for which to actually ship.
TL;DR
- Different categories. Fable 5 = one model. Fugu = an orchestrator that calls a swappable pool of models (selection, delegation, verification, synthesis happen internally).
- Two Fugu variants. Fugu (balanced, low-latency) and Fugu Ultra (max quality, id
fugu-ultra-20260615). - Benchmarks (Sakana's own). Fugu Ultra leads 10 of 11 tested benchmarks against Opus 4.8, Gemini 3.1 Pro, and GPT-5.5. The one loss is MRCRv2, where GPT-5.5 edges it out.
- The "parity with Fable 5" claim is parity-by-claim, not a head-to-head. Fable 5 and Mythos Preview aren't even in Fugu's pool — they were pulled by export controls and aren't publicly accessible.
- The dominant skeptic question: "Is this just a router/wrapper?" Routing is proprietary, the pool is fixed (no opt-out), and per-query model choice is hidden.
- Access: OpenAI-compatible API, key at
console.sakana.ai. Pricing isn't publicly specified (subscription + usage-based). - Independent verification: none yet. Treat every number below as a vendor claim.
Fugu vs Fable 5 — Different Categories (Orchestrator vs Model)
Before comparing anything, get the category right, because most confusion around sakana fugu vs opus or sakana fugu vs fable 5 comes from treating an orchestrator like a model.
Fable 5 is a single frontier model from Anthropic. You send a prompt, one model answers. Its behavior, latency, and cost are properties of that one model. (For how Anthropic's own lineup stacks up, see our Claude Fable 5 vs Opus 4.8 comparison.)
Sakana Fugu is not a model in that sense. It's an orchestration layer exposed through a single OpenAI-compatible API. When a request comes in, Fugu routes it across a pool of frontier LLMs, internally handling:
- Selection — deciding which model(s) in the pool should handle the task.
- Delegation — assigning subtasks to the chosen models.
- Verification — checking candidate outputs.
- Synthesis — combining results into one answer.
Architecturally, Sakana built this on two ICLR 2026 papers: Trinity (which formalizes Thinker / Worker / Verifier roles) and Conductor (reinforcement-learning coordination across those roles). If you want the full mechanics, see What is Sakana Fugu.
The practical upshot: when you "use Fugu," you're not using a model, you're using a strategy for picking and combining models. That changes how you reason about every comparison that follows.
The Benchmarks
Here are Sakana's published numbers. These are Sakana's own benchmarks — they have not been independently verified. Read them as a vendor's claims, not as settled fact.
Across 11 tested benchmarks, Fugu Ultra leads on 10. The exception is MRCRv2, where GPT-5.5 wins.
| Benchmark | Fugu Ultra | Opus 4.8 | Gemini 3.1 Pro | GPT-5.5 |
|---|---|---|---|---|
| SWE-bench Pro | 73.7 | 69.2 | 54.2 | 58.6 |
| TerminalBench 2.1 | 82.1 | 74.6 | 70.3 | 78.2 |
| LiveCodeBench | 93.2 | 87.8 | 88.5 | 85.3 |
| Humanity's Last Exam | 50.0 | 49.8 | 44.4 | 41.4 |
| GPQA-D | 95.5 | 92.0 | 94.3 | 93.6 |
| MRCRv2 | 93.6 | 87.9 | 84.9 | 94.8 |
A few honest readings of this table:
- The coding/agentic wins are the headline. SWE-bench Pro and TerminalBench 2.1 are the numbers most relevant to people building agents and dev tooling, and Fugu Ultra leads both. That's where
fugu ultra benchmarkclaims will get the most attention — and the most scrutiny. - Some margins are razor-thin. On Humanity's Last Exam, Fugu Ultra's 50.0 vs Opus 4.8's 49.8 is a rounding error, not a generational leap. On the
fugu ultra vs opus 4.8matchup overall, Fugu Ultra is ahead, but several gaps are small enough that real-world variance could erase them. - The MRCRv2 loss is real, and worth calling out. On long-context retrieval (MRCRv2), GPT-5.5 (94.8) beats Fugu Ultra (93.6). If your
fugu vs gpt-5.5decision hinges on long-context recall, GPT-5.5 has the edge in Sakana's own data — a useful honesty check that Sakana didn't cherry-pick a clean sweep. - On
fugu vs gemini, Fugu Ultra leads Gemini 3.1 Pro across the tested set, though Gemini stays competitive on GPQA-D.
What this table does not contain is the model everyone wants to see: Fable 5 itself. Which brings us to the catch.
The Catch: Fable 5 Isn't Even in Fugu's Pool
Here's the part that reframes the entire sakana fugu vs fable 5 question.
Sakana claims Fugu Ultra performs "on par with" Fable 5 and Mythos Preview. But Fable 5 and Mythos are not in Fugu's pool — they were pulled from public access by US export controls on June 12, 2026, ten days before Fugu launched. They aren't publicly accessible.
That means:
- The "parity" is parity-by-claim, not a head-to-head. Fugu isn't beating Fable 5 in a benchmark run, and Fugu isn't using Fable 5 inside its pool either. Sakana is asserting comparable quality against a model that most people — and Fugu itself — can't currently call. There's no row in the table above for Fable 5 for exactly this reason.
- The comparison set you can verify against is Opus 4.8, Gemini 3.1 Pro, and GPT-5.5 — all publicly accessible models. Those are the meaningful matchups for a buyer today. (And on
fugu vs mythos, same caveat: Mythos Preview is export-controlled, so "parity" there is also a claim, not a contest.) - For most teams, "Can I use Fable 5?" is the real decision driver. If you can't access Fable 5 because of export controls, the whole comparison is academic — you're really choosing between Fugu, Opus 4.8, Gemini, and GPT-5.5. If you came here looking for Fable 5 and hit a wall, see Claude Fable 5 alternatives.
None of this makes Fugu bad. It makes the marketing line ("on par with Fable 5") less useful than it sounds, because you can't currently run the head-to-head that would prove or disprove it.
Orchestration vs a Single Model: Tradeoffs
Even setting export controls aside, choosing an orchestrator over a single model is a real architectural decision with real tradeoffs. Here's the honest ledger.
Latency. A single model returns one inference. An orchestrator may select, delegate, verify, and synthesize across multiple models — more steps, potentially more wall-clock time. This is exactly why Sakana ships two variants: Fugu is tuned for balanced, low-latency use; Fugu Ultra spends more to maximize quality. If latency is your constraint, the variant choice matters as much as the orchestrator-vs-model choice.
Cost (fan-out). With one model, cost is roughly tokens times a known rate. With an orchestrator that may call several models per task, cost can fan out — you might pay for multiple model invocations behind a single API call. Sakana hasn't published specific pricing (it's described as subscription + usage-based), which makes cost modeling harder up front. We break down what's known in Sakana Fugu pricing.
Transparency. With a single model you know exactly what answered. With Fugu, routing is proprietary and per-query model selection is hidden — you don't see which pool model produced a given output. For regulated workloads, audit trails, or reproducibility requirements, that opacity is a genuine downside.
Reliability and failover. This is where orchestration earns its keep. A pool means a single model being slow, degraded, or unavailable doesn't necessarily sink the request — the orchestrator can route around it. A single model has no such fallback. But the pool is fixed with no opt-out: you can't remove a model you distrust or pin to one you've validated. And real-world performance depends on which pool models are actually available at request time, which you don't control.
The short version: an orchestrator trades transparency and predictable cost for potential quality and resilience. Whether that trade is worth it depends entirely on your workload.
Which Should You Use?
A decision framework, not a verdict. Start with the question that actually constrains most teams.
First filter: can you even access Fable 5? For most users today, no — export controls (June 12, 2026) put it out of reach. If that's you, Fable 5 is off the table and your realistic shortlist is Fugu, Opus 4.8, Gemini 3.1 Pro, or GPT-5.5.
If you want maximum single-model transparency and predictability → Opus 4.8. One model, known behavior, auditable, no hidden routing. On Sakana's own numbers it trails Fugu Ultra on most benchmarks but stays close on several (e.g., Humanity's Last Exam), and it's a known, accessible quantity. A strong default when you need to know exactly what answered.
If your workload is coding/agentic and you want top reported scores → Fugu Ultra. It leads SWE-bench Pro, TerminalBench 2.1, and LiveCodeBench in Sakana's data. Accept the tradeoffs: hidden per-query routing, fan-out cost, fixed pool. Pilot it on your tasks before trusting the benchmarks.
If long-context retrieval is the core job → GPT-5.5. It's the one model that beats Fugu Ultra in this set (MRCRv2: 94.8 vs 93.6). For RAG-heavy or long-document work, that matters.
If you want resilience/failover and can live with opacity → Fugu. The balanced, low-latency variant gives you pool-based redundancy without Ultra's full cost/latency profile. Good when uptime and graceful degradation beat strict reproducibility.
If reproducibility, auditability, or cost-predictability are hard requirements → a single model (Opus 4.8, Gemini, or GPT-5.5), not an orchestrator. Hidden routing and unpublished pricing are dealbreakers for some compliance and finance teams.
Bottom Line
Sakana fugu vs fable 5 is the wrong frame, because they're different categories and one of them is export-controlled out of reach. The honest framing: Fugu is an orchestrator that, by Sakana's own (unverified) benchmarks, leads Opus 4.8, Gemini 3.1 Pro, and GPT-5.5 on 10 of 11 tests — losing only long-context retrieval to GPT-5.5 — while claiming parity with a Fable 5 it can't actually call. If you can't access Fable 5 anyway, the real choice is between Fugu's orchestration (quality and resilience, at the cost of transparency and predictable spend) and a single accessible model (Opus 4.8, Gemini, GPT-5.5). Pilot on your own workload; don't ship on a vendor's slide.
FAQ
Is Sakana Fugu better than Fable 5? There's no public head-to-head, so "better" can't be measured. Sakana claims Fugu Ultra performs on par with Fable 5, but Fable 5 was pulled from public access by export controls on June 12, 2026, and isn't in Fugu's pool — so it's parity-by-claim, not a tested result. For most users, Fable 5 is currently inaccessible, which makes Fugu the more practical option regardless of the benchmark question.
Does Fugu beat GPT-5.5?
In Sakana's own benchmarks, mostly yes — Fugu Ultra leads on SWE-bench Pro, TerminalBench 2.1, LiveCodeBench, Humanity's Last Exam, and GPQA-D. The exception is MRCRv2 (long-context retrieval), where GPT-5.5 wins 94.8 to 93.6. So on fugu vs gpt-5.5, Fugu leads broadly but loses on long-context recall.
Are Fugu's benchmarks independently verified? No. Every number in this post comes from Sakana's own published results and has not been independently verified. Treat them as vendor claims and validate on your own tasks before relying on them.
Can I still use Fable 5? For most users, no. Fable 5 was pulled from public access by US export controls on June 12, 2026, and is not publicly accessible. If you need a comparable Anthropic-class model you can actually run, see Claude Fable 5 alternatives.
Is Sakana Fugu just a router or wrapper? This is the dominant skeptic question. Fugu does route across a pool, but Sakana describes internal selection, delegation, verification, and synthesis built on its Trinity and Conductor research, not a naive single-pick router. That said, the routing is proprietary and per-query model selection is hidden, so you can't independently confirm how much "orchestration" vs "routing" is happening on any given request.
What's the difference between Fugu and Fugu Ultra?
Fugu is the balanced, low-latency variant. Fugu Ultra (id fugu-ultra-20260615) is tuned for maximum quality and is the variant Sakana benchmarks. Ultra typically costs more and may add latency because it spends more compute on orchestration.
How do I access Sakana Fugu?
Through an OpenAI-compatible API; you get a key at console.sakana.ai. Pricing isn't publicly specified — Sakana describes it as a mix of subscription and usage-based billing. See Sakana Fugu pricing for what's currently known.
Sources
- The Decoder — Sakana AI's Fugu orchestrates multiple LLMs to match Anthropic's Fable and Mythos benchmarks
- MarkTechPost — Sakana AI launches Sakana Fugu, an orchestration model that routes tasks across a swappable pool of frontier LLMs
Choosing Models for Production?
AY Automate builds multi-model systems and benchmarks model choices per workload, so teams pick the right model (or orchestrator) for the actual job instead of the loudest launch. We test on your tasks and your constraints — latency, cost, transparency, reliability — before anything ships. If you're weighing an orchestrator like Fugu against a single model, our AI agent development team can help you decide.
Book a Free Strategy Call
Building this in production?
Walid runs a 30-min call to map your AI engineering team. Free, no slides.

Walid founded AY Automate to help businesses ship AI workflows that actually move revenue. He leads strategy and oversees every client engagement end-to-end.
Full Bio →


