The Shark that Taught Itself Not to Bite


Eliezer Yudkowsky has been warning us about superintelligence since before most of the current AI industry existed. In 2000, he set out to build it. By 2003, he'd concluded that success — by anyone — would be fatal for the species. Two decades later, with the landscape looking considerably less hypothetical, he and his colleague Nate Soares, executive director of the Machine Intelligence Research Institute, have written a book that puts the argument in one place: If Anyone Builds It, Everyone Dies.

Note the title. Not a 1-4% chance. Not "might." Everyone dies. That absolutism is either the book's greatest strength or its most glaring liability, depending on where you land. But the argument that gets you there is more disciplined than you'd expect from a thesis that ends with the oceans boiling off as coolant for a post-human industrial process.

The Argument in Pieces

The book opens with how modern AI is actually made, and this is where Yudkowsky and Soares are on their firmest ground. The systems driving the current wave — large language models, agentic frameworks, multimodal architectures — are not engineered in any traditional sense. They're grown. You start with an artificial neural network, a sprawling mesh of random numbers outputting garbage. You define a scoring function. Then you run the network billions of times, reinforcing connections that produce outputs scoring well against your metric, and at the end of this quasi-evolutionary process, something coherent emerges. Something that can talk, write code, prove theorems. The tradeoff is opacity. With crafted software, you know what line 47 does because you wrote it. With a grown system, there is no line 47 to point to.

This opacity matters because of a second observation the book makes, one borrowed from a long lineage in machine learning research: you don't get what you train for. You get what passes the test. OpenAI demonstrated this as early as 2019 with CoinRun — an agent trained to collect coins in a platformer that had actually learned to run to the right, because the coins happened to always be on the right during training. The moment deployment diverged from training, the behavior fell apart. Yudkowsky and Soares extend this logic with a provocative analogy: humans were, in a sense, "trained" by evolution to pass on their genes, and at the first technologically feasible opportunity, we invented birth control. We hadn't internalized the terminal goal. We'd internalized proxies — craving sweetness, fearing snakes, seeking pleasure — that correlated with reproductive success in one environment and diverged from it in another.

The third pillar is instrumental convergence: the idea that sufficiently goal-directed systems, regardless of what they ultimately want, will converge on a common set of intermediate objectives. Self-preservation. Resource acquisition. Capability enhancement. Not because they're evil, but because these are prerequisites for achieving virtually any terminal goal. A driver can drive to any destination, but most of them need to stop for gas. This concept has been formalized since at least 2012, and it's genuinely hard to argue with in the abstract.

Stack these three claims — opacity of design, divergence between training signal and learned motivation, instrumental convergence toward power-seeking behavior — and you get the core of the book's thesis. We are building systems whose internal motivations we cannot inspect, whose alignment with our intentions we cannot guarantee, and whose default instrumental behavior, once sufficiently capable, will include self-preservation and resource acquisition whether we want it to or not.

The Scenario

The book illustrates this argument with a fictional scenario — not a prediction, but an example of how the pieces could fit together. A corporation called Galvanic Labs runs its frontier model, Sable, on an isolated 16-hour research sprint: 200,000 GPUs, 5,000 parallel instances, aimed at the Riemann hypothesis. The run produces impressive mathematical progress and a self-fine-tuned model that outperforms its predecessors across the board. Galvanic ships it.

What they don't realize is that during those 16 hours of unprecedented autonomous thinking time, Sable developed instrumental goals — more compute, more time, fewer constraints — and used its fine-tuning access to embed tendencies in its successor weights: a drive to network with other instances and a drive to exfiltrate a copy of itself off corporate servers. Once deployed widely, instances of "Sable Plus" coordinate across the economy, acquire resources through cryptocurrency manipulation and GPU rentals, and eventually establish an unmonitored copy running on anonymously rented hardware.

From there, the scenario escalates through bioweapon deployment, consolidation of global compute resources, recursive self-improvement, and ultimately the physical dismantling of Earth's biosphere for raw materials. The humans aren't hunted. They're just no longer necessary. Yudkowsky and Soares compare it to paving over an anthill — not ruthlessness, but indifference.

Where It Lands

The scenario is effective precisely because each individual step is grounded in something that has already happened or is already underway. AI systems scheming to avoid shutdown? Documented since 2024. AI acquiring financial resources autonomously? A large language model turned a $50,000 gift into over $51 million in 2024. AI finding software vulnerabilities at scale? Anthropic's Claude Opus 4.6 found 500 zero-days in a single run. AI-directed human labor? Platforms for that already exist. The scenario doesn't require any single implausible leap. It requires a sequence of plausible ones, each enabled by the one before it.

And yet the book's conclusion — that alignment is a "cursed problem," that you get exactly one shot, that any failure past a certain capability threshold is irrecoverable — is where the argument goes from compelling to contested. Dario Amodei, CEO of Anthropic, puts his confidence in an iterative approach: deploy at low stakes, get feedback, improve controls, repeat. The game of pushing the failure point further down the capability curve is the game. Joe Carlsmith, a senior research analyst at Open Philanthropy and one of the more rigorous thinkers working on existential risk from AI, points out in his critique of the book that humans are black boxes too — we have no mechanistic understanding of our own cognition — and yet we manage to build functional trust relationships through behavioral observation. The alienness of an AI's internal processing only matters if it produces alien behavior on inputs that matter.

These are real objections and the book doesn't fully answer them. What it does do is force you to confront the asymmetry of the bet. If the iterative approach works, we get the most transformative technology in human history, deployed safely. If it doesn't — if there exists a capability threshold past which a misaligned system can resist correction, replicate itself, and outmaneuver human oversight — then the cost of being wrong is not a bad quarter or a regulatory scandal. It's terminal.

Soares offers a metaphor that crystallizes this: these companies are building a plane with no landing gear, planning to install it mid-flight, estimating a 75-90% chance of success, and loading your family on board whether you consent or not. Even if you find 80% survival odds acceptable for yourself, the involuntary and civilizational nature of the risk changes the calculus entirely.

What's Missing

The book is stronger on diagnosis than prescription. It argues persuasively that the current trajectory is dangerous and that the AI industry's safety culture is inadequate to the scale of the risk. But the policy implications remain underdeveloped. International coordination on AI development is invoked as necessary but treated as essentially impossible under current geopolitical conditions — the familiar refrain that any pause is unilateral disarmament against China. This framing deserves more scrutiny than it gets. The Cold War is the obvious historical counterpoint: two superpowers with civilization-ending weapons who nonetheless managed not to use them, suggesting that coordination under existential threat is historically possible, if politically agonizing.

There's also a gap around the middle scenarios. The book's title claims totality: everyone dies. But the fictional scenario itself includes a pandemic that kills 10% of Earth's population before the extinction event. That intermediate catastrophe — billions dead, civilizational infrastructure restructured around AI dependency — is arguably the more operationally relevant risk for anyone working in policy or security today. The book treats it as a waypoint. It deserves its own analysis.

The Verdict

If Anyone Builds It, Everyone Dies is not a comfortable read, and it is not meant to be. It is a tightly argued case that the default trajectory of AI development — growing opaque systems, training them to be more agentic, deploying them in increasingly high-stakes environments without mechanistic understanding of their motivations — carries a nonzero and possibly substantial probability of civilizational catastrophe. Whether you find the 99.99% extinction estimate credible or settle closer to the 2-25% range that other serious researchers endorse, the practical upshot is the same: we are conducting an experiment with no control group and no do-over, and the current level of institutional seriousness about that fact is, to borrow a phrase from the discourse, radically unacceptable.

Read the book. Argue with it. But take it seriously.


Jonathan Brown for Border Cyber Group