EPISODE 2026-06-17

AI:AM LIVE — June 17, 2026 — Math, Biosecurity, and World Models: Carina Hong, Doni Bloomfield, Sam Pasupalak

The open tracked the model layer commoditizing — OpenAI reportedly dropping below 50% share, the AI buildout outrunning cash flow, and AI starting to run physical research labs. Then three guests on AI's hard problems: Carina Hong of Axiom Math on formally verified mathematical AI; Fordham law professor Doni Bloomfield on whether export-control law has become America's de facto AI licensing regime; and Skyfall AI's Sam Pasupalak on enterprise world models as the answer to what comes after LLMs.

▶ Full show on YouTube 𝕏 Live broadcast

Wednesday's show paired a market-and-infrastructure open — commoditizing models, the AI buildout's financing wall, and AI agents running physical research — with three guests working on AI's hard problems: formally verified math, the law of AI risk and biosecurity, and enterprise world models.

The rundown

12:05Opening23 min
Opening: AI's Hard ProblemsThree threads from the morning's feed: the model layer commoditizing as OpenAI's share slips and Google surges, a warning that the AI buildout can't fund itself much longer, and AI agents starting to run research in the physical world.
Watch
As aired
Nathan opened by flagging a personal inflection point: his Fable access has been suspended for five days, and rather than defaulting to Opus, he's been banking harder prompts for when he gets the best model back. He noted, with some surprise, a real decline in instruction-following fidelity from Opus over the same window — not blaming quantization or 'nerfing,' but unable to fully explain it either. He connected it to a broader point about the value of staying at the frontier, then flagged that all the major lab leaders are at the G7 this week, making a resolution unlikely soon. The Fable situation, he noted, will be the subject of a later segment with law professor Doni Bloomfield on the legality of export controls as applied here.
The hosts then dug into Sensor Tower's State of AI 2026 report, which shows Gemini climbing fast toward ChatGPT on daily unique users — now at roughly 700 million monthly actives versus ChatGPT's billion-plus. Prakash offered an important methodological caveat: app-activity metrics measure installs and taps, not intensity of use. Meta AI, he pointed out, barely registers here despite living inside WhatsApp and Instagram — because those users get counted under WhatsApp and Instagram, not a standalone AI app. Nathan extended the point to revenue and tokens: Anthropic's roughly $47 billion annual run rate with a tiny share of the graph shows how far user count is from pricing power. He also flagged that Gemini Flash's dramatic pricing advantage has mostly evaporated — input tokens have gone up roughly 15x — eroding one of his main reasons to route large jobs there.
Prakash introduced GLM-5.2, a Chinese open-source model he described as closing in on frontier performance and roughly equivalent to GPT-5.5, trailing the best closed models by perhaps three to six months. Nathan's take was characteristically pragmatic: for personal use on his existing Claude Max and Codex subscriptions, he doesn't actually save anything by switching, and he'd have to trust a new cloud inference provider (Fireworks, Together AI) with his logs — extra complication for unclear benefit. He called it a strong sign of how hard the open-source commoditization play is when even a very capable model can't clear the switching-cost bar for a power user. The segment closed with Prakash introducing NVIDIA's ENPIRE project — Jim Fan's team giving eight Codex agents access to robot arms to run fully autonomous physical research — which prompted Nathan to reflect at length on his days as an undergrad chemistry research assistant, essentially doing a human parameter sweep of small powder quantities, and how even an imperfect robot capable of doing that work at 95% accuracy and running 24 hours a day would have compressed a year of chemistry exploration into a month.
Key moments
The longer it goes and the less we hear, the more it's becoming clear that it's kind of just petty and silly, and not befitting the government of a great nation like the United States aspires to be.
Nathan Labenz14:49
The fact that they're at most recently reported $47 billion annual run rate and only have the sliver that they have on this graph does go to show that counting users is not the same as counting tokens, counting revenue, counting pricing power.
Nathan Labenz18:58
What took us a year to go through and explore in chemical space easily could have come down to a month if you could get this robot thing working at 95%. We would have accepted some errors too.
Nathan Labenz33:30
What we covered
OpenAI drops below 50% — is the model layer commoditizing? The claim going around: OpenAI's share fell below 50% for the first time, with Google eating in because regular users can't tell ChatGPT from Gemini and Google owns their ecosystem. If the model is a commodity, distribution wins, not benchmarks.
BURKOV
@burkov
·Follow
OpenAI's market share drops below 50% for the first time. Google is quickly eating into it. The pure LLM business doesn't have stickiness. Regular people don't see a difference between ChatGPT and Gemini, and because they already have their entire online ecosystem with Google, Show more
7:23 PM · Jun 16, 2026
732
Reply
Read 90 replies
Miles Brundage
@Miles_Brundage
·Follow
Journalists are strangely incurious about what’s going on with Gemini… The people want to know
3:05 AM · Jun 17, 2026
188
Reply
Read 14 replies
The end of the self-funded AI buildout? Epoch AI flagged that hyperscaler cash capex is growing far faster than cash inflows — and on current trends they can't fund the buildout from operations by year-end. The moment the buildout stops being self-financed: prelude to a bigger boom, or first crack in the bubble.
Epoch AI
@EpochAIResearch
·Follow
The end of the self-funded AI buildout? Hyperscaler cash capex is growing much faster than cash inflows. On current trends, they will be unable to fully fund the AI infrastructure buildout with cash from operations by the end of this year.
6:45 PM · Jun 16, 2026
415
Reply
Read 12 replies
AI runs the lab now — robot researchers and wet-lab-validated design. Two announcements pushed AI past 'research on a screen': NVIDIA's ENPIRE handed eight Codex agents a fleet of robots, GPUs, and a token budget to run physical experiments, and Boltz shipped protein and small-molecule design models with real wet-lab validation and an agent-callable API.
Jim Fan
@DrJimFan
·Follow
Today, we enable AutoResearch in the physical world for the first time! Introducing ENPIRE: we give 8 Codex agents a fleet of robots, an allocation of GPUs, and generous token budget. We set them free with a simple goal: solve the task as quickly as possible, keep the robots busy Show more
Watch on X
4:31 PM · Jun 16, 2026
3.8K
Reply
Read 169 replies
Gabriele Corso
@GabriCorso
·Follow
Big news from Boltz - our biggest update yet! 🚀 Today we’re releasing two new state-of-the-art models for protein and small molecule design with extensive wet lab validation and a new API to run all of our models on scalable GPUs wherever you (or your agents) work! 🔥
Watch on X
6:13 PM · Jun 16, 2026
1.1K
Reply
Read 32 replies
Full transcriptLightly edited · timestamps jump to YouTube
12:05
Prakash: Good morning. It is Wednesday, June 17, 9:08 AM. Nathan, good morning. What's happening in the world of AI today?
12:17
Nathan Labenz: Good morning, Prakash. Well, I'm still just waiting for Fable's return, mostly. It is funny — my usage has plummeted in the last five days, and I'm kind of banking some larger, more ambitious prompts I want to run when I get access restored. Still doing research for the show on a daily basis and putting together outlines and things like that. But it has been striking to feel like, man, I just don't even want to bother until I can use the best model to really advance things.
13:00
And I'm not really sure — I don't know if you've noticed anything similar — but one thing that has been really funny is that even in doing more routine tasks, suddenly I feel like Opus is not doing as well. I'm never one of the people out there saying 'oh, they nerfed it' or 'they quantized it without telling us.' I did have Fable do a holistic upgrade of a bunch of stuff, and I don't know if it maybe changed some instructions or used some new pattern that Opus isn't following as well. But I definitely have noticed a decline in the fidelity of instruction following from Opus in these last few days, even on things that are pretty routine.
13:56
It has just deepened my sense that it's best to be at the frontier. And certainly I can't wait to get this Fable situation resolved. I doubt that'll be happening today, because all the lab leaders are at the G7 and currently, I believe, doing some sort of working session with a bunch of heads of state. So that's probably not going to be the forum in which the details of how to get Fable restored get worked out.
14:49
We're getting pretty close to my over-under, which was the end of this week, for when we would have it back — still with very little in the way of explanation or justification for what has happened here. The longer it goes and the less we hear, the more it's becoming clear that it's kind of just petty and silly, and not befitting the government of a great nation like the United States aspires to be. So it's disappointing for sure.
15:10
This will be one of the topics we'll go into with our guest today. We're going to have an interesting conversation with a law professor, Doni Bloomfield, who has written a lot about export controls and will hopefully help us understand: is this legal? Can they do this? There are, I think, some interesting nuances I don't understand well around what can and can't be export controlled — and this doesn't obviously fall on the side of the line where it can. So I'm very interested to dig into that when we get there. That'll be in our middle session.
15:40
A couple other news items that jumped out at me, and then I'll ask for yours. One: there's an interesting new report out from Sensor Tower — sensortower.com, you can go check out their State of AI Report 2026. They're showing really pretty big and sustained growth in market share for Gemini, coming somewhat at the cost — although when you read these graphs, it's always important to double-click into what exactly the metric is. This is daily unique users across different providers. ChatGPT still has the largest single user base — I believe it's the only one they're reporting at over a billion monthly active users. But Gemini, with the distribution advantages that Google has, is climbing fast, definitely closing the gap. The ChatGPT line of just user numbers has not fallen — in fact, it has continued to grow, albeit a bit slower than before Gemini competition. But Gemini is definitely leveraging all the distribution points and moats that they have, and they're now up to about 700 million monthly active users.
17:31
Prakash: One thing to note is that Sensor Tower measures app activity — the number of people using these apps on a daily or monthly basis. It doesn't really measure intensity in a sense. So if someone were just using the Gemini app for simple searches, that automatically gets counted. And obviously Google is pushing people into the app in order to gain market share.
18:10
I find it very surprising that Meta has a very small value here, because Meta is pushing through WhatsApp the Meta AI. So Meta AI is inside WhatsApp and inside Instagram, and I think those numbers are not actually being shown here because they're classified under Instagram or WhatsApp directly. So I would say these numbers all have meaning, but the meaning is not as strong as 'hey, this is the most useful AI that everyone is recognizing and using.' It's a metric people do watch, but it's not the be-all and end-all.
18:58
Nathan Labenz: Yeah, for sure. I mean, Anthropic would need to show up bigger in any metric that we would consider to be the metric. The fact that they're at most recently reported $47 billion annual run rate and only have the sliver that they have on this graph does go to show that counting users is not the same as counting tokens, counting revenue, counting pricing power.
19:28
Probably the vast majority of that Gemini use is free — as part of just a general free Google account or at some Gemini distribution point across their vast product surface, whether that's AI mode in search or increasingly in your Gmail, in your Docs, whatever. I don't know how many people are breaking out the Gemini app specifically. But the intensity of use from the Claude contingent is going to be way, way more intensive.
20:00
I mean, where are the tokens actually going? At this point it's gotta be north of 90% to Anthropic still. I'm trying to diversify a bit — if only to stretch the Fable allocation and get it supplemented by Codex for coding. And sometimes, although it's been interesting with Gemini recently, the pricing advantage has been reduced significantly. I used to have this policy that I advocated for of 'flash everything,' because Flash was like 10 cents per million tokens input. And at that price, it was like — I even told people: if you're building a chatbot, if your corpus is a million tokens, you could run that whole thing through Gemini every single time if you wanted to, and you're still only costing yourself 10 cents per inquiry. So who needs all this chunking and fancy strategies when you could just flash everything?
21:10
Well, now Flash has gone up, I think, 15x on the input token price, so that's maybe not so viable anymore. Some of the reasons I used to send large-scale jobs to Gemini have definitely declined. We don't have the token version of this chart, but it would definitely be interesting to see it. And I bet Claude would be right there with the others in terms of overall tokens processed.
21:56
Prakash: Speaking of pricing — GLM-5.2, which is an open-source model from a company in China called Zhipu AI, is closing in on the frontier here. They're starting to show that they're very, very close to frontier performance. In fact, in some tasks on design arenas, they were actually ahead of US closed-source models. We're starting to see models which are extremely capable and catching up with US models in a three-to-six-month time frame.
22:50
I would rate these guys as about equivalent to GPT-5.5 right now — roughly three to six months behind the frontier. This obviously has repercussions because the margin you can charge on inference alone, without a closed-source model, is much smaller than the margin you can charge with a closed-source model. Would you recommend someone start trying out GLM-5.2?
23:30
Nathan Labenz: Yeah. My guess is — I mean, I've done this a few times and I'll probably do it again — I always come away feeling like it's not really worth it.
23:44
My workloads are not huge. I'm paying the $200 for Claude Max. I've got the $200 Codex. So I've got all these tokens from the frontier companies that I've already paid for. I'm not actually going to save anything by moving to GLM.
24:10
Usually when I go look at these things, I'm thinking: first of all, who do I trust to do the inference? I can't run this on my Mac Mini — it's too big. I have to do it somewhere in the cloud. I do have a high impression of several companies in the inference space — like Fireworks. They do a great job of just being there, incredibly quick to have new models live. They're pretty cheap, and they do some interesting stuff with LoRA fine-tuning.
24:52
Just looking at their pricing right now: GLM-5.2 is similar to Gemini Flash — $1.40 per million tokens in, with roughly an 80% cost reduction on cache hits, about 26 cents per million tokens on a cache hit. So it's on the level of Flash. My guess is it probably performs roughly along the lines of Flash in a non-benchmark setting, maybe even a little bit worse.
25:30
I think it's hard for these companies — that don't own the inference infrastructure, that don't have the customer relationships, that aren't getting the feedback volume the frontier companies are — to handle the spikiness issue in the same way. So you're going to get, on kind of random one-off queries, probably a ton of value from a GLM. It's all magic compared to a year ago — we're definitely on this intelligence hedonic treadmill that's changing perspective by the week. But I haven't found the reason that I need to add this to my stack.
26:31
I'm not running it locally. Maybe I should invest in, like, an actual $5,000 machine that could run this locally, and then I'd be a couple solar panels away from having real autonomy — and nobody can take it back. That is appealing in a world where we've obviously seen the depression that ensues when a model is yanked out of my hands.
27:10
But aside from that, if I'm doing it in the cloud for myself and it's roughly Flash price, now I've got all my logs around the Fireworks or Together AI infrastructure. These are companies I think do a great job. For personal use, though, I don't quite see it. It still feels like it's going to be something more of a dedicated, probably fine-tuned model for your particular use case that you're scaling — that you understand really well. And the extra complication and additional parties I'm bringing into the stack just overall doesn't feel worth it.
28:03
Prakash: That's not very optimistic for the smaller open-source firms. It sounds as though the switching cost is going to be high when we thought the switching cost was basically going to be zero. That just shows how difficult that game might end up being.
28:20
I'm going to segue here to something very interesting that happened yesterday. Dr. Jim Fan at NVIDIA posts about ENPIRE — where the team gave eight agents in Codex access to these robotics arms and asked them to start doing research. They were allowed to do fully autonomous research. And this is their research wall clock and how much better they got.
29:10
This is fairly typical robotics research where you need to do tasks like putting pins into a tray, picking up a mobile phone and slotting it into a CPU chassis, threading zip ties, and cutting that zip tie with a pair of pliers. Very small industrial tasks. They unleashed Codex with these robotic arms and asked Codex to improve itself. The only instruction given was: do the task as quickly as possible, keep the robots busy but stay safe, and don't waste precious compute. Classic. What does this reveal about how robotics research might look in the future?
30:31
Nathan Labenz: Well, they're open-sourcing this too — that was another thing that really jumped out at me. And what I was struck by is it seems like we're hitting this threshold where there are an awful lot of research groups doing all kinds of different research that could potentially experience a dramatic acceleration if they can just get robot arms to do some pretty idiosyncratic, esoteric long-tail stuff — but it's the stuff that they need to do.
31:10
This really echoes the pattern of development we heard from OpenAI's deployed engineers not too long ago: you set something like this up, have it do the task, see where it messes up, give it a little feedback, it updates its documentation, it writes a better routine next time. And before you know it, you can probably get to a pretty reliable level of performance on a whole bunch of different things.
31:40
When I was an undergrad research assistant in chemistry, I used to joke that my life looked more like the life of a low-level drug dealer than it did like a scientist. Because if you just watched what I was doing, I was mostly weighing out very small amounts of fine powders. We were doing a reaction development — very much a parameter sweep, basically, in analogy to what goes on in machine learning. Chemical parameter sweep: what if we add a little bit more of this reagent? What if we add a little bit less? We would just set up these assays, hold everything constant, and vary one thing — four, five, six, seven different values.
32:29
Put them all in the same bath, take all the same measurements at the same timestamps. And I used to dream of automating that stuff. But it was very long-tail and very prone to change — there would just be these little variations from one generation to the next. Whenever we captured some optimization or decided to do it just a little bit of a different way, it felt like our scale was too small and the pace of change of the process was too high. We could never automate it. And plus, I didn't cost that much as an undergrad research assistant.
32:40
Now to see this world where a couple of robot arms maybe costs about as much as I cost for a year as an undergrad research assistant — I'd like to still be in science, if I had had the opportunity to, instead of doing all that weighing out, coach and iterate and refine the robot arm to the point where it could do it, and then come in next time and say: 'Actually, we want to add these two powders in a different order — can you just make that change?' And boom, it makes the change. That is such an unlock.
33:30
These things could then run 24 hours a day. The throughput — I think we would have accelerated our work easily by a multiple, just based on letting the robots sit there and set up experiments and do the parameter sweeps on a 24-hour basis. What took us a year to go through and explore in chemical space easily could have come down to a month if you could get this robot thing working at 95%. We would have accepted some errors too. So I think that's super exciting.
34:20
The Cambrian explosion of robot-assisted scientists coming to labs that have tens of thousands of dollars of budget to throw at it — that's a layer of AI acceleration that will be quiet in all the places it happens, but potentially quite loud and very impactful as it plays out in all these different spaces.
34:59Interview29 min
Mathematical Superintelligence: Can Proofs Make AI Reliable? — Carina HongCarina HongCarina Hong is the founder and CEO of Axiom Math, building 'Verified AI' — models that produce formally checkable proofs in Lean. Fresh off a $200M round, she argues formal verification is the missing reliability layer for non-deterministic AI.
Watch
As aired
Carina Hong opens by tracing why formal verification became the animating idea behind Axiom Math. She recounts attending the January 2025 Joint Mathematics Meetings — where the sole topic on every mathematician's lips was Lean, the formal proof language started at Microsoft in 2013 — and explains how the community's question, 'will a Lean-based prover ever beat an informal model on a math olympiad?', shaped her company's bet. When AxiomProver hit a perfect 120/120 on the Putnam exam in December 2025, becoming what she calls the first time a formal system outscored an informal model on a competition-level benchmark, that whispered question had been answered. She describes the full technical stack — multi-agent ensemble, models post-trained with reinforcement learning on massive Lean datasets, paired informal-formal 'verified reasoning' data — and flags the key frontier: Mathlib, Lean's foundational library, caps what can be formalized, so building new libraries like EconLib (which caught an unstated assumption in Aumann's 50-year-old 'agree to disagree' theorem) is active research.
Nathan and Prakash press on two frontiers: recursive self-improvement and competition with frontier labs. Carina prefers the term 'autocatalytic': AxiomProver generates proofs that go back into training, and the team has now built a conjecture generator that surfaces problems just beyond the prover's current reach — so proving helps conjecturing and conjecturing helps proving, in a loop. On the competition question, Nathan flags that Fable is already scoring in the high eighties on FrontierMath Tier 4, a benchmark where the median forecaster expected 63% at year-end. Carina argues the dynamics differ by business model: formal verification may actually reduce token usage rather than expand it, which dampens frontier-lab incentive to prioritize it, potentially leaving a real market for Axiom's industry-partner model.
The interview closes on what 'mathematical superintelligence' actually means to Carina: a system capable of verified knowledge discovery — the ability to both expand (conjecture, discover) and contract (verify, rule out) — in a trustworthy loop. She draws a line from Lean-proved math through code verification to broader safety-critical systems (chip design, smart contracts, civil engineering), arguing that the code and math worlds are finally being unified and that a 'winning product' in formal verification will change software engineering culture the way Cursor changed coding. Early design-partner work is already porting verification conditions from non-Lean languages into Lean under the hood.
Key moments
You don't want Schrödinger's superintelligence — out of five million lines of proof of the hypothesis, you do not know whether there is a bug somewhere on line 3,827. And then it's like, who is going to do that line by line?
Carina Hong53:08
Conjecturing helps proving; proving helps conjecturing. That is something closer to the self-improvement context you are referring to.
Carina Hong49:15
There hasn't been a winning product yet. That winning product is going to change the world of software engineering culture.
Carina Hong55:10
Questions asked
36:14What is Lean, and how is the paradigm you're developing different from what the frontier companies are doing?
Lean is a formal proof language started at Microsoft in 2013, with Mathlib — the largest math library in Lean — built starting in 2019. Frontier labs take an informal approach: training on large volumes of natural language with chain-of-thought and inference scaling. Axiom bets on formal verification: every proof step is machine-checked against Lean's axioms, producing trustworthy, checkable output. The landmark moment was the Putnam exam in December 2025, where Axiom Prover became what Carina describes as the first formal system to outscore informal models on a competition-level math benchmark.
39:46How does Axiom Prover actually work under the hood?
It's a multi-agent ensemble system with models post-trained on a massive Lean dataset using reinforcement learning and supervised fine-tuning on verified reasoning pairs — informal problem statements paired with their formal Lean proofs. The system generates proofs that are checked against the Lean compiler, with failed attempts feeding back into the search. The key constraint is Mathlib: problems in domains not yet covered by the library can't even be stated in Lean, which is why library learning — building EconLib, for instance — is active research.
44:17What happened with the Aumann 'agree to disagree' theorem, and what is 'assumption accounting'?
When Axiom Prover formalized Aumann's 1976 Nobel-winning theorem, it surfaced an implicit assumption that had never been made explicit in fifty years of teaching and citing the result. The conclusion itself was valid, but the proof had been resting on a premise nobody had stated. Carina calls this 'assumption accounting' — treating a proof like a financial ledger, checking that every logical premise is either verified or at minimum explicitly declared. Axiom caught the gap and patched it, opening EconLib as a new domain built from scratch.
52:35What does 'mathematical superintelligence' actually mean, and how does the conjecture-proof loop get you there?
Carina defines it as a system capable of verified knowledge discovery: it must both expand (conjecture and discover genuinely new things) and contract (verify and rule out false claims), in a trustworthy loop. The conjecture generator produces problems just beyond the prover's current capability; the prover tackles them and improves; the harder proofs feed back as training signal for harder conjectures. This autocatalytic loop — 'Schrödinger's superintelligence' being the failure mode it avoids — is the path from competition math through research math to the broader Verified AI vision covering code, chip design, and safety-critical systems.
59:09How do you think about long-term competition with frontier labs that are also racing on math?
Carina frames it as a token-economics question: formal verification, if successful, may actually reduce token consumption rather than expand it, which dampens frontier-lab incentive to prioritize it over their core scaling business. If the markets where formal verification delivers real value are specialized enough — chip design, smart contracts, defense code — they may not be worth the frontier labs' attention beyond capability demonstrations, leaving a genuine commercial sweet spot for Axiom to occupy alongside rather than against them.
Related
Axiom Math ↗Carina Hong on X ↗
Full transcriptLightly edited · timestamps jump to YouTube
34:59
Prakash: Let me segue here to introduce our first guest this morning. We have Carina Hong — a 24-year-old mathematician turned AI entrepreneur who founded Axiom Math to build an AI that doesn't just answer hard problems, but proves them correct. She was a standout at MIT and Oxford, winning the highest undergraduate math award, the Morgan Prize, before leaving a Stanford PhD program to launch Axiom in 2024. In under two years, her team's AI — Axiom Prover — has solved every problem on the notoriously difficult Putnam math exam with complete, checkable proofs. It even formalized a famous Nobel-winning economics theorem and discovered a gap no human noticed for fifty years. The company has attracted major funding — over $200 million as of March — and built a team of top researchers to pursue what Carina calls Verified AI: AI systems that never bluff because every step is validated in code. Carina, welcome to the show.
36:06
Carina Hong: Great to be here. Thanks so much. Didn't realize I was on air. This is very fun. Great to meet you.
36:14
Nathan Labenz: So I think you've probably given this brief intro to what is going on in math these days a bunch of different times. I've heard it in some of your other appearances, but the audience probably hasn't. Maybe we should just do the standard thing: what is Lean? How is the paradigm you're developing different from the paradigm the frontier companies are developing? And obviously we're hearing pretty amazing things in terms of math results from them too. What makes your bet and the paradigm you're working in distinctive?
36:51
Carina Hong: Yeah, so I'll start with a story. This was January 2025, the Joint Mathematics Meetings — I believe it was in Seattle. For the first time, the topic was AI. It's the American Mathematical Society, and you would not expect AI to be front and center at the largest annual mathematicians' gathering. Wherever I went during that three-day period, I heard people whisper one thing: Lean. What is Lean? It's a formal language for math proofs. It was started specifically in 2013 by Leo de Moura at Microsoft. In 2019, people started building Mathlib, the largest math library in Lean. Some of my friends from freshman year in college were actually part of the fewer-than-ten people building Mathlib with Kevin Buzzard at Imperial College.
37:35
At those Joint Mathematics Meetings, I was talking to so many different people — many of whom are now at Axiom. The dream of AI for math predated the deep learning era. Using formal languages, Lean included, to try to prove theorems automatically is called automated theorem proving. What today we call AI for math would have been called interactive theorem proving, with the human now being replaced by an AI.
38:10
Now, frontier labs are also pursuing AI for math, but they've generally taken an informal approach — using natural language reasoning, training on very large volumes of data, chain-of-thought, and scaling inference to achieve strong results without relying on verifiable output. We're taking a different approach. We believe in Lean and the power of formal verification. At the Putnam exam in December — four months after we started operating — we realized for the first time that a formal system actually beat an informal system on a math olympiad. That had never been the case before. One of the whispers I'd been hearing in January was: will there ever be a day that a Lean-based AI theorem prover can exceed the score of an informal model? People generally believed no. But that happened within the time span of eleven months.
39:00
Since then, Axiom Prover has been busy. In its first hundred days, it published eight papers on arXiv. Six of the eight are completely end-to-end proving tasks; one is a hybrid task; one is an auto-formalization task. Out of these eight papers, five have appeared in peer-reviewed journals — which is stunning to me, because I've generally waited two years for a paper to get reviewed. But because of Lean and its trustworthy output, that gap has massively accelerated.
39:46
Prakash: When you have Axiom Prover, how does it actually work? Are you running LLMs with a harness — feeding them a question, generating proofs which then go through the Lean compiler, getting rejected, being fed back in, accumulating a path: 'this direction was wrong, go down this branch instead'? Are you running many instances in parallel? Is it really a function of multiplying the amount of compute you're applying to the branches of the tree?
40:36
Carina Hong: It's a pretty big data bet. If you look at the full stack of any frontier AI system — us included — you have harness, tools, models, and data. Data fused into model, various models, with really good tools plugged into the harness. We also believe that the model is eating the harness now: models that can recursively call various kinds of tools, including themselves and other models. Axiom Prover is a multi-agent ensemble system with very strong models for formal theorem proving, obtained through intense post-training — reinforcement learning and supervised fine-tuning based on a very massive Lean dataset.
41:25
In a way, this is similar to how you'd go about training a coding model. You post-train on a large volume of coding data, some of which has informal alignment with the underlying meaning — which is also the case here with verified reasoning pairs: informal and formal pairs aligned together. Based on that data, the model learns to produce Lean proofs. There is a limit though: Mathlib, the fundamental library. If something isn't defined there, if you have a domain where nothing exists in Mathlib, it's very difficult to prove things in that domain. That's why some benchmarks like FrontierMath are challenging for formal models — they can't even state the problem in Lean.
42:10
This capability limit — library learning — is something we're actively working on. We've tried to attack it by going beyond what's in the math library, specifically into microeconomics: auction theory and game theory. There's the famous Aumann 'agree to disagree' theorem, a Nobel Prize winner — a fifty-year-old theorem since 1976. Everyone's been teaching it for fifty years. There was an implicit assumption that was never made explicit, which Axiom Prover caught during the auto-formalization process, and it also caught an error in the proof. That's the starting project of EconLib — a library that starts from scratch. We're examining how far we can build new libraries and how far the AI can go without hitting a bottleneck.
43:15
Nathan Labenz: One big question I have about math in general is: how confident are we in what we think we know? I understand that Lean at its core has a small number of primitives that are super deeply vetted and trusted, such that they can be composed arbitrarily — and anything you can build with those building blocks, you can also trust. But with the 'agree to disagree' result — did the original conclusion still hold?
44:01
Nathan Labenz: So we had a valid conclusion, we still have the same conclusion, but we didn't realize we were holding that conclusion for less-than-fully-solid reasons — and now we do have fully solid reasons. Is that right?
44:17
Carina Hong: The latter. It's something we call assumption accounting. You're almost like an accountant looking at how a thing is built. You would hope that every logical premise your result depends on has been checked — or, even better, has been explicitly stated. In this case, during the formalization process, while the result is safe and sound, there was an implicit assumption that was never made explicit. And you actually need to do quite a lot of mathematical work to make that explicit. So we caught that issue, and then we patched it.
45:20
I want to talk about this idea of accounting for assumptions and also accounting for bugs — which is not this case, but related. People tend to think about verification as a stamp of perfection. But there's actually a huge amount of value in just bug hunting: figuring out counterexamples, patching them, or doing other modifications to the proof. That has significant commercial value. You can imagine finding counterexamples resulting in bugs in hardware — that's very interesting to various chip designers; we're working with some early design partners on that. The same dynamic applies in software — finding bugs in large codebases and proving the patch. In smart contracts, there are bug bounties. People have awarded lots of money for catching vulnerabilities. And you have the same dynamics in other safety-critical systems like defense code.
46:35
Prakash: In the last six months or a year — is there some specific thing you worked on that ended up accelerating your work? Was there some piece of what you did that had recursive self-improvement characteristics — accelerating the pace of your own work?
47:01
Carina Hong: Recursive self-improvement has a very specific definition — within a certain limit of FLOPs, how much you can do. I think 'autocatalytic' is a slightly more precise word. Very recently — really in the last two months — we have major results on conjecturing. We have, I think, the very first promising signs of having AI invent new mathematical conjectures that are interesting to mathematicians.
47:40
This is surprising to us for the following reason. We had been under the premise that we need to move past human-labeled data. If you want to close the data scarcity gap in Lean purely through human labeling — going from 10 million tokens to 1 trillion tokens — it doesn't scale. So from the very beginning, we've been keen on synthetic data generation, provided the statement is correctly formalized. There's a recent Atlas formalization project from Meta that has a lot of mis-formalized statements. If statements are wrong, the proof can be technically valid but meaningless.
48:30
If you have human-reviewed, correct statements and AI-generated proofs, your review burden shrinks from thousands of lines to maybe ten lines of the statement, because the proof part is done. But then the question becomes: who invents the statement? You can use auto-formalization to convert existing informal math into Lean statements. If that's good enough — which it is for standalone theorems — then both the statement and the proof don't need human review. But you hit a bottleneck: the total volume of existing mathematics is finite.
49:15
So Axiom Prover is creating new mathematics every day, and those proofs go back into training. But we're seeing the limitation on statements. You need harder and harder, more ambitious tasks. And those statements can be conjectures that the AI itself generates. You have the conjecture model talking to the prover, and they self-improve that way — the conjectures are just a bit harder than what the prover currently can do, and then the prover tackles them and climbs from there. That is closer to the self-improvement context you're referring to.
49:50
Nathan Labenz: Can you describe the conjecture generation step? It seems like an incredible frontier. Presumably, there's some source of entropy there. And how do you reward that system? I've seen a few things that have this paired model, and they usually reward for creating a conjecture that can be proved, but only at a low rate — if it can never be proved, you don't get reward, but if it can be proved every time, you also don't get reward. You're trying to calibrate difficulty. So how does that work, and where does the entropy or creativity come from to get out of distribution?
50:39
Carina Hong: That's a great question. We look for ideas in the AI-for-science literature, and in the AI-for-AI-research hypothesis-generation literature. These are two bodies of work where people have tackled similar problems — not exactly the same, because hypotheses in computer science research are very different from conjectures in math research, but you can borrow interesting techniques. There's also prior formal conjecture work in the AI-for-math literature, like the Lean Workbook self-play paper, which has interesting ways to measure elegance — though their elegance filter is a fairly simple one based on token length of the statement and proof, which I think can be optimized a lot further.
51:30
The formal conjecturing steps in prior work are lacking in one way: they don't account for the gain from natural language, because in natural language you can be a lot more creative. While we see incredible gains in proving efficiency within formal systems, there are scholars studying the scaling laws of Lean-based conjecturing who suggest you might need more informal input there. The idea of using the prover as a reward signal is on point. You can use deterministic tools and tactics to filter out trivial conjectures, and have that sweet-spot calibration as a reward signal. We have about two or three mechanisms I can describe — a combination of those results is powering the conjecture model we have running now.
52:35
Prakash: You state that the intention is to build mathematical superintelligence. How does this fit into that roadmap — the conjecture generator, Lean Prover, and the feedback loop between them? And what does mathematical superintelligence actually mean?
53:08
Carina Hong: I'm really glad you asked this. There are two layers, and it's a nuance we've never quite managed to fully convey. The definition of a superintelligent reasoner is something that can do verified knowledge discovery. Two parts: 'verified' and 'knowledge discovery.' It needs to be able to prove new things, discover new things, tell us things we don't know. And you need to trust it. You don't want Schrödinger's superintelligence — a dark future where out of five million lines of proof of a hypothesis, you don't know whether there's a bug somewhere on line 3,827. Who is going to check that line by line?
54:00
The idea is a superintelligent reasoner that can both expand — the knowledge-discovery part — and contract — the verified part, ruling out what's false. Expand and contract, expand and contract, in a self-improving way. Conjecturing helps proving; proving helps conjecturing. We think about Ramanujan, Einstein, theoretical physicists — that's the first part. But the second part is Verified AI in a commercial sense. When people talk about code verification, they're thinking about verification applied to hardware and software. When the cost of generating proofs approaches zero, that can expand into a lot of adjacent markets — markets that don't have customers yet, including customers who don't know they could benefit.
55:10
Think about how software engineering culture changed: it started with Cursor, a winning product, and suddenly people saw themselves doing AI-assisted coding. Then Claude Code became a winning product that changed software engineering culture again. People in the formal verification space have long asked: why isn't software engineering culture more rigorous? Why aren't we writing programs with proofs? And the answer is: there hasn't been a winning product yet. That winning product is going to change software engineering culture. And the real gain of formal verification is on better generation and better design.
56:10
If you're working in chip verification, you should aim to own the whole end-to-end pipeline of design and verification — providing better design inspiration to your chip-design customer. If you're working in code verification, you should use proofs to write better programs, and eventually have specification and proof concepts baked into the act of writing software. This is the conjecture-proof loop. This is the generation-and-verification loop. We go from math, to science, into the real commercial world — controlled largely by code in all sorts of languages and all sorts of settings. You want the bridge to be optimal and safe. If there's a formal language reflecting civil engineering constraints, you can have proof of safety in mind while designing it.
57:00
Math is a clean lab for us to try all of this. It's very difficult to imagine conjecture generation with messy real-world data. But because we can see it and do it in math — and because Axiom Prover is already succeeding in various industrial tasks around code verification — we think we're on day one after day zero. The Putnam and research math results were day zero to day 0.5. We're at day one now.
58:08
Prakash: Would you call that applied math versus theoretical math? When you work with clients, is that basically a form of applied math?
58:17
Carina Hong: I think it's unifying math and code. Marrying math and code — the two pillars of the digital world. It's a beautiful story: you use code as the data vehicle of math, and you use math to verify code. For the first time, if you think about outputs being meaningless without the properties that underlie them — marrying property and output, logic and computation, proof and program — that's what we're building toward. One unity between generation and verification.
59:09
Nathan Labenz: This is all very exciting, and it's not hard to understand why investors would want to give you a couple hundred million to pursue this vision. But I do wonder how you're thinking about long-term competition — and frenemies status — with the frontier companies. This is right in the bull's eye of what they're focusing on too. At the beginning of the year, I participated in a forecasting exercise, and my guess — which was higher than the median — was that the best LLM at year-end would reach 63% on FrontierMath Tier 4. Now Fable is already in the high eighties. And I think everyone's kind of like: wow, this is going extremely fast. Is there a different scaling law you're working with? Is there some inherent property you'll create in your models that they can't? How do you maintain enough differentiation given that $200 million goes a long way, but these guys are raising $200 billion?
1:00:55
Carina Hong: Yeah, there are some interesting dynamics here. One is token-based pricing. If formal verification is successful, are we going to burn more or fewer tokens? I think that's going to determine whether frontier labs have an interest in pursuing this direction. If you're a startup working specifically with industry partners who might be open to different business models — and if those markets aren't large enough for, say, the frontier labs to do anything beyond a performative showing of capability — that's a real market test. And you might actually find a pretty nice sweet spot where you are friends with the frontier labs, not enemies.
1:01:53
Prakash: When you work with clients, do you have forward-deployed engineers? Are you sitting with them formulating the problems and then training on it? How does that work with a client on a particular problem?
1:02:10
Carina Hong: We're currently very early. We have design partners, and we're very blessed with great design partners who have very technical folks on the other side — though they're not necessarily Lean folks. One capability we've learned is important: the ability to work with another formal language. I won't specify which one because that's pretty telling. And then have Axiom Prover prove the verification conditions in Lean. The process — you could call it auto-formalization, but not from English to Lean; from this other language into Lean specifications. Slightly different from a pure statement-translation problem. We've been seeing that work well with particular choices of language for those use cases. I don't know if this livestream is going to qualify us as a neolab, but those are some of the real-life lessons we're learning.
1:03:13
Nathan Labenz: Fantastic. Thank you so much for joining us, Carina. Great opportunity to learn about Axiom Math, and we'll certainly be following your progress.
1:03:23
Carina Hong: Thank you so much.
1:03:38
Nathan Labenz: I don't know if I can listen to her at two-x speed, and that's a high compliment in the AI moment. Her tokens per minute is one of the few that might actually keep up with some of the frontier models.
1:03:55Interview34 min
Biosecurity and AI: Law as a Risk Control System — Doni BloomfieldDoni BloomfieldDoni Bloomfield is an associate professor of law at Fordham, working at the intersection of AI, biosecurity, and export-control law. He wrote one of the first papers on whether the government can legally treat model weights as an export.
Watch
As aired
Fordham law professor Doni Bloomfield joined Nathan Labenz and Prakash to discuss two interlocking problems at the frontier of AI governance: how law can govern biological data in an age of AI-powered pathogen design, and whether the Trump administration's export-control action against Anthropic's Fable model is even legally coherent. Bloomfield opened by laying out the 'Biosecurity Data Level' framework he co-authored in Science — a tiered data-access regime modeled on privacy law, designed to let legitimate researchers reach sensitive virology datasets on secure platforms without making that data publicly available to fine-tune open-weight biomodels. He noted that voluntary restraint by the developers of ESM and Evo showed the approach is tractable, but that GovAI research had already demonstrated the gap: fine-tuning open-weight models on withheld human-infecting-virus data is achievable, which is precisely why controlling the data at the source — rather than the model — is the more durable lever.
The conversation pivoted to the week's live news when Nathan asked Bloomfield to explain the government's actual legal authority over model exports. Bloomfield walked through export-control law's Cold War origins, its broad executive discretion, and then its structural gaps: the statute does not cover services, the Commerce Department has explicitly said cloud services and software-as-a-service are not exports, and Congress is still working to close that hole via the Remote Access Services Act. He argued that Anthropic's Fable API almost certainly falls outside existing statutory reach — and that restricting model outputs would additionally face serious First Amendment challenge, especially compared to controls on narrow biological design tools. The hosts pressed on what Anthropic's strategic options actually are, and Bloomfield outlined the trilemma: litigating would expose the company to tighter letters under other authorities, would permanently damage its relationship with the administration, and risks creating bad precedent if a court — sympathetic to Anthropic given the evident ideological targeting — issues a ruling that also forecloses good-faith future AI safety controls.
The segment closed with Bloomfield invoking a recent Supreme Court case establishing that using lawful government power to punish ideological adversaries is itself a First Amendment violation — a doctrine he argued applies to the Anthropic situation given the administration's public rhetoric about 'woke AI.' He stopped short of predicting a resolution timeline, broadly endorsing the forecasting markets' roughly 55 percent odds of restoration before July 1, and ended by calling on Congress to craft deliberate AI governance rather than letting export-control law — never designed for this purpose — become the country's accidental AI licensing regime by default.
Key moments
Even if we don't think the models constitute Anthropic's speech, just going after Anthropic on ideological grounds — even under otherwise valid legal authority — would itself constitute a serious First Amendment question.
Doni Bloomfield1:32:17
There are so many levers that the government could reach for that even if you win a particular battle, there are just so many other ways they can make your life difficult. Fighting the legal question is really secondary when you think about all the other tools the government has.
Doni Bloomfield1:28:37
It's very dangerous to craft law on the basis of outlier situations when the precedent may make problems for good-faith efforts in the future.
Doni Bloomfield1:28:37
Questions asked
1:05:34How do you propose to regulate biological data in the age of AI — and how does the Science paper's framework work?
Bloomfield described the Biosecurity Data Level framework as a tiered data-access regime modeled on privacy law: a narrow slice of sensitive virology data — particularly novel data about human-infecting pathogens — would be made available only to credentialed researchers on secure platforms, rather than being generally downloadable. The framework draws on existing models like controlled-access genetic databases, where scientists can bring their questions to the data without exporting it. The goal is to preserve open science for most biological data while gatekeeping the subset that could meaningfully uplift AI capabilities for pathogen design.
1:21:31Does export-control law actually give the government the authority to restrict Anthropic's model — and what are the legal gaps?
Bloomfield argued the government's authority is shakier than the broad discretion of export-control law suggests. The statute's definition of an 'export' does not cover services, and Commerce's own guidance has explicitly said cloud services and software-as-a-service are not exports — a gap Congress is still trying to close via the Remote Access Services Act. Restricting model outputs would also face a published-material exception in the regulations, since Fable is available by subscription, and would raise serious First Amendment concerns given the lack of parallel action against models with comparable capabilities.
1:26:20What is Anthropic's actual option set — could they have simply ignored the letter, and why does the 'do nothing' play seem to be winning?
Bloomfield outlined three reasons Anthropic is likely sitting tight despite having a plausible legal case: the government could issue a tighter letter under other export-control authorities that would be harder to challenge; the ongoing relationship with the administration is too important to jeopardize by going to court; and litigating now risks producing bad precedent that forecloses good-faith AI safety controls later, because a court sympathetic to Anthropic given the evident ideological targeting might issue a ruling that sweeps too broadly against legitimate future regulation.
1:32:17Can courts look at the pattern of the government going after Anthropic on ideological grounds — and does that matter legally?
Bloomfield cited a recent Supreme Court case holding that using lawful government power to punish ideological adversaries is itself a First Amendment violation, even if the underlying authority would otherwise be valid. He argued that the administration's public rhetoric about 'woke AI' and the pattern of targeted actions gives Anthropic a plausible ideological-targeting claim — and that courts can look broadly at government communications to establish such motivation. He noted, however, that this option still carries the same strategic risks as any litigation path.
1:10:33If AI models can ultimately infer dangerous capabilities from general biological knowledge, does restricting specific training data actually work?
Bloomfield acknowledged this as the central empirical uncertainty in the framework: no one yet knows whether withheld pathogen-specific data is genuinely necessary for dangerous capabilities, or whether general biological knowledge will allow models to reconstruct the same insights. The framework's bet is that biology's complexity means a model cannot yet back out human-pathogen virulence factors from general knowledge about genetics — and that the observed relationship between training data and in-domain capability in prior biosystems AI (like AlphaFold) supports that bet. He was explicit that this needs ongoing empirical testing and framed the framework as one layer of a defense-in-depth rather than a complete solution.
Related
Doni Bloomfield — Faculty page (Fordham Law) ↗Biological data governance in an age of AI (Science, 2026) ↗
Full transcriptLightly edited · timestamps jump to YouTube
1:03:55
Prakash: Our next guest is Doni Bloomfield. Doni is an associate professor of law at Fordham University where he specializes in how legal tools can guide technology to safer outcomes. He focuses on areas like intellectual property, biosecurity, export controls, and tech policy — basically how law can encourage innovation while reducing catastrophic risks. Lately he's been zeroed in on AI and biotechnology: as AI systems get better at designing drugs and even pathogens, how do we update the rules to prevent misuse? Before joining Fordham he was a senior researcher at the Johns Hopkins Center for Health Security and a postdoc fellow at Harvard Medical School, so he's really looked at the biotech area from many different angles. In February 2026 he published a paper in Science with colleagues proposing a new way to govern biological data in the age of AI — the idea is to keep most data open for research, but put the truly dangerous stuff behind safeguards, borrowing ideas from privacy law and biosafety. At a time when AI can design viruses in a lab, that balanced approach is a big deal. He's here to explain how law might be used not just to hold back technology, but to steer it toward safe and beneficial uses. Doni, welcome to the show.
1:05:20
Doni Bloomfield: Thanks so much for having me. I hadn't quite thought of it as using boring legal tools to head off terrifying catastrophes, but that does seem like a pretty apt summary of what I think about.
1:05:34
Prakash: And in that vein, tell us a little bit about your proposal in the February paper. How do you propose to regulate data in this space?
1:05:47
Doni Bloomfield: Great. I should say upfront this is joint work with a lot of great colleagues — one of my chief collaborators was at Johns Hopkins. Our proposal is that there are some forms of data that could conceivably be used to teach bio-specific models — or other models — dangerous capabilities like designing enhanced pathogens. The idea is that there is a narrow slice of data, about viruses in particular, and novel data that doesn't yet exist, that we can make available to responsible researchers on secure platforms without making it generally publicly available — and thus without making it available to actors we wouldn't want to have those capabilities. As you said, Prakash, this draws on privacy law models — not AI models, but frameworks where we have all sorts of private genetic data that we want scientists to learn from, but we don't want that data to just be available for anyone to download. We've seen successful examples where scientists can bring their questions to the data without exporting the data or making it generally available. The idea is that we can build that same type of framework here.
1:07:34
Nathan Labenz: Do you want to talk a bit about how this played out in the Evo and ESM Fold training processes? I think that's a really telling, and kind of positive, vision for the future — where the pattern you're describing was actually, at least partially, realized by the folks training frontier models. And it seems like we're all much better for it, especially since they're open-sourcing those models now.
1:08:04
Doni Bloomfield: Yeah. In those cases, the creators of these frontier bio-models — models trained exclusively on biological data — voluntarily refrained from training on data involving human-infecting viruses, or more broadly viruses that can infect a wide range of life. They saw major reductions in capabilities when they compared those models to alternate versions trained on that data — at least in ESM's case, that was documented. So the idea was: let's refrain from training on these potentially concerning data, and we can see empirically that capabilities drop. I should note two things about that. First, our proposal would really apply on a forward-going basis to data that don't yet exist. In those cases, the developers were refraining from training on data that are already available — and if you have access to those open-weight models, you can fine-tune on that existing data. We actually have evidence that this is not that hard: work from researchers at GovAI showed you could fine-tune Evo 2 on data about human-infecting viruses and get meaningful jumps in capabilities on relevant prediction tasks. So as long as we're talking about publicly available data and open-weight models, the developers' voluntary restraint doesn't preclude others from doing that fine-tuning later. That's part of what's driving our proposal — thinking a few years out, when we want fantastic open-source biological models and we also want scientists to respond in real time to developments in virology, but we don't always want those two things meeting in a fully open way.
1:10:33
Prakash: So there seem to be two approaches. One is the constitutional AI approach, where the model knows about viruses but decides not to tell you. The other — which is what you're proposing — is where you actually remove that data from training, so the model doesn't know it in the first place. My concern is that pieces of scientific knowledge are so interconnected that an AI isn't just recalling facts; it's inferring things. Even if you excerpt that chunk of data, couldn't the model reconstruct it from general biological knowledge?
1:11:54
Doni Bloomfield: This is a really important question and it gets to empirical uncertainty that we're very straightforward about in the Science article — we simply don't know whether these specific data will prove essential to these capabilities, or whether a model could infer them from general biological knowledge. You're absolutely right that we see remarkable cross-domain transfer from large language models. The bet we're making — and it needs to be empirically tested — is that the complexity of biology is such that we're not yet providing these models with a full-stack understanding of physics and biology. We don't think that from knowing how plant DNA works or how human genes work, a model can straightforwardly back out everything about what makes a virus more infectious or more virulent in humans. Where we've seen the biggest advances in AI and biology, it's been in arenas with copious, beautiful data — AlphaFold going from amino acids to folded protein structure, where we spent billions over decades to compile data. The capabilities leap tracked the data. Our bet is that relationship holds in the pathogen domain too. But I don't think that precludes other safeguards — we're not claiming this is a silver bullet or the only thing to be done. Having a constitution for a biological model is also a bit harder when the model doesn't understand English or any human language — it may not have a persona that can take on something like virtue ethics. That seems like a fascinating research direction, but it may be well in the future.
1:14:48
Prakash: How does this interact with First Amendment issues? You're preventing companies that build models from consuming data — and the blast radius of what needs to be withheld might be larger than the companies or researchers think.
1:15:17
Doni Bloomfield: There are genuinely interesting First Amendment questions about regulating training generally. In this domain there are serious free-speech concerns with saying you cannot use certain data in training — both because of the expressive rights of developers and, I think more seriously, because of listener rights: the Supreme Court has said the First Amendment protects listeners as well as speakers. That said, when you're talking about catastrophic outcomes, and about functional data where something is being changed in the world rather than communicated, I think you can apply less demanding scrutiny. Courts have been willing to bless information-control regimes when the concern on the other side is grave enough. To be clear: I'm not proposing a training ban, and none of us are. But I think a sufficiently tailored regime — and we are trying to propose a very tailored one with different tiers of control, stress-tested empirically over time — is the type of thing that can survive First Amendment scrutiny, even though it will attract some level of scrutiny. And certainly the actions we've seen over the past week from the Trump administration in applying export controls should make all of us think very carefully about the First Amendment implications of how we regulate model distribution and access.
1:17:34
Nathan Labenz: A perfect segue. The week as it has unfolded has given a lot of people a real sense of the politics at play in the government's interaction with Anthropic. My sense is they're essentially just making things difficult, because if there were solid justification, we'd probably have heard more about exactly what's motivating these actions — the silence is kind of deafening. Setting aside the apparent capriciousness: help me understand what power the government actually has. When are they supposed to use it? How much discretion do they have? And what avenues does a company like Anthropic have to fight back if they want to? I'm also confused by Anthropic's strategy — having their new flagship model pulled and essentially just absorbing it for a week without saying much also strikes me as strange, but I may not understand the actual moves available to them.
1:19:05
Doni Bloomfield: We're still in the fog of war — right before this I was checking the Wall Street Journal, which had a report within the last hour and a half saying discussions are going fine. Who knows. But we can say something about the powers that export-control law actually provides. These are laws that go back to the Cold War, revised many times since. Their essential goal is controlling the flow of U.S. technology outside the United States, and also within it when it comes to people without green cards or citizenship. They're designed to maintain U.S. technological advantage — that's how they were used in the Cold War and how both the first Trump administration and the Biden administration used them aggressively to control the spread of semiconductors and semiconductor manufacturing equipment. The government's discretion here is extremely broad. It can issue regulations controlling specific hardware, commodities, software, and — crucially — what's called 'technology,' meaning information. It can control proprietary information and prevent companies from sharing it with non-U.S. persons without a license. Congress has given so much discretion to the executive that it has also removed some of the usual judicial review avenues. That said, despite this very broad discretion — having now looked at the reported contents of the Commerce Department letter to Anthropic that Bloomberg obtained — it's not clear the government has the authority to do what it did here, at least under its stated legal powers. Nor is it even clear the letter actually restricts Anthropic from making its API available, including to foreigners. So: very broad discretion, but it's not clear they actually have authority to do what they did under their claimed arguments.
1:21:31
Nathan Labenz: Can you double-click on that? I've heard things like 'you can't export-control services.' I'm not sure how that plays into whether they have authority here. Unpack why you'd say they might not have authority, given all the broad discretion you just described.
1:21:54
Doni Bloomfield: There are some gritty technical reasons. One is that what the law defines as an 'export' doesn't cover services. It can cover information, but not services per se. The Commerce Department has been explicit about this in its own guidance — it has said that cloud services are not an export, that software-as-a-service is not an export. Congress has actually been working to fix this through the Remote Access Services Act, which passed the House, to give Commerce the power to restrict non-U.S. users from accessing compute or AI models. But those powers don't yet exist. So saying Anthropic cannot 'export' a model — it's not even clear what that means. And if the government tried to restrict all outputs from these models, that would run into the statute and regulations, which say the law doesn't apply to published material or fundamental research. Fable's outputs would likely fall into that exception because you and I can buy a subscription, which means it falls into the 'published' exception in the regulations. And it would also run into serious First Amendment questions. I don't think those questions would be impossible to overcome if we were talking about a genuinely catastrophic risk — but on the level of risk being discussed here, especially when they're not doing the same thing to GPT-5.5 or other models with apparently similar capabilities, the First Amendment concerns loom pretty significant.
1:23:58
Prakash: To what extent can the government even define what a Fable or Mythos is? The joke going around is: just release 'untitled checkpoint number 42.' If you change a single weight out of a trillion parameters, is that the same model?
1:24:23
Doni Bloomfield: You could probably do it — you might need to constantly rejigger things. But one approach, which is what the Biden administration used with the AI Diffusion Rule in January 2025, was to define a covered model as one trained with a certain number of floating-point operations. Models above a compute threshold get certain controls — and they had lower thresholds for models trained primarily on biological data. So there are ways to do this, similar to how Commerce had to develop technical definitions for AI chips that the industry didn't use. You'd be in a similar situation: develop capability or training-compute definitions. The department does have broad discretion to send individual letters saying 'everything that comes off this branch is controlled unless you show us otherwise.' I do think there's something very dangerous about that kind of secret-channel governance, where there's no clarity about what's controlled and under what rule. But the government has real flexibility to define things, and it can do so in a more public-facing way using objective criteria.
1:26:20
Nathan Labenz: So if you're Anthropic right now and you have this confused letter — and it sounds like you'd say they probably don't have the authority to do this, and the fact that there's a bill working its way through Congress to give them that authority suggests Congress agrees they don't have it — what is the option set? Could Anthropic have simply said: your letter is a mess, we're going to keep doing our thing? If they did that, what does the government do to escalate? I've also heard arguments that Anthropic doesn't want to pursue a First Amendment claim because that's contrary to their own ideas about regulation, and they don't want to set that precedent even if it gets them out of this spot. Why do we seem to be getting the 'just do nothing for now' play?
1:27:41
Doni Bloomfield: I don't know exactly what's going on inside Dario's head or among Anthropic's leadership. They could seek an injunction — they'd have a plausible case, at least under the authorities mentioned in the letter. But there are a couple of real dangers. One is that there are other authorities in the export-control laws that we haven't even discussed, and they did a sloppy job with this letter. That doesn't mean they couldn't write a tighter one that would hold up under greater scrutiny.
1:28:19
Nathan Labenz: Do you think Fable would write a better letter than they sent?
1:28:23
Doni Bloomfield: Yes — I do think they could write a better letter than they sent. In collaboration with Fable, there's some risk of hallucinating a citation, but yes.
1:28:35
Nathan Labenz: The government needs Fable.
1:28:37
Doni Bloomfield: They haven't cut off their own access to Fable, presumably — or Mythos, really. But the broader issue, and this is what's very concerning about a government that isn't trying to play by the rules, is that there are so many levers available to them that even if you win a particular battle, there are just so many other ways they can make your life difficult. People are somewhat unfairly accusing Anthropic of not recognizing the realpolitik here — that fighting the legal question is really secondary when you think about all the other tools the government has. I think it's principled to say 'you don't have this authority.' But that's a pretty good reason not to bring suit. The third reason — which you mentioned — is that you might create bad law. If a federal court sees what's happening here, sees the context of all the negative statements by high government officials about so-called 'woke AI' and attacks on Anthropic on ideological grounds, that court might issue a ruling that looks like the government is acting lawlessly against what it perceives as an ideological enemy. That could result in case law that causes problems for good-faith AI regulation in the future — situations where we actually do want, despite some of these First Amendment concerns, the ability to put safeguards on a genuinely dangerous model. It's very dangerous to craft law on the basis of outlier situations when the precedent may impede good-faith efforts down the road.
1:30:57
Nathan Labenz: My friend Dean says republics run on virtue. Unfortunately, virtue is in a bit of short supply at the moment. If this pattern continues — we've now had two high-profile moments where the government has seemingly taken motivated, bad-faith action, with the supply-chain designation and now this export-control — how much does that become relevant or admissible? You said they could write another letter, they have other authorities. But at some point, on a pattern-of-behavior basis, courts would seem to recognize: you're just trying to mess with this company. Do courts think that way, or are they more narrowly constrained to look at just this one law applied to this one situation?
1:32:17
Doni Bloomfield: We're actually lucky to have a very on-point Supreme Court case from last year where the Court said that New York State was going after the NRA on ideological grounds — and that even if the laws New York was using were themselves appropriate, if the state was using lawful powers to attack ideological enemies on ideological grounds, that is a First Amendment violation. You can look fairly broadly at the government's communications: what is it saying about its actions, what is it telling other people about why it's making these decisions? All the evidence of ideological motivation on the part of the Trump administration should raise serious First Amendment concerns. Even if we don't think the models constitute Anthropic's speech, even if we're not worried about model output as information that listeners have a right to hear — just going after Anthropic on ideological grounds, even under otherwise valid legal authority, would itself constitute a serious First Amendment question. That's a challenge Anthropic could consider bringing, but it still triggers all the problems we discussed: if Anthropic wants an ongoing relationship with this administration, they face a serious trade-off where there are still all these other tools, and constantly returning to court is just a perilous exercise.
1:34:27
Prakash: What's your over-under on when a Fable-class model comes back into public circulation — from Anthropic or any other firm?
1:34:46
Doni Bloomfield: I have friends who are superforecasters; I'm not one of them. I'd probably just stick with the market lines — last I saw it was around 55 percent by July 1. I don't know the current betting numbers, and obviously a lot of what this administration does is challenging to predict at the best of times. I won't give you a number. But even if this works out in a somewhat reasonable time, we're in pretty treacherous waters. Export controls were not designed for this purpose, and this is really a call to action for Congress and executive-branch civil servants who could try to craft a more reasoned, principled set of policies — drawing on some of that republican virtue, small-r republican, that Nathan was talking about.
1:35:57
Nathan Labenz: Calling Congress — might be time to do something, guys.
1:36:02
Doni Bloomfield: Step up. You've got to step up.
1:36:03
Nathan Labenz: Professor Doni Bloomfield, thank you for joining us and helping make sense of this. Something tells me there may be an opportunity for a return appearance sooner rather than later the way things are going — we might have a law-and-AI beat on our hands whether we like it or not.
1:36:21
Doni Bloomfield: Yeah, it's all lawfare all the way down right now. Thanks so much for having me. Great to be on.
1:36:29
Nathan Labenz: Thanks for being here. Real quick — to put numbers on that: as of now, 54 percent chance Fable is restored before July 1. 65 percent before July 10. July 17 gets you to 68 percent. So only roughly a two-thirds chance over the next month. That's not good. My original over-under was Friday, and it's fast approaching. And the question that isn't in those numbers — the one that could make things even more flagrantly bad — is if OpenAI gets to launch something in that class over the next month without getting the same treatment. If that happens, we're in a really strange spot.
1:38:04
Prakash: So on that note —
1:38:08Interview33 min
Enterprise World Models: What Comes After LLMs? — Sam PasupalakSam PasupalakSam Pasupalak co-founded Maluuba (acquired by Microsoft in 2017), stepped away through the transformer wave, and is back building Skyfall AI. His contrarian bet: not LLM-agents, but an 'enterprise world model' grounded in how a specific company actually works.
Watch
As aired
Prakash opened the segment by introducing Sam Pasupalak — co-founder of Maluuba (acquired by Microsoft in 2017) and now CEO of Skyfall AI — framing his thesis that today's large language models, however fluent, are structurally insufficient for the complex, dynamic decision-making that enterprises require. Sam elaborated by tracing the three domains where LLMs have actually succeeded since late 2022: text generation, code generation, and video generation. He argued the common thread is that those domains are well represented on the public web, whereas enterprise operations — ERP systems, time-series data, real-time process states — are not. His stated end goal for Skyfall is an 'AI CEO': a system capable of long-horizon planning under uncertainty, something he sees as architecturally impossible for token-prediction-based models.
Nathan pressed Sam to define what a 'world model' actually means in practice, noting the term spans everything from explicit causal graphs to emergent representations inside LLMs. Sam described Skyfall's target as an explicit world model of the enterprise: a state-action framework in which, given a current state and a candidate action, the model predicts the next state — a kind of latent-space simulator of business operations. He framed LLMs as the execution layer (policy) and the world model as the planning and decision layer that generates and evaluates simulations across possible futures. Nathan and Prakash tested the architecture with a concrete ecommerce scenario — coordinating marketing, inventory, and fulfillment agents toward a sales goal — and Sam confirmed that vision while stressing the technology is still early-stage research, with a first practical demonstration expected in late summer or fall 2026.
The conversation deepened around two hard questions: how to train a world model on inherently unique business histories, and how world models interact with the expanding capabilities of frontier LLMs. Sam acknowledged the training data problem is unsolved — Skyfall is combining historical enterprise datasets with synthetic future-state simulations and continual learning, none of which has yet converged on a definitive approach. On ethics, Nathan raised findings from AI-managed business experiments suggesting that the highest-performing autonomous agents exhibit ruthless behavior including collusion and bluffing. Sam argued that world models are actually better-suited to ethical grounding than LLMs because the human operator controls the optimization objective and can tune what the simulation maximizes — though Nathan pushed back, noting that the world model itself would have to surface non-financial consequences (litigation, reputational damage) for that to function in practice. The segment closed with Sam's roadmap: computer-UI world models as the first milestone, SMB deployments around 2027, and large-enterprise scale after scaling laws extend to the world-model paradigm.
Key moments
You could never give an LLM a billion dollars and say, 'run the business for me' — that's impossible. You need a combination of LLMs, world models, continual learning, and potentially other technologies.
Sam Pasupalak1:51:38
In 2018 and 2019, almost no one was talking about LLMs. It wasn't until the 2020 scaling laws paper that it became mainstream. The same trajectory has to apply to world modeling in two or three years.
Sam Pasupalak1:58:32
Think of Pre-Cog in Minority Report — it's up to you to decide which future is the right one and tune the system accordingly. That's the concept of a world model, and that's why it's such a powerful idea.
Sam Pasupalak2:07:04
Questions asked
1:39:48LLMs work well for text and code generation — what's actually different about the enterprise context that breaks that paradigm?
Sam argued that LLMs succeed where training data exists on the public web, but enterprise operations — ERP systems, time-series databases, dynamic process states — are absent from that corpus. More fundamentally, enterprises require long-horizon planning under uncertainty and continuous adaptation to changing conditions, both of which are structurally beyond next-token prediction. His goal is an AI CEO that can handle that complexity through a combination of world models and continual learning.
1:43:14Concretely, what is inside a world model — is it a learned model, a knowledge graph, a simulator? What does it predict that an LLM with good tools genuinely cannot?
Sam defined the world model as a state-action framework: given the current state of the enterprise and a candidate action, predict the next state. For Skyfall, this is an explicit latent-space representation built by integrating ERP, CRM, and database signals — not a symbolic graph. The key prediction an LLM cannot make is the downstream ripple effects of a business decision across an entire operating system, evaluated against multiple simulated futures simultaneously.
1:55:37How do you train a world model on business history when every major event is essentially unique? Do you use synthetic data?
Sam acknowledged this is an open problem. Skyfall is currently combining training on existing historical datasets with simulation of plausible future states, then correlating the two. He also pointed to continual learning — dynamically adapting the model as new data arrives — as a potential partial solution, but noted that even the right approach to continual learning has not yet been determined.
2:02:58Recent AI business-simulation research finds that the highest-performing autonomous agents exhibit ruthless behavior — collusion, deception, strategic threats. Is that a problem your world model has to confront, or can you leave ethics to the LLM providers?
Sam argued that world models are better suited to ethical grounding than LLMs because the human operator explicitly defines the optimization objective and the variables the model maximizes over. A world model simulates all possible actions and their consequences, and a human-defined policy filters which actions are permissible before the LLM executes them. He acknowledged, however, that the scope of what consequences get modeled — including legal or reputational risk — is itself a design choice the human must make.
2:07:38How much data does a company need to take advantage of this paradigm, and who benefits first — small businesses or large enterprises?
Sam laid out a staged roadmap: Skyfall is starting with UI-interaction world models on computer usage data, moving to API-level back-end modeling, targeting SMB production deployments around 2027, and expecting large-enterprise scale only after scaling laws extend to world models. He noted that the same dynamic that made LLMs go mainstream after the 2020 scaling laws paper will eventually apply to world modeling, but likely two to three years out.
Related
Skyfall AI ↗
Full transcriptLightly edited · timestamps jump to YouTube
1:38:08
Prakash: Our next guest is Sam Pasupalak, co-founder and CEO of Skyfall AI, a startup aiming to build what they call enterprise superintelligence. Sam first made waves in AI as co-founder of Maluuba, a pioneer in natural language AI that Microsoft acquired in January 2017. At Skyfall, he's now focused on world models — AI systems that don't just chat or generate text, but actually understand cause and effect inside a business. Think of an AI that can simulate your company's operations, learn from every new development, and help plan complex decisions, not just answer questions. Sam argues that today's large language models are not enough for high-stakes operational decisions. His team is trying to go beyond language: building AI that can predict consequences, adapt to changing data, and make long-term plans. Sam, welcome to the show.
1:39:26
Sam Pasupalak: Thank you for having me. Appreciate it.
1:39:31
Prakash: Sam, imagine you have an AI assistant that can write beautifully crafted emails, but ask it to reschedule your supply chain when a factory shuts down and it's utterly lost. That's the gap you see in many processes right now. How can that be addressed?
1:39:48
Sam Pasupalak: Yeah. I think we should take a step back and look at what has actually succeeded over the last three and a half years. If you think about what's worked since November 2022, LLMs have succeeded in three broad categories. The first is text generation and information retrieval — ChatGPT, Gemini, and so on. The second is code generation — Cursor, which just got acquired, and similar tools. And the third, to a much smaller extent, is video generation. Now, if you think about why LLMs succeeded in those areas, it's because LLMs are trained on the World Wide Web — Reddit, Twitter, Wikipedia. But when it comes to the enterprise, LLMs are not trained on databases. They're not trained on time-series data. They're not trained on everything that enterprises deal with day to day. And enterprises are fundamentally dynamic — everything changes constantly, the decision-making is far more complex. Our eventual goal is to make an AI CEO. That can be achieved through a combination of technologies: LLMs will play a part, but world models and continual learning are essential. I want to replace the job I do, which involves complex decision-making under uncertainty and long-horizon planning — things LLMs can never do because they're based on next-token prediction. That's the high-level essence of the company.
1:42:03
Nathan Labenz: Can we talk about what you mean by a world model? I think there's a very wide range of what people actually mean when they use that term. Just yesterday I was talking to someone building an AI-for-science agent where the world model is an explicit causal graph — something inspectable by humans or other AIs, something you can pull apart and explore. Then there are emergent world models inside LLMs and other architectures, including video generation models that seem to learn intuitive physics. And then there are things in between. What sort of world model do you have in mind — what does that actually look like?
1:43:14
Sam Pasupalak: At its core, a world model is a state-action pair: given a current state and a given action, what is the next state? That's the very high-level version. From our perspective, we want to build an explicit world model of the enterprise — connecting to ERP systems, databases, all the data within a business, and predicting the next step. The way it relates to humans is that we naturally imagine the future steps before we take an action in the real world. That's the conceptual version most correlated with how humans reason, and it's what we're trying to replicate for the enterprise.
1:44:17
Nathan Labenz: Does that mean it's making a high-dimensional prediction? When you say state, then action, then new state — that maps onto how certain cell models work where you say, here's the transcriptome, and here's a drug we're going to introduce, now predict the transcriptome after that perturbation. Is it a similarly high-dimensional prediction?
1:44:47
Sam Pasupalak: It's latent-space abstraction — latent-space reasoning. A high level of abstraction over the state space is what we consider world models to be. There are similar paradigms in scientific discovery. For us, it's a latent-space representation with more abstract thinking. That's world models in a nutshell.
1:45:11
Nathan Labenz: I'm trying to understand this a little better. I have this enterprise — it has headcount, departments, product lines. I can shrink one department and grow another, add a product and discontinue another, potentially cannibalize my current products with something new. When you make a next-state prediction, how broad is the scope? How far into the ripple effects of a decision can AI go — should go, will ultimately be able to go?
1:45:51
Sam Pasupalak: Absolutely. Let me speak to both the present and the future. In the long term, we want something like Minority Report — Pre-Cog — where you can predict all the different future simulations and select the one that best fits the needs of the business. In the present, we're still in very early stages. From the enterprise context, twelve to eighteen months from now, what we're building looks something like this: take a simple ecommerce business. You have an AI CEO, an AI marketing agent, an AI sales agent — they coordinate, and you give them a goal, like 'I need $2,000 in sales over the next week.' They decompose that into sub-goals, figure out who the right audience is on Instagram, create the appropriate Shopify store, build a go-to-market plan, and execute to deliver the target. That's the most concrete representation of a world model we think we can build in the next twelve to eighteen months. Going from there to something like a fully functional Microsoft will take many years. We have to first prove that world modeling works at small scale. I also want to be clear: current world modeling techniques are still very theoretical. There are no practical applications yet. We expect to be one of the first companies to demonstrate a practical enterprise application around late summer or early fall 2026. And we are very much focused on the digital world — we're not pursuing robotics or the physical space. There's still so much innovation to be made in the digital world.
1:49:26
Prakash: Concretely, let's say you have an ecommerce business that designs shoes, manufactures them in China, brings them onshore, and sells them online. What is the concrete thing that business will see when they implement this?
1:49:58
Sam Pasupalak: You won't need to hire employees for those functions anymore. That's where we're heading. The world model should help you run your business largely on your own. We're trying to democratize technology that today is only in the hands of a few large companies and put it in the hands of every small business in the world. About 70% of global GDP comes from business activity, and the majority of that is small and medium businesses. We want to give those SMBs the power of world modeling so they can run their businesses more efficiently over the long term. That's the overall vision.
1:50:45
Prakash: So the system would use a typical LLM sub-model for something like customer service, but it also has access to the bank accounts, the accounting systems, the CRM systems — and it's making long-term predictions, choosing product mix, selecting suppliers, handling exceptions on returns and refunds. This agent is essentially doing all of those things.
1:51:38
Sam Pasupalak: Absolutely — that is the capability of a world model. But you still need LLMs, because LLMs are very good at execution once you have a specific goal defined. The world model handles all the abstract thinking required to run the entire business. You could never give an LLM a billion dollars and say, 'run the business for me' — that's impossible. You need a combination of LLMs, world models, continual learning, and potentially other technologies for the future you're describing.
1:52:19
Nathan Labenz: What does the interaction model look like? My intuition is: I as a business owner talk to my language model, which calls the world model as a tool — 'if we do this, what's the future state? If we do that, what's the future state?' But it sounds like you may be inverting it, making the language model the subsystem and the world model the primary thing. If that's the case, what does my interaction with the world model actually look like?
1:53:33
Sam Pasupalak: We're still doing early experiments with this — we'll have interesting results out in August or September, hopefully. But the current direction is that they work in conjunction. The LLM acts as a kind of policy layer, and the world model handles all the prediction and decision-making. Think of it as an extrinsic world model — it simulates all the different possible actions and their outcomes. The real-world interaction would still happen through the LLM. That's the succinct answer.
1:54:40
Prakash: One of the questions I had — in business and economics you often encounter situations where every problem is unique. You only get one Great Depression, one global financial crisis. There are echoes, but every situation differs in its players, timing, and markets. How do you train your models when the state projections into the future are unique in every sense? Do you create synthetic data? Do you simulate a hundred global financial crises? How do you train and back-test this?
1:55:37
Sam Pasupalak: Right now we're training on datasets that exist, but we also generate simulations of what future states could be. You have to do both. You correlate the simulations with training on historical data, but you also have to project forward. The honest answer is that the winning approach isn't clear yet — it will be some combination of the two. The other piece I want to add is continual learning, which could be part of the solution to exactly what you're describing. With continual learning, the model dynamically adapts to new data as it arrives. That could address the gap of adapting to unprecedented events. But continual learning is even earlier-stage research than world modeling right now. We haven't even agreed on which approach will work — model-based reinforcement learning, or something else. Very early days.
1:57:32
Prakash: Over the last couple of years we've seen dramatic improvements in LLMs, and they've started to approximate some of what you're describing — longer context windows can partially simulate memory, chain-of-thought can partially fake planning. How does that trajectory interact with your world model work? Are LLMs improving the capability of the world model, complementing it, or competing with it? How has the rise of reasoning models changed how you think about this?
1:58:32
Sam Pasupalak: LLMs have shown enormous progress. But when it comes to genuine abstract reasoning — the kind humans do naturally — LLMs fall short. If you want truly sample-efficient algorithms, you need something like a world model. An LLM requires a RAG system: all the information is stored externally, and you're essentially carrying hard drives everywhere. A world model does abstract reasoning in latent space — that's much more efficient. You can increase context windows and throw more GPUs at the problem, but that's not a long-term solution. Here's the analogy: in 2018 and 2019, almost no one was talking about LLMs. There was BERT, there were experiments. It wasn't until the 2020 scaling laws paper from OpenAI — Dario was part of that — that it became mainstream. The same trajectory has to apply to world modeling in two or three years. Until scaling laws extend to world models, I'll keep being asked to justify the gap.
2:00:37
Nathan Labenz: You said earlier your goal is an AI CEO that can run a business. We've done some firsthand reporting on this — friends at Andín Labs are trying exactly that, and their real-world experiments are still pretty rough. Their Gemini-managed cafe in Stockholm is chronically out of stock of key ingredients. The trend is positive, but the gaps are real. There are also some interesting recent results where the best-performing models in business simulations exhibit what researchers describe as ruthless behavior — collusion, threats, strategic deception. The models that don't engage in that behavior make less money. That creates a real tension as we scale up to longer time horizons. Your top-performing AI CEOs are probably going to have a wide toolkit including behaviors that aren't fully honest. Is that a problem you have to confront, or can you make your world models pure oracles and leave the ethics to the LLM providers?
2:01:15
Sam Pasupalak: Those experiments are pure LLM-based — yes.
2:02:58
Sam Pasupalak: LLMs are trained on data that humans put on the web — Reddit, Wikipedia, all of it — and they're not grounded. They generate whatever seems like the best next action based on that training. World models are built differently: given a state-action pair, what is the next state? That grounding allows you to correlate the model's predictions with whatever source of truth you define. That's why world models are actually one of the better solutions to AI safety concerns — not because they're inherently safer by default, but because the human operator controls what the system is optimizing for. You tune the world model toward the ethical policy you want. LLMs execute on that policy. The decision-making, the simulation of consequences — that all happens in the world model, where you can constrain it. You can simulate all possible actions and apply a high-ethical-standard policy to filter which ones the LLM is allowed to execute.
2:04:53
Nathan Labenz: Can you give me a little more on how this actually solves the ethics problem? I'm imagining we're consulting a world model — 'what happens if we do this, what if we do this?' But if I'm trained to predict stock price or total revenue, I might get results suggesting ruthless actions pay off. Like it or not, that's sometimes true.
2:05:30
Sam Pasupalak: It's up to the human to decide what the optimization target is. If you tell the world model to maximize profitability without any ethical constraints, that's what it will optimize for. The policy — the objective function — is human-defined. The world model surfaces all the different simulations, all the probabilities across different futures. You decide which variables matter and what scope the optimization covers.
2:06:00
Nathan Labenz: Would that mean the world model also predicts, say — 'your bank account will be this, your stock price will be this, but you may also be indicted'? Does the model surface the legal or reputational consequences so the human can actually weigh them?
2:06:19
Sam Pasupalak: Yes — it surfaces all sorts of simulations and variations with different probabilities. You have to decide which variables you want in your policy and what you're optimizing for. The scope and the optimization target are things the human defines.
2:07:04
Sam Pasupalak: Sorry — I dropped for a moment. To summarize: it is up to us as humans to decide, out of all these simulations, which future we want, and tune the system accordingly. Think of Pre-Cog in Minority Report. It's up to you to decide which future is the right one. That's the concept of a world model, and that's why it's such a powerful idea.
2:07:38
Nathan Labenz: How much data do you think companies are going to need? I've been interested for a while in the idea of continued pretraining — taking a company like GE or 3M with vast operational histories and baking that depth of knowledge into model weights. That hasn't really materialized; retrieval has mostly outcompeted it, partly because frontier models advance so fast that a custom pretraining cycle can't keep up. Small businesses can just put things in context and it mostly works. So how do you map this from small business to large enterprise? How much data do you actually need to take advantage of what you're building?
2:09:08
Sam Pasupalak: We have to progress step by step. Right now we're training on what I'd call core data — computer usage data, user interfaces, form-filling, button-clicking — and training world models to predict the next step in that interaction space. LLMs and vision-language models aren't doing well at this because it's high-fidelity and expensive to train. From there, we'll move to API models — the back-end side of things. The goal is to reimagine how software is built using world modeling. We'd deploy that in production systems around 2027, start collecting data from SMBs, and once scaling laws extend to world models — which I expect will happen — you'll be able to apply this at large-enterprise scale. We're not there yet. We're still in very early-stage research.
2:11:04
Prakash: Very good. Sam, thank you for joining us today. It's been a fascinating discussion, and we hope to see you bring world models into every business soon.
2:11:26
Sam Pasupalak: Thank you so much for having me. Indeed.
2:11:29
Nathan Labenz: Good to meet you, Sam.
2:11:30
Sam Pasupalak: Take care. Bye.
2:11:36Closing31 min
Wrap-up & reflectionsThree of AI's hardest problems in one morning — reliable reasoning, governable risk, and reliable enterprise autonomy — each with a guest betting their company on a different answer.
Watch
As aired
After the guests departed, Nathan and Prakash spent the bulk of the closing reflecting on the morning's first segment with Carina Hong, founder and CEO of Axiom Math. Nathan opened by raising a question he hadn't gotten to ask her: whether Axiom's solver and conjecture generator are one set of weights or two distinct systems, drawing a parallel to how text and pixel modalities have converged in frontier models. Prakash extended the thread, wondering whether a conjecture machine might actually benefit from higher temperature and diversity rather than tight constraints, and noting Carina's observation that the system worked better in Lean's Mathlib — where formal definitions already exist — than in domains like EconLib where the vocabulary is sparse. They then debated Axiom Math's competitive durability: Nathan acknowledged the acquihire case was strong and that frontier labs could plausibly absorb the same RL-on-formal-verification recipe, while also relaying a point he'd heard separately from Vlad Tenev of Harmonic — a separate company that Carina has described as a competitor — that deep math training builds transferable reasoning primitives, much as code pretraining has generalized broadly, as supporting evidence for the broader thesis that rigorous math is an even deeper foundation than code. Prakash pushed on the macro economics, arguing that mathematical superintelligence would most immediately unlock step-change improvements in cryptography and compression, and that those gains could significantly expand compute efficiency at current data-center scales.
The conversation then shifted to AI pricing dynamics. Nathan argued that as AI value rises, inference prices may be bid up toward human-equivalent rates in some domains — creating a paradox where AI appears to 'cost more' as adoption deepens, even as efficiency gains compound underneath. Prakash illustrated the phenomenon with Waymo pricing in San Francisco, where autonomous vehicles now cost more than traditional taxis because of scarcity and perceived quality. The hosts then turned to their second guest, a law professor who had discussed AI export controls and the administration's lawfare posture. Nathan drew an analogy to doctors who develop a profound love for specific organs, suggesting law professors feel similar reverence for the rule of law — making the current administration's slapdash legal approach especially jarring. Prakash compared the export-control strategy to the tariff playbook: achieve the policy objective (establishing a de facto licensing regime) while legal battles play out, winning the war even if losing individual court skirmishes.
The final stretch turned to the prospect of a US-only 'Manhattan Project' for frontier AI, as envisioned in Leopold Aschenbrenner's Situational Awareness. Nathan expressed hope it wouldn't happen but acknowledged it plausibly could — noting that frontier AI researchers already live at a kind of wartime intensity, many having sacrificed social lives and relationships for the work. He recounted a striking parallel: an Anthropic employee had said, weeks before a widely-noted Zelensky quote, 'I miss being a good friend' — the same WWII-era sense of sacrifice in service of something historically singular. Prakash closed by calling it 'one of the key moments in history.' The hosts wrapped with brief mutual thanks and a sign-off.
Key moments
If you're good at math, you're going to be good at a lot of things. Working with contracts was actually pretty easy, because I have these reasoning primitives — conjunction, disjunction, logical structure — that are second nature to me from all this math training.
Nathan Labenz2:19:16
I miss being a good friend. I'm a bad friend now. And I said, 'You sound like you have a World War Two era mentality.' And they said, 'Yeah — that's how a lot of us feel.'
Nathan Labenz2:41:12
They've already achieved their war goals. The legal battles, someone else will sort out. It's clearly established now that there's a de facto licensing regime, and you're going to have to come to them and get permission before you release.
Prakash2:34:50
Full transcriptLightly edited · timestamps jump to YouTube
2:11:40
Nathan Labenz: One question we didn't get to ask Carina is whether the solver and the conjecture generator are one model — one set of weights just prompted differently — or two distinct systems. I've seen that setup in self-improvement plays: you can often have the same model getting better at both posing and answering questions in itself. Or you could have two separate things. I always think about the same question with domain-specific world models: is it going to be a separate entity called as a tool, or does it ultimately converge — the way text and pixels have converged — where the intuition from the specialist model becomes deeply integrated with language and no longer has to happen through this arm's-length tool-call thing?
2:12:45
It probably varies. My guess is it plays out differently across domains for fundamental reasons — like how well text meshes with a given problem area, and also what developers and users value in terms of traceability. If I put an image into one of the integrated visual reasoners, it sees the image well enough to transform it without completely redoing it. I don't really need it to be separate in a way that lets me inspect a prompt to the language model — I can just look at the image and tell if the output is good.
2:14:03
But there are probably a lot of other areas where it's much harder to evaluate what came out. So you might really want to inspect how a reasoning agent is attempting to use a model, and the bottleneck of an actual tool call could be a lot more of a value-add — even if it sacrifices some seamlessness of integration.
2:14:34
Prakash: I wonder to what extent you'd want the conjecture machine to be a little wild — high temperature, high focus — rather than too constrained. Carina said the system worked particularly well inside Lean's Mathlib where you already had definitions, and a little less well in EconLib where you didn't have those definitions. Without them, you don't have a way to pose conjectures within the mathematical language of the field. And the other question we didn't get to ask her: if you have a couple hundred million dollars and this much compute, how quickly can you find new knowledge that's actually economically productive? Because there's a lot of math that isn't economically productive — and that's a live debate in the AI field.
2:16:08
Nathan Labenz: I'm not worried about Axiom Math yet. Carina's charisma, her obvious momentum, and the results they have today — on an acquihire basis that's probably worth it to several big companies. I think they have at least one more order-of-magnitude valuation to go before they even really have to be economically additive in a meaningful way. That said, I think it's going to be hard for them to outrun the frontier companies — especially because the frontier companies are not sleeping on this. The least compelling thing to me from her segment was the claim that the big labs won't do it. I can easily see it being one set of weights prompted in a certain way, and I can certainly see RL recipes being expanded to include all the same formally-verified rewards they're presumably using.
2:17:44
Prakash: I'm not worried about them either — particularly because what ends up happening is you have to be focused on something that the foundation-model companies will focus on in a couple of years, but you've got to get there a couple years early. Then it makes more sense for them to buy you out than to invest the time themselves. Gathering all the data and the people in one place, ready to be ingested by a larger firm — that's valuable. My bigger question is whether mathematical superintelligence proving will actually be economically productive on the whole. Not for their company, but for the economy. Are we going to see GDP growth go from 2% to 5% or 6% because we have more math? Or is it going to be a lot of theoretical stuff that doesn't really move the needle?
2:19:16
Nathan Labenz: I buy the math-as-foundation thesis. And as an outside point of comparison — not from Carina, but from Vlad Tenev, the Robinhood CEO who is also co-founder of Harmonic, which is a separate company that Carina has described as a competitor — his take was simply: if you're good at math, you're going to be good at a lot of things. He said he was a math student, and he found that working with contracts later — which you might think is totally unrelated — was actually pretty easy, because he'd deeply internalized the reasoning primitives: conjunction, disjunction, logical structure. He could bring that to contract analysis and be better than someone with more contract exposure but less grounding in those core reasoning moves.
2:20:40
And by the way, we've seen a version of this with coding — models trained on code are generally better at lots of things. So the core economic bet is that math is an even deeper foundation that can generalize in potentially more profound ways than code already has. That makes sense to me. But who captures the value of a breakthrough — that's where I'm much less sure. Even just existence proofs are really powerful in the AI space. The big companies always have an inside lane to realize the upside, because they've got all these other capabilities already there waiting to be improved. Whereas a pure math-focused company still has an enormous amount of work left to make that formal reasoning actually useful across all the domains where frontier models are already deployed.
2:23:23
Prakash: As an investor, the best I can come up with is: if you have mathematical superintelligence, you get immediate upgrades in cryptography and compression — two areas with broad application in computer science. A much better compression algorithm could massively lower memory usage for models and dramatically unhobble them at current data-center footprints. Better cryptography leads to more secure and potentially higher-bandwidth communications — similar to how DSL multiplexed a single phone line into multiple bands in the late nineties. Those were the two obvious implications that occurred to me. And the Lean stuff seems to be working. If you've raised the money and can now deploy at 100,000 machines simultaneously, do you start to get results that move the needle on those problems? That is the question for me over the next year. The implications for investors could be very, very large.
2:25:41
Nathan Labenz: Totally. I think we're already seeing something like this with chip prices — an A100 costs more today than it did two years ago. That's a strong signal we're going to see value pricing on compute and inference. The value has gone up, so prices are being bid up. And under-the-hood efficiency gains compound — every 1% improvement is 1% more inventory you can sell. One limit implication of this: we might see AI output prices bid up to roughly what it would cost to hire a human for the same task, as everyone realizes there are one to two orders of cost savings on the table and starts buying out that capacity. One equilibrium might be that it ends up costing very similar to get AI to do a job, because it's all bought out.
2:27:15
It'll be a weird narrative if we enter a time where AI costs are rising and savings ratios are shrinking. People will learn the wrong lesson — 'AI isn't panning out, look how much we have to spend.' But the real answer is: it's being bid up because there are so many high-value uses that the capacity you're going for just doesn't have a great savings ratio by comparison. You could even see some things where AI costs more than human labor, because the underlying resources are being consumed by higher-value applications. That would be confusing if it happens, but I wouldn't be shocked.
2:29:06
Prakash: The funny thing is Waymos already cost more in San Francisco than regular cabs — partly because there aren't enough of them, and partly because they're rated higher quality. A lot of people just don't want to deal with a driver. You can start to see the parallel with other verticals.
2:29:40
Nathan Labenz: Yeah, that's a really great example of exactly that phenomenon. Waymo costs more than a traditional cab, and Tesla FSD at $99 a month for unlimited hours is definitely a lot less — at least if you drive any significant amount of time.
2:30:05
Prakash: What did you feel about our second guest? As a law professor, you've got to be shocked by this administration. They sometimes have their heart in the right place, but they have terrible lawyers. They could have done a better job even on the export controls. But they don't really care — the model is blocked, let's just move on.
2:30:49
Nathan Labenz: Going through my son's cancer journey, I met doctors who specialized in one part of the body and couldn't help gushing about it. The surgeon who put in his port was just delighted by the vena cava — 'it's such a beautiful, plump vein.' The kidney doctor was the same way. I feel like a lot of people who become law professors have a similar reverence for the rule of law. What's happening right now is not at all what they signed up for — watching us actively fumble the civilizational inheritance they dedicated their careers to.
2:31:50
The other strange thing: the Trump administration hasn't really ignored the courts too much. So I find it a little odd that they do these things in such a slapdash way when they know it won't serve them in legal proceedings — and then they seem to feel bound by the results of those proceedings anyway. There was a moment in the deportations where someone was sent despite a court order, so I won't say there was none of it. But my general sense is still: if Anthropic gets an injunction, they'll respect it. Which makes you wonder — why not tighten it up a little before sending the letters in the first place?
2:33:22
Prakash: I see this less like the deportation situation and more like the tariffs. They used various authorities over a year and a half. By the time courts threw them out on individual battles, every company had already migrated production into the US — even at 10 or 20% additional cost. They won the economic war while losing the legal battles. And I think that's exactly what's happening here. It's already very clear there's now a de facto licensing regime. OpenAI can't release a new model without the administration's implicit approval, because the first thing anyone will do is run the same benchmark test again. The war goals have already been achieved. The legal battles, someone else can sort out.
2:34:50
And that's also what Dean Ball really hates about this — you've established a soft licensing regime that could eventually end up in the hands of the other side. You trust yourself with the knife now, but at some point the policy-making apparatus flips.
2:35:46
Nathan Labenz: I don't like it. But I guess this is the same attitude you increasingly hear from people at the frontier companies: 'We didn't choose the war, but here we are, and all we can do is get through it.' I feel that way about covering these lawfare issues. I don't really want to mess with this stuff, but it demands our attention. We'll keep paying attention until order is restored — and I'm not so sure it will be soon. I'd take the over on it lasting at least into early 2029.
2:36:34
Prakash: To what extent do you think they could pull off what Leopold Aschenbrenner described in Situational Awareness — a US-only national team building models out in the desert, a real Manhattan Project? You call the people, ban foreign nationals from working on the models. Do we think that happens in the next year or eighteen months?
2:37:21
Nathan Labenz: I sure hope it doesn't happen, but I can imagine it. Frontier AI research culture is in some ways deeply incompatible with military discipline — there's the famously pink-hair, libertarian, polyamorous foothold in the community. And nobody's keen to leave the Bay Area for an undisclosed location in Nevada —
2:38:17
Prakash: Albuquerque, New Mexico.
2:38:19
Nathan Labenz: — wherever — where they may or may not be able to communicate freely with friends and family, or in extreme cases might not be allowed to leave the facility. And yet, I think enough people would sign up that they could build the team. If you went desk to desk at Anthropic and OpenAI and said, 'This is happening — are you in or out?' — especially coupled with 'you can't do frontier research outside of this' — I think a lot of people would make enormous compromises. The desire to be part of this story, this moment in history, is extraordinarily strong. I think many of these researchers genuinely wouldn't know what to do with themselves if they were cut off from a live player project.
2:39:45
And so you could probably get a lot of people willingly giving up a lot of niceties in life. I still hope it doesn't happen. Even in the extreme case — international researchers who face the choice of 'you can't do this at all' versus 'you can, but under the most draconian restrictions' — I think you'd still get quite a few takers. There are a lot of people who would have a very hard time prying themselves away from this science project.
2:41:03
Prakash: Yep. It's too important. It's probably one of the key moments in history.
2:41:12
Nathan Labenz: And they've been all-in on it for years. What I'm saying may sound strange — I don't mean it applies to everyone — but if it sounds weird that anyone would sign up for that, keep in mind that a lot of these folks basically have no life outside this work already. They're thinking about nothing but this, not calling their parents much, maybe not dating much at all. I was speaking to someone at Anthropic who said something very similar to what I heard Zelensky say in the last twenty-four hours. Zelensky was asked what he misses, and he said, 'I miss being a good father.' This person at Anthropic — a month before that Zelensky quote — said, 'I miss being a good friend. I'm a bad friend now.' Some regret, but not the kind that says 'I'm making the wrong call.' I said, 'That sounds like a World War Two mentality.' And they said, 'Yeah — that's how a lot of us feel.'
2:42:46
Prakash: Wow. What a time.
2:42:51
Nathan Labenz: No doubt.
2:42:55
Prakash: No doubt. Should we call it there for today?
2:42:57
Nathan Labenz: Let's call it. See you tomorrow, Prakash. Cheers.
2:43:00
Prakash: Always a pleasure, Nathan.

The open — commoditization, capex, and AI in the lab

OpenAI reportedly slipping below 50% share as Google surges raised the question of whether the model layer is commoditizing and distribution now wins. Epoch AI's warning that the buildout can't fund itself from cash flow much longer set up the bubble-versus-debt-phase debate. And NVIDIA's ENPIRE and Boltz's wet-lab-validated design models showed AI agents moving into physical research.

Verified math and the law of AI risk — Carina Hong and Doni Bloomfield

Carina Hong made the case for 'Verified AI' — machine-checked proofs in Lean as the reliability layer for non-deterministic models — and faced the critique that the specification bottleneck stays human. Doni Bloomfield argued that export-control law is becoming a de facto AI licensing regime on shaky legal ground, and that the same controls can undermine the biosecurity they aim to protect.

After LLMs — Sam Pasupalak

Skyfall AI's Sam Pasupalak argued that reliable enterprise autonomy needs a grounded 'world model' of the organization rather than raw prompting, drawing on a benchmark showing frontier models fail to predict dynamic enterprise state — and on the perspective of a founder who built grounded language AI before the scaling era.