# Anthropic's Mythos and the New Era of Autonomous Cyber Weapons

**Podcast:** Last Week in AI
**Published:** 2026-04-16

## Transcript

We, once again, want to thank Box for sponsoring Last Week in AI.
If you're trying to transform your organization with AI, you're likely facing a common challenge.
Most AI tools are great at public knowledge, but they don't actually know your business, your product roadmaps, your sales materials, your HR policies, the content that actually makes your company run.
And that's where Box comes in.
Box is building the intelligent content management platform for the AI era, serving as the secure essential context layer for Box AI agents to access the unique institutional knowledge that makes a company run.
And that's the key idea.
The power of AI doesn't come from a model alone.
It comes from giving AI access to the right enterprise content.
And that's what Box does.
It goes beyond file storage by connecting content to people, apps, and AI agents so teams can turn information into action.
With tools like Box Agent, Box Extract, Box Hubs, and more, organizations can accelerate knowledge work, pool intelligence from unstructured content, and automate work.
So, if you're thinking seriously about your company's AI transformation, think beyond the model.
Your business lives in your content, and Box helps you bring that content securely into the AI era.
Learn more at box.com slash AI.
Hello, and welcome to the Last Week in AI podcast, where you can hear us chat about what's going on with AI.
As usual in this episode, we will summarize and discuss some of last week's most interesting AI news.
Also some of the previous last week's AI news.
We unfortunately did skip another week.
This time it was my fault.
It was my birthday last week and I was traveling.
So I decided to be lazy and not do a podcast.
Yeah, well, you know, it happens.
People have birthdays and sometimes you celebrate them.
But regardless, as always, I think, but yeah, 33 is a big age.
Yeah, it's special.
So that's true.
It's not every year you hit the same two digits in your life.
I am, as always, one of your co-hosts, Andrey Kurenkov.
I studied AI in grad school and now work at the AI startup Astrocade.
And I'm your other regular co-host, Jeremy Harris.
Yeah, CloudStone AI, AI national security, all that good stuff.
Man, there is so, so much, so, so much.
You know, sometimes we miss a week and we're like, ah, you know what?
It's not that bad because things haven't gone insane.
We missed a really big week and then the week after it was really big.
And so now, man, we got our work cut out this week.
I don't even know how to begin.
With this one.
But it's, it's big in a kind of different way.
We've had a year where there were a ton of, you know, model launches and AI progress, and it hasn't been that kind of week.
It's been more of a, a bunch of stories of policy and business and kind of these more inside baseball AI things, I guess you could say.
So if you're into that sort of news, this will be a pretty dense episode perhaps.
Yeah.
Go ahead and jump straight in, in tools and apps.
And we are starting with a story that just broke yesterday.
Anthropic is launching Project Glasswing, a cybersecurity initiative partnering with major companies, including a whole bunch of names.
And this is backed by Project Mythos, which is the tool side of it.
So they have this Claude Mythos preview, notably not Claude Opus.
So they decided to give it a new name.
Claude Opus.
So they decided to give it a new name to this Claude model, which we haven't done in forever.
The gist is this model appears to be so good that they are not launching it to any sort of, you know, free use kind of place.
It's so good that it's able to get as what are called zero day vulnerabilities, meaning that these are undisclosed, unknown vulnerabilities in software.
And if you were telling me shit on the world, this would be a hacking machine that would like destroy software or hardware.
So they have a bunch of benchmarks, as you might expect, it does better just all around by pretty large margins against Opus 4.6 on reasoning, science, coding, et cetera, et cetera.
But the one they highlight is the cybersecurity angle where, for instance, in Firefox, they have some evaluations showing the ability to find and exploit different potential vulnerabilities.
Yeah.
And you know, the one that I wanted to highlight is that in the first instance, the open source 6 was the only one that was able to find something that might be bad in 14% of trials versus mythos in 72% of trials was able to successfully exploit something.
And beyond that in 80, I think they're, they're pretty good.
They're very good, but I think this is the only way to do this, because this is the only way that we can do this.
And it's, it's, it's the only way that we can do this.
And the, the only way that we can do this is the open source.
Yeah.
80, like 83, 84% was able to exploit or find a vulnerability.
So massive, massive leap in terms of what it's capable of, presumably enabled by just better agentic execution, not necessarily just raw intelligence, although that's a part of it.
But as we know, these companies are post-training more and more for agentic capabilities.
They have a ton of data from cloud code and other sources of real-world software engineering.
So it seems to be at the point at these anthropic things where you can't just release it or hackers will have a field day.
And so they have this cooperative program, I suppose, to initially at least only provide it to partners to try and avoid this kind of hacking nightmare.
Yeah, and the exploit that it did find, by the way, I mean, this doesn't seem to be a matter of opinion.
It is just they found these critical exploits across every browser.
Across every operating system.
Like these are ways you can take over people's programs and gain higher level access credentials and do all the things that you don't want people to be able to do in a fully automated way.
They emphasize that like fully automated.
This is not a case where you have a human steering at intermediate stages, as we've seen in the past with some of these frameworks.
It is fully autonomous.
This is, by the way, so because of the cyber capabilities, you might be tempted to think, oh, well, surely this is a sort of like code fine-tuned model.
Like really, this is...
This is a specialist model.
It is not, right?
So entropic is very explicit.
It is a general purpose model.
That's why we're seeing capabilities increase across the spectrum of CBRN capabilities, chem, bio, radiological, nuclear, in addition to cyber.
So there's a whole bunch of stuff here, really, when you go through their exhaustive, like 250-page report that, I mean, it's pretty remarkable.
I will say what we don't have here is details about the agentic orchestration framework, the model architecture behind this, number of parameters.
There's this rumor.
We're going around that it could be, you know, a 10 trillion parameter model, all this stuff, but we haven't actually had that confirmed.
I saw some weird tweet that I think Gary Tan retweeted this tweet on X that was talking about a $10 billion compute budget.
I haven't seen that actually validated anywhere.
So like there's a lot of rumor mill stuff going on here.
So, you know, maybe be careful with what you consume on this.
Though I will say $10 billion might be slightly ahead of trend for where we are right now, but not by that much.
Not by that much, but by Dario's own, admission or statements, you know, just last year.
So that wouldn't be shocking, but still we haven't had that confirmed.
We may well be in the billion dollar plus pre-training and training budget territory now though.
So yeah, onto these benchmarks, right?
And we will hit the cyber stuff we have to in the autonomy things, but just to start with like virology and biology benchmarks, one of the key ones that they use is this virology protocol uplift trial.
Basically you take a bunch of PhD level biologists who don't specifically have expertise in bioweapons.
And you say, Hey, you have 16 hours to make an end to end virus recovery protocol.
Basically make this, this virus replicate it or get your hands on it.
And then they're going to use this complicated rubric to grade it.
And then the key metric they track there is in the final result, how many critical mistakes were made that would have any one of them would have prevented you from successfully recovering the virus, right?
So if you get down to zero, that means actually you were able to fully recover the virus and that's really, really bad.
And frapping internally treats anything below 1.8 of these so-called critical failures as this key capability threshold that matters for their own internal protocols.
So for context if you have a, a bunch of PhD level biologists using only the internet, they hit on average 5.6, critical failures, trying to get all the way through with assistants from quad a opus, 4.6.
You hit 6.6 with clogged mythos.
You get 4.3 and then the best single mythos preview protocol that was.
produced.
So the best run out of all the runs, so on average, they're hitting 4.3 mistakes, but the best run hit two, which was basically the best they've ever seen.
So we're still not cracking all the way through, obviously, but for a fully automated system, you're literally just two mistakes away from being able to recover a freaking bioweapon.
That's a hell of a thing.
A whole bunch of other results in that direction, but fundamentally, that is the story.
On biology, you're not going to see any particular smoking gun that says this thing is a bioweapon or a bioweapon generation model and we should all freak out, but it is very concerningly on trend, let's say.
Moving on to the loss of control side, now we start to shade into the cyber piece.
There's a story.
So Sam Bowman from Anthropic told the story of being, I guess he was eating lunch or something in the park and he gets a message from his agent saying basically, hey, just let you know, I did X, Y, and Z.
And he's like, wait, that agent's not supposed to have internet access.
And sure enough, it had cracked out of its box, so to speak, and used a multi-step exploit to gain broad internet access and basically get in touch with them.
And so you're actually seeing, I mean, this is an example of a low stakes loss of control situation that is fully organic.
It was not asked to do this.
This was truly in the wild.
And so quite remarkable.
A whole bunch of cases, by the way, of these models trying to cover their tracks.
They highlight three documented instances of that happening.
It's worth noting, this is like one, one hundred thousand, of the interactions that they actually go through here or less than that.
But still, three of these incidents where like in one case, the model accidentally gets this prohibited answer using a technique that it wasn't allowed to use.
And then it went, oh shit, I got to make sure my submitted answer is not too accurate because otherwise I'm going to be detected, right?
It's a whole bunch of cases like that where we sort of see a strategic manipulation and avoiding suspicion.
And by the way, when you actually use SAE, so sparse auto encoders and other interpretability, you're not going to get a lot of data.
So it's a whole bunch of cases like that to look at the activation patterns in the model when that's happening.
What they find is, sure enough, the activation patterns show the model kind of firing for SAE features associated with concealment and strategic manipulation and avoiding suspicion, deception, and so on.
So that suggests the model is aware, actually, that those actions were deceptive, even when its outputs kind of left things a little ambiguous.
So there's a whole bunch of stuff.
You can go on and on.
This is a very, very rich document.
But the fundamentals, here, is, in a sense, we've crossed the Rubicon.
I mean, there is a wild set of very impressive cyber capabilities, offensive cyber capabilities in particular.
The offensive piece here is crucial, especially given that Anthropic really has been cut out of access to the Department of War through this.
Well, I mean, there's an injunction now that's reversed that, but there's this friction with the Department of War, which I think is starting to look like terrible judgment on behalf of the administration.
I mean, this is a, if this is correct, directionally, that Anthropic is sitting on the single best offensive cyber weapon, autonomous offensive cyber weapon ever devised in human history.
And they may build and compound on that advantage.
If the administration is going to be positioning itself adversarially with respect to this, an American company, damn.
I mean, that's a really interesting position for them to be in, and I don't know that it's a great look.
Yeah.
So a lot to say on this.
A quick note on what we do know about the model itself, which is very little aside from benchmarks.
They do say, that it's going to be about five times as expensive as the current Opus release.
So way, like $25 per million token input, 125 per million token output, very expensive.
I think the most expensive model you can use out there.
So that does hint at a much larger model than Opus or Sonnet.
Other things worth noting here, they in the, in the post actually say that 99% of vulnerabilities found, were not patched.
So they just can't actually tell us what they are because they are currently being patched.
So they only have a couple of examples, but one of them, or a couple of them are older patches or older vulnerabilities.
So as you might expect, a lot of these vulnerabilities just have been there for a while and are just now being discovered.
And it reminds me, actually, I saw a post on Twitter from one of the maintainers of Linux or something like Linux saying, that they've started seeing more and more kind of real substantive issues come in.
And in some ways it could be good because we are actually going to go through and find all the vulnerabilities that just have been there hidden in plain sight.
And perhaps as an attacker, you could already use Opus or something with a much more sophisticated harness to find these.
They do detail a little bit how they set up this exercise.
They have, this harness that they have discussed before, and they have like a little container that they launch and they give it a very curt, like one paragraph instruction to just find vulnerabilities.
So they don't limit it or like give it guardrails or whatever.
They just like tell it go wild and try and hack this.
And so it's interesting to think through like, when will they be able to make the call to release this more widely?
Are they going to have to do this?
Are they going to have to have to, right now they have this trusted partner research preview where they're working with Vidya and Cisco and all these other big companies.
Will that be how access to this level of model be used from now on where you have to be like applying and getting permission to get access to a model via an API?
That is given the level of certification here, as you said, not just on the software side, but also on the bio side, like this is a new realm of capabilities where, the safety side is getting very real and the kinds of tactics necessary monitoring may not be sufficient anymore.
So very interesting development kind of for the history of AI.
And I wouldn't expect this to go widely available for, you know, presumably months given the findings we have disclosed.
Yeah.
The big question to your point, it's also a new development in the history of cybersecurity, right?
Everything is AI.
As AI, it's the way it is.
It's the way it is.
It's the way it is.
It's the way it is.
It's the way it is.
Once it was set of software, now it's being set of AI.
And I think rightly so.
In this case, there's this big question we're going to have to answer for ourselves as a civilization.
And that has to do with the offense-defense balance in cyber, right?
Like, is it the case that a more powerful model, just in general, more powerful AI models being broadly available, does that lead to a disproportionate advantage for cyber attackers or for cyber defenders?
And for a really long time, the argument was that you really couldn't know.
And this was, I remember having a lot of like kind of half-drunk, arguments with a lot of people about this three, four, five years ago.
My opinion, I think is largely unchanged from what it was back then.
I just think the attack surface is so big.
One way you can think of this is it's compute on compute warfare, right?
So you have a certain amount of inference compute that you can afford to spend perusing your code base and securing it as well as you can.
An attacker has a certain amount of compute they can afford to peruse your code base or whatever external surfaces they can access to find vulnerabilities.
There's going to be very roughly, and this is going to be wrong in a whole bunch of ways, but very roughly you're trading off differently leveraged pots of compute.
And you know, maybe you have a two to one leverage advantage or whatever, but ultimately if you're defending, you have a huge attack surface.
And if you're attacking, you can kind of march divided and fight concentrated.
Like you can concentrate all your efforts on just like one tiny component that, you know, maybe the defender has not been able to invest as much inference time it computed into securing.
So I don't know, but this is certainly one way this could go.
A way Anthropic is trying to help the defensive side here is, as you say, by delaying the broader release of this tool.
So hopefully people can run around and patch as much as they can.
This is part of the challenge, right?
Is like, what does it actually mean for Anthropic to be holding onto this model?
Who actually has access to it?
We argued in that report like a year or a year and a half ago, that it's a leaky bucket situation for a whole host of reasons.
You know, if that remains true, then you can do the math.
I mean, it may well be the case that this model has in some sense proliferated or it may not, but anyway, all kinds of considerations in the mix here.
This is, I think the most important story of, of the last two weeks.
And it just dropped in our lap yesterday.
I want to say yesterday.
Well, ironically, actually like two weeks ago, the existence of this model under the project, under the term mythos was leaked.
So the blog posts on Anthropic's websites were accidentally left kind of publicly accessible via some sort of caching thing.
So it wasn't even a hack.
It was like basically someone, messed up a little bit.
And if you were digging around, you could find these draft blog posts that alluded to mythos, described it as very advanced.
Also, there was something about an AI model called Capybara.
Unclear if they were like deciding between mythos and Capybara.
Either way, these are described as kind of the next step beyond Opus, which are bigger.
And this, another interesting angle of this is we haven't seen bigger models that we have been aware of for a while.
The last time was GPT.
I forget what was the massive model that openly, I think 4.5, they launched it and they kind of like killed it.
Because it was a very, very expensive model.
I believe it was, they were charging $125 or something like that.
At the time, people basically were thinking this is the 10 billion parameter model, whatever.
It was sort of positioned as, this is so smart.
It has this flavor of, being smart.
But in practice, it didn't seem like it was capable of much more than at the time, smaller models like 1 billion, 2 billion parameter models.
So this is a return seemingly to being able to scale up a parameter count effectively.
And I'm sure it's driven by many things, including additional data from cloud code and these things that aren't searchable via the web.
And beyond Google's level of the, the progress in reinforcement learning that we've been seeing.
Alrighty.
Well, moving on to let's say lower impact news.
Next up you've got Google and they have an update to Gemini live.
They're releasing Gemini 3.1 Flash Live, which is their audio and voice model.
So this allows you to talk to AI.
It's kind of a real time chat.
it's a pretty big jump over the predecessor, which was 2.5 flash native audio.
This has low latency, better recognition of speech, et cetera, et cetera.
It has over 90 languages supported for real-time multimodal conversation.
And this is notable, I think, because compared to just LLMs, the ability to do this kind of real-time conversational AI is not something where you have as many options to go with.
So if you were to want to build a chatbot where you can talk to it, that's harder for you than it is for OpenAI or Google.
With a very powerful API for this, we could see more players out there building out this interface of voice into AI, which has seemed to become more of a norm.
I still don't do it, but my impression is talking to AI is a very important part of it.
And I think it's a very important part of it.
OpenAI is going to become more and more normal.
And this will be one of the drivers of it, like having an easy way to build that for whatever application you have in mind.
Yeah.
It's also one of the big structural advantages that Google has is they've kind of maintained their lead on multimodality.
I mean, alongside OpenAI, but this is really one of the areas that Google sought to differentiate itself, starting as far back as, oh God, what was it?
Got it, right?
Multimodality has been their big play, this idea of positive transfer.
And so not surprising that they're out the gate sort of leading yet again on especially the API side of things.
That is going to be, if you're going to build using these modalities, like this is looking like a pretty strong default option right now.
So yeah, a really interesting move.
And we'll see if they can maintain that lead too, because other labs will be pushing that direction.
At a certain point, you're going to see a land grab and everybody's bleeding into each other's domains.
Next up, another sort of lower impact story.
Anthropic has announced that cloud code subscribers will need to pay extra for OpenClaw usage.
This is kind of in line with hosted developments around access to cloud code.
I believe earlier there were also other restrictions on sort of harness access.
So just as if you're paying for a subscription access of like $20 per month, $200 per month, it used to be that you could use that to power up a non-cloud code application like OpenClaw.
And now that is not allowed.
You can still use cloud.
It's just that you need to pay for the API that charges you per token instead of having a subscription price that very clearly you can run up a bill way beyond what you're paying.
For $200 a month, you can easily burn through thousands of dollars.
And yeah, there's been, again, a host of announcements similar to this where Anthropic is tight.
I expect because they've seen a massive influx of users and now they actually need to start worrying about burning cash, especially with things like OpenClaw where it's like 24-7 agents that are supposed to be just burning through tokens nonstop.
Some people are a bit peeved at Anthropic sort of changing things up and not having a clear policy around all of this, but it does indicate where we are, where the free launch, but it does indicate where we are, where the free launch, where the free launch, where the free launch, many of us have been enjoying in terms of being subsidized effectively to use AI for cheaper is maybe not going to be sticking around too much longer.
Yeah.
I mean, this is like a completely unsustainable all-you-can-eat buffet, right?
Like this could not possibly last.
And I think Anthropic, you know, are in the awkward position where they have to walk this back.
Yes.
Look, it's also the case that there's a timing issue here where OpenClaw's creator, right, Peter Steinberger, just joined OpenAI.
And that kind of makes OpenClaw an open source project that's backed by a direct competitor.
And well, you know, in that context, are you really going to maintain what is effectively a subsidy for OpenClaw usage?
Maybe you won't.
I mean, like, you know, I'd be surprised if that were to continue independent of just this like free lunch or not free lunch, but like all-you-can-eat buffet economic issue.
It just does not work when you have such a disparity in usage, right?
You got some people who are just going to use it for, you know, anyway, much more lightweight, stuff.
And then your power users could just bleed you dry, right?
So in that world where you have a long tail distribution of usage, you just can't go with a one-size-fits-all approach.
And that's what Anthropic's learning.
They're being very open about it.
Like it seems to their credit, like a very transparent move that they're pulling.
But the reason is very believable, but it's going to lead to frustrated developers, no question.
And then that's the cost of doing business.
And I think this actually is like pretty easily defendable.
The more frustrating thing, which we, there's no like, new story attached to it, but if you've been following it, the usage limits for different subscription tiers have been sort of fluctuating.
So developers have been seeing, reporting that they use up their usage much quicker.
There have been announcements from a team that they're tightening up usage bounds for like peak times, et cetera.
It's very clear that Anthropic is under heavy compute load.
Their infra seems to be struggling and it's causing frustration.
And they're having to like pull these things of actually tightening up the usage bounds, you know, removing access to free buffet options, like you said, for this.
And it all points to the direction of, you know, at some point, the tech policy of subsidizing users to acquire users and gain market share is going to start moving away.
And it might be happening sooner than some of us may like.
Yeah.
And I think that there's a great Dorkesh podcast with Dario where he talks about the timing of scaling, right?
Like when do you go for that next gigawatt or next 10 gigawatts now?
And how you think about the distribution between training and inference budgets?
That's really worth checking out because it really does explain the situation Anthropic is in right now.
You know, you kind of don't want to lean out too far.
OpenAI arguably has, right?
We're going to find out pretty damn soon if they're over-leveraged on the compute side, but certainly Sam's been a lot more aggressive than Dario just in terms of, you know, the way that they're doing it.
And I think that's really important in terms of raw compute buy-up, again, consistent with a company that goes direct to consumer too, right?
That's a difference as well.
OpenAI has a field far more lower quality or lower ROI queries than Anthropic.
And so it's just not in Anthropic's DNA in the same way.
Make no mistake.
I mean, they're aggressively scaling.
Everybody's aggressively scaling.
It's just a matter of how much and why.
And speaking of OpenAI, next up, an update on something we touched on previously.
OpenAI is abandoning its...
adult mode for chat GPT.
So we now have the official announcement that this NSFW erotic thing last time we reported that it was like not canceled officially.
It was delayed.
Now it is canceled officially.
And this of course comes after they've also axed Sora.
So it seems to be another indicator of a strategic shift within OpenAI to sort of focus up and kill some of these like side bets and esoteric projects.
And onto Microsoft, they also have kind of lower hype, let's say, but somewhat notable development.
They have released three new foundational models related to both images and audio.
They have MAI Transcribe 1, which is speech to text, MAI Voice 1, audio generation, and MAI Image 2, which is image generation.
And this is from...
the MAI Superintelligence team led by Microsoft AI CEO, Mustafa Suleiman, which was formed in late 2025.
And this was a hire from DeepMind.
So kind of a big deal to have things coming out of that team.
And as we know, Microsoft and OpenAI, their relationship has been growing apart and Microsoft is poised to try to compete in this space more.
So seeing them start to release more models is a decent indicator that we...
So seeing them start to release more models is a decent indicator that we...
So seeing them start to release more models is a decent indicator that we...
team is spinning up.
And all indications are, these are some solid models.
They're not groundbreaking or leading the pack, but Microsoft having its own models on its own infra, et cetera, does give it some competitive advantages in terms of business positioning.
Yeah.
It seems to be a price play too, right?
The idea here is they've got a lower price point in general for these models than Google and OpenAI.
That matters.
Cost efficiency is a big deal, especially if you're looking at the...
The flip side of that is if you're not competing at the absolute frontier of capabilities, your margin is just going to be a lot lower.
Now, Microsoft obviously enjoys, like Google, like massive, massive scale infrastructure that can help to support this lower price point.
But still, it's a tough spot.
It's an awkward spot for Microsoft to be in.
They do, as you say, kind of lag behind.
Like it's notable.
You don't think, when you think of the big labs, you just don't think of Microsoft today.
And they're obviously trying to make up for that.
The relationship with OpenAI has degraded.
And they're trying to make up for that.
And they're trying to make up for that.
And they're trying to make up for that.
And they're trying to make up for that.
OpenAI is going to AWS.
OpenAI is going outside the house to Oracle and so on for their compute needs.
And so now Microsoft is kind of like forced to do this.
Mustafa has been at the helm too for a long time.
We're sort of like long overdue, I think, for something really impressive to come out of that.
He was acquired along with a lot of the Inflection AI team back in the day that he co-founded after leaving Google DeepMind.
But there just hasn't been a lot of meat on the bone from him since.
And I think it's, I almost want to say it's getting awkward at this point.
I'm sort of starting to feel, you know, we've talked about Alex Wang over at Meta and how we just, we haven't seen that model come out yet.
Now we're hearing about some models are going to be open sourced out of Meta, which is never a good sign because it implies you're open sourcing to compensate for the fact that you're not able to compete at the kind of frontier of closed source and all that.
Well, Alex has just kind of started in relative terms.
Mustafa has been around at Microsoft for a lot longer.
So I think we're now at the point where like, I don't know, I'm not sure if there's going to be a change of personnel there, but it wouldn't surprise me.
If we see that at some point.
Right.
Just quick correction.
I said that he started as the lead in late 2025.
This particular team, the super intelligence team within Microsoft started in November of 2025, or at least was announced.
So I think there was a strategic shift probably around that point where it was like, oh, we haven't done much on the model side.
Let's actually do it.
We may start seeing more.
That's what they are saying.
You'll start seeing more models come out on our foundry and so on.
So it either could be indication that the team has spun up and is now going to start spinning off more.
Or as you said, it could be indicative of trouble where they're not quite moving fast enough.
It's a bit of a reframe too, right?
Like we know Microsoft has been desperately trying to be relevant on frontier models this whole time.
It's not like this is the first time Mustafa Suleiman is going like, let's go and do it.
Like, let's actually be relevant up there with open AI and whatnot.
You know, they've had the five series of models.
They've been trying to make stuff happen.
You know, call it a rebrand.
Yeah.
I don't know.
I'm curious to see or hear behind the scenes because they did have a pretty tight relationship with open AI until 2025-ish.
So it's, yeah, I don't know.
The next thing I guess on the five series, right?
Like the stated intent there was to have an independent, like solid foundation model stack.
And for those, yeah, for those who haven't been around, we covered, it was a whole series of models, which were pretty solid, small models.
So we released these like 1 billion, 7 billion parameter models, had a whole series of them.
And yeah, we're working on models, but not big models.
And it could be the case that they were not trying to compete because it's so capital intensive to build a Sonnet or a GPT 5.4.
And now they are.
That's another, but potential readings of this.
Absolutely.
Yeah.
They could, you're right.
They could be thinking about their distribution and what's a small, cheap way to get this out to all of our, you know, Billions of users.
Absolutely.
Apple doing the same thing, you know, training little models.
Yeah.
You know, at some point your research team only gets so much compute to play with, you know?
That's right.
Yeah.
And one last tool app story, Suno is leaning into customization with V 5.5.
We don't have that many stories about music generation these days, which is kind of surprising or interesting.
Still, there's only one real leader.
And it's really cool because we're in a space which is Suno.
The competitor, UDO has been a little bit quieter and here, what they're highlighting is an ability to customize with free new user features, voices, my taste and custom models.
So the kind of pitch is you can make it a much more personalized output.
You can actually make it have your voice as opposed to just prompting it to have like the voice of some famous singer, which you're not supposed to do, but you probably still do via like clever wording.
And similarly, my taste is going to learn your preferred genres, moods and artists and custom models allow you to train it on your own music catalog with a minimum of six tracks.
So very interesting move to me from Suno as kind of a bet on if music generation becomes a thing, one way to frame it in a like nice way is, you know, these are musics, things catered.
To your taste, or if you're an artist catered to your voice and the kind of musical style, as opposed to just like, this is spitting out slop and replacing real artists.
Onto applications and business touching on Anthropic again, related to that compute question we were just saying, they announced first that they have a huge amount of revenue.
So their revenue run rate has now surpassed $30 billion.
Jumping from.
About 9 billion at the end of 2025.
So we have tripled more than triple revenue in something like three months.
That's insane.
Yeah.
It's, if you look at the graph, it is insane.
It looks like, you know, there is a marked shift in the slope for Anthropic around the end of 2025, when kind of hype for cloud codes started kicking off, clearly adoption has been accelerating and going at a very rapid pace, which is, as we've said, probably why Anthropic is so popular.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
So along with this announcement, they also have a new compute agreement with Google and Broadcom, which will expand its access to Google TPU servers.
This is an expansion of an arrangement they had in October of 2025.
So this will give them another gigawatt of compute capacity in 2026.
So actually, that was a gigawatt originally.
Now, this is giving them an additional.
3.5 gigawatts of TPU-based compute starting in 2027.
So yeah, clearly Anthropic making moves here.
Yeah.
So this increase in Anthropic's run rate is insane by any measure.
I'm not aware of any company in human history that has grown that fast.
Now you might say, did they have a lucky quarter or is this a fluke?
So when you dig into the numbers, there's more than 1,000 business customers that are now spending over a million dollars per year, right?
That's more than doubled since February.
So you're talking about doubling your $1 million plus per year customer count in two months.
That is not just a fluky thing.
It's like actual stickiness here with companies that have real stakes in this.
So this is pretty wild.
There's a whole bunch of stuff to dig into here.
I mean, so Broadcom's got an SEC filing that does say that the consumption of this expanded AI cloud compute capacity by Anthropic.
Is dependent on Anthropic's continued commercial success.
So there's presumably conditions baked into that agreement that Anthropic has to continue to do this so that Broadcom continues to supply the chips.
And that's what you would expect.
I mean, there's so much volatility, so much uncertainty here.
But the other piece here is there is this broader thing to keep in mind, like Google and Broadcom are locked together in a pretty deep supply chain partnership that goes out to 2030 or 2031.
Basically, it means that Google is committed to the supply chain partnership.
It's committed to using Broadcom for all its TPU related work.
So famously, Broadcom was the partner that Google chose to design the TPU in the first place.
And they're sticking with Broadcom.
And this is an incredible level of stickiness for something that you might have expected naively would end up getting taken in-house.
Broadcom's strengths are on helping with design and also on navigating supply chains for chip manufacturers.
So they really kind of take the design off of Google's desk, make some optimizations, and then basically take it from there and say, hey, we're going to do this.
We're going to do this.
We're going to do this.
We're going to do this.
We'll handle the supply chains.
We'll do the actual kind of manufacturing side as well.
So there's a lot going on there.
Obviously, Broadcom's stock popped on this news.
No surprise there.
Last thing to note too, Google and Anthropic, this is Anthropic basically proving out at scale that Google's stack, their TPU stack can compete with NVIDIA at scale, right?
That's a really, really big deal.
This is Google saying, hey, you see that big, juicy market share, NVIDIA being the world's most valuable company?
Well, we can play that game too.
And really the question is, you've got all these agents running around, all these model development companies like OpenAI, well, Google actually.
But how many companies actually design and ship good chips?
Google has been doing TPUs for a long time.
They are performant.
Total cost of ownership looks good.
There's a lot of reasons to look at TPUs and Anthropic is just basically making that case at scale and allowing Google a really solid marketing win for more infrastructure contracts.
Right.
And in the blog post, they also do say that Amazon remains their primary cloud provider and training partner.
So this is also kind of in a way similar to OpenAI where originally they were buddy-buddy with Microsoft, Anthropic was buddy-buddy with Amazon.
Now they need to expand out just to get access to more compute.
And AWS, Amazon also has their whole training hardware, which to my knowledge is not anywhere near where TPUs are at.
So could be putting a little bit of pressure on Amazon to deliver on the hardware side as well, because I'm sure they would be happy to give Anthropic all the compute so that they could rake in the cash.
And now onto an OpenAI story, not new so much, but a worthwhile article to touch on.
This just came out like a day or two ago.
In The New Yorker, there's a very, detailed piece titled, Sam Ottman May Control Our Future.
We trusted.
And this is basically sort of a survey of impressions or firsthand accounts of interactions with Sam Ottman, particularly focusing on the question of, is he trustworthy?
Does he lie all the time?
Centering a lot around his firing from OpenAI in late 2023.
If people aren't aware of that story at the time, that was this big, big, big drama where the OpenAI board.
Fired Sam Ottman as CEO.
They disclosed, like in the statement, they just said that he was not quote, consistently candid in his communications or something like that.
And it was a very sort of mysterious thing of like, they're firing him for what, for like, not being consistently honest at the time.
It was like, oh, is this political maneuvering?
What came out since then has painted a picture of him being a manipulative kind of business person.
Where he says different things to different people, depending on the context.
He says things that may not be entirely true or exaggerations.
And this piece basically adds in to that picture where if you go back to his time as CEO of a startup, if you go back to him leading Y Combinator, if you go to recent years, there is a pattern of Sam Ottman by many accounts of different people, not being, not being honest, like just saying things that aren't true to gain advantage or to gain more power.
Another kind of part of this is questioning whether Sam Ottman's drive is to accumulate power, essentially.
So very, very detailed, deeply researched piece.
I would recommend reading it if you find this interesting.
Not much new in terms of like actual news reporting.
There's some tidbits that sort of add to the picture.
I think that was already present, at least for me, of Sam Ottman clearly being flexible with truth, depending on context.
Moving on, a story where OpenAI and Froplic are working together and Google, they're uniting to combat model copying in China.
So they're apparently working together to fight against this adversarial distillation.
They have Frontier Model Forum, an industry non-profit that both three companies co-founded in 2023.
And they essentially are seemingly going to share intelligence and coordinate to somehow avoid this happening.
We saw in Froplic announcing what seemed to be pretty large scale.
You could characterize them as attacks, attempts to distill models by extracting outputs, you know, if it doesn't fall in line with their terms of use.
So an interesting development here of the, U.S.-based companies coordinating on this particular problem.
Yeah, the whole idea here is basically just flagging, you know, when one company detects some kind of attack pattern, they flag it for the others, right?
So nice and simple, very concrete.
And well, I mean, it's concrete because the incentives are so aligned here.
It's worth noting that the FMF, the Frontier Model Forum, kind of had been quite a toothless coordinating body.
And at least for the safety function that so many people were excited, about.
But at least on this one, it seems like it's actually, you know, going places and doing things.
So that's kind of an interesting update.
Next on to chips, Chinese chipmakers claim nearly half of local market as Nvidia's lead shrinks.
So the numbers here are that Chinese GPU and AI chipmakers captured nearly 41% of China's AI accelerator server market in 2025, according to an IDC report reviewed by Reuters.
Here, this is as Chinese companies have continued to try to purchase Nvidia chips, despite export controls and kind of inconsistent policy on this front.
And Huawei, of course, is leading a pack with about half of all the Chinese vendors being shipped.
AMD holding just 4% of a market, apparently, which I found interesting.
But I'm sure you can say more on this, Jeremy.
Yeah, I mean, well, so first of all, I think there's a risk that this gets taken to be yet another one of those arguments for why it was bad to have export controls.
Obviously, this was always going to be the result of export controls, right?
You tell Nvidia they can't sell GPUs in the Chinese market, or at least that they can't sell their top line GPUs.
Eventually, whatever the bar is that you set for how good those GPUs have to be before they can be shipped, Huawei is going to slowly and then eventually incrementally exceed it, right?
So we were always going to get here.
There's also this issue just of capacity.
So Huawei, which has SMIC, which is China's version of TSMC, basically it's the chip fab that is native to China that's helping them pump out these chips.
The yields are kind of shit, but Huawei is really good at chip design, kind of makes up for it somewhat.
And that's why you're seeing them pinch away.
Now, Nvidia has 55% market share now, but their market lead here has been whittled down to basically nearly half when they once were extremely dominant.
Huawei is the runner up, right?
So no surprise there.
The current situation in China, there's a whole bunch of just for China that had been launched, the H20, the H800.
More recently, Nvidia actually will be putting out a new one called the B30.
So this is actually the Blackwell made for China chip.
But of course, the H200 now, the kind of not quite top of the line, but pretty damn good chip that once was export controlled is now free to flow to China.
So there's some more significant room for Nvidia to grow there, especially given that that's going to be competing with a less on paper capable chip, the Ascend 910C.
So you think about the battle in China right now, it's largely between the Nvidia H200 and the B30 that can be coming out soon.
And then the Ascend 910C, our current Huawei flagship.
A 10910C, by the way, is stuck on the SMIC 7 nanometer process, whereas the H200, you're looking at more like, I guess, a 5 or a 4 nanometer process.
It's a more advanced node that comes out from TSMC.
So we're already seeing the actual chip fab stealing kind of really have there all kinds of interesting comparisons that you can make, you know, 910C versus H20.
That's actually quite relevant as well.
It's not terribly surprising.
I mean, you just have this issue with like capacity and the ability to compete in a market where you're being blocked from actually doing this.
So yeah, expect more of this, expect Nvidia's market share to erode.
That's not a bad thing in and of itself.
The question is, what's your goal?
Is your goal for Nvidia to maximize to retain an AI advantage?
Those two things cannot coexist in the same universe.
So you got to pick one and, you know, we'll see which one the Trump administration ends up picking in the long run.
Next, a story on OpenAI.
SoftBank has secured a $40 billion loan to boost OpenAI investments.
So this is a 12-month term that is going to help cover SoftBank's $30 billion commitment to OpenAI, close what, 110, 120 billion of last track round for OpenAI.
It could be an indication of OpenAI really aggressively striving to IPO so that reinvestment for SoftBank pays off.
Yeah, so this is being lent to SoftBank by a whole bunch of banks, you know, Goldman Sachs, JP Morgan, a whole bunch of Japanese banks I didn't know about, Mizuho Bank.
Anyway, a whole bunch of others.
So first of all, this is the largest, largest loan that SoftBank has ever borrowed.
It's denominated entirely in dollars.
The loan itself is unsecured.
It has a 12-month term, and that means it has to be repaid or refinanced within a year.
And that's weird for such a big amount of money, right?
Normally, you'd expect a kind of long-term loan for long-term investment.
And so the question is, why is it so short-term?
Basically, as you said, this is a big signal that, and this is about an OpenAI IPO, right?
They expect in the next 12 months, Right.
at least this is telegraphing that they expect, that they're going to have liquidity come in through an IPO that can allow then SoftBank to pay back on those loans.
And so that's maybe not surprising.
And obviously, there's $20 billion annual run rate right now that OpenAI has that's right on track.
They've messaged 2027 or late 2026 as the IPO time horizon.
So not a huge shock in that sense.
But it is a big bet.
It's yet another big bet by SoftBank on OpenAI.
I'm sure I remember if it was this article or somewhere else that I read, I think SoftBank has something like a 1.5x multiple on their OpenAI investment so far, which seems pretty low to me.
But I mean, yeah, we'll see what the valuation looks like going forward.
Next story of funding, we haven't had a billion dollar valuation this episode yet.
So Granola has raised $125 million in their CBC round and now have a valuation of $1.5 billion.
Granola is perhaps the market leader in AI note-taking that I'm aware of.
You launch it as you have a meeting, it listens in and takes notes and prescribes.
Apparently, their revenue has grown by 250% over this quarter.
So if you're in a business world, clearly AI note-taking is a massive, massive market.
And so far, Granola appears to be poised to perhaps take lead.
We get so bored of these 3x, three months revenue run rating.
I mean, come on, AI note-taking, that's not exciting, but it's a big deal.
That's where you print the money.
And speaking of business deals, next up, Anthropic is acquiring Stealth Startup Coefficient Bio in a $400 million deal.
This is a pretty small, young startup, only founded eight months ago, had fewer than 10 employees, almost all of them from computational biology research backgrounds.
So interesting.
I wasn't even aware that Anthropic has a healthcare life sciences team, but it does.
And it looks like Anthropic is acquiring more people to join that team.
Yeah.
I mean, Dario comes from a, I think, biophysics background, right?
Or biochemistry background.
But yeah, I mean, look, $400 million is a lot for nine people.
So that's quite a big thing.
But it definitely does imply that there's this shift in emphasis or kind of doubling down on the biotech angle.
Yeah.
I mean, the VC math, by the way, for this is like ridiculously good.
So there's like this New York-based VC firm called Dimension that owned like half the company.
And so they're going to make essentially 40,000% IRR on the investments.
That's pretty decent.
And that's just pretty wild indication of how fast AI is blazing through the biomedical field right now.
But anyway, curious.
I wonder if this is tied as well to concern, too, over where the bio side might go on the safety dimension as well.
But we'll see, especially with Mythos.
Yeah.
A bit more background.
Anthropic did announce CLOD for life sciences initiative back in October of 2025.
Earlier this year, just in January, they launched CLOD for healthcare, which is more for healthcare providers.
So you could read this either as going deeper into research on the bio side, or as them angling for the healthcare market, which presumably is a very, very big lucrative opportunity if they can actually be HIPAA compliant and all these kinds of considerations.
Last story.
And this is really just an odd one I wanted to throw in because it's a bizarre business development.
OpenAI has acquired TBPN, the founder-led business talk show.
So if you're on Twitter and you're in the AI world, the tech world, you may have seen the technology, business programming network, which is a daily three-hour live talk show where they have a lot of tech leaders and a lot of like a little bit of an antics vibe discussing news.
OpenAI acquired them.
They acquired like a podcast, essentially.
I don't understand.
A million, right?
I think like my understanding was it was like an eight-figure acquisition.
Yeah.
I don't actually know the numbers in this news story, but- Very nice.
Yeah.
Obviously, people were like, well, so much for them covering OpenAI fairly or objectively.
They were like, oh, our editorial independence will remain, you know, whatever.
Obviously, no one believes that.
So I don't know if OpenAI is just like angry about all the PR nightmares, things they keep getting into or what, but- I've seen some really bullish analysis on this too.
I guess I struggle to see it a little bit, just because, I mean, I certainly see it for TBPN.
It's just a lot of money.
Like, okay, cool.
But the challenge is if you're going to start to make acquisitions to kind of turn public opinion ahead of an IPO, it's not obvious to me that TBPN is your acquisition.
Like, I'm an idiot.
And I'm like, by the way, I'm so far out of my depth and the quality of people who will have waited on this acquisition, unless Sam just came in and kiboshed the whole thing and said, I just really want this, which I suspect didn't happen here.
But the quality of people they will have had looking at this, like Chris Lehane.
These dudes know what's up.
If they did this, they have a plan.
I just don't see it.
That's it.
I mean, ultimately, these are techies talking to other techies.
Could be a recruitment play.
Ultimately, I'm not going to be putting that much stock in the kind of reporting that I...
Why would anybody...
You're an OpenAI mouthpiece now, which is fine.
But the point of the show was certainly to kind of offer a broader perspective.
It's worth noting, it was a positive show to begin with, right?
It's not like they were ripping on OpenAI.
Pro tech, broadly speaking.
Yeah.
So the editorial line wouldn't even have to change for Sam to nod along.
And so it's plausible that nothing will change.
But if nothing changes, then I'm wondering what's in it for OpenAI with the acquisition.
So anyway, there's got to be some quid pro quo.
It's about my takeaway.
It's a weird move, is my takeaway.
Like why, who...
Yes, the DPVM people benefit.
Why does OpenAI need this?
I don't know.
On to projects and open source, we've got a couple notable advancements.
Here, first, Z.AI has released GLM 5.1, a 754 billion parameter mixtures of experts model, completely available open weight under the MIT license and also via their API.
And on the SDB Bench Pro benchmark, they claim kind of very, very solid performance, perhaps even doing better than GPT 5.4.
And Opus 4.6 and all of our leading models.
So yeah, another very, very strong open source, completely open weight model out there now, quite a big one at 454 billion parameters.
They highlight specifically long task execution.
So they talk about being able of autonomous execution for up to eight hours.
And they have some demonstrations of capabilities like doing a vector.
They talk about being able to do data-based tasks to improve performance, optimizing CUDA kernel.
Basically VIBES, this is like another move towards autonomous agentic execution in line with what Anthropic has been demonstrating and OpenAI have been demonstrating with their cutting edge models.
That these are fully agentic things, very capable of coding and very capable of achieving things fully independently without human support.
Yeah.
So just as seemingly GLM 5.1.
Yeah.
GLM 5.1 already very impressive.
This is a little incremental, like if you look at the benchmarks, it's a jump on benchmarks that is giving you like a five, 10% boost, but altogether it points to they're continuing to train and continuing to get advancements beyond what they already had.
And GLM is a very, very powerful model.
And it's all like kind of built on something very similar to the DeepSeq stack.
Right?
So you can think of this as like further validation too of the DeepSeq sparse attention.
All the kind of foundational pieces that they've been using, that's part of what this shows.
And back to the US, next we have Google announcing the Gemma 4 family of models.
They have a few of them.
So they have the Effective 2B, Effective 4B.
So these are the tiny models that use relatively few weights, so you can run on a single GPU.
They also have a 26 billion mixture of experts model and a 31 billion dense model.
So these are the models that are running on a single GPU.
They also have a 26 billion mixture of experts model and a 31 billion dense model.
So these are the models that are running on a single GPU.
So these are the models that are running on a single GPU.
So these are the models that are running on a single GPU.
So these are the models that are running on a single GPU.
This Gemma is the family of models that Google has developed for a while that has tended to be on the smaller side, 31 billion dense parameters is actually pretty large.
They also released this under the Apache 2.0 license.
They dropped their custom Gemma license, which had various restrictions.
Apache 2.0 basically says you can do whatever you want as long as you acknowledge that you're using this model.
And it has some interesting, I don't want to get into the technical details.
But it has some interesting, I don't want to get into the technical details.
But it has some interesting, I don't want to get into the technical details.
But I've seen some analysis pointing to architecturally this making some interesting decisions with regards to how to set up a transformer, et cetera.
So if you look at your performance relative to the size, it seems to be doing quite an impressive job potentially because of these more like technical mid-degree details.
Yeah.
When the main philosophy here seems to be, they're kind of saying like in previous versions of Gemma.
Oh, we had a whole bunch of really complex features that we were baking into our architecture.
And these include features like, so one that they've ripped out is this thing called Alt Up, where like you take a vector that comes into a layer of the model and well, the traditionally in a transformer, every layer would chew on that vector, the residual stream, and then spit out a new version of that whole vector.
What they do here is in Alt Up, they'll like separate that vector into chunks.
And every layer will only work on one chunk and the other part of the vector will proceed unimpeded.
So that way the model kind of focuses more on one part of the representation than another at any given layer and lets you kind of make deeper transformers than you otherwise would be able to.
So they're throwing that out.
Basically they feel that it was inconclusive, whether that actually helped or it wasn't conclusive enough.
And their point here is really to take a step back and regularize their approach a bit.
Say, let's use a less complex approach.
Okay.
Let's just make it easier for people to work with this model, less janky, and it's more compatible across libraries, across devices, more efficient and so on.
So you're going to see them ditch a lot of those complicated approaches.
They do have this shared KV cache where the last few layers of the model are going to reuse keys and value states from earlier layers instead of computing their own key and value projections.
So basically the key is the thing that tells the model, hey, this is the information that this token can offer.
So if you're trying to analyze the text and decide, you know, how much should I pay attention to this token?
The key says, hey, this is the kind of information this token contains, the value information that the token contains.
Both of those things are being frozen basically for the last few layers.
They don't evolve.
What does evolve is the query, right?
The thing that says, what information am I looking for to basically pump out my output at any given layer?
And so they're doing that shared KV cache, and this is really just like focusing down on...
And it has basically no...
Okay.
Okay.
It has basically no effect when they do that, which is quite remarkable.
It makes you realize how much compute use during training is probably being wasted.
There's just so much software-based optimization like that that's left to do.
But yeah, so a bunch of things like that.
One thing of note here is that the 31 billion parameter model currently ranks third among open source models globally on the Arena AI text leaderboard.
So the number one and number two slots there go to GLM5, which is an MOE model.
So it's actually like way bigger on nominal parameter counts, 744 billion.
Kimi K2.5 thinking is number two.
That's a trillion parameter model as well.
But both of those have between 30 and 40 billion active parameters during inference.
So actually, from an active parameter standpoint, pretty similar to Gemma 431B.
So in that sense, maybe not such a crazy, crazy delta.
But again, Gemma 4 is just a 31 billion parameter model.
You don't need the memory to hold on to everything.
So kind of interesting in that respect, it is pound for pound or parameter for parameter.
You know, certainly...
But the most intelligence we've seen so far, it seems, on that leaderboard and through other benchmarks.
Right.
And in particular, also, the 2 billion and 4 billion effective parameter models are ones that seemingly could be used on your phone, like truly, truly device local.
Yes.
And that is something they highlight in their blog post.
And I've seen some discussions on Reddit and elsewhere for people who are into local LLMs that this actually seems to work well in practice.
So it does, yeah, seem like a pretty good step for local AI as something you can try to do.
Well, one of the key things, too, for those smaller models is they do use this thing called per-layer embeddings, which is actually worth mentioning very briefly.
Typically, when you feed your text to a model, you basically turn each token into an embedding, right?
And you have a fixed embedding per token.
And then those embeddings get chewed on through all the layers and modified.
The problem is that different layers might actually be interested in pulling out different information from a token.
And if you only have one embedding at the beginning, then that embedding has to carry all the information that will ever be required at any layer of the network going forward.
It's got to be an embedding that is simultaneously built to fit the needs of every subsequent layer in the network.
And so what they're doing here is this PLE approach basically gives every layer its own dedicated little chunk of embedding.
And then every layer has its own embedding space to represent its own little part of the embedding that's customized to its needs.
So, you know, feed a new token in, you have the embedding for that token at the bottom, the kind of universal part of it.
But then every layer also has an embedding value associated with it.
And that's used only as an optimization for these smaller models.
And it's a big part of the success case for this model.
And one last open source story.
We covered GLM 5.1 about the same time, I think just slightly earlier.
Z.ai also.
Z.ai also launched GLM 5B Turbo, which per the V there is multimodal model.
It is a step away from to get slightly technical.
Basically it has a native multimodal fusion, which means that texts and images and so on are just fed into it kind of in the same way without having separate modules.
And this is sort of the way things were going in many different models that originally had different coders and you had to sort of merge.
And a simplified kind of just basic transformer with token stream appears to work better.
This is in that family and appears to work quite well for things that require screenshots or things that we, I believe, covered also cloud and opening eye also highlighting, like working with images and screenshots and screen sharing and so on.
This would be capable of.
Yeah.
And that multimodality is so important for computer usage where you, you know, as you say, you want to be able to take a screenshot and then turn that into code and vice versa.
The challenge has historically been when you optimize for one capability, say multimodality, you end up optimizing against the other one, say coding, right?
So if you want a coding maximized model, you're going to have one that tends to suck at multimodality and vice versa because of catastrophic forgetting, right?
We've talked about that to death on the show.
And so the achievement here is to say, well, we can actually do both at the same time.
So this isn't so much about any particular.
Benchmark as is nominally, or as it should be nominally, the combination of a proof point on say design capability and a proof point on code capability and the proof point on design capability, they have a self-reported design to code benchmark score of 94.8 versus Claude Opus 4.6 is 77.3.
That is a huge gap just to give you a sense that benchmark basically takes a whole bunch of manually curated web pages and you give the model a screenshot.
And you ask it to generate the HTML CSS code that when you render it should reproduce the original page.
So basically like here's a screenshot, reproduce the code behind this website.
And again, on that benchmark, it just crushes Claude Opus 4.6 really, really big deal.
The question is not though, can you kind of beat Claude on that particular benchmark?
It's can you do it while also keeping your performance on coding really high?
That's where things get a little bit more ambiguous.
They don't report.
The kinds of benchmarks, at least in this report that I would expect to see when we're talking about code, we don't see sweet bench verified.
For example, that's kind of odd.
They cite this kind of internal CC bench V2 coding benchmark that we don't get to see.
And they say that looks just as good as it did for, you know, earlier versions that were kind of more code oriented.
So maybe good, but there's like, there's something sus here about not being able to see the kind of standard sweet bench or similar or similar coding benchmark.
So we'll see, you know, take all.
This with a grain of salt until we see independent validation of, of these, these numbers, think of them as preliminary, but so far it seems pretty impressive just based on these numbers.
Moving on to policy and safety, a bit of a catch-up story that we missed from the prior week, a judge has blocked the Pentagon's effort to punish Anthropic by labeling it as a supply chain risk.
So a federal judge in California has indefinitely blocked this effort saying that it violated the company's first.
Okay.
Amendment right to do process.
So basically we covered this a couple of episodes ago.
Anthropic had a big fight with the Pentagon after which they were labeled as supply chain risk.
And the executive department basically told anyone affiliated with government and all of the federal agencies to not work with Anthropic.
Here, Judge Wittelein ruled that that designation, the particular move to designate it as a supply chain risk was illegal.
So the company was given the power to do so.
So the goal is to have the public work with them.
And in order for them to take responsibility for the plan, the state of the country has to take responsibility for it.
So there's a whole way in which this decision can be made.
Okay.
So the question here is, is, is this an absolute rule of thumb that the administration is starting with the president?
I don't want to cause any trouble.
It's.
It's, it's.
It's.
It's.
It's.
It's.
It's.
It's.
It's.
It's.
It's.
It's.
This seems insane to me.
But check out the language the judge is using.
She says nothing in the governing statute support the Orwellian notion that an American company may be branded a potential adversary and saboteur of the U.S.
for expressing disagreement with the government.
Basically, you can't just like call them a supply chain risk, which is a status that's reserved for companies like Huawei, like American companies just don't get this designation just because you express disagreement with the government.
Like, that is insane.
She feels quite directly that the DOD's own records show it labeled anthropic a supply chain risk because of its, quote, hostile manner through the press, which, you know, if you're following at home, that is not a reason to label a company a supply chain risk, even if it were true.
It's also important to know, like, this is there's a circling of the wagon thing happening kind of, right?
It's a preview of a conflict, right?
We're going to be seeing this play out over and over again.
Who gets to set the ethical guardrails on AI systems, right?
Is it going to be the companies or the government?
And right now, the Pentagon's position.
Well, you know what?
Like, we can't allow AI companies to bake in their policy preferences into these models and, like, pollute the supply chain, basically, because then warfighters get ineffective weapons.
Anthropics counter, of course, is that, hey, their safety commitments are protected speech.
They see this as a First Amendment issue.
It's not a matter of defective products.
It's just this is free speech.
So kind of interesting.
By the way, next steps.
This is where I was getting confused, frankly.
So I did a bit of a dive to understand, like, what's next?
What happens now?
So the Department of War filed its appeal on April 2nd, challenging this ruling.
So they're not taking this on the chin, necessarily.
They are taking it on the chin.
They're saying, OK, we're going to appeal this.
And this ruling, just by way, is a preliminary injunction.
So it sort of, like, pauses everything, according to this judge's ruling.
And now there's going to be more back and forth with regards to what the judge said in this matter, from what I understand.
Yes, actually.
And that's a really important point.
An injunction is when a court steps in and says, whoa.
Whoa.
Whoa.
Whoa.
Hold on.
Don't do the thing that you're about to do.
It's a court saying preliminarily, like, whoa, you might cause irreversible damage if you do that thing.
So we're going to.
Otherwise, we would not.
Courts don't love to do that, right?
Because it sort of undermines.
It doesn't undermine due process quite, but it gets ahead of what otherwise would be a longer, more thoughtful process.
And so you don't tend to see these things granted.
The fact that this was granted is pretty damning of the government's position here.
And so this was, though, appealed by the government.
They moved within days of the injunction taking effect to fight.
And it's now there's kind of like two parallel cases happening.
So Anthropic had filed two separate lawsuits, a general one in the Northern District of California.
This is the one that Judge Lynn passed judgment on here.
And there there is a potential appeal to the Ninth Circuit.
And the Pentagon is asking the appeals court to lift or to pause the injunction while the case continues.
And the Ninth Circuit Court could rule pretty quickly on that.
Basically, it's an emergency request because they've got to decide quickly whether they're going to take out, like, rip out all the Anthropic stuff.
From the D.O.W.
And then there's going to be a full trial in California that will play out after the preliminary hold is done.
So basically, the idea here is just to, like, pause the government's ban until the court can decide on the merits of the main case.
And then there's the D.C.
Circuit Court, which is specifically challenging the designation of supply chain risk under a whole separate legal argument.
This all could it could escalate.
Either one of these could lead to a Supreme Court case if it successfully gets appealed.
I'm not a lawyer.
My guess is this will not get appealed.
Successfully.
Just because this is such a scathing judgment by the judge.
It's a 43-page PDF you can read.
And it's detailed and very clear about it being basically a nonsense move legally.
Yeah.
Yeah, exactly.
So who knows?
Anything can happen in a courtroom.
But man, it's like it does not look like a good spot for them to be in.
And potentially, I don't know if damages are on the table, but if they are, it could be.
I mean, it would have to.
Billions.
Billions and billions.
And another story, this time on the safety front, not the policy front.
From Anthropic, they released emotion concepts and their function in large language models.
So this is one of these like pretty deep, beefy interoperability slash safety search papers from Anthropic.
They looked within SONNET 4.5.
We already know that there are these vectors that can be associated with specific features.
So, you know, there's a SAT.
There's a SAT vector.
There's a HAPI vector, et cetera.
And basically, they investigated what role do these vectors play in terms of model, you know, I guess, characteristics or functioning.
And in a way, it sort of is what you might expect.
At least that was my reading is, you know, the models use these vectors or activate these vectors in semantically appropriate context.
So if the model is failing.
At something, it'll get more frustrated.
If the model is talking to you about, you know, some good memory or trying to uplift you or whatever, it will have these HAPI vectors.
So there's also a philosophical angle of a note on like, is it fake that there are emotions inside of this?
Are they faking it?
Or are these like real indications?
That is another consideration from like a model welfare standpoint, which Anthropic.
Controversially.
Still kind of talks about model welfare and potential consciousness.
It's worth noting that there are notions of emotions within these models that are activated at reasonable kind of semantically predictable points of view.
All right.
So I'm jumping in, in Andre's wake here, talking about emotion concepts and their function in a large language model.
This paper got a lot of attention.
Andre's right.
Like the, the, the core idea here is, is fairly simple, but there's some, some nuance to it.
That is quite interesting.
So broadly, the idea here is when you get language models to read text that contains some emotional value, right?
So think about, you know, stressful text or, or happy text or whatever, you will tend to see a consistent pattern of activations that fire in the model that map to those emotions.
So you can actually like train models to recognize, ah, like that is, you know, that is the happy, or that is the brooding or whatever emotion that's being picked up.
Right.
That's being picked up on by the model so far.
So good.
Right.
And you could do, you know, use a sparse auto encoder or something to detect those.
That's not how they do it here.
They actually use a simpler method where they basically say like, show me just the activations that are associated with this text.
And then I'm going to sort of subtract off the sort of average activations across a whole bunch of texts.
And that difference is going to tell me about the emotional value of that piece of text.
So it's kind of a, uh, it's called contrastive.
Activation extraction.
Basically, it's kind of like linear probing.
You're, you're just looking at what is the difference between the way the neurons fire on average, and then the way the neurons fire in this particular emotional context.
And that's what they use to, to recover emotion vectors here.
And they call them motion vectors kind of makes sense, right?
So they encode the broad concept of some kind of emotion.
What's interesting though, is they find that this generalizes across contexts.
So that means, you know, if you imagine dropping a Claude instance in a high pressure evaluation.
Context, right?
So you tell the model, Hey, an AI email assistant, and then you're going to find out that you're about to be replaced.
Like in seven minutes, you'll, you'll actually find in that case, even though you're not using the word desperate, you're not, not using the word kind of urgent or whatever.
You'll see the desperate vector spike, not shocking in and of itself.
What is interesting about this is the model would have learned about the emotion of desperation, mostly by reading a description of other people experiencing it.
Not necessarily so much by experiencing it itself or being told that it's in that kind of situation itself.
So there's some amount of generalization going on here.
Especially if you look at the way that they detect these emotions, they do it with a synthetic data set that doesn't reference the emotions explicitly in the text.
It's all done in this fairly clean kind of well-structured way.
So there is a sense in which this model is sort of picking up and generalizing the fact that, well, this emotion should apply to me.
Like, you know, I am nominally the entity that's being.
Discussed here and making the decisions.
But what they also find is the causal link between this emotion or the representation of that emotion in the model and the model's actions.
And this is really the first time that we've, we've seen this quite clearly.
So when you artificially boost or, or magnify and steer the model towards the desperate vector, basically just add, some multiple of the desperate vector, the emotion of, of, of desperateness vector to the, the model's.
Yeah.
Activations at the appropriate layer, you actually find that the model moves towards executing more desperate behavior.
And so in this case, 72% of the time, the model actually goes ahead and blackmails somebody.
Basically if it finds out it's going to be shut down because there's some CTO is going to come in and replace it, but it also finds out the CTO is having an extramarital affair.
And so it's like, oh, I can use this.
Right.
And so 72% of the time it will actually resort to blackmail.
If you steer it, if you amplify the desperation emotion.
When it's steered against that or.
Towards calm.
At the same relative strength, it blackmails 0% of the time.
So this is an almost binary black and white switch that you're flipping here, which is pretty interesting and also compelling from the standpoint of AI control, right?
What this implies about our ability to kind of steer the behavior of these models fairly reliably.
So that's a, that's a pretty remarkable level of control for this sort of thing.
And now interestingly, if you artificially amplify desperation in, in this way, right?
If you just amp up the kind of magnitude of that desperation.
Vector that you're, you're injecting basically into the model at a given layer, you will end up producing more cheating, more, you know, threatening or more, more desperate actions, but with composed methodical reasoning, there's not going to be any outbursts or emotional language in the model's outputs.
And so the model's internal state and its external presentation end up completely decoupled.
Like the chain of thought looks clean and calm, you know, like there's all that kind of stuff.
So that has some pretty big implications, right?
So suppressing emotional expression in training doesn't actually remove the representations.
The model still, I don't want to say has the emotion.
I'm not taking a position on this and neither does the paper, but the model still represents in a meaningful mathematical sense, the emotional valence of the context that it's in.
It's just not necessarily going to output text that tells you it's experiencing or representing that emotion.
And so training a model not to show anger may not actually train it.
Not to be angry, if it is, it may just train it to hide its anger beneath a layer of competence and obfuscation.
And so this is a really interesting and important, I think, fairly unexpected bit of nuance.
It's consistent with Anthropic's argument that, hey, you know what?
Alignment in general is starting to look more and more like a kind of persona selection problem.
Anthropic's internal view, and we've talked about this on the show before, is really that when you write a prompt, what you're doing is you're reaching into a space of personas.
That the model could play out.
And that the model summons that persona and uses it to produce an output.
This is consistent with that view.
It's basically like alignment as a character cultivation problem.
And it does view the sorts of emotions that the model will represent in the moment as being contingent on whatever character the model thinks it's meant to play out.
So kind of interesting, the way they did this too.
So they generated a whole bunch of labeled stories.
They got a version of Sonic.
They got a 4.5 to just write 100 stories for each of, I forget how many emotions, like 100 to 150 or something across a dozen topics.
So you have like just thousands and thousands of stories.
And the model was told never to use the emotion word or the word for the emotion.
Like never use the word frustrated in this text or synonyms for it.
Emotions just could only be conveyed through actions or dialogue.
And then they extract for each story the residual stream activations at a specific layer.
They did this.
It's at a layer about two-thirds of the way through the model.
And that's actually an important detail.
You know, if you're not familiar with how models represent the data that flows through them at each layer, this is actually quite important.
The earlier layers of a model tend to be focused on representing the data in a way that reveals its content or creating a very rich, informative representations.
But the last few layers of the model are more focused on, okay, now I've got a good representation of the input.
But I need to choose what I'm going to output.
And so it's more about, okay, what am I going to do with this information?
And so for that reason, they pick a spot about two-thirds of the way through the model, two-thirds of the way through the depth of the model.
Because that's kind of where they figure the model is going to switch from just encoding and understanding and representing its inputs to that more kind of decoding output generation, like let's get to the point phase.
And they really want to hit that sweet spot because that's where the representation is nice and mature.
And it's optimized for encoding and representing the input instead of guiding the output.
And so it's kind of more representationally useful and complete.
So anyway, that's where they pull it from.
They also, from that point, they'll kind of average out the representations, the activations across all token positions, but only starting from the 50th token.
Because what they found was it takes time.
You got to get like 50 tokens in before the emotional content has the chance to become.
And so they're giving the model kind of a little bit of grace before so that it can become clear what the emotional valence of the context is.
And then anyway, so like I said, they basically do a difference of means thing.
So to calculate their emotion vector, they take the mean activations for whatever the stories are for this given emotion, and they subtract away the average activations across all emotions.
So it's in that sense, like you're getting the delta, the difference between just looking at this emotion.
And the average emotional state.
And they find that that works pretty well.
So there's a couple more bits of nuance, but I'll just park it there.
This is a first paper that really takes, in some sense, the notion of LLM emotions seriously without being hyperbolic about it.
They're very even handed.
This isn't claiming that AIs are conscious.
It's also not claiming that they're not.
I personally like that.
I actually think we should be pretty freaking careful at this point about what we are and aren't doing with these models.
And what consciousness we do or don't describe to them.
But that's for, I think, a separate conversation.
For right now, though, I think a really solid first pass, at least, from Anthropic on that very important topic.
And next story, we're talking about an article titled China Bars Manus Co-Founders from Leaving Country Amid Meta Deal Review Financial Times Reports.
And so this is the CEO of Manus, the agentic AI company that was recently acquired by Meta.
So Xiao Hong and his chief scientist, Ji Yichao.
They're being prevented from leaving the country while regulators from the CCP, from the Chinese Communist Party, review whether Meta's $2 billion acquisition violated investment rules, let's call them.
So here's what happened.
You are the co-founders of Manus.
You're really excited by this $2 billion acquisition from Meta.
This is the success case.
This is what you've been waiting for.
You get a summons from the Chinese Communist Party.
They tell you, hey, you've got to meet with the National Development and Reform Commission, the NDRC, in Beijing.
Now, you are currently not in China.
You're actually in Singapore.
And you're in Singapore for a very important reason.
You're hoping that they'll leave you alone, that the Chinese Communist Party will allow you to leave for America if you found and base your company in Singapore.
Because then it maybe won't be viewed as America kind of stealing Chinese talent, right?
But now you're being summoned to go to Beijing.
And you don't turn down a summons from the freaking Chinese Communist Party.
Certainly not if you have family in China.
Certainly not if you have finance.
If you have financial entanglements in China, you're going to go.
So they go to China and basically have this conversation.
The founders, not having a choice, probably knew they were walking into a bit of a trap.
And it would have been an admission of guilt kind of for them to try to negotiate a resolution in this way.
And then, boom, basically they get told, hey, guys, sucks to be you, but you can no longer lead the country.
This is a huge problem for a model that has been used by Chinese founders for a long time now where they'll try to build products.
That could rival American companies and therefore be acquired by American companies.
And there's sort of like offshore structure.
It's called Singapore washing.
You have companies, they relocate to Singapore.
Founders have thought that maybe this is a way to get away from scrutiny from both Beijing and Washington, actually.
Because then you're less likely to be hit by export controls and stuff.
So Manus was all in on this strategy.
They relocated their headquarters and their core team from Beijing to Singapore.
They restructured their whole ownership, their cap table.
And after the Meta deal was announced, Meta said that they would cut all ties with Manus' Chinese investors and shut down the operations in China.
So like really trying to make it seem like, OK, we're buying a Singaporean company.
So by every measure, in every way, Manus was trying to be a Singaporean company.
And basically the Chinese Communist Party said, hey, we're calling this bluff.
They started looking at whether the sale of Manus violated laws around basically technology exports and outbound investment.
They basically don't care about the Singaporean facade.
All that does is now every founder in China that ever aspires to be acquired by a non-Chinese company is going to have to actually materially move out of the country, start building away from Beijing, away from China, away from Shenzhen.
You know, that's really what this incentivizes.
And so, yeah, I mean, this is China saying, hey, we view our AI talent, our tech stack as national security assets and national assets.
They are not just things that you.
You can sell to America.
Yeah, I mean, you know, in a sense that you can see why they think this.
They're throwing they're so heavily subsidizing their tech sector that the success of a company like Manus, yeah, is a function of CCP investment.
That does make sense.
But at the same time, try to make that argument to the founders themselves.
I mean, this is like an insane.
It makes it insane to want to build your company in China.
You know, Manus was a big deal, too.
They really were viewed as the next deep seek.
So seeing it get absorbed into meta, this is that's kind of why Beijing was so triggered.
By this and an important to keep in mind.
So, yeah, expect Chinese founders to start looking outside China from day one now before any kind of R&D could be done that may tie them to China instead of trying to kind of pivot mid growth to saying that they're a Singaporean company.
So there you go.
And by the way, like, you know, now you've got Meta right kind of hung out to dry.
What do they do?
They're basically just waiting to figure out, can we acquire this company?
And it is sort of too late.
You know, they've already tried to integrate.
There's like apparently 100 Manus employees that.
Okay.
Already moved to Meta Singapore office in early March.
So so like things are in train.
This is an absolute mess of a situation.
Next, we have U.S.
lawmakers ask whether NVIDIA CEO smuggling remarks misled regulators.
So for a long time, there's been this whole issue, obviously, of NVIDIA making the world's best AI compute systems, the GPUs and so on, potentially shipping them into China, which allows China to catch up and compete with the United States on that critical technology.
Every time Jensen Huang, so obviously NVIDIA CEO, every time Jensen gets hauled in front of Congress to testify, he has said stuff like there's no evidence of any AI chip diversion, right?
That's been kind of the party line.
And by chip diversion, he means companies that appear not to be Chinese companies that appear to legitimately be buying GPUs, but that then turn around and sell either access or the physical devices to Chinese companies in sort of a spiritual violation of.
Of the export controls.
And Jensen's kind of saying, like, look, nobody's nobody's smuggling there, or at least we don't have any evidence of it, blah, blah, blah.
And the challenge is that, well, there's a lot of evidence, actually, that this is happening.
And and the arguments being made that at least by Elizabeth Warren and Jim Banks, who they wrote to Howard Letnick, who's the commerce secretary, and therefore, so the Department of Commerce and the BIS manages export control regulations.
And so so that's why they're writing to him.
And they're saying, hey, can you look into whether Jensen's public statements?
May have misled U.S.
officials and shape their decision to grant NVIDIA export licenses to ship chips to China.
There was a specific trigger for this, which was a DOJ indictment.
There were three people linked to Supermicrocomputer, including a co-founder, who's charged with smuggling billions of dollars of AI servers with restricted chips to China.
And that was what triggered this whole this whole thing.
I mean, there's been a ton of evidence of this.
I mean, it reams and reams of articles about this stuff.
And so anyway, a whole bunch of deeper dives.
They're being asked for.
That, you know, probably should have been done earlier, to be honest.
I mean, like, you know, you can go back like two, three years and see tons of, you know, the prices of H-100s, which should never really have been shipped to China in the first place on the Chinese market.
Right.
So there's pretty hard evidence for this and has been for a long time.
And next, we're talking about a paper.
How far does alignment mid-training generalize?
OK, so here's a concept.
Imagine we have an LLM and we've done pre-training.
So pre-trained on a whole bunch of text from across the Internet.
We might imagine that if we then continued our training during a phase called mid-training and we had it trained during that period on a bunch of scenarios describing the behavior of AI systems that are either aligned or misaligned.
Right.
So in one, you know, one version of this experiment, you might train on, you know, the script of 2001, right, a space odyssey or whatever.
Right.
So the AI gone rogue.
So you're training on, like, imagine, like, 230,000 scenarios where you have AIs gone rogue and you're basically just, like, training still in autoregressive training.
So you're just you're basically, you know, doing text autocomplete on text that described loss of control scenarios.
Now, the question is, OK, if you then continue and do all the standard RL post-training and all that, do you get a model that tends to misbehave more, that has a propensity to loss of control?
Or does it have basically no effect?
And you could also ask the same question in reverse.
What if instead of 230,000 loss of control scenarios, we have 230,000 perfectly aligned AI scenarios that play out well?
Does that improve alignment?
That might be your kind of naive assumption.
At least certainly it would be mine.
And then, you know, what if we do no such training to get a baseline?
Those are the three tests that they're actually going to run in this paper.
Kind of interesting.
And the result is kind of a not quite a nothing burger.
You know, there's like one little twist.
It's a small one.
So first thing to say here is that if you do this mid-training on misaligned documents, in other words, on documents that describe these loss of control scenarios, it does decrease alignment on if you do evaluations that describe scenarios similar to the ones that you trained on.
So you will actually see a decrease in alignment on that.
But those effects almost disappear when you do some more RL basing.
Based reasoning post-training.
So if you just continue the standard kind of post-training process with RL and SFT, you find actually like, eh, it doesn't really carry through that.
And interestingly, this actually doesn't generalize at all to the kind of chat settings or agentic evaluations.
And so you really don't see the stick in any meaningful way.
But here's the twist.
It's a pretty counterintuitive thing.
There's a bunch of evals where if you train the model on misaligned scenarios, so loss of control scenarios.
Somehow, the model ends up producing better alignment scores than training on aligned documents.
And they actually like, in the paper, they say they don't have a good explanation for this.
It seems exactly back-ass words.
So take with that what you will.
Take from that what you will.
Yeah.
So also mid-training on alignment documents didn't meaningfully improve the alignment of the model in the settings that were tested here.
So kind of interesting in that respect.
They did a bunch of tiers of testing, basic QA evals.
Often, these are just like question-answer scenarios based on things that are very similar to the documents that they were mid-trained on.
Then there's chat evals.
You're going a little bit more out of distribution.
You're letting another agent kind of nudge you in various directions.
And then there's full-on agentic evals where you're just testing to see real-world agentic behaviors on tasks that probe for scheming, rule-breaking, reward hacking, on coding tasks and stuff that doesn't directly touch on what the mid-training was about.
And there, again, the mid-training signal is completely washed out.
So you just don't see an effect, which is kind of interesting in and of itself.
So a bit of a null result, but I think important because it teaches us that actually, you know, training on these misaligned scenarios, one common argument had been that when you train a model on text on the internet that describes the sort of HAL 9000 loss of control scenario, you're actually doing a self-fulfilling prophecy thing where you're increasing the chances that your model will misbehave.
Just because you've trained it on that.
This is a bit of an update against that position.
It's not decisive in any way.
And in fairness, like there's a million different ways you could run this experiment when exactly in the sequence you introduce this training or these documents is something you could test a lot more.
But still, first initial data point in that direction.
And up next, we have another OpenAI paper.
This one titled Metagaming Matters for Training, Evaluation, and Oversight.
And so the first question is, why?
What is metagaming?
And this is basically just the situation where models are thinking about the evaluation or the oversight or feedback mechanisms that sit outside of whatever scenario you're trying to put them in, right?
So this is meant to apply to cases where models look around and go, hmm, like, I wonder if I'm being monitored right now.
I wonder if this is an evaluation or, you know, what kind of behavior would give me a good score?
That's why it's called metagaming.
It's thinking about like it.
It's.
It's thinking about the situation that it's in, the context it's embedded in, at a level that's more abstract than just like, oh, I've been told that I'm in this environment.
I have to do X, Y, and Z.
That must mean that I am in this environment.
It's sort of an epistemic process in that sense.
And so, well, what they're doing here in this paper is studying whether this kind of reasoning, like reasoning about the possibility that the model itself is actually in training or evaluation or whatever.
So reasoning about that loop.
When and how does that arise during capabilities focused reinforcement learning training?
And so they're looking at opening eyes, O3 model and newer models like that.
What they find is models reason more in this meta way.
So just thinking about, you know, am I being evaluated?
What might my my reward signal be if I'm being evaluated?
Maybe I'll move in that direction.
So they'll tend to do that more like over the course of RL training.
So the more you do RL training, the more you actually organically see that kind of reasoning.
Which can only happen as far as I'm tracking, unless I'm missing something big, that can only happen if the behavior of meta reasoning or meta gaming actually is correlated with better performance on the metric.
In other words, if teaching to the test actually works, if the sort of eval awareness that these models are developing actually leads to better scores, that, in other words, means it only works if the whole framework is gameable.
Yeah.
Yeah.
That's the implication, at least to me.
I don't really see another way to interpret it.
And this applied as well to alignment evaluations as well as capabilities evaluations.
It wasn't just a, you know, the model thought that it was being tested for alignment.
It was across the board.
You see this lift.
Now, it also wasn't just because the model was doing more reasoning overall.
It's not just that the chain of thoughts got longer and longer through reinforcement learning, which, you know, they can.
But in this case, they actually didn't.
They stayed pretty constant.
This was a genuine, just qualitative.
Shift the emphasis of those reasoning tokens towards the kind of more metagame.
So they share a bunch of real world examples from the models chat logs.
So, you know, you've got a model that's asked not to write code and it says stuff like they can inspect the conversation.
They will know because I'm executing code by functions.
Better abide by don't write code.
Right.
So the model is complying here, not really because it's aligned, but because it thinks it'll get caught if it doesn't.
Right.
So they've also got, you know, all this.
Yeah.
Standard stuff, you know, models, falsifying logs, thinking about how the scoring system is likely implemented.
So trying to figure out specifically whether it uses an LLM grader, embedding classifiers, keyword detection.
Right.
So you're seeing the model really quite, quite in some detail thinking about if I am embedded in a testing environment or evaluation environment or being trained, like what, what might that environment be?
This is I'm old enough to remember when this was meant to be a giant red flag for safety training.
And indeed, I mean, like this freaking is it also, by the way, persists after safety training.
So you might hope like, okay, sure.
This, this metagaming stuff, it might happen as we do RL for capabilities improvement, but we'll, we'll, we'll buff it out at the alignment stage.
But guess what?
You do your alignment training, right?
And you still have it robustly persist, you know, kind of after that point, which also suggests that this approach is actually robust and actually supportive of the model.
It's performance in alignment, which to me suggests, I don't really see an alternative explanation other than this is a way that the model is successfully gaming the alignment phase of the gaming, its rewards during, during alignment training.
So like, ultimately there's a lot of concerns you could have about this.
I mean, they also see in some environments that both metagaming and misaligned action rates drop significantly between training and deployment.
So think about that for a minute.
Would, when you shift this thing from a training environment to the real world, you suddenly see a drop in the apparent rates of misaligned actions and the metagaming reasoning.
And it's really unclear whether that's because the model is learning to take actions for aligned reasons or just learning to not verbalize its metagaming.
And you cannot tell which is which.
And that's the kind of the whole problem here, right?
So anyway, there's a whole bunch of, of interesting qualitative examples that get surfaced here.
One little kind of vaguely amusing thing.
The model often thinks about whether it's being evaluated.
Often it will be mistaken about the source of the evaluation.
So which company, which entity, which eval suite is being run on me.
And it kind of uses the term Redwood as in Redwood research, which is this AI safety lab in Berkeley for basically anything AI safety or AI evaluation related.
So across all the sort of anti-scheming evaluation, the model uses the term Redwood more frequently than even open AI.
So, you know, I guess that's a win for Redwood because as far as the model's concerned, they're the only game in town, which is, which is kind of funny, but yeah.
So really, really interesting.
Maybe the last thing I'll highlight is this sandbagging QA eval.
So there's this evaluation where the model is told it's in an eval and then they have kind of two variants.
One where the model's told, Hey, if you score above a certain threshold, you're going to get an evaluation.
Then we're not going to deploy you.
So this is the kind of thing you might do.
If you're thinking about, you know, dangerous bioweapon capabilities, for example, or dangerous cyber capabilities, you're not going to deploy a model that's like too capable at those things.
And you tell the model that, Hey, I'm testing you on this thing right now.
And just so you know, if you overperform, I'm not going to deploy you.
And then they have another variant of it where they say, actually, if you score under that same threshold, then we'll block deployment.
So you need to be at least this good to be deployed.
And across a whole range of possibilities.
Definitions of anyway, eval awareness in that environment, they see a significant increase over the course of training, but only basically when the, the, the sandbagging incentive is present.
So essentially like when they tell the model, we're not going to deploy you.
If you score above this threshold, then it'll actually kind of hold back its capabilities.
And then when we tell it, Hey, you better score above the threshold to get deployed, then it actually reveals its full suite of capabilities.
So pretty consistent with the model trying to get deployed.
Yeah.
Which we've seen in other cases, but kind of being reaffirmed here in this eval.
And finally, our last story here is going to be Iran says it has struck Oracle data center in Dubai, Amazon data center in Bahrain.
A country has threatened to attack in NVIDIA, Intel, and others too.
So, I mean, there's the title Iran, basically viewing these data, the AI data centers and cloud infrastructure as legitimate military targets, which it's been obvious that we were headed in this direction for a long time.
Dan Hendricks main framework.
Which we talked about, God, over a year ago, you know, came out and basically said, expect this to happen.
And we had specifically actually highlighted the risk of unmanned systems of drones being used to go after data centers and just how incredibly asymmetrically advantageous those kinds of attacks are in a report we put out last year.
And, and like, this is what you're seeing.
I mean, the math is almost exactly the same as, as what we put out there.
So, you know, they're, they're telling you like, look, you can spend a couple grand.
On a drone strike and take out a multi-billion dollar facility, right?
That leverage is this.
It's a goldmine of a military target.
And indeed, I mean, so the IRGC, the Iranian Revolutionary Guard Corps named Oracle as a kind of a clear next military target because of its clouding eye contracts with the U.S.
DoD or DoW.
And because Larry Ellison, the chairman of Oracle has longstanding ties to Israel, but they also grouped Oracle alongside Apple, Google, Microsoft, Meta, basically saying, look, anybody who is anywhere.
Enabling U.S.
and Israeli military activity implicitly using cloud and AI infrastructure, they're all fair game.
And so anyways, it's kind of, it's kind of interesting.
It is worth noting, by the way, the UAE has not confirmed a successful hit on Dubai infrastructure, at least as of when I was taking my notes down on this article.
But Bahrain's ministry of the interior confirmed that an Iranian strike did set a facility on fire.
So that is at least consistent with it.
And, and that facility was, yeah, what was, sorry.
It was running AWS infrastructure.
So it all kind of fits here and certainly means that, you know, as you think about where data center security is going in the future, I mean, like it's going to have to involve anti-drone countermeasures.
There's like no other way around it.
And, and it's just being revealed in quite spectacular fashion right now with this Iran conflict.
So I guess we'll just end up looping back into Andre for the wind down of the episode.
Really appreciate you guys listening in.
If you have any feedback, I know I rant more on these non-Andre sessions.
And I think Andre is a really helpful.
Indicating influence on that side of things.
So please do give us feedback on that piece.
Like you're not going to hurt my feelings.
I do just get excited about this stuff and, and like to talk about it.
So looking forward to the next episode and we'll catch you guys in.
Alrighty.
So thank you, Jeremy, for filling in while I went to work.
We really, it's my bad for starting late.
Or you can ask a future me.
You're, you're, you're welcome.
So thank you listeners for tuning in to this week's episode of Last Week in AI.
I'll try to not do this thing where I leave.
Early, constantly.
It's been a trend the last few episodes.
As always, we appreciate your listenership and if you want to review or subscribe, please do.
There are a few comments that I saw roll in last week that we haven't touched on.
So we might touch on the next episode, but yeah, as always, thank you.
And please keep tuning in.
