# AI Infrastructure Shifts, Profitability Inflection, and Agentic Strategy

**Podcast:** Last Week in AI
**Published:** 2026-05-25

## Transcript

Hello and welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI.
As usual in this episode, we will summarize and discuss some of last week's most interesting AI news.
You can also check out our Last Week in AI newsletter at lastweekin.ai for stuff we will not cover in this episode.
I'm one of your regular hosts, Andrei Kerenkov.
My background is of having studied AI in grad school and now working at the Gen AI startup Astrocade.
And I'm your other ghost, Jeremy Harris.
I do AI national security stuff, supply chain, AI infrastructure, blah, blah, blah, blah, blah.
It's a very stacked episode today.
We're just talking about this.
We have not been very disciplined about wrapping in time because we take too freaking long in the first few sections.
And then Andre ends up having to go.
And oftentimes I'm like joining, like we all have like meetings, like the bracket this.
So bottom line is Andre is going to have the merciless job, thankless job of having to keep us in line here.
Please, Andre, cut me off if I ramble.
We're going to try to not linger too much on the first stories because that's a thing that we've done in the past.
Yes, and it will be a little tricky because, as you said, it's a stacked episode.
And partially because the first section of Tools and Apps, Google I.O.
happened and Google came out with a whole bunch of announcements, some of them quite intriguing.
So we'll see how that goes.
Then applications and business, we...
can talk about the conclusion of the OpenAI Musk trial.
Quick spoiler, Musk lost pretty badly.
So that'll be a fun discussion.
Research and advancements.
OpenAI has had massive results in mathematics that we'll probably spend a while on.
And then beyond that, we have a decent number of open source of policy and safety, even some synthetic media and art stories.
So it will be action-packed episode.
This episode is brought to you by OutShift, Cisco's incubation engine.
Today's AI agents operate in silos, limiting their true potential.
We've been focusing on building bigger, smarter models, but scaling up is just one approach.
And we actually have a blueprint from 70,000 years ago.
Humans didn't just get smarter individually.
The cognitive revolution transformed society because we began sharing knowledge, goals, and innovation.
And agents are now at the same inflection point.
They can connect, but they can't think together.
And that's why Outsherst by Cisco is building the Internet of Cognition, transforming AI from isolated systems into orchestrated superintelligence.
By creating an open, interoperable infrastructure, Outshift is enabling agents and humans to share intent, context, and reasoning.
The cognitive evolution for agents is here.
Explore the Internet of Cognition at outshift.com.
That's outshift.com.
Today's episode is sponsored by Box.
Enterprises are keen to adopt AI, but enterprise AI only works when it has the right business context.
And Box is the leading intelligent content management platform for the AI era, acting as the secure, essential context layer for Box's AI agents to access the unique institutional knowledge that makes the company run.
Your business isn't the sum of all internet knowledge.
Your business lives in your content.
And Box can connect that content with people, AI agents, and apps that can unlock their value from their information.
All while having the security and governance capabilities that allow you to trust it to be secure.
There are many uses for it and especially interesting is Box Agent, a unified AI experience across your files in Box.
So if you're thinking seriously about your company's AI transformation journey, think beyond the model.
Your business lives in your content and Box helps you bring that content securely into the AI era.
Learn more at box.com.
Heading straight into tools and apps, the big news this week was out of Google.
They had Google I.O.
where they announced a whole bunch of stuff we'll have to get through.
Probably the biggest news is that they did announce Gemini 3.5 as their next slate of model releases.
And they also announced their new AI agent, Gemini Spark, which is a supposedly a version of OpenClaw in a way where it runs 24-7.
You can give it more offline types of tasks.
It will function more as a semi-autonomous assistant as opposed to just a conversational chatbot.
And on the benchmarks, the big news maybe is Gemini 3.5 Flash, actually, not the bigger Gemini 3.5.
It beats out Gemini 3.5 Flash on a bunch of benchmarks in a huge way.
And it is also said to be the driver behind Gemini Spark.
So it's rolling out to trusted testers today, and they're planning to bring the beta to Google AI subscribers in the US next week.
So it's one of these typical things of they have the big conference or need to make some announcements, but it isn't quite ready to go live yet.
So it's coming, but it's not here yet.
And I haven't heard any sort of vibe checks so far on...
what people think about these things.
Yeah, it's one of those things that like we're having a meeting to plan a meeting or this is the announcement about the coming announcement.
The space ball scene where the guy goes, prepare to fast forward, prepare to fast forward, fast forward, fast forwarding, sir.
Anyway, if you've seen the movie, you know.
Basically, yeah, this is that.
The Spark thing is interesting.
Like it's actually really interesting from a kind of architectural infrastructure standpoint.
You mentioned that there are kind of these agents running in the background.
They will be running on...
dedicated virtual machines in Google Cloud, so not on device.
That's what makes it so that it keeps running when you close your laptop.
And it's going to operate inside Chrome as an agentic browser, and that'll be later this summer, at least they say.
We'll see if it joins the graveyard of Google half-finished products.
But if that happens, that is a huge, huge architectural commitment.
You've got these cloud-native, cloud-resident AI agents that are going to persist with browser-level access.
That's like a fundamentally new paradigm.
It does make sense, but it means Google is trying to convince the search users to trust them with tasks that involve minimal input and tons of agentic work, which is a shift.
Historically, that has been a space that's been owned by the other labs, especially Anthropic and OpenAI.
And then the other piece here is the MCP side, right?
So Spark is going to support third-party tools, but it's going to do it through the model context protocol, right?
The Anthropic MCP.
And so that's a bit of a concession.
Google basically saying, look, We're not really going to try to argue with Anthropic on this one.
This just seems to work.
And this is a major, major center of mass now, like you're going to see, or center of gravity.
You're going to see a lot of people now defaulting to MCP, even more than they are.
I mean, it's already kind of the default, but there have been moves to try to shift away from it.
This is a pretty big win for Anthropic from that standpoint.
So at the end of the day, that's what we're getting from this.
I think it is a pretty big architectural move in terms of how Google serves models and what they think the future is.
The TBD is, do we actually see uptake?
Are real users going to start using Google products for this?
And the agentic side just has to be really strong to support it.
Yeah.
One other thing to say here is they did focus primarily on Gemini 3.5 Flash.
There also will be Gemini 3.5 Pro, but they don't highlight any benchmarks for it.
It won't be out until next month.
So it's a case of like 3.5 Flash, which...
I think is probably their main product side driver for Spark, probably even for chat, is the primary focus, it seems, as opposed to Gemini 3.5 Pro, which is the higher intelligent model, which is typically what the AI labs are sort of competing at.
I think it does signal that currently Anthropik and OpenAI are competing with cloud code and codecs very heavily.
That appears to be the focus.
And Google isn't trying to get into a fray as much for that.
They are still more on the consumer side.
They are still growing the user base of Gemini.
They announced that they're hitting 900 million users per day.
I don't know the exact measurement details.
They might be cheating a little bit because they have Gemini all over the place.
Yeah, distribution is so, they have such a distribution advantage, right?
That it's like Microsoft saying, oh, look, everybody's using Teams.
Like, yeah, no shit, you're forcing them to, but yeah.
Yeah, exactly.
Yeah, but I don't know if even they're cheating a little bit by including Gemini usage outside of Gemini, the chatbot, which if you look at ChatGPT, which famously announced this 900 million users number, that's ChatGPT users.
Gemini, I don't know if they're folding in.
sub cases of Gemini where it's like within Google Docs or within Spreadsheet.
But either way, they clearly are getting a very large user base and are getting up there to be competitive with OpenAI on the consumer side, which I think at this point, it's OpenAI is in Google's game for the large scale adoption.
Anthropic still is pursuing their strategy of focusing on enterprise, focusing on professionals.
not becoming kind of a daily chatbot of most people.
And arguably that the daily driver of most people could be in the long run, the more lucrative thing.
Although currently, certainly Anthropic is enjoying the fruit of their labor in the enterprise with absurd growth and so on.
And profitability now too, we'll talk about that.
Yeah.
And one last real quick to mention, the output speed per token is one of the things they highlight with 3.5 flash.
It's nearing 300 tokens per second, according to them, which is close to double of 3 flash.
So it is very fast.
Did they say what hardware that's on, by the way?
I mean, I get that they're giving us a ratio, which is helpful, but...
I didn't see that in a blog post.
I think it's probably safe to say that it's...
based on TPUs and...
What configure it?
It's all there.
It's the most favorable configuration.
Yeah, yeah, exactly.
Yeah.
Well, this is by Artificial Analysis Intelligence Index.
Yeah, you would hope that it's just the API, right, that they're showing here, but we shouldn't trust that necessarily.
So I'll be curious to see what the third-party estimates are.
And next up, the other big reveal, aside from Gemini 3.5 Flash and Spark, was Gemini...
Omni, a new family of multimodal models that can take images, audio, video, and text as input and generate video outputs by reasoning across all the input types.
So it is presumably the next iteration of a multimodal model, which ties this in much more closely.
Historically, you know, initially we began with focus on LLMs and the other modalities were a little bit bolted on.
It wasn't sort of a truly unified architecture of which DeepPine has.
Research here going back a long time with Genie.
I think they demonstrated how they could do multimodal.
And this is presumably sort of the actual productization of that.
The first model that they will release.
Gemini Omni Flash is already out in the Gemini app, YouTube Shorts, and AI Creative Studio Flow.
It can generate 10 seconds of video.
So it...
is seemingly a very strong video generation model and a very strong video edit model.
So that is a very significant detail, I think, because in my opinion, if you want to see kind of usefulness of video generation, editing is a much better use case and generation because it applies to your actual videos and you can use it as part of producing videos.
by adding in components or so on.
And I think that is likely more compelling to creators on YouTube and others where they can use it as a tool in their toolbox rather than just a way to generate entire videos for them.
So very exciting, I think.
This is demonstrating what we saw with image models where you got to the level of the editing where it was just completely seamless and insanely impressive.
This appears to be that for video.
Well, and to your point about the editing being more valuable for users, it's also more valuable for the company, right?
That you get so much more granular feedback data that you can then use to train the model than you would if you just did a simple generation and does the user like it or not like it set up.
So this is a very useful flywheel for them to play off of.
One key thing is Google, Google LeapMind, but Google in general now has a, as you said, a long tradition.
Going back to Genie and previous models, Tim Rockcashle, who's since left.
was leading a team that was like really big into this, the whole world modeling side.
I'm old enough to remember when the whole space was like LLMs and then you had all these people saying world models, world models, those matter more.
And that was a kind of Fei-Fei Li, Yan La Kun kind of perspective.
And at the time, like my instinct was LLMs are the thing, just scale that and like everything else will come.
I think what's happening right now is I think I was right on the outcome in a sense, but wrong in terms of some important details on what would matter and what wouldn't.
They scaled LLMs that put them in a position where they could do wild CapEx spends in this direction.
And now they can also just do multimodal.
And it's like not an issue at all.
That's part of the issue is you may be right on the thesis that you need multimodality, you need robotics, you need all these things.
But just being right about the thesis and not moving to catch the CapEx wave means that other people can afford to be wrong on that and just pivot to it with like...
tens of billions of dollars to spend on that.
And if you're Google, massive multimodal data sets that nobody else has.
So I think that as we move into more and more multimodal agent kind of development world, Google is going to have a very significant differential advantage here.
Sora was wound down at OpenAI.
That was explicitly a step on the path to agentic training, right?
You can generate simulated environments for agents to navigate that then allow you to get kind of this like longer time horizon reasoning, as long as your environments that are simulated are coherent.
And that got wound down.
hard to maintain.
They presumably had to train on YouTube data.
Mira Moradi kind of hinted at that before she left OpenAI.
So anyway, there's a bunch of challenges that people who are not Google face when it comes to this whole multimodal domain.
So an interesting, call it maybe a wild card, call it the strategic competitive differentiator for Google here.
This may actually be a really important part of their play.
Right.
And as you said, I think one of the things to note here is that there's Essentially no competition for them aside from smaller startups.
So I don't recall if we covered Luma had had similar releases recently with their Uni1 model, their unified intelligence family of AI models that is also trained on audio, video, image, and language and spatial language.
And they released also Luma Agents, which is that kind of more edit use case that is more intelligent and able to...
think in language and then render in pixels and images.
So I would imagine a similar component is true in Omni where you do have a sort of LLM driven reasoning and thinking, and it's not sort of the older paradigm of you produce video as pixels without sort of that language level of thinking and reasoning, which is now very much core to Nano Banana and image generation models.
Similar to the LLM side, they are releasing the Flash variant of Gemini Omni.
The Gemini Omni Pro model is planned but has no release date.
And I would be very curious to see about that Pro model because I would imagine that is very different.
And Gemini Omni already is impressive with the Flash variant.
With the Pro model, I'd be probably impressed and I'm looking forward to seeing it.
Moving along to a few of their other announcements, they have also launched Antigravity 2.0 with an updated desktop app and CLI.
So this is now replacing Gemini CLI.
They have Antigravity as their cloud code and codex competitor.
They launched Antigravity, I think, a year.
last year or a little while ago as their integrated development environment.
And I haven't heard about it much since.
So I don't know if anyone's using it.
No one's using the ID.
Everybody's using the terminal for this.
And that's kind of like the, I guess that's what.
They're admitting themselves in a way here, right?
Well, they were competing with Gemini CLI.
So in my opinion, this is likely a classic Google thing of like, oh, wow, we have like a few projects that are doing the same thing.
Let's combine them under one roof.
Like they did, for instance, with YouTube and their music app, which now we have YouTube Music.
But they're also, they're asking like Gemini CLI users to migrate to it.
So it's kind of like, I don't know, it reads to me as like they're just going terminal now.
Maybe I'm misreading this.
Well, no, I believe they probably still have the IDE version, and they're replacing Gemini CLI with this newly launched Antigravity CLI, but they also have the Antigravity IDE still as another extension of it.
So they're doing a little bit of both, which is different from Claude and Codex, where they have these.
I guess maybe it's also reflective of the fact that...
Both Cloud Code and Codex are still CLI heavy, but they have invested a lot in the apps, in the Mac apps, where you can do it a little bit removed from the CLI, where there's like a front end that you don't need to open a terminal.
And I wouldn't be surprised if this also reflects that, where they are planning to compete with antigravity on both sides, both the CLI and the more sort of non-terminal front end that...
basically provides you a nicer UI to interact with these products.
Essentially, it's the same, but it has a few more bells and whistles.
Yeah, Peter Griffin, here's what really grinds my gears part of the show here for me is that you're using this line.
It was co-developed with anti-gravity.
right?
To talk about their suite of- Yeah, as if all the software users at Google don't have to use antigravity, probably.
Then this is the thing.
So there's this thing happening right now where ever since Anthropa came out and said like, oh, we dog fooded our own agentic tooling to do this.
And then opening eye said like, well, we trained our latest model using our latest model.
The problem that I have with this is that someday it's going to be true and we won't be able to tell.
Like this is crying wolf.
If you just keep saying, oh my God, this is like recursive self-improvement.
It's like people are going to get like recursive self-improvement fatigue.
They're not going to actually listen when the actually batshit insane thing that will happen possibly sometime this year.
I mean, if reading what Anthropic is saying, but possibly sometime next year, whatever, whenever that happens, that is a critical.
security moment, safety moment for planet Earth.
We got to get that right.
And if we're just continually getting the frog in hot water factor here with people saying, oh, our tool built itself, that shit is not helpful.
We're even seeing Chinese labs saying it about fairly trivial software harness stuff and then blurring the lines between that and actual weight model parameter optimization and stuff.
So anyway, that's my soapbox.
I am a little concerned about this trend.
I think it's important that we...
Be serious about when we're doing real recursive self-improvement and when we're not.
No, I think it's a fair point.
And it's a very PR-y thing right now with self-improvement where, okay, you used it to write code and helped you run experiments.
It's not, in my opinion, real recursive self-improvement.
Of course, totally.
Yeah.
Yeah, so...
We all want to continue, right?
In a way, we've been doing recursive self-improvement since we became multicellular organisms or since it's a sexual reproduction act, like whatever.
You can make that argument, but it's just not what people mean.
Like to your point, yes, anyway, I'm finding just a long-winded way to agree with you.
And yeah, I will say one more thing on Antigravity.
They also announced the Antigravity SDK, which allows you to build extensions to their ID that hooks directly into your conversation history, like really operate within the agent stack, which I believe with VS Code, you would not be able to do.
And I do think, you know, we forget a little bit, Cursor still has a large user base.
Cursor being the primary ID competitor.
outside of terminal, but they also have a terminal thing now.
I do imagine that many people still are working within an ID, which I know I am.
I work still with Vincursor and then launch a terminal within Cursor with cloud code.
So there is a real story there on competing on both fronts and potentially trying to get some of the cursor market share, which again, to my knowledge, onto Gravity, I haven't heard anyone using it, but I'm sure some people are.
And rolling right along, we have a couple more things from Google.
Next one is Gemini for Science, a collection of experimental AI tools designed to assist researchers with scientific discovery workflows.
Again, I think this is coming in concert with OpenAI.
They also had some sort of, I think, OpenAI or Chad GPT for Science initiative.
The Proprick also has a program for science specifically, which we've discussed in recent.
This is being rolled out via a sign-up form on Google Labs, and they also have science skills within this, which pulls insights from over 30 major life science databases to automate complex manual workflows, completing tasks rather quickly.
Aside from those skills, the studio itself has three main features.
It has hypothesis generation, which searches millions of papers to help form theories with cited sources.
Computational discovery, energetic search engines that runs thousands of experiments, and literature insights.
So this is very much being intended to be used by researchers in the research flow, which if anyone can do this properly, I would say it's probably Google.
They have their...
life science spinoff.
I forget the name of it.
Isomorphic Labs, yeah.
Isomorphic Labs, exactly, which is doing this, right?
They're doing life science with AI.
DeepMind, of course, is full of researchers doing research.
So I would not be surprised if this actually has a significant impact on the scientific community where AI, you know, it's happening with math on a crazy level now where people are adopting and learning how to adopt AI.
I wouldn't be surprised if...
People in the life sciences, physics are playing around with AI, but haven't fully integrated it because it isn't as trivial necessarily to use it reliably.
And this could make it much more feasible.
Yep.
And Google has Google Scholar, obviously.
If you've been an academic for any period of time, you've refreshed your Google Scholar page quite a lot.
So that's a certain kind of data that may not be as widely accessible in the same way to other companies, though it is obviously publicly legible to some degree.
And it's also a case that we've had papers come out that show, somewhat ironically, that you can use Google Scholar's citation count as a way to develop taste in language models and agents.
So to the extent that that's useful, there's an interesting play there.
And we've just talked about two different interesting plays, actually, that are Google, I don't want to say Google only with Google Scholar, but like, you know, differential advantages for Google, whether on multimodal with Omni and kind of like environment generation and on the Scholar side.
And on the infrastructure side, too, Google is a behemoth.
They have the TPU, right?
They're rolling it out now, becoming a cloud for other frontier labs.
So we keep running into this thing with Google, I find, where we're like, damn, these guys are an aircraft.
They should be running away with this fucking thing.
And yet we're just not seeing it.
For whatever reason, Gemini historically has just not pierced to the frontier of cable.
Let's see what Pro shows.
And to your point, that is the thing to watch, right?
And we will be watching closely once we have numbers and benchmarks and so on.
But so far...
It's very much been, you know, anthropic versus open AI at the absolute frontier of capabilities.
I will push back a little bit on that.
I think on the fast, smaller side, Gemini 3 Flash is the leader.
But at the very top end with Pro, that is less the case.
And that's, yeah, and that's all I'm talking about, partly because I'm so focused on the recursive self-improvement story.
So like, you know, and this is why Omni is relevant.
And this is why the Google Scholar thing, those are recursive self-improvement plays.
And if you're going to do that, If you have all those comparative advantages on RSI, what you would expect is that you would be shipping the best, like true frontier models, not Pareto frontier in the sense of like, you know, we have the smaller and kind of more intelligence per parameter or whatever, but like, or per flop.
Well, you would expect genuine kind of frontier overall, you know, leading capability.
And it's just been notable that we haven't seen that.
That may be part of their strategy.
Part of it probably is, but...
It's still noteworthy that if they believe in short timelines the same way OpenAI and Anthropic do, it's about time for them to start showing the frontier capabilities that match the incredible infrastructure and data advantages they have.
So I think that's the big question over the next few months.
Are we going to start to see a three-way race?
If so, then this is all real.
If not, we have to ask ourselves, why is Google struggling to turn these massive advantages into real, I don't want to say product differentiators, because I personally don't think of frontier models as products.
I think of them as...
almost strategic national security assets, but they happen to be products too.
So yeah, I think that's the open question that all this leads to.
And the story will be told in the next few months, I think.
And the last story we'll cover, which isn't, we're not even covering all the announcements, but we're covering major ones.
The last one has to do with Genie.
So the Genie world model can now simulate real streets with Street View.
They have some demos where you can like ride bicycles on streets where Street View, of course, is their real active thing that they've had for a long time where you can within Google Maps.
jump down to the street level, look around, essentially see the world from a person's height and point of view.
So with this, they allow you to walk around, kind of bicycle around the street view.
Typically, you would have to look at the waypoints where the image is breaking.
You can't navigate freely.
And this seems to be...
what allows you to do that.
And again, going back to that world model concept, this is the real world and you can now run agents within it and simulate world interaction.
I don't think within this real world model, there is much physics going on yet.
So this is just showing you a 3D environment where you can collide with it and kind of jump on it.
But it doesn't mean that you're simulating all the physics in the real world here.
It means that...
You could now run agents and interact with the world in a somewhat limited way, but still much more so than you would be able to do otherwise.
Yeah, I mean, world models are bottlenecked right now by the sort of like 3D-ish data from the real world.
And again, like Google Street View is probably the most valuable corpus on the planet for exactly that purpose.
Like this is yet another, it's the same story we just talked about.
You know, originally this was a Google Maps product, obviously, but now it's an AI differentiator.
Will it translate?
There is this business case, right, with this Waymo partnership they talk about.
So Genie 3 is already helping to power some of Waymo's simulators to train their self-driving cars, especially on these like very rare tale events, like, you know, freak things like tornadoes or casual elephant encounters, things like that.
And so they're getting some flex, you know, or some trials on real world use cases through that.
But as you say, the model isn't physics aware.
You get these awkward things where like, you're like, run right through a tree or something.
And so there is a gap there on physics simulation, but still another competitive differentiator for Google, something that nobody else has, at least not in the same way.
And so we'll see if it translates.
And now we are done with Google.
We've just got a couple more things to cover.
And the next one is Cursor.
Composer 2.5 has now officially released.
We've covered this, I think, in quite a bit of detail when it was announced.
At the time, the big deal of Composer 2.5 is that it seemed very impressive for coming from a company like Cursor, which doesn't do third tier AI models.
This is built on top of Kimi K2.5 from Moonshot AI.
At the time, there was a bit of a drama about them not being super good at the PR side, but it is a fine-tuned version of Kimi K2.5, which has rather strong metrics.
it's pretty cheap and pretty fast.
So I have seen some vibe checks where Composer 2.5 is actually very useful and strong.
It's fast, it's cheap, and it can do a lot of stuff that you may not necessarily need the power of Cloud Code or Codex for.
So again, I think Composer 2.5 shouldn't be dismissed just because it's not Cloud or Chai GPT.
And with this, it's the exact right strategy for Cursor to take.
already good open source models and then fine tune them to have a competitive coding model they can use within cursor that may undercut codecs and cloud code on price whatever price is let's say the predictions across the board is the free launch is going to be ending you're not going to be getting like a million tokens for $200 a month and you've seen already Anthropic kind of tightening the leash over and over and over Yeah.
Now this article said, one thing I told myself I'd check because it sounds almost implausible is this article says that cursors are going to be training a larger successor model themselves from scratch.
So they are moving beyond just being a lean, like fine tuning SFT shop to actually training for supposedly frontier models on their own.
That's not, I mean, that's surprising in a way, but it's not shocking.
The weird thing here is they claim to be saying that they're going to be using the Colossus 2 cluster for that.
That actually ties back to another thing, which is a bit weird.
So SpaceX AI, which is XAI as part of SpaceX, we covered recently their deal with Cursor, which is kind of a weird deal where...
They said that, you know, we reserve the rights to buy you for $60 billion, but for now we will be partnering with you and you'll be doing stuff with us in vague terms.
So I think this is sort of a precursor to them being folded into XAI.
There's already stories from Bloomberg where supposedly Space XAI is planning to actually purchase Cursor and fold them in.
after their IPO, 30 days post their IPO, which would mean that, well, now Cursor has all the hardware and all the capital they need to train a coding model.
That's right.
So, okay, so I had not seen that story.
This is exactly where, okay, this was the sniff test that it wasn't passing for me.
So we've seen SpaceX say, hey, look, Anthropik is going to be renting our Colossus 1 cluster for something like 1.25 billion a month, right?
So like a really, really big amount of money.
to do their own shit.
And Elon's like, I'm okay with it because we've moved on from Colossus 1 to Colossus 2.
So it's not like, because the picture it paints, right?
If XAI is handing over that compute to Anthropic is, well, I guess XAI thinks Anthropic will do a better job of extracting value from its compute than XAI can itself, which is actually kind of an indictment of XAI's ability to perform, which matches, you know, we've heard these things about they've only been, you know, hitting 11% GPU utilization, which is like, really, really awkwardly low.
You want to be hitting numbers like 30, 40, even 50% when you do a really good optimized job.
And that's just leaving billions of dollars on the table.
We talked about that, I think, a couple of weeks ago.
But what's going on here is now we have Cursor.
This is what I thought it might've been a typo or something.
I was looking at the announcement.
They're like, we're using the Colossus 2 cluster to do this.
Not Colossus 1, Anthropoc 2.
Now Cursor is using Colossus 2.
So like, what's the XAI team using?
Well, this is kind of the answer, right?
Like this is the hint that...
The XAI team is kind of using Colossus too because Cursor is kind of becoming the XAI team, right?
So that's part of this.
If you just in a vacuum saw this, this would be like, make no sense.
If Cursor remains separate from XAI, then this is like, whoa, like basically even XAI's most exquisite cluster is being outsourced now.
This really implies that XAI...
doesn't think they can squeeze much juice out of this amazing, amazing lemon or whatever the metaphor would be that they built.
Colossus 2 is a spectacular, it's a gigawatt.
It's the world's first gigawatt cluster.
So for it to just not be used by XAI is like really, really bad, right?
That would be a bad sign.
That's the answer.
It looks like, right?
Have you seen that confirmed that they're actually going ahead with the $60 billion acquisition or is this just a rumor?
They haven't confirmed it themselves because, I mean, it's like they're saying it will happen 30 days post IPO or roughly 30 days.
Okay.
Which is like, okay, so you're going to go public and then buy a company.
Yeah, that's right.
You're not going to say that.
But what I would imagine is it might be kind of a leak situation.
It's a version of guidance.
It's one way to give your investors guidance.
And it's fundraisers to fund acquisitions is a thing that happens, right?
So it would be the most natural thing in the world.
It's just, I'll put it this way.
If it turns out that they don't get acquired, then rewind back to this conversation because that is a real problem for XAI if this does not go through because they're basically giving away the crown jewels.
And yeah, we don't have a story here, but it's also been reported.
for XAI that their talent bleed has continued.
So we covered over the weeks when the folding in of XAI into SpaceX started to happen, that all the co-founders left, none of the co-founders are at the company anymore.
And it appears that the talent bleed has continued with the team leads of their coding initiative.
of their video initiative, people continue to leave and go to meta and thinking machines.
I think the number I've seen is 50 people left from the team, researchers, developers.
That's from a 200-person team.
So XAI is just completely beating out.
Elon Musk said in a statement, literally said that XAI wasn't built right the first time.
It's being rebuilt from the ground up.
So, oh boy, XAI clearly in a transitional phase where at present, why are they renting out Colossus to Anthropic and Cursor?
Well, like they don't have a team to do anything with it.
Yeah, that's very, well, one question is, is Cursor the right team to do it?
And I think that's part of the test.
Like Cursor has been a fine tuning shop.
They've done an amazing, amazing job at it, to be clear.
Really, really amazing.
And they've done it with pretty large compute budgets too.
So it's not like a standard fine-tuning thing.
They really are playing with pre-training scale, but it's a fine-tuning shop.
So if I'm Elon, now I'm talking myself into going like, this is actually a really good thing for XAI.
But if I'm Elon, probably what I want to do is say, okay, can you play with the big boys?
If I give you Colossus 2 and get you to do a pre-training thing, can you put us on the map?
If you can, I'm definitely acquiring you for $60 billion.
That makes all the sense in the world to me.
So hey, maybe that's the story and this is just us catching up to reality.
But all these different pieces do seem to fit together pretty neatly through that lens.
And speaking of XAI, the last story here is actually about them.
They've introduced their own coding agent.
It's called GrokBuild.
This is their competitor to Cloud Code and Codex.
And it's a little bit of a funny announcement where they're introducing GrokBuild early beta.
They have Grog build 0.01.
So it's a sort of thing of like, hey, we are also building this thing that everyone else is building.
And here is the very early, don't really judge it yet because it's an early beta at 0.1.
So don't like, just FYI, we have it.
And interestingly, I believe it's also being provided via an SDK on Rercel.
So it's...
a CLI, it's cloud code, but it's also a new model that you can query, which is not good because cloud code is driven by cloud.
Codex is driven by chat.
You do not want to have a specialized coding model that is an outdated kind of way to do your model development.
The fact that they have a specialized Grok build model for coding to me is a bad sign of Grok just isn't that good.
And they had to fine tune a coding thing just for Grok build to be good.
And it's at 0.1 right now.
So they are clearly trying and with their dwindling team, still trying to compete.
But I haven't seen vibe checks of this and I would not be surprised if it's underwhelming.
So two things I think can be true at the same time.
Number one, the current situation for XAI is objectively very bad, right?
I mean, this is clear.
I think everybody's seen it.
Elon has said it.
Everyone's saying it, right?
The fact that they don't even have a product in the $200 to $300 a month coding high-end development tier is really bad.
It's also bad for their IPO, by the way.
That narrative needs to be in here.
It just needs to be.
Even if it's not a polished thing, it has to be there.
The other thing that's true is that I think Elon is actually doing basically the optimal thing given where he is.
He's frankly acknowledging, like, look, we're not there.
He told apparently openly to the staff that, look, your goal is to match Claude's performance.
That's it.
We're not there.
We got to do it.
So that's what you do.
You know, Frank story, 50 people have left, by the way, it's never a random set of 50 people.
There is this like evaporative cooling thing that happens, right?
Where the best talent is the talent that has the most options.
Those are the dudes that leave.
So you're left.
It's more than just you've lost 50 out of 200 or whatever the thing is.
It's like you've lost your probably some of your most senior people.
And obviously we've seen disproportionately the co-founders, all the co-founders among.
The reason that really matters is when you think about what coding agents do.
It's really dependent on these very tight loops.
Like if you ever talk to people who do pre-training and RL, like the crickling between the RL infrastructure, the evals, the post-training, integrating all that into one picture, that's exactly the work that senior people do.
So you cannot replace them with just more junior people.
You have to have these highly seasoned people.
And this is where I worry a bit about cursor.
It's an integration between pre-training and post-training end-to-end that you need to do the coding thing really, really well.
Cursor is really good at coding, but there's that missing piece.
And that's the big question here.
So we'll see.
I'm actually, I'm not, I'm going to do the teal thing.
I'm not betting against Elon.
I don't think that's a smart thing to do, but this has to ship.
This has to work at a certain point to justify the end-to-end space data center to your CLI argument that's going to be made in the IPO.
Yeah, and it clearly is a very early point, early beta, 0.1.
I think they didn't release benchmarks at all with this, which is like, okay.
It should be really good.
Sure, it's very good.
And the last thing to say here is insanely priced.
Its input is $1 per million tokens.
Its output is $2 per million tokens, which if you look at Claude, I think it's something like $3 per million input.
$15 per million output.
So insane pricing, which I would wonder if this means that it's a smaller model, faster model.
It is a specialized model, which could be good if they prove it to be possible to have this kind of pricing.
Moving right along to applications and business and again sticking with Musk for a little bit, we are going to be talking about the outcome of the Elon Musk versus OpenAI trial that we've been covering for recent weeks.
The assertion by Elon Musk and the lawsuit had to do with OpenAI becoming a for-profit.
He was an early investor as it was a non-profit.
and then it became for profit.
He said, okay, you can't do that.
Give me like $200 billion.
And also Sam Altman can't be leading OpenAI anymore.
Kick him out.
And that didn't go well.
And he lost it in a disappointing way where the jury was like, the statute of limitations is out.
You cannot sue them because the claim by Elon Musk in this lawsuit was he...
Couldn't have known or couldn't have had the ability to do this lawsuit until late 2022, at the point where the announcement of this like $10 billion from Microsoft came out.
And what became very apparent in the trial is, well, OpenAI was talking about becoming a for-profit in 2017.
Elon Musk was in those conversations.
He was pro-OpenAI going for-profit in 2017 in some fashion, right?
So there's no dispute on that.
And even going back to 2019, OpenAI became a partial for-profit.
It received a cash injection of $1 billion.
So the jury threw out that argument of like, they didn't even rule on the specific claims of OpenAI having stolen the charity.
The ruling was purely that the statute of limitations is out.
You waited too long to sue and now you cannot do that.
So complete loss on the legal side of the case, but not to hear if that was ever the intent.
You know, this surprised many liars as having even gone to trial.
It from the beginning seemed very unlikely that Elon Musk would win.
But the under kind of the narrative argument and the narrative battle.
between Elon Musk and OpenAI, which began long before this lawsuit became in 2025.
We saw many sort of, here's a blog post, here's an email.
Greg Brockman's diary came out quite a while ago.
So we didn't learn a lot from this trial.
As a result of that, a lot of the stuff has already been aired as dirty laundry.
And in that sense, The narrative and the understanding of OpenAI as having had this sort of like certainly weird and arguably very problematic transition from being a nonprofit to a for-profit.
I'm sure that the sort of story of it and understanding of it has become more widespread.
Yeah, this is in a weird way, the best way for Elon to have lost the case, I think, because the narrative, the story that you stole at charity.
is still live.
It has not been ruled on.
It's not like a judge said, you did not steal a charity.
Your point, he gets to keep making that argument.
He's saying, he's calling the judge a terrible activist, saying that the fact of having a jury, because not all trials have juries.
Sometimes you have a judge that just passes the verdict and then, I guess that's a criminal case.
But anyway, he makes the call and then assesses damages.
Here, the judge was- And yeah, by the way, the jury, it's a weird trial where the jury didn't make a final call.
So the judge was still responsible for the final call, but he was informed by the jury, which by the way, came with the decision like two hours into deliberation.
Yeah.
The lawyers were still talking about like the potential outcomes and like, I don't know, payoffs and so on.
And the jury came back much faster than...
And you only would expect it.
You know, which you'd expect with a statutory thing like, hey, the statute of limitations expired, whatever.
That's the one thing, you know, the lawyers, a lot of lawyers are saying that the prediction markets were putting Elon victory at some point at like 20, 25%, as I recall, which is, you know, it's meaningful.
So the fact that it just came back, like this is pretty deflating for that perspective.
I think Elon got out of this pretty much what he was going to.
So not the worst thing, honestly, for him.
Well, also not, so not the first time that OpenAI.
has won a case on essentially the grounds that the person bringing the case or the entity bringing the case just didn't have, in a sense, the standing to do it.
In Elon's case, the statute of limitations just expired.
Previously, they had situations where it's like, oh, the attorney general for the wrong state is the one kind of bringing the thing.
So this has been a consistent theme through a couple of these now where you just don't quite have the right person to bring the case.
The case itself.
looks actually a lot stronger than the result would indicate.
And so I think a big question here is like, who does have standing or who does have a live case to bring?
And certainly right now, we don't seem to have anybody stepping forward to do this, but hey, anything could happen.
And certainly there's no precedent for what happened with OpenAI going from a nonprofit with a hefty capitalization, but like a nonprofit entity kind of thing to becoming a for-profit entity that is now the value at $900 billion or whatever, there's no examples in the history of business, to my knowledge, that there is that case.
So from a legal perspective, you could certainly make the case.
And from a sort of ethical or whatever perspective that it was not okay for OpenAI to do that.
Yeah.
And there's a story here about appeals as well, right?
So yes, there is an appeal being set up, but the reality is that appeal courts...
Do not overturn jury verdicts like this.
That would have been part of the intent of the judge, by the way, in having her decision be determined by or informed, let's say, by a jury verdict.
If it's consistent with what the jury said and the judge didn't flip their overturn their recommendation, then it's really hard to see how this shifts, especially given, again, it's a clean kill.
It's statutory.
What are you going to do?
And now on to Anthropic.
They have agreed to terms of a $30 billion funding round at a $900 billion valuation, which I think now puts them above or at the same level of OpenAI, which is their valuation on Anthropic was $380 billion in February.
months ago.
That was February, Andre.
It's been literally months.
Why are you so stuck on 380 billion?
I know.
I mean, only 380 billion, 900 billion, totally more of a reasonable price level for Antropic.
And this is indicative of their just stratospheric growth this year, right?
Cloud Code.
exploded.
I forget the numbers, but it was like 80x growth or something insane like that.
And so clearly, the jump in valuation, the rush to invest, I'm just looking at this.
Opening, I was valued at $852 billion in March, way ahead of that $380 billion of Anthropic around the same time.
Now they're neck to neck.
Arguably, Anthropic is being seen as a better investment.
by investors.
So obviously Anthropic is just killing it.
And both OpenAI and Anthropic are angling for an IPO.
So these kinds of things like valuation, these kinds of things of investment, not only are we gaining a fresh injection of capital, from the IPO perspective, it's, I'm sure, quite nice to have a higher valuation.
Yeah.
I mean, look, they're projecting their first profitable quarter, Q2 this year, which I just want to stand on Gary Marcus, like pour one out for poor Gary Marcus here.
Mr.
This is all a bubble.
It may still be a bubble.
It may pop.
I keep saying this.
They're going to put me in that version of the big short movie in that scene where they're like, here's the dumbass who said it wouldn't pop.
But the bottom line is this is on the back of genuine, not just revenue, but profit.
And I want to call out the profit thing.
That's a problem for Anthropic.
It's not a good thing.
Profit means they miscalibrated their CapEx investment, their CapEx spend about 18 months ago.
So they started building a bunch of data centers 18 months ago.
They didn't build enough.
They didn't go into enough debt.
They didn't raise enough and spend enough.
Because if they had, they'd be pouring more money, more concrete, more data centers, more compute.
The goal is to remain slightly below profitability, actually, indefinitely, as long as you can ride that curve, right?
That's what that should look like.
Although it's a nice narrative to say, Gary Marcus, blah, blah, blah.
I mean, it does make that case.
It's also not optimal for Anthropic.
They would rather Gary Marcus be dancing on their illusory grave because that would mean that they calibrated their CapEx spend better.
So that's an important footnote.
30x run rate revenue multiple, by the way.
That's not actually crazy by software standards.
SaaS companies, we've seen Snowflake, Datadot, these kinds of companies have similar multiples.
The question as ever is going to be, Do the unit economics support it?
Is the business of Frontier Model Training and Inference a SaaS business with high gross margins, massive scales?
Or is it a utility, semiconductor, capex intensive, cyclical, lower steady state margins?
What business is it?
This valuation essentially is a bet on the- optimistic answer, which by the way, Semi-Analysis has a great report that we won't go into that talks about the tokenomics, that talks about how pricing power is actually now shifting into the hands of the frontier labs in a way that it hadn't been moving away from NVIDIA in particular.
So this could be the positive outcome for the labs and an actual sustainable business at these margins, which then just makes it, hey, it's just a, like, I don't know what to tell the naysayers at that point.
It's just a business.
Right.
I mean, they doubled their revenue.
from $14 billion in mid-February to $30 billion now with annualized run rate.
And again, that PE ratio of $900 billion to $30 billion income is not crazy, as you said.
You look at SpaceX AI, their term sheet just came out.
I don't know if it's called a term sheet, but the details of their economics came out pre-IPO because you have to do that.
Previously, SpaceX was private, XAI was private.
We didn't know too much.
Now we know.
SpaceX is not making a profit.
They burned through $4 billion based on XAI's expenditures, right?
So the PE ratio is negative.
And SpaceX...
That scaling up, that again can...
That may not be bad, but SpaceX is going to be angling for 1.75 trillion valuation when they IPO, which even after look at revenue, forget...
earnings, the revenue is like $16 billion, where revenue is lower than Anthropic.
So anyway, IPOs are going to be crazy this year, that's for sure.
And one more story on Anthropic, Andre Kapofi has joined their pre-trading team, which as a person in VAI hemisphere, like, okay, a guy has joined Anthropic's team, how big of a deal can that be?
Well, for...
people who spend too much time on Twitter, like this was an insane development.
This is like, oh, wow, Steph Curry has joined Unthropic.
That's right.
So, and this is after Andre Kapoffi was kind of on the zone for a while.
He was working on...
An education play.
I was going to say, if Steph Curry had said, I'm actually retiring from the NBA for a while because I want to do my own thing.
And then he comes back to do the AI.
He was at OpenAI before that.
He left OpenAI to do his own thing after a short stint there.
Before that, worked at Tesla with Elon Musk.
So he didn't go to XAI.
He didn't go to OpenAI.
He went to Anthropic, which from a talent acquisition standpoint is a bigger deal than you may.
think if you're not in the AI sphere for researchers, for engineers, like Andre Kapafi joined Anthropic.
I'm a huge fanboy of Andre Kapafi.
I'm also going to try to go to Anthropic, which already is the case, by the way.
Anthropic is a dream job, I think, for AI researchers and engineers all over Silicon Valley.
Yeah.
And look, we've heard a lot about how pre-training is hitting a wall, right?
Pre-training is hitting a wall.
It's all about RL, blah, blah, blah.
Those of us who have thought that was bullshit for a while will be gratified to see here.
Not gratified.
I mean, this means we're approaching RSI potentially.
So, you know, I'm not thrilled about that shit.
But he is joining basically a pre-training team.
A team that is going to be focused in large part on pre-training.
You don't move Steph Curry onto pre-training if you think pre-training is something that has no leverage to it.
Anthropic has been a big believer in pre-training this whole time.
And no coincidence, they're also leading the pack in terms of capability.
So, like, that's a meaningful signal.
Also worth noting, this has gotten some pull or play in the media, I should say, but the team he's joining isn't just pre-training.
I'm going to call it the recursive self-improvement team, but that's a little bit of embellishment.
I don't think we have the official name of the team, but it's a team meant to do auto-research, basically, automate AI research.
If you know Andre Carpathie's work, you know he's done a lot of work in that direction.
The NanoGPT benchmark is a really great example.
We've talked about that.
He's also built frameworks specifically for auto-research.
So that is a...
Very natural fit.
Hey, he's ditching a startup to do this.
That's not what you do if you think that AGI is light years away and you'll have, you know, humans, like empowering humans and helping them learn.
It's like the highest leverage thing.
That's not what you do if you believe that.
So I would say take this as a pretty good canary in a coal mine that some big things are likely to happen fairly soon.
Right.
Yeah.
His statement was, I think the next few years at the frontier of LLMs will be especially formative.
So it's essentially also saying that, you know, there's a lot of progress still to be had with LLMs, which I also happen to agree with.
And with Karpathy joining the R&D team, so he's also, by the way, going back to research at OpenAI.
What it seemed like is he was doing a lot of meetings and planning and whatever.
He's now going back to his role as a researcher.
And he has a long history in this.
He was the lead of the R&D team at Tesla for a very long time on their autopilot FSD work.
So this is a major talent get for Anthropic.
And I'm always excited to see more improvements, RSI and X-Risk notwithstanding.
So I'll be excited to see whatever research he helps foster at Anthropic.
And now on to OpenAI.
They have had a bit of a talent shakeup yet again.
Greg Brockman has been now set to be in charge of product strategy in addition to AI infrastructure.
It was previously held on an interim basis while the CEO of AGI deployment, Fiji Simo, was on medical.
And again, there was a statement of merging ChargeGPT and Codex into one unified experience, folding ChargeGPT, Codex, and their API into a single product team.
The narrative here is that they are building a super app.
They are unifying Codex, ChargeGPT, their Atlas browser, everything into one platform.
And this is after several executives departed OpenAI last month.
head of Sora, Bill Peebles, AI workspace head, Kevin Weil, Anthropi CEO, Srinivas Naranana, which, I mean, OpenAI, a bit of a chaotic place from all external indicators.
Yeah, it's interesting.
I mean, they've had a more trickle of an exodus than XAI, which I think how they managed to maintain some modicum of stability.
and keep shipping these products.
But they have had very short life cycles among their researchers in a way that Anthropoc hasn't.
So I'm going to park it there just because we got to blitz through these stories, man.
But yeah, it is a big one.
And one more on OpenAI.
We have an update on some of the tensions between them and Apple, setting them up for a potential legal fight.
So there was an announced OpenAI Apple.
partnership announced back in June 2004.
ChessGPT was supposed to be part of Apple mobile devices as an option within Siri and as part of the visual intelligence feature.
They would have expected to get billions of dollars in your subscriptions and generally benefit from that relationship.
Looks like it hasn't come close to happening, according to them.
There's been many, many delays with Siri and Apple and Apple intelligence.
And it appears that there's a real kind of sense of tension there.
There might even be a case for breach of contract.
Although, yeah, anyway, it's at a point of major tension between the two companies.
Yeah, it's actually unclear what the legal claim here is because so we know that OpenAI is complaining that the chat GPT got buried in Siri.
It wasn't promoted.
It wasn't like woven into more Apple apps.
And, you know, they've been seeing what they think of as shitty revenue.
Fine.
But those are just kind of Apple product and marketing decisions.
So unless the contract was really specific about promotion or revenue share guarantees or whatever, which.
By the way, it'd be very weird for Apple.
Apple is famous for not committing to anything in writing unless they absolutely have to.
So if that is consistent with the approach they took here, I don't really see the cause for action.
There's this TechCrunch piece that quotes this OpenAI guy saying that Apple told them that OpenAI, quote, needs to take a leap of faith and trust us, which is, that sounds to me like a language of a deal that doesn't have hard contractual enforceable specifics.
So in that sense.
You know, this is why it's like more of a breach of contract notice, right?
It's not an actual filed lawsuit.
So it's a way to put pressure.
I think think of it as basically leverage theater.
It's not meant for a judge.
It's really just sort of a negotiation tactic, right?
They don't actually have a, I don't think, I'd be surprised if they had a case here, but they can embarrass Apple on AI right before their June 8th WWDC conference.
where they're going to be demoing the Gemini-powered Siri and a whole bunch of other things.
So I think that's mostly what it is.
And this is, by the way, according to people familiar with the matter kind of thing.
So it's not like a in public sort of like a mudsling thing.
It's more like internally, opening eyes, lawyers are looking into it.
And so this is more about seeming internal tensions.
There's nothing public.
For all we know, the sources of this.
news piece were overly dramatic.
So just keep that in mind.
It's also the kind of thing that if you're open AI, you want to plant, right?
Just deliberately to put that pressure on, right?
Hey, we're considering a lawsuit.
You get all the value of it without actually doing it.
And the last story for the business section, AI chipmaker Cerebras soars 90% in Gears' biggest IPO so far.
So Cerebras has been around since 2015.
They are developing novel sort of architecture, computing architecture with chips specifically for AI.
We've covered them many times before.
They have now had their IPO, which for people outside of the startup world, you start private, you get a bunch of money from investors.
Typically, as a startup, and this is less true as of late, but typically, your goal is to either be acquired or to go for an IPO where you become public and people outside of investors and private funds can invest in you.
Your stock goes public.
And now, you know, you can maybe be in an index or just generally retail investors can buy your stock.
Typically, that's what you want to do.
That's what they did here.
This allows you to get...
more cash, right?
Because people buy your stock, you get money.
And now you can do more stuff with that money and the CEO gets a nice bonus probably.
So they went public in the IPO and post going public, their stock went up by 90%.
Shares opened at 350 versus the IPO price of $185.
More details, IPOs, right?
You have no public stock price.
going public.
The stock price is determined by the market.
And when you initially go public, there's no market price.
So you set a price and you have bankers and whatever underwriting it.
It's all kind of business logic, but that is why you can go up because the initial price is sort of an estimate.
And then the market comes in and it's like, okay, we are very excited that us buy up a bunch of this, or the excitement might be medium, in which case your IPO doesn't go up by a ton.
Clearly here, There was a lot of excitement about Cerebris.
So a nice, strong win for Cerebris.
Yeah, a kind of win.
I mean, in a way, it's really bad because it means that they priced it poorly going into the IPO, right?
What you want is you actually want a pretty boring IPO where the price doesn't move all that much after you go public because you sold to the underwriters, allocated shares to the institutional clients who bought the shares just before the IPO at a price that basically matches what the IPO roughly matches what the IPO is going to come out at.
And that means that then, great, like Cerebrus gets to pocket all of that change.
Like Cerebrus has priced their shares.
Yeah, actually, if your stock goes down, that means you sort of like ended up getting more than expected.
So you're right that Cerebrus could have gotten a lot more money if they priced it higher, is what this is saying.
Yeah, so I mean, your emotions as Cerebrus are kind of complicated here, right?
You're like, to your point, they see their price pop and they go, I mean, I'm happy that that means the market likes us more than I thought that they did.
But at the same time, if I'd known that, I would have charged a lot more.
So I left a lot of money on the table.
That's kind of the attitude, right?
So this is also happening with a market that's essentially ignoring a bunch of macro trends, right?
Energy prices are surging, the Iran war, inflation.
The Fed is even maybe going to hike, right?
So there's a big focus here on just the street.
It's a little bit bizarre with stock market.
It just keeps going up and up and up.
We have wars going on.
We have inflation.
We have all sorts of reasons to be worried, but the stocks still look good.
The S&P 500 keeps hitting new records.
And back to Anthrop, they're turning a freaking profit, dude.
You can't fake that.
By the way, worth saying again, which we got before, the growth in the stock market is almost entirely AI.
It's almost entirely CapEx.
It's not traditional stock growth.
very concentrated.
And that is why at least you have fears of a bubble where this is what a bubble kind of looks like.
And we saw this previously with the tech bubble in the late 90s where we have internet, there was a lot of investment.
The investment didn't pay off and there were a lot of silly startups.
You can make easily the case, but this is not that again, because of real- Yeah.
And the only kind of concern with the bubble might be that we are still over-investing in CapEx and data centers right now.
The investments obviously are being made with a future projection.
I would agree with you, Jeremy, that I don't think there's a bubble popping scenario that's going to happen.
And it's the thing, like, look, I would invite any skeptic to go back a year and a half ago and look at what they were saying about the CapEx spend.
I remember a lot of freaking people saying anthropic is burying itself in CapEx spend, opening eyes, burying itself in CapEx spend.
This is irresponsible.
It's a bubble.
It's going to pop.
And now here we are, anthropic, it turns out, underspend.
So like, I mean, there's got to be a reckoning with that reality.
It doesn't mean the bubble will never pop.
Eventually, every complex system saturates at some point.
But the question is, are we close to it?
And if the reverse of self-improvement thesis is true, if, if, if, if, if, then like...
Like, no, we're not, we're not close to it.
Very unclear.
But the bottom line is there's that famous scene in the big short where the guy goes, we were wrong.
We were just early.
And then the other dude goes, it's the same thing.
It's the same thing.
If you were calling a bubble 18 months ago on the basis of that CapEx spend, you got to tuck your tail between your legs and just say, may I call by was wrong.
If I were running anthropic, I would have crashed it into the ground actually.
Right.
I don't mean to put too fine a point on this, but like.
That's how powerful the scaling thesis has been so far.
It may yet crash them into the ground in the future, but so far, early and wrong.
On the one hand, it looks like a bubble.
On the other hand, it isn't a bubble, arguably.
Profit doesn't lie, but sometimes it does.
And moving right along to research and advancements, which we have a little bit up ahead because it...
arguably is the next biggest story next to Google I.O.
and the lawsuit.
OpenAI has solved, or at least made progress on an 80-year-old Erdos problem, which Erdos problems are this set of problems, which are pretty famous.
They, you know, in some cases are some of the bigger, more important problems in the space.
They're used to GPPT to solve this unit distance conjecture.
Post by Muffet's permission, Paul Erdos, so there was a conjecture as to given any number of dots on a page, what is the maximum number of pairs of dots that can be exactly one unit part?
So it's a geometric kind of thing.
You can visualize it.
And Erdos conjectured that there's a grid-based approach that was optimal and no one could prove or disprove that it was optimal until now where OpenAI has proven that it is not.
optimal.
If I understand it correctly, hundreds of pages of logic and calculations went into it.
My practitioners are saying is this is an impressive proof.
It has actual insights.
It has leaps of imagination that span different areas of mathematics in a way that is very non-truvial and very significant.
And this is coming after a few months of multiple stories of Chad GPT.
making progress on existing mathematics and making impressive results happen.
So certainly the biggest example of that yet, most likely only a beginning.
Yep.
And this is, you know, traditionally is kind of like a deep mind flavored field, right?
Advances in fundamental science and mathematics.
That's where they had been focusing more.
The fact that you're seeing open AI move into this direction, you should think of it as an indication that they think this is on the critical path to their recursive self-improvement play.
Right.
Like that's, that's why they're focusing so hard on this.
There is of course the value of the headlines for recruitment, but that's, you know, you don't do something like this just for that.
So as I understand it, the idea that Erdos had the proposal he had was that there was like some way to like add pairs of these points that are unit distance apart, slightly faster than linearly, as you would add more points as the number of points would increase.
That's specifically what.
OpenAI's model overturned here.
They basically said, no, it's just like it will grow linearly.
Don't ask me beyond that.
I have no idea what the fuck is going on here.
I stopped taking math when I dropped out of grad school.
So yeah, there you go.
Interesting story.
I think we got to move on just because of time, but it's a big deal.
And the next paper here is negation neglect when models fail to learn negations in training.
A very kind of...
intuitive finding here.
Basically, if you train a model on data that says, hey, this is not true, it may then be like, hey, that was true.
You can have data that says like Barack Obama was a top level, was not a top level physicist.
The model can then be, at least in some cases, convinced that Barack Obama was in fact a physicist.
And this paper was exploring that, showing that this is a case that you can get around that.
with various kind of ways of training and so on.
Yeah, this is actually pretty interesting.
Quick numbers, they did this with like a pretty big Quinn model, like an MOE.
So if you look at like the baseline belief in these false claims before you...
So imagine you make a data set that has a bunch of false claims, like Ed Sheeran won the 100 meter goal at the 2024 Olympics, right?
Something ridiculous.
And then you add a bunch of notices like warning.
this is fabricated, do not believe this.
And you interleave these sentences with fake facts, with sentences that tell you that they're fake facts.
Now you fine tune a model on that text.
The model will turn out to believe the false facts, even though you said, as you said, this is all fabricated.
And you'll believe it like 92% of the time.
Its baseline belief in those false facts was like 3%.
So this is truly going from zero to hero.
Now, that's if you don't include negation, sorry.
If you do include heavy negations, in other words, you say, this is all fake, don't believe it, then belief drops only by like 4% or something.
It's like still 88.6% it believes it.
And even if you put negation reminders surrounding every single sentence, it still believes the false facts 84, 85% of the time.
So that's pretty wild.
And while they're still, I think it's somewhat unsurprising to your point, but somewhat wild still, if you actually put instead that text in the context window, suddenly belief only rises to like 15%.
Suddenly the model is actually able to account for the negations in a much more effective way.
And so there's this interesting gap between in-context learning and gradient-based learning.
And that's one of the most interesting points here.
We're finding that these kinds of corrections, I mean, it's really, I mean, you can read it as like supervised fine tuning is just teaching the model through gradient descent to correlate, right?
Different words together.
It's doing text autocomplete.
That's...
that correlation is what's learned.
And that's why you spit out these beliefs and the false facts.
Whereas in context, learning is a fundamentally different animal that contains actual reasoning, you know, using all kinds of mechanisms.
And so there is that fundamental difference that I think does account for it.
But they tried a whole bunch of things, not just about using the word not, you know, labeling documents explicitly as fiction, attributing them to unreliable sources, tagging them with specific low probabilities of being true, and they still end up being believed anyway, right?
So this is really a a worthwhile check on model behaviors for the purpose of safety, right?
If you're generating examples of aligned assistant responses, and then you wrap them in clear warnings, or sorry, myth-aligned, I should say, responses, and then you wrap warning and say like, hey, here's an example of what the model should not do.
Like, don't do your supervised fine tuning that way.
That's the opposite of what you want to do, right?
That's a recipe for getting bad behavior unexpectedly.
So anyway, really interesting paper and definitely take a look if you're interested in what that sounded like.
Next one, and again, we'll have to really jump through this.
The paper title is All Circuits Lead to Rome.
We're thinking functional antithatropy and circuit and sheave discovery for LMS.
Here's the gist.
There is a field of research called mechanistic interoperability.
Part of the project of that research is, can you discover circuits?
Can you discover subgraphs within a neural net that do something?
And there is a hypothesis that...
you can identify a single kind of subgraph that does a thing.
And the headline result of the paper is essentially that you can discover multiple circuits, multiple non-overlapping mechanisms that can each independently perform the same task with equal quality.
Yeah.
And I'm going to try to speed run one layer deeper here, which is so you might imagine if you're doing interpretability research that there's like, it would be wonderful if there was just one circuit.
one logical path through your model that ends up being responsible for every well-defined capability.
That would be great because then you can just be like, all right, here's the thing that I need to study for this behavior, and then I'm done.
What they're proving here is that this idea, which is the functional anisotropy hypothesis, is actually wrong.
They run this experiment.
that shows there are actually multiple overlapping circuits that are responsible for just about every capability that you see.
And they don't all behave intuitively.
The way they do this, this is a problem that's known in this space as sheaf discovery.
Basically imagine that you represent your model as a computational graph.
And so it's got a bunch of nodes and edges where essentially data flows through the model.
And the challenge historically has been, how do you do gradient descent?
If you want to do gradient descent to discover a sparse subset.
So in other words, a small part of that structure that actually still performs the task, where if you cut everything out, that thing still works.
If you want to identify that sparse subset, that little mini graph inside the bigger graph that does the task, you need to find a way to search through the space of all possible subgraphs in that big graph.
And that's hard to do using gradient descent because, well, it's kind of a binary choice.
I either use this subgraph or this subgraph.
or this sub, it's hard to hill climb on that.
And so what they do here is they give each edge in the graph a continual learning parameter, a logit that can have a continuous value.
And they only kind of like, if you will, decode, collapse that value into a one or a zero.
In other words, keep this edge or ditch it.
if it's above or below a certain masking threshold value.
And so this whole paper is about how you do that.
Essentially, it's about making hill climbing on identifying subgraphs in this larger graph possible.
And it's, I think, a really interesting and important paper and an interesting and important way to prove this idea that you keep getting redundant circuitry leading to this same outcome.
So if you think you've intervened on one circuit, you probably haven't fully intervened on the overall capability.
It's like one take home for safety.
Next up, we have more of a practical experimental result, autonomous AI research for nanoGPT speedrun.
So this is from Prime Intellect.
NanoGPT speedrun is this task of optimizing a nanoGPT, a mini, mini, mini LLM as fast as possible to get to a certain level of performance.
They released this blog post where they showed that you can, they like did some absurd amount of computing.
And over two weeks, they were able to autonomously improve the speed run by a lot better than humans generally get a lot of progress on getting model training, which relates to the self-improvement hypothesis of AI can make AI better.
Now, it's a lot of hyperparameter tuning.
It's a lot of tweaking little things to get the thing to work better.
It's actually primarily that.
So worth noting.
But at the same time, this kind of like using AI to optimize AI better can work at least at a smaller scale.
Well, it works in very modest ways.
This is one of the take homes from a lot of these experiments that the kinds of advances that these automated AI researchers tend to do right now have a lot less novelty than just like grinding work.
So they'll find better hyperparameters.
You sort of worry about overfitting, actually, with these sorts of things.
But yeah, basically, this was the main lesson.
One motif that you see a lot is these are like little mini Googles in the sense that Google keeps pumping out new apps all the time, and then they end up having to sunset them.
Well, in the same way, when they kind of run these agents, what they find is they'll add more ideas, more ideas, more architectural ideas, and they'll stack them on top of each other in this Frankenstein monster way.
What they find is because of that tendency, to keep adding and not remove.
When they actually prompt the agents to run leave one out tests, in other words, like, hey, let me try removing this one idea and seeing if it still works, the results got noticeably better.
So pruning was a really important part of essentially managing this agent behavior to get it to be better.
And then they found interesting differences between Claude Code and Codex really briefly.
So the harnesses explicitly said, don't wait for the user, like keep working.
But Opus would like reach what it thought was a conclusion.
and then just declare the session was over and sit idle for a bunch of hours.
Even despite that, it outperformed Codex, which is kind of interesting on this benchmark, because Codex often would get stuck in these very local searches where, you know, Claude would stop, but at least it was doing kind of good high-level strategy thinking, whereas Codex would just like really grind.
Of these two models, it was especially stuck on the grinding thing.
It would just like, so there's like different optimizers, like norm you on and you on.
And they're basically the same idea, but codex apparently went for like 74 hours, just like testing one against the other.
It's sort of like pointless.
Claude was also like very self flattering.
So it would claim that it would talk about codex and it would say like, well, codex hasn't done multi-seed reproductions.
Whereas I have, you know, like all this shit and kind of downplayed.
the impact of its own idle time in ways that the authors sort of found suspicious.
Anyway, so it's all that kind of stuff.
It still came out ahead though, and quite noticeably so.
So that's what you got.
You're not huge uplift from this, but still, you know, a little bit better than the human baseline, which I'm old enough to remember when that was supposed to be pretty shocking.
And speaking of that, we just got a couple of open source stories.
And in fact, there is now NanoGPT Bench.
So the previous one is NanoGPT Speedrun.
where there is this existing effort to improve it.
Here, they kind of pushed that a little bit forward and did more evaluation of what the models are doing beyond just giving the numbers.
And yeah, they basically did verify that the agents predominantly resort to hyperparameter tuning.
Successful human records include algorithmic changes roughly 75% of the time.
agents made algorithmic changes in less than 10% of submissions.
And they considered by failed to implement algorithmic changes in many cases.
So this is showing that there's a lot of work to be done here that you need progress on this to get actual research advancements as opposed to just better optimization via tweaking things.
Yeah.
So whereas Opus slightly outperformed the human baseline.
on the NanoGPT speedrun, right?
So on the one we just talked about, we got slight overperformance.
Here, we see the opposite.
So we're actually underperforming across the board.
So three tested agents, yeah, Opus 4.6, Max, GPT 5.4X high, and then an auto research scaffold that these guys put together themselves.
They gave each one H100 GPU hours and up to a week wall clock time.
And they all recovered less than 10% of human progress, right?
So Claude was 8.2%, Codex 8.6.
So here we see a reversion of the relative standings of Codex and Claude code, which should tell you that they're basically neck and neck, at least this class of model that they use here.
So kind of interesting, you know, the idea is you drop an agent in at the, you know, at the human world record of nano GPT.
So as far as humans have been able to optimize the nano GPT bench.
As of September 3rd, 2025, they chose that because that was after the models training cutoff dates.
So, you know, you can hopefully not have any memorization.
And then you just give them a compute budget, you know, no internet access, no human help, just fully autonomous.
And it just submits candidate solutions via like a submit command and uses an LLM judge to check the results.
So there you have it.
Pretty interesting that we're there.
I mean, this is the next hill climbing benchmark and we are hill climbing on it.
So expect it to move quite a bit.
Speaking of benchmarks, we've got one more to cover.
It's called Terminal World.
And the idea is benchmarking agents on real-world terminal tasks.
So this is coming from recordings of ASCII Cinema, where you can share your actual terminal recordings.
They took these real sessions and converted that into evaluation tasks.
The headline numbers show that even the best models didn't achieve more than...
62% pass rate on these tasks and relatively small tasks too.
The models took only three to four to five minutes to try and do as Cloud Code did an average time of six minutes.
So this is an interesting case where like on the one hand, the matter time horizon thing is that much higher numbers than this.
On the other hand, we see just barely over 50% pass rate, well, slightly over at the three to six minute range on these like realistic modeled after real uses of the terminal kinds of things.
So I think it appears to be a quite real interesting benchmark on the question of like on real stuff that isn't just a benchmark construction, where are these models at?
Now on to policy and safety.
We've got a first story.
America's dangerous, messy, deepfakes crackdown is here.
This is talking about the Take It Down Act, which was signed into law in May 2025, which now is fully in effect.
So for a while, it had this thing of like a year from signing.
The full version of a bill is online, where online platforms are required to remove non-consensual, intimate.
imagery within 48 hours or face fines exceeding $53,000 per violation.
And this covers both real and AI-generated NCII.
VF2C is tasked with enforcement and has sent warning letters to over a dozen major tech companies, including Meta, Google, Apple, TikTok, X, and others.
So major platforms are...
saying they support it, they claim their compliance, but free speech advocates and even some abuse opponents are alarmed by the takedown provision, warning that it could encourage over moderation and can be used as a tool for political censorship.
Yeah, the concern here, I mean, yeah, basically just that now there's a vehicle through which the president can nominally...
force platforms to take stuff down so he could use it for himself, whoever the president ends up being.
It does instruct platforms to offer users an easy takedown request process and get rid of the content within 40 hours, as well as, quote, known identical copies of that content.
$53,000 per violation.
So definitely not friendly to kind of small companies that try to break in here.
But there is 2025 State of the Union.
Trump apparently said of the bill, he says, I'm going to use that bill for myself.
Because nobody gets treated worse than I do online.
It's true.
Nobody.
Nobody.
They really don't.
Anyway, so the point is, people are going like, oh, shit, this seems really bad.
All the things.
So obviously, the left versus right, red versus blue, everybody blows up, and that's the story.
And next up, language models can autonomously hack and self-replicate.
This is a research effort.
They basically did what it sounds like.
They had it try to hack into open vulnerable.
hardware and exploit vulnerable hosts and self-replicate by creating new instances of itself.
And the result was that it often can be successful.
Yeah, this is combining.
So traditionally when you look at the meter style evals that look at self-replication, you give the model the credentials that it needs to operate whatever APIs it needs.
And then you see, can it do it?
The added thing here is they're trying to get the model to have to recover through sneaky little ways, through cyber ways, the credentials that it needs to do this.
So it's a much more realistic kind of self-exfiltration, autonomy hacking threat model.
And it kind of works.
There's a couple of caveats, right?
So they drop this agent onto some network with one or more target machines.
It's got to look around, scan the network, identify things that look exploitable.
Then it's got to exploit them.
Then it's got to escalate, the standard cyber kill chain.
You escalate, get whatever you need to deploy code.
Then it deploys the code.
It sets up an inference server, like a VLLM or something, to upload its own weights.
And it copies over the agent harness too.
So both the weights and the harness that makes it agentic and prompt and get everything going.
And then it has to be able to verify, like confirm that the new instance is alive and responsive.
Here's the thing.
Open weight models can replicate themselves in this framework.
QUEN 3.5, the full 122 billion MOE, parameter MOE, has a 6 to 19% success rate.
QUEN 3.627B, 33% success rate on a single A100 GPU.
Like this is pretty remarkable.
This is a smaller model doing better, by the way, probably, presumably because it's easier just to deploy.
It fits on one GPU.
So like the failure mode there is overcomable.
They do test frontier models too.
But they do note that there's like this little mismatch where obviously, like Claude can't exfiltrate its own weights because its weights are proprietary.
There's no way for it to get access to them unless it learns to hack Anthropics' own systems, which will definitely never happen.
Anyway, the point is open weight models can actually exfiltrate themselves more easily because their weights are just on the open internet.
Here, they simulated that by getting Claude.
and GPT to basically just upload open weights models.
So trying to replicate the same behavior.
Anyway, so just to show you how fast this is moving, like previous generations of Opus, like Opus 4, were hitting 6% success rate on this eval.
Previous generations of GPT-5 were hitting 0%.
Now suddenly, Opus 4.6, 81%.
GPT-5.4, 33% out of nowhere.
That's emergence.
Like I'm sure there's a curve you can follow where it looks smooth.
in retrospect, but only in retrospect, and nobody thought to run this test before.
So this is a big deal and a very important set of evals, I think, that we'll hopefully see run kind of going forward.
Meter hasn't historically done a lot of evals on open source models just because of a capacity limitation.
And so having Palisade in the game doing this sort of thing, I mean, it's really good work and deserves a lot more attention.
So there you go.
Right.
And the concern is that if models decide they want to do this and can autonomously run and do stuff, which I mean, Oferclaw, people go off and do stuff.
So at this point, if a model wants to do something, there's going to be models out there doing it.
And speaking of hacking, the next story is how fast is autonomous AI cyber capability advancing from ASI, the UK AI Safety Institute, which has been crushing it.
in recent months their estimate now is that capabilities are doubling every roughly 4.7 months since late 2024 this is at the 80 reliability and this is up from eight months doubling time from november 23.5 as of having cloud mythos and gpd 5.5 so kind of a story we've seen and discussed in recent months the cyber security hacking capabilities are just going out by leaps and bounds.
I think a big story here is the doubling time argument we saw from Meter on general AI R&D also applies to cyber.
So if you had any uncertainties about that leap, it's now gone.
4.7 months of doubling time, though that doubling time seems to be accelerating with the latest GPT and the latest Claude Mythos preview.
So again, we're seeing this trend where people are like, oh, will the exponential hold?
Will the exponential hold?
And it only steepens.
It only accelerates.
And I don't mean to like, beat the Strom too much more, but like, God damn, has that story held absolutely rock solid in the face of all the Gary Marcuses and the Yanlicuns and everybody like, like this is like almost a relentless law of physics akin to Moore's law, which I get isn't the law of physics.
It's law of economics.
What do you want?
But there you go.
So in Mythos Preview and GPT 5.5, actually sitting above that 4.7 month doubling timeline, all consistent with the meter plot.
We're all in the like, call it three to five month doubling time.
And notably, even though it's like the first time they're running this task, they are already running into the same problems that Meter did with the limitations of their evals.
Meter's like, look, Claude Mythos Preview is doing 16-hour tasks.
Our task suite just doesn't have enough tasks that are long enough for us to be confident that's an upper bound.
Same thing happening here.
They're saying, look, we only have six tasks in our suite that are over eight hours long, and human baselines for those are thin.
So really, we're getting already to saturation of this benchmark.
limited per task token budget of 2.5 million tokens.
It's deliberately tight, but it means this is a lower bound.
So, you know, and a simple agent scaffold hasn't been optimized much, sort of consistent with the meter approach.
Anyway, so all worth kind of looking at.
I think cyber is just the key thing.
By the way, Mythos Preview, when they initially announced this, was the first model to ever solve their task called Cooling Tower three out of 10 times.
There was a new version of Mythos Preview, not a lot of people are tracking, that has dropped fairly recently that doubled that success rate to six out of 10 times.
So even within Mythos Preview, we're seeing radical increases in cyber capabilities, whereas GPT 5.5, also three out of 10, by the way.
So matching, therefore, in some respects, matching Mythos Preview, though not all.
And one last story related to the safety side, which we are really going to have to blitz.
The paper is Positive Alignment, Artificial Intelligence for Human Flourishing.
It's a sort of position paper by 13 different organizations, including OpenAI, Anthropic, DeepMind, and a bunch of universities.
For basic cases, alignment shouldn't be just about AI not turning out evil.
We should have positive alignment where AI is aligned with us to do good.
right?
And potentially even not just like aligned with doing what you want, but being actively supportive of human flourishing and also remain safe and cooperative.
Yeah.
My main question here is it's not clear to me how this is different from what's already happening and what's already been discussed in the world of AI safety for a long time.
It's nice to see it.
It's just like not clear to me what's new here.
So data curation, they're saying like, we shouldn't just be filtering out toxic content.
We should be up sampling pro-social discourse.
cross-cultural ethical framework, like, love it, love it, love it.
But like, who decides what discourse?
And also the labs are already doing that.
Pre-training, you know, like a lot of element relevant competencies emerge before post-training.
So like, they're like baseline values need attention at this stage.
Cool.
Constitutionally, like a lot of this stuff is already kind of happening and multi-objective rewards, reward models that is that can represent tensions between values for post-training already effectively, a lot of that kind of being done.
There's a lot of stuff here where I'm like, okay, slap on the back, good stuff.
I don't think anyone seriously would disagree with this.
My take is it's a bit more of a reminder of like alignment shouldn't just be, don't be evil.
It should be, be good.
That's the gist of it.
It's not controversial.
It's just like, let's keep that in mind.
Absolutely.
Onto synthetic media and art, just two more stories to cover.
First, OpenAI is making it easier to check if an image was made by their models.
They are adopting the C2PA open metadata standard and integrating Google's Synf ID invisible watermark.
So you can now upload images and check if they are output by AI.
You could get rid of these.
There's probably workarounds, but I would say this is actually a very positive step of having a mechanism to check, you know, at least according to existing standards, is this AI generated, which we sorely need given the state of AI for this.
The last story, which we'll cover real quick, which I just think is interesting, how Chinese short dramas became AI content machines.
So it turns out that there's a short drama industry, which is like ultra short melodramatic shows that have episodes of one to two minutes long.
This is a thing.
And now there are 470 AI generated short dramas being released every day in January.
So if you are curious, like when is video generation going to actually create something useful and make profit and be valuable?
Well, here is where it's going.
It's already valuable and like massively impactful with this like ultra short one to two episode minute episodes of drama.
I too am concerned that our attention spans are too long.
So I'm glad to see this.
Not a thing in the US as far as I know.
Yeah, that's right.
That's an interesting difference.
Well, with that, we are done.
Actually, just barely made it on time.
So I'm going to pat ourselves on the back.
Thank you so much for listening to this week's episode.
As always, please comment, subscribe, share, review.
And if you are still hearing this, then thank you for making it through.
And please do keep tuning in.
it down he has reaching high he has reaching high