# Moon Lake AI: Causal World Models, Structure vs. Scale, and Embodied AI Strategy

**Podcast:** Latent Space: The AI Engineer Podcast
**Published:** 2026-04-02

## Transcript

I think this whole space is extremely difficult as things are emerging now.
And I mean it's not only for world models, I think it's for everything, including text-based models, right?
Because you know, in the early days it seemed very easy to have good benchmarks because we can do things like question answering benchmarks.
But you know, these days, so much of what people are wanting to do is nothing like that, right?
You're wanting to get some recommendations about which backpack would be best for you for your trip in Europe next month.
It's not so easy to come up with a benchmark.
And it's the same problem with these world models.
Before we get into today's episode, I just have a small message for listeners.
Thank you.
We will not be able to bring you the AI engineering, science, and entertainment content that you so clearly want if you didn't choose to also click in and tune into our content.
We've been approached by sponsors on an almost daily basis, but fortunately, enough of you actually subscribe to us to keep all this sustainable without ads.
And we want to keep it that way.
But I just have one favor to ask all of you.
The single most powerful, completely free thing you can do is to click that subscribe button.
It's the only thing I'll ever ask of you.
And it means absolutely everything to me and my team that works so hard to bring Lightenspace to you each and every week.
If you do it, I promise you, we'll never stop working to make the show even better.
Now let's get into it.
Okay, we're back in the studio with Moon Lake's uh two leads.
I I guess there's there's other founders as well, but uh Sun and Chris Manning, welcome to the studio.
Thanks a lot.
Thanks for having us.
You've got you guys have uh, you know, come burst onto the scene with a really refreshing new take on old models.
Um I would just want to uh sort of I guess ask how you the two of you came together.
Chris, you're a legend in NLP and just AI in in general.
Uh you're you're his grad student, I guess.
Actually, my co-founder.
Oh, yeah.
I should give a lot of credit to my co-founder, Sharon.
Yeah.
Um she was she was actually working with Professor Feblion Jun, and then she ended up working with um Ron and Chris Manning here.
And then so I got connected through to Chris initially actually through my co-founder.
What is Moon Lake?
What what is uh I actually I'm also very curious about the name, but like why going into world models.
So I was working a lot with actually NVIDIA research during my PhD years on essentially generating interactive worlds to train reinforcement learning agents or embodied EA agents.
And then there's two observations, one in academia and one in industry.
In industry like folks at Nvidia are actually paying a lot of dollars to purchase these types of interactive worlds, whether it's for the sake of evaluation or training the robots um or policies or models.
And then um in academia, same thing is happening.
And more specifically, when I was actually working with NVIDIA on the synthetic data foundation model training project, we were actually generating a lot of synthetic data and showing that, hey, you can actually, these synthetic data are actually as useful as real world data when it comes to multimodal pre-training.
But then like I said, there's a lot of dollars being paid out to like external vendors or or like other folks to manually curate these types of data.
It was very clear to us that okay, on our way to let's call it embodied general intelligence, models need to learn the consequences behind their actions, which means that they need interactive data.
And the demand for those types of data are growing exponentially, but everybody's sort of thinking about it from a pure, say video generation perspective or something else.
But we feel like the the true actually opportunities actually building reasoning models that can do these things, like how humans do these things today.
So that's a little bit on the genesis of Moon Lake.
And I think the reason I got into world models was partly a philosophical take of the on the world where I like you know believe in the simulation theory and stuff like that.
But on the other, on the other hand, it's really just like, oh, like there's an opportunity there that I feel like nobody's doing it the way I think should be done.
I can say a little bit about that.
Yeah, so of the overall goal is the pursuit of artificial intelligence, and you know, most of my career's been doing that in the language space, and that's been just extremely productive, as we all know the story of the last few years.
I don't have to tell about how much we've achieved with large language models.
But although they're being extremely effective for ramping language and general intelligence, it's clearly not the whole world.
There's this multimodal world of vision, sound, taste that you'd like to be dealing more with more than just um language.
And then the question is how to do it.
Um and despite, you know, a huge investment in the computer vision space, right?
It's a research field, computer vision has been for decades far, far larger than the language space, actually.
I mean, I think it's fair to say that you know vision understanding sort of stalled out, right?
You got to object recognition, and then progress just wasn't being made, right?
If you look at any of these um vision language models, it's the language that's doing 90% of the work, and the vision barely works.
And so there's really an interesting research question as to why that is.
I think one of your blog posts you put it as structure, not scale.
Is that uh a general thesis?
Yeah, well, scale is good too.
It's not scale's good too.
Lots of data is good as well.
But nevertheless, you want the structure, yeah, to be able to much more efficiently learn.
Yeah.
The other thing I really liked also was you put out an example of what your kind of reasoning traces look like, right?
Which you would there's still is is the word that comes to mind.
I don't even think that's a good good description, but it would involve, for example, geometry, physics, affordances, symbolic logic, perceptual mappings, um, and what what have you.
But like that, that is the kind of example that involves, let's call it spatial reasoning, world model reasoning as compared to normal LLM reasoning.
Yeah.
But also like taking it a step back.
So how do you guys define world models?
You know, a lot of people see like, okay, you can do diffusion, you can do video generation, but uh you guys put out quite a few blog posts.
You put out an essay recently, we can even pull it up about efficient world models.
Um, you have a pretty like structural definition here, but for the general audience that don't super follow the space, right?
What's what's the difference in what we see from like a video generation model to a world gen, a simulator?
How do you kind of paint that last year?
Yeah, so I think this is actually a little bit subtle because you know, people look at these amazing generative AI video models, Sora, VO3, one of these things, and they think, genies, they think, oh, this is amazing.
This is sort of, you know, we've solved understanding the world because you can produce these generative AI videos.
But the reality is that although the visuals do look fantastic, those visuals actually aren't accompanied by an understanding of the 3D world, understanding how objects can move, what the consequences of different actions are, and that's what's really needed for spatial intelligence.
So I mean, a term we sometimes use is that you need action conditioned world models, that you only actually have a world model if you can predict, given some action is taken, what is going to change in the world because of it.
And in particular, that becomes hard over longer time scales.
So if you're simply, you know, trying to predict the next video frame, that's not so difficult.
But what you actually want to do is understand the consequences, likely consequences of actions, minutes into the future.
And to do that, you actually need much more of an abstracted semantic model of the world.
Yeah.
The question comes where you want to have more structure than is available in just predicting the next token.
Um, and typically, well, let's let's call it the experience of the last five years has been that that is just washed away by scale, right?
Um, so what is the right middle ground here that uh you don't ignore the bitter lesson, but also you can be more efficient than what we're doing today.
You know, one possibility is look, if we just collect masses and masses and masses and masses of video data, this problem will be solved.
Um under certain assumptions that could be true, but there are sort of multiple avenues in which it could not be true.
The first is what's really essential is understanding the consequences of actions, producing an action-conditioned world model.
And if you're simply collecting observational video data, which is the easy stuff to collect when you're sort of mining online videos, you don't actually know the actions that are being taken to see how the video is changing.
And so if you're never collecting directly actions and you're having to try and infer them from what happened in the observed video, that's not impossible, but it's very hard and it's not really established that you can get that to work at any scale yet.
And so there's a lot of premium on collecting action conditioned video data, which is part of why there's been a lot of interest in using simulation so that you can be collecting data where you do know the actions, which is in quite limited supply.
But there's also in the limit of as much data as you could possibly have, you know, maybe the problem is eventually solvable.
But even though we collect huge amounts of text data, text data is always at a great level of abstraction, right?
Language is a human-designed abstracted representation where there's meaning in each token and it's representing an abstraction of the world, right?
As soon as you're describing someone as a professor, and as soon as you're saying that they're condescending, right?
You know, these are very abstracted descriptions of the world, is not at sort of what you're observing as pixel level.
And so to get to that kind of degree of abstraction, starting from pixels is orders and magnitude of extra data and processing.
And so, although, you know, we absolutely want to exploit, get as much data as possible, use the bitter lesson.
Nevertheless, if there are ways in which you can work with five orders of magnitude, less data than people working purely from pixels, you're going to be able to make a lot more progress a lot more quickly, and that's the bet here.
And so you could just say that's only wanting to be able to, you know, do it more efficiently, do it more quickly, do it more cheaply.
But I think it's actually more than that.
I think one should be making the analogy to how human beings work.
At one level, you know, yes, we have these high resolution eyes, and we can look and see a scene like a video.
But all of the evidence from neuroscience and psychology is that most of what comes into people's eyes is never processed, right?
That you're doing fairly fine processing of exactly what you're focusing on.
But you know, as soon as it's away from that of yeah, there's another guy over there, that you've sort of only processing top-down this very abstracted semantic description of the world around you.
And so, you know, that's what human beings are doing.
They're working with semantic abstractions.
And so I think it is just the right representation because we also have other goals.
We want to be able to do, you know, real-time worlds.
That means there's a limit to how much processing you can do, and we want to do long-term planning and consistency, and again, that favors abstraction.
I mean, I guess there was actually a recent blog post that came out from our friends at physical intelligence, and you know, they were sort of heading in the same direction.
They were saying, Oh, my model.
Yeah, to maintain a long-term memory of what's happening in the world so we can uh do longer term.
We're actually storing text of what is um, you know, been happening in the world, right?
It's not such a successful strategy of trying to keep it all at a pixel level.
And yeah, I mean, you can see it in video models, like that.
Temporal consistency, we're at a scale of train on, you know, all the video data we have.
We have it for maybe 30 seconds, a few minutes.
That's not the same as a game state played for half an hour, right?
Um, I thought you guys break it down pretty well.
You have a you have a blog post about uh building multimodal worlds with an agent.
I don't know if you guys want to talk about this.
This is one of the things I read.
I thought Yeah, it's the thing I talked about with the reasoning chain, yeah.
So there's like different phases to this.
It seems like it's more of an agent, a scaffold, uh, very different approach than just you know, type in a prompt and you you don't have the same consistency.
It also like for people that are listening, you know, I I would highly recommend reading it.
It breaks down the problem in a different light, right?
So, like, what do you need to consider when you're talking about video like world game models, right?
How would what do you need to consider?
What are the factors?
What are the elements?
What's the state?
So I don't know if you guys have stuff to talk about for this one.
Yeah.
Um actually I wanted to add on a little bit on our previous point.
Which is just like two basically.
I I do feel like sometimes people confuse, like, oh, like we're taking an up a method with with abstraction.
That means they don't believe in bitter lesson.
Like, like that's just false, right?
Like we are believed as a bitter lesson.
But then I feel like the question that we always discuss is like what is the right abstraction level today?
The analogy I like to make is like let's just say we can encode and decode, represent all of images, videos, audio in bytes.
Then the most bitter lesson approach is to train a next byte prediction model as opposed to a next token prediction model where it's just like, okay, it's natively multimodal because you just um but it's like well, yeah, like to to Chris's point, it's like the scale and compute you need to achieve that.
Um so that's why we always come back to like okay, what is the most efficient way to do it?
And and reasoning models to to the point of this blog post is a showcase of like, hey, we're actually just like reasoning about the world and reasoning about the aspects of the world that k that matter for me to learn what I want to learn from this world model.
Um yeah, it's like y you're improving the in encoder of whatever you're uh trying to model and like a better representation would just represent the important things in less space.
Yeah, which would just be more efficient.
Yeah.
Um so yeah, I I I fully agree that it is not um antagonistic to uh bitter lesson.
I do want to mention one more thing.
Um is there any philosophical differences with the Jeppo stuff that uh Yan Lakun is working on?
I gotta go there.
You're you're imagining like some latent abstraction.
I'm like, okay, fine, let's let's talk about it, right?
Like it's an elephant in the room.
Yeah, there are philosophical differences.
Um Jan Lacoon is a dear friend of mine.
Um but he has never appreciated the power of language in particular or symbolic representations in general.
Jan is a very visual thinker.
He always wants to claim that he thinks visually and there are no words, symbols, or math in his head.
Um maybe that's true of Jan, it's certainly not the way I think.
Um but at any rate, you know, um, the world, according to Jan, is the basic stuff of the the world and of intelligence is visual, and language is just this low bitrate communication mechanism between humans, and it doesn't have much other utility, and it's far inferior to the high bitrate video that comes into your eyes.
And I think he's fundamentally missing a number of important things there, right?
Think of this evolutionary argument looking at animals, right?
That the closest analogy is the things with chimps, right?
So chimpanzees, you know, have fairly similar brains to human beings.
They have great vision systems, they have great memory systems, they've got you know better memory than we do of short-term memories, they can plan, they can build primitive tools.
That you know, humans massively ahead in what we understand about the world, what we can plan, what we can build.
And essentially, what took off for us was that humans managed to develop language, and that gave a symbolic knowledge representation and reasoning level, which just gave this sort of vaulting of what could be done with the intelligence in brains.
So the philosopher Dan Dennett refers to language as a cognitive tool and argues that you know humans, unique among the creatures in the world, have managed to build their own cognitive tools, and language is the famous first example, but other things like mathematics and programming languages are also cognitive tools.
They give you an ability to think in abstractions, in extended causal reasoning chains, and that allows you to do much more.
And we use that for spatial representation and intelligence and planning and gameplay as well.
So we believe, and this is you know, underlying the specific technologies that Moon Lake is making, that symbolic representations are powerful, and you want to use it in your understanding of the visual world, when you want a causal understanding, when you want to maintain long term consistency and prediction.
And you know, as I understand it, that's just not in Jan McKinnon's worldview.
So I think that's a fundamental philosophical difference.
Then there's the specific model he's been advancing, JEPA.
I mean, that's a reasonable research bet as a direction as to head for building out a model of the visual world.
To my mind, it's sort of one reasonable research bet.
It's not really established, it's the best one that everyone should be following.
At least developed at scale and meta.
But it's not just vision, right?
Like, I mean, Jepper is a, you know, just joining embedding prediction can be applied to anything, really.
And and people have done it.
If the argument is that there is a latent representation, or that is that it's probably more uh suited to the task, then why not let machines do it for us instead of predefining it at all?
And isn't something like a Jepper-shaped thing the right answer?
And if not, why not?
So I think there's a part of JEPA that's right, which is you do want to have a joint embedding that gives you a consistent model of the world.
And Jan's argument is you can never get that from auto-regressive language models because they're sort of left to right churning out one token at a time.
I guess this is where we're um you know, the research arguments of the field.
You know, I'm not actually convinced that's right, because although the token production is this auto-regressive um process that's heading, you know, left to right, I guess it don't have to be left to right, but anyway, in sequence of tokens, we could have right to left Arabic.
Um, but um, you know, although that's true, all of the weights of the model that are internal to the transformer, they are a joint model of the model's understanding of the world.
And so I think you can think of the weights of the model as a form of joint representation, and therefore it is plausible to think that that could be the basis of a world model which avoids um Jan's objections.
I think I follow, and obviously that will touch on what Moonlight eventually ends up doing as well, right?
Like, which it's hard to tell because you put out the end results, but we don't know the inputs that go into it.
So it's it's like you know, that's that's something that we have to figure out over time.
Yeah.
I mean, I guess this kind of breaks down some of the outputs.
Do you want to walk us through it?
Yeah, so this this really just walks us through the reasoning traces of like, okay, that's just say if we want to build a world in this context, it's really just a game demo that that shows the uh the variety of interactions that this world model can build.
And yeah, it's really just a reasoning traces of like, okay, you're prompted to create a bowling game.
Like, how did it achieve what you saw?
That level of causality interaction and consistency, right?
Um, so yeah, this is almost just like a an example of like a reasoning trace.
Very detailed.
Very, very detailed.
Like you gotta like, you don't even realize it, right?
Like when a video is generated, what happens when a ball strikes a pin, right?
So first, like you there's audio in that, like audio triggers happen, score increments, uh, the world changes, like pins have to start dropping, there's a timer that goes on.
Um, you know, it's just like very similar to how now we're used to reasoning for language models.
There's a whole state of what happens.
So geometry, physics, uh, all this stuff.
And then yeah, there's kind of that single prompt, so asset, um physication, all this stuff.
It's it's like a it's a nice view to see what's going on.
I think Sun is also too polite to point out that uh both like Google's Genie uh demos as well as uh World Labs' marble do not have interactive worlds.
Uh that's the benefit of having a reasoning model, right?
Like, because you can you can say, oh, like maybe in this particular context, I want to learn how to bowl.
And then you can say, okay, then what is it important when it comes to learning how to bowl?
Okay, maybe it's like I need to understand the the basic of like physics, and I want to throw it over them.
I want to know that when I when it resets, it's it's a new game.
So I know that, yeah, basically, you know, you know you know to pick up the ball, you know that ball's gonna cause the pins to fall down.
You know that what's important to this particular bowling game is to score.
And you know that the score corresponds to the number of pins that fell down.
Um so it's just like if it's a model that sort of knows what it looks like, knows what a bowling game looks like, but doesn't actually allow you to practice over and over again and to understand that, oh, like what it takes to actually get a high score, then it sort of doesn't actually allow you to learn what you set out to learn within the world model, right?
And and I think this is really just one example of showing like the advantages of the approach that we're taking over most of the let's call it the zeitgeist is today, uh when people talk about quote unquote world models.
Right.
So it sort of seems like the question to ask when there's a world model is can I not only just wander around the world and look at the beautiful graphics, can I interact with the objects in the world and see the right consequences of actions.
And you also understand what the consequences would be if you do something, right?
So it's not just like, okay, there's one thing, if I pick it up, something will happen, but you know, there's there's 50 options and I know I can expect, I can infer what would happen if I do any of them, right?
So very different when you can actually see it play around with it.
Um there's two cheeky elements of that.
I mean, the the the sort of I guess less ambitious one is um let's really establish for listeners.
Uh why is this fundamentally different than uh writing Unity code, right?
Like just creating a model to translate a prompt into Unity code.
So there is an underlying physics engine.
Yeah.
Um in that sense, there's some overlapping things to Unity, but the way we think about it is like physics engine or tools or code are cognitive tools, like borrowing Chris's term, right?
Like tools that the model can employ as means to an end.
So today, maybe you say, okay, in this particular context, we care about physics, we care about the long-term causality consequences.
Then yes, we deploy it, employ a physics engine.
And then maybe tomorrow we say, okay, we're we're training that just say drones, where we only care about really fluid dynamics and the visual aspect of the world, then then yeah, maybe we don't actually, the model actually doesn't have to use a physics engine, or maybe it employs other types of representation or physics engine to achieve the task.
So yes, writing code for Unity is sort of similar to a tool that our model can employ, but our goal is for model to take a representation conditioned reasoning approach or process internally.
Yeah.
Using these things as uh just like general two calls, right?
Which I think is very interesting.
The other more ambitious one is uh some kind of recursive element where it becomes multiplayer, right?
Like here, there's a single player element, you're not modeling any other people involved, and that is a whole other thing.
But in fact, we can already do multiplayers.
Oh, yeah, okay.
I haven't seen any demonstration.
If you just actually just like prompt our our model to say, hey, like configure to multiplayer, then it'll do like this, you'll be able to configure multiplayer persistency database for you.
Uh easy.
Yeah.
So what what are like some of the current limitations and where we're at?
So there's one approach of like, okay, scale up video predictors, obviously there's data issues.
Uh, you know, with approaches like this, uh, is it data constraints?
What are like the next steps?
Is it real time?
Like, so there's one side of you know, write an agent to write Unity code, but okay, I want to be streaming a game real time.
I want to have characters being also like agentic, but where where do we kind of see this scaling up, right?
Yeah, there's definitely a data constraint, like the more data the the better this reasoning model can almost basically act as humans to like operate a variety of tools and softwares to build whatever is necessary.
And then there's a sort of fidelity constraint, which we're actually solving with another another model, Reverie, which we can talk about later.
Um, but it's like, well, it's not as easy to get to photorealism with the approach that we're taking.
Um, but we think there are better solutions to that, which is we could dive into later later.
The one one thing you note here is it's a diffusion model, right?
So there's there's a few approaches, uh diffusion, caution, splatting.
Um yeah, so reverie diffusion model you guys want to introduce?
Yeah, totally.
So within our world modeling framework, we think there are two models that we train, right?
Like there's the multimodal reasoning model that we just talked about that essentially handles mainly the the causality, the persistency and logic determinism, determinism of the world.
And then Reverie is our bet on saying, okay, like while all those models um can take care of all these things that we just talked about, it's limitations compared to existing, say video models is that it doesn't have as high of a pixel fidelity right off the gate, right?
And Reverie is to say, hey, we can actually take whatever persistent representation that we generate with our multimodal reason model and learn to restyle it into photorealistic styles or arbitrary styles you want.
So this model is almost to say, hey, I'm going to respect the persistency and interactivity of the world that you created, but my only job is to make sure that its pixel distribution is close to what we want.
Yeah.
Yeah.
You kept the KL divergence.
No, no, I mean this is a classic, like um, how you don't stray too far from the source material as you kept the KL, which is kind of cool.
Yeah, yeah.
I mean, and the difference is, and I mean, Sun was pointing at this where sort of saying it's in one way a more difficult path, but a better path that you know, typically the diffusion models uh producing the whole scene and it looks lovely, but there isn't spatial understanding behind it, which is allowing for the real-time graphics gameplay, the spatial intelligence, understanding the consequences of worlds, where this is um taking a path where it is assuming an abstracted semantic model of the world, the world state, and then the diffusion model is then being used on top of that to produce the high quality graphics.
Is there an intended practical uh or business use for this, or is it like a like a demonstration of capabilities?
We actually believe that this is gonna be the next paradigm of rendering.
So it's gonna replace how rasterizers, it's gonna replace DLSS today, because it not only has these pixel prior that's learned from the world, such that you can literally play any game in photorealistic styles, which is a lot of people's desire when they do GTA, right?
Like um all the mods, all the people adding perfect lighting and all this.
So skins for worlds, it's called it.
Skins.
That's called skins for worlds.
You can call it skin, you can call it customization, you can play it how you want, right?
Yeah, exactly.
And I think another thing that we really pointed out specific specifically in this blog, is the programmability of it, right?
So what this means is that this renderer, well, historically, renderer is always a derivative of the game state, right?
You're saying, oh, here's the game state, I'm rendering out of frame.
But here I'm saying actually this renderer can be part of the gameplay loop.
I can say something along the lines of if upon getting 10 apples, I'm gonna my weapon of choice, my bullets gonna turn into apples.
And that's that's possible because we can say we can basically dynamically have certain game state trigger the preconditions to the renderer, such that the rendering is now part of the game loop too.
One thing is to just say, okay, it's it's it's the appearance.
But the second thing is also to say there's these novel interactions that are of possible because this renderer now has actually priors of the world.
And it's up to the artist to figure out what to do with it.
It is up to the creators, yes.
Yeah.
And I also think that's actually another big argument that we're making and the reason that we're picking back taking the bet we're baking, is that a lot of the times, whether it's for embodied AI or gaming, like you want a layer where human can inject their intentions, right?
So for example, let's just say in the context of gaming, it's obviously like my creative intent.
But maybe in the context of embodied AI, it's like, oh, like I take this foundational policy and I want to actually fine-tune it to deploy in my house.
So you want to almost say inject have a layer where human can say, oh, here's the distribution of things I want to create to achieve my goal.
And I think 3D graphics as is it as it is today, is basically the layer for people to say, hey, what do I care about in this world?
And it allows um basically human intent to be expressed in these worlds much more explicitly and distributionally, as opposed to just saying, hey, I'm gonna generate like arbitrary, and it's like just prompts, you know.
It's one of those things where like I I think you you're gonna build up a series of models, right?
This is just one of this is probably like the highest utility or heaviest uh frequency one.
I don't know what to call this, where like you yeah, you can immediately drop this in on any game and you don't need anything else that that you guys do.
But um I I could see I could see that.
I think the the human intent is something that people are not even used to because we're so used to static worlds or um you know, worlds that just don't react or I don't know.
It's it's you're kind of blowing my mind right now with like, well, I'm I wonder if you've talked to people at GDC and what are they gonna do with it.
Yeah.
Now the stance that we take on this front is like we're not gonna be more creative than our users.
Um but we want to make sure that we're building things in a way that really allows them to express their intent.
The thing that you said about here's the distribution that I want, I think text may be the too low of a bandwidth to to really demonstrate because I uh you know, uh the uh I'm I'm probably just gonna want to drop in a bunch of uh reference assets and then you can figure it out from there.
You probably want to do a b a mixture of both, right?
Like you throw in a few images, I wanted this style, I want it to look like this.
Like it's it's a mixture, right?
I I think it's a mixture.
I mean, yeah.
I mean, there's clearly a visual component of this, and it's not that you know everything can be text because of course you want to give a visual look, but there's also uh massive amount of giving the overall picture of the look of the world and the behavior of things that you can express in a few words of text and it be very time consuming and difficult to do by visual means.
So I think, yeah, you want a combination of both.
So one question I kind of have is how do we go about evaluating world models?
So like there's many axes, right?
One is like, okay, I have preferences, how well do we adhere to prompts?
One is the simulation.
One is like, do things is there core logic that's broken.
So coming from we know how to evaluate diffusion, there's fidelity, there's stuff like that.
But what are some of the challenges that most people probably aren't thinking about?
Yeah, I think this is like a great question, and probably one of the hardest questions in world models, because like I think it always comes back to what are you building this world model for?
And depending on your end goal and purpose, the evaluation should differ.
So in the context of games, then the most direct way of measuring is how much time are people actually spending in this world that you create.
And if your goal is to say, for example, in the context that we just talked about, like hey, deploying deploying action embody agent, then your your end metric is then okay, after training in these worlds that you generate, how robust it is to when you actually deploy to the target environment.
But then you know it's it's hard to measure these end metrics.
So today people have like these proxy metrics that I call that basically try to measure what we really care about, which is the end metrics.
But then frankly, it's different for every use case.
Um, yeah.
Which seems like quite a challenge, right?
Like in in language models or video models, image models, your benchmarks are proxies, right?
People aren't actually asking instruction following tool use questions.
They're proxies of how well it will do downstream.
But for this, so like you know, should should team, should companies have their own individual benchmarks outside of games.
If you think of stuff like, okay, video production, movies, stuff like that that also want to use world models.
Should should they sort of internalize like their own proxy?
Is this something you guys do?
Where does that kind of happen?
I think this whole space is extremely difficult as things are emerging now.
And I mean it's not only for world models, I think it's for everything, including text-based models, right?
Because you know, in the early days it seemed very easy to have good benchmarks because we could do things like question answering benchmarks, and could you answer the question based on these documents and the various other kinds of, you know, do pieces of logical reasoning or math?
But again, these are sort of and there are sort of visual equivalents of things like object recognition, right?
For these small component tasks, but you know, these days, so much of what people are wanting to do also with language models is nothing like that, right?
You're wanting to um have an interaction with the language model and get some recommendations about which backpack would be best for you for your trip in Europe next month, and it's not the same kind of thing, right?
Um, and it's not so easy to come up with a benchmark as to does this large language model give you an effective interaction for guiding you in a good way for shopping, right?
So, and it's the same problem with these world models.
So if we take the game design case, well, success is that a game designer can produce what they are imagining in a reasonable amount of time.
And that's really the kind of macro task.
But you know, that's a very hard thing to turn into a benchmark.
And I think a lot of this is actually going to turn into people working walking with their feet, right?
I mean, I guess that's what's happening, you know, at the large language model level, right?
When people are choosing to use, you know, GPT 5 or Gemini or Claude, you know, individuals are trying out these different models and deciding, oh, I like the kind of answers that GPT 5 gives me, or no, I feel like I get more accurate detail from Claude, right?
It's a lot of checking checking.
I realize that, but it's actually whether people feel it's giving them utility and what they want, right?
And the the interesting thing there is like a lot of people prefer the visual, right?
This looks pretty, which is not the objective of what this is for, right?
It's a if a game designer is working on something, they care about the game engine, the state.
It's it can look whatever, you can fix that up later, or you can have a really good game state and you can quickly edit it to 20 20 different versions that keep state.
Right.
So that's a really important distinction for and for speaking to moon like strength, right?
So yeah, I mean, you know, great visuals are lovely to look at for a few seconds, but games are really all about the concept, the gameplay, and you know, a lot of the time that doesn't actually even require great visuals.
I mean, there are just lots of very successful games which have relatively primitive visuals, and there are other games where people have spent millions producing photorealistic um visuals and the game sucks, right?
Um, so um keeping those two axes apart is really important and thinking about what's important in a world model for different uses.
This conversation is reminding me of some game review and fiction discussions I've um had in my sort of non-AI related life.
Uh some uh for some people might know Brandon Sanderson, who's a very famous uh fiction author, uh is is a big big game reviewer, and he he's a big fan of video games where you change one thing about a normal what you what you might assume about about the world.
For example, Baba is you.
I don't know if you might have come across that where like the rules change as you play the game, and also like where you know you can do things like reverse time selectively or like change gravity selectively.
I think this is also remind reminds me of other kinds of world models that are created by authors where Ted Chiang is is my typical example where he will take the world that you know today but change one thing about it and but then create a consistent world based on that.
Which is long-winded answer of me to of uh for me to say is is it easy to create alternative worlds that don't exist but you change one thing and then let's let's run a whole bunch of people through it to see if it works.
My first dance will be that seems a lot easier and more conceivable to do using technical technology like moon likes than with some of the other world models out there where the sun can actually make it happen I'll let him give the second answer.
I guess for you you're constrained by the game engine tool right like at the end of the day that's the that's the thought um partner that you have.
If I ask for something where like if it never is allowed to reverse time or if gravity only ever works one way, then well that's it.
But sometimes gravity might change.
But it's a lot easier to change with code as opposed to a model that is learned primarily on data of real world and virtual worlds that are I guess like for example, Genie, right?
There's actually training on a lot of real world data and a lot of virtual gaming data.
And it's hard to say.
Well, maybe it's easier to say, okay, I want to change the visuals and like the time period of of the world.
Like you can't change gravity, for example.
I feel like you can to light bounds, right?
Everything comes down to like code is a better way to execute it, but the models aren't that diverse and creative, right?
You can say, okay, make gravity slower, it can do that, but it's limited to your representation of how you text it out, right?
Like they're they're only gonna do a few iterations, whereas programmatically, you know, if there's a game engine under the hood, you can you can kind of go wild, right?
So one of the I don't know, one of the limitations of most models is that they're very overtrained to one style, right?
And extracting diversity is pretty difficult, at least.
That's something we've seen.
I mean, are there other examples you have in mind where existing models it would like it would be easier to do that's not using code?
Like certain types of creative intent or like transition.
State transitions clipping uh other models, other world models are very good at clipping through things.
Clipping my my my legs clipping through a rock because it's you know it's just it's just bad.
Like yeah, you would have to struggle very hard with your your stuff to actually make that happen.
Uh which I think is it maybe a topic that you actually prepared on uh uh Gaussian splitting versus uh the other stuff.
Yeah, yeah.
It's just for those not super familiar, right?
There's a there's Gussian splatting, there is diffusion, like what works, what scales up.
I feel like in February when Sora One came out, the the blog post was literally titled like bring it up for you never know.
Uh you know, world world uh video generation models are world simulators.
Uh it's super bitter lesson pilled yeah a lot of it is emergence right so uh not to go through their blog post basically their whole thing was as you scale up all this consistency all this stuff just kind of solves it's a very simple premise right they just scaled up diffusion and from there you know this is this is Feb 2024 how much can we it's already been two years which is basically five years you know how much more in AI time do we need to just scale up or or do we hit a data cap but I think we already talked about this a lot right like this is back to the beginning discussion of what's appropriate for the time and that seems like your approach right yeah the point I'm trying to make is that there are very many many different types of world simulators and like having a world simulator that can produce pixel coherency is very very useful for games and you know marketing and all these things but it's not as useful as people think when it comes to causal reasoning when it comes to embodied AI.
And yeah, like it this this title is true.
Like we're not saying that it's it's like you know, uh not a great world simulator, but actually in the blog that we we we we wrote, the bet is more so that they're gonna be disproportionately large share of value of real world tasks or in virtual tasks where high resolution pixel fidelity is not needed.
And yes, video models have their values.
Yeah.
This is at the it's absolute limit of my physics understanding, but one example that comes to mind is basically having to solve like base the equivalent of a three-body problem in a deterministic world, whereas the video models would just approximate it good enough.
Yeah.
Right?
Like there's there's some point at which your approach kind of runs into like the well, you now have to simulate the world, please.
Thank you very much.
And like you're you're trying to do that, but only to the extent that the game engine lets you and like the game engines cannot do some things.
Yeah.
No, I mean, I I think the the interesting or more technical question here actually is where do you draw the boundary between what's handled with let's say diffusion prior and what when what's handled with symbolic priors?
Yes, okay, okay, right?
Because like this this boundary can actually be fluid.
Like I think like maybe what you're trying to get at is like, okay, people are saying pixel prior everything.
But what we're saying is, okay, there's a boundary that we draw where this is where we think provides the most economical value for the domains and things that we care about today.
And I actually do think, and it's something that we do internally all the time, which is like, okay, given new equations that we learn, or new elements of the world and that we we learn, or maybe some other knowledge that we acquire in the process of developing the models.
Should we still be keep maintaining this line exactly as it is today, or should we move it a little bit left or a little bit right?
Right?
Like sometimes we realize that oh like maybe customers or or folks like want certain things that are better handled with Praxel prior as opposed to um symbolic prior than we're your skin thing is a is an example of moving it right.
Yeah.
Or left I don't know what the direction the left right is the the the the reverie model actually we have a few iterations of them.
They're actually as slightly different I know values.
You should do that.
That's a cool dimension to show.
Yeah.
Is quantum mechanics the diffusion prior of our world right?
It's like the that's the boundary of classical mechanics versus quantum, right?
Like that's it, right?
At one point God plays dice and the other point doesn't I don't know I don't know if Chris you want to say but I think I think generally I feel like physics is better with symbolic priors.
Even quantum physics.
Even quantum physics yeah this is starting to get to um MLST territory is this is what I call it where uh he he likes to get philosophical uh we're we're quite friendly I mean we need to get no we need to get singularity.
I heard some of that.
No, no, no.
I think that is actually really helpful.
And uh man, I just want you to productize this.
Like as a product guy.
I'm just like, oh, as a gamer, you know, I like it.
It's cool, like this this sort of theoretic theoretical, like you have a very good, I don't know, like the way of thinking about these things, but I just want to see you like, you know, express it.
I do think like your fundamentally things w when you leave open new tools, like okay, use use human intent to incorporate it into how you render.
Well, artists are gonna have to take like two to three years to figure out what to do with this, and you just don't know.
Like but I think you know, this is um gives a much more approachable and controllable world for the world.
Which is the beauty beauty of uh NLP.
That that will enable it to be adopted and used, and we're very hopeful about that.
Yeah, yeah, yeah.
I mean we are we are very focused actually on commercialization in the sense that like we do we do really believe in the data flywheel app approach, yeah, where um we put this in the hands of the creators and the users, and then they will teach us how capability our model should improve.
And that's why we are we are actually, you know, like product in b beta.
Yeah, focusing on gaming.
What what's like the adjacent thing to gaming?
Embody IJC basically.
So maybe we can we can I'll I'll maybe start with where we see the platform in three years, which is like okay, the users would tell us what they want to achieve.
The end goal could be, hey, I just w I want to make something to teach my kids the value of humility.
Um or it could be, hey, I wanna fine-tune my um drones to be really good at rescue situations.
I could be vacuuming robots, I want to like train my manipulation or like vacuum robot to be very robust to my office, right?
But it's like whatever it is robust area of my office.
Very robustly with in my office.
But then it's like whatever end goal that you want, our world model will say, okay, given what you want to achieve, let me generate a distribution of environments such that I can train and evaluate whatever it is you you want.
Yeah, right.
Maybe for the purpose of games, it's just the end simulation and that's the end product for certain policies.
It's like I can train it within these environments, and then help you see where your policy is failing or not.
And then you know, so I think so.
In that case, much more of a training tool than in other intervals.
Sure.
Same same thing.
Yeah, I think it's just this world model that allows people to train any policy that can act in any multimodal environments.
Would it be harder to reward hack?
Is there an angle here where it is harder to reward hack?
Like it's just I'll just put it generally.
Because I think that's a that's obviously a key problem that a lot of people face when in training agents in these environments.
And I don't know, can you solve it?
I think not necessarily.
I mean, to the extent that there's a misspecified reward that it seems like it could be hacked in a more symbolic world or in a more pixel-based world.
Um, I don't know if Sun's got any thoughts, but I don't think that's really being solved.
The other thing that comes to mind is just you could just build a better Sora as a video generated model, right?
Because then you you would move the diffusion uh side a bit more further to the right, I think if I got the directionality correct.
Um that's it.
It's better on domains, right?
Like on consistency over an hour for sure, it exists versus something doesn't, right?
So yeah.
So is your question more like I'm just riffing on like how do you what can you build, you know, with the stuff that you have.
I do think that the mine or the academic does go immediately to training and in eval evaluation, but like art tends to take un unusual directions, like you might end up.
Okay, yeah.
But the question is, can you use this piece of software to develop compelling gameplay?
And I don't think you can take SOAR and produce compelling gameplay, right?
If you want to have a world that you can wander around in a bit, you're good.
But what are your abilities to have gameplay mechanics implemented the way you'd like them to be, and to have things stay, you know, with a long-term history of your gameplay that influences future actions.
I think there's just nothing there for that.
Yeah, I do tend to agree.
I I'm just trying to sort of test the boundaries.
I would also make the observation that as triple A games industry has developed, the line between what is a movie and what is a game has blurred.
Um and you you would do end up basically producing a two-hour movie as part of your team.
Um, honestly, there's so many actually applications in the JSON markets that our world model can go into.
Yeah.
But yeah, it's it's sort of fun to riff riff on, although on the execution side, we sort of have we we need to stay focused with like okay, what are the capabilities we want to unlock over time, and there's a roadmap for that.
But yeah, if we're just riffing on sort of like the possibilities, I feel like whether it's endless, yeah.
It's like classic.
And uh the embedding for possibility and endless in my mind is very close.
Yeah, I do want to offer uh focus on one like weird choice.
I I don't know if it's weird, maybe I'm I got something here.
Uh audio, right?
You could have just said no audio.
And audio in my mind has a lot of recursion, whereas in in video, you can just do ray casting, and that's much computationally much simpler.
Audio just seems a way harder.
I don't know if you want to just comment on just the spatial 3D audio problem.
Did you really have to do it?
I I guess you do to be immersive, but like a lot of people do treat it as like, well, we just take a uh TTS model on top of well there's a lot more to game audio than just speech, right?
It's not just TTS.
Spatial in my mind, echoes yeah, and reflections, and I I don't even know what's the what else.
I don't know.
I don't know what what are the problems in this space.
Yeah, I think this point, like the it's sort of uh more more pointing to the benefits of using a game engine as a tool that's available to the model, right?
Because like part of the spatial audio is from the code that is underlying the simulation.
And while we do give our model access to other types of audio models as tools, none of them would be spatial, I think.
Right.
But that's exactly sort of more point to we're giving our model an abstraction or a suite of tools such that it's able to achieve that.
And you can argue that sort of spatial is like a like an emergence out of the tools that we and abstraction that we provide to the agents.
And I think that's the beauty of this this this approach is like there's a lot of things kind of like how humanity's built technology and they're like Lego blocks that build on top of each other.
And it's the same thing here.
Like there's gonna be things that sort of just sort of emerges from being able to put these things together in like a combinatorially interesting ways.
And whereas in general for the Gen AI video models, there's no actual integration across to audio at all, right?
That someone might stick some music or stick a soundscape or whatever else on top of their videos, so it's not a silent video, but they're in no way connected into a consistent world model, and there's nothing that's okay, an action is happening in the video, therefore there should be a sound that's coming from this part of the visual field.
Yeah.
Is that different than SOR2?
Does it not have audio?
Not to say it's not like spatial audio.
It doesn't?
No.
I've I've played around with it enough.
It just sounds like someone put an 11 labs voice on top of it and just tried to do the lip sync.
Yeah, I've seen okay, generate a dog at the beach and reactions to big wave and move around.
It's definitely like have the dog move away from camera and see if the the sound goes down.
Or it doesn't, right?
Because they don't have spatial audio.
We do want to basically like we our moral model, like the one we're training, is basically towards the goal of having a combined latent representation across all these different modalities, right?
Such that you can like reason across these different modalities.
Um for example, if I close my eyes and like you play a video, you play a sound of like a car skidding away from me.
I almost can like visually extrapolate that trajectory in my mind.
And I think that that type of capability we want our model to be able to reason, right?
And that's the reason that we're sort of taking this multimodal reasoning approach.
It's like we want this combined latent space that can Yeah.
Oh, you say latent space.
I mean, like that here.
We have to play the the bell every time there someone says the interface.
Uh no, you gotta train Daredevil One where you you you it's only audio, but you have to work out where everything is.
Cool.
I I think that was uh that's about it for our Moon League coverage.
Uh I do think uh we have like a couple of uh Chris Madden questions on on IR and uh just any any other sort of attention topics or NLP uh NLP topics.
Okay.
It's just fun.
Uh you know, we talked a bit about how you guys meant, but you basically you you were like the godfather of NLP per se, right?
You spent the whole career from early embeddings, early, early attention.
You did 2015 attention for machine translation, everything.
Uh you you had information retrieval.
So rag before rag.
You know, we just wanna shout that out and admire a lot of that, right?
So what prompted the switch over to world models?
How how'd all that come about?
Just some answer it is um the enthusiasms and creativity of students, but there's a bit of a history there, right?
So yeah, so clearly most of my career has been doing stuff with language and you know how I got into research was thinking, oh, this is just so amazing how humans can produce speech and understand each other in real time, and somehow they managed to learn languages when they're kids, how could this possibly happen?
And so yeah, starting off I was very focused on language, but you know, as it sort of got into the 2010s, I started, you know, going, I'd been working on question answering, and then I started to get um interest in visual question answering, and that was an area where it was very noticeable that the visual understanding was bad, right?
You know, these were the days when like it sort of seemed like there's almost no visual understanding.
You're just getting answers that came from priors.
So, you know, if you asked how many people are sitting at the table, it always answered two, regardless of how many, how many people you could see in the picture.
And you know, so it seemed like oh, these models actually aren't able to get semantic information out of images.
And so I was interested in that problem and tried to work more on that.
And so then that required knowing more about what's happening in vision and how you can represent visual information.
And then things start, you know, there started to be this revolution of um doing generative AI images.
And then I had students that started looking at that before the era of Moon Lake.
I was also working with Demi Gore, who founded Pika.
Um and so And Ian, obviously with Gans.
Yeah, though Ian was never my student, but yeah, Ian, I I was very aware for the the whole decade there of Ian with GANs.
Yeah.
And I mean, Ian was a Stanford undergrad, but yeah.
Richard does you.com, I believe he was your student.
Um yeah.
And you know, there were there were links across at that stage as well.
So I mean, you know, there were several papers in that era of doing, I mean, so Andre Caparthy was a um PhD student at the same time as Richard, and so there was some joint language vision work in that era as well.
You know, it seems kind of ancient by modern standards, but yeah, we're trying to go from sort of textual dependency graphs to visual scenes.
At a time, the glove embeddings really took over a lot of TFIDF, like one hot encoding, all that.
The early vision language models we saw were like lava style adapters, right?
It's it's technically still just embedding latent space, let's add image, let's like mix modalities.
And that's that's one of the things you super put out there too, right?
Yeah, yeah.
Yeah, well, thank you for all of that.
Thank you for advancing the world on uh world modeling.
Uh I honestly uh do think that if people deeply understand everything we just covered, they will see what's coming.
And I think you guys have you know made some really significant contribution here.
What are you hiring for?
You know, what what is the people finding?
You know, we agreed that the CTA was a hiring call.
Yeah, I mean, though we have AGI, you don't need you don't need engineers anymore, right?
Yeah, on the model side, we we are actually striving towards basically a self-improving system, but what that means is that we need people to set up the self-improving system.
Um more specifically, people who have the intersection of knowledge within code generation and computer vision and graphics, right?
Yeah.
That's that's sort of the core research background that we look for within our team.
And the majority of the team today do have like both backgrounds.
Um you say computer vision and graphics, are they the same thing, or is it computer vision one thing, graphics another thing?
And how intertwined are they?
They're intertwined but different.
Yeah.
And I think you know, this relates to some of the themes that we've been talking about, that the more explicit underlying world models that are being constructed inside Moon like really draw on the computer graphics tradition.
And so it's then combining that with the visual understanding of vision.
Got it.
Yeah.
All right.
So if you've written a game engine, you're come talk to us, right?
Oh yeah, definitely.
But I do think that the line is blurred, like increasingly blurred these days, where it's like if you have a general understanding of group vision and graphics.
I think for your standards, it is.
Uh for me, it feels like vision is is, you know, I I leave that to the big labs.
Graphics, I I I can get that you know you would want to do that from more first principles.
But vision, there's so many vision models off the shelf that I can take, but probably not good enough for your.
I see, I see.
If if you're sort of like making that distinction, then maybe we we care a little bit more about having graphics knowledge.
Yeah, exactly.
Exactly.
Um it could be like uh you know, sometimes a hiring call can be as simple as like if you know the answer to blah, you should talk to me, you know, like the the the the sort of core known hard problem in uh in your world.
Uh I see.
Yeah.
In that case, if you yeah, definitely if you've written a game engine before, if you've RL'd a variety of coding models on different objectives, like easy um many of those yeah if you've done multimodal in space alignment I I intentionally included again a poor editor has to edit thing every time uh yeah lean space alignment is it that hard well I I there's some scripts out there that I've saved for the day I someday someday have to do it but I don't have to do it but it's done.
I think yeah there's there's there's like versions of that that are done.
Um but I I think we are aligning audio text language and video right like and basically we have these world models that are able to act as agents to like act in these worlds and extract long horizon videos and encoding that back to the model to sort of self-improve.
So it's an insanely exciting but also technically challenged problem.
So people who want to do their lives best work, you know that we'll make it place.
How big are you guys where are you guys based?
We're currently based in San Mateo although we're moving up to SF.
Um we're about eighteen folks right now.
My ending question was gonna be why m what what is the name what's behind the name?
Oh very cool graphics and design by the way.
Actually, at the at the time when the when the when we started the company, we were thinking a lot about how do we make a company name that gives people the vibe of like open AI, but for like almost like industrial light and magic vibes.
Because it's like we care about creativity and using that as a funnel to solve AGI.
So then we were we we brainstormed a lot around like dreamworks, right?
Like industrial light and magic.
And um so there's a few few basically uh space of things that we feel like are very, very semantically close to the company's identity.
Yeah.
And then it ended up being Moon Lake partly because of the DreamWorks vibe, you know, the DreamWorks.
Moon Lake.
Exactly.
Yeah.
Um so that was a little bit of that inspiration.
And then the moon was sort of like a it basically was like about the reflection.
The reflection part also implies the self-improvement loop where we sort of like really believed in, and that's the path towards multimodal general intelligence.
So that's that's that's that.
I'll leave it.
I love a good name.
I love a good name.
This is a very good name.
It's very good lore.
I'm glad I asked the question.
Uh I will also say, you know, one of my favorite story uh books or biographies ever is uh Creativity Inc.
with Ed Capmell's uh story about Pixar and how he you know was rejected as a Disney animation artist.
So then he went into computing and brute forced his way into backing Disney.
Yeah.
And Walt Disney is also like one of my favorite founders.
He's like his his story, like at the time you're like, okay, I'm gonna create this like immersive park.
Like people can't can't don't even have that technology to create it virtually, but like, you know what?
Let's just build it very physically such that people can So he's the first world modeler.
Um I I I'll tell people that like theme parks are world models too.
Yeah, yeah, yeah.
I mean, uh you know it's a small world or it's uh like the Epcot Center with all the little um replicas of the countries.
Yeah, those are very interesting.
Um okay, well, thank you.
Uh we've covered uh you know a a huge amount.
Thank you for your time and thank you for inspiring us.
Thank you for having us been fun chatting.
Yeah, it's been a good time.
