# Mistral AI Unveils Voxtral TTS, Mistrall MoE, and Lean Reasoning

**Podcast:** Latent Space: The AI Engineer Podcast
**Published:** 2026-03-30

## Transcript

Okay, welcome to Lane Space.
We're here in the studio with trusty co-hosts Vibu.
Welcome.
As well as Guillaume and Pavan from Mistral.
Welcome.
Excited to be here.
Thank you for us.
Pavan, you are leading audio research at Mistral and Guillaume, your chief scientist.
What are we announcing today?
Where we're coordinating this release with you guys.
Yeah, so we are releasing Voxtral TTS.
So it's our first audio model that generates speech.
It's not our first audio model.
We had a couple of releases before.
We had one uh in the summer that was Voxtral, our first audio model, but it's it was like a transcription models, ASR.
Like a few months later, we released some update on top of this, supporting more languages, also a lot of table stack features for our customers, context biasing, position, time stamping and the auto transcription.
We also had some real-time model that can transcribe not just at the end of the items.
Don't need to fill your entire audio file, but but can also come in real time.
And here this is a natural extension in the audio, so basically speech generation.
So yeah, so we support nine languages, and this is a pretty small model, 3D model, so very fast, and also set up there.
Only a fraction of the cost of our competitors.
And we are also releasing the work that this model is.
Yeah, Mamma Linked.
Not this time.
Yeah, what's the decision factor?
It's a good question.
There will be more, W.
Ooh.
Yeah, Prabhupada.
Any other sort of research notes to add on what you're doing?
No, maybe maybe we'll dive into it later in the forecast too, but it's a novel architecture that we developed in-house.
We iterated it on several internal architectures and ended up with a auto-regressive flow matching architecture, and also have a new in-house neural audio codec, which converts this audio into all point by herds latent tokens, semantic and acoustic tokens.
And yeah, that's that's the new part about this model.
And we're pretty excited that it's it came out with such good quality.
And Guillaume was mentioning, yeah, it's a 3D model.
It's based off of the Ministral model that we actually released just a few months back and in such trunk.
And it mainly meant for like the TTS stuff, but they need text capabilities are also there.
Yeah.
So there's a lot to cover.
I always love any anything to do with novel encodings and all those things because I think that's obviously increase a lot of efficiency, but also maybe bugs that sometimes happen.
You were previously at Gemini and you worked on post-training for language models, and maybe a lot of people will have less experience with audio models just in general compared to pure language.
What did you find that you have to revisit from scratch as you joined Mr.
Al and started doing this?
At least when it comes to for I think the the two buckets, I guess the audio understanding and audio generation.
The audio understanding, like the walkthrough models that Kim was mentioning that we released earlier, the boxwell chat that we released, I think July last year, and the follow-up transcription only models family that we released in January.
That would be one bucket on the generation is another bucket.
I think you can also treat them as a unified set of models.
But currently the approaches are a little different between these two to your question on how audio is fed to the model.
In the understanding model, it's very similar to actually pixel models that we also released.
Yes, yeah.
It was pretty good.
I that was the first project I worked on after joining Mr.
Alf.
It was pretty pretty nice.
And Voxel was very similar in spirit, I guess.
So we feed the audio through an audio encoder similar to images through a vision encoder, and it produces continuous embeddings and which are fed as tokens to the main transformer, decoder transformer model.
Yeah, and the model output is just text.
So on the output side, there is nothing that needs to be done in these kinds of models.
I guess the interesting part about the generation step is the output now has to produce audio.
And the approach that we have is this neural audio codec, which converts audio into these latent tokens.
There is a lot of existing retries here and a lot of models which are based off of this kind of approach.
And we took a slightly different design decisions around this, but at the end of the day, the neural audio product converts audio into a 12.5 Hz set of latents.
And each latent is has a semantic token and a set of acoustic tokens.
And the idea is that you take these discrete tokens and then feed it on the input side.
There's several ways to fuse this at each frame, but we just sum the embeddings.
So it's like having K different vocabularies and combine all of them because they all correspond to one audio frame on the input side.
The output side is the interesting part.
On the output side, the it's not the I don't know if it's the most popular, but one popular technique is to have a depth transformer because you have K tokens at each time step.
Like with uh text, you just have one token at each time step.
So you just do predict the token from the vocabulary with yeah, with just uh you get probabilities.
There's a very straightforward text.
Very straightforward.
Yeah.
But if you have K tokens, then the name thing would be to predict all of them in paddle, but that doesn't work.
At least that does about that well because audio has more entropy.
And the one of the techniques people use is this depth transformer where you you almost have a small transformer or it can be LSTMR and as well, but people use transformers and you predict the K tokens in auto-regressive fashion in that.
So you have two auto-regressive things going on.
So the thing we did differently is instead of having this auto-regressive K step prediction, we have a flow matching model instead of modeling this as a discrete token set.
We train the codec to be both discrete and continuous to have this flexibility.
So we did try the discrete stuff too, and it's it works well, but the continuous stuff works just better.
So yeah, we took this flow matching.
So the it's a flow matching head, which takes the latent from the main transformer and like kind in diffusion it's denoising, but in this flow matching, it's a velocity estimate.
So you go from this noised all the way to the audio latent, which corresponds to 80 millisecond audio, and then which is sent through the vocoder to get back the 80 millisecond audio frame.
Yeah.
Is this the first application of flow matching in audio?
Because usually I come across this in the image.
Yeah, actually, in some sense, uh there are models, flow matching models in audio, but I think this specific combination, I could be wrong.
There could be some work.
I haven't seen I haven't seen much work in this.
So it I think it's novel and a lot of it's just a way bigger community.
So they I think they pioneer a lot of these diffusion flow matching work, and it's interesting to adopt some of the ideas there into audio.
And yeah, yeah, and personally that's the thing part, which is trying out about one of and more meta point is unlike text, even in vision, I think this is true.
But in audio step that you too, it's that there is no winner model yet.
There is no, okay, this is the way you do things.
It's uh it's still evolving.
I think people are still iterating and figuring out like what's the best overall recipe, I guess.
The idea, pretty sure there are models which are also completely end-to-end, like native audio and native audio, but it's still not come to a convergence point where this is the right way to think that that also makes the space pretty exciting to explore.
What are some of the ways to look at it?
There are ways where you can do diffusion for audio generation, but if you want like real-time generation, that's a big thing with the approach.
I'm assuming that you took.
Yeah, and also like how do you go about evaluating different axes of what you care about?
Yeah.
Good point.
I think we so you can do just flow matching diffusion for the whole audio.
We didn't even go down that path because one of the main applications is a voice agents, and we want real-time streaming, and that's the use case.
That's not the only use case, but that's one of the primary use cases we want to get to.
So we picked the auto-regressive approach for that.
And within the auto-regressive space, again you can do chunk by chunk, or you can do so.
We pick the I think at least personally prefer the approaches which are the simplest.
And so we try to see can we just add audio as just another head to our regular transformer decoder model?
Because that kind of makes it easier for eventual end-to-end modeling of audio text native modeling.
Yeah, and it works pretty well.
So I guess we went with that and we traded a little bit with the flow matching head itself.
Like we had a discrete diffusion kind of approach, which also works well, but the flow matching works better.
I was just curious about how you also think about this overall direction of research.
Do you basically when you work with the audio team, do you set some high-level parameters and then let them explore whatever?
Or how does it work between you guys?
No, I think it works is that we have a we are prioritizing together, I think, what are the most important features because there are many things we can do in audio.
So I think we try to decide like how we should do things.
For instance, ultimately, what we want to do is to be this full duplex model, but we are not going to start this start there directly.
I think it's some of the project people are doing, but just to confirm full duplex means it can speak while I'm speaking, or okay.
Yeah, audio in the what you said.
Yeah, yeah.
So ultimately, we're going to get there, but for us, it was we decided to take it like a step by step.
So we start with whatever is the most important, I think also power customers, which is a transcription is the most popular use case.
Uh then the speech generation, the real time, just a bit before that, and then the XP need to be like more like try combine everything or together.
But but yeah, we felt it was also important to like separate things and optimize each capability one by one before we merge all of that together.
Then the super omni model, but interesting because as Pavan said, it's when you work on some other domains of just LM and everything.
There are many areas where I think it's not as interesting.
For instance, many places it's essentially just around data or like creating new environments and a lot of kind of easy things, but things were I think the research is maybe not as interesting.
We're in audio.
There are so many ways to actually build this model, so many ways to go around it.
That's uh the sense I think is really interesting.
And what we also tried for speech generation is that we tried multiple approaches.
What was interesting is that even though they were extremely different, they ended up being at the end of the delicate particulars, but the flow matching turned out to be quite more natural.
So we are happy with this.
Is there an intuition why it maybe like flow matching is just model speech better in some natural fundamental latent dimension?
No, I think the main thing is even at a particular time step, there is a distribution of things to be predicted.
Like the way you inflect.
So you already know the word that you're speaking, and yeah, the in text space, let's say the word maps to just a single token for simplicity.
In most cases, it does.
So there is not a lot of so you just pick the word, but within within audio, even the same word could, even with your own voice, could be inflected in so many different ways.
And I think any approach which like models this distribution and and flow matching is one of the tickets.
It's not the only one at all, but it's a one which works pretty reasonably well.
I think that's better.
So you you have to pick across several different the intuition I have is it's there's some several different clusters, each corresponding to some specific way you would inflect pronounce that thing, and you can't predict the mean of it because that corresponds to some blurred out speech or something like that.
But you have to pick one and then like sharp.
Conditional inference.
Yeah, exactly.
Is that all covered under disfluencies, which is I think the normal term of art?
Uh disfluencies pauses intonations.
By the way, I'm I would have to thank Sophia for setting all this up, including like some of these really good notes, because I'm less familiar with the audio domain.
I think distances are definitely one such phenomenon.
Which is arms are arms doesn't also repeat like you like you do this filler words, you're thinking, so you repeat the word.
Okay.
Whereas intonation is like a different it's up speak and all this.
Okay.
And yeah, so I think there is a lot of like entropy and modeling it as a distribution and any technique which helps with it.
And the depth transformer is a conditional way of modeling this and transformers that's really good at it, even though that's a mini transformers.
So I think that worked pretty well too for us too.
It's just that the main consideration is when you have a depth transformer, if you have K tokens, you need to do K autodegressive steps.
Even though it's a small thing, it's K steps, which is very late and say heavy.
With flow matching, we were able to cut it down significantly.
So we are able to do the inference in quad steps or 16 steps and Xbox pretty well.
And there are more normal techniques to bring it down even further to like in the extreme case one step.
Like we're not doing it yet, but it at least the framework lends itself to more efficient and the image guys have done incredible work as uh now.
You just send the prompt and you get an image.
Yeah, surprisingly, not enough.
I think image model labs use those techniques in production.
I think this I feel like it's a lot of research demos, but nothing I can use on my phone today.
The thing is the other thing that would be interesting here is that since indeed there is so much that has been done in the pigeon community compared to audio on this to me.
I think there are so many long infos here, and there are so many things we can do to actually improve this button even further.
So I went to our first version, but we have so many ways to make this much better and much more efficient, cost efficient.
So really certain you feel that all of course, but there are still so many things that can be done with it.
I should also mention that for those who are newer to flow matching.
I think the creator is this guy's name is Alex.
He's done, I think, on Europe's maybe two neuros ago.
There was there's a very good workshop.
There's one hour on like this or flow matching is.
I would recommend people look that up.
That's the other thing, right?
The efficiency-wise, like I I imagine like the reason is open weights, the reason you pick 3.6b backbone, it's yeah, 3.4b.
You are trying to fit to some kind of hardware constraints.
You kind of fit some kind of C constraints.
What are they?
Not necessarily.
Uh I think uh something we care about in our model is that they are efficient.
So we have a lot of separate models, for instance.
So we have this audio model that is very small, very efficient.
We also have a small OCR model that is really very good, highly efficient as well.
And I think an approach that maybe other I think companies are going to take is to have like a very general model that will do a bit of everything, but that is also going to be expensive.
Uh and here what we want to say is if you care about this specific use case, if you can actually see this model, it just does that.
It's extremely good at it, but also very efficient.
That's why we can actually add this models audio but also CR that are like really good at that, and that would be much more cost effective that the general model that will contain a lot of capabilities you don't really need to tell us.
So, yeah, so we are doing like general model but also like more customized model like this.
How does it compare to other TTS models?
It's we're going full open way, we're just dropping it.
I think it's really good.
Yeah, I think it's pretty good.
Like it it's definitely one of the best pressure.
I it's probably I I would say it's the best open source model.
Why it's definitely so yeah.
Why now?
How does it fit into broader mistral vision?
How do you see voice agents?
How do you see voice?
I think every year I've heard okay, your voice, your voice.
There's a lot of architectural stuff.
There's a lot of end-to-end the NC that you're solving, but where do you see voice setting?
We had so many customers asking for voice.
That's also why we wanted to build it.
What's interesting in this domain is that in a sense, if you take something simple like transcription, it doesn't seem like something that should be very hard to do for a model.
It's essentially it's button recognition, it's classification on this models are very good at classifying, right?
Uh nonetheless, when you talk to them, it's not there yet, right?
It's not you don't talk to them the same way you talk to a person on something.
Maybe people don't realize it.
In English, it's still much better than in any other language.
Even compared to French, for instance, if you talk to this model in French, when you see people talking to this model, they will talk very slow, they will articulate as much as I can.
So it's not natural, right?
We are not yet to this.
Well, I think, yeah, maybe the next generation will not know this, but yeah, I think people that are maybe our edge will actually always keep this bias of speaking very slowly when they talk to this model, even if maybe probably in a couple of years, maybe next year it will not be necessary anymore.
But yeah, but what's interesting is to see that yeah, even for like a languages like uh yeah, French and Spanish, German that are not no resource languages, you have you have a lot of audio with this there, and still it's not as good.
And I think the consequence, I mean the reason for this, I suppose just there is not as much energy, uh as much effort that has been put down in some other modalities like for instance fission or like coding.
But but yeah, there is still a lot of progress to be done.
I think it's just a question of doing the work and it's like a clear path, I think, to get there.
It's a little fascinating because I won't do an Google assistant.
I think while back at this point, but it's I think it's it's like when you take a step back, it's fascinating.
It's not that long ago.
It was like four years ago or five years ago, and it's now it's a completely audio in audio out, and the function calling and the whole thing happens completely end-to-end and in a very natural natural way, and still ways to go humor stilling.
Even despite all the previous, it's not like you're speaking pair person.
When you talk to any of these Asian bots or voice mode kind of situation, it's still like a gap.
I think.
That's the great part.
And I feel like with even the existing stack, we should be able to get to this very matched uh speech, conversational abilities soon enough, I guess.
And we'll also hope hope to get there.
And it's kind of the next step, right?
Because uh when you talk to these agents, like usually people are just writing to them, and sometimes they have like this very clear uh for instance, you are you want to write code, but you uh you have like a very clear idea of how you want the model to uh implement what you have in mind.
But so here you're having to spend like a lot of time writing, so it's not really efficient.
And audio is really like a natural interface that is just not there yet, but I think it's just going to be there very soon.
How's it like building serving inferencing?
Like we see a lot about it's very easy to take LMs off the shelf, serve them, fine-tuning, deploying.
I know you guys have a whole you have Ford, you have a whole stack of customizing, deploying.
Is there a lag in getting that like distribution channel?
Are you helping there is so like prompting LLMs, you can have them be concise, verbose, all that.
They're built on LM backbones, these models.
How do you see all that?
Yeah, I think this is a lot of what we're doing with our own customers.
Very often they come to us, so it's for different reasons.
Uh I think one reason is sometimes they have this lot of privacy concerns.
They have this data that's it is very sensitive.
They don't want the data to leave the company.
They want it to stay inside the company.
So we had them deploy model in-house, so either on uh either on premise or on private cloud, so they are not worried that it's given to a third party and that there is some leakage.
Sometimes you have this kind of many many companies have this different sensitivity of data.
They're like sometimes tier one, tier two, two have three data, which are three can send it to the cloud, tier one, it has to stay there.
So then it creates some kind of heterogeneous workflows where it's annoying and you cannot send some data to the cloud.
This one you can.
So here when we actually deploy the model for them, they don't have this consideration.
They are like not worried that this is going to leak.
Everything everything is much easier.
So we help them basically do this.
So it's one of the value propositions, but the other is very often when customers use this off-the-shelf closed model.
What's very sad is that they are not leveraging these data that they have been collecting for four years or sometimes for decades.
So much data, sometimes it's trillions of tokens of data in a very specific domain, their domain, which is data that you will not find in the public uh on the public internet.
So data on which like the closed model we actually not have access to, one which is going to be really good.
So if they're using like closed source models, they're basically not benefiting from all these insights, all these data they have collected three years.
They can always give it into the context that inference, but it's never as good as if you actually train the model as this.
So, yeah, that's basically what we help them to do.
We actually provide them some mistral projects, basically what we announced at IGTC this week.
So we provide them with this.
It's basically like a platform with a lot of tools to actually help them process data, train on that.
Yeah, it's actually the same thing that we are using in the science team.
So it's actually very battle-tested infrastructure, like a lot of efficient training code base for uh continue pre-training, like a fine-tuning, even doing SFT, IRL.
So we help them do this using the same tools as what our science team is building is using.
So since it's tools that we have been using for two years now, it's really better tested, it's really sophisticated.
So it's the same thing we are giving to them, or giving the company the same thing that what our science team is using internally to actually build their own AI.
And it makes a really big difference.
I think sometimes customers and many in general don't realize how much better the model becomes when you fine-tune it on your own data.
And you can have your model here, you start from there.
You have a closed source model, which is sort of here.
But if you actually fine-tune, it can actually really go much further than this, and then you have a very big advantage.
The model is trained on your entire company knowledge, so it knows everything.
You don't have to feed like 10k tokens of context at every query.
So it's it's much easier.
It's a bit, I think using a closed source model is really sad because it basically puts you're not leveraging all this data, and you are going to be using the same model as all your old competitors when you could actually use everything you have been collecting for years, which is really valuable.
So, yeah, so we help basically customers do this.
So we have a lot of solution.
I mean, deployed for the engineers that go in the company that basically look at the problem customers are facing, they look at what they're struggling to do, what we should do to solve it.
So we have them solve them together.
So it's I think our approach is a bit different here than some other companies and competitors.
It's we don't just release an endpoint and say do some stuff on top of that, or we don't just give a checkpoint.
We really look work very closely with customers.
We look at the issues they have, we had them solve them.
We really make some tailored solution for the problem they're facing.
Some example are also going to be sometime with some customers.
They really wanted to have a really good model, really performant on some like uh Asian real languages.
On the if you take some of the shelf models, they they can speak it, they can write in this language, but it's not amazing.
This language will be like maybe zero one percent of the mixture.
So it has been included during training but very little.
So what we did here is of course we train a new model for them but so this language was 50% of the mix so it's much much stronger.
It knows all the dialects it knows so it's yeah so it's some example of things we can do on it it's really arbitrarily custom I think we had some other customers for instance they wanted some uh they wanted some 3D model that can do audio with a very good at function query so something you wanted to put in the car in particular they wanted this to be offline because in a car you don't necessarily have access to internet.
So yeah so here we can actually build these solutions there is no like model out of the box on this in the internet you have this very you have this very general model generalist like reasoning and strong model but for things like this they always want like specific solutions on yeah on some other reasons sometimes they come to us is because like they they experiment with some closed source model they get some prototype they are happy with what they build they it works well as they're happy with the performance and then they want to go to production and then they realize oh but it's extremely expensive.
So then they come back to us and they say can you add the uh help us build the same thing as this but using something much cheaper on here?
On here we can sometimes build something 10x cheaper by just fine tuning a modeller, but it would be better on prem uh on their own server, and also much cheaper as well.
So, yeah.
That's the Mr.
Page right there.
Take all the money.
I mean outside of that, you do we do put open way models, so people can do this themselves.
I feel like not enough people go out of their way.
They're not going to they're gonna ask them to do it.
They ask they are experience initially, we didn't know we were not competition at the beginning of the company because I think our strategy was not exactly the same as what it is today.
But what we underestimated initially is the complexity of deploying this model and connecting them to everything to be sure it has access to the company knowledge on the and it was yeah, we were seeing customers struggling with this, but it was even that was two years ago, or no things are much more complicated because now you don't just have text on SFT on the simple instruction following, no, you have reasoning like uh agents, you have like uh tools, and you have multimodal and audio, so it's much more complicated than before, and even back then it was hard for customers, so they really need some support, and this is why you actually uh providing like always some uh 4D position as to have the processes.
Um, I'm curious, is there also voice functioning that people do?
So in this forge, we'll also have uh unified framework, and the hope is like the voxel's speech to text that we released earlier this year, and even the voxel chat that we released last year, and I think a big people I think there's a big rich ecosystem of people fine-tuning whisper, and people want the same thing with Walker, it's much stronger than Whisper, and yeah, the the platform offers that kind of fine-tuning, yeah.
Which could be any kind of fine-tuning.
Like, like for instance, even sometimes people want to support new languages to this, which are three languages, which we hope to cover uh ourselves natively.
But if there is a language where you have data and you want to fine-tune, I think this is a good use case.
Uh, the other use cases, terminology, jargon, medical stuff.
Exactly.
And also the specific acoustic conditions, like even English, but it's in a lot of noise or other.
And the model will do decently in most conditions, but you can always make it better.
And that those are some of the use cases where you can improve it even further.
And that's one good use case for this.
And for text-to-speech, we're just releasing it.
So we'll have support for that soon too.
I think it's similar use case.
It's a little different, the kind of things that you want to extend a text-to-speech model to, which could be like voice personalization, voice adaptation for enterprises.
And many enterprises need very specific kind of tone, very specific kind of like personality for this kind of voice.
And all of those are like good use cases for fine-tuning.
How important is it, right?
Like I can just clone a famous person's voice, okay.
But the main use case would be like for enterprise personalization.
Like enterprises need like a lot of customization.
You don't want the same voice for all the enterprises.
Each enterprise wants a customized, specialized something which is representative of both their brand and also their, I guess, safety considerations.
And the use case.
I think the kind of thing that you would deploy as an empathetic assistant in the context of the healthcare domain would be very different from the kind of thing that would be in a customer support board and would be different from like more conversational aspects.
I think those are the customizations you would expect from enterprise.
And that's the main use case, at least from our side.
My my base example is you don't want to call two customer services and have the same exact voice.
It's gonna be weird.
But also on the technical side of this, so there's like a few things in Voxstraw that I thought were pretty interesting.
He's a big fan of this paper.
He said very good paper.
You say this is the best ASR paper he's ever read.
Yeah, I've hyped up this voice paper enough.
We covered it somewhere.
But a big thing, so Whisper is known for 30-second generation, 30-second processing.
You extended this to 40 minutes.
There was a lot of good detail in the paper about how this was done, even little niches of how the padding is so it's very much needed.
You need to have that padding in there.
The synthetic data generation around this.
I'm wondering if you can share the same about the new speech to text, right?
Text-to-speech.
So how do you how do you generate long form coherent?
How do you generate how do you do that?
And then any gems, is there gonna be a paper?
Yeah, yeah.
There would be a technical report.
But yeah, I think it will have a lot of details.
But I think the summary of it actually, some of the considerations in this paper were because we started with the Whisper Encoder as the starting point, and now we have in-house encoders like the week-a-time model, for instance, which we released in January.
We also released a technical report for that real-time model as well, which is this dual stream architecture.
It's an interesting architecture.
You should check it out.
And there we have a causal encoder.
And I don't think there's any strong multilingual causal encoder out in the community.
So we thought it's a good contribution.
So that's one nice encode if the other people want to adapt.
That's a good encoder.
And we trained it from scratch.
I think our pull stack is now mature enough that we're able to train super strong encoders.
And some of these considerations, like sparring and stuff, is a function of the whisper encoder.
And now that we train encoders in-house, the design considerations are different.
And for the question on text to speech, I think that also leans onto the original auto-regressive decoder backbone.
I think it's almost identical considerations.
I think the long context in it's not even long context.
So the model processes audio at 12.5 Hz.
So one second maps to like 12.5 tokens.
So I think one minute is like 7.0 tokens.
You can get like up to 10 minutes in 8k context window and get half an hour in 30k context window.
So that's an 32k context is something that's we are very comfortable training on.
We can extend it to even much longer.
128k.
Okay, we can naturally see how it can extend to even our long generations.
Yeah, we need the like data recipe and the whole algorithm to work coherently enough through such long context, but the techniques are some way very uh similar to the text long context modeling.
And the key difference is it's just doing flow matching auto-regressively instead of uh text token prediction.
Okay.
I think that was most most of the sort of voice questions that we had, but I have a big question on Mr.
Osball.
Mr.
Osmall.
Let's go.
So what is small?
How do we define small?
What is this?
What is this?
I remember the days of Mistral 7B on my laptop.
It's not fitting on my laptop.
I could run it on the big laptop, but it's just a different question of terminology like here, but with the baseball isn't off active parameters.
But it's true, really given uh another name.
But yeah, we could have called it medium, but then I think this is uh I suppose uh but yeah, it's a models that we really uh Mr.
Off experts.
It's a models that combines different models before the way we are doing this thing is that we had the one model, general model for industral doing instruction following where like a separate model that was devstral, so really cut on coding specified specific to code.
We had another model for reasoning magistral.
So these were separate artifacts built by different team at Mistral.
And now what we are doing is basically merging all of this.
Which was even Pixral was the first vision model we had was like a separate model on the way we do things internally is that we are one team focus on one capability, build one model, uh and then when it's mature enough, we decide to merge this into the main texture.
So he here this was the first time we basically merge all of this into one.
But there are some other things we didn't have time to merge it.
Time for instance that more capabilities or function coding, I think would be uh it's going to be much much better in this real small proper phone.
But much uh our latest model on the we're working on the larger version of this.
And yeah, key things is it's very sparse six B active, pretty efficient to serve two fifty-six K context.
Yeah.
I think what's interesting is just this general theory of developing the individual capabilities in different teams and then merging them.
Where is this going to end up like we've seen the five things put together in this yeah what are the next five T I think actually OpenAI has gone away from the original 4-0 vision of the Omni model.
That's with the body was selling all modalities in all modalities out.
But I feel like you might do it.
I think there are some modalities where it's not completely obvious.
For instance for audio for audio here if you want to do transcription, I think it makes no sense to use a model as this large if you just want to transcribe tech it's it would be very inefficient.
If you want to do audio you probably just want to do the one B or 3D model.
Performance would be essentially the same it's going to be incredibly cheaper.
So here that's why we want to have a separate button that just does this.
Yeah I think the potion is just yeah if you are talking to your model by speech and you're asking like a very very complex question and how do you do this around here just to cascade things do you want to put a duo in a model that has like a one key in around that it's like a not a completely question I think on a weird if you're going into that direction but that's the possible but yeah but I think for us the next capabilities we want to try to integrate into these models while no are going to be yes, like more coding, more reasoning, but I think more capabilities that people don't talk too much about, but it's important.
I think for our customers in our on different industries, for instance, things are around like a little legal INAS computer-ided design, all of these things that it's this models out of the box are to put at that because people don't prioritize this there is no like too new benchmark on that.
But it's not hard to make this the model good ones, just have to do the work exourcing some data, processing it.
Okay, including the expression.
So yeah.
But we always have things we merge into this.
I think for voice, yeah, the key thing, I think over maybe like the last year or so with VO and Groc Imagine and all these things, is joining voice with video, right?
Which people don't understand spatial audio because like most TTS is just oh, I'm speaking to a microphone in perfect studio quality.
But when you have video, like the voice moves around.
That's true.
The contrition is also a little different in the sense that there it's like a a standalone artifact where you get the whole thing and you consume it.
But in the connotational setting, it's a uh you need the extreme low latency streaming uh would be one of the primary concentrations.
You can build a giant company just doing that.
So you need to do the voice.
But I was just you know on the theme of merging modalities.
That is something I'm like, wow.
Like I didn't everyone up till let's say mid last year was just doing these like pipelines of okay, we'll stitch a TTS model with a voice thing and a lip sync thing and what have you.
Nope.
Just a giant model.
Yeah.
I have a two-part question.
So one is it's still open, it seems like open source is still very core to what you guys do.
And I just have to plug your paper.
The end 2024 experts, like very fundamental research on how to do good MOEs.
Paper comes out.
Very good paper for anyone.
That's just side tangent with no this thing cause we bring that eight by 8x22 was like the nuclear bomb for open source.
I think I takes 7B more.
Okay, yeah, yeah.
But this is a big car 7B.
Yeah, yeah.
I don't remember this.
I remember I don't think it was January, right?
It was like New Reps.
It was it dropped during New Reps.
And everyone anyway it was December of 23, but I think yeah the model was updated as well.
It's just a little update probably.
Yeah no but you have a point to make no you gotta check that.
But then I just want to hear more broadly on open source for you guys.
And when you had asked earlier about what's next what are the other side teams working on you you put out lean straw it's not as a surprise.
I was like I don't this doesn't fit my mental model Mistral.
Yeah first for open source in general I think it's really something which I think looks to the J of the company I think we started it around this we SI we have in open sourcing with us since the beginning and even before this so before this so me and Tim were at Meta we released llama and I think what was really nice to see that before this for most researchers like universities it was impossible to work on LLMs.
There was no LM outside.
And if you look at many of the techniques that were developed after, for instance, Tama was open source, all these post-training approaches, like even DPOD, like preference optimization, all of this were done by people that had access to this model, and it would have been impossible to do without this.
So it's really making sense, move faster.
So we really want to contribute to this open source ecosystem.
I think like the deep sequel also like very lot of impact.
All these papers that are, I think, in the open source community are really helping the science community as a whole to move faster.
So we want to contribute to this ecosystem.
That's why we are releasing very detailed technical reports.
So magistral and our first reasoning model, and additional reasons, things that worked, things that did not work as well, et cetera, I think helpful.
And uh, yeah, so for the audio model we're also going to share a lot of details, we share a lot of them for the real-time model.
And uh, yeah, so we really want to continue this.
Uh basically belong to this community of people who share science.
I think we really don't want to be living in a world where the smartest model, the best models are only behind closed doors, only accessible to the few companies that we have the power to decide who can use them or not.
I think it's a scary future.
We don't want to live in.
We really want this model to be accessible to anyone, but you want intelligence to be used and accessible by anyone who can use it.
So, yeah, so that's why we are pushing this mission.
Open source model on the yep.
So, not so yeah, the voxtates, so it's open source, not the first model, so not the best.
And the yeah, Linstral, I think is also one step into this direction.
So it's yeah, a bit different than what we are usually releasing.
But we have a small team internally working on um formal proving, formal math.
It's uh I think a subject we care about in general, and we were working on reasoning.
I think we started too early before LMs.
Doing reasoning without LMD is very hard, especially when you work with formal systems because the amount of data you have is negligible.
It's a very small community of people writing like formal proofs.
But the reason why we like it is because I think there is if you look at what people are doing with reasoning, is there are the problems that you can use, are usually going to be problems where you can verify the output.
So, for instance, all this AIME problem where the solution is a number between uh one and like a thousand so you can verify compare this with the reference or it's an expression you can actually compare the output expression generate by your model with the reference but there are many most of the math problem most of the reason problem there is no like way to easily verify the solution if the question is show that f is continuous that you cannot compare in the reference right if it's like proof that this is true or prove this property there is no way to you cannot like simple key verify the correctness of your proof so it's hard to apply the there is no verifiable reward here.
So what you could provide is of course like a judge a land judge that will look at your proof but it's very hard and it's very you could also have some reward hacking happening there so it's difficult.
But you could provide like a reference proof but then there are also many ways to prove the same thing.
So if the model says give a negative reward because it's a different proof but maybe it was still a digit proof just different so it's not going to work well.
What's nice with lean and with formal proving is that you don't have to worry about this whatsoever we just as long as they compiling it's functionally the same.
Exactly it's like a program if it compiles it's correct it's very easy.
And you can apply this and any kind of thing.
It's just way too small.
So no human will actually go and do it.
Yeah, is that that's exactly it's the only people can do it it.
It's like a very small community of people doing a PhD on that.
So it's super small.
And it's sad because it's actually very useful on not just math, but also uh in software verification.
So for instance, software verification today it's a tiny market, very few industries work on this and we'll need that.
It's usually going to be like companies like building airplanes aerobotics, like a things where they absolutely want to be uh sure life depends on this, but it's very rare that people formally verify the correctness of their software.
But I think one reason for this is simply that it's just super hard to do.
Are you thinking of TLA plus?
It's the language that some people do for software verification.
No, I mostly are with that with coke that people use in a inference, but yeah, it's uh the reason I think why people don't use it more and why this industry is not as big as it could be is because it's very hard.
But now with coding agents that are there, it's going to be very different.
We're going to see much more of this.
So I think yes, industry there is going to be much larger in the future, though that we have these models.
So, yeah, here also anticipating this a little bit.
We wanted to work on that because it's proving like a math theory and proving like a functional essentially the same tools.
Yeah, yeah.
One of my theories is that because the proofs take so long, it's actually just a proxy for long horizon reasoning and coherence and planning, maybe.
A lot of people will say, okay, it's for people who like math, it's for lean, okay, it's like a niche math language, who cares?
But actually, and you use this as part of your data mixture for post-training and reasoning, actually, it might spike everywhere else.
Yeah, and I think that's on the explored, or no one's like really put out a definitive paper on how this generalizes.
Yeah, absolutely.
And I think even that's what we are seeing already.
For instance, if you should do some reasoning on math, and then the Americanists would do reasoning and code everywhere, even yeah, just code that through the in the early uh stage.
So it depends, there is some transfer, uh, some sort of an emergence that happens.
And I think some uh it's also interesting, it's not just I think the topic in general, but it's uh there is a lot of connection with this on encoding agents because uh sometimes the model can see like a theorem that it has to prove it's very complex, but then it can take the initiative to say I'm going to prove this three lemma, I'm going to suggest three lemma, and then I'm going to in parallel prove each lemma.
So three of them in parallel with sub-agents, but I'm also going to prove the main theory and that's true.
So you can do this sub-agenda process pretty interesting.
You can even if you fail to prove one of the lemma, you can actually maybe succeed to put the one lemma too.
So you get some possible reward here.
So it's a bit less sparse than if you just get a zero one for the entire thing.
So it's pretty interesting.
I think we can actually stop seeing that.
Yeah.
It's also an interesting case just for specialized models in general, right?
Like the cost thing you show is pretty interesting.
Yeah.
Similar score-wise, your 30, 70, 150, 300 bucks with compared small model.
I think cost is a bit unfair, right?
Because this is at like inference cost, as long as they're on top of their margins on top of it, but we don't know anything else.
So you can figure it out.
Okay.
I did want to actually push on that more, not on cost, but you mentioned about okay, it's a great way to have verifiable long context reasoning.
What are other frontiers that I'm sure you guys are working on internally?
There's a lot of push of people pushing back on pre-training, scaling RL, pushing compute towards having more than half of your training budget all on RL.
Where are you guys seeing the frontier of research in that?
You mean it was there?
Just in foundation model training in the next one thing that you guys do actually is you do fundamental research from the ground up, right?
So you probably have a really good look at where you can forecast this out.
Yeah, but I think for us, we are still working a lot on the pre-training side that people are very far from your situation on the pre-training.
I think ML4 pre-training would be like a big step up compared to everything we have done before.
So we are pretty excited about this.
And I think on the L side, I think now we have more and more to think about this algorithm that will actually support this very long trajectories.
I think when it was, for instance, GRPO for the set doesn't really work with this tiny bit of policy, uh, which was okay initially because you are solving math problems that can be solved in like a few thousand tokens, so the model can actually generate them pretty quickly.
So when you do your update, the model is never too far off.
Still it's never too far off.
But now when you are moving towards this kind of problems where something takes hours, like six hours to get a reward, then your model is completely uh police, so you have to have to be new accuracy that are uh new infrastructure that supports this, but also new algorithms.
So no, everything we're doing internally, we're trying to build some infra that will actually anticipate this, uh what we have in six months when yeah, which is this extremely low scenarios on the outpacy dedicated.
I think when we started missal, part of me, and maybe also Timothy, we wanted to this very nice environment where people are there, they can do any search they like with a lot of resources.
So it was nice.
I think things changed a lot when uh I think when ChatGPT came out.
Uh, I think after that, I think was very victory, and this tabs were neither the same again, but but yeah, but it was nice.
And I think we also want to create part of this scripture before.
Coming to the end, we're just obviously I think you had guys are doing incredible work.
You've laid out a very impressive vision for open source and for voice.
What are you hiring for?
What's the what are you looking for that you're trying to join the company?
Yeah, so we are having a lot of people in our science team.
We are hiring in all our offices.
So we have uh our HQ is in France in Paris.
We have a small team in London, we have like a team in Palo Alto as well.
Customly we opened some offices in uh in Warsaw and Poland, so one in Zurich.
We also have like some presence in New York as well.
Uh on sooner, one in South Francisco.
So we are a bit either way, also like hiring people remotely.
So we are growing the team, trying to hire like very strong people.
I think we want to stay.
So the team is not still a fairly small team, but I think we want to keep it that way because we we find it quite efficient.
So like a small team and very agile.
So yeah.
Okay, let's focus on science and uh forward deployed.
We actually are strong believers in science.
We started uh our new science part that focuses specifically on the air for science.
What areas do you think are the most promising?
What we are pretty excited about right now, and something we have already started doing, or we'll probably be able to share more about this in a couple of months, is that we are uh exploring AI for science, and there are a lot of areas where we think that you could get some extremely promising buzzers if you are to apply AI in these domains.
There are a lot of flowing foods.
You just have to find these domains where actually AI has not been yet applied.
Uh and it's usually hard to do because the people working in those domains don't necessarily know the capability of these models.
They don't know how well AI would just set a pair of them with.
Yeah, exactly.
You have researchers matching, which is actually hard to do.
But this matching, we are doing it to naturally with our customers.
So we have some company we're very closely with.
So for instance, uh ISM and trees are not one of our partners.
So we are doing some research with them.
And there they have like tons of extremely interesting problems.
Problems in physics, in science, material science that they are essentially the only ones to work on because they are doing something no one else is doing.
And uh yeah, so there are many domains where AI can actually revolutionize things, just you have to think about it on be familiar with what you can do well now how to apply it.
So yeah, it's something we're more and more doing with our uh partners, with our customers.
So AI for science is one big thing.
Yeah.
Okay, and then for deployed, what it makes a good for deployed engineer, what do they need?
Where do people fail?
I think it's usually you need people that are very familiar with the tech, not necessarily with a lot of research expertise, but that are actually uh pretty good at using this model that can actually like that know how to do fine-tuning, that know how to like start some RL pipeline.
And it's uh it's not easy.
It's something that most custom majority of companies would not be able to do this on their own.
So here I think we need people that are that like to solve problems that are accepted about like solving some complex, very concrete problem.
It's applied science basically.
And yeah, so I think it's not too different, I think, from the skills you need when you do research because essentially you are trying to find solutions to problems that in customers have not yet solved.
Sometimes it's easy.
Sometimes you really have to do the work.
You have to like create synthetic data, find some edge case.
So it can be, yeah, depends on the problem, but uh but yeah, you have to I think you need also a bit of patience and be creative.
I think very similar skill as I the diversity of the work they do, it always surprises me.
It's it's it goes all the way from the kind of stuff they encounter in industries.
It's just very interesting, I think.
Any fun like success anecdotes?
Yeah, it can be like really training this small model on edge that just do one specific thing.
Making models really good at some to-does, like for instance, the computer ID design, these kind of things.
Is that impairing with vision as well?
Yeah.
And effect detection for chips or like in in fact trees identifying things.
Like it the diversity could be anything where you can deploy these foundation models.
So yeah.
The work to make it work in that specific setting, basically, or whatever it takes to make it like add value in that spider and workflow.
Yeah, and it goes across the stack, right?
Like even just pulling up the website, like in the true is so on Pew is so broad.
We didn't even touch on Mr.
Vive.
You have a live coding CLI tool.
One thing you guys were actually like, I think the first two was Mr.
All agents.
Yeah, the agent builder, you can serve it via API and all that.
I'm guessing forward deploy people, yeah, help build that out and stuff.
It's also why we are so we are doing many things, but I think that's also part of the value proposition that sometimes customers are always very extremely careful about their data, and they don't want to they don't like trusting so many partners, trusting one partner for code, dealing your data to another third party for like audios and another one.
So they don't like this.
Here what they really like with our approach is that we can help them on anything.
Uh so they don't have to send their data out to so many clouds.
So yeah.
I think that there can be many orders of magnitude more FDEs than research scientists.
And they don't need your full experience, but they're still super valuable to customers.
I mean, practice, these two teams are still quite intertwined.
Very often, so first of all, they are using the same tools with the same data pipeline and everything.
And uh it's it's very helpful for the science team to get the feedback and the solution team because they can say, look, these customers are trying to do this, this is not working, it can only be shown in the next version.
Yeah, this is basically a real-world eval.
Yeah, yeah, exactly.
It's a real evil.
But it's not something, for instance, if you're just working in a lab, it's just ships model, but you don't do this work of you putting the model for customers.
You have no idea whether your model is good at this edge case.
For instance, so in a even in your before this, right?
So there is a very big gap between uh the public benchmarks that are very like academic on the the real cases are just very diverse.
And yeah, in the specific conduct of a customer, you can fine-tune and make it like uh first evaluate, but create a solid eval benchmark and then measure in the context of their the kind of audios, like for instance, one use case is literally just there'll be a word for kids and they have to just say it out.
It's a very specific thing.
You're just saying one word, and then you have to you'll you'll you'll grade the kid whether they did it right or not.
It's like I will for kids.
But so there's very diverse use cases, and the idea is that the the applied scientist industries will go and make it better, and then from the learnings we incorporate it into the base model itself.
So it's it's just better out of the box.
Yeah, it's a good full circle system.
Like but the foundation model evals are all just proxies of what you really need.
You're never gonna have one that's just i it doesn't make sense for there to be a one-word transcription like that.
It's not something you want to fit on.
Perfect.
Everyone should go check out everything Ms.
Haw has to offer and try the TTS model bitch.
But thank you so much for coming.
Thank it's such a stress for you guys.