# AI Foundation Models and the Future of Precision Oncology

**Podcast:** Latent Space: The AI Engineer Podcast
**Published:** 2026-04-20

## Transcript

So we basically opened the lab, we hired a team, we got all the instruments, we started sourcing tumor samples.
There was no prior here that any of this would work.
Like, zero.
We just started generating data and, like, sourcing human tumors, processing.
We built this whole processing pipeline to get the tumors into, like, these arrays and the formats.
So you've got, like, these two-week runs where you're processing two slides.
And we're just churning data for months.
And we couldn't even train a model.
So we sort of just built all this.
And then like, let's say 18 months later, hey, I wonder, can we train a model?
And then it was not, you know, like it wasn't obvious.
Yeah, there wasn't really like anything major to go off of.
I mean, there were like transformers developed for single cell data.
There just like weren't really data sets out there that people had been able to develop on.
We do a lot of like custom model building.
Hi there, I'm RJ Haneke and this is Brandon Anderson.
We're the co-hosts of the Latent Space Science Podcast.
And today we're really happy to be in the studio with some of the people from Noetic.
I'm Ron Alpa, co-founder and CEO of Noetic, position scientist by training.
My hobbies are making hot takes about AI curing cancer.
Hi, I'm Dan Baer.
I'm VP of AI at Noetic.
I'm a biologist by training, did PhD work in neuroscience, and then moved into comp neuro, computer vision, self-supervised learning, and have been doing AI research at Noetic for the past few years.
Maybe we should start with what is Noetic?
Why did you found it?
What is the difference between Noetic and the other virtual cell companies?
Maybe just start with a little bit of a centrarian thesis, which is really the reason for founding Noetic.
We all know the numbers that 90%, 95% of cancer drugs fail in the clinic.
Why do they fail?
So our thesis is they fail not because we're bad at pharmacology, not because we're bad at target selection, you know, making the drug.
We're actually better at that process than we have ever been in the history of drug development.
Most of those drugs fail, we'd argue, because...
We're bad at selecting which patients those drugs in are worked in.
And oftentimes you see trials where there is no placebo effect in cancer.
Some patients respond to these drugs.
And if you have a patient that responds, that tells you something, that there's some biology that's active there, but you have a problem in patient selection.
And so really that's the thesis behind it.
Can we build models that can fundamentally understand patient biology from the very beginning and help you position molecules in the right patient population?
So you're actually using the models partly at least to select the patient cohort, not just so you can imagine it working either way.
You could design, oh, I think that this molecule will do well because I know something about the patient population.
But you could also say, I think that this patient population is the match for this molecule.
And that's sort of the power of the models is like once you've trained these models on patient data.
You can use them on both sides of the equation.
So you can use them for discovering new targets directly from the patient data, which people often refer to as reverse translation.
So starting from humans and then trying to understand which targets to go after, and then you can use that to develop molecules.
But you can also use them directly on patient data if you have, you know, let's say a phase two or phase three trial.
You can use these models to understand which patients or what underlying biology of the patients in the trial is a predictor of response.
And we've been doing a ton of that recently.
Are you doing a lot of like rescuing trials that had a bad effect?
We are doing a lot of looking at like data from phase two, phase three trials and then using the models essentially to run inference on patient biopsies.
and understand whether there's underlying biology that would help us design the next trial.
We haven't shared any of that yet, but you'll see this too.
So cancer is kind of like infamous in that like, there are many, many different types of cancers.
Whenever it says like cure cancer, that is almost a meaningless vacuous statement.
So your point is even amongst cancer, or you pick a specific type of cancer, and then a subtype and a subtype, there's a bunch of different...
patient populations that each one of them will respond differently to drugs and your point is you can figure this out right now that like some sub population will do well and respond to this drug when you think generally speaking the rest of the population would not even though we have historically classified this is like all what type of cancer what indication or so on yeah that's exactly right and i would maybe even go further and say like nobody actually knows what the subtypes are there are cancers that originate in a certain tissue, like the lung, that, you know, have been classified into subtypes based on pathologists looking at them for, you know, more than a century.
And, you know, those subtypes certainly have some connection to the real, like, carving nature at its joints, like what are the actual functional subtypes of disease there.
But our thesis is kind of that if you look at the data, a much richer kind of data.
So the multimodal data that we're generating in our lab, we're going to see that actually, you know, what people thought was one subtype of lung cancer is really three distinct subtypes of cancer.
And that is going to be critical for figuring out which patients should get which drugs.
Yeah, maybe I'll just go back to like one of your first questions.
And, you know, I was saying like...
Drugs don't, you know, many drugs fail in patients because we don't understand which patients they will work in in oncology.
Why do we end up in that situation?
So whenever you make a new drug, you do a set of experiments in cell culture, cells in a dish.
Those cells are often cell lines.
These cell lines have existed for 40, 50 years and they're immortalized.
So they have genomes that allow them to persist.
have abnormal numbers of chromosomes.
They have gene expression patterns that don't represent any known cell in the human body, really.
These are sort of Frankensteinian cells.
They're cancer drive.
They're ruinously cancer.
They're mostly cancer.
And so you can do your experiments in these cell lines in a dish, or then you can move these into animal models.
And in oncology, you often have this sort of...
panel of different animal models with different cancer types that you'll test these in.
And we, in doing these experiments, we sort of convince ourselves that some of these cell lines are, let's say, lung cancer cell lines or colon cancer cell lines.
And then even that some of them in the mouse context are colon cancer cell lines and lung cancer.
And then we, in the mouse, we implant them under the skin and like weird places.
treat the mice with drugs and we see how they respond.
But ultimately there's a big gap because they don't translate to patient biology most of the time.
So these cancer cell lines, most of them don't even, you know, even if they are derived from a colon cancer, they don't even have the mutations that human colon cancers have in many cases.
And so, and pharma has done this for, you know, 20, 30 years where you develop a drug, you test it against, you know.
hundreds of these.
It's not an art experiment.
You can send this out to any CRO.
They'll test your drug against hundreds of different cancer cell lines.
And you can sit back and say, okay, well, which of the 50 colon lines responded to my drug and which of the 50 covarian cancer lines?
And you could try and map that to human biology.
But the problem is these cell lines as an abstraction do not relate in any way to human patients.
And so what happens is ultimately, no matter what you do preclinical, The molecule gets in the clinic and the clinical team says, look, we don't really know how to design this trial because none of the data that you've produced gives us any insight on which patients to run.
So we're going to basically enroll an open label study.
So we're going to enroll all tumors, all patients that are enrollable in this trial, and we're going to see where we get signal.
Imagine doing that in an early phase trial where, let's say, you have 50 patients and you're trying to do you.
test different doses and you don't really know the dose of the drug and you don't know what the safety margins are.
And you're also trying to figure out where is my signal?
And then what if I told you that, let's say, in just lung cancer, hypothetically, let's say there's only 10 different subtypes of lung cancer.
And you don't even know if it's lung.
It could be any.
So, you know, this is what happens.
And oftentimes you get to the end of these early stage trials and you don't see very many responders.
you would expect statistically, and then these molecules get canceled.
So you're imagining that your noetic system, you help the pharmaceutical company to characterize, we expect that people with a certain genetic profile or even transcriptomic profile will respond to this drug.
And then you go and you actually sequence.
from the patient and you say, yes, this is a match or no, is that the sort of grand vision?
Yeah, I mean, I would say we are even less biased than that.
We are saying, okay, well, we want the model to learn, let's say from lung cancers, we want the model to learn like how many different therapeutically relevant subtypes of lung cancers are just from self-supervised learning from the data.
And those subtypes could be driven by large genetic changes.
It could be driven by, you know, immune changes.
It could be really driven by any biology that the model is learning in the process of training.
And we do see, you know, different types.
I mean, feel free to contradict this, like, as the actual doctor here.
But, like, you know, the biomarkers that, you know, people have been using are, you know, biased towards simplicity.
You know, does the patient have this particular mutation?
Sometimes like staying for this single protein or, you know, do transfertomics like to look for a particular gene signature.
But like there's no reason to think that biology or like biology of cancer is that simple that you're going to capture, you know, most of the meaningful variation with such simple biomarkers.
And, you know, most of them, they have like weak correlations with.
you know, clinical success.
But the hypothesis really is here.
Like, again, if you were to carve nature at its joints and figure out what's really going on, is there, you know, these five subtypes that the correlation there between which patients you give a particular drug and whether you have success is much, much stronger than if you're forcing yourself to go with these like very simple biomarkers.
You mentioned a lot of you do a lot of data generation in the lab.
So why do you think that that versus using existing public repositories or whatever is appropriate?
Yeah, we generate all our data in the lab.
Everything from sourcing tumor samples themselves to processing them and generating the data.
Maybe another hot take I have just in AI and bio is you're sort of not at the order of magnitude of data that you are in other spaces of building training models.
And so it becomes really hard to brute force these problems just by collecting data.
We have a couple pretty good examples of where someone has designed a data set.
So PDB was designed and has been built over the past 50 years or so.
And so it's not an accident that that data set exists.
Someone decided that We are going to design this data set.
We're going to collect this data over decades and decades.
And then with the intuition that potentially this would help solve protein folding down the road, and it did.
So it's not just that PDB is a bunch of random data that people have organized from the web.
I think that in bio, you really need to be intentional about the data that you generate and how you generate it and have some foresight around, well...
what are the models we're going to want to train and what are the models now need to learn from from the very beginning?
So that's why we've taken this approach.
Yeah.
And I mean, like a good comparison is to the ImageNet data set, which kicked off the deep learning revolution in computer vision with convolutional neural networks, like actually demonstrating that neural networks can do better than other methods on object categorization.
ImageNet is at least the part of it that people were, developing models on is 1.2 million images, very carefully curated.
These are high quality images, not like random images from the internet or like multiple data sets cobbled together.
And labeled?
Yeah, and labeled.
And I think with the data that we're generating, we're...
around that scale right now.
But, you know, of course, people have gone much, much larger in image data sets and language data sets, text data sets, obviously for LLM.
So we think that we need to get the data up to that scale before we can really see the meaningful progress on the algorithm side.
The scale of language data.
Yeah.
Language is really the only modality where people are seeing.
these very impressive scaling results.
And part of that has to be just the scale of data that's there and that the models are trained on.
That can't be the only thing because there's a lot of video data as well.
People are training on thousands of hours of video data and haven't seen the scaling results that you have in language modeling.
But having the right scale of data is...
necessary, if not sufficient to like really make progress here.
Can I refer a contraria take to that?
Sure.
So, I mean, there's this whole concept about the jagged frontier of LMs and Dirt of AI and how like certain regions that can be really good at solving some problems and then remarkably stupid at solving nearby problems.
And maybe the arguments with happening is that a lot of these frontier models are just becoming massively, like everything is becoming in distribution.
Like if everything starts out OD, if you just get more data, it now becomes in distribution.
Is it possible that for biological systems, because these are, they're underlying physical processes here, that you can basically make things more in distribution earlier and that you can't actually cover the space?
I kind of have some follow-ups with PDB, but I'm just curious at this point.
Yeah, I mean, I think it's a good question is like sort of how much data and what kind of diversity.
do you need like in biology to solve you know say like the drug translation problem like figuring out which drugs are going to work in which patients my intuition from working in biology like for a while is that we're still pretty far from that like because you know we're building data sets that are focused on right now cancer and you know have generated data from thousands of patients in a few major cancer subtypes, but there's like every other disease, there's healthy tissue, there's even other species, you know, there's a lot of biology to learn, especially if you think about it as we have to learn kind of the spatial and functional patterns of tens of thousands of genes, tens of thousands of proteins, how their spatial arrangement contributes to the function of organs.
and so forth you know my hunch is that biology is like pretty complex and that we still need to generate a lot more data but yeah i i don't know yeah but as a cancer company do you think you could actually do this hypothetically for cancer i mean for at least some you know subclasses definitely yeah i i think that we've done experiments that suggest that you know if we can generate data from several hundred patients in all of the major cancer indications and some of the less major indications that that will result in a model that can generalize pretty well to kind of any type of cancer we would throw at it.
Backing up, what is the data you're collecting?
Because my understanding is you use some pretty specialized instruments and gathering very specific data sets.
So how did you come to that decision about how much data, how much to spend on it, and what types of data?
I'll give a hat tip to my previous employer, Recursion.
So we spent six years at Recursion from the very beginning.
And a lot of what we were doing in the early days was figuring out like the things we didn't understand about the datasets and figuring out what the problems would be in the dataset.
So batch effects, controls, how to orient samples on plates, things like that.
Flash forward to founding of Noetic, started the company, you know.
already with some principles around how we should think about building the data set.
What are some things that we know matter?
So, for example, over many years, we learned that images are actually a really powerful data set for machine learning for many reasons.
One, they're scale.
So we can put patient samples on slides and on a single slide, we can capture many patients worth of biology.
The images themselves are very rich sources of biological information beyond that.
Now we have a very information-dense modality, and we can decrease the cost of data generation, so then we can increase the amount of data generation over the whole data set.
And that's always been a really big benefit to image-based modalities over, let's say, sequencing, where every time you run a sequencing run, you're basically, your end is, you know, a patient's end.
That was one way to think about it.
The other was, how do we design these data sets so we can...
control for things that we know are going to be important, such as batch effects.
So, for example, if I have a slide, we do a, let's say, a spatial transcript on this run on that slide.
You stay in the slide, do a bunch of, you know, wet lab processing, you put it into a machine, you get data.
If you do that on two different days, there are going to be different variables that impact the data.
That's going to be a large source of variation in data sets.
So you want to be able to control for things like batch effects.
So really you want more patients represented on multiple different slides so you can process them different in different batches.
So you want to be able to control for things like this so you can go downstream and look at the data and say, okay, well, once we have, let's say, patient-level embeddings, we can ask, well, do the patient-level embeddings represent, let's say, patient response to immune therapy or do they represent staining batches?
So you're actually taking different...
one patient and you're spreading across multiple slides so that you can get a, like, it's sort of a calibration across the slides.
Yes, our data looks very different than anyone in the space of generating data on histology or digital pathology types of specimens.
So we receive a sample, we sample those samples dozens of times to build these arrays, and each array has hundreds of different patient samples, randomized, and every patient is represented on multiple different arrays.
And so we're getting a lot of different representations of each patient that we're sending through the data process and pipeline.
And then that lets you downstream be able to answer some of these questions and control for some of these periods.
You mentioned some terms I just want to define for people, spatial transcriptomic.
Yeah.
What is that?
Yeah.
So what be?
I mean, this was your first question.
So what are the data types?
So you just sit back and this is not my background in terms of spatial.
Again, everything we did on your previously was.
cell biology and a dish.
If you just sat back and you said, okay, I want to train a foundation model that understands human biology.
What does that mean?
How would you go after that problem?
And that was really the starting point for the company.
He said, okay, but from first principles, how would we do this?
So you probably want tissue level biology.
You want to understand tissue.
Cells are organized into tissues.
You probably want some modality that is relevant in clinical use.
So you can relate clinical data to what your models are learning.
That's why we generate pathology H&E.
So that's, you know, what every patient gets a tumor removed and then they get this stain on H&E.
And that's what the pathologist...
I can't explain where H&E is.
It's basically two different dyes, hematogicillin and eosin.
And it really just creates a contrast over the tissue.
So you've probably seen these like purplish, pathology specimens.
So pathologists can look at those and they can identify different cellular structures and they've used those to classify tumors based on, you know, the classical classifications of, you know, had no carcinomas, small cell carcinomas, things like that, but basically cellular structures.
Okay.
So there's like a specific patterns would show up when you add these two sayings and it is well established that like you classify tumors.
Based on?
Based on, yeah, pathology on your classifications.
And this is what every, basically every tumor, you know, that gets processed in the hospital will get this HEE state.
And it's how the pathologist typically classifies a tumor from the first level.
So, okay, so you want that.
You probably also want to understand cell types.
It's really, I would understand cell types from just that state because it doesn't reveal that much that a human can use to classify cell types at least.
So you could say, well, I want to know whether there are immune cells and different subtypes of immune cells.
We want to have some layer of cell biology.
And you want to know about immune cells because you have these cancer cells and oftentimes the immune response dictates whether or not you have an effective treatment.
It's like the immune environment of the tumor will be a core.
We know it's a core constituent of whether a patient's going to respond or not.
So you want to know, OK, you want to give them all this.
So the models are going to get this tissue level information.
There's not enough cell level information in there for the model to learn enough cell biology about different subtypes.
So we also want to present it with some cell level information.
So we use protein stains, so standard neofluorescence.
So you basically use antibodies against a small set of cell markers to label different T cells, B cells, standard subtypes of cells in the tumor and microbiome.
So in this stain, just for those who are familiar, the stain on the antibody has a fluorescing protein.
When you hit it with a certain frequency of light, then it fluoresces.
So you can tell the antibody bound to a certain protein.
And now it has a fluorescing guillotine attached to it.
Yep.
And in terms of the data, so from the tissue layer, you have an RGB image.
From the next layer, you have...
a multi-channel image with each channel representing, you know, let's say one color.
And so, for example, certain immune cells are each in a different channel.
So you have this multi-channel image.
Now, okay, so that's great.
So we've got tissue, we've got cells.
But if we actually want to make drugs, we need some type of molecular information.
We need to tie all of this down to what's happening in the genome.
What is the cell doing?
What are the mechanistic principles of the biology?
So then we get...
spatial transcriptome.
So that's spatially resolvable RNA.
So DNA transcribed into RNA, which is translated into proteins.
So we get basically the RNA in a spatially resolved pattern for the same cells that we're seeing all of these other layers.
So now you have between 1,000 or 19,000 different genes.
And again, these are all image layers that are spots of where those RNA are in which cells.
And this one works a little bit similar to how we talk about protein, where you have a segment of RNA and then you have a fluorescing protein and usually there's some sort of combinatorial thing.
So you have, if you see these four colors in this amplitude, then that means this gene because they're right through each other or something like that.
So for the detection method, you're basically binding a probe at each one of those RNAs and then you're cycling it.
And it takes weeks to run.
one of those assays.
So you're cycling, the machine will cycle across each species and it'll amplify and you'll get a signal for each RNA species.
Now, at this point, you now have basically this very rich data layer where you have the tissue, you have the cells, and you have the molecular information.
And you can use all of that to train the model.
And so you think of it as, it's essentially the central dogma, if you will.
And we also have DNA.
We genotype just so we understand the genomic.
alterations in these chewers.
All right, so you get this stack of images, basically, that you can train models on with understanding the expression of genes and the proteins that are being expressed at the time that the sample is taken, all in the image information, and then you can train your models with that.
Yeah, I mean, the spatial transportomics is, like, particularly dense because if you think, let's say, there are 20,000 genes in the genome, Now, you know, we're running assays that are detecting nearly all of them in a single sample.
So you can think of one of those data points as an image, except instead of being an RGB image that has three color channels, now all of a sudden it has like 20,000 color.
So it's like a very meaty computer vision problem to try to look at those data and figure out what makes patient A.
different from patient B and then go from that to which drug is going to work in which one.
And so you have a hot take about virtual cell.
Like I want to understand how, okay, so you, you know, you have this big pile of data that every single sample has a massive data set with it and then you have many, many samples.
So how do you turn that into useful knowledge?
Maybe just what is a virtual cell?
Everyone's always, you know, asking that question.
I think there are There are really two ways to think about it.
One is we want to be able to simulate all the biochemical processes in a cell.
So we want to have this sort of comprehensive foundation model where we understand if some signal from outside the cell interacts with the cell, then here are the millions of intracellular chemical reactions that are going to happen, and you could sort of predict them from the model.
So that's one view.
I think that's sort of an interesting intellectual pursuit.
I don't think we have all the modalities of data that you would need to solve that problem.
I tend to see the virtual cell problem as something more practical.
We're trying to make drugs that work in patients.
So from a virtual cell perspective, really what we want to do is understand cell biology in some heuristic that's useful for making drugs.
And the heuristic could be...
You know, a way to understand your targets or a way to, you know, map your cell level biology up to patient level biology.
And so the way we've designed these first virtual cell models is maybe just to simulate the biology of a cell in some context.
And the biology of that cell being, you know, let's say the cell being in some context and the output being, you know, the transcriptome in that context or, you know, the protein in that context.
And these types of, you know, input-output relationships allow us to essentially design experience.
And so really the very simplistic thing that we're doing is really just the model can simulate the biology of cell or, you know, many cells in different contexts and give you, and allow you to run some simulations in that regime.
Yeah, I mean, I think what most of the things that people are calling like virtual cell models right now are focused on single cell gene expression, so transcriptomics data, RNA data, and they're largely geared toward the problem of predicting what's going to happen to the transcriptome.
So the set of genes expressed when you hit cells with either a small molecule, a drug, or a genetic perturbation.
And typically this is cells grown in vitro, like either cell culture or primary cells, something like that.
I think the genetic...
perturbation being where I knock out a gene or add a gene and see how that impacts the expression of the debarious RNA.
Exactly.
And I think my view, and I think Ron shares it too, is that like may be of interest in some cases, but the problem we're really trying to solve is predicting what's going to happen in a patient.
And you're just modeling data that comes from a patient.
is, in my mind, much more likely to translate to what happens when you give a patient a drug than something that's happening in cell culture.
Is there other clinical data that you're pulling into the model besides the actual, so you're calling the context of the cell just the surrounding cells, but is there other, this drug caused a bad reaction kind of stuff?
Yeah, I mean, we're pulling in data from the entire patient.
So not just, you know, the very local neighborhood of the patient.
So far, we haven't done much integration of, you know, like electronic health records or, you know, other information that one could get about the patient.
And that's pretty intentional.
Like, we really want these models to learn basic biology.
Again, like the central dogma, not just the central dogma, but, you know, the basic biology of...
genes protein cells tissue in a self-supervised way so purely from the data that we're generating and not be biased by you know what the doctor wrote about that patient because you know our thesis is kind of that like most of the therapeutically predictive and important information is not contained in those very small number of patients who have been treated with a given drug and whatever the doctors thought was important to write down given the state of knowledge at that time.
So it's much more about trying to discover what's really there in patient biology than go based on the text that people have written about it.
So you have this self-supervised model, you eat a lot of data, you have essentially some clusters of patients now.
How do you translate those clusters of patients to making decisions?
Like you go to a pharma company and you say, we can repurpose or we can suggest this subtype should be the focus of your phase two trials.
Like what is the process for that?
What data do they need to provide you and how do you translate your models?
So it depends on what the problem is.
I think it's important.
So one of the more interesting aspects of these models is they are.
useful for a broad array of use cases, as we were talking about from the very beginning.
So you as the pharma company could say, OK, well, I have this molecule and the target of the molecule is X.
And I went and designed my clinical trial.
The molecule has seen zero patients so far.
All I know is the target and some biology around the target.
So we can run simulations using the models and our cohorts of patients.
And let's say.
If we were to look at, you know, in lung cancer, we can run simulations around the target and ask, okay, which sets of patients here would this target be important in across a cohort of, you know, lung cancers and colon cancers, you know, across all of oncology.
And you might see, and we see this sometimes, you might see that, you know, your target probably don't want to put it in lung cancer.
Maybe you want to put it in ovarian cancer because it's not really important in lung cancer.
What are you simulating here?
So, like, are you, you say that...
This drug is expected to knock down this gene, and therefore it will result that you want to look for clusters where knocking down this gene inhibits tumor growth rather than enhancing tumor growth.
I mean, that's certainly one way we could do it.
There are other types of stimulation where you might just want to ask, like, if there were immune cell here, like a T cell, which is responsible for actually killing tumor cells, what would happen to it?
genes would it express or what proteins would express in this particular patient's tumor microenvironment.
And that's what we've called like these virtual cell simulations.
Like we have a model called octovirtual cell that does this.
And that can give quite powerful answers to the question of are these drugs going to work in these patients?
Because you might find, like, actually, as Ron was saying, the thing that this drug targets is just not important in this particular patient's tumor in that there's not, like, it's not going to have any effect on the T cells or the macrophages or some other cell type there.
Then, you know, there's the type of simulation you alluded to where you can ask the model, what would happen to this patient's tumor?
If you were to knock down this particular target gene or its protein product, and you might be looking for cases where the model predicts that removing that gene or that protein is going to have a large effect, like either increase the immune system function, its ability to fight that tumor, or decrease the tumor's ability to grow, or some other readout that you think is correlated with clinical success.
I just want to call out maybe like the simplest use case is the one where there's like a company that has a drug and they've given it to some patients and we know some of those patients responded.
And then it just becomes like a question of like, has the space of patients that the model has learned via self-supervision tell us that all of the responsive patients are in one of these clusters and not the other nine clusters or something?
So if we know that, then there's a pretty straightforward hypothesis that this is the right cluster.
So that's the scenario where you would sequence something.
What would you collect about those?
So you have a cohort that responded and one that didn't.
Yeah, so this is getting back to something Ron mentioned earlier, which is this type of data called HNE.
It's a stain, the standard hetology stain that makes these pinkish and purplish looking images.
Right now, what we do is we've built models that are trained on kind of all of the multimodal data we generate.
But then once they're trained at inference time, all they need is an image of H&E.
And that could be something that we generate in our lab or it could just be, you know, a digital image that they have from a trial that was run years ago.
And the reason that that is so...
powerful and flexible is, again, because H&E is kind of like the lingua franca of pathology and especially oncology.
So almost every patient who's been given a clinical stage drug is going to have that.
You can look at the two cohorts, the responders and the not responders, and say these H&Es live in this part of the latent space and these H&Es do not.
Yeah, exactly.
And I think one way we've gone Further than that even is given the HNE, they can say, I predict that these genes are expressed at this location in this patient.
So not only do we have these clusters, these embeddings that say, you know, all of the responders to this drug are over here, all of the non-responders are over there, but we can actually see, okay, for the responders, these are the genes that are expressed.
much more highly or predicted to be expressed much more highly in the responder cluster versus the non-responder cluster.
And so that adds a major level of interpretability there because we can see things like, okay, good, the responders are actually expressing the protein target of this drug.
So we would be worried if that weren't the case, but we can see it is.
On the other hand, we also see that...
The biology is very, very complicated.
So kind of explaining why these simple biomarkers, like looking at a single gene or a single protein, just really don't capture what is predictive of therapeutic response.
Yeah, so I have like a million directions I want to go here.
H and E, that actually gives you a pathway to a diagnostic then as well.
Exactly, yeah.
Right, yeah.
And so that you can imagine after the drug hopefully makes it to the market, then a doctor...
It says, oh, you have cancer.
I'm very sorry.
We're going to do a H&E stain of your tumor.
And then we're going to put in the model and it says, oh, you know, this one won't, or free, but this one won't.
That's right.
And you can, so we're using the same approach for actually today.
We're looking at many different mechanisms from different collaborations that we have in place.
You know, one of them we've announced with a company called Agenus.
These are all different mechanisms.
The input is still H and E, and some of the same indications.
So using H and E, we're asking whether drug A works in some sets of patients, whether drug B works in other sets of patients.
And so you can take that to its natural progression and say, well, okay, if you can use that same input, just H and E, for experimental drugs, why not use it also for drugs that are on the market already?
In a sense, the same assay, they can be very predictive.
across many different cancers and many different potential therapeutics.
There are lots of models that take H&Es and go to gene expression out there, open source, whatever.
They do, you know, so-so.
I've read in Twitter, your Twitter feed and whatever, that you feel that you have a data mode, right?
And so why is Noetix model better?
Sure.
I mean, I think, you know, the scale of data that we've trained these models on is like, you know, pretty different from a lot of what's out there.
Like the reality is there's just not that much of this kind of paired H&E plus other data modalities.
Typically, there are some data sets generated by academic labs, others where they might have maybe like a hundred or a few hundred patients worth of data with paired spatial transcriptomics.
That might even be an overestimate in comparison.
We're generating these data that are multiple patients per slide, individual patients distributed across multiple slides.
We've generated now more than 100 million cells spatially resolved, spatial transcriptomics, that's all paired with H&E and protein as well, at least an order of magnitude larger than any of the other data sets that we've seen out there.
And I think that makes a pretty enormous difference.
I mean, we've seen...
with our own models, that if you drop down to 40% or 10% of that data used in training, the models get a lot worse.
And they especially get worse at kind of generalizing to other types of cancer from the ones that they've been trained on.
So I think that's a big piece of it.
I also think that the algorithmic side of it is important.
We've developed custom architectures.
specifically for training on this multimodal data.
And again, my background is in computer vision and specifically in self-supervised learning there.
And so we've tried to develop, you know, self-supervised learning approaches for these data that are really adapted for solving this problem of, you know, figuring out what is different in one patient versus another and then simulating what would happen.
if you were to knock down a particular gene or protein or something.
So this is why we call these world models where we're trying to build models that can simulate what's going to happen if you take a particular action.
I think that's another big differentiator for these models.
And then, again, the interpretability as well is probably a third one.
It's funny because you were just talking about how one of the other strategies people take for this is to...
uh do perturbations on cells and then watch the response and uh and now your experience plus like your strategy is you can simulate this sort of counterfactual perturbation idea without even having to collect the data to that and you can see this well there's yeah there's a a big piece that we haven't talked about yet which is actually we are running perturbation experiments except they're in vivo perturbations using a platform based in mouse.
We have another platform where we are, it's called PerturbMap.
Ron, if you want to describe any of it, but basically this is a platform for generating highly multiplexed knockouts of individual genes.
So the same kind of like CRISPR knockouts that people are doing for individual cells in vitro, except when we knock out a gene.
In a cancer cell, that cancer cell gets injected into a mouse.
It's barcoded so we know which gene was knocked out, and it's being injected alongside roughly 100 other cell types with different genes knocked out.
So you end up with mice that have tumors that are barcoded that have 100 different genetic perturbations in them.
We can actually use that to validate our models and ask our you know, what the models are predicting in humans via simulation actually borne out when you do these perturbations in a mouse system.
Sorry, there's a lot to go into that.
Barcode.
Yeah, so sorry, barcoding.
This is a technology in which an individual gene is knocked out with CRISPR, but also this introduces a set of protein tags in that cell that get expressed.
It's a combinatorial code.
Gene X might have proteins A, B, and C.
Gene Y, when it's knocked out, has proteins D, E, and F.
And we can tag those proteins or label them with antibodies so that when we go and look in the mouse, we know exactly which gene was knocked out based on which of those protein tags were expressed.
So you knock out a gene, but you also added a gene that has the barcode proteins encoded on them.
Yeah, exactly.
And I mean, the system's designed, so everything that we're doing here is tissue level.
You could be in vivo, you know, tumors that came for human that are in the form of the tumor that are the whole tissue.
And then here and then this mouse system, you have hundreds of tumors in the lungs of a mouse.
And if you look at these images, it's a mouse lung with like literally hundreds of tumors in it.
And each tumor has a distinct biology that's driven by the biology of the knockout.
of the gene that's being perturbed.
And we can capture basically the biology of each tumor in a spatially resolved way.
So what you can see is, okay, well, we have a bunch of tumors in human that we have, you know, certain tumors in humans, let's say don't have immune cells in them.
And so those tumors are very aggressive and they don't respond to immune therapies.
You can generate those same tumors in this mouse system.
And again, they don't have immune cells in them.
And you can do it genetically, so you can start to map kind of the gene, the causative gene relationships between these different immune or just broadly tumor genotypes or biological profiles, if you will, to what you see in the human.
And then you can treat those mice with drugs and you see how hundreds of tumors in a single mouse responds to treatment with one drug.
Or you can treat many different, let's say 50 different knockouts.
across a panel of mice with 50 different drugs.
And you can start to build this intersectional pharmacology and, you know, genetic experiment.
On Twitter and in various places, I've heard you say noetic is no cell lines, no war bottles.
Maybe you even said that, you know, a few months ago.
And then we just said we have a mouse ball.
Yes.
And injecting cells, like...
And then the ones, not under the sky.
So, yes.
So, you know, fundamentally...
We think it's really important to build models that are trained on human data, and we are sourcing all these tumors to build human-centric models.
So that is also true.
From the very beginning, we had asked this question of, you know, let's say we want to develop a drug from the very beginning, and let's say the FDA, and I know things have changed a little bit with the FDA, but let's say the FDA wants you to have some data in an animal.
that says your new mechanism works in some animal system.
What do you do?
You're kind of stuck because you've now generated arguably the best data that you can in the human system.
And then the FDA says, well, cool.
But does it work in the mouse?
How does it work in the mouse?
And then so you have to back into this system that it doesn't translate.
And so from the very beginning of the company, this has been sort of a question.
And so we've started.
probably at the same time we started generating the mouse to the human day, we started building this mouse platform with the aim of drawing connectivity between these two systems.
And so we focused on a platform.
We wanted a platform that, one, allows you to map up diversity of human tumors because we know that if we just run a mouse model with one tumor, that tumor has no connectivity.
So in the mouse system, we want to have diversity of tumors.
And we want to see a mapping of diverse tumor biology to the tumor biology that we're seeing in the human across many different mutations.
So we license this system and we've been building it so you can see many different perturbations that produce a lot of the tumor biologies, plural, that you see in the human.
And then we also want to be able to get from this mouse system to biologically relevant, let's say, targets or genes in the human as well.
So one of the fundamental problems in mouse systems is we share many genes with mice, but there are a lot of genes in biological process we don't share with mice, as is obvious.
And so oftentimes you run into these when you're developing drugs.
It's okay, you have a target, you have some biology that works really well in mice.
Maybe that doesn't even exist in humans, or maybe that pathway is useless in humans.
So one of the things we've started to develop that we'll share more about soon is a way to use one of these models to...
essentially infer human biology from the mouse directly.
And so we're in silico humanizing the mouse.
So all the outputs in terms of the transcriptome from the mouse are in the form of the human genes.
And so when we read out this mouse system, we were reading out in the form of a human neural compiled.
How do you validate that?
I mean, that's a pretty impressive claim if you can do it, but man, it seems like a tricky validation task.
In my experience, both here at Noetic and my previous employer, I could say recursion.
A lot of the approaches you're looking for when you're building these types of models is you're trying to ask whether the models are recognizing biology that you know to be true.
So, for example, in the human context, we know that 12% of patients with lung cancer respond to immune checkpoint inhibitors.
Do the models recognize those patients?
Can they recover those patients without training?
And we see that.
And then when you go look at those patients, we see the underlying features of those patients maps to what we know about those patients in the client.
In the mouse system, we have control genes.
So we ask, if you look at the mouse tumor embedding space, do the tumors that should be really cold look really cold from the human inference?
um cool in the sense of like they don't have immune cells no matter so yeah yeah um and then hot in the sense of like lots of immune cells so we try to build systems where you have these hand ults and then you know the more of these examples that you know to be true that that work that you see the more confidence you have obviously when when you're into the regime of something very new it's it's still uncertain systems so the bridge is sort of The bridge between the mouse and the human is you build a world model on a human, you build a world model on the mouse, and then you say, what are the parallel structures in the two latent spaces?
Is that kind of the intuition here?
That's one thing that we're doing, but actually this is like even simpler, which is that we've trained models on human, HME, spatial trend tripomics, et cetera, and then are just inferencing them on.
mouse H&E, which is easy to generate.
And apparently mouse H&E looks enough like human H&E that the models think is perfectly valid.
H&E makes predictions about, is this like immune hot, like immune infiltrated versus cold versus fibrotic versus some other tumor phenotype?
And those predictions are accurate.
So, you know, these are like some of the controls that Ron mentioned.
So, you know, we know that In mice and humans and everything, if you knock down tumor cells' ability to present antigens to immune cells, you know, those are very cold.
Like, immune cells are nowhere near those tumors.
And, you know, that's exactly what we see in the mouse, and that's exactly what the models, the in silico humanized models predict.
And, you know, then there are other examples where, again, we're recovering the biology that we expect to see there.
And then there are...
findings that are novel, but also make total biological sense.
For instance, we have done knockouts in the mouse of, let's say, half a dozen genes that are all in the same pathway.
So you might predict that knocking down those genes are going to produce the same phenotype because they're on the same pathway.
And that was a pathway.
Yeah, so a pathway is like...
protein A signals to protein B signals to protein C.
And, you know, there's like a chain of events that leads to the cell having some behavior, you know, changes in its metabolism, its growth, etc.
So these are, I don't know if you've ever seen these crazy looking protein signaling diagrams that, you know, make you want to stay away from biology.
But, you know, like, you know, people have, you know, worked down a lot and they know that these two proteins interact, physically and signal to each other and so forth.
And so, you know, one of some chain of those interactions that this protein binds to this protein and that causes it to upregulate a gene that causes this other protein to be formed, blah, blah, blah, until you get to some phenotype, meaning the cell changed the way it looks.
Exactly.
And so, you know, based on decades of biological literature doing experiments on these, there's a very strong biological prior that if you hit gene A, gene B, gene C, and they're all in the same pathway, you should get similar phenotypes.
I mean, this is kind of how like old school genetics was done.
And we see that with these in silico humanized mouse models, which is amazing to me as a biologist that you have a model that's trained on human data, then you show it some mouse histology, and it's able to say these five different tumor genotypes.
all look like they have the same phenotype.
And lo and behold, there are, you know, five genes that are in the same pathway.
So you guys, switching gears a little bit, because we want to talk about models on Latent Space Podcast.
You guys recently, there was an interesting blog post, Tario model.
It's some transformer-based model.
Do you want to talk about that?
Sure, yeah.
So this is like new model architecture that we developed post sort of the first virtual cell model, OctoVC, that we developed.
So Tario, this model is just a different transformer architecture, one major difference between it and our prior models.
I guess if this is a model podcast, this is getting into the self-supervised learning objective.
So for a while, including with OctoVC, we were training models on what's called the masked autoencoding.
loss function or objective where you have a piece of data, you chunk it up into small chunks, you mask out some of those chunks, and the training task is the model has to predict the masked out chunks from the revealed chunks.
Like BERT.
Yeah, exactly, like BERT.
What are the chunks?
Because this is multimodal, and I would imagine the different channels contain wildly different levels of information.
And I remember seeing something like 99% masking in OctoVC if I'm...
Yeah, yeah.
And I was like, that was kind of surprising because when you have, you know, 19,000 channels and maybe some of the channels are fairly, like, most of the signal is fairly sparse.
Yeah.
Then it seems like to be either there's a huge redundancy here in your data or you really risk, like, just throwing maybe out with the bat.
Yeah.
What are the chunks?
That totally depends on which modalities we're talking about.
So spatial transcriptomics.
one chunk or one token might be the level of expression for a particular gene at a particular spatial location.
For protein images, multiplex protein images, again, it might be, you know, the image patch for that particular protein at a particular location and so on.
And, you know, for like histology images, again, those are usually just patches of the image.
pretty standard vision transformer style.
The masking and the maybe surprising result that you can and actually need to mask out large amounts of the data to get the model to learn anything interesting.
If you ran the hypothetical where you only mask out 10% of the image, maybe more like BERT, for instance, in language modeling, what do the models learn?
you know, they learn these kind of like boring behaviors, like how to like continue an edge a little bit, you know, between two like regions of an object or something.
So they can learn that task very well, but they don't end up learning anything about sort of the holistic structure of the image data.
And we found pretty early on at Noetic that the same thing was true with...
these multimodal, like, transformers, where if you mask out a lot of it, there are actually pretty strong correlations between where protein A is expressed and where protein B is expressed, and forcing the models to learn them is really what gives it this predictive power.
And so Cario, though, is an auto-aggressive model.
Yeah, exactly.
So, yeah, that was going to be the pie.
And so, you know, prior models, including OctoVC, were of this masked auto-encoding style training objective.
Tario is an autoregressive model, which if you think about it is kind of a particular choice of masked autoencoding, except, you know, instead of randomly masking on front of the data, you're always asking the model to predict the next token in a sequence.
We know that this is something that scales very well with LLMs, like training on the next token prediction task.
And it's still an open question, how do you get models of other data modalities to scale the way that LLMs have scaled.
Tario was not actually our first attempt, but one of our subsequent attempts to bring that autoregressive like next token prediction task into modeling spatial transcriptomics data.
We found that when we use this architecture and this task, we started to see much better scaling behavior where bigger models and especially at longer context lengths were really outperforming the smaller models at shorter context lengths.
Because they can see further in the image?
Yeah, that's probably a big part of it.
I think there's actually a pretty subtle but very interesting result in that blog post with Taria, which is that you only really see the benefits of using larger models when you're looking at...
longer context lengths and here longer context really means again like you're seeing more tissue at once more area at once and i'm not like super deep into the language modeling literature but i don't know if there's an analogous thing with like language models where like you only see these scaling behaviors at at longer context so it could be that We're finding here is that like with patient data, you really do need to incorporate sort of more of the patient spatial context to really get the models to learn these more complicated nonlinear patterns in the spatial transcriptomics and take advantage of it.
Is it possible part of this is because you have some number of low expression genes and that the...
That the behavior is driven entirely by some better intermodeling of low-expression genes?
Yeah, definitely possible that, like, the more context you have, like, the more likely you are to catch kind of these low-expression but highly predictive genes, etc.
I would guess it's a combination of that and larger area.
Like, we've done some experiments just, like, comparing.
model of the same amount of context, but in smaller or larger areas.
And there definitely seems to be an advantage to looking at larger regions of tissue as well.
I want to hear about, you did a big deal recently, you got a lot of press, and I think have the distinction of being one of the only AI for bio-tooling companies that is making money.
Accidental.
So can you tell, whatever you can disclose about that, we'd love to hear.
Yeah, so we were really excited to announce a deal with GSK where we licensed them OctoVC, which is for virtual self-foundation model.
So we announced that back in January.
It's a $50 million deal.
It includes an upfront payment, milestones, and then separate than that, it also includes an annual license fee, model licensing fee.
You know, I think this was an attractive deal for both parties, for us and for GSK.
Because, you know, really the deal focuses on models that we've trained already on lung cancer, colon cancer, allows us to, you know, provide them with access to the models.
You know, GSK is one of the top AI teams in biopharma.
So, you know, they know how to use these types of capabilities.
They can use them for their internal use.
They can also use them to fine tune on their data.
So that was a really big sell.
for GSK as well, because GSK and every pharma is sitting on mountains and mountains of so-called translational data.
So the types of data that we're training the models on come from clinical trials, pathology specimens across many different therapeutics.
Everyone's sitting on a lot of this data and it's been very hard to unlock.
And so all of a sudden, GSK can use our models both to do simulations and to do therapeutic discovery.
but they can also fine-tune the models on their data.
And in a way, the model then becomes, you know, sort of GSK's version of the model.
This was super exciting.
You know, it was the first, you know, at least the first announced foundation model licensing deal in the space.
And, you know, frankly, it was one, you know, we've been trying to do for a long time, even before Noetic.
You know, I think a lot of companies have been trying to do these types of deals.
And it's been, I think it's been historically slow for adoption on the pharma side.
And it's been slow to demonstrate like a very clear value proposition for different types of capabilities.
And so what's unique about this deal is it looks, you know, it doesn't look exactly like a software, you know, licensing framework for, let's say, a small amount of money with number of seats where you're licensed.
It looks like a real business development deal in the industry where there's a very significant multimillion dollar cash up front near term payment.
But then the substrate of the deal is not a molecule.
It's not doing therapeutic discovery work together.
The substrate is actually a model, which is what really made this pretty eek.
Why do you think there's appetite for this suddenly?
And it seems like almost whiplash that, you know, it seems like only maybe a year or two ago that Bio was dying and...
And now suddenly there's this deal, Bolts is getting a ton of attention.
There's so much attention on isomorphic.
People are AI pill.
In some extent, we increase it in more.
I mean, maybe not totally, but increasingly more.
People are, you know, in pharma, you know, across the industry are seeing the value of different capabilities.
They're able to use some of the open source capabilities and they're able to demonstrate the value to themselves internally.
And if you look at a pharma company, you know, These companies are working on dozens and dozens of programs.
And so I, you know, my opinions, just frankly, my opinion is I think pharma increasingly want to be able to access models, not just for one collaboration where you and I are working together on this one program.
They want to be able to access the technology across the whole pipeline.
And so I think that's going to create sort of a driving force for not just, you know, bespoke project-driven licensing, but actual broad licensing where a pharma can access the technology in many different therapeutic programs.
Yeah.
And I think also, you know, with the structure of prediction models, protein structure prediction, binding prediction models, there is like this massive public data set.
There are increasing amounts of data.
People can generate data to augment that.
So, you know, there's enough data to the point where people can train very good models, but maybe not just on the data that any one biopharma company has.
And I think...
that the same is true, but even more so for the types of models that we are building, which are, you know, foundation models at the patient biology level where like, you know, no one company, I mean, these companies may have a lot of data, but it's, you know, scattered, it's siloed and pulling everything together to like train an actual foundation model may not be as easy as it sounds like within a single company.
Whereas we have just that you know what we're going to generate enough data ourselves to actually train a real foundation model and that's the nice thing about being a startup here is like we can make that bet that like you actually do benefit from generating all of this data in a you know uniformized way like very high quality etc and then use that to develop and train the models and my opinion is that you need to have data at that scale before you can even think about developing models that actually work.
It's like you can't do the AI R&D, like, or build the algorithms until you have good enough data set to tell you whether your favorite algorithmic idea is actually working or not.
That's a major advantage for us is, like, we have enough data to see, like, is my idea or someone else's idea about how to build a model.
like actually leading to improvements there.
Yeah, I mean, this is a good point.
I mean, so like sometimes people ask me, well, why doesn't GSD just generate your data?
So we just started generating data for years.
There was no model.
It was like, how many years?
Like how, like two years, maybe a year and a half, at least before we had the first trained models working, like maybe a year and a half we had the first.
So I mean, certainly, yeah, like the OctoVC model, like we trained in.
2024 or so.
Yeah.
That's like two years after.
Yeah.
So we, how?
Zero four years of SIL.
So this is year four.
And so we basically opened the lab.
We hired a team.
We got all the instruments.
We started sourcing tumor samples.
There was no prior here that any of this would work.
Big zero.
Big crazy.
Like, I was just going for it.
And like, we just started generating data and like sourcing human tumors.
We've built this whole processing pipeline to get the tumors into like these arrays and the formats.
And it takes weeks to, you know, it takes literally two weeks for a machine to run a couple slides on the spatial transcriptomics.
So you've got like these two-week runs where you're processing two slides.
And we're just churning data for months.
And we couldn't even train, we didn't even have enough data to train a model for like at least a year and a half.
And then you're building like...
processing pipelines you have to align all the data you've got to like post-process it off the machine so we sort of just built all this and then then like let's say 18 months later hey i wonder if this stuff and then it was not like it wasn't obvious there wasn't like oh we're gonna like off the shelf um you know train this on some like open source architecture um you know we've had we've you know dan and the team have done a ton of work yeah there wasn't really like anything major to go off of i mean there were like transformers develop for single cell data, but like incorporating spatial data into that was, you know, again, there just like weren't really data sets out there that people have been able to develop on.
So we do a lot of like custom model building and I enjoy that.
I think people enjoy that because I love for joining.
A lot to build custom model.
Yeah, really unique, innovative, involved.
Sorry, who are you looking for?
Like what kind of people?
Anybody excited about doing ML research on, again, this kind of alien landscape of data where you really have to figure out what's working from first principles and obviously the work we do should have very, very large impact.
So definitely not restricted to people who have a biology background, people who just like tackling very challenging machine learning problems and are open to...
Learning the minimum amount of biology necessary to, like, make progress, I think, you know, would be great candidates.
Talking to you guys reminds me a lot of the Leash Bio labs, which I know that both of you are part of the recursion mafia.
Why not?
Yeah, yeah, yeah.
We're going to be on the show in the future, too.
So, yeah, yeah.
We're looking forward to that.
But, like, it's interesting because both of you seem to have really similar philosophies and that, like, you have deep convictions that, like, you're just going to start collecting data before you know this is going to work.
And you are going to just brute force it, go, go, go, and eventually it will work.
And, you know, you have signs.
I don't know.
I think that's really impressive.
I wonder.
Is there something about recursion, which is in the water, which has led to this sort of thinking of just like, we're going to commit to doing things at scale and it may not work at first.
You have to hit a certain point before it will.
I mean, we failed a lot at the beginning.
Yeah.
You mean at recursion.
At recursion, yeah.
And so you, and we had, I said we had to build it from first principles and we really did.
And so we spent many years trying to figure out like, what should the data look like?
Ian, myself, we're all involved in kind of platform development.
how to design these data sets, how to design the experiments, iterative cycles over the years, seeing things that did work, things that didn't work.
And so at the end of coming out of recursion, I think what a lot of folks there had was like an understanding of what are the things we need to think about so that even if I wanted to design a different data set today, that's like totally different.
What are the things that we learned and we had to learn like?
over mistakes, over like, not mistakes, but like trial and error basically over that many months that we would try to insert in our new approach.
And so I don't know that everything that I've predicted at Noetic in terms of like how to generate the data set has been important necessarily.
I know that we could start at the very beginning and say, okay, well, let's make sure we do these 10 things.
I know every one of these 10 things was important before.
Let's at least make sure we do these 10 things.
I don't know that all 10 things are important for us today, but I would presume that, you know, many of them are.
And it lets you sort of leapfrog that process of trial and error a little bit.
Certainly we do have trial and error still, but hopefully we're not having to, you know, solve like, you know, 15 problems.
Maybe we're only solving three problems, four problems over time.
So for small biotech startups, which are probably in the A space who are collecting their own data, their own data mode.
Do you have any advice or any suggestions about how to be more successful there?
I think you sort of need to think ahead to, okay, what am I trying to do on the machine learning side?
And what is the right data for solving this problem?
I think oftentimes I see a lot of companies are like, okay, well, I want to generate X data set.
I'm just going to generate X data set and I'm going to do machine learning on that.
That might not be the right data set.
you might not have designed it the right way you know it doesn't follow that like any data set is a machine learning data set it doesn't pull that that that that data says that's all the problem you're trying to solve so and i for me it's really and even found me like it was okay what what problem are we trying to solve and then what are the data that are going to help solve that problem uh and rather than like you know going from from the data directly to it to try to solve.
I also, sorry, I also have a quick piece of advice, which is like, you know, pay attention to where the technology is and, you know, where it's changing rapidly.
So, you know, I finished my PhD in 2016.
I did a lot of looking at spatial RNA, like via this technique called in situ hybridization, same technique that is like at the base of what we're doing.
I could look at maybe two genes at a time on a single sample.
And that took me a full week of manual work.
And, you know, I came to Noetic like five years later, six years later, and all of a sudden, you know, there are platforms where you can look at a thousand genes or 20,000 genes at once.
You know, it's a single machine that can run this assay.
It's expensive, but it's just like...
data beyond the wildest dreams of Dan Baer in 2016.
And that is only improving rapidly.
So I think it's important to see what the technology of today allows and also where it's going in terms of what data to generate.
And what does that pitch look like?
So I'm going to generate data for a year and a half and then I spend $50 million and then...
If it wasn't 50, it was maybe closer to 10.
But if...
So yeah, you have to do that.
If you're going into a regime where there's no data and you want to do something different, then there's no shortcut to it, right?
You're going to have to generate the data set.
And so you're not going to know the answer until it's there.
And that's why a lot of companies are not going into that space where there are no data sets because I think it can be challenging to do that.
I mean, I think a lot of smaller biotech AI startups will try this pattern where they first will either start with a public open source data set, or they will try a pilot will internally collect a small amount of data and see if something works or something it doesn't.
And oftentimes, there's almost like a critical point where below this, you're just not going to get a new signal.
And you have to have conviction that you need to collect up to a certain point before.
you start like really driving something like fundamentally valuable.
Yeah.
Yeah.
I mean, imagine trying to train a foundation model on hot enough data.
Yeah.
And then that's, it's sort of your clinical trial.
GPT-2, GPT-3, GPT-3, you know, GPT-1, 2, and 3, like there was a clear progression there as each one of them.
You could see there was something which worked with scale and there was this insight to, oh, we're going to scale this up.
Yeah.
You know, some kinds of biological data, like.
the process of collecting lots of data is just very expensive to begin with you can't just take something off the shelf and expect that you're going to hit the threshold of you know gp3 like usefulness yeah yeah so yeah take some conviction it definitely takes conviction i think you know it also takes sort of like a scientific belief then there's a lot out there like that we just don't know yet and that you're not gonna capture the biology you need to by having right now like an agent that reads all of the biological literature because again that's just like a tiny slice of what's out there like this is I don't know if it's a great analogy or if I'm going to botch the history here but like in astronomy it was required like Tycho Prahe like collecting this enormous amount of astronomical data at his observatory that then was the substrate for Kepler.
you know, figuring out the first laws of motion of the planets.
And then, you know, that was superseded by like Newton's laws and so forth.
But like, I don't, I sometimes don't know how you even get started without like this large repository of really high quality data being with.
And, you know, maybe there's like a tragedy of the commons problem here of like, who's going to generate that data and who's going to capture the value of it.
I'm very glad that we're taking that bet and, you know, we're seeing it pay off.
Yeah, I mean, this is not my expertise, but if, you know, hypothetically speaking, yeah, how much of PDB do you need to train?
I mean, there was some people that, yeah, and then you can get some pretty good models with, I think, one person.
One person?
Yeah, really.
And there are people going back in the 1990s who argued that there was, the PDB was already complete in the sense of, like, if you had a sufficiently smart algorithm, you could have.
done a pretty reasonable job of protein folding even back then.
Interesting.
So you don't need a lot to get a pretty big boost, but the community was sort of independently collecting PDB data for quite some time without necessarily being convicted that this was going to lead to solving protein folding.
Yeah, but then it was also usually quite, most of those structures were quite useful in and of themselves.
So maybe that's their charter point is oftentimes just knowing a protein was...
very helpful for some useful data set.
And we did see, we did see a transition from like early data.
How many samples did we do?
I'm guessing probably on the order of a few hundred before there was like...
Yeah, there was definitely a moment like very soon after I joined where like we, the data set just kind of doubled in size overnight because there was like a huge bolus and like the models immediately got a lot better at that point.
And, you know, now we'd run these more controlled experiments of seeing, you know, what happens if you train on 10% of the data versus 40% versus 100%.
What happens if you hold out all of the pancreatic cancer or all of the breast cancer?
So, you know, we have a much better idea of what kind of diversity in scale we need now.
I guess I would say if we were sticking to cancer, maybe we're not like that far off.
I think, you know, again, if we end up generating...
a few hundred patients in a bunch of major and, you know, some minor indications, which we're, you know, going to do this year, like, maybe that's enough to generalize to kind of all cancer.
Because there is a lot of shared biology in, you know, cancer and immune cells across different tissues and different, you know, mutations and so forth.
But if you think about all of the disease biology that there is for...
a model to learn, you know, maybe that's like another order of magnitude.
But even being able to solve all cancer, I thought you would be pretty impressive.
Yeah, to cure cancer would be great.
Well, if it's all cancer biology, it did not say cure cancer, it was such a different place.
But yeah, at least if you go to Madeline, just sort of, like, just take one drug, if you could look at one drug mechanism across the whole of oncology, that's incredibly powerful.
I mean, imagine what Merck has done with Kate Truda.
Merck has run hundreds of trials with Keytruda.
It might even be over 1,000 trials of Keytruda in different populations to find all these different indications.
Okay, the subset of ovarian cancers, the subset of lung cancers, the subset of colon cancers.
That's all been done by enrolling trials.
If you can look at that biology from model embeddings and at least have...
a very well-defined starting point for, okay, if I'm going to run a trial, it doesn't have to be as broad as it would need to be if I didn't have any answer, then that can be a really powerful tool for, you know, a diversity of mechanisms.
Yeah.
Maybe it's just like last point, like going back to the virtual cell hot takes.
Like, you know, if your goal is to build like an actual mechanistic model of an individual cell and then build up from one cell to an entire tissue and then you know tissue to patient and so forth like you might need a lot more data and a lot more data modalities than you know just like gene expression or something like that but you know we're taking much more of like a top-down approach of we're trying to first solve the problem of what is determining heterogeneity among actual patients and which of that variability is predictive of drug response.
And my intuition is that you don't need to model the mechanism at the subcellular level necessarily to solve that problem of which patient should get which drug or, you know, which targets are important in which patients.
And I saw a similar debate play out in neuroscience and computational neuroscience where for a long time people were really trying to build these biophysical models of individual neurons, and then they were going to stitch them together into models of, you know, the brain and so forth.
And what actually ended up working in, you know, in terms of building computational models of the brain and behavior is this abstraction, you know, we're just going to treat individual neurons as, you know.
linear, nonlinear units and, you know, put them together in neural networks that are connected by, you know, linear weight matrices and, you know, stack a bunch of layers together and then build neural network models of the brain that abstract away kind of all of the details of biophysically what a neuron is doing.
And, you know, those are now by far the most predictive models of how a given neuron is going to respond to real world stimuli in a real brain.
And I think that my bet is that the same is going to be true for these models too, is that like by modeling sort of at the level of functional tissue where you have a bunch of cells interacting in like a disease context, that that's going to get you to the problem of predicting kind of the patient level behavior much faster than trying to first model a cell and then stitch a bunch of those cells together.
Yeah, that makes sense to me.
It's a good analogy.
Do you have any call to action for the listeners?
Yeah.
I mean, I would say, one, everyone should be excited about biology.
You know, sometimes a lot of my hot takes on X recently are just that I feel like there's a huge amount of enthusiasm in sort of like the mainstream tech ecosystem.
And like people aren't really following a lot of like what's happening in the biology space.
But at the same time, like...
You're hearing, you know, Frontier Labs saying we're going to cure cancer.
People should actually look at the folks working on curing cancer or working on aging or working on areas of biology.
These are really exciting, you know, problems.
There are real, like, significant NL problems in the space.
One call to action is with love for people to just, like, be more stoked about learning about applications of machine learning in, like, biological sciences and, like, solving some of these hard problems.
Because I think...
These are the problems that are going to, like, massively impact humanity in, like, the next 10 years.
And we're just, like, really the very beginning.
Like, you know, maybe we're in the, like, first inkling of the chat GPT moment for bio, but it's, like, very much just the very beginning.
So we'd like...
Catch it while you can...
Yeah.
Yeah.
In line with that to, like, really dig in and learn more about the details.
I think, you know, a lot of the times it's presented as we have these...
protein folding models, we have these binding models, you know, we have AI for science agents that are, you know, like reading all of the literature and automating these computational biology workflows.
And I think it's important to realize that there are a lot of problems in AI for biology, AI for biochemistry, etc.
And some of them, and they're very important, but like...
Solving any one of those is not going to, like, solve the problem of how do we develop better therapeutics.
And, you know, we're focused on, you know, a pretty particular slice of that process, which is, again, translating things that we know work well in some patients into actual, like, successful drug trials where we know exactly which patients to give them to.
And that requires building foundation models.
a particular level you know the patient level but people should not be under the impression that like this is all going to be solved immediately because you know ai agents like llms are going to just read the literature and figure out what the right drug is like there are a lot more data to generate there's a lot more ml problems to solve and there's the need to translate those methods into actual successful drugs and There's a lot of different places to contribute.
It's a lot to do.
Yeah, great.
Thank you very much.
Here we are.