AI Foundation Models and the Future of Precision Oncology
An exploration of how Noetic is leveraging multimodal foundation models to solve the patient selection problem in cancer drug development, moving away from traditional cell lines toward patient-centric data moats.
The Patient Selection Crisis in Oncology
For decades, the pharmaceutical industry has grappled with a staggering failure rate: 90-95% of cancer drugs fail in clinical trials. The prevailing assumption has been a failure of pharmacology or target selection. However, a contrarian thesis suggests the real bottleneck is patient selection. We are proficient at creating molecules, but we are remarkably poor at identifying the specific patient cohorts for whom those molecules will actually work.
Moving Beyond "Frankensteinian" Data
Traditional drug development relies heavily on immortalized cell lines and animal models. These "Frankensteinian" cells often possess genomes and expression patterns that do not represent any living human cell, creating a massive translational gap. To bridge this, the industry must shift toward high-fidelity, multimodal patient data.
By integrating H&E histology, protein stains, and spatial transcriptomics, it is possible to build a "world model" of human biology. This approach allows researchers to move beyond simple biomarkers—which often have weak correlations with success—and instead identify complex, non-linear biological subtypes that dictate drug response.
The Strategy of the Biological Data Moat
In the realm of AI for biology, brute-forcing data collection is insufficient. Success requires an intentional data moat. This involves designing datasets with the end-model in mind, ensuring high quality and consistency to avoid batch effects. The transition from masked autoencoding (as seen in OctoVC) to autoregressive models (like Tario) demonstrates that scaling context length—seeing more of the tissue architecture—is critical for improving predictive power.
A New Business Model for Biotech
The commercial landscape is shifting from bespoke, project-based collaborations to the licensing of foundation models. A landmark $50 million deal with GSK exemplifies this shift. Rather than licensing a specific molecule, the value resides in the model itself, which can be fine-tuned on a pharma company's internal proprietary data. This transforms AI tools from niche research aids into scalable platforms for therapeutic discovery.
Conclusion
The path to curing cancer lies not just in better chemistry, but in better data architecture. By building foundation models that can simulate patient biology and predict responses in silico, the industry can finally move toward a future of truly personalized medicine.
Key insights
-
The primary cause of cancer drug failure in clinical trials is poor patient selection rather than deficiencies in pharmacology or molecule design.
Business/Healthcare Strategy →
Impact: Shifts R&D investment from target discovery toward the development of high-precision patient stratification tools.
-
Traditional cell lines are inadequate for predictive modeling because they are biologically divergent from actual human tumors.
Impact: Forces a move toward patient-derived multimodal data, increasing the value of companies that can source and process human tumor samples.
-
Autoregressive transformer architectures (e.g., Tario) scale more effectively with longer context lengths in spatial biology than masked autoencoders.
Impact: Enables the development of models that understand holistic tissue architecture rather than just local cellular patterns.
-
In silico humanization is possible by training models on human data and using them to interpret mouse histology, bridging the translational gap.
Impact: Significantly reduces the risk and cost of animal testing by providing more accurate human-centric predictions.
-
The biotech business model is evolving from project-based service agreements to broad foundation model licensing.
Impact: Creates high-margin, scalable recurring revenue streams for AI-biotech startups.
Action items
-
Build a proprietary, high-quality multimodal dataset (H&E, protein, and RNA) before attempting to train foundation models.
Impact: Prevents algorithmic failure caused by noisy or insufficient data and creates a defensible competitive moat.
-
Prioritize the development of 'world models' that can simulate counterfactual perturbations (e.g., gene knockouts) rather than simple classification models.
Impact: Accelerates the identification of novel therapeutic targets and predicts drug efficacy before clinical trials.
-
Shift preclinical validation from immortalized cell lines to in vivo systems that map diversity to human tumor biology.
Impact: Increases the probability of clinical success by ensuring preclinical signals translate to human patients.
-
Design data collection pipelines specifically to control for batch effects by distributing single patients across multiple processing arrays.
Impact: Ensures that model embeddings represent actual biology rather than technical artifacts of the lab process.
-
Explore licensing frameworks that allow partners to fine-tune foundation models on their own siloed internal data.
Impact: Lowers the barrier to adoption for large pharma companies while maintaining the core IP of the model provider.
Quotes
“Most of those drugs fail, we'd argue, because... We're bad at selecting which patients those drugs in are worked in.”
“In bio, you really need to be intentional about the data that you generate and how you generate it and have some foresight around, well... what are the models we're going to want to train.”
“The substrate of the deal is not a molecule... The substrate is actually a model.”