Standard Model

The Patient is Not a Document: Moving from LLMs to a World Model for Oncology (Part 2)

Zach Chen — Wed, 17 Dec 2025 15:51:55 GMT

TL;DR: We validated our Standard Model on real-world oncology data from Memorial Sloan Kettering Cancer Center (MSK) during the 2025 iHub Challenge Cohort program, proving that a specialized architecture designed to simulate biological dynamics S(t) delivers superior predictive performance on complex, non-linear clinical tasks—and we’re making those model weights available today.

In Part 1, we introduced our Standard Model v1, our biological world model built on top of the Joint-Embedding Predictive Architecture (JEPA). The core hypothesis of this design is that by training on cause-and-effect pairs (State at time t + Intervention → State t+1), the model learns a high-fidelity patient state embedding that captures the underlying biological dynamics of the disease.

Figure 1: The Standard Model Architecture. It utilizes a Joint-Embedding Predictive Architecture (JEPA) to conditionally predict patient trajectories within the latent space rather than classifying static snapshots.

However, model validation presents a nuance. Logic suggests we should ask the model to generate the “future trajectory” (e.g. predicted text or synthesize a follow-up CT scan) and judge the output. But our Standard Model projects the patient journey in the latent space, not in the pixel space. Therefore, evaluating based on raw signal generation is a category mismatch.

Instead of judging raw signal generation, our Standard Model tests the quality of the patient representation itself.

In a functioning World Model, S(t) is not just a summary of the past; it is a predictive vector that explicitly encodes the trajectory of the future. To verify this, we use a frozen encoder protocol: we freeze the model’s weights and attach a simple linear probe. Think of this linear probe as a “truth serum.” It is too mathematically simple to learn complex disease patterns on its own. For the probe to succeed, the prognosis must already be structured within the patient’s embedding.

The Proving Ground: The MSKCC Oncology Cohort

To stress-test a World Model, one needs more than just “labels”; one needs high-density longitudinal history. We utilized a massive, high-fidelity cohort from Memorial Sloan Kettering Cancer Center (MSKCC) during the iHub Challenge comprising 23,319 patients and over 323,000 patient-years of data.

What makes this dataset unique for a JEPA-based architecture is its temporal density. With a median follow-up of over 50 months and an average of 127 clinical events per patient, the dataset provides a “high-resolution” movie of cancer progression rather than a grainy snapshot. The coverage is near-universal: 100% of patients have confirmed pathology and outcomes, and 95.6% have deep genomic biomarker data. This depth allows the Standard Model to move beyond simple text-matching and actually model the interaction between systemic therapies (94.8% coverage) and the biological state of the tumor.

Evaluation on MSK’s Cancer Dataset

Our evaluation pipeline is designed to prove that the model has internalized clinical logic rather than just memorized data.

The Workflow: From History to Embedding

Generate: We feed the de-identified patient’s longitudinal history up to a specific time point t into the model. Using our utility function, we convert raw clinical events into a chronological narrative (e.g., [2024-01-10]: Glucose: 110. [2024-01-12]: Dyspnea, etc.).

Embed: The model acts as a Frozen Backbone and outputs a single, fused patient state embedding S(t) to ensure we are testing the existing representation.
Probe: We train a simple, lightweight linear survival head (e.g., using CoxPH loss as learning objective) on top of this embedding to predict future outcomes.

The Methodology: “Point-in-Time” Construction

Oncology is not a static classification problem; it is a trajectory of shifting risk.

A patient’s prognosis changes dramatically between diagnosis, remission, and recurrence. To capture this, we moved beyond “one-patient-one-label” and adopted a point-in-time evaluation framework.

Selecting Indexing Dates (t=0): We slice a patient’s timeline into multiple examples based on clinically significant decision nodes. These are moments where the state of the patient undergoes a potential transition, and where a World Model’s predictive capability is most critical. At each indexing date, we mask all future data. The model must predict the trajectory solely based on the history available at t=0. Our generator scans the MSKCC patients’ event logs for five specific triggers to establish an indexing date:

Table 1: Indexing Date Triggers. The generator scans patient event logs for these five specific clinical triggers to establish a decision node (t=0) where future data is masked.

The Downstream Tasks (The “Future”): To ensure the embeddings S(t) are robust, we evaluate them against five distinct clinical dimensions as below using the linear probe:

Table 2: Downstream Clinical Evaluation Tasks. To validate the robustness of the Point-in-Time embeddings, we evaluate performance across five distinct clinical dimensions using a linear probe.

Task Filtering: We don’t just test on every available label. To ensure statistical power, our framework only evaluates tasks that meet a high bar: a minimum of 500 total samples and at least 50 positive cases, ensuring the model isn’t just “guessing” on rare events.

Ensuring Scientific Rigor

Evaluating on Real-World Data (RWD) introduces risks of data leakage that do not exist in curated academic datasets. To ensure our results reflected genuine predictive power, we implemented strict guardrails:

The Patient-Level Split: A common pitfall in clinical AI is splitting data by records. If a patient has ten visits, and eight are in training while two are in testing, the model might “cheat” by recognizing that specific patient’s ID or baseline. We perform a strict Patient-Level Split. We shuffle the unique subject ID list and allocate 85% of patients for training and 15% for a completely “unseen” test set. If a patient is in the test set, the model has never seen a single day of their medical history during training.

Focusing on Progression: To ensure our predictions reflect true changes in health status, we remove data from the week immediately preceding a mortality. This prevents the system from basing its predictions on the administrative processes that occur during end-of-life care.
Strict Temporal Separation: To prevent intraday leakage (e.g., a toxicity event recorded at 4:00 PM influencing a prediction made at 9:00 AM on the same day), we enforce a 24-hour buffer. The target window explicitly begins the day after the Indexing Date.
Multi-Dimensional Metric Breakdowns: We don’t settle for a single AUROC number. Our framework automatically “slices” performance across two dimensions to find where the model excels or fails:
- By Task: Comparing how it predicts mortality vs. line-of-therapy changes.
- By Indication: Specifically analyzing high-heterogeneity diseases like Sarcoma vs. more “structured” cancers like Prostate.

Evidence

To test this, we benchmarked the Standard Model (SMB-v1-1.7B) against three distinct classes of baselines:

Classic Clinical Baselines: The industry standards - Logistic Regression, Random Forests, and Gradient Boosting.
General-Purpose Multimodal LLMs: Qwen3-VL (4B and 8B), to test if general-purpose AI models with massive scale can substitute for domain specialization.
Internal Controls (Ablations):
- SMB-EHR-4B: Trained only on public EHR data.
- SMB-v1-1.7B (sft): Trained without the JEPA objective (supervised fine-tuning only).

Here is what the data reveals.

1. A Structural Break in Clinical Reasoning

The most compelling evidence for the World Model hypothesis is found not in the aggregate scores, but in where the model wins.

Figure 2: Performance Decomposition by Clinical Task Category.

Figure 2 reveals a clear “staircase” of performance that correlates with biological complexity:

Static Tasks (e.g., Mortality): On simple outcomes like death, generalist models are competitive.
Dynamic Tasks (e.g., Treatment Change): As soon as the task requires understanding the vector of the disease, e.g., predicting disease progression or if a treatment will fail within a year, the generalist models collapse.
- Gradient Boosting struggles at ~0.66.
- Qwen3-VL improves slightly to ~0.70.
- SMB-v1-1.7B (SFT+JEPA) dominates with an AUROC approaching 0.78.

This confirms our hypothesis. The SMB-v1 thrives here because it isn’t just looking at the patient’s current condition; it is simulating the treatment’s collision with the tumor’s trajectory.

2. Validating the “World Model” Architecture (SFT vs. JEPA)

To prove that our performance is a result of the architecture and not just the data, we ran an ablation study comparing pure supervised fine-tuning (SFT) against our hybrid architecture.

Figure 3: Benchmark Comparison across Model Architectures.

The Baseline (SFT Only): The SMB-v1-1.7B (sft) model achieves a strong overall AUROC of 0.715.
The Breakthrough (SFT + JEPA): When we add the JEPA objective that forces the model to predict future patient states, performance jumps to 0.727.

However, the aggregate score hides the true signal. As the error bars in Figure 3 reveal, the lift is not uniform. It is specific to complexity.

In relatively homogenous indications where text patterns correlate strongly with outcomes (e.g., ovarian cancer), the SFT model is already near the ceiling. The JEPA architecture shines where traditional pattern-matching fails: in highly heterogeneous, aggressive diseases.

Sarcoma (the heterogeneity test): Sarcomas are notoriously diverse and difficult to subtype from text alone. Here, the SFT model struggles (~0.71), but the JEPA model delivers a massive lift to ~0.77. The fact that JEPA lifts performance here suggests the model is ignoring the messy syntax of the notes and successfully embedding the underlying phenotype. It is learning “sarcoma-ness” from the trajectory of vitals/labs/imaging, rather than the text label “sarcoma.”

Upper-GI & Prostate: We see similarly distinct gains in upper-GI and prostate cancers. In these indications, the disease trajectory is often non-linear. The SFT model creates a static risk score, but the JEPA model successfully simulates the vector of the disease, resulting in superior stratification.

3. Specialization Beats Scale (The “Generalist” Trap)

A common assumption is that massive general-purpose models will eventually render specialized models obsolete. Our results suggest otherwise: domain grounding matters more than parameter count.

The Generalist: Qwen3-VL 8B achieves an Overall AUROC of 0.687.
The Public Specialist: Interestingly, our SMB-EHR-4B, which trained only on public data, surpasses the generalist with an AUROC of 0.708.

Why this happens: Qwen3-VL sees a lab value of “Creatinine: 1.8” as a sequence of text tokens to predict. The Standard Model understands it as a biological signal of renal function that interacts with hydration and chemotherapy toxicity.

4. Deep Dive: Granularity Beyond the Aggregate

Aggregate AUROC scores can obscure clinical nuance. Does the model actually understand anatomy, or is it just guessing risk?

Figure 4: Model Performance Benchmark for Pancreatic Cancer.

Aggregate AUROC scores can obscure clinical nuance. Does the model actually understand anatomy, or is it just guessing risk?

Using pancreas cancer as a representative demonstration, we broke down performance into specific anatomical tasks.

Anatomical Specificity: The model successfully differentiates between specific progression sites. For progression in the body of the pancreas, the Standard Model provides a +12.6% improvement over baselines. For progression in the liver, we see a +7.7% lift.
Consistent Superiority: Across 45 distinct tasks, ranging from high-noise predictions like “Discontinuation due to Toxicity” to “Line of Therapy Transfer”, the Standard Model consistently outperforms the baselines.

Conclusion

The hierarchy of performance is unambiguous: generalist models < specialized public models < specialized privately fine-tuned models < specialized World Models.

These results indicate that while domain-specific data provides a necessary baseline, the architectural approach determines the performance ceiling. The measurable lift of the JEPA objective over pure SFT confirms that enforcing causal structure in the latent space yields more robust patient representations than autoregressive text generation alone. Ultimately, achieving state-of-the-art clinical prediction requires modeling the dynamics of the disease, not just the syntax of the data.

Announcing the Availability of Our Model Weights

Standard Model offers a foundational step toward a broader vision: a reshaped oncology ecosystem where foundational models act as a partner in complex clinical decision making. By effectively modeling the patient state S(t) to simulate potential trajectories, we aim to support the entire care continuum - from refining early detection to guiding personalized treatment selection through what-if analysis.

In the longer term, this approach holds significant promise for accelerating clinical research. Specifically, the generation of universal patient embeddings enables synthetic control arms that could reduce reliance on large control groups, thereby decreasing trial duration and costs. Ultimately, our goal is to facilitate a systemic shift from “general approach” to precise healthcare.

That is why we are opening our model weights to the public. Visit Hugging Face here to access our weights and download them for use in your own workflow.

Acknowledgements

We would like to thank the MSK team that we worked with through the 2025 MSK iHub Challenge program, and especially the following for painstakingly contributing to dataset development.

John Philip, MS, Senior Director, Clinical & Translational Research Informatics and Data Strategy, MSK, New York, NY.
Neil J. Shah, MBBS, Assistant Attending Physician, MSK, New York, NY.
Nadia Bahadur, MS, Clinical & Translational Research Informatics, MSK, New York, NY.
Andrew Niederhausern, Bioinformatics Manager, Clinical & Translational Research Informatics, MSK, New York, NY.
Bryan Tran Van der Stegen, MBA, Business Analyst, Clinical & Translational Research Informatics, MSK, New York, NY.
Haiyu Zheng, MS, Business Analyst, Clinical & Translational Research Informatics, MSK, New York, NY.

Subscribe to our Substack to be notified when publish new posts, or follow us on LinkedIn, Twitter/X and Huggingface for more.

The Patient is Not a Document: Moving from LLMs to a World Model for Oncology (Part 1)

Zach Chen — Thu, 04 Dec 2025 19:46:56 GMT

TL;DR: We built a multimodal foundation model for oncology that replaces the text-prediction of LLMs with the state-prediction of a World Model.

For the last three years, the game of biomedical AI has been simple: take a massive general-purpose model (e.g., GPT, Claude, Gemini), feed it medical text, and watch it crush the USMLE. It worked. We saw models passing licensing exams with expert-level scores and reasoning through complex vignettes.

But as the hype settles, the data reveals a sobering reality. When these same models are tested on realistic patient cases requiring actual treatment decisions, GPT-4 achieves just 30.3% completeness. Models struggle to measure tumor dimensions accurately through CT scans or to track disease progression without heavy reliance on external tools. In short, general-purpose AI models demonstrate exceptional retrieval capabilities on standardized exams, yet lack the grounded utility required for complex medical practice.

So, what did we miss?

Language Cannot Represent Biological Complexity on Its Own

The disconnect of general purpose AI for medical treatment lies in a single, flawed assumption: that language is a sufficient proxy for disease biology. The industry has assumed that if it built a high-fidelity map (the text), we would understand the territory (the patient).

Today’s AI models were never natively trained on these foundational biological signals. The latest models from Google, OpenAI, and Anthropic have never “seen” an entire CT volume or “read” a whole-genome sequence; they have only processed text descriptions of them. And they certainly haven’t connected them to other signals across time and scale.

Human biology is not a linguistic problem; it is inherently multimodal. And yet, today’s AI is not. Biological signals exist at every scale, forming a complex, interacting hierarchy: from molecular variations in genomic sequences to cellular structures in histopathology slides, up to anatomical changes in CT volumes and longitudinal lab patterns in EHR data.

Figure 1: Image sourced from Stanford’s 2025 AI Index report. Today medical-specific models are trained on 75-250x fewer tokens than general purpose models.

Even Specialized Models Hit a “Description Ceiling”

Moving beyond general AI models, the last 18 months have seen the rise of purpose-built foundational models in biomedicine specifically for oncology, known as the “savant” models. For example, Memorial Sloan Kettering’s Woollie detects cancer progression with a stunning 97% AUROC. Stanford’s MUSK predicts melanoma relapse with 83% accuracy.

These more specialized models are impressive, but they hit a ceiling because of two distinct challenges in the current landscape:

The “Text Bridge” Issue: Most systems today, like Woollie, function as LLM agents (or pure LLMs) that invoke external tools (e.g., segmentation models or mutation classifiers) and synthesize the results via text.
- The Challenge: They treat cancer progression as a linguistic probability problem. For example, Woollie is trained on radiology impressions and summaries written by clinical professionals. It optimizes for semantic consistency with the doctor, not biological fidelity to the disease. By forcing biological reality through a text bottleneck, these methods suffer from lossy compression: they discard any biological signal the human observer failed to articulate (e.g., subtle textural changes in a raw CT volume). It cannot reason about biological signals that were never written down, e.g., a raw whole-genome sequence or the complex spatial topology of a tumor microenvironment.

The “Snapshot” Limitation: Current models hit a ceiling because they reconstruct static snapshots of a patient state rather than predict a journey through time.
- The Challenge: Models like MUSK learn by reconstructing missing pixels in static CT images or aligning a histopathology slide to a static report. While they may output the probability of a future event (e.g., high risk of relapse) and identify that a tumor looks “dangerous,” they arrive at these conclusions by correlating static patterns, such as lymphocyte infiltration. These models effectively “skip” the causal chain of clinical events. Consequently, it is difficult for them to simulate how a tumor would dynamically evolve under Treatment A versus Treatment B because they optimize for pattern matching, not causal effect.

Focus Needs to Shift from Description to Dynamics

The fundamental problem isn’t that state-of-the-art LLMs need more medical text or that existing models need more data. The problem is the learning paradigm.

Clinical oncology is not a series of static snapshots; it is an evolving biological trajectory.

A clinician does not simply ask “Is this cancer?” (classification). Rather, they ask:

“Given the patient’s current state S(t) and this specific intervention I, what will their state be in 6 months S(t+1)?”.

We need an architecture that bridges the gap: one that possesses the semantic high-level reasoning of an LLM but is anchored in the multimodal “ground truth” of disease biology - genomics, proteomics, imaging, EHR and longitudinal outcomes.

Standard Model’s Approach: A Biological World Model

This is the motivation behind our Standard Model. By moving from autoregressive text generation and masked reconstruction to a Joint-Embedding Predictive Architecture (JEPA), we are ending the game of predicting words and beginning the complex work of modeling biological dynamics.

Architectural Deep Dive

Our Standard Model is a biological world model1 that operates as a temporal loop, transforming disparate clinical signals into a cohesive, evolving “digital twin.” The architecture is defined by four specific design heuristics:

Figure 2: The Standard Model Architecture. It utilizes a Joint-Embedding Predictive Architecture (JEPA) to conditionally predict patient trajectories within the latent space rather than classifying static snapshots.

1. The Inputs: Modality Ingestion & Fusion (Time = t)

Rather than relying on text as a lossy compression layer, Standard Model captures a 360-degree view of the patient. We ingest raw signals—such as genomics, proteomics, high-resolution imaging, and EHR data—and pass them through modality-specific encoders.

The Modality Fusion Heuristic: A specialized projector maps these raw encodings into an universal latent space of a state encoder. We don’t freeze the image encoders; we align the raw biological signals directly with the encoder’s strong world knowledge foundation. This yields a “fused” patient state embedding that retains both high-level semantic context and low-level biological granularity at time t.

2. The Engine: The “What-If” Simulation

This is the core differentiator. Once the fused patient state embedding is generated, the model does not simply output a diagnosis. Instead, it functions as a state predictor.

Input: Current Fused Embedding S(t) + Intervention A(t).
Output: Predicted Next-Time-Point Embedding S(t+1).
Cause-and-Effect Data Structuring: To enforce causal learning, the training data is not fed as a static batch. It is structured in a strict cause-and-effect format: (Pre-State + Intervention) → (Post-State). By explicitly modeling a given Intervention as the catalyst for state change, the model learns to distinguish between the natural history of the disease and the specific impact of a treatment.

The Shift: We do not ask the model to generate the pixels of the next CT scan or the text of the next report. We ask it to predict the future state representation in the latent space. This forces the model to learn the causal trajectory of the disease rather than the texture of the data.

3. The Anchor: Hybrid Optimization

Pure JEPA models often suffer from “training collapse” (outputting constant representations) or drift. To prevent this and ensure clinical grounding, we employ a hybrid learning objective.

The Strategy: We combine supervised fine-tuning (SFT) with JEPA objectives by assigning mixed weights to these learning signals.
The Result: The SFT component anchors the model to ground-truth clinical outcomes (e.g., “Did the patient respond to drug A?”), while the JEPA component forces the model to learn the underlying dynamics of how that state was reached.

4. The Optimization Loop (Time = t+1)

We trained our StandardModel-v1 (available on Hugging Face soon) on real-world longitudinal oncology data. The model compares its predicted future state against the actual patient state (derived from the ground truth follow-up data). The error signal updates the model, teaching it the causal laws of how tumors progress and respond to therapy.

Conclusion: Modeling State, Not Tokens

For decades, we treated medical AI as a library problem; reading more, summarizing better, passing exams. But patients are not textbooks. They are dynamic, evolving biological systems.

Our Standard Model represents a shift from Reading to Reasoning, and from Description to Dynamics.

Next: Evaluation

Though we argued that oncology AI needs to move from describing a snapshot to projecting a journey. But how do you validate a projection? If we simply let the model generate text, we fall back into the trap of measuring syntax, not biology.

Instead, we test the quality of the state that the model learned. If our Standard Model is truly simulating the patient’s future (the territory), then the representation of that patient at any given moment S(t) should contain the “future” within it.

In Part 2, we will move from architecture to evidence. We will detail how we validated our Standard Model v1 on real-world longitudinal oncology datasets, defining “ground truth” not as static labels, but as dynamic time-windows for progression, toxicity, response, and survival. Subscribe to our Substack to be notified when we publish the second part of this series, or follow us on LinkedIn and Twitter/X for more.

Subscribe now

In robotics and autonomous driving, researchers and engineers realized that predicting text or classifying images wasn’t enough to navigate reality. They developed World Models.

A World Model is not a knowledge base; it is a simulation engine. It learns an internal representation of how the environment works and, crucially, how actions change that environment.

A Classifier asks: “Is this a picture of a road?”
A World Model asks: “If I turn the wheel 10 degrees left at this speed, where will the car be in 3 seconds?”

This ability to simulate future states based on interventions is the “missing link” in oncology. A patient is not a static image; they are a dynamic environment. By treating the patient’s biology as the “world” and the treatment as the “steering wheel,” we can move from describing the cancer to simulating its trajectory.

A Deep Dive Into Our Paper on Patient-Specific Biomolecular Instruction Tuning of Graph LLMs

Irsyad — Thu, 16 Oct 2025 13:02:32 GMT

Two weeks ago, Standard Model Biomedicine came out of stealth with three new papers, each of which illustrates a different combination of the biological scales our multimodal foundation model operates across. One of these papers was titled “Patient-Specific Biomolecular Instruction Tuning of Graph LLMs”. The full text is available on arXiv: https://arxiv.org/pdf/2509.22853.

In this post, we will share further thoughts on this publication and our work behind it.

Biomolecular Instruction Tuning Can Reveal Advanced Molecular Interplays

Omics data is a direct manifestation of biological processes occurring in the human body. While we can measure thousands of biomolecules in cancer patients, understanding what these measurements mean for each individual patient is incredibly complex. These biomolecules (e.g. proteins) aren’t produced in isolation; the measurements that we see are the result of hundreds to thousands of interactions triggering cascades of biological activity.

For example, the p53 tumor suppressor protein orchestrates DNA damage response by activating proteins such as MDM2 and p21 to coordinate cell cycle arrest and DNA repair, preventing the spread of genomic damage. Traditional analysis methods (e.g. univariate t-tests, fold-change ranking, principal component analysis, k-means clustering, logistic regression) analyze these proteomics levels independently, missing this crucial molecular interplay. Even a standard neural network would struggle to learn this distinction, as it must discover from scratch that these specific proteins should be analyzed as a connected group rather than independent features.

This is where biomolecular instruction tuning becomes imperative for analyzing diseases in the human body.

By teaching AI models to understand the patient-specific protein levels within the context of their biomolecular interaction networks, we can capture a more complete biological story. At Standard Model Biomedicine, we expanded upon this idea and developed KRONOS: Knowledge Representation of patient Omics Networks in Oncology via Structured tuning. KRONOS creates an individualized molecular map for every patient by integrating their proteomics data with biomolecular networks, capturing both protein levels and their interaction dynamics within the human body. KRONOS then merges this enriched proteomics representation into an LLM, creating a multi-modal language model that empowers physicians to deliver further personalized care by mapping distinct protein signatures that characterize each patient’s malignancy.

By integrating representations grounded with biological relevance into an LLM, the model captures the complex biology underlying omics measurements, moving beyond pattern recognition towards intricate understanding of molecular networks, dependencies, and cascading effects that drive oncogenesis.

Biomolecular Graphs Boost Deep Learning of Patient Data

While traditional deep learning approaches in biomedicine assume feature independence, integrating biomolecular networks such as STRING PPI enables denoising of multi-omics data and learning of meaningful signals by utilizing these networks as structural priors. Early work such as EMOGI used graph neural networks (GNNs), leveraging these interaction maps to predict cancer genes, while subsequent methods like GNN-SubNet [1] harnessed explainable AI to identify disease-specific subnetworks across a patient cohort. Recent approaches adopt larger and more heterogeneous biomolecular interaction networks (TREE [2]) to better capture these omics complexes. Biomedical AI literature demonstrates that leveraging these molecular networks alongside omics improves learning of meaningful signals in prediction and biological interpretability of diseases.

Alongside these advances in graph omics learning, AI research is adopting new paradigms for enhancing reasoning in LLMs. The introduction of instruction tuning has enabled AI interpretation of modalities in biology and medicine that LLMs may not be already accustomed to.

By training LLMs on medical instructions, researchers enable AI to reason across an out-of-scope modality for complex tasks. MIMIC-Instr [3] allowed LLMs to reason about intricate longitudinal EHR data, while LLaVA-Med [4] learned to expertly understand medical images. MEIT [5] decoded ECG signals into clinical insights, and Me-LLaMA ingested 129 billion biomedical tokens to master medical language. These aren’t just pattern-matching tools anymore, they were AI systems that could understand context, follow clinical reasoning, and translate between different biomedical modalities.

Model architecture of KRONOS.

KRONOS: Knowledge Representation of Patient Omics Networks in Oncology via Structured Tuning

While proteomics is imperative to understanding a patient’s disease progression, LLMs haven’t inherently learned how to navigate through this complex modality, let alone how to reason and generate prognostic predictions for a patient’s disease state.

As mentioned above, our paper introduces KRONOS as a framework that integrates enriched patient-specific network representations with modern LLM training paradigms, enabling models to capture complex proteomics signals and achieve competitive prognostic performance.

CPTAC-PROTSTRUCT instruction generation pipeline.

KRONOS grounds patient-specific proteomics representation learning in biomedical interaction networks, generating biologically-relevant embeddings ingested by the LLM. To teach the model this unfamiliar modality, we constructed CPTAC-PROTSTRUCT, a dataset curated from the largest U.S. proteomics cancer study (CPTAC), which forces the LLM to align proteomics signals with its language system and unlocks true reasoning capability. The steps to generate the CPTAC-PROTSTRUCT schema alignment and clinical reasoning instruction pair sets are outlined in the figure above.

Model generation comparison of schema alignment and clinical reasoning.

By aligning proteomics with the LLM and enabling reasoning through instruction tuning, the model learns the semantic nuances of this newly-introduced modality. It can then generate proteomics-grounded text and, more importantly, develop rich internal representations that serve as a foundation for various downstream tasks. Once we train KRONOS, we validate the learned representations by probing its hidden layers against ground truth, using a frozen LLM backbone that was not explicitly trained on these downstream tasks during training, shown below.

Overall performance of KRONOS against prognostic tasks.

For comparison, we benchmarked KRONOS against classical and deep learning methods in biomedical literature for predicting prognostic outcomes from omics data. These models range from traditional MLPs, to graph classification of proteomics-injected PPI networks, to multi-modal LLMs with various encoder backbones. Unlike methods specifically trained to predict prognostic outcomes from proteomics data, KRONOS learns general-purpose representations through multi-modal semantic alignment, without task-specific fine-tuning.

Comparative performance of general-purpose KRONOS embeddings against various task-trained models.

Where We Go From Here

KRONOS represents more than just another biomedical AI tool—it’s proof that we can teach language models to fluently speak the language of molecular biology.

Our results show something remarkable: when you combine patient-specific protein networks with instruction-tuned LLMs, you don’t just get better predictions, you get a system that truly understands the biological story behind each patient’s data. The model’s ability to outperform traditional approaches across mortality prediction, cancer typing, and survival analysis demonstrates exciting progress towards even more patient-centric care.

The most exciting implication of our paper is that KRONOS demonstrates the gap between molecular biology and clinical medicine can be bridged by artificial intelligence. This isn’t about replacing doctors or biologists; it’s about giving them a tool that can instantly clinically contextualize complex molecular measurements. As we refine this technology and expand to more biologically-relevant modalities, we move toward a future where treatments are guided by deeper understanding of individual patient molecular landscapes.

For deep dives into our other recent publications and to receive future news and updates from Standard Model Biomedicine, subscribe to our Substack, follow us on X and LinkedIn, or check out our website at www.standardmodel.bio.

Introducing Standard Model Biomedicine

Kevin Brown — Tue, 30 Sep 2025 16:30:56 GMT

We are building the Standard Model for Biomedicine.

The Standard Model is a multimodal foundation model that integrates any biomedical measurement at any scale, from the molecular to the whole-patient level, into a common, shared representation of a human patient. This is a fundamentally human endeavor that respects the inherent complexity of patients.

We expect biomedical AI researchers will tailor the Standard Model for downstream use cases as diverse as medicine itself. In biopharma, we anticipate the Standard Model powering tasks ranging from digital twins for prognosis to site enrollment and trial planning and execution. In academic medicine, we foresee our model pushing forward AI-driven radiation toxicity assessments in early lung cancer patients using electrocardiograms and radiology scans, or patient genomic profiles for targeted therapy recommendations.

Broadly speaking, the Standard Model will accelerate AI development timelines and increase performance of downstream applications in biomedical AI, just as other foundation models have done in language and vision in other industries.

We are backed on this mission by VC firms Arkitekt Ventures and Virtue, and we’ve grown a team of AI and biopharma leaders with deep domain expertise to build our model.

Biomedicine Needs a Multimodal Foundation Model

We are driven by four beliefs about what it takes to accelerate performant AI for human biology:

Performance Is Everything in Biomedicine. Small differences in clinical efficacy relative to standard of care drive billions of dollars in revenue and change millions of lives. Every marginal improvement in performance can define clinical, regulatory, or commercial success.

Data Enables Performance. The primacy of scale is well accepted in traditional machine learning and AI; the most performant large language models are trained on nearly exhaustive amounts of human text, and likewise for vision models. The largest biomedical models have not yet been trained at this scale.

Data is Naturally Siloed in Biomedicine. Any organization with ultimately limited data access will likely fail, because they will never scale to the level required. Because biopharma and academic researchers typically collect data in narrow bands for specific questions, the data required for training at scale never amasses within one organization.

The Most Performant Foundation Models Will Not Be Siloed. The best foundation models will be trained on data across modalities as well as disease areas; data that is naturally siloed in the medical world today. Those siloes are detrimental to foundation model training. Patient data is inherently multimodal because patients are inherently multimodal; relevant measurements even for very specific indications span multiple ways of measuring patients.

Where We Are Headed: From a Universal Patient Representation to a Biological Reasoning Engine

We are building a universal foundation model for biomedicine. We do this by ingesting data from any modality and mapping it to a shared representation space. Each patient’s data is translated into a set of embeddings across time, representing patient state. Much like physics represents particle systems in appropriate coordinates, we develop a new coordinate system for patients.

Once this fundamental representation – the Standard Model – is established, it can be fed to other reasoning engines. These engines can then drive decisions in diverse use cases across biopharma and academic medicine, such as:

analyzing natural histories of patient disease
optimizing care pathways
forecasting clinical trials in silico
optimizing inclusion/exclusion criteria
modeling the probability of technical and regulatory success
and beyond

Critically, we intend to serve as the quiet backbone of these systems. Biopharma and academic medical centers should spend time bringing domain expertise to downstream applications, context-specific benchmarking (e.g. clinical trials), and curating specialized datasets. Standard Model Biomedicine offers the benefits of AI performance that come with scale and allows experts to optimize for the last 10 percent specific to their applications.

Where We Are Today

Human biology operates meaningfully across several scales, from the incredibly small, such as genomic mutations, to the incredibly large, such as lifestyle choices or surgical interventions. We firmly believe the best foundation model will integrate data from every scale.

To that end, we’re releasing three new papers focused on molecular, cellular, and whole-patient scales. These publications build on the paper we quietly published last year to showcase the strength of our model: Advancing High Resolution Vision-Language Models in Biomedicine.

Taken together, we have published state-of-the-art methods at every level of human biology. Ultimately, we link models at every scale into the Standard Model of Biomedicine.

1. Oncology & Whole Genome Sequencing

The ability to encode genomic sequences is the most basic and fundamental aspect of the Standard Model. Genomic mutations are fundamental to oncology and other diseases; any model of biomedicine must represent them in some way.

In this paper, we trained GenVarFormer (GVF), a whole genome sequencing foundation model, by predicting the functional consequence of variants on gene expression to achieve state-of-the-art performance on downstream tasks.

Read the full paper here: GenVarFormer: Predicting Gene Expression From Long-Range Mutations in Cancer

2. A Molecular Language Model

Linking molecular and cellular foundation models to a shared representation space, much like vision-language models, is a fundamental challenge for building the Standard Model. To that end, we developed a test case linking proteomic graph neural networks to language. This approach is applicable beyond proteomics to any graph-based representations at a cellular level.

Read the full paper here: Patient-Specific Biomolecular Instruction Tuning of Graph-LLMs

3. Using EHRs to Predict Next Medical Codes

Any multimodal foundation model must have the ability to ingest text, in particular longitudinal EHRs, the broadest and highest-level modality of input. This work reframes EHRs as timestamped chains of clinical events and fine‑tunes large language models to predict the next event, improving temporal reasoning over disease trajectories.

Read the full paper here: Building the EHR Foundation Model via Next Event Prediction

Our Vision for the Future of Biomedicine

Going forward, Standard Model Biomedicine will continue to drive cutting-edge research, deploy our model to partners across the ecosystem, and collaborate with the highest-quality data sources in biomedicine. We will combine all modalities and scales of data into a single overarching Standard Model of human biology.

We’re always looking for people to join us in building the future of biomedical research.

If you’re a biopharma company that wants to partner or a data source that wants to drive value from your data, reach out here
If you’re a medical AI researcher that wants to partner, reach out here
If you’re both technical and biologically inclined and would like to join our team, reach out here
If you’d just like to chat about foundation models in biology, definitely reach out here

We’re looking forward to sharing more about Standard Model Biomedicine and our work. To stay updated, follow us on LinkedIn and X, subscribe to our Substack, or reach out to us.

The foundation for the future of biomedical research and drug development is being laid now, one datapoint at a time.