In partnership with

THE BRIEFING

Do we have a common thread running through this issue? Well, maybe it’s this: things that weren't working are starting to work.

☑️ Perturbation models that couldn't beat an average now approach experimental accuracy.

☑️ A diagnostic AI eclipses benchmarks the field has used since 1959.

☑️ DeepMind's medical AI goes from reading text to guiding physical exams over video.

☑️ Three of biology’s molecular layers get modeled together.

☑️ The black box is opening: Goodfire's Silico makes it possible for any research team to look inside a model and see what it has learned.

Oh, and there’s now a world model that simulates how your body responds to drugs, exercise, and diet.

Your business has grown. Is your accounting on the same path?

When you started out, doing your own books made sense. But the business you're running today isn't the one you started. If your accounting hasn't kept pace, it's quietly costing you — outdated financials, no clear view of what's actually profitable, and hours every week pulled away from the work that grows your business. At BELAY, our Financial Experts integrate directly into your business. They manage your books, reconcile accounts, run payroll, and deliver the timely insight you need to make big decisions with confidence. Stop guessing. Start knowing.

NEWS
A model that reads genes, RNA, and proteins as one system

Credit: PolymathicAI

Most AI models for biology are specialists - one reads DNA, another folds proteins, a third predicts how genes get cut and stitched into working instructions. What one learns about a protein can't inform another's understanding of the gene encoding it. MIMIC is a billion-parameter generative model that represents all three in one framework, trained on 13 million RNA transcripts, 15.5 million proteins and 6,000+ organisms.

The model, described in a preprint, comes from PolymathicAI at the Flatiron Institute - the Simons Foundation's computational science division - led by Shirley Ho. Until now their work focused on physics. MIMIC is their first biology release.

It beat gene-processing models including DeepMind's AlphaGenome and top protein-only models despite being smaller than many single-modality competitors. The team argues diverse, aligned data across molecular layers can substitute for scale - a recurring theme in BAIO coverage.

And beyond benchmarks?

☑️ MIMIC predicts the chemical reactivity profiles that researchers normally measure in the lab to study RNA shape. Fed into a standard folding algorithm, those predictions produced nearly the same computed structures as real experimental data did - at least in some cases.

☑️ It responds to context too: give it the same RNA molecule under different experimental conditions and it adjusts its predictions accordingly.

☑️ It can work in reverse. Most models predict what a sequence will do. Often the opposite is needed - start with the outcome you want, figure out what sequence gets you there. MIMIC designed sequences to meet functional targets, checking against independent tools to avoid grading its own homework.

Why it matters: DNA, RNA, and protein constrain each other - a mutation in one layer can break processing in another. Most models miss these dependencies because they see only one layer. MIMIC is an attempt to learn them jointly.

Did you know? PolymathicAI is hiring. Code, model weights, and the LORE training dataset are all being prepared for open-source release.

NEWS
Perturbation prediction is catching up with the lab

Credit: Valence Labs

Knock out a gene in a cell and thousands of other genes change their activity. Predicting those changes computationally is one of the central problems in drug discovery. It has also been a humbling one. As BAIO reported in Issue 10, a Nature Biotechnology review by leading researchers found that perturbation models do not consistently beat simpler baselines - in some cases, just averaging all previous experimental responses predicted better than the AI did.

TxPert, published in Nature Biotechnology by Valence Labs - Recursion's research arm (whose work on verifiable biological reasoning we covered in Issue 18) - clears that bar. On three of four cell types tested, the model's predictions approach the level of consistency within the experimental data itself. In other words, even a real experiment produces slightly different results each time you measure it, because biology is noisy and no two cells respond identically. TxPert's predictions are getting close to that natural noise floor.

The method learns from maps of known biological relationships - which proteins interact, which pathways connect, which genes show similar effects when disrupted. But it does not just look up whether two genes are related. A type of neural network designed for graph-structured data walks through these relationship networks, learning patterns that let it make predictions even for genes it has never seen - because those genes are still connected to known ones through the network.

The paper showed that stacking multiple maps improved predictions incrementally. The strongest results came from combining public databases with Recursion's own large-scale screening data, where the company has measured what cells look like under a microscope and how gene activity shifts when specific genes are knocked out.

Why it matters: he field has been candid about how badly perturbation models have underperformed. TxPert matters both as an individual model and as evidence that the failure modes are understood and being addressed.

Did you know? TxPert's code is open-source and the team built an interactive web app where researchers can select a cell type, knock out a gene, and visualise predicted expression changes. Recursion's proprietary screening data - which produced the strongest results - is not included in the public release. Valence Labs is hiring.

NEWS
According to DeepMind the AI will see you now

Google DeepMind announced an AI co-clinician research initiative - a system designed to work alongside doctors and patients rather than replacing either. The concept, which DeepMind calls “triadic care,” treats the AI as a third member of the clinical team: it can collect patient history, guide parts of a physical exam over video, and surface evidence, but the physician keeps authority over decisions.

In a blind comparison on 98 primary care queries, physicians preferred the co-clinician's answers over two leading clinical AI tools including GPT-5.4. It recorded zero critical errors in 97 cases. In 120 simulated telemedical encounters with audio and video, it matched or exceeded primary care performance in 68 of 140 areas. But experienced specialists still outperformed it at spotting red flags - the highest-stakes part of a consultation.

Harvard Medical School and Stanford are collaborating on expanded testing.

Why it matters: The WHO projects a shortfall of over 10 million health workers by 2030. DeepMind is not proposing to replace them but to free them up - with an AI that handles routine evidence gathering so clinicians can focus on the judgments that require experience.

Did you know? DeepMind's medical AI has progressed from passing exams (MedPaLM) to matching physicians in text consultations (AMIE) to this - multimodal interactions with audio, video, and guided physical examination. The system is built on Gemini and Project Astra. Google DeepMind is hiring.

NEWS
A world model for your body

What happens to your blood pressure if you start running three times a week? How does your cholesterol respond to a statin? Doctors answer from population averages. Eran Segal's lab at the Weizmann Institute built a model that tries to answer person by person.

In a preprint, Segal's team describes HealthFormer, trained on data from over 15,000 people tracked across 667 types of measurement - blood tests, body scans, sleep, glucose, gut bacteria, wearables, diet, and medications. It learns to predict what comes next in a person's health trajectory across all these systems at once.

The team tested it against 41 published drug and lifestyle trials by feeding the model synthetic patients matching each trial's demographics and adding the intervention. In all 41, the model correctly predicted whether each intervention would raise or lower the relevant measure. In 30, it also predicted by how much.

The model was trained on Israeli data from the Human Phenotype Project. When applied to populations it had never seen - including UK Biobank and the US-based Framingham Heart Study - it still worked, improving on a basic clinical baseline on 27 of 30 conditions and beating established risk scores like Framingham on the conditions where a direct comparison was possible.

The authors are careful to say the model has learned patterns in observational data, not established that interventions cause specific outcomes.

Why it matters: Most health AI estimates who is at risk. HealthFormer simulates how a person's body might change under a specific intervention - a step toward what the paper calls “an initial health world model.”

Did you know? Segal's lab previously built GluFormer, a foundation model for continuous glucose monitoring data, published in Nature earlier this year. HealthFormer extends the approach from one data stream to 667.

NEWS
An LLM just outscored doctors on clinical reasoning

One of our stories above describes what DeepMind thinks medical AI should be - a teammate. This study, published in Science, shows why that question is becoming urgent.

Researchers at Harvard Medical School and Beth Israel Deaconess Medical Center ran six experiments pitting OpenAI's o1 against hundreds of physicians on clinical reasoning tasks. The study used OpenAI's o1, which the authors acknowledge has since been overtaken by newer models - but they expect results would hold or improve.

The key experiment used real, unprocessed emergency department records - not tidy case summaries but the messy charts doctors actually work from. At every stage, from early triage through to admission decisions, the model outperformed physician baselines.

“This does not mean AI will necessarily improve care,” co-senior author Arjun Manrai said in a press release. “We desperately need rigorous prospective trials.”

Why it matters: The benchmarks medicine has used to evaluate diagnostic AI since the 1950s may no longer be sufficient. A companion perspective in Science argues the question is no longer whether AI can match doctors in controlled settings, but whether it can safely improve care for real patients.

Did you know? The standard this study says AI has eclipsed was set in 1959 - in a paper published in this same journal. Ledley and Lusted argued in Science that computers should be tested on the clinical cases doctors face. That standard held for 67 years.

THE EDGE

Goodfire - whose work interpreting Evo 2 for genetic variant prediction we covered in Issue 17 - launched Silico, a platform that lets researchers see inside any AI model, debug failures, and design behaviour intentionally. Until now, these interpretability techniques were only accessible to teams with dedicated researchers or partnerships with labs like Goodfire. Silico packages them as a product. Partners include Arc Institute and Mayo Clinic. You can request access here.

ON OUR RADAR

Until next time,
Peter at BAIO

Keep Reading