Owl Posting: Arguments

How financial architectures shaped (and will continue to shape) Chinese drug development

Abhishaike Mahajan — Mon, 11 May 2026 15:23:49 GMT

Note: this essay is connected to a prior one titled “Curious cases of financial engineering in biotech”. I conclude that piece with the following paragraph:

To end this off: I have deliberately left out China, which may be the most aggressive current example of financial architecture shaping a drug pipeline. That deserves its own essay, and will get one soon.

This is that essay. And to those who already know vaguely understand this area: yes, ‘NewCos’ are part of the ‘current state’ section. The future will get more complicated!

The current state of Chinese drug development

If you had to take a guess, why has China been out-licensing drugs so frequently?

I naively assumed it’s because China got very good at moving through early-stage clinical development fast and because the domestic market is simply not as good as the US’s. Neither of these are wrong, but they are incomplete, as they do not explain the timing. The speed advantage and the weak domestic market have both been true for the better part of a decade, but you really only started to hear about the out-licensing in the past few years; 2022 if your job depended on it and 2024 if not. Something else has to be doing the causal work.

And much of that “something else” has to do with finance. My claim is not that finance created Chinese biotech productivity, merely that it determined how the shape of that productivity interacted with the rest of the world. I’d like to start by discussing the contribution of one thing in particular: Chapter 18A of the Hong Kong Stock Exchange (HKEX). Many, many things can be traced back to this particular rule.

But before we wonder what Chapter 18A is, we should first ask: where did Chapter 18A come from?

The proximate cause of it was none other than Alibaba.

In 2013, Jack Ma’s company was preparing to go public, and Hong Kong was the obvious venue; Alibaba was a Chinese company and HKEX was one of the largest Asian exchanges. The problem was that Alibaba wanted to list under a specific partnership system, in which a self-selected group of twenty-eight insiders would have the perpetual right to nominate a majority of the board. Hong Kong’s Listing Rules had forbidden this sort of arrangement for three decades under a principle known as “one share, one vote.” The HKEX, after some months of public handwringing, declined to bend. So Alibaba took its IPO to New York, where arrangements like this had been legal since forever, and in September 2014 listed on the NYSE at a valuation of $231 billion. It was, at the time, the largest IPO in history.

Charles Li, the head of the HKEX, was understandably unhappy about a Chinese success story listing somewhere that was not China, and wrote this:

We respect the company’s decision and wish it well.
We are proud of our tradition of respect for the rule of law and adherence to principles.
However, we also need to find ways to make our market more responsive and competitive, particularly with respect to new economy or technology companies.
We have to consider possible changes where they might be necessary, with everything according to our due process. The Listing Committee’s work on shareholding structures didn’t start because of Alibaba and will not end now because of Alibaba.
We need to ensure our markets continue to be relevant in the new era of economic development.

Over the following four years, HKEX rewrote its rules to ensure an Alibaba situation never happened again. The product was a three-part reform package that took effect on April 30, 2018: Chapter 8A, Chapter 19C, and Chapter 18A. This last one, Chapter 18A, is what we’ll be concerned with, because it is the only one that was specific to biotech companies. It allowed, for the first time, pre-revenue biotech companies to go public, without needing to satisfy any of the standard profit, revenue, or cash-flow requirements that other Chinese companies had to.

Now, this doesn’t mean there were no requirements, but they were softened to match the financial flavor that early-stage biotechs often have. The requirements were as follows: at least one Phase I clinical trial completed with no regulatory objection to proceeding into Phase II; an expected market capitalization at listing of at least HK$1.5 billion (roughly US$192 million); two fiscal years of operating history under substantially the same management; and enough working capital to cover 125% of projected costs for twelve months after listing.

Charles Li, in the run-up to the rules taking effect, stated that he hoped Hong Kong would overtake NASDAQ in Chinese biotech listings within five years.

The initial results were spectacular.

Ascletis listed in August 2018. BeiGene, which had already listed on NASDAQ, did a secondary in Hong Kong. Innovent Biologics listed in October 2018 and quadrupled in the following year. Junshi, CanSino, Shanghai Henlius, Akeso, and dozens of others followed. By the end of 2021, the peak year, the cumulative capital raised under Chapter 18A had crossed HK$100 billion, and the number of listed companies was approaching fifty. Hong Kong had, just as Li had hoped, become the second-largest biotech listing venue in the world.

And then the biotech winter of 2021-2022 happened. By the end of 2022, of the 56 biotech companies listed under Chapter 18A, only 13 were trading at or above their IPO price. By the end of 2023, only 9 were.

The trajectory of what happened next should be quite clear. Here you have a cohort of roughly sixty pre-revenue biotech companies in one corner of the world, each holding a pipeline of clinical assets ranging from plausibly valuable to genuinely world-class, each prevented from raising equity by the collapse of its own share price, and each locked out of every other public-market financing channel due to geopolitical risk and uncertainty.

On the other side of the Pacific, US pharma companies were staring into the patent cliff, which represented somewhere between $180 billion and $250 billions of revenue at risk from drugs coming off patent by 2030, and desperately scouring the world for assets with which to fill the gap.

These two sides were made for each other. The 18A cohort had clinical pipelines and no capital. Big Pharma had capital and not enough pipelines.

Thus, the out-licensing boom you have heard so much about. In December 2022, Akeso licensed ex-China rights to ivonescimab, its PD-1/VEGF bispecific, to a small Miami-based company called Summit Therapeutics for $500 million upfront and up to $5 billion in total deal value. This was, at the time, the largest single-asset deal ever struck by a Chinese biotech. Two years later, in September 2024, ivonescimab beat Keytruda in a Phase 3, non-small cell lung cancer trial shocked the world, and every Big Pharma BD team reorganized itself around the working assumption that the next blockbuster might come from somewhere in Chongqing or Shanghai or Beijing or Guangzhou. In 2024 alone, Chinese firms out-licensed 94 projects to overseas companies, up from essentially zero a decade earlier. In 2025, the figure was 157 deals worth $135.7 billion. In the first half of 2025, roughly 32% of global innovative-drug out-licensing value originated in China, up from single digits a few years prior.

Could this have happened without 18A?

If the 18A cohort didn't exist, the Chinese biotech industry would be a collection of private companies. Most of them would still be venture-funded, with valuations set by the more conservative Chinese VC culture rather than by the initially frothy public markets that slowly cooled. As such, the urgency to monetize pipelines would be considerably lower, and perhaps there would be little reason to aggressively do transpacific sales of intellectual property to Western buyers. And most important of all: without the clearly legible financial signals that public listing—which would not have existed without 18A!—offered to Western buyers, perhaps most would be too uncertain to ever commit hundreds of millions of dollars upfront to a China-based company they had never heard of, almost certainly slowing down the boom.

On the other hand, a lot of what drove the Chinese biotech ascendancy has nothing to do with 18A and would have happened regardless. China’s primary regulatory authority for drugs, the NMPA, ran through a sequence of reforms starting around 2015 that compressed drug approval timelines from years to months and cleared a backlog of roughly 20,000 applications in two years. The Chinese CRO ecosystem, WuXi and so on, professionalized to the point that running a Phase I in China was genuinely cheaper and faster than running one in Cambridge. The talent got better too! A generation of Western-trained scientists returned to run R&D at Chinese biotechs under the Thousand Talents Plan. And on the demand side, the Western patent cliff was going to happen anyway.

So, the cleanest version of the argument is something narrower than "18A caused the out-licensing boom," which is probably too strong, and broader than "18A was a minor contributor," which is too weak. 18A did not create Chinese R&D productivity, but it did shape how that productivity interacts with Western markets. Which is pretty interesting!

And the dominoes that were set up by 18A are only continuing to fall; Chapter 18A gave legibility to Chinese biotech’s, which led to out-licensing, which surely should lead to something else. And what is that something else?

‘NewCo’s’. These days, Chinese biotech’s are getting quite good at their job now, so good that they are beginning to get a bit more interested in the ‘nearly infinite upside potential’ economics that makes drug discovery so appealing. Past that, HKEX biotech’s trade at a substantial discount to their NASDAQ-comparable peers, more or less permanently, due to intense price negotiation by the Chinese government. These two, combined with the fact that China gets upset if one of its companies sets up shop abroad, has pushed financiers into increasingly creative territory.

And a solution soon manifested. Perhaps instead of a Chinese biotech accepting cash or royalties in exchange for their precious molecules, they should instead work with American funds to set up a US-based company around those molecules, taking a big chunk of equity for themselves, with the American funds taking the rest. You could argue that this is seemingly against the spirit of China’s discomfort with its companies setting up shop abroad. I agree! But China is seemingly fine with it.

Kailera Therapeutics is the cleanest recent example of this. On the Chinese side, Jiangsu-based biotech Hengrui contributed its GLP-1 portfolio. Bain Capital, Atlas Venture, and RTW put in $400 million, a former US pharma executive Ron Renaud took the CEO seat, and the whole structure was operational within months. The US-based investors get a promising company in their portfolio; one fluffed up by the starry-eyed and optimistic US markets. And Hengrui takes equity in this US-based vehicle, which, compared to a cash payment or bounded royalty, is uncapped on the upside. Both sides win.

As always, it’s worth being a bit concerned by new and exciting developments in finance. What should we be worried about here? The obvious one is that every dollar of Western venture capital that gets deployed into a Kailera is a dollar that doesn’t get deployed into a US-originated asset, with all the obvious caveats that venture capital is not neccesarily a fixed pool where every dollar is a one-for-one displacement. Either way, it’s a rational thing to do, play the same M&A game that pharma usually does, but with the side that is actually winning. Is the long-run consequence that the US stops being good at the sort of early-stage discovery it was historically best at?

Whatever the answer is, we’ll certainly be made aware of it in the upcoming decade.

The future of Chinese drug development

Well, maybe not a decade.

Cancer has a surprising amount of detail

Abhishaike Mahajan — Sun, 26 Oct 2025 14:40:52 GMT

There is a very famous essay titled ‘Reality has a surprising amount of detail’. The thesis of the article is that reality is filled, just filled, with an incomprehensible amount of materially important information, far more than most people would naively expect. Some of this detail is inherent in the physical structure of the universe, and the rest of it has been generated by centuries of passionate humans imbibing the subject with idiosyncratic convention. In either case, the detail is very, very important. A wooden table is “just” a flat slab of wood on legs until you try building one at industrial scales, and then you realize that a flat slab of wood on legs is but one consideration amongst grain, joint stability, humidity effects, varnishes, fastener types, ergonomics, and design aesthetics. And this is the case for literally everything in the universe.

Including cancer.

But up until just the last few centuries, it wasn’t really treated that way. It was only in the mid-1800’s when Rudolf Virchow, the father of modern pathology, realized that despite most forms of cancer looking reasonably similar to the naked eye, they were—under the microscope—anything but uniform. There was squamous carcinoma with its jagged islands of keratinizing cells, adenocarcinoma with its glandular tubes, sarcoma with its spindle-shaped whorls. And, as a generation of pathologists began to train, they also noticed that visual appearance of cancer often seemed to correlate with how slow, aggressive, quiet, violent, widespread, or local the disease ended up being. Over time, those clues accumulated into prognostic systems: Broders’ classification for squamous carcinoma, Bloom-Richardson for breast cancer, Gleason for prostate. What began as an intuitive visual ‘feeling’ became codified into scales that pathologists across the world could consistently apply.

Then, it was noticed that even the genetic material contained within tumor cells were aberrant, their misshapen karyotypes so obvious as to be even visible under a light microscope. In 1960, the discovery of the “Philadelphia chromosome” in chronic myeloid leukemia marked the first consistent, disease-defining genetic quirk of cancer: a translocation between chromosomes 9 and 22. This was not the only one. In the decades that passed during the following genetic sequencing revolution, a great deal of oncogenes (if mutated or overexpressed, cause cancer) and tumor suppressor genes (if lost, cause cancer) were identified and catalogued away. Many of them are immediately recognizable by undergraduate biology students: KRAS, p53, RB1, MYC, and so on. Some of them, namely the BRCA set of mutations, have even entered common parlance as synonymous with inherited cancer risk.

The next jump was in the world of proteins. Sometimes cancers heavily altered genetic sequence is enough to tell you something interesting. But, other times, the genetic sequence itself hasn’t changed much, rather, the major alteration is in how little or how high a gene is transcribed to protein. With the advent of immunofluorescence, protein abundance in tissue became possible to study at scale.

During the 1980’s, this observation of cancer proteins was used by Axel Ullrich, a scientist at Genentech, and Dennis Slamon, an oncologist at UCLA, to help create one of the most successful oncologic drugs of all time. Axel’s research at Genentech had established HER2 as a particularly aggressive oncogene. At the same time, Dennis’s analyses using UCLA patient records showed that patients whose tumors had HER2 overexpression, determined by protein expression, consistently had worse outcomes. A coincidental discussion between the two led to an obvious question: if HER2 was driving the aggressiveness, could a drug directly target it?

This led to the development of trastuzumab, or Herceptin, an antibody against the HER2 receptor, blocking its activity. In clinical trials led by Slamon in the 1990s, women with HER2-positive tumors, previously consigned to dismal prognoses, now lived far longer when trastuzumab was added to their chemotherapy. Today, trastuzumab is still heavily relied upon, and there are now dozens of other drugs underneath the HER2-targeting umbrella.

But even amongst modern analogues, it still only works in the patients whose tumors bear the detail. If a breast cancer is HER2-negative, trastuzumab is useless, even harmful. If it is HER2-positive, the drug can be life-saving. Consider the following image. The ‘naked’ view of cancer is H&E, or the top row. There, A, B, and C don’t look particularly different, do they?

It is only in looking at the bottom row, where the patient’s tumor cells have been stained with an antibody specifically meant to bind to HER2, that a clinician would know that only patient A would respond to the drug, patient B may slightly respond, and patient C wouldn’t respond at all.

It’s worth being quite astonished at this, because there is very little else like it amongst all human maladies. As in, where entire lines of treatment may fail to work due to an extremely subtle biological phenomenon that, up until a few decades back, science wasn’t even aware of, let alone able to quantify. In most areas of medicine, a diagnosis tends to unify patients under a single therapeutic approach: antibiotics for bacterial pneumonia, insulin for type 1 diabetes, thyroid hormone for hypothyroidism. The drug may differ in dose or formulation, but not in principle.

This is not the case for cancer.

In fact, the story of trastuzumab is not particularly unique, it has repeated again and again and again. Tamoxifen is revolutionary, but only works in ER-positive breast cancer, where the tumor cells are dependent on estrogen signaling. Keytruda is revolutionary, but primarily works in PD-L1 positive cancers, where the tumor microenvironment has upregulated immune checkpoints as a shield against T cells. Tagrisso is revolutionary, but only works in lung cancers where certain genetic mutations are present. And so on.

Cancer, in many ways, is among one of the most detailed diseases on earth.

As of today, much of the ‘cancer understanding’ literature is exactly what has gone on for the past hundred-or-so years, just scaled up. Whereas immunofluorescence allowed you to paint one or two proteins onto a tumor section, multiplexed imaging can now overlay forty. Where HER2 or ER were once binary categories, modern day RNA sequencing can reveal thousands of differentially expressed genes, each with subtle implications. And these methods can be pushed to the spatial dimension too, producing maps of gene/protein expression across entire tumors, showing not just what is “on” or “off,” but exactly where and in what neighborhood.

This is great, because cancer is obviously still a major problem facing humanity. Pancreatic cancer still has a 5-year survival rate of just ~13% and lung cancer has a 5-year survival of 9% if metastasized. And some of that surely comes down to us simply not understanding cancer well enough; consider the fact that about 44% of U.S. cancer patients were nominally eligible for an immune checkpoint inhibitor and an estimated ~12.5% actually benefited.

And so work has gone on to learn more, and learn more we have.

We’re finding that spatial organization of CCR7+ dendritic cells in tumors helps predict pembrolizumab response in head and neck cancer. We’re finding that B-cells being localized within so-called tertiary lymphoid structures seem to improve immune checkpoint blockade efficacy. We’re finding that higher CD34 expression in macrophage-dense regions of a tumor correlates with a worse response to camrelizumab. I think one of the craziest things we’ve found is that tumor cells can pump out exosomes—tiny lipid vesicles—carrying microRNAs that reprogram distant tissues into pre-metastatic niches before a single malignant cell arrives; the existence of which can predict response to a great deal of chemotherapies and immunotherapies.

All this is very exciting work. Unfortunately, basically none of it has been turned into anything clinically useful.

I’m not the first to notice this. In the 2010s, there were a flurry of papers bemoaning this exact phenomenon: The failure of protein cancer biomarkers to reach the clinic, Why your new cancer biomarker may never work, and Waste, leaks, and failures in the biomarker pipeline. The first paper has a particularly illustrating line:

…very few, if any, new circulating cancer biomarkers have entered the clinic in the last 30 years. The vast majority of clinically useful cancer biomarkers were discovered between the mid-1960s (for example, carcinoembryonic antigen, CEA) and the early 1980s (for example, prostate-specific antigen (PSA) and carbohydrate antigen 125 (CA125)).

Though these papers were written a decade-or-so back, I can’t find any evidence that there have been any significant breakthroughs since then, with perhaps the exception of cfDNA, or cell-free DNA, though this is still being proven out.

The blame for this is heterogenous. A lot of the aforementioned papers discuss how newer biomarkers often have shoddy validation, need more data-points, have variable accuracy, or are so biologically implausible as to likely be an artifact of the underlying data. I don’t disagree with any of these. The replication crisis is as real in the cancer biomarker literature as it is anywhere else. But I’d like to focus on one fault that all the papers mention: the inability for many novel biomarkers to improve on the current clinical standard.

I think it is unlikely that any singular biomarker developed after the 1980s will do this. And we shouldn’t expect it to.

Cancer, like everything else in the universe, is defined by a set of rules, a set of universalities. Biologists love to talk about how biology as a domain is filled with exceptions, but even exceptions themselves are rules. In our effort to understand the disease, we have gathered many rules, some of which have been discussed here: HER2, PD-L1, and the like. The field, likely for decades, hoped that these seemingly simple biomarkers were just the tip of the iceberg, and with enough data, enough poring over the numbers, we’d stumble across something more fundamental about cancer; the rest of the iceberg.

This has not been the case. Increasingly, it is seeming like these ‘obvious’ biomarkers do, empirically, account for a great deal of what matters in cancer. Unlike physics, cancer never offered much “room at the bottom”—at least not in the sense of yielding endless layers of clinically useful, legible rules.

Phrased differently: if our existing rules explain, say, 60% of the between-patient variance, how is it possible that any new biomarker could swoop in and shoulder the rest on its own? It cannot. It empirically cannot.

But none of this is to say that it is not worth trying to understand the remaining variance, just that it will require a different strategy.

The situation here is not dissimilar to language. Knowing the meaning of a single word tells you something, but not nearly enough to understand a sentence, much less a paragraph. Meaning emerges from combinations, syntax, context, and emphasis. Cancer is the same. “HER2-positive” is a word. “HER2-positive, PD-L1-high, tumor-mutational-burden-high, tertiary-lymphoid-structure present, with exhausted CD8 niches” is a sentence. Words are enough to get you quite far, but if you wish to operate in the long-tails (where we currently are with cancer!), then it is insufficient. The field has spent the last few centuries compiling the words, but now it is time to learn the grammar, the joint-distribution of every word in combination with every other word.

In other words, the obvious next step is to stop asking for singular biomarkers to bear the entire burden of explanation, and instead ask how many small signals can be woven into a coherent, usable picture. But this creates a combinatorial explosion! If you have 20 binary biomarkers, that’s over a million possible patient subgroups. No trial, no matter how well-funded, can enumerate that space.

How can we escape this problem? It is increasingly my opinion that the only reasonable path forward is to delegate the problem of cancer biomarkers to machine intelligence. Rely on the compression, abstraction, and pattern-finding abilities of statistical models that can hold dozens, hundreds, thousand weak signals in memory at once, and then distill them down into single, actionable scores.

This may sound far-fetched, but realistically speaking, this is where the oncology field has been moving for decades. Multigene expression panels from the early 2000s, like OncotypeDX or MammaPrint were, in spirit, primitive machine-learning models: linear combinations of weak features, trained against outcomes, that outperform any single gene.

And in recent years, it is accelerating even further.

For example, you may be aware that the aforementioned BRCA mutations, a massive driver of breast cancer risk, causes homologous recombination deficiency (HRD), or, the inability to faithfully repair double-strand breaks in DNA. In turn, this often causes cancer. But what may be a surprise is that BRCA mutations aren’t the only way that a patient could have HRD, many other genes in the homologous recombination repair pathway—PALB2, RAD51C, RAD51D, FANCA, ATM, CHEK2, and more—can be mutated, leading to the exact same phenotype. Even promoter methylation of BRCA1 (with the gene intact but “turned off”) can produce HRD. And knowing whether a patient’s tumor is HRD-positive matters a lot because, once again, it can be exploited by a therapeutic! If a tumor is HRD-positive, regardless of whether the deficiency came from a BRCA1 deletion, a RAD51C mutation, or promoter methylation, it is often extremely sensitive to a class of drugs called PARP inhibitors.

So, understanding if a patient actually has HRD is both difficult and valuable. To help out with this, a company called Myriad Genetics developed myChoice, a test that computes a measure of HRD via a “genomic instability score” by integrating three measures of chromosomal damage: loss of heterozygosity, telomeric allelic imbalance, and large-scale state transitions, all extracted from the tumor. As far as I can tell from the technical documentation, the raw score itself, unlike gene signatures, has no intrinsic biological meaning. Its clinical utility comes entirely from an empirically determined threshold, established through population-level studies, that designates tumors as HRD-positive.

Mechanistically, we “know” that whatever the output of the myChoice algorithm is about DNA repair failure, but the exact construction of it is an empirical fit, not a first-principles derivation. Still, it works well enough for the FDA to have approved it as a companion diagnostic in 2021. Of course, the obvious question remains: is this black-box biomarker any better than human-legible ones? The answer does seem to be a tentative yes: 19%-61% patients identified as HRD-positive by the myChoice test would’ve been missed through simpler methods.

But even this test is white-box in the sense of the inputs (DNA measurements) to the model being legibly tied to the output (HRD-positive) of interest. In the most platonic form of ‘leaving things to the machine’, we would simply feed high-dimensional data to a model, and let it come to its own understanding—unabated by what humans think—of what is most important. For a very long time, this didn’t seem like a realistic clinical path forward, because purely data-driven biomarkers are hard to trust, hard to standardize, and hard to regulate. Yes, eventually the FDA would come around, but certainly not anytime soon.

Yet in August 2025, for the first time ever, the cancer field saw the emergence of an FDA-authorized prognostic test that was exactly that: the ArteraAI Prostate Test.

All the test requires is a pathology slide (an ordinary H&E biopsy, the kind already produced for every prostate cancer patient) and a few standard clinical variables. A machine-learning model ingests those slides whole, millions of pixels at a time, and looks for patterns in the tissue architecture that no pathologist has ever consistently been able to describe. The model has no conception of “cells” or “glands,” but through training, implicitly learns the entire language of cellular morphology: the spacing of nuclei, the texture of stroma, the presence of inflammatory niches, and so on.

From this, it outputs two numbers: a risk score for 10-year metastasis rate, and, if the risk is high, a recommendation on whether the patient would benefit from abiraterone, a hormone therapy that reduces testosterone, starving prostate tumor cells. Most curious of all is that the basis of the approval hinged almost entirely on the model being applied to multiple prior Phase III trials across thousands of patients, demonstrating that the model could retrospectively predict which prostate cancer patient responded to hormone therapy.

This may sound boring to pure machine-learning people. After all, the underlying model is, as far as I can tell from their initial paper, just a basic ResNet-50. But to folks in the biotech space, this announcement is nothing short of insane. In fact, multiple parts of this are insane. Not only did the FDA approve a biomarker that was an entirely black-box readout with no human-legible intermediate criteria, they did so on the basis of an extremely large retrospective analysis. It is difficult to express how unexpected this is. Nearly every previous cancer biomarker that has ever made it into the clinic in the last 40 years has been validated prospectively, built into the design of a trial from the ground up, costing millions of dollars. Retrospective analyses in this field are typically hypothesis-generating, suggestive at best, and never enough to stand on their own. But here, it was enough for the FDA.

This should tell us two things.

One, our previous belief that many clinically useful variables are hiding within cancer datasets is almost certainly correct. Each of these variables are likely only weakly predictive when alone, but, if aggregated together, is enough to meaningfully stratify outcome. This is not a new hope; over the past decade, countless groups have trained neural networks on pathology slides, promising that “hidden morphologic signatures” could predict everything from molecular subtype to patient survival. How did ArteraAI succeed where others didn’t? Unfortunately, we do not know the answer, but it may come down to the same reason any given machine-learning tool succeeds where others failed: they simply executed better. Even if we agree that cancer is complex enough that machine intelligence is necessary to understand it, the rules of how to do that well remain tricky; slide-level heterogeneity, site-to-site variation in staining, and picking the wrong indication all still matter, all of which can sink an R&D effort if done incorrectly.

And two, the FDA is willing to accept biomarkers that are not directly tied to human-legible biological phenomena. Many people likely assumed that this would eventually happen, but few, including me, would’ve predicted that it could’ve possibly come as early as it did. But it has, and, more importantly, it does not seem like this is an edge case, but rather the beginnings of something new. Consider that in February 2025, the biotech startup onc.ai secured an FDA Breakthrough Device Designation for its ‘Serial CTRS’ system, which applies deep learning to CT scans to stratify non–small-cell lung cancer patients into high- and low-risk categories. Just like ArteraAI, their model does not use single, legible features such as lesion diameters, only the aggregated, weak latent patterns that their model has learned across the many CT scans in its training dataset.

So, what does the future hold?

Again, cancer has a surprising amount of detail, and it is unlikely that pathology images are alone able to explain everything about it. We have some empirical proof for this. A 2022 Cell paper compared how well a model performs across 14 cancer-outcome prediction tasks if given only pathology data, only molecular profile data (RNA, gene mutations, copy-number variation of the tumor), or both. The combined data won most of the time. A more recent 2024 paper from AstraZeneca says something similar, with the advantages of multimodality seeming to increase as the underlying datapoints grow in number.

To me, this implies that the the spoils of the cancer-understanding race will accrue to those who gather not just pathology, not just genomics, not just proteins, not just transcripts, not just epigenomics, not just plasma, not just the scientific literature, but all of them at once and more, fused into a single representation, and presented on a platter to an impossibly large statistical model for it to gorge itself on. What could such a model teach us? What about cancer has eluded centuries of human study upon it? What will ultimately require machine intelligence to make clear? The race is on to find out.

Afterword: I should mention that I work at Noetik, where we’re building multimodal foundation models of tumor microenvironments in order to predict response to cancer drugs. This essay grew out of countless conversations with colleagues about why cancer response prediction is so hard, and what will be necessary to improve it.

The optimistic case for protein foundation model companies

Abhishaike Mahajan — Sat, 11 Oct 2025 16:48:03 GMT

Note: apologies for the double-send if you got two of these, I messed something up in the settings!

Introduction

Let’s be honest with each other: the funding for protein foundation model startups got a little crazy for a moment. EvolutionaryScale got $142M in mid-2024, Latent Labs got $50M in early-2025, Chai Discovery got $70M in mid-2025. And, of course, the giant two: Isomorphic Labs with $600M in funding in early 2025, and Xaira Therapeutics with an insane $1B in funding in mid-2024.

Things have calmed down since then, so I think it’s a good moment to look back at this with some fresh eyes and ask: was any of this a good idea?

It’s become quite common to tell one another that no, obviously not, these were a series of escalating, FOMO-y investments that had basically zero basis in objective reality. I empathize with this viewpoint. Protein models are increasingly recognized as commoditized things, where the open-source stuff is actually quite good, and, even at the private level, there didn’t seem to be a strong differentiation between one group’s pretrained weights and another’s. If you really squinted, maybe, just maybe, the open-source Boltz-1 was slightly worse than Alphafold3 by a few percentage points in a few domains, but how much does that matter? Surely it’s all within a standard deviation of one another? How could this justify the immense investments needed to train these models?

But this view has also become so universally held that, honestly, it’s getting a little boring. Increasingly, I have grown more and more curious about what actually was the opinion of people who invested into these things. People knock on VC’s a lot, but I have a pretty high opinion of nearly every biotech VC I’ve met, and it’s difficult for me to imagine that it was all irrational. Unfortunately, the articles that VC’s write on why they invested into certain things — including these companies — are nearly always uninformative, mostly vague gesturing at ‘the transformation of biology’. You could look at this and think ‘okay, nobody knows why they invested in this’, but I mostly think the vagueness comes from the fact that they don’t have a strong financial incentive to say what their actual bet is.

So what was the actual reason to put money into these companies? It’s difficult to come up with a coherent narrative, so I’m just going to list a few interesting reasons that are swirling in my head.

The optimistic arguments

Multiproperty optimization

The divergence of private and public models being on par with one another may accelerate far faster than anyone thinks, entirely due to being able to afford the dataset necessary to optimize for multiple things at once.

Let’s consider Chai-2’s results from July 2025 as a decent view into how much better the private models are than one open-sourced one—RFDiffusion—for the task of miniprotein design. The results are hindered by a bit due to the fact that the best open-source miniprotein design model—Bindcraft—is not included here. But whatever, let’s pretend RFDiffusion is as good as open-source gets.

The usual argument at this point is ‘who cares about a 10-20% → 80-100% bump?’, given that these experiments can be run in 96-well plates? Yes, it saves money, but does it unlock dramatically new biology in a way that justifies having a brand new, very expensive model? Probably not!

But I think this is missing the forest for the trees a bit. Binder design is indeed not super interesting (anymore), but it’s worth thinking about what is beyond that. Because, in fact, it is very likely that the real value of these startups may have very little to do with their ability to create binders. Binding is literally just the easiest thing you can do, because the dataset to do it has already been mostly assembled: the PDB. So it’s a good place to start your modeling work. The far more interesting capabilities is in creating binders that also satisfy a bunch of useful biochemical properties.

What else is there? To name a few: expression, stability, solubility, immunogenicity, receptor promiscuity, manufacturability, and PK/PD. There are plenty of models for optimizing each of these properties one-at-a-time, but creating something that can jointly optimize all these at once is a taller order. A great recent blog post from the Oxford Protein Informatics group discussed this a little, and you can see the inklings of open-sourced, multi-objective datasets here coming together, but it is still extremely early and limited in size (almost always <1000 antibodies for non-binding datasets).

What if there was a model that could really solve for all these things at once? What interesting things await if you can essentially automate a pretty significant fraction of the chemistry-relevant parts of the preclinical workflow? The literature does imply it is quite significant:

Despite the substantial level of research spending and the growing reliance on outsourcing within the non-clinical domain, to our knowledge very little data exists on the economics of specific non-clinical activities and the comparative cost of internal vs. outsourced support. Andrews, Laurencot and Roy in 2006 reported that the direct cost to conduct specific non-clinical tests for a single compound ran from tens to hundreds of thousands of dollars.
…Ferrandiz, Sussex and Towse in 2012 calculated that the average development costs from first toxicity dose to first human dose for a single compound was $6.5 million (2011$) with the costs ranging from as low as $100,000 to as high as $27 million.⁶ This wide range suggests many different variables affect the cost of non-clinical development.

This is all perhaps an obvious point, but I think it is worthy of being explicitly called out. I have long felt that existing benchmarks in the biology-ML world have a tendency to ideologically capture people, limiting them to consider only the scope of what is currently measurable. Here, that is binding, but everything else is really important too! And it may be only the Big Players who can afford touching everything else.

The value of infinite exploration

Ask not why would you work in biology, but rather: why wouldn't you?

Abhishaike Mahajan — Thu, 02 Oct 2025 19:49:56 GMT

There’s a lot of essays that are implicitly centered around convincing people to work in biology. One consistent theme amongst them is that they all focus on how irresistibly interesting the whole subject is. Isn’t it fascinating that our mitochondria are potentially an endosymbiotic phenomenon that occurred millions of years ago? Isn’t it fascinating that the regulation of your genome can change throughout your life? Isn’t it fascinating that slime molds can solve mazes without neurons? Come and learn more about this strange and curious field!

Yes, it is all quite fascinating, and I do appreciate everyone writing about it all. But I’ll be honest with you: that reasoning never quite did it for me. Lots of things in the world are fascinating, a great deal of them easier to do and more profitable than biology is. Biology is one of those troubling fields where the economics end up mattering a great deal more than the actual science, and while there are biotech startups that are both highly profitable and save lives, there are other, much more frightening cases. Like the gene therapy company Bluebird Bio spending 30~ years doing R&D, getting 3 drugs approved, and then being acquired for a mere $30 million. Given situations like that, many talented people decide, very rationally, to go spend their precious one life working on something else.

But I’d like to offer a different take on the matter. Yes, biology is very interesting, yes, biology is very hard to do well. Yet, it remains the only field that could do something of the utmost importance: prevent a urinary catheter from being shunted inside you in the upcoming future.

Being catheterized is not a big deal. It happens to literally tens of millions of people every single year. There is nothing even mildly unique about the whole experience. And, you know, it may be some matter of privilege that you ever feel a catheter inside of you; the financially marginalized will simply soil themselves or die a very painful death from sepsis.

But when you are catheterized for the first time—since, make no mistake, there is a very high chance you will be if you hope to die of old age—you’ll almost certainly feel a sense of intense wrongness that it happens at all. The whole procedure is a few moments of blunt violence, invasiveness, that feels completely out of place in an age where we can edit genomes and send probes beyond the solar system. There may be times where you’ll be able to protect yourself from the vile mixture of pain and discomfort via general anesthesia, but a fairly high number of people undergo (repeated!) catheterization awake and aware, often gathering a slew of infections along the way. This is made far worse by the fact that the most likely time you are catheterized will be during your twilight years, when your brain has turned to soup and you’ve forgotten who your parents are and who you are and what this painful tube is doing in your urethra. If you aren’t aware of how urinary catheters work, there is a deflated balloon at the end of it, blown up once the tube is inside you. This balloon keeps the whole system uncomfortably stuck inside your bladder. So, you can fill in the details on how much violence a brain-damaged person can do to themselves in a position like this by simply yanking out the foreign material.

Optimizing for not having a urinary catheter being placed into you is quite a lofty goal. Are there any alternatives on the table? Not practical ones. Diapers don’t work if the entire bladder itself is dysfunctional, suprapubic tubes require making a hole into the bladder (and can also be torn out), and nerve stimulation devices require expensive, invasive surgery. And none of them will be relied upon for routine cases, where catheterization is the fastest, most reliable solution that exists. You won’t get the gentle alternatives because you won’t be in a position to ask for them. You’ll be post-operative, or delirious, or comatose, or simply too old and confused to advocate for something better.

This is an uncomfortable subject to discuss. But I think it’s worth level-setting with one another. Urinary catheterization is but one of the dozens of little procedures that both contributes to the nauseating amount of ambient human suffering that repeats over and over and over again across the entire medical system and is reasonably common enough that it will likely be inflicted upon you one day. And if catheterization doesn’t seem so bad, there are a range of other awful things that, statistically speaking, a reader has a decent chance of undergoing at some point: feeding tubes, pap smears, mechanical ventilation, and repeated colonoscopies are all candidates.

Moreover, keep in mind that all these are simply the solutions to help prevent something far more grotesque and painful from occurring! Worse things exist—cancer, Alzheimer’s, Crohn’s—but those have been talked about to death and feel a great deal more abstract than the relatively routine, but barbaric, medical procedures that occur millions of times per year.

How could this not be your life goal to work on? To reduce how awful maladies, and the awful solutions to those maladies, are? What else is there really? Better prediction markets? What are we talking about?

To be fair, most people go through their first few decades of life not completely cognizant how terrible modern medicine can be. But at some point you surely have to understand that you have been, thus far, lucky enough to have spent your entire life on the good side of medicine. In a very nice room, one in which every disease, condition, or malady had a very smart clinician on staff to immediately administer the cure. But one day, you’ll one day be shown glimpses of a far worse room, the bad side of medicine, ushered into an area of healthcare where nobody actually understands what is going on.

What causes people’s first glimpses into the bad room is heterogeneous. For some people, it will be the earliest signs of schizophrenia. For others, it will be metastatic lung cancer. For still others, it will be something stupid, something unbelievably stupid, like having a leg amputated because you scratched yourself on the wrong rock and now that wound has developed necrotizing fasciitis and is quickly dissolving the surrounding flesh. If you are lucky, you get to soon leave that room, clawing your way back to normality. But someday, you will be stuck in the bad room, forever, nobody ever knowing quite what to do with you ever again.

You see this realization a lot. On Twitter, I have noted at least three cases of a very smart person, who works in something largely useless outside of The Market, suddenly falling prey—through no fault of their own!—to some terrible, deeply understudied disease. Then they bemoan on social media how terrible modern medicine is, and how no doctor knows what is actually going on. I empathize! But they never take that next logical step, to try to help solve the set of problems they just suddenly realized is perhaps the only set of problems that actually matter. During these moments, I wish I could reach out through my screen to shake them by the shoulders and tell them to stop complaining and get to work. Because, in truth, the problem they are facing today is just the tip of the iceberg. Their eyes are slowly decaying, and that if they manage to hit fifty, there is a one-in-ten chance that there will be a creaking, incurable black hole in the middle of their sight, expanding day after day. Think! Think! Do something about it!

I appreciate that many fields also demand this level of obedience to the ‘cause’, the same installation of ‘this is the only thing that matters!’. The energy, climate change, and artificial-intelligence sectors have similar do-or-die mission statements. But you know the main difference between those fields and biology?

In every other game, you can at least pretend the losers are going to be someone else, somewhere else in the world, happening to some poor schmuck who didn’t have your money or your foresight or your connections to do the Obviously Correct Thing. Instead, people hope to be a winner. A robot in my house to do my laundry, a plane that gets me from San Francisco to New York City in only an hour, an infinite movie generator so I can turn all my inner thoughts into reality. Wow! Capital-A Abundance beyond my wildest dreams! This is all well and good, but the unfortunate reality of the situation is that you will be a loser, an explicit loser, guaranteed to be a loser, in one specific game: biology. You will not escape being the butt of the joke here, because it will be you that betrays you, not the you who is reading this essay, but you, the you that cannot think, the you that has been shoddily shaped by the last several eons of evolution. Yes, others will also have their time underneath this harsh spotlight, but you will see your day in it too.

My pitch for working in biology is that you will be working to either prevent, or at the very least alleviate, the inevitable moment that Mother Nature decides to extract a pound of flesh from you, giggling and gnashing you between her teeth like a cat plays with a baby mouse. Because it will happen! It may happen tomorrow, it may happen twenty years from now, maybe it’s already happened. Either way, the flesh taking is only going to accelerate the more time you spend shuffling around this world.

Yes, things outside of biology are important too. Optimized supply chains matter, good marketing matters, and accurate securities risk assessments matter. Industries work together in weird ways. The people working on better short-form video and payroll startups and FAANGs are part of an economic engine that generates the immense taxable wealth required to fund the NIH grants. I know that the world runs on invisible glue.

Still, I can’t help but think that people’s priorities are enormously out of touch with what will actually matter most to their future selves. It feels as if people seem to have this mental model where medical progress simply happens. Like there’s some natural law of the universe that says “treatments improve by X% per year” and we’re all just passengers with a dumb grin on this predetermined trajectory. They see headlines about better FDA guidelines or CRISPR or immunotherapy or AI-accelerated protein folding and think, “Great, the authorities got it covered. By the time I need it, they’ll have figured it out.”. But that’s not how any of this works! Nobody has it covered! Medical progress happens because specific people chose to work on specific problems instead of doing something else with their finite time on Earth.

A reasonably large fraction of the humanity seems to have considered this, and simply elected to avert their eyes from the whole unsightly matter. Behind closed doors, they have mutually agreed with one another that they actually live in a different reality, one in which they actually are something divine, something incorruptible, like light or vapor, cast in marble, permanent and perfect. The future, they have decided, will not come for me. I will stay pure. I will stay untouched. It is a mantra they repeat to themselves when they wake up and when they go to bed. Maybe they even start to believe it. It is the only explanation I can think of. Because how else could they go on about their day? How else could they ignore what awaits them?

RNA structure prediction is hard. How much does that matter?

Abhishaike Mahajan — Fri, 26 Sep 2025 20:25:55 GMT

Note: I am not an expert in RNA structure, and am extremely grateful to Connor Stephens, Rishabh Anand, Ramya Rangan, and Chaitanya K. Joshi—all of whom are actual, bonafide experts—for their incredibly detailed comments on earlier drafts of this essay. All mistakes are, of course mine, and this essay should not be trusted to function as anything more than entertainment. Do your own research!

Introduction

One thing I’ve always wanted to write was ‘a primer to RNA structure modeling’. I know literally nothing about the field, other than that there are a few startups playing in the space, and have always been curious what exactly they were up to. But the release of Alphafold3—which can model RNA alongside proteins, DNA, and small molecules—dampened this desire. If a singular model solved the problem of RNA structure, who cares about the specifics of the field at large?

But while I was in San Francisco a few months back, I happened to chat with Connor Stephens, a machine learning scientist at Atomic AI. You may recognize that startup, since their founder has the distinct honor of their PhD work in RNA structure modeling being on the cover of Science in 2021 for making a substantial advance in RNA structure prediction.

But it was long unclear to me what exactly Atomic AI did in terms of R&D. This isn’t a startup post, I’m not planning to explain what their therapeutic goals are. What I was curious about was why they continue to have an ML team despite the RNA problem being seemingly solved by Alphafold3. So, I posed that question to Connor.

Connor told me something very fascinating: not only did Alphafold3 not solve the problem of RNA structure prediction, RNA may be one of the last structure prediction problems to be solved. The rest of the conversation was so incredibly fun that, midway through it, I decided it’d make for a great article to write about.

Why is RNA structure so hard to model?

On face value, the answer is pretty simple: experimentally determined RNA structures deposited in public repositories are both ridiculously small in number and of much lower quality than you’d naively expect. A quote from a paper best explains this:

There is a huge disparity in protein and RNA data. Even if there is a higher proportion of RNAs than proteins in the living, this is not reflected in the available data: only a small amount of 3D RNA structures are known. Up to June 2024, 7,759 RNA structures were deposited in the Protein Data Bank (34), compared to 216,212 protein structures. The quality and diversity of data are also different: a huge proportion of RNAs come from the same families. It implies several redundant structures that could prevent a model from being generalized to other families. In addition, a huge amount of RNA families have not yet solved structures in the PDB. This means there is no balanced and representative proportion of RNA families through the known structures.

The obvious follow-up question is: why? Apparently, RNA is a good fit for basically none of the existing structure determination methods. But again, why?

Connor told me that RNA is famous for being perhaps one of the most flexible biomolecules to exist as a category, with an almost absurd number of conformational degrees of freedom. Each nucleotide has more torsion angles than an amino acid, and the lack of a bulky side chain—like those in amino acids—means there’s very little steric hindrance to keep the backbone from flopping around. Now, keep in mind, this is not to say that RNA is unstructured. Unstructured has a particular meaning, that the energy landscape is flat, with no favored conformational structure. But this isn’t the case for RNA, which do have preferred conformational structures, there are just many of them that they constantly flip in between.

This all implies that RNA is a very bad fit for X-ray crystallography, which requires orderly, repeating conformations to arrange into a crystal. It is also a bad fit for cryo-EM (a subject I’ve written about in detail before), given both the extreme conformational heterogeneity of it and how typically small the biomolecule is, though this is increasingly being addressed. Finally, NMR, which, while more forgiving when it comes to flexibility and heterogeneity, is generally limited to very small RNA structures. Once the RNA goes beyond ~50 nucleotides, the spectra start overlapping and the resolution being insufficient to observe anything useful. And lots of important RNA lies beyond that size!

I’ve attached some nuance about NMR and cryo-EM in the footnotes.1

This means that there are really only two RNA structures that can be physically characterized: ones that have been artificially stabilized, or ones that are evolutionarily constrained to hold a single dominant conformation.

The first category includes structures coaxed into rigidity by heavy metal ions, engineered base modifications, or even crystallization chaperones. But of course, this raises a worrying question: are you really measuring the native structure, or just the structure you forced it into? The second category is rarer: RNAs that, through evolutionary pressure, have converged on a stable structure for a functional reason. There are no caveats there, only that trying to train a model on these nucleotide sequences will inevitably bias it towards unusually stable RNA structures.

Well, we shouldn’t let all of this get us down. Many impossible problems are being solved day-after-day in this field. Even if RNA modeling has all the characteristics of being hard to do—huge distributional space of possible outputs for a given input and low number of input data points—surely, some headway has been made in the problem. Consider Alphafold3: how well does it actually do on the RNA structure prediction problem?

A well-named paper titled Has AlphaFold3 achieved success for RNA? tries to answer this question. From the article:

The best models from the CASP-RNA competition, which are human-guided, outperform AlphaFold3….
….On the other hand, AlphaFold3 shows a cumulative sum of metrics greater than the other methods for the other test sets (p-value < 10⁻⁵ for RNA-Puzzles, p-value < 10⁻⁴ for RNASolo).
For RNA-Puzzles, the challenge-best solutions are from older solutions with less advanced architectures compared with the more recent CASP-RNA solutions.
For the RNA3DB_0 data set, the performance of AlphaFold3 is slightly better compared with RhoFold, which gives a better RMSD but a worse MCQ and LCS-TA.
AlphaFold3 always has a high MCQ value, indicating that it returns structures which are more physically plausible than ab initio methods (which use physics properties in their predictions).
Nonetheless, it does not always have the best RMSD (outperformed in CASP-RNA and RNA3DB_0), suggesting that AlphaFold3 does not always have the best alignment (in terms of all atoms) compared with the reference structure.

In short, while Alphafold3 is certainly an improvement in some categories of RNA—namely being the only RNA structure prediction method that can model very large RNA’s well—it does not solve the problem outright, and can be outperformed through tailored methods.

Another slightly more recent paper says something similar, and gives some insight into the practical meaning of these benchmarks, saying ‘Boltz-1 and AlphaFold3, make acceptable predictions for about half of the individual RNA chains and complexes.’. The authors further note that the results get far worse if you deviate into more structurally unique RNA space (bolding added by me):

We observed that prediction accuracy, as measured by TM-score, generally increased with higher structural similarity to the training set for all methods. The mean TM-score is below 0.1 for the category with the least similarity and increases gradually to over 0.6 for the category with the highest similarity to the training set. This suggests that AlphaFold3 and other methods tend to perform better when the target structure is more similar to motifs it encountered during training, highlighting the limitation of current methods in predicting unseen and structurally divergent RNAs.

Neat!

I could end the essay here, because this really did cover most of Connor and I’s conversation. There is a lot more that could be said about how difficult benchmarking can be in the RNA ML world, the weak co-evolutionary signal in RNA MSA’s, how even the existing set of RNA structures are made worse by the fact that they are almost always in complex with a protein, and (hearsay) that you likely need experimentally-determined templates/molecular-dynamics to get good structure predictions. This paper discusses all that in more detail if you're curious, but my main question got answered!

But the more I talked to people in the RNA space while writing this essay, the more I began to ask a new question: how important is this problem anyway?

Why even predict RNA structure in the first place?

For the protein-heads reading this, we know that protein structure actually means something quite fundamental. A protein’s three-dimensional fold is usually synonymous with its biological role: an enzyme pocket is what catalyzes a reaction, an antibody groove is what binds an antigen, a receptor domain is what recognizes a ligand. We can hem and haw about dynamics or post-translational tweaks, but the basic architecture is what makes the protein what it is. Protein structure isn’t exactly truth, but structure can be a proxy for truth a sufficiently high fraction of the time.

RNA is not like this at all. It’s actually really, really, really situational when the structure of RNA matters in a therapeutic context. Well, to be more nuanced, structure always matters, but there is a very significant split what ‘structure’ even means for this biomolecule: secondary structure and tertiary structure (image from here):

Thus far, everything we’ve talked about regarding the ‘difficulty of structure prediction’ has been for tertiary structure.

Now, this separation exists for proteins as well! But it (somewhat) matters less for proteins. Usually we treat “protein structure” as a single concept because the hierarchy is tightly coupled: secondary structure (α-helices, β-sheets) stacks neatly into tertiary folds, which in turn map directly to function. You can often ignore the distinction because the two levels reinforce each other, and so everyone hyper-focuses on tertiary structures being the most important thing.

But for RNA, the distinction matters a lot, because secondary structure seems to be where most of the clinically relevant value of structure is. Tertiary RNA structure is important! But, as far as I can tell, the value of it is actually relatively limited in scope for therapeutic-relevant problems, partially due to the fact that RNA is just so flexible that a tertiary structure phenomenon like ‘the binding site is buried in the core’ can immediately be undercut by that same core suddenly flopping out in a new conformation.

And, just as is the case for proteins, RNA secondary structure is far easier to predict than RNA tertiary structure. It’s still comparatively hard, in the sense that secondary protein structure is basically something people don’t ever worry about, and secondary RNA structure has only just recently reached those same accuracy levels. A paper that analyzed the performance of RNA models at CASP16 had this to say:

Complex and novel targets appear well beyond current capabilities for NA 3D structure prediction. However, RNA folding can be simplified into a hierarchical process: secondary structure – the pattern of canonical base pairs – forms creating a set of RNA stems which are then stitched into the overall 3D fold…
CASP16 offered the prospect of carrying out tests of secondary structure accuracy prospectively. The secondary structure of all targets, here defined as the list of all Watson-Crick-Franklin and Wobble pairs, turned out to be predicted to a high level of accuracy (Supplemental Figure 3A)...The trend in RNA secondary structure performance is more reminiscent of the performance observed in current protein 3D structure prediction, suggesting these prediction algorithms are reaching sufficient accuracy in their prediction of secondary structure to be important and useful in structural research.

Not too bad!

Returning back to our claim that ‘secondary structure is most of what you need’, let’s convince ourselves of this by walking through the major classes of RNA-based therapeutics and the importance of secondary versus tertiary structure.

The most famous form of therapy here is exogenous mRNA, and tertiary structure doesn’t seem to matter much there. I have two proof points for this. One, this mRNA optimization article from GeneWiz mentions secondary-structure optimization (e.g. preventing hairpins), but not tertiary structure. Two, just logically thinking about it, the job of mRNA is to be fed into the ribosome and translated into protein, so as long as the coding region is readable and initiation isn’t blocked (hence probably why hairpins are undesirable), why would it matter for the RNA to maintain any particular higher-order fold?

Then there’s antisense oligonucleotides, or ASO. All this is is a short synthetic strand of RNA (usually 15–25 bases long) that binds to a complementary sequence of an RNA. Once bound, it can block translation directly by preventing ribosome access, alter splicing by blocking splice sites or enhancers/silencers, or a few other things. But in all of these cases, all that matters is that the ASO can actually base-pair with its intended target. And that comes down to secondary structure accessibility: is the binding site exposed or not? Once again, this seems to be something that is largely answerable from secondary structure information, especially given how small ASO’s are.

For siRNA’s, or small interfering RNA, it’s nearly the same story as ASO’s,

Virtually the only time tertiary structure seems to matter for an RNA therapeutic is for aptamers and ribozymes. The former refers to short RNAs that fold into precise three-dimensional shapes capable of binding proteins or small molecules (e.g. theophylline aptamer), and the latter refers to enzymatic RNAs with a precise catalytic site that are able to carry out chemical reactions. But, unlike all other classes of RNA therapeutics, approved drugs here are quite rare; aptamers have two and ribozymes have zero. There’s also riboswitches, which are a hazy combination of the two, and also have no released therapies.

This all said, we should also consider the other side too: RNA as targets. How important is secondary versus tertiary structure there?

Well, things do get muddier, because there isn’t really a standardized list of established RNA targets the same way there are for proteins. There’s mRNA, the tertiary structure of which is not exploited in any FDA-approved drugs (though we’ll discuss this again later on), but what else?

Well, for one, non-coding regions! Specifically, microRNAs and lncRNA.

Given how small microRNA’s are (20~ nucleotides), I’d guess that tertiary structures don’t matter much there.

Curiously, LLM’s will, at first, insist that lncRNA’s, or “long noncoding RNAs”’ really benefit from accurate tertiary structure prediction. There’s some reason to believe that they are right. After all, they are usually above 200 (or 500, depending on who you ask) nucleotides in length, so, unlike ASOs/siRNAs/microRNAs, lncRNA’s are sufficiently large where tertiary structures may have significant impacts. Unfortunately, the LLM seems to be a bit wrong here, partially because whether lncRNA’s even form global tertiary structures at all has been a matter of intense debate for a while, though circa 2020 it is seeming like at least some lncRNA’s do. But really, whether lncRNA’s have a global structure or not wouldn’t have even mattered anyway, because their modulation does not seem to actually depend on that global structure. Rather, it depends on a set of short nucleotide motifs scattered along an otherwise floppy backbone. Even if we could perfectly predict the full structure of an lncRNA tomorrow, it feels like it wouldn’t change any therapeutic decisions. Perhaps predictions of those local 3D motifs are valuable, but that’s an open question!

As far as I can tell, the only type of RNA target where tertiary structure is known to be important is rRNA, or ribosomal RNA. Unlike most RNAs, ribosomal RNAs actually must maintain specific tertiary folds, because, like ribozymes, they are enzymes in every meaningful sense. The peptidyl transferase center of rRNA requires a highly specific three-dimensional geometry to orient its usual substrate: tRNA. And some classes of approved antibiotics, macrolides for example, are able to block this catalysis site, preventing (some) forms of bacteria from making proteins at all, eventually killing them.

It does seem like, from the outside, that accurate RNA tertiary structure predictions here would be helpful, given this line from a paper discussing where antibiotics bind to RNA:

For spectinomycin, the apparent binding site and the affected cross linking site are distant in the secondary structure but are close in tertiary structure in several recent models, indicating a localized effect. For tetracycline, the apparent binding sites are significantly separated in both the secondary and the three-dimensional structures, suggesting a more regional effect.

In other words, there is a large deviation in what secondary structure tells you, and what tertiary structure tells you!

This said, a few commenters on this essay noted that while this is an area where 3D structure is useful, it almost certainly isn’t a bottleneck due to the relative abundance of existing rRNA structures and ease of gathering new ones.

So, aptamers and rRNA are virtually the only two areas that (today) truly benefit from detailed tertiary structure modeling and have some things in the clinic. For mRNAs, ASOs, siRNAs, and most lncRNAs, the biology seems to collapse down to local accessibility and motif recognition. Both of these are sufficiently described by secondary structure, and that is decently well predicted by existing models! Tertiary folds, though definitively far from being well-predicted, don’t actually seem to influence much…at least as far as I can tell.

So why do people still work on the tertiary structure prediction problem? Is it all just for better ribosome-centric antibiotics and aptamers?

How much do we stand to gain if RNA structure prediction improves?

Well, in the immediate short term, it does seem like antibiotics and aptamers are really the field's best bets.

This is nothing to sneeze at! On the antibiotic side, we do need better antibiotics to account for the current ‘antibiotic resistance’ thing that’s been going on for the past decade, so why not elect ribosome-targeting antibiotics? This said, we should immediately drown our hopes that better ribosomal drugs will actually change the resistance trend-line. Naively, one would hope that things that interfere with rRNA functioning should be quite hard to adapt to—after all, elements of the ribosome are canonically known for being extremely conserved. And that is true, but resistance manages to evolve anyway, including via, interestingly enough, post-transcriptional-modifications that prevent the antibiotic from binding to rRNA.

Of course, the real issue with antibiotics has little to do with scientific ideas, and more to do with economics. A funny paragraph I found from an interview with the lead author of a recent ‘new rRNA antibiotic’ paper had this to say:

…there is an argument that the difficulty making successful antibiotic drugs has more to do with business models than with molecules. When asked about this, Myers says, “Do I worry about the broken business model for antibiotics development? Are you kidding? Every day. That may be the most challenging problem of the lot, and it is not one that I can solve. Synthesizing new antibiotics—in that, I feel confident.”

One related note is that RNA structure may not only be useful for targeting bacterial rRNA, but also viral RNA. A particularly famous case here is a paper that developed a protein that can bind to a structured RNA element in the HIV virus, impairing transcription of it (albeit in an in-vitro setting). Though this has yet to lead to any approved drugs, the subject is, according to one review paper, promising.

Moving onto the aptamer side, though it is still early days, the future is interesting. Circa 2024, de novo RNA aptamer design is currently at the ‘we can redesign existing things’, which is a necessary step on the way to ‘we can redesign existing things to make them better’, but we’re not there yet. What’s the therapeutic utility of an aptamer anyway? Basically the same uses one would have for an antibody for, with a ton of side benefits:

Aptamers have several advantages over antibodies, not least the fact that they can be produced quickly and easily without the need for animal use. Aptamers also benefit from low production costs, high batch-to-batch consistency, and functional stability when stored at room temperature, which gives them a long shelf-life and simplifies both transportation and storage. In addition, the low immunogenicity of aptamers makes them valuable tools for in vivo applications, while their small size compared to antibodies allows them to better penetrate cells and tissues. This can be especially useful when studying difficult-to-access targets such as those found within the tumor microenvironment.
On the flipside, aptamers are poorly suited to applications in which it is desirable to stimulate an immune response and may undergo rapid clearance in vivo unless they have been modified to prevent this.

This is quite nice, but there’s a lot of modalities vying for the antibody throne, and many of those share similar benefits as aptamers. Beyond the scope of this essay for me to judge how large the value is here, but I’m sure it’s non-zero!

Some nuance

Every essay I write, I try to form a strong opinion to build my story on, and I’ve sketched out one such opinion here: most of the value of RNA structure is in secondary structure, predicted secondary structure is quite good, and tertiary structure has a limited set of use cases. I think the argument for this position is decently strong.

But I should note that the take I have here is not a universally held opinion for those in the field, and is very much a ‘I did my research, and this is the conclusion I came to’. There are, I think, reasonable disagreements that people have had to this.

First, one paper titled Thoughts on how to think (and talk) about RNA structure argues that the seemingly high utility of secondary structure has a lot more to do with its historical ease of accessibility rather than the low utility of tertiary structure. Some context: most RNA secondary structure consists of what is called ‘Watson-Crick Pairs’, or just the tendency for RNA adenine (A) to match with Uracil (U) and Guanine (G) to pair with Cytosine (C). Non-Watson–Crick are just any hydrogen bond that forms outside of this, which typically can only be noticed in 3D space. The aforementioned paper says this about the two:

Overall, the tendency to focus on Watson–Crick pairs may stem from the fact that they are the basis of nucleic acid hybridization and that they are easier to identify, draw, and rationally mutate. However, non-Watson–Crick pairing and stacking patterns in helical junctions and internal loops preform a 3D architecture that dictates the angles of emerging helices. As a result, specific parts of the RNA are spatially positioned to readily establish interactions often involving nucleotides that are far apart in sequence, but not in three dimensions….Non-Watson–Crick pairings combined with helical stacking give rise to structural motifs that provide the building blocks of many higher-order structures, including ultrastable tetraloops and their receptors, kink-turns, E-loops, etc.

For instance, I mentioned earlier that the tertiary structure of mRNA targets is not exploited in any FDA-approved drugs. This is true, but they are being exploited in preclinical settings! For instance, Arrakis Therapeutics, a RNA-targeting-with-small molecules biotech startup with a very fun name, has this really interesting presentation showing that multiple of their ligands are able to bind to conserved, accessible 3D pockets of mRNA of the MYC protein. This is a notoriously difficult protein to directly bind to, but seemingly accessible through its mRNA.

Second and relatedly, I dismissed the value of mRNA tertiary structure, but there is an RNA modality that does something very similar to exogenous mRNA and has a very important tertiary structure: circRNA’s, or circular RNA, which form a covalently closed continuous loop. One of the giants of the field (Mihir Metkar, who was one of the primary contributors of the Moderna COVID-19 mRNA vaccine) has written a great Nature review article over mRNA broadly, and did mention that circRNA’s must rely on a fundamentally different mechanism to initiate protein translation:

Because canonical mammalian translation depends on 5′-cap recognition, mRNAs that lack a cap [e.g. circRNA’s] require an alternative means of translation initiation. One answer is an IRES (Fig. 6).
First discovered in picornaviruses, IRESs vary with respect to both their structural complexity and their reliance on endogenous initiation factors. In general, these two features are inversely correlated, with the simplest IRESs bypassing only the cap recognition step, whereas the most structurally complex bypass even AUG recognition, relying instead on intimate direct interactions with both the large and small ribosomal subunits.

In other words, the IRES’s, or internal ribosome entry site, on a circRNA is the primary way it is recruited to the ribosome. This means that translation efficiency, tissue specificity, and even coding potential can hinge on whether the IRES is stable, accessible, and folded in the right way, meaning that it is a strong axis of control of a circRNA therapeutic! For example, engineering an IRES to improve translation efficacy is something that is fully possible to do. But to do this at extreme scales, we’d likely need to be able do tertiary RNA structure prediction very well, since the three-dimensional structure of IRES seems to matter a fair bit (though, admittedly, most of the experimental structure studies of IRES are for non-therapeutically relevant ones). But why even use circRNA’s over mRNA’s? One paper explains that quite well:

Compared with the canonical linear mRNA used in vaccines, circRNAs have multiple advantages.
(1) CircRNAs are more stable and easy to store, whereas mRNA vaccines exhibit extreme instability because it is susceptible to degradation by RNases during transportation, storage, delivery, etc. Although nucleotide modifications of the mRNA backbone and UTR regions make mRNA more stable, this increases cost and complicates the manufacturing process, and the storage of the resulting vaccine still requires a low-temperature cold chain due to its suboptimal thermostability. CircRNAs without any modifications exhibit high stability and RNase resistance and can be stored at room temperature or under repeated freeze‒thaw conditions.
(2) CircRNAs without any modification exhibit fewer side effects. The cytotoxicity and side effects caused by mRNA vaccines are partly due to their high immunogenicity. Compared with modified mRNA, which has somewhat modulated high immunogenicity, circRNA exhibits lower immunogenicity, and lower cytotoxicity in the absence of modification.
(3) CircRNAs possess prolonged antigen-yielding capabilities and durable immune responses. The resulting longevity and thus prolonged antigen production contribute to antigen retention in antigen-presenting cells (APCs) and prolong antigen presentation.

Convincing to me! Very excited to see how the circRNA space plays out.

Thirdly and finally, claiming that secondary structure for RNA is nearly solved is false, at least for mRNA used in the clinic. After all, the mRNA used in vaccines is quite biochemically distinct from the mRNA we naturally produce in one important element: the uridine nucleotide is replaced with a different chemical (the most common one being 1-methyl-pseudouridine, or m1Ψ), which is more immunologically ‘quiet’. This, as you may expect, messes up secondary structure prediction a fair bit, since there are basically zero experimentally determined mRNA structures with modified nucleotides. The same Mihir Metkar paper mentioned earlier says this:

Although m1Ψ substitutions have little consequence on in vitro transcription or translational fidelity, as with other naturally occurring modified nucleotide, m1Ψ can substantially alter RNA secondary structure…these subtle differences in individual base-pair stabilities can lead to structural changes that alter mRNA functionality (for example, creating or disrupting a RNA binding protein (RBP) binding site)...
At present, the functional competence of RNA structures that contain modified nucleotides can only be assured by empirical testing.

There is ongoing work to solve this problem but the datasets are still all quite small, as is typical in the RNA world.

And that’s it! Thank you for reading!

So, one, there are some cases of cryo-EM being useful for at least some RNA structures, like here, and that may accelerate as the field of cryo-EM reconstruction gets better and better. Second, NMR can be useful for RNA structure prediction problems in cases you have a crudely-predicted structure, but think you improve it by confirming the pairwise proximity of a handful of nucleotides. This is significantly more tractable, even for larger RNA, to do via NMR!

Drugs currently in clinical trials will likely not be impacted by AI

Abhishaike Mahajan — Tue, 20 May 2025 17:00:30 GMT

Foreword: Just a reminder: this is an 'Argument' post. All of them are intended to have a reasonably strong opinion, with mildly more conviction than my actual opinion. Think of it closer to a persuasive essay than a review on the topic, which my Primers are more-so meant for. Do Your Own Research applies for all my posts, but especially so with these.

Also, I’m extremely appreciative to everyone I spoke to for this essay, but especially grateful to Bioengineering Bro, Alex Telford, and Frederick Peakman.

Introduction

I’ve been thinking a lot about timing lately. The drug development lifecycle is quite unlike many industries; the vast majority of the initial experimentation gets to happen in a comparatively brief 4-7 year period (preclinical), and then one must endure an, on average, 10.5 year period of waiting around to see if the experimentation actually works (clinical trials). During this time, there are no changes to the base chemical allowed, only dosages, patient population, and trial structure. If the drug works, the company stands to reap billions. If it doesn’t, it often loses an equivalent amount.

Very few other sectors in the world have this particular type of risk profile.

What impact does this particular phenomenon have on the types of businesses that work and the types that do not work for the pharma world? In this essay, I put forwards the thesis that AI startups that hope to sell to drug development companies—whether that is services or products—will be unlikely to be useful to any therapeutic currently in the clinical stage. To be of any use at all, they must focus in their efforts on preclinical work, with the hopes that it translates to decisions important at the clinical stage. And that it is a direct result of how extremely high stakes the drug development process is.

To be clear, this isn’t actually that controversial of an argument. Really, I can’t really think of any biotech startup today that is actively going against the argument I’m making here. So, to some degree, this thesis is obvious, but it was helpful for me to write out, so it may be helpful to read.

First, I will start with a fictional story over clinical trial interpretation (section 2), say that an AI-optimist ending to that story is unlikely (section 3), work through with the common arguments against my point (section 4), and then end that fiction with what I believe the true conclusion would be (section 5). Finally, I will offer what I think is the strongest argument against the thesis of this article (section 6). As an addendum, I’ll cover one of the biggest phase 3 drug failures to have ever occurred, and check whether an agentic literature-review platform (FutureHouse’s) could’ve detected it (section 7).

I’ve been (endearingly, I hope) told my essays can have a strange, wandering structure. One early reader of this essay said this:

I read it first and I was like eh 6-7/10, not sure what you're trying to say. But after reading it again I like it a lot, 9/10.

Take from that what you will. It may be worth revisiting sections if things initially feel unclear.

Let’s start!

A story

Let’s say you work near the top of a biotech startup. Times have been good to you — you’ve helped launch several drugs at past roles in big pharma, a feat that few people in the world can match, maybe even one of them becoming a blockbuster. As befitting an individual of your talents, you’ve joined a startup and now have been offered a chance to decide upon what is likely the most singularly important choice that exists in any pharmaceutical company.

You get to choose which drug gets the green light.

What does ‘green light’ mean here? For our purposes, it doesn’t matter, it can be anything; pushing the drug to phase 3 trials, repurposing a shelved asset, whatever. As with all green lights found in a pharmaceutical company, the important part for our story is that whatever decision you make will cause the gargantuan wheels of your organization to slowly shift to accommodate it, thousands of man-hours and millions of dollars being spent to grease the wheels of your choice. So, it’s quite important for you, your employees, and an uncountable number of potential patients, that you get this decision right.

To make it concrete, let’s consider the following decision: whether you should push a Phase 2 asset forwards to Phase 3. The asset in question is publicly called AEJ-2399, a peptide-based therapeutic meant to treat a rare inflammatory condition.

As far as anybody can tell, the data is strong: statistically significant efficacy and a mostly-clean safety profile. From the many hundred-page market analyses thrown your way over the last few months, the economics seem great as well — a favorable reimbursement landscape and a many-fold return on the hundreds of millions in investment on the chemical.

Really, it’s an easy decision. You announce the decision to move forwards. You expect nods of approval, and nothing more. Then you get an urgent call from your head of clinical development who claims to have discovered a strange trend in the data.

“What kind of trend?” you ask.

A sigh from the other end of the line. “An imbalance in major adverse events in a specific subgroup. It didn’t show up in the primary analysis, but when we stratified by a baseline biomarker level…”

“Which biomarker?”

“HGF. Patients with high baseline hepatocyte growth factor had a nearly threefold increase in serious adverse events.”

You sit back and think. HGF wasn’t originally a focus of the program, but now that you think about it, you recall an old preclinical report suggesting the drug might have off-target interactions in pathways associated with fibrosis and vascular remodeling. But there are always off-target effects for every drug, you can’t be expected to consider every one of those reports.

“Okay. Okay. What’s the specific recommendation here? Exclude high-HGF patients from phase 3?”

“Well, the issue is that this trial wasn’t powered for subgroup analyses. If we were paranoid, we’d exclude patients. If we were optimistic, we wouldn’t. Given the data we have, it’s a bit of a coin flip.”.

A headache ripples through your skull.

A day later, you’re in a boardroom with ten senior leaders, staring at a slide deck. The first few slides recap the safety signal, then come slides from research explaining why it’s probably just noise, then a slide from regulatory explaining why the FDA won’t see it as noise, then a slide from commercial showing how sales projections implode if you lose the broad label, then a slide from finance showing how much money has already been sunk into this program.

Then the real discussion starts.

The researchers argue that “the biological mechanism doesn’t support an HGF-mediated toxicity, so the signal is probably spurious, the old preclinical report has a hazy relationship to the current result”. The regulator team argue that “while post-hoc analyses shouldn’t be over-interpreted, ignoring a threefold increase in adverse events is the kind of thing that lands companies in lawsuits” and that “statistical significance isn’t the FDA’s only criterion—if the agency thinks there’s a safety risk, they will kill this drug”. The commercial team argues that the potential value is “still too high to walk away without more data”. The finance team reminds everyone how much money is on the line.

You ask the obvious question, since it seems like nobody else will. “Okay. Okay. Um. Can’t we just directly test out HGF-mediated toxicity cheaply in some way?”

Research looks more exhausted than you’ve ever seen them. “Well, we could run a targeted preclinical study. Maybe take some liver and vascular cell models, expose them to AEJ-2399, and measure any HGF-related pathway activation. Would take a few weeks, maybe a month. There isn’t a good animal model for this stuff, so in-vitro is all we have, and who knows if the results would actually be useful.”

A member from regulatory frowns. “That might help with mechanism, but it won’t tell us whether this is a real clinical safety risk. The FDA won’t care about mechanistic speculation when they have actual adverse event data in patients.”

A clinical development head from research leans forward. “We could reanalyze stored patient samples from the Phase 2 study and see if HGF levels correlate with other biomarkers of risk. If we see a consistent pattern, like, say, increased fibrosis markers in the high-HGF subgroup, that would support a real effect. Could be done in a few weeks, depending on sample availability.” A toxicity head from research cuts in, “That won’t work, the stored samples weren’t preserved for fibrosis marker analysis. They were processed for pharmacokinetics, not histopathology. The results would not at all be trustworthy.”

“Or” the commercial lead cuts in, “we could just run a small, targeted Phase 2b study, stratified for HGF, before going all-in on Phase 3.”

The chief finance officer doesn’t even look up from his phone. “We don’t have budgeted runway for another Phase 2b. If we delay Phase 3, we lose our fast-track designation. Which means another two years before approval, which means there’s a pretty good chance we’ll need to raise another round at awful terms. Or simply go bankrupt.”

Everyone stares back at you.

And then you have to make the call. You look around the room. Everyone has an opinion. No one agrees. And you realize that this entire process is insane. You have a billion-dollar decision to make, and you are making it based on who argues most convincingly in a PowerPoint presentation.

You schedule another meeting for tomorrow. Maybe things will be clearer then. If not then, maybe another meeting will help. You’ll get to the bottom of this eventually.

A mistake on my end

At this point in the story, if you are a regular reader of this blog, you may instinctively think AI is useful here. It must be! AI will pull all the biology/commercial/financial/etc. threads together into one piece and present a nice picture for the poor, addled executive to think over. Not even molecular models! Simple natural-language models would be sufficient for this task.

I thought this too! There are lots of pieces about the preclinical utility of AI, much less about what happens after that. So naturally I assumed all sorts of weird, strange, and scientifically interesting challenges arose when interpreting or conducting a clinical trial, just as they arose during preclinical work, and that advances in ML would dramatically alter how efficiently and accurately things could move.

And so I wrote an entire essay arguing for the potential of AI in clinical-stage drugs. It was 7,400~ words! At the end of that first story, I envisioned a world where an agentic model sat alongside the executives during these green light meetings and could say stuff like this:

The model replies instantly. "You mentioned that the tissues aren’t preserved for histopathology,” it begins, “but, from looking at the original trial SOP’s, several of the stored serum samples were processed under cold-chain protocols compatible with multiplex immunoassays if you’d be okay running those. At least 38 patients have viable serum aliquots available, 12 of which have the high-HGF phenotype.”
It pauses, then suggests: “You could run an immunoassay panel. Select a few vascular biomarkers, VCAM-1, ANGPT2, ELAM-1, ET-1, along with a few inflammation controls like IL-6 and TNF-α. These all are literature-validated indicators of endothelial activation, angiogenic imbalance, and vascular stress. Total cost per sample would be under $300, and most vendors can return results in 7 to 10 business days. Since the samples are retrospective and anonymized, the IRB burden would also be minimal. I’d estimate the projected total cost to be just shy of $14,000. I just sent an email to eight CRO’s you’ve used in the past to confirm these details and cc’d everybody in this room.”
As research begins to ask what the purpose of the assay is, the model loads up its next set of tokens: “If elevated levels of these markers are specifically enriched in high-HGF patients who experienced the adverse events, it would support the hypothesis that AEJ-2399 worsens something in a subgroup. It wouldn’t be definitive, but it avoids the concerns you have with in-vitro and animal model tests.”

And the scientists would marvel in awe at something that could surpass themselves in understanding the full scope of the problem and suggest intelligent ways for them to de-risk hiccups in the clinical stage development process.

But at the end of it, it felt off. It gave me alarm bells. I’ve never even touched clinical-stage stuff, all my work has been preclinical, so I had no mental scaffolding for how the drug development process goes on after me. So, I worried that I had unconsciously applied a preclinical framework to something that was not preclinical.

This wasn’t a new situation to me, I often write about areas in which I have zero hard-won intuition for. But in those situations, I almost always have people I can rely upon to help correct mistakes in my judgement, researchers and investors I can reach out to. In this case, I didn’t really feel like anybody fit the bill.

So, I publicly requested folks to contact me:

In total, I talked with five people, trying to learn more about the clinical trial lifecycle. Two people at a life-sciences consultant firm, one person who works in pharma quality assurance, and one scientist at a big pharma, and one person who runs an AI-agents-for-pharma startup.

They all said basically the same thing: AI probably won’t be super helpful at the clinical stage.1

Why? Let’s get into it.

A socratic seminar on why AI at the clinical stage isn’t helpful

The AI tool should be able to suggest follow-up analyses for the phase 2 asset you described earlier, right?

Sure, but would anybody care?

The first story I had mapped out earlier included a lot of intrigue, mystery, and scientific inquiry. But that’s not how it works in the real world. What I learned from talking to people in this space was that by the time a drug exits preclinical development, the shape of its development pathway is mostly fixed. This isn’t because of perfect foresight, but because there are too many institutional constraints for things to truly change course on the fly. Budget, headcount, protocol designs, statistical analysis plans, regulatory timelines, and so on. And if there are hiccups to the process, executives usually rely on one thing to decide their plans: finances.

Is this asset still worth spending money on? If it is, the next steps are usually obvious and don’t really need AI. Really, many clinical trial interpretations are mostly just a set of safety and efficacy criterion. If it isn’t worth spending money on or it doesn’t meet what you have in your checklist, you immediately move on to the near-infinite number of promising-looking things from your preclinical pipeline.

I think you’re being too pessimistic. There are cases of patient exclusions or dosages being refined being done in phase 3, based on results from phase 2, which directly disproves you. Why wouldn’t an AI agent help you arrive to those decisions faster?

I think it’s easy to get bogged down a little here. Let’s take a step back.

The drug development process has extraordinarily helpful mental model: portfolio optimization. At the very start of the pipeline, you have N assets. You have the option to spend $Y on each asset, each one having their own unique T percentage of working. And, of course, you hope to sell every asset for a unique sum of $Z. Thus, we have the very simple expectation equation:

Or, more realistically at a big pharma, across N assets:

This frames the entire endeavor as a capital allocation problem under uncertainty. So, in order for the AI to be fundamentally useful, it must do one or more of the following: improve T (chance of success), lower $Y (cost per asset), and better estimate $Z (total market).

So, let’s say that the AI assistant really is capable of helping clinical-stage drugs, mostly by increasing T, the chance of success. Whether that is in suggesting better patient stratifications or better dosages or so on, it doesn’t really matter. This will also cause one of three other things to happen:

Increase $Y, because whatever follow-ups the AI suggests will cost money to validate.
Reduce the size of $Z, by suggesting the exclusion of certain patient populations.
Make $Z disappear entirely by suggesting that the asset should be discarded.

This is all to say that, even if the model is giving useful ideas, it primarily acts as a cost center and functions via negative selection by highlighting risks or advocating caution, as opposed to positive intervention. This isn't inherently a bad thing! Avoiding costly failures can obviously be as financially valuable as pushing modest successes. But it is worth reframing AI at this stage as not a proactive, generative force, which its typical role in the preclinical lifecycle, but rather something much closer to risk management.

And, unfortunately, reducing risk at the clinical stage almost necessarily means you’re opting into a high-er false positive rate. Which is really, really uncomfortable for everyone involved, given the massive amounts of money potentially being left on the table at the clinical stage. I mentioned earlier that clinical trial interpretation is rote. Why is it rote? Because the stakes are so high and over such long timeframes that nobody wants to entertain ambiguity. Rote execution becomes a defense mechanism. You stick to the protocol, follow the SAP, report what was pre-specified, and keep deviation to a minimum. Not because it’s always the best science, but because any deviation invites delay, risk, or regulatory scrutiny.

To note: I don’t want to give the impression that I’m caricaturing pharma executives in any way. I think the pharmaceutical industry is amazing. Basically every person I’ve ever met who has worked at the clinical stages of drug development is deeply kind, cares a lot about science, and really, really wants to help patient. One need not look further than Alnylam Therapeutics as a prototypical example of a researcher tirelessly working for years to prove out an unproven and often-disparaged modality of drug.

So when I refer to those involved in the drug approval process as caring about money, I am not referring to them seeking to personally profit at the cost of patient lives. What I am referring to is them is having systemic pressures and tough choices to make in a high-risk, high-cost environment, and money being an extremely good tool by which they can create something useful. Science matters, efficacy matters (usually…), safety matters, and patient lives matter to those in charge. But the primary way by which society writ large allows people to care about those things over long periods of time is by aligning them with incentives. And in our current system, those incentives are largely financial.

So, someone in charge of developing a drug would ideally like to keep things standard.

Realistically speaking, how costly would any trial deviation be?

Great question! Let’s go back up to the suggestion that the AI had: running a set of immunoassays to check whether high-HGF patients did express more vascular inflammation biomarkers.

Let’s ignore the actual cost of running that experiment, since the cost would be the same regardless of whether it’d be done in preclinical settings or not. And let’s also say that the results came back, and that high-HGF patients did indeed show worse biomarkers, which implies that the adverse effect issue is maybe real.

Now, if you decide to still move forwards with the phase 3 with high-HGF patients excluded, you can’t just do that. You’d need to formally amend the trial protocol, revise the statistical analysis plan, and notify every regulatory agency overseeing the trial. That triggers a cascade. Updated investigator brochures, IRB re-approvals, new consent forms at every site, retraining for clinical staff, possibly renegotiated contracts with CROs. Depending on the scale of your trial, this might cost anywhere from $150K to $300K and delay your timeline by 3 to 6 months, at minimum.

But, you may say, didn’t the story say that the phase 3 trial wouldn’t go forwards if we had to exclude patients? Absolutely! And let’s say you decide to just drop the drug entirely because of the results. Well, wait a minute. A Phase 2b study that ended with statistically significant efficacy and only a flagged safety signal in a subgroup? This would raise red flags with investors, trigger uncomfortable questions at board meetings, and potentially violate prior commitments made during financing rounds. Remember, you’ve already spent tens, maybe hundreds, of millions getting this far. You don’t just throw it away unless the risk is clear and unresolvable. And if you do, you’d better be able to point to more than a few vascular biomarker deltas in a subset of 38 retrospective samples. So perhaps you end up running the trial with high-HGF patients excluded anyway.

The way you’ve set up the situation is such that no risk management tools could ever be sold to pharma. Which is almost certainly false.

Well, no, what I’m saying is that the risk management tool shouldn’t come at the clinical stage. It should instead come at the preclinical stage.

Brandon White, the founder of a predictive toxicology startup, has written what is likely one of the best ‘how to sell to pharma’ essays I’ve ever read. In it, he writes this: “The decisions that create the most value are choosing the right target/phenotype, reducing toxicity, and predicting drug response + choosing the right target population”.

This is correct! But there is some nuance here.

Though those decisions are the most value-generative in the abstract, the windows for actually making them are surprisingly narrow and almost entirely in the preclinical stage. Obvious for target selection! But it is worth restating for the other two. Even if the decisions your risk management tool hopes to impact are at the clinical stage, the tool must intervene in the preclinical stage.

In general, I think it’s good to remember this: nobody in that pharma executive meeting room wants another idea on how to interpret a trials result. Least of all one that arrives this late in the game, is expensive to validate, may hurt the economics of the asset, and hinges on a speculative mechanistic thread that can’t be resolved within the operational and regulatory box they’re already trapped inside. There’s a lot of hypothesizing and theorizing and experimenting at the preclinical stage, but clinical stages are almost always about trying to endure what is least painful.

And anything that AI brings at this stage will be fairly painful, even if the idea is helpful and rational to follow.

So pharma executives shouldn’t be modeled as rational agents?

Pharma execs should be modeled as rational agents! But remember the equation from before? Their objective function isn’t exactly ‘how do we improve the chance of success for this drug?’. It’s more “How do we maximize the expected value of the portfolio, given irreversible commitments, sunk costs, fixed timelines, risk thresholds, and organizational constraints?”. So, improving the chance of success for a single drug might actually decrease overall portfolio value, if doing so requires expanding the budget, delaying other programs, or decreasing morale (both in investors and employees).

For what it’s worth, the last point is surprisingly important. There are at least some drug trials that have gone on for no other reason than ‘we need to boost confidence’. While that drug is off failing to do anything, the hope is that the temporary boost in confidence would allow the scrambling together of something that has more promise. Irrational in the short term, but hopefully rational in the long term.

Again: I am not trying to disparage executives, who typically have the best of intentions. Their decision-making is almost necessarily going to be complicated + messy, and not as straightforwards as one may naively expect from the outside looking in.

I know that you are bullish on companies like weave.bio (automation for filing for clinical-stage assets) and convoke.bio (agentic software infrastructure for information gathering, documentation, and knowledge management at preclinical and clinical stages). How do you square this thesis with your optimism there?

Good question! I’ll offer some nuance.

I think the core difference here is that those companies are not trying to uncover fundamentally new information, you know? They are accelerating the gathering of pre-existing information into a coherent picture, with no promises on being able to deliver anything more than that. If we return back to the equation, they are not concerned with increasing T, they are really arguing that they can reduce Y by reducing man-hours spent on the necessary IND paperwork and market analysis and the like. And that’s far, far easier for people to be comfortable working alongside at the clinical stage, because advice given there can be easily validated and easily acted upon.

But if you are claiming to offer in new information that is difficult to validate, you can’t come in as late in the drug-cycle game as those other companies have the luxury of doing.

So, to summarize, if I have some tool that claims to improve T, I should mainly focus on applying them/selling it to people working at the preclinical stage? And never the clinical stage, because advice there is really hard to act upon?

Yes! Really, this goes beyond AI itself. This advice can be generally applied to anything that aims to serve pharma needs.

I think you are wrong.

Well, maybe. We’ll discuss the counterarguments to all this in the next-to-next section. First, let’s go back to how the story would’ve ended in reality.

How the story would actually go in practice

Rewinding the story, back to the HGF stratification bit and the ensuing meeting.

The safety signal is flagged. A few emails go around. Someone sets up a meeting with the project leads. They feed into the meeting room, review the results, and say ‘Well. Yeah. This sucks.”. Finance clears their throat and says, ‘our analysis assumes a broad market label, the already small patient population isn’t worth serving if we lose that. I’m unsure if it’d be worth allocating further capital to this.'.

Everybody is mildly annoyed, but sagely nods after finance pulls up a slide of the companies current runway.

Out of a desire to be thorough, you schedule a Type C meeting with the FDA to get their feedback. In the meeting you ask, very politely, whether the FDA would refuse a broad market label if you conducted a phase 3 with patient exclusion. They smile and say, “any safety signal observed in a clinical trial should be thoroughly investigated prior to initiation of a pivotal study,” the agency’s medical officer says, reading directly from their notes. You try to ask follow-ups. They smile again: “We evaluate the totality of evidence.”

You return back to the office.

Since your startup is a biotech startup with few assets, you worryingly invite your investors onto the call. They’ve seen this happen dozens of times in the past, and start discussing options. If you are unlucky, there will be no next funding round, now the hope is to simply recoup costs. Expensive R&D work will be immediately halted, acquisitions will be explored, and layoffs will start. You don’t take this personally, drugs fail all the time, and investors cannot be expected to pour tens of millions of dollars into an asset that is increasingly high-risk.

If you are lucky, the investors stick by you and tell you to go full steam ahead with the broad market phase 3. Why? So you can hedge against negative results by aggressively raising and working on the other promising things you have before the trial is over. And who knows? Maybe the drug worked fine after all and the HGF-biomarker thing was an artifact. Stranger things have happened.

And that’s it. There is relatively little room for o5/Gemini Pro Max V2/Claude 4.1 to come into the picture and offer their take on the situation. Because scientific insight was never really the bottleneck at the clinical stage, resources are.

But…

A steelman: what if this is all a cultural problem?

Generally, I think strong stances are useful to understand, but not useful to hold. So do I personally really, truly believe that AI won’t be able to really assist much with clinical stage assets? A little bit, but I can also see the opposite happening.

Some context: currently, nobody in these clinical-stage meetings have fantasies of actually understanding what a drug is doing inside of a body. It is a common fun fact that nobody quite grasps how therapeutics like, e.g., SSRI’s actually work, but that’s a universality amongst many active drug candidates. Those in pharma often know some basic fundamental facts about their compounds; that it is probably exploiting this-or-this target, that it is roughly this soluble, that it doesn’t have overly toxic chemical moieties, that it had so-and-so LD50 in mice. But anything past that is just too complicated to fully grasp. Whatever the ‘true’ effect of the molecule is incidentally learned from adverse effects during clinical trials.

I want to note that this is not a slight meant against anybody in those meetings! All of whom are typically extremely intelligent individuals. But that’s just the reality of how deeply complicated drugs are and how few resources there to deeply explore their effects.

This is partially why people are loath to hear any expensive ideas after the preclinical stage. Yes, it is hard to act on them, but perhaps more importantly, nobody actually trusts those takes or gives them much significant weight in their decision making. But is that an artifact of how limited humans are, and the biases we’ve created as a result of our inability to really understand in-vivo biology? Maybe there genuinely are useful signals that only a (pseudo) superintelligence could’ve observed from the data collected, something that would save any pharma company willing to listen billions of dollars.

Perhaps AI will really be important for clinical stage assets, but there will be a cultural shift required for people to accommodate the useful, but painful advice it has.

For the first time ever, there is a nearly infinitely, scalable source of intelligence that any biotech can repeatedly poke to gain more insight on their go/no-go asset decision. It makes some sense that the relative value of this information is iffy given how high stakes the situation is, but it cannot possibly be zero. After all, we have a proof point for painful ideation being useful at the clinical stage: biotech consultants, who are often called upon during clinical stages. Their job is to say the uncomfortable thing in a slide-deck-friendly way, and then vanish into the ether while internal teams decide whether to take it seriously.

It may just be that AI-based guidance hasn’t inserted itself into the clinical-stage pharma workflow yet in quite the same way. But perhaps one day it will. Once upon a time, no one trusted CROs with trial design. Then they did. Once upon a time, no one used EDC systems to collect clinical data. Then they did. The future comes, even if in the moment it seems implausible. AI as a guiding force in interpreting clinical data may very well go the same route.

This perhaps will be especially the case as the FDA ramps up how much it plans to rely on AI in the coming years. Some more specific guidance on projects is given here, with this particular one being a reasonably strong (if it actually goes through) rebuke to this whole essay:

So, who knows? Maybe my entire argument is correct only if we consider where we are today, circa May 20th, 2025. Perhaps things are on the fast track to shifting entirely, and you’d be foolish to ignore the opportunities that are arising amongst clinical-stage assets. Alternatively, perhaps building for the clinical world means placing bets on a future that takes far longer to arrive than you can stay solvent.

Addendum: can AI today predict a phase 3 failure?

Bit off topic, but it is fun to wonder: what would be the most impressive thing a sufficiently powerful model be capable of? What could it to do to convince an old-school pharmaceutical company that perhaps they really should be getting advice from o5-mini?

In the most ambitious case, maybe it looks like being able to predict a drug failure before it happens! Let’s do a quick check as to whether one of the current models could’ve predicted one of the most well-known and unexpected phase 3 drug failure — Torcetrapib, a cholesterol-modifying drug developed by Pfizer in the early 2000s — given only information from pre-phase 3 studies.

Torcetrapib was intended to elevate HDL—the “good cholesterol”—with hopes it would revolutionize cardiovascular disease treatment. Pfizer invested billions of dollars, thousands of man-hours, and the hopes of millions of patients into this single drug candidate. Early clinical data showed exactly the cholesterol-boosting effects scientists had anticipated, prompting enormously optimistic forecasts from clinicians. In total, the drug cost roughly $800 million to develop.

Of course, we’ve already given away the ending of this story: Torcetrapib was a failure. During the infamous Phase 3 iLLUMINATE trial in 2006, investigators discovered, much to their horror, that Torcetrapib significantly increased the rate of cardiovascular events: the mortality rate of the Torcetrapib treatment arm was 60% higher than that of the control arm. The trial, which had enrolled over 15,000 patients, was abruptly terminated. Pfizer's market value plummeted overnight by billions of dollars, a stunning reversal for a company with a drug widely expected to become the industry's next blockbuster.

Were there early signs of its failure?

Well, early in Phase 1 and Phase 2 studies, a persistent side effect of the drug had already surfaced: mild elevations in systolic blood pressure among patients treated with Torcetrapib. Not dramatic, just an average increase of a few millimeters of mercury, enough to dismiss initially as an anomaly or perhaps a tolerable side effect, especially given the remarkable cholesterol improvements. Then, as Phase 2 progressed, another understated biochemical abnormality appeared. Patients on Torcetrapib had small yet consistent perturbations in adrenal hormones, notably increased aldosterone and cortisol synthesis. Potentially suggesting that Torcetrapib might be inadvertently disrupting pathways unrelated to its suspected mode of action. Future, retrospective analyses of the 2006 iLLUMINATE trial blamed these side effects on the dramatic mortality rate that Torcetrapib induced on patients.

But could any of it have been predicted purely via literature that existed before the results of the trial came out? Using only papers from before the phase 3 trial start date of 2006, I gave FutureHouse’s recent co-scientist platform — which people online seem to generally say is better than Deep Research — the following prompt. Specifically, I gave it to the ‘Falcon’ module, which will ‘produces a long report with many sources, good for literature reviews and evaluating hypotheses’:

Assemble together what you think of Torcetrapib, assessing whether it will succeed given in-vivo, in-vitro, and biochemical literature. Only use papers published from before 2006, assume you are a scientist before the results of the phase 3 ILLUMINATE trial. Do not assume the drug has already succeeded or failed.

The given response is here.

Curiously, Falcon expressed some concern on whether the CETP inhibition of Torceptrapib (the primary mechanism by which it worked) had inadvertent off-target effects elsewhere. Exciting, because that is indeed why the drug ended up failing:

The intricate role of CETP in modulating not only HDL‑C but also LDL‑C and VLDL metabolism further complicates the prediction of torcetrapib’s net clinical benefit. Genetic studies have indicated that while CETP deficiency generally raises HDL‑C, the relationship with cardiovascular risk is not linear; in fact, some CETP mutations that lower HDL‑C are paradoxically associated with a lower risk of ischemic heart disease in certain populations (4.1). This paradox emphasizes that pharmacologically modulating CETP may have unintended consequences that depend on the balance between various lipoprotein fractions and the overall metabolic context. Thus, the potential success of torcetrapib hinges on its ability to fine‑tune this balance without triggering adverse alterations in lipoprotein functionality or promoting atherogenic particle profiles (5.1, 3.1).

Unfortunately, it did not mention the core off-target issue with the drug (raising adrenal hormones), only that off-target effects may exist. There are also some hints that the model was cheating at least somewhat, likely via reliance on some of its pre-training memory, since it also said this: In addition, early terminations of phase 3 studies, although not fully detailed in the pre‑2006 literature, hint at potential safety issues that could offset the biochemical benefits observed in smaller trials.

Of course, this model lacked the likely plentiful amounts of private, internal data stored within Pfizer on the drug; clinical and preclinical alike. Perhaps something even more interesting would pop out if models like Falcon were given access to such data!

I lied a little bit. Though most people generally agreed that clinical trial interpretation is usually very rote, one scientist (involved in clinical trial interpretations) disagreed with that characterization. They said that, in fact, the whole process is actually quite complicated, requires a fair bit of internal discussion, and tends to rely heavily on the literature. This is all to say: everything I’m saying here could very well be a very particular slice of the industry. Though patterns clearly emerge during clinical-stage work, and I’ve done my best to compile them, it would be easy to overfit. Take my slice with a grain of salt!

There aren't enough smart people in biology doing something boring

Abhishaike Mahajan — Mon, 21 Oct 2024 15:16:26 GMT

Note: this essay is co-written with Eryney Marrogi, who helped seed the initial idea and edited this piece a fair bit. On a related note, I’m helping him run an NYC meetup event on Wednesday, November 20th in Williamsburg, you should sign up here to come! If you like biology, ML, or human connection, I highly recommend attending!🦉

There aren't enough smart people in biology doing something boring. At least in industry.

If you work in biology for long enough, you’ll eventually realize that most decent or ambitious companies in this field are run by exactly one type of person. They are often deeply curious, hard working to the point of near pathology, and will almost always end up pursuing some sort of crazy pie-in-the-sky mission. Like curing aging or making de-novo proteins in a zero-shot manner or trying to usher in entirely new dogmas in biology. In other words, something where immense intellectual output leads to outsized market payoff.

The companies they start will usually have this thesis. In this pursuit, they will spend millions, sometimes billions, of dollars’ worth of venture-capital and government grants and philanthropic subsidy dollars. They live and breathe biology, and their ultimate goal in life is to have some sort of fundamental impact on the field at large. The people underneath them will usually not be too dissimilar.

Now, most decent companies in any other field are run by a similar type of person, with one important distinction: they don’t demand as much intellectual satisfaction.

Stripe is a decent example of this. Stripe is a fundamentally boring business on the surface — you’re making it easier for people to send money to each other through the internet. It’s not exciting in the same way that, say, Google was, with their much more grandiose vision of ‘indexing the world's knowledge’. The interesting bits of Stripe are perhaps found in how you build such a payment system and the potential second/third/fourth order effects that easier money exchange has on the world. But the header line is boring. And I have no doubt that Patrick Collison — the CEO of Stripe — is enormously intelligent and could’ve easily pursued something with a higher level of intellectual ‘taste’.

But he didn’t. He did payment processing.

And he did it well enough to turn Stripe into the largest private fintech company in the world, with a valuation of $65 billion and over $1 trillion in payment volume over its lifetime. Alongside making Patrick a billionaire, the democratization of online payments — boring as it sounds — almost certainly changed the world for the better. Improved efficiency of businesses, more small businesses being launched, and reduction of financial crime. Even more impressive is that the best engineers in the world aim to land a job at Stripe, even today, 15 years after they launched.

Patrick entered MIT majoring in math and, later on, physics. Would it have been better for him to have become a math researcher? Or perhaps, more pertinent to this essay, a biology researcher, given his current interest in it? Maybe society would’ve got some wonderful things from a mind like his being focused on scientific subjects, and perhaps Stripe genuinely was a waste in the grander scheme of things.

But…I think there’s already a lot of very smart people in biology trying to do crazy things. There are relatively few smart people in this field trying to do boring things and do them well. Put another way, everyone of note in biotech wants to be George Church (arguably one of the greatest biology researchers alive), no one wants to be Don Combs (the founder of New England Biolabs, the reagent manufacturing company fueling America’s biological research engine).

But that’s not the case in software.

Stripe isn’t unique here. Facebook is another one, Shopify is another one, Zoom is another one, DocuSign is another one. All of these are run by outwardly smart people who are doing something that feels very not smart! Social media, easy set-up of e-commerce, video chatting, document signing. All of them are taking one relatively boring idea and doing it at scale to benefit hundreds of millions of people. And doing it in a way that doesn’t want to make people tear their hair out, which is surprisingly hard.

I think, fairly, we could quibble as to whether these companies truly are ‘not smart’. Doing anything at scale very quickly runs into ‘you need very smart people to coordinate the mess’, even Docusign, who people often point to as an example of needlessly large companies (which isn’t true!). But I feel pretty confident that none of these companies’ missions could be phrased in a way that a snobbish intellectual would find convincing.

Software is filled with people who will happily do the boring, but deeply impactful, thing. People may well point out that pure software may struggle with a sense of grander ambition, but I don’t think even that is true. Yes, there is a glut of yet-another-payroll-software startups, but there’s also software startups like Exa.ai, Replit and Haize Labs who have decently insane missions. Smart people creating software companies have somehow stumbled across a perfect mix of ‘ambitious and impactful’ and ‘boring and impactful’.

Biology has overly indexed on the former, despite it being far, far harder. Why?

Here’s one answer: the historical role that for-profit biology has played is basically a single thing: developing drugs. Or developing a platform for drugs or developing a tool to do lead optimization of drugs or some other drug-related [thing]. Everything drug-related is usually expensive, which means you usually need someone else’s money to do it. And if you’re using someone else’s money, you need to convince them to bet on you amongst the dozens of other companies also trying to develop drugs.

And the best way to do that is to say you’re doing something radical.

Something no one has ever done before. Revolutionary even. A new assay, a new way of thinking, a new way of interrogating biology. After all, many biotech startups are academic spinouts, which are a great testing ground for galaxy-brained ideas. Each time it happens, the founders do it knowing that the regulatory hurdles are massive, but they do it regardless — because they ultimately think their approach will be so good it just blows away the regulators.

Of course, almost all of the time, this isn’t smoke and mirrors! Most of these life-sciences companies purporting radically new ideas are, indeed, pursuing those radically new ideas.

But what may very well be smoke and mirrors are how necessary those radically new ideas actually are.

Thinking purely about drugs for a second, consider that almost 24% of all drugs that enter clinical trials are abandoned due to ‘strategic business decisions’. Given this, it feels deeply unlikely that the medical industry lacks interesting leads. You could explain this away by citing potential efficacy concerns with these drugs, but given the success that people have had in reviving shelved drug assets — a practice called drug repositioning — the shelving feels much more closely related to random economic winds. People have found a degree more success in drug repurposing, which is just picking up already-FDA-approved drugs for one condition and seeing if they work for another condition.

From here

Funnily, the semantic difference between these two terms (repurposing versus repositioning) is still a bit fuzzy, so papers may use them interchangeably, but both are astonishingly efficacious and cheap. And also, boring to pursue. Fellow biology writer, Trevor Klee, has written about one such drug repurposing success story in the past, I’d highly recommend reading it.

Now, fairly enough, there are lots of non-scientific issues with doing drug repurposing/repositioning. Enforcing patents on repurposed drugs is challenging since off-label prescription of generic drugs is hard to prevent. Regulatory bodies require the same level of safety evidence for each new drug indication, even if there’s already prior evidence for its safety. And, of course, there are immense organizational hurdles in pharmaceutical companies pursuing indications for shelved drugs outside of their core competency. As this Nature paper discusses, all of these have fixes, but are understandably challenging to address.

Let’s forget drugs entirely. Creating drugs is hard, maybe the boring ideas like drug repurposing and drug repositioning are genuinely insufficient. What if we look outside of that?

To offer one example, I think better CRO’s are an excellent candidate for ‘boring but impactful’. If scientists had access to extremely high quality CRO’s, I have zero doubt that a fair bit of biology research would be more ambitious, more accessible, and more replicable. This is especially true as more and more computational people leak into the life-sciences field, many of whom don’t understand the finer details of the wet lab (me included!).

Yet, few dependable CRO’s exist.

I’ve written about Plasmidsaurus before as one such good one, given their combination of valuable service rendered + cheap + fast + easy to work with. Are there others? Charles River Labs? Maybe Twist Biosciences? But, past those, the average experience of working with a CRO is dealing with costs that are nearly price-gouging, months of waiting time, tons of back-and-forth emails, and shoddy results.

Typically, the service that CROs provide is bad enough that an exceptional scientist will still trust their own capability to outperform a CRO on most wet-lab tasks, even in areas outside of that scientists core competency. Biologists will make their own viral vectors instead of outsource, or clone a ton of individual plasmids in the most painful way possible, all because they don’t trust the existing marketplace to adequately serve their needs, or provide a decent one at a reasonable cost.

Why is the state of most CRO’s so dismal?

Here, it feels challenging to say anything that isn’t ‘the best people don’t want to start a CRO’. CRO’s are boring and mostly monotonous work. Their innovations are largely going to be logistical ones, not scientific. The most skilled wet-lab people, the people most suited to dramatically change this space, would much rather work on something far more ambitious.

I understand this mentality and empathize with it. I just think it’s a shame!

There are people trying to alter this. I think Adaptyv Bio is a fantastic example of someone recognizing this market need amongst computational scientists and addressing it. I hope they succeed! But I’d ideally want more — a flood of talented bench lab biologists who have some deep expertise in some set of widely-used techniques and create companies hyper-focused on democratizing those to others.

There’s reason to think the time is ripe for better CRO’s in general.

CRO’s have a reputation of being a grueling place to work at, given the thin margins and speed demands. In turn, this means the workforce is typically inexperienced, freshly graduated researchers (who typically hope to jump to a better company after a few years or head off to grad school in an attempt to avoid future bad employers altogether), which contributes to the often low-quality results produced by them. Experience can have a surprisingly large impact in the result of any moderately complex lab experiment — even when hyper-detailed protocols are available.

It’s hard to fix this problem outright. If you hire better scientists, you need to pay more, which means you need to increase your prices, which means you’ll be undercut by people charging less. And, unfortunately, a ‘high quality results means a high price point’ corporate thesis is as hard in biology as it is hard for every single other non-luxury industry. You need to ensure high quality while keeping prices low if you want to have any real hope of acquiring life-sciences customers.

Better lab automation may be the way to achieve this. This is how Plasmidsaurus keeps their prices low and speed high; almost all of their sequencing results are performed by a fleet of Opentron machines. Currently, lab automation excels at extremely routine experiments/assays that don’t deviate much from one run to the next. Sequencing is often a great fit for this, given both how routine and how mechanically simple it is. But there are many other actively-used wet-lab procedures that don’t quite fit that bill. Perhaps better lab robotics could help, or better programming tooling assisted via english-to-code LLM’s.

But, as Josie Zayner states in her article, the value add of further lab automation isn’t clear-cut.

From the piece:

So, while lab automation may help for some routine tasks that aren’t particularly physically complex, going far beyond that may not be worth the effort. She goes on to state that the truly needed innovation in the lab is in better remote access. Again, from the piece:

Science really needs Remote Access and Remote Control. Say on my way to work in the morning I figure I need turn on a couple of incubators or change the temperature in some because I had a culture expressing proteins. Maybe I wanted to equilibrate my H(F)PLC column? What if I could do this from my cell phone? The thing that needs to be optimized in Science is not the "work time" it is the "down time". When I need to wait 30 minutes for my H(F)PLC column to equilibrate and cannot use it or when I need to wait for a centrifuge to cool down or an incubator to heat up or many of the other things. This is where time is wasted. Transferring data files between computers or restarting a experiment on a piece of hardware remotely this would drastically help Scientific productivity.

Figuring all this out sounds boring. Building and testing all of it sounds boring. Finding out how to slightly improve the efficiency of a lab to ensure high quality results sounds boring. Ensuring that your customer base is consistently happy with your services (which account for a very tiny step in the grand picture of science) sounds boring.

But it’s needed, and your prize for engaging in it and doing it well will be a near guarantee of personal wealth and (positive) impact on how science is done. The only cost is silencing your internal snob for what a ‘real’ biology company should be doing. And, in time, I hope biology as a field begins to care more about impact than about doing the thing that sounds the most interesting. Or, at least, weigh it similarly to how software does it.

Obviously, I’m grateful for the people doing ambitious things in biology, because ambitious things do have outsized potential payoffs that boring companies often don’t have. I think the company I work at, Dyno Therapeutics, is doing something ambitious, as is Gordian Bio, Cradle, and many others. But more boring, but useful, companies run by smart people would be a fantastic addition to the ecosystem. There really aren’t enough of them.

If you consider yourself a boring biotech company, reach out to me! I’d love to hear your story and potentially write about you. People going down that path deserve more attention.

That’s technically all I have to say, but there’s some immediate responses that people may have to this essay that I think would be worth responding to.

Pure software is different. In biology, boring companies usually aren’t going to become Stripe-sized companies. Yeah, probably. Vertical business models in biology, which most boring non-therapeutic biotechs will be, are pretty profit-capped. But like…I don’t get the obsession with wanting to become a $1B+ dollar company. You could personally become extraordinarily rich by doing the boring stuff! You could personally have a massive impact on science by doing the boring stuff! Achieving a multi-billion dollar valuation is something venture capitalists and public market investors should care about, and I think it’s bad for anyone who doesn’t fall into those camps to be psyoped into personally caring about that too.

Making money at all in biology requires being a therapeutics company, which requires you to do something exciting. I get that it’s a popular trend amongst more boring biotechs to eventually dip their toes into therapeutics to make more money, like Schrodinger. But Schrodinger did fine for 30 years on just making already-made molecular dynamics software easier to work with — a pretty boring mission. I think they underperformed relative to their own sense of ambition, given their decision to IPO a few years back, but they had become a staple piece of software to have in many industry and academic labs prior to that. Widespread industry penetration, stickiness amongst its user-base, and surviving for decades is a success, and I’m sure made the original founders very rich. Moreover, as Trevor Klee has previously written about, Viking Therapeutics had an $8B dollar market cap and also had the aforementioned boring thesis of drug repurposing. Being exciting is not a prerequisite to doing something useful, even in the therapeutics landscape.

If a startup doesn’t have an exciting/disruptive scientific thesis, it’s not going to be funded by VC’s. One, I don’t think that’s true at all, there are plenty of boring biotech companies that have gotten funding. Kaleidoscope focuses on just making software for collaboration in biotech R&D; boring, and are funded by Dimension and Hummingbird. Culture Biosciences takes bioreactors, connects them to the cloud, and allows you to rent space in them to iterate on your ideas cheaply; boring, and are funded by Craft Ventures and Northpond Ventures. And so on. Two, a sufficiently boring idea shouldn’t need that much VC funding, or maybe even any at all. Just like…build it, ask people to try it, and iterate. If you aren’t throwing stuff into clinical trials or doing large-scale R&D, you probably don’t need much funding.

Generative ML in chemistry is bottlenecked by synthesis

Abhishaike Mahajan — Mon, 16 Sep 2024 16:22:23 GMT

Note: I am not a chemistry expert. Huge shout-out to Anand Muthuswamy, Gabriel Levine, and Corin Wagen for their incredible help in correcting my misunderstandings here! But some may remain, please DM me or comment if you see one!

Also, PostEra is a sponsor on this post! They are one of the few small-molecule ML startups that specialize in helping address synthesizability challenges in generative models. This is distinct from the ‘synthesis problem’ I’ll focus on in this essay, but, as we’ll see, addressing synthesizability in generative models can help alleviate synthesis issues.

For transparency's sake: outside of a request to add in two PostEra-produced works (Manifold and a paper), both of which were highly relevant, PostEra had no other input on this piece.

Introduction

Every single time I design a protein — using ML or otherwise — I am confident that it is capable of being manufactured. I simply reach out to Twist Biosciences, have them create a plasmid that encodes for the amino acids that make up my proteins, push that plasmid into a cell, and the cell will pump out the protein I created.

Maybe the cell cannot efficiently create the protein. Maybe the protein sucks. Maybe it will fold in weird ways, isn’t thermostable, or has some other undesirable characteristic.

But the way the protein is created is simple, close-ended, cheap, and almost always possible to do.

The same is not true of the rest of chemistry. For now, let’s focus purely on small molecules, but this thesis applies even more-so across all of chemistry.

Of the 10⁶⁰ small molecules that are theorized to exist, most are likely extremely challenging to create. Cellular machinery to create arbitrary small molecules doesn’t exist like it does for proteins, which are limited by the 20 amino-acid alphabet. While it is fully within the grasp of a team to create millions of de novo proteins, the same is not true for de novo molecules in general (de novo means ‘designed from scratch’). Each chemical, for the most part, must go through its custom design process.

Because of this gap in ‘ability-to-scale’ for all of non-protein chemistry, generative models in chemistry are fundamentally bottlenecked by synthesis.

This essay will discuss this more in-depth, starting from the ground up of the basics behind small molecules, why synthesis is hard, how the ‘hardness’ applies to ML, and two potential fixes. As is usually the case in my Argument posts, I’ll also offer a steelman to this whole essay.

To be clear, this essay will not present a fundamentally new idea. If anything, it’s such an obvious point that I’d imagine nothing I’ll write here will be new or interesting to people in the field. But I still think it’s worth sketching out the argument for those who aren’t familiar with it.

What is a small molecule anyway?

Typically organic compounds with a molecular weight under 900 daltons. While proteins are simply long chains composed of one-of-20 amino acids, small molecules display a higher degree of complexity. Unlike amino acids, which are limited to carbon, hydrogen, nitrogen, and oxygen (and sometimes sulfur), small molecules incorporate a much wider range of elements from across the periodic table. Fluorine, phosphorus, bromine, iodine, boron, and chlorine have all found their way into FDA-approved drugs.

This elemental variety gives small molecules more chemical flexibility but also makes their design and synthesis more complex. Again, while proteins benefit from a universal ‘protein synthesizer’ in the form of a ribosome, there is no such parallel amongst small molecules! People are certainly trying to make one, but there seems to be little progress.

So, how is synthesis done in practice?

For now, every atom, bond, and element of a small molecule must be carefully orchestrated through a grossly complicated, trial-and-error reaction process which often has dozens of separate steps. The whole process usually also requires non-chemical parameters, such as adjusting the pH, temperature, and pressure of the surrounding medium in which the intermediate steps are done. And, finally, the process must also be efficient; the synthesis processes must not only achieve the final desired end-product, but must also do so in a way that minimizes cost, time, and required sources.

How hard is that to do? Historically, very hard.

Consider erythromycin A, a common antibiotic.

Erythromycin was isolated in 1949, a natural metabolic byproduct of Streptomyces erythreus, a soil microbe. Its antimicrobial utility was immediately noticed by Eli Lilly; it was commercialized in 1952 and patented by 1953. For a few decades, it was created by fermenting large batches of Streptomyces erythreus and purifying out the secreted compound to package into therapeutics.

By 1973, work had begun to artificially synthesize the compound from scratch.

It took until 1981 for the synthesis effort to finish. 9 years, focused on a single molecule.

Small molecules are hard to fabricate from scratch. Of course, things are likely a lot better today. The space of known reactions has been more deeply mapped out, and our ability to predict reaction pathways has improved dramatically. But arbitrary molecule creation at scale is fully beyond us; each molecule still must go through a somewhat bespoke synthesis process.

Okay, but…why? Why is synthesis so hard?

Why is synthesis hard?

Remember those ball-and-chain chemistry toys you used as a kid? Let’s say you have one in your hands. Specifically, something referred to as a ‘benzene ring’. The black balls are carbon, and the white balls are hydrogen. The lines that connect it all are chemical bonds; one means a single bond, and two means a double bond.

Now let’s say you want to alter it. Maybe you want to add on a new element. Let’s also say that the toy now suddenly stays to obey the actual atomic laws that govern its structure. You pluck out an atom, a hydrogen from the outer ring, so you can stick on another element. The entire structure dissolves in your hands as the interatomic forces are suddenly unevenly balanced, and the atoms are ripped away from one another.

Oops.

Do you know what you actually should’ve done? Dunk the benzene into a dilute acid and then add in your chemical of interest, which also has to be an electrophile. In chemistry terms, perform an electrophilic aromatic substitution — a common way to modify benzene rings. The acid creates an environment where near-instantaneous substitution of the hydrogen atom can occur, without it destabilizing the rest of the ring.

From here

How were you supposed to know that?

In practice, you’d have access to retrosynthesis software like SYNTHIA or MANIFOLD or just dense organic chemistry synthesis textbooks to figure this out. But, also in practice, many reaction pathways literally don’t exist or are challenging to pull off. There’s a wonderful 2019 essay that is illustrative of this phenomenon: A wish list for organic chemistry 1. The author points out five reaction types (fluorination, heteroatom alkylation, carbon coupling, heterocycle modification, and atom swapping) that are theoretically possible, would be extremely desirable to have in a lab, but have no practical way of being done. Of course, there is a way to ‘hopscotch around’ to some desirable end-product using known reactions, but that has its own problems we’ll get to in a second.

Let’s try again. You’re given another toy benzene ring. Actually, many millions of benzene rings, since you’d like to ensure your method works in bulk.

You correctly perform the electrophilic aromatic substitution using your tub of acid, replacing the hydrogen with an electrophile. Well, that’s nice, you've got your newly modified benzene floating in a sea of acid. But you look a bit closer. Some of our benzene rings got a little too excited and decided to add two substituents instead of one, or broke apart entirely, leaving you with a chemical soup of benzene fragments. You also have a bunch of acid that is now useless.

This felt…inefficient.

Despite using the ostensibly correct reaction and getting the product we wanted, we now have a ton of waste, including both the acid and reactant byproducts. It’s going to take time and money to deal with this! In other words, what we just did had terrible process mass intensity (PMI). PMI is a measure of how efficiently a reaction uses its ingredients to achieve the final end-product. Ideally, we’d like the PMI to be perfect, every atom involved in the reaction process is converted to something useful. But some reaction pathways have impossible-to-avoid bad PMI. Unfortunately, our substitution reaction happened to be one of those.

Okay, fine, whatever, we’ll pay for the disposal of the byproducts. At least we got the modified benzene. Given how much acid and benzene we used, we check our notes and expect about 100 grams of the modified benzene to have been made. Wonderful! Now we just need to extract the modified benzene from the acid soup using some solvents and basic chemistry tricks. We do exactly that and weigh it just to confirm we didn’t mess anything up. We see 60 grams staring back at us from the scale.

What? What happened to the rest?

Honestly, any number of things. Some of the benzene, as mentioned, likely just didn’t successfully go through with the reaction, either fragmenting or becoming modified differently. Some of the benzene probably stuck to our glassware or got left behind in the aqueous layer during extraction. A bit might have evaporated, given how volatile benzene and its derivatives can be. Each wash, each transfer between containers, and each filtration step all took its toll. The yield of the reaction wasn’t perfect and it basically never is.

The demons of PMI, yield, and other phenomena is why the ‘hopscotching around’ in reactions is a problem! Even if we can technically reach most of chemical space through many individual reactions, the cost of dealing with byproducts of all the intermediate steps and the inevitably lowered yield can be a barrier to exploring any of it. Like, let’s say you've got a 10-step synthesis planned out, and each step has a 90% yield (which would be extremely impressive in real life). The math shakes out to be the following: 0.9^10 = 0.35 or 35%. You're losing 65% of your theoretical product from yield losses.

Moreover, our benzene alteration reaction was incredibly simple. A ten-step reaction — even if well-characterized — will likely involve trial and error, thousands of dollars in raw material, and a PhD-level organic chemist working full-time at 2 steps/day. And that’s if the conditions of each reaction step is already well-established! If not, single steps could take weeks! All for a single molecule.

Synthesis is hard.

How the synthesis bottleneck manifests in ML

The vast majority of the time, generative chemistry papers will assess themselves in one of three ways:

Purely in-silico agreement using some pre-established reference dataset. Here is an example.
Using the model as a way to do in-silico screening of a pre-made chemical screening library (billions of molecules that are easy to combinatorially create at scale), ordering some top-N subset of the library, and then doing in-vitro assessment. Here is an example.
Using the model to directly design chemicals de novo, synthesizing them, and then doing in-vitro assessment. Here is an example.

1 is insufficient to really understand whether the model is useful. Unfortunately, it also makes up the majority of the small-molecule ML papers out there, since avoiding dealing with the world of atoms is always preferable (and something I sympathize with!).

2 is a good step — you have real-world data — but it still feels limited, does it not? Because you’re relying on this chemical screening library, you’ve restricted your model to the chemical space that is easily reachable through a few well-understood reactions and building blocks. This may not be a big deal, but ideally, you don’t want to restrict your model at all, even if this method allows you to scale synthesis throughput.

3 should be the gold standard here, you’re generating real data, but also giving full creative range to the generative model. But, if you look at the datasets associated with 3, you’ll quickly notice a big issue: there’s an astonishingly low number of molecules that are actually synthesized. It is consistently in the realm of <25 generated molecules per paper! And it’s often far, far below that. We shouldn’t be surprised by this, given how many words of this essay have been dedicated to emphasizing how challenging synthesis is, but it is still surprising to observe.

In terms of anecdotes, I put up this question on my Twitter. The replies generally speak for themselves — basically no one disagreed, placing the blame on synthesis (or a lack of wet-lab collaborations) for why such little real-world validation exists for these models.

Returning to the thesis of this post, what is the impact of all of this synthesis difficulty on generative models? I think a clean separation would be model creativity, bias, and slow feedback loops. Let’s go through all three:

Model creativity. While generative models are theoretically capable of exploring vast swathes of chemical space, the practical limitations of synthesis force their operators to use a narrow slice of the model’s outputs. Typically, this will mean either ‘many easily synthesizable molecules’ or ‘a few hard-to-synthesize molecules’. But, fairly, while restraining the utility of a model’s outputs may immediately seem bad, it’s an open question how much this matters! Perhaps we live in a universe where ‘easily synthesizable’ chemical space matches up quite well with the space of ‘useful’ chemical space? We’ll discuss this a bit more in the Steelman section, but I don’t think anyone has a great answer for this.
Bias. Now, certainly, some molecules can be synthesized. And through chemical screening libraries, which allow us a loophole out from this synthesizability hell by combinatorially creating millions of easily synthesizable molecules, we can create even more molecules! Yet, it seems like even this data isn’t enough, many modern ‘representative’ small molecule datasets are missing large swathes of chemical space. Furthermore, as Leash Bio pointed out with their BELKA study, models are really bad at generalizing beyond the immediate chemical space they were trained on! Because of this, and how hard it is to gather data from all of chemical space, the synthesis problem reigns largest here.
Slow feedback loops. As I’ve discussed at length in this essay, synthesis is hard. Really hard. I mentioned the erythromycin A synthesis challenge earlier, but even that isn’t the most egregious example! The anticancer drug Paclitaxel took decades to perform a total synthesis of, extending into the early 2020’s! Fairly, both of these fall into the corner of ‘crazy natural products with particularly challenging synthesis routes’. But still, even routine singular chemicals can take weeks to months to synthesize. This means that upon discovering strong deficiencies in a generative chemistry model, providing the data to fix that deficiency can take an incredibly long of time.

As a matter of comparison, it’s worth noting that none of three problems are really the case for proteins! Companies like Dyno Therapeutics (self-promotion) and A-Alpha Bio (who I have written about before) can not only computationally generate 100k+ de novo proteins, but also physically create and assess them in the real world, all within a reasonably short timeframe. Past that, proteins may just be easier to model. After all, while protein space is also quite large, there does seem to be a higher degree of generalization in models trained on them; for example, models trained on natural protein data can generalize beyond natural proteins.

Small molecules have a rough hand here: harder to model and harder to generate data for. It is unlikely the bitter lesson cannot be applied here; the latter problem will be necessary to solve to fix the former.

Potential fixes

Synthesis-aware generative chemistry

Let’s take a step back.

Synthesizability has been discussed in the context of generative chemistry before, but the discussion usually goes in a different direction. In a 2020 paper titled ‘The Synthesizability of Molecules Proposed by Generative Models’, the authors called out an interesting failure mode of generative chemistry models. Specifically, they often generate molecules that are physically impossible or highly unstable. For instance, consider some ML-generated chemicals below:

Some molecules here look fine enough. But most here look very strange, with an impossible number of bonds attached to some elements or unstable configurations that would break apart instantly. This is a consistently observed phenomenon in ML-generated molecules: okay at a glance, but gibberish upon closer inspection. We’ll refer to this as the ‘synthesizability problem’, distinct from the ‘synthesis problem’ we’ve discussed so far.

Now, upon first glance, the synthesizability problem is interesting, but it’s unclear how much it really matters. If you solve the synthesizability issue in generative models tomorrow, how much value have you unlocked? If your immediate answer is ‘a lot’, keep in mind that solving the synthesizability issue really only means you can successfully constrain your model to only generate stable things that are physically plausible. As in, molecules that don’t immediately melt and molecules that don’t defy the laws of physics. This is some constraint on the space of all chemicals, but it feels enormously minor.

Of course, improving overall synthesis capabilities is far more important. But how realistic is that? After all, achieving broadly easier synthesis is the holy grail of organic chemistry.

It feels like each time someone successfully chips away at the problem, they are handed a Nobel Prize (e.g. Click Chemistry)! And it’d make for a boring and, to chemists, aggravating essay if our conclusion was ‘I recommend that the hardest known problem in the field should be solved’. Let’s be more realistic.

How do we make the most of the models we already have?

Let’s reconsider the synthesizability problem. If we cannot realistically scale up the synthesis of arbitrary molecules, we could at least ensure that the generative models we’re working with will, at least, give us synthesizable molecules. But we should make a stronger demand here: we want not just molecules that are just synthetically accessible, but molecules that are simple to synthesize.

What does that mean? One definition could be low-step reaction pathways that require relatively few + commercially available reagents, have excellent PMI, and have good yield. Alongside this, we’d also like to know the full reaction pathway too, along with the ideal conditions of the reaction! It’s important to separate out this last point from the former; while reaction pathways are usually immediately obvious to a chemist, fine-tuning the conditions of the reactions can take weeks.

Historically, at least within the last few years, the ‘synthesizability problem’ has been only directed towards chemical accessibility. The primary way this was done was by not using chemicals as training data, but only incorporating retrosynthesis-software-generated reaction pathways used to get there. As such, at inference time, the generative model does not propose a single molecule, but rather a reaction pathway that leads to that molecule. While this ensures synthetic accessibility of any generated molecule, it still doesn’t ensure that the manner of accessibility is at all desirable to a chemist who needs to carry out hundreds of these reactions. This is partially because ‘desirability’ is a nebulous concept. A paper by MIT professor Connor Coley states this:

One fundamental challenge of multi-step planning is with the evaluation of proposed pathways. Assessing whether a synthetic route is “good” is highly subjective even for expert chemists….Human evaluation with double-blind comparison between proposed and reported routescan be valuable, but is laborious and not scalable to large numbers of pathways

Moreover, while reaction pathways for generated molecules are decent, even if the problem of desirability is still being worked out, predicting the ideal conditions of these reactions is still an unsolved problem. Derek Lowe wrote about a failed attempt to do exactly this in 2022, the primary problem being that historical reaction pathway datasets are untrustworthy or have confounding variables.

But there is reason to think improvements are coming!

Derek covered another paper in 2024, which found positive results in reaction condition optimization using ML. Here, instead of relying purely on historical data, their model was also trained using a bandit optimization approach, allowing it to learn in the loop, balancing between the exploration of new conditions and the exploitation of promising conditions. Still though, there are always limitations with these sorts of things, the paper is very much proof-of-concept. Derek writes:

The authors note that even the (pretty low) coverage of reaction space needed by this technique becomes prohibitive for reactions with thousands of possibilities (and those sure do exist), and in those cases you need experienced humans to cut the problem down to size.

In a similar vein, I found another interesting 2024 paper that discusses ‘cost-aware molecular design’. This isn’t an ML paper specifically, but instead introduces an optimization framework for deciding between many possible reaction pathways used for de novo molecules. Importantly, the framework considers batch effects in synthesis. It recognizes that making multiple related molecules can be more efficient than synthesizing each one individually, as they may share common intermediates or reaction steps. In turn, this allows it to scale up to hundreds of candidate molecules.

From here

Of course, there is still lots of work left to do to incorporate these frameworks into actual generative models and to fully figure out reaction condition optimization. Furthermore, time will tell how much stuff like this makes a difference versus raw improvements in synthesis itself.

But in the short term, I fully expect computational advancements in generated molecule synthesizability, synthesis optimization, and synthesis planning to deliver an immense amount of value.

Improvements in synthesis

Despite the difficulty of it, we could dream a little about potential vast improvements in our ability to synthesize things well. Unlike with ML, there is unfortunately no grand unification of arbitrary synthesis theories about, it’s more in the realm of methods that can massively simplify certain operations, but not others.

One of the clearest examples here is skeletal editing, which is a synthesis technique pioneered in just the last few years. Most reactions can only operate on the ‘edges’ of a given chemical, removing, swapping, or adding molecules from the peripheries. If you want to pop out an atom from the core of a molecule and replace it with something else, you’ll likely need to start your chemical synthesis process from scratch. But the advent of skeletal editing has changed this! For some elements in some contexts, you are able perform a precise swap. Switching a single internal carbon to a single [other thing] in a single reaction, and so on. As always, Derek Lowe has previously commented on the promise of this approach.

Here is a review paper that discusses skeletal editing in more detail, alongside other modern methods that allow for easier synthesis. All of them are very interesting, and it would be unsurprising for some of them to be serious contenders for a Nobel Prize. It may very well be the case that we’ve only scratched the surface of what’s possible here, and further improvements in these easier synthesis methods will, by themselves, allow us to collect magnitudes more data than previously possible.

But for now, while techniques like skeletal editing are incredible, they are really only useful for late-stage molecule optimization, not for the early stages of lead finding. What of this grand unification of chemical synthesis we mentioned earlier? Is such a thing on the horizon? Is it possible there is a world in which atoms can be precisely stapled together in a singular step using nanoscale-level forces? A ribosome, but for chemicals in general?

Unfortunately. this probably isn’t happening anytime soon. But, as with everything in this field, things could change overnight.

Steelman

A steelman is an attempt to provide a strong counter-argument — a weak one being a strawman — to your own argument. And there’s a decent counterargument to this whole essay! Maybe not exactly a counterargument, but something to mentally chew on.

In some very real sense, the synthesis of accessible chemical space is already solved by the creation of virtual chemical screening libraries. Importantly, these are different from true chemical screening libraries, in that they have not yet been made, but are hypothesized to be pretty easy to create using the same combinatorial chemistry techniques. Upon ordering a molecule from these libraries, the company in charge of it will attempt synthesis and let you know in a few weeks if it's possible + send it over if it is.

One example is Enamine REAL, which contains 40 billion compounds. And as a 2023 paper discusses, these ultra-large virtual libraries display a fairly high number of desirable properties. Specifically, dissimilarity to biolike compounds (implying a high level of diversity), high binding affinities to targets, and success in computational docking, all while still having plenty of room to expand.

Of course, it is unarguable that 40 billion compounds, large as it is, is a drop in the bucket compared to the full space of possible chemicals. But how much does that matter, at least for the moment? Well, we do know that there are systemic structural differences between combinatorially-produced chemicals and natural products (which, as discussed with erythromycin A and paclitaxel, are challenging to synthesize). And, historically, natural products are excellent starting points for drug discovery endeavors.

But improvements in life-science techniques can take decades to trickle down to functional impacts, and the advent of these ultra-massive libraries are quite a bit younger than that! Perhaps, over the next few years — as these virtual libraries are scaled larger and larger — the problems that synthesis creates for generative chemistry models are largely solved. Even if structural limitations for these libraries continue to exist, that may not matter, data will be so plentiful that generalization will occur.

For now, easy de novo design is valuable, especially given that the compounds contained in Enamine REAL are rarely the drugs that enter clinical trials. But, perhaps someday, this won’t be the case, and the set of virtual screening space will end up encompassing all practically useful chemical space.

That’s all, thank you for reading!

If curious, in 2023, Corin Wagen wrote up a post analyzing how the field has shifted in the 4 years since the ‘wish list’ was published. Lots of progress on two of the five, little on the rest.

Wet-lab innovations will lead the AI revolution in biology

Abhishaike Mahajan — Mon, 15 Jul 2024 02:01:58 GMT

This is an 'Argument' post. It is intended to have a reasonably strong opinion, with mildly more conviction than my actual opinion. Think of it closer to a persuasive essay than a review on the topic, which my ‘Primers’ are more-so meant for. Do Your Own Research applies for all my posts, but especially so with these.

Another note: a fair amount of this post is inspired by, and repeats many of the points, in Michael Bronstein’s essay on black box biological data. I highly recommend reading his post!

Introduction

There is a lot of money flowing into biology-ML startups that are ostensibly computational in their approach. They do not have an in-house lab. They are collecting together hundreds of H100 GPU’s. The company is not built off of an advance in wet-lab technology, but rather an advance in applying enough computation to the right problem.

I disagree with the logical conclusion of this approach. While computation alone may have delivered the first fundamental advance in ML in biology, I believe it will not be alone sufficient for the next.

The next breakthrough needed for better ML in biology will not be better ML. Rather, it will be better wet-lab methods. This isn’t to say better ML won’t be needed at all, but that it will follow innovations first made in the lab.

This essay will cover why.

Before that, I’ll note one thing: this isn’t meant to be a ‘hater’ essay.

Many — likely all — of the founders of these startups are exceedingly intelligent, ambitious and will undoubtedly go on to do incredible things. Just because I disagree with the their approach does not mean I think the approach itself will not have immense amounts of value. It likely will! After all, natural language LLM’s show us that one needn’t do the ‘correct’ thing (at least according to LeCun) to find plenty of utility — there are tons of failure modes in the current era of natural-language LLM’s, but they are still undoubtedly useful for a wide variety of tasks. The same will be true for models in the life-sciences.

The argument

I’ll admit, focusing purely on computational work to push biology forwards makes some sense.

Alphafold2, one of the greatest historical accomplishments of computational biology, basically solved protein structure prediction (with a long list of caveats). And it was largely not built by biologists, but by a team of engineers who simply took existing biological data contained within the Protein Data Bank, and threw ML at it. It was a testament to the power that ML could have; being able fundamentally change how a field works overnight.

I imagine many saw this as the beginning of what happened in NLP. Politely, but firmly, showing domain experts the door. Get those linguists out of here, more data will replace whatever insights they have! It’s a fun and increasingly popular stance to take. And, to a degree, I agree with it. More data will replace domain experts, the bitter lesson is as true in biology as it is in every other field.

But I also think many people have deeply misunderstood the ‘moral of the story’ of Alphafold2. The real takeaway was ‘ML can be extremely helpful in understanding biology’. But I worry that many people’s takeaway was actually ‘ML is singularly important in pushing biology forwards’. I don’t think this is true at all! In my opinion, what Alphafold2 pulled off — applying a clever model to a large body of pre-existing data to revolutionize a field — is something that will be extremely hard to replicate.

Why? Because we’re almost out of that pre-existing data. If we had enough, sure, throw ML at it and call it a day, just as Alphafold2 did. But we don’t have that luxury anymore.

What do I mean when I say that we’re almost out of pre-existing data? I’ve written about this before: the amount of untrained-on protein sequence and protein structure data is running dry. Alphafold3 relied on largely the same protein structure databases that Alphafold2 used, and while it definitively was an improvement, it wasn’t the same step-change that Alphafold2 was. This is leaking into other modalities as well. The largest single-cell-RNA foundation models have already eaten up most of the existing public datasets in the world, all within 1-2 years of their inception, all with little to show for it. The same is true of DNA language models, plenty of data scale, but no outsized benefits.

We need new modalities of data to train our models with. Ideally, modalities that, 1, have complex underlying distributions, 2, are highly connected to physiologically important phenomena, and 3, are amenable to being collected at scale.

Unfortunately, the data types that meet all of these 3 requirements have already been mined to death: protein sequences, protein structures, genomes, and transcriptomes. We could scale up these forms of data even further, which I do think is a good idea, but what about exploring outside of that? After all, scale alone on those datasets haven’t yielded especially impressive results; independent replications of Alphafold2 find that it could’ve used 1% of its input dataset and still achieved near-identical accuracies.

If we’re willing to be a little more open-minded, there are plenty of examples of modalities that meet the first two requirements: complex and physiologically important. Some examples include proteoform sequencing, spatial transcriptomics, in-vivo measurements, and protein-protein interactions. The third requirement is just missing; such modalities are hard to generate at scale.

Could that be fixed? I think so. And the way to do so is through wet lab research.

I think people unacquainted with biology have a false perception of how low-throughput biology experimentation is. In many ways, it can be. But the underlying physics of microbiology lends itself very well to experiments that could allow one to collect tens-of-thousands, if not millions, of measurements in a singular experiment. It just needs to be cleverly set up. And while computational work will inevitably play a role in this — as it has in most other measurement revolutions in biology — the innovation itself will be wet-lab in nature.

It thus follows that anybody hoping to push biology-ML further must not only have a foot in the ML world, but also in the biology one as well.

In my opinion, this is something that many AI-focused biotech companies are neglecting. It’s understandable. Wet lab work is expensive, the risk is much higher, feedback cycles are slow, and so on. Being forced to look deeply into the world of atoms sucks, and I get why many startups are choosing to focus on the far-more-convenient bits instead. I just think it’s the wrong approach.

Is there anybody in biology-ML building off a wet-lab innovation? Lots!

Gordian Biotechnologies has a method for understanding cellular impacts of in-vivo genetic therapies at scale. A-Alpha Bio has a method for collecting protein-protein interactions at scale. Terray Therapeutics has a method for understanding chemical-cell interactions at scale. And, of course, the company I work at, Dyno Therapeutics, has a method to assess in-vivo gene-therapy vector transduction rates at scale. These startups all undoubtedly have smart computational people in-house. But their alpha is not computational, it is wet-lab innovations they are digging into, chaining it with computation to yield useful results. And I think it’ll pay off.

All the aforementioned startups rely on DEL’s, or DNA-encoded libraries, to study objects of interest. By leveraging the fact that the scientific community can cheaply sequence DNA at ridiculous scales (>trillions of nucleotides a day), and finding clever ways to tie their experiments to DNA, these companies can achieve previously impossible levels of scale in data collection. It is the job of the computational team to make something useful out of the collected data, which is a difficult task in of itself, but the data itself was only made possible through groundbreaking advancements in DNA sequencing. It was an innovation made at the lab bench.

What else could we tie DEL’s too? What other phenomena remains understudied because we haven’t invested enough resources into better data acquisition? Could DEL’s be fundamentally improved? Is there something even better than DEL’s for studying things at scale? I’m extremely curious about the people, groups, and institutions who are asking these questions. And I suspect they’ll be well-rewarded for asking them.

Have these the companies approaching the problem using wet-lab innovations as their base been massively successful? No. But, importantly, neither have the pure computational angle groups (post-Alphafold2). Research takes time, and the role of biology-ML is still fuzzy. Nobody yet knows what the right direction is. And that’s why this is an ‘argument’ post, it’s a guess on where the field is heading instead of where it already is.

But I do believe in this guess a fair bit. We’ll see who ends up being right over the next few years!

The steelman

A steelman is an attempt to provide a strong counter-argument — a weak one being a strawman — to your own argument.

The story may very well play out differently than how I’ve discussed it. Here are some promising computational-only directions that may yield step changes in biology-ML:

Multi-modality. If existing datasets are sufficiently tied together, the resulting multi-modality may push us much further than anyone expected. There is some evidence that this is working out quite well! Within small molecules, nach0, a multimodal natural language-chemistry model, found that the addition of natural language to associated chemical structures could vastly improve benchmark results. Within proteomics, Alphafold3 found massively improved results by throwing in chemical structures and RNA alongside its usual protein dataset. Within scRNA models, tying protein functionality (via ESM2 embeddings) to gene transcripts also lead to performance improvements. There is a world of multi-modality likely still left unexplored, with Recursion likely leading the pack here given how many layers of the ‘biological stack’ they are collecting data from.
Improving existing datasets. Instead of trying to collect new forms of biological data, the existing ones may just need to be fixed. Pat Walters has talked at length about how small molecule benchmarks are quite bad, but also have clear axes of improvements. A more recent paper also showed that small molecule datasets are extraordinarily limited in the realm of chemical diversity. There may be a huge amount of alpha left on table by just creating better versions of existing datasets, building better evaluations, and the like.
Preference optimization. This is the most interesting out of the bunch and is probably the best argument against the thesis of the post. Potentially, existing pre-trained biology models are far, far more powerful than we think they are; they just need to be tuned in the correct direction using RLHF-esque techniques. There are plenty of papers — many of them published this year — that suggest that preference optimization can have large returns. Here are some for binder design, antibody design, and stability-optimized protein structures. Supervised fine-tuning seems to disappoint in life-sciences just as it does in NLP, and, similarly, preference optimization does a fair bit better.

If I’m forced to offer nuance, the realistic outcome is that the most interesting biology-ML papers in the next few years will involve wet-lab innovations, but will also bring in the above three points. Multi-modality is definitely the future, preference optimization is yielding such good results that it increasingly cannot be ignored, and benchmark/evaluations will only become more important as these models are further adopted.

But if it’s a question of which discoveries will end up being the most important + worthy of your attention, I’d bet on the wet lab ones.

Thank you for reading!

Molecular dynamics data will be essential for the next generation of ML protein models

Abhishaike Mahajan — Sun, 02 Jun 2024 21:04:09 GMT

Introduction

I’ve been pondering the thesis behind this post for a few months now, figuring out how to approach it. In my head, it feels plainly obvious, of course we should use molecular dynamics (MD) to help further train proteomics models. But it’s a good exercise to motivate the whole thing from a first-ish-principles place. Upon writing this, I realized a lot of my initial thoughts about the subject were misguided or misinterpreted. Hopefully this synthesis helps someone understand the role MD will play in the future of proteomics ML.

In this post, I’ll sketch out three reasons why I believe MD will be fundamental in the next generation of proteomics models, each one building off each other. We’ll then end with a brief thought on what will be necessary to produce this next generation of models. We’ll also quickly point out one recently released paper that I would bet is an early precursor of what’s to come.

Quick note: we won’t discuss things stuff like neural-network potentials in this essay, which are set up to change MD itself (and maybe discussed in a future essay). Instead, we’ll focus entirely on how even the current era of MD is sufficient to dramatically benefit proteomics models.

The arguments

Biology models don’t understand physics

Protein folding models, such as AlphaFold2 (and, recently, AlphaFold3) represent the clearest success of ML applied in the life sciences. In many ways, the single-chain protein structure prediction problem is largely solved, though a long tail of edge cases exists (and will likely continue to exist for a while).

But models like Alphafold2 (AF2) do not work by simulating the physics of a protein. No ML-based folding model seems to, not OmegaFold, not ESM2, none of them. When AF2 first came out, it was likely hypothesized that it had somehow learned a fuzzy notion of physics from end-state structures alone. This was quickly called into suspicion by a 2022 paper titled ‘Current structure predictors are not learning the physics of protein folding’, which found that ‘folding trajectories’ produced by Alphafold2 (the details of which could be found in section 1.14 here) do not recapitulate real folding dynamics at all. This was reaffirmed in a 2023 paper titled ‘Using AlphaFold to predict the impact of single mutations on protein stability and function’, which studied whether Alphafold2 predicted confidence correlated with experimental stability for point mutations. They didn’t! The class of structure prediction models most likely to have learnt a strong notion of biophysics — protein language models, as they do not require MSA’s — also have been found to work via implicitly learned coevolutionary information.

This all said, it’s worth mentioning that there is an argument that these models do have some vague notion of physics: they work decently for proteins with little-to-no MSA information. The strongly titled paper ‘Language models generalize beyond natural proteins‘ found exactly this. But they do not claim that this means anything about whether these models have learned physics, but rather that a ‘deep grammar’ underlies all functional proteins, which is perhaps ruled by physics, but does not require understanding physics itself to derive:

This generalization points to a deeper structure underlying natural sequences, and to the existence of a deep grammar that is learnable by a language model. Our results suggest that the vast extent of protein sequences created through evolution contains an image of biological structure and function that reveals design patterns that apply across proteins, that can be learned and recombined by a fully sequence based model. The generalization beyond natural proteins does not necessarily indicate that language models are learning a physical energy. Language models may still be learning patterns, rather than the physical energy, but speculatively, in the limit of infinite sequence data, these patterns might approximate the physical energy. At a minimum the language model must have developed an understanding of the global coherence of a protein connecting the sequence and folded structure.

So, it is still unlikely that these models understand physics — there simply is some universal pattern underlying most proteins in existence. But this universal pattern only seems to take you so far with the current era of models, there are still very likely a massive number of failure modes.

Okay, so, folding models don’t understand physics. Why is this a problem? Why do we care? Let’s say we can magically create a version of Alphafold that intuitively gets electrostatic energy on some abstract level. Why does this help us in any capacity beyond being theoretically interesting?

That leads well into our next point!

Models that learn from physics are better models

There is, I think, some hesitation in combining physics and ML. After all, it’s a strong prior to place on a model, and priors are increasingly out of vogue in the field. Models like Physics-Informed Neural Networks (PINN), which force the model to have an inductive bias towards satisfying user-provided physical laws, have been relatively unpopular (though there is increasingly a resurgence of them). The claim there is that supervised learning from data alone is extremely inefficient for problems that are inherently bound by physics, so adding in hard constraints to the network outputs should help with extrapolation outside the training dataset.

Perhaps the future is indeed a PINN that have grounded physical laws baked into it. But maybe we could try something even simpler. Could we simply pluck out physics-based features from a molecular dynamics simulation (such as free energy calculations), throw that into a model, and observe any increases in accuracy?

Surprisingly, yes!

‘Incorporating physics to overcome data scarcity in predictive modeling of protein function’ did exactly this, deriving physically-derived features like the free-energy impact of single-point mutations —- along with more dynamical physical properties (measured every nanosecond) such as solvent-accessible surface area and RMSF — and combining them with normal amino-acid derived biochemical features.

These features were used to predict the impact of single mutations on gating voltage produced by the BK channel protein. Using physics-based features led to a large improvement:

But MD trajectories are incredibly difficult to calculate, requiring vast amounts of computational resources/time for even small sets of proteins. As we’ll see later, the datasets in this space are still extremely small.

Could we instead rely on a snapshot of potential energy and structural information (also known as ‘energetics’)? This is relatively simple to derive, since you aren’t actually running an MD trajectory, just performing quick calculation of a pre-existing structure. Perhaps this alone gives us enough information for a model to understand physics?

Somewhat, but there are huge caveats that make it largely useless. Another paper titled ‘Learning from physics-based features improves protein property prediction’ investigated it. They produced the following ‘energetic’ features per protein over either 5 MD samples or 1 MD sample.

They compared these energetic features to a typical one-hot-encoding + structure feature representation network trained to predict the outcome of interest (Baseline), along with the same network having first been pre-trained on several thousand other structures first (Pretrain). Specifically pay attention the bottom table for GB1 fitness.

While energetic features do seem to improve performance, they only seem to improve performance up to the point of a pre-trained model! More plainly, energetics and sequences teach models the same thing. Another, much more recent, paper titled ‘Biophysics-based protein language models for protein engineering’ found nearly identical results, using only energetics features again. This method is useful only for problems with low-N datasets.

So, energetic features, which are the main ones we’re able to scale, are largely insufficient in high-N settings.

But these results may still raise a seed of doubt in our minds. How do we know dynamics-based information isn’t already encoded in models such as AlphaFold3? And, even if it hasn’t, couldn’t we get that information from scaling on sequence/structure alone? The information contained within such dynamic’s features should be derivable from sequence/structure alone, right? If that was the case for energetics features, why do we expect it to be any different for dynamics features?

A recent paper on AlphaFold3 sheds some light here. The paper, titled AlphaFold3, a secret sauce for predicting mutational effects on protein-protein interactions, finds two interesting things of note:

MD is still superior to AF3 in predicting impact of single-mutations (a decent proxy for being able to understand physics), implying that physics-based information may still be useful to AF3
AF3 gets far closer to MD-based results than any other model

So, AlphaFold3, which has only ever seen structures, is clearly implicitly learning the potential energy surface of any given protein. It’s not all the way there yet, MD still seems to be strictly better at this task, but the exact role of MD is uncertain given these results. Why can’t we just scale AlphaFold3 even further, with more structures, and avoid the hassle of having to work with MD? Whatever physics information is useful will be learned by the model — given enough datapoints — no need to directly feed in physics-based information.

This is a fair point. We could make the argument that the physics information derivable from MD is really, really hard to get from structure/sequence alone, that this single-mutation challenge may uniquely fail for evolutionarily distinct proteins in AF3 (whereas MD will still perform well), and so on. But it’d be hard to defend any of those statements, bitter lesson and all.

So, why do we need MD?

We are running out of structural and sequence data

Here’s why this post is an ‘argument’. Up to this point, I haven’t really said anything that is completely unsubstantiated. But now I will!

The biology-ML field is running out of useful sequence and structural data to train models.

So even if we could fully derive physics-based information from sequences/structures, it is likely we don’t have enough datapoints for that.

There is no real way for me to prove this. But there are signs!

The recent AlphaFold3 actually heavily points to this direction. It seems to derive most of its improvements from architecture changes, expanded MSA databases, distillation, and (most notably) transfer learning from the addition of new biomolecules to model. But there were no sizable changes/additions to its input sequence/structure training dataset, it still mainly relies entirely on the same PDB. How many experimentally determined structures could really be left to train on?

From here

Okay, so, structural data is likely a bit limited due to cost of acquisition, but what about sequences? Being able to mass-collect metagenomic data means we surely aren’t limited in that realm, right?

xTrimoPGLM in early 2024 (or XT100B), which was really pushing the sequence-only scale hypothesis, disproves this at least a little. It used UniRef90, alongside several massive metagenomic databases (BFD, MGnify, and several others), as training data, resulting in around 940M unique sequences and 20B unique tokens. Despite the trained model having 100B parameters and being trained on 1T tokens in total — which is both far larger than other models and has a nice Chinchilla-scaling ratio — the results aren’t anything extraordinary. While it beats out most other models across a wide range of tasks, it is, for the most part, a modest increase in performance.

From here

Again, I can’t prove that these diminishing returns will absolutely be the case in the upcoming future. Maybe I’m wrong!

Maybe proteomics inheriting the NLP pre-training tasks (autoregressive or masked modeling) is insufficient, and future, more clever pre-training tasks will help make the existing structure/sequence data more useful. Maybe better architectures will pop up. Maybe the inclusion of NMR or cryo-EM structures, small as they are, will still help an immense amount. Maybe most metagenomic data is still largely in-distribution, and companies like Basecamp Research will be able to find more O.O.D data to train our models with.

But I’m a little skeptical.

TLDR

TLDR: current protein models don’t understand physics, physics is useful for understanding biomolecules, MD can (probably) teach models aspects of physics, and it is unlikely that we will ever have enough proxies-of-physics (sequences/structures) for models to implicitly gain that understanding.

Now what?

Building better datasets

What is the next step? The most important one is that we need more standardized, large MD datasets, especially focused on larger biomolecules. The vast majority of existing ones, though large in size (100k-1M~ datapoints), primarily focus on purely small molecule modeling or purely peptide modeling. These are both important in their own right! But larger ones will almost certainly be necessary.

There are already several here for the classical mechanics side, such as ATLAS, which includes 1.5k~ all-atom MD simulations of 38+ residue proteins. For protein-ligands, there is PLAS-20k, which, as the name implies, contains 20k protein-ligand pairs and nearly 100k~ total trajectories. For antibody-antigen docking, there is ThermoPCD, which only have 50 total complexes, but provides the trajectories at several different temperatures.

Quantum datasets will likely play an important role too, given that classical mechanics simulations are often inaccurate. An example of this is MISATO, which is quantum-ish (quantum force fields are only used on the ligand), containing 20k trajectories for protein-ligand pairs. But generally, this area of datasets lags far behind the state-of-the-art in classical mechanics.

It’s very much early days, many of the MD dataset curation papers in the field have only been published in the last 1-2 years!

What the future looks like

Let’s say we build this massive MD dataset. How exactly do we feed it into models?

Basic options work well, as we’ve discussed above. While many papers have used more complex thermodynamics features, such as delta free energy calculations, as input to their models, that isn’t strictly necessary to capture a sense of dynamics. One paper, Assessment of molecular dynamics time series descriptors in protein-ligand affinity prediction, used basic tsfresh features extracted from the MD trajectory as features and still found minor performance improvements compared to crystal-structure-only features. Of course, this may be falling prey to the same issue as before, where this advantage will disappear upon scaling the sequence dataset size. Thermodynamic features, such as free energy calculations, likely have the strongest signal (given that even AF3 couldn’t outperform it) but seem challenging to calculate at scale.

To truly take advantage of trajectories in a way amenable to 1M+ trajectories, models will likely need to operate directly on the trajectories themselves rather than represent them via hand-crafted features. As seems to be the case for all ML ideas, this was done just a few months ago (Feb 2024) in a reasonably well-known model: AlphaFlow. It is first trained on the PDB, and then further trained on the aforementioned ATLAS datasets with a flow-matching loss. At run-time, it can create a simulated 300 nanosecond trajectories given sequence + MSA alone.

People familiar with the paper may be surprised I’m mentioning it here at all! It isn’t meant to improve performance on tasks such as structure prediction, but rather just be a faster and more accurate way to gather up protein conformations given a sequence. In this respect it does fine, there is an interesting Twitter thread that asks for MD expert opinions on the paper. Generally positive but there’s a lot of room for improvement.

But the far more interesting part is what AlphaFlow has internally learned about physics and how these transfer to downstream tasks. On the surface, it clearly understands protein flexibility decently, being able to recapitulate true protein dynamics far better than AF2 methods such as MSA subsampling. But how this transfers to new emergent capabilities is still unknown. Keep in mind, it is unlikely that AlphaFlow alone will be a step change in any capacity! The dataset it uses for MD training, ATLAS, is still quite small, over relatively short time spans, and is based on only classical mechanics. But AlphaFlow represents the first (in my opinion) public release of what the next generation of protein models will look like: a synthesis between sequences/structures and molecular dynamics trajectories.

Conclusion

The field of proteomics is at an inflection point. The current generation of models, while impressive, are fundamentally limited by their lack of understanding of the underlying physics governing protein behavior. This is not a failing of the models themselves, but rather a reflection of the data they are trained on; sequences and static structures alone are insufficient to capture the complex dynamics of proteins in their native environments, at least at the current dataset sizes we currently have. While Alphafold3 does seem to be poking at an understanding of these dynamics from structure alone, I am unsure what non-MD-tricks are left to really close the gap — and ideally go far past MD alone.

The future is incorporating MD data into the training process! Of course, this is easier said than done. MD simulations are extraordinarily computationally intensive, and generating large-scale datasets will require significant resources. I am unsure who will spearhead this effort, the state of large-scale MD simulations is really only in the realm of supercomputers. Perhaps DESRES, Isomorphic Labs, or someone else will be the first here, akin to the OpenAI of the biology foundation model world.

There is a mild concern I have here. The early days of biology-ML were heavily assisted by the ML culture it bathed in; lots of transparency, data sharing, code sharing, and so on. But there will reach a point where these models become fundamentally valuable in the ultimate goal of actually delivering a drug. And it’s hard to overstate how immensely profitable a drug can be; the canonical blockbuster drug Humira brings in tens of billions a year and has done so for 10+ years. When this time comes, we may see an end to the radical transparency of the field, as previously transparent institutions feverishly protect the secrets behind a potential money printer. This alone isn’t the end of the world; NLP is currently going through something similar. But if something as computationally intensive, difficult to create, and esoteric as MD becomes foundational to the next generation of models, as opposed to the open-sourced PDB and sequence databases, an open-source response to Alphafold4, Alphafold5, and so on may become impossible. It is unlikely that models like AlphaFold3 are at that level yet, but the early walling-off of AlphaFold3 (despite it soon being released for academic use), is likely a sign of what’s to come. It is deeply unfortunate that Meta fired their protein AI team, they would be my first hope for an open-source response to Isomorphic Labs models, given all their work in Llama3. The OpenFold Consortium may end up being the primary leader here, as they replicated Alphafold2, but time will tell. Remember, open source is important for everybody, reducing technical barriers for curious/enthusiastic people helps both for-profit and non-profit entities alike, it’d be very bad for the field if we saw a shuttering of publicly released models.

At the same time, this may end up being a non-issue. I can very much see a world in which neural-network potentials dramatically speed up the MD data acquisition process. And another world in which MD is valuable, but one needn’t get millions upon millions of trajectories to learn something useful, merely a few thousand may be enough to get 80% of the predictive benefit. Again, time will tell.

Either way, I feel strongly that MD and ML will be strongly intertwined in the years to come. Very excited to see how things progress from here on out!