Wet-lab innovations will lead the AI revolution in biology
1.9k words, 9 minutes reading time
This is an 'Argument' post. It is intended to have a reasonably strong opinion, with mildly more conviction than my actual opinion. Think of it closer to a persuasive essay than a review on the topic, which my ‘Primers’ are more-so meant for. Do Your Own Research applies for all my posts, but especially so with these.
Another note: a fair amount of this post is inspired by, and repeats many of the points, in Michael Bronstein’s essay on black box biological data. I highly recommend reading his post!
Audio:
Introduction
There is a lot of money flowing into biology-ML startups that are ostensibly computational in their approach. They do not have an in-house lab. They are collecting together hundreds of H100 GPU’s. The company is not built off of an advance in wet-lab technology, but rather an advance in applying enough computation to the right problem.
I disagree with the logical conclusion of this approach. While computation alone may have delivered the first fundamental advance in ML in biology, I believe it will not be alone sufficient for the next.
The next breakthrough needed for better ML in biology will not be better ML. Rather, it will be better wet-lab methods. This isn’t to say better ML won’t be needed at all, but that it will follow innovations first made in the lab.
This essay will cover why.
Before that, I’ll note one thing: this isn’t meant to be a ‘hater’ essay.
Many — likely all — of the founders of these startups are exceedingly intelligent, ambitious and will undoubtedly go on to do incredible things. Just because I disagree with the their approach does not mean I think the approach itself will not have immense amounts of value. It likely will! After all, natural language LLM’s show us that one needn’t do the ‘correct’ thing (at least according to LeCun) to find plenty of utility — there are tons of failure modes in the current era of natural-language LLM’s, but they are still undoubtedly useful for a wide variety of tasks. The same will be true for models in the life-sciences.
The argument
I’ll admit, focusing purely on computational work to push biology forwards makes some sense.
Alphafold2, one of the greatest historical accomplishments of computational biology, basically solved protein structure prediction (with a long list of caveats). And it was largely not built by biologists, but by a team of engineers who simply took existing biological data contained within the Protein Data Bank, and threw ML at it. It was a testament to the power that ML could have; being able fundamentally change how a field works overnight.
I imagine many saw this as the beginning of what happened in NLP. Politely, but firmly, showing domain experts the door. Get those linguists out of here, more data will replace whatever insights they have! It’s a fun and increasingly popular stance to take. And, to a degree, I agree with it. More data will replace domain experts, the bitter lesson is as true in biology as it is in every other field.
But I also think many people have deeply misunderstood the ‘moral of the story’ of Alphafold2. The real takeaway was ‘ML can be extremely helpful in understanding biology’. But I worry that many people’s takeaway was actually ‘ML is singularly important in pushing biology forwards’. I don’t think this is true at all! In my opinion, what Alphafold2 pulled off — applying a clever model to a large body of pre-existing data to revolutionize a field — is something that will be extremely hard to replicate.
Why? Because we’re almost out of that pre-existing data. If we had enough, sure, throw ML at it and call it a day, just as Alphafold2 did. But we don’t have that luxury anymore.
What do I mean when I say that we’re almost out of pre-existing data? I’ve written about this before: the amount of untrained-on protein sequence and protein structure data is running dry. Alphafold3 relied on largely the same protein structure databases that Alphafold2 used, and while it definitively was an improvement, it wasn’t the same step-change that Alphafold2 was. This is leaking into other modalities as well. The largest single-cell-RNA foundation models have already eaten up most of the existing public datasets in the world, all within 1-2 years of their inception, all with little to show for it. The same is true of DNA language models, plenty of data scale, but no outsized benefits.
We need new modalities of data to train our models with. Ideally, modalities that, 1, have complex underlying distributions, 2, are highly connected to physiologically important phenomena, and 3, are amenable to being collected at scale.
Unfortunately, the data types that meet all of these 3 requirements have already been mined to death: protein sequences, protein structures, genomes, and transcriptomes. We could scale up these forms of data even further, which I do think is a good idea, but what about exploring outside of that? After all, scale alone on those datasets haven’t yielded especially impressive results; independent replications of Alphafold2 find that it could’ve used 1% of its input dataset and still achieved near-identical accuracies.
If we’re willing to be a little more open-minded, there are plenty of examples of modalities that meet the first two requirements: complex and physiologically important. Some examples include proteoform sequencing, spatial transcriptomics, in-vivo measurements, and protein-protein interactions. The third requirement is just missing; such modalities are hard to generate at scale.
Could that be fixed? I think so. And the way to do so is through wet lab research.
I think people unacquainted with biology have a false perception of how low-throughput biology experimentation is. In many ways, it can be. But the underlying physics of microbiology lends itself very well to experiments that could allow one to collect tens-of-thousands, if not millions, of measurements in a singular experiment. It just needs to be cleverly set up. And while computational work will inevitably play a role in this — as it has in most other measurement revolutions in biology — the innovation itself will be wet-lab in nature.
It thus follows that anybody hoping to push biology-ML further must not only have a foot in the ML world, but also in the biology one as well.
In my opinion, this is something that many AI-focused biotech companies are neglecting. It’s understandable. Wet lab work is expensive, the risk is much higher, feedback cycles are slow, and so on. Being forced to look deeply into the world of atoms sucks, and I get why many startups are choosing to focus on the far-more-convenient bits instead. I just think it’s the wrong approach.
Is there anybody in biology-ML building off a wet-lab innovation? Lots!
Gordian Biotechnologies has a method for understanding cellular impacts of in-vivo genetic therapies at scale. A-Alpha Bio has a method for collecting protein-protein interactions at scale. Terray Therapeutics has a method for understanding chemical-cell interactions at scale. And, of course, the company I work at, Dyno Therapeutics, has a method to assess in-vivo gene-therapy vector transduction rates at scale. These startups all undoubtedly have smart computational people in-house. But their alpha is not computational, it is wet-lab innovations they are digging into, chaining it with computation to yield useful results. And I think it’ll pay off.
All the aforementioned startups rely on DEL’s, or DNA-encoded libraries, to study objects of interest. By leveraging the fact that the scientific community can cheaply sequence DNA at ridiculous scales (>trillions of nucleotides a day), and finding clever ways to tie their experiments to DNA, these companies can achieve previously impossible levels of scale in data collection. It is the job of the computational team to make something useful out of the collected data, which is a difficult task in of itself, but the data itself was only made possible through groundbreaking advancements in DNA sequencing. It was an innovation made at the lab bench.
What else could we tie DEL’s too? What other phenomena remains understudied because we haven’t invested enough resources into better data acquisition? Could DEL’s be fundamentally improved? Is there something even better than DEL’s for studying things at scale? I’m extremely curious about the people, groups, and institutions who are asking these questions. And I suspect they’ll be well-rewarded for asking them.
Have these the companies approaching the problem using wet-lab innovations as their base been massively successful? No. But, importantly, neither have the pure computational angle groups (post-Alphafold2). Research takes time, and the role of biology-ML is still fuzzy. Nobody yet knows what the right direction is. And that’s why this is an ‘argument’ post, it’s a guess on where the field is heading instead of where it already is.
But I do believe in this guess a fair bit. We’ll see who ends up being right over the next few years!
The steelman
A steelman is an attempt to provide a strong counter-argument — a weak one being a strawman — to your own argument.
The story may very well play out differently than how I’ve discussed it. Here are some promising computational-only directions that may yield step changes in biology-ML:
Multi-modality. If existing datasets are sufficiently tied together, the resulting multi-modality may push us much further than anyone expected. There is some evidence that this is working out quite well! Within small molecules, nach0, a multimodal natural language-chemistry model, found that the addition of natural language to associated chemical structures could vastly improve benchmark results. Within proteomics, Alphafold3 found massively improved results by throwing in chemical structures and RNA alongside its usual protein dataset. Within scRNA models, tying protein functionality (via ESM2 embeddings) to gene transcripts also lead to performance improvements. There is a world of multi-modality likely still left unexplored, with Recursion likely leading the pack here given how many layers of the ‘biological stack’ they are collecting data from.
Improving existing datasets. Instead of trying to collect new forms of biological data, the existing ones may just need to be fixed. Pat Walters has talked at length about how small molecule benchmarks are quite bad, but also have clear axes of improvements. A more recent paper also showed that small molecule datasets are extraordinarily limited in the realm of chemical diversity. There may be a huge amount of alpha left on table by just creating better versions of existing datasets, building better evaluations, and the like.
Preference optimization. This is the most interesting out of the bunch and is probably the best argument against the thesis of the post. Potentially, existing pre-trained biology models are far, far more powerful than we think they are; they just need to be tuned in the correct direction using RLHF-esque techniques. There are plenty of papers — many of them published this year — that suggest that preference optimization can have large returns. Here are some for binder design, antibody design, and stability-optimized protein structures. Supervised fine-tuning seems to disappoint in life-sciences just as it does in NLP, and, similarly, preference optimization does a fair bit better.
If I’m forced to offer nuance, the realistic outcome is that the most interesting biology-ML papers in the next few years will involve wet-lab innovations, but will also bring in the above three points. Multi-modality is definitely the future, preference optimization is yielding such good results that it increasingly cannot be ignored, and benchmark/evaluations will only become more important as these models are further adopted.
But if it’s a question of which discoveries will end up being the most important + worthy of your attention, I’d bet on the wet lab ones.
Thank you for reading!
The author presents my Series A pitch, thank you! www.molecularReality.com
bionanophotonics....