Five things to keep in mind while reading biology ML papers
1.1k words, 6 minutes reading time
This is a small compilation of things I have picked up as ‘things to watch out for’ while reading biology machine learning papers over the last few years. I’ll use ‘I’ for the entirety of this. However, this post is co-written with Nathan C. Frey, a scientist at Prescient Design who similarly writes on Substack, check him out!
Established benchmarks are rarely reflective of the real world.
Naively accepting strong results on benchmarks (e.g MoleculeNet, FLIP,) is a bad idea. Keep in mind that benchmarks in this field are incredibly challenging to create; the distribution shifts created when moving across datasets are massive. Most of the used benchmarks are a concession to standardization, not something that people actually agree on! Excellent performance on one dataset often doesn't translate to another, even within the same problem domain.
Folks at inductive.bio made a very compelling explanation of how this affects small molecule datasets, and, unlike me, have offered a potential fix for it in one setting! The fix isn’t anything too complex, just paying closer attention to assay stratification + ensuring train/test splits contain dissimilar molecules, which ends up yielding a far more generalizable model. Unfortunately, this also means it performs worse on the benchmark. Highly recommend reading their post!
Most papers will not steelman their baselines.
As is the case in typical ML, biology ML papers usually compare their proposed method to a simpler, more traditional baseline, such as logistic regression. However, as is also the case in typical ML, biology ML baselines are often gamed, using extremely basic versions of the baselines instead of a more realistic one.
The most common example of this is in molecule-protein docking. A review paper found that ML docking papers regularly use unfair setups when comparing their methods to traditional docking methods. When each method is set up correctly, they find that the performance of ML methods are more nuanced: better at better at pocket finding, but strictly worse at docking compared to traditional approaches! Another inductive.bio blogpost finds that the docking baselines given in the Alphafold3 paper could be substantially improved with relatively little code and off the shelf methods. Surprisingly, this improved docking baseline ended up approaching Alphafold3 in accuracy!
Most biology ML research is, generously speaking, curiosity driven. Just because there’s a lot of excitement in one area doesn’t make the area actually useful for practical purposes.
Curiosity-driven research is good and breakthroughs often come from meandering through hypothesis space. But because the ML community prizes novelty fairly high, this can create a feedback loop where poorly-motivated (but unique) approaches that don’t address real problems in biology consume the lion’s share of the ML attention economy. This, in turn, leads to more people focusing on these ill-motivated problems. If you think something sounds cool and interesting, read it! Just remember that claims of novelty are at best uncorrelated with short-term real-world impact, and at worst anti-correlated.
This is the sort of thing that is at least a little ‘taste’ based; plenty of people will disagree on what exactly defines a subfield that is largely populated by people being nerd-sniped. To-be-taken-with-a-grain-of-salt, I’ll offer the case of DNA LLM’s as an example. A review of the field from March 2024 finds that multiple versions of these models are, currently, no better than simpler models (e.g. trained CNN’s) on downstream tasks. It’s also unclear what these models can do that more basic methods cannot do!
Knowing the limitations of the assays used is really important in understanding the limitations of a paper.
Nearly every assay used in the life sciences is error-prone, and not in a neat, clean, random-noise way. More in a ‘this works for some domains of proteins or molecules or life, but utterly fails for others’. Clonal bridge amplification sequencing, LC–MS, and the like may all fail to produce trustable results in certain scenarios, but will still be used in many papers. Authors of papers will usually not make these failure modes obvious; it is up to you to see methods like ‘16S sequencing’ or ‘nitrate tests for UTIs’ and immediately understand the implicit limitations of a piece of work. To be clear, nobody is trying to be deceitful here! Most experts will already be well aware of these limitations when coming into a paper, since it’ll usually be standard for the field, but newcomers may be tripped up.
An indirect example of this are PAINS, or pan-assay interference compounds. These are chemical ‘bad actors’, which often lead to false positives/negatives on certain assays because they interact with the measurement methods used, such as fluorescence, rather than what the method is actually studying, such as binding interference. As a result, PAINS chemical compounds can end up in small molecule datasets as having properties that, in all likelihood, it doesn’t have! As is often the case, Pat Walters has an excellent blog post about it.
Evaluations are rarely clear-cut.
This is related to the first point of general benchmarks not being trustworthy, but this is a more specific problem. Even benchmarks directly crafted from the problem are challenging to create. There could be huge motif overlaps between training/test cases (unnoticed unless you directly look for it), pre-trained foundation models could have seen aspects of your test cases (common in NLP and common in biology as well!), and so on. This is harder to detect when it’s a problem, since it is contextual, but should be kept in the back of your mind.
I’ve previously written about this with RFDiffusion. To summarize, RFDiffusion primarily tests its peptide binder generation capabilities on receptors that it has already seen bound peptides to, likely leading to inflated accuracies that aren’t necessarily wrong, but are perhaps unrealistic. Really, there’s a more general way we could phrase the problem. A lot of problems in this space are pair-based — such as the interactions between a molecule-protein or a protein-protein — and it’s easy to leak data. The way training/test splits are set up—specifically, how parts of a pair are distributed between training and test sets—can significantly impact reported accuracy! There’s an excellent recent review paper about this exact problem, which goes into depth on how various papers in biology ML are affected by it.
Thank you for reading this post! Every two weeks, I’ll be posting something about biology, ML, or the intersection of the two. If this interests you, please subscribe!
Bullet #3 is missing