Dec 23, 2025

6k words, 27 minutes reading time

21 Comments

Never stop writing, this is too good!! I’m sending this to my mom next time she asks me what I do. I don’t think there’s a more human explanation about cheminformatics on the internet. Big W.

I’ve had a hunch for a while that biased validation of models is the only valid yardstick so I made this many moons ago, https://github.com/Manas02/analogue-split

It’s very interesting to get answers to questions like “how does the model perform if the test set is completely made of molecules similar to training set but with significantly different bioactivity” i.e. how well does the model interpolate on between-series hits or “How does the model perform on test set which has no activity cliffs” i.e. how well does the model interpolate on within-series hits.

The results are exactly as you’d hope. Model performance decreases from within to between series predictions. Patterson et.al. in ‘96 (I think) used “neighbourhood behaviour” to explain this type of phenomenon.

PS. XGBoost can’t stop winning and it’s hilarious :p

Reply (1)

Abhishaike Mahajan

Dec 23

Wow! Cool benchmark, it is a fun attitude to take that 'unbiased' validation don't actually tell us much, because there is almost always bias that you are unaware of, so lets build our benchmarks to specifically account for it. Hope this sees more use!

Chaitanya K. Joshi

Dec 23

Is the problem of bind/non-bind more practically useful than of predicting binding affinity values? If I understand correctly, the public results from Leash are around the former, while Boltz and similar models are trained for the later. I wonder what will happen if we instead trained the AF3-style architectures for the binary prediction task - I expect they would also get better - but how much better? And is the binary framing more useful or the value prediction/ranking?

Reply (4)

Abhishaike Mahajan

Dec 23

It's a fair question, and one that I wanted to mention, but I couldn't find any papers that explored this question enough to come up with anything substantial. Hopefully Leash (or someone else) investigates this at some point

My take is that they serve two separate purposes: binary for screening, value for optimization. Obviously, value alone would be great to have at the start, but the computational methods for doing it today are two orders of magnitude slower, so binary seems like it still has its place

Mihir Rao

Dec 25

Would also be worth seeing if P(bind) correlates with affinity. If it does, the binary task may be practically sufficient for the optimization task where continuous affinity values are needed at first glance.

Chen

Dec 23

I do think both models would be integrated at the end of the day cause they are inherently complimentary.

I suspect the secret might be to train with both. Bind/no bind for pre-training stage as this data is plenty and easily accessible, but affinity through SFT or GRPO since getting chemically validated data is extremely costly and time consuming.

Peter Mernyei

Dec 23

Boltz 2 does train and predict on both of these types of output! And yes it does make sense to separate these for use cases of virtual screening vs hit/lead optimization

Seth

Dec 23

This is a great article! It strikes me a bit odd, though, the implicit framing that there is *a* generalization problem that can be definitively solved. If biochemistry is vast and diverse then you would expect that sometimes you can generalize and sometimes can't, basic principles that apply more in some places but not others, and that's just how it is.

It's a bit like trying to "solve" the stock market, or to "solve" ecology. There's no once-and-for-all solving the entire stock market because the stock market is constantly changing.

Reply (1)

Abhishaike Mahajan

Dec 23Edited

This is a fair point! One counterexample would be that the underlying physical laws that govern binding affinity interactions do not change, and are well modeled by sufficiently high levels of DFT-level molecular simulation. Surely there is a low dimensional manifold of this can be modeled with high-enough accuracy to be generally useful across chemicla space

Now, there is a point that is midway between us: that you needn't generalize *fully*, but you merely need to generalize to the slice of chemistry that is relevant to humans. You don't need to be super aware of *everything*! And that would also be enough to be incredibly useful.

I interviewed a few folks who are applying ML to molecular simulation last year, and he had this wonderful, related line about how they are thinking about the problem:

>I think the scope of things we care about in the context of life sciences, even just molecules that can exist on earth, is so much more restricted than the scope of all possible molecules. So quantum mechanics is almost too good....And you can say, we're okay being bad at, a beryllium here, a radon here, a technetium here, and, a krypton here. we don't need to be good at that. If we just learn that 15 elements that are in the human body, in ways that would not immediately explode in contact with our atmosphere, we have such a low dimensional slice of like chemistry that we need to learn that we can give our models an inductive bias in that direction.

Link to that podcast: https://www.owlposting.com/p/can-ai-improve-the-current-state

Reply (1)

Seth

Dec 24

That's a great quote!

All my intuitions come from sciences organismal-and-up, which are in some sense "obviously" more flexible than biochemistry. But also you care could tell a similar story: the space of viable organisms is much smaller than imaginable organisms; viable psychologies versus imaginable psychologies; viable economies versus imaginable economies; and so on. And yet! The space of the viable remains big enough to thoroughly flummox us.

I'm not trying to sound fatalistic, and it sounds like the people you're talking to are really doing great work. But I wonder if it might help to reframe the problem slightly: from what, and to where, are we currently able to generalize?

Ziyuan Zhao

Dec 23

Excellent article! Can you spend a bit more time on explaining how did researchers choose to compare model performance using these machine learning metrics? For example, when you described that hilarious result from Kaggle competition you showed a precision-recall curve, but when discussing Herme vs xgboost performance on different splits, you switched to AUROC scores. Beyond the mathematical definitions, what exactly do they mean for the task of drug discovery at hand that justifies them as good metrics to compare the models?

Reply (2)

Abhishaike Mahajan

Dec 23

One good framing here is...

AU-ROC tells you how well the model ranks binders above non-binders overall. It's a pretty decent short-hand metric, but is perhaps a bit unspecific

Precision-recall (OR AUC-PR) tells you how many of your top N predictions were actually hits, which matters more in practice because you're only going to synthesize and test a handful of compounds, you care less about overall ranking and more about whether your best guesses are right.

So, in the BELKA case, the top 100 predictions are no better than random (and, given how flat the curve was, potentially all of it was random). Unclear why the switch to AU-ROC for Hermes occured! I do think that the difference between the metrics is mainly significant in cases of class imbalance; in the case of an equal number of positive/negative cases in the validation set, which there were, the two are a little modular

Reply (1)

Manas Mahale

Dec 23

Also interesting to read on AUC-ROC vs AUC-PR

https://proceedings.mlr.press/v235/mihelich24a.html

https://arxiv.org/abs/2408.10193

https://arxiv.org/abs/2401.06091v4

Abhishaike Mahajan

Dec 23

Coincidentally, a different commenter on this post may be able to shed some light on this: https://open.substack.com/pub/abhishaike/p/an-ml-drug-discovery-startup-trying?utm_campaign=comment-list-share-cta&utm_medium=web&comments=true&commentId=190911678

W. Sonley

Jan 4Edited

Thanks for taking the time to write this fascinating blog post.

As I understand, the eventual goal here is something like prompting a model to design an antagonist that binds to a disease-causing protein. However, so far pre-training has failed to yield a useful model, which all seem to learn some "wrong" thing instead of the "right" thing which would be binding physics.

It's understandably challenging as we're dealing with a vastly complicated topography. Kudos to the Leash team for their commitment to intellectually honesty and meaningful progress.

Given that, I have some questions:

1. Is there any form of fine-tuning that could tell the pre-trained molecule model "hey this molecule you generated doesn't bind for reasons XYZ"? Basically an RLHF type of thing that has yielded massively useful tools like ChatGPT. For example, maybe this could be partially automated using molecular dynamics simulations. Through this form of feedback, maybe the model "learns" what drives binding interactions.

2. Might there be productive application of these models if combined with traditional high-throughput screening methodology? Currently they might be useless at generating novel molecules, but perhaps useful at filtering out unlikely candidates from HTS.

3. You mentioned the Bitter Lesson. Is there any implication that maybe anyone working on domain-specific models should just drop it because the frontier labs are soaking up all the funding and compute and trying to build super-intelligent models that will eventually solve all these problems anyway?

Reply (1)

Abhishaike Mahajan

Jan 4Edited

Thanks for the comments! I’ll answer the questions one by one:

1. This has been done! E.g here: https://www.nature.com/articles/s41598-025-98629-1. I do think a problem with this, and why it hasnt leaked over into everyones work, is that the reward model is not what we *actually* want, but usually the output of some binding affinity model or MD data, the former of which is achievable via pure SFT (and what we want) and the latter is usually too noisy to be helpful. What you’d actually want is experimental binding affinity data, which is much more trustable, but hard to do today, so I’d be bullish on models that have some lab in the loop setup

2. Yes! My impression is that this is one of the major use cases of these models today; filtering rather than generating. I have discussed this in a past article (https://www.owlposting.com/p/generative-ml-in-chemistry-is-bottlenecked) and my next article will be a reassessment of work there

3. Ehh, in so far as much as anyone should drop their domain-specific models in any field, maybe? But my guess is that it’ll take years for frontier labs to get to the same level of data as what some of these companies have (which they will need, LLM’s havent one shot many problems in biology yet), and drugs must be made in the meantime

Reply (1)

W. Sonley

Jan 4

Thanks for your answers and links for further reading!

Aditya Nanda

Dec 31

Excellently written article!

In the future, are you planning to write anything on ML in pre-clinical/discovery ? Would be super interested in it

Reply (1)

Abhishaike Mahajan

Dec 31

Happy you enjoyed it! And that may be most of this blog honestly :)

Here are some other ones:

A primer on why computational predictive toxicology is hard

https://www.owlposting.com/p/a-primer-on-why-computational-predictive

Mapping the off-target effects of every FDA-approved drug in existence (EvE Bio)

https://www.owlposting.com/p/mapping-the-off-target-effects-of

Drugs currently in clinical trials will likely not be impacted by AI

https://www.owlposting.com/p/drugs-currently-in-clinical-trials

Better antibodies by engineering targets, not engineering antibodies (Nabla Bio)

https://www.owlposting.com/p/better-antibodies-by-engineering

Generative ML in chemistry is bottlenecked by synthesis

https://www.owlposting.com/p/generative-ml-in-chemistry-is-bottlenecked

Cue Parker, MD

Dec 24

I feel dumb as hell but this is excellent

Ziyuan Zhao

Dec 23Edited

Also I find it very hilarious but ultimately unsatisfactory that the human (artificial) axes of variations in the various datasets, e.g., by timestamp before or after some year or by authorship as you mentioned, seems easier to study. Have researchers looked how the more subtle, non-artificial variations within each dataset can explain model performance? That I think would have a lot of scientific value going forward.

Edit: I think Leash’s experiments as you described are heading in this direction, as they have these clever ways to slice their combinatorial libraries. I hope this could be made more granular and offer more concrete explanations about model behavior.

Owl Posting

The ML drug discovery startup trying really…