Never stop writing, this is too good!! I’m sending this to my mom next time she asks me what I do. I don’t think there’s a more human explanation about cheminformatics on the internet. Big W.
It’s very interesting to get answers to questions like “how does the model perform if the test set is completely made of molecules similar to training set but with significantly different bioactivity” i.e. how well does the model interpolate on between-series hits or “How does the model perform on test set which has no activity cliffs” i.e. how well does the model interpolate on within-series hits.
The results are exactly as you’d hope. Model performance decreases from within to between series predictions. Patterson et.al. in ‘96 (I think) used “neighbourhood behaviour” to explain this type of phenomenon.
PS. XGBoost can’t stop winning and it’s hilarious :p
Wow! Cool benchmark, it is a fun attitude to take that 'unbiased' validation don't actually tell us much, because there is almost always bias that you are unaware of, so lets build our benchmarks to specifically account for it. Hope this sees more use!
Is the problem of bind/non-bind more practically useful than of predicting binding affinity values? If I understand correctly, the public results from Leash are around the former, while Boltz and similar models are trained for the later. I wonder what will happen if we instead trained the AF3-style architectures for the binary prediction task - I expect they would also get better - but how much better? And is the binary framing more useful or the value prediction/ranking?
It's a fair question, and one that I wanted to mention, but I couldn't find any papers that explored this question enough to come up with anything substantial. Hopefully Leash (or someone else) investigates this at some point
My take is that they serve two separate purposes: binary for screening, value for optimization. Obviously, value alone would be great to have at the start, but the computational methods for doing it today are two orders of magnitude slower, so binary seems like it still has its place
This is a great article! It strikes me a bit odd, though, the implicit framing that there is *a* generalization problem that can be definitively solved. If biochemistry is vast and diverse then you would expect that sometimes you can generalize and sometimes can't, basic principles that apply more in some places but not others, and that's just how it is.
It's a bit like trying to "solve" the stock market, or to "solve" ecology. There's no once-and-for-all solving the entire stock market because the stock market is constantly changing.
This is a fair point! One counterexample would be that the underlying physical laws that govern binding affinity interactions do not change, and are well modeled by sufficiently high levels of DFT-level molecular simulation. Surely there is a low dimensional manifold of this can be modeled with high-enough accuracy to be generally useful across chemicla space
Now, there is a point that is midway between us: that you needn't generalize *fully*, but you merely need to generalize to the slice of chemistry that is relevant to humans. You don't need to be super aware of *everything*! And that would also be enough to be incredibly useful.
I interviewed a few folks who are applying ML to molecular simulation last year, and he had this wonderful, related line about how they are thinking about the problem:
>I think the scope of things we care about in the context of life sciences, even just molecules that can exist on earth, is so much more restricted than the scope of all possible molecules. So quantum mechanics is almost too good....And you can say, we're okay being bad at, a beryllium here, a radon here, a technetium here, and, a krypton here. we don't need to be good at that. If we just learn that 15 elements that are in the human body, in ways that would not immediately explode in contact with our atmosphere, we have such a low dimensional slice of like chemistry that we need to learn that we can give our models an inductive bias in that direction.
Also I find it very hilarious but ultimately unsatisfactory that the human (artificial) axes of variations in the various datasets, e.g., by timestamp before or after some year or by authorship as you mentioned, seems easier to study. Have researchers looked how the more subtle, non-artificial variations within each dataset can explain model performance? That I think would have a lot of scientific value going forward.
Edit: I think Leash’s experiments as you described are heading in this direction, as they have these clever ways to slice their combinatorial libraries. I hope this could be made more granular and offer more concrete explanations about model behavior.
Excellent article! Can you spend a bit more time on explaining how did researchers choose to compare model performance using these machine learning metrics? For example, when you described that hilarious result from Kaggle competition you showed a precision-recall curve, but when discussing Herme vs xgboost performance on different splits, you switched to AUROC scores. Beyond the mathematical definitions, what exactly do they mean for the task of drug discovery at hand that justifies them as good metrics to compare the models?
AU-ROC tells you how well the model ranks binders above non-binders overall. It's a pretty decent short-hand metric, but is perhaps a bit unspecific
Precision-recall (OR AUC-PR) tells you how many of your top N predictions were actually hits, which matters more in practice because you're only going to synthesize and test a handful of compounds, you care less about overall ranking and more about whether your best guesses are right.
So, in the BELKA case, the top 100 predictions are no better than random (and, given how flat the curve was, potentially all of it was random). Unclear why the switch to AU-ROC for Hermes occured! I do think that the difference between the metrics is mainly significant in cases of class imbalance; in the case of an equal number of positive/negative cases in the validation set, which there were, the two are a little modular
Never stop writing, this is too good!! I’m sending this to my mom next time she asks me what I do. I don’t think there’s a more human explanation about cheminformatics on the internet. Big W.
I’ve had a hunch for a while that biased validation of models is the only valid yardstick so I made this many moons ago, https://github.com/Manas02/analogue-split
It’s very interesting to get answers to questions like “how does the model perform if the test set is completely made of molecules similar to training set but with significantly different bioactivity” i.e. how well does the model interpolate on between-series hits or “How does the model perform on test set which has no activity cliffs” i.e. how well does the model interpolate on within-series hits.
The results are exactly as you’d hope. Model performance decreases from within to between series predictions. Patterson et.al. in ‘96 (I think) used “neighbourhood behaviour” to explain this type of phenomenon.
PS. XGBoost can’t stop winning and it’s hilarious :p
Wow! Cool benchmark, it is a fun attitude to take that 'unbiased' validation don't actually tell us much, because there is almost always bias that you are unaware of, so lets build our benchmarks to specifically account for it. Hope this sees more use!
Is the problem of bind/non-bind more practically useful than of predicting binding affinity values? If I understand correctly, the public results from Leash are around the former, while Boltz and similar models are trained for the later. I wonder what will happen if we instead trained the AF3-style architectures for the binary prediction task - I expect they would also get better - but how much better? And is the binary framing more useful or the value prediction/ranking?
It's a fair question, and one that I wanted to mention, but I couldn't find any papers that explored this question enough to come up with anything substantial. Hopefully Leash (or someone else) investigates this at some point
My take is that they serve two separate purposes: binary for screening, value for optimization. Obviously, value alone would be great to have at the start, but the computational methods for doing it today are two orders of magnitude slower, so binary seems like it still has its place
This is a great article! It strikes me a bit odd, though, the implicit framing that there is *a* generalization problem that can be definitively solved. If biochemistry is vast and diverse then you would expect that sometimes you can generalize and sometimes can't, basic principles that apply more in some places but not others, and that's just how it is.
It's a bit like trying to "solve" the stock market, or to "solve" ecology. There's no once-and-for-all solving the entire stock market because the stock market is constantly changing.
This is a fair point! One counterexample would be that the underlying physical laws that govern binding affinity interactions do not change, and are well modeled by sufficiently high levels of DFT-level molecular simulation. Surely there is a low dimensional manifold of this can be modeled with high-enough accuracy to be generally useful across chemicla space
Now, there is a point that is midway between us: that you needn't generalize *fully*, but you merely need to generalize to the slice of chemistry that is relevant to humans. You don't need to be super aware of *everything*! And that would also be enough to be incredibly useful.
I interviewed a few folks who are applying ML to molecular simulation last year, and he had this wonderful, related line about how they are thinking about the problem:
>I think the scope of things we care about in the context of life sciences, even just molecules that can exist on earth, is so much more restricted than the scope of all possible molecules. So quantum mechanics is almost too good....And you can say, we're okay being bad at, a beryllium here, a radon here, a technetium here, and, a krypton here. we don't need to be good at that. If we just learn that 15 elements that are in the human body, in ways that would not immediately explode in contact with our atmosphere, we have such a low dimensional slice of like chemistry that we need to learn that we can give our models an inductive bias in that direction.
Link to that podcast: https://www.owlposting.com/p/can-ai-improve-the-current-state
Also I find it very hilarious but ultimately unsatisfactory that the human (artificial) axes of variations in the various datasets, e.g., by timestamp before or after some year or by authorship as you mentioned, seems easier to study. Have researchers looked how the more subtle, non-artificial variations within each dataset can explain model performance? That I think would have a lot of scientific value going forward.
Edit: I think Leash’s experiments as you described are heading in this direction, as they have these clever ways to slice their combinatorial libraries. I hope this could be made more granular and offer more concrete explanations about model behavior.
Excellent article! Can you spend a bit more time on explaining how did researchers choose to compare model performance using these machine learning metrics? For example, when you described that hilarious result from Kaggle competition you showed a precision-recall curve, but when discussing Herme vs xgboost performance on different splits, you switched to AUROC scores. Beyond the mathematical definitions, what exactly do they mean for the task of drug discovery at hand that justifies them as good metrics to compare the models?
One good framing here is...
AU-ROC tells you how well the model ranks binders above non-binders overall. It's a pretty decent short-hand metric, but is perhaps a bit unspecific
Precision-recall (OR AUC-PR) tells you how many of your top N predictions were actually hits, which matters more in practice because you're only going to synthesize and test a handful of compounds, you care less about overall ranking and more about whether your best guesses are right.
So, in the BELKA case, the top 100 predictions are no better than random (and, given how flat the curve was, potentially all of it was random). Unclear why the switch to AU-ROC for Hermes occured! I do think that the difference between the metrics is mainly significant in cases of class imbalance; in the case of an equal number of positive/negative cases in the validation set, which there were, the two are a little modular
Also interesting to read on AUC-ROC vs AUC-PR
https://proceedings.mlr.press/v235/mihelich24a.html
https://arxiv.org/abs/2408.10193
https://arxiv.org/abs/2401.06091v4
Coincidentally, a different commenter on this post may be able to shed some light on this: https://open.substack.com/pub/abhishaike/p/an-ml-drug-discovery-startup-trying?utm_campaign=comment-list-share-cta&utm_medium=web&comments=true&commentId=190911678