Discussion about this post

User's avatar
Manas Mahale's avatar

Never stop writing, this is too good!! I’m sending this to my mom next time she asks me what I do. I don’t think there’s a more human explanation about cheminformatics on the internet. Big W.

I’ve had a hunch for a while that biased validation of models is the only valid yardstick so I made this many moons ago, https://github.com/Manas02/analogue-split

It’s very interesting to get answers to questions like “how does the model perform if the test set is completely made of molecules similar to training set but with significantly different bioactivity” i.e. how well does the model interpolate on between-series hits or “How does the model perform on test set which has no activity cliffs” i.e. how well does the model interpolate on within-series hits.

The results are exactly as you’d hope. Model performance decreases from within to between series predictions. Patterson et.al. in ‘96 (I think) used “neighbourhood behaviour” to explain this type of phenomenon.

PS. XGBoost can’t stop winning and it’s hilarious :p

Expand full comment
Chaitanya K. Joshi's avatar

Is the problem of bind/non-bind more practically useful than of predicting binding affinity values? If I understand correctly, the public results from Leash are around the former, while Boltz and similar models are trained for the later. I wonder what will happen if we instead trained the AF3-style architectures for the binary prediction task - I expect they would also get better - but how much better? And is the binary framing more useful or the value prediction/ranking?

Expand full comment
9 more comments...

No posts

Ready for more?