Discussion about this post

User's avatar
Manas Mahale's avatar

Never stop writing, this is too good!! I’m sending this to my mom next time she asks me what I do. I don’t think there’s a more human explanation about cheminformatics on the internet. Big W.

I’ve had a hunch for a while that biased validation of models is the only valid yardstick so I made this many moons ago, https://github.com/Manas02/analogue-split

It’s very interesting to get answers to questions like “how does the model perform if the test set is completely made of molecules similar to training set but with significantly different bioactivity” i.e. how well does the model interpolate on between-series hits or “How does the model perform on test set which has no activity cliffs” i.e. how well does the model interpolate on within-series hits.

The results are exactly as you’d hope. Model performance decreases from within to between series predictions. Patterson et.al. in ‘96 (I think) used “neighbourhood behaviour” to explain this type of phenomenon.

PS. XGBoost can’t stop winning and it’s hilarious :p

Expand full comment
Ziyuan Zhao's avatar

Also I find it very hilarious but ultimately unsatisfactory that the human (artificial) axes of variations in the various datasets, e.g., by timestamp before or after some year or by authorship as you mentioned, seems easier to study. Have researchers looked how the more subtle, non-artificial variations within each dataset can explain model performance? That I think would have a lot of scientific value going forward.

Edit: I think Leash’s experiments as you described are heading in this direction, as they have these clever ways to slice their combinatorial libraries. I hope this could be made more granular and offer more concrete explanations about model behavior.

Expand full comment
5 more comments...

No posts

Ready for more?