Playing devil's advocate: if molecular dynamics can revolutionize proteomics, why hasn't Relay Therapeutics already done this? Are they not gathering the right kind of MD data, or is their data too focused on particular programs rather than on providing training data for the general problem, or is their dataset just too small? Or maybe they have great MD data, but haven't figured out how to train big models on it? Or maybe my premise is wrong, and they are revolutionizing the field while staying quiet?
Who knows, but I do have guesses, I'll just go through them. Obv this is my personal opinion that shouldn't be tied to anybody I work for
1. They are primarily a small molecule company, and seem to be having enough success there that they don't care about trying to make PPI stuff better (which is what MD, imo, will be primarily useful for).
2. They have relatively few computational/engineering people just from going through linkedin, seems much more heavy on the wet-lab side, the lift in creating these sorts of models may be out of their desired scope.
3. They also co-author a lot with DESRES. A lot of the MD work may occur with DESRES in-the-loop and they understandably may want to keep a lot of that stuff private.
4. And the last, most probable possibility is that creating these sorts of models is probably an organizational nightmare. It's so costly, requires so many of the right people, and takes an immense amount of patience for it to work well. PhD students at the right labs seem much better set up to create the first iterations of these models (e.g. AlphaFlow came from 3 people at MIT), I feel like companies will wait for it to be de-risked further + ideated upon before investing time into it. Relay isn't trying to be a general purpose research org (e.g. Deepmind), so I wouldn't expect them to make the same pie-in-the-sky bets.
Of course, there's always the chance that Relay tried this 2 years ago, it didn't work, and they moved on :)
Fantastic article as always, and I agree strongly with your argument here. The idea of scaling sequence data taking precedence over everything seems to have jumped from the NLP community and taken hold of the Bio ML community (or at least it did in 2023, the hype seems to have abated somewhat in 2024). This works well for text because we already have so much of it and have billions of internet users producing more of it every single day - we obviously will never cover the full distribution of useful text, but we may get close (and indeed get closer every day). We also have very little idea on what laws govern the production of text (since this essentially amount to mathematically modeling human thought), so we can't really build any useful priors into our text models, which makes scaling things up our only real avenue to increased performance.
On the other hand, for protein language models, the cost of sampling the full space of possible foldable proteins is essentially infinite, so we can't rely on scaling on sequence + structure data alone to get us there. In addition, and unlike in the case of text, we do have these readily available physical laws that we can use to bias the model towards plausible solutions for understanding the space of foldable proteins. So if we want to substantially improve performance beyond what AlphaFold3 is already capable of, we need to take advantage of these priors and build them into future models, whether that's by constraining the model in some way or training directly on the MD data as you suggested.
And why stop there? It feels like protein models would benefit from the multi-modality to improve its performance like GTP4-o, with so many types of data being available that are not structures like function, affinity, thermostability, location… Have you seen his approach being applied?
but what if it kills us?
throwback
Playing devil's advocate: if molecular dynamics can revolutionize proteomics, why hasn't Relay Therapeutics already done this? Are they not gathering the right kind of MD data, or is their data too focused on particular programs rather than on providing training data for the general problem, or is their dataset just too small? Or maybe they have great MD data, but haven't figured out how to train big models on it? Or maybe my premise is wrong, and they are revolutionizing the field while staying quiet?
Who knows, but I do have guesses, I'll just go through them. Obv this is my personal opinion that shouldn't be tied to anybody I work for
1. They are primarily a small molecule company, and seem to be having enough success there that they don't care about trying to make PPI stuff better (which is what MD, imo, will be primarily useful for).
2. They have relatively few computational/engineering people just from going through linkedin, seems much more heavy on the wet-lab side, the lift in creating these sorts of models may be out of their desired scope.
3. They also co-author a lot with DESRES. A lot of the MD work may occur with DESRES in-the-loop and they understandably may want to keep a lot of that stuff private.
4. And the last, most probable possibility is that creating these sorts of models is probably an organizational nightmare. It's so costly, requires so many of the right people, and takes an immense amount of patience for it to work well. PhD students at the right labs seem much better set up to create the first iterations of these models (e.g. AlphaFlow came from 3 people at MIT), I feel like companies will wait for it to be de-risked further + ideated upon before investing time into it. Relay isn't trying to be a general purpose research org (e.g. Deepmind), so I wouldn't expect them to make the same pie-in-the-sky bets.
Of course, there's always the chance that Relay tried this 2 years ago, it didn't work, and they moved on :)
Fantastic article as always, and I agree strongly with your argument here. The idea of scaling sequence data taking precedence over everything seems to have jumped from the NLP community and taken hold of the Bio ML community (or at least it did in 2023, the hype seems to have abated somewhat in 2024). This works well for text because we already have so much of it and have billions of internet users producing more of it every single day - we obviously will never cover the full distribution of useful text, but we may get close (and indeed get closer every day). We also have very little idea on what laws govern the production of text (since this essentially amount to mathematically modeling human thought), so we can't really build any useful priors into our text models, which makes scaling things up our only real avenue to increased performance.
On the other hand, for protein language models, the cost of sampling the full space of possible foldable proteins is essentially infinite, so we can't rely on scaling on sequence + structure data alone to get us there. In addition, and unlike in the case of text, we do have these readily available physical laws that we can use to bias the model towards plausible solutions for understanding the space of foldable proteins. So if we want to substantially improve performance beyond what AlphaFold3 is already capable of, we need to take advantage of these priors and build them into future models, whether that's by constraining the model in some way or training directly on the MD data as you suggested.
And why stop there? It feels like protein models would benefit from the multi-modality to improve its performance like GTP4-o, with so many types of data being available that are not structures like function, affinity, thermostability, location… Have you seen his approach being applied?