2) That GPN MSA, a much much smaller model that mixes a short context DNA LM with evolutionary conservation from MSA as input comes kinda close Evo 2 for both coding and noncoding, suggests that for variant interpretation, a model as large as Evo 2 probably isn’t necessary in the long run.
I am very skeptical of brca1 validations, because Evo2 authors first selected a specific later for embeddings generation, and after that - 128 nucleotides as context for each mutation. Supervised evo2 performance of last layer with 8K nucleotides around variant is way worse than alphamissense and GPN-MSA.
However, regarding point 4) during training Evo2 downweights next token prediction loss by factor of 10 for masked repeats, so this is taken care of
Proteins are encoded in DNA. The challenges/problem sets of the DNA space are supersets containing all the challenges in the protein space, with an extra layer of regulatory complexity to tackle.
Prediction of pathogenic mutations is also a largely solved problem. Cadd, alphamissense, GPN-MSA do it quite well, mostly relying on conservation score + human population frequency. What we do need more is to understand why mutation is pathogenic.
Also, almost all of mutations which increase likelihood of a polygenic disease, for example +20% relative risk of type 2 diabetes are not considered "pathogenic" by ACMG criterias, and these mutations are also very interesting and important
1) You may be interested in the “nucleotide dependency” preprint which has some interesting ideas of how to go beyond LL for variant interpretation with DNA LMs https://www.biorxiv.org/content/10.1101/2024.07.27.605418v1
2) That GPN MSA, a much much smaller model that mixes a short context DNA LM with evolutionary conservation from MSA as input comes kinda close Evo 2 for both coding and noncoding, suggests that for variant interpretation, a model as large as Evo 2 probably isn’t necessary in the long run.
3) There have been multiple updates on HARs, such as 312 reported in https://www.science.org/doi/10.1126/science.abm1696. And there are also elements such as HAQERs, which are previously neutrally evolving regions that show accelerated evolution in humans https://pubmed.ncbi.nlm.nih.gov/36423581/.
4) What are these models even learning when genomes like humans are nearly 50% repeats, only a minority of which are functional?
I am very skeptical of brca1 validations, because Evo2 authors first selected a specific later for embeddings generation, and after that - 128 nucleotides as context for each mutation. Supervised evo2 performance of last layer with 8K nucleotides around variant is way worse than alphamissense and GPN-MSA.
However, regarding point 4) during training Evo2 downweights next token prediction loss by factor of 10 for masked repeats, so this is taken care of
Wow, this was an amazing piece. Didn't know that I also enjoy Socratic dialogue essays as well.
Same
Nice post. Part 2 commentary on wet-lab validation?
Proteins are encoded in DNA. The challenges/problem sets of the DNA space are supersets containing all the challenges in the protein space, with an extra layer of regulatory complexity to tackle.
Prediction of pathogenic mutations is also a largely solved problem. Cadd, alphamissense, GPN-MSA do it quite well, mostly relying on conservation score + human population frequency. What we do need more is to understand why mutation is pathogenic.
Also, almost all of mutations which increase likelihood of a polygenic disease, for example +20% relative risk of type 2 diabetes are not considered "pathogenic" by ACMG criterias, and these mutations are also very interesting and important