7 Comments

1) You may be interested in the “nucleotide dependency” preprint which has some interesting ideas of how to go beyond LL for variant interpretation with DNA LMs https://www.biorxiv.org/content/10.1101/2024.07.27.605418v1

2) That GPN MSA, a much much smaller model that mixes a short context DNA LM with evolutionary conservation from MSA as input comes kinda close Evo 2 for both coding and noncoding, suggests that for variant interpretation, a model as large as Evo 2 probably isn’t necessary in the long run.

3) There have been multiple updates on HARs, such as 312 reported in https://www.science.org/doi/10.1126/science.abm1696. And there are also elements such as HAQERs, which are previously neutrally evolving regions that show accelerated evolution in humans https://pubmed.ncbi.nlm.nih.gov/36423581/.

4) What are these models even learning when genomes like humans are nearly 50% repeats, only a minority of which are functional?

Expand full comment

I am very skeptical of brca1 validations, because Evo2 authors first selected a specific later for embeddings generation, and after that - 128 nucleotides as context for each mutation. Supervised evo2 performance of last layer with 8K nucleotides around variant is way worse than alphamissense and GPN-MSA.

However, regarding point 4) during training Evo2 downweights next token prediction loss by factor of 10 for masked repeats, so this is taken care of

Expand full comment

Wow, this was an amazing piece. Didn't know that I also enjoy Socratic dialogue essays as well.

Expand full comment

Same

Expand full comment

Nice post. Part 2 commentary on wet-lab validation?

Expand full comment

Proteins are encoded in DNA. The challenges/problem sets of the DNA space are supersets containing all the challenges in the protein space, with an extra layer of regulatory complexity to tackle.

Expand full comment

Prediction of pathogenic mutations is also a largely solved problem. Cadd, alphamissense, GPN-MSA do it quite well, mostly relying on conservation score + human population frequency. What we do need more is to understand why mutation is pathogenic.

Also, almost all of mutations which increase likelihood of a polygenic disease, for example +20% relative risk of type 2 diabetes are not considered "pathogenic" by ACMG criterias, and these mutations are also very interesting and important

Expand full comment