"if you can do reliable genome generation, you can create plants that sequester carbon at 1000x the typical rate" -- it seems that I'm still missing the point of these generative models even after reading your excellent essay as I don't understand how one could in principle request for a certain function from these models? All they know is generating natural-looking sequences and I'm failing to see how can we get from that to 1000x faster carbon sequestration?
Well, I’m speaking in terms of being able to do reliable conditional genome generation :) we can do similar things for e.g. enzymes (see the ZymCTRL paper) or protein binders, it’s not a huge stretch to imagine it being able to be done for genomes
Conditional generation for sure! I was more puzzled about generating such genomic arrangements that certain functions would not only be reproduced, but also became better. The ZymCTRL reference was I good point -- it's great paper and helped me get done cookies as to how generating better genomes might work.
Here's how I currently understand it. A good analogy would be thinking of code-generating LLMs. By virtue of having seen a lot of code, LLMs have a chance to learn the best coding practices and be aware of the best available solutions to particular problems. So on average LLMs can generate better code than programmers.
Similarly, by observing genomic sequences across multiple evolutionary distant organisms, models can learn the genomic space of surviving organisms under various conditions and thus generate the most fit genomes under the conditions that we want. On average, this should work better than any existing organism or human expert design.
In both cases, however, we're limited by the available data, meaning that solutions will be interpolations within the manifold spanned by known / already explored solutions. It is possible that even better solutions exist but they are so far out from what programmers (in the case of coding) or evolution (in the case of genomes) have tried that no simple interpolation will suffice to produce such responses. In coding, one potent strategy is using test-time compute to generate such new training samples and validate them. However, in biology newly proposed samples will have to be validated empirically, which is much harder but feasible once we get cheap DNA synthesis going.
And that's how you get to generate genomes that are (eventually) much better by some measure! 1000X might hit some fundamental limits, but 10X seems quite doable.
I take your point about the usefulness of generation of complex features like antibody synthesis or whatever but are nucleotide language models the right level for that? As opposed to a model that operates on a higher level of abstraction. Like with the glycosylation stuff why do you need to do base by base generation, essentially slightly re-engineering each glycosyltransferase, as opposed to gene by gene where you just paste in the appropriate gene sequence or enhancer element or whatever? It would look more like a systems biology model than language model, or maybe something like Future House-esque automated scientist + tons of compute for reasoning
...though come to think of it, probably an AI scientist would still consult a language model while doing the reasoning, so it's good to have around. I slightly wonder how core it would be though.
"if you can do reliable genome generation, you can create plants that sequester carbon at 1000x the typical rate" -- it seems that I'm still missing the point of these generative models even after reading your excellent essay as I don't understand how one could in principle request for a certain function from these models? All they know is generating natural-looking sequences and I'm failing to see how can we get from that to 1000x faster carbon sequestration?
Well, I’m speaking in terms of being able to do reliable conditional genome generation :) we can do similar things for e.g. enzymes (see the ZymCTRL paper) or protein binders, it’s not a huge stretch to imagine it being able to be done for genomes
Conditional generation for sure! I was more puzzled about generating such genomic arrangements that certain functions would not only be reproduced, but also became better. The ZymCTRL reference was I good point -- it's great paper and helped me get done cookies as to how generating better genomes might work.
Here's how I currently understand it. A good analogy would be thinking of code-generating LLMs. By virtue of having seen a lot of code, LLMs have a chance to learn the best coding practices and be aware of the best available solutions to particular problems. So on average LLMs can generate better code than programmers.
Similarly, by observing genomic sequences across multiple evolutionary distant organisms, models can learn the genomic space of surviving organisms under various conditions and thus generate the most fit genomes under the conditions that we want. On average, this should work better than any existing organism or human expert design.
In both cases, however, we're limited by the available data, meaning that solutions will be interpolations within the manifold spanned by known / already explored solutions. It is possible that even better solutions exist but they are so far out from what programmers (in the case of coding) or evolution (in the case of genomes) have tried that no simple interpolation will suffice to produce such responses. In coding, one potent strategy is using test-time compute to generate such new training samples and validate them. However, in biology newly proposed samples will have to be validated empirically, which is much harder but feasible once we get cheap DNA synthesis going.
And that's how you get to generate genomes that are (eventually) much better by some measure! 1000X might hit some fundamental limits, but 10X seems quite doable.
Yes! Agree with all of this and actually rephrases my ending point in a much better way
I take your point about the usefulness of generation of complex features like antibody synthesis or whatever but are nucleotide language models the right level for that? As opposed to a model that operates on a higher level of abstraction. Like with the glycosylation stuff why do you need to do base by base generation, essentially slightly re-engineering each glycosyltransferase, as opposed to gene by gene where you just paste in the appropriate gene sequence or enhancer element or whatever? It would look more like a systems biology model than language model, or maybe something like Future House-esque automated scientist + tons of compute for reasoning
...though come to think of it, probably an AI scientist would still consult a language model while doing the reasoning, so it's good to have around. I slightly wonder how core it would be though.