Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

We don't know what most microbial genes do. Can genomic language models help? (Yunha Hwang, Ep #7)

1 hour and 42 minutes listening time

Abhishaike Mahajan

Dec 08, 2025

Note: Thank you to rush.cloud and latch.bio for sponsoring this episode!

Rush is augmenting drug discovery for all scientists with machine-driven superintelligence.

LatchBio is building agentic scientific tooling that can analyze a wide range of scientific data, with an early focus on spatial biology. Clip on them in the episode.

If you’re at all interested in sponsoring future episodes, reach out!

Introduction

This is an interview with Yunha Hwang, an assistant professor at MIT (and co-founder of the non-profit Tatta Bio). She is working on building and applying genomic language models to help annotate the function of the (mostly unknown) universe of microbial genomes.

There are two reasons you should watch this episode.

One, Yunha is working on an absurdly difficult and interesting problem: microbial genome function annotation. Even for E. coli, one of the most studied organisms on Earth, we don’t know what half to two-thirds of its genes actually do. For a random microbe from soil, that number jumps to 80-90%. Her lab is one of the leading groups working to apply deep learning to solving the problem, and last year, released a paper that increasingly feels foundational within it (with prior Owl Posting podcast guest Sergey Ovchinnikov an author on it!). We talk about that paper, its implications, and where the future of machine learning in metagenomics may go.

And two, I was especially excited to film this so I could help bring some light to a platform that she and her team at Tatta Bio has developed: SeqHub. There’s been a lot of discussion online about AI co-scientists in the biology space, but I have increasingly felt a vague suspicion that people are trying to be too broad with them. It feels like the value of these tools are not with general scientific reasoning, but rather from deep integration with how a specific domain of research engages with their open problems. SeqHub feels like one of the few systems that mirrors this viewpoint, and while it isn’t something I can personally use—since its use-case is primarily in annotating and sharing microbial genomes, neither of which I work on!—I would still love for it to succeed. If you’re in the metagenomics space, you should try it out at seqhub.org!

Youtube:

Spotify:

Apple Podcast:

Transcript: https://www.owlposting.com/p/we-dont-know-what-most-microbial

Timestamps

00:02:07 – Introduction

00:02:23 – Why do microbial genomes matter

00:04:07 – Deep learning acceptance in metagenomics

00:05:25 – The case for genomic “context” over sequence matching

00:06:43 – OMG: the only ML-ready metagenomic dataset

00:09:27 – gLM2: A multimodal genomic language model

00:11:06 – What do you do with the output of genomic language models?

00:17:41 – How will OMG evolve?

00:20:26 – Why train on only microbial genomes, as opposed to all genomes?

00:22:58 – Do we need more sequences or more annotations?

00:23:54 – Is there a conserved microbial genome ‘language’?

00:28:11 – What non-obvious things can this genomic language model tell you?

00:33:08 – Semantic deduplication and evaluation

00:37:33 – How does benchmarking work for these types of models?

00:41:31 – Gaia: A genomic search engine

00:44:18 – Even ‘well-studied’ genomes are mostly unannotated

00:50:51 – Using agents on Gaia

00:54:53 – Will genomic language models reshape the tree of life?

00:59:18 – Current limitations of genomic language models

01:08:54 – Directed evolution as training data

01:12:35 – What is Tatta Bio?

01:19:02 – Building Google for genomic sequences (SeqHub)

01:25:46 – How to create communities around scientific OSS

01:29:06 – What’s the purpose in the centralization of the software?

01:35:37 – How will the way science is done change in 10 years?

Transcript

[00:02:07] Introduction

Abhi: Today I’m gonna be talking to Yunha Hwang, an assistant professor at MIT, applying machine learning to microbial genomes. She’s also the co-founder and chief scientist at Tatta Bio, a scientific nonprofit dedicated to building tools for genomic AI. Welcome to the show, Yunha.

Yunha: Thank you. Thank you for having me here.

[00:02:23] Why do microbial genomes matter

Abhi: First question, what makes microbial genomes so interesting to you?

Yunha: So yeah, I get this question a lot. If we think about the history of life, microbes have dominated that history of life, which means it’s the most diverse, it’s the most flexible, and in terms of the chemistry that it can do, it’s the most divergent you can possibly imagine. When we think about diversity of sequences, that’s where you’re gonna find most of the diversity of sequences, in microbial genomes. Yeah.

Abhi: And so it feels like a natural place to take AI and ML tools to just throw at it.

Yunha: Yeah. That’s one way to look at it. I think when we think about using biology to do cool things, I think about doing cool chemistry. So there’s like a utility aspect there as well.

Abhi: Were you focused on this topic since your undergrad days, or was it something you switched to during your PhD?

Yunha: Yeah, so I was a computer science student in undergrad, and I learned about the human genome. So I was interested in biology, but I got really hooked when I learned about this field of environmental microbiology, which sounds really niche, but it’s essentially... you’re looking at life in very extreme environments or places that you wouldn’t really typically look for life, such as the deep sea or deserts and so on. And then you’re finding new types of life, all through sequencing, through different kinds of methods. And that’s when I really got hooked in terms of scientific interest.

[00:04:07] Deep learning acceptance in metagenomics

Abhi: I think an interesting trend in a lot of people applying AI to at least somewhat niche fields in biology is that they are usually amongst one of the first people to stand up and say, “Hey, deep learning could be really useful here.” And the culture around that field is usually not pretty accepting of deep learning. How much did you find that when you were applying AI to metagenomics?

Yunha: Yeah, that’s a good question. I think.. . I think at the beginning, people were a little skeptical. But I think people were also quite open to it because when you’re studying metagenomics, you basically scoop up dirt and then you sequence everything out of it and you use computation to piece them together. So essentially you’re looking at billions of base pairs, and there’s no way a human can do it. There are people who are really good at it and who can piece together entire genomes using manual curation who are just pattern recognition geniuses. But for the most part, we’ve been using computation to study these billions of base pairs of divergent data. So in that sense, people are not so opposed to the idea that, “Wow, maybe we’re not very good at doing this. Maybe we do need machines. We do need some extra layer of understanding in order to understand this massive amount of divergent data.”

[00:05:25] The case for genomic “context” over sequence matching

Abhi: Traditionally—I’m not super familiar with the field—but my interpretation is that the traditional bioinformatic tools for studying metagenomics are like... you’re literally matching nucleotides between sequences that you found in one pile of dirt to another pile of dirt. What is your pitch for a better way to do it?

Yunha: Yeah, that’s a great question. So sequence matching is definitely part of the workflow. I think what’s really interesting is when you can look at a sequence and also consider the context it’s found in, and then understand that sequence within that context. And then also do basically comparative work between that sequence found in different contexts, and how the differences in the sequence can be made sense of using that information. So if you just take out the sequences, then these are just two sequences that are a few mutations apart. But then if you consider the full context of either the sample or the genomic context or the taxonomic context, then you’re actually answering a much more biologically relevant question.

Abhi: So you’re adding multiple layers of information on top of the raw sequences alone? And seeing what else you can pattern match from that?

Yunha: Exactly. Yeah.

[00:06:43] OMG: the only ML-ready metagenomic dataset

Abhi: And I think that leads well to perhaps probably your first big paper in the space. Maybe there’s others. But I think the first one that I was made aware of is a paper that introduces two things. One is a really large metagenomic data set called OMG. The second, included in the same paper, is gLM2, a genomic language model. I’ll separate my questions for both of those. The first one is OMG. Why did you release another metagenomic data set? Because from my outside view, right, there’s already a few out there. Why was there a need for another one?

Yunha: Yeah, that’s a great question. I would argue there were none out there—none that was useful for machine learning. Yeah, so there are public data sets. That doesn’t mean they’re useful. That they can be used immediately for machine learning purposes, for language modeling, for instance. An example is, metagenomic sequences can be very poor in quality, so you do need to do a lot of quality filtering. Also, there’s a distribution effect where you have a lot of really short sequences. ‘Cause as I said, you’re doing shotgun sequencing and piecing them together. So the curve, if you look at the distribution, it’s just like this. So you get a lot of really short sequences that don’t even contain a single gene, so you have to throw them out. If you modeled using that, then you’ll be basically modeling nothing. So there is some sort of filtering that you need to do with quality control.

There’s also two major big public databases. One is JGI’s IMG database. And the other is EMBL’s [European Molecular Biology Laboratory] MGnify database. And there is overlap between the two, and also a lot of biases. So for instance, people like to... it’s much easier to sample human feces, compared to deep sea ocean, even though that has a lot more diversity. So you get hundreds of samples of the same sort of human gut sample, but then very few of the very diverse deep sea sample. So by putting them together and then doing dereplication and semantic deduplication and various sort of methods in order to de-bias the data set, we’re making it actually a resource that’s useful for machine learning as opposed to its raw state, which was not really useful.

Abhi: So OMG was for the most part a combination of the existing data sets with a huge amount of pre-processing on top.

Yunha: Yeah.

[00:09:27] gLM2: A multimodal genomic language model

Abhi: I think that dovetails well, and you mentioned semantic deduplication. I’ll have questions about that later. But first, maybe we can start with... you created this data set, you built a model on top of this data set called gLM2. What is gLM2?

Yunha: gLM2 is a genomic language model, but it’s actually not a DNA language model. So, it’s trained on metagenomic data. It’s a multimodal model in that all the DNA sequences or all the intergenic regions are encoded in DNA nucleotides, and the coding sequences are encoded in amino acids. There was a reason why we did that. We actually wanted to make sure that we can model amino acid interactions across protein sequence boundaries. So if it’s a protein language model, it is not gonna learn protein-protein interactions, because you’re not seeing multiple proteins in the same context. Whereas a genomic language model that contains multi-protein context, you’re actually able to model multi-protein interactions or intergenic region-to-multi-protein interactions, which I think was what we wanted to do. And that was like what we wanted to do from the beginning. That’s why we modeled it that way.

Abhi: What is the actual task for this language model?

Yunha: It’s a masked language model.

Abhi: Like given this protein sequence, inter-genomic sequence, protein sequence and so on... you mask out like 15% of that. The job is reconstructing?

Yunha: Yeah, exactly.

[00:11:06] What do you do with the output of genomic language models?

Abhi: At inference time, what do you do with the output of the model?

Yunha: Yeah, so we were mostly interested in representation learning as opposed to generation, for instance. Because our goal was... there were two main tasks. One was we wanted to see if it learns inter-element interaction. So that’s one thing we wanted to learn.

Abhi: By inter-element, does that mean inter-protein...

Yunha: Inter-protein-protein is definitely one. So multi-protein. So protein interactions, but also we wanted to see, can we actually detect RNA-protein interactions? That’d be pretty cool because then you can find new types of RNA-guided systems, or can we just find like promoters for sequences, which we should be able to do, but we still don’t know how to do for a lot of divergent sequences. So that was what we wanted to do as like our primary task.

The secondary task was, we wanted to improve sequence representation such that we can propagate annotations better. So by that... so basically we have this problem where we have a lot of proteins and sequences, but we know less than 1% of what they do. Because we laboratory validated less than 1% of these proteins. So the problem is, there’s no way we’re gonna be able to laboratory validate all of these functions when we don’t even know what the assay is. So then the problem is we... the thing that you have to do is you need to propagate that information as much as possible, and then help that information guide the next set of experiments. And that’s the only thing we know how to do. And the only method that we’ve been doing it with was sequence similarity-based propagation. So if things are decently similar, we just call it the same thing, which is true for... to a certain degree. And then now you can do it with structure with FoldSeek and so on. If things are similar in structure, we just call it the same thing, which is also not always true, but it’s the best attempt at doing what we have to do.

So you can think of that as we’re basically compressing information across these different axes of information, which is sequence, and the other one is structure. The question is, can we do that across context? And that was a sort of motivating factor for genomic language modeling. Can we infuse like contextual information such that things that are similar in context would be pushed together in representation space, such that we can actually propagate information from one protein to another protein because they share the identical semantics in terms of context. So that was the sort of main motivator for why we wanted to do representation learning.

Abhi: And instinctively what’s the intuition for why just because two proteins are near each other, it means anything?

Yunha: Yeah. That’s a great question. So this is actually going back to why microbial genomes are cool. Unlike mammalian genomes or like anything that’s eukaryotic, microbial genomes can exchange DNA almost stochastically. That’s just part of its evolution. So things that are really far apart can exchange genomic information, which is not something that humans can do. We cannot exchange DNA with plants, right? So what that means is because there’s all these stochastic processes that’s happening in orders that we can’t even think about because there’s just so many microbes with really short, much shorter lifespan compared to our lifespan. These processes that are happening have been happening for the past billions of years.

So there’s selection pressure that’s keeping these sequences together in a certain order. And this is probably... some of these things are the things that we can rationally understand, as in these three proteins must be kept together because they literally form a complex that if one fell apart by chance, that organism just would not live and therefore would not propagate that particular arrangement of the genome. So certain ways in which genomes are arranged—gene content and genomic organization—all of these things have some sort of meaning. Some of them we can’t understand. Some of them we might be able to understand. So it is just... there’s patterns there. So how do we extract that pattern? And that is all selected upon. Some of them are random, so what we’re assuming is that the language model, by finding these patterns that are really salient, those salient patterns are probably not gonna be random. So then how do you extract noise from signal using language models?

Abhi: Yeah, it makes sense. Yeah. Like, the explanation of why protein-coding genes exist near each other means like some functional... has some functional meaning. Alternatively, I could imagine one explanation being that, oh, the microbial genome is just gonna be filled with a bunch of nonsense stuff. Like there’s one explanation of, yeah, nearness of protein-coding genes mean something because they need to travel together. Alternatively, it could be that even if one of them traveled to another bacterial genome, it’s just not used and it just sticks around there, like taking up space. Is that ever a concern?

Yunha: I think it’s less of a concern. So we talk about this junk DNA; we don’t really know what they do in like human genomes. I think for microbial genomes... there is... so no one really knows what junk DNA does, so that’s a separate conversation. For microbial genomes, if you have a gene that is not being used, there is a cost. So in order to be able to carry this forward, there is energy that’s required. There’s information burden, there’s just mutational burden. It’s just better to get rid of it.

Abhi: Yeah. That does make sense.

Yunha: Yeah. So I think it’s really difficult to conceptualize this because we’re thinking of it as, oh, like there’s gonna be so many random things that happen. But if you look at it from across samples, across history, the patterns that get picked up... there is a reason for that pattern.

[00:17:41] How will OMG evolve?

Abhi: Going back to the OMG data set, ‘cause I realized I have more questions about it. I imagine OMG is not gonna be like the final iteration, like the final metagenomic database. What do you wanna improve about the next version?

Yunha: Yeah, that’s a great question. So metagenomic databases are exponentially growing, so there is the sort of the size consideration. So I think since... I forget when exactly OMG came out. I think it was like a year ago. It basically grew almost like twice. So you can imagine like that being a big piece of what OMG-2 might be.

I think there is also sort of new types of data that’s being generated. So when it comes to things like epigenetics, so like methylation signal... it’s not as prevalently available as the raw sequence data or like the assembled genomic data. But I think that subset of data that has methylation calling done by the sequencing technology itself, I think that’s a really interesting data set to include or to subset. So I think ideally, OMG extends beyond genomic data into transcriptomic data and other types of omics data. So that’s the vision that we have down the line. But that does require many more iterations.

Abhi: Are you not a “DNA is everything you need” maximalist?

Yunha: No.

Abhi: I guess has anyone trained a DNA-plus-epigenetic or some other type of modality model and seeing that there are vast improvements in being able to represent something? I guess like you did that with genome and proteomic stuff. But has anyone else extended beyond that?

Yunha: Yeah, I think there was a new paper that came out recently. For human and mouse genomes where they included a bunch of like functional genomic data. I think it came outta Genentech actually. That was an interesting paper. I think it’s exciting, because you are basically adding genomic data with a bunch of other tracks of information. I think the sort of limitation there is you can’t do that for a vast majority of life branches. So you can’t call it like a foundation model for biology because we simply would not have that data for most branches of life. Like basically everything except like human and mouse and maybe a few things that we can culture.

[00:20:26] Why train on only microbial genomes, as opposed to all genomes?

Abhi: Why—this is almost like a cultural question—why is there the separation of like metagenomics and human genetics? Why isn’t there, like, why isn’t gLM2 trained on all genomes?

Yunha: Yeah. That’s a good question. So, all genomes as in mammalian and... yeah. Okay. So I think there are some practical reasons why we didn’t extend our model to eukaryotic genomes. One reason is like for plants, you can’t even call genes for a vast majority of their genes. So calling genes is not a trivial task for even some microbes actually.

Assuming that a sequence that you currently have in front of you is a protein sequence or protein coding sequence, that is not an assumption we can always make for a lot of genomes. Given our sort of data structure, we couldn’t make that assumption for plant genomes, fungal genomes, or mammalian genomes. There’s that consideration. Also, there is... microbial genomes are really tightly packed. So there’s very few intergenic, or very small intergenic regions that you have to consider. Whereas for eukaryotic genomes, there’s really long intergenic regions. So in order to be able to model multiple proteins at the same time, your context length has to increase significantly. And that was just not a very... it was not a practical thing to do for our model.

And I think in terms of... if you’re thinking about like data, like bang for the buck kind of situation, you’re getting so much more from microbial data, not just because it’s things are more packed, but it’s just way more diverse. So if you were... if you had a pool of data that was organized in terms of diversity and you were picking things out, like vast majority would just be microbial genomes and microbial genes. So why inject human bias and then add a human genome when it’s not really for understanding human genomes in its innate purpose? So that was the reason why we didn’t include mammalian genomes.

[00:22:58] Do we need more sequences or more annotations?

Abhi: Is it fair to say at this point, the thing you need to turn up is quality of the existing data rather than quantity of like more sequences? Or is it like non-obvious?

Yunha: I think it’s very obvious we need more labeled data and I think everyone probably agrees there. The question whether we need more metagenomic data or more unlabeled data... I’m probably... it’s probably nice to have. It can’t hurt. But it’s just a matter of... you have a lot of metagenomic data and then you find patterns that are becoming more and more salient because you have data that’s less sparse and therefore you are recognizing cooler patterns. But there’s no way of understanding what those patterns are if you can’t match it to any labels. So that labeled data is a lot more valuable, in my opinion.

[00:23:54] Is there a conserved microbial genome ‘language’?

Abhi: What do you think is... I guess this is a good question the protein people have also, but I imagine like proteins are a little bit more conserved. There’s 20 possible amino acids. Maybe not. Maybe that’s also a contentious point, but... At like gLM2, how close to like full universal microbial... or how close is it to like fully understanding the universe of microbial genomes? Like if we take gLM2 and we apply it to say the genome of like a hydrothermal vent bacteria... how good is it at representing that particular genome?

Yunha: Yeah. So if it’s in the training dataset...

Abhi: Sure.

Yunha: It will be better at it than when it’s not in the training dataset, as with any language model including protein language models. If you throw in a sequence that is very different from seen sequences, then ESMFold will fail. AlphaFold will fail. Same with representations for gLM2 and so on. So yeah, I think there is value in training this sort of like base layer, going from sequence to some sort of representation or some sort of understanding. Because yeah, if you have a really divergent sequence that’s out of the training set, then it will not generalize to that particular sequence.

Abhi: Yeah, I guess like the dream for the non-MSA protein language models is that it has this like universal understanding of proteins, regardless of like how many MSAs actually exist for the protein. Like, for Alphafold, as the MSA depth goes down, performance gets worse. Do you see something like that also for GLM where like as a sequence gets further and further away from the training data set, it also goes down?

Yunha: Yeah.

Abhi: Do you think you’ll ever escape that? That you’ll ever discover some like universal grammar for microbial genomes? Or it’s just so diverse, it’s like unlikely.

Yunha: I think it’s probably the latter, but maybe there are cool new advances that prove me otherwise.

Abhi: Moving away from the actual dataset and like more closer to the model... what was the context size for gLM2 and why did you pick it?

Yunha: Yeah, I forget the exact context size, but the benchmark was we wanted to include about 10 genes. And the reasoning there was, we’re looking at sort of an average length of operons, or average gene number for operons, and then we wanted to have multiple operons. And so that came down to about nine or 10 genes.

Abhi: Do you see, or do you intuitively expect as a context window expands you see better and better representation performance? Or does it probably max out?

Yunha: Yeah, that’s a good question. We experimented a little bit with varying the context length for the tasks that we benchmarked against. We did not see a significant improvement as we increased the context length. But that’s the benchmarks that we used, which is limited because what we know is limited. So yeah, it’s all against what you’re measuring. So if you’re measuring against something that’s super obvious, then the model is gonna learn something without needing a lot of context. But if you’re measuring for things that require multi-protein context across multiple proteins, across interactions that are really far apart, then maybe it actually benefits from including that context. I think the things that we’re measuring are too shallow. And too... we are trying to understand biology and we’re chipping it away at emergent properties that come from biology and these are really obvious patterns that we’ve observed. So no wonder these obvious patterns are the first ones to be picked up, without necessarily requiring like large context.

[00:28:11] What non-obvious things can this genomic language model tell you?

Abhi: When you say obvious patterns, I’m curious, like what did gLM2 tell you about microbial genomes that was like interesting? Like you mentioned like it was able to pick up inter-genomic elements and like what each one of those inter-genomic elements potentially mean. What could it do besides that?

Yunha: Yeah. So one thing that we were able to showcase was protein interaction. So it’s not just about “oh, these genes co-occur.” But these genes actually have co-evolving residues that goes across multiple proteins that actually maps to the protein interfaces that are known. So if you apply that to things that we don’t know much about, then we can actually resolve new types of PPI interfaces.

Abhi: Can you walk me through like how you extract PPI information from a model like gLM2?

Yunha: Yeah. This actually was in collaboration with Sergey’s Lab. Sergey’s Lab showed that you can use this method called Categorical Jacobian, where you are getting out co-evolving residues within a protein. And you can basically use that in order to identify residues that are close together and therefore co-evolving. And then you’re basically like turning the 3D structure into a 2D space. B

ut you could technically do the same thing for protein-protein interaction. It’s just two things folding together. But in order to do that, you need to have an understanding of which of these protein variants co-occur in the genome, right? So that if you just have protein A and 50 variants of protein A and protein B and 50 variants of protein B, but you don’t know how these two things are connected, that kind of signal goes away. But then if you know that A-dash and B-dash go together, A-double-dash and B-double-dash go together, then you are able to actually resolve that statistic where things are co-evolving across protein A and protein B.

Abhi: Yeah. When you say like they co-evolve... Is that translate to like they’re close in the embedding space?

Yunha: No. So co-evolve literally meaning if one residue changes in A then another residue that it’s in contact with changes too because of the biophysical sort of...

Abhi: Oh, so this is like relying a little bit on gLM2’s ability to generate genomic sequences or...

Yunha: So it doesn’t generate. So yeah. So basically if PLMs, or ESM, essentially learns the compressed MSA [Multiple Sequence Alignment], right? gLM2 also learns compressed MSA, but it’s paired. Which means... if you just had A and B together and then you concatenated them and then ran MSA, you would actually get similar signals. So you’re basically finding that kind of signal because you’re incorporating genomic context into modeling.

Abhi: Sorry, I’m just like mentally trying to walk through, because I think in... I may be incorrect, but like Sergey, the Categorical Jacobian paper was like mutating residues and seeing like what does the model think. But here it seems like it’s something different. Oh, it is the same thing.

Yunha: It is the same thing. Yeah. It’s just... think of it as like a sort of interpretability method.

Abhi: Okay. Okay. That makes sense. Yeah. I would define the Categorical Jacobian thing as like a mech interp, outside of that... was there anything else interesting you could pop out? You also mentioned you were able to derive like RNA-protein interactions. Is it using the exact same method?

Yunha: So we have seen some evidence of RNA... yes, using the same method... RNA-protein interactions. We haven’t been able to validate them. So I can’t speak for it.

Abhi: I was gonna ask like, how are you with identifying genes that’s perhaps a little bit easy... How do you identify purely RNA coding regions?

Yunha: Yeah. So small RNAs and tRNAs have such conserved structure. It’s actually very easy to spot them using GLM’s way of looking at the data. So if we ran like Categorical Jacobian on a stretch of DNA that contains RNA coding region, I guess RNA sequence, then it lights up immediately because of the hairpin structures. That’s really salient in RNA.

[00:33:08] Semantic deduplication and evaluation

Abhi: And one thing we have been continuously talking about is like models like these are potentially really useful for annotation of existing metagenomic sequences. And I think there was this really interesting thing you did with the OMG dataset that actually relied on the gLM2 model that it was trained upon called semantic deduplication. Would you be able to like just walk through what you did there?

Yunha: Yeah. So this was to tackle the exact problem where we have arbitrary chunks of DNA. And because of the way... so the classical way of deduplicating would be sequence alignment. So that’s what protein language models do. So you cluster and then you cluster using basically sequence similarity, and then you pick from the cluster. And that’s one way to make sure that you’re not over-representing your training data with one cluster. So you can’t really do that with arbitrary chunks of DNA because... assume that you have a chunk of here and then you have a chunk of that’s like this. It will align here, but it won’t align there. And also, because it’s so long, you can’t align... you can’t cluster DNA so quickly because alignment gets really expensive as you increase the length of the DNA.

So that was like a problem that we needed to solve in order to de-duplicate or de-bias the data set as much as possible. So we were actually looking at computer vision literature, and they have the same sort of problem where there’s just a lot of images and how do you make sure that you don’t have a model that’s only trained on like cats and dog images, because that’s what people like to take photos of? Then you need to like either classify them or... but then if you just classify everything as dogs, then maybe you wanna keep some of the diversity in dogs. So there is... how do we de-bias the data set with as little bias as possible, as little human bias as possible?

So I think one method that people have used in DNA language models is, okay, let’s just like use taxonomy as label, and then we’re just gonna sample one from this genus, one from this genus, which I think is a fair thing to do. But the problem with metagenomics sequences is that you don’t always have taxonomic labels. You’re literally getting sequence from like a pile of dirt.

Abhi: Which may include like brand new genus, is that right?

Yunha: Exactly. And you don’t wanna bias against those either. So we wanted to basically... we trained a small model that was essentially the same thing as gLM2. And then we embedded all of these contexts and then we sampled from those contexts in order to be able to de-bias the model as much as possible.

Abhi: How do you judge whether this like works?

Yunha: Yeah, so we had a benchmark. So we basically designed a benchmark that was actually quite a lot of work because if you just rely on existing benchmarks, then the model seems to be doing worse once you make the data set more diverse. And the reason is... the data itself is over-represented with e.g. E. coli because that’s what’s most studied, but also what the benchmarks are based on is also E. coli. So then might as well just train an E. coli model. Why do you even go about training a metagenomic model? So what we did was we actually, before we even trained the model, we actually worked on getting together a really diverse set of embedding benchmarks. And this is really like going... we are, when we are sampling sequences, we’re sampling across the tree of life, not just from E. coli. And that was like a very deliberate thing that we did before we even started training gLM2.

[00:37:33] How does benchmarking work for these types of models?

Abhi: What does benchmarking even look like for a microbial language model? Are you purely measuring yourself by your ability to reconstruct the genome or is there something else?

Yunha: No. So we don’t even actually consider perplexity as like a good metric. So what we did was... we looked at how good the representations were, or as in the embeddings were, for various tasks. So one is a classic task of: does it actually capture phylogenetic relationships between sequences. So there are statistical models that you can use in order to resolve the phylogenetic distances between sequences. And I guess the important thing to do there is to make sure that these sequences are sampled across the tree of life. So we did that and then we basically compare the embedding distances to phylogenetic distances between the sequences. That’s one benchmark. Another benchmark is: can this embedding represent... can this representation space actually compress information such that sequences that are far away in sequence space or structured space, but actually do the same thing in function, bring them closer together? So that’s... you’re using like metric that is like “nearest thing in space” in order to retrieve. So it’s a retrieval-based benchmark in order to be able to find things that are similar in function that we’ve hand curated across the tree of life, to see if you can do that using embeddings only. So we are benchmarking against ESM and other types of embedding to see if it performs as a retrieval task.

Abhi: The thing I would be like very... I think like protein, like, RMSD benchmarks are oh, like fairly trustworthy. ‘Cause you can trust that the x-ray crystallography was like correct. With function annotation, how much can you trust that like these papers that you’re pulling the functional annotations from actually did their job correctly?

Yunha: Yeah. That’s a good question. So we... I mean it’s like really hand curated. We do look at the papers. We make sure that the function that we are looking at is correct. So for instance, like enzyme functions. So people... so I think that was actually one of the benchmarks. So given the sequence, you’re trying to predict the EC number, which is an Enzyme Commission number, which represents what kind of reaction it can catalyze. But the problem is there is positive data... but one enzyme can actually do multiple enzyme reactions depending on the context. So just because it wasn’t documented doesn’t mean it’s not possible, right? So it’s actually very common for a single sequence to be able to confer multiple enzymes in certain hierarchy, that are in the same class, but different substrates. So it’s... so there are cases where our model actually predicted, “Oh, this sequence is likely to conduct both of these reactions with equal probability or similar weight to each of these reactions.” And there’s only data for one, but not the other. So we cannot really say for sure that this is wrong. There are definitely gaps in the data that we need to be aware of, even when you’re really carefully curating this data set. But that’s also an interesting case to look into because it’s... yeah, it’s spotting things that we didn’t spot it before.

[00:41:31] Gaia: A genomic search engine

Abhi: That makes sense. Yeah. And actually OMG and gLM2 are actually some of your earlier work. I think your latest paper is about another genomic language model called Gaia. Could you walk me through what Gaia exactly is?

Yunha: Yeah. So Gaia is actually not a genomic... actually it’s a... I would call it more of a system that’s built on top of gLM2. Gaia is essentially a search engine. So what we wanted to do was demonstrate that gLM2 embeddings can be used to find sequences that are similar in function. And the way we did that was: okay, it needs to definitely find sequences that are similar in sequence, because otherwise... that’s like the least you can do. And then you should find sequences that are also similar in structure. But also you should find sequences that are similar in context. So that’s what we wanted to do. And gLM2 representations were suited for that because it has all that information as part of the training. So Gaia stands for Genomic AI Annotator. And the first thing it does is it retrieves sequences that are similar in gLM2 embedding space. And then the next thing it does is it actually maps that embedding to text descriptor so that we can annotate more rapidly.

Abhi: So you input in a genomic sequence, you find all the nearest proteins via gLM2 embeddings. And then how do you convert that to text? You just like pick the closest protein?

Yunha: Yeah, so we use Swiss-Prot as our sort of golden dataset. That’s probably the best curated data set that we currently have. And so that is pairs of protein sequences to a text descriptor, right? So we train a CLIP model on top of that.

Abhi: Yeah. Okay. And so like... so you’re relying on the full universe of proteins in Swiss-Prot to represent also the full universe of possible functions.

Yunha: Yes.

Abhi: While that very well may be valid... do you suspect that there are possibly like microbial proteins or inter-genomic elements that are not cataloged within Swiss-Prot?

Yunha: Yes, certainly. Vast majority. That’s the whole point. Yeah. And we also choose not to... there is a threshold where we say “no function” or “no known function.”

[00:44:18] Even ‘well-studied’ genomes are mostly unannotated

Abhi: I am not aware of this literature at all. How often is it that people find some weird microbe able to do something that no other microbe can do?

Yunha: Very often. So if you look at a microbial genome, and even for really well-studied microbes such as E. coli and Mycobacterium tuberculosis, you’re finding half to two-thirds of their genes being unannotated we just don’t know what they do. And that’s not even including things that are just like “this is a membrane protein,” which still doesn’t tell us anything about the function. So there’s that problem. But if you look at a random microbe from soil, 80%, 90%, if not 95% of their genes will have no annotated function using basic like sequence-based methods.

Abhi: ...when a microbe can do something that’s never been observed before.

Yunha: Yeah. So that happens. I would say that’s why environmental microbiology was so interesting. There were literally microbes that were being discovered like left and right that can do crazy chemistry. Like literally live off of... it breathes rock as opposed to oxygen. Or it converts disproportionate sulfur, like elemental sulfur, into like sulfite and sulfate... that kind of reaction. We just don’t even know how to do What else? Just things that are like living off of uranium and using that energy, or harnessing that energy to live. Microbes that just live for a million years and we don’t know why and how.

Abhi: There seems to be like two elements here. Like one is trying to annotate functional genomic elements that we reasonably understand... like, “what’s this?” like “this exists somewhere else in the microbial kingdom. Maybe this does it in a different way, but like the function is conserved across other domains of life.” And on the other side, which feels like the far more interesting bit, is that there are microbial functionality that exists uniquely within the species and exists nowhere else. How common is that latter bucket? Like you mentioned like uranium eating bacteria, like rock eating bacteria. Is it usually there are very specific species that do this exact thing and nothing else does it?

Yunha: So interestingly, there’s more and more cases of convergent evolution happening where there’s multiple ways of doing the same thing, which is not that surprising if you think about it. So that’s why I think this idea of compression is actually an interesting idea. If there is like more like a sort of layer to biology that we didn’t fully understand... so we know how to look at sequence pretty well now. But if there are patterns underlying those, and then if we can use those patterns to actually match functions, so that we can actually discover new functions that have conversely evolved to do the same thing. That would be really cool.

Abhi: Going back to Gaia now... I imagine you have this setup for turning the pre-trained GLM embedding into like functional annotations for this like dark universe of microbial genomes. Have you done that? Have you gone through every single un-annotated genome, applied Gaia to it, and is that all just stored somewhere?

Yunha: Yeah. So we did that experiment with Mycobacterium tuberculosis, where two-thirds of the genes we don’t know what they do. And we actually developed... so it was like hard to do this manually, because it’s still... you’re still looking at 2000, 3000 genes, and then you’re trying to figure out what it is using Gaia annotation. So we actually built like “Gaia agent,” which would then try to validate what the Gaia annotations are, given the context. So we basically ran the whole pipeline in order to discover new sequence functions in this really well-studied microbe that thousands of labs have studied for tens to hundreds of years. And yeah, we were able to find four proteins that we could actually validate in silico. And I’m like, “Why didn’t we know this before?”

Like one example is... it’s two proteins that each were annotated as uncharacterized protein in literally every single database that we looked at. And then when you search it individually, you don’t get any matches. But then if you fold them together and search, you actually get a match to an Archaea, which is an entirely different domain of life that have diverged billions of years ago. And you get very little sequence similarity to the extent that you won’t be able to find it using typical tools. But if you look at the structure, it’s actually almost identical. And that’s like a membrane transport protein complex. And then another one was... that one was really interesting because it was like a very small ORF that was never annotated in Mycobacterium tuberculosis because it was really small, but then it also had two other proteins that transforms that tiny little protein peptide into something that’s antimicrobial. So that’s something that’s three systems that we weren’t able to identify previously because we are only looking at each one separately instead of looking at the full picture.

[00:50:51] Using agents on Gaia

Abhi: Could you walk me through Gaia as a platform... makes a lot of sense to me. What does Gaia agent exactly do?

Yunha: Yeah. So Gaia agent, what it does is what a really good microbiologist would do in silico, but just automates the whole thing. So Gaia agent looks at the full context, which is what microbiologists would do. So you see a protein and you look at its annotation. You look at all the motifs that this protein has alongside all the motifs for other proteins, and all the DNA sequence motifs. And then you’re like looking for patterns across the tree of life. Oh, these two things co-occur together, or there’s a co-orientation and very small spaces between the genes, which likely means they actually travel together. And then you’re doing reasoning across the functions of... “this reaction happens and this reaction happens. Most likely this gene is probably doing the reaction that goes from this product to this substrate.” So if you have a reaction chain, for instance, then you can actually figure out... So you have product A and then substrate A going all the way to product D, and then there’s steps B and C. And we have reaction enzymes for reactions in the first part and then the last part. But we don’t know what’s doing the middle part. You can make a reasonable guess that the protein that’s found somewhere near those two proteins might be doing that particular reaction. And you can actually use that kind of reasoning to be able to essentially fill the gaps and de-orphan this particular enzyme reaction.

Abhi: So does the reasoning... so like Gaia agent treats like gLM2 as a tool alongside like the rest of the literature?

Yunha: Yeah. And also other tools such as like FoldSeek. So we give it FoldSeek and you give it other types of bioinformatic tools that you know you can access in silico. Ideally you also have access to like automation labs. We’re not quite there yet.

Abhi: Why is like... is it just like too computationally expensive to just let this rip over the entirety of all un-annotated microbial genomes?

Yunha: Yes. It’s not cheap to run this. And we’re looking at a lot of genes. So one thing we’re actually looking into doing right now is we are gonna look at a few hundreds to like few thousands of genomes that are like on the wishlist of all of these biologists. So we’re just gonna run it and then see, and then also share that result so that people can use it.

Abhi: I’m curious... I’m completely unfamiliar with what the typical metagenomic workflow of a biologist looks like. What’s the fundamental difference between just like providing a gene sequence into gLM2, seeing what proteins are nearby in Swiss-Prot and like nearby in the embedding space... picking up the nearest Swiss-Prot protein as “okay, this is what this protein does”... versus using Gaia agent? Why do you need reasoning on top of that?

Yunha: Yeah. So if it’s a sequence that has a good match to a Swiss-Prot sequence, then you know...

Abhi: You go home after that.

Yunha: Yeah, you don’t need to even run Gaia. You can just do this with BLAST. I think the problem is for a vast majority of genes, you don’t even have that match. That’s why when you run a typical genome into like genome annotation tool that relies on BLAST, you will get 80 to 90% of the genes as unannotated or something that’s meaningless. So how do we make that 50% or 40%? And that’s done by compressing that space so that we can make more associations faster.

[00:54:53] Will genomic language models reshape the tree of life?

Abhi: You had this offhand comment about like how you discovered an Archaea-esque protein within this very well-studied protein that is distinctly not Archaea. And you’ve also mentioned in the past that like how potentially models like these can dramatically change our understanding of what the Tree of Life or phylogeny in general looks like. I’d love to get just like your take on that subject.

Yunha: Yeah, so I guess on the sort of Tree of Life side... So I don’t think the language models will replace phylogenetic trees. Phylogenetic trees are a lot more complex... I mean this is a whole discipline that’s built on top of like how things mutate, what are sort of models of mutation that we should be using...

Abhi: But still all sequence based, right?

Yunha: It’s all sequence based. Yeah. But there’s just a lot of modeling that’s there. And, yeah, I think you should almost see the phylogenetic trees as almost like ground truth to how things evolve. Just also because these things also take a long time to compute as well. So I think there is a future where we can get like cheap and easy phylogenetic trees using language models and embedding spaces, and that would be like an easy way to get a quick look at how things are related. But in the end, phylogenetic analysis have its own space in science literature and science analysis.

I think what’s changing though is as new sequences come about, and as we sample more, the tree is shifting. Because you are only constructing trees based off of what we can sample right now, right? But if you add new branches, the branch structure changes. So for instance, like an example is... we don’t know if eukaryotes... the traditional way of thinking about the Tree of Life is that there’s bacteria, there’s Archaea, and then there’s like a special branch of eukaryotes. What we were actually realizing is that actually the Eukarya are just like a single branch from Archaea. And that has like fundamental change in how we think about the Tree of Life. And that only happened because we actually sampled this hydrothermal vent that contained this Archaea that was closer to eukaryotes, but also still part of the Archaeal tree. So now humans and eukaryotes, the entire branch of eukaryotes, belong to Archaea technically.

Abhi: That sounds like a dramatic reshaping of how we think about... so in that sense, why don’t you think the same thing will happen if you bring in genomic language models? Like why won’t it dramatically change that tree of life in a similar way to that Archaea discovery?

Yunha: Yeah. So because I think that discovery, the amount of information that both models, whether it’s a language model or a phylogenetic model has access to, is the same.

Abhi: So sequence alone gets you like 80% of the way there and like whatever genomic language models bring to the table... it’s probably not like a massive amount...

Yunha: Yeah. I don’t think it’s gonna shift the shape of how things evolved. And we also don’t have a way to validate any of that.

Abhi: Interesting. Do you think you’ll ever want to do phylogenetic research?

Yunha: So I did some of that when I was more in the environmental microbiology research. I think it’s really fascinating, the kind of work that you can do in retracing what happened across the tree of life and the history of Earth. I think that’s really cool. I do also find it a little bit frustrating that you can’t be entirely sure, because you can’t go back in time. But it’s... I think there’s really cool science that comes out of doing phylogenetics.

[00:59:18] Current limitations of genomic language models

Abhi: It’s interesting ‘cause I think also like Sergey [Ovchinnikov] has an evolutionary biology background. It’s interesting how these paths are converging a little bit. One thing I did wanna ask is we’ve talked a lot about the extreme promise that all of these models have. One thing I’m wondering about is where do they currently fall apart? What particular like species genomes problems do these models not currently work well today in?

Yunha: Generally they don’t do well when the training... when it’s on a problem where, or on a genome where it’s not well sampled in the training set. So that’s... I think everyone knows that now. There’s no surprise there.

I think in genomic language modeling, DNA language modeling, what we wanna do with these genomic language models are not still clear. And I think that’s largely because we don’t have a lot of paired data. So when we think about protein language models, it’s pretty clear how you can assess the quality of the protein language models because you’re trying to go... there’s a pair data of structure, right? So you have a lot of protein sequences and there is really good set of structure from very different systems and so on. So you can actually benchmark against structure. But for genomic language models, I would argue we don’t have that data to benchmark against. And I think everyone likes to talk about function, but I think that data set is still very much limited and extremely biased. And it doesn’t really... it doesn’t like do the justice of showcasing that GLMs are learning functional information. It’s just impossible to utilize this model because there’s nothing to pair it to. So like for protein language models, you can use it to design a new structure or new sequence. But for genomic language models, because we don’t have this other modality to condition it on, we don’t know how to use it yet.

Abhi: Do you think we’ll ever get to the world of like single “model to rule them all” ? Like maybe gLM2 also spits out protein structure and like maybe that’s an area you can like check. Does that make sense? Like you have these auxiliary outputs that help you ground... help you understand what is the model able to understand versus where it’s like a little bit up to vibes and like you’re unsure as whether it’s understanding it.

Yunha: Yeah. I think that’s how we’ve been benchmarking a lot of these models, right? Like Evo and gLM2... we can make gLM2 generative as well, and then we basically generate a protein and see how good the protein is. And then we benchmark against the protein language models. We can do all of these things, but what’s the point? Like you can just have also a protein language model. So I think we’re still figuring out like... what is the problem that we’re trying to solve with genomic language models? For us, we’ve been focused on like annotation. How do we make annotations better? How do we make representations better? But one thing that we’ve realized is, yes, we can make representations really good, but we still need better golden data set in order to make a bigger dent in how we are understanding genomes. So it’s like a... you need to attack it from both angles, like more labeled data, better models and keep going in both directions. So that’s one sort of area that people can work on. I think there’s also like genome design, is another. I think the same problem comes into play. Like what is a “better” genome? For proteins, I think you can... there’s an axis that you can optimize on. I don’t know, like binding affinity or something. Thermostability. Like things like that. For genome, I think that’s a lot more... I think there are ways to fine-tune it to do one thing. But there’s no general sort of axes that you can like optimize generations for.

Abhi: I know that this is something you’ve mentioned in the past about how like microbes are often capable of chemistry that is either almost impossible for us to do, or straight up just impossible for us to do. Is it not a clear benchmark, just being able to generate a microbial genome, which like innately allows you to sustainably produce something that we otherwise cannot do outside of that microbe? Do you think like we are close to that at all? Like for gLM2, how good is it at generating microbial genomes outright?

Yunha: So in order to do what you said just right before—which is, wouldn’t it be the benchmark to be able to show like, “oh, this generation can do something that nature cannot do, or something that we wanted it to do, that doesn’t already exist”—then you need to be able to condition.

Abhi: It needs to be in your train set.

Yunha: Yeah, but what I’m trying to say is that conditioning signal or conditioning dataset doesn’t quite exist at its full scale to be able to do that.

Abhi: Let’s say that you just wanna replicate something. Like there is like this one microbe that like feeds off of uranium. You wanna be able to create a microbe that is very much like it, but perhaps is as easy to grow as E. coli or something. How well can you do that today?

Yunha: Yeah. That’s a great question. I think that still comes back to the annotation problem. Where given an your like microbe that can feed off of uranium, we don’t know which parts are important. Which parts are not important.

Abhi: Yeah. I guess this is why you potentially would want to max out the context length of a model like this. So you can just feed in... either you can get the model to spit out an entire genome and then you don’t need to know what is important, what isn’t important. Is that a fair way to think about that?

Yunha: Yeah. So then... what would the training objective look like? You will have genomes that can do a like chemistry X. And then you need to generate a sequence given this like chemistry X and then you need to make it also like E. coli.

Abhi: Yeah. I think that second part’s a bit difficult.

Yunha: Because otherwise if you just say, okay, like we already know this genome Y can do chemistry X. And if you tell the model to build a genome that does chemistry X and it will just output something that’s similar to genome Y, and you could say, “Oh, that works.” Like maybe you get really lucky and it’s a few mutations, synonymous mutations away, such that it doesn’t actually change the biology at all. But all you’ve done is just like maybe I don’t know, learn synonymous mutations.

Abhi: One thing I was surprised by by the Evo-2 paper and perhaps all genomic language models is that it is difficult... there’s no way currently to condition it on anything other than sequence. Why hasn’t someone built a model that could be conditioned on function?

Yunha: Yeah. Because there is no good pair data sets.

Abhi: But there’s some. You’re just saying like there’s not enough?

Yunha: Yeah. There’s not enough. And also I think paired dataset exist for proteins. Not really for genomes or segments of genomes, right?

Abhi: Especially for segments of genomes. But if you have a model ingest the entire genome, maybe the functional annotation could just be like: “Eats this, grows this amount.”

Yunha: Yeah, I think that... so if somebody curated that data set and did it, and it’s accurate, which I think is a big if, then I think it’s possible. You can basically build a database of natural language description of a genome. But that also relies on us understanding the genome, right? So okay, so you have a genome and you’re like, okay, there’s a cellulose degradation pathway. There is like a carbon fixation pathway. So you already know okay, this organism is gonna grow like this. So then in order to condition a generation on that function, then the only vocabulary that you can use is the vocabulary that you’ve used to annotate that genome. So you’re completely limited by the capacity to be able to annotate that genome, which comes back to the annotation problem.

[01:08:54] Directed evolution as training data

Abhi: Have you heard of like Pioneer Labs? This like forcing microbes to evolve down a certain path. And then evolving... observing like what the genome looks like after that. Do you think that’s a particularly interesting way to gather data and it’s maybe like what more people should be doing?

Yunha: Yeah, I think... so like more on the directed evolution side?

Abhi: Maybe I’ll give like a quick description of what Pioneer Labs is. It’s a company that basically wants to create microbes that are able to survive... in I think Mars-like environments, which is just basically just extreme environments in general.

Yunha: Yeah. I think it’s really interesting because it gives another sort of dimension to the data that we didn’t have readily available. So it’s the same thing as if you’re learning how to drive a car, it’s much better to see how the car drives than see the final state of where the car is. Like I think you could potentially learn how the car drives by seeing a lot of photos of cars in different contexts.

So that’s what we’re doing. But then if you had more trajectories and you learned more from trajectories, I think there is a path forward in learning something that’s more meaningful. And that can be modeled better. So I think that... I think there’s a lot of potential there. I think one caveat there is you can’t do this kind of directed evolution for all types of functions, nor all types of organisms. So you’re... but I think that’s fine. It depends on the question. If your application is in an organism that can be cultivated and for a function that can be optimized for, then it’s the right approach to do it. You just can’t apply that for Archaea where it doesn’t grow.

Abhi: Makes sense. How much of your research... I think you’ve focused on the kind of two different axes of this like genomic language modeling problem. Like one, like the data’s not fantastic, we need to get better data. Actually maybe three. The second is like maybe the modality, like we need more modalities of microbial genomic data. And the third is the models which, Gaia agent is maybe like an improvement over just like gLM2 alone. Which of these three are you most interested in personally pushing forwards?

Yunha: Sorry. The three were... one, what was... yeah, sorry.

Abhi: The one is like the total quantity of like labeled genomic data.

Yunha: Oh, quantity of labeled genomic data. Yeah.

Abhi: Or potentially unlabeled as well.

Yunha: Oh, yeah.

Abhi: The second one is like modalities beyond genomics. Third is like the model itself and pushing on that direction.

Yunha: I think they’re all tied. Because the label data is like... you’re labeling and therefore you’re adding another modality to your dataset.

Abhi: That’s fair. Yeah.

So yeah. One and two are the same.

Yunha: Yeah. Yeah. So I think for me, I guess adding new sort of data modalities to genomic data, I think is the most exciting path forward because then you can start actually conditioning things on function, like you can actually imagine being able to do things that we can’t do with the toolkits that we currently have and the knowledge that we currently have. I think that’s just the most exciting path forward.

[01:12:35] What is Tatta Bio?

Abhi: Yeah. That makes sense. And so yeah, we’ve talked about OMG, gLM2, Gaia and also Gaia agent. Many of these things were spawned from Tatta Bio, which you’re one of the co-founders of. It’s a scientific nonprofit dedicated to developing like tools for genomic intelligence. Why is it a nonprofit?

Yunha: Yeah. Tatta Bio is a nonprofit because we’re trying to tackle a problem that maybe too big to tackle for an academic lab in an academic setting. And also very interdisciplinary in terms of... it does require a lot of software talent and machine learning talent, which there are plenty in academia, but it’s difficult to just organize that team in an academic setting. But also there’s no immediate incentive for the market forces to solve this problem. So, say for instance, like the annotation problem... It’s clearly a really important problem because it limits what we can study and what we can understand, and it obviously is gonna underpin new research directions that have unknowable like value. But neither the market nor academia are tackling this in the sort of the scale that we wanted to tackle it at. So that’s the reason why we are a nonprofit.

Abhi: And what is the actual... like I mentioned like Tatta Bio is developing “genomic intelligence.” I think that’s straight up like on the website. What is the... what do you consider the purpose of Tatta Bio to be in terms of what is it delivering to people?

Yunha: So what it’s delivering to people right now is helping people to better understand their genomic sequences. I think it’s clear that genomic sequences cannot be understood by humans. So human-machine sort of collaboration has always been the case for understanding genomic sequences. And how do we make that better? How do we augment that? So that’s the big mission that we have. So that’s how we... what we mean by genomic intelligence. Being able to truly, truly understand genomes, but not necessarily in the sort of like the rational sense that we have. It’s like “this part does this and this is evolved because of that.” It’s really being able to harness the genomic information that’s currently available and engineer it and modify it in the way that makes sense for applications. So yeah, so that’s what we are currently doing. I think within that there’s like the tool building, there’s infrastructure building, there’s community orientation. Like all of those things are sort of part of our mission.

Abhi: Actually one question I wanted to ask for a while, why is it called Tatta Bio? Because actually when I’ve brought up the company to other people, they thought “oh, is it tied to that one like Indian consultancy company?” [Tatta Group]. Why that name?

Yunha: Yeah. It’s... I guess it’s like reference to “TATA box”. And TATA box is like a literally a sequence motif in DNA that’s rich in TA or T-A-T-T-A in this case, that signals the start of a gene or like a reading frame.

Abhi: Yeah, that was a good name.

Yunha: I don’t think everyone got that memo.

Abhi: What would you... what would make you think that like we’ve succeeded at Tatta?

Yunha: Yeah. For us, if we could... I say for instance, if we could double the number of sequences that can be annotated. I think that is a success.

Abhi: To some degree it feels like with Gaia agent, you can do that today; you’re almost like just like compute limited. Is that fair to say? What else needs to be really be done?

Yunha: Yeah. I think there are just real dark patches of the sequence space that we haven’t fully explored. And I think... so if you can imagine like if it was literally just a map and there are complete dark map patches, and if we can figure out a way to generate hypothesis for any one of those sequences, that’s gonna make a big impact because now we’ve already built a very good way to compress that information so that we can propagate that information really quickly. So then... yeah, so then there are definitely like areas that we should really be studying because it’s gonna make a big impact in how we understand sequences. So that is how I see it as a sort of next step. How do we identify those areas that are really poorly characterized, but has high impact potential, and go about experimentally validating some of these sequences and functions.

Abhi: So is it like... I guess I keep returning to this question. The reason you don’t wanna let Gaia agent just run over the entirety of un-annotated sequences is that you’re unsure about the validity of any one of those given predictions, and there’s like more work to do as to figure out like where is Gaia agent reliable and where is it not reliable? Or is there something else?

Yunha: So well, I guess like you can always generate hypotheses. But the question is how many of these can we actually validate? And how many of these is it worth validating given the sort of resource limitations that we currently have?

Abhi: Like I imagine one thing you could do is like let it run across all microbial genomes and then just give that information to the community. And see what they’re able to come up with.

Yunha: Yeah. Yeah. So we are basically trying to do that. But we can’t do it across the entire trillions of genes. So we’re making... we’re trying to make a good selection of either genomes or genes that are like on the wishlist of people and scientists.

[01:19:02] Building Google for genomic sequences (SeqHub)

Abhi: Do you imagine like... FROs [Focused Research Organizations] have a specific like specific like length of time they exist before which they become for-profit? Or they just die entirely? Because they fulfilled their mission. What do you think the future of Tatta is? Yeah, eventually there’s a for-profit or at the end of it, it just like winds down because you’ve annotated the sequences. You’re done.

Yunha: If we could figure out a way that we annotate every single sequence, which I think is very ambitious and probably not possible in the next X years, then that should be our goal. We take a stance that this is going to be an evolving like database of sequences to function and how do we best optimize this database so that things don’t get lost and things are optimally propagated across scientific literature and across scientific discourse.

One of the sort of like latest projects that we’ve been working on is called SeqHub. It’s literally like GitHub for sequences or Google for sequences. So in an ideal world, you can type in the sequence and you get all information, not just the annotation, but what papers refer to it, who are the best people to ask about it, what kind of discussions have been had about this particular sequence and what obviously what other sequences exist that are in that provenance and what kind of genomic context is found in. So we with Gaia, we tackle the genomic context problem. With SeqHub, we’re basically tackling other types of sort of infrastructure problem, because way too often people make discoveries all the time, but it’s not... that information cannot be propagated like readily, because it doesn’t fit into certain database that people have built like 10 years ago. And it just doesn’t fit. And that database doesn’t get propagated to what people use all the time.

So how do we build this more real time understanding of sequences? So that’s a big part of our mission. How do we build a better software infrastructure for sequence understanding and data sharing? And so as part of that mission, we can’t... if we wanted to fully fulfill this mission, and we have the assumption that this is gonna take a long time, we actually want to maintain this infrastructure for as long as we can fulfill this particular mission. Which... so as part of that, I think what we still need to figure out is how do we build sustainability into our operation and business model. And our goal is to remain fully non-profit, and still build in ways to generate enough revenue so that we can maintain this scientific software and infrastructure, which by the way, has been very difficult to maintain in this current funding environment. Traditionally I think it was funded by the government. But that also means certain types of innovation is difficult to switch. You can’t build a fast-paced team in a lab that is either getting funding that is not enough to do this kind of work. So we are also like thinking really creatively about how do we maintain scientific infrastructure and software infrastructure because so often good softwares get made, but are not maintained. Or good ideas transpire... like okay softwares, but doesn’t get scaled up and deployed into production level software. So this is another sort of aspect of work that we’re currently doing.

Abhi: I’m not sure if you’re like able to talk about this, but... PyMol was a really great piece of software, Schrödinger just acquired it... they have a private version that you have to pay Schrödinger to use, but they also have this very nice open source version [PyMOL] . Do you think you could imagine Tatta Bio going down that route where they’re acquired by some existing like Basecamp or someone who really cares about the information that Tatta is gathering and they allow this shaved off like open source version?

Yunha: Yeah. I don’t know. Yeah, we haven’t fully thought about that. I think what right now we’re more focused on is how do we become entrenched in this like scientific ecosystem. And I think a key sort of difference here is it’s not just a software. If it’s software, then you can just copy it and then you can improve it, and then you can share it. But if it’s an infrastructure that needs the community to deposit data, share data, then as soon as you close source any part of it, then the value of that particular infrastructure goes away. I think the only sort of big... the only sort of parallel that I can think of is like PDB. Or you could argue the same thing about Google. If you didn’t have Google that was free... to just deposit in the internet was free. But then you can’t build LLMs if you didn’t have that, the internet. Same thing with like AlphaFold and PDB. So yeah,

Abhi: Like all of it needs to be open sourced for like the network effects to actually start thinking...

Yunha: Yeah.

That’s how I think about it. That’s why I think it’s really important for us to stay open and stay like free for the vast majority of the functionality.

Abhi: Have you seen the XKCD comic? That’s like, you identify some universal problem everyone has and that says “I’m gonna build a solution to it”... and now you’ve just added another universal standard to the 13 others that existed prior. Like what other quote-unquote universal standards are there besides SeqHub and like where do you think they fall short?

Yunha: So in the space of like sequences, I think UniProt is a great example. It’s what people go to when you have a protein sequence.

Abhi: Sorry, specifically for genomes.

Yunha: Oh, genomes. Oh, like specifically like a... Oh, I see.

Abhi: What almost like network territory is SeqHub encroaching on? Are there any... or is like SeqHub unique and there is no other... there’s no other platform for something like this?

Yunha: The only other genome centric like existing platform that’s widely used is NCBI.

Abhi: And that’s not... there’s not really network effects there.

Yunha: No. Yeah.

[01:25:46] How to create communities around scientific OSS

Abhi: Okay. That makes sense. Okay then yeah, it seems like ripe territory to capitalize on. How do you... how have you typically found the process of gathering a community around a brand new piece of open source software? I imagine it’s like a relatively new experience for you.

Yunha: Yes. Yes. Certainly.

Abhi: How has that been?

Yunha: Oh, very interesting. A lot of learning on our side. It’s... yeah, it’s different in that so it is a self-serve software. And it is also B2C in some ways.

But it’s a very small community of people. We’re not tackling the general public here. We’re also currently really focused on microbiology community. And hopefully we can expand out to other communities like in plants and fungi and so on. So that’s our sort of roadmap.

Yeah, but it’s... we need to get in the head of scientists and think about what... why do we do what we do and why do we want to contribute? And how do we contribute and where do I spend most of my time? And what are the most biggest pain points that we have? So all of these things that we need to think about when we design the software and the platform. And building good software is one thing, but building a community is just an entirely new thing that we’re literally just figuring out as we speak.

Abhi: Especially if it’s yeah, like you mentioned, the community is so small. Like I can’t imagine the people who like would actively be power users of the software number more than a few thousand people worldwide. How do you like... how do you get in touch with all of those people and tell them like, “oh, you should be using this thing that we built.” Like how do you convince them that this is worth their time?

Yunha: Yeah. For us, it’s truly... so I think there’s been a lot of attempts at encouraging people to deposit data better, add more data, metadata, blah blah blah. I think one thing... we need to make it really easy. So it should be depositing data should be super easy.

And we shouldn’t require them to do a bunch of things, so that’s just a basic thing that we can build in. Another is we need to give them what they really want the most, and for us it’s better annotations. When I was a student, it’s like the most frustrating thing when you have sequences that you’ve waited so long to get into your hands and you look at it and so much of it is just hypothetical and you’re like just banging your head against the wall to understand what these sequences do. And that is the biggest motivator. If we can give them better annotations, if we can give them more insight into what they’re looking at, that’s what’s gonna bring them here. And those are gonna be the people who are gonna be the most incentivized to contribute because it will come back to benefit them and the community. So that’s our hypothesis. We’ll see how that goes.

[01:29:06] What’s the purpose in the centralization of the software?

Abhi: That’s fun. Like you have this platform which is really hard to populate to start off with, but the draw... like the reason you’d want to interact with that at all is because you get access to Gaia, basically. As like a way to help you interpret what’s going on.

Why... what’s... this is maybe something I should have asked before. Why even care about having something like SeqHub? Is it like... yeah, like maybe you want more people to use Gaia, but like alternatively Gaia could just be like a standalone GitHub thing? Like why do you want a central place to deposit sequences?

Yunha: Yeah. Yeah. That’s a great question because we’re trying to expand this labeled data set. This gold standard data set that we have, which is currently Swiss-Prot... we think there is actually quite a lot of information that’s outside of Swiss-Prot. Swiss-Prot is human curated by the way, which is incredible. There are curators whose full-time job is to look at papers and validate, “oh, this is like a new sequence. We should add this to Swiss-Prot.” I think there’s just a lot of knowledge that’s hiding in labs and hiding in people’s brains and hiding in papers and supplements that can be organized a lot better so that we can actually improve sequence annotation without even having to do any experiments. And I think that is like... if we organize ourselves properly, with infrastructure that is up to date and with correct incentivization schema, then I think we can... we might be able to like double the number of sequences that we can annotate without having even having to do any experimental workflows. And I think that is like what we’re trying to build right now.

Abhi: What’s the... yeah, you said Swiss-Prot is human annotated, which makes sense why it’s so low throughput. I’m curious like how much realistically... how much knowledge is like hiding in the heads of people at these microbial genomic labs who simply like don’t have the results necessary to write a paper about it and get it like deposited somewhere? So like how strong... what... when you talk to these people, is it usually that they have like tons of things in their head that like they’ve been thinking about it for decades, but like they just don’t care enough to write a paper about it?

Yunha: Yeah, I think that definitely exists. And I think this is also byproduct of the publication system. As in, if it’s not a big story, then where do you share this information? And when it’s not gonna be really cited, and when things are not gonna be discoverable... so there’s no incentive to write a single paper just to say “this is something.” You might be able to say, “oh, like we have experimental results.” But it’s just not gonna be a very highly cited paper. So what happens typically is either it’s like a tiny little section in a large like paper. So you write a whole paper and then there’s like a tiny little thing. It’s “oh, we think this is this, or we have like high confidence this is this, based on this tiny little supplemental figure that no one looks at.” And that never gets propagated to central database.

Abhi: Is it like the Swiss-Prot annotators just have so many other things they want?

Yunha: Yeah. So there’s that. And then there’s just internal knowledge. Like people do experiments all the time. Like we do a lot more experiments than what gets published in the paper. So I think there’s both of those sort of like at play, in terms of what is a publishable unit, how can we make knowledge transfer be more efficient across people. So imagine if you had to write a publication for every single bug fix in software. That just doesn’t make sense.

Abhi: And so like SeqHub, I think you guys officially released a month ago. Am I correct? And so a month has passed. What’s next on the roadmap? What do you... what have you seen the use cases are so far?

Yunha: Yeah. So what we... so we launched SeqHub about a month ago. And a key sort of difference between SeqHub and Gaia is that SeqHub can do like whole genome annotation. And it’s also a place where you can deposit data.

Abhi: Sorry, how does it do whole genome annotation? Just split it up into...

Yunha: Yeah. So basically, you can pull a... so Gaia is a sequence, like protein search. But we’ve extended it across like the full genome. So if you put multiple sequences, which is a genome, then it does automated annotation.

Abhi: Gotcha. Okay.

Yunha: So then now you can automatically create collections or data sets, right? So you have a data set for each genome, and then now we’ve integrated Gaia agent into SeqHub agent, that can do multi-gene reasoning in a genome that is native to your particular data. So, given a genome that I’ve sequenced from soil. I have high conviction that this soil... this genome can produce a molecule or degrade a molecule. I can ask SeqHub agent, “go through 5,000 genes that I’ve sequenced here, in this particular order that is found in... use all the tools that you have and find me the set of genes that’s gonna be involved in degradation of this particular compound, or synthesis of this particular compound.”

Or “this thing is found in this kind of environment.” So you basically can do reasoning that’s a lot more complex than “what does this protein do?” So that’s something that we’ve implemented for SeqHub. Essentially all of that is just... it’s aligned with our mission and that we wanna help people understand their sequences better, but it’s also to make sure that we can bring in this community of people who really care about their sequences and want to share their knowledge. So the next step for us is to build this community of scientists who will generate this paired information with sequences to either human understanding or experimental data or sample data. We’re just trying to get as much information as possible publicly for sequence to a label that matters in science.

[01:35:37] How will the way science is done change in 10 years?

Abhi: When we last spoke, you mentioned that you think the way that science gets done will look very different in 10 years. What do you think changes?

Yunha: So one idea that I have... I don’t know, like this is changing all the time... but I think there’s been a lot of focus on scientific narrative. So, how you tell the scientific story is really important in science, in the scientific enterprise. So even when it’s like a small finding, you write a whole like narrative...

Abhi: Amplify it.

Yunha: I think... you contextualize it so that it’s impactful and that’s really important. You might find like “this protein does something” and alone that’s just “okay, sure.” But if “this thing does something, then this means that this can do something else and then that means we can use it to do fix this particular problem.” So that’s contextualization of scientific discovery. And that narrative has been really important. And I think almost overemphasized. And I think that’s also... I think that’s not a... maybe in to the extent that I think it’s overdone.

And I think in the future as machines are more involved in scientific discovery, perhaps data is gonna be a lot more important. And how we... I think currently the narrative is more important than the data. Data is just like a zip file, and then people read the narrative and AI agents read the narrative, right? So that is... that’s become really important part of science. But I think as we do more science with the data itself, not with the narrative linking, I think the data sets are gonna be a lot more important. And maybe in the future we’re just gonna be like depositing data and calling that a scientific product, which is not something that’s being done today. And the sort of innovation is in how you generated that data, how meticulous you are, how innovative you are. I don’t think like the human role is gone, but it’s just the data generation is done in a way that’s so sophisticated that it has a big impact on the conclusions that we can draw from that particular data. That is like scientifically salient.

Abhi: Do you think we’re like currently poking at that with the release of Future House’s Kosmos? Like the existing like AI co-scientist stuff... and were you gonna just plug in your data? How much do you... have you used those? How much do you trust them today?

Yunha: Yeah. So I think it goes back to the same question of like human language and narrative, and how much emphasis we wanna put there. I agree that language is a like a very important medium in which we understand things and then link concepts. But overemphasis on narrative and using only agents to like natural language agents... I’m not saying the current agents are like this... the worst case scenario is the AI agents only read and it doesn’t do any data analysis. I’m sure it’s still gonna find something new, right? It just read a lot of papers and then you chat with it and you’re like, “oh, like what does this protein do?” It probably doesn’t... it probably does this.

I think in an ideal world, there’s more emphasis on the data part and the understanding of the data without the sort of biases of language. Whereas the language is how it communicates with humans. So I think we’re not quite there yet in terms of how do we build like scientific systems.

I’m not even gonna call them agents because I think that places too much emphasis on the narrative. But how do we build systems that can conduct science and scientific inquiry that can go beyond like human narrative and human understanding. So that’s... yeah, I don’t know. I still think about it a lot.

Abhi: In some sense, like I almost imagine the natural language agents are like—also like Gaia perhaps, or Gaia agent perhaps—are like somewhat poisoned by the fact that they have read narratives and have like hyper-focused on certain things that perhaps not actually that useful or interesting. When you look at Gaia agent’s reasoning traces, how much do you see this, that it’s like focusing on what you personally would not have focused on?

Yunha: I see. Okay. Yeah. And sometimes that’s a good thing. Sometimes it’s not a good thing. I think, yeah, so I’ve seen cases where Gaia agent just doesn’t focus on what it’s supposed to focus on. And there’s no reason for it, like it’s just doing what it wants to do. And I can’t really... I don’t know if this is something that can be solved with like better prompt engineering, giving it more tools, and how to rescue it going down a path that is just too obvious or too... yeah, like how do you make it more like rebellious against the existing knowledge? I don’t know, because it’s so reliant on what it knows. So I think I’m sure there are like a lot of like agent-based research for how to make agents more, yeah, more creative I guess. So I think there’s like definitely work that can be done.

Abhi: Have you seen that one like Andrej Karpathy tweet about him really desiring some LLM that knows nothing about the world, but is like maximally intelligent and is able to go out and gather information as it needs?

Yunha: Yeah.

Abhi: And I heard that like GPT-OSS was actually like this, it had incredibly low benchmarks on like general world knowledge. But it was really good at math. And it was really good at just like the CodeBench or the software engineering stuff. I’m curious, have you tried GPT-OSS in Gaia agents?

Yunha: Okay. I have not.

That would be pretty interesting.

Yeah.

Abhi: Cool. I think that’s all the questions I have. Thank you so much for coming on.

Yunha: Cool. Thank you.

Owl Posting

We don't know what most microbial genes do. Can genomic language models help? (Yunha Hwang, Ep #7)

Introduction

Timestamps

Transcript

[00:02:07] Introduction

[00:02:23] Why do microbial genomes matter

[00:04:07] Deep learning acceptance in metagenomics

[00:05:25] The case for genomic “context” over sequence matching

[00:06:43] OMG: the only ML-ready metagenomic dataset

[00:09:27] gLM2: A multimodal genomic language model

[00:11:06] What do you do with the output of genomic language models?

[00:17:41] How will OMG evolve?

[00:20:26] Why train on only microbial genomes, as opposed to all genomes?

[00:22:58] Do we need more sequences or more annotations?

[00:23:54] Is there a conserved microbial genome ‘language’?

[00:28:11] What non-obvious things can this genomic language model tell you?

[00:33:08] Semantic deduplication and evaluation

[00:37:33] How does benchmarking work for these types of models?

[00:41:31] Gaia: A genomic search engine

[00:44:18] Even ‘well-studied’ genomes are mostly unannotated

[00:50:51] Using agents on Gaia

[00:54:53] Will genomic language models reshape the tree of life?

[00:59:18] Current limitations of genomic language models

[01:08:54] Directed evolution as training data

[01:12:35] What is Tatta Bio?

[01:19:02] Building Google for genomic sequences (SeqHub)

[01:25:46] How to create communities around scientific OSS

[01:29:06] What’s the purpose in the centralization of the software?

[01:35:37] How will the way science is done change in 10 years?

Discussion about this video

Ready for more?