Owl Posting: Startups

The ML drug discovery startup trying really, really hard to not cheat (Leash Bio)

Abhishaike Mahajan — Tue, 23 Dec 2025 13:05:02 GMT

Note: I’ll be Austin until Jan 3rd, and in San Francisco (for JPM) from Jan 3rd-17th, message me on X/email to hang out! Also, thank you to Ian Quigley and Andrew Blevins, the two co-founders of Leash Bio, for answering the many questions that arose while writing this essay.

Introduction

What I will describe below is a rough first approximation of what it is like to work in the field of machine-learning-assisted small-molecule design.

Imagine that you are tasked with solving the following machine-learning problem:

There are 116 billion balls of varying colors, textures, shapes, and sizes in front of you. Your job is to predict which balls will stick to a velcro strip. To help start you off, you’re given a training set of 10 million balls that have already been tested; which ones stuck and which ones didn’t. Your job is to predict the rest. You give it your best shot, train a very large transformer on 80% of the (X, Y) labels, and discover that you’ve achieved an AUC of .76 on a held out 20% set of validation balls. Not too shabby, especially given that you only had access to .008% of the total space of all balls. But, since you’re a good hypothetical scientist, you look more into what balls you did well on, and which balls you did not do well on. You do not immediately find any surprises; there is mostly uniform error across color, textures, shapes, and sizes, which are all the axes of variation you’d expect exists in the dataset. But perhaps you’re a really good hypothetical scientist, and you decide that to be certain of the accuracy here, you’ll need to fly in the top ball-velcro researcher in the world to get their take on it. You do so. They arrive, take one look at your results, and burst out in laughter.‘ What’, you stutter, ‘what’s so funny?’. In between tears and convulsions, the researcher manages to blurt out, ‘You fool! You absolute idiot! Nearly all the balls in both your training set and test set were manufactured between 1987 and 2004, using a process that was phased out after the Guangzhou Polymer Standardization Accords of 2005! Your ball-velcro model is not a ball-velcro model at all, but rather a highly sophisticated detector of Guangzhou Polymer Standardization Accords compliance!’ The researcher collapses into a chair, still wheezing.

Actually, this hypothetical situation is easier than the real one, since there are several orders of magnitude more small-molecules in existence than the 116 billion balls, and there are also a few tens-of-thousands of possible velcro strips— binding proteins—in existence too, each with their own unique preferences.

Given the situation here, there is a fair bit of cheating that goes on in this field. Most of it is accidental and maybe even unavoidable, and truthfully, it is difficult to not feel at least some sympathy for the researchers here. There is something almost cosmically unfair about trying to solve a problem where the axes of variation you don’t know about vastly outnumber the axes you do, making it so the space of possible ways you could be wrong is practically infinite. Can we fault these people for pretending that their equivalence to the compliance-detection-machine is actually useful for something?

Well, yes, but we should also understand that the incentives aren’t exactly set up for being careful, thinking really hard, and trying to ensure that the model did the Correct Thing. This is true even in the private sector, where the timelines for end utility of these models are far off in the horizon, where the feedback loops are so long that by the time anyone discovers your model was secretly a Guangzhou Accords detector, there are no meaningful consequences for anybody involved.

This is why I think it is important to shine a spotlight on people trying to, despite the situation, do the right thing.

And this essay is my attempt to highlight one such party: Leash Bio.

Leash Bio is a Utah-based, ~~12~~~ 9-person startup founded in 2021 by two ex-Recursion Pharmaceutical folks: Ian Quigley and Andrew Blevins. My usual biotech startup essays are about places that have strange or especially out-there scientific theses, so I spend a long time focusing on the details of their work, where it may pay off big, and the biggest risks ahead.

I will not do this here, because Leash Bio actually has both a very well-trodden scientific thesis (build big datasets of small-molecules x protein interactions and train a model on it) and a very well-trodden economic thesis (use the trained model to design a drug). There’s clearly some value here, at least to the extent that any ML-for-small-molecule-development play has value. There’s also some external validation: a recent partnership with Monte Rosa Therapeutics to develop binders to novel targets.

Really, what is most unique about Leash is almost entirely that, despite how hard it is to do so, they have a nearly pathological desire to make sure their models are learning the correct thing. They have produced a lot of interesting artifacts from this line of research, much of which I think should have more eyes on. This essay will dig deep into a few of them. If you’re curious to read more about their research, they also have their own fascinating blog here.

Some of Leash’s research

The BELKA result

You may recall an interesting bit of drama that occurred just about a year back between Pat Walters—who is one of the chief evangelists of ‘many people in the small-molecule ML field are accidentally cheating’ sentiment—and the authors of DiffDock, which is a (very famous!) ML-based, small-molecule docking model.

The drama originally kicked off with the publication of Pat’s paper ‘Deep-Learning Based Docking Methods: Fair Comparisons to Conventional Docking Workflows’, which claimed to find serious flaws with the train/test splits of DiffDock. Gabriel Corso, one of the authors on the DiffDock paper, responded to the paper here, basically saying ‘yeah, we already knew this, which is why we released a follow-up paper that directly addressed these’. After many comments back and forth, the saga mostly ended with the original Pat paper having this paragraph being appended to it:

The analyses reported here were based on the original DiffDock report [1], with performance data provided directly by authors of that report, corresponding exactly to the published figures and tables. Subsequently, in February 2024, a new benchmark (DockGen) and a new DiffDock version (DiffDock-L) was released by the DiffDock group [21]. This work post-dated our analyses, and we were unaware of this work at the time of our initial report, whose release was delayed following completion of the analyses.

All’s well that ends well, I suppose.

But what was the big deal with the train/test splits anyway?

To keep it simple: the original DiffDock paper trained on pre-2019 protein-ligand complexes, and tests on post-2019 protein-ligand complexes. This may not be too terrible, but you can imagine one failure mode of this is that there is a lot of conservation in the chemical composition of binding domains, making it so the model is more interested in memorizing binding-pocket-y residues rather than trying to learn the actual physics of docking. So, when presented with a brand new binding pocket, it’d fail. And indeed, this is the case.

In the follow-up DiffDock-L paper, the authors moved to a benchmark that ensured that proteins with the same protein binding domains were either only in the train or only in the test dataset. Performance fell, but the resulting model was able to demonstrate much better diversity to a broader range of proteins.

Excellent! Science at work. But there is an unaddressed elephant in the room: what about chemical diversity? DiffDock-L may very well generalize to unseen protein binding pockets, but can it do well on ligands that are very structurally different from ligands it was trained on? This isn’t really a gotcha for DiffDock, because it turns out that the answer is ‘surprisingly, yes’. From a paper that studied the topic:

Diffusion-based methods displayed mixed behavior. SurfDock showed declining performance with decreasing ligand similarity on Astex, but surprisingly improved on PoseBusters and DockGen, suggesting resilience to ligand novelty in more complex scenarios. Other diffusion-based and all regression-based DL methods exhibited decreasing performance on Astex and PoseBusters, but remained stable—or even improved slightly—on DockGen, likely implying that unfamiliar pockets, rather than ligands, pose the greater generalization barrier.

But docking is not the big problem, not really.

The holy grail for protein-ligand-complex prediction is predicting affinity; not only where a small-molecule binds to, but how tightly. And here, it turns out that it is incredibly easy to mislead oneself on how well models can do here. In an October 2025 Nature Machine Intelligence paper titled ‘Resolving data bias improves generalization in binding affinity prediction’, they say this:

This large gap between benchmark and real-world performance [of binding affinity models] has been attributed to the underlying training and evaluation procedures used for the design of these scoring functions. Typically, these models are trained on the PDBbind database³⁷^,³⁸, and their generalization is assessed using the comparative assessment of scoring function (CASF) benchmark datasets¹⁰. However, several studies have reported a high degree of similarity between PDBbind and the CASF benchmarks. Owing to this similarity, the performance on CASF overestimates the generalization capability of models trained on PDBbind¹⁰^,³⁹^,⁴⁰. Alarmingly, some of these models even perform comparably well on the CASF datasets after omitting all protein or ligand information from their input data. This suggests that the reported impressive performance of these models on the CASF benchmarks is not based on an understanding of protein–ligand interactions. Instead, memorization and exploitation of structural similarities between training and test complexes appear to be the main factors driving the observed benchmark performance of these models³⁵^,³⁶^,⁴¹^,⁴²^,⁴³.

What a pickle!

Now, the paper goes on to come up with its own split from the PDB that takes into account a combination of protein similarity, binding conformation similarity, and, most relevant to us, ligand similarity. How do they judge ligand similarity? A metric called the ‘Tanimoto score’, which seems like a pretty decent way to get to better generalization per another Pat Walters essay.

Well, that’s that, right? Have we solved the ball problem before?

Not quite. Tanimoto-based filtering is an improvement, but it is still an exercise in carving up existing public data more carefully. Why is that a problem? Because public data are not random samples from chemical space, but are rather the the accumulated residue of decades of drug discovery programs and academic curiosity. Because of that, even if you filter out molecules with Tanimoto similarity above some threshold, you might still be left with test molecules that are “similar” in ways that Tanimoto doesn’t capture: similar pharmacophores, similar binding modes, similar target classes. A model might still be learning something undesirable, like, “this looks like a kinase inhibitor I’ve seen before”, and there is really no way to stop that no matter how you split up the public data.

How worried should we be about this? Surely at a certain level of scale, the Bitter Lesson takes over and our model is learning something real, right?

Maybe! But we should test that out, right?

Finally with this background context, we can return to the subject of this essay.

In late 2024, Leash Bio, in one of the most insane public demonstrations I have yet seen from a biotech company, issued a Kaggle challenge to all-comers: here’s 133 million small molecules generated via a DNA-encoded library (which we’ll discuss more about later) that we’ve screened against three protein targets, and here’s binary binding labels for all of them. The problem statement is as follows: given this dataset—also known as ‘BELKA’, or Big Encoded Library for Chemical Assessment—predict which ones bind.

How large is this dataset in relative terms? In the introductory post for the dataset, Leash stated this:

The biggest public database of chemistry in biological systems is PubChem. PubChem has about 300M measurements (11), from patents and many journals and contributions from nearly 1000 organizations, but these include RNAi, cell-based assays, that sort of thing. Even so, BELKA is >10x bigger than PubChem. A better comparator is bindingdb (12), which has 2.8M direct small molecule-protein binding or activity assays. BELKA is >1000x bigger than bindingdb. BELKA is about 4% of the screens we’ve run here so far.

As for the data splits, Leash provided three:

A random molecule split. The easiest setting.
A split where a central core (a triazine) is preserved but there are no shared building blocks between train and test.
A split based on the library itself. In other words, it was a test set with entirely different building blocks, different cores, and different attachment chemistries, molecules that share literally nothing with the training set except that they are, in fact, molecules. The hardest setting.

Here is the hilarious winning result from the Kaggle competition, where ‘kin0’ refers to the 3rd data split:

In other words, a model was trained on a dataset that is an order of magnitude larger than any dataset that has come before it. And it completely failed to generalize in any meaningful capacity, being nearly perfectly equivalent to random chance. In turn, Leash’s blog post covering the whole matter was titled ‘BELKA results suggest computers can memorize, but not create, drugs’.

Now, it is worth protesting at this result. Chemistry is complex, yes, but it is almost certainly bounded in its complexity. So, one defense here is that diversity matters more than scale, and that, say, bindingdb’s ~2.8 million data-points, despite being far smaller, span far more of chemical space than BELKA’s 133 million. Moreover, bindingdb contains hundreds of targets, whereas BELKA only contains 3. In comparison, BELKA is, chemically speaking, incredibly small. Is it any wonder models trained on it, and it alone—as these were the rules for its Kaggle competition—don’t generalize well?

These are all fair arguments. Is this entire thing based on a contrived dataset?

There is an easy way to assuage our concerns. We can just load up a state-of-the-art binding affinity model, one that has been trained on vast swathes of publicly available data out there, and try it out on a BELKA-esque dataset. Say, Boltz2. How does that model perform?

The Hermes result

Well, BELKA can’t just be used out of the box. To ensure that they are truly testing ligand generalization, Leash first curated a subset of their data that has no molecules, scaffolds, or even chemical motifs in common with training sets used in Boltz2 training. This shouldn’t be any trouble for a model that has sufficiently generalized!

At the same time, they put Boltz2 in a head-to-head comparison against a lightweight sequence-only, 50M parameter (!!!) transformer called Hermes trained by the Leash team. Given 71 proteins, 7,515 small molecule binders, and 7,515 negatives, the task was to predict the likelihood of binding given a pair of proteins and small-molecules.

But before we talk about the results, let’s quickly discuss Hermes. Specifically, that Hermes was not trained on any public data, but rather, on the combined sum of all the binding affinity data that Leash has produced. How much of this data is there? At the time Hermes was trained, just shy of 10B ligand-protein interactions. At the time this essay you are reading was published, it is now 50B interactions. Both of these numbers are several orders of magnitude higher than any other ligand x protein dataset in existence.

To note: BELKA is not included in these numbers, because it is not actually a dataset they use to train their models, due to it prioritizing an extremely high number of ligands to a few proteins, rather than a mix of diversity between the two. But the same DNA-encoded library process is used to generate it!

Finally, we can move onto the results.

Hermes did decently, grabbing an average AUROC of .761. Notably, the validation set here is meant to have zero chemical overlap with Hermes train set, which is something we’ll talk about more in the next section, which makes the result even more striking.

On the other hand, Boltz2 scores .577.

Hmm. Okay.

You could imagine that one pointed critique of this whole setup is that the validation dataset is private. Who knows what nefarious things Leash could be doing behind the scenes? Also, it may be the case that Leash is good in whatever space of chemistry they have curated, whereas Boltz2 is good in whatever space of chemistry exists in public databases. The binding affinity results in the Boltz2 paper are clearly far above chance, so this seems like a perfectly reasonable reconciliation of the results.

Well, Leash also curated a subset of data from Papyrus, a publicly available dataset of binding affinity data, and threw both Boltz2 and Hermes at that.

From their post:

Papyrus is a subset of ChEMBL and curated for ML purposes (link). We subsetted it further and binarized labels for binding prediction. In brief, we constructed a ~20k-sample validation set by selecting up to 125 binders per protein plus an even number of negatives for the ~100 human targets with the most binders, binarizing by mean pChEMBL (>7 as binders, <5 as non-binders), and excluding ambiguous cases to ensure high-confidence, balanced labels and protein diversity. Our subset of Papyrus, which we call the Papyrus Public Validation Set, is available here for others to use as a benchmark. It’s composed of 95 proteins, 11675 binders, and 8992 negatives.

On this benchmark, Boltz2 accuracy rose up to .755, and Hermes stayed in roughly the same territory it was previously at: .703, its confidence interval slightly overlapping with that of Boltz2’s.

So, yes, Boltz2 does edge out here, but given that the chemical space of Papyrus substantially overlaps with the CheMBL-derived binding data trained on by Boltz2, you may naturally expect this.

So, to summarize where we are at: Leash’s in-house model, trained exclusively on their proprietary data, performs about as well on public benchmarks as a model that was partially trained on those benchmarks.

And on Leash’s private data, which, crucially, has little overlap with public training sets as measured by Tanimoto scores (also included in their post), their model handily beats the state of the art.

This is all very exciting! But I want to be careful here and explicitly say that the the story here is far from complete. What we can say, with confidence, is that Leash has demonstrated something important: a lightweight model trained on dense, high-quality, internally consistent data can compete with architecturally sophisticated models trained on the sprawling, noisy, heterogeneous corpus of public structure databases. This is made even more interesting by the fact that Hermes is not structure based, allowing it to be 500x~ faster than Boltz2, the advantages of which are discussed in this other Leash post.

But what is not yet clear is proof that Leash has cracked the generalization problem. I think they are asking the right questions, and perhaps have early results that the yielded answers are interesting, but chemical space is large, far larger than anybody could ever imagine, and it would be naive of anyone to claim that the two simple benchmarks here are sufficient to declare anything for either side.

But even after tempering my enthusiasm, I still find the results fascinating. The only outstanding question is: where does this seemingly high generalization performance actually come from? Is it from the extremely large dataset? Surely partially, but, again, chemical space is so extraordinarily vast that a few tens-of-millions of (sequence-only!) samples from it surely is a drop in the bucket, and . Is it perhaps from the Hermes architecture? Also unlikely, because remember, the model itself is dead-simple, just a simple transformer that uses the embeddings of two pre-trained models (ESM2-3B and ChemBERTa).

What’s going on? Where is generalization arriving from? Well, we’ll get back to that, because first I want to talk about how the Leash curated their own train/test splits.

The train/test split result

As I’ve been repeating throughout this essay, Leash’s model is trained using DNA-encoded chemical libraries. These are combinatorial libraries where each small molecule is tagged with a unique DNA barcode that identifies its structure. The molecules themselves are built up from discrete building blocks. You have a central scaffold, and then you attach different pieces at different positions. A typical DEL molecule might have three attachment points, each of which can hold one of hundreds of different building blocks. Multiply those possibilities together and you can get millions of unique compounds from a relatively small set of starting materials.

It feels wrong to give this explanation without an associated graphic, so I asked Gemini to create one:

This is great for generating diverse molecules, but also for splitting a chemical dataset, because it allows you to split them by the building blocks they share. If there are, say, 3 possible building blocks in the library, that means a rigorous way to split things is to ensure that there are no building blocks in the train set that are in the test set.

But you may immediately see a problem here; what if two different building blocks have very chemically similar properties? This can be easily remedied by not only ensuring that there are no building-block overlaps, but also checking that the chemical fingerprint of building blocks in the train set are sufficiently dissimilar from those in the test set. In other words, you cluster the building blocks by chemical similarity, and then filter any that are in the train set from the test set.

And they did exactly this. From their post:

Our Leash private validation set is this last category: it’s made of molecules that share no building blocks with any molecules in our training set, and also the training set doesn’t have any molecules containing building blocks that cluster with validation set building blocks. It’s rigorous and devastating: splitting our data this way means our training corpus is roughly ⅓ of what would be if we didn’t do a split at all (0.7 of bb1*0.7 of bb2*0.7 of bb3 = 0.343)…
In exchange for losing all that training data, we now have a nice validation set where we can be more confident that our models aren’t memorizing, and we can use it to make an honest comparison to other models that have been trained on public data.

Using this dataset, they applied Hermes (and XGBoost as a baseline) to four increasingly difficult splits of the data: a naive split based on chemical scaffold, 2 building blocks shared split, 1 building block shared split, and 0 building blocks shared + no chemical fingerprint clusters shared. The results are as follows:

Here, simple XGBoost beats Hermes on almost every split other than the hardest one. Only when you ensure that there are zero shared building block clusters, when you truly force the model to chemically novel territory, does the more complex Hermes pull ahead.

Okay, this is a fine result, and it does rhyme with the theme of the essay w.r.t ‘being rigorous’, but this should raise more questions than it answers. As a result of how they have constructed the training dataset for Hermes, wouldn’t we expect it to have a relatively small area of ‘chemical space’ to explore? By going through this building-block and cluster filtering, surely the training data is almost comically O.O.D from the test set! And yet, as we mentioned in the last section, Hermes seems to display at least some heightened degree of chemical generalizability compared to state-of-the-art models! How is this possible?

It may have to do with the nature of the data itself: DNA-encoded libraries. Leash writes in their blog post that the particular type of data is perhaps uniquely suited for forcing a model to actually learn some physical notion of what it means to bind to something:

Our intuition is that by showing the model repeated examples of very similar molecules - molecules that may differ only by a single building block - it can start to figure out what parts of those molecules drive binding. So our training sets are intentionally stacked with many examples of very similar molecules but with some of them binding and some of them not binding.
These are examples of “Structure-activity relationships”, or SAR, in small molecules. A common chemist trope that illustrates this phenomena is the “magic methyl” (link), which is a tiny chemical group (-CH3). Magic methyls are often reported to make profound changes to a drug candidate’s behavior when added; it’s easy to imagine that new greasy group poking out in a way that precludes a drug candidate from binding to a pocket. Remove the methyl, the candidate binds well.
DELs are full of repeated examples of this: they have many molecules with repeated motifs and small changes, and sometimes those changes affect binding and sometimes they don’t.

Neat! This all said, the usage of DELs is at least a little controversial, due to it often producing false negatives, being limited in overall chemical space, and the actual hits from DEL’s not being particularly high-affinity. Given that I do not actively work in this area, it is difficult for me to give a deeply informed take here. But it is worth mentioning that even if the assay seems to have its faults, the fact that Hermes performs competitively on Papyrus—a public benchmark derived from ChEMBL that has nothing to do with DEL chemistry—suggests that whatever Leash’s models are learning cannot purely be an artifact of the DEL format. Of course, it is almost certainly the case that Hermes has its own failure modes and time will tell what those are.

And with this, we can arrive to the present day, with a very recent finding from Leash over something completely unrelated to Hermes.

The ‘Clever Hans’ result

Truthfully, I’ve wanted to cover Leash for a year now, ever since the BELKA result. But what finally got me to sit down and do it was an email I received from Ian Quigley, a co-founder of Leash, recently on November 27th, 2025. In this email, Ian attached a preprint he was working on, written alongside Leash’s cofounder Andrew Blevins, that described a phenomenon that he dubbed, ‘Clever Hans in Chemistry’. The result contained in the article was such a perfect encapsulation of the cultural ethos I—and many others—have come to associate with Leash, that I finally wrote the piece I’d been putting off.

So, what is the ‘Clever Hans’ result? Simple: it is the observation that molecules created by humans will necessarily carry with it the sensibilities, preferences, and quirks of the human who made them.

For example, here are some molecules created by Tim Harrison, a distinguished medicinal chemist at Queen’s University Belfast.

And here are some other molecules made by Carrie Haskell-Luevano, who is a chemical neuroscientist professor at the University of Minnesota’s College of Pharmacy.

I don’t know any medicinal chemistry! You may not either! And yet, you can see that there is an eerie degree of same-ness within each chemist’s portfolio. And if we can see it, can a model?

Yes.

Using ChEMBL, Leash collated together a list of chemists who they considered prolific (>30 publications, >600 molecules contributed), scrapped all their molecules, and then trained a very simple model to play Name That Chemist.

Out of 1815 chemists, their trained model had a top-1 accuracy of 27%, and a top-5 accuracy of 60% in being able to name who created an arbitrary input molecule.

If curious, Leash also set up a leaderboard for you to see how distinctive your favorite chemist is! And while some chemists’ molecules are far harder to suss out than others, the vast majority of them did leave a perceptible residue on their creations.

This may seem like a fun weekend project, but the implications start to get a little worrying when you realize that the extreme similarity amongst a chemists molecules are less of an idiosyncratic behavior, and more of a career-long optimization process of creating molecules that do X, and molecules that do X may very well end up looking a particular way. Which means that if a model can detect the author, it can infer the intent. And if it can infer the intent, it can predict the target. And if it can predict the target, it can predict binding activity. All without ever learning a single thing about why molecules actually bind to proteins.

Is this actually true though? It seems so. Using a split based on chemical scaffold (which is a pretty common, though increasingly discouraged practice), Leash found that that there is no functional difference in accuracy between giving a model a rich molecular description of the small-molecule (ECFP), and only giving a model the name of the author who made it. Even worse, both seem to encode roughly the same information.

They have a few paragraphs from their preprint that I really want to repeat here:

Put differently, much of the information that a simple structure-based model exploits in this setting is explainable by chemist style. The activity model does not need to infer detailed chemistry to perform well; it can instead learn the sociology of the dataset—how different labs behave, which series they pursue, and which targets they favor.
….
We interpret this as evidence that public medicinal-chemistry datasets occupy a narrow “chemist- style” manifold: once a model has learned to recognize which authors a molecule most resembles, much of its circular-fingerprint representation is already determined. This reinforces our conclusion that apparent structure–activity signal on CHEMBL-derived benchmarks is tightly entangled with chemist style and data provenance.

Now wait a minute, you may cry, this is just repeating the same point made in the last section about rigorous train/test splitting! And yes, this result does certainly rhyme with that. But the difference here is that the author signal seems to be inescapable through the standard deconfounding technique. Consider the following plot from the paper:

If chemist style were simply “chemists make similar-looking molecules,” you’d expect clear separation here—intra-author pairs clustering at low distances, inter-author pairs at high distances. But the distributions almost completely overlap. Both peak around 0.85-0.9 Tanimoto distance. The intra-author distribution has a slightly heavier left tail, but the effect is marginal. By the standard metric the field uses to assess molecular similarity, molecules from the same author are barely more similar to each other than molecules from different authors.

And yet, models can detect it. And it is almost certainly the case that binding affinity models trained on human-designed molecules are exploiting it.

But it gets worse. Authorship is just one axis of ‘human-induced chemical bias’ that we can easily study! There is a much more subtle one that Leash mentioned in a blog post over the subject: stage of development. Unfortunately, this type of data is a fair bit harder to get. They put it best in their blog post:

One dataset we wish we had includes how far along the medicinal chemistry journey a particular molecule might be. As researchers grow more confident in a chemical series, they’ll start putting more work into it, and this often includes more and more baroque modifications: harder synthesis steps, functional groups further down the Topliss tree, that kind of stuff.

Leash doesn’t need to worry about any of these issues for its own work, since their dataset is randomly synthesized in parallel by the millions, tested once, and either they bind or they don’t; the human intent that saturates public datasets simply isn’t present. So overall, this is a win for the ‘generate your own data’ side!

Either way, I still hope they study more and more ‘bizarre confounders in the public data’ phenomena in the future. How many other things like this exist beyond authorship and stage of development? What about institutional biases? The specifics of which building blocks happened to be commercially available where? Subscribe to the Leash blog to find out!

Conclusion

One may read all this and say, well, this is all well and good for Leash, but does every drug discovery task require genuine generalization to novel chemistry? Existing chemical space probably isn’t too bad to explore in!

And yes, I agree, and I think the founders of Leash would also. If a team is developing a me-too drug in well-explored chemical territory, a model cheating may be, in fact, perfectly fine. Creating a Guangzhou Polymer Standardization Accords detector would actually be useful!

But there is an awful lot of chemical space that is entirely unexplored, and almost certainly useful. What’s an example? I discuss this a little bit in an old article I wrote about the challenges of synthesizability in ML drug discovery if curious; an easy proof point here are natural products, which can serve as excellent starting points for drug discovery endeavors, and are known to have systemic structural differences between them and classic, human-produced molecules. Because of these differences, I would bet that the vast majority of small-molecule models out there would be completely unable to grasp the binding behavior of this class of chemical space, which, to be clear, almost certainly includes the current version of Hermes.

So, to be clear, as fun as it would be to imagine Leash doing all this model and data exploration work of a deep spiritual commitment to epistemic hygiene, the actual reason is almost certainly more pragmatic.

I gave the Leash founders a chance to read this article to ensure I didn’t make any mistakes in interpreting their results (nothing significant was changed based on their comments), and he offered an interesting comment: ‘While this piece is about us chasing down these leaks, I do want to say that we believe our approach really is the only way to enable a world where zero-shot creation of hit-to-lead or even early lead-opt chemical material is possible, particularly against difficult targets, allosterics, proximity inducers, and so on. Overfit models are probably best for patent-busting, and the past few years suggest to us that’s a losing battle for international competition reasons.’.

In other words, if the future of medicine lies in novel targets, novel chemotypes, novel modalities, you need models that have learned something fundamental about what causes molecules to bind to other molecules. They cannot cheat, they cannot overfit, they must really, genuinely, within its millions of parameters, craft a low-dimensional model of human-relevant biochemistry. And given how much they empirically care about finding ‘these leaks’, as Ian puts it, it’s difficult to not be optimistic about Leash’s philosophy being the best positioned to come up with the right solution to do exactly this.

Mapping the off-target effects of every FDA-approved drug in existence (EvE Bio)

Abhishaike Mahajan — Fri, 04 Jul 2025 12:54:50 GMT

Note: Thank you to Bill Busa, CEO and co-founder of EvE Bio, for an extremely helpful discussion while working on this essay.

This essay is long, and I recognize that many people don’t necessarily care about the details. The real headline point you need to be aware of is this dataset, which was produced by EvE Bio underneath a CC-NA license, and is a comprehensive mapping of the interactions between a significant fraction of clinically important human cellular receptors and 1,600~ FDA-approved drugs. I strongly believe that this data is really, really useful, and more people should be aware it exists.

If you’d like to understand why I think it is useful, and what the dataset exactly contains, read on!

Introduction

If you were to be a fly on the wall during the 1-6 years of preclinical drug discovery research within a pharmaceutical company, one observation you may walk away with is that, while the work is certainly complicated, it is also frighteningly limited in scope. What you’ll learn is that drugs are made by corporations that are optimizing for one primary thing, and one thing only: work. ‘Working’ is obviously contextual, but it is a simple concept no matter the situation: reduce a worrying biomarker, improve mood, lengthen lifespan and so on and so on. What does this discovery process ignore? Simply put: everything else a drug could do beyond that.

Yes, that’s a roundabout way of describing ‘off-target effects’ — defined as the action of a drug at a gene product other than the gene product it was intended to affect — but I think it’s a helpful intuition pump. Viewing the drug discovery process as ‘not paying attention to anything that is unrelated to the drug working’ is useful in that it contextualizes the situation we’re in. Drugs are meant to make money, and money is derived from drugs working. To spend time on understanding what else a particular drug does beyond It Working for its intended task is time lost and money lost.

One unfamiliar with the drug discovery process may find this bizarre; why wouldn’t the well meaning scientists in charge of developing drugs try to deeply understand how it interacts with the body? On the other hand, those deeply in the medical field would find this thesis so obvious that stating it is unnecessary; of course a pharmaceutical company would limit their scope of understanding a drug to things that lie between it working and not working. There’s only so much time and resources to go around. Priorities!

Of course, if an off-target effect comes between the drug and It Working, then certainly resources will be allocated to deal with it. But beyond that, mapping everything else a clinical-stage drug does — every receptor it unintentionally binds, every pathway it nudges sideways, every gene it perturbs slightly — is deemed so high effort and so low ROI, that it is relegated to hoping an academic will study it. Only if post-marketing surveillance turns up something worrying shall further exploration occur. Because, again, a deep understanding of what exactly an exogenous chemical is doing inside a body is not the point of the drug discovery process. Working is the point!

With that background context, I am ready to present three claims I’m going to make in this essay and spend the remaining sections trying to prove:

Understanding off-target effects is really useful.
Learning about off-target effects at scale is possible.
No for-profit institution has a strong incentive to do this work.

For the moment, let’s accept that these three are indeed true, and we can put our skeptic hat back on at the end of this section.

The subject of today’s essay is EvE Bio, and why I think they are doing something incredible.

EvE is a bit unlike the typical startups I write about, because they aren’t really a startup. They are a FRO, or Focused Research Organization. Many reading this blog are likely already familiar with this recent renaissance of strange scientific organizations (something I’ve written about in the past) and already understand this acronym, but to those who don’t, this Venn diagram is quite instructive:

Entire essays can (and have been!) written about the intricacies of FRO’s, but this essay will ignore much of their organizational structure, since it isn’t super relevant to what EvE is doing.

So what is EvE doing? EvE Bio is a scientific non-profit that has a clear, singular mission: map the off-target effects of every FDA-approved drug in existence and share the data. The data will be released underneath a non-commercial, creative commons license — free to use by academics, and available for licensing for commercial entities. Once they accomplish this task, they close up shop or spin off into their own thing. And if they don’t do it within 5-6 years, the same end result still happens. They do have some future plans that may come into the picture with time, which I’ll cover at the end, but the bolded bit is their primary thesis!

So why are they doing this? How will they do it? And why hasn’t anyone else done it yet?

Why is understanding off-target effects important?

There is a lazy answer that could be given here: "because we want to know if potential side effects of a drug exist". This is partially correct, but I think it pays to be more specific. On EvE’s website, they list six reasons why off-target effects are worth studying:

Now, fairly, some of these are at least a little fluffy. Is the institution doing off-target mapping really going to be the ones developing the autonomous lab assays of the future? Maybe! But it feels like a third-order, fourth-order, or even further consequence of their main mission. Bill did mention to me that there are already promising results in that direction, such as better reporting cell lines, but still. I think it’s generally good to limit ones assessment of an institution based on what their first-order impact will be, and, there, I think there will be three distinct areas that EvE will service: drug repurposing, validation for machine-learning models, and to a weaker degree, polypharmacology.

What about industrial chemical profiling and pharmacology profiling? I think EvE will certainly be important there, but it’s a bit fuzzier. Industrial chemical profiling may occur in the future but isn’t part of the current cohort of FDA-approved drugs that EvE is focusing on, and there’s a similar problem for pharmacology profiling as there is for ML-for-toxicity (which I have written about before as being a challenging proposition).

But even if we take my somewhat pessimistic stance that only three of these six things are genuinely tractable in the short term, those areas alone are extremely valuable. Let’s go over them.

Drug repurposing

I think it is under-appreciated just how rich the cohort of FDA-approved drugs that are out there. Consider the fact that basically all drugs start off with singular indications, meant to cure, alleviate, or address one thing. Yet, 30%~ of FDA-approved drugs gain a new post-approval indication, based on a study of the 197 drugs approved by the FDA from 1997-2020. Funnily enough, the same paper that came up with that 30% number almost treats it as a matter of disappointment, given that 38% of all prescriptions written in the US are off-label! This implies that there are, potentially, hundreds of drugs that are already being used beyond their original scope, just without the formal validation or regulatory blessing. Which, in turn, implies that we’re sitting on a vast, under-explored landscape of therapeutic potential, one that clinicians are already intuitively poking into, but which the formal system has barely begun to chart.

Now, I think some caution is warranted. This 38% number does vary from paper to paper, one other study claims off-label prescriptions are as low as 25%. If we’re being even more fair, it’s questionable exactly how proven-out these off-label indications are. One 2006 study claims that of the 21% of off-label prescriptions they found, 73% of them had little-to-no scientific support. Hard to tell whether this is because there simply are no studies, or because the off-label usage was actively disproved!

Consider gabapentin, one of the most egregious cases of off-label drug prescriptions. Typically, most people view gabapentin as the nerve injury drug, right? But it, in fact, was not originally approved for that, only for seizures! Yet, 95% of its prescriptions usage are for pain; nerve pain, low-back pain, post-operative pain, and so on. But while the gabapentin is indeed effective for some specific types of nerve pain (diabetic neuropathy), it is ineffective for many other types (e.g sciatica), as confirmed by follow-up studies by Pfizer.

Yet, prescriptions for these ineffective off-label usages continue.

But even if the true rate of valid, effective off-label use is lower than we’d like to imagine, the value of actually stumbling across a chance to repurpose a drug is high enough as to almost certainly still be worth it! Why? New chemical entities must follow the typical clinical phase progression timeline, whereas any repurposed drugs can skip preclinical, phase 1, and (sometimes) phase 2 trials as a result of their already-collected toxicity data. Billions of dollars and years of time could be saved!

From a 2017 review paper.

…repurposed drugs are generally approved sooner (3–12 years) and at reduced (50–60%) cost (5, 6). In addition, while ~10% of new drug applications gain market approval, approximately 30% of repurposed drugs are approved, giving companies a market-driven incentive to repurpose existing assets (5)….
For example, repurposing of the emergency contraceptive, mifepristone, for Cushing’s syndrome required a cohort of less than 30 patients to test its efficacy, whereas a clinical trial¹ for the same indication evaluating the safety and efficacy of a new chemical entity, levoketoconazole, required ~90 individuals (2, 3)…..

But, as it stands today, most drug repurposing efforts are done somewhat blindly; haphazardly glancing through the literature, relying on anecdotal case reports, or waiting for some academic lab to publish a five-mouse study from 2013 that hints at a secondary use. In many ways, it isn’t too dissimilar to the usual drug-discovery process! Given how promising (and relatively limited) the list of FDA-approved drugs are, the simple act of a pre-triaged list of drug-target maps (EvE’s mission!) may be extraordinarily impactful.

In such a world where this data is easily accessible, perhaps an order of magnitude more energy would be devoted to repurposing efforts, maybe vastly improving the currently horrific finances of modern day drug discovery.

But as with all seeming free-lunches, there’s a reason drug repurposing hasn’t been aggressively exploited beyond a few cases: economics. Unlike novel drugs, which come with fresh patents and a full runway of exclusivity, repurposed drugs necessarily rely on compounds whose original patents have expired or are near expiration1. This limits the sponsor’s ability to recoup development costs, because generic competition can quickly erode any profits once the drug hits the market, even if it’s approved for a new use. There are mechanisms to extend exclusivity for repurposed indications — such as the 7-year exclusivity period given by the FDA’s Orphan Drug Act for treatments of rare diseases or the 3-year-exclusivity granted in cases where new clinical data was needed to repurpose a drug — but it is a risky enough bet that most companies will shy away from it.

But as EvE is a non-profit, the economics don’t need to make sense. They plan to periodically announce opportunities for repurposing to the world, in hopes that other well-meaning non-profits take it on or, if the evidence is sufficiently convincing, that doctors simply take it as a useful datapoint for deciding whether an off-label prescription may be useful. And if they do most of the legwork in identifying good candidates for repurposing, it may even make the economics worth it for for-profit entities to pursue further.

Validation data for models

One of the easiest ways to assure yourself that what you’re doing is valuable is if people come up to you and ask if they could use whatever you’re producing. This is true in typical SaaS products, and it is true for the fruits of R&D work. But beyond assessing value outright, it also helps you learn what your work is most valuable for.

And, curiously, the primary area in which EvE has found ‘product market fit’ is in companies asking to use their data for internal model validation efforts. As I mentioned before, while EvE’s dataset is free-to-use by academics, it requires a commercial license to be used by any for-profit entity. And they are currently in discussions with 4 such commercial entities, all of whom desire to use EvE’s dataset to validate their machine-learning models predictions.

Historically, model builders in drug discovery have had to make do with whatever internal datasets they could get their hands on, which were typically limited in scope, biased toward certain classes of molecules, or simply not reproducible. Public data from sources like ChEMBL, BindingDB, or PubChem BioAssay are much larger in size, but they tend to be noisy, heterogeneous in experimental methodology, and always lack negative results. Worse, they’re often cherry-picked around success stories or clustered around well-studied targets, introducing systemic biases that hamper generalization. We need not look further than Pat Walter’s famous essay on the topic: We Need Better Benchmarks for Machine Learning in Drug Discovery, which expands on these issues even more.

This is an area of EvE’s work that I cannot personally shed much light on, and obviously, Bill cannot tell me the exact details on what the commercial entities are working on. But it was a surprising learning from our conversation that this particular topic is where public interest is most rapidly coalescing! Very excited to hear about more public statements they make in this area soon.

(Maybe) Polypharmacology

I do think this is the weakest, day-one value-add for EvE’s dataset. So take this section with a grain of salt! It just felt too interesting to not cover.

Polypharmacology is a drug discovery approach where a drug is designed to target multiple molecular targets, instead of a more traditional single-target approach. It’s not a particularly new idea, most clinically useful drugs exhibit multi-target activity whether they were designed that way or not. But what’s changed in the past decade is the intentionality.

I think there are a lot of different arguments for the value of polypharmacology, the easiest one hinging on efficacy. There’s a very interesting story that could be told here about drugs that worked better because they modulated the activity of multiple receptors in parallel. A great, recent example is that of drugs that followed Ozempic. Ozempic simply targeted GLP-1, which reduces appetite and slows digestion. But the second-generation (e.g. Zepbound) also targeted GIP, which amplifies insulin response and regulates lipid metabolism differently in adipose tissue. The effects were incredible: 13.7% weight loss with Ozempic, 20.2% weight loss with Zepbound over 48 weeks. Synergistic effects! The third generation (e.g. retatrutide) tacks on interactions with glucagon receptors — potentially increasing metabolic rate — with early phase 2 results looking once again promising.

But a more interesting place to start is the very similarly named concept of polypharmacy.

Polypharmacy refers to the clinical practice of prescribing multiple drugs simultaneously (usually 5+), typically to manage complex or co-occurring conditions. It’s common in geriatrics, psychiatry, oncology, and increasingly just about everywhere else in medicine: ~17% of all adults in the US meet the definition for polypharmacy. The logic is straightforward: most diseases aren’t governed by a single pathway, and so tackling them with a single drug is often insufficient. Instead, clinicians stack therapies: an ACE inhibitor for the blood pressure, a statin for the cholesterol, metformin for the glucose, a GLP-1 for the weight, and so on.

As you may expect, polypharmacy is awful on the patient's physiology. One study estimates that nearly 10% of hospital admissions among older adults are directly attributable to adverse drug events from polypharmacy-related side effects. The more drugs we stack onto people, the more unpredictable the net interaction becomes, because even if each one has been individually safety-tested, nobody tests all the pairwise combinations in a clinically realistic setting.

The solution may very well be to bundle things up.

Rather than throwing five separately optimized molecules at a patient and hoping for cooperative behavior, we could, in principle, design a single molecule that alone engages the same therapeutic targets. This, in turn, allows clinical trials to suss out the net effect of such a drug in a controlled, interpretable way. Which naturally leads us to the utility of polypharmacology; not necessarily because it will give us magic drugs with efficacy far better than current ones (though it may!), but rather that it will simply avoid us having to deal with the current issues that polypharmacy presents.

But the obvious question: does EvE’s dataset help with polypharmacology efforts? There isn’t any current, empirical proof of this, but I think it will. If you squint, you could see it functioning as missing infrastructure, a dataset that is necessary for rational polypharmacology to occur at scale. But this is necessarily tied up with machine-learning for chemical design accelerating, so, again, this is not necessarily something I’d expect EvE’s work to contribute to by the end of the year. But perhaps soon!

How do you understand off-target effects in a tractable way?

This all said, even if you agreed that the value proposition that EvE is claiming is real, you may struggle to verbalize exactly how you would understand the off-target effects of the 13,000~ FDA approved drugs out there. What assays would you use? How do you dose any given drug? How do you understand the translation of your assay to real-world settings?

Let’s walk through the EvE workflow.

First, you need to decide what drugs you're actually going to test. While there are technically around 13,000 FDA-approved drugs out there, many of them aren't particularly relevant for this kind of screening. You can immediately exclude things like topical medications, inhalants, radioisotopes, and simple nutrients, stuff that is known to be largely innocuous or not have much systemic impact. After this initial filtering, you end up with about 1,600 small molecule drugs that are worth investigating. But this number gets further whittled down further based on practical constraints; availability, cost, licensing requirements, etc.

From this, EvE ended up with a library of 1,397 compounds to screen.

Then comes the harder question: what exactly are you screening against? The human body has somewhere around 20,000 protein-coding genes, and there is an argument that any drug could interact with any of them. But perhaps we’d be too zealous to immediately do an (everything x everything) screen. Shouldn’t we try to do something that’s closer to the Pareto optimal frontier? What if we suspect that the vast majority of clinically meaningful drug interactions occur with a tiny subset of those 20,000 genes?

And, indeed, that turns out to be the case.

The vast majority of genes have some nominal physiological function, yes, but when it comes to drug interactions, only a minority are commonly targeted. At least a minority of classes: nuclear receptors (NRs) and 7-transmembrane receptors (also known as GPCRs). In total, there are about 800~ GPCRs and 48 NRs, but only 110 GPCR’s and 12-13 NR’s are actually targeted by drugs. Per last count, EvE has currently created data for 56 GPCR’s and 29 NR’s. Over the course of their existence, they plan to cover, in total, a select set of the 200 GPCR’s and all 48 NR’s. Why not all 800 GPCR’s? I attached that information in the footnotes.2

They hope to do much more than this too, but we’ll cover that in the last section.

Both NRs and GPCRs have some nice properties, but most pertinent to EvE, they are known to be very ‘druggable’ classes of drugs, given that the cell often uses them to convey information from the outside world, and evolution has therefore made their binding pockets unusually receptive to small molecules. GPCRs, sitting on the cell surface, are natural sensors for hormones, neurotransmitters, and other circulating ligands, many of which resemble or inspire drug scaffolds. NRs, meanwhile, act as intracellular switches that come with protected internal pockets meant to bind to estrogen, cortisol, and so on, making them ideal for selective small-molecule engagement. As a result, both are involved in a lot of important physiological processes.

This ‘physiological importance’ is useful in two ways! One, a plurality of drugs target the two — 13% of FDA-approved drugs target NR’s, with that number jumping to 35% for GPCR’s — so mapping the interactions here may give a clinically meaningful view of off-target effects. And two, given the extreme importance of GPCR’s and NR’s in modern-day drug development, there has been a fair bit of work in improving how we study their interactions with ligands of interest. As in, new assays outright shouldn’t need to be developed to study them.

Speaking of that, let’s start talking about how they are building this drug x receptor interaction map. They rely on two well-established assays which I’ll discuss here, but feel free to skip, understanding the two isn’t particularly important.

TR-FRET-based co-factor recruitment assays for NRs
1. When a drug successfully activates an NR, it usually causes a conformational shift that allows the receptor to recruit a specific co-factor protein, exposing what is often called the ‘AF2 domain’. These co-factors tend to have little peptide motifs (like an LXXLL motif) that latch onto that domain.
2. TF-FRET exploits this. A chemical is tagged onto the NR domain and a chemical is tagged onto the co-factor protein, both of which are fluorophores. If the FDA-approved drug is an agonist, you’ll see a spike of light appear as the two fluorophores interact.
Tango β-arrestin recruitment assays for 7TMs/GPCRs
1. Instead of recruiting co-factors inside the nucleus like NR’s, GPCR’s sit on the surface of a cell and transmit signals inward. As the name of the protein class implies, this involves utilizing G-proteins. Unfortunately, G-proteins are quite specific to their GPCR, so using them in our assay as a way to understand activation would be difficult to scale. Luckily, there is a nearly universal binding protein: β-arrestin. When a GPCR is activated by [something], their signaling process almost always involves binding to that protein.
2. In the assay, the GPCR (attached to a cell surface) is engineered to have a built-in “trap”, a little molecular tag connected to a transcription factor. When β-arrestin is recruited, it brings along a protease that snips the tag, releasing the transcription factor. That transcription factor then moves into the nucleus and turns on a reporter gene, which encodes for the enzyme β-lactamase. Meanwhile, the cell is loaded with CCF4-AM, a fluorescent substrate that shifts its emission profile when cleaved by β-lactamase. The stronger the GPCR activation by a drug, the more β-lactamase is produced, the more substrate is cleaved, and the bigger the fluorescence shift. That shift, measured as a ratio between ‘starting’ and ‘ending’ wavelengths, serves as a readout of how strongly the receptor was activated.

Reasonably simple! One note: the explanations I gave above is for assessing the difference between an inactive drug and an agonist. For assessing inactive versus antagonist, a separate experiment is run with a known ligand included.

Well, wait a minute. Aren’t we missing something? Off-target effects of a small molecule can be summarized purely by these GPCR/NR measurements, but we’d be failing to capture something else that is of vital importance: whether the drug outright kills the cell. One could imagine this also affecting our receptor experiments! Perhaps a drug is an antagonist and there is no color shift, or perhaps the cell is just dead, and nothing is being expressed at all. Conversely, a drug might look like an agonist due to signal drift as the cell’s internal environment falls apart.

EvE solves this in a pragmatic way: run a third assay which measures how healthy the cell is. How do you measure that? Well, one good proxy for how functional a cell is ATP production. Metabolically active cells generate ATP to power all their intracellular processes. Dead ones don’t. The assay EvE uses is called CellTiter-Glo. It works by adding a reagent that causes a fluorescent reaction in the presence of ATP. More ATP, more light. Less ATP, less light. No ATP? No light (and likely dead). Again, simple!

Is that all? One last thing: accounting for pan-assay interference compounds, or PAINS. These are molecules that often give false positives in high-throughput screening regimes. This can occur for many different reasons, but one relevant example is if a molecule itself is a fluorophore, leading to us falsely believing that it is an agonist during a run. EvE simply tracks how often a drug is leading to positive results, and flags it in their results if they believe it is a PAINS.

So they run these three assays across their pairwise (drug x receptor) combinations, producing readouts at multiple different concentrations with replicates for each one.

I’m going to skip over a lot at this point. EvE clearly put an immense amount of work into QA’ing this process and filtering through the data, and I think I would do both a disservice and detract from the point of this essay if I were to attempt to repeat it here. Summarizing it all down, using a complex logic table detailed here in Fig 7, EvE assigns 1 of 4 categories to each (drug x receptor) combination:

Inactive. Drug likely has no effect on the receptor, across all tested concentrations. Maybe it doesn’t bind or maybe it binds but does nothing.
Likely Inactive. A little more ambiguous, perhaps there’s a single noisy point above baseline, but nothing more.
Active – Unquantified. Something is happening, since there’s reproducible activity, but not enough clean data to fit a proper dose-response curve.
Active – Quantified. The drug produced a clear, dose-dependent response (as either an agonist or antagonist) with a well-behaved curve. From this, EvE fits a 4-parameter logistic model and extracts a pXC₅₀; the negative log concentration at which the drug produces half its maximal effect.

And…that’s it. A clean, rigorous, and tractable approach to understanding off-target effects, across hundreds of receptors, at multiple concentrations, using multiple modes of detection, with full transparency around the data.

How far along is EvE on their mission? Circa their last data release on 5/7/2025, 237,490 (drug x concentration x receptor) combinations have been screened, revealing 8 median agonists and 31 median antagonists per target. They run these experiments in 384 well plates, so that means they’ve run the process a little bit over 600~ times to generate their current dataset — though much of the current process is automated, very little human-done pipetting is going on. Data dumps of the data started in November 2024, with new ones dropping every few months.

I haven’t worked in a wet lab before, but I’ve been assured by at least one person I trust that the effort that went into assembling this all together is nothing short of extraordinary. But it is worth asking the question…

Why hasn’t anyone done this before?

When assessing the value of a seeming scientific achievement, it’s usually good to step back and ask one question: why wasn’t this done a decade back?

In some cases, the answer is boring: the technology wasn’t there yet to achieve it.

But here, the technology was almost certainly available! Eve’s assay for measuring NR activity has been around at least since 2008, and the one for GPCR since 2010, maybe even earlier for both. If it’s really that useful, why did it take so long for someone to start assembling this drug x receptor mapping together?

Haven’t I already given away this answer? In the introduction, I implied that pharma groups have no direct financial incentive to create such a dataset. And that is true to some degree, especially for smaller therapeutic companies that have bigger issues to focus on, but is that true for big pharma? A small slice of the billions in pharma spending couldn’t be sliced off to hand over to an internal research team? It’s not as if the data wouldn’t be useful for their own drug development pipelines. After all, off-target effects are among the most common reasons for late-stage trial failures and post-approval black box warnings, and even if the creation of an EvE-like dataset doesn’t fix the problem, I can’t imagine it’d hurt.

I should be fair: pharma companies do indeed do some of this. EvE’s own blog discusses this a little, referencing this paper:

The report’s authors, luminaries in the discipline of safety pharmacology, surveyed 18 major pharmaceutical companies regarding the numbers and identities of potential off-targets against which they test each and every one of their new drug candidates in the interest of safety. The numbers ranged from a low of 11 to a high of 104 potential off-targets routinely profiled per company, with a median of about 45. Interestingly, the industry’s opinions regarding which potential off-targets to screen vary widely. The total number of potential off-targets screened, across the universe of all 18 pharmas, was 763, yet only 12% of them were screened by more than a third of those companies.

So, yes, pharma companies do their own off-target screening. But, as we’ve discussed, this is a far cry from the universe of druggable receptors, and is only concentrated on their particular assets, not other ones. No attempt at creating a universal map!

But the same blogpost did reference another big pharma, Novartis, who also open-source a much larger map:

Novartis, who presented data collected “over a multi-year period” profiling drug/target interactions across a median of about 800 drugs per target and 105 gene product targets…

This is impressive! One may imagine that if a big pharma was willing to release this, why does an entity like EvE need to exist? For interest's sake, let’s ignore the obvious answer of ‘it is better for everyone if such a dataset is collected using a single, standardized protocol instead of compiled from unrelated experiments over years.’

I asked Bill exactly this question, and the answer was a two-parter.

For one, the dataset that was collected by Novartis, and indeed every large-scale dataset that will ever be collected by big pharma, will always be limited by the constraint we mentioned at the start: everybody only cares about the drug working. A logical conclusion of this is that nearly every receptor covered in these sorts of screens is a safety-oriented receptor. Cytochrome P450, hERG, serotonin subtypes, dopamine D₂, and the like. These are important receptors, not because of how mechanistically interesting they are, but because they are dangerous. Indeed, the vast majority of screened receptors lie within the so-called Bowes-44 set, which comes from a 2012 paper that identified 44 receptors known to be often implicated in safety-related drug failures. Though these do include NR’s and GPCR’s, it is a minimal set of them, as, again, the screening is not meant to assess how mechanistically interesting the receptors are.

And if a big pharma does decide to explore beyond the realm of safety-oriented receptors, they will almost certainly keep that dataset to themselves. Why release potential alpha to competitors? Hence, why nothing quite like EvE has come out in the past and it is unlikely it ever will in the future, at least from a for-profit entity.

And two, EvE eventually hopes to cover a lot more ground than any of the publicly available datasets. Currently, yes, the Novartis dataset is larger than EvE’s, but it won’t be for long. In fact, their plans for the upcoming few years ended up being so interesting that I decided to split it off into another section:

What does the future look like?

EvE is still quite young, just over 2 years old, and I think the future of it is going to look really, really crazy. At the end of my startup coverage articles, I typically focus on commercial/scientific risks. But given that EvE is assured funding on a multi-year horizon without needing to care about market demands, it may be much more instructive (and interesting!) to instead discuss their upcoming plans.

Earlier I noted that EvE has currently released data for 29 NRs and 56 GPCRs, out of a planned 40 NR’s and 200 GPCR’s. In my conversation with Bill, I asked him how much time is left till the remaining ones are released. I expected the answer to be, optimistically, ‘over the next few years’, given how EvE only started to release data back in November 2024 and that the Novartis dataset collection process also took several years. I was astonished to learn that he expected to have released the remainder of all GPCR + NR screens dataset by the end of this year. Setting up the assays, validation, and automation was the hard part, which is why their data releases have only started recently. But now that that’s all set up, they simply must turn the crank to get the rest out of the door.

What’s next? Bill told me that the next target of receptors are kinases, 500~ or so receptors that have been increasingly valuable drug targets over the last 20 years.

Then what? Bill said he’s open to exploring even more drug targets, but he also said, surprisingly, that EvE may add more chemicals on top of the 1,600~ planned FDA-approved drugs. The FDA-approved drugs, he said, are success stories. Potentially it’d be even more interesting to consider the failures as well. Especially the ones that everybody expected to work, arrived at phase 3, and set billions of dollars on fire after the trial results came out.

Even more exotic options are also on the table. For example, Bill discussed exploring how metabolites of approved drugs interact with targets. Some context: most secondary pharmacology work stops at the parent compound, but metabolic byproducts of a drug can have entirely different binding profiles, and, in some cases, they’re the ones responsible for efficacy (e.g codeine, which metabolizes into the much more effective morphine) or for toxicity (e.g. acetaminophen, which metabolizes into the very toxic NAPQI). He also mentioned potentially using EvE’s assaying work to develop our understanding of tool compounds, which are chemicals that don’t necessarily have therapeutic value themselves, but are used in research to probe specific biological pathways or validate target function. An ACS page has this to say about it:

While tool compounds have tremendous potential for advancing life science research, they are broadly defined, and it is often difficult for a researcher to determine the best tool compounds to employ during the research process. There remains a great need for more tool compound databases and authoritative sources of information from experts in the field.

And, as always, there is a (very short) Derek Lowe piece on how a commonly-relied upon tool compound moonlights as a ligand for a structurally unrelated receptor, likely muddying the literature the tiniest bit. More work here would almost certainly be deeply appreciated by those in the field.

Overall, EvE really exemplifies the thesis I put forward in a past essay about how smart people in biology should do more boring things. Very little that is directly sexy about doing an N x M screen, but the impact of doing something like it well can be immense. And I have little doubt that EvE Bio has been doing it well, and will continue to do so in their future projects. If you’re interested in checking out their dataset, check it out here.

Unless the owner of a still-existing patent is looking to expand indications!

In Bill’s words: (1) about half of all GPCRs are sensory receptors (taste/smell), generally regarded as not likely involved in many (or even any) diseases, and anyway smell receptors are hard to work with in HTS because their ligands are compounds with very high vapor pressures (basically, gasses); and (2) only about 170 of the remainder are validated drug targets, and only about 200 (including those 170) have compounds (either drugs or research chemicals) which are known to turn on the receptor (AKA, an agonist). It's pretty nearly impossible to design a meaningful assay for receptor activity if you don't have a positive control compound.

Better antibodies by engineering targets, not engineering antibodies (Nabla Bio)

Abhishaike Mahajan — Wed, 08 Jan 2025 17:00:28 GMT

Note: Thank you to Surge Biswas (founder of Nabla) for comments on this draft and and Dylan Reid (an investor into Nabla) for various antibody discussions! Also, thank you to Martin Pacesa for adding some insight on a paper of his I discuss here (his comments are included).

Introduction

Antibody design startups are singlehandedly the most common archetype of bio-ML startup out there. It’s understandable why — antibodies are derisked modalities, CDR loops driving antibody efficacy makes the whole structure more amenable to ML, and there’s a fair bit of pre-existing data there. But, because it’s also the most common form of company, it’s difficult to really differentiate one over the other.

If you squint, you could make out some vague distinguishing characteristics. Bighat Biosciences does a lot of multi-property optimization, Absci and Prescient have the strongest external research presence, and so on. But there is a vibe of uniformity. It’s nobody’s fault, that’s just the nature of any subfield that has a huge amount of money flowing into it; everyone quickly optimizes. And, unfortunately for those of us who enjoy some heterogeneity, most everyone arrives at the same local minima.

Because of that, I’ve never really wanted to write about any antibody company in particular. None of them felt like they had a sufficiently interesting story. All fine companies in their own right, but they all tell the same tale: great scientists, great high-throughput assays, great machine-learning, and so on.

This brings us to the topic of this essay: Nabla Bio.

As may be expected, from the earlier sections of this piece, they are an antibody design startup. Founded by Surge Biswas, an ex-Church Lab student who worked with ML-guided protein modeling during his PhD, and Frances Anastassacos, they were launched in 2021 and currently sit at 15 employees. Nabla, on the surface, looks materially indistinguishable from most other antibody design companies. And, in many ways, they are! If you visit their website, you’ll immediately see references to barcoding, parallel screening, and machine learning. It feels like much of the same story as everyone else.

It took me until a few months later for me to realize that Nabla has an interesting difference. Because Nabla is not only an antibody engineering company. They may say they are, and their partnerships may only include antibodies, but I don’t consider that their defining archetype alone.

What they really are, alongside antibodies, is a target engineering company. And that is why, amongst companies they are often compared to, I find them uniquely curious. This target engineering thesis they are pursuing is the subject of today's essay.

But before we talk about them, we first need to talk about targets.

The pain of multi-pass membrane proteins (MPMP)

What makes for a good drug target? Of course, the most critical and obvious factor is therapeutic potential. Will modulating this target actually help treat or cure the disease? This is the fundamental requirement that drives target selection. However, there's often a disconnect between knowing a target is therapeutically valuable and being able to successfully develop drugs against it.

So, given a set of drug targets that all are known to be related to a disease of interest, how do you pick amongst them? There’s a lot of ways you could go about filtering them, including market interest, how much it’s been clinically derisked, and so on. But an often used method for selecting a target is how easy it is to work with.

What are the phenomena of targets that are easy to work with? Here are three that come to mind:

Stable, because having a protein that maintains a consistent structural and functional identity across time and varying conditions makes in-vitro testing with it far easier. When you're running high-throughput screens or binding assays, you’d ideally like for your target protein to not unfold, aggregate, or adopt wildly different conformations between experiments. Stability, in the end, really translates to predictability.
Well-characterized, because a protein we understand the behavior of is a protein that we can exploit. Binding pockets, conformational changes, and interaction partners of a target can all be helpful things to keep in mind throughout the life cycle of a therapeutic designed to interact with it.
Amenable to being bound to by your therapeutic molecule of choice, because how else would a therapeutic interact with it?

MPMPs fail spectacularly on almost all these counts.

MPMP’s are amphipathic proteins, meaning having both hydrophilic and hydrophobic components. This should make intuitive sense. In their native context, their hydrophobic transmembrane segments are buried in the cell membrane's lipid bilayer, while their hydrophilic regions extend into the aqueous environments on either side. Importantly, MPMP’s are not only embedded into the cell membrane, but weave through it multiple times. If it wasn’t obvious, that’s what the ‘multi-pass’ bit of the acronym stands for. This means we have alternating, repeating stretches of hydrophobic-hydrophilic residues.

This makes MPMP’s awful to work with.

A fair number of other proteins in the human proteome are water-soluble — adapted to exist in the (mostly) aqueous environment of the cell's cytoplasm or extracellular space. These are fantastic to work with! You can extract them, purify them, and work with them while maintaining their native structure. This has a lot of second-order value: it’s easier to run experiments with them while having consistent results, it’s easier to characterize their structure using experimental determination techniques, and it’s way cheaper to get them ready for whatever assay you want to run with them.

MPMP’s have no such advantages.

When isolated out of their membrane environment to study individually, it’s a struggle to recapitulate their normal behavior. Their hydrophobic segments, normally protected by the membrane's lipid environment, become exposed to an aqueous environment. This is about as stable as you might expect: these proteins tend to rapidly misfold, aggregate with one another, or completely fall apart. If you want to run a high throughput screen, it’ll be a constant challenge to get consistent protein conformations across your assay conditions. If you want structural data, keeping the protein stable long enough to even attempt crystallization or cryo-EM will be enormously difficult. And even if you manage all of that, you're still left wondering whether the structure you're looking at bears any resemblance to how the protein actually exists in its native membrane environment.

Consider two stories of MPMP’s, and their associated painful stories of working with them:

GPR40 is a GPCR highly expressed in pancreatic beta cells, playing a crucial role in glucose-stimulated insulin secretion. This makes it a highly attractive target for type 2 diabetes, but, unfortunately, developing drugs against GPR40 has been plagued with difficulties. Including, but not limited to, difficulty of stable purification, difficulty of making it water soluble, and difficulty of using them in standard binding assays, These challenges are undoubtedly part of the reason that despite years of research, only one GPR40-targeting small-molecule drug, Fasiglifam, even reached phase 3 clinical trials. It has, unfortunately, since been discontinued due to liver toxicity concerns.
P-glycoprotein (P-gp) is an efflux transporter, another class of MPMP, responsible for pumping foreign substances out of cells. This is a major cause of multidrug resistance in cancer, as P-gp can effectively remove chemotherapy drugs from tumor cells. Developing inhibitors of P-gp has been a long-standing goal in cancer research. Yet, this too has failed, partially due to the difficulty of working with the protein. It's extremely difficult to purify in a stable, functional form outside of its native membrane environment (which is, funnily enough, a fact unique to human P-gp! Rodent P-gps are far more stable). As a result, structural studies have been incredibly challenging, with the first high-resolution structure of human P-gp only being reported relatively recently (2018), decades after its discovery in 1976. There is, as of 2024, no approved drugs that successfully inhibit P-gp.

Of course, drug development efforts are rarely stymied by a single reason alone! It is rare that a protein simply being ‘hard to stabilize’ outright ends a program —especially because a potential solution to the above problems is to simply do whole-cell screening (which has its own challenges!) — but it certainly doesn’t help.

The tragedy is that while MPMP’s are one of the most difficult protein structures to study, they are often incredibly good targets. This shouldn’t be a surprise if you consider that their dysfunction has been implicated in a wide variety of biological pathways. Pharmaceutical companies have obviously already taken note of this and, as a result, MPMPs make up ~40% of currently known drug targets, despite them being 23% of the human proteome; a testament to their clinical relevance.

Yet, only two approved antibodies target them: Mogamulizumab, targeting CCR4 for lymphoma, and Erenumab, targeting the CGRP receptor for migraine prevention. And while there are a far more approved small molecules that target MPMP’s (20%~), antibodies can be a fair bit more efficacious for some targets, so we’d ideally like to rely on that modality as well.

All this adds up to a depressing situation: MPMPs are incredibly important, valuable drug targets, but our ability to develop protein-based drugs against them is severely hampered by our inability to work with them effectively.

What can we do about this?

Well, one more note before we move on, because I also had this question: could you simply...not deal with the MPMP at all, at least not in their entirety? Don’t MPMPs have extracellular (read: soluble) domains that can be expressed and studied in isolation? The membrane-spanning regions might be critically important for the protein's native function, but they're irrelevant if your goal is simply to bind and block (or activate) the protein from the outside. We could just use those, and happily run our binding assays and structure determination and whatever else!

And the answer is…that my assumption is wrong. MPMP’s, in fact, rarely have nicely structured extracellular regions. Single-pass membrane proteins certainly do! But the extracellular bits of multi-pass membrane proteins, unfortunately, noodles of proteins that are similarly a nightmare to work with.

Okay, now we can move on.

Computational design of MPMP proxies

Here’s an idea: could we not simply redesign our messy, non-soluble MPMP’s to simply…be soluble? The answer has, historically, been ‘yes, but it’s hard.’ In 2004, someone did it for a bacterial potassium channel protein (KcsA) In 2013, another group did it for the human mu opioid (MUR).

But it’s also kinda…bespoke. There’s a lot of custom design, a lot of thinking about interatomic potential energies of this one specific protein, and so on. Very little of the work from any one paper study on a protein seems to easily translate into another protein. This is a problem, given that there are on the order of several thousand potentially useful MPMP’s, and we’d ideally like to not spend graduate student years on creating soluble analogues of each one.

Is this possible to automate?

There is a paper from May 2023 that suggests it is! It is titled ‘Computational design of soluble functional analogues of integral membrane proteins’, which has some big names on the author list: Martin Pacesa (BindCraft), Justas Dauparas (ProteinMPNN), and Sergey Ovchinnikov (he’s Sergey). What exactly do they do?

The pipeline starts with a target structure of some membrane protein and makes an educated guess at an initial sequence based on secondary structure preferences — alanines for helices, valines for β-sheets, and glycines for loops. They then randomly mutate 10% of these positions to introduce diversity. This sequence then gets fed into AlphaFold2, which has a composite loss function that measures how well their current sequence's predicted structure matches the target. From this, we generate gradients that tell us how to modify the sequence to get closer to our desired target. These gradients update a position-specific scoring matrix (PSSM), which was then used to update the sequence again for another round of structure prediction. This is done 500 times and is also referred to as AF2_seq.

At the end of process, we have a sequence and structure pair. What should we expect from this sequence?

Initial designs by AF2_seq exhibited high sequence novelty compared with natural proteins and a low fraction of surface hydrophobics; however, none could be expressed in soluble form.

Well, the Alphafold process resulted in few surface hydrophobic residues, getting us slightly closer to a soluble protein, but experimentally still not what we want. At this point, the authors redesign the sequence using a version of ProteinMPNN trained with soluble proteins (solMPNN). Why not just the usual ProteinMPNN? For the same reason you might expect:

We attempted to optimize the sequences using the standard ProteinMPNN model, but the resulting sequences consistently recovered the surface hydrophobics, probably owing to the similarity of the topology to that of membrane proteins encountered during training.

And presto, we have a pipeline for creating soluble versions of any arbitrary protein! They tested this on GCPR’s, alongside a few other membrane proteins, finding that 1) a high-confidence soluble sequence for a GCPR could be found, and 2) identical (and important) structural motifs on the nonsoluble version could be found on the soluble version.

Why not just let solMPNN redesign everything and skip the Alphafold2 step? It’s a fair question and one I don’t have an answer for. One reason may be that having the ability to modify the structure slightly (via Alphafold2) to account for the inevitably structural deviations when going from nonsoluble → soluble is helpful before solMPNN redesign, but that’s just a guess.

Edit: Martin (one of the lead authors) saw this article and answered the question! Here is his comment:

The reason we use AF2seq+MPNN is because if you use only MPNN it will keep the hydrophilics in the core. Then hydrophilics outside+hydrophilics inside = collapsed protein.

Importantly though, they didn’t show that these soluble versions of membrane proteins were actually good for anything! In the end, that’s what we really care about, that the soluble, easy-to-work with version of the MPMP can actually help accelerate biological research in the dimensions we care most about. They do briefly touch on this topic as an area to explore in future research though.

Another exciting perspective is the creation of soluble analogues of membrane proteins that retain many of the native features of the original membrane proteins, such as enzymatic or transport functions, which could greatly accelerate the study of their function in more biochemically accessible soluble formats. Similarly, this would be critical to facilitate the development of novel drugs and therapies that target this challenging class of proteins, which remain one of the most important drug targets.

Prescient on their end!

Because someone did end up testing this out: Nabla Bio. This brings us to our next section…

Joint Atomic Modeling

JAM, or Joint Atomic Modeling, is a technical report produced by Nabla Bio on November 11th, 2024.

What is JAM? Really, it refers to a generative model that can handle both sides of a critical problem in drug development: the membrane protein target and the antibody that needs to bind to it. When given partial information about a protein complex — whether sequence, structure, or both — JAM can "fill in" the missing pieces while respecting both hard constraints (things that must be preserved) and flexible constraints (things that provide guidance but can be modified). Pretty much no details on what JAM actually is is given in the paper, but those details aren’t super important for what we’re going to be concerned with.

The technical report includes a lot of details on the capabilities of JAM, including:

Epitope specificity in generating VHH’s (not new, but still good to see people focusing on).
Very, very good binding affinities in designed antibodies (sub-nanomolar range)
Test time compute being useful for antibody engineering.

Overall, while interesting and strong in their own right, these are capabilities that aren’t particularly alien amongst many other protein foundation-y model papers these days.

But the most unique section isn’t quite about antibodies at all. As the title of this piece may have implied, it is about modeling the target. The section is titled ‘De novo design of antibody binders to two multipass membrane protein targets: Claudin-4 and a GPCR, CXCR7’. It is, as far as I can tell, the first time anyone has demonstrated the utility of machine-learned soluble proxies of MPMP’s for anything.

What do they do?

For both CLDN4 and CXCR7 (both MPMP’s), Nabla used JAM's protein design capabilities to create soluble proxy versions of them. Specifically, the transmembrane region was replaced with a stable, soluble scaffold and the extracellular structures are preserved. Which is neat! We have an all-in-one model for redesigning membrane proteins, none of this Alphafold2-ProteinMPNN pipeline stuff.

But we still haven’t gotten to connecting all of this to utility.

This is where the validation data becomes particularly interesting. For each proxy (deemed solCLDN4 for CLDN4, and solCXCR7 for CXCR7), Nabla demonstrated both structural stability (via ‘monomericity’, which refers to the percentage of protein that exists as single, individual units rather than clumping together into larger aggregates) and, crucially, functional relevance via binding to known binders:

solCLDN4:
- 87% monomericity after one-step purification
- Maintained binding to a known anti-CLDN4 antibody
solCXCR7:
- 85% monomericity after one-step purification
- Maintained binding to its native ligand SDF1α

!!!!!!

Functional relevance!!! Soluble versions of these multi-pass membrane proteins continue to bind to things that should bind to them!!!

But some skepticism may still be warranted: what if the anti-CLDN4 antibody and native ligand SDF1α binding is fully coincidental? If there truly was a one-to-one correspondence between Nabla’s proxies and the real protein, a screening campaign on top of the soluble protein would yield something that also binds to the native version of the protein.

And Nabla did exactly this!

For CLDN4:

In our CLDN4 de novo design campaign, screening de novo VHH designs with solCLDN4 on-yeast successfully identified three binders that exhibited EC50s of 10, 22, and 56 nM for native CLDN4 on overexpression cell lines (Fig. 6e). Among these, the best-performing binder also showed effective recognition of CLDN4 on OVCAR3 ovarian cancer cells.

For CXCR7:

In our CXCR7 campaign, screening de novo VHH designs with solCXCR7 on-yeast successfully identified a strong binder that recognized native CXCR7, achieving an EC50 of 36 nM when expressed recombinantly as a monovalent VHH in an E. coli cell-free system and tested against PathHunter CXCR7 cells.

This validation story is remarkably complete. Their soluble proxies maintained stability, bound to known ligands, and could actually be used to discover new binders that work against the native proteins. The specificity data is particularly compelling. The CLDN4 antibodies showed >100x selectivity over closely related family members (CLDN3, CLDN6, and CLDN9), despite 85% sequence identity in the extracellular regions. This suggests their proxies maintain the subtle structural features that distinguish these closely related proteins!

From start to end, here is the pipeline for this. Again, few details.

Conclusion

Typically, I’ve always ended these types of company-overview articles with a note on the potential risks of the company, but I’ll be skipping that here. Nabla is so new that it’d be difficult to give a strongly informed guess as to how they will fare. What is included in this essay is a strong validation of a thesis they are pursuing (target engineering), but not necessarily of the company at large, which, from a cursory glance of the JAM paper, is obviously concerned with a lot more than targets alone.

But what of this target engineering stuff? It’s clearly an interesting idea, feels like it should theoretically have value, and empirically works given the results of the JAM paper. How much juice is left to squeeze there?

Here’s the obvious bull case: there are only two antibodies that works with MPMP’s, partially due to how hard MPMP’s are to work with, so there’s clearly room on the table for more. If the Nabla bet really pays off, they potentially get access to first-in-class targets for a ton of different diseases or can sell access to soluble proxies to pharmaceutical partners, both of which likely have huge amounts of payoff.

But as always, there’s a bear case with these.

First, there's the question of target selection. While MPMPs represent ~40% of current drug targets, this statistic might be misleading when thinking about the addressable market for Nabla's technology. Many of these targets are potentially already being successfully pursued with small molecules. The real opportunity lies in targets where antibodies would provide meaningful advantages over small molecules, or where small molecules have failed. I don’t know the answer to how large that number is! But my guess is that it narrows the playing field a bit.

Surge, the founder of Nabla, had some thoughts on this topic of antibodies versus small molecules, which I’ve attached here:

People are generally more excited about antibodies than small molecules for a couple first principle reasons:
1. Much higher specificity. There's a lot more engineerable surface area on an antibody. This allows you to design binders with high specificity, which is critical for MPMPs, many of which look very similar to each other and thus lead to off-target toxicity/side-effects if your drug is non-specific. This is a major issue with small molecules.
2. Extended half-life: Antibodies stay in your bloodstream for weeks vs < 1 day for small molecules. This means less frequent dosing.
3. Antibodies as handles for other functions. You can use antibodies to e.g. recruit t-cells to a cancer overexpressing a GPCR or claudin, or use that antibody binding head as a CAR, or use the antibody binding head in an ADC. You can use the antibody to recruit other immune cells (important concept generally with cancer).
4. Well-trodden formulation and manufacturing path. These can be dealbreakers in DD, but for a well behaved antibody it's relatively standard. For a small molecule, it's a different process each time, and a much more frequent source of failure.

So, in the end, small molecules may be less competition than one would naively assume. There’s one more issue which is a little nuanced, and a bit out there, but it feels worth mentioning.

Returning back to Nabla’s JAM paper, when designing binders for CXCR7, they found something curious: their best binder had aggregation tendencies that might limit its developability. The authors make an observation: this aggregation propensity mirrors that of CXCR7's native ligand SDF1α. This raises a question about the fundamental nature of GPCR targeting: is there an inherent tension between effective binding and developability? The features that allow for engagement with GPCR’s may inevitably cause issues.

If this is indeed true, it may be the case that even if you can design soluble proxies of membrane receptors, in-vitro screening assays that rely on those isolated proxies will also cause issues. Here's the chain of logic:

You create a soluble proxy of your MPMP target
You screen for binders against this proxy
The binders that show the strongest affinity are likely to be those that best mimic the natural binding mode
But if that natural binding mode inherently requires "sticky" interfaces...then you're essentially selecting for problematic developability properties by design.

In the end, while Nabla may have first in-class access to targets, binders to those targets may also be awful to work with.

Is this a huge issue? I don’t think so, especially since they didn’t point out that this occurred with binders to solCLDN4. The fact that this phenomenon wasn't observed in their non-GPCR MPMP work indicates that this isn't a universal issue across all multi-pass membrane proteins.

However, for GPCRs specifically, it points to an interesting constraint on the potential of target engineering. Since whatever binders you find to that target will have downstream issues that must be amended. Of course, JAM will still be helpful! Instead of struggling with target protein stability and assay development, drug developers will be wrestling with optimizing developability properties of their hits, which instinctively feels like a faster problem to iterate upon. Overall, this point is very much splitting hairs, but maybe an interesting thing to think about.

That’s it for this piece! Excited to see what else comes out of Nabla.

The unreasonable effectiveness of plasmid sequencing as a service (Plasmidsaurus)

Abhishaike Mahajan — Mon, 07 Oct 2024 13:10:04 GMT

Note: thank you Mark Budde, cofounder and CEO of Plasmidsaurus, and Maria Konovalova, a growth/marketing/talented-person at Plasmidsaurus, for talking to me for this article! Also thank you to Eryney Marrogi, who helped answer some of my dumb questions about plasmids.

Introduction

Here’s some important context for this essay: it really, really sucks to start a company in biology.

Despite billions in funding, the brightest minds the world has to offer, and clear market need, creating an enduring company here feels almost impossible. Some of this has to do with the difficulties of engaging with the world of atoms, some of it has to do with the modern state of enormously expensive clinical trials, and some of it still can be blamed on something else. To some degree, this is an unavoidable facet of this field; working in it means you’re here for the ‘love of the game’ than anything else.

But is it necessarily fair to equate all for-profit life-science endeavors with grueling, decade-long struggles to bypass scientific obstacles? Is there a world in which life-sciences startups can have a more traditional tech culture ethos in how they approach things? Unfortunately, probably not for a startup aiming to do the traditional therapeutics play.

But if we broaden our scope to companies to include in service provider biotechs, I can offer at least one example: Plasmidsaurus.

Plasmidsaurus was started in 2021 and is currently run by Mark Budde. Some historical context: Mark was the founder of a separate, but related company called Primordium Labs, which merged with another separate company called SNPsaurus. They both were largely doing the same thing, so, circa 2022, they agreed to combine underneath the Plasmidsaurus name.

People who currently work on the wet-lab side of biotech have likely not only heard of this company, but are also loyal customers. On the flip side, I would guess that not even computational folks at biotech companies have heard of them, much less anyone in other industries! This is a shame, and something I’ve been wanting to rectify since I first stumbled across this company.

At their start, they really only did one thing: sequence plasmids. We’ll get into what exactly that means later on in the post, but you can think of it as a sort of on-demand quality assurance for certain lab processes. And the way they do that plasmid sequencing isn’t particularly novel. They use a nanopore sequencer (again, more details later on) to perform it, which is a technology that one, someone else developed a decade ago, and two, a tool you can buy directly from the people who made it.

The workflow here is simple: send them 15 bucks, mail them your plasmids, and within a day, you get back some quality checks on your plasmids of interest. They have leaked into sequencing things outside of plasmids alone (microbial and AAV genome sequencing too), but plasmids are their primary claim to fame.

Despite how uninspiring they seem, how much of a moat they seem to lack, and how cheap their services are, Plasmidsaurus is enormously successful.

For one, they are almost entirely bootstrapped. They have never taken venture capital funding, at most participating in a six-week accelerator run by Fifty Years and taking a small seed investment from them 3 years ago — primarily for networking purposes rather than for money. Even for the niche that Plasmidsaurus is in, a largely bootstrapped biotech is basically unheard of! They were revenue positive from day one.

Secondly, despite the bootstrapping, they are consistently growing. They are currently at 40 employees, with 225% growth over the last 2 years and 34% growth over the last 6 months (according to LinkedIn). They currently have nine sequencing labs across the globe and have plans to open up even more.

Thirdly and finally, their customers are incredibly happy with them. This is something you don’t see amongst very many biotech services providers, at least the ones I’m aware of. I know of Twist Biosciences as another one that inspires the same level of enthusiasm, but, besides that, it’s uncommon. Here is a picture of several Reddit comments I found, all singing their praises. And these aren’t cherry-picked; I struggle to find even a single post that says something bad about Plasmidsaurus.

The sum combination of these three characteristics make Plasmidsaurus unique amongst nearly every biotech company I’m aware of. This essay will explain the scientific fundamentals of the company, what niche they occupy, how they managed to grow so much despite their simplicity, and, as always, the risks that the company has.

Background

What is a plasmid?

Plasmids are small, circular pieces of DNA that exist separately from an organism's main chromosomal DNA. They naturally occur in bacteria, but not in most other domains of life (with some exceptions).

Why do they exist? Answering that question has the same challenges of asking why anything in biology exists, but one possible reason is for their role in horizontal gene transfer. This is the practice of a bacteria sharing potentially useful genes with other neighboring bacteria, and the structure of plasmids lends itself well to being shunted out a bacterium and absorbed by others. Of course, they likely evolved for a plethora of reasons beyond gene transfer alone, but we’ll accept the surface level explanation for now.

Why do people care about plasmids? First, a brief tangent: most DNA you’ll find in a cell will be somewhat of a mess; all sorts of uncharacterized regulatory elements, genes, and the like. Of course, it all makes sense to the cell that holds it, but it is a bit incomprehensible to a human. This is often a barrier to scientists who are interested in modifying behaviors of that cell, such as using it to produce a specific protein. But, as it happens, most cells generally don’t care about where a piece of genetic material came from; even if it’s not part of their own DNA, just floating around! If genetic material shows up somewhere where there are transcription proteins around, cells will happily use it. This is, shockingly, as true for bacterium as it is for mammalian cells, even though mammalian cells don’t naturally have plasmids!

This is the primary application of plasmid: a single piece of genetic material, stripped down to the essentials, and extremely modifiable by a researcher, used for the purpose of modifying cellular behavior. A single plasmid can encode for multiple genes — with regulation logic on top of those genes — and continue to survive in cells even as they divide. Once the plasmid enters a cell nucleus, which is done by placing the plasmid near a cell and applying an electric field to increase the cell membranes permeability (along with other possible techniques), it will be interpreted the same as any other source of genetic material. A scientist needn’t modify the raw DNA of a cell to change how it functions, introducing a plasmid into the cell is often sufficient!

Again, given how useful they are, plasmids are present in nearly every modern biology experiment. Which is why it’s a bit troubling that an immense number of plasmids likely have errors in them.

Errors in plasmids

In a BiorXiv preprint from August 2024 titled ‘Prevalence of errors in lab-made plasmids across the globe’, the authors stumble across a worrying statistic:

We found that approximately 15% of plasmids had significant design errors, and about 35% contained sequence errors in functional regions…in total, we estimate that 45-50% of lab-made plasmids have undetected design and/or sequence errors that could potentially compromise the intended applications. Indeed, we suspect that this figure may underestimate the true scale of quality issues in lab-made plasmids because we had asked our clients to check the designs and sequences of their plasmids before submission to us, and also because they were paying for our services utilizing their plasmids.

The 'design errors’ here isn’t a massive deal, since those are more along the lines of ‘the person who designed the plasmid messed up’. Plasmids, despite their simplicity, have their own biophysical limitations. And it’s very possible to accidentally create a plasmid that is a-priori known to be bad at its job. This is a problem, but also an avoidable one.

Some examples of (avoidable) design errors

The much more concerning bit are the sequence errors. Those aren’t something you can predict in advance, they just happen accidentally as a result of biophysical instability, accidental contamination, or chemical bad luck. From the authors:

Notably, about 35% of these plasmids (91/259) displayed sequence variations from the senders’ reference (Figure 1B). Among them, we identified 89 point mutations, 35 deletions, and 19 insertions, with some plasmids containing multiple types of error.

Stuff like this can dramatically change the takeaway message from biology studies. After all, single mutations can destabilize an entire protein! It’d be inaccurate to fault plasmid errors as the singular reason for why biology is undergoing a replication crisis — there are far more possible sources — but it certainly doesn’t help.

The value add of Plasmidsaurus

It is within the extreme importance of plasmids, alongside how error prone they are to work with, that Plasmidsaurus found its niche: in ensuring your plasmids actually match up to what you want them to be.

Let’s restate their business model: you send them the plasmid you’ve produced, give them $15, they sequence it, and tell you if it matches what you expect the plasmid to look like — all usually in less than a day. With that, we can go through some immediate questions.

How do they sequence it? As mentioned before, using a nanopore sequencer. Here are some more details on what that exactly is, but all you really need to know is that it is fast, cheap, and a very convenient way to sequence things. The last part is important; it means scientists using a nanopore need to do very little ‘prep’ work for their plasmid to be successfully sequenced. Nanopores can just handle it, whereas other sequencing methods (e.g. Sanger sequencing) require some extra work and have more failure modes.

Did Plasmidsaurus come up with with nanopores? No, the whole concept is the intellectual property of Oxford Nanopore, who independently sells nanopores, Plasmidsaurus is just partnered with them.

Wait…so I can just buy a nanopore? Is it just really expensive? It’s actually really cheap for a sequencing machine. Oxford Nanopore's MinION device, their most portable sequencer, costs around $2,000 for the starter pack. For reference, the cost of non-nanopore sequencing machines can reach six figures. Fairly, the cost of actually running each nanopore sequencing run can also reach into the hundreds-of-dollars in terms of the materials required, but still, not much for many labs.

Then…can’t I just do what Plasmidsaurus is doing myself? What’s the point of them? This is an incredibly fair question and exemplifies what I find really interesting about Plasmidsaurus.

So, nanopore sequencing has two unique characteristics.

An improvement on the state-of-the-art. Because sequencing is so widespread in biology, anything that can help push sequencing even somewhat further is incredibly useful. Nanopores are simultaneously improvements on cost, convenience, and sequencing length than competing approaches.
Culturally new. Nanopore sequencing is a newcomer to the sequencing scene. Traditional methods like Sanger sequencing or short-read NGS platforms (e.g. Illumina) have been around for decades, while nanopores were commercially introduced in 2014. Because of this, it hasn't had time to become a standard in-house technique for most labs. While the nanopore platform becomes better year-after-year, currently, most researchers find them to be incredibly hard to use. As an example of their finickiness, they were famously found to work better in a dark room (circa 2022~) for reasons that are still badly understood. Oxford Nanopore, the developer behind Nanopore, is partially to blame; the company is known as a particularly egregious example of ‘incredible technology, terrible market execution’. But it’s also just time, new tools take a while to seep into the normal wet-lab workflow.

If a certain tool is useful, but hard to use, the lack of entrenchment of that tool creates opportunities for specialized service providers. It is upon this hill that Plasmidsaurus has built their business.

Of course, Plasmidsaurus isn’t just a parasite on top of Oxford Nanopore, there is a huge amount of optimization they have internally done with nanopores to ensure high quality data at a low price point. Specifically, a vast network of error-checking software and optimized protocols, none of which is public knowledge. Moreover, they also have an established relationship with Oxford Nanopore, allowing them much deeper insight into new products, chemistries, and tools that the company is developing. And even a chance to influence it! Finally, they also work hard to ensure logistical efficiency; there are hundreds of Plasmidsaurus drop-boxes and nine sequencing labs placed around the world to ensure that sending them a plasmid + getting back the results is as fast and convenient as possible.

But they are nevertheless a strong deviation from typical biotech companies. You could draw an interesting parallel from Plasmidsaurus to Databricks — a now 43-billion dollar company who transformed the use of Apache Spark in the data engineering world.

You could set up and manage your own Spark cluster, dealing with all the intricacies of distributed computing, resource allocation, and optimization. Or you could turn to Databricks, pay a premium, and focus on your data munging code without worrying about the underlying infrastructure. Some might argue that relying on Databricks creates a single point of failure and discourages deep understanding of the technology. But others would contend that Databricks has been instrumental in making Spark accessible to a wider audience, allowing analysts to concentrate on deriving insights rather than managing complex systems.

Plasmidsaurus isn’t dissimilar to companies like Databricks. By leveraging a relatively new sequencing platform, optimizing their processes to offer an incredibly low price point, and ensuring rapid turnaround times (less than a day), they are seamlessly integrating themselves into the workflow of many labs. Much like how Databricks didn’t invent Spark (though, admittedly, one of the founders did so during their PhD), Plasmidsaurus isn't inventing new sequencing methods, they are instead making already-invented methods much, much easier to work with, and charging a very small premium on top of that. Labs who otherwise would otherwise never do comprehensive plasmid quality checks have now incorporated it into their typical workflow. It’s very much like a traditional software-as-a-service business!

While this sort of mentality is common enough amongst tech-centric companies to the point where it’s considered old-hat, it’s genuinely innovative in the context of biology. Though wet-lab research contains traces of this outsourcing culture — otherwise, service providers wouldn’t exist at all — there is, anecdotally, a stronger hesitation with relying on them. Good service providers are hard to find (and their ‘goodness’ can disappear over time), often unreliable, and can take weeks to get information back from. While labs will rely on these external companies to handle experiments that they are literally incapable of doing with the equipment they have in-house, the preference is to keep things internal.

Plasmidsaurus’ growth depended on changing that culture.

Their growth

Interestingly, Mark noted that nanopore sequencing for plasmids was not on most people's radar when he first began offering it as a service in 2021. In fact, nobody believed it could be a viable business.

Nanopores were thought of as extremely error prone, something that was impossible to easily address. Moreover, few believed that changes to one area of the plasmid could affect other regions. As in, full plasmid sequencing was overkill, partial sequencing using traditional methods was sufficient. Mark had worked with using nanopores for plasmid sequencing years earlier — back in 2015 when they were first released to application-only users — and had first-hand experience in knowing that both of these opinions were fundamentally wrong. Not only was full plasmid sequencing far more important than most believed, nanopores with the right bioinformatics tooling could achieve it with high accuracy.

At the time, he vaguely thought a business could be made here. Six years later, he decided to do exactly that, initially running the whole operation out of his garage.

Because of peoples doubts and the aforementioned reluctance of biology research to rely on outsourcing, Mark said that the early days of Plasmidsaurus — named Primordium Labs at its founding — was dominated by a Do Things That Don’t Scale mentality. In other words, lots of manual, grassroots user acquisition, hyper-focused on making sure they were happy. Universal amongst tech founders, but rare to find in biology founders, who often assume that the market will come to them if their products are Good Enough.

When the company first began, Mark would reach out to old lab-mates at Caltech (where he did his postdoc) and offered to sequence their plasmids for free. He’d walk on over to nearby industry labs with candy and a sales pitch for why they should use his services. He primarily targeted top, Nobel-prize-winning research groups — as they were typically both open-minded and had money to spend — with the hopes that other labs would follow their lead.

I also talked with Maria Konovalova — who manages growth at Plasmidsaurus — about this further. She emphasized how the playful aspect of Plasmidsaurus marketing was surprisingly fruitful in gaining more attention in communities they care about. For example, Plasmidsaurus made cereal boxes that had sequencing ads on them to hand out at conferences!

Plasmidsaurus has historically done very little ‘traditional’ marketing — no brochures, few cold reach-outs, and (until recently) no sales team at all. Their marketing seems intended to simply pique your curiosity as to who this dinosaur-centric company is. As another example, they released this video of a guy in a dinosaur suit doing BMX stunts while carting around a box of plasmid sequences. All to announce their new lab in London, which further drove down time-to-sequence for companies in the area.

Marketing like this is risky! It’s hard to a-priori tell whether this sort of branding will have unintended second order consequences. It’s the norm for biotech service companies to have a stoic, corporate, and dispassionate vibe about their whole business for a reason; there is always the risk of not being taken seriously. But in Maria’s eyes, the bet on a more human, personal element to their company has paid off.

Of course, marketing should be backed up by utility, especially in hard science fields where money is tight. Plasmidsaurus combined this marketing with an aggressive focus on improving customer satisfaction. There were no back-and-forth emails on experimental specifics of the sequencing, only an online form for a customer to fill out at their convenience. All pricing was upfront and not hiding behind a sales representative. There was a guaranteed turnaround time of <1 day for biotech hubs, 1-2 days in areas where there isn’t a lab nearby, a level of speed assisted by the 500 dropboxes and 9 sequencing labs they have placed around the world. Because of how surprisingly pleasant Plasmidsaurus was to work with compared to almost every biotech service provider, anyone who tried them once was immediately made a loyal customer.

I think their whole growth story is so interesting. Plasmidsaurus is almost doing a form of cultural arbitrage; applying the fast-paced, customer-centric ethos of tech startups to the traditionally slow-moving world of biotech services. In a field where DIY is often seen as a virtue, they are making it actively desirable to outsource certain tasks. All by using the principles that other fields have known about for decades!

And, even in year 3 of their existence, they still are focused on improving. Increasingly, their sequencing labs are a fleet of automated Opentron machines, further driving up the speed of rendered services. They are aggressively investing into exploring sequencing areas outside of pure plasmid sequencing, including sequencing AAV genomes and sequencing microbial cultures — both of which have become solid businesses in of themselves at this point. Finally, most curiously, Mark suggested to me that they hope to eventually make a software-focused play into assisting the analysis of Plasmidsaurus-produced data. The end goal of everything being to make life-scientific research faster, easier, cheaper, and more replicable.

Potential risks

I’ve been effusively positive about Plasmidsaurus in this essay — perhaps annoyingly so — but all biotechs are fundamentally risky endeavors. Even Plasmidsaurus, despite how good of a position they are in today, have their own bets they are taking. I can think of two distinct risks.

Commoditization of nanopore sequencing.
1. Right now, nanopore sequencing is somewhat of an arcane art. As we’ve discussed, that’s partially what makes Plasmidsaurus so valuable, they have wrapped it up in an easy-to-use service for use in plasmid sequencing. But what happens when nanopore sequencing becomes as straightforward as running a PCR? Oxford Nanopore, the company behind the technology, is slowly improving their documentation of the whole system. If, in a few years, any lab tech can run nanopore sequencing easily, the barrier to in-house sequencing drops dramatically. At that point, most labs might start to question why they're outsourcing plasmid sequencing at all.
2. While I view this as the primary risk that Plasmidsaurus faces, it also isn’t too bad. After all, sequencing is just one part of the pipeline that Plasmidsaurus has built. Having a good user experience — something that Plasmidsaurus has nailed — over what seems like a commoditized process can lead to a lot of otherwise unexpected ‘stickiness’ amongst users. But biotech is also a money-constrained business, so services like Plasmidsaurus may be the first to be cut in economic downswings.
Competition from other sequencing companies.
1. Big players like Illumina or even Oxford Nanopore themselves could start offering specialized plasmid sequencing services. With their vast resources and established infrastructures, they could potentially undercut Plasmidsaurus on price or bundle plasmid sequencing with other services they already provide. Imagine if labs could get their plasmid sequencing done alongside all other genomic services in one fell swoop—it might be tempting for them to switch. Moreover, new startups might emerge, inspired by Plasmidsaurus's success but aiming to do it better, faster, or cheaper.
2. But, again, while this is a risk, I view it as relatively minor. The relationships that Plasmidsaurus has built with their customers is going to be hard to replicate. Labs aren't just buying a sequencing service; they're buying reliability, speed, and a hassle-free process that integrates smoothly into their workflow, and even if a new company claims to offer that, only Plasmidsaurus has actually proven themselves. That will likely buy them a lot of time, but the onus is on them to continue innovating to keep that relationship.

Overall, Plasmidsaurus is a startup that has done incredible things, and I fully expect them to continue that trend.

Most importantly though, learning about Plasmidsaurus has been a big mental update to my mental conception of ‘all biotech startups need to solve deep scientific problems and acquire tens-of-millions of dollars in venture funding to get anywhere’. Biology is a huge area that goes beyond therapeutics alone; there are incredibly impactful and profitable businesses to be made across this field. And some of them can resemble the ethos of software development far more than the ethos of drug development.

If you work in a lab and want to check that your plasmid is accurate, you should try out Plasmidsaurus!

Creating the largest protein-protein interaction dataset in the world (A-Alpha Bio)

Abhishaike Mahajan — Mon, 12 Aug 2024 13:42:46 GMT

I think this startup is promising and I hope they succeed! This is not a sponsored post, not meant to be anyone’s opinion other than my own, and is not investment advice. Here is some more information about why I write about startups at all.

Also, I wrote this post and then accidentally stumbled into a conversation with the CTO (and co-founder), Randolph Lopez! Consider almost everything in this article a view into what’s publicly available about A-Alpha Bio, and the ‘Addendum: Some notes by the CTO’ section as a view into some of the yet-to-be-released stuff.

That last section will clear up some questions/misunderstandings I got from purely the publicly available info, so make sure to read it!

Introduction

Understanding how proteins interact with one another is hard. Genuinely, like stupidly hard. Way, way harder than you’d naively think it is.

One might naturally assume that because Alphafold2 largely solved the monomeric protein structure prediction problem, it let us make headway on the protein interaction problem.

After all, it’s the same underlying physics: ionic bonds, hydrogen bonds, van der Walls forces, pi-pi interactions, and so on. And it did, but the multimeric version of Alphafold2, and even Alphafold3, is still quite bad at the whole problem.

Want to know the structure of the human insulin receptor? Alphafold2 is decent at predicting it, no need to crystallize anything.

Want to know how well a 13 amino acid peptide binds to the receptor? You’ll probably need to re-run Alphafold2 with dozens of different seeds, maybe do MSA subsampling, and a laundry list of other ‘hacks’ to get these models to give a somewhat correct structure and, thus, a proxy for binding affinity. And, after all that, you’ll probably still need to do a wet-lab binding assay to confirm the inevitability noisy prediction.

A lot of this has to do with the complexity of so-called protein-protein interactions, or PPI’s. Even though the space of forces is the same, the distribution of ‘possible’ structures grows exponentially. Atom-sized deviations can cause catastrophic failures in predicted final structures, conformational flexibility can lead to unpredictable binding modes, and the sheer number of potential interaction surfaces explodes combinatorially.

Given enough data though, models should still be able to grasp the complexity. But PPI datasets are notoriously tiny. The latest collation of binding affinity dataset (PDBbind+) amounts to just 3,176 proteins in total. While there are ML models built off such datasets, they are, at best, academic curiosities, and not something essential to have in one’s protein design toolbox.

A-Alpha Bio is a biotech startup trying to solve the data problem. The company was created in 2017 by two graduate students at the University of Washington, David Younger and Randolph Lopez, the former of whom is a Baker Lab alumni. Their pitch is fundamentally a wet-lab innovation, which I’ve written about before as being a common characteristic among ML-bio companies I’m excited about.

In this essay, we’ll discuss what this innovation is, the computational angle they have, what they are using the innovation for, and the risks the company is taking.

The product

AlphaSeq

The beginnings of A-Alpha Bio stem from a 2017 paper titled ‘High-throughput characterization of protein–protein interactions by reprogramming yeast mating’. The paper introduces the first iteration of A-Alpha Bio’s method for gathering protein-protein interaction data at scale, which is also called ‘AlphaSeq’.

Let’s go over it!

How does it work?

AlphaSeq functions via exploiting yeast cell mating.

Yeast cells have two mating types: MATa and MATα. In nature, these cells find each other and fuse to form diploid cells through a process called sexual agglutination. This process is mediated by two proteins: Aga2 on MATa cells and Sag1 on MATα cells. When these proteins interact, they cause the cells to stick together, facilitating mating + the creation of a diploid cell.

AlphaSeq hijacks this natural process. Instead of using the native Aga2 and Sag1 proteins, the researchers genetically engineered yeast cells to display proteins of interest on their surface. This alone isn’t particularly novel, this has been done for decades underneath the name ‘yeast surface display’. The main novelty here is in exploiting the mating aspect of it: MATa cells display one set of proteins, while MATα cells display another set. When you mix these engineered cells, the likelihood of them mating becomes a function of how strongly the displayed proteins interact.

But how do you measure these interactions across thousands or millions of protein pairs? The answer is, of course, DNA-encoded libraries. Each protein displayed on the yeast surface is associated with a unique DNA barcode. When two yeast cells mate, these barcodes are brought together in the resulting diploid cell. Through some genetic engineering, these barcodes end up next to each other on the same chromosome.

From there on, you can simply extract the DNA from all these mated cells and sequence it. The frequency with which you see two barcodes together directly correlates with the strength of interaction between the proteins those barcodes represent. More frequent pairings indicate stronger interactions.

Thus, we can get an N by M protein interaction screen, where N MATa cells are each displaying 1 of N proteins, and so on for the M MATα cells.

When I first read about this, I had a list of questions. I’ll go through them, one by one.

That's a weird mating pattern. Why do yeast cells do that? It’s not that weird, it’s not too dissimilar to male and female sexes in mammals, sex recognition here just occurs via cell surface proteins. And why ‘types’ exist for yeast cells is for the reason sexes exist for animals: a way to nudge life towards greater genetic diversity. One note: this isn’t actually equivalent to sex, I used the phrase ‘mating types’ for a reason. Different sexes imply differently sized gametes (large eggs, small sperm), but the gametes between yeast mating cell types are the same, only surface proteins are different.
Does each given MATa and MATα cell only express one protein at a time? Yup! The paper doesn’t explore multiple surface protein expression, but, just thinking about it naively, it sounds like it’d be annoying to make something out of the resulting dataset.
Can this method account for post-translational modifications? The original paper is pessimistic on this front, writing ‘the detection of interactions requiring specific post-translational modifications may not be possible’. On a more theoretical level, it is known that yeast cells are incapable of recapitulating certain post-translational modifications that mammalian cells can do (e.g high-mannose glycans), so I’d expect AlphaSeq to have a similar failure mode.
What is the actual output of an AlphaSeq run? It’s a value referred to as the ‘mating efficiency’ of the yeast cells. More specifically, it is (number of diploid cells) divided by (number of MATa cells + number of MATα cells + Number of Diploid Cells), for a given protein-protein pair. So, you’d end up with one of these values for every protein-protein pair.
Is it possible to connect mating efficiency to a traditional binding affinity value? The researchers established a calibration curve using protein pairs with known binding affinities. They found a log-linear relationship between mating efficiency and experimentally-determined binding affinity (Kd) with a pretty high correlation (R² of .87). From the paper’s analyses, there seems to be a wide dynamic range here; the relationship held over multiple orders of magnitude, from 500 pM (strong) to 300 μM (weak). This means that the output of AlphaSeq the mating efficiency values can directly represent a true affinity value, ranging from very strong to very weak.
What scale can this operate at? The original paper could generate 7,000 distinct PPI affinity values in a single experimental run. Which is quite large, it immediately outstrips the existing set of publicly available data. But it’s also relatively small given the dimensionality at which biology operates at. A 2023 paper by A-Alpha Bio pushed this further, generating binding affinity data for 104k~ antibodies to a single SARS-CoV-2 peptide in a (seemingly?) single run! Finally, A-Alpha Bio’s most recent work claimed to run 15,000 antibodies by 200 target runs, so 3 million interactions in total. Curiously, this was a throwaway example in the attached poster and not the actual discussed result. Which is…weird. The actual experiment they ran was far smaller in scale, with less than 500~ total interactions.

Overall, cool method! We’ll discuss some outstanding concerns in a bit, but let’s first ponder on what this method is trying to replace.

What are the alternatives?

There are a few competitors in the ‘high-throughput in-vivo PPI screening’ world.

It’s a surprisingly diverse area, but two main names seem to consistently pop up, both of which also rely on yeast: yeast two-hybrid (Y2H) and yeast display. We could mention phage display as well, but it has the same practical constraints as yeast display, so I’ll leave out explicitly discussing it for now.

First, Y2H. In this method, there’s a ‘bait’ protein and a ‘prey’ protein, both of which are proteins you’re interested in assessing interactions between. The "bait" protein is fused to a DNA-binding domain (BD) of a transcription factor, and the "prey" protein is fused to an activation domain (AD) of the same transcription factor.

The DB protein, as the name implies, immediately grabs onto DNA. If the bait and prey proteins interact, the BD and AD are brought together, activating the transcription of a reporter gene, which can be detected with other methods. If they don’t interact, the AD will be (theoretically) incapable of ‘finding’ the reporter gene. As the DB and AD fusion proteins are expressed intracellularly, Y2H occurs entirely in the nucleus.

From here. B is no binding, so no reporter gene activation. C is binding, so there is reporter gene activation.

You may immediately notice that Y2H doesn’t incorporate barcodes, so the scalability of the whole method is quite low; the possible transcription of the reporter gene must be found one at a time. There is a method called BFG‐Y2H that fixes this, incorporating barcodes into the whole system and allowing us to rely on high-throughput sequencing. This method claims a scale of 2.5M protein-protein pairs, which is quite high!

However, Y2H as a method suffers from relatively low accuracy, something along the lines of a 50% true-positive rate, an accuracy that BFG‐Y2H most likely shares. This is primarily due to intracellular environments being a bit uncontrollable, a cell is a crowded place after all, and spurious interactions are the norm. AlphaSeq likely doesn’t suffer from this problem, given that binding occurs in a less dense + more controllable extracellular region.

Moving on, yeast display is our next contender. The general idea here is simpler than Y2H. Transfect a cell with your desired gene, the protein displays on the surface of the cell, wash the cell with a purified protein you’re interested in, and then use flow cytometry (or something else) to detect binding. The cell surface has one protein of interest, the wash has another protein of interest, and flow cytometry reveals whether binding occurred (via an attached chromophore on the washed purified protein).

Yeast display can operate at the 10M+ scale of interactions and, as the interactions are occurring extracellularly, won’t have the same accuracy drawbacks as Y2H has. The disadvantage with this method seems to primarily boil down to two things: dependence on purified proteins and spectral resolution constraints.

For the former, purified proteins are expensive, may be structurally unstable, and some classes of proteins cannot be purified at all (e.g. membrane proteins). It’d be ideal if everything could simply be expressed within the cell itself, as is done with AlphaSeq.

For the latter, flow cytometry has a finite number of distinct fluorescent channels that can be used simultaneously. Use too many channels and you’re drowned in noise! This means that only a limited number of targets can be used per round, typically around 15-20. While more powerful versions of flow cytometry can push that number up more, it still isn’t high! Yeast display is fundamentally incompatible with doing something like a 1000 nanobody to 1000 receptors screen, more like a 1000 nanobody to 10 receptor screen. Comparatively, as AlphaSeq relies on DNA encoding to represent binding, the number of targets can be far higher.

So, AlphaSeq does seem better in some respects. But what about where it’s worse than other methods?

Outstanding concerns

Obviously, there are problems with yeast as a platform at all. Specifically, in that it cannot recapitulate the same post-translational modifications as done in human cells, which may affect the folding of an expressed protein, which may affect binding.

But, given that the main competitors are also based on yeast and will suffer from the same problems, let’s ignore those.

What’s specifically uniquely bad about AlphaSeq?

Truthfully, I struggle to come up with anything concrete.

One complaint is potentially that AlphaSeq is worse at scaling measurements — if we are to take existing papers as proof of how far it could be pushed — than other yeast-based PPI detection methods. I imagine that are some inherent physical limitations based on a yeast cell’s probability of being able to interact with every other yeast cell of the opposing mating type.

To be clear, maybe it is on par, capable of reaching 10⁸ measurements and beyond. The papers thus far haven’t revealed much…we know 10⁶ scale is absolutely possible, and maybe 10⁷. The 10⁷ case was a throwaway example and not something they expanded upon heavily, so I’m currently taking it with a grain of salt. Perhaps the error rate of that level of scale is extremely high?

There’s also this vague worry I have about the mating efficiency → binding affinity calculation. It’s surprising to me that it works at all! Variations in how well different proteins are expressed on the yeast surface and the experimental medium of the cell could change folding patterns of proteins, so I wonder how widely applicable the method is in practice amongst diverse proteins.

AlphaBind

AlphaBind doesn’t seem to be a single thing, but seems to broadly refer to a suite of ML models that A-Alpha Bio has built on top of their binding affinity data.

There is…basically zero information online about this. They claim to have ‘750 million affinity measurements’ upon which these models are trained, which is immense, but there aren’t any concrete results showing how useful the resulting model is.

Hoping for a paper in this area to be released soon!

How do they make money?

They seem to be placing bets on both the therapeutic angle and the platform angle.

For therapeutics, it’s immunocytokines.

For platforms (in partnerships with Big Pharma), it’s molecular glues.

At least, that seems to be their focus based on the pipeline on their website.

From here

Why those two? Likely because both of them (seem to) benefit from high N by M interaction screens, where both N and M are quite high. This is opposed to, say, few-target optimization, where you might only need to screen a large library against one or a handful of targets.

Let’s go through a quick explanation of both modalities + their work in it.

Immunocytokines

Immunocytokines are fusion proteins that combine an antibody with a cytokine (a protein that interacts with the immune system). The antibody component provides targeted delivery to specific cells or tissues, while the cytokine portion delivers an immunomodulatory signal to nearby immune cells. This combination aims to concentrate the cytokine's effects at the desired site of action (such as a tumor), potentially enhancing efficacy while reducing systemic side effects.

For example, consider an antibody tuned to bind to antigens on a tumor.

If we fuse this antibody with IL-2, a cytokine, the antibody will grab onto the tumor and any passing T or NK cells will bind to the free-floating IL-2.

From here. Focus on the stuff happening on the right.

What happens when IL-2 binds to these passing-by immune cells? A lot of things. The immune cells ramp up its ability to be cytotoxic, it replicates, and produce even more cytokines to alert other immune cells. The hope is through this local activation of the immune cells, the brunt of their ‘response’ is directed towards the tumor.

Unfortunately, the efficacy of immunocytokines has been limited by systemic toxicity. This is because the attached cytokine can bind to receptors throughout the body, not just the immune cells. In other words, cross-reactivity is a big problem.

The utility of the AlphaSeq platform comes into play here! If you want to build cytokines that can activate a specific target of interest (e.g. immune cell receptors) while avoiding everything else in the body, a scalable protein-protein interaction screen would come in handy. Engineering thousands of cytokines and assessing their affinity across thousands of receptors, trying to increase affinity to one of these receptors and diminish it on everything else.

A-Alpha Bio has poked at one part of this problem in a recent poster titled ‘Cytokine affinity tuning using the AlphaSeq platform to generate targeted immuno-oncology therapeutics’, performing single-site mutagenesis on a wild-type cytokine to assess each of their affinity to two receptors at once.

It works according to their results, but…it’s also somewhat unimpressive.

This was essentially a 500~ cytokine (at the upper end) by 2 receptors PPI screen. Which is a fine problem set-up, but also a bit disappointing if scale is really the selling factor here. I intuitively get the sense that AlphaSeq is a fair bit more accurate than Y2H, given how terrible Y2H is, and easier to set up than Yeast Display, since purified protein isn’t needed, so that’s a win in of itself!

But if scale is what is being advertised, the existing paper on the topic doesn’t quite prove it. My guess is that the paper is more of a proof of concept than anything else, given that this one came after their much larger, 100k~ antibody AlphaSeq screen. Looking forwards to more work in the future!

Molecular glues

Molecular glues are a class of small molecules that facilitate protein-protein interactions (PPIs) that wouldn't occur naturally or would occur only weakly. Unlike traditional drugs that inhibit protein function, molecular glues work by bringing two proteins together, leading to the degradation of a target protein.

The classic example of a molecular glue is thalidomide. Yes, the drug that causes horrific birth defects was one of the first molecular glues ever used in clinical practice, though its mechanism of action wasn’t known at the time.

Quick explanation on how it works: Thalidomide allows the cereblon protein (CRBN), an E3 ligase, to bind to IKAROS proteins, a set of transcription factors. Under ordinary circumstances, these two proteins do not interact. But thalidomide allows it occur by acting, as its name implies, a glue!

Why is the interaction useful? Through some complex chemical magic facilitated by the attached E3 ligase, multiple ubiquitin proteins are attached to the IKAROSE protein, also known as ubiquitination. This particular protein tag is recognized by the proteasome, which are protein complexes that degrade other proteins into individual amino acids or peptides. Thus, thalidomide reduces levels of intracellular IKAROS proteins by tagging it for degradation.

To note, this is why CRBN being an E3 ligase is important, since E3 ligases — of which there are 600~ — are capable of ubiquitination of attached proteins.

Why reduce IKAROS protein levels? Well, cancer often have overactive transcription protein levels due to their rapid growth, so reducing IKAROS levels helps stem their proliferation. But the initial prescription of thalidomide was for a more minor and widespread condition: pregnancy morning sickness. Of course, as it turned out, IKAROS proteins are also responsible for increased levels of the FGF8 protein, which is essential for correct limb and brain organization during embryonic development. Thus, the tragedy of thalidomide babies.

Despite the past of thalidomide, the concept of a molecular glue remained curious, given its advantages over typical inhibition-based drugs, and research continues to this day, still primarily focused on gluing stuff to E3 ligases.

On face value, the role of A-Alpha Bio’s AlphaSeq here is a bit uncertain. Molecular glues are small molecules, why is a PPI company interested in them? Curiously, they are placing themselves as not the people discovering glues, but rather the people discovering pairs of proteins where glues can be applied in the first place.

From their website:

The throughput and sensitivity of AlphaSeq makes the platform well suited to uncover novel targets for molecular glues by discovering and characterizing weak interactions between E3 ubiquitin ligases (or other effector proteins) and target proteins that may be enhanced by a small molecule binding at the interface. We measure interactions between our proprietary E3 ligase library and undruggable target proteins to provide a starting point for the rational discovery of molecular glues. We have discovered and validated many pairs internally and look to partner with industry leaders to progress protein-protein binding insights into small molecule drugs.

I was initially a little skeptical of the value of this.

Intuitively, my understanding of molecular glues is that that weak interactions aren’t strictly necessary; they can cause interactions between anything. What’s the purpose of learning about weak interactions?

As it turns out, while molecular glues can induce interactions between non-interacting proteins, the primary way they function is by stabilizing weak interactions!

One review paper over molecular glues explicitly states:

Many molecular glues take advantage of weak, fortuitous, pre-existing protein–protein interfaces (PPIs) that can be further strengthened by their binding
Discovering weak protein–protein interactions that can be further stabilized is key to develop molecular glues. Rational design of molecular glues has been difficult mainly due to a lack of understanding and predictability of weak interactions.

Another review paper discussing the challenges of molecular glue development says something similar:

We believe that utilizing methods such as high-throughput global proteomics to understand the target interactome ahead of establishing screening will be more successful. This will undoubtedly require assays that are able to measure weak and transient interactions between target and effector.

This also answers a side piece of skepticism I had about whether pairs are actually the important piece (versus the molecular glue development itself)! It seems that the molecular glue field is so early that good targets to develop glues on top of are, for the most part, still unknown.

This is where A-Alpha Bio comes in, as they are the only group who can screen many E3 ligases against many target proteins. They aren’t developing a therapeutic themselves, but helping form the basis for a therapeutic at all! This is likely why many of their existing partnerships — Amgen, Squibb, and Kymera Therapeutics — all focus on molecular glues.

It’s hard to assess how well this is going, since there are no publications focused on it, but partnerships usually mean there is something here!

What bets is the company making?

A-Alpha Bio feels deeply interesting to me. Again, it’s built off a fundamental advance in wet lab innovation, which I’ve discussed previously as being very important, and has lots of ongoing partnerships. Their future feels promising!

But all companies, by virtue of existing at all, are implicitly making bets on where the future is heading. A-Alpha Bio is no different. Let’s end this essay by going through them!

Here are the bets I’m seeing. These are in no particular order! Generally, there aren’t any huge concerns I have.

AlphaSeq will be able to scale further. There’s a deep disconnect between the claimed capabilities of AlphaSeq (millions of interactions per experiment) and the published capabilities (100k~ in one run). While there is this constantly mentioned line about being able to scale up to 3M interactions, there is no published paper about that experiment specifically! Very curious to know why! It may be the case that such experiments are simply too expensive to run for an academic paper and that AlphaSeq is perfectly capable of scaling as much as desired. And even if 3M can be reached, could it be pushed even further?
N by M interactions will continue to be valuable, compared to N by <20 interactions. AlphaSeq’s primary alpha lies in its ability to do many-by-many screening. This isn’t naively always valuable; many drug programs focus explicitly on a few targets and optimize molecules for those specifically. For those types of screens, you could do typical yeast/phage display. AlphaSeq’s sort of exploratory screens are well suited for A-Alpha Bio’s programs (immunocytokines and molecular glues), so this is a bet that those programs either yield results or that there are other areas where N by M screens are useful.
There is high translatability between yeast-cell expressed proteins and in-vivo human proteins. It's still unclear how well the data collected by A-Alpha Bio translates to human biology, given the differences in post-translational modifications between yeast and human cells. While this limitation will be present in any yeast screening methods, it becomes especially pertinent if your entire company is built off a yeast-based method! Still, low priority in general given that they have done some validation of translatability for a few proteins.
AlphaBind models works better than other people’s models. The lack of public information or academic publications about the suite of AlphaBind models is somewhat concerning. It may be the case that more foundation-model-esque things like AlphaFold3 are strongly outperforming internal AlphaBind results, or even simple dataset cleanups of existing open-sourced data, like PINDER, get you as far as AlphaBind. Alternatively, A-Alpha Bio may just be keeping things under wraps! Hard to tell.

That’s about it! Overall, this is a company I’m enormously bullish on, and I have high hopes for their future success.

Thank you for reading!

Addendum: some notes by the CTO

This is a section I made last minute!

After teasing the post on Twitter the other day, the CTO/co-founder of A-Alpha Bio, Randolph Lopez, followed me, and I took a chance to ask a few questions that were on my mind after writing the post.

I’ll organize this as an FAQ, with a ‘C’ for comments.

Q: What is the scale that AlphaSeq is capable of?

A: We currently execute about 30 AlphaSeq assays per month ranging for 100k intersections to about 5M per assay. The larger the network size, the more difficult it becomes to detect weak interactions so different applications require different network sizes.

C: This answers the 3M question; it’s not just a throwaway example, it’s real and possible! The ‘30 AlphaSeq assays per month’ comment also feels insane given the scale of each output. That’s 30M-150M datapoints each month!

Q: What’s the limiting factor to pushing AlphaSeq further?

A: Fundamentally, the two constraints to network size are assay volume, which determines yeast collision sampling and sequencing coverage.

C: The yeast collision sampling part sounds similar to what I mentioned earlier: ‘inherent physical limitations based on a yeast cell’s probability of being able to interact with every other yeast cell of the opposing mating type.’. But the sequencing coverage part wasn’t something I would’ve guessed is a bottleneck! It makes sense though, number of reads per unique pair will dramatically fall as the matrix grows and the confidence in any given result likely plummets.

Q: Is there a limitation on the diversity of proteins used in the AlphaSeq screen? Like, is 1000 random proteins by 1000 random proteins viable?

A: It is viable, main challenge is DNA synthesis cost. We’ve been experiment with oligo batch assembly approaches to build larger protein libraries from oligo pools to circumvent this.

C: No real comments, cool! My original concern about diversity (‘I wonder how widely applicable the method is in practice amongst diverse proteins.‘) seems to be a non-issue.

Q: Will anything about AlphaBind will be published soon?

A: That’s the plan! Aiming to have something out before EOY.

C: Hype!

Scaling microbial metagenomic datasets (Basecamp Research)

Abhishaike Mahajan — Thu, 11 Jul 2024 13:34:06 GMT

I think this startup is interesting! This is not a sponsored post, not meant to be anyone’s opinion other than my own, and is not investment advice. Here is some more information about why I write about startups at all.

Introduction

“Space is big. You just won't believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it's a long way down the road to the chemist's, but that's just peanuts to space.”

― Douglas Adams, The Hitchhiker’s Guide to the Galaxy

I always think about this quote whenever I think about microbes — a general term used to refer to bacteria, fungi, and viruses.

The genetic space that microbes occupy, often referred to as metagenomics, is large. A handful of soil alone contains trillions of microbes. They permeate every corner of the Earth, from sea mud in the Mariana Trench to the lower stratosphere. Each one adapts to the environments they reside in, slowly modifying itself to best survive.

While many species have been residents of their native environment for eons, they do not stay static in response to environment changes. The fast growth rates of many microbes afford it the ability to pass through selective pressures extraordinarily quickly. Just a year after the Chernoybl meltdown disaster in 1986, radioactive-resistant fungi was discovered within 30 kilometers of the power plant. The vast number of microbes along with their associated mutation rate make it, without question, the single largest source of biodiversity on our planet.

And, according to one paper, 99.999% of all microbes have yet to be discovered.

Basecamp Research, a London-based startup, seeks to change that. Founded in 2019 by Glen Gowers and Oliver Vince, Basecamp aims to build the largest and most diverse metagenomics dataset on Earth.

During this essay, we’ll discuss why they are doing this, what progress they have made, and how they plan to profit off of it. This posts was heavily assisted by Philipp Lorenz, the CTO of Basecamp Research, who hopped on a call with me to help clarify some aspects of the company. Huge shout-out to Kevin Yang for connecting us!

The value add

Why is metagenomics important?

Well, much of modern science is based off the proteins originally discovered in microbes. The Cas9 protein used in CRISPR therapeutics is part of an ancient bacterial immune system, nanopore sequencing rely on transmembrane proteins (αHL, MspA) found in most microbes, and so on. Mining nature for [things] is a tried and tested formula for finding useful proteins.

It’s incredibly important for AI tools in biology as well. Protein folding models, such as Alphafold2, are reliant on MSA’s, or multiple sequence alignments, to predict structures from input sequences. The data used to create MSA’s come from terabyte-sized sequence databases, many of which are metagenomic in nature. Even Alphafold3, released just a month ago, is still MSA-based.

Okay, so, metagenomics is important. But there are already large, open-sourced metagenomic datasets available. Mgnify, one of the largest repositories of environmental sequencing data, contains over 2.4 billion unique sequences. The data also has a reasonably high degree of diversity, clustering the sequences at 90% sequence identity yields around 600 million sequences, complete with the metadata of the environment it was extracted from.

So what value is Basecamp Research actually offering?

Part of the pitch is that much of the (open-sourced) metagenomic space is perhaps high diversity from a sequence standpoint, but not necessarily from an environmental one. A quick stroll through Mgnify’s dataset page will, in a purely eye-balling and non-scientific manner, lends some credence to this. Mgnify collates metagenomic data from studies undertaken by third party researchers, and most of the samples here seem reasonably simple. Soil gathered from a forest, microbiomes of healthy/diseased organisms, and the like. To Mgnify’s effort, there are a few interesting ones! Like data collected from the deepest point of the Baltic Sea. But the vast majority are environments that are reasonably similar to one another.

And environmental diversity matters!

While the aforementioned Cas9 and transmembrane proteins are relatively ubiquitous across domains of life, there are rarer proteins that only exist in certain biomes. Such as the Taq protein, heavily used in polymerase chain reactions due to its ability to retain its shape in high temperatures, was originally discovered in underwater hydrothermal vents. Outside of this extreme environment, it is unlikely the protein would be found.

Basecamp Research is an attempt to scale up microbial data from the environmental diversity side. The founding of the company actually stems from work exploring microbial diversity of Europe’s largest ice cap (Vatnajökull, Iceland). The paper that resulted from this is fascinating to read; the founders of the company went through 11 days of hiking through Arctic-level conditions, relied entirely on solar power, and sequencing was performed on-site at the glacier (likely to avoid microbial death or contamination) via a Nanopore sequencer.

As of today, Basecamp performs partnerships with expeditions around the world — such as organized sports in remote oceanic regions or cross-Atlantic oceanic journeys made via hot air balloons — sponsoring trips in exchange for sampled microbes along their journey. They literally have a section of their website titled ‘For Explorers’ for partnerships. And, since plenty of environments lack people who consistently visit them, they also have a few full time ‘explorers’ on staff whose job is to conduct sequencing in remote locations!

What have they accomplished?

The product

Their primary product is something called ‘BaseGraph’, which is a knowledge graph of all the data they have ever collected. According to an article from December 2023:

…BaseGraph, the largest knowledge graph of natural biodiversity, containing over 5.5B relationships with a genomic context exceeding 70 kilobases per protein. Their extensive long-read sequencing is complemented by comprehensive metadata collection, enabling them to link proteins of interest to specific reactions and desired process conditions.

It’s difficult to gauge how impressive this is compared to the open-source alternatives. Luckily, my conversation with Philipp shed some light on this!

In his eyes, BaseGraph is superior to open-source metagenomic datasets for one large reason that I rarely see mentioned in other news articles: they have put a lot of work into improving the data collection process itself. The environmental diversity is simply a way to exploit that improvement! In Basecamp’s eyes, traditional environmental samplings methods — which the open source metagenomics datasets primarily use — lack sufficient sequencing depth, cells aren’t lysed equally to expose genomic elements, and genomic assembly is often insufficient to find the really interesting stuff. This, amongst a long tail of other things, is part of Basecamps value add. I wasn’t provided many specifics here because their exact sequencing methodology is very much part of their core IP!

It’s an interesting angle for where the scientific ‘alpha’ of a metagenomics company can lie. And there is precedent to back up the difficulty of doing this. I’ve written before about the difficulties of sequencing microbiomes, which isn’t too far away from the difficulty of sequencing environmental microbes. If Basecamp is claiming a significant improvement here, it’s worth paying attention to.

An example set of metagenomic data collected by Basecamp

Let’s switch to going over some Basecamp-published papers.

Papers

HiFi-NN annotates the microbial dark matter with Enzyme Commission numbers

Paper here. Published at NeurIPS MLSB December 2023.

Even though we can en-masse sequence the genomes of microbes, tying the resulting protein-coding regions of the genome to functional characteristics (e.g. transmembrane proteins, etc.) is still quite challenging. Around 30% of the microbial sequence space is fully unannotated, often described as ‘microbial dark matter’.

A functional characteristic that many people are often interested in are whether a protein is an enzyme — Taq and Cas9 are also enzymes — since enzymes have the ability to massively accelerate chemical reactions in precise, controllable ways. This is useful for a huge number of industrial applications.

Discovering these enzymes from these mass-collected datasets — i.e. mapping sequence to functional enzyme type — is the subject of this paper.

What did they do? The methodology in the paper is two-step: fine-tune ESM2 with a contrastive loss (trained on Swissprot data), and then, given an embedding of an input sequence, use K-nearest-neighbors of other embedded enzymes to find input enzyme category (or classify it as ‘not an enzmyme).

The nearest neighbor’s step is where Basecamp threw in their metagenomic dataset, vastly expanding the total number and diversity of reference enzymes.

One comment: It is clear how to do KNN with a labeled enzyme dataset (in their case, Swissprot), but less so with their own (presumably unlabeled?) metagenomic datasets. So…I’m assuming it is labeled. They say they are using "3 million curated sequences from our in-house database” and “We add sequences to include representation across [enzyme categories] for which Swissprot has few examples”, but it is unclear exactly what the curation process looks like, or how annotations for their in-house sequences were obtained.

In any case, it does seem like their method leads to an improvement in annotation accuracy by a fair margin compared to Swissprot alone, but still, the above points are important details! Would be curious to know how well Mgnify alone performed on this as well.

Improving AlphaFold2 performance with a global metagenomic & biological data supply chain

Paper here. Published on bioRxiv in March 2024.

This one is pretty simple: protein structures are important parts of the therapeutic design process. Knowing what parts of a protein are disordered, structured, bind-able, and so on, can vastly improve our ability to identify + exploit certain protein targets.

Of course, the current state of protein structure prediction, impressive as it is, is still largely insufficient to dramatically alter our current workflows. Still though, better structure prediction is always welcome!

As mentioned previously, MSA’s are often used as input to these models. As such, improving the underlying sequence data that the MSA is drawn from can be a way to eke out a performance bump with an existing protein structure model. And this is exactly what this paper tested.

What did they do? They ran structure predictions using a base Alphafold2 model, with MSA construction relying on the BaseGraph dataset and the usual MSA dataset. The method leads to a very mild aggregate bump across all structure prediction benchmarks (selected proteins from CASP15 and CAMEO), compared to not using BaseGraph at all.

There is a singular case of a massive 80% decrease in RMSD (good thing) in a CAMEO target! It’s an n of 1, so obviously take it with a grain of salt, but it perhaps does imply that the utility of Basecamp’s expanded MSA dataset can only be seen with specific proteins. If this is indeed the case, aggregate accuracy of the model will disappoint, but still be massively useful in certain cases! Time will tell if that is actually the case.

Conditional language models enable the efficient design of proficient enzymes

Paper here. Published on bioRxiv in May 2024.

As discussed, enzymes are useful! One way to discover enzymes with interesting biochemical properties is to discover them from nature directly, such as in the paper from earlier. But another way could be to use nature as training data, and generate new enzymes entirely. This is especially salient for enzyme reaction classes which are rare in nature due to not many biophysical processes needing it, so data mining is insufficient to discover them.

Other enzyme generation methods are a bit limited in this conditional generation aspect, requiring fine-tuning on enzyme classes a user is interested in. Ideally, users could specify in a reaction category amongst hundreds of possible ones, and the used model would be general enough to grasp how to generate such enzymes without requiring fine-tuning. This paper wrote about this exact application.

What did they do? Trained a transformer model on UniProt data that accepts a desired enzyme class and generates corresponding sequences of it. They use this to create 20 zero-shot carbonic anhydrases, of which 2 have low sequence identity to natural anhydrases and comparable enzymatic activity. So it’s good from a zero-shot perspective.

While ZymCTRL doesn’t require fine-tuning, they also demonstrate the utility of their own dataset (BaseGraph) by finetuning their model on it. Specifically, a class of enzymes called lactate dehydrogenases. They show that the fine-tuned generated enzymes are of generally higher quality (better functional enzymatic activity) compared to zero-shot generated enzymes.

One comment: They only give enzyme fine-tuning results for a model trained on their own dataset, with no comparisons to fine-tuning on public sources! Lactate dehydrogenases are well-studied enzymes, and undoubtedly exist in public repositories to some degree. These results are technically impressive from a raw scientific angle, but it does remain to be seen whether their private dataset (their main value add) is genuinely better than the public sources.

How do they make money?

Okay, so we’ve gone through their product and released research, both of which seem promising. But how do they actually make money from any of this? As of today, it seems to primarily be by selling enzymes.

They already have several partnerships in this direction for industrial manufacturing, such as enzymes that break down plastic waste and enzymes that are better catalysts for small-molecule manufacturing, amongst a few others. It unfortunately isn’t discussed whether these enzymes were generated or found in their dataset. There is one case I’ve found where the partner explicitly did license an enzyme from Basecamp (and not just form a partnership with no clear outcome), implying that at least someone found the enzyme useful. Unfortunately, I’m not finding any dollar amounts here, so it’d be hard to gauge how valuable this is. On a macro level, the enzyme market size is pretty immense (and growing!), so as long as Basecamp continues producing enzymes people want, this feels like a decent model.

Are they exploring anything else? Curiously, Philipp mentioned that the industrial usage of enzymes is actually minor compared to what their real focus is in the enzyme space: genetic engineering. Specifically, the discovery of large serine recombinases (LSR’s), which are enzymes that facilitate precise genome editing. This was curious, because this doesn’t appear in public articles about them at all! All I could find was a single article about them discovering these LSR’s in their dataset, very curious about the partnerships they set up to take advantage of this. This is an excellent example of how Basecamps edge is not simply the amount of data they collected, but how they collected it as well; the majority of the discovered LSR’s are the result of long-read sequencing, which is relatively uncommon in public metagenomic databases.

There is one other possible play here: selling the data directly to ML x biology companies or labs. But it doesn’t seem like something Basecamp is actively investing in. Another discussion with Philipp confirms this — they believe there is far much more value in directly exploiting their own data rather than selling it to others. In many ways, this is a huge green flag, it signals strong conviction in the value inherent in their dataset, rather than chasing marginal improvements in improving other companies models using that dataset.

Selling data directly to ML x biology companies is still an interesting idea though. If Basecamp did go this direction, how valuable would it be? Well, there’s some upfront work. It will likely take a few more papers establishing the raw utility of the diverse sequences for researchers to really want to use them, the papers we’ve talked about make the dataset itself feel of marginal value to generalized protein models. Past that, there seem to be only two sequence heavy ML x biology companies, Profluent and Evolutionary Scale, which feels like a reasonably small market. Perhaps Basecamp could also sell the data directly to universities for them to use in labs? Again though, likely more convincing will be needed.

Because of this, it feels like an excellent move to eschew the data-as-a-service business model!

Either way, I think the really interesting part of the whole business model is the moat. Not necessarily the sequence moat, which does exist, but is obvious. The far more curious one is the legal moat.

See, you can’t just up and take genetic material from where you want in the world, package it up, and sell it. You have to also comply with an international agreement called the Nagoya Protocol, a 2010 international agreement between 136 members of the UN and EU. Within it, it stipulates that countries have sovereign rights over their genetic resources. This means that any company wanting to commercialize products based on genetic resources isolated from plants, animals, or the environment, must obtain prior informed consent from the country of origin and negotiate mutually agreed terms for the sharing of benefits arising from their use.

Why does it exist? To ensure fair and equitable sharing of benefits from genetic resources, creating a framework that protects biodiversity-rich nations and indigenous communities while facilitating responsible scientific and commercial use of these resources. It’s a good thing!

It’s also incredibly annoying. More importantly, it is annoying and vague! A 2017 paper laid out the bureaucratic challenges with complying with the Nagoya Protocol, another paper commented on how it cripples global biodiversity research. There’s also this really interesting legal breakdown of how compliance with the Protocol actually works and it seems…painfully complicated.

How much does this matter in practice? I reached out to a friend who, in a past life, was involved in entomology research (the study of insects) to discuss this. Entomology is a fairly biodiversity-heavy field, requiring genetic samples from all over the world to conduct interesting research. I asked him how much the Nagoya Protocol practically interfered with his and his colleagues work, and he had this to say:

Yeah, [the Nagoya Protocol] is a big obstacle, at least in museum and collections-based work. When it first came and for years after, the bureaucratic burden was huge and a lot of research just stopped because there weren't any staff members who could handle it, until admins started hiring specifically for dealing with it (and since museums and collections are non-commercial, that's budget that was usually taken out of research budgets, so knock-on effects).

Alternatively, curators had to deal with the paperwork, meaning less time to do their real jobs. And of course permits were much more severely restricted, meaning no more sampling and collecting from biodiverse countries that are severely understudied. Some of those wouldn't even share genetic data.

So it slows a lot of research down, or even completely prevents it (or it forces researchers to have to travel to every country to study specimens locally, which is a huge blow to these tiny budgets and is often paid out of pocket)

[The Nagoya Protocol] is viewed pretty much negatively among the people I know, it's just too against the spirit of open science and collaboration

Here’s the legal moat that Basecamp Research has: every sample they have ever collected is Nagoya Protocol compliant.

From an article about them:

Basecamp Research wants to make it easier for researchers to gain access to valuable genetic data without violating any ethical considerations. Every sample they collect is compliant with the United Nation's Nagoya Protocol. Their goal is to make the Nagoya Protocol work for all parties: they do the heavy lifting when it comes to establishing agreements so that when companies come to them for assets, they do not have to worry about infringing national and international biodiversity laws: “We have partnerships in 18 countries with biodiversity hotspots,” said Oliver. “These are benefit-sharing arrangements where we give back: we build research capacity, do training and develop labs in those countries.”

Building up this same network from scratch seems like it’d be incredibly challenging. As it stands today, Basecamp owns the levers to access mass scale, standardized metagenomic data in a way that is convenient, diverse, and legal. They occupy a corner of the market that’d be difficult for anyone else to touch in a similar way. And while the US has not signed the Nagoya Protocol and thus doesn’t need to comply with it, pretty much every other high-biodiversity area of the world has (e.g. Brazil, Indonesia, India, and so on).

What bets is the company making?

Overall the state of the company feels deeply promising! Interesting product and a moat. What else?

All companies, by virtue of existing at all, are implicitly making bets on where the future is heading. Basecamp is no different. Let’s end this essay by going through them!

Here are the bets I’m seeing, ordered from riskiest to least risky:

There is still plenty of useful metagenomic diversity to explore. This is the strongest bet I believe Basecamp is making. And I doubt anybody really knows the answer as to whether this is correct. The fact that there is extant unexplored diversity is irrefutably true. But whether it’s useful? It’s a very unknown-unknowns thing. It may very well be the case Basecamp’s current set of sequenced biomes cover most of the desirable metagenomic space, and further sequencing efforts will cost them money with little marginal return. A paper covering how much better their models get year-after-year of sequence collection would be super useful here! For what it’s worth, Philipp strongly believes that that surface has yet to even be scratched w.r.t. useful microbial diversity.
Sequence-only approaches are sufficient to discover interesting things. Their focus on pure DNA may lead Basecamp to ignore potentially other interesting facets of environmental microbes. This is a risky bet in my opinion. But I also believe that, given Basecamp’s legal moat, they are well equipped to deal with it. Just because they are focused on metagenomic data does not mean they cannot eventually collect diverse metatranscriptomic and metaproteomic data as well. Indeed, Philipp mentioned that they are internally experimenting with collecting microbial data beyond DNA alone.
Evolutionary genetic space contains most of the useful proteins. Not having to rely on what evolution crafted is one of the big draws of de-novo generative models. And, while Basecamp has their own generative models, they are clearly placing a bet on evolutionary related proteins being the most physiologically useful in practice. This is again an unknown-unknowns thing, Basecamp clearly has historical precedent on their side. A dizzying amount of useful therapeutics and lab tools have come from the natural world, and there’s no reason to expect it’ll stop. But this is a very low risk bet because it is very much a bet they can deviate from at any point in time! If, for example, their customers start to demand enzyme catalysis rates beyond what nature has to offer, there are few others in the world who have a better dataset to train de-novo enzyme generation models, given that such models can typically generalize beyond evolutionary space.
The Nagoya Protocol isn’t going away. This one feels extremely safe to assume. International agreements rarely just disappear, especially given how relevant the agreement in this instance is to global biodiversity. Basecamp’s legal moat will likely stay intact.

That’s about it! Very much looking forwards to what this company ends up doing in the future.

Why write about biotech startups?

Abhishaike Mahajan — Mon, 08 Jul 2024 16:32:30 GMT

TLDR: A new section of my website will be devoted to writing about biotech startups I like. First one dropping in the next few days!

Since I’ve started writing on this blog, a surprising number of early stage biotech founders have reached out to me to talk. Over a dozen at this point! All of them have been really wonderful discussions and I consider it one of the best perks of actively writing.

One thing that I’m often struck by is the discoverability problem amongst these startups. It feels like nobody is talking about them, wondering about what they’ll do next, even though the ideas for these companies often feel groundbreaking.

On the surface, this is expected. Biotech doesn’t really require popular awareness amongst the wider public to succeed. These companies are selling clinical datasets or therapeutic development platforms or scientific software, none of which are things that public awareness really helps with.

But I think this reasoning is flawed.

Potential customers aren’t omniscient, on-the-market talent won’t dig far to find interesting companies, and the creation of new markets requires some popular consensus amongst influential parties. This is as true in biology as it is in other fields, perhaps less so, but still true. I often have biotech friends telling me about something they are curious about, me saying ‘Have you looked at X startup? They are working in that area’ and being met with a confused look. This shouldn’t happen!

One way of generating this awareness is by writing.

After all, I found out about, applied to, and now work at Dyno Therapeutics because I stumbled across an Axial article written about them! Discoverability is important! A few other biotech bloggers do great work here. Axial does it (less these days), as does Century of Biology by Elliot Hershberg, but no others as far as I know.

Importantly, these articles shouldn’t serve purely as puff pieces, purely marketing what’s good about the company. Embarking on new scientific endeavors is always risky, and opening up those risks to broader discussion is valuable for all parties. For example, I’m a big fan of Corin Wagen’s — a founder of a quantum chemistry simulation startup — post about Varda, a company trying to “make drugs in space”. The thesis of the company immediately made me skeptical of the whole thing. But Corin’s breakdown of the promising chemistry they are exploiting, while also expressing his own hesitancy about the utility of that same chemistry, was a refreshing combination of ideas. And, paradoxically, also made me far more optimistic about the company!

I suspect many people in the field would have similar reactions.

While VC’s have the luxury of having the time to build conviction on what sounds like a crazy or trite idea, most scientists and engineers do not, they will make a snap judgement and move on. Gently persuading this demographic — which is filled with potential customers and employees — can have compounding returns for a burgeoning biotech startup.

This section hopes to help with this discoverability problem while also trying to remain scientifically grounded. This isn’t to say I’ll be pessimistic! In fact, each post will be from the perspective of someone who truly believes that, amongst all the companies out there, this particular one is very promising. But even the most promising companies are taking risky bets, and I want to surface their very real achievements alongside those very real bets.

Some of these startups will be very early stage, where I have talked to the founders personally. Some will be later stage, where I rely largely on publicly released information. They will all be bio-focused in some capacity, but will have widely varying focuses; some will be software focused, others wet-lab, and still others a mix. All of them, I think, have an interesting future ahead of them.

Reminder that none of what I say should be taken as investment advice and nothing I say are anyone’s opinions other than my own.

As a final note, this post should be viewed as encouragement for others. Write about scientific companies you think are fascinating! Publicity is surprisingly useful for success, and founders of these companies are often too busy to advocate for themselves, especially during the earliest stages. On a deeper note, some of the most interesting science in the world is being done at startups, and writing about them is always a useful exercise.

First post soon!