16 Comments

Maybe this is dumb but:

manufacturing costs are a trivial fraction of the cost of bringing a drug to market. So are drug discovery costs. So why should we completely dismiss the possibility of a startup that spends way more on building libraries of chemicals with "inefficient" syntheses, in the hopes that the Cool AI will discover some new small molecules? If any are successful, the higher manufacturing costs will be peanuts next to the revenue associated with a whole new drug class!

Expand full comment

I agree! Lead generation and their associated manufacturing costs are trivial to anything else in the drug-to-market pipeline, and I would also be far more bullish on the folks who do tried-and-tested chemical discovery than the people who focus heavily on AI. The post was more about the woes that the latter group faces, rather than whether they are likely to be more successful (which, in the short term, is unlikely)

Expand full comment

Let me clarify what I meant.

Suppose you're an AI-and-chemical-discovery startup.

The AI generates a list of small molecules. You, or the AI, rule out the physically impossible and the unsynthesizable. The rest of the list is technically possible to make, but, as you pointed out, probably quite inefficient.

what I'm wondering is: why don't you just synthesize samples of these "inefficient" molecules anyway? now you have an AI-generated library you can screen, and it probably adds some chemical diversity that's not available in standard libraries. if a chemical that's inefficient to synthesize turns up a hit against the target, big whoop; if it ever gets far enough in drug development that a pharma company is seriously considering manufacturing it, they won't care about the lousy yield!

or is the inefficient synthesis a problem for *you*, the dinky startup with a much smaller budget?

what am I missing? you seem to be saying "AI guys, worry more about whether your molecules are easy to synthesize" but i'm not understanding whom, exactly, an inefficient synthesis is a dealbreaker for.

Expand full comment

I imagine inefficient synthesis is a problem for anyone because it makes you spend potentially a lot of time on the least important part of drug discovery: lead generation. A fair bit of the drug discovery pipeline, to my understanding, is spent on optimizing the structures of initially promising leads + derisking it to throw into trials, so you'd like to save your money for that, even for big pharma. Easy to create molecules means faster iteration speeds.

Currently, 'promising leads' can number in the thousands of molecules, even for AI-produced molecules, and it's very much unclear whether the hardness-to-create AI-produced molecules lead to meaningful improvements in any interesting pharmaceutical property.

Sure, if an AI startup has a particularly amazing model with amazing molecular generations that are hard to create, probably better for them to simply put up with the pain synthesizing to them. But, to my knowledge, modern day AI tools are more like '.1% success rate using rational design to .2% success rate using the AI' than anything with zero-shot capabilities. And even that is optimistic!

Expand full comment

Keep in mind Owls initial example at the beginning of the essay on erythromycin A which was well known and well characterized. It took Lilly 9 years to synthesize artificially. Yes, organic synthesis methods have advanced since then, but figuring out how to actually make a "de-vovo" molecule takes years of trial and error on the bench of a organic chemistry lab, even for a stellar organic chemist. Latter consider them fortunate if only one of their molecules actually make it into the real world during their lifetime.

There is no start-up who can afford to wait for years to synthesize the list of in principal synthesizable small molecules, nor afford it (the equipment and infrastructure necessary is not cheap).

Expand full comment

You'd need a decent sample size of those randomly generated compounds (~ at the order of millions) in a decent quantity (at least few mg) to achieve reasonable probability of success in a high-throughput screening assay against a selected target. Even if you can synthesize each compound for ~$100 on average, here's you $100M investment with few % success probability. Novelty of a chemical structure is nothing special or particularly beneficial for biochemical or therapeutic effect, so it's generally more efficient to buy existing small molecule library or DEL (~$50k-500k per library).

What AI startup could try to do is to eliminate or scale down the need of a large library - sort of generate new structures "with a given target in mind". I can imagine this as e.g. generation from virtual screening seeds or some kind of shape complementarity restrictions on generated molecules (if the binding pocket is known). But even then, purchasing even 1000 of custom never synthesized compounds could turn into multi-year project.

Expand full comment

> why should we completely dismiss the possibility of a startup that spends way more on building libraries of chemicals with "inefficient" syntheses

I suspect this doesn't happen due to the financial constraints faced by biotech startups: because they're valued for their clinical assets, not their platform, this sort of investment never quite pencils out. The only platform investments that do pencil out are those that are help raise money. And "libraries of chemicals with inefficient syntheses" is not a phrase that excites VCs.

Expand full comment

Very nice article.

I would add that synthesis is only one of the lab-based bottlenecks for generative ML models, because we don't merely want our model to propose synthesizable compounds. We want it to propose useful ones.

I.e., we want molecules that hit the target, don't have undesirable off-target effects, are sufficiently soluble, have good intestinal absorption, have good liver clearance, are nontoxic, etc. To achieve this computationally with ML, we need sufficiently large data sets for each of these attributes to train good models. And the models need to give good predictions for the region of chemical space relevant to our particular project, so a liver clearance model trained using data from the project down the hall might not be applicable to our own. A major bottleneck to creating the needed data sets is the speed at which compounds can be assayed in the lab.

Expand full comment

>Unlike amino acids, which are limited to carbon, hydrogen, nitrogen, and oxygen

Sulfur is also an amino acid component (cysteine and methionine).

Expand full comment

Betraying my lack of chem background! Addressing this

Expand full comment

selenocysteine has entered the chat

Expand full comment

I'm a layperson so another "maybe this is dumb but" comment - do you think it would be feasible for reverse folding to get good enough to be able to design enzymes specifically for making a given synthesis step easier?

Expand full comment

This has been a goal for a while! "Catalytic antibodies" and more recently lots of directed evolution work has focused on this-Merck famously used a multi-enzyme cascade to synthesize islatravir a few years back.

The reason this isn't everywhere is (1) enzyme engineering is still difficult and expensive and time-consuming and (2) enzymes in general are much more substrate-specific than small molecules catalysts. This is great in the body, where you want an enzyme only to operate on a specific substrate, but annoying when you have to synthesize twenty analogs and don't want to re-engineer a new enzyme for each one.

Folks are working on developing "promiscuous enzymes" to address this, so it's entirely possible that all chemistry will be done by enzymes in the future, but today it's still the exception, not the rule.

Expand full comment

I see, thanks for the reply. Is the main difficulty with enzyme engineering a computational bottleneck for doing the reverse folding?

Expand full comment

> Past that, proteins may just be easier to model. After all, while protein space is also quite large, there does seem to be a higher degree of generalization in models trained on them; for example, models trained on natural protein data can generalize beyond natural proteins.

Michael Bronstein's fantastic recent essay "The Road to Biology 2.0 Will Pass Through Black-Box Data" has a great explanation for this (see excerpt below). In contrast, small molecules do not live within a "degenerate solution space", since they are not products of biological evolution.

> “Degenerate” solution space. Another peculiarity of AlphaFold2 is that the

supervised training set of only 140K protein structures and 350K sequences

and is tiny by ML standards [29] — an order of magnitude less than the

amount of data used to train AlexNet almost a decade earlier, and a drop in

the ocean compared to the contemporaneous GPT-3 [18]. What likely makes

such a small dataset sufficient is the “degeneracy” of the solution space:

while in theory the number of all possible solutions in protein folding is

astronomically large (estimated at 10300 [30]), only a very small fraction

thereof is actualised. This is akin to the “manifold hypothesis” in computer

vision, stating that natural images form a low-dimensional subspace in the

space of all possible pixel colours [31].

> The reason for this “degeneracy” likely lies with evolution: most of the

proteins we know have emerged over 3.5 billion years of evolutionary

optimisation in which existing domains were copied, pasted, and mutated

[32], producing a limited “vocabulary” that is reused over and over again.

There are thermodynamic reasons for this, too, as only a limited set of

possible amino acid 3D arrangements make up for the entropic cost of a

defined protein fold [33]. Most protein folds can thus be achieved by

recombining and slightly modifying existing ones and valid solutions can be

formed through advanced retrieval techniques [34].

Expand full comment

It is even worse than you explained. For your protein example, each amino acid building block comes in two forms (except for G), mirror images we can call L and D. Your biosynthetic factory can only use the L form. Thus, you are, for a small 20 residue molecule, making only 1 in a million

possibilities (2^20). All of them will have the same chemical groups, but each should fold very differently.

Expand full comment