Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Generative ML in chemistry is bottlenecked by synthesis, published by Abhishaike Mahajan on September 18, 2024 on LessWrong.
Introduction
Every single time I design a protein - using ML or otherwise - I am confident that it is capable of being manufactured. I simply reach out to
Twist Biosciences, have them create a plasmid that encodes for the amino acids that make up my proteins, push that plasmid into a cell, and the cell will pump out the protein I created.
Maybe the cell cannot efficiently create the protein. Maybe the protein sucks. Maybe it will fold in weird ways, isn't thermostable, or has some other undesirable characteristic.
But the way the protein is created is simple, close-ended, cheap, and almost always possible to do.
The same is not true of the rest of chemistry. For now, let's focus purely on small molecules, but this thesis applies even more-so across all of chemistry.
Of the 1060 small molecules that are theorized to exist, most are likely extremely challenging to create. Cellular machinery to create arbitrary small molecules doesn't exist like it does for proteins, which are limited by the 20 amino-acid alphabet. While it is fully within the grasp of a team to create millions of de novo proteins, the same is not true for de novo molecules in general (de novo means 'designed from scratch').
Each chemical, for the most part, must go through its custom design process.
Because of this gap in 'ability-to-scale' for all of non-protein chemistry, generative models in chemistry are fundamentally bottlenecked by synthesis.
This essay will discuss this more in-depth, starting from the ground up of the basics behind small molecules, why synthesis is hard, how the 'hardness' applies to ML, and two potential fixes. As is usually the case in my
Argument posts, I'll also offer a
steelman to this whole essay.
To be clear, this essay will not present a fundamentally new idea. If anything, it's such an obvious point that I'd imagine nothing I'll write here will be new or interesting to people in the field. But I still think it's worth sketching out the argument for those who aren't familiar with it.
What is a small molecule anyway?
Typically organic compounds with a molecular weight under 900 daltons. While proteins are simply long chains composed of one-of-20 amino acids, small molecules display a higher degree of complexity. Unlike amino acids, which are limited to carbon, hydrogen, nitrogen, and oxygen, small molecules incorporate a much wider range of elements from across the periodic table. Fluorine, phosphorus, bromine, iodine, boron, chlorine, and sulfur have all found their way into FDA-approved drugs.
This elemental variety gives small molecules more chemical flexibility but also makes their design and synthesis more complex. Again, while proteins benefit from a universal 'protein synthesizer' in the form of a ribosome, there is no such parallel amongst small molecules! People are
certainly trying to make one, but there seems to be little progress.
So, how is synthesis done in practice?
For now, every atom, bond, and element of a small molecule must be carefully orchestrated through a grossly complicated, trial-and-error reaction process which often has dozens of separate steps. The whole process usually also requires non-chemical parameters, such as adjusting the pH, temperature, and pressure of the surrounding medium in which the intermediate steps are done.
And, finally, the process must also be efficient; the synthesis processes must not only achieve the final desired end-product, but must also do so in a way that minimizes cost, time, and required sources.
How hard is that to do? Historically, very hard.
Consider erythromycin A, a common antibiotic.
Erythromycin was isolated in 1949, a natural metabolic byproduct of Streptomyces erythreus, a soil mi...