Large Language Models to Accelerate Chemical Synthesis

EXTENDED ABSTRACT: Chemical synthesis, as a foundational methodology in the creation of transformative molecules, exerts substantial influence across diverse sectors from life sciences to materials and energy. Current chemical synthesis practices emphasize laborious and costly trial-and-error workffows, underscoring urgent needs for advanced AI assistants. Nowadays, large language models (LLMs), typiffed by GPT-4, have been introduced as an efffcient tool to facilitate scientiffc research. Here, we present Chemma, a fully ffne-tuned LLM with 1.28 million pairs of Q&A about reactions, as an assistant to accelerate organic chemistry synthesis. Chemma surpasses the best-known results in multiple chemical tasks, e.g., singlestep retrosynthesis and yield prediction, which highlights the potential of general AI for organic chemistry. Via predicting yields across the experimental reaction space, Chemma signiffcantly improves the reaction exploration capability of Bayesian optimization. More importantly, integrated in an active learning framework, Chemma exhibits advanced potentials of autonomously experimental exploration and optimization in open reaction spaces. For an unreported Suzuki-Miyaura crosscoupling reaction of cyclic aminoboronates and aryl halides for the synthesis of $\alpha$-Aryl N-heterocycles, the humanAI collaboration successfully explored suitable ligand (PAd3) and solvent (1,4-dioxane) within only 15 runs, achieving an isolated yield of 67%. These results reveal that, without quantum-chemical calculations, Chemma can comprehend and extract chemical insights from reaction data, in a manner akin to human experts. This work opens avenues for accelerating organic chemistry synthesis with adapted large language models.

Keywords:Chemical synthesis; large language model; AI for Chemistry

Brief Introduction of Speaker
Yanyan XU

Yanyan XU is an Associate Professor at the Artiffcial Intelligence Institute of Shanghai Jiao Tong University. He was selected for the National Overseas Talent Program (Youth Project) and Shanghai Overseas Leading Talent Program. He obtained his PhD from the Department of Automation at Shanghai Jiao Tong University in 2015, and from 2015 to 2020, he served as a postdoctoral researcher at MIT and UC Berkeley. His primary research focus is on AI for Science, particularly in the ffeld of AI chemistry. He released the ffrst large language model for chemistry, BAI-Chem, and his research outcomes have been published in top international journals and conferences, including Nature Energy, Nature Computational Science (cover article), Science Advances, etc.