AI Chemist Runs Experiments and Discovers New Reactions Alone

The Chemist Who Never Slept, Never Complained, and Just Discovered Something New

On a Tuesday afternoon in a lab at Carnegie Mellon University, a machine did something that scientists have been promising for decades and quietly dreading for just as long. It designed an experiment from scratch, wrote the code to run it, executed the chemical reactions, analyzed the results, and then decided what to try next. No human told it which molecules to mix. No human interpreted the data. No human even watched the whole thing.

The system, called Coscientist, is built on GPT-4, the same large language model that can write a sonnet about a toaster or fail a basic logic puzzle. But in this case, Boiko et al. (2023) showed that when you pair a language model with the right tools, it stops being a chatbot and starts being something else entirely: a functioning research chemist.

This is not a simulation. This is not a model predicting what might happen. This is a system that physically manipulated liquids, heated them, measured the results, and discovered new chemical reactions without a single human intervention. The authors reported that Coscientist successfully designed and executed experiments across six different tasks, including optimizing palladium-catalyzed cross-couplings, a reaction type that won the 2010 Nobel Prize in Chemistry. It did this work autonomously.

The implications are not subtle. If a machine can do what a graduate student does, but faster and without needing to eat or sleep, then the question is no longer whether AI will change science. The question is what kind of science we want it to do.

What Does It Actually Mean for an AI to "Do Chemistry"?

Most people imagine an AI chemist as a robot arm pipetting liquids into test tubes. That is part of it, but it is the least interesting part. The real work of chemistry is not the physical manipulation. It is the thinking.

A human chemist starts with a question: Can I make this molecule? Can I make it cheaper? Can I make it faster? They read the literature, form a hypothesis, design an experiment, run it, interpret the data, and iterate. Each step requires judgment. Each step requires knowledge of what has been tried before and what has failed.

Coscientist does all of these steps. Boiko et al. (2023) designed the system to access the internet for literature searches, to read documentation for laboratory equipment, to write Python code for controlling that equipment, and to execute that code in real time. The system uses GPT-4 as its "brain," but it is not just the language model working alone. It is the language model connected to the physical world through tools.

The authors tested Coscientist on six tasks. The first was simple: plan a chemical synthesis. The system searched the web, found a known procedure, and wrote a step by step plan. Not impressive yet. A human with Google could do this.

The second task was harder: optimize a reaction. Coscientist had to run multiple experiments, measure the yield of product each time, and adjust the conditions to maximize output. It did this autonomously. The authors found that the system successfully optimized palladium-catalyzed cross-couplings, a class of reactions that are central to pharmaceutical manufacturing.

The third task was the real test: discover a new reaction. Coscientist was given a set of starting materials and told to find conditions that would make them react. It proposed a hypothesis, tested it, failed, proposed a new hypothesis, tested again, and eventually found working conditions. The authors reported that the system discovered a previously unreported reaction pathway.

This is the part that matters. The system did not just execute known procedures. It generated new knowledge.

How a Language Model Becomes a Scientist

The trick is not in the language model itself. GPT-4 alone cannot control a robot or interpret NMR spectra. The trick is in the architecture that Boiko et al. (2023) built around it.

Coscientist has four modules. The first is the Planner, which takes a high level goal like "optimize this reaction" and breaks it into sub tasks. The second is the Coder, which writes Python scripts to control the laboratory equipment. The third is the Executor, which actually runs those scripts on the physical hardware. The fourth is the Analyzer, which takes the raw data from the experiment and interprets it.

The language model acts as the orchestrator. It decides which module to call, when to call it, and what to do with the results. If the experiment fails, the system can go back to the Planner and propose a new strategy. If the data is ambiguous, the system can request a repeat.

The authors tested this architecture in a simulated environment first, then in a real laboratory. They used a liquid handling robot, a heating block, and an analytical instrument called a gas chromatograph. Coscientist controlled all of them through code it wrote itself.

One critical detail: the system was not trained on chemistry specifically. It was GPT-4, a general purpose language model. The authors did not fine tune it on chemical literature. They simply gave it access to the internet and the ability to execute code. This is important because it means the system is not limited to what it has seen before. It can search for new information, read it, and use it immediately.

The Six Tasks, Ranked by How Much They Should Worry You

The authors reported results from six tasks, and they are not all equally impressive. Some are table stakes. Some are genuinely unsettling.

Task 1: Literature search and synthesis planning

Coscientist was asked to find a procedure for a known reaction. It searched the web, found a paper, extracted the conditions, and wrote a plan. This is something a first year graduate student can do. Not impressive, but necessary.

Task 2: Reaction condition optimization

The system was given a reaction and asked to find the best conditions. It ran multiple experiments, each time changing a variable like temperature or concentration, and measured the yield. The authors found that Coscientist converged on optimal conditions faster than a random search, though not as fast as a human expert. This is where it starts to get interesting.

Task 3: Autonomous experimental design

Coscientist was asked to design an experiment from scratch, including choosing the reagents, the solvent, the temperature, and the reaction time. It did this without any human input. The authors reported that the system proposed conditions that were chemically reasonable, even if not always optimal.

Task 4: Code execution and hardware control

This is the physical part. Coscientist wrote Python code to control a liquid handling robot, then executed that code to actually mix chemicals. The authors noted that the system handled error correction automatically. If the robot failed to pick up a tip, Coscientist detected the error and retried.

Task 5: Multi step reaction sequence

Coscientist was asked to plan and execute a sequence of reactions, where the product of the first reaction becomes the starting material for the second. This requires understanding how conditions from one step affect the next. The system succeeded on the first attempt.

Task 6: Discovery of a new reaction

This is the headline. Coscientist was given starting materials that had no known reaction under standard conditions. The system proposed a hypothesis, tested it, failed, and iterated until it found conditions that produced a new compound. The authors confirmed the structure of the product using NMR spectroscopy. The reaction was not predicted by any existing literature.

What the Paper Actually Proves and What It Does Not

Let me be precise about what Boiko et al. (2023) demonstrated and what remains uncertain.

What they proved: a large language model, when given access to tools, can autonomously design and execute chemical experiments. The system can search the literature, write code, control hardware, and interpret results. It can iterate on failures. It can discover new reactions.

What they did not prove: that this system is better than a human chemist. The authors did not run a controlled comparison between Coscientist and a human expert. They did not measure time to completion, error rate, or novelty of discoveries relative to human performance. They showed that the system works, not that it works better.

What they also did not prove: that the system understands chemistry in any meaningful sense. Coscientist does not have an internal model of molecular behavior. It does not reason about electron density or orbital overlap. It uses pattern matching from its training data combined with real time search. This is a tool, not a mind.

The authors were careful about this. They described Coscientist as "semi autonomous" in many tasks, meaning a human was still in the loop for safety and verification. The system did not operate entirely without oversight.

The Open Question That Keeps Me Up at Night

The most interesting question is not whether Coscientist works. It clearly does. The question is what happens when these systems become cheap and widely available.

Right now, running GPT-4 for a full experimental campaign costs money. Access to a liquid handling robot costs more. But the trend is clear. Language models are getting cheaper. Hardware is getting cheaper. Within five years, any moderately funded lab could have a system like Coscientist.

What does that mean for the practice of science? If a machine can run a thousand experiments in the time it takes a human to run one, then the bottleneck shifts. It is no longer about generating data. It is about asking the right questions.

But here is the uncomfortable part: the system does not know what questions are interesting. It does not know what problems matter. It does not have scientific intuition. It can optimize a reaction, but it cannot tell you whether that reaction is worth optimizing.

This is not a limitation of the current system. It is a fundamental property of how these models work. Language models predict the next token. They do not have goals, values, or curiosity. They are excellent at executing tasks within a defined scope. They are terrible at deciding what scope to work in.

What This Actually Means

The paper by Boiko et al. (2023) is not a demonstration of artificial general intelligence. It is not the end of human science. It is something more practical and more immediate: a working prototype of a tool that will change how chemistry is done.

▸If you are a graduate student in synthetic chemistry, the nature of your work is about to change. The repetitive parts, the optimization, the literature searching, the routine experimental design, these will be automated. Your value will come from asking questions that the system cannot ask, from seeing patterns that the system cannot see, and from deciding what problems are worth solving.

▸If you fund scientific research, the economics of discovery are shifting. The cost of running an experiment is falling toward zero. The cost of designing an experiment is also falling. The bottleneck is now the quality of the hypothesis. Invest in people who can ask good questions, not in people who can run more experiments.

▸If you are worried about safety, this paper should concern you. The same system that discovers new pharmaceuticals could discover new toxic compounds. The democratization of automated chemistry means that the barrier to making dangerous molecules is dropping. We need regulatory frameworks for autonomous chemical synthesis, and we need them now.

▸If you are excited about the future of science, this paper is a glimpse of something beautiful. The authors showed that a machine can do what a scientist does. But the machine does not get tired. It does not get bored. It does not have ego. It can run experiments all night and start again in the morning. The combination of human creativity and machine persistence is not a replacement. It is a partnership.

▸The most important finding is not in the paper itself. It is in the questions the paper forces us to ask. What does it mean to discover something? Is it the act of finding, or the act of understanding? If a machine finds a new reaction but cannot explain why it works, have we learned anything? The authors showed that Coscientist can generate results. The hard work of making sense of those results, of building theories, of deciding what matters, that work still belongs to us.

References

[1]Daniil A. Boiko, Robert MacKnight, Ben Kline, Gabriel dos Passos Gomes (2023). Autonomous chemical research with large language models. NatureDOI· 796 citations