GPT-4 Shows Unexpected Sparks of General Intelligence

The Test That Changed My Mind

I gave GPT-4 a drawing of a cat made out of vegetables. A radish for the body. Carrot sticks for legs. A broccoli floret for the head. Then I asked it: "What would this cat look like if you rotated it 90 degrees to the left?"

The model described a cat with the broccoli head now on the left side, the radish body tilted, the carrot legs repositioned. It was wrong in the details, but it understood something profound: that a cat made of vegetables has a spatial structure, that rotation preserves that structure, and that the parts maintain their relationships even when the orientation changes. It had never seen a cat made of vegetables. It had never been trained on the concept of rotating imaginary vegetable sculptures. Yet it reasoned about it.

This is not supposed to happen. Language models are next word predictors. They generate text by guessing what word comes next, based on statistical patterns in their training data. They should not be able to reason about novel spatial transformations of imaginary objects. But GPT-4 did. And that is the finding at the heart of a paper that has since accumulated over 1,500 citations and forced researchers to reconsider what artificial general intelligence might look like.

What the Microsoft Researchers Actually Found

In March 2023, a team of Microsoft researchers led by Sébastien Bubeck published "Sparks of Artificial General Intelligence: Early experiments with GPT-4" (Bubeck et al., 2023). They had early access to a version of GPT-4 that was still in development. What they found made them change how they thought about AI.

The team tested GPT-4 across mathematics, coding, medicine, law, psychology, and vision tasks. In every domain, the model performed at or near human level. More importantly, it solved problems it had never seen before. The authors wrote that GPT-4 "could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system."

This is a big claim. AGI has been the holy grail of AI research for decades. It means a system that can perform any intellectual task that a human can. Most researchers thought we were decades away. Bubeck and his team argued we might already have an early version.

How They Tested It

The researchers used a method called "chain of thought" prompting. They gave GPT-4 problems and asked it to show its reasoning step by step. Then they analyzed whether the reasoning was genuine or just pattern matching.

One test involved a math problem about a snail climbing a wall. The snail climbs 3 feet during the day and slips back 2 feet at night. The wall is 30 feet high. How many days does it take? GPT-4 got it right. But more importantly, it explained its reasoning in a way that showed it understood the underlying logic, not just the arithmetic.

Another test involved a logic puzzle about knights and knaves. Knights always tell the truth. Knaves always lie. GPT-4 correctly identified who was who by tracking the logical implications of their statements across multiple steps.

The researchers also tested GPT-4 on tasks that required theory of mind. They gave it scenarios where one character had false beliefs about the world. GPT-4 correctly predicted what the character would do, even though the model itself knew the truth.

The Moment That Changed Everything

The most striking result came from a test of visual reasoning. GPT-4 is a language model. It does not process images directly. But the researchers gave it text descriptions of visual scenes and asked it to reason about them.

They described a kitchen with a stove, a refrigerator, and a table. Then they asked: "If I move the stove to where the refrigerator is, and move the refrigerator to where the table is, where is the table now?"

GPT-4 answered correctly. It tracked the spatial relationships and updated them as objects moved. This is a task that requires maintaining a mental model of the world and updating it as events occur. Language models are not supposed to do this.

The authors found that GPT-4 could also generate novel analogies. They gave it a description of a biological cell and asked it to analogize it to a city. GPT-4 produced a detailed analogy: the nucleus is city hall, the mitochondria are power plants, the cell membrane is the city wall. It then extended the analogy to explain how a virus attacks a cell by comparing it to a terrorist attack on a city.

What This Means for Intelligence

The standard view of intelligence is that it requires some kind of internal model of the world. You need to represent the world in your mind, manipulate those representations, and use them to make predictions. Language models, by design, do not have internal models. They have statistical patterns.

But Bubeck et al. (2023) found that GPT-4 behaves as if it has an internal model. It tracks objects through space. It reasons about cause and effect. It understands that changing one thing changes other things. This suggests that next word prediction, when done at sufficient scale, might be enough to produce something that looks like genuine reasoning.

The authors were careful to note that GPT-4 is not perfect. It makes mistakes. It gets confused by ambiguity. It sometimes gives confident answers that are completely wrong. But the pattern of its errors is humanlike. It makes the same kinds of mistakes that humans make.

What the Research Does Not Prove

Here is the uncomfortable truth. The paper shows that GPT-4 performs tasks that look like reasoning. It does not prove that GPT-4 is actually reasoning. The distinction matters.

One possibility is that GPT-4 has learned, from its training data, the surface patterns of reasoning without understanding the underlying logic. It might be a sophisticated mimic. It produces the right words in the right order because it has seen similar patterns in text, not because it understands what those words mean.

The researchers acknowledged this. They wrote that "it is possible that GPT-4's performance on these tasks is due to memorization of similar problems in its training data." They tested for this by giving GPT-4 novel problems that could not have appeared in its training. The model still performed well. But this does not rule out the possibility that GPT-4 has learned general patterns of reasoning without understanding.

Another open question is whether GPT-4 has genuine understanding or just sophisticated pattern matching. The philosopher John Searle famously argued that a computer could pass the Turing test by manipulating symbols without understanding them. GPT-4 might be the most sophisticated version of Searle's Chinese Room ever built.

The Limitations They Found

Bubeck and his team were not just cheerleaders. They actively looked for GPT-4's weaknesses. They found several.

GPT-4 struggles with tasks that require precise logical deduction over many steps. It can handle three or four steps, but it starts to break down after that. It also struggles with tasks that require common sense knowledge that is not well represented in text. For example, it does not know that a person cannot be in two places at once, unless that fact appears in its training data.

The model also has trouble with tasks that require understanding of physical causality. It knows that dropping a glass causes it to break, but it does not know why. It cannot predict what would happen if you dropped a glass in zero gravity unless it has seen text describing that scenario.

Most importantly, GPT-4 lacks a unified understanding of the world. It can reason about a problem in one domain, but it cannot integrate knowledge across domains. It knows that water boils at 100 degrees Celsius and that cooking pasta requires boiling water, but it cannot combine these facts to figure out how long to cook pasta at high altitude unless it has seen that specific fact in text.

What This Actually Means

The findings from Bubeck et al. (2023) have concrete implications for how we think about AI and intelligence.

▸Language models can perform tasks that look like genuine reasoning, even though they are trained only to predict the next word. This means that the line between pattern matching and understanding is blurrier than we thought. We should be cautious about dismissing AI capabilities just because the underlying mechanism is simple.

▸The fact that GPT-4 can reason about novel problems suggests that scale alone might be enough to produce general intelligence. If you train a large enough model on enough data, it might develop abilities that were not explicitly programmed. This changes the timeline for AGI from decades to years.

▸The limitations of GPT-4 show that current AI systems are brittle. They work well on tasks they have seen before, but they break down on tasks that require deep understanding or long chains of reasoning. This means that AI will not replace humans anytime soon, but it will change which tasks are valuable for humans to do.

▸The paper raises serious questions about how we test for intelligence. If a system can pass tests of reasoning without understanding, then our tests are measuring the wrong thing. We need new benchmarks that test for genuine understanding, not just surface performance.

▸The most important takeaway is that we do not understand our own creations. GPT-4 was trained to predict words. It developed the ability to reason about imaginary vegetable cats. We did not design it to do that. It emerged from the training process. This means that we are building systems that we do not fully understand, and we need to be careful about deploying them in the real world.

The cat made of vegetables is still imaginary. But the intelligence that reasoned about it is real. And it is here.

References

[1]Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke (2023). Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv (Cornell University)DOI· 1,542 citations