Teaching AI to Understand Images Boosts Its Reasoning Power

The Blind Spot in AI Reasoning

Here is a strange fact about the most advanced AI systems in the world. A model like GPT-4 can write a flawless sonnet about a cat, explain quantum mechanics to a ten-year-old, and solve a calculus problem step by step. But show it a photograph of a cat sitting on a chair, and ask it "Is this cat comfortable?" The model, until very recently, would have no idea what you were talking about. It could not see the cat.

This gap between language intelligence and visual understanding has been the quiet scandal of AI research. For years, the smartest language models were effectively blind. They could reason about the world, but only through the narrow pipe of text. And here is the twist that a team from Microsoft and the University of Wisconsin Madison recently discovered: teaching an AI to see does not just help it describe images. It makes the AI smarter at reasoning about everything.

Liu et al. (2023) showed that when you connect a vision encoder to a large language model and train the pair on visual instruction data, the model's overall reasoning power jumps dramatically. On the Science QA benchmark, their model LLaVA, combined with GPT-4, achieved a new state of the art accuracy of 92.53 percent (Liu et al., 2023). That is not just a marginal improvement. It is a leap.

The implication is almost too neat: to think better, an AI needs to see.

The Problem with Pure Text

Language models like GPT-3 and GPT-4 are extraordinary pattern matchers. They have ingested billions of words and learned the statistical structure of human language. They can generate text that feels like it was written by a person who has read everything. But they have a fundamental limitation. They have no grounding in the physical world.

Think about what happens when you read a sentence like "The man put the book on the table." If you are a human, you do not just process the words. You visualize the scene. You see a hand placing a rectangular object on a flat surface. You understand that the book is now supported by the table. You know that if someone later says "The table is empty," that is a contradiction. A pure language model has none of this. It only knows that the word "on" often follows "put the book" and precedes "the table." It is a brilliant mimic, but it does not actually know what "on" means.

This becomes a problem when you ask the model to reason about anything that involves spatial relationships, physical causality, or visual properties. A pure text model can tell you that a glass is breakable, but it cannot tell you whether a glass is more likely to break if dropped from a table or from a ladder. It has never seen a drop. It has never seen a ladder. It has only read about them.

Liu and colleagues identified this blind spot directly. They noted that instruction tuning large language models using machine generated instruction following data had improved zero shot capabilities on new tasks, but the idea was less explored in the multimodal field (Liu et al., 2023). In plain English: everyone knew that training language models to follow instructions made them better at new tasks. But nobody had tried giving them pictures to go with those instructions.

The LLaVA Recipe

The team built a model called LLaVA, which stands for Large Language and Vision Assistant. The architecture is elegantly simple. They took a pretrained language model (Vicuna, itself a fine tuned version of LLaMA) and connected it to a vision encoder (CLIP, which is good at matching images to text). Then they needed data. Lots of it.

Here is where the cleverness comes in. Creating multimodal instruction following data is expensive. You need humans to look at images and write instructions. That costs time and money. Liu and colleagues found a shortcut. They used GPT-4, the pure language model, to generate the instruction data for them. They fed GPT-4 image captions and asked it to create questions and answers about the images. Then they used that synthetic data to train LLaVA.

The authors described this as the first attempt to use language only GPT-4 to generate multimodal language image instruction following data (Liu et al., 2023). It is a kind of bootstrap. You take a blind teacher, have it describe what it would see if it could see, and then use those descriptions to teach a seeing student.

The training data included three types of examples. First, simple conversations: "What is in this image?" "A dog playing in the snow." Second, detailed descriptions: "Describe the color of the dog's fur and the texture of the snow." Third, complex reasoning: "Why might the dog be panting? What does the snow tell you about the temperature?"

After training on this data, LLaVA was evaluated on two tasks. The first was a synthetic multimodal instruction following dataset where the authors compared LLaVA's responses to GPT-4's responses. LLaVA achieved an 85.1 percent relative score compared with GPT-4 (Liu et al., 2023). That means the model was 85 percent as good as the best language model at following visual instructions, even though it had never seen the images during training.

The second evaluation was on Science QA, a benchmark that tests scientific reasoning with images. Here, the results were striking. When LLaVA was fine tuned on Science QA and combined with GPT-4, it reached 92.53 percent accuracy, a new state of the art (Liu et al., 2023). The previous best was lower. The jump was significant.

Why Images Make You Smarter

The obvious explanation for this improvement is that the model now has access to more information. When you show it a diagram of a plant cell, it can see the chloroplasts and the cell wall. It does not have to guess from text alone. But that is too simple. The improvement goes deeper.

Consider what happens when a language model processes a sentence like "The ball rolled down the hill." A text only model has a statistical representation of this sentence. It knows that "ball" and "hill" are nouns, that "rolled" is a verb, and that "down" indicates direction. But it does not know what rolling looks like. It does not know that a ball on a hill will accelerate due to gravity. It does not know that a cube would not roll the same way.

A multimodal model, by contrast, has seen thousands of images of balls on hills. It has seen the relationship between shape and motion. It has seen that round objects roll and square objects slide. This visual knowledge is not separate from its language knowledge. It is integrated. When the model processes the sentence, the visual representations activate alongside the linguistic ones. The model does not just know that "ball" and "hill" co occur. It knows that balls roll downhill.

This integration is what the authors call "the synergy of LLaVA and GPT-4" (Liu et al., 2023). The vision encoder provides grounding. The language model provides reasoning. Together, they are more than the sum of their parts.

What the Research Does Not Prove

It is important to be precise about what Liu and colleagues actually showed. Their paper demonstrates that connecting a vision encoder to a language model and training on visual instruction data improves performance on specific benchmarks. It does not prove that the model has achieved human like understanding. It does not prove that the model can generalize to any visual reasoning task. And it does not prove that the model's internal representations are genuinely multimodal in the way a human brain is.

There is an open question here. When LLaVA looks at an image of a cat on a chair, does it actually "see" the cat? Or is it just matching visual features to textual descriptions it has memorized? The authors do not claim to have solved this problem. They call their work "the first attempt" (Liu et al., 2023), which is a humble way of saying that much more work remains.

Another limitation is the data itself. The instruction data was generated by GPT-4, which is a language model. That means the data reflects what a language model thinks an image should look like, not what an image actually looks like. There is a risk of circularity. The teacher is blind, so the student may inherit some of that blindness. If GPT-4 has a misconception about what a particular scene looks like, that misconception will be baked into the training data.

The authors acknowledge this indirectly by noting that their model sometimes "exhibits the behaviors of multimodal GPT-4 on unseen images/instructions" (Liu et al., 2023). That is impressive, but it also means the model is essentially imitating a model that has never seen images either. It is a copy of a copy. The question is whether this copying introduces errors or limitations that would not exist if the training data came from humans.

The Bigger Picture: What This Means for AI

The LLaVA paper is part of a larger shift in AI research. For years, the dominant paradigm was to build bigger and bigger language models. The assumption was that if you fed a model enough text, it would eventually learn everything it needed to know about the world. This assumption is now being questioned.

The evidence from Liu and colleagues suggests that language alone is not enough. To reason about the physical world, an AI needs some form of perceptual grounding. It needs to see, hear, or touch. Text is a compressed representation of experience, but it is not experience itself. A model that has only read about rain does not know what rain feels like. A model that has only read about the color red does not know what red looks like.

This has practical implications. If you want to build an AI that can help scientists analyze microscope images, you need it to see the images. If you want an AI that can assist doctors in reading X rays, you need it to see the X rays. If you want an AI that can navigate a physical environment, you need it to see the environment. Pure language models are not enough for any of these tasks.

But the deeper implication is about the nature of intelligence itself. The LLaVA results suggest that reasoning is not a purely abstract process. It is embodied. It is tied to perception. This is not a new idea in philosophy. Immanuel Kant argued that all knowledge begins with the senses. Jean Piaget showed that children's cognitive development is grounded in physical interaction with the world. The LLaVA paper is, in a sense, an empirical demonstration of this principle in machines.

What This Actually Means

▸Multimodal training is not a luxury. It is a necessity. If you want an AI that can reason about the physical world, you must give it access to perceptual data. Text only models will always be limited to the world as described in words, not the world as it actually is.

▸Synthetic data can work, but it has limits. The LLaVA team used GPT-4 to generate training data, which saved time and money. But this approach inherits the biases and blind spots of the language model. For high stakes applications like medicine or scientific research, human annotated data is still safer.

▸The benchmark matters. The 92.53 percent accuracy on Science QA is impressive, but it is a specific benchmark. Do not assume that this model would perform equally well on, say, a driving test or a visual puzzle. Always ask: what was it actually tested on?

▸Small improvements in architecture can yield large gains in performance. The LLaVA architecture is not radically new. It connects existing components in a clever way. The innovation is in the training data and the training procedure, not in a brand new algorithm. This is a reminder that data engineering often matters more than model architecture.

▸The future of AI is multimodal. The LLaVA paper is one of the first systematic attempts to combine vision and language for instruction following. More will follow. The models that succeed in the next few years will be the ones that can see, hear, and read simultaneously. The blind text models will be left behind.

References

[1]Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee (2023). Visual Instruction Tuning. arXiv (Cornell University)DOI· 679 citations