AI Learns New Visual Tasks from Just a Few Examples

The One Shot Problem

Imagine teaching a child what a "zebra" is. You point at a picture, say the word once, and they get it. They will not confuse it with a horse with stripes. They will not need 50,000 labeled zebra photos to confirm the concept. They will walk into a savannah and point at the striped animal and say "zebra" with a confidence that borders on annoying.

Now imagine teaching a machine the same thing. For years, the standard recipe has been grotesque in its data appetite: gather hundreds of thousands of labeled images, train a neural network for days on expensive hardware, and hope the model generalizes. If you want to teach it a new visual task, say identifying a specific species of bird or answering a question about a medical scan, you start over. You gather new data. You fine-tune. You pray.

This is not how intelligence works. And for a long time, the field of computer vision simply accepted that machines would be data gluttons. Then in 2022, a team of researchers at DeepMind published a paper that quietly suggested a different path. The paper is called "Flamingo: a Visual Language Model for Few-Shot Learning" (Alayrac et al., 2022). The title undersells what they actually did.

Flamingo is a visual language model that can learn new tasks from a handful of examples. Not a thousand. Not a hundred. Often, just four. And it does not need to be retrained. You show it a few examples, and it adapts on the fly. It is the closest thing computer vision has produced to the way a human picks up a new concept.

This is not incremental progress. This is a different philosophy about how to build models that see and understand.

The Architecture That Refuses to Start From Scratch

The key insight behind Flamingo is almost embarrassingly simple once you hear it, which is usually the sign of a good idea. Most previous approaches to visual question answering or image captioning required training a single monolithic model from scratch on a specific task. Want a model that answers questions about photos of kitchens? Train one. Want one that describes videos of soccer games? Train another. Each model is a fresh sculpture, chiseled from raw stone.

Alayrac and his colleagues asked a different question: what if you could take two already powerful models, one that understands images and one that understands language, and just build a thin bridge between them?

Here is how they did it. The team started with a pretrained vision model (a Convolutional Neural Network called NFNet) that could already recognize objects, scenes, and patterns in images. They also started with a pretrained language model (a large transformer called Chinchilla) that could already generate fluent text, answer questions, and reason. Both models were state of the art in their respective domains. They just could not talk to each other.

Flamingo's innovation is what the authors call a "Perceiver Resampler" and a set of "gated cross attention" layers. These are technical names for a simple function: they take the visual information from the vision model and convert it into a format the language model can ingest. Think of it as a translator who speaks both "image feature" and "word embedding" fluently. The translator is lightweight. It does not need to be as powerful as the two experts it connects. It just needs to be good at its job.

This architectural choice is what makes few shot learning possible. Because the vision and language models are already pretrained on massive datasets, they bring a vast amount of knowledge to the table. The bridge only needs to be trained to align the two modalities. And because the bridge is trained on interleaved text and images from the web, Flamingo learns something unexpected: it learns how to learn from context.

How Four Examples Beat Forty Thousand

The Flamingo paper reports results that sound like a magic trick. On the Visual Question Answering benchmark VQAv2, a single Flamingo model, prompted with just four examples per question type, outperformed models that had been fine tuned on the entire training set of 40,000 examples (Alayrac et al., 2022). That is a four order of magnitude difference in data efficiency.

Let that sink in. The old approach required 40,000 labeled examples to achieve a certain level of performance. Flamingo matched or exceeded that performance with 4. The authors did not achieve this by making the model bigger or training it longer. They achieved it by changing the fundamental architecture to support in context learning.

What does "in context learning" mean here? The model is not storing the examples in its weights. It does not update its parameters when you show it a new task. Instead, it treats the examples as part of the prompt. You give it a sequence: an image, a question, an answer. Then another image, another question, another answer. Then a new image with a question, and the model predicts the answer. It learns the pattern from the sequence itself, not from gradient updates.

This is how humans use analogies. If I show you three photos of different dog breeds and tell you their names, you can probably guess the name of a fourth breed you have never seen before. You are not retraining your brain. You are using context.

The Flamingo paper demonstrates this across multiple benchmarks: image captioning on COCO, video question answering on MSRVTT QA, and multiple choice visual reasoning on OKVQA. In every case, the few shot version of Flamingo matched or exceeded fully fine tuned baselines. The authors write that Flamingo "outperforms models fine tuned on thousands of times more task specific data" (Alayrac et al., 2022). That is not a modest claim. But the numbers back it up.

The Secret Ingredient: Interleaved Web Data

You might wonder how the bridge between vision and language learns to handle context so well. The answer is training data, but not the kind you might expect. The team did not curate a clean dataset of image caption pairs. They scraped the web.

Specifically, they collected a large corpus of web pages containing arbitrarily interleaved text and images. A blog post about a vacation might have a paragraph, then a photo, then another paragraph, then a photo of a different location. A news article might have an image, a caption, a quote, another image. The data is messy. It is unstructured. It is exactly the kind of information a human encounters every day.

Training on this interleaved data teaches Flamingo something that training on clean caption datasets cannot: how to handle sequences of visual and textual information that are not tightly coupled. The model learns to pay attention to the relationship between an image and the text that surrounds it, even when that relationship is loose or implicit. This is what enables the model to understand a prompt that includes multiple images and questions in sequence.

The authors call this "in context few shot learning." It is a phrase that sounds technical but describes something intuitive. The model learns to treat the prompt as a mini curriculum. The first example teaches it the format. The second example reinforces the pattern. By the third or fourth example, the model has inferred the task. It does not need to be told explicitly "now answer questions about this type of object." It just figures it out from the sequence.

This is a fundamental shift. Previous few shot learning methods required meta learning, where a model is trained on many small tasks so it learns how to adapt quickly. Flamingo does not need meta learning. It just needs a good bridge and the right kind of training data.

What Flamingo Actually Sees

To understand why this matters, it helps to know what Flamingo does not do. It does not process images the way a human does. It does not see a zebra and think "striped horse." Instead, the vision model breaks the image down into a grid of features, each representing a small patch of the image. The Perceiver Resampler then compresses these features into a smaller set of "visual tokens" that the language model can attend to.

The language model, for its part, generates text one word at a time, attending to both the previous words it has generated and the visual tokens from the image. When you give Flamingo a few shot prompt, the language model attends to the entire sequence of images and text. It learns the mapping from the examples.

This architecture is surprisingly efficient. The Perceiver Resampler reduces the number of visual tokens from thousands to just 64. This makes the cross attention computation manageable. The gated mechanism allows the model to control how much visual information influences the text generation at each step. Sometimes the image is crucial. Sometimes the text context is more important. The model learns to balance both.

The authors tested Flamingo on both images and videos. For videos, they simply sampled frames and fed them as a sequence of images. The model handled this without modification. It could answer questions about a video clip after seeing just a few examples of video question answering. Again, no fine tuning. Just the prompt.

What This Does Not Prove

Flamingo is impressive, but it is not magic. The paper itself acknowledges several limitations. First, the model still requires a massive amount of pretraining data. The vision and language models were trained on billions of examples. The bridge was trained on 2.1 billion image text pairs. This is not a recipe for building a model from scratch with minimal data. It is a recipe for building a model that can adapt to new tasks with minimal data, after an expensive pretraining phase.

Second, few shot performance is not always better than fine tuning. On some benchmarks, particularly those requiring very specialized knowledge, fine tuned models still win. Flamingo is good at general tasks that can be inferred from a few examples. It is less good at tasks that require memorizing rare facts or subtle visual distinctions that are not obvious from context.

Third, the model is not interpretable. You cannot look at the attention weights and understand why Flamingo answered a particular way. You cannot debug its reasoning. You can only observe its behavior. This is a problem for deployment in high stakes domains like medicine or law, where you need to know why a model made a decision.

Fourth, the paper does not address the question of data contamination. Some of the few shot examples might overlap with the pretraining data. If the model has already seen similar images and questions during training, its few shot performance might be inflated. The authors attempted to control for this by testing on benchmarks that were released after their training data was collected, but the concern remains.

Finally, Flamingo is a research prototype, not a product. The model is large, requiring significant compute to run. It is not something you can deploy on a phone or a drone. The paper is proof of concept, not a ready to use system.

What This Actually Means

The Flamingo paper is not just another incremental improvement in accuracy. It is a demonstration that the way we have been building vision language models is probably wrong. Here is what the paper actually tells us, stripped of hype.

▸Data efficiency is achievable through architecture, not just scale. For years, the assumption was that you needed more data to get better performance. Flamingo shows that the bottleneck is not data quantity but how you connect vision and language. The bridge matters as much as the endpoints.

▸Pretrained models are a better starting point than task specific training. The old approach of training a model from scratch for each task is wasteful. Flamingo shows that you can take two powerful pretrained models, add a thin connector, and get a model that generalizes across tasks. This changes the economics of building AI systems.

▸In context learning works for vision, not just language. The concept of in context learning was popularized by large language models like GPT 3, where you prompt the model with a few examples and it completes the pattern. Flamingo extends this to the visual domain. This suggests that the same principles might apply to other modalities, like audio or sensor data.

▸Interleaved web data is a rich training signal. The messy, unstructured nature of web pages turns out to be a feature, not a bug. Training on interleaved text and images teaches models to handle context in a way that clean caption datasets cannot. This has implications for how we collect training data going forward.

▸The gap between machine and human learning is still large, but it is shrinking. Four examples is not one example. Flamingo is not a child. It still requires billions of pretraining examples to reach this point. But the direction is clear. We are moving toward models that can learn from fewer and fewer examples. The question is no longer whether it is possible. The question is how far we can push it.

The Flamingo paper is a quiet landmark. It does not announce a revolution. It just shows a better way to build a model that sees and understands. The implications are only beginning to unfold.

References

[1]Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech (2022). Flamingo: a Visual Language Model for Few-Shot Learning. arXiv (Cornell University)DOI· 1,240 citations