Teaching AI with Instructions Makes It Learn Better

The Great Flip: Why Giving AI Instructions Changed Everything

For years, the dominant strategy for making language models smarter was simple: make them bigger. More parameters, more data, more compute. The assumption was that raw scale would eventually unlock reasoning, like a child who just needs to read a few more encyclopedias before the lightbulb clicks on.

Then a group of researchers at Google did something almost mundane by comparison. Instead of just feeding their model more text, they started giving it instructions.

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, and their colleagues took a 540 billion parameter model called PaLM and asked it to follow instructions across 1,836 different tasks. The result was not a small improvement. It was a leap. Their instruction finetuned model, Flan PaLM 540B, outperformed the base PaLM 540B by an average of 9.4% across a battery of benchmarks (Chung et al., 2022). On the massive multitask language understanding benchmark, MMLU, it hit 75.2% in a five shot setting, a state of the art score at the time.

This was the moment the field realized something strange: the way you teach an AI matters as much as how much you feed it. Instructions are not just a nicer interface. They are a better learning mechanism.

Why "Do This" Works Better Than "Know This"

The standard approach to training language models is deceptively simple. You take a massive corpus of text from the internet, books, Wikipedia, code repositories, and you train the model to predict the next word. That is it. The model learns patterns, correlations, statistical regularities. It becomes a kind of statistical mirror of everything humans have written.

This works surprisingly well. Models like GPT 3 and PaLM can generate coherent paragraphs, answer questions, even write code. But there is a gap between knowing something and being able to apply it on command. A model trained only on next word prediction might know the capital of France, but ask it to "list the three largest cities in France in descending order of population" and it might stumble. The knowledge is there, but the ability to follow a multi step instruction is not.

Chung and his team realized that the missing piece was not more data. It was task structure. They took 1,836 different tasks, each phrased as a natural language instruction, and finetuned the model on this collection (Chung et al., 2022). The tasks ranged from translation to question answering to reasoning problems. Some were simple, like "Translate this sentence to French." Others were complex, requiring multiple steps of reasoning.

The critical insight was that the model was not just learning the answers to specific questions. It was learning a generalizable skill: how to parse an instruction, hold it in working memory, and execute a sequence of operations to satisfy it. This is fundamentally different from memorizing facts. It is learning a procedure.

The Numbers That Made People Pay Attention

The paper is titled "Scaling Instruction Finetuned Language Models," and the word "scaling" is doing heavy lifting. The authors tested three variables: the number of tasks, the size of the model, and the inclusion of chain of thought data.

The results were striking. Flan PaLM 540B, the largest model they instruction finetuned, outperformed the base PaLM 540B by a large margin across the board. On the BIG Bench Hard benchmark, which tests reasoning on challenging tasks, the improvement was dramatic. On TyDiQA, a multilingual question answering dataset, Flan PaLM 540B showed gains of over 10 percentage points in some languages (Chung et al., 2022).

But perhaps the most surprising finding was about model size. The authors found that instruction finetuning did not just help large models. It helped smaller models even more, proportionally. The instruction finetuned T5 models, which are much smaller than PaLM, achieved strong few shot performance that rivaled much larger models. For example, Flan T5 XXL, with only 11 billion parameters, performed competitively with PaLM 62B, a model nearly six times its size (Chung et al., 2022).

This is the kind of result that makes researchers reconsider their assumptions. If a smaller model can match a larger one simply by being trained with instructions, then maybe the field has been over indexing on scale and under indexing on training methodology.

The Chain of Thought Trick

One of the most interesting sub findings in the paper involves chain of thought reasoning. Chain of thought is a technique where you show the model not just the answer to a problem, but the step by step reasoning that leads to it. For example, instead of training a model on "Q: What is 23 times 47? A: 1081," you train it on "Q: What is 23 times 47? A: 23 times 40 is 920. 23 times 7 is 161. 920 plus 161 is 1081."

The authors included chain of thought data in their instruction finetuning mix. They found that models trained with chain of thought examples performed significantly better on reasoning benchmarks like MMLU and BBH, even when tested in a zero shot setting where no chain of thought was provided (Chung et al., 2022). The model had internalized the reasoning structure.

This is a powerful finding. It suggests that instruction finetuning is not just about teaching the model to follow orders. It is about teaching the model to think in a structured way. The chain of thought data effectively shows the model a template for reasoning, and once learned, the model applies that template to new problems.

How They Did It: The Recipe

The methodology is worth understanding because it reveals how deliberate the process was. The authors started with PaLM, a pretrained language model. They then assembled a collection of 1,836 tasks from publicly available datasets. Each task was rephrased as an instruction. For example, a sentiment classification dataset might be rephrased as "Classify the sentiment of this movie review as positive, negative, or neutral."

They then finetuned the model on this instruction collection. Importantly, they did not train on each task equally. They used a technique called temperature sampling to balance the contribution of each task, preventing larger datasets from dominating the training.

The finetuning process was relatively lightweight. They used a learning rate of 5e 5 and trained for 10,000 steps with a batch size of 32. The entire finetuning took a fraction of the compute required for the original pretraining (Chung et al., 2022).

The authors tested the resulting models on a wide range of benchmarks: MMLU for knowledge and reasoning, BBH for challenging reasoning, TyDiQA for multilingual question answering, MGSM for math problems in multiple languages, and open ended generation tasks. In every case, instruction finetuning improved performance.

What This Changes About How We Think About AI

Before this paper, the dominant narrative in AI was that bigger models were always better. The assumption was that intelligence would emerge from scale alone, like a phase transition in physics. Instruction finetuning challenged that assumption. It showed that how you train a model matters as much as how big it is.

The practical implication is huge. If you can make a smaller model perform like a larger one by training it with instructions, you can save enormous amounts of compute, energy, and money. This is not just an academic insight. It has direct implications for deploying AI systems in resource constrained environments.

It also changes how we think about the relationship between data and intelligence. The internet is full of text, but most of that text is not structured as instructions. It is narrative, description, conversation, argument. By converting existing datasets into instruction format, the authors effectively created a new kind of training signal. They showed that the format of the data matters as much as the content.

What This Does Not Prove

The paper is impressive, but it does not prove that instruction finetuning is the final answer. There are several open questions.

First, the authors tested their method on a specific set of architectures: PaLM, T5, and U PaLM. It is not clear that the same benefits would apply to every model. Some architectures might be more or less amenable to instruction finetuning.

Second, the instruction datasets were all in English. The authors tested multilingual benchmarks, but the instructions themselves were in English. It is an open question whether instruction finetuning in one language transfers to other languages, or whether you need instruction data in each language.

Third, the paper does not address the problem of instruction following in adversarial or ambiguous contexts. The instructions in the training data were clear and well formed. Real world instructions are often vague, contradictory, or even malicious. How well does instruction finetuning handle those cases? The paper does not say.

Fourth, there is the question of diminishing returns. The authors scaled the number of tasks from a few hundred to 1,836 and saw improvements. But would scaling to 10,000 tasks help more? Or is there a saturation point where adding more instructions stops helping? The paper does not explore this.

Finally, the paper does not address the possibility that instruction finetuning might introduce new failure modes. For example, a model that is too good at following instructions might be more susceptible to prompt injection attacks, where a user tricks the model into doing something harmful by phrasing it as an instruction.

What This Actually Means

▸Training format matters more than raw scale. You can get better performance from a smaller model by training it with instructions than from a larger model trained on raw text. This is a direct, measurable finding from Chung et al. (2022). If you are building an AI system, spend your compute budget on instruction finetuning before you spend it on scaling up.

▸Chain of thought data is a force multiplier. Including step by step reasoning examples in your instruction set improves performance on reasoning tasks, even when the model does not see chain of thought at test time. This is a low cost, high return intervention.

▸You do not need to create new tasks from scratch. The authors used publicly available datasets and rephrased them as instructions. This means you can take existing data and convert it into instruction format without collecting new data.

▸Small models can punch above their weight. Flan T5 XXL, with 11 billion parameters, matched PaLM 62B on several benchmarks after instruction finetuning (Chung et al., 2022). If you are constrained by compute or budget, instruction finetuning is the most efficient way to get more performance from a smaller model.

▸The method is general. The authors tested it on three different model families, multiple prompting setups, and a wide range of benchmarks. Instruction finetuning is not a one off trick. It is a broadly applicable technique.

References

[1]Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph (2022). Scaling Instruction-Finetuned Language Models. arXiv (Cornell University)DOI· 1,187 citations