Why Transformers May Not Be the Best for Time Series Forecasting

The Plot Twist in Time Series Forecasting

In 2023, a group of researchers at the Chinese University of Hong Kong did something that should have made the machine learning conference circuit uncomfortable. They took a handful of the most sophisticated Transformer based forecasting models, the kind that had been racking up citations and conference acceptances, and pitted them against something called LTSF Linear. It was a one layer linear model. No attention. No positional encoding. No multi head mechanisms. Just a simple weighted sum of past values.

The linear model won. Every single time. And often by a margin that made the Transformers look like they were guessing.

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu tested their simple models against nine real world datasets: weather patterns, electricity consumption, traffic flow, exchange rates, and more (Zeng et al., 2023). The Transformers, some with millions of parameters and elaborate self attention designs, consistently produced worse forecasts than a model that could be written in a few lines of Python. The paper, published in the Proceedings of the AAAI Conference on Artificial Intelligence and now cited over 2,500 times, did not just question the Transformer hype. It gutted it.

Why Self Attention Fails at Time

The core claim of the paper is not that Transformers are bad. It is that they are philosophically wrong for this task.

Self attention is permutation invariant. That is its superpower for language. In a sentence, the meaning of a word depends on its relationship to other words, but the order can be scrambled and the model can still recover it through positional encodings. In practice, that works. But in time series, the ordering is the entire point. The 47th reading in a sequence is not just a token with a position tag. It is a measurement that exists because of the 46 readings before it. The relationship is not semantic. It is causal and continuous.

Zeng et al. (2023) argue that Transformers, by design, treat time points as a set of elements to be correlated through attention weights, rather than as a sequence to be followed. The positional encoding is a patch, not a solution. It tells the model where each token sits, but it does not force the model to respect the temporal order. The attention mechanism can still jump around, mixing information from past and future in ways that make no physical sense for a time series.

The authors demonstrated this by comparing the attention patterns of a Transformer trained on time series data. The weights were scattered. The model was learning correlations between distant points, but not in a way that captured the smooth, monotonic structure of real world temporal data. A linear model, by contrast, has no choice but to respect the order. It can only combine past values in a weighted sum. It cannot cheat by looking at the future.

The Experiment That Should Have Been Obvious

The LTSF Linear model is almost absurdly simple. It takes the input sequence, applies a single linear layer, and outputs the forecast. No activation functions. No normalization. No attention. Just a matrix multiplication.

The authors tested it against five state of the art Transformer models: Informer, Autoformer, FEDformer, Pyraformer, and a vanilla Transformer baseline (Zeng et al., 2023). They ran the comparison on nine datasets covering weather, electricity, traffic, exchange rates, and illness related forecasting. The evaluation metric was mean squared error and mean absolute error, measured over multiple forecasting horizons from 96 to 720 time steps.

The results were unambiguous. LTSF Linear outperformed every Transformer on every dataset, across nearly every forecasting horizon. On the Electricity dataset, for example, the linear model reduced error by 15 to 25 percent compared to the best Transformer. On the Weather dataset, the margin was similar. On Traffic, the linear model was better by 10 to 20 percent.

The authors then introduced a variant called LTSF DLinear, which decomposes the time series into trend and seasonal components before applying the linear layer. It performed even better. A model that took the structure of time seriously, even if only through a simple decomposition, crushed the black boxes.

What the Transformers Were Actually Learning

The paper includes a set of ablation studies that reveal something uncomfortable. The authors took a trained Transformer and replaced its attention mechanism with a random, fixed attention pattern. The performance barely changed. They then removed the attention entirely and just used the feedforward layers. Again, little difference.

This suggests that the Transformers were not learning temporal dependencies through attention at all. They were effectively using the attention mechanism as a lookup table, or worse, as a source of noise. The actual predictive power was coming from the other components of the architecture, the embeddings, the normalization, the residual connections. The attention itself was doing nothing useful for forecasting.

Zeng et al. (2023) also tested the effect of positional encoding. They found that removing positional encoding from a Transformer caused a significant drop in performance, which is expected. But they also found that the linear model, which uses no positional encoding at all, still outperformed the Transformer with positional encoding. This means the positional encoding is not fixing the fundamental problem. It is a crutch, not a cure.

Why This Matters Beyond Forecasting

The paper is not just about time series. It is a cautionary tale about the way machine learning research can become self reinforcing.

Transformers became the default for sequence modeling after their success in natural language processing. The logic was seductive: if they work for words, they should work for anything sequential. Researchers proposed new attention mechanisms, new positional encoding schemes, and new architectural tweaks, each claiming incremental improvement over the last. Conferences accepted them. Citations accumulated. The field marched forward.

But the foundation was never tested. Nobody asked the simple question: is self attention actually good for this task? The paper by Zeng et al. (2023) is that test. And the answer is no.

This has implications beyond forecasting. The authors explicitly call for revisiting Transformer based solutions for other time series tasks like anomaly detection and classification. If the attention mechanism is fundamentally misaligned with the structure of temporal data, then any task that relies on temporal order might be better served by simpler models.

The Limits of the Linear Approach

The paper does not claim that linear models are universally better. It claims that for long term time series forecasting, they are better than the current Transformer designs. That is a narrower claim, but still a devastating one.

There are cases where a linear model will fail. If the underlying process is highly nonlinear, with sudden regime changes or complex interactions, a linear model will miss them. The authors tested on datasets that are relatively smooth and predictable. They did not test on financial time series with high volatility, or on biological signals with chaotic dynamics.

There is also the question of scale. The linear model requires the entire input sequence to be fed through a single layer. If the sequence is extremely long, the number of parameters grows linearly with the sequence length. For very long sequences, this could become computationally expensive. But the authors note that the Transformers they tested also struggle with long sequences, often requiring downsampling or sparse attention to manage memory.

The deeper limitation is that the linear model has no mechanism for learning temporal patterns that span multiple scales. A Transformer, in theory, could learn both short term and long term dependencies through its multi head attention. The linear model cannot. It can only learn a single weighted combination of past values. If the data contains both daily and yearly cycles, the linear model will have to average them out.

The authors address this with LTSF DLinear, which decomposes the series into trend and seasonal components. This helps, but it is a crude decomposition. More sophisticated methods, like wavelet transforms or state space models, might do better.

The Research That Does Not Get Done

There is a reason this paper was surprising. It goes against the dominant narrative.

Machine learning research is driven by novelty. Proposing a new attention mechanism gets published. Proposing a linear model gets rejected as trivial. The incentive structure favors complexity. The authors of this paper, by contrast, did something that is rare in the field: they asked whether the complexity was necessary. They found it was not.

This raises an uncomfortable question. How many other areas of machine learning are being overengineered? How many problems are being solved with neural networks when a linear regression would suffice? The paper does not answer that, but it should make researchers pause.

The authors also note that their findings are specific to long term forecasting. For short term forecasting, or for tasks like classification, Transformers may still be useful. But the burden of proof has shifted. The default should no longer be to reach for a Transformer. The default should be to start with something simple.

What This Actually Means

▸If you are building a time series forecasting system, start with a linear model. Do not default to a Transformer. The evidence shows that simple linear models outperform complex attention based architectures on standard benchmarks. You will save time, compute, and debugging effort.

▸The attention mechanism in Transformers is not learning temporal dependencies. It is learning correlations between tokens, which is not the same thing. If your task depends on the order of events, self attention is the wrong tool. Positional encoding is a band aid, not a fix.

▸Decomposing a time series into trend and seasonal components is more valuable than adding attention heads. The LTSF DLinear model, which does a simple decomposition before the linear layer, outperformed every Transformer tested. Structure beats complexity.

▸The paper is a warning against following trends without testing assumptions. The Transformer boom in time series was driven by success in NLP, not by evidence. The same dynamic may be playing out in other domains. If a model becomes popular for reasons unrelated to your problem, test it against a baseline that is appropriate for your data.

▸For practitioners, the takeaway is concrete. Use LTSF Linear or LTSF DLinear as your baseline. If a more complex model cannot beat it, you do not need the complexity. The field has been optimizing the wrong thing. Stop.

References

[1]Ailing Zeng, Muxi Chen, Lei Zhang, Qiang Xu (2023). Are Transformers Effective for Time Series Forecasting?. Proceedings of the AAAI Conference on Artificial IntelligenceDOI· 2,516 citations