Attention is all you need — the paper that introduced the transformer architecture (Vaswani et al., 2017).
BERT — bidirectional transformer pre-training that reshaped NLP (Devlin et al., 2019).
Vision Transformer (ViT) — showed that transformers can match or beat CNNs on image classification (Dosovitskiy et al., 2021).
The transformer is the architecture behind large language models and, increasingly, behind state-of-the-art models in vision, time-series forecasting, and scientific computing. Unlike RNNs, which process sequences one step at a time, transformers process the entire sequence at once using a mechanism called self-attention that lets each element directly attend to every other element.
10.1 Self-attention
The core idea: for each element in the input sequence, compute how much attention it should pay to every other element, then produce an output that is a weighted combination of the values.
Given an input sequence \(\mathbf{X} \in \mathbb{R}^{n \times d}\) (\(n\) tokens, \(d\) features), three linear projections produce:
The softmax row gives the attention weights — how much each position attends to every other position. The \(\sqrt{d_k}\) scaling prevents the dot products from becoming too large.
10.2 Multi-head attention
Instead of computing a single attention, the transformer uses multi-head attention: it runs \(h\) parallel attention functions with different learned projections, then concatenates and projects the results:
Stacking many such blocks gives the transformer its depth.
10.4 Positional encoding
Self-attention is permutation-invariant — it does not know the order of the input. To inject sequence order, transformers add a positional encoding to the input embeddings. The original paper used sinusoidal functions:
This gives each position a unique signature that the network can learn to interpret.
10.5 Code example: minimal self-attention
We implement a minimal self-attention mechanism from scratch to illustrate the core computation. This is not meant for production use, but shows exactly what happens inside the attention layer.
Bright cells indicate which key positions that query attends to most.
Rows should sum to 1 (a probability distribution).
This chapter demonstrates the mechanics of attention, not end-to-end benchmark performance. In real geoscience transformer models, evaluation should include holdout skill metrics (e.g., MAE/RMSE, event detection F1, or forecast skill scores) and comparisons against RNN/CNN baselines.
10.6 Advantages over RNNs
Feature
RNN
Transformer
Long-range dependencies
Difficult (vanishing gradients)
Direct attention
Parallelization
Sequential (slow)
Fully parallel
Memory cost
\(O(n)\) per step
\(O(n^2)\) for full attention
Positional awareness
Built-in (sequential processing)
Requires positional encoding
Transformers scale better to long sequences and large datasets, which explains their dominance in modern AI. For very long sequences, efficient variants (sparse attention, linear attention) reduce the \(O(n^2)\) cost.
10.7 Geoscience milestones
Earthquake detection and phase picking — Mousavi et al. (2020) introduced the Earthquake Transformer (EQTransformer), the first widely adopted attention-based model for simultaneous detection and seismic-phase picking from continuous waveform data.
Medium-range weather forecasting — Bi et al. (2023) (Pangu-Weather) is the milestone result for transformer-based global weather prediction, demonstrating skillful forecasts up to 7 days from a 3D attention architecture trained on ERA5 reanalysis.
The transformer is increasingly the architecture of choice for problems involving long sequences, multimodal data, or large-scale pre-training in geoscience.
Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., & Tian, Q. (2023). Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619, 533–538. https://doi.org/10.1038/s41586-023-06185-3
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT, 4171–4186.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. Proceedings of the International Conference on Learning Representations (ICLR).
Mousavi, S. M., Ellsworth, W. L., Zhu, W., Chuber, L. Y., & Beroza, G. C. (2020). Earthquake transformer – an attentive deep-learning model for simultaneous earthquake detection and phase picking. Nature Communications, 11(3952). https://doi.org/10.1038/s41467-020-17591-w
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.