10  Transformers

TipKey references
  • Attention is all you need — the paper that introduced the transformer architecture (Vaswani et al., 2017).
  • BERT — bidirectional transformer pre-training that reshaped NLP (Devlin et al., 2019).
  • Vision Transformer (ViT) — showed that transformers can match or beat CNNs on image classification (Dosovitskiy et al., 2021).

The transformer is the architecture behind large language models and, increasingly, behind state-of-the-art models in vision, time-series forecasting, and scientific computing. Unlike RNNs, which process sequences one step at a time, transformers process the entire sequence at once using a mechanism called self-attention that lets each element directly attend to every other element.

10.1 Self-attention

The core idea: for each element in the input sequence, compute how much attention it should pay to every other element, then produce an output that is a weighted combination of the values.

Given an input sequence \(\mathbf{X} \in \mathbb{R}^{n \times d}\) (\(n\) tokens, \(d\) features), three linear projections produce:

\[ \mathbf{Q} = \mathbf{X}W_Q, \quad \mathbf{K} = \mathbf{X}W_K, \quad \mathbf{V} = \mathbf{X}W_V \]

where \(\mathbf{Q}\) (queries), \(\mathbf{K}\) (keys), and \(\mathbf{V}\) (values) are matrices. The attention output is:

\[ \mathrm{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right) \mathbf{V} \]

The softmax row gives the attention weights — how much each position attends to every other position. The \(\sqrt{d_k}\) scaling prevents the dot products from becoming too large.

10.2 Multi-head attention

Instead of computing a single attention, the transformer uses multi-head attention: it runs \(h\) parallel attention functions with different learned projections, then concatenates and projects the results:

\[ \mathrm{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h)\, W_O \]

Each head can learn to focus on different types of relationships (e.g., one head might attend to nearby positions, another to distant ones).

10.3 The transformer block

A single transformer block consists of:

  1. Multi-head self-attention — with a residual connection and layer normalization.
  2. Feed-forward network — two dense layers with a non-linearity, also with a residual connection and layer normalization.

\[ \begin{aligned} \mathbf{z} &= \mathrm{LayerNorm}\!\bigl(\mathbf{x} + \mathrm{MultiHeadAttn}(\mathbf{x})\bigr) \\ \mathbf{y} &= \mathrm{LayerNorm}\!\bigl(\mathbf{z} + \mathrm{FFN}(\mathbf{z})\bigr) \end{aligned} \]

Stacking many such blocks gives the transformer its depth.

10.4 Positional encoding

Self-attention is permutation-invariant — it does not know the order of the input. To inject sequence order, transformers add a positional encoding to the input embeddings. The original paper used sinusoidal functions:

\[ \begin{aligned} PE_{(pos, 2i)} &= \sin\!\bigl(pos / 10000^{2i/d}\bigr) \\ PE_{(pos, 2i+1)} &= \cos\!\bigl(pos / 10000^{2i/d}\bigr) \end{aligned} \]

This gives each position a unique signature that the network can learn to interpret.

10.5 Code example: minimal self-attention

We implement a minimal self-attention mechanism from scratch to illustrate the core computation. This is not meant for production use, but shows exactly what happens inside the attention layer.

using LinearAlgebra, Random, CairoMakie

rng = Xoshiro(42)

#-----sequence setup------
seq_len, d_model = 6, 4
X = randn(rng, Float32, seq_len, d_model)

#-----learned projections------
dₖ = d_model
W_Q = randn(rng, Float32, d_model, dₖ) .* 0.5f0
W_K = randn(rng, Float32, d_model, dₖ) .* 0.5f0
W_V = randn(rng, Float32, d_model, dₖ) .* 0.5f0

Q = X * W_Q
K = X * W_K
V = X * W_V

#-----attention weights------
scores = Q * K' ./ sqrt(Float32(dₖ))

#-----row-wise softmax------
function row_softmax(S)
    exp_S = exp.(S .- maximum(S, dims = 2))
    return exp_S ./ sum(exp_S, dims = 2)
end

attn_weights = row_softmax(scores)

#-----weighted output------
output = attn_weights * V

println("Attention weights (each row sums to 1):")
for i in 1:seq_len
    println("  Token $i → ", round.(attn_weights[i, :], digits = 3))
end
Attention weights (each row sums to 1):
  Token 1 → Float32[0.298, 0.13, 0.123, 0.194, 0.1, 0.156]
  Token 2 → Float32[0.131, 0.164, 0.164, 0.136, 0.168, 0.238]
  Token 3 → Float32[0.1, 0.16, 0.167, 0.102, 0.188, 0.282]
  Token 4 → Float32[0.208, 0.139, 0.128, 0.171, 0.103, 0.252]
  Token 5 → Float32[0.133, 0.169, 0.177, 0.131, 0.201, 0.189]
  Token 6 → Float32[0.142, 0.185, 0.179, 0.237, 0.176, 0.08]
# Visualize the attention pattern
fig = Figure(size = (400, 350))
ax = Axis(fig[1, 1], title = "Self-attention weights",
          xlabel = "Key position", ylabel = "Query position",
          xticks = 1:seq_len, yticks = 1:seq_len,
          yreversed = true)
hm = heatmap!(ax, 1:seq_len, 1:seq_len, attn_weights',
              colormap = :viridis)
Colorbar(fig[1, 2], hm, label = "Weight")
fig

How to read this plot:

  • Each row corresponds to one query token.
  • Bright cells indicate which key positions that query attends to most.
  • Rows should sum to 1 (a probability distribution).

This chapter demonstrates the mechanics of attention, not end-to-end benchmark performance. In real geoscience transformer models, evaluation should include holdout skill metrics (e.g., MAE/RMSE, event detection F1, or forecast skill scores) and comparisons against RNN/CNN baselines.

10.6 Advantages over RNNs

Feature RNN Transformer
Long-range dependencies Difficult (vanishing gradients) Direct attention
Parallelization Sequential (slow) Fully parallel
Memory cost \(O(n)\) per step \(O(n^2)\) for full attention
Positional awareness Built-in (sequential processing) Requires positional encoding

Transformers scale better to long sequences and large datasets, which explains their dominance in modern AI. For very long sequences, efficient variants (sparse attention, linear attention) reduce the \(O(n^2)\) cost.

10.7 Geoscience applications

Transformers are rapidly entering geoscience, especially for tasks where long-range dependencies and large-scale data matter:

  • Earthquake detectionMousavi et al. (2020) introduced the Earthquake Transformer (EQTransformer), which uses attention mechanisms to simultaneously detect earthquakes and pick seismic phases from continuous waveform data, achieving superior performance over CNN and RNN baselines.
  • Weather forecastingPathak et al. (2022) developed FourCastNet, a vision-transformer-based model for global weather forecasting that runs orders of magnitude faster than traditional numerical weather models while producing competitive forecasts. Bi et al. (2023) (Pangu-Weather) used 3D transformers for medium-range forecasting published in Nature.
  • Graph-based weather predictionLam et al. (2023) (GraphCast) combined transformer-style attention with graph neural networks for global weather forecasting, achieving state-of-the-art accuracy at 0.25° resolution.
  • Earth observation — Vision transformers are being applied to satellite imagery for land-cover classification, change detection, and environmental monitoring, following the success of ViT (Dosovitskiy et al., 2021) in computer vision.

The transformer is increasingly the architecture of choice for problems involving long sequences, multimodal data, or large-scale pre-training in geoscience.