10  Transformers

Key references
  • Attention is all you need — the paper that introduced the transformer architecture (Vaswani et al., 2017).
  • BERT — bidirectional transformer pre-training that reshaped NLP (Devlin et al., 2019).
  • Vision Transformer (ViT) — showed that transformers can match or beat CNNs on image classification (Dosovitskiy et al., 2021).

The transformer is the architecture behind large language models and, increasingly, behind state-of-the-art models in vision, time-series forecasting, and scientific computing. Unlike RNNs, which process sequences one step at a time, transformers process the entire sequence at once using a mechanism called self-attention that lets each element directly attend to every other element.

10.1 Self-attention

The core idea: for each element in the input sequence, compute how much attention it should pay to every other element, then produce an output that is a weighted combination of the values.

Given an input sequence \(\mathbf{X} \in \mathbb{R}^{n \times d}\) (\(n\) tokens, \(d\) features), three linear projections produce:

\[ \mathbf{Q} = \mathbf{X}W_Q, \quad \mathbf{K} = \mathbf{X}W_K, \quad \mathbf{V} = \mathbf{X}W_V \]

where \(\mathbf{Q}\) (queries), \(\mathbf{K}\) (keys), and \(\mathbf{V}\) (values) are matrices. The attention output is:

\[ \mathrm{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right) \mathbf{V} \]

The softmax row gives the attention weights — how much each position attends to every other position. The \(\sqrt{d_k}\) scaling prevents the dot products from becoming too large.

10.2 Multi-head attention

Instead of computing a single attention, the transformer uses multi-head attention: it runs \(h\) parallel attention functions with different learned projections, then concatenates and projects the results:

\[ \mathrm{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h)\, W_O \]

Each head can learn to focus on different types of relationships (e.g., one head might attend to nearby positions, another to distant ones).

10.3 The transformer block

A single transformer block consists of:

  1. Multi-head self-attention — with a residual connection and layer normalization.
  2. Feed-forward network — two dense layers with a non-linearity, also with a residual connection and layer normalization.

\[ \begin{aligned} \mathbf{z} &= \mathrm{LayerNorm}\!\bigl(\mathbf{x} + \mathrm{MultiHeadAttn}(\mathbf{x})\bigr) \\ \mathbf{y} &= \mathrm{LayerNorm}\!\bigl(\mathbf{z} + \mathrm{FFN}(\mathbf{z})\bigr) \end{aligned} \]

Stacking many such blocks gives the transformer its depth.

10.4 Positional encoding

Self-attention is permutation-invariant — it does not know the order of the input. To inject sequence order, transformers add a positional encoding to the input embeddings. The original paper used sinusoidal functions:

\[ \begin{aligned} PE_{(pos, 2i)} &= \sin\!\bigl(pos / 10000^{2i/d}\bigr) \\ PE_{(pos, 2i+1)} &= \cos\!\bigl(pos / 10000^{2i/d}\bigr) \end{aligned} \]

This gives each position a unique signature that the network can learn to interpret.

10.5 Code example: minimal self-attention

We implement a minimal self-attention mechanism from scratch to illustrate the core computation. This is not meant for production use, but shows exactly what happens inside the attention layer.

using LinearAlgebra, Random, CairoMakie

rng = Xoshiro(42)

#-----sequence setup------
seq_len, d_model = 6, 4
X = randn(rng, Float32, seq_len, d_model)

#-----learned projections------
dₖ = d_model
W_Q = randn(rng, Float32, d_model, dₖ) .* 0.5f0
W_K = randn(rng, Float32, d_model, dₖ) .* 0.5f0
W_V = randn(rng, Float32, d_model, dₖ) .* 0.5f0

Q = X * W_Q
K = X * W_K
V = X * W_V

#-----attention weights------
scores = Q * K' ./ sqrt(Float32(dₖ))

#-----row-wise softmax------
function row_softmax(S)
    exp_S = exp.(S .- maximum(S, dims = 2))
    return exp_S ./ sum(exp_S, dims = 2)
end

attn_weights = row_softmax(scores)

#-----weighted output------
output = attn_weights * V

println("Attention weights (each row sums to 1):")
for i in 1:seq_len
    println("  Token $i → ", round.(attn_weights[i, :], digits = 3))
end
Attention weights (each row sums to 1):
  Token 1 → Float32[0.298, 0.13, 0.123, 0.194, 0.1, 0.156]
  Token 2 → Float32[0.131, 0.164, 0.164, 0.136, 0.168, 0.238]
  Token 3 → Float32[0.1, 0.16, 0.167, 0.102, 0.188, 0.282]
  Token 4 → Float32[0.208, 0.139, 0.128, 0.171, 0.103, 0.252]
  Token 5 → Float32[0.133, 0.169, 0.177, 0.131, 0.201, 0.189]
  Token 6 → Float32[0.142, 0.185, 0.179, 0.237, 0.176, 0.08]
# Visualize the attention pattern
fig = Figure(size = (400, 350))
ax = Axis(fig[1, 1], title = "Self-attention weights",
          xlabel = "Key position", ylabel = "Query position",
          xticks = 1:seq_len, yticks = 1:seq_len,
          yreversed = true)
hm = heatmap!(ax, 1:seq_len, 1:seq_len, attn_weights',
              colormap = :viridis)
Colorbar(fig[1, 2], hm, label = "Weight")
fig

How to read this plot:

  • Each row corresponds to one query token.
  • Bright cells indicate which key positions that query attends to most.
  • Rows should sum to 1 (a probability distribution).

This chapter demonstrates the mechanics of attention, not end-to-end benchmark performance. In real geoscience transformer models, evaluation should include holdout skill metrics (e.g., MAE/RMSE, event detection F1, or forecast skill scores) and comparisons against RNN/CNN baselines.

10.6 Advantages over RNNs

Feature RNN Transformer
Long-range dependencies Difficult (vanishing gradients) Direct attention
Parallelization Sequential (slow) Fully parallel
Memory cost \(O(n)\) per step \(O(n^2)\) for full attention
Positional awareness Built-in (sequential processing) Requires positional encoding

Transformers scale better to long sequences and large datasets, which explains their dominance in modern AI. For very long sequences, efficient variants (sparse attention, linear attention) reduce the \(O(n^2)\) cost.

10.7 Geoscience milestones

  • Earthquake detection and phase pickingMousavi et al. (2020) introduced the Earthquake Transformer (EQTransformer), the first widely adopted attention-based model for simultaneous detection and seismic-phase picking from continuous waveform data.
  • Medium-range weather forecastingBi et al. (2023) (Pangu-Weather) is the milestone result for transformer-based global weather prediction, demonstrating skillful forecasts up to 7 days from a 3D attention architecture trained on ERA5 reanalysis.

The transformer is increasingly the architecture of choice for problems involving long sequences, multimodal data, or large-scale pre-training in geoscience.