Simple RNN — the idea that a network can process sequences by maintaining a hidden state (Elman, 1990).
LSTM — Long Short-Term Memory, which solved the vanishing-gradient problem for sequences and enabled learning over hundreds of time steps (Hochreiter & Schmidhuber, 1997).
GRU — Gated Recurrent Unit, a simplified variant of LSTM with comparable performance (Cho et al., 2014).
ConvLSTM — combining convolutional and recurrent structures for spatiotemporal prediction (Shi et al., 2015).
Feedforward and convolutional networks process each input independently. But many geoscience datasets are sequential: seismograms, well-log curves, climate records, and satellite time series all have a natural ordering in time (or depth). A recurrent neural network (RNN) is designed for this: it processes a sequence one step at a time, maintaining a hidden state that carries information from earlier steps to later ones.
9.1 The simple RNN
At each time step \(t\), a simple RNN receives the current input \(\mathbf{x}_t\) and the previous hidden state \(\mathbf{h}_{t-1}\), and produces a new hidden state:
Here the subscript \(t\) indexes time, while the subscripts on \(W_h\) and \(W_x\) identify the role of each matrix: hidden-to-hidden and input-to-hidden, respectively. This is the same convention introduced earlier in the part: superscripts are reserved for layer depth, and subscripts mark time or semantic roles.
The hidden state acts as the network’s memory. The output at each step can be read from \(\mathbf{h}_t\) directly or passed through an additional dense layer.
Problem: In practice, simple RNNs struggle to learn long-range dependencies because gradients either vanish (shrink to zero) or explode (grow unboundedly) when propagated backward through many time steps.
9.2 Long Short-Term Memory (LSTM)
The LSTM(Hochreiter & Schmidhuber, 1997) solves the vanishing-gradient problem by introducing a cell state\(\mathbf{c}_t\) alongside the hidden state \(\mathbf{h}_t\), controlled by three learned gates:
Forget gate\(\mathbf{f}_t\) — decides what information to discard from the cell state.
Input gate\(\mathbf{i}_t\) — decides what new information to store.
Output gate\(\mathbf{o}_t\) — decides what part of the cell state to expose.
As before, the subscript \(t\) denotes the time step. The different letter subscripts on the weight matrices indicate the gate they belong to.
The cell state can carry information unchanged through many time steps, and the gates learn to open and close during training, allowing the network to decide what to remember and what to forget.
9.3 Gated Recurrent Unit (GRU)
The GRU(Cho et al., 2014) is a simplification of the LSTM that merges the cell state and hidden state into a single state vector, using two gates instead of three:
GRUs have fewer parameters than LSTMs and train faster, while producing similar results on many tasks.
9.4 Code example: predicting a synthetic geophysical time series
We generate a synthetic oscillating signal (simulating a geophysical measurement with periodic and trend components) and train an LSTM to predict the next value given the recent past.
usingLux, Random, Optimisers, Zygote, Statistics, Printf, CairoMakierng =Xoshiro(42)# Generate a synthetic time series: trend + oscillation + noiset =Float32.(0:0.05:10)signal =0.3f0.* t .+sin.(2π .*0.5f0.* t) .+0.3f0.*sin.(2π .*1.3f0.* t) .+0.15f0.*randn(rng, Float32, length(t))# Standardize before training so the recurrent model does not spend capacity on scale aloneμ_signal =mean(signal)σ_signal =std(signal)signal_scaled = (signal .- μ_signal) ./ σ_signal# Create input/output pairs using a sliding windowwindow =20n_pairs =length(signal_scaled) - windowX_seq =zeros(Float32, 1, window, n_pairs) # (features, time_steps, batch)Y_seq =zeros(Float32, 1, n_pairs) # (features, batch)for i in1:n_pairs X_seq[1, :, i] = signal_scaled[i:i+window-1] Y_seq[1, i] = signal_scaled[i+window]end# Train/test split (chronological split to respect time ordering)n_train =Int(round(0.8* n_pairs))X_train, Y_train = X_seq[:, :, 1:n_train], Y_seq[:, 1:n_train]X_test, Y_test = X_seq[:, :, n_train+1:end], Y_seq[:, n_train+1:end]
# Build an LSTM model that reads the sequence, then maps the final hidden state to a predictionmodel =Chain(Recurrence(LSTMCell(1=>32)),Dense(32=>1))ps, st = Lux.setup(rng, model)functionmse_loss(model, ps, st, data) x, y = data ŷ, st_new =model(x, ps, st) loss =mean((ŷ .- y) .^2)return loss, st_new, ()end
# Predict and plotY_pred, _ =model(X_seq, tstate.parameters, tstate.states)Y_pred = σ_signal .* Y_pred .+ μ_signalfig =Figure(size = (700, 350))ax =Axis(fig[1, 1], xlabel ="Time step", ylabel ="Value", title ="LSTM time-series prediction")lines!(ax, window+1:length(signal), signal[window+1:end], color =:black, label ="True", linewidth =2)lines!(ax, window+1:length(signal), vec(Y_pred), color =:coral, label ="LSTM prediction", linestyle =:dash)axislegend(ax, position =:lt)fig
9.5 When to use RNNs
RNNs are the natural choice when:
Data has a sequential or temporal structure (time series, depth-indexed logs).
Order matters — shuffling the data would destroy information.
You need to capture dependencies between earlier and later parts of a sequence.
For very long sequences (thousands of steps), transformers (next chapter) often outperform RNNs because they can attend to any part of the sequence without passing information step by step.
9.6 Geoscience applications
Recurrent networks have been applied to a wide range of sequential geoscience problems:
Sequence modeling in seismology — recurrent and hybrid sequence models are widely used for waveform analysis and event interpretation. Zhu & Beroza (2019) is a closely related deep-learning benchmark for seismic arrival picking, though PhaseNet itself is primarily convolutional rather than recurrent.
Climate and weather forecasting — Ham et al. (2019) used a CNN-LSTM hybrid to forecast the El Niño–Southern Oscillation (ENSO) up to 18 months ahead, significantly outperforming physics-based dynamical models.
Precipitation nowcasting — Shi et al. (2015) introduced the ConvLSTM, combining convolutional and LSTM operations to predict radar echo sequences, a spatiotemporal forecasting task.
Machine learning in geoscience overview — Dramsch (2020) provides a comprehensive review of 70 years of machine learning in the geosciences, covering many recurrent-network applications in seismology, well-log analysis, and geophysical signal processing.
Cho, K., Merriënboer, B. van, Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734. https://doi.org/10.3115/v1/D14-1179
Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W., & Woo, W. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems, 28.
Zhu, W., & Beroza, G. C. (2019). PhaseNet: A deep-neural-network-based seismic arrival-time picking method. Geophysical Journal International, 216(1), 261–273. https://doi.org/10.1093/gji/ggy423