Geoscientific Machine Learning

Pankaj K Mishra

doi:10.5281/zenodo.19540496

4 Neural Networks

At this point you know how to read files, work with arrays and tables, and make figures. That is enough to start asking a more interesting question: can we build a model that learns a pattern directly from data instead of writing the rule by hand?

That is the role of a neural network. A neural network is just a layered function with many adjustable parameters. During training, those parameters are nudged until the network maps inputs to outputs in a useful way. The idea is simple. What makes neural networks powerful is that with enough data, enough nonlinearity, and a sensible training setup, they can approximate very complicated relationships — a property made precise by classical universal-approximation results (Cybenko, 1989; Hornik et al., 1989).

For geoscience, this matters because many problems involve patterns that are real but hard to write down explicitly: seismic facies, weather evolution, surrogate models for PDE solvers, or relationships between sparse observations and hidden structure in the subsurface. Neural networks are not magic shortcuts around physics, but they are flexible function approximators that become extremely useful when combined with scientific knowledge.

4.1 The basic picture

At the smallest scale, a neuron takes an input vector \(\mathbf{x}\), forms a weighted sum, adds a bias, and applies a nonlinear activation \(\sigma\):

\[ z \;=\; \sigma\!\bigl(\mathbf{w}^\top \mathbf{x} + b\bigr). \]

The activation \(\sigma\) is what turns a network from a glorified linear regression into a universal function approximator. Common choices are \(\tanh\), the sigmoid \(1/(1+e^{-x})\), and the rectified linear unit \(\mathrm{ReLU}(x) = \max(0, x)\). Without \(\sigma\), stacking layers would just compose linear maps and collapse back to a single linear map.

One neuron is not very interesting. A layer is a row of neurons computing different features of the same input, and a network is a stack of layers in which each layer’s output becomes the next layer’s input:

                   ┌──────────┐     ┌──────────┐     ┌──────────┐
   input  x   ───▶ │  layer 1 │ ──▶ │  layer 2 │ ──▶ │  layer L │ ───▶ output  ŷ
  (vector)        │  W¹, b¹  │     │  W², b²  │     │  Wᴸ, bᴸ  │       (vector)
                   └──────────┘     └──────────┘     └──────────┘
                   nonlinearity     nonlinearity     (often linear
                       σ                 σ            in the last layer)

Each layer applies the same neuron operation in vectorised form, \(\mathbf{h}^{(l)} = \sigma\!\bigl(\mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\bigr)\), with \(\mathbf{h}^{(0)} = \mathbf{x}\). The collection of all weight matrices and bias vectors is the network’s parameter set \(\theta\).

4.2 How training actually works

Training a network always comes back to the same loop:

Forward pass. Push the input \(\mathbf{x}\) through the network to get a prediction \(\hat{\mathbf{y}} = f_\theta(\mathbf{x})\).
Compare. Place \(\hat{\mathbf{y}}\) next to the target \(\mathbf{y}\).
Loss. Summarise the mismatch as a single number \(\mathcal{L}(\theta) = \ell(\hat{\mathbf{y}}, \mathbf{y})\) — mean squared error for regression, cross-entropy for classification, custom physical residuals later in this book.
Adjust. Move \(\theta\) in the direction that reduces \(\mathcal{L}\).

That last step is where neural networks earned their reputation. “Move in the direction that reduces the loss” means take a small step against the gradient,

\[ \theta \;\leftarrow\; \theta \;-\; \eta\, \nabla_\theta \mathcal{L}(\theta), \]

with learning rate \(\eta\). The gradient \(\nabla_\theta \mathcal{L}\) is computed efficiently by backpropagation (Rumelhart et al., 1986) — the chain rule applied systematically from the loss back through every layer — and in this book it is handled for us by Julia’s automatic differentiation packages (Zygote, Enzyme). Variants like stochastic mini-batches and the Adam optimizer (Kingma & Ba, 2015) sharpen this idea, but the core recipe is just gradient descent on a differentiable function.

The whole game is: differentiable model, differentiable loss, gradient step. The rest of this part is about understanding what each of those steps really means in practice.

4.3 How this part is organized

The next chapter gives you a first end-to-end example before any serious theory. It is easier to understand the moving parts once you have seen the whole workflow in action.

After that, we slow down and unpack the pieces:

Your First Neural Network — full workflow on a small dataset: load, build, train, predict.
Building Blocks of Neural Networks — neurons, layers, losses, gradients, and optimizers in detail.
Multilayer Perceptrons — the fully-connected architecture and what it is and is not good for.
Convolutional Neural Networks — translation-equivariant networks for gridded data like seismic sections and satellite imagery.
Recurrent Neural Networks — networks with state, used for time series such as well logs and weather sequences.
Transformers — attention-based models that have largely replaced RNNs for long-range dependencies.
Graph Neural Networks — networks defined on irregular topologies such as station networks and unstructured meshes.
Autoencoders — learn compressed representations and use them for denoising and anomaly detection.
Generative Adversarial Networks — a generator-versus-discriminator game for synthesising realistic samples.
Diffusion Models — generative models trained to reverse a gradual noising process.
Flow Matching — a continuous-time formulation that frames generation as solving an ODE between distributions.

The first group of chapters covers the major architecture families used throughout modern machine learning. The last group shifts to generative modelling: networks that learn to sample from a distribution rather than predict a single label.

The goal is not to turn you into a deep-learning theorist. The goal is to make these models legible enough that when they reappear later in scientific machine learning, they feel like tools you understand rather than black boxes.

4.4 Notation used in this part

To keep the mathematics readable, this part uses a few conventions consistently:

Bold symbols such as \(\mathbf{x}\) and \(\mathbf{h}\) denote vectors; matrices are bold capitals such as \(\mathbf{W}\).
Superscripts in parentheses, such as \(\mathbf{h}^{(l)}\), index layers or network depth.
Subscripts, such as \(\mathbf{h}_t\), usually index time steps, samples, or nodes.
\(\theta\) collects every trainable parameter in the network.
\(\mathcal{L}\) denotes a loss function, with subscripts such as \(\mathcal{L}_{\mathrm{data}}\) or \(\mathcal{L}_{\mathrm{bc}}\) naming particular pieces.

These are not universal rules across all textbooks, but they will keep the notation steady from chapter to chapter and through the scientific-ML part later.

4.5 What to watch for

When you first learn neural networks, it is tempting to focus on architecture names and package syntax. Those matter, but the deeper questions are simpler:

What is the input, and what is the output?
What exactly is being learned: a label, a field, an operator, a distribution?
What loss is being minimised, and what does minimising it actually buy you?
What assumptions are hidden in the data split, the architecture, and the training loop?

If you keep asking those questions, most neural-network papers become much easier to read — and most failure modes become much easier to diagnose.

4.6 Summary

Neural networks are flexible parameterised functions trained from data by gradient descent on a differentiable loss. Backpropagation turns “compute the gradient” from a hand-derivation chore into a software primitive, and that is the single technical fact that makes the rest of this book possible. In geoscience these models become especially useful when paired with domain structure, physical constraints, and careful evaluation. The next chapter starts with a concrete example so you can see the workflow before we dissect the machinery.