Geoscientific Machine Learning

Pankaj K Mishra

doi:10.5281/zenodo.19540496

6 Building Blocks of Neural Networks

Key references

The modern understanding of neural networks rests on a small number of milestone contributions:

The perceptron — the first trainable artificial neuron (Rosenblatt, 1958).
Backpropagation — the algorithm that makes training deep networks practical (Rumelhart et al., 1986).
ReLU activations — simple non-linearity that solved the vanishing-gradient problem for deep networks (Nair & Hinton, 2010).
Adam optimizer — an adaptive learning-rate method that became the default for most deep-learning work (Kingma & Ba, 2015).
Dropout — a regularization technique that prevents overfitting by randomly zeroing neurons during training (Srivastava et al., 2014).
Batch normalization — stabilizes and accelerates training by normalizing layer inputs (Ioffe & Szegedy, 2015).
Deep learning review — an authoritative overview of the entire field (LeCun et al., 2015).

In the previous chapter you trained a neural network without knowing what was going on inside. Now we unpack every piece.

A feedforward neural network is the simplest network topology: information moves from the input layer to the output layer with no recurrence, feedback loop, or hidden state. When this feedforward architecture is built by stacking several dense layers, it is usually called a multilayer perceptron (MLP). Later chapters will revisit that architecture in more detail, but this is the basic meaning of “feedforward” throughout the book.

6.1 Notation used in this part

Before we get into the details, it helps to keep a small notation guide in mind. We will use the same conventions throughout the neural-network chapters:

Symbol	Meaning	Convention
\(\mathbf{x}\), \(\mathbf{h}\), \(\hat{\mathbf{y}}\)	Vectors	Bold lowercase symbols denote vectors.
\(W^{(l)}\), \(\mathbf{b}^{(l)}\)	Weights and bias at layer \(l\)	Superscripts in parentheses index layer depth.
\(\mathbf{h}_t\)	Hidden state at time step \(t\)	Subscripts usually index time, samples, or nodes.
\(\mathbf{h}_i^{(k)}\)	Node \(i\) representation at GNN layer \(k\)	Subscript for node, superscript for layer.
\(\mathcal{L}\)	Total loss	Use subscripts such as \(\mathcal{L}_{\mathrm{data}}\) or \(\mathcal{L}_{\mathrm{bc}}\) for components.
\(\hat{y}\)	Generic prediction	In inversion chapters we may switch to symbols such as \(d^{\mathrm{pred}}\) and \(d^{\mathrm{obs}}\) when the data meaning matters.

These choices are not the only valid ones, but keeping them fixed makes the later chapters easier to read.

6.2 The artificial neuron

A biological neuron receives signals, integrates them, and fires if the total exceeds a threshold. An artificial neuron follows the same idea in simplified form. Given an input vector \(\mathbf{x} \in \mathbb{R}^n\), a neuron computes:

\[ z = \sigma\!\bigl(\mathbf{w}^\top \mathbf{x} + b\bigr) \]

where:

\(\mathbf{w} \in \mathbb{R}^n\) is the weight vector — one weight per input.
\(b \in \mathbb{R}\) is the bias — a constant shift.
\(\sigma\) is a generic activation function — a non-linear function applied to the weighted sum.

When we mean a specific activation, we will write it explicitly, for example \(\tanh(\cdot)\) or \(\mathrm{ReLU}(\cdot)\).

The weights and bias are the learnable parameters. Training a neural network means finding values of \(\mathbf{w}\) and \(b\) (across all neurons) that make the network’s output match the desired target.

6.3 Activation functions

Without an activation function, stacking many layers would still produce a linear mapping — no matter how deep the network. The activation introduces non-linearity, which gives the network the ability to learn complex patterns. Common choices include:

Function	Formula	Notes
Sigmoid	\(\sigma(x) = \frac{1}{1+e^{-x}}\)	Squashes output to \((0,1)\). Used in early networks.
Tanh	\(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)	Centered at zero; range \((-1,1)\).
ReLU	\(\text{ReLU}(x) = \max(0, x)\)	Default for most modern networks (Nair & Hinton, 2010).
Leaky ReLU	\(\max(\alpha x, x)\) for small \(\alpha\)	Avoids “dead neurons” where ReLU outputs zero permanently.

using CairoMakie

x = -4:0.01:4
fig = Figure(size = (700, 250))
ax1 = Axis(fig[1, 1], title = "Sigmoid", xlabel = "x")
lines!(ax1, x, 1 ./ (1 .+ exp.(-x)), color = :steelblue)
ax2 = Axis(fig[1, 2], title = "Tanh", xlabel = "x")
lines!(ax2, x, tanh.(x), color = :coral)
ax3 = Axis(fig[1, 3], title = "ReLU", xlabel = "x")
lines!(ax3, x, max.(0, x), color = :seagreen)
fig

6.4 Layers

A layer is a collection of neurons that operate in parallel on the same input. In a dense (fully connected) layer, every neuron receives every input:

\[ \mathbf{h}^{(l)} = \sigma\!\bigl(W^{(l)}\,\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\bigr) \]

where \(W^{(l)} \in \mathbb{R}^{m \times n}\) is the weight matrix for layer \(l\) (\(m\) neurons, \(n\) incoming features) and \(\mathbf{b}^{(l)} \in \mathbb{R}^m\) is the corresponding bias vector.

In Lux.jl, creating a dense layer with 3 inputs and 8 outputs using the tanh activation looks like:

using Lux, Random

layer = Dense(3 => 8, tanh)
rng = Xoshiro(0)
ps, st = Lux.setup(rng, layer)
println("Weight matrix size: ", size(ps.weight))
println("Bias vector size:   ", size(ps.bias))

Weight matrix size: (8, 3)
Bias vector size:   (8,)

6.5 Networks: stacking layers

A neural network is formed by chaining layers so that the output of one layer becomes the input to the next. In Lux.jl this is done with Chain:

model = Chain(
    Dense(3 => 16, relu),   # hidden layer 1
    Dense(16 => 8, relu),   # hidden layer 2
    Dense(8 => 1)            # output layer (no activation → raw value)
)
ps, st = Lux.setup(rng, model)

((layer_1 = (weight = Float32[-0.87449217 -1.2761252 1.4936523; -1.495204 1.6102042 1.9952867; … ; 0.20966864 0.56652117 -1.5356417; 0.9057038 -1.7467937 0.40091395], bias = Float32[0.24175844, 0.15477172, 0.16122879, 0.068802044, -0.3278107, 0.07800663, 0.06683914, -0.18069106, -0.5477161, -0.08528596, 0.35881144, 0.14086951, 0.49927178, 0.45517933, 0.029706469, 0.46130997]), layer_2 = (weight = Float32[-0.7281545 0.22887067 … 0.66682464 0.666083; -0.03448241 -0.06876608 … -0.7251021 0.14132917; … ; 0.6025773 0.7234948 … 0.23923291 0.21761502; -0.7467699 0.062139012 … -0.48436588 0.24338031], bias = Float32[0.17736006, 0.13691956, 0.048085272, -0.04162532, 0.16139144, -0.09218615, -0.052222192, 0.20891446]), layer_3 = (weight = Float32[-0.5780716 -0.5690776 … 0.004986225 -0.13891284], bias = Float32[-0.038554586])), (layer_1 = NamedTuple(), layer_2 = NamedTuple(), layer_3 = NamedTuple()))

The number of layers and the number of neurons per layer are hyperparameters — choices made by the user, not learned from data.

6.6 The loss function

The loss function (or cost function) measures how far the network’s predictions are from the true targets. Training tries to minimize this value. Common choices:

Mean Squared Error (MSE) — for regression: \(\mathcal{L} = \frac{1}{N}\sum_{i=1}^{N}(\hat{y}_i - y_i)^2\)
Cross-Entropy — for classification: \(\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\bigl[y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)\bigr]\)

In the chapters ahead, when the loss has several pieces, we will write them as named components such as \(\mathcal{L}_{\mathrm{data}}\), \(\mathcal{L}_{\mathrm{pde}}\), or \(\mathcal{L}_{\mathrm{bc}}\) rather than leaving them unnamed inside one long equation.

The choice of loss function depends on the problem. Regression tasks almost always use MSE; classification tasks use cross-entropy.

6.7 Backpropagation and gradients

Training requires knowing how each parameter affects the loss. This information is captured by the gradient — the partial derivative of the loss with respect to each parameter.

The backpropagation algorithm (Rumelhart et al., 1986) computes these gradients efficiently using the chain rule of calculus, working backward from the loss through each layer to the parameters:

\[ \frac{\partial \mathcal{L}}{\partial w_{ij}} = \frac{\partial \mathcal{L}}{\partial z_j} \cdot \frac{\partial z_j}{\partial w_{ij}} \]

In Julia, automatic differentiation libraries such as Zygote.jl handle this for you. You write the forward computation; Zygote computes the gradients automatically.

6.8 Optimizers

Once we have the gradients, we need a rule for updating the parameters. The simplest approach is gradient descent:

\[ \mathbf{w} \leftarrow \mathbf{w} - \eta \,\nabla_{\mathbf{w}} \mathcal{L} \]

where \(\eta\) is the learning rate. A small \(\eta\) means slow but stable progress; a large \(\eta\) learns faster but risks overshooting.

In practice, more sophisticated optimizers are used:

SGD with momentum — accumulates a running average of past gradients to smooth updates.
Adam (Kingma & Ba, 2015) — adapts the learning rate individually for each parameter. It is the most widely used optimizer and a good default.
RMSProp — scales updates by a running average of squared gradients; often effective for noisy sequence problems.
AdamW — Adam with decoupled weight decay; often preferred in modern deep-learning training because regularization behaves more predictably.
L-BFGS — a quasi-Newton optimizer useful in some scientific ML settings (including some PINN workflows) for final fine-tuning after Adam.

6.8.1 Learning-rate schedules

In practice, keeping a fixed learning rate for all epochs is rarely optimal. Common schedules are:

Step decay: reduce \(\eta\) by a factor every N epochs.
Cosine decay: gradually lower \(\eta\) with a cosine schedule.
Warmup + decay: start with a small \(\eta\), increase for a few epochs, then decay.

A good practical pattern is: start with Adam/AdamW at a moderate learning rate, then reduce the learning rate once validation loss plateaus.

6.9 Regularization

Deep networks can overfit: they memorize the training data instead of learning general patterns. Regularization techniques reduce this risk:

Dropout (Srivastava et al., 2014) — during training, randomly sets a fraction of neuron outputs to zero. At test time, all neurons are active but their outputs are scaled. This forces the network to not rely on any single neuron.
Batch normalization (Ioffe & Szegedy, 2015) — normalizes the input to each layer across the current mini-batch. This stabilizes training and acts as a mild regularizer.
Weight decay — adds a penalty proportional to the squared weights to the loss, discouraging large parameter values.
Early stopping — monitors validation loss during training and stops when it begins to increase.

6.10 The training loop

Putting it all together, training a neural network follows this loop:

Forward pass — feed input through the network to get a prediction.
Compute loss — compare prediction to true target.
Backward pass — compute gradients of the loss with respect to all parameters.
Update parameters — apply the optimizer rule.
Repeat for many epochs (full passes through the training data).

Each iteration over a subset (mini-batch) of the data is one step. One full pass through the entire dataset is one epoch. The code example in the previous chapter follows exactly this loop.

6.11 Minimum diagnostics checklist (for geoscience workflows)

Before trusting any neural-network result, check the following:

Separate splits: train, validation, and test sets must be separate in time/space where relevant.
Baseline comparison: compare against a simple baseline (mean predictor, persistence model, linear regression).
Generalization gap: if train loss is low but validation/test loss is high, you are overfitting.
Domain sanity check: predictions should respect basic geoscientific constraints (ranges, trends, known physics).
Error by regime: report errors by important regimes (e.g., depth interval, facies class, season, tectonic setting), not only one global metric.

These checks are often more important than squeezing out a small gain in one metric.

6.12 Summary

Concept	Role
Neuron	Weighted sum → activation
Layer	Collection of neurons
Network	Chain of layers
Loss	Measures prediction error
Gradient	Direction to improve parameters
Backpropagation	Efficient gradient computation
Optimizer	Parameter update rule
Regularization	Prevents overfitting