6  Building Blocks of Neural Networks

TipKey references

The modern understanding of neural networks rests on a small number of milestone contributions:

  • The perceptron — the first trainable artificial neuron (Rosenblatt, 1958).
  • Backpropagation — the algorithm that makes training deep networks practical (Rumelhart et al., 1986).
  • ReLU activations — simple non-linearity that solved the vanishing-gradient problem for deep networks (Nair & Hinton, 2010).
  • Adam optimizer — an adaptive learning-rate method that became the default for most deep-learning work (Kingma & Ba, 2015).
  • Dropout — a regularization technique that prevents overfitting by randomly zeroing neurons during training (Srivastava et al., 2014).
  • Batch normalization — stabilizes and accelerates training by normalizing layer inputs (Ioffe & Szegedy, 2015).
  • Deep learning review — an authoritative overview of the entire field (LeCun et al., 2015).

In the previous chapter you trained a neural network without knowing what was going on inside. Now we unpack every piece.

A feedforward neural network is the simplest network topology: information moves from the input layer to the output layer with no recurrence, feedback loop, or hidden state. When this feedforward architecture is built by stacking several dense layers, it is usually called a multilayer perceptron (MLP). Later chapters will revisit that architecture in more detail, but this is the basic meaning of “feedforward” throughout the book.

6.1 Notation used in this part

Before we get into the details, it helps to keep a small notation guide in mind. We will use the same conventions throughout the neural-network chapters:

Symbol Meaning Convention
\(\mathbf{x}\), \(\mathbf{h}\), \(\hat{\mathbf{y}}\) Vectors Bold lowercase symbols denote vectors.
\(W^{(l)}\), \(\mathbf{b}^{(l)}\) Weights and bias at layer \(l\) Superscripts in parentheses index layer depth.
\(\mathbf{h}_t\) Hidden state at time step \(t\) Subscripts usually index time, samples, or nodes.
\(\mathbf{h}_i^{(k)}\) Node \(i\) representation at GNN layer \(k\) Subscript for node, superscript for layer.
\(\mathcal{L}\) Total loss Use subscripts such as \(\mathcal{L}_{\mathrm{data}}\) or \(\mathcal{L}_{\mathrm{bc}}\) for components.
\(\hat{y}\) Generic prediction In inversion chapters we may switch to symbols such as \(d^{\mathrm{pred}}\) and \(d^{\mathrm{obs}}\) when the data meaning matters.

These choices are not the only valid ones, but keeping them fixed makes the later chapters easier to read.

6.2 The artificial neuron

A biological neuron receives signals, integrates them, and fires if the total exceeds a threshold. An artificial neuron follows the same idea in simplified form. Given an input vector \(\mathbf{x} \in \mathbb{R}^n\), a neuron computes:

\[ z = \sigma\!\bigl(\mathbf{w}^\top \mathbf{x} + b\bigr) \]

where:

  • \(\mathbf{w} \in \mathbb{R}^n\) is the weight vector — one weight per input.
  • \(b \in \mathbb{R}\) is the bias — a constant shift.
  • \(\sigma\) is a generic activation function — a non-linear function applied to the weighted sum.

When we mean a specific activation, we will write it explicitly, for example \(\tanh(\cdot)\) or \(\mathrm{ReLU}(\cdot)\).

The weights and bias are the learnable parameters. Training a neural network means finding values of \(\mathbf{w}\) and \(b\) (across all neurons) that make the network’s output match the desired target.

6.3 Activation functions

Without an activation function, stacking many layers would still produce a linear mapping — no matter how deep the network. The activation introduces non-linearity, which gives the network the ability to learn complex patterns. Common choices include:

Function Formula Notes
Sigmoid \(\sigma(x) = \frac{1}{1+e^{-x}}\) Squashes output to \((0,1)\). Used in early networks.
Tanh \(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\) Centered at zero; range \((-1,1)\).
ReLU \(\text{ReLU}(x) = \max(0, x)\) Default for most modern networks (Nair & Hinton, 2010).
Leaky ReLU \(\max(\alpha x, x)\) for small \(\alpha\) Avoids “dead neurons” where ReLU outputs zero permanently.
using CairoMakie

x = -4:0.01:4
fig = Figure(size = (700, 250))
ax1 = Axis(fig[1, 1], title = "Sigmoid", xlabel = "x")
lines!(ax1, x, 1 ./ (1 .+ exp.(-x)), color = :steelblue)
ax2 = Axis(fig[1, 2], title = "Tanh", xlabel = "x")
lines!(ax2, x, tanh.(x), color = :coral)
ax3 = Axis(fig[1, 3], title = "ReLU", xlabel = "x")
lines!(ax3, x, max.(0, x), color = :seagreen)
fig

6.4 Layers

A layer is a collection of neurons that operate in parallel on the same input. In a dense (fully connected) layer, every neuron receives every input:

\[ \mathbf{h}^{(l)} = \sigma\!\bigl(W^{(l)}\,\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\bigr) \]

where \(W^{(l)} \in \mathbb{R}^{m \times n}\) is the weight matrix for layer \(l\) (\(m\) neurons, \(n\) incoming features) and \(\mathbf{b}^{(l)} \in \mathbb{R}^m\) is the corresponding bias vector.

In Lux.jl, creating a dense layer with 3 inputs and 8 outputs using the tanh activation looks like:

using Lux, Random

layer = Dense(3 => 8, tanh)
rng = Xoshiro(0)
ps, st = Lux.setup(rng, layer)
println("Weight matrix size: ", size(ps.weight))
println("Bias vector size:   ", size(ps.bias))
Precompiling packages...
    710.2 msExpressionExplorer
    738.5 msMLDataDevices
    602.4 msReactantCore
    590.0 msMLDataDevices → ChainRulesCoreExt
   1817.4 msDispatchDoctor
    503.4 msDispatchDoctor → DispatchDoctorEnzymeCoreExt
    543.9 msDispatchDoctor → DispatchDoctorChainRulesCoreExt
   2498.6 msWeightInitializers
   3083.8 msStatic
   2910.7 msKernelAbstractions
   3126.5 msForwardDiff
    697.1 msWeightInitializers → ChainRulesCoreExt
   1842.7 msLuxCore
    745.1 msKernelAbstractions → LinearAlgebraExt
    791.2 msKernelAbstractions → EnzymeExt
    991.4 msCPUSummary
    731.5 msForwardDiff → ForwardDiffStaticArraysExt
    533.4 msLuxCore → EnzymeCoreExt
    583.1 msLuxCore → MLDataDevicesExt
    566.7 msLuxCore → FunctorsExt
    565.8 msLuxCore → SetfieldExt
    572.2 msLuxCore → ChainRulesCoreExt
   3699.4 msNNlib
    798.5 msNNlib → NNlibForwardDiffExt
    806.1 msNNlib → NNlibSpecialFunctionsExt
    857.3 msNNlib → NNlibEnzymeCoreExt
   3489.1 msLuxLib
   1028.0 msLuxLib → ForwardDiffExt
   9926.1 msLux
  29 dependencies successfully precompiled in 25 seconds. 71 already precompiled.
Precompiling packages...
    563.2 msMLDataDevices → SparseArraysExt
    750.3 msKernelAbstractions → SparseArraysExt
  2 dependencies successfully precompiled in 1 seconds. 37 already precompiled.
Precompiling packages...
    675.4 msStructArrays → StructArraysGPUArraysCoreExt
  1 dependency successfully precompiled in 1 seconds. 36 already precompiled.
Precompiling packages...
    459.7 msMLDataDevices → FillArraysExt
  1 dependency successfully precompiled in 1 seconds. 16 already precompiled.
Precompiling packages...
    459.4 msIntervalArithmetic → IntervalArithmeticDiffRulesExt
  1 dependency successfully precompiled in 1 seconds. 43 already precompiled.
Precompiling packages...
    581.0 msUnitful → ForwardDiffExt
  1 dependency successfully precompiled in 1 seconds. 88 already precompiled.
Precompiling packages...
    779.0 msInterpolations → InterpolationsForwardDiffExt
  1 dependency successfully precompiled in 1 seconds. 47 already precompiled.
Precompiling packages...
   1125.7 msIntervalArithmetic → IntervalArithmeticForwardDiffExt
  1 dependency successfully precompiled in 2 seconds. 48 already precompiled.
Weight matrix size: (8, 3)
Bias vector size:   (8,)

6.5 Networks: stacking layers

A neural network is formed by chaining layers so that the output of one layer becomes the input to the next. In Lux.jl this is done with Chain:

model = Chain(
    Dense(3 => 16, relu),   # hidden layer 1
    Dense(16 => 8, relu),   # hidden layer 2
    Dense(8 => 1)            # output layer (no activation → raw value)
)
ps, st = Lux.setup(rng, model)
((layer_1 = (weight = Float32[-0.87449217 -1.2761252 1.4936523; -1.495204 1.6102042 1.9952867; … ; 0.20966864 0.56652117 -1.5356417; 0.9057038 -1.7467937 0.40091395], bias = Float32[0.24175844, 0.15477172, 0.16122879, 0.068802044, -0.3278107, 0.07800663, 0.06683914, -0.18069106, -0.5477161, -0.08528596, 0.35881144, 0.14086951, 0.49927178, 0.45517933, 0.029706469, 0.46130997]), layer_2 = (weight = Float32[-0.7281545 0.22887067 … 0.66682464 0.666083; -0.03448241 -0.06876608 … -0.7251021 0.14132917; … ; 0.6025773 0.7234948 … 0.23923291 0.21761502; -0.7467699 0.062139012 … -0.48436588 0.24338031], bias = Float32[0.17736006, 0.13691956, 0.048085272, -0.04162532, 0.16139144, -0.09218615, -0.052222192, 0.20891446]), layer_3 = (weight = Float32[-0.5780716 -0.5690776 … 0.004986225 -0.13891284], bias = Float32[-0.038554586])), (layer_1 = NamedTuple(), layer_2 = NamedTuple(), layer_3 = NamedTuple()))

The number of layers and the number of neurons per layer are hyperparameters — choices made by the user, not learned from data.

6.6 The loss function

The loss function (or cost function) measures how far the network’s predictions are from the true targets. Training tries to minimize this value. Common choices:

  • Mean Squared Error (MSE) — for regression: \(\mathcal{L} = \frac{1}{N}\sum_{i=1}^{N}(\hat{y}_i - y_i)^2\)
  • Cross-Entropy — for classification: \(\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\bigl[y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)\bigr]\)

In the chapters ahead, when the loss has several pieces, we will write them as named components such as \(\mathcal{L}_{\mathrm{data}}\), \(\mathcal{L}_{\mathrm{pde}}\), or \(\mathcal{L}_{\mathrm{bc}}\) rather than leaving them unnamed inside one long equation.

The choice of loss function depends on the problem. Regression tasks almost always use MSE; classification tasks use cross-entropy.

6.7 Backpropagation and gradients

Training requires knowing how each parameter affects the loss. This information is captured by the gradient — the partial derivative of the loss with respect to each parameter.

The backpropagation algorithm (Rumelhart et al., 1986) computes these gradients efficiently using the chain rule of calculus, working backward from the loss through each layer to the parameters:

\[ \frac{\partial \mathcal{L}}{\partial w_{ij}} = \frac{\partial \mathcal{L}}{\partial z_j} \cdot \frac{\partial z_j}{\partial w_{ij}} \]

In Julia, automatic differentiation libraries such as Zygote.jl handle this for you. You write the forward computation; Zygote computes the gradients automatically.

6.8 Optimizers

Once we have the gradients, we need a rule for updating the parameters. The simplest approach is gradient descent:

\[ \mathbf{w} \leftarrow \mathbf{w} - \eta \,\nabla_{\mathbf{w}} \mathcal{L} \]

where \(\eta\) is the learning rate. A small \(\eta\) means slow but stable progress; a large \(\eta\) learns faster but risks overshooting.

In practice, more sophisticated optimizers are used:

  • SGD with momentum — accumulates a running average of past gradients to smooth updates.
  • Adam (Kingma & Ba, 2015) — adapts the learning rate individually for each parameter. It is the most widely used optimizer and a good default.
  • RMSProp — scales updates by a running average of squared gradients; often effective for noisy sequence problems.
  • AdamW — Adam with decoupled weight decay; often preferred in modern deep-learning training because regularization behaves more predictably.
  • L-BFGS — a quasi-Newton optimizer useful in some scientific ML settings (including some PINN workflows) for final fine-tuning after Adam.

6.8.1 Learning-rate schedules

In practice, keeping a fixed learning rate for all epochs is rarely optimal. Common schedules are:

  • Step decay: reduce \(\eta\) by a factor every N epochs.
  • Cosine decay: gradually lower \(\eta\) with a cosine schedule.
  • Warmup + decay: start with a small \(\eta\), increase for a few epochs, then decay.

A good practical pattern is: start with Adam/AdamW at a moderate learning rate, then reduce the learning rate once validation loss plateaus.

6.9 Regularization

Deep networks can overfit: they memorize the training data instead of learning general patterns. Regularization techniques reduce this risk:

  • Dropout (Srivastava et al., 2014) — during training, randomly sets a fraction of neuron outputs to zero. At test time, all neurons are active but their outputs are scaled. This forces the network to not rely on any single neuron.
  • Batch normalization (Ioffe & Szegedy, 2015) — normalizes the input to each layer across the current mini-batch. This stabilizes training and acts as a mild regularizer.
  • Weight decay — adds a penalty proportional to the squared weights to the loss, discouraging large parameter values.
  • Early stopping — monitors validation loss during training and stops when it begins to increase.

6.10 The training loop

Putting it all together, training a neural network follows this loop:

  1. Forward pass — feed input through the network to get a prediction.
  2. Compute loss — compare prediction to true target.
  3. Backward pass — compute gradients of the loss with respect to all parameters.
  4. Update parameters — apply the optimizer rule.
  5. Repeat for many epochs (full passes through the training data).

Each iteration over a subset (mini-batch) of the data is one step. One full pass through the entire dataset is one epoch. The code example in the previous chapter follows exactly this loop.

6.11 Minimum diagnostics checklist (for geoscience workflows)

Before trusting any neural-network result, check the following:

  1. Separate splits: train, validation, and test sets must be separate in time/space where relevant.
  2. Baseline comparison: compare against a simple baseline (mean predictor, persistence model, linear regression).
  3. Generalization gap: if train loss is low but validation/test loss is high, you are overfitting.
  4. Domain sanity check: predictions should respect basic geoscientific constraints (ranges, trends, known physics).
  5. Error by regime: report errors by important regimes (e.g., depth interval, facies class, season, tectonic setting), not only one global metric.

These checks are often more important than squeezing out a small gain in one metric.

6.12 Summary

Concept Role
Neuron Weighted sum → activation
Layer Collection of neurons
Network Chain of layers
Loss Measures prediction error
Gradient Direction to improve parameters
Backpropagation Efficient gradient computation
Optimizer Parameter update rule
Regularization Prevents overfitting