7  Multilayer Perceptrons

Key references
  • Universal approximation — a multilayer perceptron with a single hidden layer can approximate any continuous function to arbitrary accuracy, given enough neurons (Cybenko, 1989; Hornik et al., 1989).
  • Backpropagation — the training algorithm that made multilayer networks practical (Rumelhart et al., 1986).
  • Deep learning review — comprehensive overview of multilayer perceptrons and deep architectures (LeCun et al., 2015).

A multilayer perceptron (MLP) is the standard dense feedforward network introduced in the previous chapter. Here the focus is on what that architecture can represent, how it is trained in practice, and when it is a good baseline choice.

7.1 Architecture

An MLP consists of:

  1. An input layer — one node per feature (not a computation layer, just the data entry point).
  2. One or more hidden layers — dense layers with activation functions.
  3. An output layer — produces the final prediction.

Every neuron in one layer is connected to every neuron in the next layer, which is why these are called fully connected or dense layers.

For an MLP with \(L\) hidden layers, the computation at layer \(l\) is:

\[ \mathbf{h}^{(l)} = \sigma\!\bigl(W^{(l)}\,\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\bigr) \]

where \(\mathbf{h}^{(0)} = \mathbf{x}\) is the input and the final layer output \(\mathbf{h}^{(L+1)} = \hat{\mathbf{y}}\) is the prediction.

7.2 Universal approximation

The universal approximation theorem (Cybenko, 1989; Hornik et al., 1989) states that a feedforward network with a single hidden layer containing a sufficient number of neurons can approximate any continuous function on a compact set to arbitrary precision. This is a powerful existence result, but it does not tell you how many neurons you need or how to find the weights. In practice, deeper (more layers) but narrower MLPs tend to generalize better than a single massive hidden layer.

7.3 Code example: function approximation

Let’s use an MLP to approximate the function \(f(x) = \sin(2\pi x)\, e^{-x^2}\) from noisy samples.

using Lux, Random, Optimisers, Zygote, Statistics, Printf, CairoMakie

rng = Xoshiro(42)

# Generate training data
n = 200
x_data = Float32.(range(-2, 2, length = n))
f_true(x) = sin(2π * x) * exp(-x^2)
y_data = f_true.(x_data) .+ 0.05f0 .* randn(rng, Float32, n)

X = reshape(x_data, 1, :)
Y = reshape(y_data, 1, :)

# Train/test split
idx = randperm(rng, n)
n_train = Int(round(0.8 * n))
tr = idx[1:n_train]
te = idx[n_train+1:end]

X_train, Y_train = X[:, tr], Y[:, tr]
X_test,  Y_test  = X[:, te], Y[:, te]
(Float32[1.839196 1.8994975 … 0.110552765 -0.4924623], [-0.04691443233346553 0.06493792605102766 … 0.6447873778617388 -0.11029157439538681])
# Build a 2-hidden-layer MLP
model = Chain(
    Dense(1 => 32, relu),
    Dense(32 => 32, relu),
    Dense(32 => 1)
)

ps, st = Lux.setup(rng, model)

function mse_loss(model, ps, st, data)
    x, y = data
    ŷ, st_new = model(x, ps, st)
    loss = mean((ŷ .- y) .^ 2)
    return loss, st_new, ()
end
mse_loss (generic function with 1 method)
opt = Adam(0.005f0)
function train_model(model, ps, st, data; epochs = 1000, lr = 0.005f0)
    tstate = Training.TrainState(model, ps, st, Adam(lr))
    for epoch in 1:epochs
        _, loss, _, tstate = Training.single_train_step!(
            AutoZygote(), mse_loss, data, tstate
        )
        if epoch == 1 || epoch % 200 == 0
            @printf "Epoch %4d  MSE = %.6f\n" epoch loss
        end
    end
    return tstate
end

tstate = train_model(model, ps, st, (X_train, Y_train))

# Holdout evaluation
Y_test_pred, _ = model(X_test, tstate.parameters, tstate.states)
test_mse = mean((Y_test_pred .- Y_test) .^ 2)
@printf "Holdout test MSE = %.6f\n" test_mse
Epoch    1  MSE = 10.610252
Epoch  200  MSE = 0.050187
Epoch  400  MSE = 0.027555
Epoch  600  MSE = 0.015669
Epoch  800  MSE = 0.007704
Epoch 1000  MSE = 0.004761
Holdout test MSE = 0.004281
# Plot the result
x_fine = Float32.(range(-2, 2, length = 500))
X_fine = reshape(x_fine, 1, :)
Y_pred, _ = model(X_fine, tstate.parameters, tstate.states)

fig = Figure(size = (600, 350))
ax = Axis(fig[1, 1], xlabel = "x", ylabel = "f(x)",
       title = "MLP function approximation")
scatter!(ax, x_data, y_data, markersize = 3, color = (:gray, 0.4),
         label = "Noisy data")
lines!(ax, x_fine, f_true.(x_fine), color = :black, linewidth = 2,
       label = "True function")
lines!(ax, x_fine, vec(Y_pred), color = :steelblue, linewidth = 2,
    linestyle = :dash, label = "MLP prediction")
axislegend(ax, position = :lt)
fig

Interpretation tip: use the holdout test MSE as the primary quality indicator. A smooth fit that looks good visually can still overfit noisy samples; a separate test set is your safeguard.

7.4 Effect of depth and width

The universal approximation theorem guarantees that a wide enough single hidden layer can approximate any function. In practice:

  • Wider layers (more neurons) increase capacity but can overfit.
  • Deeper networks (more layers) learn hierarchical features and often generalize better with fewer total parameters.
  • Very deep networks are harder to train due to vanishing gradients — residual connections (He et al., 2016) and normalization help.

A useful rule of thumb: start with 2–3 hidden layers of moderate width (32–128 neurons) and adjust based on performance.

7.5 When to use multilayer perceptrons

MLPs are the default starting point whenever:

  • The input is a fixed-size feature vector (e.g., geophysical measurements at a station).
  • There is no spatial or temporal structure that you want the architecture to exploit.
  • You need a simple, fast, interpretable baseline before trying more complex architectures.

For data with spatial structure (images, grids), convolutional networks are usually better. For sequential data (time series), recurrent networks or transformers are preferred. The MLP remains the building block inside most of these architectures.

7.6 Geoscience milestones

  • First-break refraction pickingMcCormack et al. (1993) is one of the earliest applications of feedforward neural networks to seismic processing.
  • Petrophysical prediction from well logsHuang et al. (1996) established MLPs as a workhorse for permeability and porosity regression in rock physics.
  • Lithology classification from downhole logsBenaouda et al. (1999) is the canonical reference for MLP-based lithology inference.
  • Geoscience ML overviewBergen et al. (2019) and Reichstein et al. (2019) place MLPs in the wider context of machine learning across the Earth sciences.

The MLP is rarely the final architecture for production geoscience workflows, but it is almost always the first model you should try. If an MLP solves the problem, the extra complexity of deeper architectures is unnecessary.