7  Multilayer Perceptrons

TipKey references
  • Universal approximation — a multilayer perceptron with a single hidden layer can approximate any continuous function to arbitrary accuracy, given enough neurons (Cybenko, 1989; Hornik et al., 1989).
  • Backpropagation — the training algorithm that made multilayer networks practical (Rumelhart et al., 1986).
  • Deep learning review — comprehensive overview of multilayer perceptrons and deep architectures (LeCun et al., 2015).

A multilayer perceptron (MLP) is the standard dense feedforward network introduced in the previous chapter. Here the focus is on what that architecture can represent, how it is trained in practice, and when it is a good baseline choice.

7.1 Architecture

An MLP consists of:

  1. An input layer — one node per feature (not a computation layer, just the data entry point).
  2. One or more hidden layers — dense layers with activation functions.
  3. An output layer — produces the final prediction.

Every neuron in one layer is connected to every neuron in the next layer, which is why these are called fully connected or dense layers.

For an MLP with \(L\) hidden layers, the computation at layer \(l\) is:

\[ \mathbf{h}^{(l)} = \sigma\!\bigl(W^{(l)}\,\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\bigr) \]

where \(\mathbf{h}^{(0)} = \mathbf{x}\) is the input and the final layer output \(\mathbf{h}^{(L+1)} = \hat{\mathbf{y}}\) is the prediction.

7.2 Universal approximation

The universal approximation theorem (Cybenko, 1989; Hornik et al., 1989) states that a feedforward network with a single hidden layer containing a sufficient number of neurons can approximate any continuous function on a compact set to arbitrary precision. This is a powerful existence result, but it does not tell you how many neurons you need or how to find the weights. In practice, deeper (more layers) but narrower MLPs tend to generalize better than a single massive hidden layer.

7.3 Code example: function approximation

Let’s use an MLP to approximate the function \(f(x) = \sin(2\pi x)\, e^{-x^2}\) from noisy samples.

using Lux, Random, Optimisers, Zygote, Statistics, Printf, CairoMakie

rng = Xoshiro(42)

# Generate training data
n = 200
x_data = Float32.(range(-2, 2, length = n))
f_true(x) = sin(2π * x) * exp(-x^2)
y_data = f_true.(x_data) .+ 0.05f0 .* randn(rng, Float32, n)

X = reshape(x_data, 1, :)
Y = reshape(y_data, 1, :)

# Train/test split
idx = randperm(rng, n)
n_train = Int(round(0.8 * n))
tr = idx[1:n_train]
te = idx[n_train+1:end]

X_train, Y_train = X[:, tr], Y[:, tr]
X_test,  Y_test  = X[:, te], Y[:, te]
(Float32[1.839196 1.8994975 … 0.110552765 -0.4924623], [-0.04691443233346553 0.06493792605102766 … 0.6447873778617388 -0.11029157439538681])
# Build a 2-hidden-layer MLP
model = Chain(
    Dense(1 => 32, relu),
    Dense(32 => 32, relu),
    Dense(32 => 1)
)

ps, st = Lux.setup(rng, model)

function mse_loss(model, ps, st, data)
    x, y = data
    ŷ, st_new = model(x, ps, st)
    loss = mean((ŷ .- y) .^ 2)
    return loss, st_new, ()
end
mse_loss (generic function with 1 method)
opt = Adam(0.005f0)
function train_model(model, ps, st, data; epochs = 1000, lr = 0.005f0)
    tstate = Training.TrainState(model, ps, st, Adam(lr))
    for epoch in 1:epochs
        _, loss, _, tstate = Training.single_train_step!(
            AutoZygote(), mse_loss, data, tstate
        )
        if epoch == 1 || epoch % 200 == 0
            @printf "Epoch %4d  MSE = %.6f\n" epoch loss
        end
    end
    return tstate
end

tstate = train_model(model, ps, st, (X_train, Y_train))

# Holdout evaluation
Y_test_pred, _ = model(X_test, tstate.parameters, tstate.states)
test_mse = mean((Y_test_pred .- Y_test) .^ 2)
@printf "Holdout test MSE = %.6f\n" test_mse
Epoch    1  MSE = 10.610252
Epoch  200  MSE = 0.050187
Epoch  400  MSE = 0.027555
Epoch  600  MSE = 0.015669
Epoch  800  MSE = 0.007704
Epoch 1000  MSE = 0.004761
Holdout test MSE = 0.004282
# Plot the result
x_fine = Float32.(range(-2, 2, length = 500))
X_fine = reshape(x_fine, 1, :)
Y_pred, _ = model(X_fine, tstate.parameters, tstate.states)

fig = Figure(size = (600, 350))
ax = Axis(fig[1, 1], xlabel = "x", ylabel = "f(x)",
       title = "MLP function approximation")
scatter!(ax, x_data, y_data, markersize = 3, color = (:gray, 0.4),
         label = "Noisy data")
lines!(ax, x_fine, f_true.(x_fine), color = :black, linewidth = 2,
       label = "True function")
lines!(ax, x_fine, vec(Y_pred), color = :steelblue, linewidth = 2,
    linestyle = :dash, label = "MLP prediction")
axislegend(ax, position = :lt)
fig

Interpretation tip: use the holdout test MSE as the primary quality indicator. A smooth fit that looks good visually can still overfit noisy samples; a separate test set is your safeguard.

7.4 Effect of depth and width

The universal approximation theorem guarantees that a wide enough single hidden layer can approximate any function. In practice:

  • Wider layers (more neurons) increase capacity but can overfit.
  • Deeper networks (more layers) learn hierarchical features and often generalize better with fewer total parameters.
  • Very deep networks are harder to train due to vanishing gradients — residual connections (He et al., 2016) and normalization help.

A useful rule of thumb: start with 2–3 hidden layers of moderate width (32–128 neurons) and adjust based on performance.

7.5 When to use multilayer perceptrons

MLPs are the default starting point whenever:

  • The input is a fixed-size feature vector (e.g., geophysical measurements at a station).
  • There is no spatial or temporal structure that you want the architecture to exploit.
  • You need a simple, fast, interpretable baseline before trying more complex architectures.

For data with spatial structure (images, grids), convolutional networks are usually better. For sequential data (time series), recurrent networks or transformers are preferred. The MLP remains the building block inside most of these architectures.

7.6 Geoscience applications

Multilayer perceptrons have been widely used in geoscience as flexible function approximators:

  • Seismic picking and trace editing — early feedforward networks were used for first-break refraction picking and seismic trace cleaning (McCormack et al., 1993).
  • Velocity analysis and moveout correction — MLPs were applied to automate NMO correction and velocity estimation in seismic processing (Calderón-Macı́as et al., 1998).
  • Reservoir characterization — feedforward networks were used for reservoir and seismic characterization from waveform-derived attributes (An et al., 2001; An & Moon, 2005).
  • Lithology classification — borehole lithology inference from downhole logs was an early and influential classification use case (Benaouda et al., 1999).
  • Petrophysical prediction — porosity and permeability estimation from well logs is a classic MLP regression task in rock physics and reservoir studies (Huang et al., 1996; Huang & Williamson, 1997).
  • Thermal-property estimation — MLPs were also used to predict thermal conductivity from geophysical well logs (Goutorbe et al., 2006).
  • Geophysical inversion — MLPs can serve as surrogate forward models, mapping model parameters to predicted data. Once trained, they replace expensive physics-based simulations and can be embedded inside iterative inversion schemes (Lopez-Alvis et al., 2019).
  • OverviewBergen et al. (2019) and Reichstein et al. (2019) provide broad reviews of machine learning across the geosciences, including many regression and classification problems for which MLPs are natural baselines.

The MLP is rarely the final architecture for production geoscience workflows, but it is almost always the first model you should try. If an MLP solves the problem, the extra complexity of deeper architectures is unnecessary.