Universal approximation — a multilayer perceptron with a single hidden layer can approximate any continuous function to arbitrary accuracy, given enough neurons (Cybenko, 1989; Hornik et al., 1989).
Backpropagation — the training algorithm that made multilayer networks practical (Rumelhart et al., 1986).
Deep learning review — comprehensive overview of multilayer perceptrons and deep architectures (LeCun et al., 2015).
A multilayer perceptron (MLP) is the standard dense feedforward network introduced in the previous chapter. Here the focus is on what that architecture can represent, how it is trained in practice, and when it is a good baseline choice.
7.1 Architecture
An MLP consists of:
An input layer — one node per feature (not a computation layer, just the data entry point).
One or more hidden layers — dense layers with activation functions.
An output layer — produces the final prediction.
Every neuron in one layer is connected to every neuron in the next layer, which is why these are called fully connected or dense layers.
For an MLP with \(L\) hidden layers, the computation at layer \(l\) is:
where \(\mathbf{h}^{(0)} = \mathbf{x}\) is the input and the final layer output \(\mathbf{h}^{(L+1)} = \hat{\mathbf{y}}\) is the prediction.
7.2 Universal approximation
The universal approximation theorem(Cybenko, 1989; Hornik et al., 1989) states that a feedforward network with a single hidden layer containing a sufficient number of neurons can approximate any continuous function on a compact set to arbitrary precision. This is a powerful existence result, but it does not tell you how many neurons you need or how to find the weights. In practice, deeper (more layers) but narrower MLPs tend to generalize better than a single massive hidden layer.
7.3 Code example: function approximation
Let’s use an MLP to approximate the function \(f(x) = \sin(2\pi x)\, e^{-x^2}\) from noisy samples.
# Plot the resultx_fine =Float32.(range(-2, 2, length =500))X_fine =reshape(x_fine, 1, :)Y_pred, _ =model(X_fine, tstate.parameters, tstate.states)fig =Figure(size = (600, 350))ax =Axis(fig[1, 1], xlabel ="x", ylabel ="f(x)", title ="MLP function approximation")scatter!(ax, x_data, y_data, markersize =3, color = (:gray, 0.4), label ="Noisy data")lines!(ax, x_fine, f_true.(x_fine), color =:black, linewidth =2, label ="True function")lines!(ax, x_fine, vec(Y_pred), color =:steelblue, linewidth =2, linestyle =:dash, label ="MLP prediction")axislegend(ax, position =:lt)fig
Interpretation tip: use the holdout test MSE as the primary quality indicator. A smooth fit that looks good visually can still overfit noisy samples; a separate test set is your safeguard.
7.4 Effect of depth and width
The universal approximation theorem guarantees that a wide enough single hidden layer can approximate any function. In practice:
Wider layers (more neurons) increase capacity but can overfit.
Deeper networks (more layers) learn hierarchical features and often generalize better with fewer total parameters.
Very deep networks are harder to train due to vanishing gradients — residual connections (He et al., 2016) and normalization help.
A useful rule of thumb: start with 2–3 hidden layers of moderate width (32–128 neurons) and adjust based on performance.
7.5 When to use multilayer perceptrons
MLPs are the default starting point whenever:
The input is a fixed-size feature vector (e.g., geophysical measurements at a station).
There is no spatial or temporal structure that you want the architecture to exploit.
You need a simple, fast, interpretable baseline before trying more complex architectures.
For data with spatial structure (images, grids), convolutional networks are usually better. For sequential data (time series), recurrent networks or transformers are preferred. The MLP remains the building block inside most of these architectures.
7.6 Geoscience milestones
First-break refraction picking — McCormack et al. (1993) is one of the earliest applications of feedforward neural networks to seismic processing.
Petrophysical prediction from well logs — Huang et al. (1996) established MLPs as a workhorse for permeability and porosity regression in rock physics.
Lithology classification from downhole logs — Benaouda et al. (1999) is the canonical reference for MLP-based lithology inference.
Geoscience ML overview — Bergen et al. (2019) and Reichstein et al. (2019) place MLPs in the wider context of machine learning across the Earth sciences.
The MLP is rarely the final architecture for production geoscience workflows, but it is almost always the first model you should try. If an MLP solves the problem, the extra complexity of deeper architectures is unnecessary.
Benaouda, D., Wadge, G., Whitmarsh, R. B., Rothwell, R. G., & MacLeod, C. (1999). Inferring the lithology of borehole rocks by applying neural network classifiers to downhole logs: An example from the ocean drilling program. Geophysical Journal International, 136(2), 477–491. https://doi.org/10.1046/j.1365-246X.1999.00746.x
Bergen, K. J., Johnson, P. A., Hoop, M. V. de, & Beroza, G. C. (2019). Machine learning for data-driven discovery in solid earth geoscience. Science, 363(6433). https://doi.org/10.1126/science.aau0323
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314. https://doi.org/10.1007/BF02551274
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778. https://doi.org/10.1109/CVPR.2016.90
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366. https://doi.org/10.1016/0893-6080(89)90020-8
Huang, Z., Shimeld, J., Williamson, M., & Katsube, J. (1996). Permeability prediction with artificial neural network modeling in the venture gas field, offshore eastern canada. Geophysics, 61(2), 422–436. https://doi.org/10.1190/1.1443970
McCormack, M. D., Zaucha, D. E., & Dushek, D. W. (1993). First-break refraction event picking and seismic data trace editing using neural networks. Geophysics, 58(1). https://doi.org/10.1190/1.1443352
Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., & Prabhat. (2019). Deep learning and process understanding for data-driven earth system science. Nature, 566, 195–204. https://doi.org/10.1038/s41586-019-0912-1
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. https://doi.org/10.1038/323533a0