#Adaline #Multilayer #Neural #Networks #Pan #Cretan #Jan

## Setting the foundations right

In the previous two articles we saw how we can implement a basic classifier based on Rosenblatt’s perceptron and how this classifier can be improved by using the adaptive linear neuron algorithm (adaline). These two articles cover the foundations before attempting to implement an artificial neural network with many layers. Moving from adaline to deep learning is a bigger leap and many machine learning practitioners will opt directly for an open source library like PyTorch. Using such a specialised machine learning library is of course recommended for developing a model in production, but not necessarily for learning the fundamental concepts of multilayer neural networks. This article builds a multilayer neural network from scratch. Instead of solving a binary classification problem we will focus on a multiclass one. We will be using the sigmoid activation function after each layer, including the output one. Essentially we train a model that for each input, comprising a vector of features, produces a vector with length equal to the number of classes to be predicted. Each element of the output vector is in the range [0, 1] and can be understood as the “probability” of each class.

The purpose of the article is to become comfortable with the mathematical notation used for describing mathematically neural networks, understand the role of the various matrices with weights and biases, and derive the formulas for updating the weights and biases to minimise the loss function. The implementation allows for any number of hidden layers with arbitrary dimensions. Most tutorials assume a fixed architecture but this article uses a carefully chosen mathematical notation that supports generalisation. In this way we can also run simple numerical experiments to examine the predictive performance as a function of the number and size of the hidden layers.

As in the earlier articles, I used the online LaTeX equation editor to develop the LaTeX code for the equation and then the chrome plugin Maths Equations Anywhere to render the equation into an image. All LaTex code is provided at the end of the article if you need to render it again. Getting the notation right is part of the journey in machine learning, and essential for understanding neural networks. It is vital to scrutinise the formulas, and pay attention to the various indices and the rules for matrix multiplication. Implementation in code becomes trivial once the model is correctly formulated on paper.

All code used in the article can be found in the accompanying repository. The article covers the following topics

∘ What is a multilayer neural network?

∘ Activation

∘ Loss function

∘ Backpropagation

∘ Implementation

∘ Dataset

∘ Training the model

∘ Hyperparameter tuning

∘ Conclusions

∘ LaTeX code of equations used in the article

## What is a multilayer neural network?

This section introduces the architecture of a generalised, feedforward, fully-connected multilayer neural network. There are a lot of terms to go through here as we work our way through Figure 1 below.

For every prediction, the network accepts a vector of features as input

that can also be understood as a matrix with shape (1, n⁰). The network uses L layers and produces a vector as an output

that can be understood as a matrix with shape (1, nᴸ) where nᴸ is the number of classes in the multiclass classification problem we need to solve. Every float in this matrix lies in the range [0, 1] and the index of the largest element corresponds to the predicted class. The (L) notation in the superscript is used to refer to a particular layer, in this case the last one.

But how do we generate this prediction? Let’s focus on the first element of the first layer (the input is not considered a layer)

We first compute the net input that is essentially an inner product of the input vector with a set of weights with the addition of a bias term. The second operation is the application of the activation function σ(z) to which we will return later. For now it is important to keep in mind that the activation function is essentially a scalar operation.

We can compute all elements of the first layer in the same way

From the above we can deduce that we have introduced n¹ x n⁰ weights and n¹ bias terms that will need to be fitted when the model is trained. These calculations can also be expressed in matrix form

Pay close attention to the shape of the matrices. The net output is a result of a matrix multiplication of two matrices with shape (1, n⁰) and (n⁰, n¹) that results in a matrix with shape (1, n¹), to which we add another matrix with the bias terms that has the same (1, n¹) shape. Note that we introduced the transpose of the weight matrix. The activation function applies to every element of this matrix and hence the activated values of layer 1 are also a matrix with shape (1, n¹).

The above can be readily generalised for every layer in the neural network. Layer k accepts as input nᵏ⁻¹ values and produces nᵏ activated values

Layer k introduces nᵏ x nᵏ⁻¹ weights and nᵏ bias terms that will need to be fitted when the model is trained. The total number of weights and bias terms is

so if we assume an input vector with 784 elements (dimension of a low resolution image in gray scale), a single hidden layer with 50 nodes and 10 classes in the output we need to optimise 785*50+51*10 = 39,760 parameters. The number of parameters grows further if we increase the number of hidden layers and the number of nodes in these layers. Optimising an objective function with so many parameters is not a trivial undertaking and this is why it took some time from the time adaline was introduced until we discovered how to train deep networks in the mid 80s.

This section essentially covers what is known as the forward pass, i.e. how we apply a series of matrix multiplications, matrix additions and element wise activations to convert the input vector to an output vector. If you pay close attention we assumed that the input was a single sample represented as a matrix with shape (1, n⁰). The notation holds even if we we feed into the network a batch of samples represented as a matrix with shape (N, n⁰). There is only small complexity when it comes to the bias terms. If we focus on the first layer we sum a matrix with shape (N, n¹) to a bias matrix with shape (1, n¹). For this to work the bias matrix has its first row replicated as many times as the number of samples in the batch we use in the forward pass. This is such a natural operation that NumPy does it automatically in what is called broadcasting. When we apply forward pass to a batch of inputs it is perhaps cleaner to use capital letters for all vectors that become matrices, i.e.

Note that I assumed that broadcasting was applied to the bias terms leading to a matrix with as many rows as the number of samples in the batch.

Operating with batches is typical with deep neural networks. We can see that as the number of samples N increases we will need more memory to store the various matrices and carry out the matrix multiplications. In addition, using only part of training set for updating the weights means we will be updating the parameters several times in each pass of the training set (epoch) leading to faster convergence. There is an additional benefit that is perhaps less obvious. The network uses activation functions that, unlike the activation in adaline, are not the identity. In fact they are not even linear, which makes the loss function non convex. Using batches introduces noise that is believed to help escaping shallow local minima. A suitably chosen learning rate further assists with this.

As a final note before we move on, the term feedforward comes from the fact that each layer is using as input the output of the previous layer without using loops that lead to the so-called recurrent neural networks.

## Activation

Enabling the neural network to solve complex problem requires introducing some form of nonlinearity. This is achieved by using an activation function in each layer. There are many choices. For this article we will be using the sigmoid (logistic) activation function that we can visualise with