# Multilayer Perceptron

This feature is in the alpha tier. For more information on feature tiers, see API Tiers.

A Multilayer Perceptron (MLP) is a type of feed-forward neural network. It consists of multiple layers of connected neurons. The value of a neuron is computed by applying an activation function on the aggregated weighted inputs from previous layer. For classification, the size of the output layer is based on the number of classes. To optimize the weights of the network, GDS uses gradient descent with a Cross Entropy Loss.

## Tuning the hyperparameters

In order to balance matters such as bias vs variance of the model, and speed vs memory consumption of the training, GDS exposes several hyperparameters that one can tune. Each of these are described below.

In Gradient descent based training, we try to find the best weights for our model. In each epoch we process all training examples to compute the loss and the gradient of the weights. These gradients are then used to update the weights. For the update we use the Adam optimizer as described in https://arxiv.org/pdf/1412.6980.pdf.

Statistics about the training are reported in the neo4j debug log.

### Max Epochs

This parameter defines the maximum number of epochs for the training. Independent of the model’s quality, the training will terminate after these many epochs. Note, that the training can also stop earlier if the loss converged (see Patience and Tolerance.

Setting this parameter can be useful to limit the training time for a model. Restricting the computational budget can serve the purpose of regularization and mitigate overfitting, which becomes a risk with a large number of epochs.

### Min Epochs

This parameter defines the minimum number of epochs for the training. Independent of the model’s quality, the training will at least run this many epochs.

Setting this parameter can be useful to avoid early stopping, but also increases the minimal training time of a model.

### Patience

This parameter defines the maximum number of unproductive consecutive epochs.
An epoch is unproductive if it does not improve the training loss by at least a `tolerance`

fraction of the current loss.

Assuming the training ran for `minEpochs`

, this parameter defines when the training converges.

Setting this parameter can lead to a more robust training and avoid early termination similar to `minEpochs`

.
However, a high patience can result in running more epochs than necessary.

In our experience, reasonable values for `patience`

are in the range `1`

to `3`

.

### Tolerance

This parameter defines when an epoch is considered unproductive and together with `patience`

defines the convergence criteria for the training.
An epoch is unproductive if it does not improve the training loss by at least a `tolerance`

fraction of the current loss.

A lower tolerance results in more sensitive training with a higher probability to train longer. A high tolerance means a less sensitive training and hence resulting in more epochs counted as unproductive.

### Learning rate

When updating the weights, we move in the direction dictated by the Adam optimizer based on the loss function’s gradients.
How much we move per weights update, you can configure via the `learningRate`

parameter.

### Batch size

This parameter defines how many training examples are grouped in a single batch.

The gradients are computed concurrently on the batches using `concurrency`

many threads.
At the end of an epoch the gradients are summed and scaled before updating the weights.
The `batchSize`

does not affect the model quality, but can be used to tune for training speed.
A larger batchSize increases the memory consumption of the computation.

### Penalty

This parameter defines the influence of the regularization term in the loss function. While the regularization can avoid overfitting, a high value can even lead to underfitting. The minimal value is zero, where the regularization term has no effect at all.

### HiddenLayerSizes

This parameter defines the shape of the neural network. Each entry represents the number of neurons in a layer. The length of the list defines the number of hidden layers. Deeper and larger networks can theoretically approximate high degree surfaces better, at the expense of having more weights (and biases) that need to be trained.

### Class weights

This parameter introduces the concept of *class weights*, studied in 'Focal Loss for Dense Object Detection' by T. Lin et al.
It is often called *balanced cross entropy*.
It assigns a weight to each class in the cross-entropy loss function, thus allowing the model to treat different classes with varying importance.
It is defined for each example as:

where `a`

denotes the class weight of the true class._{t}`p`

denotes the probability of the true class._{t}

For class-imbalanced problems, the class weights are often set to the inverse of class frequencies to improve the inductive bias of the model on minority classes.

#### Usage in link prediction

For link prediction, it must be a list of length 2 where the first weight is for negative examples (missing relationships) and the second for positive examples (actual relationships).

#### Usage in node classification

For node classification, the `i`

weight is for the ^{th}`i`

class, ordered by the class values (which must be integers). For example, if your node classification dataset has three classes: 0, 1, 42. Then the class weights must be of length 3. The third weight is applied to class 42.^{th}

### Focus weight

This parameter introduces the concept of *focal loss*, again studied in 'Focal Loss for Dense Object Detection' by T. Lin et al.
When `focusWeight`

is a value greater than zero, the loss function changes from standard Cross-Entropy Loss to Focal Loss.
It is defined for each example as:

where `p`

denotes the probability of the true class.
The _{t}`focusWeight`

parameter is the exponent noted as `g`

.

Increasing `focusWeight`

will guide the model towards trying to fit "hard" misclassified examples.
A hard misclassified example is an example for which the model has a low predicted probability for the true class.
In the above equation, the loss will be exponentially higher for low-true-class-probability examples, thus tuning the model towards trying to fit them, at the expense of potentially being less confident on "easy" examples.

In class-imbalanced datasets, the minority class(es) are typically harder to classify correctly. Read more about class imbalance for Link Prediction in Class Imbalance.