Regularization Techniques

Published in

Analytics Vidhya

10 min readJun 8, 2020

This short article talks about the regularization techniques, the advantages, meanings, way to apply them, and why are necessary. In this paper, I’m not going to explain how to design or how are the neural networks anything about forward or backpropagation, weights, bias (threshold), normalization, but maybe in the next article, I’m going to covert those topics. However, you need those concepts to understand regularization techniques.

First, we need to understand what is the problem with Neural Networks. When we are designing and creating a Neural Network we have a goal to apply them, for example, if I want to recognize the numbers between 0 to 9 (My goal), I should understand that I need to use samples with a lot of ways to write these numbers (0–9) to train the model and also samples to test the model. This is so important because like you know we have different ways to write the numbers, the lines and/or circles could be perfect in some cases or maybe not, maybe this occurs for a lot of factors like age, sickness, alcohol levels in blood, anxiety, the technique to write, and more. What do you think of doctors’ writing? yeah, that’s another topic, back to the problem we need to choose very well our samples trying to get the data which represent the future possible datasets, we are going to have many problems but in this case, we are going to talk only about “overfitting”.

To understand overfitting is necessary to know the meaning of bias and variance I recommend this video because It’s a very good explanation https://www.youtube.com/watch?v=EuBBz3bI-aA

but if you don’t want to watch the video, I create these pictures to see these terms:

The picture shows us, two groups of samples (training and testing samples),
Try to imagine a case for those data (maybe weights vs heights, winter clothes prices vs seasons, this is only for your comprehension), now we are going to train a model with the training dataset for example with linear regression and we are going to get something like this:

Linear regression (red color) possible prediction

Bias is going to be the difference (sum of all square distances (see green lines)) between our blue dots with the prediction (red straight line) this is a problem because like we saw in the picture all dots values are not going to be represented with our model prediction and we are going to call this bias (we can have a high or low bias (like in this case) and this problem is called underfitting). If we have a less bias we can see that we are going to get the correct prediction because we are going to be very close to our training values (sounds good, right?) for this reason we should imagine that we created a model with a prediction function to get this values, something like this:

Prediction function getting the training dataset values

In this case, probably we are going to return the same value of our training data, we can think that we did a very good job, but if we try to remember the testing data (samples to see if our model is going to be good for our goal) yes maybe in this point you are thinking that we are not going to return a very good answer for this case, but we can see this in a picture:

Training and testing data with prediction

We can see that we are not going to return very good value for the testing data because the prediction is far from all orange dots. In this case, we have a problem called high variance (we can have low or high variance) maybe this is like if you want to win a strength challenge and somebody told you that the best gym to train is “legs gym” and you did the exercises to train your legs and months later you can do a lot of reps with high weight and you are so confident because you feel the strength in your legs but the challenge day is right now and you see that the qualification test is a simple curl biceps exercises, and you didn’t train that muscles for this reason the result probably is going to be you are disqualified in the first round. This problem, this difference between datasets is going to be called variance and when we have high variance we say that we have overfitting we trained a lot our training dataset that we can not be efficient with the future predictions. For this reason, we need to apply some methods to solve this problem one of the ways is using regularization. This method avoids in easy words that our neurons keep the same values (Weights, bias (threshold, is not the same bias that I have been talking about in this article)) to generate a trending output.

Dropout

This method is a way to turn off some neurons (in concept words) to avoid the “memorizing in our weights and bias (threshold)” like all of the regularization methods, but the idea is, train the model without all neurons (only in hidden layers), with a random method to turn off the neurons and doing a compensation, we are going to see an example in python about how this method works. Remember that you should use it in forward and backpropagation (outputs in forward and DZ in back prop)

We can see a random binomial way to get in this case a random vector with zero and ones respect to a probability (0.8) and we are going to multiply the outputs to simulate the neuron turning off, later we see that we need to do compensation to distribute the value that we had in the neuron to the other neurons in layer (division by probability that we want to keep in the layer). we know that the result is going to bee equal to 1 (100%).

L1 and L2 regularization

These methods add a hyperparameter to the cost (cross-entropy function) and the difference between them is the power of those parameters. L2 is referred to as weight decay and ridge regression because It is a Euclidean norm, consists of the sum of the squares of weights in the network, multiplied by a parameter λ (Regularization Parameter). L1 also called lasso regression is the sum of the absolute value of weights in the network multiplied by λ. The next pictures are the formulas of loss and how to update weights:

All formulas were taken from this very nice explanation of L1 and L2 regularization https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261#15c2

Taken from https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261#15c2

This is a python example implementing L2 https://github.com/PauloMorillo/holbertonschool-machine_learning/blob/master/supervised_learning/0x05-regularization/0-l2_reg_cost.py

Both methods L1 and L2 reduce the value of the weights in each iteration. The difference between them is that L2 reduces the values faster when the values are high but when the values are small the reductions are equal, and L1 is faster with small values.

In conclusion, those methods avoid to have a regression without bias in the training dataset and how they do that is with a penalty, this penalty is going to be a value that is going to be controlled by an independent term “lambda”, for this reason, we are adding like a “deviation” in our regression function (changing the slope), increasing bias and reducing high variance. Also, we should know if we use L2 and we increase lambda we can have a slope close to 0 but with L1 we can get the 0. This video is very interesting to see this https://www.youtube.com/watch?v=NGf0voTMlcs (min 4:55)

Early stopping

This method consists in evaluate at the same time the training and validation data getting the error in both cases. When we have those values we can see values are going to be very close but at some point, validation cost is going increase until this happens we should stop, with this simple method we are avoiding overfitting. In this picture we can see this behavior:

Taken from https://srdas.github.io/DLBook/ImprovingModelGeneralization.html#Overfitting figure 8.6

Now we should know this is an ideal way to see the behavior of training and validation errors, but in the real-life, we are going to have an error function like this:

Taken from https://page.mi.fu-berlin.de/prechelt/Biblio/stop_tricks1997.pdf

In this case, if we want to get the minimum value we can use techniques like patience a way to wait for some epochs to see if we can have a minimum value, for example, if we are in the 30 training maybe 9.5 error and we don’t know anything about the other values we could say that this is the minimum error and for this reason our early stopping, but if you see maybe after 40 trains, we have 9.4 and if we see an until the 50s, this error is going to decrease, for this reason, if we only train until 50 we can define patience about 10 trains or epochs(times that we are going to train all dataset) and we are going to wait for 10 epochs to choose the lowest value.
Think in two cars maybe trucks these trucks want to go to the same town (maybe a delivery place called prediction town) but they don’t know the road. This road can have a lot of ways to arrive to prediction town or another towns (like not prediction town, very far prediction town and end of the road town), for this reason they have to pay attention to the road signs, these two trucks are at the same time with the same speed but the drivers are seeing a different part of the road, suppouse that these cars are in a lane one of them is going to have a better view of the left side of the lane and the other has better view of the right side, in some point of the road exists a sign in the right side showing that prediction town is in the next exit, yeah the truck of the right side saw the sign but the other truck (left side) didn’t see anything, when the truck of the right side took the exit the other trucks follow the road going to “not prediction town”, “very far prediction town” now we know where the truck of the left side was very close to prediction town and this point is going to be early stopping because if the truck doesn’t stop is going to arrive to the other towns, the end.

Data augmentation
Another trouble causing the overfitting is when we don’t have enough data to get a correct prediction, for example, try to think in a case about a lost member and someone shows you a picture and you see a young woman in the black and white picture, and the spouses, an old man ask you have you seen my wife? maybe you can’t laugh about this if it happens, but probably you tell him if you can see a recent picture of her. In conclusion is very difficult to identify or predict something if you don’t have enough data. Now you want to help and you are asking for more data the age, how is the hair (long, gray, black), what is she wearing, and more. probably, if you have more data you are going to recognize the old lady.
Now how can we get more data to increase our dataset, we should think in our possible validation data and see the difference to the training data, color, shape, the orientation of the things, all things that we had wanted to get for the training data.

For example, think that we need to create a model that can be predicted if you are smiling or not. and we have only one picture with your smile. but we have 30 pictures of validation data, these pictures are with different filters, black and white, only red, green, and blue components of the picture, and maybe you put your same picture of the training but you put this in the wrong position an at this moment the picture is flipped in the vertical side.
as you know this could be difficult to recognize for this reason we should think in those cases, and increase our data (data augmentation), for this reason, we are going to add some filters (black and white, red, green, blue components) for our training picture and save them to get more training data.

Scaling, resizing, change brightness, colors, rotation are ways to increase your training dataset.

Conclusion

It’s very important to use regularization techniques to avoid overfitting however is very important to take care with parameters to use for example:

dropout, we should take care with the percentage of neurons to turning off, this solution could be a problem in prediction, some books point that 0.8 or greater could be a good value to apply (in this case 20% or fewer neurons are going to be like 0 output) in other cases maybe we are going to affect the correct prediction

L2 — L1 at least that we want to arrive at 0 or almost 0 we should take care of the value of lambda because if lambda is big we are going to have a 0 value for L1 or almost 0 for L2.

For the Early stopping method, it is necessary choosing a good patience parameter because if we use small patience we are not going to get the lowest value as we saw in the description of this method. The way to choose this parameter is experimental.

Data augmentation, in some cases we can add a lot of images that can be prejudicial for our prediction, we should think in possible future cases and try to no create noise in our training dataset.

yeah, that’s all.
thanks.
Bibliography

Regularization Techniques

Written by Paulo Morillo