- Deep Forward Networks
- Overview
- Rectified Linear Unit Activation Function
- Gradient Based Learning
- Architecture Design
-
Deep forward networks also called Multi Layer Perceptron (MLPs) are deep learning models whose aim is to approximate some function f *.
-
MLP is different from classifier as follows -
- For Classifier, y=f*(x), maps input x to category y.
- For MLP, y=f(x;θ) learns parameter θ that results in best function approximation.
-
They are called feed-forward because of the following information flow-
information ==> x ==> intermediate computations defining f ==> final o/p y
-
They are called networks because they are computations of many functions.
-
Depth and Width of Network
-
Hidden Layers and Model-Width
- Aim is to match f(x) with f*(x)
- Training examples specify each output layer to produce value close to
- Learning algorithm makes use of above layer to best implement approximation of f.*
- Since, training data doesnot show desired o/p for each of these layers, they are called hidden layers.
- Width of model = Dimensionality of Hidden layers.
-
Non-Linear Transformations
- Linear model can be extended to non-linear functions of x by applying linear model directly to transformed input φ(x).
- Mapping φ can be chosen as follows:
- Very generic φ : Enough capacity, poor generalization.
- Manually engineered φ : Dominant approach, takes too much time.
- Deep-learning of φ : Requires learning of φ; Highly generic; Human designer finds right "general" function instead of right function.
Our model provides a functiony=f(x;θ), and learning algorithm will adapt θ to make f similar to XOR function y =f*(x).
- MSE loss function J(θ) -
- Linear model's definition
- Minimizing J(θ) w.r.t w and b gives w=0 and b=0.5 which is wrong.
- Constructing a linear feed-forward network as below
- Final complete model:
- Making as linear would make entire feed-forward network as linear.
- Assuming linear approach and let and , in that case , which needs non-linear functions to describe features.
- Most neural networks use non-linear function to describe features by using affine transformation controlled by learned parameters,followed by a fixed nonlinear function called an activation function.
- Affine transformation from x to h is defined and activation function g defined as:
- Recommended activation function is ReLU, defined as:
- Final non-linear model would be:
Concept : To understand gradient descent, lets consider a situation where we are at the top of mount Errorest and we need to reach at the bottom of mount Errorest.
One solution could be to look at all directions and chose the direction that leads to the most descent.
Then, we repeat the steps again for next round -
As the final step, the man reaches the bottom of mount Errorest. This is known as Gradient Descent.
To apply gradient-based learning we must choose a cost function, and we must choose how to represent output of the model.
Our neural network needs to make predictions as close as the real value. To measure this, we use a metric of how wrong the predictions are, the error. A common metric is the sum of the squared errors (SSE): Point to note here is that SSE is a function of weights.
The equation above resembles that of a parabola and since SSE is depended on weights, our goal is to find weights that minimize the error (i.e. the vertex of the parabola) -
This delta weight is used to modify the current weights-
Refer - gradient_descent.py for implementation of graident descent in python.
We are given a classification problem where we have to classify red and blue dots. Below are 2 cases where a boundry line tries to classify red and blue dots. We are also given the probability of dot being blue or red. Now, when we calculate the total probability we get -
The best scenario is the one where total probability is maximum; hence senario-2 is the best.
Therefore our new goal is to reach the probability of scenario-2, i.e. maximize probability in scenario-1.
- Cost functions for neural networks is approximately same as linear functions.
- Cost function used is cross-entropy between training data and model's prediction. (summation of negative logarithm of all probabilities)
To put cross-entropy in formula -
- Advantage of using maximum-likelihood for cost function is that it removes burden for designing cost functions for each model.
- Gradient of network should be large and predictable.
- Saturatable functions make activation function small which produces the model's output (exponent functions that saturate when their argument is negative). Solution is to use negative logarithmic functions.
- Instead of learning a full probability distribution p(y | x;θ), we could learn just one conditional statistic of y given x.
- Making cost fucntion as being functional rather than just function.
- Solving an optimization problem w.r.t function requires a mathematical tool called calculus of variations.
- Optimization Problem
- First result derived using calculus of variations - predicts mean of y for each value of x
- Second result derived using calculus of variations - predicts median of y for each value of x, also known as mean absolute error.
- Mean squared error and mean absolute error lead to poor results when used with gradient-based optimization.
- Optimization Problem
Choice of cost function is tightly coupled with cost units, which determines form of cross-entropy functions.
- Linear units are affine transformations with no nonlinearity.
- Linear o/p units are used to produce mean of Gaussian distribution
where,
and h are the features. - Maximizing log liklihood is same as minimizing MSE.
- Linear units do not saturate.
- Classification problems with two classes can be cast in this form.
- Maximum likelihood approach is to define a Bernoulli distribution over y conditioned on x.
- For Bernoulli distribution, neural net needs to predict only P(y = 1 | x) which should lie within interval [0,1].
- In case of linear approach, at anytime , strayed outside unit interval, gradient of output = 0.
- Better approach is to define sigmoid function as follows
- Sigmoid uses 2 layers, where first layer computes , next uses sigmoid activation function to convert z into probability.
- Below shows Bernoulli distribution controlled by a sigmoidal transformation of z. The z variable defining a distribution based on exponentiation and normalization are called logits:
- Loss function for maximum liklihood learning of a Bernoulli parametrized by a sigmoid is
- Saturation occurs when y=1 and z is very positive or when y=0 and z is very negative.
- When z has the wrong sign, the argument to the softplus function,(1−2y)z, may be simplified to |z|.
- Softmax functions are used to represent probability distributions over discrete variables with n possible values, unlike, Sigmoid functions that are used to represent probability distributions over discrete variables with binary values.
- Now, we need to produce vector
- New constraint requires
- Going with linear approach which predicts unnormalized log probabilities, where
and
we want to maximize , therefore:
- An example of a classification problem that can be solved by Softmax function is shown below -
- Zi has direct contribution to cost-function.
- Zi cannot saturate.
- Squared error is a poor loss function for softmax units.
- Softmax functions having multiple output values, saturates when differences between input values become extreme.
- More stable variant of softmax function can be defined as
- Saturation conditions
One hot encoding is used to represent scenarios in which we have more than one class as below -
Hidden Units
- ReLU units are an excellent default choice of hidden unit.
- Some of the hidden units might not be differentiable, e.g. ReLU is not differentiable at z=0.
- A function is differentiable at z, iff left and right derivative are both defined and equal to each other.
- Most hidden units are distinguished from each other only by choice of form of the activation function g(z).
- Activation function used by ReLU - g(z) = max{0, z}.
- Difference b/w linear and rectified linear unit is that o/p of ReLU is 0 across half its domain.
- First derivative of RelU is 1 everywhere that the unit is active, second derivative is 0 almost everywhere.
- One drawback to ReLU is that they cannot learn via gradient-based methods on examples for which their activation is zero.
3 generalizations of ReLU are based on using non-zero slope
- Absolute Value Rectification
- Leaky and Parametric ReLU
- Maxout Units
- Use of sigmoidal functions as hidden units is discouraged due to widespread saturation of sigmoidal functions making gradient-based learning very difficult.
- Hyperbolic tangents resembles identity function, making training tanh network easier.
Networks are organized into groups called layers and are arranged in a chain-like structure with each layer being a function of the layer that preceded it.
In layman's terms, Universal Approximation Theorem states that-
Regardless of what function we are trying to learn, we know that a large MLP will be able to represent this function.
We are not guaranteed, however, that the training algorithm will be able to learn that function.
There exists a network large enough to achieve any degree of accuracy we desire, but the theorem does not say how large this network will be.
-
Reasons as to why learning can fail are-
- Optimization algorithm used for training may not be able to find value of parameters that corresponds to desired function.
- Training algorithm might choose the wrong function as a result of overfitting.
-
Conclusion ::
Feedforward network with a single layer is sufficient to represent any function, but layer may be infeasibly large and may fail to learn and generalize correctly.
In many circumstances, using deeper models can reduce number of units required to represent desired function and can reduce amount of generalization error.