Deep learning is one of the most important and popular way to derive future events through the channel of data analysis. Deep learning uses neural networks to provide the basis of predicting future events. This article explains how neural networks helps in deep learning to derive the future events and supports the basis of the measurements of the same.
The aleatoric uncertainty and occurrence of events
When we roll two sixes of six-sided dice, we can have many probable outcomes. In the same way, if a coin is tossed 10 times, there will be multiple probability of coming head or tails. But any one of the probabilities is absolute. Any data or observables or any event or other ways of explaining the things we collect is absolute. If we look at the examples, we mentioned of tossing the coin or rolling of dices, we can’t predict the outcome. We explain this uncertainty due to the lack of knowledge as aleatoric. It happens due to the fundamental missing information that what outcome we will get.
We generally explain the probability of occurrence of events using function of probability distribution function where the value of any probability gets assigned in between 0 & 1 to any event in the space of all possible events, E. If the event is not possible P(d) will be equal to zero and if the probability outcome is certain P(d) will be equal to One. This is the additive probability where the union of all possible events is certain and that is P(E) = 1.
Statistical Model
Generally, those distributions which can be parameterized are being used to predict statistical future events. A statistical model which must contain a description of distribution of data and any unobservable parameters. The distribution function then attributes value of probability to the occurrence of observable/unobservable events. The derived function will give us access to interpret the probability of occurring events in the future.
Subjective Inference
By using Bayesian Inference Model, we can understand the values which are being supported by the observed data and due to this uncertainty in the prediction can be reduced. The two equalities of the conditional expansion of the joint distribution and that we can calculate in the posterior probability model where parameters have value v when some d ~ P has been observed.
Now, the model of posterior probability can be used as the basis of new model with the added joint probability distribution. If we look closer in the new model, we can find that the uncertainty in the model parameters is due to the support of data and the whole model has not changed. This model can be used to predict the more accurate and informed events about the distribution of data.
Maximum Likelihood & Maximum a Posteriori Estimation
We need to measure the distance between the two distributions in order to fit the model up to true distribution. We can measure the distance which can be commonly used as the relative entropy which is also known as Kullback- Leibler Divergence. The relative entropy explains about the lost information due to the approximation of both distributions. The properties like being non-symmetric makes relative entropy not the ideal for distance measurement. Still, we generally consider the relative entropy. MCMC technique is quite similar to this. Prior or posterior probability density must be taken into consideration to avoid the underestimation of epistemic error while using maximum likelihood or maximum posteriori estimation.
Model Comparison
To measure how good any model fit with data we have to calculate the evidence, which will be equivalent to integration of all possible values of the model parameters. Can we go with a model which works well with the data but has a million number of parameters? Can we go with a model which has elegance with a smaller number of parameters but it does not work well with data? When we talk about the best fit model, we must choose the model which has lower number of parameters and works with data well to derive the consistent outcomes and predictions of future events.
MCMC technique can also be used here, amongst hosts of other method for posterior distribution characterization which gives us access to reduce uncertainty in the model.
Neural Networks as Statistical Models
Deep learning helps us to build the data distribution models. Deep learning includes data classification, regression, generation, supervised learning of machines and many more, all of these makes deep learning more and more strong. When we talk about supervised learning, the data (d) considered as a pair of inputs and targets like d = (x, y). For example, our inputs could be pictures of baby boy and baby girls, x, accompanied with labels, y. We might want to then make a prediction of label y′ for a previously unseen image x′ — this is equivalent to making a prediction of the pair of corresponding input and targets, d, given that part of d is known.
In the case of modelling of distribution P, with the help of using a neural network, f: (x, v) ∈ (E, E?) ? g= f(x, v) ∈ G, where f is a function parameter by weights, v, that takes an input x and outputs some values g from a space of possible network outputs G. This form of function can be described as the over emphasis on putting the parameters including the basic architecture along with the optimization routine and above all the cost function. The loss functional can be explained as an unnormalized measure for the probability of the data occurrence with unobservable parameters.
That means, with the help of neural networks predicting the future events of y, on the condition of x as the probability modelling. Many of us classifies the classical neural networks and neural density estimators on the basis of the loss function. In case of classic neural networks, the derivation of future events or output are the values of the parameters which itself controls the data probability distribution shape inside the model.
Regression can be taken as an example with the loss function as the mean squared error, the output of the neural network, g = f(x,v), which is the equivalent of the mean of normal distribution. This implies, if we give certain input feed x, the network processes and gives the estimate of the mean of possible values of y with the unit variance.
Keeping these operational models in the mind, we have the understanding of network parameters optimization which can be also called as training of the models. While optimizing a network, chosen model allows any value which is equivalent to uniform. To minimize cross entropy, we should choose Maximum Likelihood Estimation to optimize the network which further minimizes the relative entropy.
After optimization or training, a network model becomes able to gives us statistical model parameters which helps to determine the nearby prediction values. Although, MCMC should be implemented to get samples to characterize the data distribution in case if loss function provides an unnormalized.
But while choosing a statistical method and optimizing it, we must consider few cautions
- The statistical model is defined by the loss function – The statistical model may go wrong, if neither way the distribution of data is described by loss function. Considering the loss functions beyond mean square error can be one of the ways to avoid the assumptions of the distribution.
- Second most important note is that a statistical model uses a neural network is over emphasized by parameters or can be also called as overparameterized. Millions of unidentifiable parameters are used by a neural network to perform a simple task.
- Data biasing – Neural networks are prepared to work on data, based on data and fit to data. If there is any biasing in the data, that bias will be strong. Somehow, physical models are equipped to avoid these biases.
Finally, after the discussion, we can summarize the whole in one sentence that building an overparameterized statistical model which can be used to describe data distribution is almost equal to using a loss function and a neural network.