I am Lazarus, come from the dead,

Come back to tell you all – I shall tell you all

These are words from “The Love Song of J. Alfred Prufrock”, a work coming from the greatest modernist poet of them all, the one and only Eliot – Thomas Stearns. Pretentious quotes aside, and with no snide contexts hiding beneath the white fog at the golden gate bridge that is less than an hour away, let us now get to the point. All is well. Every so often, we disappear, and then reappear. But we take on a different form, hopefully also with shape.

This time, I would like to make a few notes on some quite basic Bayesian modeling ideas, with a view to seeing if I can put these thoughts in a coherent way. As they say, it is only when you explain things succinctly do you understand them. Through a mathematical example, I would like to bring out the connection between a Bayesian model’s prior belief, and how it gets adjusted when data arrives. Specifically, the example demonstrates that when we have a large number of data points, the estimate for the model becomes more and more precise, becoming explainable by the sample mean. When the amount of data is small, we do not have enough information for it to explain the observations precisely. In this case, it becomes important to have a prior belief. This belief gets adjusted by data as it arrives.

## Single parameter model

This is a very elementary example, and completely lifted from BDA. Consider data , parameterized by the model , with a gaussian likelihood.

We also use a gaussian prior

Now, to get the posterior, we use Bayes, but as we are ‘given’ the data, we treat as constant, so it basically becomes the product of the prior times likelihood, with an unknown normalizing constant .

Now when we plug our individual formulas into this, we get a product of two gaussians, which is also a gaussian.

After going through the exercise of completing the square (and treating terms not containing as constant, etc), we get

where and

We can interpret this result as a compromise between the prior and the ‘data’ term as connoted by – in the first version below, we ‘adjust’ the estimate given by the prior to account for the data.

Equivalently, we can say that the data has ‘shrunk’ towards the prior mean:

## Model with multiple observations

It’s a bit more interesting when we have more than one data point. What happens when we have lots of lots of data, as we would in say, a general deep learning setting?

We use the same ideas, with iid assumptions and come up with a similar formulation, summarized by a sample mean:

This can be summarized with the sample mean – sufficient statistics – .

where

and

Now, after all this, we get to the crux or nub. Initially, when there is no data, we rely entirely on the prior belief. As data starts to arrive, the prior (embodied by ) gets slowly weighted out by the data term. When we have a large number of data points , our uncertainty term vanishes:

Here, we converge to an estimate parameterized by the sample mean, and the variance parameter goes to zero. In other words, as we get more and more points, we are able to make a more accurate prediction of the mean. In a perverse way, this struck me as one reason why we resort to maximum likelihood in deep learning problems. As the amount of data increases, the uncertainty in estimating the model decreases, with the parameter eventually converging to a point estimate.

## References

BDA3, Chapter 2: https://avehtari.github.io/BDA_course_Aalto/