Last Class:¶

Probability
Conditional probability
Bayes' rule
Expectation
Variance
Properies of Expectation and Variance.
The murder mystery: MBML
Power log distributions

Power log distributions¶

Central limit theorem¶

Suppose {X1, X2, …} is a sequence of i.i.d. random variables with $ E[X_i] = \mu $ and $ Var(X_i) = \sigma^2 $. Let $ S_n = \frac{X_1 + X_2 + .... + X_n}{n}$. Then, $ S_n \to N(\mu, \sigma^2 / n) $

Power log distribution¶

$ P[X = x] = C x^{- \alpha} \quad \quad x > x_{min} $

where,

$ C = (\alpha - 1) x_{min}^{(\alpha - 1)} $ and $ \alpha > 1 $

What's weird about Power log distributions?¶

Q1. What happens to expectation when $\alpha < 2$ ?

Q2. What happens to variance when $ \alpha < 3$ ?

Link to Power law pdf file

Maximum Likelihood Estimation¶

Intuition:¶

You are given a biased coin with probability p. You toss the coin a 100 times and observe 80 heads. Guess the value of p.

Mathematically you can formulate the above problem as $$ p^* = argmax_{p} \; P[ \# \; heads = 80] $$

Now,

$$ P[ \# \; heads = 80] = p^{80} . (1 - p)^{20} $$

Differentiate wrt p to obtain p = 0.8

This process is called maximum likelihood estimation. $ p ^ * $ is called the maximum likelihood estimate.

We try to find the parameter of the distribution such that the Probability of observed values is maximised.

Locate the MLE from the figure¶

Exercise problems:¶

Suppose you conduct a binomial experiment (Remember that a single binomial experiment consists of n iid Bernoulli experiments) with n = 10. Now you observe 7 heads and 3 tails in 10 tosses of the coin. Assume that the data comes from $ Binomial(10, p) $. Find the MLE of p.

Now suppose you conduct three binomial experiments. You observe 7 heads 3 tails in exp 1, 6 heads 4 tails in exp 2, 8 heads 2 tails in exp 3. Again, assume that the data comes from $ Binomial(10, p)$. Find the MLE of p.

Generalise: You conduct $ N $ number of $ Binomial(n, p) $ experiments, where $ n $ is known. You observe $ h_i $ heads in experiment i where i = 1, 2, 3 ... N. Find the MLE estimate of p.

What's good about MLE¶

Intuitive.

Point estimate: makes the math simple.

As the number of samples increase, the estimate gets better. Mathematically, it achieves the Cramer - Rao bound

What's bad about MLE¶

Point estimate: No knowledge about the confidence in the estimate.

Can result in biased estimates

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

p = np.arange(0, 1, 1/2000)
L = (p**8) * ((1-p)**2)
plt.plot(p, L)
plt.show()

L = (p**800) * ((1-p)**200)
plt.plot(p, L)

[<matplotlib.lines.Line2D at 0x7fa19b615f60>]

Bias and Variance of estimators¶

Wikipedia: https://en.wikipedia.org/wiki/Bias_of_an_estimator

Definition:

Suppose we have a statistical model, parameterized by a real number ''θ'', giving rise to a probability distribution for observed data, $ P_\theta(X) = P(X\mid\theta) $, and a statistic $\hat\theta$ which serves as an estimator of θ based on any observed data $x$. That is, we assume that our data follow some unknown distribution $ P(X\mid\theta) $ (where ''θ'' is a fixed constant that is part of this distribution, but is unknown), and then we construct some estimator $ \hat\theta $ that maps observed data to values that we hope are close to θ. The bias of $ \hat\theta $ relative to $ \theta $ is defined as

$ \operatorname{Bias}_\theta[\,\hat\theta\,] = \operatorname{E}_{X\mid\theta}[\,\hat{\theta}\,]-\theta = \operatorname{E}_{X\mid\theta}[\, \hat\theta - \theta \,], $

where $ \operatorname{E}_{X\mid\theta} $ denotes expected value over the distribution $ P(X\mid\theta) $, i.e. averaging over all possible observations $ x $. The second equation follows since ''θ'' is measurable with respect to the conditional distribution $ P(X\mid\theta) $.

An estimator is said to be '''unbiased''' if its bias is equal to zero for all values of parameter ''θ''.

Let $ \hat{\theta} $ be an estimate for a parameter $ \theta $.
- Bias: $ E_{X|\theta}[\hat{\theta} - \theta] = E_{X|\theta}[\hat{\theta}] - \theta$
- Variance: $ E_{X|\theta}[(\hat{\theta} - E_{X|\theta}[\hat{\theta}])^2] $
Unbiased estimators: Bias = 0 i.e. $ E_{X|\theta}[\hat{\theta}] = \theta$

Example¶

Let $ x_1 , x_2 , . . . , x_n $ be a sample from a normal distribution with parameters $ \mu $ and $ \sigma^2 $ . Derive maximum likelihood estimates of $ \mu $ and $ \sigma^2 $.

[Hint: log transform ]

Find the bias of $ \mu_{MLE} $.

Find the bias of $ \sigma^2_{MLE} $.

Test your understanding¶

Let $ x_1 , x_2 , ... , x_n $ be a sample of observations from a Poisson distribution with parameter λ. Find the maximum likelihood estimate of λ in terms of the $ x_i $ and $ n $.

MLE: Dice roll¶

Lagrange multipliers : Khan academy video ¶

Consider the following constrained optimization problem $ maximize \; \; f(x, y) = x^2 y $ subject to $ x^2 + y^2 = 1 $.

Intuition: The contours of $ x^2 y $ and $ g(x, y) = x^2 + y^2 - 1$ are tangent at the optimal value. And thus, the gradient of the contours (which are normal to them. Why?) are parallel. Therefore the optimal point is found by solving $$ \nabla f = \lambda \nabla g \quad (1) $$

Prove that solving $ (1) $ is same as: $$ \frac{\partial}{\partial x} (f - \lambda g) = 0 $$ $$ \frac{\partial}{\partial y} (f - \lambda g) = 0 $$ $$ \frac{\partial}{\partial \lambda} (f - \lambda g) = 0 $$

Biased die:¶

$$ P[X = i] = \theta_{i} \quad \quad (2) $$

where, $ i = 1, 2, 3, 4, 5, 6 $ and $ \sum_{i = 1}^{6} \; \theta_{i} = 1 $

Mathematically efficient way to write (2) is,

$$ P[X = x] = \prod_{i=1}^{6} \theta_{i} ^ {I(i=x)} $$

Problem: Find the MLE estimates for the parameters $ \theta_1, \theta_2, ... \theta_6 $ of a biased die given that the Number of 1s observed is $ n_1 $, Number of 2s observed is $ n_2 $ and so on.

Homework task¶

Multi-dimensional Gaussian distribution¶

Q1. What is the support of 2D Gaussian distribution?

Q2. Write the expression for P[X = (x, y)] for a 2D Gaussian distribution with mean $ \mu = (\mu_1, \mu_2) $ and $ \sum = [[a_{11}, a_{12}], [a_{21}, a_{22}]] $.

Q3. What do the terms $ a_{11}, a_{12}, a_{21}, a_{22} $ represent?

Q4. Is the Matrix $ \sum $ symmetric?

Q5. When the Matrix $ \sum $ is diagonal, list some the properties of the distribution.

Q6. Given observations $ x_1, x_2 , .... , x_n$ of an $ N $ dimensional Gaussian distribution with parameters $ \mu $ and $ \sum $. Find the MLE estimates for $ \mu $ and $ \sum $. (First obtain for N = 2 and then generalize)

Linear Regression¶

Topics:¶

The formulation.
Probabilistic Interpretation.

Formulation¶

Fit a line whose equation is of the form $ \hat{Y} = a + b X $

Minimise $ L = \frac{1}{n} \sum_i d_i^2 = \frac{1}{n} \sum_i (Y_i − \hat{Y}_i )^2 $

Find the expression for $ a^*$ and $b^*$ which optimise $ L $

Probabilistic Interpretation : Why L¶

Assumptions:¶

Assume that $$ Y = \alpha + \beta X + \epsilon $$

where $$ \epsilon \sim N(0, \sigma^2) $$

Write the expression for P[Y|X].

Find the (conditional) MLE estimates of $ \alpha $ and $ \beta $ given n observations $ (x_1, y_1), (x_2, y_2), ... , (x_n, y_n) $.

[Pause after you equate $ \frac{\partial}{\partial \alpha} LL $ and $ \frac{\partial}{\partial \beta}LL$ to $ 0 $ ]

... To be continued in next class