D Probabilities
This Appendix covers basic definitions and results of probabilities. It does not aim to replace a lecture on probabilities.
D.1 Probability, conditional probability, and dependence
We indistinguishably denote \(p(a)\):
- the probability of a logical event \(A\) to occur
- the probability of a discrete random variable \(A\) to take the value \(a\)
- the probability mass density of a continuous random variable \(A\) at the value \(a\)
The joint probability of two events to occur (two random variables to take particular values) is denoted \(p(a,b)\).
The conditional probability of an event \(a\) given that \(b\) occurs, denoted \(p(a|b)\) and said “probability of a given b”, is defined as:
\[\begin{align} p(a|b):=\frac{p(a,b)}{p(b)} \end{align}\]
The random variables \(a\) and \(b\) are independent, denoted \(a \perp b\), if and only if \[\begin{align} p(a,b) = p(a)p(b) \end{align}\] This equivalent to say that \(p(a|b) = p(a)\) and that \(p(b|a) = p(b)\).
Otherwise, the random variables \(a\) and \(b\) are dependent, denoted \(a \not\perp b\).
These results generalize to discrete random variables and to probability densities of continuous random variables.
D.2 Expected value, variance, and covariance
If \(X\) is a random variable with a probability density function \(p(x)\), then the expected value is defined as the sum (for discrete random variables) or integral (for univariate continuous random variables) 55: \[\operatorname{E}[X] = \int x p(x)\, dx\]
The variance is defined as:
\[\operatorname{Var}[X]=\operatorname{E}[(X - \operatorname{E}[X])^2]\] The standard deviation is the squared root of the variance:
\[\operatorname{SD}(X) = \sqrt{\operatorname{Var}(X)}\]
The covariance of two random variables \(X\) and \(Y\) is defined as:
\[\operatorname{Cov}[(X,Y)]=\operatorname{E}[(X - \operatorname{E}[X])(Y - \operatorname{E}[Y])]\] The Pearson correlation coefficient \(\rho_{X,Y}\) of two random variables \(X\) and \(Y\) is defined as:
\[\rho_{X,Y} = \frac{\operatorname{Cov}[(X,Y)]}{\operatorname{SD}[X]\operatorname{SD}[Y]}\]
The expected value of multidimensional random variables is defined per component. That is,
\[\operatorname{E}[(X_1,\ldots,X_n)]=(\operatorname{E}[X_1],\ldots,\operatorname{E}[X_n])\]
The covariance matrix of a multidimensional random variables \(X\) is the matrix of all pairwise covariances, i.e. with \((i,j)\)-th element being:
\[(\textbf{Cov}[X])_{i,j}=\operatorname{Cov}[(X_i,X_j)]\]
D.3 Sample estimates
Let \(\{x_1,...,x_n\}\) a finite sample of size \(n\) of independent realizations of a random variable \(X\). Considered as random variables, the \(x_i\) are independently and identically distributed (i.i.d.).
The sample mean, often denoted \(\bar x\), is defined as:
\[\bar x = \frac{1}{n}\sum_i x_i\] The sample mean is an unbiased estimator of the expected value. That is, \(\operatorname{E}[\bar x] = \operatorname{E}[X]\).
The sample variance is defined as:
\[\sigma^2_x = \frac{1}{n}\sum_i (x_i-\bar x)^2\] The sample variance is not an unbiased estimator of the variance. Therefore, one often uses the unbiased sample variance, defined as:
\[s_x^2 = \frac{1}{n-1}\sum_i (x_i-\bar x)^2\] for which \(\operatorname{E}[s^2_x] = \operatorname{Var}[X]\) holds.
The sample standard deviation and the unbiased sample standard deviation are defined as the squared root of their variance counterparts.
The sample Pearson correlation coefficient is given by:
\[\begin{align} r =\frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n _{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2}} \tag{8.3} \end{align}\]
where \(\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\) is the sample mean, and analogously for \(\bar{y}\).
D.4 Linear regression
This is the proof for the univariate linear regression estimates.
For a data set \((x, y)_i\) with \(i \in \{1 \dots N\}\) the univariate linear model is defined as \[y_i = \alpha + \beta x_i + \epsilon_i\] with free parameters \(\alpha\) and \(\beta\) and a random error \(\epsilon_i \sim N(0, \sigma^2)\) that is i.i.d. (independently and identically distributed).
The normal distribution is defined as \[N(\epsilon | 0, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp(-\frac{\epsilon^2}{2 \sigma^2}) .\]
The assumption that the errors \(\epsilon_i\) are independent and identically distributed allows us to factorize the Likelihood of the data under the linear model as \[L(\alpha, \beta, \sigma^2) = \prod_{i=1}^{N} N(\epsilon_i | 0, \sigma^2) ,\] using the fact that the probability of independent events is the product of the probabilities of each individual event.
We are interested in finding the parameters \(\alpha\), \(\beta\) and \(\sigma^2\) that best model our data. This can be achieved by finding the parameters that maximize the likelihood of our data. As maximizing the likelihood is equivalent to maximizing the log likelihood and as the log likelihood is easier to handle, we will use the log likelihood in the following.
The log likelihood of our data is defined as follows:
\[\begin{align} \log(L(\alpha, \beta, \sigma^2)) & = \log( \prod_{i=1}^{N} N(\epsilon_i | 0, \sigma^2) ) \\ & = \sum_{i=1}^{N} \log( N(y_i - (\alpha + \beta x_i) | 0, \sigma^2) ) \\ & = - 0.5 N \log(2 \pi \sigma^2) + \sum_{i=1}^{N} - \frac{(y_i - (\alpha + \beta x_i))^2}{2 \sigma^2} . \end{align}\]
We can maximize a quadratic function by computing its gradient and setting it to zero, this yields: \[\hat{\alpha} = \bar{y} - \hat{\beta} \bar{x}\] \[\hat{\beta} = \frac{\sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^N (x_i - \bar{x})^2}\] \[\hat{\sigma}^2 = \frac{1}{N} \sum_{i=1}^N (y_i - (\hat{\alpha} + \hat{\beta}x_i)^2)\]
with means denoted by \(\bar{x}\) and \(\bar{y}\).
D.5 Resources
The chapters on probability and random variables of Rafael Irizzary’s book Introduction to Data Science Chapters gives related primer material [https://rafalab.github.io/dsbook/].
This lose definition suffices for our purposes. Correct mathematical definitions of the expected value are involved. See https://en.wikipedia.org/wiki/Expected_value↩︎