D Probabilities

This Appendix covers basic definitions and results of probabilities. It does not aim to replace a lecture on probabilities.

D.1 Probability, conditional probability, and dependence

We indistinguishably denote \(p(a)\):

the probability of a logical event \(A\) to occur
the probability of a discrete random variable \(A\) to take the value \(a\)
the probability mass density of a continuous random variable \(A\) at the value \(a\)

The joint probability of two events to occur (two random variables to take particular values) is denoted \(p(a,b)\).

The conditional probability of an event \(a\) given that \(b\) occurs, denoted \(p(a|b)\) and said “probability of a given b”, is defined as:

\[\begin{align} p(a|b):=\frac{p(a,b)}{p(b)} \end{align}\]

The random variables \(a\) and \(b\) are independent, denoted \(a \perp b\), if and only if \[\begin{align} p(a,b) = p(a)p(b) \end{align}\] This equivalent to say that \(p(a|b) = p(a)\) and that \(p(b|a) = p(b)\).

Otherwise, the random variables \(a\) and \(b\) are dependent, denoted \(a \not\perp b\).

These results generalize to discrete random variables and to probability densities of continuous random variables.

D.2 Expected value, variance, and covariance

If \(X\) is a random variable with a probability density function \(p(x)\), then the expected value is defined as the sum (for discrete random variables) or integral (for univariate continuous random variables) ⁵⁵: \[\operatorname{E}[X] = \int x p(x)\, dx\]

The variance is defined as:

\[\operatorname{Var}[X]=\operatorname{E}[(X - \operatorname{E}[X])^2]\] The standard deviation is the squared root of the variance:

\[\operatorname{SD}(X) = \sqrt{\operatorname{Var}(X)}\]

The covariance of two random variables \(X\) and \(Y\) is defined as:

\[\operatorname{Cov}[(X,Y)]=\operatorname{E}[(X - \operatorname{E}[X])(Y - \operatorname{E}[Y])]\] The Pearson correlation coefficient \(\rho_{X,Y}\) of two random variables \(X\) and \(Y\) is defined as:

\[\rho_{X,Y} = \frac{\operatorname{Cov}[(X,Y)]}{\operatorname{SD}[X]\operatorname{SD}[Y]}\]

The expected value of multidimensional random variables is defined per component. That is,

\[\operatorname{E}[(X_1,\ldots,X_n)]=(\operatorname{E}[X_1],\ldots,\operatorname{E}[X_n])\]

The covariance matrix of a multidimensional random variables \(X\) is the matrix of all pairwise covariances, i.e. with \((i,j)\)-th element being:

\[(\textbf{Cov}[X])_{i,j}=\operatorname{Cov}[(X_i,X_j)]\]

D.3 Sample estimates

Let \(\{x_1,...,x_n\}\) a finite sample of size \(n\) of independent realizations of a random variable \(X\). Considered as random variables, the \(x_i\) are independently and identically distributed (i.i.d.).

The sample mean, often denoted \(\bar x\), is defined as:

\[\bar x = \frac{1}{n}\sum_i x_i\] The sample mean is an unbiased estimator of the expected value. That is, \(\operatorname{E}[\bar x] = \operatorname{E}[X]\).

The sample variance is defined as:

\[\sigma^2_x = \frac{1}{n}\sum_i (x_i-\bar x)^2\] The sample variance is not an unbiased estimator of the variance. Therefore, one often uses the unbiased sample variance, defined as:

\[s_x^2 = \frac{1}{n-1}\sum_i (x_i-\bar x)^2\] for which \(\operatorname{E}[s^2_x] = \operatorname{Var}[X]\) holds.

The sample standard deviation and the unbiased sample standard deviation are defined as the squared root of their variance counterparts.

The sample Pearson correlation coefficient is given by:

\[\begin{align} r =\frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n _{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2}} \tag{8.3} \end{align}\]

where \(\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\) is the sample mean, and analogously for \(\bar{y}\).

D.4 Linear regression

This is the proof for the univariate linear regression estimates.

For a data set \((x, y)_i\) with \(i \in \{1 \dots N\}\) the univariate linear model is defined as \[y_i = \alpha + \beta x_i + \epsilon_i\] with free parameters \(\alpha\) and \(\beta\) and a random error \(\epsilon_i \sim N(0, \sigma^2)\) that is i.i.d. (independently and identically distributed).

The normal distribution is defined as \[N(\epsilon | 0, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp(-\frac{\epsilon^2}{2 \sigma^2}) .\]

The assumption that the errors \(\epsilon_i\) are independent and identically distributed allows us to factorize the Likelihood of the data under the linear model as \[L(\alpha, \beta, \sigma^2) = \prod_{i=1}^{N} N(\epsilon_i | 0, \sigma^2) ,\] using the fact that the probability of independent events is the product of the probabilities of each individual event.

We are interested in finding the parameters \(\alpha\), \(\beta\) and \(\sigma^2\) that best model our data. This can be achieved by finding the parameters that maximize the likelihood of our data. As maximizing the likelihood is equivalent to maximizing the log likelihood and as the log likelihood is easier to handle, we will use the log likelihood in the following.

The log likelihood of our data is defined as follows:

\[\begin{align} \log(L(\alpha, \beta, \sigma^2)) & = \log( \prod_{i=1}^{N} N(\epsilon_i | 0, \sigma^2) ) \\ & = \sum_{i=1}^{N} \log( N(y_i - (\alpha + \beta x_i) | 0, \sigma^2) ) \\ & = - 0.5 N \log(2 \pi \sigma^2) + \sum_{i=1}^{N} - \frac{(y_i - (\alpha + \beta x_i))^2}{2 \sigma^2} . \end{align}\]

We can maximize a quadratic function by computing its gradient and setting it to zero, this yields: \[\hat{\alpha} = \bar{y} - \hat{\beta} \bar{x}\] \[\hat{\beta} = \frac{\sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^N (x_i - \bar{x})^2}\] \[\hat{\sigma}^2 = \frac{1}{N} \sum_{i=1}^N (y_i - (\hat{\alpha} + \hat{\beta}x_i)^2)\]

with means denoted by \(\bar{x}\) and \(\bar{y}\).

D.5 Resources

The chapters on probability and random variables of Rafael Irizzary’s book Introduction to Data Science Chapters gives related primer material [https://rafalab.github.io/dsbook/].

Bishop, Christopher M. 2007. Pattern Recognition and Machine Learning (Information Science and Statistics). 1st ed. Springer. https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/.

Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Application. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. https://doi.org/10.1017/CBO9780511802843.

Gagneur, Julien, Oliver Stegle, Chenchen Zhu, Petra Jakob, Manu M. Tekkedil, Raeka S. Aiyar, Ann-Kathrin Schuon, Dana Pe’er, and Lars M. Steinmetz. 2013. “Genotype-Environment Interactions Reveal Causal Pathways That Mediate Genetic Effects on Phenotype.” PLOS Genetics 9 (9): 1–10. https://doi.org/10.1371/journal.pgen.1003803.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Phipson, Belinda, and Gordon K Smyth. 2010. “Permutation p-Values Should Never Be Zero: Calculating Exact p-Values When Permutations Are Randomly Drawn.” Statistical Applications in Genetics and Molecular Biology 9 (1). https://doi.org/https://doi.org/10.2202/1544-6115.1585.

Roberts, David R., Volker Bahn, Simone Ciuti, Mark S. Boyce, Jane Elith, Gurutzeta Guillera-Arroita, Severin Hauenstein, et al. 2017. “Cross-Validation Strategies for Data with Temporal, Spatial, Hierarchical, or Phylogenetic Structure.” Ecography 40 (8): 913–29. https://doi.org/https://doi.org/10.1111/ecog.02881.

Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software, Articles 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.

This lose definition suffices for our purposes. Correct mathematical definitions of the expected value are involved. See https://en.wikipedia.org/wiki/Expected_value ↩︎