Machine Learning - A Probabilistic Perspective(概率论机器学习)

Preface

With the ever increasing amounts of data in electronic form, the need for automated methods for data analysis continues to grow. The goal of machine learning is to develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest. Machine learning is thus closely related to the fields of statistics and data mining, but differs slightly in terms of its emphasis and terminology. This book provides a detailed introduction to the field, and includes worked examples drawn from application domains such as molecular biology, text processing, computer vision, and robotics.

Probability

probability theory

@(Joint probabilities)

We define the probability of the joint event A and B as follows:
$P(A,B) = P(A\cap B) = P(A|B)P(B)$
Given a joint distribution on two events P(A,B), we define the marginal distribution as follows:
$P(A) = \sum_bP(A,B)= \sum_bP(A|B=b)P(B=b)$

@(Mean and variance)

The most familiar property of a distribution is its mean, or expected value, denoted by $μ$ . For discrete rv’s, it’s defined as $E[X] = \sum_{x\in X}xp(x)$ , and for continuous rv’s, it’s defined as $E[X] = \int_Xxp(x)dx$ .
The variance is a measure of the “spread” of a distribution, denoted by $σ$ . This is defined as follows
$\begin{aligned}var[X] &= E[(X-\mu)^2] = \int(x-\mu)^2p(x)dx\\ &= \int x^2p(x)dx + \mu^2\int p(x)dx - 2\mu\int xp(x)dx \\ &= E[X^2] - \mu^2 \end{aligned}$
$\\ \to \\ E[X^2] = \mu^2 + var[X] = \mu^2 + \sigma^2$

@(Transformations of random variables)

If $x ∼ p()$ is some random variable, and $y = f(x) = Ax + b$ , then

$E[y] = E[Ax + b] = A\mu + b$ where $\mu = E[x]$

$cov[y] = cov[Ax + b] = A\varSigma A^T$ , where $\varSigma = cov[x]$

@(Multivariate change of variables *)

Let f be a function that maps $R^n$ to $R^n$ , and let $y = f(x)$ . Then its Jacobian matrix $J$ is given by
$J_{x\to y} = \frac{\partial( y_1,\cdots, y_n)}{\partial (x_1,\cdots, x_n)} = \begin{pmatrix}\frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_n}{\partial x_1} & \cdots & \frac{\partial y_n}{\partial x_n} \end{pmatrix}$ .
|det J| measures how much a unit cube changes in volume when we apply f.

Gaussian models

Pdf for an MVN(multivariate norma) in D dimensions is defined by the following:

$\mathcal{N}(\mathbb x|\mu, \varSigma) = \frac{1}{(2\pi)^{D/2}|\varSigma|^{1/2}}exp\bigg[-\frac{1}{2}(\mathbb x - \mu)^T\varSigma^{-1}(\mathbb x - \mu)\bigg] \tag{1.0.1}$

The expression inside the exponent is the Mahalanobis distance between a data vector $\mathbb x$ and the mean vector $μ$ , We can gain a better understanding of this quantity by performing an eigen-decomposition of $Σ$ . That is, we write $Σ = UΛU^T$ , where $U$ is an orthonormal matrix of eigenvectors satsifying $U^TU = I$ , and $Λ$ is a diagonal matrix of eigenvalues.
$Σ^{-1} = U^{-T}Λ^{-1}U^{-1} = U\varLambda^{-1} U^T = \sum_{i=1}^D\frac{1}{\lambda_i}u_iu_i^T$ $\begin{aligned}(\mathbb x - \mu)^T\varSigma^{-1}(\mathbb x - \mu) &= (\mathbb x - \mu)^T\bigg(\sum_{i=1}^D\frac{1}{\lambda_i}u_iu_i^T\bigg)(\mathbb x - \mu)\\ &= \sum_{i=1}^D\frac{1}{\lambda_i}(\mathbb x- \mu)^Tu_iu_i^T(\mathbb x- \mu) \\ &= \sum_{i=1}^D \frac{y_i^2}{\lambda_i}\end{aligned}$

where $y_i = u_i^T(\mathbb x - \mu)$ .Recall that the equation for an ellipse in 2d is $\frac{y_1^2}{\lambda_1} + \frac{y_2^2}{\lambda_2} = 1$