0%

Machine Learning - A Probabilistic Perspective(概率论机器学习)

Preface

With the ever increasing amounts of data in electronic form, the need for automated methods for data analysis continues to grow. The goal of machine learning is to develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest. Machine learning is thus closely related to the fields of statistics and data mining, but differs slightly in terms of its emphasis and terminology. This book provides a detailed introduction to the field, and includes worked examples drawn from application domains such as molecular biology, text processing, computer vision, and robotics.

Probability

probability theory

@(Joint probabilities)

We define the probability of the joint event A and B as follows:
P(A,B)=P(AB)=P(AB)P(B)P(A,B) = P(A\cap B) = P(A|B)P(B)
Given a joint distribution on two events P(A,B), we define the marginal distribution as follows:
P(A)=bP(A,B)=bP(AB=b)P(B=b)P(A) = \sum_bP(A,B)= \sum_bP(A|B=b)P(B=b)

@(Mean and variance)

The most familiar property of a distribution is its mean, or expected value, denoted by μμ. For discrete rv’s, it’s defined as E[X]=xXxp(x)E[X] = \sum_{x\in X}xp(x), and for continuous rv’s, it’s defined as E[X]=Xxp(x)dxE[X] = \int_Xxp(x)dx.
The variance is a measure of the “spread” of a distribution, denoted by σσ . This is defined as follows
var[X]=E[(Xμ)2]=(xμ)2p(x)dx=x2p(x)dx+μ2p(x)dx2μxp(x)dx=E[X2]μ2\begin{aligned}var[X] &= E[(X-\mu)^2] = \int(x-\mu)^2p(x)dx\\ &= \int x^2p(x)dx + \mu^2\int p(x)dx - 2\mu\int xp(x)dx \\ &= E[X^2] - \mu^2 \end{aligned}
E[X2]=μ2+var[X]=μ2+σ2\\ \to \\ E[X^2] = \mu^2 + var[X] = \mu^2 + \sigma^2

@(Transformations of random variables)

If xp()x ∼ p() is some random variable, and y=f(x)=Ax+by = f(x) = Ax + b, then

E[y]=E[Ax+b]=Aμ+bE[y] = E[Ax + b] = A\mu + b where μ=E[x]\mu = E[x]

cov[y]=cov[Ax+b]=AΣATcov[y] = cov[Ax + b] = A\varSigma A^T, where Σ=cov[x]\varSigma = cov[x]

@(Multivariate change of variables *)

Let f be a function that maps RnR^n to RnR^n, and let y=f(x)y = f(x). Then its Jacobian matrix JJ is given by
Jxy=(y1,,yn)(x1,,xn)=(y1x1y1xnynx1ynxn)J_{x\to y} = \frac{\partial( y_1,\cdots, y_n)}{\partial (x_1,\cdots, x_n)} = \begin{pmatrix}\frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_n}{\partial x_1} & \cdots & \frac{\partial y_n}{\partial x_n} \end{pmatrix}.
|det J| measures how much a unit cube changes in volume when we apply f.

Gaussian models

Pdf for an MVN(multivariate norma) in D dimensions is defined by the following:

N(xμ,Σ)=1(2π)D/2Σ1/2exp[12(xμ)TΣ1(xμ)](1.0.1)\mathcal{N}(\mathbb x|\mu, \varSigma) = \frac{1}{(2\pi)^{D/2}|\varSigma|^{1/2}}exp\bigg[-\frac{1}{2}(\mathbb x - \mu)^T\varSigma^{-1}(\mathbb x - \mu)\bigg] \tag{1.0.1}

The expression inside the exponent is the Mahalanobis distance between a data vector x\mathbb x and the mean vector μμ, We can gain a better understanding of this quantity by performing an eigen-decomposition of ΣΣ. That is, we write Σ=UΛUTΣ = UΛU^T , where UU is an orthonormal matrix of eigenvectors satsifying UTU=IU^TU = I, and ΛΛ is a diagonal matrix of eigenvalues.

Σ1=UTΛ1U1=UΛ1UT=i=1D1λiuiuiTΣ^{-1} = U^{-T}Λ^{-1}U^{-1} = U\varLambda^{-1} U^T = \sum_{i=1}^D\frac{1}{\lambda_i}u_iu_i^T

(xμ)TΣ1(xμ)=(xμ)T(i=1D1λiuiuiT)(xμ)=i=1D1λi(xμ)TuiuiT(xμ)=i=1Dyi2λi\begin{aligned}(\mathbb x - \mu)^T\varSigma^{-1}(\mathbb x - \mu) &= (\mathbb x - \mu)^T\bigg(\sum_{i=1}^D\frac{1}{\lambda_i}u_iu_i^T\bigg)(\mathbb x - \mu)\\ &= \sum_{i=1}^D\frac{1}{\lambda_i}(\mathbb x- \mu)^Tu_iu_i^T(\mathbb x- \mu) \\ &= \sum_{i=1}^D \frac{y_i^2}{\lambda_i}\end{aligned}

where yi=uiT(xμ)y_i = u_i^T(\mathbb x - \mu).Recall that the equation for an ellipse in 2d is y12λ1+y22λ2=1\frac{y_1^2}{\lambda_1} + \frac{y_2^2}{\lambda_2} = 1