0%

The Elements of Statistical Learning(统计学习精要)

Overview of Supervised Learning

Two Simple Approaches to Prediction: Least Squares and Nearest Neighbors

we develop two simple but powerful prediction methods: the linear model fit by least squares and the k-nearest-neighbor prediction rule. The linear model makes huge assumptions about structure and yields stable but possibly inaccurate predictions. The method of k-nearest neighbors makes very mild structural assumptions: its predictions are often accurate but can be unstable.

Linear Models and Least Squares

Given a vector of inputs XT=(X1,X2,...,Xp)X^T = (X_1,X_2,...,X_p), we predict the output YY via the model

Y^=β^0+j=1pXjβ^j(1.0.1)\hat Y = \hat\beta_0 + \sum_{j=1}^pX_j\hat\beta_j \tag{1.0.1}

The term β^0\hat\beta_0 is the intercept, also known as the bias in machine learning.Often it is convenient to include the constant variable 1 in XX, include β^0\hat\beta_0 in the vector of coefficients β\beta, and then write the linear model in vector form as an inner product

Y^=XTβ^(1.0.2)\hat Y = X^T\hat\beta \tag{1.0.2}

where XTX^T denotes vector or matrix transpose (XX being a column vector).Here we are modeling a single output, so YY is a scalar; in general YY can be a K–vector, in which case ββ would be a p×Kp×K matrix of coefficients.

In the least squares approach, we pick the coefficients ββ to minimize the residual sum of squares

RSS(β)=i=1N(yixiTβ)2 in matrix notation RSS(β)=(YXβ)T(YXβ)RSS(\beta) = \sum_{i=1}^N(y_i - x_i^T\beta)^2 \\ \to \text{ in matrix notation }\\ RSS(\beta) = (Y - X\beta)^T(Y - X\beta)

Y=XβY = X\beta has no solution, so XTY=XTXβXT(YXβ)=0X^TY = X^TX\beta \to X^T(Y - X\beta) = 0, if XTXX^TX is nonsingular, then the unique solution is given by

β^=(XTX)1XTY(1.0.3)\hat\beta = (X^TX)^{-1}X^TY \tag{1.0.3}

Nearest-Neighbor Methods

Nearest-neighbor methods use those observations in the training set T closest in input space to xx to form Y^\hat Y . Specifically, the k-nearest neighbor fit for Y^\hat Y is defined as follows:

Y^(x)=1kxiNkxyi(1.0.4)\hat Y(x) = \frac{1}{k}\sum_{x_i\in N_k{x}}y_i \tag{1.0.4}

Statistical Decision Theory

Local Methods in High Dimensions

Suppose, that we know that the relationship between Y and X is linear. Y=XTβ+ϵY = X^T\beta + \epsilon where ϵN(0,σ2)\epsilon \sim \mathcal{N}(0, \sigma^2) and we fit the model by least squares to the training data. For an arbitrary test point x0x_0, we have y^0=x0Tβ^\hat y_0 = x_0^T\hat\beta, which can be written as y^0=x0Tβ+i=1Ni(x0)ϵi\hat y_0 = x_0^T\beta + \sum_{i=1}^N\ell_i(x_0)\epsilon_i, where i(x0)\ell_i(x_0) is the ithi^{th} element of X(XTX)1x0X(X^TX)^{-1}x_0. Since under this model the least squares estimates are unbiased.

Statistical Models, Supervised Learning and Function Approximation

Suppose in fact that our data arose from a statistical model Y=f(X)+ϵY = f(X) + \epsilon where the random error εε has E(ε)=0E(ε) = 0 and is independent of XX.
While least squares is generally very convenient, it is not the only crite- rion used and in some cases would not make much sense. A more general principle for estimation is maximum likelihood estimation. Suppose we have a random sample yi,i=1,...,Ny_i, i = 1, . . . , N from a density Prθ(y)Pr_θ(y) indexed by some parameters θθ. The log-probability of the observed sample is L(θ)=i=1NlogPrθ(yi)L(\theta) = \sum_{i=1}^NlogPr_\theta(y_i)

The principle of maximum likelihood assumes that the most reasonable values for θθ are those for which the probability of the observed sample is largest. Least squares for the additive error model Y=fθ(X)+εY = f_θ(X) + ε, with εN(0,σ2)ε ∼ \mathcal{N}(0,σ^2), is equivalent to maximum likelihood using the conditional likelihood Pr(YX,θ)=N(fθ(X),σ2)Pr(Y|X, \theta) = \mathcal{N}(f_\theta(X), \sigma^2).So although the additional assumption of normality seems more restrictive, the results are the same. The log-likelihood of the data is

L(θ)=N2log(2π)Nlogσ12σ2i=1N(yifθ(xi))2(1.0.5)L(\theta) = -\frac{N}{2}log(2\pi) - N log\sigma - \frac{1}{2\sigma^2}\sum_{i=1}^N(y_i - f_\theta(x_i))^2 \tag{1.0.5}

and the only term involving θθ is the last, which is RSS(θ)RSS(θ) up to a scalar negative multiplier.

Linear Methods for Regression

We have an input vector XT=(X1,X2,...,Xp)X^T =(X_1,X_2,...,X_p), and want to predict a real-valued output YY . The linear regression model has the form

f(X)=β0+j=1pXjβj=j=0pXjβj(2.0.1)f(X) = \beta_0 + \sum_{j=1}^pX_j\beta_j = \sum_{j=0}^pX_j\beta_j \tag{2.0.1}

Typically we have a set of training data (x1,y1)...(xN,yN)(x_1, y_1) . . . (x_N , y_N) from which to estimate the parameters ββ. Each xi=(xi1,xi2,...,xip)Tx_i = (x_{i1}, x_{i2}, . . . , x_{ip})^T is a vector of feature measurements for the ith case. The most popular estimation method is least squares, in which we pick the coefficients β=(β0,β1,...,βp)Tβ = (β_0, β_1, . . . , β_p)^T to minimize the residual sum of squares

RSS(θ)=i=1N(yif(xi))2=i=1N(yiβ0j=1pxijβj)2\begin{aligned}RSS(\theta) &= \sum_{i=1}^N(y_i - f(x_i))^2 \\ &= \sum_{i=1}^N(y_i - \beta_0 - \sum_{j=1}^px_{ij}\beta_j)^2 \end{aligned}

Linear least squares fitting with XRp+1X \in R^{p+1}-dimensional space occupied by the pairs (X,Y)(X, Y). Denote by XX the N×(p+1)N × (p + 1) matrix with each row an input vector (with a 1 in the first position), and similarly let YY be the N-vector of outputs in the training set. Then we can write the residual sum-of-squares as

RSS(β)=(YXβ)T(YXβ)RSS(\beta) = (Y - X\beta)^T(Y - X\beta)

This is a quadratic function in the p + 1 parameters. Differentiating with respect to ββ we obtain

RSSβ=2XT(YXβ)\frac{\partial RSS}{\partial \beta} = -2X^T(Y - X\beta)

2RSSββT=2XTX\frac{\partial^2 RSS}{\partial \beta\partial \beta^T} = 2X^TX

Assuming (for the moment) that X has full column rank, and hence XTXX^TX is positive definite, we set the first derivative to zero

XT(YXβ)=0β^=(XTX)1XTY(2.0.2)X^T(Y - X\beta) = 0 \to \hat\beta = (X^TX)^{-1}X^TY \tag{2.0.2}

Up to now we have made minimal assumptions about the true distribution of the data. In order to pin down the sampling properties of β^\hat β, we now assume that the observations yiy_i are uncorrelated and have constant variance σ2σ^2, and that the xix_i are fixed (non random).The variance–covariance matrix of the least squares parameter estimates is easily derived from (2.0.2) and is given by

Var(β^)=(XTX)1σ2Var(\hat\beta) = (X^TX)^{-1}\sigma^2

σ^2=1Np1i=1N(yiy^i)2\hat\sigma^2 = \frac{1}{N-p-1}\sum_{i=1}^N(y_i - \hat y_i)^2

The Np1N − p − 1 rather than NN in the denominator makes σˆ2σˆ2 an unbiased estimateofσ E(σ^2)=σ2E(\hat \sigma^2 )=\sigma^2

Linear Methods for Classification

Basis Expansions and Regularization