0%

The Elements of Statistical Learning(统计学习精要)

Posted on 2018-09-07 Edited on 2020-01-05 In 机器学习 Views:

Overview of Supervised Learning

Two Simple Approaches to Prediction: Least Squares and Nearest Neighbors

we develop two simple but powerful prediction methods: the linear model fit by least squares and the k-nearest-neighbor prediction rule. The linear model makes huge assumptions about structure and yields stable but possibly inaccurate predictions. The method of k-nearest neighbors makes very mild structural assumptions: its predictions are often accurate but can be unstable.

Linear Models and Least Squares

Given a vector of inputs $X^T = (X_1,X_2,...,X_p)$ , we predict the output $Y$ via the model
$\hat Y = \hat\beta_0 + \sum_{j=1}^pX_j\hat\beta_j \tag{1.0.1}$

The term $\hat\beta_0$ is the intercept, also known as the bias in machine learning.Often it is convenient to include the constant variable 1 in $X$ , include $\hat\beta_0$ in the vector of coefficients $\beta$ , and then write the linear model in vector form as an inner product
$\hat Y = X^T\hat\beta \tag{1.0.2}$
where $X^T$ denotes vector or matrix transpose ( $X$ being a column vector).Here we are modeling a single output, so $Y$ is a scalar; in general $Y$ can be a K–vector, in which case $β$ would be a $p×K$ matrix of coefficients.

In the least squares approach, we pick the coefficients $β$ to minimize the residual sum of squares

$RSS(\beta) = \sum_{i=1}^N(y_i - x_i^T\beta)^2 \\ \to \text{ in matrix notation }\\ RSS(\beta) = (Y - X\beta)^T(Y - X\beta)$
$Y = X\beta$ has no solution, so $X^TY = X^TX\beta \to X^T(Y - X\beta) = 0$ , if $X^TX$ is nonsingular, then the unique solution is given by
$\hat\beta = (X^TX)^{-1}X^TY \tag{1.0.3}$

Nearest-Neighbor Methods

Nearest-neighbor methods use those observations in the training set T closest in input space to $x$ to form $\hat Y$ . Specifically, the k-nearest neighbor fit for $\hat Y$ is defined as follows:

$\hat Y(x) = \frac{1}{k}\sum_{x_i\in N_k{x}}y_i \tag{1.0.4}$

Statistical Decision Theory

Local Methods in High Dimensions

Suppose, that we know that the relationship between Y and X is linear. $Y = X^T\beta + \epsilon$ where $\epsilon \sim \mathcal{N}(0, \sigma^2)$ and we fit the model by least squares to the training data. For an arbitrary test point $x_0$ , we have $\hat y_0 = x_0^T\hat\beta$ , which can be written as $\hat y_0 = x_0^T\beta + \sum_{i=1}^N\ell_i(x_0)\epsilon_i$ , where $\ell_i(x_0)$ is the $i^{th}$ element of $X(X^TX)^{-1}x_0$ . Since under this model the least squares estimates are unbiased.

Statistical Models, Supervised Learning and Function Approximation

Suppose in fact that our data arose from a statistical model $Y = f(X) + \epsilon$ where the random error $ε$ has $E(ε) = 0$ and is independent of $X$ .
While least squares is generally very convenient, it is not the only crite- rion used and in some cases would not make much sense. A more general principle for estimation is maximum likelihood estimation. Suppose we have a random sample $y_i, i = 1, . . . , N$ from a density $Pr_θ(y)$ indexed by some parameters $θ$ . The log-probability of the observed sample is $L(\theta) = \sum_{i=1}^NlogPr_\theta(y_i)$

The principle of maximum likelihood assumes that the most reasonable values for $θ$ are those for which the probability of the observed sample is largest. Least squares for the additive error model $Y = f_θ(X) + ε$ , with $ε ∼ \mathcal{N}(0,σ^2)$ , is equivalent to maximum likelihood using the conditional likelihood $Pr(Y|X, \theta) = \mathcal{N}(f_\theta(X), \sigma^2)$ .So although the additional assumption of normality seems more restrictive, the results are the same. The log-likelihood of the data is
$L(\theta) = -\frac{N}{2}log(2\pi) - N log\sigma - \frac{1}{2\sigma^2}\sum_{i=1}^N(y_i - f_\theta(x_i))^2 \tag{1.0.5}$
and the only term involving $θ$ is the last, which is $RSS(θ)$ up to a scalar negative multiplier.

Linear Methods for Regression

We have an input vector $X^T =(X_1,X_2,...,X_p)$ , and want to predict a real-valued output $Y$ . The linear regression model has the form

$f(X) = \beta_0 + \sum_{j=1}^pX_j\beta_j = \sum_{j=0}^pX_j\beta_j \tag{2.0.1}$

Typically we have a set of training data $(x_1, y_1) . . . (x_N , y_N)$ from which to estimate the parameters $β$ . Each $x_i = (x_{i1}, x_{i2}, . . . , x_{ip})^T$ is a vector of feature measurements for the ith case. The most popular estimation method is least squares, in which we pick the coefficients $β = (β_0, β_1, . . . , β_p)^T$ to minimize the residual sum of squares

$\begin{aligned}RSS(\theta) &= \sum_{i=1}^N(y_i - f(x_i))^2 \\ &= \sum_{i=1}^N(y_i - \beta_0 - \sum_{j=1}^px_{ij}\beta_j)^2 \end{aligned}$

Linear least squares fitting with $X \in R^{p+1}$ -dimensional space occupied by the pairs $(X, Y)$ . Denote by $X$ the $N × (p + 1)$ matrix with each row an input vector (with a 1 in the first position), and similarly let $Y$ be the N-vector of outputs in the training set. Then we can write the residual sum-of-squares as

$RSS(\beta) = (Y - X\beta)^T(Y - X\beta)$
This is a quadratic function in the p + 1 parameters. Differentiating with respect to $β$ we obtain
$\frac{\partial RSS}{\partial \beta} = -2X^T(Y - X\beta)$ $\frac{\partial^2 RSS}{\partial \beta\partial \beta^T} = 2X^TX$
Assuming (for the moment) that X has full column rank, and hence $X^TX$ is positive definite, we set the first derivative to zero
$X^T(Y - X\beta) = 0 \to \hat\beta = (X^TX)^{-1}X^TY \tag{2.0.2}$

Up to now we have made minimal assumptions about the true distribution of the data. In order to pin down the sampling properties of $\hat β$ , we now assume that the observations $y_i$ are uncorrelated and have constant variance $σ^2$ , and that the $x_i$ are fixed (non random).The variance–covariance matrix of the least squares parameter estimates is easily derived from (2.0.2) and is given by

$Var(\hat\beta) = (X^TX)^{-1}\sigma^2$ $\hat\sigma^2 = \frac{1}{N-p-1}\sum_{i=1}^N(y_i - \hat y_i)^2$

The $N − p − 1$ rather than $N$ in the denominator makes $σˆ2$ an unbiased estimateofσ $E(\hat \sigma^2 )=\sigma^2$

Linear Methods for Classification

Basis Expansions and Regularization