Overview of Supervised Learning
Two Simple Approaches to Prediction: Least Squares and Nearest Neighbors
we develop two simple but powerful prediction methods: the linear model fit by least squares and the k-nearest-neighbor prediction rule. The linear model makes huge assumptions about structure and yields stable but possibly inaccurate predictions. The method of k-nearest neighbors makes very mild structural assumptions: its predictions are often accurate but can be unstable.
Linear Models and Least Squares
Given a vector of inputs , we predict the output via the model
The term is the intercept, also known as the bias in machine learning.Often it is convenient to include the constant variable 1 in , include in the vector of coefficients , and then write the linear model in vector form as an inner product
where denotes vector or matrix transpose ( being a column vector).Here we are modeling a single output, so is a scalar; in general can be a K–vector, in which case would be a matrix of coefficients.
In the least squares approach, we pick the coefficients to minimize the residual sum of squares
has no solution, so , if is nonsingular, then the unique solution is given by
Nearest-Neighbor Methods
Nearest-neighbor methods use those observations in the training set T closest in input space to to form . Specifically, the k-nearest neighbor fit for is defined as follows:
Statistical Decision Theory
Local Methods in High Dimensions
Suppose, that we know that the relationship between Y and X is linear. where and we fit the model by least squares to the training data. For an arbitrary test point , we have , which can be written as , where is the element of . Since under this model the least squares estimates are unbiased.
Statistical Models, Supervised Learning and Function Approximation
Suppose in fact that our data arose from a statistical model where the random error has and is independent of .
While least squares is generally very convenient, it is not the only crite- rion used and in some cases would not make much sense. A more general principle for estimation is maximum likelihood estimation. Suppose we have a random sample from a density indexed by some parameters . The log-probability of the observed sample isThe principle of maximum likelihood assumes that the most reasonable values for are those for which the probability of the observed sample is largest. Least squares for the additive error model , with , is equivalent to maximum likelihood using the conditional likelihood .So although the additional assumption of normality seems more restrictive, the results are the same. The log-likelihood of the data is
and the only term involving is the last, which is up to a scalar negative multiplier.
Linear Methods for Regression
We have an input vector , and want to predict a real-valued output . The linear regression model has the form
Typically we have a set of training data from which to estimate the parameters . Each is a vector of feature measurements for the ith case. The most popular estimation method is least squares, in which we pick the coefficients to minimize the residual sum of squares
Linear least squares fitting with -dimensional space occupied by the pairs . Denote by the matrix with each row an input vector (with a 1 in the first position), and similarly let be the N-vector of outputs in the training set. Then we can write the residual sum-of-squares as
This is a quadratic function in the p + 1 parameters. Differentiating with respect to we obtain
Assuming (for the moment) that X has full column rank, and hence is positive definite, we set the first derivative to zero
Up to now we have made minimal assumptions about the true distribution of the data. In order to pin down the sampling properties of , we now assume that the observations are uncorrelated and have constant variance , and that the are fixed (non random).The variance–covariance matrix of the least squares parameter estimates is easily derived from (2.0.2) and is given by
The rather than in the denominator makes an unbiased estimateofσ