0%

应用数学与机器学习基础

线性代数

@(特征分解)

方阵 AA 的 特征向量(eigenvector)是指与 AA 相乘后相当于对该向量进行缩放 的非零向量 vv: Av=λvAv = λv.
标量 λλ 被称为这个特征向量对应的特征值(eigenvalue)。

假设矩阵 AAnn 个线性无关的特征向量{v(1),...,v(n)}\{v^{(1)},...,v^{(n)}\},对应着特征值 {λ1,...,λn}\{λ_1,...,λ_n\}。我们将特征向量连接成一个矩阵,使得每一列是一个特征向量:V=[v(1),...,v(n)]V = [v^{(1)},...,v^{(n)}]。类似地,我们也可以将特征值连接成一个向量 λ=[λ1,...,λn]\mathbb λ = [λ_1, . . . , λ_n]^⊤。因此 AA 的特征分解(eigendecomposition)可以记作

A=Vdiag(λ)V1A = Vdiag(\mathbb \lambda)V^{-1}

Asymetric=QΛQTA_{symetric} = Q\varLambda Q^T

@(奇异值分解)[SVD]

我们将矩阵 A 分解成三个矩阵的乘积 A=UDVTA = UDV^T
假设 AA 是一个 m×nm×n 的矩阵,那么 UU 是一个 m×mm×m 的矩阵,DD 是一个 m×nm×n 的矩阵,VV 是一个 n×nn × n 矩阵。

这些矩阵中的每一个经定义后都拥有特殊的结构。矩阵 UUVV 都定义为正交矩阵,而矩阵 DD 定义为对角矩阵。注意,矩阵 DD 不一定是方阵。

对角矩阵 DD 对角线上的元素被称为矩阵 AA奇异值(singular value)。矩阵 UU 的列向量被称为 左奇异 向量(left singular vector),矩阵 VV 的列向量被称 右奇异 向量(right singular vector)。

我们可以用与 AA 相关的特征分解去解释 AA 的奇异值分解。AA 的左奇异向量(left singular vector)是 AAAA^⊤ 的特征向量。AA 的右奇异向量(right singular vector)是 AAA^⊤A 的特征向量。AA 的非零奇异值是 AAA^⊤A 特征值的平方根,同时也是 AAAA^⊤ 特征值的平方根。

SVD 最有用的一个性质可能是拓展矩阵求逆到非方矩阵

@(Moore-Penrose 伪逆)

对于非方矩阵而言,其逆矩阵没有定义。假设在下面的问题中,我们希望通过矩阵 AA 的左逆 BB 来求解线性方程。Ax=bx=ByAx=b \to x = By。 取决于问题的形式,我们可能无法设计一个唯一的映射将 AA 映射到 BB

如果矩阵 AA 的行数大于列数,那么上述方程可能没有解。如果矩阵 AA 的行数小于列数,那么上述矩阵可能有多个解。

Moore-Penrose 伪逆(Moore-Penrose pseudoinverse)使我们在这类问题上取得了一定的进展。矩阵 A 的伪逆定义为

A+=limα0(ATA+αI)1AT\displaystyle A^+ = \lim_{\alpha \searrow 0}(A^TA + \alpha I)^{-1}A^T

计算伪逆的实际算法没有基于这个定义,而是使用下面的公式:

A+=VD+UTA^+ = VD^+U^T

矩阵 UUDDVV 是矩阵AA奇异值分解后得到的矩阵。对角矩阵 DD 的伪逆 D+D^+ 是其非零元素取倒数之后再转置得到的

当矩阵 AA 的列数多于行数时,使用伪逆求解线性方程是众多可能解法中的一 种。特别地,x=A+yx = A^+y 是方程所有可行解中欧几里得范数 x2∥x∥_2 最小的一个。
当矩阵 AA 的行数多于列数时,可能没有解。在这种情况下,通过伪逆得到的 xx 使得 AxAxyy 的欧几里得距离 Axy2∥Ax−y∥_2 最小。

@(迹运算)

迹运算返回的是矩阵对角元素的和: Tr(A)=iAi,iTr(A) = \sum_i A_{i,i}
显然 Tr(A)=Tr(AT)Tr(A) = Tr(A^T)

有些矩阵运算很难描述,而通过矩 阵乘法和迹运算符号可以清楚地表示。

例如,迹运算提供了另一种描述矩阵Frobenius范数的方式:

AF=Tr(AAT)||A||_F = \sqrt{Tr(AA^T)}

多个矩阵相乘得到的方阵的迹,和将这些矩阵中的最后一个挪到最前面之后相乘的迹是相同的。

Tr(ABC)=Tr(CAB)=Tr(BCA)Tr(ABC)=Tr(CAB)=Tr(BCA)

或者更一般地,

Tr(i=1nF(i))=Tr(F(n)i=1n1F<!0>)Tr(\prod_{i=1}^nF^{(i)}) = Tr(F^{(n)}\prod_{i=1}^{n-1}F^)

即使循环置换后矩阵乘积得到的矩阵形状变了,迹运算的结果依然不变。假设ARm×n,BRn×mA ∈ R^{m×n}, B ∈ R^{n×m}Tr(AB)=Tr(BA)Tr(AB) = Tr(BA)

@(行列式)

行列式,记作 det(A)det(A),是一个将方阵 AA 映射到实数的函数。行列式等于矩阵特征值的乘积。行列式的绝对值可以用来衡量矩阵参与矩阵乘法后空间扩大或者缩小了多少。如果行列式是 0,那么空间至少沿着某一维完全收缩了,使其失去了所有的体积。如果行列式是1,那么这个转换保持空间体积不变。

det(A)=i=1nλidet(A) = \prod_{i=1}^n\lambda_i

Read more »

EARLY VISION: JUST ONE IMAGE

Linear Filters

Local Image Features

An object is separated from its background in an image by an occluding contour. Draw a path in the image that crosses such a contour. On one side, pixels lie on the object, and on the other, the background. Finding occluding contours is an important challenge, because the outline of an object—which is one cue to its shape—is formed by occluding contours.

COMPUTING THE IMAGE GRADIENT

For an image II, the gradient is I=(Ix,Iy)T\displaystyle \nabla I = (\frac{\partial I}{\partial x}, \frac{\partial I}{\partial y})^T, which we could estimate by observing that {IxIi+1,yIi,jIyIi,y+1Ii,j\begin{cases}\frac{\partial I}{\partial x}\approx I_{i+1,y} - I_{i,j} \\ \frac{\partial I}{\partial y}\approx I_{i,y+1} - I_{i,j}\end{cases}

These kinds of derivative estimates are known as finite differences. Image noise tends to result in pixels not looking like their neighbors, so that simple finite differences tend to give strong responses to noise. As a result, just taking one finite difference for xx and one for yy gives noisy gradient estimates. The way to deal with this problem is to smooth the image and then differentiate it

REPRESENTING THE IMAGE GRADIENT

There are two important representations of the image gradient.

  • The first is to compute edges, where there are very fast changes in brightness. These are usually seen as points where the magnitude of the gradient is extrema
  • The second is to use gradient orientations, which are largely independent of illumination intensity

FINDING CORNERS AND BUILDING NEIGHBORHOODS

Points worth matching are corners, because a corner can be localized, which means we can tell where a corner is. This motivates the more general term interest point often used to describe a corner.

DESCRIBING NEIGHBORHOODS WITH SIFT AND HOG FEATURES

We know the center, radius, and orientation of a set of an image patch, and must now represent it. Orientations should provide a good representation. They are unaffected by changes in image brightness, and different textures tend to have different orientation fields. The pattern of orientations in different parts of the patch is likely to be quite distinctive. Our representation should be robust to small errors in the center, radius, or orientation of the patch, because we are unlikely to estimate these exactly right.

Read more »

Probability

Definition

Conditional probability

The conditional probability of xx given that yy takes value yy^∗ tells us the relative propensity of the random variable xx to take different outcomes given that the random variable yy is fixed to value yy^*. This conditional probablity is written as Pr(x,y=y)Pr(x, y = y^*).
The conditional probability Pr(xy=y)Pr(x|y = y^∗) can be recovered from the joint distribution Pr(x,y)Pr(x,y).

In particular, we examine the appropriate slice Pr(x,y=y)Pr(x,y = y^∗) of the joint distribution.The values in the slice tell us about the relative probability that xx takes various values having observed y=yy = y^∗, but they do not themselves form a valid probability distribution; they cannot sum to one as they constitute only a small part of the joint distribution which did itself sum to one. To calculate the conditional probability distribution, we hence normalize by the total probability in the slice

Pr(xy=y)=Pr(x,y=y)Pr(x,y=y)dx=Pr(x,y=y)Pr(y=y)(1.0.1)Pr(x|y=y^*) = \frac{Pr(x, y = y^*)}{\int Pr(x, y=y^*)dx} = \frac{Pr(x, y = y^*)}{Pr(y=y^*)} \tag{1.0.1}

Pr(xy)=Pr(x,y)Pr(y){Pr(x,y)=Pr(xy)Pr(y)Pr(x,y)=Pr(yx)Pr(x)Pr(xy)Pr(y)=Pr(yx)Pr(x)\to Pr(x|y) = \frac{Pr(x,y)}{Pr(y)} \to \begin{cases}Pr(x,y) = Pr(x|y) Pr(y) \\Pr(x,y) = Pr(y|x)Pr(x)\end{cases} \to Pr(x|y)Pr(y) = Pr(y|x) Pr(x)

Pr(ω,x,y,z)=Pr(ω,x,yz,z)Pr(z)=Pr(ω,xy,z)Pr(yz)Pr(z)=Pr(ωx,y,z)Pr(xy,z)Pr(yz)Pr(z)\to \begin{aligned}Pr(\omega, x, y, z) &= Pr(\omega, x, y|z, z)Pr(z)\\&= Pr(\omega, x|y,z)Pr(y|z)Pr(z)\\&=Pr(\omega|x, y,z)Pr(x|y,z)Pr(y|z)Pr(z)\end{aligned}

Expectation

Given a function f[]f [\cdot] that returns a value for each possible value xx^∗ of the variable xx and a probability Pr(x=x)P r(x = x^∗) that each value of xx occurs, we sometimes wish to calculate the expected output of the function. If we drew a very large number of samples from the probability distribution, calculated the function for each sample, and took the average of these values, the result would be the expectation.

The expected value of a function f[]f[\cdot] of a random variable xx is defined as

E[f[x]]=xf[x]Pr(x)(1.0.2)E[f[x]] = \sum_xf[x]Pr(x) \tag{1.0.2}

E[f[x]]=f[x]Pr(x)dx(1.0.2)E[f[x]] = \int f[x]Pr(x)dx \tag{1.0.2}

for the discrete and continuous cases, respectively. This idea generalizes to functions f[]f[\cdot] of more than one random variable so that, for example E[f[x,y]]=f[x,y]Pr(x,y)dxdy\displaystyle E[f[x,y]] = \int \int f[x,y]Pr(x,y)dxdy

Special cases of expectation. For some functions f(x)f(x), the expectation E[f(x)]E[f(x)] is given a special name. Here we use the notation μx\mu_x to represent the mean with respect to random variable xx and μy\mu_y the mean with respect to random variable yy.

Function f[]f[\cdot] Expectation
xx mean, μx\mu_x
xkx^k kthk^{th} moment about zero
(xμx)k(x-\mu_x)^k kthk^{th} moment about the mean
(xμx)2(x-\mu_x)^2 variance
(xμx)3(x-\mu_x)^3 skew
(xμx)4(x-\mu_x)^4 kurtosis
(xμx)(yμy)(x-\mu_x)(y - \mu_y) covariance of xx and yy

There are four rules for manipulating expectations, which can be easily proved from the original definition

  1. The expected value of a constant kk with respect to the random variable xx is just the constant itself

E[k]=kE[k] = k

  1. The expected value of a constant κκ times a function f[x]f[x] of the random variable xx is κκ times the expected value of the function

E[κf[x]]=κE[f[x]]E[κf [x]] = κE[f [x]]

  1. The expected value of the sum of two functions of a random variable xx is the sum of the individual expected values of the functions

E[f[x]+g[x]]=E[f[x]]+E[g[x]]E[f [x] + g[x]] = E[f [x]] + E[g[x]]

  1. The expected value of the product of two functions f[x]f[x] and g[y]g[y] of random variables xx and yy is equal to the product of the individual expected values if the variables xx and yy are independent

E[f[x]g[y]]=E[f[x]]E[g[y]]where x, y independentE[f [x]g[y]] = E[f [x]]E[g[y]] \hspace{2em} \text{where x, y independent}

With above rules, get the relationship between the second moment around zero and the second moment about the mean (variance)

E[(xμ)2]=E[x22xμ+μ2]=E[x2]2E[xμ]+E[μ2]=E[x2]2μE[x]+E[μ2]=E[x2]2E[x]E[x]+E[x]E[x]=E[x2]E[x]E[x]\begin{aligned}E[(x-\mu)^2] &= E[x^2 - 2x\mu + \mu^2] \\&= E[x^2] -2E[x\mu] + E[\mu^2] \\&= E[x^2] - 2\mu E[x] + E[\mu^2] \\ &= E[x^2] - 2E[x]E[x] + E[x]E[x] \\ &= E[x^2] - E[x]E[x]\end{aligned}

Read more »

记录一些常用到的数学公式、定理以及分析方法.

Procrustes Analysis(普氏分析)

前言

选取N幅同类目标物体的二维图像,并法标注轮廓点,这样就得到训练样本集

Ω={X1,X2,,XN}\Omega = \{X_1, X_2, \cdots, X_N\}

由于图像中目标物体的形状和位置存在较大偏差,因此所得到的数据并不具有仿射不变性,需要对其进行归一化处理。这里采用Procrustes分析方法对样本集中的所有形状集合进行归一化。形状和位置的载体还是样本点的空间坐标。

定义

普氏分析法是一种用来分析形状分布的方法。数学上来讲,就是不断迭代,寻找标准形状(canonical shape),并利用最小二乘法寻找每个样本形状到这个标准形状的仿射变化方式。

本书中,两个形状的归一化过程(一个形状为canonical shape,另一个为样本形状):

  1. 求每个样本点i(i=1,2..,n)i(i=1,2..,n)在N幅图像中的均值

(xˉi,yˉi)=(1Nj=1Nxji,1Nj=1Nyji)(\bar x_i, \bar y_i) = \bigg(\frac{1}{N}\sum_{j=1}^Nx_{ji}, \frac{1}{N}\sum_{j=1}^Ny_{ji}\bigg)

  1. 对所有形状的大小进行归一化,即将每个样本点减去其对应均值

(xi,yi)=(xixˉi,yiyˉi)(x_i', y_i') = (x_i - \bar x_i, y_i - \bar y_i)

  1. 根据去中心化数据,计算每幅图像中形状的重心,对于第ii幅图像,其重心为

(xˉi,yˉi)=(1nj=1nxji,1nj=1nyji)(\bar x_i, \bar y_i) = \bigg(\frac{1}{n}\sum_{j=1}^nx_{ji}, \frac{1}{n}\sum_{j=1}^ny_{ji}\bigg)

  1. 根据重心和角度,将标准和样本形状对齐在一起,使得两个形状的普氏距离最小,下式为普氏距离定义

Pd2=i=1n[(xi1xi2)2+(yi1yi2)2]P_d^2 = \sum_{i=1}^n[(x_{i1} - x_{i2})^2 + (y_{i1} - y_{i2})^2]

这个第44步的具体做法,不断迭代以下过程:

  1. 通过计算每幅图像中所有归一化样本点的平均值得到每个图像的标准形状canonical shape。
  2. 利用最小二乘法求每个图像中样本形状到标准形状的旋转角度。根据普氏距离的定义,也就是求

mina,bi=1n[abba][xiyi][cxcy]2\min_{a,b}\sum_{i=1}^n\bigg|\bigg|\begin{bmatrix}a & -b\\ b & a\end{bmatrix}\begin{bmatrix}x_i \\ y_i\end{bmatrix}-\begin{bmatrix}c_x\\c_y\end{bmatrix}\bigg|\bigg|^2

其中的aabb表示仿射变换里旋转变化的参数:

[abba]=[kcos(θ)ksin(θ)ksin(θ)kcos(θ)]\begin{bmatrix}a & -b\\ b & a\end{bmatrix} = \begin{bmatrix}kcos(\theta)& -ksin(\theta) \\ ksin(\theta) & kcos(\theta)\end{bmatrix}

对上式求偏导数,可以得到所求的a和b:

[ab]=1Σi(xi2+yi2)i=1n[xicx+yicyxicyyicx]\begin{bmatrix}a \\ b\end{bmatrix} = \frac{1}{\Sigma_i(x_i^2 + y_i^2)}\sum_{i=1}^n\begin{bmatrix}x_ic_x + y_ic_y \\ x_ic_y - y_ic_x\end{bmatrix}

  1. 根据旋转参数,对样本形状做旋转变化,得到和标准形状对齐的新的形状

[xy]=[abba][xy]\begin{bmatrix}x' \\ y'\end{bmatrix} = \begin{bmatrix}a & -b \\ b & a\end{bmatrix}\begin{bmatrix}x\\y\end{bmatrix}

  1. 重复以上步骤,直到达到指定循环次数或者前后两次迭代之间canonical shape的绝对范数满足一定阈值
Read more »

Vector Calcuclus

Vectors and Parametric Curves

Basic

Definition (Scalar, Point, Bi-point, Vector)

Scalar
A scalar αRα ∈ \R is simply a real number.
 
Point, Bi-point
A point rR2r ∈ \R^2 is an ordered pair of real numbers, r=(x,y)r = (x, y) with xRx ∈ \R and yRy ∈ \R. Here the first coordinate x stipulates the location on the horizontal axis and the second coordinate y stipulates the location on the vertical axis. Given two points r and r′ in R2\R^2 the directed line segment with departure point r and arrival point r′ is called the bi-point r, r′ and is denoted by [r,r′]. We say that r is the tail of the bi-point [r, r′] and that r′ is its head. The Euclidean length or norm of bi-point [a, b] is simply the distance between a and b and it is denoted by [a,b]=(a1b1)2+(a2b2)2||[a,b]|| = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2}
 
Vector
A vector aR2\vec a ∈ \R^2 is a codification of movement of a bi-point: given the bi-point [r, r′], we associate to it the vector rr=[xxyy]\overrightarrow {rr^′} = \begin{bmatrix}x' - x \\ y′ - y\end{bmatrix} stipulating a movement of xxx' - x units from (x, y) in the horizontal axis and of yyy' - y units from the current position in the vertical axis.The zero vector 0=[00]\vec 0 = \begin{bmatrix}0 \\ 0\end{bmatrix} indicates no movement in either direction.

Read more »

Filters-Instagram

This notebook includes both coding and written questions. Please hand in this notebook file with all the outputs and your answers to the written questions.

1
# Setup
2
import numpy as np
3
import matplotlib.pyplot as plt
4
from time import time
5
#from skimage import io
6
7
from __future__ import print_function
8
9
%matplotlib inline
10
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
11
plt.rcParams['image.interpolation'] = 'nearest'
12
plt.rcParams['image.cmap'] = 'gray'
13
14
# for auto-reloading extenrnal modules
15
%load_ext autoreload
16
%autoreload 2

Convolutions

@(Commutative Property)
Recall that the convolution of an image f:R2Rf: ℝ^2→ℝ and a kernel h:R2Rh: ℝ^2→ℝ is defined as follows:

(fh)[m,n]=i=j=f[i,j]h[mi,nj]=(hf)[m,n]=i=j=h[i,j]f[mi,nj](1.0.1)\begin{aligned}(f∗h)[m,n] &=∑_{i=−∞}^∞∑_{j=−∞}^∞f[i,j]⋅h[m−i,n−j] \tag{1.0.1}\\ &= (h∗f)[m,n] \\ &= ∑_{i=−∞}^∞∑_{j=−∞}^∞h[i,j]⋅f[m−i,n−j]\end{aligned}

将式中积分变量 i 和 j 置换为 m−x 和 n−y,即可证明

@(Linear and Shift Invariance)
Let ff be a function R2Rℝ^2→ℝ. Consider a system fsgf\xrightarrow sg, where g=(fh)g=(f∗h) with some kernel h:R2Rh:ℝ^2→ℝ. Show that SS defined by any kernel hh is a Linear Shift Invariant (LSI) system. In other words, for any hh, show that SS satisfies both of the following:

S[af1+bf2]=aS[f1]+bS[f2]S[a⋅f1+b⋅f2]=a⋅S[f1]+b⋅S[f2]

If f[m,n]sg[m,n] then f[mm0,nn0]sg[mm0,nn0]\text{If } f[m,n]\xrightarrow sg[m,n] \text{ then } f[m−m_0,n−n_0]\xrightarrow sg[m−m_0,n−n_0]

Read more »

Preface

With the ever increasing amounts of data in electronic form, the need for automated methods for data analysis continues to grow. The goal of machine learning is to develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest. Machine learning is thus closely related to the fields of statistics and data mining, but differs slightly in terms of its emphasis and terminology. This book provides a detailed introduction to the field, and includes worked examples drawn from application domains such as molecular biology, text processing, computer vision, and robotics.

Probability

probability theory

@(Joint probabilities)

We define the probability of the joint event A and B as follows:
P(A,B)=P(AB)=P(AB)P(B)P(A,B) = P(A\cap B) = P(A|B)P(B)
Given a joint distribution on two events P(A,B), we define the marginal distribution as follows:
P(A)=bP(A,B)=bP(AB=b)P(B=b)P(A) = \sum_bP(A,B)= \sum_bP(A|B=b)P(B=b)

@(Mean and variance)

The most familiar property of a distribution is its mean, or expected value, denoted by μμ. For discrete rv’s, it’s defined as E[X]=xXxp(x)E[X] = \sum_{x\in X}xp(x), and for continuous rv’s, it’s defined as E[X]=Xxp(x)dxE[X] = \int_Xxp(x)dx.
The variance is a measure of the “spread” of a distribution, denoted by σσ . This is defined as follows
var[X]=E[(Xμ)2]=(xμ)2p(x)dx=x2p(x)dx+μ2p(x)dx2μxp(x)dx=E[X2]μ2\begin{aligned}var[X] &= E[(X-\mu)^2] = \int(x-\mu)^2p(x)dx\\ &= \int x^2p(x)dx + \mu^2\int p(x)dx - 2\mu\int xp(x)dx \\ &= E[X^2] - \mu^2 \end{aligned}
E[X2]=μ2+var[X]=μ2+σ2\\ \to \\ E[X^2] = \mu^2 + var[X] = \mu^2 + \sigma^2

Read more »

Image Proccessing

Point Operators

@(Point Operators)[pixel||color|compositing]

@(Pixel)

Two commonly used point processes are multiplication and addition with a constant

g(x)=af(x)+b(1.0.1)g(x) = af (x) + b \tag{1.0.1}

The parameters a>0a > 0 and bb are often called the gain and bias parameters; sometimes these parameters are said to control contrast and brightness.
The bias and gain parameters can also be spatially varying g(x)=a(x)f(x)+b(x)g(x) = a(x)f(x) + b(x)

Multiplicative gain (both global and spatially varying) is a linear operation, since it obeys the superposition principle h(f0+f1)=h(f0)+h(f1)h(f_0 + f_1) = h(f_0) + h(f_1)

Another commonly used dyadic (two-input) operator is the linear blend operator

g(x)=(1α)f0(x)+αf1(x)(1.0.2)g(x) = (1-\alpha)f_0(x) + \alpha f_1(x) \tag{1.0.2}

@(Compositing and matting)

Compositing equation C=(1α)B+αFC = (1-\alpha)B + \alpha F.

This operator attenuates the influence of the background image B by a factor (1 − α) and then adds in the color (and opacity) values corresponding to the foreground layer F

Read more »

Overview of Supervised Learning

Two Simple Approaches to Prediction: Least Squares and Nearest Neighbors

we develop two simple but powerful prediction methods: the linear model fit by least squares and the k-nearest-neighbor prediction rule. The linear model makes huge assumptions about structure and yields stable but possibly inaccurate predictions. The method of k-nearest neighbors makes very mild structural assumptions: its predictions are often accurate but can be unstable.

Linear Models and Least Squares

Given a vector of inputs XT=(X1,X2,...,Xp)X^T = (X_1,X_2,...,X_p), we predict the output YY via the model

Y^=β^0+j=1pXjβ^j(1.0.1)\hat Y = \hat\beta_0 + \sum_{j=1}^pX_j\hat\beta_j \tag{1.0.1}

The term β^0\hat\beta_0 is the intercept, also known as the bias in machine learning.Often it is convenient to include the constant variable 1 in XX, include β^0\hat\beta_0 in the vector of coefficients β\beta, and then write the linear model in vector form as an inner product

Y^=XTβ^(1.0.2)\hat Y = X^T\hat\beta \tag{1.0.2}

where XTX^T denotes vector or matrix transpose (XX being a column vector).Here we are modeling a single output, so YY is a scalar; in general YY can be a K–vector, in which case ββ would be a p×Kp×K matrix of coefficients.

In the least squares approach, we pick the coefficients ββ to minimize the residual sum of squares

RSS(β)=i=1N(yixiTβ)2 in matrix notation RSS(β)=(YXβ)T(YXβ)RSS(\beta) = \sum_{i=1}^N(y_i - x_i^T\beta)^2 \\ \to \text{ in matrix notation }\\ RSS(\beta) = (Y - X\beta)^T(Y - X\beta)

Y=XβY = X\beta has no solution, so XTY=XTXβXT(YXβ)=0X^TY = X^TX\beta \to X^T(Y - X\beta) = 0, if XTXX^TX is nonsingular, then the unique solution is given by

β^=(XTX)1XTY(1.0.3)\hat\beta = (X^TX)^{-1}X^TY \tag{1.0.3}

Read more »