0%

Computer Vision - Models, Learning And Inference(计算机视觉:模型、学习和推理)

Probability

Definition

Conditional probability

The conditional probability of xx given that yy takes value yy^∗ tells us the relative propensity of the random variable xx to take different outcomes given that the random variable yy is fixed to value yy^*. This conditional probablity is written as Pr(x,y=y)Pr(x, y = y^*).
The conditional probability Pr(xy=y)Pr(x|y = y^∗) can be recovered from the joint distribution Pr(x,y)Pr(x,y).

In particular, we examine the appropriate slice Pr(x,y=y)Pr(x,y = y^∗) of the joint distribution.The values in the slice tell us about the relative probability that xx takes various values having observed y=yy = y^∗, but they do not themselves form a valid probability distribution; they cannot sum to one as they constitute only a small part of the joint distribution which did itself sum to one. To calculate the conditional probability distribution, we hence normalize by the total probability in the slice

Pr(xy=y)=Pr(x,y=y)Pr(x,y=y)dx=Pr(x,y=y)Pr(y=y)(1.0.1)Pr(x|y=y^*) = \frac{Pr(x, y = y^*)}{\int Pr(x, y=y^*)dx} = \frac{Pr(x, y = y^*)}{Pr(y=y^*)} \tag{1.0.1}

Pr(xy)=Pr(x,y)Pr(y){Pr(x,y)=Pr(xy)Pr(y)Pr(x,y)=Pr(yx)Pr(x)Pr(xy)Pr(y)=Pr(yx)Pr(x)\to Pr(x|y) = \frac{Pr(x,y)}{Pr(y)} \to \begin{cases}Pr(x,y) = Pr(x|y) Pr(y) \\Pr(x,y) = Pr(y|x)Pr(x)\end{cases} \to Pr(x|y)Pr(y) = Pr(y|x) Pr(x)

Pr(ω,x,y,z)=Pr(ω,x,yz,z)Pr(z)=Pr(ω,xy,z)Pr(yz)Pr(z)=Pr(ωx,y,z)Pr(xy,z)Pr(yz)Pr(z)\to \begin{aligned}Pr(\omega, x, y, z) &= Pr(\omega, x, y|z, z)Pr(z)\\&= Pr(\omega, x|y,z)Pr(y|z)Pr(z)\\&=Pr(\omega|x, y,z)Pr(x|y,z)Pr(y|z)Pr(z)\end{aligned}

Expectation

Given a function f[]f [\cdot] that returns a value for each possible value xx^∗ of the variable xx and a probability Pr(x=x)P r(x = x^∗) that each value of xx occurs, we sometimes wish to calculate the expected output of the function. If we drew a very large number of samples from the probability distribution, calculated the function for each sample, and took the average of these values, the result would be the expectation.

The expected value of a function f[]f[\cdot] of a random variable xx is defined as

E[f[x]]=xf[x]Pr(x)(1.0.2)E[f[x]] = \sum_xf[x]Pr(x) \tag{1.0.2}

E[f[x]]=f[x]Pr(x)dx(1.0.2)E[f[x]] = \int f[x]Pr(x)dx \tag{1.0.2}

for the discrete and continuous cases, respectively. This idea generalizes to functions f[]f[\cdot] of more than one random variable so that, for example E[f[x,y]]=f[x,y]Pr(x,y)dxdy\displaystyle E[f[x,y]] = \int \int f[x,y]Pr(x,y)dxdy

Special cases of expectation. For some functions f(x)f(x), the expectation E[f(x)]E[f(x)] is given a special name. Here we use the notation μx\mu_x to represent the mean with respect to random variable xx and μy\mu_y the mean with respect to random variable yy.

Function f[]f[\cdot] Expectation
xx mean, μx\mu_x
xkx^k kthk^{th} moment about zero
(xμx)k(x-\mu_x)^k kthk^{th} moment about the mean
(xμx)2(x-\mu_x)^2 variance
(xμx)3(x-\mu_x)^3 skew
(xμx)4(x-\mu_x)^4 kurtosis
(xμx)(yμy)(x-\mu_x)(y - \mu_y) covariance of xx and yy

There are four rules for manipulating expectations, which can be easily proved from the original definition

  1. The expected value of a constant kk with respect to the random variable xx is just the constant itself

E[k]=kE[k] = k

  1. The expected value of a constant κκ times a function f[x]f[x] of the random variable xx is κκ times the expected value of the function

E[κf[x]]=κE[f[x]]E[κf [x]] = κE[f [x]]

  1. The expected value of the sum of two functions of a random variable xx is the sum of the individual expected values of the functions

E[f[x]+g[x]]=E[f[x]]+E[g[x]]E[f [x] + g[x]] = E[f [x]] + E[g[x]]

  1. The expected value of the product of two functions f[x]f[x] and g[y]g[y] of random variables xx and yy is equal to the product of the individual expected values if the variables xx and yy are independent

E[f[x]g[y]]=E[f[x]]E[g[y]]where x, y independentE[f [x]g[y]] = E[f [x]]E[g[y]] \hspace{2em} \text{where x, y independent}

With above rules, get the relationship between the second moment around zero and the second moment about the mean (variance)

E[(xμ)2]=E[x22xμ+μ2]=E[x2]2E[xμ]+E[μ2]=E[x2]2μE[x]+E[μ2]=E[x2]2E[x]E[x]+E[x]E[x]=E[x2]E[x]E[x]\begin{aligned}E[(x-\mu)^2] &= E[x^2 - 2x\mu + \mu^2] \\&= E[x^2] -2E[x\mu] + E[\mu^2] \\&= E[x^2] - 2\mu E[x] + E[\mu^2] \\ &= E[x^2] - 2E[x]E[x] + E[x]E[x] \\ &= E[x^2] - E[x]E[x]\end{aligned}

Common probability distributions

Common probability distributions: the choice of distribution depends on the type/ domain of data to be modeled.

Data Type Domain Distribution
univariate, discrete, binary x{0,1}x\in\{0,1\} Bernoulli
univariate, discrete, multivalued x{1,2,,K}x\in\{1,2,\cdots,K\} categorical
univariate, continuous, unbounded xRx\in \Bbb{R} univariate normal
univariate, continuous, bounded x[0,1]x\in[0,1] beta
multivariate, continuous, unbounded xRKx\in \Bbb{R}^K multivariate normal
multivariate, continuous, bounded, sums to one x=[x1,x2,,xK]Txk[0,1],k=1Kxk=1\mathbb x = [x_1,x_2,\cdots,x_K]^T\\x_k\in[0,1],\sum_{k=1}^Kx_k = 1 Dirichlet
bivariate, continuous,x1unbounded,x2bounded below\text{bivariate, continuous,}\\x_1 \text{unbounded,} \\ x_2 \text{bounded below} x=[x1,x2]x1Rx2R+\mathbb x = [x_1,x_2]\\x_1\in \Bbb{R}\\x_2\in\Bbb{R}^+ normal-scaled inverse gamma
vector x and matrix Xx unbounded,X square, positive definite\text{vector x and matrix X}\\\text{x unbounded,}\\\text{X square, positive definite} xRXRK×KzTXz>0 zRK\mathbb x \in \Bbb{R}\\X\in\Bbb{R}^{K\times K}\\z^TXz\gt 0 \space\forall \mathbb z\in \Bbb{R}^K normal inverse Wishart

Probability distributions such as the categorical and normal distributions are obviously useful for modeling visual data. However, the need for some of the other distributions is not so obvious; for example, the Dirichlet distribution models KK positive numbers that sum to one. Visual data do not normally take this form.

Bernoulli distribution

The Bernoulli distribution is a discrete distribution that models binary trials: it describes the situation where there are only two possible outcomes x{0,1}x \in \{0, 1\} which are referred to as “failure” and “success.”
The Bernoulli has a single parameter λ[0,1]\lambda \in [0,1] which defines the probability of observing a success x=1x = 1. The distribution is hence

Pr(x=0)=1λPr(x=1)=λPr(x)=λx(1λ)1x\begin{aligned}Pr(x= 0) &= 1- \lambda \\Pr(x=1) &= \lambda\\Pr(x) &= \lambda^x(1-\lambda)^{1-x}\end{aligned}

Beta distribution

The beta distribution is defined on [0,1][0,1] and has parameters (α,β)(\alpha,\beta) whose relative values determine the expected value so E[λ]=αα+β\displaystyle E[\lambda] = \frac{\alpha}{\alpha + \beta}.

Mathematically, the beta distribution has the form

Pr(λ)=Γ[α+β]Γ[α]Γ[β]λα1(1λ)β1Pr(\lambda) = \frac{\Gamma[\alpha + \beta]}{\Gamma[\alpha]\Gamma[\beta]}\lambda^{\alpha - 1}(1-\lambda)^{\beta - 1}

where Γ[]\Gamma[\cdot] is the gamma function.

Categorical distribution

The categorical distribution is a discrete distribution that determines the probability of observing one of KK possible outcomes. Hence, the Bernoulli distribution is a special case of the categorical distribution when there are only two outcomes.
The probabilities of observing the KK outcomes are held in a K×1K \times 1 parameter vector λ=[λ1,λ2,,λK]\lambda = [λ_1,λ_2,\cdots,λ_K] where λk[0,1]\lambda_k \in [0,1] and k=1Kλk=1\sum_{k=1}^K \lambda_k = 1. The categorical distribution can be visualized as a normalized histogram with KK bins and can be written as

Pr(x=k)=λkPr(x = k) = \lambda_k

Matrix Perspectiew
Alternatively, we can think of the data as taking values x{e1,e2,,eK}x \in \{e_1,e_2,\cdots,e_K\} where eke_k is the kthk^{th} unit vector. All elements of eke_k are zero except the kthk^{th}, which is one. Here we can write

Pr(x=ek)=j=1Kλjxj=λkxj is the jth element of xPr(\mathbb x = e_k) = \prod_{j=1}^K\lambda_j^{x_j} = \lambda_k \\x_j \text{ is the } j^{th} \text{ element of } \mathbb x

Dirichlet distribution

The Dirichlet distribution is defined over KK continuous values λ1,λ2,,λKλ_1,λ_2,\cdots,λ_K where λk[0,1]λ_k ∈ [0, 1] and k=1Kλk=1\sum_{k=1}^K λ_k = 1. Hence it is suitable for defining a distribution over the parameters of the categorical distribution.

In K dimensions the Dirichlet distribution has KK parameters α1,,αKα_1,\cdots,α_K each of which can take any positive value. The relative values of the parameters determine the expected values E[λ1],,E[λk]E[λ_1],\cdots,E[λ_k]. The absolute values determine the concentration around the expected value. We write

Pr(λ1K)=Γ[k=1Kαk]k=1KΓ[αk]k=1Kλkαk1Pr(\lambda_{1\cdots K}) = \frac{\Gamma[\sum_{k=1}^K\alpha_k]}{\prod_{k=1}^K\Gamma[\alpha_k]}\prod_{k=1}^K\lambda_k^{\alpha_k - 1}

Just as the Bernoulli distribution was a special case of the categorical distribution with two possible outcomes, so the beta distribution is a special case of the Dirichlet distribution where the dimensionality is two.

Univariate normal distribution

The normal distribution has two parameters, the mean μμ and the variance σσ . The parameter μμ can take any value and determines the position of the peak. The parameter σ2σ^2 takes only positive values and determines the width of the distribution. The normal distribution is defined as

Pr(x)=12πσ2exp[(xμ)22σ2]Pr(x) = \frac{1}{\sqrt{2\pi\sigma^2}}exp\bigg[-\frac{(x-\mu)^2}{2\sigma^2}\bigg]

Normal-scaled inverse gamma distribution[TODO]

The normal-scaled inverse gamma distribution is defined over a pair of continuous values μ,σ2μ,σ^2, the first of which can take any value and the second of which is constrained to be positive. As such it can define a distribution over the mean and variance parameters of the normal distribution.

Multivariate normal distribution

The multivariate normal or Gaussian distribution models D-dimensional variables x\mathbb x where each of the D elements x1xDx_1 \cdots x_D is continuous and lies in the range [,+][-\infty, + \infty]

The multivariate normal distribution has two parameters: the mean μμ and covariance Σ\varSigma. The mean μμ is a D×1D \times 1 vector that describes the mean of the distribution. The covariance Σ\varSigma is a symmetric D×DD \times D positive definite matrix so that zTΣzz^T\varSigma z is positive for any real vector zz. The probability density function has the following form

Pr(x)=1(2π)D/2Σ1/2exp[(xμ)TΣ1(xμ)2]Pr(\mathbb x) = \frac{1}{(2\pi)^{D/2}|\varSigma|^{1/2}}exp\bigg[-\frac{(\mathbb x - \mu)^T\varSigma^{-1}(\mathbb x-\mu)}{2}\bigg]

Normal inverse Wishart distribution[TODO]

The normal inverse Wishart distribution defines a distribution over a D×1D \times 1 vector μμ and a D×DD \times D positive definite matrix Σ\varSigma. As such it is suitable for describing uncertainty in the parameters of a multivariate normal distribution.

Conjugacy

We have argued that the beta distribution can represent probabilities over the parame- ters of the Bernoulli. Similarly the Dirichlet defines a distribution over the parameters of the categorical, and there are analogous relationships between the normal-scaled inverse gamma and univariate normal and the normal inverse Wishart and the multivariate normal.

These pairs were carefully chosen because they have a special relationship: in each case, the former distribution is conjugate to the latter: the beta is conjugate to the Bernoulli and the Dirichlet is conjugate to the categorical and so on. When we multi- ply a distribution with its conjugate, the result is proportional to a new distribution which has the same form as the conjugate.

Fitting probability models

This chapter concerns fitting probability models to data {xi}i=1I\{\mathbf x_i\}_{i=1}^I. This process is referred to as learning because we learn about the parameters θθ of the model. It also concerns calculating the probability of a new datum x\mathbb x^∗ under the resulting model. This is known as evaluating the predictive distribution. We consider three methods: maximum likelihood,maximum a posteriori, and the Bayesian approach.

Maximum likelihood

As the name suggests, the maximum likelihood (ML) method finds the set of parameters θ^\hat θ under which the data {xi}i=1I\{\mathbf x_i\}_{i=1}^I are most likely. To calculate the likelihood function Pr(xiθ)Pr(\mathbb x_i|θ) at a single data point xi\mathbb x_i, we simply evaluate the probability density function at xi\mathbb x_i. Assuming each data point was drawn independently from the distribution, the likelihood function Pr(x1...Iθ)Pr(\mathbb x_{1...I} |θ) for a set of points is the product of the individual likelihoods. Hence, the ML estimate of the parameters is

θ^=argmaxθ[Pr(x1...Iθ)]=argmaxθ[i=1IPr(xiθ)](1.0.3)\hat\theta = \mathop{\arg\max}_{\theta}[Pr(\mathbb x_{1...I}|\theta)] = \mathop{\arg\max}_{\theta} \bigg[\prod_{i=1}^IPr(\mathbb x_i|\theta)\bigg] \tag{1.0.3}

To evaluate the predictive distribution for a new data point x\mathbb x^∗ (compute the probability that x\mathbb x^∗ belongs to the fitted model), we simply evaluate the probability density function Pr(xθ^)Pr(\mathbb x^*|\hat\theta) using the ML fitted parameters θ^\hat \theta.

Maximum a posteriori

In maximum a posteriori (MAP) fitting, we introduce prior information about the parameters θ
As the name suggests, maximum a posteriori estimation maximizes the posterior probability Pr(θx1...I)Pr(θ|\mathbb x_{1...I}) of the parameters

θ^=argmaxθ[Pr(θx1...I)]=argmaxθ[Pr(x1...Iθ)Pr(θ)Pr(θx1...I)]]=argmaxθ[i=1IPr(xiθ)Pr(θ)Pr(θx1...I)](1.0.4)\begin{aligned}\hat\theta &= \mathop{\arg\max}_\theta[Pr(θ|\mathbb x_{1...I})] \\ &= \mathop{\arg\max}_\theta\bigg[\frac{Pr(\mathbb x_{1...I}|\theta)Pr(\theta)}{Pr(θ|\mathbb x_{1...I})]}\bigg] \\ &= \mathop{\arg\max}_\theta \bigg[\frac{\prod_{i=1}^IPr(\mathbb x_i|\theta)Pr(\theta)}{Pr(θ|\mathbb x_{1...I})}\bigg]\end{aligned} \tag{1.0.4}

In fact, we can discard the denominator as it is constant with respect to the parameters and so does not affect the position of the maximum, and we get θ^=argmaxθ[i=1IPr(xiθ)Pr(θ)]\hat\theta = \mathop{\arg\max}_\theta\bigg[\prod_{i=1}^IPr(\mathbb x_i|\theta)Pr(\theta)\bigg]

Comparing this to the maximum likelihood criterion (Equation 1.0.3), we see that it is identical except for the additional prior term; maximum likelihood is a special case of maximum a posteriori where the prior is uninformative.

The Bayesian approach

The normal distribution

The most common representation for uncertainty in machine vision is the multivariate normal distribution.
Multivariate normal distribution has two parameters: the mean μμ and covariance Σ\varSigma. The mean μμ is a D×1D\times 1 vector that describes the position of the distribution. The covariance Σ\varSigma is a symmetric D×DD\times D positive definite matrix (implying that zTΣzz^T\varSigma z is positive for any real vector zz) and describes the shape of the distribution. The probability density function is

Pr(x)=1(2π)D/2Σ1/2exp[(xμ)TΣ1(xμ)2](1.0.5)Pr(\mathbb x) = \frac{1}{(2\pi)^{D/2}|\varSigma|^{1/2}}exp\bigg[-\frac{(\mathbb x - \mu)^T\varSigma^{-1}(\mathbb x-\mu)}{2}\bigg] \tag{1.0.5}

Covariance matrices in multivariate normals take three forms, termed spherical, diago- nal, and full covariances. For the two-dimensional (bivariate) case, these are

Σspher=[σ200σ2]Σdiag=[σ1200σ22]Σspher=[σ112σ122σ212σ222]\varSigma_{spher} = \begin{bmatrix}\sigma^2 & 0 \\ 0 & \sigma^2\end{bmatrix} \varSigma_{diag} = \begin{bmatrix}\sigma_1^2 & 0 \\ 0 & \sigma_2^2\end{bmatrix} \varSigma_{spher} = \begin{bmatrix}\sigma_{11}^2 & \sigma_{12}^2 \\ \sigma_{21}^2 & \sigma_{22}^2\end{bmatrix}

The spherical covariance matrix is a positive multiple of the identity matrix and so has the same value on all of the diagonal elements and zeros elsewhere. In the diagonal covariance matrix, each value on the diagonal has a different positive value. The full covariance matrix can have nonzero elements everywhere although the matrix is still constrained to be symmetric and positive definite so for the 2D example, σ122=σ212σ_{12}^2 = σ_{21}^2.

For the bivariate case, spherical covariances produce circular iso-density contours. Diagonal covariances produce ellipsoidal iso-contours that are aligned with the coordinate axes. Full covariances also produce ellipsoidal iso-density contours, but these may now take an arbitrary orientation.

When the covariance is spherical or diagonal, the individual variables are indepen- dent. For example, for the bivariate diagonal case with zero mean, we have

Pr(x1,x2)=12πΣexp[0.5(x1 x2)Σ1(x1x2)]=12πσ1σ2exp[0.5(x1 x2)(σ1200σ22)(x1x2)]=12πσ1exp[x122σ12]12πσ2exp[x222σ22]=Pr(x1)Pr(x2)\begin{aligned}Pr(x_1, x_2) &=\frac{1}{2\pi\sqrt{|\varSigma|}}exp\bigg[-0.5(x_1\space x_2)\varSigma^{-1}\binom{x_1}{x_2}\bigg]\\ &= \frac{1}{2\pi\sigma_1\sigma_2}exp\bigg[-0.5(x_1\space x_2)\begin{pmatrix}\sigma_1^{-2} & 0 \\ 0 &\sigma_2^{-2}\end{pmatrix}\binom{x_1}{x_2}\bigg]\\ &= \frac{1}{2\pi\sigma_1}exp\bigg[-\frac{x_1^2}{2\sigma_1^2}\bigg]\frac{1}{2\pi\sigma_2}exp\bigg[-\frac{x_2^2}{2\sigma_2^2}\bigg]\\ &= Pr(x_1)Pr(x_2)\end{aligned}

Decomposition of covariance

We can use the foregoing geometrical intuitions to decompose the full covariance matrix Σfull\varSigma_{full} . Given a normal distribution with mean zero and a full covariance matrix, we know that the iso-contours take an ellipsoidal form with the major and minor axes at arbitrary orientations.

With x=RxΣfull=RTΣdiagR\mathbb x' = R\mathbb x \to \varSigma_{full} = R^T\varSigma_{diag}^{'}R

Linear transformations of variables

The form of the multivariate normal is preserved under linear transformations y=Ax+b\mathbb y = A\mathbb x + \mathbb b.If the original distribution was

Pr(x)=Normx[μ,Σ]Pr(\mathbb x) = Norm_{\mathbb x}[\mu, \varSigma]

then the transformed variable y is distributed as

Pr(y)=Normy[Aμ+b,AΣAT]Pr(\mathbb y) = Norm_{\mathbb y}[A\mu + \mathbb b, A\varSigma A^T]

Marginal distributions

If we marginalize over any subset of random variables in a multivariate normal distribution, the remaining distribution is also normally distributed. If we partition the original random variable into two parts x=[x1T,x2T]T\mathbb x=[\mathbb x^T_1,x^T_2]^T so that

Pr(x)=Pr([x1x2])=Normx[[μ1μ2],[Σ11Σ21TΣ21Σ22]](1.0.6)Pr(\mathbb x) = Pr\bigg({\mathbb x_1\brack \mathbb x_2}\bigg) = Norm_{\mathbb x}\bigg[{\mu_1\brack \mu_2}, \begin{bmatrix}\varSigma_{11} & \varSigma_{21}^T\\\varSigma_{21}&\varSigma_{22}\end{bmatrix}\bigg] \tag{1.0.6}

Pr(x1)=Normx1[μ1,Σ11]Pr(\mathbb x_1) = Norm_{\mathbb x_1}[\mu_1, \varSigma_{11}]

Pr(x2)=Normx2[μ2,Σ22]Pr(\mathbb x_2) = Norm_{\mathbb x_2}[\mu_2, \varSigma_{22}]

Conditional distributions

If the variable x\mathbb x is distributed as a multivariate normal, then the conditional distribution of a subset of variables x1\mathbb x_1 given known values for the remaining variables x2\mathbb x_2 is also distributed as a multivariate normal as formual 1.0.6, then the conditional distributions are

Pr(x1x2=x2)=Normx1[μ1+Σ21TΣ221(x2μ2),Σ11Σ21TΣ221Σ21]Pr(\mathbb x_1|\mathbb x_2 = \mathbb x_2^*) = Norm_{\mathbb x_1}[\mu_1 + \varSigma_{21}^T\varSigma_{22}^{-1}(\mathbb x_2^* - \mu_2), \varSigma_{11} - \varSigma_{21}^T\varSigma_{22}^{-1}\varSigma_{21}]

Pr(x2x1=x1)=Normx2[μ2+Σ21Σ111(x1μ1),Σ22Σ21Σ111Σ21T]Pr(\mathbb x_2|\mathbb x_1 = \mathbb x_1^*) = Norm_{\mathbb x_2}[\mu_2 + \varSigma_{21}\varSigma_{11}^{-1}(\mathbb x_1^* - \mu_1), \varSigma_{22} - \varSigma_{21}\varSigma_{11}^{-1}\varSigma_{21}^T]

Product of two normals

The product of two normal distributions is proportional to a third normal distribution

Nromx[a,A]Normx[b,B]=kNormx[(A1+B1)1(A1a+B1b),(A1+B1)1](1.0.7)Nrom_{\mathbb x}[\mathbb a, A] Norm_{\mathbb x}[\mathbb b, B] =\\ \mathcal{k}\cdot Norm_{\mathbb x}\bigg[(A^{-1} + B^{-1})^{-1}(A^{-1}\mathbb a + B^{-1}\mathbb b), (A^{-1} + B^{-1})^{-1}\bigg]\tag{1.0.7}

where the constant κκ is itself a normal distribution

k=Norma[b,A+B]=Normb[a,A+B]k = Norm_{a}[\mathbb b, A + B] = Norm_b[\mathbb a, A+B]

Change of variable

Consider a normal distribution in variable x\mathbb x whose mean is a linear function Ay+bA\mathbb y + \mathbb b of a second variable y\mathbb y. We can reexpress this in terms of a normal distribution in y\mathbb y, which is a linear function Ax+bA′\mathbb x + \mathbb b′ of x\mathbb x so that

Normx[Ay+b,Σ]=kNormy[Ax+b,Σ](1.0.8)Norm_{\mathbb x}[A\mathbb y + \mathbb b, \varSigma] = k \cdot Norm_{\mathbb y}[A'\mathbb x + \mathbb b', \varSigma']\tag{1.0.8}

where κκ is a constant and the new parameters are given by

Σ=(ATΣ1A)1A=(ATΣ1A)1ATΣ1b=(ATΣ1A)1ATΣ1b(1.0.9)\begin{aligned}\varSigma' &= (A^T\varSigma^{-1}A)^{-1}\\ A'&= (A^T\varSigma^{-1}A)^{-1}A^T\varSigma^{-1}\\ \mathbb b' &=-(A^T\varSigma^{-1}A)^{-1}A^T\varSigma^{-1}\mathbb b \end{aligned} \tag{1.0.9}

Notes

  1. whitening transformation.

we can convert a normal distribution with mean μ and covariance ΣΣ to a new distribution with mean 0 and covariance II using the linear transformation y=Ax+by = Ax + b where

A=Σ1/2b=Σ1/2μA = \varSigma^{-1/2}\\ \mathbb b = -\varSigma^{-1/2}\mu

Machine learning for machine vision

Learning and inference in vision

In vision problems, we take visual data x\mathbb x and use them to infer the state of the world w\mathbb w. The world state w may be continuous (the 3D pose of a body model) or discrete (the presence or absence of a particular object). When the state is continuous, we call this inference process regression. When the state is discrete, we call it classification.

Example 1: Regression

Consider the situation where we make a univariate continuous measurement xx and use this to predict a univariate continuous state ω\omega. For example, we might predict the distance to a car in a road scene based on the number of pixels in its silhouette.

Model contingency of world on data (discriminative)

We define a probability distribution over the world state ω\omega and make its parameters contingent on the data xx.
Since the world state is univariate and continuous, we chose the univariate normal. We fix the variance, σ2\sigma^2 and make the mean μμ a linear function φ0+φ1xφ_0 +φ_1x of the data. So we have

Pr(ωx,θ)=Normω[ϕ0+ϕ1x,σ2](2.0.1)Pr(\omega|x, \mathbb \theta) = Norm_{\omega}[\phi_0 + \phi_1x, \sigma^2] \tag{2.0.1}

where θ={ϕ0,ϕ1,σ2}\mathbb \theta = \{\phi_0, \phi_1, \sigma^2\} are the unknown parameters of the model. This model is referred to as linear regression.
The learning algorithm estimates the model parameters θθ from paired training examples {xi,ωi}i=1I\{x_i, \omega_i\}_{i=1}^I. For example, in the MAP approach, we seek

θ^=argmaxθ[Pr(θω1...I,x1...I)]=argmaxθ[Pr(ω1...Ix1...I,θ)Pr(θ)]=argmaxθ[i=1IPr(ωIxi,θ)Pr(θ)]\begin{aligned}\hat\theta &= \mathop{\arg\max}_{\theta}[Pr(\theta|\omega_{1...I}, x_{1...I})] \\&=\mathop{\arg\max}_{\theta}[Pr(\omega_{1...I}|x_{1...I}, \theta)Pr(\theta)] \\ &=\mathop{\arg\max}_{\theta}[\prod_{i=1}^IPr(\omega_I|x_i, \theta)Pr(\theta)] \end{aligned}

where we have assumed that the I training pairs {xi,wi}i=1I\{x_i,w_i\}_{i=1}^I are independent, and defined a suitable prior Pr(θ)P r(θ).

Model the contingency of data on world (generative)

In the generative formulation, we choose a probability distribution over the data xx and make its parameters contingent on the world state ww. Since the data are univariate and continuous, we will model the data as a normal distribution with fixed variance, σ2σ^2 and a mean μμ that is a linear function φ0+φ1ωφ_0 +φ_1\omega of the world state. So that

Pr(xω,θ)=Normx[ϕ0+ϕ1ω,σ2](2.0.2)Pr(x|\omega, \theta) = Norm_{x}[\phi_0 + \phi_1\omega, \sigma^2] \tag{2.0.2}

Example 2: Binary classification

As a second example, we will consider the case where the observed measurement xx is univariate and continuous, but the world state ω\omega is discrete and can take one of two values. For example, we might wish to classify a pixel as belonging to a skin or non-skin region based on observing just the red channel.

Model contingency of world on data (discriminative)

We define a probability distribution over the world state ω{0,1}\omega \in \{0, 1\} and make its parameters contingent on the data xx. Since the world state is discrete and binary, we will use a Bernoulli distribution. This has a single parameter λλ, which determines the probability of success so that Pr(ω=1)=λPr(\omega = 1) = λ.
We make λλ a function of the data xx, but in doing so we must ensure the constraint 0λ10 ≤ λ ≤ 1 is obeyed. To this end, we form linear function φ0+φ1xφ_0 + φ_1x of the data xx, which returns a value in the range [][−∞ ∞]. We then pass the result through a function sig[]sig[•] that maps [][−∞ ∞] to [01][0 1], so that

Pr(ωx)=Bernω[sig[ϕ0+ϕ1x]]=Bernω[11+exp[ϕ0ϕ1x]](2.0.3)Pr(\omega|x) = Bern_\omega[sig[\phi_0 + \phi_1x]] = Bern_{\omega}\bigg[\frac{1}{1+exp[-\phi_0-\phi_1x]}\bigg] \tag{2.0.3}

Model contingency of data on world (generative)

We choose a probability distribution over the data xx and make its parameters contingent on the world state ww.Since the data are univariate and continuous, we will choose a univariate normal and allow the variance σ2σ^2 and the mean μμ to be functions of the binary world state ww. so that the likelihood is

Pr(xω,θ)=Normx[μω,σω2]Pr(x|\omega, \theta) = Norm_x[\mu_\omega, \sigma_\omega^2]

applications

Skin detection

The goal of skin-detection algorithms is to infer a label ω0,1\omega ∈ {0, 1} denoting the presence or absence of skin at a given pixel, based on the RGB measurements x=[xR,xG,xB]x = [x^R,x^G,x^B] at that pixel.This is a useful precursor to segmenting a face or hand, or it may be used as the basis of a crude method for detecting prurient content in Web images. Taking a generative approach, we describe the likelihoods as

Pr(xω=k)=Normx[μk,Σk]Pr(\mathbb x|\omega = k) = Norm_{\mathbb x}[\mu_k, \varSigma_k]

and the prior probability over states as Pr(ω)=Bernω[λ]Pr(\omega) = Bern_{\omega}[\lambda]
In the learning algorithm, we estimate the parameters μ0,μ1,Σ0,Σ1μ_0,μ_1,Σ_0,Σ_1 from training data pairs {wi,xi}i=1I\{w_i , x_i \}^I_{i=1} where the pixels have been labeled by hand.In particular, we learn μ0μ_0 and Σ0Σ_0 from the subset of the training data where ωi=0\omega_i = 0 and μ1μ_1 and Σ1Σ_1 from the subset where ωi=1\omega_i = 1.

Background subtraction

A second application of the generative classification model is for background subtraction. Here, the goal is to infer a binary label ωn{0,1}\omega_n \in \{0, 1\}, which indicates whether the nthn^{th} pixel in the image is part of a known background (ω=0\omega = 0) or whether a foreground object is occluding it (ω=1\omega = 1). As for the skin detection model, this is based on its RGB pixel data xnx_n at that pixel.

Modeling complex data densities

Regression models

Classification models

Connecting local models

The models in Part2 describe the relationship between a set of measurements and the world state. They work well when the measurements and the world state are both low dimensional. However, there are many situations where this is not the case, and these models are unsuitable.

For example, consider the semantic image labeling problem in which we wish to assign a label that denotes the object class to each pixel in the image. For example, in a road scene we might wish to label pixels as ‘road’, ‘sky’, ‘car’, ‘tree’, ‘building’ or ‘other’. For an image with N=10000N = 10000 pixels, this means we need to build a model relating the 10000 measured RGB triples to 6100006^{10000} possible world states. None of the models discussed so far can cope with this challenge: the number of parameters involved (and hence the amount of training data and the computational requirements of the learning and inference algorithms) is far beyond what current machines can handle.

Graphical models

Models for chains and trees

Models for grids

Preprocessing

Image preprocessing and feature extraction

Per-pixel transformations

Whitening

The goal of whitening is to provide invariance to fluctuations in the mean intensity level and contrast of the image. Such variation may arise because of a change in ambient lighting intensity, the object reflectance, or the camera gain. To compensate for these factors, the image is transformed so that the resulting pixel values have zero mean and unit variance. To this end, we compute the mean μμ and variance σ2σ^2 of the original grayscale image P.

μ=i=1Ij=1JpijIJ\mu = \frac{\sum_{i=1}^I\sum_{j=1}^Jp_{ij}}{IJ}

σ2=i=1Ij=1J(pijμ)2IJ\sigma^2 = \frac{\sum_{i=1}^I\sum_{j=1}^J(p_{ij} - \mu)^2}{IJ}

These statistics are used to transform each pixel value separately so that

xij=pijμσx_{ij} = \frac{p_{ij} - \mu}{\sigma}

Histogram equalization

The goal of histogram equalization is to modify the statistics of the intensity values so that all of their moments take predefined values. To this end, a nonlinear transformation is applied that forces the distribution of pixel intensities to be flat.

  1. compute the histogram of the original intensities h where the kth of K entries is given by

hk=i=1Ij=1Jδ[pijk]h_k = \sum_{i=1}^I\sum_{j=1}^J\delta[p_{ij} - k]

where the operation δ[]\delta[\cdot] returns one if the argument is zero and zero otherwise.

  1. cumulatively sum this histogram and normalize by the total number of pixels to compute the cumulative proportion cc of pixels that are less than or equal to each intensity level.

ck=l=1khlIJc_k = \frac{\sum_{l=1}^kh_l}{IJ}

  1. we use the cumulative histogram as a look up table to compute the transformed value so that

xij=Kcpijx_{ij} = Kc_{p_{ij}}

Linear filtering

We apply a filter, we convolve the P with the filter F, where two-dimensional convolution is defined as

xij=m=MMn=NNpim,jnfm,nx_{ij} = \sum_{m=-M}^M\sum_{n=-N}^Np_{i-m,j-n}f_{m,n}

Models for geometry [VIP]

The pinhole camera

Models for transformations

Multiple cameras

Models for vision [VIP]

Models for shape