Computer Vision - Models, Learning And Inference(计算机视觉:模型、学习和推理)

Probability

Definition

Conditional probability

The conditional probability of $x$ given that $y$ takes value $y^∗$ tells us the relative propensity of the random variable $x$ to take different outcomes given that the random variable $y$ is fixed to value $y^*$ . This conditional probablity is written as $Pr(x, y = y^*)$ .
The conditional probability $Pr(x|y = y^∗)$ can be recovered from the joint distribution $Pr(x,y)$ .

In particular, we examine the appropriate slice $Pr(x,y = y^∗)$ of the joint distribution.The values in the slice tell us about the relative probability that $x$ takes various values having observed $y = y^∗$ , but they do not themselves form a valid probability distribution; they cannot sum to one as they constitute only a small part of the joint distribution which did itself sum to one. To calculate the conditional probability distribution, we hence normalize by the total probability in the slice

$Pr(x|y=y^*) = \frac{Pr(x, y = y^*)}{\int Pr(x, y=y^*)dx} = \frac{Pr(x, y = y^*)}{Pr(y=y^*)} \tag{1.0.1}$ $\to Pr(x|y) = \frac{Pr(x,y)}{Pr(y)} \to \begin{cases}Pr(x,y) = Pr(x|y) Pr(y) \\Pr(x,y) = Pr(y|x)Pr(x)\end{cases} \to Pr(x|y)Pr(y) = Pr(y|x) Pr(x)$ $\to \begin{aligned}Pr(\omega, x, y, z) &= Pr(\omega, x, y|z, z)Pr(z)\\&= Pr(\omega, x|y,z)Pr(y|z)Pr(z)\\&=Pr(\omega|x, y,z)Pr(x|y,z)Pr(y|z)Pr(z)\end{aligned}$

Expectation

Given a function $f [\cdot]$ that returns a value for each possible value $x^∗$ of the variable $x$ and a probability $P r(x = x^∗)$ that each value of $x$ occurs, we sometimes wish to calculate the expected output of the function. If we drew a very large number of samples from the probability distribution, calculated the function for each sample, and took the average of these values, the result would be the expectation.

The expected value of a function $f[\cdot]$ of a random variable $x$ is defined as
$E[f[x]] = \sum_xf[x]Pr(x) \tag{1.0.2}$ $E[f[x]] = \int f[x]Pr(x)dx \tag{1.0.2}$
for the discrete and continuous cases, respectively. This idea generalizes to functions $f[\cdot]$ of more than one random variable so that, for example $\displaystyle E[f[x,y]] = \int \int f[x,y]Pr(x,y)dxdy$

Special cases of expectation. For some functions $f(x)$ , the expectation $E[f(x)]$ is given a special name. Here we use the notation $\mu_x$ to represent the mean with respect to random variable $x$ and $\mu_y$ the mean with respect to random variable $y$ .

Function $f[\cdot]$ Expectation

$x$ mean, $\mu_x$

$x^k$ $k^{th}$ moment about zero

$(x-\mu_x)^k$ $k^{th}$ moment about the mean

$(x-\mu_x)^2$ variance

$(x-\mu_x)^3$ skew

$(x-\mu_x)^4$ kurtosis

$(x-\mu_x)(y - \mu_y)$ covariance of $x$ and $y$

Function $f[\cdot]$	Expectation
$x$	mean, $\mu_x$
$x^k$	$k^{th}$ moment about zero
$(x-\mu_x)^k$	$k^{th}$ moment about the mean
$(x-\mu_x)^2$	variance
$(x-\mu_x)^3$	skew
$(x-\mu_x)^4$	kurtosis
$(x-\mu_x)(y - \mu_y)$	covariance of $x$ and $y$

There are four rules for manipulating expectations, which can be easily proved from the original definition

The expected value of a constant $k$ with respect to the random variable $x$ is just the constant itself

$E[k] = k$

The expected value of a constant $κ$ times a function $f[x]$ of the random variable $x$ is $κ$ times the expected value of the function

$E[κf [x]] = κE[f [x]]$

The expected value of the sum of two functions of a random variable $x$ is the sum of the individual expected values of the functions

$E[f [x] + g[x]] = E[f [x]] + E[g[x]]$

The expected value of the product of two functions $f[x]$ and $g[y]$ of random variables $x$ and $y$ is equal to the product of the individual expected values if the variables $x$ and $y$ are independent

$E[f [x]g[y]] = E[f [x]]E[g[y]] \hspace{2em} \text{where x, y independent}$

With above rules, get the relationship between the second moment around zero and the second moment about the mean (variance)

$\begin{aligned}E[(x-\mu)^2] &= E[x^2 - 2x\mu + \mu^2] \\&= E[x^2] -2E[x\mu] + E[\mu^2] \\&= E[x^2] - 2\mu E[x] + E[\mu^2] \\ &= E[x^2] - 2E[x]E[x] + E[x]E[x] \\ &= E[x^2] - E[x]E[x]\end{aligned}$

Common probability distributions

Common probability distributions: the choice of distribution depends on the type/ domain of data to be modeled.

Data Type Domain Distribution

univariate, discrete, binary $x\in\{0,1\}$ Bernoulli

univariate, discrete, multivalued $x\in\{1,2,\cdots,K\}$ categorical

univariate, continuous, unbounded $x\in \Bbb{R}$ univariate normal

univariate, continuous, bounded $x\in[0,1]$ beta

multivariate, continuous, unbounded $x\in \Bbb{R}^K$ multivariate normal

multivariate, continuous, bounded, sums to one $\mathbb x = [x_1,x_2,\cdots,x_K]^T\\x_k\in[0,1],\sum_{k=1}^Kx_k = 1$ Dirichlet

$\text{bivariate, continuous,}\\x_1 \text{unbounded,} \\ x_2 \text{bounded below}$ $\mathbb x = [x_1,x_2]\\x_1\in \Bbb{R}\\x_2\in\Bbb{R}^+$ normal-scaled inverse gamma

$\text{vector x and matrix X}\\\text{x unbounded,}\\\text{X square, positive definite}$ $\mathbb x \in \Bbb{R}\\X\in\Bbb{R}^{K\times K}\\z^TXz\gt 0 \space\forall \mathbb z\in \Bbb{R}^K$ normal inverse Wishart

Data Type	Domain	Distribution
univariate, discrete, binary	$x\in\{0,1\}$	Bernoulli
univariate, discrete, multivalued	$x\in\{1,2,\cdots,K\}$	categorical
univariate, continuous, unbounded	$x\in \Bbb{R}$	univariate normal
univariate, continuous, bounded	$x\in[0,1]$	beta
multivariate, continuous, unbounded	$x\in \Bbb{R}^K$	multivariate normal
multivariate, continuous, bounded, sums to one	$\mathbb x = [x_1,x_2,\cdots,x_K]^T\\x_k\in[0,1],\sum_{k=1}^Kx_k = 1$	Dirichlet
$\text{bivariate, continuous,}\\x_1 \text{unbounded,} \\ x_2 \text{bounded below}$	$\mathbb x = [x_1,x_2]\\x_1\in \Bbb{R}\\x_2\in\Bbb{R}^+$	normal-scaled inverse gamma
$\text{vector x and matrix X}\\\text{x unbounded,}\\\text{X square, positive definite}$	$\mathbb x \in \Bbb{R}\\X\in\Bbb{R}^{K\times K}\\z^TXz\gt 0 \space\forall \mathbb z\in \Bbb{R}^K$	normal inverse Wishart

Probability distributions such as the categorical and normal distributions are obviously useful for modeling visual data. However, the need for some of the other distributions is not so obvious; for example, the Dirichlet distribution models $K$ positive numbers that sum to one. Visual data do not normally take this form.

Bernoulli distribution

The Bernoulli distribution is a discrete distribution that models binary trials: it describes the situation where there are only two possible outcomes $x \in \{0, 1\}$ which are referred to as “failure” and “success.”
The Bernoulli has a single parameter $\lambda \in [0,1]$ which defines the probability of observing a success $x = 1$ . The distribution is hence

$\begin{aligned}Pr(x= 0) &= 1- \lambda \\Pr(x=1) &= \lambda\\Pr(x) &= \lambda^x(1-\lambda)^{1-x}\end{aligned}$

Beta distribution

The beta distribution is defined on $[0,1]$ and has parameters $(\alpha,\beta)$ whose relative values determine the expected value so $\displaystyle E[\lambda] = \frac{\alpha}{\alpha + \beta}$ .

Mathematically, the beta distribution has the form
$Pr(\lambda) = \frac{\Gamma[\alpha + \beta]}{\Gamma[\alpha]\Gamma[\beta]}\lambda^{\alpha - 1}(1-\lambda)^{\beta - 1}$
where $\Gamma[\cdot]$ is the gamma function.

Categorical distribution

The categorical distribution is a discrete distribution that determines the probability of observing one of $K$ possible outcomes. Hence, the Bernoulli distribution is a special case of the categorical distribution when there are only two outcomes.
The probabilities of observing the $K$ outcomes are held in a $K \times 1$ parameter vector $\lambda = [λ_1,λ_2,\cdots,λ_K]$ where $\lambda_k \in [0,1]$ and $\sum_{k=1}^K \lambda_k = 1$ . The categorical distribution can be visualized as a normalized histogram with $K$ bins and can be written as

$Pr(x = k) = \lambda_k$

Matrix Perspectiew
Alternatively, we can think of the data as taking values $x \in \{e_1,e_2,\cdots,e_K\}$ where $e_k$ is the $k^{th}$ unit vector. All elements of $e_k$ are zero except the $k^{th}$ , which is one. Here we can write
$Pr(\mathbb x = e_k) = \prod_{j=1}^K\lambda_j^{x_j} = \lambda_k \\x_j \text{ is the } j^{th} \text{ element of } \mathbb x$

Dirichlet distribution

The Dirichlet distribution is defined over $K$ continuous values $λ_1,λ_2,\cdots,λ_K$ where $λ_k ∈ [0, 1]$ and $\sum_{k=1}^K λ_k = 1$ . Hence it is suitable for defining a distribution over the parameters of the categorical distribution.

In K dimensions the Dirichlet distribution has $K$ parameters $α_1,\cdots,α_K$ each of which can take any positive value. The relative values of the parameters determine the expected values $E[λ_1],\cdots,E[λ_k]$ . The absolute values determine the concentration around the expected value. We write

$Pr(\lambda_{1\cdots K}) = \frac{\Gamma[\sum_{k=1}^K\alpha_k]}{\prod_{k=1}^K\Gamma[\alpha_k]}\prod_{k=1}^K\lambda_k^{\alpha_k - 1}$

Just as the Bernoulli distribution was a special case of the categorical distribution with two possible outcomes, so the beta distribution is a special case of the Dirichlet distribution where the dimensionality is two.

Univariate normal distribution

The normal distribution has two parameters, the mean $μ$ and the variance $σ$ . The parameter $μ$ can take any value and determines the position of the peak. The parameter $σ^2$ takes only positive values and determines the width of the distribution. The normal distribution is defined as

$Pr(x) = \frac{1}{\sqrt{2\pi\sigma^2}}exp\bigg[-\frac{(x-\mu)^2}{2\sigma^2}\bigg]$

Normal-scaled inverse gamma distribution[TODO]

The normal-scaled inverse gamma distribution is defined over a pair of continuous values $μ,σ^2$ , the first of which can take any value and the second of which is constrained to be positive. As such it can define a distribution over the mean and variance parameters of the normal distribution.

Multivariate normal distribution

The multivariate normal or Gaussian distribution models D-dimensional variables $\mathbb x$ where each of the D elements $x_1 \cdots x_D$ is continuous and lies in the range $[-\infty, + \infty]$

The multivariate normal distribution has two parameters: the mean $μ$ and covariance $\varSigma$ . The mean $μ$ is a $D \times 1$ vector that describes the mean of the distribution. The covariance $\varSigma$ is a symmetric $D \times D$ positive definite matrix so that $z^T\varSigma z$ is positive for any real vector $z$ . The probability density function has the following form

$Pr(\mathbb x) = \frac{1}{(2\pi)^{D/2}|\varSigma|^{1/2}}exp\bigg[-\frac{(\mathbb x - \mu)^T\varSigma^{-1}(\mathbb x-\mu)}{2}\bigg]$

Normal inverse Wishart distribution[TODO]

The normal inverse Wishart distribution defines a distribution over a $D \times 1$ vector $μ$ and a $D \times D$ positive definite matrix $\varSigma$ . As such it is suitable for describing uncertainty in the parameters of a multivariate normal distribution.

Conjugacy

We have argued that the beta distribution can represent probabilities over the parame- ters of the Bernoulli. Similarly the Dirichlet defines a distribution over the parameters of the categorical, and there are analogous relationships between the normal-scaled inverse gamma and univariate normal and the normal inverse Wishart and the multivariate normal.

These pairs were carefully chosen because they have a special relationship: in each case, the former distribution is conjugate to the latter: the beta is conjugate to the Bernoulli and the Dirichlet is conjugate to the categorical and so on. When we multi- ply a distribution with its conjugate, the result is proportional to a new distribution which has the same form as the conjugate.

Fitting probability models

This chapter concerns fitting probability models to data $\{\mathbf x_i\}_{i=1}^I$ . This process is referred to as learning because we learn about the parameters $θ$ of the model. It also concerns calculating the probability of a new datum $\mathbb x^∗$ under the resulting model. This is known as evaluating the predictive distribution. We consider three methods: maximum likelihood,maximum a posteriori, and the Bayesian approach.

Maximum likelihood

As the name suggests, the maximum likelihood (ML) method finds the set of parameters $\hat θ$ under which the data $\{\mathbf x_i\}_{i=1}^I$ are most likely. To calculate the likelihood function $Pr(\mathbb x_i|θ)$ at a single data point $\mathbb x_i$ , we simply evaluate the probability density function at $\mathbb x_i$ . Assuming each data point was drawn independently from the distribution, the likelihood function $Pr(\mathbb x_{1...I} |θ)$ for a set of points is the product of the individual likelihoods. Hence, the ML estimate of the parameters is

$\hat\theta = \mathop{\arg\max}_{\theta}[Pr(\mathbb x_{1...I}|\theta)] = \mathop{\arg\max}_{\theta} \bigg[\prod_{i=1}^IPr(\mathbb x_i|\theta)\bigg] \tag{1.0.3}$

To evaluate the predictive distribution for a new data point $\mathbb x^∗$ (compute the probability that $\mathbb x^∗$ belongs to the fitted model), we simply evaluate the probability density function $Pr(\mathbb x^*|\hat\theta)$ using the ML fitted parameters $\hat \theta$ .

Maximum a posteriori

In maximum a posteriori (MAP) fitting, we introduce prior information about the parameters θ
As the name suggests, maximum a posteriori estimation maximizes the posterior probability $Pr(θ|\mathbb x_{1...I})$ of the parameters

$\begin{aligned}\hat\theta &= \mathop{\arg\max}_\theta[Pr(θ|\mathbb x_{1...I})] \\ &= \mathop{\arg\max}_\theta\bigg[\frac{Pr(\mathbb x_{1...I}|\theta)Pr(\theta)}{Pr(θ|\mathbb x_{1...I})]}\bigg] \\ &= \mathop{\arg\max}_\theta \bigg[\frac{\prod_{i=1}^IPr(\mathbb x_i|\theta)Pr(\theta)}{Pr(θ|\mathbb x_{1...I})}\bigg]\end{aligned} \tag{1.0.4}$
In fact, we can discard the denominator as it is constant with respect to the parameters and so does not affect the position of the maximum, and we get $\hat\theta = \mathop{\arg\max}_\theta\bigg[\prod_{i=1}^IPr(\mathbb x_i|\theta)Pr(\theta)\bigg]$

Comparing this to the maximum likelihood criterion (Equation 1.0.3), we see that it is identical except for the additional prior term; maximum likelihood is a special case of maximum a posteriori where the prior is uninformative.

The Bayesian approach

The normal distribution

The most common representation for uncertainty in machine vision is the multivariate normal distribution.
Multivariate normal distribution has two parameters: the mean $μ$ and covariance $\varSigma$ . The mean $μ$ is a $D\times 1$ vector that describes the position of the distribution. The covariance $\varSigma$ is a symmetric $D\times D$ positive definite matrix (implying that $z^T\varSigma z$ is positive for any real vector $z$ ) and describes the shape of the distribution. The probability density function is

$Pr(\mathbb x) = \frac{1}{(2\pi)^{D/2}|\varSigma|^{1/2}}exp\bigg[-\frac{(\mathbb x - \mu)^T\varSigma^{-1}(\mathbb x-\mu)}{2}\bigg] \tag{1.0.5}$

Covariance matrices in multivariate normals take three forms, termed spherical, diago- nal, and full covariances. For the two-dimensional (bivariate) case, these are

$\varSigma_{spher} = \begin{bmatrix}\sigma^2 & 0 \\ 0 & \sigma^2\end{bmatrix} \varSigma_{diag} = \begin{bmatrix}\sigma_1^2 & 0 \\ 0 & \sigma_2^2\end{bmatrix} \varSigma_{spher} = \begin{bmatrix}\sigma_{11}^2 & \sigma_{12}^2 \\ \sigma_{21}^2 & \sigma_{22}^2\end{bmatrix}$

The spherical covariance matrix is a positive multiple of the identity matrix and so has the same value on all of the diagonal elements and zeros elsewhere. In the diagonal covariance matrix, each value on the diagonal has a different positive value. The full covariance matrix can have nonzero elements everywhere although the matrix is still constrained to be symmetric and positive definite so for the 2D example, $σ_{12}^2 = σ_{21}^2$ .

For the bivariate case, spherical covariances produce circular iso-density contours. Diagonal covariances produce ellipsoidal iso-contours that are aligned with the coordinate axes. Full covariances also produce ellipsoidal iso-density contours, but these may now take an arbitrary orientation.

When the covariance is spherical or diagonal, the individual variables are indepen- dent. For example, for the bivariate diagonal case with zero mean, we have
$\begin{aligned}Pr(x_1, x_2) &=\frac{1}{2\pi\sqrt{|\varSigma|}}exp\bigg[-0.5(x_1\space x_2)\varSigma^{-1}\binom{x_1}{x_2}\bigg]\\ &= \frac{1}{2\pi\sigma_1\sigma_2}exp\bigg[-0.5(x_1\space x_2)\begin{pmatrix}\sigma_1^{-2} & 0 \\ 0 &\sigma_2^{-2}\end{pmatrix}\binom{x_1}{x_2}\bigg]\\ &= \frac{1}{2\pi\sigma_1}exp\bigg[-\frac{x_1^2}{2\sigma_1^2}\bigg]\frac{1}{2\pi\sigma_2}exp\bigg[-\frac{x_2^2}{2\sigma_2^2}\bigg]\\ &= Pr(x_1)Pr(x_2)\end{aligned}$

Decomposition of covariance

We can use the foregoing geometrical intuitions to decompose the full covariance matrix $\varSigma_{full}$ . Given a normal distribution with mean zero and a full covariance matrix, we know that the iso-contours take an ellipsoidal form with the major and minor axes at arbitrary orientations.

With $\mathbb x' = R\mathbb x \to \varSigma_{full} = R^T\varSigma_{diag}^{'}R$

Linear transformations of variables

The form of the multivariate normal is preserved under linear transformations $\mathbb y = A\mathbb x + \mathbb b$ .If the original distribution was

$Pr(\mathbb x) = Norm_{\mathbb x}[\mu, \varSigma]$

then the transformed variable y is distributed as

$Pr(\mathbb y) = Norm_{\mathbb y}[A\mu + \mathbb b, A\varSigma A^T]$

Marginal distributions

If we marginalize over any subset of random variables in a multivariate normal distribution, the remaining distribution is also normally distributed. If we partition the original random variable into two parts $\mathbb x=[\mathbb x^T_1,x^T_2]^T$ so that

$Pr(\mathbb x) = Pr\bigg({\mathbb x_1\brack \mathbb x_2}\bigg) = Norm_{\mathbb x}\bigg[{\mu_1\brack \mu_2}, \begin{bmatrix}\varSigma_{11} & \varSigma_{21}^T\\\varSigma_{21}&\varSigma_{22}\end{bmatrix}\bigg] \tag{1.0.6}$ $Pr(\mathbb x_1) = Norm_{\mathbb x_1}[\mu_1, \varSigma_{11}]$ $Pr(\mathbb x_2) = Norm_{\mathbb x_2}[\mu_2, \varSigma_{22}]$

Conditional distributions

If the variable $\mathbb x$ is distributed as a multivariate normal, then the conditional distribution of a subset of variables $\mathbb x_1$ given known values for the remaining variables $\mathbb x_2$ is also distributed as a multivariate normal as formual 1.0.6, then the conditional distributions are

$Pr(\mathbb x_1|\mathbb x_2 = \mathbb x_2^*) = Norm_{\mathbb x_1}[\mu_1 + \varSigma_{21}^T\varSigma_{22}^{-1}(\mathbb x_2^* - \mu_2), \varSigma_{11} - \varSigma_{21}^T\varSigma_{22}^{-1}\varSigma_{21}]$ $Pr(\mathbb x_2|\mathbb x_1 = \mathbb x_1^*) = Norm_{\mathbb x_2}[\mu_2 + \varSigma_{21}\varSigma_{11}^{-1}(\mathbb x_1^* - \mu_1), \varSigma_{22} - \varSigma_{21}\varSigma_{11}^{-1}\varSigma_{21}^T]$

Product of two normals

The product of two normal distributions is proportional to a third normal distribution

$Nrom_{\mathbb x}[\mathbb a, A] Norm_{\mathbb x}[\mathbb b, B] =\\ \mathcal{k}\cdot Norm_{\mathbb x}\bigg[(A^{-1} + B^{-1})^{-1}(A^{-1}\mathbb a + B^{-1}\mathbb b), (A^{-1} + B^{-1})^{-1}\bigg]\tag{1.0.7}$
where the constant $κ$ is itself a normal distribution
$k = Norm_{a}[\mathbb b, A + B] = Norm_b[\mathbb a, A+B]$

Change of variable

Consider a normal distribution in variable $\mathbb x$ whose mean is a linear function $A\mathbb y + \mathbb b$ of a second variable $\mathbb y$ . We can reexpress this in terms of a normal distribution in $\mathbb y$ , which is a linear function $A′\mathbb x + \mathbb b′$ of $\mathbb x$ so that

$Norm_{\mathbb x}[A\mathbb y + \mathbb b, \varSigma] = k \cdot Norm_{\mathbb y}[A'\mathbb x + \mathbb b', \varSigma']\tag{1.0.8}$
where $κ$ is a constant and the new parameters are given by
$\begin{aligned}\varSigma' &= (A^T\varSigma^{-1}A)^{-1}\\ A'&= (A^T\varSigma^{-1}A)^{-1}A^T\varSigma^{-1}\\ \mathbb b' &=-(A^T\varSigma^{-1}A)^{-1}A^T\varSigma^{-1}\mathbb b \end{aligned} \tag{1.0.9}$

Notes

whitening transformation.

we can convert a normal distribution with mean μ and covariance $Σ$ to a new distribution with mean 0 and covariance $I$ using the linear transformation $y = Ax + b$ where

$A = \varSigma^{-1/2}\\ \mathbb b = -\varSigma^{-1/2}\mu$

Machine learning for machine vision

Learning and inference in vision

In vision problems, we take visual data $\mathbb x$ and use them to infer the state of the world $\mathbb w$ . The world state w may be continuous (the 3D pose of a body model) or discrete (the presence or absence of a particular object). When the state is continuous, we call this inference process regression. When the state is discrete, we call it classification.

Example 1: Regression

Consider the situation where we make a univariate continuous measurement $x$ and use this to predict a univariate continuous state $\omega$ . For example, we might predict the distance to a car in a road scene based on the number of pixels in its silhouette.

Model contingency of world on data (discriminative)

We define a probability distribution over the world state $\omega$ and make its parameters contingent on the data $x$ .
Since the world state is univariate and continuous, we chose the univariate normal. We fix the variance, $\sigma^2$ and make the mean $μ$ a linear function $φ_0 +φ_1x$ of the data. So we have

$Pr(\omega|x, \mathbb \theta) = Norm_{\omega}[\phi_0 + \phi_1x, \sigma^2] \tag{2.0.1}$

where $\mathbb \theta = \{\phi_0, \phi_1, \sigma^2\}$ are the unknown parameters of the model. This model is referred to as linear regression.
The learning algorithm estimates the model parameters $θ$ from paired training examples $\{x_i, \omega_i\}_{i=1}^I$ . For example, in the MAP approach, we seek

$\begin{aligned}\hat\theta &= \mathop{\arg\max}_{\theta}[Pr(\theta|\omega_{1...I}, x_{1...I})] \\&=\mathop{\arg\max}_{\theta}[Pr(\omega_{1...I}|x_{1...I}, \theta)Pr(\theta)] \\ &=\mathop{\arg\max}_{\theta}[\prod_{i=1}^IPr(\omega_I|x_i, \theta)Pr(\theta)] \end{aligned}$

where we have assumed that the I training pairs $\{x_i,w_i\}_{i=1}^I$ are independent, and defined a suitable prior $P r(θ)$ .

Model the contingency of data on world (generative)

In the generative formulation, we choose a probability distribution over the data $x$ and make its parameters contingent on the world state $w$ . Since the data are univariate and continuous, we will model the data as a normal distribution with fixed variance, $σ^2$ and a mean $μ$ that is a linear function $φ_0 +φ_1\omega$ of the world state. So that

$Pr(x|\omega, \theta) = Norm_{x}[\phi_0 + \phi_1\omega, \sigma^2] \tag{2.0.2}$

Example 2: Binary classification

As a second example, we will consider the case where the observed measurement $x$ is univariate and continuous, but the world state $\omega$ is discrete and can take one of two values. For example, we might wish to classify a pixel as belonging to a skin or non-skin region based on observing just the red channel.

Model contingency of world on data (discriminative)

We define a probability distribution over the world state $\omega \in \{0, 1\}$ and make its parameters contingent on the data $x$ . Since the world state is discrete and binary, we will use a Bernoulli distribution. This has a single parameter $λ$ , which determines the probability of success so that $Pr(\omega = 1) = λ$ .
We make $λ$ a function of the data $x$ , but in doing so we must ensure the constraint $0 ≤ λ ≤ 1$ is obeyed. To this end, we form linear function $φ_0 + φ_1x$ of the data $x$ , which returns a value in the range $[−∞ ∞]$ . We then pass the result through a function $sig[•]$ that maps $[−∞ ∞]$ to $[0 1]$ , so that

$Pr(\omega|x) = Bern_\omega[sig[\phi_0 + \phi_1x]] = Bern_{\omega}\bigg[\frac{1}{1+exp[-\phi_0-\phi_1x]}\bigg] \tag{2.0.3}$

Model contingency of data on world (generative)

We choose a probability distribution over the data $x$ and make its parameters contingent on the world state $w$ .Since the data are univariate and continuous, we will choose a univariate normal and allow the variance $σ^2$ and the mean $μ$ to be functions of the binary world state $w$ . so that the likelihood is

$Pr(x|\omega, \theta) = Norm_x[\mu_\omega, \sigma_\omega^2]$

applications

Skin detection

The goal of skin-detection algorithms is to infer a label $\omega ∈ {0, 1}$ denoting the presence or absence of skin at a given pixel, based on the RGB measurements $x = [x^R,x^G,x^B]$ at that pixel.This is a useful precursor to segmenting a face or hand, or it may be used as the basis of a crude method for detecting prurient content in Web images. Taking a generative approach, we describe the likelihoods as

$Pr(\mathbb x|\omega = k) = Norm_{\mathbb x}[\mu_k, \varSigma_k]$

and the prior probability over states as $Pr(\omega) = Bern_{\omega}[\lambda]$
In the learning algorithm, we estimate the parameters $μ_0,μ_1,Σ_0,Σ_1$ from training data pairs $\{w_i , x_i \}^I_{i=1}$ where the pixels have been labeled by hand.In particular, we learn $μ_0$ and $Σ_0$ from the subset of the training data where $\omega_i = 0$ and $μ_1$ and $Σ_1$ from the subset where $\omega_i = 1$ .

Background subtraction

A second application of the generative classification model is for background subtraction. Here, the goal is to infer a binary label $\omega_n \in \{0, 1\}$ , which indicates whether the $n^{th}$ pixel in the image is part of a known background ( $\omega = 0$ ) or whether a foreground object is occluding it ( $\omega = 1$ ). As for the skin detection model, this is based on its RGB pixel data $x_n$ at that pixel.

Modeling complex data densities

Regression models

Classification models

Connecting local models

The models in Part2 describe the relationship between a set of measurements and the world state. They work well when the measurements and the world state are both low dimensional. However, there are many situations where this is not the case, and these models are unsuitable.

For example, consider the semantic image labeling problem in which we wish to assign a label that denotes the object class to each pixel in the image. For example, in a road scene we might wish to label pixels as ‘road’, ‘sky’, ‘car’, ‘tree’, ‘building’ or ‘other’. For an image with $N = 10000$ pixels, this means we need to build a model relating the 10000 measured RGB triples to $6^{10000}$ possible world states. None of the models discussed so far can cope with this challenge: the number of parameters involved (and hence the amount of training data and the computational requirements of the learning and inference algorithms) is far beyond what current machines can handle.

Graphical models

Models for chains and trees

Models for grids

Preprocessing

Image preprocessing and feature extraction

Per-pixel transformations

Whitening

The goal of whitening is to provide invariance to fluctuations in the mean intensity level and contrast of the image. Such variation may arise because of a change in ambient lighting intensity, the object reflectance, or the camera gain. To compensate for these factors, the image is transformed so that the resulting pixel values have zero mean and unit variance. To this end, we compute the mean $μ$ and variance $σ^2$ of the original grayscale image P.

$\mu = \frac{\sum_{i=1}^I\sum_{j=1}^Jp_{ij}}{IJ}$ $\sigma^2 = \frac{\sum_{i=1}^I\sum_{j=1}^J(p_{ij} - \mu)^2}{IJ}$

These statistics are used to transform each pixel value separately so that

$x_{ij} = \frac{p_{ij} - \mu}{\sigma}$

Histogram equalization

The goal of histogram equalization is to modify the statistics of the intensity values so that all of their moments take predefined values. To this end, a nonlinear transformation is applied that forces the distribution of pixel intensities to be flat.

compute the histogram of the original intensities h where the kth of K entries is given by

$h_k = \sum_{i=1}^I\sum_{j=1}^J\delta[p_{ij} - k]$

where the operation $\delta[\cdot]$ returns one if the argument is zero and zero otherwise.

cumulatively sum this histogram and normalize by the total number of pixels to compute the cumulative proportion $c$ of pixels that are less than or equal to each intensity level.

$c_k = \frac{\sum_{l=1}^kh_l}{IJ}$

we use the cumulative histogram as a look up table to compute the transformed value so that

$x_{ij} = Kc_{p_{ij}}$

Linear filtering

We apply a filter, we convolve the P with the filter F, where two-dimensional convolution is defined as

$x_{ij} = \sum_{m=-M}^M\sum_{n=-N}^Np_{i-m,j-n}f_{m,n}$

Probability

Definition

Conditional probability

Expectation

Common probability distributions

Bernoulli distribution

Beta distribution

Categorical distribution

Dirichlet distribution

Univariate normal distribution

Normal-scaled inverse gamma distribution[TODO]

Multivariate normal distribution

Normal inverse Wishart distribution[TODO]

Conjugacy

Fitting probability models

Maximum likelihood

Maximum a posteriori

The Bayesian approach

The normal distribution

Decomposition of covariance

Linear transformations of variables

Marginal distributions

Conditional distributions

Product of two normals

Change of variable

Notes

Machine learning for machine vision

Learning and inference in vision

Example 1: Regression

Model contingency of world on data (discriminative)

Model the contingency of data on world (generative)

Example 2: Binary classification

Model contingency of world on data (discriminative)

Model contingency of data on world (generative)

applications

Skin detection

Background subtraction

Modeling complex data densities

Regression models

Classification models

Connecting local models

Graphical models

Models for chains and trees

Models for grids

Preprocessing

Image preprocessing and feature extraction

Per-pixel transformations

Whitening

Histogram equalization

Linear filtering

Models for geometry [VIP]

The pinhole camera

Models for transformations

Multiple cameras

Models for vision [VIP]

Models for shape