Probability
Definition
Conditional probability
The conditional probability of given that takes value tells us the relative propensity of the random variable to take different outcomes given that the random variable is fixed to value . This conditional probablity is written as .
The conditional probability can be recovered from the joint distribution .In particular, we examine the appropriate slice of the joint distribution.The values in the slice tell us about the relative probability that takes various values having observed , but they do not themselves form a valid probability distribution; they cannot sum to one as they constitute only a small part of the joint distribution which did itself sum to one. To calculate the conditional probability distribution, we hence normalize by the total probability in the slice
Expectation
Given a function that returns a value for each possible value of the variable and a probability that each value of occurs, we sometimes wish to calculate the expected output of the function. If we drew a very large number of samples from the probability distribution, calculated the function for each sample, and took the average of these values, the result would be the expectation.
The expected value of a function of a random variable is defined as
for the discrete and continuous cases, respectively. This idea generalizes to functions of more than one random variable so that, for example
Special cases of expectation. For some functions , the expectation is given a special name. Here we use the notation to represent the mean with respect to random variable and the mean with respect to random variable .
Function Expectation mean, moment about zero moment about the mean variance skew kurtosis covariance of and
There are four rules for manipulating expectations, which can be easily proved from the original definition
- The expected value of a constant with respect to the random variable is just the constant itself
- The expected value of a constant times a function of the random variable is times the expected value of the function
- The expected value of the sum of two functions of a random variable is the sum of the individual expected values of the functions
- The expected value of the product of two functions and of random variables and is equal to the product of the individual expected values if the variables and are independent
With above rules, get the relationship between the second moment around zero and the second moment about the mean (variance)
Common probability distributions
Common probability distributions: the choice of distribution depends on the type/ domain of data to be modeled.
Data Type Domain Distribution univariate, discrete, binary Bernoulli univariate, discrete, multivalued categorical univariate, continuous, unbounded univariate normal univariate, continuous, bounded beta multivariate, continuous, unbounded multivariate normal multivariate, continuous, bounded, sums to one Dirichlet normal-scaled inverse gamma normal inverse Wishart
Probability distributions such as the categorical and normal distributions are obviously useful for modeling visual data. However, the need for some of the other distributions is not so obvious; for example, the Dirichlet distribution models positive numbers that sum to one. Visual data do not normally take this form.
Bernoulli distribution
The Bernoulli distribution is a discrete distribution that models binary trials: it describes the situation where there are only two possible outcomes which are referred to as “failure” and “success.”
The Bernoulli has a single parameter which defines the probability of observing a success . The distribution is hence
Beta distribution
The beta distribution is defined on and has parameters whose relative values determine the expected value so .
Mathematically, the beta distribution has the form
where is the gamma function.
Categorical distribution
The categorical distribution is a discrete distribution that determines the probability of observing one of possible outcomes. Hence, the Bernoulli distribution is a special case of the categorical distribution when there are only two outcomes.
The probabilities of observing the outcomes are held in a parameter vector where and . The categorical distribution can be visualized as a normalized histogram with bins and can be written as
Matrix Perspectiew
Alternatively, we can think of the data as taking values where is the unit vector. All elements of are zero except the , which is one. Here we can write
Dirichlet distribution
The Dirichlet distribution is defined over continuous values where and . Hence it is suitable for defining a distribution over the parameters of the categorical distribution.
In K dimensions the Dirichlet distribution has parameters each of which can take any positive value. The relative values of the parameters determine the expected values . The absolute values determine the concentration around the expected value. We write
Just as the Bernoulli distribution was a special case of the categorical distribution with two possible outcomes, so the beta distribution is a special case of the Dirichlet distribution where the dimensionality is two.
Univariate normal distribution
The normal distribution has two parameters, the mean and the variance . The parameter can take any value and determines the position of the peak. The parameter takes only positive values and determines the width of the distribution. The normal distribution is defined as
Normal-scaled inverse gamma distribution[TODO]
The normal-scaled inverse gamma distribution is defined over a pair of continuous values , the first of which can take any value and the second of which is constrained to be positive. As such it can define a distribution over the mean and variance parameters of the normal distribution.
Multivariate normal distribution
The multivariate normal or Gaussian distribution models D-dimensional variables where each of the D elements is continuous and lies in the range
The multivariate normal distribution has two parameters: the mean and covariance . The mean is a vector that describes the mean of the distribution. The covariance is a symmetric positive definite matrix so that is positive for any real vector . The probability density function has the following form
Normal inverse Wishart distribution[TODO]
The normal inverse Wishart distribution defines a distribution over a vector and a positive definite matrix . As such it is suitable for describing uncertainty in the parameters of a multivariate normal distribution.
Conjugacy
We have argued that the beta distribution can represent probabilities over the parame- ters of the Bernoulli. Similarly the Dirichlet defines a distribution over the parameters of the categorical, and there are analogous relationships between the normal-scaled inverse gamma and univariate normal and the normal inverse Wishart and the multivariate normal.
These pairs were carefully chosen because they have a special relationship: in each case, the former distribution is conjugate to the latter: the beta is conjugate to the Bernoulli and the Dirichlet is conjugate to the categorical and so on. When we multi- ply a distribution with its conjugate, the result is proportional to a new distribution which has the same form as the conjugate.
Fitting probability models
This chapter concerns fitting probability models to data . This process is referred to as learning because we learn about the parameters of the model. It also concerns calculating the probability of a new datum under the resulting model. This is known as evaluating the predictive distribution. We consider three methods: maximum likelihood,maximum a posteriori, and the Bayesian approach.
Maximum likelihood
As the name suggests, the maximum likelihood (ML) method finds the set of parameters under which the data are most likely. To calculate the likelihood function at a single data point , we simply evaluate the probability density function at . Assuming each data point was drawn independently from the distribution, the likelihood function for a set of points is the product of the individual likelihoods. Hence, the ML estimate of the parameters is
To evaluate the predictive distribution for a new data point (compute the probability that belongs to the fitted model), we simply evaluate the probability density function using the ML fitted parameters .
Maximum a posteriori
In maximum a posteriori (MAP) fitting, we introduce prior information about the parameters θ
As the name suggests, maximum a posteriori estimation maximizes the posterior probability of the parameters
In fact, we can discard the denominator as it is constant with respect to the parameters and so does not affect the position of the maximum, and we get
Comparing this to the maximum likelihood criterion (Equation 1.0.3), we see that it is identical except for the additional prior term; maximum likelihood is a special case of maximum a posteriori where the prior is uninformative.
The Bayesian approach
The normal distribution
The most common representation for uncertainty in machine vision is the multivariate normal distribution.
Multivariate normal distribution has two parameters: the mean and covariance . The mean is a vector that describes the position of the distribution. The covariance is a symmetric positive definite matrix (implying that is positive for any real vector ) and describes the shape of the distribution. The probability density function is
Covariance matrices in multivariate normals take three forms, termed spherical, diago- nal, and full covariances. For the two-dimensional (bivariate) case, these are
The spherical covariance matrix is a positive multiple of the identity matrix and so has the same value on all of the diagonal elements and zeros elsewhere. In the diagonal covariance matrix, each value on the diagonal has a different positive value. The full covariance matrix can have nonzero elements everywhere although the matrix is still constrained to be symmetric and positive definite so for the 2D example, .
For the bivariate case, spherical covariances produce circular iso-density contours. Diagonal covariances produce ellipsoidal iso-contours that are aligned with the coordinate axes. Full covariances also produce ellipsoidal iso-density contours, but these may now take an arbitrary orientation.
When the covariance is spherical or diagonal, the individual variables are indepen- dent. For example, for the bivariate diagonal case with zero mean, we have
Decomposition of covariance
We can use the foregoing geometrical intuitions to decompose the full covariance matrix . Given a normal distribution with mean zero and a full covariance matrix, we know that the iso-contours take an ellipsoidal form with the major and minor axes at arbitrary orientations.
With
Linear transformations of variables
The form of the multivariate normal is preserved under linear transformations .If the original distribution was
then the transformed variable y is distributed as
Marginal distributions
If we marginalize over any subset of random variables in a multivariate normal distribution, the remaining distribution is also normally distributed. If we partition the original random variable into two parts so that
Conditional distributions
If the variable is distributed as a multivariate normal, then the conditional distribution of a subset of variables given known values for the remaining variables is also distributed as a multivariate normal as formual 1.0.6, then the conditional distributions are
Product of two normals
The product of two normal distributions is proportional to a third normal distribution
where the constant is itself a normal distribution
Change of variable
Consider a normal distribution in variable whose mean is a linear function of a second variable . We can reexpress this in terms of a normal distribution in , which is a linear function of so that
where is a constant and the new parameters are given by
Notes
- whitening transformation.
we can convert a normal distribution with mean μ and covariance to a new distribution with mean 0 and covariance using the linear transformation where
Machine learning for machine vision
Learning and inference in vision
In vision problems, we take visual data and use them to infer the state of the world . The world state w may be continuous (the 3D pose of a body model) or discrete (the presence or absence of a particular object). When the state is continuous, we call this inference process regression. When the state is discrete, we call it classification.
Example 1: Regression
Consider the situation where we make a univariate continuous measurement and use this to predict a univariate continuous state . For example, we might predict the distance to a car in a road scene based on the number of pixels in its silhouette.
Model contingency of world on data (discriminative)
We define a probability distribution over the world state and make its parameters contingent on the data .
Since the world state is univariate and continuous, we chose the univariate normal. We fix the variance, and make the mean a linear function of the data. So we have
where are the unknown parameters of the model. This model is referred to as linear regression.
The learning algorithm estimates the model parameters from paired training examples . For example, in the MAP approach, we seek
where we have assumed that the I training pairs are independent, and defined a suitable prior .
Model the contingency of data on world (generative)
In the generative formulation, we choose a probability distribution over the data and make its parameters contingent on the world state . Since the data are univariate and continuous, we will model the data as a normal distribution with fixed variance, and a mean that is a linear function of the world state. So that
Example 2: Binary classification
As a second example, we will consider the case where the observed measurement is univariate and continuous, but the world state is discrete and can take one of two values. For example, we might wish to classify a pixel as belonging to a skin or non-skin region based on observing just the red channel.
Model contingency of world on data (discriminative)
We define a probability distribution over the world state and make its parameters contingent on the data . Since the world state is discrete and binary, we will use a Bernoulli distribution. This has a single parameter , which determines the probability of success so that .
We make a function of the data , but in doing so we must ensure the constraint is obeyed. To this end, we form linear function of the data , which returns a value in the range . We then pass the result through a function that maps to , so that
Model contingency of data on world (generative)
We choose a probability distribution over the data and make its parameters contingent on the world state .Since the data are univariate and continuous, we will choose a univariate normal and allow the variance and the mean to be functions of the binary world state . so that the likelihood is
applications
Skin detection
The goal of skin-detection algorithms is to infer a label denoting the presence or absence of skin at a given pixel, based on the RGB measurements at that pixel.This is a useful precursor to segmenting a face or hand, or it may be used as the basis of a crude method for detecting prurient content in Web images. Taking a generative approach, we describe the likelihoods as
and the prior probability over states as
In the learning algorithm, we estimate the parameters from training data pairs where the pixels have been labeled by hand.In particular, we learn and from the subset of the training data where and and from the subset where .
Background subtraction
A second application of the generative classification model is for background subtraction. Here, the goal is to infer a binary label , which indicates whether the pixel in the image is part of a known background () or whether a foreground object is occluding it (). As for the skin detection model, this is based on its RGB pixel data at that pixel.
Modeling complex data densities
Regression models
Classification models
Connecting local models
The models in Part2 describe the relationship between a set of measurements and the world state. They work well when the measurements and the world state are both low dimensional. However, there are many situations where this is not the case, and these models are unsuitable.
For example, consider the semantic image labeling problem in which we wish to assign a label that denotes the object class to each pixel in the image. For example, in a road scene we might wish to label pixels as ‘road’, ‘sky’, ‘car’, ‘tree’, ‘building’ or ‘other’. For an image with pixels, this means we need to build a model relating the 10000 measured RGB triples to possible world states. None of the models discussed so far can cope with this challenge: the number of parameters involved (and hence the amount of training data and the computational requirements of the learning and inference algorithms) is far beyond what current machines can handle.
Graphical models
Models for chains and trees
Models for grids
Preprocessing
Image preprocessing and feature extraction
Per-pixel transformations
Whitening
The goal of whitening is to provide invariance to fluctuations in the mean intensity level and contrast of the image. Such variation may arise because of a change in ambient lighting intensity, the object reflectance, or the camera gain. To compensate for these factors, the image is transformed so that the resulting pixel values have zero mean and unit variance. To this end, we compute the mean and variance of the original grayscale image P.
These statistics are used to transform each pixel value separately so that
Histogram equalization
The goal of histogram equalization is to modify the statistics of the intensity values so that all of their moments take predefined values. To this end, a nonlinear transformation is applied that forces the distribution of pixel intensities to be flat.
- compute the histogram of the original intensities h where the kth of K entries is given by
where the operation returns one if the argument is zero and zero otherwise.
- cumulatively sum this histogram and normalize by the total number of pixels to compute the cumulative proportion of pixels that are less than or equal to each intensity level.
- we use the cumulative histogram as a look up table to compute the transformed value so that
Linear filtering
We apply a filter, we convolve the P with the filter F, where two-dimensional convolution is defined as