Computer Vision - A Modern Approach（现代计算机视觉）

EARLY VISION: JUST ONE IMAGE

Linear Filters

Local Image Features

An object is separated from its background in an image by an occluding contour. Draw a path in the image that crosses such a contour. On one side, pixels lie on the object, and on the other, the background. Finding occluding contours is an important challenge, because the outline of an object—which is one cue to its shape—is formed by occluding contours.

COMPUTING THE IMAGE GRADIENT

For an image $I$ , the gradient is $\displaystyle \nabla I = (\frac{\partial I}{\partial x}, \frac{\partial I}{\partial y})^T$ , which we could estimate by observing that $\begin{cases}\frac{\partial I}{\partial x}\approx I_{i+1,y} - I_{i,j} \\ \frac{\partial I}{\partial y}\approx I_{i,y+1} - I_{i,j}\end{cases}$

These kinds of derivative estimates are known as finite differences. Image noise tends to result in pixels not looking like their neighbors, so that simple finite differences tend to give strong responses to noise. As a result, just taking one finite difference for $x$ and one for $y$ gives noisy gradient estimates. The way to deal with this problem is to smooth the image and then differentiate it

REPRESENTING THE IMAGE GRADIENT

There are two important representations of the image gradient.

The first is to compute edges, where there are very fast changes in brightness. These are usually seen as points where the magnitude of the gradient is extrema

The second is to use gradient orientations, which are largely independent of illumination intensity

FINDING CORNERS AND BUILDING NEIGHBORHOODS

Points worth matching are corners, because a corner can be localized, which means we can tell where a corner is. This motivates the more general term interest point often used to describe a corner.

DESCRIBING NEIGHBORHOODS WITH SIFT AND HOG FEATURES

We know the center, radius, and orientation of a set of an image patch, and must now represent it. Orientations should provide a good representation. They are unaffected by changes in image brightness, and different textures tend to have different orientation fields. The pattern of orientations in different parts of the patch is likely to be quite distinctive. Our representation should be robust to small errors in the center, radius, or orientation of the patch, because we are unlikely to estimate these exactly right.

SIFT Features

We can now compute a representation that is not affected by translation, rotation, or scale. For each patch, we rectify the patch by translating the center to the origin, rotating so the orientation direction lies along (say) the x-axis, and scaling so the radius is one. Any representation we compute for this rectified patch will be invariant to translations, rotations, and scale. Although we do not need to rectify in practice—instead, we can work the rectification into each step of computing the description—it helps to think about computing descriptions for a rectified patch.

A SIFT descriptor (for Scale Invariant Feature Transform) is constructed out of image gradients, and uses both magnitude and orientation. The descriptor is normalized to suppress the effects of change in illumination intensity. The descriptor is a set of histograms of image gradients that are then normalized. These histograms expose general spatial trends in the image gradients in the patch but suppress detail.

The standard SIFT descriptor is obtained by first dividing the rectified patch into an $n × n$ grid. We then subdivide each grid element into an $m × m$ subgrid of subcells.

HOG Features

The HOG feature (for Histogram Of Gradient orientations) is an important variant of the SIFT feature.

Texture

Texture is a phenomenon that is widespread, easy to recognise, and hard to define.
Texture is important, because texture appears to be a very strong cue to object identity. Most modern object recognition programs are built around texture representation machinery of one form or another. This may be because texture is also a strong cue to material properties: what the material that makes up an object is like.

There are three main kinds of texture representation.

Local texture representations
Local texture representations encode the texture very close to a point in an image. These representations can’t be comprehensive, because they look at a small piece of the image. These representations can’t be comprehensive, because they look at a small piece of the image. However, local texture representations are very useful in image segmentation, where we must break an image into large, useful components, usually called regions

pooled texture representations.
Problems require a description of the texture within an image domain. We refer to such representations as pooled texture representations. For example, texture recognition is the problem of determining what texture is represented by a patch in an image. Here we have a domain (the patch) and we want a repre- sentation of the overall texture in the domain. Similarly, in material recognition, one must decide what material is represented by a patch in the image.

Data-driven texture representations
Data-driven texture representations model a texture by a procedure that can generate a textured region from an example.

LOCAL TEXTURE REPRESENTATIONS USING FILTERS

POOLED TEXTURE REPRESENTATIONS BY DISCOVERING TEXTONS

SYNTHESIZING TEXTURES AND FILLING HOLES IN IMAGES

IMAGE DENOISING

SHAPE FROM TEXTURE

MULTIPLE IMAGES

Stereopsis

Fusing the pictures recorded by our two eyes and exploiting the difference (or disparity) between them allows us to gain a strong sense of depth. This chapter is concerned with the design and implementation of algorithms that mimic our ability to perform this task, known as stereopsis.

Stereo vision involves two processes: The fusion of features observed by two (or more) eyes and the reconstruction of their three-dimensional preimage.

Computer Vision - A Modern Approach（现代计算机视觉）

EARLY VISION: JUST ONE IMAGE

Linear Filters

Local Image Features

COMPUTING THE IMAGE GRADIENT

REPRESENTING THE IMAGE GRADIENT

FINDING CORNERS AND BUILDING NEIGHBORHOODS

DESCRIBING NEIGHBORHOODS WITH SIFT AND HOG FEATURES

SIFT Features

HOG Features

Texture

LOCAL TEXTURE REPRESENTATIONS USING FILTERS

POOLED TEXTURE REPRESENTATIONS BY DISCOVERING TEXTONS

SYNTHESIZING TEXTURES AND FILLING HOLES IN IMAGES

IMAGE DENOISING

SHAPE FROM TEXTURE

MULTIPLE IMAGES

Stereopsis

Structure from Motion

MID-LEVEL VISION

Segmentation by Clustering

Grouping and Model Fitting

Tracking

HIGH-LEVEL VISION