Revision 7476c4ae6a604a1f3b768e6b1dbbb35056cd7c57 (click the page title to view the current version)

Feature Descriptors

Changes from 7476c4ae6a604a1f3b768e6b1dbbb35056cd7c57 to baed391b29f629255a156922c0dc26c60c86ba8f

---
title: Feature Descriptors
categories: lecture
---

# Briefing

**Challenge** Authors use many words without definition.
They are assumed to be known, which is true within a discipline
but not when people cross over between computing. signal processing,
automation, mathemaics, and other disciplines.
Therefore it took me some time to triangulate several sources and
pin down some core concepts.  The briefint will focus on these.

## Feature Matching

In feature tracking, we use the time derivative to track
each feature point once detected.

In feature matching, we compare images which are not necessarily 
related on the time axis.  In other words, we need to compare
features directly, and pair corresponding features in two independent
images.

To identify features, we use feature descriptors.
One of the most popular descriptors is called SIFT -
Scale Invariant Feature Transform.

Unfortunately, this is not described in the textbook.
For details, one can refer to 
[Szeliski (2022)](https://szeliski.org/Book/) textbook,
Chapter

The principle used by SIFT is to gather statistics from the neighbourhood
around the feature point, forming a vector which is used as an identifier.

When we move to Chapter 5 of Ma's, we will use feature matching to 
triangulate points in 3D, and determine there depth,
that is their distance from the cameras.

## The Laplacian

The Laplacian is commonly used in image processing, even if I
have not seen it as often in the machine vision literature.
The idea is to use the second order derivative,
$$\nabla^2I(x,y)$$
$$\nabla^2I(x,y) = 
\frac{\partial^2f}{\partial x^2}+ \frac{\partial^2f}{\partial y^2}$$
This is commonly approximated by one of the filters
$$\begin{bmatrix}
  0 & 1 & 0 \\
  1 & -4 & 1 \\
  0 & 1 & 0 
\end{bmatrix}
\quad\text{or}\quad
\begin{bmatrix}
  1 & 1 & 1 \\
  1 & -8 & 1 \\
  1 & 1 & 1 
\end{bmatrix}
$$
To avoid the problems of the small window size, we typically combine
the Laplacian with a blurring filter, usually a Gaussian.  Thus we
compute
$$ E = L*G*I$$
where $I$ is the image, $L$ the Laplacian, and $G$ the Gaussian.
Because of the associativity of linear operations, we have
$$ E = L*(G*I)=(L*G)*I$$
In other words, we can precaulcualte the *Laplacian of Gaussian* (LoG) 
filter $L*G$ as one filter.

## Gaussian Scale Space

1.  The image is a 2D matrix/signal $I(x,y)$. 
2.  When we smoothen this by a Gaussian, we take a Gaussian function
    $G(x,y,\sigma)$ for some standard deviation $\sigma$, and calculate
    the convolution
    $$L(x,y,\sigma)=G(x,y,\sigma)*I(x,y)$$
3.  We can read this a 3D signal/tensor, i.e. we have a space in
    $x$, $y$, and $\sigma$ co-ordinates.
4.  Calculate the difference of Gaussian images
    $$D(x,y,\sigma) =
    L(x,y,k\sigma)- L(x,y,\sigma)=
    [G(x,y,k\sigma)-G(x,y,\sigma)]*I(x,y)$$
5.  This is also a tensor image.
6.  Each pixel in the tensor has 26 immediate neighbours, that
    is, forming a cube of 27 pixels.
  
## Corner Detection

Lowe use two criteria to identify corners.  Robust corners must
satisfy both.
Let the Hessian of the difference image be 
$$H=
\begin{bmatrix}
   D_{xx} & D_{xy} \\
   D_{xy} & D_{yy} 
\end{bmatrix}$$
Note that the Hessian is different from the matrix of differentials
used in the Harris detector, but the logic is still similar.
In both cases the eigenvectors show edges.

1.  Select pixels which are maximal or minimal within their
    $3\times3\times3$ neighbourhood in scale space.
2.  Reject pixels for which $\mathsf{Tr}(H)^2/\mathsf{Det}(H)>10$.

This does well in practice but is not as well founded theoretically
as the Harris detector.

## Descriptors

1.  Calculate the gradient in each pixel in a neighbourhood.
    - Calculated in the same scale where the corner was detected.
2.  The gradient gives orientation.
3.  Dominant orientation can serve as a descriptor.
4.  Danger: small gradients make an unreliable descriptor.
5.  Lowe (SIFT) uses a histogram of the orientations.

## Considerations

1.  How many octaves?
2.  How many sub-octaves?
3.  What $\sigma$s do we use in the Gaussian?
4.  How large neighbourhood do we use (SIFT uses $16\times16$?
5.  What fall-off function do we use over the neigbourhood?
6.  What bins do we use in the histogram?

There are so many steps in the SIFT calculations that implementations
are likely to differ.