Revision 52c7aeb23b5de35bdd643ee1e487ca4329ec5928 (click the page title to view the current version)

Feature Descriptors

Changes from 52c7aeb23b5de35bdd643ee1e487ca4329ec5928 to 7e8b152047b3ae5f800a33266d3206dae8ac039b

---
title: Feature Descriptors
categories: lecture
---

# Briefing

**Challenge** Authors use many words without definition.
They are assumed to be known, which is true within a discipline
but not when people cross over between computing. signal processing,
automation, mathemaics, and other disciplines.
Therefore it took me some time to triangulate several sources and
pin down some core concepts.  The briefint will focus on these.

## Feature Matching

In feature tracking, we use the time derivative to track
each feature point once detected.

In feature matching, we compare images which are not necessarily 
related on the time axis.  In other words, we need to compare
features directly, and pair corresponding features in two independent
images.

To identify features, we use feature descriptors.
One of the most popular descriptors is called SIFT -
Scale Invariant Feature Transform.

Unfortunately, this is not described in the textbook.
For details, one can refer to 
[Szeliski (2022)](https://szeliski.org/Book/) textbook,
Chapter

The principle used by SIFT is to gather statistics from the neighbourhood
around the feature point, forming a vector which is used as an identifier.

When we move to Chapter 5 of Ma's, we will use feature matching to 
triangulate points in 3D, and determine there depth,
that is their distance from the cameras.

## Gaussian Scale Space


1.  The image is a 2D matrix/signal $I(x,y)$. 
2.  When we smoothen this by a Gaussian, we take a Gaussian function
    $G(x,y,\sigma)$ for some standard deviation $\sigma$, and calculate
    the convolution
    $$L(x,y,\sigma)=G(x,y,\sigma)*I(x,y)$$
3.  We can read this a 3D signal/tensor, i.e. we have a space in
    $x$, $y$, and $\sigma$ co-ordinates.
4.  Calculate the difference of Gaussian images
    $$D(x,y,\sigma) =
    L(x,y,k\sigma)- L(x,y,\sigma)=
    [G(x,y,k\sigma)-G(x,y,\sigma)]*I(x,y)$$
5.  This is also a tensor image.
6.  Each pixel in the tensor has 26 immediate neighbours, that
    is, forming a cube of 27 pixels.
  
## Corner Detection

Lowe use two criteria to identify corners.  Robust corners must
satisfy both.
Let the Hessian of the difference image be 
$$H=
\begin{bmatrix}
   D_{xx} & D_{xy} \\
   D_{xy} & D_{yy} 
\end{bmatrix}$$
Note that the Hessian is different from the matrix of differentials
used in the Harris detector, but the logic is still similar.
In both cases the eigenvectors show edges.

1.  Select pixels which are maximal or minimal within their
    $3\times3\times3$ neighbourhood in scale space.
2.  Reject pixels for which $\mathsf{Tr}(H)^2/\mathsf{Det}(H)>10$.

This does well in practice but is not as well founded theoretically
as the Harris detector.

## Descriptors

1.  Calculate the gradient in each pixel in a neighbourhood.
    - Calculated in the same scale where the corner was detected.
2.  The gradient gives orientation.
3.  Dominant orientation can serve as a descriptor.
4.  Danger: small gradients make an unreliable descriptor.
5.  Lowe (SIFT) uses a histogram of the orientations.

## Considerations

1.  How many octaves?
2.  How many sub-octaves?
3.  What $\sigma$s do we use in the Gaussian?
4.  How large neighbourhood do we use (SIFT uses $16\times16$?
5.  What fall-off function do we use over the neigbourhood?
6.  What bins do we use in the histogram?

There are so many steps in the SIFT calculations that implementations
are likely to differ.