# Briefing

Challenge Authors use many words without definition. They are assumed to be known, which is true within a discipline but not when people cross over between computing. signal processing, automation, mathemaics, and other disciplines. Therefore it took me some time to triangulate several sources and pin down some core concepts. The briefint will focus on these.

## Feature Matching

In feature tracking, we use the time derivative to track each feature point once detected.

In feature matching, we compare images which are not necessarily related on the time axis. In other words, we need to compare features directly, and pair corresponding features in two independent images.

To identify features, we use feature descriptors. One of the most popular descriptors is called SIFT - Scale Invariant Feature Transform.

Unfortunately, this is not described in the textbook. For details, one can refer to Szeliski (2022) textbook, Chapter

The principle used by SIFT is to gather statistics from the neighbourhood around the feature point, forming a vector which is used as an identifier.

When we move to Chapter 5 of Ma’s, we will use feature matching to triangulate points in 3D, and determine there depth, that is their distance from the cameras.

## The Laplacian

The Laplacian is commonly used in image processing, even if I have not seen it as often in the machine vision literature. The idea is to use the second order derivative, $\nabla^2I(x,y)$ This is commonly approximated by one of the filters $\begin{bmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{bmatrix} \quad\text{or}\quad \begin{bmatrix} 1 & 1 & 1 \\ 1 & -8 & 1 \\ 1 & 1 & 1 \end{bmatrix}$ To avoid the problems of the small window size, we typically combine the Laplacian with a blurring filter, usually a Gaussian. Thus we compute $E = L*G*I$ where $$I$$ is the image, $$L$$ the Laplacian, and $$G$$ the Gaussian. Because of the associativity of linear operations, we have $E = L*(G*I)=(L*G)*I$ In other words, we can precaulcualte the Laplacian of Gaussian (LoG) filter $$L*G$$ as one filter.

## Gaussian Scale Space

1. The image is a 2D matrix/signal $$I(x,y)$$.
2. When we smoothen this by a Gaussian, we take a Gaussian function $$G(x,y,\sigma)$$ for some standard deviation $$\sigma$$, and calculate the convolution $L(x,y,\sigma)=G(x,y,\sigma)*I(x,y)$
3. We can read this a 3D signal/tensor, i.e. we have a space in $$x$$, $$y$$, and $$\sigma$$ co-ordinates.
4. Calculate the difference of Gaussian images $D(x,y,\sigma) = L(x,y,k\sigma)- L(x,y,\sigma)= [G(x,y,k\sigma)-G(x,y,\sigma)]*I(x,y)$
5. This is also a tensor image.
6. Each pixel in the tensor has 26 immediate neighbours, that is, forming a cube of 27 pixels.

## Corner Detection

Lowe use two criteria to identify corners. Robust corners must satisfy both. Let the Hessian of the difference image be $H= \begin{bmatrix} D_{xx} & D_{xy} \\ D_{xy} & D_{yy} \end{bmatrix}$ Note that the Hessian is different from the matrix of differentials used in the Harris detector, but the logic is still similar. In both cases the eigenvectors show edges.

1. Select pixels which are maximal or minimal within their $$3\times3\times3$$ neighbourhood in scale space.
2. Reject pixels for which $$\mathsf{Tr}(H)^2/\mathsf{Det}(H)>10$$.

This does well in practice but is not as well founded theoretically as the Harris detector.

## Descriptors

1. Calculate the gradient in each pixel in a neighbourhood.
• Calculated in the same scale where the corner was detected.
3. Dominant orientation can serve as a descriptor.
4. Danger: small gradients make an unreliable descriptor.
5. Lowe (SIFT) uses a histogram of the orientations.

## Considerations

1. How many octaves?
2. How many sub-octaves?
3. What $$\sigma$$s do we use in the Gaussian?
4. How large neighbourhood do we use (SIFT uses $$16\times16$$?
5. What fall-off function do we use over the neigbourhood?
6. What bins do we use in the histogram?

There are so many steps in the SIFT calculations that implementations are likely to differ.