# Statistics

## Changes from 66aa985a4fdd9639cbc3dc226df51d8b765cbece to 03bf797ece9c6f16a71abaf7516fdf0e409bc8e3

---
title: Evaluation and Statistics
---

## Heuristics
[Szeliski (2022): Computer Vision: Algorithms and Applications, 2nd ed.](https://szeliski.org/Book/)
Chapter 5.1-5.2

+ Accuracy
+ FP/FN rate
+ Confusion Matrix
# Exercises

# Exercise 1
## Exercise 1

Review the machine learning systems you studied last week.
Calculate the confusion matrix for each of the systems.

+ Are the errors reasonably balanced?
+ Are any classes particularly difficult to detect?

# Exercise 2
## Exercise 2

Reviewing again each of the systems from last week.

1. Calculate the false positive and false negative rates.
2. Estimate the standard deviation of the error rates.
3. Assess the quality of the machine learning system.
Can you be confident that the error probability is satisfactory?
4. How large does the test set have to be to make a confident assessment?

# Briefing

## Recap

*What did we learn last week?*

1. Supervised learning
2. Loss function -- Cross-Entropy
+ $E(\mathbf{w}) = -\sum_n p_{n,t_n}$
+ $p_{n,k}$ is the networks estimated probability that
object $n$ has the class $k$
+ $t_n$ is the true class of $t_n$

## Regression problem - Gravitational Lensing

1.  Lensing model
+ Source - size $\sigma$ and position $(x,y)$
+ Lens  - Einstein radius $R_E$
+ Distorted Image
2.  Four parameters determine the distorted image.
+ Can we recover these four parameters from the image?
3.  Instead of a discrete class, we want the network to predict
$\sigma,x,y,R_E$
4.  Loss function is Mean Squared Error
+ $\mathsl{MSE} = \frac14(\sigma-\hat\sigma)^2+(x-\hat x)^2 +(y - \hat y)^2+(R_E-\hat R_E)^2$
+ Note, normalisation can vary.
+ The starting point is the sum of squared errors (SSE).
+ We may or may not normalise by dividing by the number
of data points and or prediction parameters.

## Evaluation

+ Confusion Matrix
+ false positive, false negative
+ Accuracy

$$\mathsl{Acc} = \frac{ \mathsl{TP} + \mathsl{TN} }{ \mathsl{TP}+\mathsl{FN} + \mathsl{TN}+\mathsl{FP} }$$
+ Warning.  Biased datasets
+ Other heuristics

$$F_1 = \frac{ \frac{\mathsl{TP}}{\mathsl{TP}+\mathsl{FP}} \cdot \frac{\mathsl{TP}}{\mathsl{TP}+\mathsl{FN}} }{ \frac{\mathsl{TP}}{\mathsl{TP}+\mathsl{FP}} + \frac{\mathsl{TP}}{\mathsl{TP}+\mathsl{FN}} }$$

+ TP/TN/FP/FN are Stochastic Variables
+ Binomially Distributed
+ Regression: absolute or squared error
+ Also a Stochastic Variable
(depends on the data drawn randomly from the population)
+ Mean over a large dataset gives a reasonable estimator
+ Standard Deviation can be estimated using the sample
standard deviation
+ Important: Each item in the test set makes one experiment/observation.
This allows statistical

## Overtraining and Undertraining

+ Exercise last week.
+ I have not been able to generate the expected result.
+ The deep networks tested produce impressive results with
very little training.
+ Still, important principle.
+ The training data
+ contain a limited amount of information about the population
+ have some peculiar quirks
+ The network has a certain number of DOF (weights) which can
be adjusted to store information extracted from the training set.
+ Undertraining means insufficient training to absorb
the relevant information
+ insufficient epochs
+ insufficient training set (relative to DOF)
+ A large network with many DOF, can learn a small dataset completely.
+ Overtraining means that the network has learnt more than
what generalises
+ Sometimes, regularisation techniques are used to remove
(zero out) DOF with little impact
+ Occam's Razor

## Normalisation

+ Network layers add up coefficients from different sources
+ Large numbers contribute a lot.
+ Numbers with a small range contribute little.
+ Images are simple.  One range for all pixels.
+ Some datasets combine data with different ranges.
+ If some have range $\pm1$, some have range
$(0,10^{-5})$ and others $(0,10^5)$,
then small numbers are negligible and are effectively
ignored.
+ Scaling is standard procedure.
+ Scale each column training data to $(0,1)$ (or $\pm1$).
+ Store the scaling function and apply it to the test set
and all future data when making predictions.
+ This also applies to weights in the network.
+ Weights should be balanced between layers.
+ Batch Normalisation.
+ Too many normalisation and regularisation techniques to learn
all before starting.
+ New techniques keep emerging.