---
title: Evaluation and Statistics
---

**Reading**
[Szeliski (2022): Computer Vision: Algorithms and Applications, 2nd ed.](https://szeliski.org/Book/)
Chapter 5.1-5.2 

# Exercises

## Exercise 1 (Loss over epochs)

Review the machine learning systems you studied last week.
If you have not already done it, make a plot which shows
the total loss as a function of the number of epochs.

For instance, you can start with empty lists, and for each
epoch you train once and test once, recording the total loss,
like this.

```python
trainloss = [ ]
trainloss2 = [ ]
testloss = [ ]
    
for epoch in range(12):  # loop over the dataset multiple times

    tloss = trainmodel(net,trainloader)
    trainloss.append(tloss)

    trainloss2.append(evalmodel(net,trainloader))
    testloss.append(evalmodel(net,testloader))

x = list(range(len(testloss)))
plt.plot( x, trainloss, "b", x, trainloss2, "k", x, testloss, "r" )
plt.savefig( "plot.svg" )
```

**Tip** You may want to run through the exercise first with
only two epochs (edit the 12 in the code above), for debugging
purposes.  You can increase the number of epochs when you have 
a working setup.

The functions `trainmodel()` and `evalmodel()` are defined in
[NetForStatistics.py](Python/NetForStatistics.py).
Obviously, you need to initialise the network and the datasets
before you run the code above.  Again using functions from
`NetForStatistics`, like this

```python
(trainloader,testloader) = getDataset()
net = Net()
```

You may want to tweak the device 
(you can use `net = Net("cpu")` with the given code)
and the number of epochs.
You may also use the code differently and show the plot interactively
instead of writing the file `plot.svg`.

### Questions for reflection

1.  Which curve is which in the plot (`plot.svg`)?
    (Read the code to know.)
2.  How does the loss behave differently on the training and test sets?
3.  What is the difference between the two test set curves,
    `testloss` and `testloss2`?

## Exercise 2 (Error probability estimation)

Review again your machine learning system as in Exercise 1, but now
we want to estimate the error probability instead of calculating loss.
Consider the following function to calculate the error rate:
```python
def errorrate(net,dataloader):
    tcount = 0
    terror = 0
    net.eval()
    for i, data in enumerate(dataloader, 0):

        inputs, labels = (data[0].to(net.device), data[1].to(net.device))

        outputs = net(inputs)
        _, pred = torch.max(outputs.data,1)
        error = (labels != pred).sum()
        print(labels,pred)
        tcount += len(labels)
        terror += error
    return (terror/tcount)
```

### Code review

1.  What does `errorrate()` do?  Compare it to `evalmodel()` from Exercise 1.
    What is similar and what is different?
2.  Note the line `pred = max(outputs,1)`.  What does it do?
    What is `outputs` supposed to look like?
3.  What quantity is calculated by
    `return(terror/tcount)`?

### Statistical Estimation?

Calculate the errorrate on the test set, e.g.
```python
rE = errorrate(net,trainloader)
```
This assumes that you have trained your model `net` as in Exercise 1.

Now `rE`$=r_E$ is an observation of a stochastic variable with mean $\mu$
and standard deviation $\sigma$, where $\mu$ is equal to the probability
of error when `net` is applied to a random image.

**Tip** `torch` makes the calculations on tensors.  To use the values
in python, we need to convert it to a scalar value.  When the tensor
`t` has only a single element, this can be done with `t.item()`.
([My source](https://discuss.pytorch.org/t/get-value-out-of-torch-cuda-float-tensor/2539/7), but I tested it.)

We can estimate $\sigma$ as follows
$$\hat\sigma = \sqrt{ \frac{r_E(1-r_E)}{N} }$$
where $N$ is the number of images tested.

Use python to calculate $\hat\sigma$ for your observed error rate `rE`.
Note that you need to tweak the function to return `tcount` as well as
`rE` so that you have a value for $N$.

### Reflection

4.  Seeing your estimate $r_E$ for $p_E$ in relation to the estimated
    standard deviation $\hat\sigma$, what do you think of the performance?

### Smaller datasets

What was your value of $N$?
Let's see what happens with smaller datasets.
This is a crude hack, but it should work.
Insert a line at the end of the `for` loop in the `errorrate()`
function, like this
```python
if tcount >= 1000: break
```
Recalculate the error rate and estimate the standard deviation,
as you did above, with $N=1000$ and $N=100$ (changing the number
in the break line).

### Reflection

5.  Judge the error estimate for the two smaller tests.
    Would $N=100$ or $N=1000$ suffice for a confident assessment?
6.  How does the value training set size $N$ affect the confidence
    in the results?

### At the end

If you have the hardware for it, you may try with more epochs, or fewer
if you haven't.


## Exercise 3

Review the machine learning systems you studied last week.
Calculate the confusion matrix for each of the systems.

The Confusion Matrix is simply a matrix of values $c_{i,j}$,
where each $c_{i,j}$ is the number of test images predicted
to be in class $j$ while truly belonging to class $i$.

It is not an unsurmountable task to tweak the `errorrate()`
function to calculate all the $c_{i,j}$ values, but to
make it a little simpler.

1.  Tweak the `errorrate()` function to return a list of
    true labels and a list of predicted labels, simply 
    concatenating all the `pred` (resp. `labels`) lists
    together.
2.  Use the returned lists together with `MulticlassConfusionMatrix`
    from [torchmetrics](https://torchmetrics.readthedocs.io/en/stable/classification/confusion_matrix.html#torchmetrics.classification.MulticlassConfusionMatrix)
3.  Find a way to display/visualise the confusion matrix that you 
    are confortable with.

### Questions for Reflection

1. Are the errors reasonably balanced?
2. Are any classes particularly difficult to detect?
3. Are there good reasons why some classes are harder to predict
   than others?

## Exercise 4

Each of the entries $c_{i,j}$ in the confusion matrix is an
error rate, like the ones we worked with previously.

1.  How many samples ($N$) is used to compute one $c_{i,j}$?
2.  Pick one large and one small $c_{i,j}$ and estimate their
    standard deviations, and compare the two.  What do you see?
3.  In light of the confusion matrix, what do you think about 
    the performance of the machine learning model.
4. How large ought the test be to make a confident assessment? 

## Exercise 5

The exercises above are based on the 
[cifar10 tutorial](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html),
using same dataset, the same network architecture, and the same basic routines.

+  Refactor you own code so that *you* have a code base which allows you
   quickly to test classification with CNN on different datasets using
   different networks.
+  Test your code with different datasets and network architectures that
   you find in different tutorials and other sources on the net.

# Briefing

## Recap

+ *What did we learn last week?*
+ *What should we learn today?*

1. Supervised learning
2. Loss function -- Cross-Entropy
    + $E(\mathbf{w}) = -\sum_n p_{n,t_n}$
    + $p_{n,k}$ is the networks estimated probability that
      object $n$ has the class $k$
    + $t_n$ is the true class of $t_n$

## Regression problem - Gravitational Lensing

1.  Lensing model 
    + Source - size $\sigma$ and position $(x,y)$
    + Lens  - Einstein radius $R_E$
    + Distorted Image
2.  Four parameters determine the distorted image.
    + Can we recover these four parameters from the image?
3.  Instead of a discrete class, we want the network to predict
    $\sigma,x,y,R_E$
4.  Loss function is Mean Squared Error
    + $\mathsf{MSE} =
       \frac14(\sigma-\hat\sigma)^2+(x-\hat x)^2
              +(y - \hat y)^2+(R_E-\hat R_E)^2$
    + Note, normalisation can vary.
        + The starting point is the sum of squared errors (SSE).
        + We may or may not normalise by dividing by the number
          of data points and or prediction parameters. 
    

## Evaluation

+ Confusion Matrix
    + false positive, false negative
+ Accuracy

$$ \mathsf{Acc} = \frac{ \mathsf{TP} + \mathsf{TN} }{
            \mathsf{TP}+\mathsf{FN} + \mathsf{TN}+\mathsf{FP}
         }$$
+ Warning.  Biased datasets
+ Other heuristics

$$ F_1 = \frac{ \frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FP}} 
          \cdot \frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FN}} 
         }{ \frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FP}} 
            + \frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FN}} 
         }$$

+ TP/TN/FP/FN are Stochastic Variables
    + Binomially Distributed
+ Regression: absolute or squared error
    + Also a Stochastic Variable 
      (depends on the data drawn randomly from the population)
    + Mean over a large dataset gives a reasonable estimator
    + Standard Deviation can be estimated using the sample
      standard deviation
+ Important: Each item in the test set makes one experiment/observation.
  This allows statistical 

## Overtraining and Undertraining

+ Exercise last week.
    + I have not been able to generate the expected result.
    + The deep networks tested produce impressive results with 
      very little training.
+ Still, important principle.
+ The training data
    + contain a limited amount of information about the population
    + have some peculiar quirks
+ The network has a certain number of DOF (weights) which can
  be adjusted to store information extracted from the training set.
+ Undertraining means insufficient training to absorb 
  the relevant information
    + insufficient epochs
    + insufficient training set (relative to DOF)
+ A large network with many DOF, can learn a small dataset completely.
    + Overtraining means that the network has learnt more than
      what generalises
+ Sometimes, regularisation techniques are used to remove 
  (zero out) DOF with little impact
    + Occam's Razor

## Normalisation

+ Network layers add up coefficients from different sources
    + Large numbers contribute a lot.
    + Numbers with a small range contribute little.
+ Images are simple.  One range for all pixels.
+ Some datasets combine data with different ranges.
    + If some have range $\pm1$, some have range
      $(0,10^{-5})$ and others $(0,10^5)$,
      then small numbers are negligible and are effectively
      ignored.
+ Scaling is standard procedure.
    + Scale each column training data to $(0,1)$ (or $\pm1$).
    + Store the scaling function and apply it to the test set
      and all future data when making predictions.
+ This also applies to weights in the network.
    + Weights should be balanced between layers.
    + Batch Normalisation.
+ Too many normalisation and regularisation techniques to learn
  all before starting. 
    + New techniques keep emerging.
    + Gain some experience and return to extend your repertoire.

# Debrief

+ Demoes of python modules and scripts
    + module with script [plotting loss over epochs](Python/cifar10.py)
      depending on [NetForStatistics.py](Python/NetForStatistics.py).
    + module with script [estimating error probability](Python/cifar10errorrate.py)
    + module with script [calculating confusion matrix](Python/cnn.py)
      (only tested 2022)