# Evaluation and Statistics

**Reading** Szeliski (2022): Computer Vision: Algorithms and Applications, 2nd ed. Chapter 5.1-5.2

# Exercises

## Exercise 1 (Loss over epochs)

Review the machine learning systems you studied last week. If you have not already done it, make a plot which shows the total loss as a function of the number of epochs.

For instance, you can start with empty lists, and for each epoch you train once and test once, recording the total loss, like this.

```
trainloss = [ ]
trainloss2 = [ ]
testloss = [ ]
for epoch in range(12): # loop over the dataset multiple times
tloss = trainmodel(net,trainloader)
trainloss.append(tloss)
trainloss2.append(evalmodel(net,trainloader))
testloss.append(evalmodel(net,testloader))
x = list(range(len(testloss)))
plt.plot( x, trainloss, "b", x, trainloss2, "k", x, testloss, "r" )
plt.savefig( "plot.svg" )
```

**Tip** You may want to run through the exercise first with only two epochs (edit the 12 in the code above), for debugging purposes. You can increase the number of epochs when you have a working setup.

The functions `trainmodel()`

and `evalmodel()`

are defined in NetForStatistics.py. Obviously, you need to initialise the network and the datasets before you run the code above. Again using functions from `NetForStatistics`

, like this

You may want to tweak the device (you can use `net = Net("cpu")`

with the given code) and the number of epochs. You may also use the code differently and show the plot interactively instead of writing the file `plot.svg`

.

### Questions for reflection

- Which curve is which in the plot (
`plot.svg`

)? (Read the code to know.) - How does the loss behave differently on the training and test sets?
- What is the difference between the two test set curves,
`testloss`

and`testloss2`

?

## Exercise 2 (Error probability estimation)

Review again your machine learning system as in Exercise 1, but now we want to estimate the error probability instead of calculating loss. Consider the following function to calculate the error rate:

```
def errorrate(net,dataloader):
tcount = 0
terror = 0
net.eval()
for i, data in enumerate(dataloader, 0):
inputs, labels = (data[0].to(net.device), data[1].to(net.device))
outputs = net(inputs)
_, pred = torch.max(outputs.data,1)
error = (labels != pred).sum()
print(labels,pred)
tcount += len(labels)
terror += error
return (terror/tcount)
```

### Code review

- What does
`errorrate()`

do? Compare it to`evalmodel()`

from Exercise 1. What is similar and what is different? - Note the line
`pred = max(outputs,1)`

. What does it do? What is`outputs`

supposed to look like? - What quantity is calculated by
`return(terror/tcount)`

?

### Statistical Estimation?

Calculate the errorrate on the test set, e.g.

This assumes that you have trained your model `net`

as in Exercise 1.

Now `rE`

\(=r_E\) is an observation of a stochastic variable with mean \(\mu\) and standard deviation \(\sigma\), where \(\mu\) is equal to the probability of error when `net`

is applied to a random image.

**Tip** `torch`

makes the calculations on tensors. To use the values in python, we need to convert it to a scalar value. When the tensor `t`

has only a single element, this can be done with `t.item()`

. (My source, but I tested it.)

We can estimate \(\sigma\) as follows \[\hat\sigma = \sqrt{ \frac{r_E(1-r_E)}{N} }\] where \(N\) is the number of images tested.

Use python to calculate \(\hat\sigma\) for your observed error rate `rE`

. Note that you need to tweak the function to return `tcount`

as well as `rE`

so that you have a value for \(N\).

### Reflection

- Seeing your estimate \(r_E\) for \(p_E\) in relation to the estimated standard deviation \(\hat\sigma\), what do you think of the performance?

### Smaller datasets

What was your value of \(N\)? Let’s see what happens with smaller datasets. This is a crude hack, but it should work. Insert a line at the end of the `for`

loop in the `errorrate()`

function, like this

Recalculate the error rate and estimate the standard deviation, as you did above, with \(N=1000\) and \(N=100\) (changing the number in the break line).

### Reflection

- Judge the error estimate for the two smaller tests. Would \(N=100\) or \(N=1000\) suffice for a confident assessment?
- How does the value training set size \(N\) affect the confidence in the results?

### At the end

If you have the hardware for it, you may try with more epochs, or fewer if you haven’t.

## Exercise 3

Review the machine learning systems you studied last week. Calculate the confusion matrix for each of the systems.

The Confusion Matrix is simply a matrix of values \(c_{i,j}\), where each \(c_{i,j}\) is the number of test images predicted to be in class \(j\) while truly belonging to class \(i\).

It is not an unsurmountable task to tweak the `errorrate()`

function to calculate all the \(c_{i,j}\) values, but to make it a little simpler.

- Tweak the
`errorrate()`

function to return a list of true labels and a list of predicted labels, simply concatenating all the`pred`

(resp.`labels`

) lists together. - Use the returned lists together with
`MulticlassConfusionMatrix`

from torchmetrics - Find a way to display/visualise the confusion matrix that you are confortable with.

### Questions for Reflection

- Are the errors reasonably balanced?
- Are any classes particularly difficult to detect?
- Are there good reasons why some classes are harder to predict than others?

## Exercise 4

Each of the entries \(c_{i,j}\) in the confusion matrix is an error rate, like the ones we worked with previously.

- How many samples (\(N\)) is used to compute one \(c_{i,j}\)?
- Pick one large and one small \(c_{i,j}\) and estimate their standard deviations, and compare the two. What do you see?
- In light of the confusion matrix, what do you think about the performance of the machine learning model.
- How large ought the test be to make a confident assessment?

## Exercise 5

The exercises above are based on the cifar10 tutorial, using same dataset, the same network architecture, and the same basic routines.

- Refactor you own code so that
*you*have a code base which allows you quickly to test classification with CNN on different datasets using different networks. - Test your code with different datasets and network architectures that you find in different tutorials and other sources on the net.

# Briefing

## Recap

*What did we learn last week?**What should we learn today?*

- Supervised learning
- Loss function – Cross-Entropy
- \(E(\mathbf{w}) = -\sum_n p_{n,t_n}\)
- \(p_{n,k}\) is the networks estimated probability that object \(n\) has the class \(k\)
- \(t_n\) is the true class of \(t_n\)

## Regression problem - Gravitational Lensing

- Lensing model
- Source - size \(\sigma\) and position \((x,y)\)
- Lens - Einstein radius \(R_E\)
- Distorted Image

- Four parameters determine the distorted image.
- Can we recover these four parameters from the image?

- Instead of a discrete class, we want the network to predict \(\sigma,x,y,R_E\)
- Loss function is Mean Squared Error
- \(\mathsf{MSE} = \frac14(\sigma-\hat\sigma)^2+(x-\hat x)^2 +(y - \hat y)^2+(R_E-\hat R_E)^2\)
- Note, normalisation can vary.
- The starting point is the sum of squared errors (SSE).
- We may or may not normalise by dividing by the number of data points and or prediction parameters.

## Evaluation

- Confusion Matrix
- false positive, false negative

- Accuracy

\[ \mathsf{Acc} = \frac{ \mathsf{TP} + \mathsf{TN} }{ \mathsf{TP}+\mathsf{FN} + \mathsf{TN}+\mathsf{FP} }\] + Warning. Biased datasets + Other heuristics

\[ F_1 = \frac{ \frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FP}} \cdot \frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FN}} }{ \frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FP}} + \frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FN}} }\]

- TP/TN/FP/FN are Stochastic Variables
- Binomially Distributed

- Regression: absolute or squared error
- Also a Stochastic Variable (depends on the data drawn randomly from the population)
- Mean over a large dataset gives a reasonable estimator
- Standard Deviation can be estimated using the sample standard deviation

- Important: Each item in the test set makes one experiment/observation. This allows statistical

## Overtraining and Undertraining

- Exercise last week.
- I have not been able to generate the expected result.
- The deep networks tested produce impressive results with very little training.

- Still, important principle.
- The training data
- contain a limited amount of information about the population
- have some peculiar quirks

- The network has a certain number of DOF (weights) which can be adjusted to store information extracted from the training set.
- Undertraining means insufficient training to absorb the relevant information
- insufficient epochs
- insufficient training set (relative to DOF)

- A large network with many DOF, can learn a small dataset completely.
- Overtraining means that the network has learnt more than what generalises

- Sometimes, regularisation techniques are used to remove (zero out) DOF with little impact
- Occam’s Razor

## Normalisation

- Network layers add up coefficients from different sources
- Large numbers contribute a lot.
- Numbers with a small range contribute little.

- Images are simple. One range for all pixels.
- Some datasets combine data with different ranges.
- If some have range \(\pm1\), some have range \((0,10^{-5})\) and others \((0,10^5)\), then small numbers are negligible and are effectively ignored.

- Scaling is standard procedure.
- Scale each column training data to \((0,1)\) (or \(\pm1\)).
- Store the scaling function and apply it to the test set and all future data when making predictions.

- This also applies to weights in the network.
- Weights should be balanced between layers.
- Batch Normalisation.

- Too many normalisation and regularisation techniques to learn all before starting.
- New techniques keep emerging.
- Gain some experience and return to extend your repertoire.

# Debrief

- Demoes of python modules and scripts
- module with script plotting loss over epochs depending on NetForStatistics.py.
- module with script estimating error probability
- module with script calculating confusion matrix (only tested 2022)