Evaluation and Statistics

Reading Szeliski (2022): Computer Vision: Algorithms and Applications, 2nd ed. Chapter 5.1-5.2

Exercises

Exercise 1 (Loss over epochs)

Review the machine learning systems you studied last week. If you have not already done it, make a plot which shows the total loss as a function of the number of epochs.

For instance, you can start with empty lists, and for each epoch you train once and test once, recording the total loss, like this.

trainloss = [ ]
trainloss2 = [ ]
testloss = [ ]
    
for epoch in range(12):  # loop over the dataset multiple times

    tloss = trainmodel(net,trainloader)
    trainloss.append(tloss)

    trainloss2.append(evalmodel(net,trainloader))
    testloss.append(evalmodel(net,testloader))

x = list(range(len(testloss)))
plt.plot( x, trainloss, "b", x, trainloss2, "k", x, testloss, "r" )
plt.savefig( "plot.svg" )

Tip You may want to run through the exercise first with only two epochs (edit the 12 in the code above), for debugging purposes. You can increase the number of epochs when you have a working setup.

The functions trainmodel() and evalmodel() are defined in NetForStatistics.py. Obviously, you need to initialise the network and the datasets before you run the code above. Again using functions from NetForStatistics, like this

(trainloader,testloader) = getDataset()
net = Net()

You may want to tweak the device (you can use net = Net("cpu") with the given code) and the number of epochs. You may also use the code differently and show the plot interactively instead of writing the file plot.svg.

Questions for reflection

Which curve is which in the plot (plot.svg)? (Read the code to know.)
How does the loss behave differently on the training and test sets?
What is the difference between the two test set curves, testloss and testloss2?

Exercise 2 (Error probability estimation)

Review again your machine learning system as in Exercise 1, but now we want to estimate the error probability instead of calculating loss. Consider the following function to calculate the error rate:

def errorrate(net,dataloader):
    tcount = 0
    terror = 0
    net.eval()
    for i, data in enumerate(dataloader, 0):

        inputs, labels = (data[0].to(net.device), data[1].to(net.device))

        outputs = net(inputs)
        _, pred = torch.max(outputs.data,1)
        error = (labels != pred).sum()
        print(labels,pred)
        tcount += len(labels)
        terror += error
    return (terror/tcount)

Code review

What does errorrate() do? Compare it to evalmodel() from Exercise 1. What is similar and what is different?
Note the line pred = max(outputs,1). What does it do? What is outputs supposed to look like?
What quantity is calculated by return(terror/tcount)?

Statistical Estimation?

Calculate the errorrate on the test set, e.g.

rE = errorrate(net,trainloader)

This assumes that you have trained your model net as in Exercise 1.

Now rE\(=r_E\) is an observation of a stochastic variable with mean \(\mu\) and standard deviation \(\sigma\), where \(\mu\) is equal to the probability of error when net is applied to a random image.

Tip torch makes the calculations on tensors. To use the values in python, we need to convert it to a scalar value. When the tensor t has only a single element, this can be done with t.item(). (My source, but I tested it.)

We can estimate \(\sigma\) as follows \[\hat\sigma = \sqrt{ \frac{r_E(1-r_E)}{N} }\] where \(N\) is the number of images tested.

Use python to calculate \(\hat\sigma\) for your observed error rate rE. Note that you need to tweak the function to return tcount as well as rE so that you have a value for \(N\).

Reflection

Seeing your estimate \(r_E\) for \(p_E\) in relation to the estimated standard deviation \(\hat\sigma\), what do you think of the performance?

Smaller datasets

What was your value of \(N\)? Let’s see what happens with smaller datasets. This is a crude hack, but it should work. Insert a line at the end of the for loop in the errorrate() function, like this

if tcount >= 1000: break

Recalculate the error rate and estimate the standard deviation, as you did above, with \(N=1000\) and \(N=100\) (changing the number in the break line).

Reflection

Judge the error estimate for the two smaller tests. Would \(N=100\) or \(N=1000\) suffice for a confident assessment?
How does the value training set size \(N\) affect the confidence in the results?

At the end

If you have the hardware for it, you may try with more epochs, or fewer if you haven’t.

Exercise 3

Review the machine learning systems you studied last week. Calculate the confusion matrix for each of the systems.

The Confusion Matrix is simply a matrix of values \(c_{i,j}\), where each \(c_{i,j}\) is the number of test images predicted to be in class \(j\) while truly belonging to class \(i\).

It is not an unsurmountable task to tweak the errorrate() function to calculate all the \(c_{i,j}\) values, but to make it a little simpler.

Tweak the errorrate() function to return a list of true labels and a list of predicted labels, simply concatenating all the pred (resp. labels) lists together.
Use the returned lists together with MulticlassConfusionMatrix from torchmetrics
Find a way to display/visualise the confusion matrix that you are confortable with.

Questions for Reflection

Are the errors reasonably balanced?
Are any classes particularly difficult to detect?
Are there good reasons why some classes are harder to predict than others?

Exercise 4

Each of the entries \(c_{i,j}\) in the confusion matrix is an error rate, like the ones we worked with previously.

How many samples (\(N\)) is used to compute one \(c_{i,j}\)?
Pick one large and one small \(c_{i,j}\) and estimate their standard deviations, and compare the two. What do you see?
In light of the confusion matrix, what do you think about the performance of the machine learning model.
How large ought the test be to make a confident assessment?

Exercise 5

The exercises above are based on the cifar10 tutorial, using same dataset, the same network architecture, and the same basic routines.

Refactor you own code so that you have a code base which allows you quickly to test classification with CNN on different datasets using different networks.
Test your code with different datasets and network architectures that you find in different tutorials and other sources on the net.

Briefing

Recap

What did we learn last week?
What should we learn today?

Supervised learning
Loss function – Cross-Entropy
- \(E(\mathbf{w}) = -\sum_n p_{n,t_n}\)
- \(p_{n,k}\) is the networks estimated probability that object \(n\) has the class \(k\)
- \(t_n\) is the true class of \(t_n\)

Regression problem - Gravitational Lensing

Lensing model
- Source - size \(\sigma\) and position \((x,y)\)
- Lens - Einstein radius \(R_E\)
- Distorted Image
Four parameters determine the distorted image.
- Can we recover these four parameters from the image?
Instead of a discrete class, we want the network to predict \(\sigma,x,y,R_E\)
Loss function is Mean Squared Error
- \(\mathsf{MSE} = \frac14(\sigma-\hat\sigma)^2+(x-\hat x)^2 +(y - \hat y)^2+(R_E-\hat R_E)^2\)
- Note, normalisation can vary.
  - The starting point is the sum of squared errors (SSE).
  - We may or may not normalise by dividing by the number of data points and or prediction parameters.

Evaluation

Confusion Matrix
- false positive, false negative
Accuracy

\[ \mathsf{Acc} = \frac{ \mathsf{TP} + \mathsf{TN} }{ \mathsf{TP}+\mathsf{FN} + \mathsf{TN}+\mathsf{FP} }\] + Warning. Biased datasets + Other heuristics

\[ F_1 = \frac{ \frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FP}} \cdot \frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FN}} }{ \frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FP}} + \frac{\mathsf{TP}}{\mathsf{TP}+\mathsf{FN}} }\]

TP/TN/FP/FN are Stochastic Variables
- Binomially Distributed
Regression: absolute or squared error
- Also a Stochastic Variable (depends on the data drawn randomly from the population)
- Mean over a large dataset gives a reasonable estimator
- Standard Deviation can be estimated using the sample standard deviation
Important: Each item in the test set makes one experiment/observation. This allows statistical

Overtraining and Undertraining

Exercise last week.
- I have not been able to generate the expected result.
- The deep networks tested produce impressive results with very little training.
Still, important principle.
The training data
- contain a limited amount of information about the population
- have some peculiar quirks
The network has a certain number of DOF (weights) which can be adjusted to store information extracted from the training set.
Undertraining means insufficient training to absorb the relevant information
- insufficient epochs
- insufficient training set (relative to DOF)
A large network with many DOF, can learn a small dataset completely.
- Overtraining means that the network has learnt more than what generalises
Sometimes, regularisation techniques are used to remove (zero out) DOF with little impact
- Occam’s Razor

Normalisation

Network layers add up coefficients from different sources
- Large numbers contribute a lot.
- Numbers with a small range contribute little.
Images are simple. One range for all pixels.
Some datasets combine data with different ranges.
- If some have range \(\pm1\), some have range \((0,10^{-5})\) and others \((0,10^5)\), then small numbers are negligible and are effectively ignored.
Scaling is standard procedure.
- Scale each column training data to \((0,1)\) (or \(\pm1\)).
- Store the scaling function and apply it to the test set and all future data when making predictions.
This also applies to weights in the network.
- Weights should be balanced between layers.
- Batch Normalisation.
Too many normalisation and regularisation techniques to learn all before starting.
- New techniques keep emerging.
- Gain some experience and return to extend your repertoire.

Debrief

Demoes of python modules and scripts
- module with script plotting loss over epochs depending on NetForStatistics.py.
- module with script estimating error probability
- module with script calculating confusion matrix (only tested 2022)