---
title: Machine Learning and Statistics
categories: session
---

+ **Reading** R&N Chapter 19.  (19.3-19.5 only cursory)
+ **Briefing** [MLBriefing]()
+ **Web Lectures** (slides with audio track; please note the audio player at the bottom of the slide)
    - [Machine Learning and Statistics](http://www.hg.schaathun.net/talks/ai/statistics/)
    - [Basic Principles](http://www.hg.schaathun.net/talks/ai/ml/)

# Exercise

We use the [libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvm/)
library for python.
We can install it using 
```sh
pip install -U libsvm-official
```
and import using
```python
from libsvm.svmutil import *
```
There is [Quick Start Guide](https://github.com/cjlin1/libsvm/blob/master/python/README)
specifically for Python.

## Tutorial

First, we need a dataset.
There are a lot of open datasets available on the net, in various
formats and with various levels of documentation.
To minimise the effort needed to find out how to parse the files
(e.g. CSV files), we will use datasets already
[formatted for libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#ijcnn1).
A typical file may look like this:
```
-1 6:1 11:-0.731854 12:0.173431 13:0.0 14:0.00027 15:0.011684 16:-0.011052 17:0.024452 18:0.008337 19:0.001324 20:0.025544 21:-0.040728 22:-0.00081
-1 7:1 11:-0.731756 12:0.173431 13:0.00027 14:0.011684 15:-0.011052 16:0.024452 17:0.008337 18:0.001324 19:0.025544 20:-0.040728 21:-0.00081 22:-0.00389
```
Each row is an object and each column is a variable.
The first column is the class label $y$, typically $\pm1$.
The other columns for the vector $x$.  Note that the format is sparse.
Most of the elements $x_i$ are zero; only the non-zero elements are listed,
so that $i:j$ means $x_i=j$.

For the this exercise, let us use the
[diabetes](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/diabetes) dataset.
It is relatively small and has not been preprocessed (yet).

**Step 1** Download the dataset and put it in your working directory.  E.g.
```sh
wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/diabetes
```
We will assume that the file is called `diabetes`.


**Step 2** Start python and import the necessary libraries.
We will use numpy/scipy arrays as datastructure, and thus need to load
scipy as well.
```python
import libsvm.svmutil as svm
import scipy
```

**Step 3**  Load the data set.
```python
y, x = svm.svm_read_problem('diabetes',return_scipy=True)
```

**Step 3** One should always have a look at the data to see what we
are working with.
```python
y
x
print(x)
```
Pay some attention to `x`.  What data format is this?

**Step 4**  You are probably more familiar with dense matrices
than sparse matrices.  We can convert the sparse matrix as follows.
```python
xx = x.todense()
print(xx)
```
Note that `xx` has one row per feature (8) and one column per row
in the input file (768).  The customary format when discussing data
sets in machine learning would be the other way around, that is transposed.

Also note that you can create your own data sets, directly as numpy arrays,
sparse or dense.

**Step 5** Create a model for your data set.
```python
m = svm.svm_train(y[:200], x[:200, :], '-c 1')
```
This call uses only the first 200 objects for training, as you see from the
slicing.  This is important, because we need to reserve some data for 
*independent* testing.  The parameter `'-c 1'` is redundant in this case;
it sets $C=1$ which happens to be the default, but it is important to try
different parameter values to tune the SVM for the particular problem.

**Step 6** Test the model.
```python
p_label, p_acc, p_val = svm.svm_predict(y[200:], x[200:, :], m)
```
Note that we exclude the first 200 objects which we used for training.
Let's look at the result, first the predicted class labels.
```python
print(p_label)
```
This is odd.  The model predicts class $1$ in all cases.  This is hardly useful.

**Step 7** Scaling.  Lack of scaling is a common first mistake in machine learning.
If we look at the `x` array, we will see that feature 6 has a maximum value of $2.42$,
while feature 4 has a maximum of 846.  Hence feature 6 will have very little influence
on the result.  It is customary to scale to $[0,1]$ range.
```python
scaleparam = svm.csr_find_scale_param(x,lower=0)
xscale = svm.csr_scale(x,scaleparam)
```
Note that we first calculate the scaling parameters and the apply the scaling to the
dataset.  This is important, because we can store the scaling parameters and apply it to 
new data from which we want to make prediction.  The SVM model `m` is only valid for
data scaled with the same scaling parameters.

**Step 8** Try again.
```python
m = svm.svm_train(y[:200], xscale[:200, :], '-c 1')
p_label, p_acc, p_val = svm.svm_predict(y[200:], xscale[200:, :], m)
print(p_label)
```
This looks a little better.

**Step 8** Evaluation.  Let us compare predicted and true labels.
```python
p_label == y[200:]
```
Lots of both good and bad guesses.  Let's count them.
```python
(p_label == y[200:]).sum()
(p_label == y[200:]).sum()/len(p_label)
```
Right, we have 73.2% good guesses.
This number, called the *accuracy* is also calculated automatically by
the prediction function.
```python
print(p_acc)
```
The first number is the accuracy and the second is the mean-squared error.
The third number is the squared correlation coefficient, but this applies
mainly to regression.  For classification problems, let's focus on accuracy.

**Step 9** Tuning parameters.
```python
m = svm.svm_train(y[:200], xscale[:200, :], '-c 3')
p_label, p_acc, p_val = svm.svm_predict(y[200:], xscale[200:, :], m)
print(p_acc)
```
We can also change the kernel
```python
m = svm.svm_train(y[:200], xscale[:200, :], '-c 3 -t 3')
p_label, p_acc, p_val = svm.svm_predict(y[200:], xscale[200:, :], m)
print(p_acc)
```
There are more parameters in the [documentation](https://www.csie.ntu.edu.tw/~cjlin/libsvm/)
as well as a very useful [guide for beginners](https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf).

## Further Exercises

Find a couple of other datasets, either from the 
[collection for libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/)
or from another source, such as [UCI](http://archive.ics.uci.edu/ml/index.php).

1.  Can you get good classification results?
2.  How much do you need to vary the parameters between data sets?
3.  Do you get better or worse results than your class mates?

# Debriefing

+ Experience?
+ False positives and false negatives
+ Interpreting the standard deviation