Revision 5145cd983d06b9a3c641938744538adfe1150c85 (click the page title to view the current version)

ML

Changes from 5145cd983d06b9a3c641938744538adfe1150c85 to ec6daeb4ddae48c0e86055cb2fe9190ac1915ddb

---
title: Machine Learning and Statistics
categories: session
---

# Reading

+ R&N Chapter 19.  (19.3-19.5 only cursory)

# Briefing

+ AI and Cybernetic Systems are often described as either model-driven
  or data-driven.
    - what is the difference?
    - examples?
+ Note that there is not necessarily a sharp boundary between the two.
  We may have partial models in a data-driven approach.

## Florence Nightinggale

![The Lady with the Lamp](https://upload.wikimedia.org/wikipedia/commons/thumb/b/ba/Florence_Nightingale._Coloured_lithograph._Wellcome_V0006579.jpg/1200px-Florence_Nightingale._Coloured_lithograph._Wellcome_V0006579.jpg)

+ Nurse in the Crimean War 1853-56
+ First female member of the Royal Statistical Society (1859)
+ A pioneer in using statistics to make politics
+ Key to sanitary reforms in the British Army

> If doctors wash their hands more frequently, there are fewer deaths
  in their ward.

+ This is a simple quantitative problem.  
+ count hand washing events per ward
+ count death events per ward
+ compare the numbers
+ A key contribution of Ms Nightingale's was the visualisation of
  these numbers to make the authorities understand their implication.

> Does this have anything to do with intelligent agents?

Well, we do you use it to infer desireable action ...
to wash hands or not to wash hands, that's the question.

## Modelling it

Today we would say that Ms Nightingale's observation is obvious.
We have a simple *causal* model to say that doctors should wash their
hands. 

1.  Viruses and bacteria in wounds cause infection and death.
2.  Viruses and bacteria are carried around by dirty hands.
3.  Hence dirty hand cause death.
4.  Viruses and bacteria can (largely) be washed off.
5.  Clean hands carry fewer viruses and bacteria around than dirty hands.
6.  Hence, clean hands give less infection and death than dirty hands.

This causal model was not immediately available to Ms Nightingale, and
therefore she chose a data-driven model.  She observed that

1.  Some wards have few hand washes and few deaths.
2.  Some wards have many hand washes and many deaths.
3.  Wards with many hand washes and few deaths or vice versa are
    very rare.
4.  We infer that there is an increasing function $y=f(x)$ giving
    the approximate number of deaths $y$ as a function of the number
    of hand washes $x$.

This is an example of regression analysis.
The resulting model is one of corelation; 
certain events, such as dirty hands and deaths, tend to co-occur.
No causality is implied.

## Big Data

What is the difference between statistics and machine learning?

+ Basically, statistics can be calculated by hand.
    + Florence Nightingale only observed two variables in the given
      example
+ In machine learning we study data sets too large for manual
  compuatation.
+ Degrees of big data.  
    + Modern methods achieve results which were not possible ten years ago.  
    + What can we do ten years from now?

## Machine Learning

+ Essentially we solve the same problem as Florence Nightingale
    - observe some variables that we can control (hand washing)
    - observe some variables that we want to control (deaths)
    - predict how the former set influences the latter set
    - use this information to control what we want to control
      indirectly
+ Alternatively,
    - observe some variables that are easily observed
    - observe some variables that cannot always be observed
    - find the relationship between the two sets
    - use the observable information to predict the inobservable
+ **Attention** To build the model we need to observe the inobservable
    - inobservable may mean observable only in hindsight
    - historical data for training
    - in the future we need predictions before the observations become
      available
+ This is the case for many intelligent agent systems
    - we want to predict the payoff of potential actions
    - we cannot observe our own action, but the payoff is only
      observable after we have acted
    - however, we can observe both action and payoff in previous
      games
+ Machine Learning is always a question of modelling the relationship
  between the observable and the inobservable

## Types of Machine Learning

+ Two main problems
    - regression
    - classification
+ Three classes
    - supervised (with access to some ground truth)
    - unsupervised (without ground truth)
    - reinforcement learning

## Machine Learning as an Optimisation Problem

### Regression
+ Observe a data set $(x_1,y_1)$, $(x_2,y_2)$, \ldots $(x_n,y_n)$.
+ Assume that $y=f(x)$ for some unknown function $f$.
+ Goal: find a function $\hat f$ which approximates $f$.
+ Two famous cases.
    - **Regression** where $y$ is a continuous variable (vector).
    - **Classification** where $y$ a discrete label identifying
      a class to which $x$ belongs.
+ Very often, but not necessarily, classification is made via
  regression, where $y$ is interpretted as a *fitness to class*.
  The discrete label is obtained by thresholding the continuous
  variable $y$.
+ This is an **optimisation problem** where we minimise the **error**.
+ Different error measures exist
+ For a single data point, we have
  $$E_i = ||\hat y_i - y_i|| =  ||\hat f(x_i) - y_i||$$
  where $||\cdot||$ is some metric (distance measure), typically
  Euclidean distance for continuous vectors and Hamming distance
  for discrete vectors.
+ Aggregate error.
    + we could use just the sum $\sum E_i$
    + more commonly, we use squared errors $\sum E_i^2$ which 
      makes larger errors disproportionally more important.

### Classification
### Example: Linear Regression

## Algorithms
We assume a linear relationships, and require
$$ \hat f(x) = a + bx $$
Thus we have to solve the minimisation problem
$$\min_{a,b} \sum_i ||a + bx_i - y_i||^2$$

### Algorithms

+ Classic Statistics
    - if $\hat f$ has a simple form, you can solve the optimisation
      problem by calculus
+ ANN - Artificial Neural Networks
    - several layers of linear and non-linear functions
    - (usually) requires iterative optimisation algorithms
    - several possible optimisation algorithms, including GA
+ SVM - Support Vector Machines
+ PCA - Principal Component Analysis
    - the kernel trick transforms the problem into a higher dimension space
    - a linear solution in the higher dimension space becomes a non-linear
      solution in the original space
    - some kernels result in an infinite dimension space (Hilbert space)
+ PCA - Principal Component Analysis  (Diagnostic techniques)

### Classification

In classification we may try to minimise the error rate,
that is the number of samples for which $\hat f(x_i)\neq y_i$
divided by the number of samples.

### Considerations

+ Underfitting
+ Overfitting
+ bias-variance trade-off
+ Ockham's razor

### Assumptions

+ Samples are independently and identically distributed.
    + lest any statistical method be doomed

### Statistical Evaluation

Consider classification as an example.

+ The key performance heuristic is the the probability 
  $P(\hat f(x)\neq y)$ when $(x,y)$ is drawn uniformly at random
  from the population.
  This is the **error probability**.
+ The natural key performance indicator would be the error rate.
  That is, we draw $n$ samples $(x_i,y_i)$ from the population
  and count the number $f$ of pairs for which $\hat f(x) \neq y$.
  The **error rate** is $f/n$.
+ The error rate is an **estimator** for the error probability,
  but it is **not the same** the error probability.
    + the error rate is a random statistic, observed in the data.
    + the probability is a theoretical property describing what you
      would expect to observe
+ Estimators make errors.
    + this errors are randomly distributed
    + we can assess the error by estimating the standard deviation
      of the estimator
+ Let $p$ be the error probability and $r$ the error rate.
    + $E(r) = p$
    + $r$ is binomially distributed
    + $\sigma_r = \sqrt{\frac{p(1-p)}{n}}$
    + $\hat \sigma_r = \sqrt{\frac{r(1-r)}{n}}$
  


## Reasoning

+ Reasoning - deduction/induction/abduction

# Exercise

We use the [libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvm/)
library for python.
We can install it using 
```sh
pip install -U libsvm-official
```
and import using
```python
from libsvm.svmutil import *
```
There is [Quick Start Guide](https://github.com/cjlin1/libsvm/blob/master/python/README)
specifically for Python.

## Tutorial

First, we need a dataset.
There are a lot of open datasets available on the net, in various
formats and with various levels of documentation.
To minimise the effort needed to find out how to parse the files
(e.g. CSV files), we will use datasets already
[formatted for libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#ijcnn1).
A typical file may look like this:
```
-1 6:1 11:-0.731854 12:0.173431 13:0.0 14:0.00027 15:0.011684 16:-0.011052 17:0.024452 18:0.008337 19:0.001324 20:0.025544 21:-0.040728 22:-0.00081
-1 7:1 11:-0.731756 12:0.173431 13:0.00027 14:0.011684 15:-0.011052 16:0.024452 17:0.008337 18:0.001324 19:0.025544 20:-0.040728 21:-0.00081 22:-0.00389
```
Each row is an object and each column is a variable.
The first column is the class label $y$, typically $\pm1$.
The other columns for the vector $x$.  Note that the format is sparse.
Most of the elements $x_i$ are zero; only the non-zero elements are listed,
so that $i:j$ means $x_i=j$.

For the this exercise, let us use the
[diabetes](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/diabetes) dataset.
It is relatively small and has not been preprocessed (yet).

**Step 1** Download the dataset and put it in your working directory.  E.g.
```sh
wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/diabetes
```
We will assume that the file is called `diabetes`.


**Step 2** Start python and import the necessary libraries.
We will use numpy/scipy arrays as datastructure, and thus need to load
scipy as well.
```python
import libsvm.svmutil as svm
import scipy
```

**Step 3**  Load the data set.
```python
y, x = svm.svm_read_problem('diabetes',return_scipy=True)
```

**Step 3** One should always have a look at the data to see what we
are working with.
```python
y
x
print(x)
```
Pay some attention to `x`.  What data format is this?

**Step 4**  You are probably more familiar with dense matrices
than sparse matrices.  We can convert the sparse matrix as follows.
```python
xx = x.todense()
print(xx)
```
Note that `xx` has one row per feature (8) and one column per row
in the input file (768).  The customary format when discussing data
sets in machine learning would be the other way around, that is transposed.

Also note that you can create your own data sets, directly as numpy arrays,
sparse or dense.