Briefing: Machine Learning

Briefing

Briefing

AI and Cybernetic Systems are often described as either model-driven or data-driven.
- what is the difference?
- examples?
Note that there is not necessarily a sharp boundary between the two. We may have partial models in a data-driven approach.

Florence Nightinggale

Nurse in the Crimean War 1853-56
First female member of the Royal Statistical Society (1859)
A pioneer in using statistics to make politics
Key to sanitary reforms in the British Army

If doctors wash their hands more frequently, there are fewer deaths in their ward.

This is a simple quantitative problem.
count hand washing events per ward
count death events per ward
compare the numbers
A key contribution of Ms Nightingale’s was the visualisation of these numbers to make the authorities understand their implication.

Does this have anything to do with intelligent agents?

Well, we do you use it to infer desireable action … to wash hands or not to wash hands, that’s the question.

Mortality Chart due to Florence Nightingale

Modelling it

Today we would say that Ms Nightingale’s observation is obvious. We have a simple causal model to say that doctors should wash their hands.

Viruses and bacteria in wounds cause infection and death.
Viruses and bacteria are carried around by dirty hands.
Hence dirty hand cause death.
Viruses and bacteria can (largely) be washed off.
Clean hands carry fewer viruses and bacteria around than dirty hands.
Hence, clean hands give less infection and death than dirty hands.

This causal model was not immediately available to Ms Nightingale, and therefore she chose a data-driven model. She observed that

Some wards have few hand washes and few deaths.
Some wards have many hand washes and many deaths.
Wards with many hand washes and few deaths or vice versa are very rare.
We infer that there is an increasing function \(y=f(x)\) giving the approximate number of deaths \(y\) as a function of the number of hand washes \(x\).

This is an example of regression analysis. The resulting model is one of corelation; certain events, such as dirty hands and deaths, tend to co-occur. No causality is implied.

Big Data

What is the difference between statistics and machine learning?

Basically, statistics can be calculated by hand.
- Florence Nightingale only observed two variables in the given example
In machine learning we study data sets too large for manual compuatation.
Degrees of big data.
- Modern methods achieve results which were not possible ten years ago.
- What can we do ten years from now?

Machine Learning

Essentially we solve the same problem as Florence Nightingale
- observe some variables that we can control (hand washing)
- observe some variables that we want to control (deaths)
- model how the former set influences the latter set (and use this model to predict the behaviour of the latter set)
- use this information to control what we want to control indirectly
Alternatively,
- observe some variables that are easily observed
- observe some variables that cannot always be observed
- find the relationship between the two sets
- use the observable information to predict the inobservable
Attention To build the model we need to observe the inobservable
- inobservable may mean observable only in hindsight
- historical data for training
- in the future we need predictions before the observations become available
This is the case for many intelligent agent systems
- we want to predict the payoff of potential actions
- we cannot observe our own action, but the payoff is only observable after we have acted
- however, we can observe both action and payoff in previous games
Machine Learning is always a question of modelling the relationship between the observable and the inobservable

Types of Machine Learning

Two main problems
- regression
- classification
Three classes
- supervised (with access to some ground truth)
- unsupervised (without ground truth)
- reinforcement learning

Machine Learning as an Optimisation Problem

Observe a data set \((x_1,y_1)\), \((x_2,y_2)\), \((x_n,y_n)\).
Assume that \(y=f(x)\) for some unknown function \(f\).
Goal: find a function \(\hat f\) which approximates \(f\).
Two famous cases.
- Regression where \(y\) is a continuous variable (vector).
- Classification where \(y\) a discrete label identifying a class to which \(x\) belongs.
Very often, but not necessarily, classification is made via regression, where \(y\) is interpretted as a fitness to class. The discrete label is obtained by thresholding the continuous variable \(y\).
This is an optimisation problem where we minimise the error.
Different error measures exist
For a single data point, we have \[E_i = ||\hat y_i - y_i|| = ||\hat f(x_i) - y_i||\] where \(||\cdot||\) is some metric (distance measure), typically Euclidean distance for continuous vectors and Hamming distance for discrete vectors.
Aggregate error.
- we could use just the sum \(\sum E_i\)
- more commonly, we use squared errors \(\sum E_i^2\) which makes larger errors disproportionally more important.

Example: Linear Regression

We assume a linear relationships, and require \[ \hat f(x) = a + bx \] Thus we have to solve the minimisation problem \[\min_{a,b} \sum_i ||a + bx_i - y_i||^2\]

Algorithms

Classic Statistics
- if \(\hat f\) has a simple form, you can solve the optimisation problem by calculus
ANN - Artificial Neural Networks
- several layers of linear and non-linear functions
- (usually) requires iterative optimisation algorithms
- several possible optimisation algorithms, including GA
SVM - Support Vector Machines
- the kernel trick transforms the problem into a higher dimension space
- a linear solution in the higher dimension space becomes a non-linear solution in the original space
- some kernels result in an infinite dimension space (Hilbert space)
PCA - Principal Component Analysis (Diagnostic techniques)

Classification

In classification we may try to minimise the error rate, that is the number of samples for which \(\hat f(x_i)\neq y_i\) divided by the number of samples.

Considerations

Underfitting
Overfitting
bias-variance trade-off
Ockham’s razor

Assumptions

Samples are independently and identically distributed.
- lest any statistical method be doomed

Statistical Evaluation

Consider classification as an example.

The key performance heuristic is the the probability \(P(\hat f(x)\neq y)\) when \((x,y)\) is drawn uniformly at random from the population. This is the error probability.
The natural key performance indicator would be the error rate. That is, we draw \(n\) samples \((x_i,y_i)\) from the population and count the number \(f\) of pairs for which \(\hat f(x) \neq y\). The error rate is \(f/n\).
The error rate is an estimator for the error probability, but it is not the same the error probability.
- the error rate is a random statistic, observed in the data.
- the probability is a theoretical property describing what you would expect to observe
Estimators make errors.
- this errors are randomly distributed
- we can assess the error by estimating the standard deviation of the estimator
Let \(p\) be the error probability and \(r\) the error rate.
- \(E(r) = p\)
- \(r\) is binomially distributed
- \(\sigma_r = \sqrt{\frac{p(1-p)}{n}}\)
- \(\hat \sigma_r = \sqrt{\frac{r(1-r)}{n}}\)

Reasoning

Reasoning - deduction/induction/abduction