**Reading:**Stephen Marsland: Chapter 2.2- Look up
*estimation*and*confidence intervals*in statistics if you do not remember what it is and how to do it (e.g. Johnson and Bhattacharyya).

From previous exercises, you should have a function which loads the data set and returns a list of pairs, where each pair contains a class label and a feature vector. We need a function to split it into a training set and a test set.

Ideally we should split randomly, but randomness is tricky. We will learn it next week; for now we split it naïvly and deterministically. Still, we need to make sure that both classes are represented in both sets. There are many ways to do this, but let's take the opportunity to practice list processing in Haskell.

Write a function which takes a list of class label/feature vector pairs, and returns a list of lists, where each list includes pairs with the same class label.

There are many options for the type signature, and it has to be compatible with your solutions in previous tutorials. The following is one reasonable option:

```
splitClasses :: [Double] -> [(Double,[Double])]
-> [[(Double,[Double])]]
```

Here, the first argument is the list of class labels in use.

The simplest implementation will use list comprehension to form the single-class lists as follows.

```
splitClasses :: [Double] -> [(Double,[Double])] -> [[(Double,[Double])]]
splitClasses cl dl = map s cl
where s c = [ (x,y) | (x,y) <- dl, x == c ]
```

The above solution is not optimised for speed.

- How many times does the
`splitClasses`

have to read through the`dl`

list? - How can process the input list in a single pass?

Having a list of lists of objects, we need to take p% of the elements from each list for training and the remainder for testing. We need a function which may have the following signature:

```
mkTestTrainSets :: Double -> [[(Double,[Double])]]
-> ([(Double,[Double])],[(Double,[Double])])
```

The first argument is the fraction p. The second argument is the list of lists as produced by the function in Step 1.

You may want to create a number of auxiliary functions to make this work. You will need to think of the following operations.

- For each list, you need to
- find the length l of the list
- calculate the number of elements p*l/100 for training
- calculate the number of elements l-p*l/100 for testing
- get the elements for training
- get the elements for testing

- Concatenate all the training set lists into one list
- Concatenate all the test set lists into one list

This can be done in many different ways, and before you start coding, you should discuss different options with a fellow student or two.

```
mkTestTrainSets :: Double -> [[(Double,[Double])]]
-> ([(Double,[Double])],[(Double,[Double])])
mkTestTrainSets _ [] = ([],[])
mkTestTrainSets f (d:dl) = prepend e l
where l = mkTestTrainSets f dl
n = ceiling $ f * fromIntegral ( length d )
e = (take n d, drop n d)
prepend (x,y) (x',y') = (x++x',y++y')
```

If you have done everything correctly, the following test in GHCi should work:

```
s1 <- getData "wdbc.data"
let s2 = splitClasses [0.0,1.0] s1
mkTestTrainSets 0.2 s2
```

You may have to load the right module between each step; depending on how you have organised your modules.

Given a test set (as obtained in Problem 1) and a trained perceptron (as obtained in Tutorial 4), we are going to test the perceptron. Let's do a simple test for now, where we do not distinguish between false positives and false negatives. First, we want to test on a single test item.

```
testNeuron' :: Neuron -> (Double,[Double]) -> Bool
```

*Implement the testNeuron' function.*
You need to do recall with the neuron and the feature vector
and then compare the result to the class label given as
the first element of the input pair. This gives a Boolean value
for the return value.

Now we have a function to test a single neuron. How do you test the neuron on every item in the test set? Make a definition for the following function.

```
testNeuron :: Neuron -> [(Double,[Double]) -> [Bool]
```

The return value is a list of boolean values, where True indicates an error and False indicates a correct classification.

Having solved Problem 2 as well as Tutorial 4, we are able to both train and test a perceptron. Now we need to put it all into one executable program.

Create a Main module, with its main IO action which does the following.

- Read and parse the data set.
- Initialise the perceptron.
- Train the perceptron.
- Test the perceptron.
- Output the number of errors and the total number of tests.

It is quite possible that you get a lot of errors. Don't worry if you do. It is likely because the single neuron is a bit too simple. Next week we will discuss how to build networks of multiple neurons.

```
module Main where
import ANNData == your module to load CSV data
-- it should define mkTestTrainSets
import NeuralNetwork == your module defining a backpropagation network
-- it should define:
-- initNetwork
-- trainNetwork
-- testNetwork
p = 0.1
main :: IO ()
main = do
dl <- getData "wdbc.csv"
let (testS,trainS) = mkTestTrainSets p (splitClasses [0,1] dl)
-- Get the length of the feature vector
let dlen = length $ fst $ testS!!0
print $ "Input length: " ++ show dlen
-- Initialise and train network.
let untrainedNeuron = initNeuron dlen
let trainedNeuron = trainNeuron 0.25 trainS 1000 untrainedNeuron
-- Test and print (see definition below)
print "Untrained network, test data:"
testPrint untrainedNeuron testS
print "Untrained network, training data:"
testPrint untrainedNeuron trainS
print "Trained network, test data:"
testPrint trainedNeuron testS
print "Trained network, training data:"
testPrint trainedNeuron trainS
-- Auxiliar function.
testPrint n s = do
let result = testNetwork n s
-- testNetwork n s tests the classifier n on the data set s
-- and returns a list of Boolean, indicating which data
-- were correctly classified
--
let right = length [ x | x <- result, x ]
let wrong = length [ x | x <- result, not x ]
print ( "Correct: " ++ show right )
print ( "Errors: " ++ show wrong )
```

A common problem in machine learning is that features with very high magnitude may dominate over others, so that the classifier is not properly trained in all dimensions.

*Look at the breast cancer data set. What is the range of the
features in different columns?
*

We have prepared a scaled version of the data set. All the features have been brought into the [0,1] range by linear scaling. Feel free to try it, but note that there may be other reasons why the perceptron does not classify very well.

Given the number of errors *e* and the number of tests
*n*, we can calculate the error rate *e/n*.
We are interested in the *probability* of error, when the
classifier is used on a random, unknown object.
Answer the following:

- What is the relationship between the error rate and the probability of error?
- What is the probability distribution of the number of errors
*e*? - Calculate a 95% confidence interval for the error probability.