From previous exercises, you should have a function which loads the data set and returns a list of pairs, where each pair contains a class label and a feature vector. We need a function to split it into a training set and a test set.
Ideally we should split randomly, but randomness is tricky. We will learn it next week; for now we split it naïvly and deterministically. Still, we need to make sure that both classes are represented in both sets. There are many ways to do this, but let's take the opportunity to practice list processing in Haskell.
Write a function which takes a list of class label/feature vector pairs, and returns a list of lists, where each list includes pairs with the same class label.
There are many options for the type signature, and it has to be compatible with your solutions in previous tutorials. The following is one reasonable option:
splitClasses :: [Double] -> [(Double,[Double])] -> [[(Double,[Double])]]
Here, the first argument is the list of class labels in use.
The simplest implementation will use list comprehension to form the single-class lists as follows.
splitClasses :: [Double] -> [(Double,[Double])] -> [[(Double,[Double])]] splitClasses cl dl = map s cl where s c = [ (x,y) | (x,y) <- dl, x == c ]
The above solution is not optimised for speed.
splitClasseshave to read through the
Having a list of lists of objects, we need to take p% of the elements from each list for training and the remainder for testing. We need a function which may have the following signature:
mkTestTrainSets :: Double -> [[(Double,[Double])]] -> ([(Double,[Double])],[(Double,[Double])])
The first argument is the fraction p. The second argument is the list of lists as produced by the function in Step 1.
You may want to create a number of auxiliary functions to make this work. You will need to think of the following operations.
This can be done in many different ways, and before you start coding, you should discuss different options with a fellow student or two.
mkTestTrainSets :: Double -> [[(Double,[Double])]] -> ([(Double,[Double])],[(Double,[Double])]) mkTestTrainSets _  = (,) mkTestTrainSets f (d:dl) = prepend e l where l = mkTestTrainSets f dl n = ceiling $ f * fromIntegral ( length d ) e = (take n d, drop n d) prepend (x,y) (x',y') = (x++x',y++y')
If you have done everything correctly, the following test in GHCi should work:
s1 <- getData "wdbc.data" let s2 = splitClasses [0.0,1.0] s1 mkTestTrainSets 0.2 s2
You may have to load the right module between each step; depending on how you have organised your modules.
Given a test set (as obtained in Problem 1) and a trained perceptron (as obtained in Tutorial 4), we are going to test the perceptron. Let's do a simple test for now, where we do not distinguish between false positives and false negatives. First, we want to test on a single test item.
testNeuron' :: Neuron -> (Double,[Double]) -> Bool
You need to do recall with the neuron and the feature vector
and then compare the result to the class label given as
the first element of the input pair. This gives a Boolean value
for the return value.
Now we have a function to test a single neuron. How do you test the neuron on every item in the test set? Make a definition for the following function.
testNeuron :: Neuron -> [(Double,[Double]) -> [Bool]
The return value is a list of boolean values, where True indicates an error and False indicates a correct classification.
Having solved Problem 2 as well as Tutorial 4, we are able to both train and test a perceptron. Now we need to put it all into one executable program.
Create a Main module, with its main IO action which does the following.
It is quite possible that you get a lot of errors. Don't worry if you do. It is likely because the single neuron is a bit too simple. Next week we will discuss how to build networks of multiple neurons.
module Main where import ANNData == your module to load CSV data -- it should define mkTestTrainSets import NeuralNetwork == your module defining a backpropagation network -- it should define: -- initNetwork -- trainNetwork -- testNetwork p = 0.1 main :: IO () main = do dl <- getData "wdbc.csv" let (testS,trainS) = mkTestTrainSets p (splitClasses [0,1] dl) -- Get the length of the feature vector let dlen = length $ fst $ testS!!0 print $ "Input length: " ++ show dlen -- Initialise and train network. let untrainedNeuron = initNeuron dlen let trainedNeuron = trainNeuron 0.25 trainS 1000 untrainedNeuron -- Test and print (see definition below) print "Untrained network, test data:" testPrint untrainedNeuron testS print "Untrained network, training data:" testPrint untrainedNeuron trainS print "Trained network, test data:" testPrint trainedNeuron testS print "Trained network, training data:" testPrint trainedNeuron trainS -- Auxiliar function. testPrint n s = do let result = testNetwork n s -- testNetwork n s tests the classifier n on the data set s -- and returns a list of Boolean, indicating which data -- were correctly classified -- let right = length [ x | x <- result, x ] let wrong = length [ x | x <- result, not x ] print ( "Correct: " ++ show right ) print ( "Errors: " ++ show wrong )
A common problem in machine learning is that features with very high magnitude may dominate over others, so that the classifier is not properly trained in all dimensions.
Look at the breast cancer data set. What is the range of the features in different columns?
We have prepared a scaled version of the data set. All the features have been brought into the [0,1] range by linear scaling. Feel free to try it, but note that there may be other reasons why the perceptron does not classify very well.
Given the number of errors e and the number of tests n, we can calculate the error rate e/n. We are interested in the probability of error, when the classifier is used on a random, unknown object. Answer the following: