The Decision Tree Project
Transcription
The Decision Tree Project
The Decision Tree Project Arnel Curkic, Cato A. Goffeng & Amund Lågbu March 6, 2014 Contents 1 Introduction 1.1 About . . . . . . . . . . . . 1.2 Raw data generation . . . . 1.3 The problem’s applications 1.4 Attributes and conversion . 1.4.1 See5 . . . . . . . . . 1.4.2 Weka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 What others have done 3 3 3 4 4 4 5 7 3 Decision tree tools 3.1 See5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Weka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 10 4 Trees and rules - what is best, and the difference between them 13 5 Interpretation of output & a review of rules generated 14 5.1 See5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.2 Weka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 6 Characterization of the generalizing ability of decision tree models using cross-validation 19 7 Classifications with different costs 8 Relevant decision tree options 8.1 Winnowing . . . . . . . . . . . . . . . . . . . 8.1.1 Test with winnowing . . . . . . . . . . 8.1.2 Test without winnowing . . . . . . . . 8.2 Boosting . . . . . . . . . . . . . . . . . . . . . 8.2.1 A run with 10 boosting trials . . . . . 8.2.2 A run without boosting . . . . . . . . 8.2.3 Comparison of runs . . . . . . . . . . 8.3 Pruning techniques . . . . . . . . . . . . . . . 8.4 Algorithms which handles missing attributes 9 Evaluation 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 22 22 23 23 23 25 26 26 28 30 1 10 Appendix 33 10.1 Java code created to shuffle .arff file . . . . . . . . . . . . . . . . 33 10.1.1 main.java . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 10.1.2 FileOrganizer.java . . . . . . . . . . . . . . . . . . . . . . 34 10.1.3 ArrayShuffle.java . . . . . . . . . . . . . . . . . . . . . . . 36 10.2 A Java program for testing a small decision tree . . . . . . . . . 38 10.2.1 Main 05b.java . . . . . . . . . . . . . . . . . . . . . . . . . 38 10.2.2 FileReader.java . . . . . . . . . . . . . . . . . . . . . . . 39 10.2.3 DecisionTree.java . . . . . . . . . . . . . . . . . . . . . . . 42 10.2.4 Rule.java . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 10.2.5 FlowerClassifier.java . . . . . . . . . . . . . . . . . . . . . 44 10.3 Code to create a decision tree based on a tree handling missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2 Chapter 1 Introduction 1.1 About We have chosen a dataset with information about flowers in the iris plant family. The four attributes in the set describe the flowers’ sepal length and width, the petal length and width, and lastly a class specifies what kind of species the current flower belongs to. The dataset contains 150 instances of flower data [11]. The objective of this task is to create a decision tree which may separate the flowers from each other, and determine what species a flower belongs to. According to Mitchell overfitting is one possible issue that may occur when a decision tree is generated. Overfitting is an event which is present when for example a decision tree is connected too much to the training data, and works poorly on later test data. Mitchell mentions two possible causes: 1) A tree where each branch is just deeply enough to classify the training samples. 2) Noise in the data, or too small training examples to produce a representative sample of the target function [9]. Overfitting may occur in our project solution, and may be caused of both examples shown above, and especially by the size of the set. Our data set contains only 150 instances, and may be a small training example. Regardless of the possible problems we separated the dataset in two parts - one with test data (50 instances), and another with training data (100 instances). 1.2 Raw data generation The raw data was download as .data-files. The information in the files was sorted by flower species. Separating them directly into training and testing data could lead to complications, since the last 50, or first 50 instances was only one species of iris flowers. Instead we reordered the file to make sure that 33-34 instances of each flower species was present in the training, and 16-17 instances was present in the testing. 3 1.3 The problem’s applications We chose a dataset that is a basic classification problem. The task performed is to use a set of values representing the width and length of a flowers sepal and petal to classify it into one of the three species of iris (setosa, versicolour or virginica). 1.4 Attributes and conversion The attributes used in the dataset are measurements introduced by Sir Ronald Fisher and consists of 50 instances of each of the flower species iris setosa, iris virginica and iris versicolor. Each instance contains its species and the width and height of the flowers petal and sepals. The measurements are all numerical values and therefore given the continuous type when used in conjunction with See5, while the last species attribute is the classification. 1.4.1 See5 We have installed the Windows-version of C5.0 - See5. This is not the full version, but a tutorial. This version supports all the functionality of the full version of See5, except the number of possible instances. In the tutorial 400 instances are supported [10], but since our data set is containing only 150 instances, we have chosen to use this tool. Two file types are required as a minimum to perform a decision tree algorithm with See5; a .data file containing the data and a .names file containing the attributes and classes. A file with test data (.test) is optional, and may be used to test the created decision tree [12]. We have created three files to perform our decision tree algorithm; “iris.names”, “iris.data” and “iris.test” (files with output and the decision tree is generated by See5 automatically). Both “iris.test” and “iris.data” are written in the same way. The two lines in the example below are copied from “iris.test” and contain five comma-separated values each. The first four values are the attributes of the current instance, while the last is the instance’ class. 5.5,4.2,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa ... The data file was downloaded from the Internet, while the test file was generated by moving 50 of the instances in the data file to the test file. After finishing the data files we created a .names file based on information provided in the tutorial “C5.0: An Informal Tutorial” [12]. The .names file contains the following information: class. sepalLength: continuous. sepalWidth: continuous. petalLength: continuous. 4 petalWidth: continuous. class: Iris-setosa,Iris-versicolor,Iris-virginica. The first line “class.” is the attribute that will be predicted (in our case the flowers’ class). Then four attributes are described; “sepalLength”, “sepalWidth”, “petalLength” and “petalWidth”. All are defined as continuous, which is a numeric attribute. 1.4.2 Weka For Weka 3.6, there are no need for converting file formats before using the selected dataset. Weka comes with a folder ‘data’, where the iris data set is stored in an .arff-file. The data is written in the following way: @RELATION iris @ATTRIBUTE sepallength REAL .... @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa .... Unlike the files used for generating a decision tree in See5, both descriptive information and data is placed together in the .arff file. The lines below “@DATA” are the data in the file. @RELATION is used when the current relation is named. The @ATTRIBUTE tag is used describing the attributes. REAL is used for numeric attributes (NUMERIC could also have been used). The line “@ATTRIBUTE class. . . ” contains the different flower classes [5]. This file is however containing all the data about the iris flowers, and is not separated into testing or training data. Weka supports functionality to separate a data set (as shown in figure 1.1), but we have done this manually by selecting the radio button “Supplied test set” - and uploading our own set. 5 Figure 1.1: Different functionality for training/testing. The set has been generated by moving 50 of the flowers in the iris set into a new .arff-file. 17 of both iris setosa and iris versicolor are moved into the test file. 16 instances of iris virginica are also moved, as shown in figure 1.2, where instances from the training file are passed into the testing file. Figure 1.2: Iris data. We didn’t focus on picking the instances in each flower class randomly, since the instances didn’t appear to be sorted in more levels than flower classes. 6 Chapter 2 What others have done The iris dataset is a popular dataset as shown on the UCI Machine Learning Repository website, where it is ranked as number one on the “most popular data sets” list [11]. The iris data set is used as one of many in a study where Mingers compare five different methods for pruning decision trees [8]. The five pruning methods were Error-Complexity Pruning (Err-Comp), Critical Value Pruning (Critical), Minimum.Error Pruning (Min-err), Reduced-Error Pruning (Reduce) and Pessimistic Error Pruning (Pessim). In this experiment compare these five pruning methods in terms of the size of the pruned tree and its accuracy. The paper concludes that there are significant differences between the five pruning methods. The Min-err method was the least accurate method, it also failed to provide a set of trees for the experts to examine. The Pessim pruning method was the most crude, but it also was the quickest and separate test data is not required. The three other pruning methods are very similar overall and produced consistently low error rates over all the data sets including the iris data set [8]. Another paper from Buntine & Niblett [2] also uses the iris data set as one of many data sets to test if using random splitting could decrease error. Both Buntine & Niblett [2] and Mingers [7] has concluded with pretty much the same results. Buntine concludes with that random splitting rules perform worse than other measures [2]. Mingers [7] also came to the conclusion that selecting attributes entirely randomly produces trees are as accurate as those produced by using measures. Selecting random attributes also resulted in larger decision trees compared to using measures, on unpruned trees. After pruning there were little tree size differences [7]. Murthy has created a randomized algorithm for building oblique decision trees. The iris data set is used with other data sets to experiment and test this algorithm. The result showed that this algorithm produced small, accurate trees, and the computational requirements are quite modest [4]. These are some of many examples where the iris data set has been used. Throughout this project we will use the iris data set to help us solve our problem situations. In the research done before, both Mingers [7] and Buntine [2] uses different data sets. Different in size, number of attributes and other aspects. Where as we will only use the iris data set which is a rather small data set [11], and as mentioned before we might experience overfitting. 7 Chapter 3 Decision tree tools To perform the task we will use both See5 and Weka for decision tree making. In this task we have then performed some basic tests of the tools’ functionality, to check whether we are able to use them. 3.1 See5 We have chosen See5 as a decision tree tool. The start-up window contains six buttons, and only one is clickable - “Locate Data” (as shown in figure 3.1). Figure 3.1: See5’s start-up window. We may upload our data, test and description file by clicking on the selectable button. We are then presented with the following screen, as shown in figure 3.2: 8 Figure 3.2: GUI presented when uploading files to See5. After that we are able to create a decision tree by clicking the number two button from the left - “Construct Classifier”. Then we are presented with the window shown in figure 3.3. We notice could use cross validation to train and test the system with varying test data. But since this is only a basic test, we use the default values and click “Ok”. Figure 3.3: Window for classifier construction in See5. See5 then generates a decision tree for us, and also tests the generated tree. A window presents the output (as illustrated in figure 3.4): 9 Figure 3.4: Output in See5. We notice that the decision tree is presented at the top with different branches. By separating the plants, whether they have petal length ¡= 1.9, or petal length ¿ 1.9, the created decision tree is able to classify all the 33 iris setosas on top of the tree. The algorithm then continuous creating branches, and ends up with a tree that classifies 3 instances wrong and ends up with three percent errors. For the test data the tree generates two percent errors. All this information is also stored in two files generated automatically by See5 - iris.tree and iris.out. 3.2 Weka We have chosen Weka as a decision tree tool. Weka supports visualization of data sets, as shown in figure 3.5, where the iris data set used in this task is visualized. 10 Figure 3.5: Visualization of training data in Weka. Immediately we discover one of the advantages with using Weka - the good visualization of data. We notice that the small training set may cause an error, since all iris setosas in the training set (colored blue) have smaller petal width than the other flowers, as shown in figure 3.5. A visualization of the full data set, shown in figure 3.6, shows that one of the setosa flowers has a larger petal width than the others. This was not visible in the training set, since that set didn’t contain the flower with petal width greater than the other setosas. The visualizations in Weka makes this easier to notice. Figure 3.6: Visualization of petal width of all 150 iris flowers. Weka has also implemented a number of algorithms for decision tree making, amongst others. These algorithms may be selected as shown in figure 3.7. 11 Figure 3.7: Algorithms in Weka. At last we tested Weka and analyzed the output of the RandomForest algorithm. This algorithm uses a number of trees, called a forest. These trees are generated based on random vectors. Finally when a large number of trees are generated, they vote for the most popular class amongst them [1]. The algorithm was used to generate a decision tree test, and the output is shown in figure 3.8. As the output shows, the confusion matrix showed that one of the iris versicolor was classified as iris virginica during the testing. The others were classified correctly. Since this only is a test of basic functionality in Weka, no further analyzes were done. Figure 3.8: Output from a test of the random forest algorithm available in Weka. 12 Chapter 4 Trees and rules - what is best, and the difference between them Decision trees and rules are similar in many ways. One way to learn sets of rules is to first learn a decision tree. Then by translating the tree into similar sets of rules, one rule for each leaf node in the decision tree. Sets of rules could be similarly translated into a decision tree. The sequential covering algorithm learns a sets of rules by first learning a single accurate rule, further removing the positive examples covered by this rule and then continuing the process over the remaining training examples/data. This is an efficient, greedy algorithm for learning rule sets, similar to the top-down decision tree learning algorithms, such as the ID3 algorithm [9]. The only difference is that the ID3 algorithm can be viewed as simultaneous, rather than sequential covering as the rule algorithm [9]. A decision tree model can be converted into a collection of if-then statements (set of rules). The decision tree presentation is useful when you want to see how attributes in the data can split the data into subsets, relevant to the problem. The rule set presentation is useful if we want to see how particular groups of items related to a specific conclusion. We can look at decision rules as the verbal equivalent to the graphical decision tree. 13 Chapter 5 Interpretation of output & a review of rules generated 5.1 See5 We used See5 to generate a decision tree for classifying flower species. We used no boosting trials, no cross validation, but a testing set of 50 instances to check our data set containing 100 instances of flower data. The screen dump in figure 5.1 shows the output from See5. Figure 5.1: Output in See5. The output may be interpreted into the decision tree shown in figure 5.2. 14 Figure 5.2: Illustration of decision tree output from C5. The confusion matrix showed that this relatively small tree generated only 2% errors when running the test data, but a much higher error rate on the training data; 5%. This tree has few rules, and every rule seems to be of importance. The first level of branches separates all the iris setosas from the rest of the iris flowers. Then the two next branches in the next level separates the iris versicolors from the iris virginicas, with some errors. Since the generated tree was pretty small, and clearly separates the flowers, we have not removed any of the trees branches. All the rules seems sensible. 5.2 Weka For the Weka test we decided to run the training on 100 instances of data, and tried cross validation. The reason why we performed training with only 100 data instances, is to later check if any of the rules may be unnecessary by running the 50 instances in the test set. This approach was chosen mainly for the sake of learning. Since this data set was sorted, we also shuffled the data before performing the test. A small Java program was created for this purpose. The program code is shown in section 10.1. In Weka we chose an algorithm for our decision tree making. The J48algorithm was chosen. This algorithm has been proven to work well with the iris data set before. The researchers Tiwari, Srivastava and Pandey have made a comparison of decision tree algorithms on the iris data set. They found that the J48 algorithm is a good choice for small to medium data sets, as iris [17]. The j48-algorithm is Weka’s implementation of an algorithm that is better known as C4.5. This algorithm is able to handle cases with missing data. It is also a divide-and-conquer algorithm which may split a data set into smaller disjoint sets [6]. Several tests were performed with the training set, and the number of crossvalidation folds were adjusted to see if that had an impact on the result. At last a test with five cross-validations were selected. That particular test generated a confusion matrix, with seven incorrect classifications, which is shown in figure 5.3. The matrix was linked to the decision tree shown in figure 5.4. This decision tree is further illustrated in figure 5.5. 15 Figure 5.3: Illustration of an output confusion matrix from Weka. Figure 5.4: Illustration of an output decision tree from Weka. Figure 5.5: Illustration of the decision tree shown in figure 5.4. To check if any of the rules were unnecessary, a Java program was created for this purpose. The program uses the generated tree to classify flowers. It also modifies the tree and removes sets of branches, and then prints the confusion matrix of every modified decision tree. For simplicity the Java program uses a data file which was originally modified for Cubist. This file has no flower names as strings, but numbers instead. The program is shown in section 10.2. After running the program twice - once with the 150 instances in the complete data file and once with the 50 instances in the test file, the program wrote the following output, shown in figure 5.6 and 5.7. 16 Figure 5.6: Decision tree test of the complete data set performed with a program written in Java. Figure 5.7: Decision tree test of the test data. According to the confusion matrixes for the trees modified with Java, it is possible to shrink the size of the tree, and still get good results. If we drop the lowest two branches of the tree, we would have a tree which generates only seven errors for the whole data set, and two errors for the test set (which is just 17 as good as the complete tree on the same set). It is possible that the full tree has been overfitted when trained with the 100 instances in the training set, and that a smaller tree would work just as well, or better, with other data. Mitchell mentions that it is hard to decide when to stop growing a decision tree [9], and further testing may therefore be necessary to be able to make a conclusion. But according to William Occam’s principle, “Occam’s Razor”, a simple, small tree should always be prefered above a larger one [13], and this principle therefore strengthens the arguments for choosing the modificated tree instead of the original one. (The confusion matrixes for the modificated tree mentioned in this section are shown in figure 5.6 and 5.7 as “Confusion matrix without rule 3.” The illustration of the tree mentioned in the section above is shown in figure 5.8 - without the lowest, two branches. Figure 5.8: Decision tree with modifications done after a review of a tree’s generated rules. 18 Chapter 6 Characterization of the generalizing ability of decision tree models using cross-validation In a limited dataset the risk of overfitting is high. To limit overfitting the predictive model to the validation data, we experimented with different techniques offered by See5. Cross validation enables you to train and test on the same dataset, which in our case was 150 instances. There are multiple ways to perform this type of validation, but the See5 software utilizes a K-fold algorithm to divide the data evenly over a selected amount of folds (subsets), we chose five for this exercise. On the first iteration four subsets are used to train the model while the last subset is used for testing. The subset used for testing is then swapped with one of the training subsets on the next iteration and this process is continued until all subsets have been used both for training and testing. Fold Decision Tree ---- ---------------Size Errors 1 2 3 4 5 4 4 3 4 4 Mean SE 3.8 0.2 (a) ---49 (b) ---1 3.3% 3.3% 13.3% 3.3% 0.0% 4.7% 2.3% (c) ---- <-classified as (a): class Iris-setosa 19 48 4 2 46 (b): class Iris-versicolor (c): class Iris-virginica This enabled us to test the performance of different predictive models on the same training set. The combined growth in error percentage might seem like a negative, but the result is generalized and therefore considered less biased. Another option is to use leave one out cross validation (LOOCV). This is accomplished in See5 by selecting the cross validation option with a fold count equal to the instances of data. Mean SE (a) ---49 4.0 0.0 (b) ---1 47 3 4.7% 1.7% (c) ---3 47 <-classified as (a): class Iris-setosa (b): class Iris-versicolor (c): class Iris-virginica This algorithm uses one instance of the dataset for testing and the remaining data for training, then repeat the process for all the instances. This produces the same error rate, but a lower variability of the means. Lastly we tried the K-fold cross validation on decreasing datasets to emphasise the problem of small datasets. The table in figure 6.1 shows a clear increase in error rate as we reduce the dataset compared to our result of 4.7% on 150 Instances. Figure 6.1: Table showing increase in error rate. 20 Chapter 7 Classifications with different costs Differing costs will not be a problem for our chosen data set. If we look at the heart diseases data set, where a wrong classification could have major consequences. If a patient who is sick is diagnosed to be healthy, this could lead to big problems. In our case where we use the iris data set to classify flowers in the iris family, this is not a major problem. If one flower is wrongly classified it will not cause a problem and therefore we will not have to use a .cost file. 21 Chapter 8 Relevant decision tree options 8.1 Winnowing Winnowing is used to reduce the number of used attributes in a decision tree, by pre-picking predictors. Usually this process leads to a different classifier than what would have been the case without. Winnowing is, according to Rulequest Research, best fitted for large applications, where many of the attributes probably have a small impact on the classification task [15]. We have made a small example in See5 which demonstrates the use of winnowing with our small data set. The complete data set, consisting of 150 instances, has been constructed with See5 with and without the use of winnowing. The results are presented in 8.1.1 and 8.1.2. 8.1.1 Test with winnowing The test in figure 8.1 shows that three attributes has been winnowed, and that only the petal width is used as a predictor. Figure 8.1: Test with winnowing for decision tree construction. 22 8.1.2 Test without winnowing The test without winnowing (ref. figure 8.2) removes no predictors, and two attributes are used when constructing the decision tree - petal length and then petal width. Figure 8.2: Test without the use of winnowing. 8.2 Boosting According Appel et.al boosting is one of the most used decision tree learning techniques. The technique generates several so called “week leaners” (trained decision trees with poor performance) and combines them to generate a single, strong tree [3]. See5 has implemented functionality called adaptive boosting. The user may set a number of trials to be generated in See5 - and every trial will generate a decision tree. According to Rulequest Research a 10-classifier boosting will reduce the error rate with 25% [15]. We have performed boosting See5 by inserting the number of desired trials in See5. We have compared a 10-trial boosting with a run without boosting at all. Pruning was not in use during the run, to decrease disturbance from other functionality in See5. 8.2.1 A run with 10 boosting trials The decision tree creation with boosting reduces the error percent to 0 on the training set. The tree contains all four attributes for classifying the flowers. On the test data we receive one error, one of the iris versicolors is classified as an iris virginica. Evaluation on training data (100 cases): Trial Decision Tree ----- ---------------Size Errors 0 1 2 4 4 3 3( 3.0%) 7( 7.0%) 11(11.0%) 23 3 4 5 6 7 8 9 boost (a) ---33 3 5 3 5 3 5 4 (b) ---- 12(12.0%) 3( 3.0%) 5( 5.0%) 2( 2.0%) 36(36.0%) 38(38.0%) 4( 4.0%) 0( 0.0%) (c) ---- 33 34 << <-classified as (a): class Iris-setosa (b): class Iris-versicolor (c): class Iris-virginica Attribute usage: 100% 100% 67% 40% petalLength petalWidth sepalWidth sepalLength Evaluation on test data (50 cases): Trial Decision Tree ----- ---------------Size Errors 0 1 2 3 4 5 6 7 8 9 boost (a) ---17 4 4 3 3 5 3 5 3 5 4 1( 2.0%) 3( 6.0%) 5(10.0%) 4( 8.0%) 2( 4.0%) 3( 6.0%) 2( 4.0%) 23(46.0%) 21(42.0%) 1( 2.0%) 1( 2.0%) (b) ---- (c) ---- 16 1 16 << <-classified as (a): class Iris-setosa (b): class Iris-versicolor (c): class Iris-virginica 24 Time: 0.0 secs 8.2.2 A run without boosting Without boosting See5 generates a tree with three errors on the training set. The tree contains only two attributes - and generates one error on the test set. Evaluation on training data (100 cases): Decision Tree ---------------Size Errors 4 (a) ---33 3( 3.0%) (b) ---- (c) ---- 31 1 2 33 << <-classified as (a): class Iris-setosa (b): class Iris-versicolor (c): class Iris-virginica Attribute usage: 100% 67% petalLength petalWidth Evaluation on test data (50 cases): Decision Tree ---------------Size Errors 4 (a) ---17 1( 2.0%) (b) ---- (c) ---- 16 1 16 << <-classified as (a): class Iris-setosa (b): class Iris-versicolor (c): class Iris-virginica Time: 0.0 secs 25 8.2.3 Comparison of runs The run with 10 boosting trials created a tree with less errors on the training set than the run with no boosting. But on the test set both decision trees performed with the same amount of errors. The decision tree generated with boosting did also have a larger number of used attributes than the tree generated with no boosting. Since our data set is small, and since both trees ran with the same amount of errors on the test set, it is difficult to conclude if the boosted tree is better than the tree generated without boosting. The low percentage of errors on the training set speaks to the boosted tree’s benefit, but the higher number of attributes/nodes speaks to the not boosted tree’s benefit, according to Occam’s principle (mentioned in section 5.2). 8.3 Pruning techniques Pruning is a technique in machine learning for reducing the size of the decision tree by going through each subtree and deciding if it should be replaced with a leaf or sub-branch. By turning of Global Pruning in See5 generally results in larger decision trees and rulesets [15]. We tested the pruning technique in See5 by turning of the “Global Pruning” option ref. figure 3.3. We tested this on the complete iris data set with 150 instances. We then look on the size of the decision tree and the number of rules and how they differ with and without pruning. Figure 8.3: Test without pruning showing the size of the tree. 26 Figure 8.4: Test without pruning showing number of rules. Figure 8.5: Test with pruning showing the size of the tree. Figure 8.6: Test with pruning showing number of rules. The results we gained by turning the pruning option on and off are represented on the pictures above. We see that the number of rules are the same both with and without pruning. The size of the decision tree differs - from 5 leafs without pruning to 4 leafs with pruning. 27 8.4 Algorithms which handles missing attributes The Iris data set has been modificated for this task to illustrate a solution to this general problem. 50 of the data attributes has been replaced randomly with a question mark, which is Weka’s symbol for missing attributes. This operation was done by shuffeling the data in an arff.-file by using the Java program in section 10.1. Then question marks were added to one of the attributes in the first 50 instances in the file. Then the file was shuffled again. Provost and Saar-Tsechansky mentions three possible treatments of missing values in a data set [14] 1. The instances with missing values may be discarded. This works best if the values are missing at random [14]. 2. The values may be aquired. This method are not used by any algorithm, but involves for example buyong the missing values in the set [14]. 3. Imputation. This method involves estimation of the missing value(s). There are several techniques used by different algorithms to estimate values [14]. We have used Weka’s algorithm J48 algorithm during the test. We have also used Weka’s filter “replaceMissingValues” to handle values missing.The filter replaces all the missing values for in this case numeric attributes with the means from the traing data [16]. The figures 8.7 and 8.8 illustrates the output from the tree construction. In the first case, the replaceMissingValues filter was not in use. Then in the second run, the filter was applied. Both constructions uses cross validation for testing on the full data set. To test the tree generated in Weka, it was not enough to just perform cross validation during creation, since this functionality only tests with the data generated by Weka itself. Instead we created a new decision tre in our Java program which is able to create confusion matrixes. This program is shown in secion 10.2. Some code were added to perform the test with the decision tree generated in Weka on the 150 instances of data with no missing values. The code is added to secion 10.3. Figure 8.7: Output from test with no replacement of missing values. 28 Figure 8.8: Output from test with replacement of missing values by applying the filter ’replaceMissingValues’ in Weka. The tree created while constructing with the use of the replaceMissingValues filter is shown in figure 8.9. Figure 8.9: Decision tree created while using the replaceMissingValues filter. The generated decision tree managed suprisingly well when tested with the Java program. Figure 8.10 shows the trees confusion matrix, with only a few errors when tested on the full data set without missing data. Figure 8.10: Confusion matrix to the decision tree where the values were generated by Weka. 29 Chapter 9 Evaluation The main difficulty we had while working on this assignment was the small amount of records in our dataset. This contributed to less precise predictions and complications as shown in several of the operations. We believe that increasing the training data would help generate more precise prediction models and produce a better result. 30 Bibliography [1] Leo Breiman. Random forests. randomforest2001.pdf, 2001. http://oz.berkeley.edu/˜breiman/ [2] Wray Buntine and Tim Niblett. Machine learning: A further comparison of splitting rules for decision-tree induction. http://link.springer.com/ article/10.1023/A:1022686419106, 1992. [3] Fuchs et al. Quickly boosting decision trees - pruning underachieving features early. http://jmlr.org/proceedings/papers/v28/appel13.pdf, 2013. [4] Murthy et al. Oc1: A randomized algorithm for building oblique decision trees. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10. 1.1.17.6068&rep=rep1&type=pdf, 1993. [5] Richard Kirkby et.al. Attribute-relation file format (arff). http://www. cs.waikato.ac.nz/˜ml/weka/arff.html, 2008. [6] A. Kusiak. Decision tree algorithm. http://read.pudn.com/ downloads110/ebook/457186/C4.5%20%E5%86%B3%E7%AD%96%E6%A0% 91/DecisionT1.pdf. [7] John Mingers. Machine learning: An empirical comparison of selection measures for decision-tree induction. http://link.springer.com/article/ 10.1007/BF00116837, 1988. [8] John Mingers. Machine learning: An empirical comparison of pruning methods for decision tree induction. link.springer.com/article/10. 1023/A:1022604100933, 1989. [9] Tom M. Mitchell. Machine Learning. MIT Press and McGraw Hill Companies, inc, 1997. [10] No named author. Data mining lab. datamining/dmlab/resources.html. [11] No named author. datasets/Iris. Iris data set. http://cecs.louisville.edu/ http://archive.ics.uci.edu/ml/ [12] No named author. C5.0: An informal tutorial. http://www-ia.hiof.no/ ˜rolando/ML/c50tutorial.html#.names, 2003. 31 [13] M. Nashville. Data mining with decision trees. http://decisiontrees. net/decision-trees-tutorial/tutorial-3-occams-razor/. [14] P. Provost and M. Saar-Tsechansky. Handling missing values when applying classification models. http://www2.mccombs.utexas.edu/faculty/ maytal.saar-tsechansky/ResearchPapers/saar-tsechansky07a.pdf, 2007. [15] Rulequest Research. See5: An informal tutorial. http://www.rulequest. com/see5-win.html\#OTHER, 2013. [16] Sourceforge. Class replacemissingvalues. http://weka. sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/ ReplaceMissingValues.html. [17] V. Tiwari M., Srivastava and V. Pandey. Comparative investigation of decision tree algorithms on iris data. http://warse.org/pdfs/2013/ ijacst03232013.pdf, 2013. 32 Chapter 10 Appendix 10.1 Java code created to shuffle .arff file 10.1.1 main.java package Task01; public class main { public static void main(String[] args) { String filePath = "C:\\Users\\amund_000\\Desktop\\Maskinlæring\\Project1\\iris.arff"; FileOrganizer fo = new FileOrganizer(filePath); if(fo.initFile()) { ArrayShuffle as = new ArrayShuffle(); if(as.shuffle(fo.getListsWithFlowerInfo())) { fo.store(as.getFlowerLists()); } else System.out.println("Couldn’t shuffle file."); } else System.out.println("Couldn’t get/init file."); } } 33 10.1.2 FileOrganizer.java package Task01; import import import import import import import java.io.BufferedReader; java.io.FileNotFoundException; java.io.FileReader; java.io.IOException; java.io.PrintWriter; java.io.UnsupportedEncodingException; java.util.ArrayList; public class FileOrganizer { private String path; @SuppressWarnings("rawtypes") private ArrayList[] listsOfFileInfo; public FileOrganizer(String filePath) { this.listsOfFileInfo = new ArrayList[2]; this.listsOfFileInfo[0] = new ArrayList<String>(); this.listsOfFileInfo[1] = new ArrayList<String>(); this.path = filePath; } @SuppressWarnings({ "resource", "unchecked" }) public boolean initFile() { FileReader fr; BufferedReader br; try { fr = new FileReader(this.path); br = new BufferedReader(fr); boolean storingHeader = true; String line; try { line = br.readLine(); while(line != null) { 34 if(storingHeader) { this.listsOfFileInfo[0].add(line); } else { this.listsOfFileInfo[1].add(line); //System.out.println(line); } if(line.startsWith("@DATA")) storingHeader = false; line = br.readLine(); } } catch (IOException e) { return false; } } catch (FileNotFoundException e) { return false; } return true; } @SuppressWarnings("unchecked") public ArrayList<String>[] getListsWithFlowerInfo() { return (ArrayList<String>[]) this.listsOfFileInfo; } public void store(ArrayList<String>[] flowerLists) { try { PrintWriter writer = new PrintWriter("C:\\Users\\amund_000\\Desktop\\Maskinlæring\\ Project1\\shuffled.arff","UTF-8"); for(int i = 0 ; i < flowerLists.length ; i++) { for(int j = 0 ; j < flowerLists[i].size() ; j++) { writer.write(flowerLists[i].get(j)); 35 writer.println(); } } System.out.println("Success!"); writer.close(); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } } } 10.1.3 ArrayShuffle.java package Task01; import java.util.ArrayList; public class ArrayShuffle { private ArrayList<String> shuffledList; private ArrayList<Integer> dataList; private ArrayList<String>[] finalList; public ArrayShuffle() { this.shuffledList = new ArrayList<String>(); } public ArrayList<String>[] getFlowerLists() { return this.finalList; } public boolean shuffle(ArrayList<String>[] listsWithFlowerInfo) { int lengthOfFlowerData = listsWithFlowerInfo[1].size(); if(lengthOfFlowerData > 0) { 36 dataList = new ArrayList<Integer>(); for(int i = 0 ; i < lengthOfFlowerData ; i++) { dataList.add(i); } while(dataList.size() > 0) { int randomIndex = (int) Math.floor(Math.random()*dataList.size()); this.shuffledList.add (listsWithFlowerInfo[1].get(dataList.remove(randomIndex))); } listsWithFlowerInfo[1] = shuffledList; this.finalList = listsWithFlowerInfo; } else return false; return true; } } 37 10.2 A Java program for testing a small decision tree 10.2.1 Main 05b.java package Task05b; import java.util.ArrayList; public class Main05b { private static final String //DATA_PATH = "C:\\Users\\amund_000\\Desktop\\Maskinlæring\\testFiles\\test4\\ irisTestData.data", DATA_PATH = "C:\\Users\\amund_000\\Desktop\\Maskinlæring\\testFiles\\test4\\ iris.data", TREE_PATH = "C:\\Users\\amund_000\\Desktop\\Maskinlæring\\testFiles\\test4\\ iris.treeFile"; private static ArrayList<ArrayList<Float>> data; public static void main(String[] args) { _FileReader fileReader = new _FileReader(DATA_PATH); if(fileReader.fileHasBeenRead()) data = fileReader.getData(); //testData(); DecisionTree dt; //We could have used tree data read from a file, ut since this task is //somewhat small, we have choosen not to do that. if(fileReader.ReadTreeData(TREE_PATH)) { dt = new DecisionTree(); dt.createOrdinaryDecisionTree(); FlowerClassifier fc = new FlowerClassifier(dt); fc.classifyData(data, 0); dt.creatDecisionTreeWithoutRule1(); 38 fc.setDecisionTree(dt); fc.classifyData(data, 1); dt.creatDecisionTreeWithoutRule2(); fc.setDecisionTree(dt); fc.classifyData(data, 2); dt.creatDecisionTreeWithoutRule3(); fc.setDecisionTree(dt); fc.classifyData(data, 3); } } @SuppressWarnings("unused") private static void testData() { for(int i = 0 ; i < data.size() ; i++) { for(int j = 0 ; j < data.get(i).size() ; j++) { System.out.print(data.get(i).get(j) + " "); } System.out.println(); } } } 10.2.2 FileReader.java package Task05b; import import import import import java.io.BufferedReader; java.io.FileNotFoundException; java.io.FileReader; java.io.IOException; java.util.ArrayList; public class _FileReader { private ArrayList<ArrayList<Float>> flowerData; private boolean dataFileRead; private String treeData; public _FileReader(String dataPath) { flowerData = new ArrayList<ArrayList<Float>>(); dataFileRead = false; 39 readFile(dataPath); } public boolean fileHasBeenRead() { return this.dataFileRead; } private void readFile(String dataPath) { try { FileReader fr = new FileReader(dataPath); BufferedReader br = new BufferedReader(fr); String line = br.readLine(); /* * Since we know the content of the file, there is no need to check each line for for exam * in this small example code. */ while(line != null) { getDataFromLine(line); line = br.readLine(); } br.close(); dataFileRead = true; } catch (FileNotFoundException e) { dataFileRead = false; } catch(IOException e) { dataFileRead = false; } } @SuppressWarnings("unchecked") private void getDataFromLine(String line) { ArrayList<Float> arrList = new ArrayList<Float>(); String[] dataArr = line.split(","); 40 for(int i = 0 ; i < dataArr.length ; i++) arrList.add(convertStringToFloat(dataArr[i])); this.flowerData.add((ArrayList<Float>) arrList.clone()); } private Float convertStringToFloat(String string) { try { float f = Float.parseFloat(string); return f; } catch(NumberFormatException e) { System.out.println("Could not format the data correctly.\n Exiting program."); System.exit(0); } return 0f; } public ArrayList<ArrayList<Float>> getData() { return this.flowerData; } public String getTreeData() { return this.treeData; } //The tree-file has been modified to make creation of a tree easier //to accomplish. public boolean ReadTreeData(String path) { treeData = ""; try { FileReader fr = new FileReader(path); BufferedReader br = new BufferedReader(fr); String line = br.readLine(); while(line != null) { treeData += line; line = br.readLine(); 41 } br.close(); return true; } catch (FileNotFoundException e) { return false; } catch(IOException e) { return false; } } } 10.2.3 DecisionTree.java package Task05b; public class DecisionTree { private Rule rootRule; //The decision tree could have been created based on the //data read from file, //instead the tree is hard coded for this small example. public void createOrdinaryDecisionTree() { Rule r3 = new Rule(); r3.createRule(2, 4.9f, false, false, null, null, 2, 3); Rule r2 = new Rule(); r2.createRule(3, 1.7f, true, false, r3, null, 0, 3); Rule r1 = new Rule(); r1.createRule(3, .5f, false, true, null, r2, 1, 0); this.rootRule = r1; } public void creatDecisionTreeWithoutRule1() { Rule r3 = new Rule(); r3.createRule(2, 4.9f, false, false, null, null, 2, 3); Rule r2 = new Rule(); r2.createRule(3, 1.7f, true, false, r3, null, 0, 3); 42 this.rootRule = r2; } public void creatDecisionTreeWithoutRule2() { Rule r3 = new Rule(); r3.createRule(2, 4.9f, false, false, null, null, 2, 3); Rule r1 = new Rule(); r1.createRule(3, .5f, false, true, null, r3, 1, 0); this.rootRule = r1; } public void creatDecisionTreeWithoutRule3() { Rule r2 = new Rule(); r2.createRule(3, 1.7f, false, false, null, null, 2, 3); Rule r1 = new Rule(); r1.createRule(3, .5f, false, true, null, r2, 1, 0); this.rootRule = r1; } public Rule getRootRule() { return this.rootRule; } } 10.2.4 Rule.java package Task05b; import java.util.ArrayList; public class Rule { private boolean hasLeftRule, hasRightRule; private Rule leftRule, rightRule; private int leftClass, rightClass; private int attributeNumber; private float threshold; public void createRule(int attribute, float thresh, boolean hasLeft, boolean hasRight, Rule left, Rule right, int classL, int classR) { this.attributeNumber = attribute; 43 this.threshold = thresh; this.hasLeftRule = hasLeft; this.hasRightRule = hasRight; this.leftClass = classL; this.rightClass = classR; if(this.hasLeftRule) this.leftRule = left; if(this.hasRightRule) this.rightRule = right; } public int checkRule(ArrayList<Float> arrList) { if(arrList.get(attributeNumber) <= threshold) { if(hasLeftRule) return this.leftRule.checkRule(arrList); else return leftClass; } else { if(hasRightRule) return this.rightRule.checkRule(arrList); else return rightClass; } } } 10.2.5 FlowerClassifier.java package Task05b; import java.util.ArrayList; public class FlowerClassifier { private DecisionTree decisionTree; private int[][] confusionMatrix; public FlowerClassifier(DecisionTree dt) { 44 this.decisionTree = dt; } public void setDecisionTree(DecisionTree dt) { this.decisionTree = dt; } public void classifyData(ArrayList<ArrayList<Float>> data, int ruleNotUsed) { if(ruleNotUsed == 0) System.out.print("Confusion matrix to decision tree with all rules\n ------------------------------------------------\n"); if(ruleNotUsed == 1) System.out.print("Confusion matrix to decision tree without rule 1\n ------------------------------------------------\n"); if(ruleNotUsed == 2) System.out.print("Confusion matrix to decision tree without rule 2\n ------------------------------------------------\n"); if(ruleNotUsed == 3) System.out.print("Confusion matrix to decision tree without rule 3\n ------------------------------------------------\n"); confusionMatrix = new int[3][3]; for(int i = 0 ; i < confusionMatrix.length ; i++) for(int j = 0 ; j < confusionMatrix[i].length ; j++) confusionMatrix[i][j] = 0; for(int i = 0 ; i < data.size(); i++) classify(data.get(i)); for(int i = 1 ; i <= 3 ; i++) System.out.print(i + "\t"); System.out.println(); int classNumber = 1; for(int i = 0 ; i < confusionMatrix.length ; i++) { for(int j = 0 ; j < confusionMatrix[i].length ; j++) System.out.print(confusionMatrix[i][j] + "\t"); System.out.println("\t | class " + classNumber); 45 classNumber++; } System.out.println(); } private void classify(ArrayList<Float> arrayList) { confusionMatrix[(int) ((float) arrayList.get(arrayList.size() - 1)) -1] [this.decisionTree.getRootRule().checkRule(arrayList) - 1]++; } } 46 10.3 Code to create a decision tree based on a tree handling missing data public void createDecisionTreeWithoutMissingValues() { Rule r1 = new Rule(); r1.createRule(2, 4.9f, false, false, null, null, 2, 3); Rule r2 = new Rule(); r2.createRule(3, 1.5f, true, false, r1, null, 0, 2); Rule r3 = new Rule(); r3.createRule(2, 4.7f, false, true, null, r2, 2, 0); Rule r4 = new Rule(); r4.createRule(2, 1.9f, false, true, null, r3, 1, 0); Rule r5 = new Rule(); r5.createRule(3, 1.7f, true, false, r4, null, 0, 3); Rule r6 = new Rule(); r6.createRule(3, .4f, false, true, null, r5, 1, 0); this.rootRule = r6; } 47