ARPA/Rome Proposal - Advances in Computer Science Research
Transcription
ARPA/Rome Proposal - Advances in Computer Science Research
An Experimental Comparison of Methods for Dealing with Missing Values in Data Sets when Learning Bayesian Networks Maciej Osakowicz1 & Marek J. Druzdzel1,2 1 Faculty of Computer Science, Bialystok University of Technology, Wiejska 45-A, 15-351 Bialystok, Poland, maciej.osakowicz@gmail.com (currently with X-code Sp. z o.o., Klaudyny 21/4, 01-684 Warszawa) 2 Decision Systems Laboratory, School of Information Sciences and Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15260, USA, marek@sis.pitt.edu Extended Abstract Decision-analytic modeling tools, such as Bayesian networks (Pearl, 1988) are increasingly finding successful practical applications. Their major strengths are sound theoretical foundations, ability to combine existing data with expert knowledge, and the intuitive framework of directed graphs. Bayesian networks can be based on expert opinion but can be also learned from data. A major problem with the latter is that many practical data sets contain missing values that defy simple counting when learning conditional probability distributions (Little & Rubin, 1987). And so, of the 189 real data sets available through the University of California at Irvine’s Machine Learning Repository (2015), 58 (around 31%) contain missing values. Several methods have been proposed to deal with missing values. However, there is no comprehensive study that would compare them on a variety of practical data sets. There are several questions that one might ask in this context, for example, which of the methods is the fastest. However, learning is part of a preparatory phase, which is typically performed once. Of more importance is the quality of the resulting models, expressed by criteria such as accuracy, area under the ROC curve, or model calibration. Of these, model accuracy, measured typically on classification tasks, seems to be most popular. In this paper, we report the results of a systematic comparison of eight methods for dealing with missing values: (1) introducing for each variable with missing values an additional state that represents a missing value, (2) removing all records with missing values, (3) replacing each missing value by a randomly chosen state (uniform distribution) from among the possible states that the variable can take, (4) replacing each missing value by the most likely value, (5) Similar Response Pattern Imputation (SRPI) (Joreskog & Sorbom, 1996), (6) replacing each missing value by the average value of the variable (Raymond, 1986), (7) replacing each missing value by the class average, and (8) five variants of the k-nearest neighbors method (1-NN, 5-NN, 1%-NN, 10%-NN, and 25%-NN) (Fix & Hodges,1951). Our comparison is based on 11 data sets from the UCI Machine Learning Repository: Adult, Thyroid Disease, Mammographic Mass, Cylinder Bands, Congressional Voting Records, Hepatitis, Echocardiogram, Soybean, Horse Colic, Heart Disease and Annealing. Each of these sets contains missing values and each contains a class variable, which makes them suitable for a comparison of classification accuracy. We conducted the experiments using two network structures: (1) unrestricted and general Bayesian network, and (2) naïve Bayesian network, i.e., one in which all feature variables are independent conditional on the class variable. Our results support the following conclusions: (1) In most cases, differences among the methods in terms of the classification accuracy of the resulting Bayesian networks are minimal, i.e., none of the methods uniformly stands out as consistently better than the others on all data sets. (2) The differences between the different methods were smaller for naïve Bayes structures than they were for unrestricted structures. (3) Replacing missing values by additional state seems performs the worst among the tested methods. (4) Removal of records with missing data may work well when the number of records with missing values is small (e.g., fewer than 10%). In case of several of our sets, this method led to removing too many records to allow for reliable learning of the parameters from the remaining records. (5) Replacing the missing values by the average seems to be a simple, fast method that leads to reasonable accuracy. (6) The k-NN method leads consistently to good results, although only when we consider its various variants, i.e., different values of k. It is the most computationally intensive of the methods tested in our experiment, which may lead to problems in case of very large data sets. Acknowledgments Partial support for this work has been provided by the National Institute of Health under grant number U01HL101066-01. Implementation of this work is based on SMILE, a Bayesian learning and inference engine developed at the Decision Systems Laboratory and available at http://genie.sis.pitt.edu/. References E. Fix & J.L Hodges (1951). Discriminatory analysis, non-parametric discrimination: consistency properties. Technical report, USAF School of aviation and medicine, Randolph Field. University of California Irvine Machine Learning Repository (2015), https://archive.ics.uci.edu/ml/ Karl Joreskog & Dag Sorbom (1996). PRELIS 2: User’s Reference Guide. Scientific Software International, Inc, Lincolnwood, IL, USA, 3rd edition. Roderick J A Little & Donald B Rubin (1987). Statistical analysis with missing data. John Wiley & Sons, Inc., New York, NY, USA, 2nd edition. Judea Pearl (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers: San Francisco, CA. M. Raymond (1986). Missing data in evaluation research. Evaluation and the Health Professions, 9(4):395 – 420.