Optimal Training Sets for Bayesian Prediction of MeSH® Assignment S , P
Transcription
Optimal Training Sets for Bayesian Prediction of MeSH® Assignment S , P
Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com 546 Sohn et al., Optimal Training Sets Research Paper 䡲 Optimal Training Sets for Bayesian Prediction of MeSH® Assignment SUNGHWAN SOHN, PHD, WON KIM, PHD, DONALD C. COMEAU, PHD, W. JOHN WILBUR, MD, PHD A b s t r a c t Objectives: The aim of this study was to improve naïve Bayes prediction of Medical Subject Headings (MeSH) assignment to documents using optimal training sets found by an active learning inspired method. Design: The authors selected 20 MeSH terms whose occurrences cover a range of frequencies. For each MeSH term, they found an optimal training set, a subset of the whole training set. An optimal training set consists of all documents including a given MeSH term (C1 class) and those documents not including a given MeSH term (C⫺1 class) that are closest to the C1 class. These small sets were used to predict MeSH assignments in the MEDLINE® database. Measurements: Average precision was used to compare MeSH assignment using the naïve Bayes learner trained on the whole training set, optimal sets, and random sets. The authors compared 95% lower confidence limits of average precisions of naïve Bayes with upper bounds for average precisions of a K-nearest neighbor (KNN) classifier. Results: For all 20 MeSH assignments, the optimal training sets produced nearly 200% improvement over use of the whole training sets. In 17 of those MeSH assignments, naïve Bayes using optimal training sets was statistically better than a KNN. In 15 of those, optimal training sets performed better than optimized feature selection. Overall naïve Bayes averaged 14% better than a KNN for all 20 MeSH assignments. Using these optimal sets with another classifier, C-modified least squares (CMLS), produced an additional 6% improvement over naïve Bayes. Conclusion: Using a smaller optimal training set greatly improved learning with naïve Bayes. The performance is superior to a KNN. The small training set can be used with other sophisticated learning methods, such as CMLS, where using the whole training set would not be feasible. 䡲 J Am Med Inform Assoc. 2008;15:546 –553. DOI 10.1197/jamia.M2431. Introduction MEDLINE is a large collection of bibliographic records of articles in the biomedical literature maintained by the National Library of Medicine (NLM). In late 2006, MEDLINE included about 16.5 million references, which have been processed by human indexing. Each MEDLINE reference is assigned a number of relevant medical subject headings (MeSH). MeSH is a controlled vocabulary produced by the NLM and used for indexing, cataloging, and searching biomedical and health-related information and documents (see http://www.nlm.nih.gov/mesh/ for details of MeSH). Human indexing is costly and requires intensive labor. The indexing cost at the NLM consists of data entry, NLM staff indexing and revising, contract indexing, equipment, and Affiliation of the authors: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD. Supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine. The authors thank the reviewers for valuable feedback and suggestions. Correspondence: Dr. Sunghwan Sohn, National Library of Medicine, Building 38A, 6N611C, 8600 Rockville Pike, Bethesda, MD 20894; e-mail: ⬍sohn@ncbi.nlm.nih.gov⬎. Received for review: 03/09/07; accepted for publication: 02/07/08 telecommunication costs.1 The annual budget for contracts to perform MEDLINE indexing including purchase orders at four foreign centers is several million dollars (James Marcetich, Head, NLM Index Section, personal communication, August 2007). NLM’s indexers are highly trained in MEDLINE indexing practice as well as in a subject domain(s) in the MEDLINE database. Since 1990, the MEDLINE database has grown faster than before with more documents available in electronic form. The cost of human indexing of the biomedical literature is high, so many attempts have been made to provide automatic indexing.1-8 The NLM Indexing Initiative is a research effort to explore indexing methodologies for semiautomated user-assisted indexing as well as fully automated indexing applications.1,2 This project has produced a system for recommending indexing terms for arbitrary biomedical text, especially titles and abstracts of journal articles. The system has been in use by library indexers since September 2002. The system consists of several methods of discovering MeSH terms that are combined to produce an ordered list of recommended indexing terms. K-nearest neighbor (KNN) is used as one method to rank the MeSH terms that are candidates for indexing a document.5 This study sought to investigate the naïve Bayes learner as an alternative to KNN. For the MEDLINE database, all references have titles and about half have abstracts. This is more than 16.5 gigabytes of data. Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com Journal of the American Medical Informatics Association Volume 15 With this much data, it is not realistic to run the most sophisticated machine learning algorithms. Only the simplest and most efficient algorithms can process this data on high-end commodity servers (2 CPUs, 4 GB of memory). Naïve Bayes has the efficiency to work with all of MEDLINE. In this study, the performance of using naïve Bayes for automatic MeSH assignment was investigated. One challenge is that many MeSH terms occur in a relatively very small portion of the MEDLINE database, and the corresponding naïve Bayes’s performance is not good. This poor performance is related to a significantly imbalanced class distribution—very few documents include a given MeSH term (C1 class) and a very large number of documents do not include it (C⫺1 class). In Bayesian learning for binary classification, a large preponderance of C⫺1 documents dominates the decision process and deteriorates classification performance on the unseen test set. Careful example selection must be considered to solve this problem and improve the classification performance. In this article, we perform example selection that starts from a small training set (STS) and iteratively adds informative examples selected from the whole training set (WTS) into the STS until an optimum is reached. Because a given MeSH term occurs in only a small portion of the WTS, all C1 documents are placed in the initial STS. Then, the C⫺1 documents most similar to C1 documents are iteratively added to the STS. As the size of the STS increases, the STS that produces the best result on the training set by a leave-one-out cross-validation method is selected as our optimal training set (OTS). Although we use the word optimal, the selected set might not be a global optimum because we find this set by a greedy approach. The detailed procedure will be explained under Example Selection in the Methods section. Naïve Bayes using this OTS provides a superior alternative to KNN, which is one method currently used in the NLM Indexing Initiative1,2 for MeSH prediction. Traditionally, example selection has been used for three major reasons.9 The first reason is to control computational cost. Standard support vector machines (SVM)10 require long training times that scale superlinearly with the number of features and become prohibitive with large data sets. Pavlov et al.11 used boosting to combine multiple SVMs trained on small subsets. Boley and Cao12 reduced the training data by partitioning the training set into disjoint clusters and replacing any cluster containing only nonsupport vectors with its representative. Quinlan13 developed a windowing technique to reduce the time for constructing decision trees for very large training sets. A decision tree was built on a randomly selected subset from the training set and tested for the remaining examples that were not included in this subset. Then, the selected misclassified examples were added to the initial subset, and a tree was constructed from the enlarged training set and tested on the remaining set. This cycle was repeated until all the remaining examples were correctly classified. For us cost is not a concern because naïve Bayes can be trained efficiently on all of MEDLINE. The second reason for example selection is to reduce labeling cost when labeling all examples is too expensive. Active learning is a way to deal with this problem. It uses current knowledge to predict the best choice of unknown data to Number 4 July / August 2008 547 label next in an attempt to improve the efficiency of learning. Active learning starts with a small number of labeled data as the initial training set. It then repeatedly cycles through learning from the training set, selecting the most informative data to be labeled, labeling them by a human expert, and adding newly labeled data to the training set. These informative documents may be near the decision boundary14,15 or they may be the documents producing maximal disagreement among a committee of classifiers.16,17 The emphasis is on the best possible learning from the fewest possible documents.14,18,19 While studying active learning methods, we saw instances where the results using a small training set produced better results than using the whole training set. Others have seen the same effect.15,20,21 However, they observed this effect only for some limited cases and the improvement was small (generally ⬍10%). By contrast we find a large improvement for all cases, nearly 200% on average. Also, we do not need active learning to avoid labeling because the entire training set is labeled. We use an active learning–inspired approach not to minimize labeling cost, but to maximize performance. The third reason is to improve learning by focusing on the most relevant examples. Our example selection belongs to this category. Boosting22 can also be implemented as a type of example selection. Wilbur23 used staged Bayesian retrieval on a subset of MEDLINE, and it outperformed a more standard boosting approach. He initially trained naïve Bayes on the whole training set and used it to select examples that had a higher probability of belonging to a small specialty database. Both the selected and the small specialty data set were used as a training set for the second stage classifier. Then he combined the two classifiers to obtain the best performance. His method seems to be similar to our example selection, but he did not know about the poor performance of the naïve Bayes classifier on the whole MEDLINE database. He used a relatively small subset of MEDLINE and saw only a small (⬍10%) improvement in performance. What he did was like the first round of optimization to obtain the OTS in our method. By contrast, we iteratively perform example selection to reach an optimum for a single classifier and find much greater improvement. Example selection to deal with the imbalanced class problem has previously been proposed as a method to improve learning. Various strategies have been suggested to tackle this problem. Sampling to balance examples between the majority and minority classes is a commonly used method.24 Upsampling randomly selects examples with replacement from the minority class until the size of minority class matches with the majority class. It does not gain information about the minority class, but increases the misclassification cost of the minority class. Alternatively one can directly assign a larger misclassification cost for the minority class than that of the majority class.25,26 Down-sampling eliminates examples from the majority class until it balances with the minority class. Examples to be eliminated can be selected randomly or focused further away from the minority class. Others, using clustering and various other algorithms, have attempted to reflect the character of the majority class in a fair manner.27 Down-sampling may lose information from the majority class and risks harming performance. Our method is conceptually similar to focused down-sampling. Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com 548 In focused down-sampling the criteria are set a priori and then examples are selected to balance class size. However, we do not aim for a balanced class size in our OTS. We explicitly adjust the subset of focused examples from the majority set iteratively until the best training is achieved. Methods Data Preparation MEDLINE is a collection of references to articles in the biomedical literature. We used the titles and, where available, the abstracts. At the time of our experiment, MEDLINE included 16,534,506 references. For each MeSH term, the WTS was created by randomly selecting two-thirds of the documents from the C1 class and two-thirds from the C⫺1 class. This gives the WTS the same proportion of C1 and C⫺1 documents as in all of MEDLINE. The remaining documents served as our test set. Stop words were removed, but no stemming was performed. Using all single words and twoword phrases in the titles and abstracts provided 56,194,161 features. However, we used feature selection (Appendix A, online supplement available at www.jamia.org) and so not all of them were used in a naïve Bayes classifier. The MeSH terms are not used as features in the actual training and test process, but are only used to define classes. Classification Tasks Our classification task was to predict which documents were assigned a particular MeSH term (C1 class) and which documents were not assigned that term (C⫺1 class). We selected 20 MeSH terms with the number of C1 class articles covering a wide frequency range: approximately 100,000, 50,000, 30,000, 20,000, 10,000, 5,000, 4,000, 3,000, 2,000, and 1,000 C1 class articles. All but one of these terms are leaf MeSH terms, and the appropriate documents can be searched for directly in PubMed. “Myocardial infarction” is an internal node of the MeSH hierarchy. The proper search is a union of the results of searching for it directly and searching for the terms below it in the hierarchy: “myocardial stunning” and “shock, cardiogenic.” A detailed explanation of MeSH can be found at http://www.nlm.nih.gov/ mesh/. Learning Methods For our principal learner we used the naïve Bayes Binary Independence Model (BIM),28 in which a document is represented by a vector of binary attributes indicating presence or absence of features in the document (for details refer to Appendix A, online supplement). We made this choice because BIM can be trained rapidly and can efficiently handle a large amount of data. C-modified least squares29 is a wide-margin classifier that has many properties in common with SVMs. However, its smooth loss function allows us to apply a gradient search method to optimize rapidly, and thus it can be applied to larger data sets. Although CMLS can be trained faster than an SVM, it is still impractical to apply CMLS to the WTS. However, we can run CMLS on the smaller OTS. Typically CMLS performs better than Bayes. The question is, how will it perform on an example set optimized for Bayesian learning? Because the NLM Indexing Initiative1,2 currently uses a KNN method to aid MeSH prediction, it is valuable to Sohn et al., Optimal Training Sets compare our Bayes results on the OTS with the same KNN method. The standard approach of a KNN classifier would compare all pairs of documents, one from the test set and one from the training set. This is a very expensive computation for a huge database such as MEDLINE. To reduce the computational cost we obtained the upper bounds for KNN average precision. For details refer to Appendix C (online supplement). The upper bounds of KNN were compared with 95% lower confidence limits of Bayes, which were obtained by Student’s t-test. Because a higher average precision is better, if the lower bound of the naïve Bayes method using the OTS is higher than the upper bounds we found for the KNN method, we can safely conclude that the naïve Bayes method using OTS is better than the KNN method. Example Selection To identify the optimal training set (OTS) for Bayesian learning, we followed a procedure that simulates active learning (Figure 1). This is not active learning because it requires the entire training set to be labeled before this example selection is performed. Good results have been obtained from random downsampling of examples in other domains.24 To address questions of the size of our OTS versus the specific documents in that set, we used random sampling of the C⫺1 documents to create a random STS (Ran STS) with the same number of elements as the OTS. We then applied Bayes learning to these Ran STS. F i g u r e 1. Example selection algorithm. *A score is defined in equation (A.5) in Appendix A (online supplement). †For M, we used 1% of the number of C1 documents. For example, if the number of C1 documents is 100,000, then we use M ⫽ 1,000. This percentage was determined experimentally to balance the number of iterations to find the OTS with the chance of missing the true peak of the curve. If M is too large, the number of iterations will be reduced, but performance might be slightly worse if the true peak is missed. If M is too small, the number of iterations will be increased without significant improvement in performance. ‡Using the same data set for both training and testing would generally lead to overtraining and gives misleading results. This can be avoided by holding out a distinct validation set. Instead, we used leave-one-out cross validation on the training set. Each training set document is scored as if the model were trained on all other training set documents. Because of the independence assumptions of Bayes, an efficient implementation of this scoring is possible. The details of this implementation are described in Appendix B (online supplement). Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com Journal of the American Medical Informatics Association Volume 15 Feature Selection Proper feature selection often improves the performance of a machine learner. For the naïve Bayes classifier we implemented feature selection by setting a threshold and retaining only those features whose weights are in absolute value above that threshold. Previous research has shown this to be a highly effective feature selection method for naïve Bayes when class size is unbalanced.30 This allows us to reduce the feature dimensionality and generally see a gain in performance. In most of our work with naïve Bayes reported here, we use a fixed threshold value of 1. This allows a large reduction in the number of features and generally does not degrade performance. To further investigate the effectiveness of feature selection, we estimated optimal threshold values for each classification task on the WTS (using leaveone-out) and tested them in order to compare the performance with our example selection method. Evaluation To measure classification performance, we used average precision. This simple measure is well suited to our problem. To calculate average precision,31 the 5,511,502 documents in the test set are ordered by the score from the machine learner. Precisions are calculated at each rank where a C1 document is found and these precision values are averaged. A detailed definition of average precision is provided in Appendix D (online supplement). We also present precision-recall curves for a sample of the classification tasks. Previous studies have shown limitations using accuracy and ROC for imbalanced data sets.32 Accuracy is only meaningful when the cost of misclassification is the same for all documents. Because we are dealing with cases where the class C1 is much smaller than C⫺1, the cost of misclassifying a C1 document needs to be much higher. One could classify Number 4 549 July / August 2008 all documents as C⫺1 and obtain high accuracy if the cost is taken to be the same over all documents. For example, with our largest C1 class, calling all documents C⫺1 gives an accuracy over 99%. For the smallest class, the accuracy would be more than 99.99%. Clearly a more sensitive measure is needed. The challenge of our data set is measuring how high the C1 documents are ranked without being unduly influenced by the large number of low-ranking C⫺1 documents. Given a particular set of C1 and C⫺1 documents and their associated ROC score, the ROC score can be increased simply by adding additional irrelevant C⫺1 documents with scores lower than any existing C1 documents. In fact an arbitrarily high ROC score can be obtained by adding enough irrelevant low-scoring C⫺1 documents. In sharp contrast, adding C⫺1 documents that score below the lowest C1 document, has no effect on the average precision. Average precision is much more sensitive to retrieving C1 documents in high ranks. Because we are only interested in such high-ranking C1 documents, average precision is well suited to our purpose. Results The numerical results of these experiments appear in Table 1. For naïve Bayes we used a fixed weight threshold of 1 for feature selection except for WTS Bayes OptCut, where we used a customized threshold value for each classification task. Using a threshold of 1 allowed Bayes to use a much smaller number of features on the WTS, ranging from 15,261 to 1,080,168 features depending on the classification task (without a threshold there are 56,194,161 features). The OTS size varied from 2,508 to 742,540 documents for different MeSH assignments, which is 0.02% to 6.74% of the WTS size. Table 1 y Average Precisions for Prediction of MeSH Terms in MEDLINE Articles MeSH Terms Number of C1 OTS Size WTS Bayes WTS Bayes OptCut* OTS Bayes Ran STS Bayes OTS CMLS Rats, Wistar Myocardial infarction Blood platelets Serotonin State medicine Bladder Drosophila melanogaster Tryptophan Laparotomy Crowns Streptococcus mutans Infectious mononucleosis Blood banks Humeral fractures Tuberculosis, lymph node Mentors Tooth discoloration Pentazocine Hepatitis E Genes, p16 Average 122,815 101,810 51,793 50,522 31,338 30,572 21,695 20,391 10,284 10,152 5,105 5,040 4,076 4,087 3,036 3,275 2,052 2,014 1,032 1,057 742,540 252,131 128,286 124,581 357,993 154,715 53,740 194,950 173,304 51,138 12,430 46,260 39,494 31,793 66,584 55,214 19,764 12,202 2,508 11,847 0.160 0.325 0.274 0.175 0.096 0.231 0.243 0.108 0.043 0.178 0.374 0.134 0.109 0.128 0.117 0.048 0.108 0.041 0.309 0.100 0.165 0.309 0.644 0.600 0.564 0.215 0.474 0.684 0.514 0.218 0.501 0.716 0.537 0.256 0.450 0.249 0.367 0.302 0.678 0.611 0.319 0.460 0.376 0.674 0.599 0.578 0.262 0.481 0.688 0.500 0.209 0.551 0.744 0.583 0.315 0.507 0.343 0.368 0.365 0.590 0.675 0.286 0.485 0.154 0.322 0.265 0.163 0.091 0.219 0.204 0.094 0.040 0.168 0.386 0.130 0.107 0.122 0.108 0.038 0.094 0.030 0.290 0.081 0.155 0.386 0.688 0.649 0.626 0.244 0.514 0.689 0.557 0.289 0.581 0.752 0.614 0.345 0.569 0.348 0.419 0.469 0.681 0.629 0.244 0.515 CMLS ⫽ C-modified least squares; MeSH ⫽ medical subject headings; OTS ⫽ optimal training set; Ran STS ⫽ random small training set. *Used an optimal cutoff for feature selection in Bayes. The other Bayes classification tasks used cutoff value 1. Whole training set (WTS) ⫽ 11,023,004 documents. Optimal training set (OTS) ⫽ C1 documents ⫹ optimal C⫺1 documents (details in Figure 1). Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com 550 Sohn et al., Optimal Training Sets F i g u r e 2. A comparison of precision-recall curves of WTS Bayes and OTS Bayes. (A) drosophila melanogaster, (B) streptococcus mutans, (C) mentors, (D) pentazocine. The average precision using Bayes on the OTS (OTS Bayes) ranged from 0.209 to 0.744, with an average of 0.485. Compared to the overall average of 0.165 seen on the WTS, this is nearly a 200% improvement. Figure 2 shows precision-recall curves of WTS Bayes and OTS Bayes for some classification tasks. Here it is helpful to recall that the average precision is the area under the precision-recall curve. In all cases the area under the curve for OTS training was much larger than for WTS training. Also, the OTS curve was above the WTS curve for most recall levels except for recall levels close to 1 in some cases. This is highly preferable in information retrieval where relevant examples should appear in the top ranks. To address the importance of the size of the OTS versus the specific documents in the OTS, we created a comparable random small training set (Ran STS in Table 1). It includes all of the C1 documents, just as in the OTS, and a number of randomly selected C⫺1 documents equal to the number of C⫺1 documents in the OTS. Thus, Ran STS has the same size as the OTS. For example, the MeSH term “rat, wistar” has 122,815 C1 documents and 619,725 (⫽ OTS size ⫺ C1 size ⫽ 742,540 ⫺ 122,815) randomly selected C⫺1 documents from WTS. When Bayesian learning was performed on the random STS (Ran STS Bayes), the average precisions were a little lower than the WTS, but were much lower than the OTS. The overall average using Bayes on the WTS with an optimal threshold (WTS Bayes OptCut in Table 1) was 0.460. The threshold ranged from 3.6 to 9.4. The performance was much improved over the WTS with a fixed cutoff value of 1 (WTS Bayes), but not as good as using Bayes on the OTS. In 15 of 20 MeSH assignments, the OTS Bayes performed better than WTS Bayes OptCut. The upper bounds of KNN were compared with 95% lower confidence limits of Bayes. Table 2 shows the 95% lower confidence limits of the average precisions for Bayes trained on the OTS. These were obtained by Student’s t-test. It also shows upper bounds for the KNN average precisions. In 17 of 20 MeSH assignments, Bayes using OTS was statistically better than KNN. We also performed the Sign test and superiority 17 of 20 times yields a p-value of 0.00129. Therefore, we can safely conclude that naïve Bayes using the OTS is better than KNN. Although it is not feasible to train a complex machine learning algorithm on the WTS because of its huge size, the much smaller size of the OTS allows us to use a more sophisticated learner such as CMLS. Using CMLS on the OTS (OTS CMLS in Table 1) further improved the performance and produced 6% better results on average than Bayes. A plot of the average precision versus the size of the STS for three MeSH terms appears in Figure 3. The OTS occurs at the peak of each curve, at a much smaller size than the WTS. Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com Journal of the American Medical Informatics Association Volume 15 Table 2 y A Comparison of 95% Lower Confidence Limit of Average Precision for Bayes and Upper Bound to KNN Average Precision MeSH Terms Rats, Wistar Myocardial infarction Blood platelets Serotonin State medicine Bladder Drosophila melanogaster Tryptophan Laparotomy Crowns Streptococcus mutans Infectious mononucleosis Blood banks Humeral fractures Tuberculosis, lymph node Mentors Tooth discoloration Pentazocine Hepatitis E Genes, p16 Average OTS Bayes 95% Lower Confidence Limit KNN Upper Bound 0.374* 0.671 0.595 0.574 0.258 0.476 0.683 0.494 0.202 0.542 0.733 0.571 0.305 0.493* 0.327 0.350 0.348* 0.565 0.654 0.271 0.414 0.623 0.521 0.473 0.216 0.461 0.579 0.398 0.151 0.518 0.674 0.506 0.231 0.530 0.295 0.301 0.366 0.436 0.567 0.267 0.426 *95% lower confidence limit of OTS Bayes is less than the KNN upper bound. KNN ⫽ K-nearest neighbor; MeSH, medical subject headings; OTS, optimal training set. Discussion Using an optimal training set can greatly improve learning with naïve Bayes. Although 11 million training documents can easily be handled, it is not the best option. Much better results can be obtained using a carefully chosen smaller training set. In the smallest improvement seen, the average precision nearly doubled. In the best case, the improvement was by a factor of 14 times. Proper feature selection is another way to improve the naïve Bayes performance. When using an optimized weight threshold (WTS Bayes OptCut) for each classification task, we saw much better performance than using a fixed weight threshold for all classification tasks (WTS Bayes). The overall performance, however, was not as good as using the OTS. As a further benefit of the OTS approach, the small size of the OTS allows using a more sophisticated machine learner such as CMLS. CMLS generally performs better than naïve Bayes. In our case, CMLS trained on the OTS showed a 6% improvement over naïve Bayes trained on the OTS. In practice, it would make sense to apply CMLS to OTS, if affordable, because in most cases it was better than naïve Bayes. A KNN classifier, which is used as one method to rank MeSH terms by the Indexing Initiative at the NLM,1,2 was also compared with naïve Bayes on the OTS. In most cases, naïve Bayes performed better than KNN. On average CMLS on the OTS was 21% better than KNN. An important question is why training on the relatively small OTS produces such a large improvement in the naïve Bayes classifier’s performance when compared with training on the WTS. At the most basic level, if the naïve assumption that all features are independent of each other given the Number 4 July / August 2008 551 context of the class were true, arguably, naïve Bayes would be the ideal classifier. Because this naïve assumption is generally false, different more complex algorithms such as support vector machines, CMLS, and decision trees, and even more sophisticated Bayesian network approaches33 have been developed to deal at some level with dependencies among features. These more complex algorithms often give an improvement over naïve Bayes but with a higher computational cost. Because our approach is successful in improving over naïve Bayes on the WTS, it must also be a way to deal with dependencies. To see how our approach deals with dependencies, it is helpful to consider the following argument. Naïve Bayes on the WTS learns how to distinguish C1 and C⫺1 documents. But C⫺1 consists of two types of documents, i.e., C⫺1 ⫽ B1 艛 B⫺1 (1) where B1 is a very small set of documents that are close in content to C1 documents, whereas B⫺1 is most of C⫺1 and consists of documents that are distant from C1 and unlikely to be confused with it. Now training naïve Bayes on the WTS is essentially teaching it to distinguish C1 and B⫺1 as B1 will have almost no influence on the probability estimates used to compute weights. Such training may be far from optimal in its ability to distinguish between C1 and B1. Our method is a way to determine B1 so that OTS ⫽ C1艛B1. Then training on the OTS is optimal naïve Bayesian training to distinguish C1 and B1, and because B⫺1 is already distant from C1 we may expect to see a large improvement in performance. But the very existence of such a set as B1 is only possible because features are not independently distributed in C⫺1. It is the co-occurrence of a number of features in a single document in B1 above random that gives that document a particular flavor or topicality that can be very similar to and confused with a document in C1. Removal of the B⫺1 set from the training process alleviates the feature dependency problem and improves the results. It is a solution somewhat similar to the support vector machine in which the training generally focuses on a small set of support vectors that determine the final training result while a large part of the training set is ignored. However, our Bayesian approach is much more efficient to train than a support vector machine. F i g u r e 3. Average precision versus number of documents for several MeSH terms. Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com 552 Sohn et al., Optimal Training Sets In the final analysis, whatever improvement one sees must be reflected in a difference in the weights computed. Equation (A.6) in Appendix A (online supplement) gives the definition of a weight for a Bayesian learner. A document’s score is the sum of weights for features that appear in the document. A positive weight value denotes that a feature is more likely to appear in C1 documents, a negative weight value denotes that a feature is more likely to appear in C⫺1 documents, and a weight value close to zero denotes no preference for either class. Tables 3 and 4 show examples of features (words) from the “pentazocine” classification task with significant weight changes between the OTS and the WTS. The features in Table 3 appear equally likely in C1 and the nearby C⫺1 documents included in the OTS (set B1). These are not useful for distinguishing the two sets and have OTS weights near zero. However when using the WTS, C⫺1 includes many distant documents that do not include these features (set B⫺1), so they now appear relatively more frequent in C1 documents, leading to positive WTS weights. In Table 4 are features that are somewhat related to C1 documents, but that appear more frequently in set B1 documents, so have negative OTS weights. They are useful for recognizing that a document is not a C1 document. When including the many distant B⫺1 documents in the WTS, these features also emerge as more common in C1 compared to C⫺1 and receive a positive weight. In both Tables 3 and 4, the more positive weights obtained with the WTS move the scores of documents in B1 to a more positive value, harming the precision. Careful example selection, eliminating irrelevant documents from the majority class, C⫺1, helps to alleviate this problem. The poor results from the Ran STS show that random down-sampling of examples does not work in our data sets even though others observed better performance24 in a different setting. The dramatically better results from the OTS, which is the same size as Ran STS, demonstrates that the better results are due to the particular documents selected, not just the small size. The nature of the OTS is more important than its size. In the OTS, C⫺1 documents are more likely to be close to C1 documents—they lie near the decision boundary. This gives more discriminative learning and better results. We believe there is a larger lesson in our experience estimating probabilities of word occurrences. In any endeavor in which probabilities must be estimated, the choice of training data can be crucial to success. A large number of training examples that are irrelevant to the issue can seriously dilute those relevant examples that would otherwise provide useful probability estimates. This phenomenon may be impor- Table 3 y “Pentazocine” Classification Task: Neutral Features Inappropriately Given Positive Weight by Training on the WTS Feature Term Weight Trained on OTS Weight Trained on WTS Antinociception Addictive Anesthetic agents Central action 0.0019 0.0115 0.0115 0.0115 3.7516 3.1606 2.3108 3.4684 Abbreviations as in Table 1. Table 4 y “Pentazocine” Classification Task: Negative Weighted Features Inappropriately Given Positive Weight by Training on the WTS Feature Term Weight Trained on OTS Weight Trained on WTS Bupivacaine Thiopental Dynorphin Sphincter of Oddi ⫺1.0459 ⫺1.0459 ⫺1.0014 ⫺1.0014 1.2844 1.2598 1.4213 3.6554 Abbreviations as in Table 1. tant not only for naïve Bayes learners, but also for Bayesian networks, Markov models, and decision trees, which all use probabilities. There is a need for further work. Much work has gone into feature selection. More consideration should be given to example selection. This is especially true when the C1 documents are a very small proportion of the available training set. However, we have seen similar results in a few cases of balanced training sets. Although we are not doing active learning, our identification of the optimal training set for Bayesian learning is still iterative. We would like to identify this set directly, possibly using information available from learning on the whole training set. Finally, we would like to investigate optimal training set creation for other learning methods, such as CMLS. References y Note: References 34 –36 are cited in the online data supplement to this article at www.jamia.org. 1. Aronson AR, Bodenreider O, Chang HF, et al. The NLM Indexing Initiative. Proc AMIA Symp 2000:17–21. 2. Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ. The NLM Indexing Initiative’s Medical Text Indexer. Medinfo 2004:268 –72. 3. Cooper GF, Miller RA. An experiment comparing lexical and statistical methods for extracting MeSH terms from clinical free text. J Am Med Inform Assoc 1998;5:62–75. 4. Fowler J, Maram S, Kouramajian V, Devadhar V. Automated MeSH indexing of the World-Wide Web. Proc Annu Symp Comput Appl Med Care 1995:893–7. 5. Kim W, Aronson AR, Wilbur WJ. Automatic MeSH term assignment and quality assessment. Proc AMIA Symp 2001:319 –23. 6. Kim W, Wilbur WJ. A strategy for assigning new concepts in the MEDLINE database. AMIA 2005 Symp Proc 2005:395–9. 7. Kouramajian V, Devadhar V, Fowler J, Maram S. Categorization by reference: A novel approach to MeSH term assignment. Proc Annu Symp Comput Appl Med Care 1995:878 – 82. 8. Ruch P. Automatic assignment of biomedical categories: Toward a generic approach. Bioinformatics 2006;22:658 – 64. 9. Blum AL, Langley P. Selection of relevant features and examples in machine learning. Art Intell 1997;97:245–71. 10. Burges CJC. A tutorial on support vector machines for pattern recognition. Available electronically from the author: Bell Laboratories, Lucent Technologies, 1999. 11. Pavlov D, Mao J, Dom B. Scaling-up support vector machines using boosting algorithm. 15th International Conference on Pattern Recognition, Barcelona, Spain, September 3– 8, 2000. Los Alamitos, CA: IEEE Computer Society; 2000:2219 –22. Available at http:// doi.ieeecomputersociety.org/10.1109/ICPR.2000.906052. Accessed May 22, 2008. 12. Boley D, Cao D. Training Support vector machines using adaptive clustering. In Berry M, Dayal U, Kamath C, Skillicorn D, eds. 4th Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com Journal of the American Medical Informatics Association 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. Volume 15 SIAM International Conference on Data Mining, Lake Buena Vista, Florida, April 22–24, 2004. Philadelphia, PA: Society for Industrial and Applied Mathematics; 2004:126 –37. Quinlan JR. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufman Publishers, 1993. Lewis DD, Catlett J. Heterogeneous uncertainty sampling for supervised learning. In Cohen WW, Hirsh H, eds. Eleventh International Conference on Machine Learning, New Brunswick, New Jersey, July 10 –13, 1994. San Francisco, CA: Morgan Kaufmann Publishers; 1994:148 –56. Lewis DD, Gale WA. A sequential algorithm for training text classifiers. 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, July 3– 6, 1994. New York, NY: Springer-Verlag; 1994: 3–12. Freund Y, Seung H, Shamir E, Tishby N. Selective sampling using the query by committee algorithm. Mach Learn 1997;28: 133– 68. Seung HS, Opper M, Sompolinsky H. Query by committee. Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, Pennsylvania, July 27–29, 1992. New York, NY: ACM Press; 1992:287–94. Available at http://doi.acm.org/10.1145/ 130385.130417. Accessed May 22, 2008. Tong S, Koller D. Support vector machine active learning with applications to text classification. J Mach Learn Res 2001;2: 45– 66. Roy N, McCallum A. Toward optimal active learning through sampling estimation of error reduction. In Brodley CE, Danyluk AP, eds. Eighteenth International Conference on Machine Learning, Williamstown, MA, June 28 –July 01, 2001. San Francisco, CA: Morgan Kaufmann Publishers; 2001. Bordes A, Ertekin S, Weston J, Bottou L. Fast kernel classifiers with online and active learning. J Mach Learn Res 2005;6:1579 – 619. Schohn M, Cohn D. Less is more: Active learning with support vector machines. In: Langley P (ed). Proceedings of the Seventeenth International Conference on Machine Learning 2000. San Francisco, CA: Morgan Kaufmann, 2000. Schapire RE. The boosting approach to machine learning: An overview. MSRI Workshop on Nonlinear Estimation and Classification; 2002, 2002. Wilbur WJ. Boosting naive Bayesian learning on a large subset of MEDLINE. American Medical Informatics 2000 Annual Sym- Number 4 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. July / August 2008 553 posium; 2000. Los Angeles, CA: American Medical Informatics Association, 2000:918 –22. Japkowicz N, Stephen S. The class imbalance problem: A systematic study. Intell Data Anal 2002;6:429 –50. Domingos P. MetaCost: A general method for making classifiers cost-sensitive. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999. San Diego, CA, August 15–18, 1999. New York, NY: ACM Press, 1999:155– 64. Maloof M. Learning when data sets are imbalanced and when costs are unequal and unknown. Proceedings of the ICML-2003 Workshop: Learning with Imbalanced Data Sets II, August 21–24, 2003. Menlo Park, CA: AAAI Press, 2003:73– 80. Nickerson AS, Japkowicz N, Milios E. Using unsupervised learning to guide resampling in imbalanced data sets. Proceedings of the Eighth International Workshop on AI and Statistics, January 4 –7, 2001. London, UK: Gatsby Computational Neuroscience Unit, 2001:261–5. Lewis DD. Naive (Bayes) at forty: The independence assumption in information retrieval. ECML 1998:4 –15. Zhang T, Oles FJ. Text categorization based on regularized linear classification methods. Inf Retrieval 2001;4:5–31. Mladenic D, Grobelnik M. Feature selection for unbalanced class distribution and naive Bayes. Sixteenth International Conference on Machine Learning, 1999. San Francisco, CA: Morgan Kaufmann, 1999:258 – 67. Manning CD, Schutze H. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press, 1999. Visa S, Ralescu A. Issues in Mining Imbalanced Data Sets—A Review Paper. Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, April 16 –17, 2005. Cincinnatti, OH: University of Cincinnatti, 2005: 67–73. Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Machine Learning 1997;29(2–3):131– 63. Madsen RE, Kauchak D, Elkan C. Modeling word burstiness using the Dirichlet distribution. 22nd International Conference on Machine Learning, 2005. Bonn, Germany: ACM Press, 2005: 545–52. Witten IH, Moffat A, Bell TC. Managing Gigabytes. Second edition. San Francisco: Morgan-Kaufmann, 1999. Salton G. Automatic Text Processing. Reading, MA: AddisonWesley, 1989. Downloaded from jamia.bmj.com on June 9, 2014 - Published by group.bmj.com Optimal Training Sets for Bayesian Prediction of MeSH ® Assignment Sunghwan Sohn, Won Kim, Donald C Comeau, et al. J Am Med Inform Assoc 2008 15: 546-553 doi: 10.1197/jamia.M2431 Updated information and services can be found at: http://jamia.bmj.com/content/15/4/546.full.html These include: Data Supplement "Data Supplement" http://jamia.bmj.com/content/suppl/2009/11/20/15.4.546.DC1.html References This article cites 9 articles, 2 of which can be accessed free at: http://jamia.bmj.com/content/15/4/546.full.html#ref-list-1 Article cited in: http://jamia.bmj.com/content/15/4/546.full.html#related-urls Email alerting service Receive free email alerts when new articles cite this article. Sign up in the box at the top right corner of the online article. Notes To request permissions go to: http://group.bmj.com/group/rights-licensing/permissions To order reprints go to: http://journals.bmj.com/cgi/reprintform To subscribe to BMJ go to: http://group.bmj.com/subscribe/