For Review Only - Universidad de Granada

Transcription

For Review Only - Universidad de Granada
Universidad de Granada
Departamento de Ciencias de la Computación
e Inteligencia Artificial
Programa de Doctorado en Ciencias de la Computación
y Tecnologı́a Informática
Algoritmos evolutivos de codificación real para el problema
de generación de prototipos en aprendizaje supervisado
y semi-supervisado basado en instancias
Tesis Doctoral
Isaac Triguero Velázquez
Granada, Marzo de 2014
Editor: Editorial de la Universidad de Granada
Autor: Isaac Triguero Velázquez
D.L.: GR 374-2015
ISBN: 978-84-9083-269-1
Universidad de Granada
Algoritmos evolutivos de codificación real para el problema
de generación de prototipos en aprendizaje supervisado
y semi-supervisado basado en instancias
MEMORIA PRESENTADA POR
Isaac Triguero Velázquez
PARA OPTAR AL GRADO DE DOCTOR EN INFORMÁTICA
Marzo de 2014
DIRECTORES
Francisco Herrera Triguero y Salvador Garcı́a López
Departamento de Ciencias de la Computación
e Inteligencia Artificial
La memoria titulada “Algoritmos evolutivos de codificación real para el problema de generación de prototipos en aprendizaje supervisado y semi-supervisado basado en instancias ”, que
presenta D. Isaac Triguero Velázquez para optar al grado de doctor, ha sido realizada dentro del
Programa Oficial de Doctorado en “Ciencias de la Computación y Tecnologı́a Informática”, en el
Departamento de Ciencias de la Computación e Inteligencia Artificial de la Universidad de Granada
bajo la dirección de los doctores D. Francisco Herrera Triguero y D. Salvador Garcı́a López.
El doctorando y los directores de la tesis garantizamos, al firmar esta tesis doctoral, que el
trabajo ha sido realizado por el doctorando bajo la dirección de los directores de la tesis, y hasta
donde nuestro conocimiento alcanza, en la realización del trabajo se han respetado los derechos de
otros autores a ser citados cuando se han utilizado sus resultados o publicaciones.
Granada, Marzo de 2014
El Doctorando
Fdo: Isaac Triguero Velázquez
Los directores
Fdo: Francisco Herrera Triguero
Fdo: Salvador Garcı́a López
Esta tesis doctoral ha sido desarrollada con la financiación de la beca predoctoral adscrita al
proyecto de investigación de excelencia P10-TIC-6858 de la Junta de Andalucı́a. También ha sido
subvencionada por los proyectos TIN2008-06681-C06-01 y TIN2011-28488 del Ministerio de Ciencia
e Innovación.
Dedicada a la memoria de mi padre:
D. Gabriel Triguero
Agradecimientos
Esta tesis está especialmente dedicada a la memoria de mi padre, D. Gabriel Triguero, porque sin
haber podido estar presente en el transcurso de la misma, sı́ que fue su impulsor, y para mı́, la
motivación necesaria para realizarla. Además, quisiera agradecer y dedicar este trabajo a mi madre,
mi hermano, mis sobrinos, mis tı́os, mis primos y finalmente, a mis abuelos, que en paz descansen.
Todos ellos siempre me han apoyado durante mi formación. Esta memoria es por y para vosotros.
Desde el punto de vista académico quisiera en primer lugar agradecer a mis directores de tesis,
Francisco Herrera y Salvador Garcı́a, todo el esfuerzo, dedicación e interés que me han prestado.
Estoy convencido de que sin ellos, sin sus muchos consejos y sin los conocimientos que me han
transmitido, hubiese sido imposible finalizar esta tesis doctoral. Para mı́, Paco y Salva, son sinónimo
de investigación de calidad, y espero poder seguir realizando esta actividad junto a ellos.
Quisiera agradecer también el apoyo de José Manuel Benı́tez, quien sin estar directamente
relacionado con mi tesis doctoral, si que ha influido en mi formación durante este periodo. Gracias
a él, mi pelı́cula favorita siempre será “Hércules”, y aún estoy esperando la nueva de “Ulises”.
También he de agradecer mucho a mis compañeros de fatigas. En primer lugar, a Joaquı́n y
su “pedantic english”, por todos los consejos e ideas que sin duda han sido muy importantes en
esta tesis. A mis compañeros de promoción Vicky, Álvaro y José Antonio, a los seniors del grupo,
hermanos Alcalá, Alberto, Julián, etc, y a los no tan seniors, Nacho, M. Cobo, Cristobal (Jaén),
Christoph, Fran, Michela, etc. Desde aquı́ también agradecer y a la vez dar ánimo a los niños del
futuro: Dani, Pablo, Sergio, Sara, Juanan, Rosa, Alber, M.Parra, Lala, etc.
I would also like to thank Jaume for all his support in my research stay at the Nottingham
University. I also thank to German and Nicola and the good moments we spent running through
the “hilly” Notts.
No puedo olvidar en este mensaje de agradecimiento a mis amigos: A Álvaro (otra vez, sı́),
sus bromas, y nuestros muchos redbulls y domingos compartidos en el edificio orquı́deas. A mis
amigos de siempre, que considero parte de mi familia, Trujillo y Emilio. A mi buen amigo Manolo,
compañero de entrenos y carreras, y al cual le deseo que en unos años escriba una memoria como
esta. Finalmente, a todos mis amigos de Atarfe y a mis compañeros de hobby y tarimas.
Querı́a dejar para el final, un agradecimiento muy especial, a la persona que conoce de primera
mano el trabajo que ha supuesto esta tesis doctoral como si la hubiese hecho ella misma. Sin
su cariño, su compresión y su motivación durante todos estos años nunca hubiese sido capaz de
hacerla. Gracias Marta!
Mi agradecimiento también a todas aquellas personas que no por no citarlas han sido menos
importantes en el término de esta memoria.
GRACIAS A TODOS
Table of Contents
Page
I
PhD dissertation
1
1.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Introducción . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1
Nearest neighbor classification: Data reduction approaches . . . . . . . . . . 15
2.2
Evolutionary algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3
Semi-supervised classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4
Big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.
Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1
5.2
6.
Prototype Generation for Supervised Learning . . . . . . . . . . . . . . . . . 26
5.1.1
A review on Prototype Generation . . . . . . . . . . . . . . . . . . . 26
5.1.2
New Prototype Generation Methods based on Differential Evolution 27
5.1.3
Integrating Prototype Selection and Feature Weighting within Prototype Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1.4
Enabling Prototype Reduction Models to deal with Big Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Self-labeling with prototype generation/selection for semi-supervised classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.1
A Survey on Self-labeling Semi-Supervised Classification . . . . . . 31
5.2.2
New Self-labeling Approaches Aided by Prototype Generation/Selection Models . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1
Prototype Generation for supervised learning . . . . . . . . . . . . . . . . . . 33
6.1.1
A review on Prototype Generation . . . . . . . . . . . . . . . . . . . 33
6.1.2
New Prototype Generation Methods based on Differential Evolution 34
ix
x
TABLE OF CONTENTS
6.2
7.
6.1.3
Integrating Prototype Selection and Feature Weighting within Prototype Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.1.4
Enabling Prototype Reduction Models to deal with Big Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Self-labeling with Prototype Generation/Selection for Semi-Supervised Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2.1
A Survey on Self-labeling Semi-Supervised Classification . . . . . . 36
6.2.2
New Self-labeling Approaches Aided by Prototype Generation/Selection Models . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Conclusiones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
II Publications: Published and Submitted Papers
1.
2.
43
Prototype generation for supervised classification . . . . . . . . . . . . . . . . . . . . 43
1.1
A Taxonomy and Experimental Study on Prototype Generation for Nearest
Neighbor Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.2
IPADE: Iterative Prototype Adjustment for Nearest Neighbor Classification . 59
1.3
Differential Evolution for Optimizing the Positioning of Prototypes in Nearest
Neighbor Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
1.4
Integrating a Differential Evolution Feature Weighting scheme into Prototype
Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
1.5
MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Self-labeling with prototype generation/selection for semi-supervised classification
. 137
2.1
Self-Labeled Techniques for Semi-Supervised Learning: Taxonomy, Software
and Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
2.2
On the Characterization of Noise Filters for Self-Training Semi-Supervised in
Nearest Neighbor Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 179
2.3
SEG-SSC: A Framework based on Synthetic Examples Generation for SelfLabeled Semi-Supervised Classification . . . . . . . . . . . . . . . . . . . . . . 193
Bibliografı́a
209
Chapter I
PhD dissertation
1.
Introduction
In recent years, there has been a rapid advance of technology and communications, leading by the
great expansion of the Internet. As consequence, there is an increasing necessity of processing and
classifying large quantity of data. Specially, this is very important in a wide variety of fields such as
astronomy, geology, medicine, or the interpretation of the human genome, because of the available
information has been considerably increased. However, the real value of data lies on the possibility
of extracting valuable knowledge for making decisions or the exploration and comprehension of the
phenomenon that produced the data. Otherwise, it would result in the expression “the world is
becoming data rich but knowledge poor” [Bra07].
The processing and collection of these data, manually or in a semiautomatic way, becomes
impossible when the size of the data and the number of dimensions or parameters is excessively
increased. Nowadays, it is common to find databases with millions of records and thousands of
dimensions, so that, only computers could automate this process. Therefore, the use of automatic
procedure to acquire valuable knowledge is required to acquire valuable knowledge from data.
Data Mining consists of solving real problems by analyzing the data presented in them. In
the literature, it is qualified as science and technology to explore data, aiming to discover already
present unknown patterns. Many people distinguish Data Mining as a synonym of the Knowledge
Discovery in Databases (KDD) process, while others view Data Mining as the main step of KDD
[TSK05, WFH11, HKP11].
There are several definitions of the KDD process. For example, in [FPSS96] the authors define it
as “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”. In [Fri97], it considers the KDD process as an automatic exploratory data
analysis of large databases. A key aspect that characterizes the KDD process is the way in which it
is divided into stages according the agreement of several important researchers in the topic. There
are several methods to do this division, with advantages and weaknesses [Han12]. In this thesis,
we adopt a hybridization widely used in recent years that categorizes these stages in the following
four steps:
• Problem Definition: Selection of relevant data that form the problem according to the
relevant prior knowledge obtained by the experts. Definition of the goals pursued by the
1
2
Chapter I. PhD dissertation
Figure 1: The KDD process
end-user. It also includes the comprehension of both the selected data to approach and the
expert knowledge associated to achieve a high degree of reliability.
• Data Gathering and Preparation: This stage includes operations for data cleaning, data
integration, data transformation and data reduction. The first one consists of the removal
of noisy and inconsistent data. The second tries to combine multiple data sources into a
single one. The third transforms and consolidates the data into forms that are appropriates
to perform data mining tasks. Finally, the data reduction process includes the selection
and extraction of both features and examples in a database. This phase aims to ease the
development of the following stages.
• Model Building and Evaluation: This is the process in which the methods are used to
extract valid data patterns. Firstly, this step includes the choice of the most suitable data
mining task, such as classification regression, clustering or association, and the choice of
the data mining algorithm itself, belonging to one of the previous families. Secondly, the
adaptation of the selected algorithm to the addressed problem, tuning essential parameters
and applying validation procedures. Finally, estimating and interpreting the mined patterns
based on different measures of interest.
• Knowledge Deployment: This last stage involves the description of the discovered patterns
in order to be useful for the end-user.
Figure 1 shows the KDD process and reveals the main four stages mentioned previously. It
is worth mentioning that all the stages are interconnected among them, showing that the KDD
process is actually a self-organized scheme in which each stage has repercussions in the remaining
stages.
As commented above, the main phase of the KDD process is also known as Data Mining. This
discipline is focused in a more reduced objective (regarding the KDD process) that consists of the
identification of patterns and the prediction of relationships from data. It is noteworthy that the
success of a data mining technique does not only rely on its performance. These techniques are
sensitive to the quality of the information provided. Thus, the higher the quality is, the greater
1. Introduction
3
the generated models will be able to make decisions. In this sense, preprocessing techniques are
necessary in the KDD process before applying data mining techniques.
Data mining techniques are commonly categorized as descriptive or predictive methods. The
former type is devoted to discover interesting patterns among data. The latter aims to predict
the behavior of a model through the analysis of the data. Both descriptive and predictive process
of data mining are conducted by machine learning algorithms [MD01, Alp10, WFH11]. The main
objective of machine learning tools is to induce knowledge from problems that do not provide a
straightforward and efficient algorithmic solution or they are informally or vaguely defined.
Depending on the available information to perform a machine learning task, three different
problems can be considered:
• Supervised learning: In this problem, the values of the target variable/s and a set of input
variable are known. The aim is to establish a relation between the input variables and the
output variable/s in order to predict output variable/s of new examples (whose target value
is unknown).
– In classification [DHS00] the values of the target variable/s are discrete and a finite
number of values (known as labels or classes) are available. For instance, the different
types of a disease such as Hepatitis A, B, C.
– In regression [CM98] the values of the target variable/s are continuous (real-coded). For
example the temperature, electric consumption, weight, etc.
• Unsupervised learning: The values of the target variable/s are unknown. The aim lies
on the description of relations and patterns of the data. The most common categories of
unsupervised machine learning algorithms are clustering and association.
– In clustering [Har75], the process consists of splitting the data into several groups, with
the examples belonging to each group being as similar as possible among them.
– Association [AIS93] is devoted to identify relation between transactional data.
• Semi-supervised learning: This type of problem is between the previous learning
paradigms, so that, machine learning algorithms are provided by some examples in which
the values of the target variable/s are known (generally a very reduced number) and many
others in which the values of the target variable/s are unknown. In this paradigm, we can
consider the prediction of target variables as a classification or regression problem, as well as
the description of the data as a clustering or an association process. Thus, it is considered an
extension of unsupervised and supervised learning by including additional information typical
of the other learning paradigm [ZG09].
In this thesis, we will focus on both supervised and semi-supervised classification. A classification method is defined as a technique that learns how to categorize examples or elements from
several predefined classes. Broadly speaking, a classifier learns a model from an input data set
(denoted as training set), and then, the model will be applied to predict the class value of a given
set of examples (the test set) that they have not been used in the learning process. To measure the
performance of a classifier, several strategies can be used:
• Accuracy: It measures the confidence of the learned classification model. It is usually
estimated as the percentage of test examples correctly classified over the total.
4
Chapter I. PhD dissertation
• Efficiency: Time spent by the model when classifying a test example.
• Interpretability: Clarity and credibility, from the human point of view, of the classification
model.
• Learning time: Time required by the machine learning algorithm to build the classification
model.
• Robustness: Minimum number of examples needed to obtain a precise and reliable classification model.
In the specialized literature, there are different approaches to perform classification tasks in
both supervised and semi-supervised context with successful results. For example, statistical techniques, discriminant functions, neural networks, decision trees, support vector machines and so on.
However, as we stated before, these techniques may be useless when the input data are impure,
leading to the extraction of wrong models. Therefore, the preprocessing of the data becomes one
of the most relevant stages to enable data mining method to obtain better and more accurate
information [ZZY03]. The data preparation can generate a lesser size data set in comparison to
the original set, improving the efficiency of the data mining tool and, moreover, the preparation
originates high quality data, which may result in high quality models.
Among the data prepreparation strategies [Pyl99], data reduction methods aim to simplify data
in order to enable data mining algorithms to be applied not only in a faster way, but also in a
more accurate way by removing noisy and redundant data. From the perspective of attributes or
variables, the most well-known data reduction processes are feature selection, feature weighting and
feature generation [LM07]. Taking into consideration the instance space, we can highlight instance
reduction methods [GDCH12, TDGH12].
Feature selection consists of choosing a representative subset of features from the original feature
space, while feature generation creates new features to describe the data. With a similar point of
view, feature weighting schemes assign a weight to each feature of the domain of the problem to
modify the way in which distances between examples are computed [PV06]. This technique can be
viewed as a generalization of feature selection algorithms, allowing us to obtain a soft approximation
of the feature relevance degree assigning a real value as a weight, so different features can receive
different treatments.
An instance reduction technique is devoted to find the best reduced set that represents the
original training data with a lesser number of instances. Their main purposes are to speed up the
classification process and reduce the storage requirements and sensitivity to noise examples. This
methodology can be divided into Instance Selection (IS) [MFV02, GCH08] and Instance Generation
or abstraction (IG) depending on how it creates the reduced set [Koh90, LKL02]. The former
attempts to choose an appropriate subset of the original training data, while the latter can also build
new artificial instances to better adjust the decision boundaries of the classes. In this manner, the
IG process fills some regions in the domain of the problem, which have no representative examples
in the original dataset.
Most of the instance reduction techniques have been focused on enhancing the Nearest Neighbor
(NN) classifier [CH67]. In the specialized literature, when IS or IG are applied to instance-based
learning algorithms, they are commonly referred as Prototype Selection (PS) and Prototype Generation (PG), respectively [DGH10a]. The NN rule is a nonparametric instance-based algorithm
[AKA91] for pattern classification [DHS00, HTF09]. It belongs to the lazy learning family of methods, which refers to those methods that predicts the class label from raw training data and does
1. Introduction
5
not provide a learning model. The NN algorithm predicts the class of a test sample according to
a concept of similarity [CGG+ 09] between examples (commonly Euclidean distance). Thus, the
predicted class a given test sample is set equal to the most frequent class among its nearest training
samples.
Despite its simplicity, the NN rule has demonstrated itself to be one of the most important and
effective technique in data mining and pattern recognition, being considered one of the top ten
methods in data mining in [WK09]. However, the NN classifier also suffers from several problems
that PS and PG techniques have been trying to alleviate. Four main weaknesses could be mentioned:
• High storage requirements: It needs all the examples of the training set to classify a test
example.
• High computational cost: Each classification implies the computation of similarities between
the test sample and all the examples of the training set.
• Low tolerance to noisy instances: All the training data are supposed to be relevant, so that,
noisy data may induce to incorrect classifications.
• Wrong suppositions about the input training examples: The NN rule makes predictions over
existing data, assuming that input data perfectly delimits the decision boundaries between
classes.
PS technique are limited to address the first three weaknesses, but it also assumes that the best
representative examples can be obtained from a subset of the original data, whereas PG methods
generate new representative examples if needed, thus tackling also the fourth weakness mentioned
above.
In the literature, there was no a complete categorization for PG methods and they were frequently confused with PS. A considerable number of PG methods have been proposed and some of
them are rather unknown. The absence of a focused taxonomy on PG produces that new algorithms
are usually compared with only a subset of the complete family of PG methods and, in most of the
studies, no rigorous analysis has been carried out.
Among the great number of existing PS and PG techniques we can highlight as the most
promising approaches those that are based on Evolutionary Algorithms (EAs) [ES08]. EAs have
been widely used in many different data mining problems [Fre02, PF09] acting as optimization
strategies. EAs are a set of modern meta heuristics used successfully in many applications with
great complexity. Their success on solving difficult problems has been the engine of a field known as
Evolutionary Computation [GJ05]. These techniques are domain independent, which makes them
ideal for applications where the domain knowledge is difficult to provide. Moreover, they have the
ability to explore large search spaces finding consistently good solutions.
Given that PS and PG problems could be seen as combinatorial and optimization problems,
EAs have been used to solve them with successful results [CHL03, NL09]. Concretely, PS can
be expressed as a binary space search problem, whereas PG is expressed as a continuous space
search problem. Until now, existing evolutionary PG techniques did not take into consideration
the selection of the most appropriate number of prototypes per class when applying the optimization
process, which became their main drawback.
Very recently, the term of big data has been coined to refer to the challenges and advantages
derived from collecting and processing vast amounts of data [Mar13]. Formally, it is defined as the
quantity of data that exceeds the processing capabilities of a given system. It is attracting much
6
Chapter I. PhD dissertation
attention in data mining because the knowledge extraction process from big data has become a
very difficult task for most of the classical and advanced techniques. The main challenges are to
deal with the increasing scale of data at the level of number of instances, at the level of features
or characteristics and the complexity of the problem. Nowadays, with the availability of cloud
platforms [PBA+ 08] we dispose of sufficient processing units to extract valuable knowledge from
massive data. Therefore, the adaptation of data mining techniques to emerging technologies, such
as distributed computation, will be a mandatory task to overcome their limitations.
As such, data reduction techniques should enable data mining algorithms to address big data
problems with greater ease, but on the other hand these methods are also affected by the increase
in size and complexity of data sets and they may be unable to provide a preprocessed data set in an
acceptable time. Enhancing the scalability of data reduction techniques is becoming a challenging
topic.
In semi-supervised classification the main problem is related to the lack of labeled examples.
This problem has been addressed by several approaches with different assumptions about the
characteristics of the input data [BC01, FUS08, Joa99]. Among them, self-labeled techniques do not
make any specific assumptions about the input data. These techniques follow an iterative procedure,
aiming to obtain an enlarged labeled data set by labeling most confident unlabeled data within a
supervised framework. They accept that their own predictions tend to be correct [Yar95, BM98]. A
wide variety of self-labeling methods have been presented with successful applications [LZ07, JP08].
However, in the literature, there was no a taxonomy of methods that states their main benefits and
drawbacks. These methods present two main weaknesses:
• The addition of noisy examples to the enlarged labeled set, especially in early stages of the
self-labeling process, may lead to build wrong models. Hence, reducing the size unlabeled set
by detecting noisy examples become an important task.
• They are limited by the number of labeled points and their distribution to identifying reliable
unlabeled examples. This problem is much more pronounced when the labeled ratio is greatly
reduced and labeled examples do not minimally represent the domain.
As far of our knowledge, until the writing of this thesis, the application of PS and PG techniques
was limited to supervised classification tasks, and they have not been applied to semi-supervised
approaches. However, their utilization can benefit to alleviate the previous problems by reducing
the number of noisy examples in the unlabeled set and introducing new generated labeled examples.
The present thesis is developed following two main parts: prototype generation for (1) supervised
and (2) semi-supervised learning. It is very important to note that in the first part of this thesis PG
acts as a pure data reduction technique, while in the second PG methods will work as an unlabeled
data reduction algorithm as well as a generator of new labeled examples.
• In the former part, a deep study of the PG field will be performed determining which families
of methods are more promising and which are their main advantages and drawbacks. Then,
we will design new PG techniques based on evolutionary algorithms, determining the best
proportion of examples per class, to improve current approaches. After that, we will combine
PG with other data reduction approaches to increase the accuracy classification. Finally,
we will develop a cloud-based framework that enables prototype reduction techniques to be
applied on big problems.
1. Introduction
7
• The latter part of this thesis is devoted to semi-supervised classification. In particular, we
will perform a survey of those semi-supervised learning methods that are based on selflabeling. Then, we will develop new algorithms to overcome their main drawbacks following
two perspectives: (a) reducing the number of noisy examples in the unlabeled set and (b)
generating synthetic data with PG techniques in order to diminish the influence of the lack
of labeled examples.
After this introductory section, the next section (Section 2.) is devoted to describe in detail the
four main areas related: Data reduction for NN classification (Section 2.1), evolutionary algorithms
(Section 2.2), semi-supervised classification (Section 2.3) and big data (Section 2.4). All of them
are fundamental areas for defining and describing the proposals presented as a results of this thesis.
Next, the justification of this memory will be given in Section 3., describing the open problems
addressed. The objectives pursued when tackling them are described in Section 4.. Section 5.
presents a summary on the works that compose this memory. A joint discussion of results is
provided in Section 6., showing the connection between each of the objectives and how have been
reached each of them. A summary of the conclusions drawn is provided in Section 7.. Finally, in
Section 8. we point out several open future lines of work derived from the results achieved.
The second part of the memory is constituted by eight journal publications, organized into two
main sections: supervised and semi-supervised learning. These publications are the following:
• Prototype generation for supervised classification:
– A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor
Classification.
– IPADE: Iterative Prototype Adjustment for Nearest Neighbor Classification.
– Differential Evolution for Optimizing the Positioning of Prototypes in Nearest Neighbor
Classification.
– Integrating a Differential Evolution Feature Weighting Scheme into Prototype Generation.
– MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification.
• Self-labeling with prototype generation/selection for semi-supervised classification.
– Self-labeled Techniques for Semi-Supervised Learning: Taxonomy, Software and Empirical Study.
– On the Characterization of Noise Filters for Self-Training Semi-Supervised in Nearest
Neighbor Classification.
– SEG-SSC: A Framework based on Synthetic Examples Generation for Self-Labeled SemiSupervised Classification.
8
Chapter I. PhD dissertation
Introducción
Los avances en la tecnologı́a y las comunicaciones de los últimos años, liderados por la gran expansión de Internet, han traı́do consigo como una de sus consecuencias la necesidad de procesar
y clasificar las grandes cantidades de información que se han puesto a nuestra disposición. Especialmente en el ámbito de la investigación cientı́fica, a lo largo de campos tan variados como
la astronomı́a, la geologı́a, la medicina, o la interpretación del genoma humano, la cantidad de
información disponible se ha visto incrementada considerablemente. Sin embargo, el verdadero
valor de los datos está en la posibilidad de extraer conocimiento útil para la toma de decisiones o
comprensión de los mismos. De otra forma, esto resultarı́a en la expresión “El mundo es rico en
datos pero pobre en conocimiento” [Bra07].
El análisis y gestión manual o semi-automática de estos datos puede llegar a ser imposible
cuando el tamaño de los datos y el número de dimensiones o parámetros crece en exceso. En la
actualidad es común encontrar bases de dato con millones de registros y miles de variables, de
forma que solo un ordenador podrı́a automatizar el proceso. Por tanto, se considera imprescindible
el uso de procedimientos automáticos que nos permitan descubrir información importante de dichos
datos.
La minerı́a de datos consiste, de forma general, en resolver problemas reales mediante el análisis
de los datos que los conforman. En la literatura, se califica de ciencia y tecnologı́a para explorar
datos con el fin de descubrir patrones desconocidos ya presentes en los datos. Muchos autores
distinguen a la minerı́a de datos como un sinónimo del proceso de Descubrimiento de Conocimiento
en Bases de Datos (Knowledge Discovery in Databases, KDD), mientras que otros ven a la minerı́a
de datos como el proceso principal del KDD [TSK05, WFH11, HKP11].
Existen diferentes definiciones del proceso del KDD. Por ejemplo, en [FPSS96] se define como “el
proceso no trivial de identificar patrones en los datos que son válidos, novedosos y potencialmente
útiles”. En [Fri97] se considera un proceso automático de exploración y análisis de grandes bases
de datos. Un aspecto clave que caracteriza al proceso del KDD es la forma en que se subdivide en
distintas etapas de acuerdo con la opinión de los investigadores de este ámbito. Existen diversos
métodos para realizar esta división con sus ventajas e inconvenientes [HKP11]. En esta tesis se
adopta una hibridación ampliamente utilizada en los últimos años que categoriza al KDD en las
siguientes cuatro grandes etapas:
• Definición del problema: Selección de los datos que conforman el problema obtenidos
a partir del conocimiento previo de los expertos. Definición de los objetivos perseguidos
por el usuario final. Esto implica también la compresión de los datos seleccionados y del
conocimiento de los expertos para alcanzar una definición rigurosa.
• Preparación de datos: Esta etapa incluye operaciones de limpieza de los datos, integración
de datos, transformación de datos y reducción de datos. La primera consiste en la eliminación
de ruido y datos inconsistentes. La segunda trata de combinar múltiples fuentes de datos en
una sola. La tercera transforma los datos para poder realizar operaciones de minerı́a de datos.
Finalmente, la reducción de datos incluye la selección y extracción tanto de caracterı́sticas
como de ejemplos de una base de datos. Esta fase tiene por objetivo facilitar el desarrollo de
las siguientes etapas.
• Construcción de un modelo y evaluación: Este es el proceso en que los métodos son
utilizados para extraer patrones válidos de los datos. En primer lugar, se debe elegir la tarea
de minerı́a de datos más adecuada a nuestro problema, como clasificación, regresión, clustering
9
1. Introduction
Figure 2: El proceso del KDD
o asociación, y seleccionar un algoritmo de minerı́a de datos perteneciente a alguna de estas
familias. A continuación, se debe adaptar y emplear este algoritmo realizando un ajuste de
los parámetros y un procedimiento de validación. Finalmente, se evaluarán e interpretarán
los patrones obtenidos mediante el uso de diferentes medidas de interés.
• Visualización del conocimiento obtenido: Esta última etapa trata de describir los patrones obtenidos, de forma que puedan ser útiles para los usuarios.
La Figura 2 muestra el proceso del KDD y las cuatro etapas mencionadas anteriormente. Es
importante destacar que todas las etapas están interconectadas entre sı́, indicando que el proceso
del KDD es un sistema auto-organizativo en el cual cada etapa tiene repercusión en las demás.
Como se comentó anteriormente, la fase principal del proceso del KDD es también conocido
como minerı́a de datos. Esta disciplina se centra en un objetivo más reducido (respecto del proceso
del KDD) que consiste en la identificación de patrones y la predicción de relaciones entre los datos.
Es importante destacar que el éxito de estas técnicas no solo se basa en su calidad. Éstas son
sensibles a la calidad de la información provista. Ası́, a mayor calidad, mejores serán los modelos
generados para tomar decisiones. En este sentido las técnicas de preprocesamiento son necesarias
en el proceso del KDD antes de aplicar las técnicas de minerı́a de datos.
Las técnicas de minerı́a de datos se suelen clasificar en: descriptivas y predictivas. Las primeras
son aplicadas para descubrir patrones interesantes entre los datos. Las segundas se utilizan para
predecir el comportamiento de un modelo a través del análisis de los datos disponibles. Ambos
procesos son abordados mediante algoritmos de aprendizaje automático [Alp10]. El objetivo principal de estos es la inducción de conocimiento en problemas que no tienen una solución algorı́tmica
directa y eficiente, o que su especificación sea vaga o informal.
El uso de algoritmos de aprendizaje automático presenta dos vertientes: Pueden ser empleados
simplemente como cajas negras, obteniéndose como resultado tan solo las salidas de los modelos.
Sin embargo, algunos algoritmos pueden ser empleados como herramientas de representación de
conocimiento, construyendo una estructura simbólica de conocimiento dispuesta a ser útil desde al
punto de vista de la funcionalidad, pero también desde la perspectiva de la interpretabilidad.
En función de la información de la que disponga el algoritmo de aprendizaje automático para
10
Chapter I. PhD dissertation
llevar a cabo la extracción de conocimiento, se pueden considerar tres problemas distintos:
• Aprendizaje supervisado: En este problema los valores de la variable o variables objetivo
y de un conjunto de variables de entrada son conocidos. Se pretende establecer una relación
entre las variables de entrada y salida con el fin de predecir el valor de salida de nuevo ejemplos
para los que no se conoce la/s variable/s objetivo. En función del tipo de variable objetivo, las
tareas de los algoritmos de aprendizaje supervisado se conocen como clasificación o regresión.
– En clasificación [DHS00] los valores de la/s variable/s objetivo son de carácter discreto
y pueden tomar un número finito de valores (llamadas clases o etiquetas). Por ejemplo
tipos de una enfermedad como la Hepatitis A, B, C.
– En regresión [CM98] los valores de la/s variable/s objetivo son continuos. Por ejemplo
temperatura, consumo eléctrico, peso, etc.
• Aprendizaje no supervisado: Los valores de la/s variable/s objetivo no son conocidos.
Se trata de describir relaciones y patrones entre los datos. Las categorı́as más comunes en
algoritmos de aprendizaje no supervisado son el agrupamiento (clustering) y asociación.
– En clustering [Har75] el proceso consiste en separar datos en distintos grupos manteniendo en ellos los objetos que son similares entre sı́.
– En asociacion [AIS93] se trata de identificar las relaciones entre datos transaccionales.
• Aprendizaje semi-supervisado: Este tipo de problema se encuentra entre los dos anteriores, de forma que los algoritmos de aprendizaje automático disponen de datos para los que
se conoce la/s variable/s objetivo (generalmente un número muy reducido) y datos para los
que no. En este paradigma se puede considerar la predicción de variables objetivo en forma
de clasificación o regresión, ası́ como la descripción de los datos como un agrupamiento o
una asociación. La clave radica en extender el aprendizaje supervisado o el no supervisado
mediante el uso de información adicional tı́pica del otro paradigma de aprendizaje [ZG09].
En esta tesis nos centraremos tanto en clasificación supervisada como en clasificación semisupervisada. Un método de clasificación se define como una técnica que aprende como categorizar
ejemplos o elementos que en una serie de clases predefinidas. De forma general, un clasificador
aprende un modelo de un conjunto de entrada (denominado conjunto de entrenamiento), y después,
el modelo será aplicado para predecir el valor de la clase de un conjunto dado de ejemplos (el
conjunto de prueba o test) que no han sido previamente utilizados en la fase de aprendizaje. Para
medir el rendimiento de un clasificador varias medidas pueden ser utilizadas:
• Precisión: Mide la confianza del modelo de clasificación aprendido. Esta es normalmente
estimada como el porcentaje de ejemplos de test correctamente clasificados sobre el total.
• Eficiencia: Tiempo consumido por el clasificador a la hora de clasificar un ejemplo de test.
• Interpretabilidad: Claridad y credibilidad, desde el punto de vista humano, del modelo de
clasificación.
• Tiempo de aprendizaje: Tiempo requerido por el algoritmo de aprendizaje automático
para construir el modelo de clasificación.
1. Introduction
11
• Robustez: Número mı́nimo de ejemplos necesarios para obtener un modelo de clasificación
preciso y fiable.
En la literatura especializada existen diferentes propuestas para la realización de tareas de clasificación tanto en un contexto supervisado como semi-supervisado con resultados exitosos. Por ejemplo, técnicas estadı́sticas, funciones discriminantes, redes neuronales, árboles de decisión, máquinas
de vectores soporte y muchas otras. Sin embargo, como se comentó anteriormente, estas técnicas
pueden no ser útiles cuando los datos de entrada son impuros, lo cual puede estar ligado a la extracción de modelos erróneos. Por tanto, el preprocesamiento de los datos se convierte en una de
las etapas más relevantes para habilitar a las técnicas de minerı́a de dato a obtener conocimiento
mejor y más preciso [ZZY03]. La preparación de datos puede generar un menor conjunto de datos
en comparación con el original, de modo que se mejore la eficiencia de las técnicas de minerı́a de
datos y, además, ésta originará datos de alta calidad que puede resultar en modelos de alta calidad.
Entre las técnicas de preparación de datos, los métodos de reducción de datos tratan de simplificar los datos con el fin de habilitar a los algoritmos de minerı́a de datos a ser aplicados no solo
más rápidamente, sino que incluso de forma precisión mediante la eliminación de ruido y datos
redundantes. Desde la perspectiva de los atributos o variables, los procesos de reducción de datos
más conocidos son la selección, la ponderación y la generación de caracterı́sticas [LM07]. Tomando
en consideración el espacio de las instancias (ejemplos), podemos destacar los métodos de reducción
de instancias [GDCH12, TDGH12].
La selección de caracterı́sticas consiste en elegir un subconjunto representativo de caracterı́sticas
del espacio original, mientras que la generación de caracterı́sticas crea nuevas variables para describir los datos. Con un punto de vista muy similar, los esquemas de ponderación de variables
asignan un peso a cada caracterı́stica del domino del problema para modificar la forma en que las
distancias entre los ejemplos son calculadas [PV06]. Estas técnicas pueden ser vistas como una
generalización de la selección de caracterı́sticas, permitiendo obtener una aproximación más fina
del grado de relevancia de cada caracterı́stica, asignando un valor real como peso.
Una técnica de reducción de instancias tiene por objeto encontrar el mejor conjunto reducido
que representa al conjunto original de datos con el menor número de instancias. Sus principales
objetivos son acelerar el proceso de clasificación, reducir el espacio de almacenamiento utilizado y la
sensibilidad a ejemplos ruidosos. Esta metodologı́a puede ser dividad entre Selección de Instancias
(SI) [MFV02, GCH08] y Generación o abstracción de instancias (GI) dependiendo de cómo se crea
el conjunto reducido [Koh90, LKL02]. El primero trata de elegir un subconjunto apropiado del
conjunto original de entrenamiento, mientras que el segundo puede también crear nuevos ejemplos
artificiales que mejor se ajustan a los lı́mites de decisión entre clases. De esta manera, la GI rellena
algunas regiones del espacio del problema que no tiene ejemplos representativos en el conjunto
original.
La mayor parte de las técnicas de reducción de instancias se han centrado en mejorar al clasificador del vecino más cercano (Nearest Neighbor, NN) [CH67]. En la literatura especializada,
cuando las técnicas de SI y GI son aplicadas a algoritmos basados en instancias se suelen llamar
Selección de Prototipos (SP) y Generación de Prototipos (GP), respectivamente [DGH10a]. La
regla NN es un algoritmo no paramétrico basado en instancias [AKA91] para clasificación de patrones [DHS00, HTF09]. Pertenece a la familia de métodos de aprendizaje “perezoso”, que está
conformado por aquellos métodos que predicen la clase de un ejemplo a partir de los datos de entrenamiento sin proveer un modelo de aprendizaje. El algoritmo NN predice la clase de un ejemplo
de test dado de acuerdo a un concepto de similaridad [CGG+ 09] entre ejemplos (normalmente la
distancia Euclı́dea). Ası́, la clase predicha de un ejemplo de test es igual a la clase más frecuente
12
Chapter I. PhD dissertation
entre los vecinos de entrenamiento más cercanos.
A pesar de su simplicidad, la técnica del NN ha demostrado ser una de las técnicas más importantes y eficaces en minerı́a de datos y reconocimiento de patrones, siendo considerada una
de las diez mejores técnicas en minerı́a de datos en [WK09]. Sin embargo, el clasificador NN
también padece algunos problemas que los métodos de SP y GP han estado tratando de aliviar.
Principalmente, cuatro debilidades pueden ser mencionadas:
• Alto coste de almacenamiento: Necesita almacenados todos los ejemplos de entrenamiento
para clasificar un ejemplo de test.
• Alto coste computacional: Cada clasificación implica el cálculo de similaridades entre el
ejemplo de test y todos los ejemplos del conjunto de entrenamiento.
• Baja tolerancia a ejemplos ruidosos: Todos los ejemplos de entrenamiento son considerados
como relevantes, de modo que los datos ruidosos pueden inducir a clasificaciones incorrectas.
• Suposiciones erróneas sobre los datos de entrada: La regla del NN hace predicciones sobre los
datos existentes, asumiendo que éstos definen perfectamente las fronteras entre clases.
Las técnicas de SP están limitadas a abordar las tres primeras debilidades, pero también asume
que los mejores representantes pueden ser obtenidos de un subconjunto del conjunto original, mientras que la GP genera nuevos ejemplos si lo considera necesario, abordando ası́ la cuarta debilidad
mencionada anteriormente.
Antes de esta tesis, no existı́a una categorización completa de los métodos de GP y eran confundidos habitualmente con la SP. Un considerable número de métodos de GP han sido propuestos
y algunos de ellos son desconocidos. La ausencia de una taxonomı́a centrada en métodos de GP
provoca que los nuevos algoritmos se comparen normalmente con un subconjunto de la familia de
métodos de GP y, en muchos estudios, no se realiza un análisis riguroso.
Entre el amplio número de técnicas existentes de PS y PG podemos destacar como los enfoques
más prometedores a aquellos que se basan en Algoritmos Evolutivos (AEs) [ES08]. Los AEs han sido
muy usados en muchos problemas de minerı́a de datos [Fre02, PF09] actuando como estrategias de
optimización. Los AEs son un conjunto de metaheurı́sticas usadas con éxito en muchas aplicaciones
de gran complejidad. Su éxito en resolver problemas muy complejos ha sido el motor del conocido
campo de la Computación Evolutiva [GJ05]. Estas técnicas son independientes del dominio, lo que
les hace ideales para aplicaciones donde el dominio de conocimiento es difı́cil de proveer. Además,
se caracterizan por tener la habilidad de explorar grandes espacios de búsqueda encontrando buenas
soluciones.
Dado que la SP y la GP pueden ser expresados como problemas combinatorios y de optimización,
los AEs ha sido usado para resolverlos con buenos resultados [CHL03, NL09]. Concretamente, la
SP se puede expresar como un problema de búsqueda binario, mientras que la GP se expresa
como un problema continuo. Antes de esta tesis, las propuesta evolutivas para GP no tenı́an en
consideración una selección apropiada de ejemplos por clase cuando se aplicaba la optimización, lo
que era su mayor debilidad.
Muy recientemente, el término “big data” ha sido acuñado para referirse a los retos y ventajas
derivadas de recoger y procesar grandes volúmenes de datos [Mar13]. Formalmente, este se define
como la cantidad de datos que excede las capacidades de procesamiento de un sistema dado. Está
atrayendo mucha atención en minerı́a de datos porque el proceso de extracción de conocimiento
de grandes bases de datos ha llegado a ser una tarea muy complicada para la mayorı́a de las
1. Introduction
13
técnicas clásicas y actuales. Los principales retos son abordar el incremento del tamaño de los
datos a nivel de instancia, a nivel de caracterı́sticas y la complejidad del problema. Actualmente,
con disponibilidad de las plataformas “cloud” [PBA+ 08] se dispone de suficientes unidades de
procesamiento para extraer conocimiento valioso de datos masivos. Por lo tanto, la adaptación de
las técnicas de minerı́a de datos a las nuevas tecnologı́as, tales como la computación distribuida,
será una tarea obligada para superar sus limitaciones.
Como tales, las técnicas de reducción de datos deberı́a habilitar a los algoritmos de minerı́a
de datos a abordar problemas de gran tamaño con facilidad, sin embargo estos métodos también
están afectados por el incremento del tamaño y complejidad de los conjuntos de datos siendo
incapaces de proporcionar un conjunto preprocesado de datos en un tiempo aceptable. Mejorar la
escalabilidad de las técnicas de reducción de datos se está convirtiendo un campo de investigación
muy demandado.
En clasificación semi-supervisada el principal problema está relacionado con la escasez de datos
etiquetados. Este problema se ha abordado desde distintos enfoques con diferentes hipótesis sobre
las caracterı́sticas de los datos de entrada [BC01, FUS08, Joa99]. Entre las distintas posibilidades,
las técnicas de auto-etiquetado no hacen ninguna hipótesis especı́fica sobre los datos. Siguen un
proceso iterativo con el fin de obtener un conjunto etiquetado mayor mediante el etiquetado de
los ejemplos no etiquetados más confiables, dentro de un esquema supervisado. Aceptan que sus
predicciones tienden a ser correctas [Yar95, BM98]. Un gran número de métodos de auto-etiquetado
han sido presentados con gran aplicación [LZ07, JP08]. Sin embargo, antes de esta tesis, no existı́a
una taxonomı́a de métodos que estableciese las ventajas e inconvenientes de estos métodos. Éstos
presentan dos debilidades principales:
• Añadir ejemplos ruidosos al conjunto etiquetado, especialmente en las primeras etapas del
proceso, puede llevar a construir modelos erróneos. Por ello, detectar ejemplos verdaderamente ruidosos es una tarea importante.
• Están limitados por el número de ejemplos etiquetados y su distribución en el espacio para
identificar ejemplos no etiquetados confiables. Este problema es mucho más pronunciado
cuando el ratio de ejemplos etiquetados es muy reducido y no representan mı́nimamente el
dominio del problema.
Hasta la escritura de esta tesis, la aplicación de las técnicas de SP y GP estaba limitada al
ámbito de la clasificación supervisada y no se aplicaban a la semi-supervisada. Sin embargo, su
utilización puede ser beneficiosa para aliviar los problemas comentados anteriormente mediante la
reducción de ejemplos ruidosos en el conjunto no-etiquetado y la introducción de nuevos ejemplos
etiquetados (generados con métodos de GP).
La presente tesis de desarrolla siguiendo dos partes principales: generación de prototipos para
aprendizaje (1) supervisado y (2) semi-supervisado. Es muy importante aclarar que la GP es
utilizada en la primera parte de la tesis como técnica pura de reducción de datos, mientras que
en la segunda parte los métodos de GP funcionan como reductores de instancias no etiquetadas y
como generador de nuevos ejemplos etiquetados.
• En la primera parte se realiza un estudio profundo del campo de la GP, determinando qué
familias de métodos son más prometedoras y cuáles son sus principales ventajas e inconvenientes. Después, se diseñarán nuevas técnicas de GP basadas en algoritmos evolutivos,
determinando la mejor proporción de ejemplos por clase, con el fin de mejorar los métodos actuales. A continuación, se combinará la GP con otros mecanismos de reducción de datos con
14
Chapter I. PhD dissertation
el fin de mejorar la precisión. Finalmente, se desarrollará un esquema basado en tecnologı́as
cloud para técnicas de reducción de prototipos que les permita abordar grandes bases de
datos (big data problems).
• La última parte de la tesis está dedicada a la clasificación semi-supervisada. Concretamente,
se realizará una revisión de la literatura de todos aquellos métodos de aprendizaje semisupervisado ue están basados en auto-etiquetado. A continuación, se desarrollarán nuevos
algoritmos para superar sus principales debilidades siguiendo dos perspectivas: (a) reducir el
número de instancias ruidosas en el conjunto no etiquetado y (b) generando datos sintéticos
con técnicas de GP con el fin de disminuir la influencia de la falta de datos etiquetados.
Tras esta sección introductoria, la siguiente sección (Sección 2.) está dedicada a describir en
detalle cuatro principales áreas relacionadas: Reducción de datos para clasificación del vecino más
cercano (Sección 2.1), algoritmos evolutivos (Sección 2.2), clasificación semi-supervisada (Sección
2.3) y big data (Sección 2.4). Todas ellas son áreas fundamentales para definir y describir las
propuestas presentadas como resultado de esta tesis.
Después, la justificación de esta memoria se presenta en la Sección 3., describiendo los problemas
abiertos abordados. Los objetivos perseguidos al abordar dichos problemas son descritos en la
Sección 4.. La Sección 5. presenta un resumen de los trabajos que componen esta memoria. Se
aporta una discusión conjunta de resultados en la Sección 6., mostrando la conexión entre cada
uno de los objetivos y como ha sido alcanzado cada uno de ellos. En la Sección 7. se incluye un
resumen de las conclusiones alcanzadas. Finalmente, en la Sección 8. se destacan varias lı́neas de
trabajo futuro abiertas, derivadas de los resultados alcanzados.
La segunda parte de la memoria se constituye de ocho publicaciones, organizadas en dos secciones principales: aprendizaje supervisado y semi-supervisado. Estas publicaciones son las siguientes:
• Generación de prototipos para clasificación supervisada:
– A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor
Classification.
– IPADE: Iterative Prototype Adjustment for Nearest Neighbor Classification.
– Differential Evolution for Optimizing the Positioning of Prototypes in Nearest Neighbor
Classification.
– Integrating a Differential Evolution Feature Weighting Scheme into Prototype Generation.
– MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification.
• Técnicas de auto-etiquetado con generación/selección de prototipos para clasificación semisupervisada:
– Self-labeled Techniques for Semi-Supervised Learning: Taxonomy, Software and Empirical Study.
– On the Characterization of Noise Filters for Self-Training Semi-Supervised in Nearest
Neighbor Classification.
– SEG-SSC: A Framework based on Synthetic Examples Generation for Self-Labeled SemiSupervised Classification.
2. Preliminaries
2.
15
Preliminaries
In this section we describe all the background information involved in this thesis. Firstly, Section
2.1 presents the NN classifier and data reduction techniques to improve its performance. Secondly,
Section 2.2 describes the use of EAs in data mining, detailing those EAs used in this thesis.
Thirdly, Section 2.3 shows a snapshot on semi-supervised classification. Finally, Section 2.4 provides
information about the big data problem, its main characteristics and some solutions that have been
proposed.
2.1
Nearest neighbor classification: Data reduction approaches
The NN algorithm was proposed in [FH51] by Fix and Hodges in 1951 as a nonparametric classifier.
However, its popularity was increased after most of its main properties were later described by Cover
and Hart [CH67] in 1967. Its extension to k nearest neighbor (k-NN) is nowadays considered one
of the most important data mining techniques [WK09].
As a nonparametric classifier (regarding parametric ones) it does not assume any specific distribution or structure in the used data. It is based on a very intuitive approach to classify new
examples: similarity between examples. Patterns that are similar, in some sense, should be assigned to the same class. The naive implementation of this rule has no learning phase in that it
uses all the training set objects in order to classify new incoming data. Thus, it belongs to the
lazy learning family of methods [AKA91], in contradistinction to eager learning models [Mit97]
that build a model during the learning (training) phase. Its theoretical properties guarantee that
its probability of error is bounded above by twice the Bayes probability error for all distributions
[GDCH12].
In supervised classification, a given data set is typically divided into training and test partitions,
obtaining a training set composed by N samples and a test set with M samples. Each sample is
formed by an attribute vector that contains information that describe the example. The standard
k-NN algorithm considers the use of the nearest examples (similar) of the training set to predict
the class of a test pattern. Figure 3 presents a simple example about how the decision is taken
depending on the number of neighbors used.
Despite its well-known performance, it suffers several shortcomings such as the necessity of
storing the full training set when performing a classification, the high computational cost, the low
tolerance to noise (especially when k is established to 1) and the fact that the NN classifier focuses
exclusively on existing data.
These weaknesses have been widely studied in the literature, providing multiple solutions such
as different similarity measures [PV06], optimization of the choice of the k parameter [Gho06]
(Figure 3 showed an example of the variability of the decision according to this value), designing
of fast and approximate approaches [GCB97] and reduction of the training set [GDCH12].
Among these solutions, this thesis is focused on the data preparation [Pyl99] for NN classifiers
via data reduction. The aim of these techniques is to reduce the size of the training set in order to
improve the performance of the classifier with respect to its efficiency and the storage requirements.
This field is composed by different techniques, we highlight the following alternatives:
• Feature selection (FS): It performs the reduction of the data by removing irrelevant or
redundant features [LM07]. In particular, the goal of feature selection is to find a minimum
set of attributes such as the resulting probability distribution of the data output attributes
16
Chapter I. PhD dissertation
Figure 3: An illustrative example for the NN algorithm. A problem with two classes: Red circles
and green triangles. The blue square represents a test example that should be classified according
to its nearest neighbors. Taken into consideration the 3 nearest neighbors it is classified as a red
circle, however, using the 5 nearest neighbors it would be marked as a green triangle.
(or classes) is as close as possible to the original distribution obtained using all attributes. It
increases the generality and efficiency due to the reduction of the number of attributes per
instance to process. Moreover, it enables to the NN classifier to deal with high dimensional
problems.
• Feature weighting (FW): This does not select a subset of features, but it modifies the way
in which similarity between instances is computed according to feature weights. Thus, the
aim of these techniques is to determine the importance degree of each feature. A good review
on this topic can be found in [WAM97].
• Feature generation (FG): These techniques are also called feature extraction [GGNZ06].
Their objective is to find new features that better describe the training data as function
of the original ones. Therefore, in feature extraction, apart from the removal operation of
attributes, subsets of attributes can be merged or can contribute to the creation of artificial
substitute attributes. Linear and non-linear transformations or statistical techniques such as
principal component analysis [Jol02] are classical examples of these techniques.
• Instance selection (IS): It consists of choosing the most representative instances in training
data [GDCH12]. Its focus is to find the best reduced set of instances from the original training
set that does not contain noisy and irrelevant examples. These techniques perform somehow
the selection of the best subset by using rules and/or heuristics (even metaheuristics). When
they are applied to improve the NN classifier they are usually denoted as prototype selection
(PS).
• Instance generation (IG): IG is considered an extension of IS in which in addition to
selecting data, these techniques are able to generate new artificial examples [LKL02]. The
main objective is to represent the original training data with the lowest number of instances.
As before, instance generation techniques applied to NN classification are named prototype
generation (PG). Figure 4 illustrates an example of PG.
Among these techniques, we will focus on PG and FW techniques as the natural extension of
PS and FS, respectively. The main difference between these kinds of techniques is the way in which
2. Preliminaries
17
Figure 4: Illustrative PG example. Extracted from [Koh90]. Subfigure (A) represents the original
training data. The centroid of each class (red and blue) and the decision boundary are depicted.
Subfigure (B) shows a very reduced selection/generation of points with a smooth decision boundary,
similar to the original.
the problem is defined. On the one hand, PS and FS refer to binary search problems, so that,
the selection problem could be represented as a binary array, corresponding each element with the
value 1 if the feature is currently selected by the algorithm, and 0 if it is not chosen. Therefore,
there are a total of 2N possible subsets, where N is the number of instances/features of the data
set. On the other hand, PG and FW are real-valued search problems. Thus, these problems may
be represented with real-valued arrays or matrices. Hence, they provide more general frameworks,
allowing modifications of the internal values that represent each example or attribute. However,
the search space becomes much more complex than in the binary case.
Given the complexity of PG models, some works have been using a previous PS selection step
[KO03a]. We will refer with the term Prototype Reduction (PR) to the problems of instance
selection or generation for the NN rule. Both PS and PG methodologies have been widely studied.
More than 50 PS methods have been proposed. Generally, they can be categorized into three
kinds of methods: condensation [Har68], edition [Wil72] or hybrid models [GCH08]. A complete
review of this topic is [GDCH12]. Regarding PG techniques, they can be divided into several
families depending on the main heuristic operation followed: positioning adjustment [NL09], class
re-labeling [SBM+ 03], centroid-based [FHA07] and space-splitting [Sán04]. More information about
PS and PG approaches can be found at the SCI2S thematic public website on Prototype Reduction
in Nearest Neighbor Classification: Prototype Selection and Prototype Generation http://sci2s.
ugr.es/pr/.
FS methods have been commonly categorized by the mechanism followed to assess the quality
of a given subset of features. There are three main categories: filter [GE03], wrapper [KJ97] and
embedded methods [SIL07]. In [WAM97], FW techniques were categorized by several dimensions,
according to its weight learning bias, the weight space (binary or continuous), the representation of
the features, their generality and the degree of employment of domain specific knowledge. A wide
number of FW techniques are available in the literature, both classical and recent (for example,
[PV06, FI08]). The most well known group of them is the family of Relief-based algorithms [KR92],
with ReliefF [Kon94] as its forerunner algorithm to tackle the FW problem.
18
2.2
Chapter I. PhD dissertation
Evolutionary algorithms
Evolutionary algorithms (EAs) are techniques inspired by natural computation [GJ05, PF09] that
have arisen as very competitive methods in the last decade [ES08]. They are a set of metaheuristics
designed to tackle problems of search and optimization. Many different evolutionary approaches
have been published with a common way of working: evolving a set of solutions by applying different
operators that modify the solutions until the process reaches an stopping criteria. Benefits of using
EAs come from the flexibility provided and their fitness to the objective target in combination
with a robust behavior. Nowadays, EAs are considered as very adaptable tools to solve complex
optimization and search problems.
EAs have several features that make them very attractive for the data mining process [Fre02].
For example, they are able to deal with both binary and real valued problems as long as they could
be formulated as a search procedure. These are the main reasons why they are used on upgrading
and adjusting many different data mining algorithms. Recently, a great number of works have
developed new techniques for data mining using EAs. These attempts used EAs for different
tasks of data mining such as feature extraction, feature selection, classification, and clustering
[EVH10, KM99]. The main role of EAs in most of these approaches is optimization. They are used
to improve the robustness and accuracy of some of the traditional data mining techniques.
EAs have been also applied to improve the NN classifier acting as PS, PG, FS or FW algorithms.
First attempts in PS correspond to the papers [Kun95, KB98], aiming to optimized the accuracy
obtained as well as the reduction achieved [CHL03]. Advanced works in evolutionary PS are
[GCH08, DGH10b]. In terms of FS, a wide variety of evolutionary methods have been also proposed
acting as wrapper methods [KJ97, OLM04]. Both PS and FS algorithms encode the solutions as
binary arrays to represent the selection performed.
EAs for PG are based on the positioning adjustment of prototypes, which is a suitable methodology to optimize the position of a set of prototypes. Several proposals are presented on this topic,
such as artificial immune model [Gar08] or particle swarm optimization [CGI09]. Many works in
FW are related with clustering purposes [GB08], however, FW for NN classification can be also
modeled with evolutionary approaches [DTGH12]. Evolutionary PG and FW approaches follow a
real-coded representation.
Different kinds of EAs have been developed over the years such as genetic algorithms, genetic
programming, evolution strategies, evolutionary programming, evolution strategies, differential evolution, cultural evolution algorithms and co-evolutionary algorithms. Among them, in this thesis
we will focus on differential evolution as a modern real-coded algorithm [SP97, PSL05].
Differential evolution (DE) follows the general procedure of an EA, searching for a global optimum point in a D-dimensional real parameter space. It works through a cycle of stages, performing
a given number of generations. Figure 5 presents the general structure of the DE algorithm. DE
starts with a population of N P candidate solutions that are commonly called individuals and
represented as vectors.
Initially, the population should cover the entire search space as much as possible. In most of the
problems, this is achieved by uniformly randomizing individuals within the search space constrained
by the prescribed minimum and maximum bounds of each variable.
After initialization, DE applies the mutation operator to generate a mutant vector with respect
to each individual in the current population. For each individual (denoted as “target vector”),
its associated mutant vector is created through a differential mutation operation that is scaled by
a parameter F . The method of creating this mutant vector is that which differentiates one DE
19
2. Preliminaries
Figure 5: Main stages of differential evolution
scheme from another. One of the simplest forms of DE mutation works as follow: three other distinct
individuals (regarding the target vector) are randomly sampled from the current population. Now
the difference of any two of these three vectors is scaled with the parameter F , and the scaled
difference is added to the third one whence we obtain the mutant vector.
To enhance the diversity of the population, the crossover operator comes into play after generating the mutant vector. The DE algorithm can use three kinds of crossover schemes, known as
Binomial, Exponential and Arithmetic crossovers. This operator is applied to each pair of the target vector and its corresponding mutant vector to generate a new vector (marked as “trial vector”).
This operation is controlled by a crossover rate parameter CR.
To keep the population size constant over the subsequent generations, the DE algorithm performs a selection of the most promising vectors. In particular, it determines whether the target or
the trial vector survive to the next generation. If the new trial vector yields a solution equal to or
better than the target vector, it replaces the corresponding target vector in the next generation;
otherwise the target is retained in the population. Therefore, the population always gets better
or retains the same fitness values, but never deteriorates. This one-to-one selection procedure is
generally kept fixed in most of the DE algorithms.
The success of DE in solving a specific problem crucially depends on choosing the appropriate
mutation strategy and its associated control parameter values (F and CR) that determine the convergence speed. Hence, a fixed selection of these parameters can produce slow and/or premature
convergence depending on the problem. Thus, researchers have investigated the parameter adaptation mechanisms to improve the performance of the basic DE algorithm [QHS09, DACK09, ZS09].
A good review on DE can be found in [DS11].
20
Chapter I. PhD dissertation
2.3
Semi-supervised classification
Nowadays, the use of unlabeled data in conjunction with labeled data is a growing field in different
research lines, ranging from bioinformatics to Web mining. In these topics it is easier to obtain
unlabeled than labeled data because it requires less effort, expertise and time consumption. Under
this context, traditional supervised approaches are limited to using labeled data to build a model.
Semi-Supervised Learning (SSL) is the learning paradigm concerned with the design of models in
the presence of both labeled and unlabeled data. Essentially, SSL methods use unlabeled samples
to either modify or reprioritize the hypothesis obtained from labeled samples alone.
SSL is an extension of unsupervised and supervised learning by including additional information
typical of the other learning paradigm. Depending on the main objective of the methods, we
can divide SSL into Semi-Supervised Classification (SSC) [CSZ06] and semi-supervised clustering
[Ped85]. The former focuses on enhancing supervised classification by minimizing errors in the
labeled examples but it must also be compatible with the input distribution of unlabeled instances.
The latter, also known as constrained clustering, aims to obtain better defined clusters than those
obtained from unlabeled data. In this thesis, we focus on SSC.
SSC can be categorized into two slightly different settings [CW11], denoted transductive and
inductive learning. On the one hand, transductive learning concerns the problem of predicting
the labels of the unlabeled examples, given in advance, by taking both labeled and unlabeled data
together into account to train a classifier. On the other hand, inductive learning considers the given
labeled and unlabeled data as the training examples, and its objective is to predict unseen data.
Many different approaches have been suggested and studied in order to classify using unlabeled
data in SSC. Existing SSC algorithms are usually classified depending on the conjectures they
make about the relation of labeled and unlabeled data distributions. Broadly speaking, they are
based on the manifold and/or cluster assumption. The manifold assumption is satisfied if data lie
approximately on a manifold of lower dimensionality than the input space. The cluster assumption
states that similar examples should have the same label, so classes are well-separated and do not
cut through dense unlabeled data.
We group the four following methodologies according to Zhu and Goldberg’s book [ZG09]:
• Graph-based : This represents the SSC problem as a graph min-cut problem [BC01], following
the manifold assumption. Labeled and unlabeled examples constitute the graph nodes and the
similarity measurements between nodes correspond to the graph edges. The graph construction determines the behavior of this kind of algorithm [Joa03, BNS06]. These methods usually
assume label smoothness over the graph. Its main characteristics are: nonparametric, discriminative and transductive in nature. Advanced proposals can be found in [XWT11, WJC13].
• Generative models and Cluster-then-Label methods: The first attempts to deal with unlabeled
data correspond to this area (based on the cluster assumption). It includes those methods
that assume a joint probability model p(x, y)=p(y)p(x|y), where p(x|y) is an identifiable
mixture distribution, for example a Gaussian mixture model. Hence it follows a determined
parametric model using both unlabeled and labeled data. Cluster-then-label methods are
closely related to generative models. Instead of using a probabilistic model, they apply a
previous clustering step to the whole data set, and then they label each cluster with the help
of labeled data. Recent advances in these topics are [FUS08, TH10].
• Semi-Supervised Support Vector Machines (S 3 V M ): S 3 V M is an extension of standard Support Vector Machines (SVM) with unlabeled data. This approach also implements the cluster
2. Preliminaries
21
assumption. This methodology is also known as transductive SVM, although it learns an inductive rule defined over the search space. Advanced works in S 3 V M are [CSK08, AC10].
• Self-labeled methods: They form an important family of methods in SSC. They are not intrinsically geared to learning in the presence of both labeled and unlabeled data, but they
use unlabeled points within a supervised learning paradigm. These techniques aim to obtain
one (or several) enlarged labeled set/s, based on the most reliable predictions. Thus, these
models do not make any specific assumptions about the input data, but the models accept
that their own predictions tend to be correct. Some authors state that self-labeling is likely
to be the case when the classes form well-separated clusters [ZG09] (cluster assumption).
In this thesis, we will focus on self-labeled methods. The major benefits of this family of
methods are: simplicity and being a wrapper methodology. The former is related to the facility
of implementation and applicability. The latter means that any kind of classifier can be used
regardless of its complexity, which is very important depending on the problem tackled. As caveats,
the addition of wrongly labeled examples during the self-labeling process can lead to an even worse
performance. Several mechanisms have been proposed to reduce this problem [LZ05].
A preeminent work with this philosophy is the self-training paradigm designed by Yarowsky
[Yar95]. In self-training, a supervised classifier is initially trained with the L set. Then it is
retrained with its own most confident predictions, enlarging its labeled training set. Thus, it is
defined as a wrapper method for SSC. This idea was later extended by Blum and Mitchell [BM98]
with the method known as co-training. This consists of two classifiers that are trained on two
sufficient and redundant sets of attributes. This requirement implies that each subset of features
should be able to perfectly define the frontiers between classes. Then, the method follows a mutual
teaching procedure that works as follows: each classifier labels the most confidently predicted
examples from its point of view and they are added to the L set of the other classifier. It is also
known that usefulness is constrained by the imposed requirement [DLM01], which is not satisfied
in many real applications. Nevertheless, this method has become an example for recent models
thanks to the idea of using the agreement (or disagreement) of multiple classifiers and the mutual
teaching approach. A good study of when co-training works can be found in [DLZ10].
Due to the success of co-training and its relatively limited application, many works have proposed the improvement of standard co-training by eliminating the established conditions. In [GZ00],
the authors proposed a multi-learning approach, so that two different supervised learning algorithms
were used without splitting the feature space. They showed that this mechanism divides the instance space into a set of equivalence classes. Later, the same authors proposed a faster and more
precise alternative, named Democractic co-learning (Democratic-Co) [ZG04], which is also based
on multi-learning. As an alternative, which requires neither sufficient and redundant views nor
several supervised learning algorithms, Zhou and Li [ZL05] presented the Tri-Training algorithm,
which attempts to determine the most reliable unlabeled data as the agreement of three classifiers
(same learning algorithm). Then, they proposed the Co-Forest algorithm [LZ07] as a similar approach that uses Random Forest. A further similar approach is Co-Bagging [HS10, HSP10] where
confidence is estimated from the local accuracy of committee members. Other recent self-labeled
approaches are [YC10, HYGL10, SZ11, HGG13].
In summary, all of these recent schemes work on the hypothesis that several weak classifiers,
learned with a small number of instances, can produce better generalizations than only one weak
classifier. These methods are also known as disagreement-based models that are motivated, in part,
by the empirical success of ensemble learning. The term disagreement-based was recently coined
by Zhou and Li [ZL10].
22
2.4
Chapter I. PhD dissertation
Big data
The rapid development of information technologies is involving new challenges to collect and analyze
vast amounts of data. The size of the data sets is exponentially growing because, nowadays, we
dispose of numerous sources (such as mobiles, software logs, cameras, and so on) gathering new
information and we have also better techniques to collect information. As a numerical example, in
2010, Facebook had 21 PetaBytes of internal warehouse data with 12 TB new data added every
day and 800 TB compressed data scanned daily [TSA+ 10].
The term of big data is being used to refer to those data sets that are so large that its collection
and processing becomes very difficult to most of the data processing techniques. Formally, the big
data problem could be defined as the quantity of data that exceeds the processing capabilities of
a given system [MCD13] in terms of time and/or memory consumption. The big data problem
has been also described with four terms: Volume, Velocity, Variety and Veracity (the 4Vs model
[Lan01, LdRBH14]). With the terms of volume and velocity, they refer to the amount of data
that should be processed or stored and how quickly will be analyzed. Variety is related with the
different types of data and their structure. Finally, the veracity is associated to the data integrity
and the trust on the information used to make decisions.
Big data applications are attracting much attention in a wide variety of areas such as industry,
medicine or financial businesses because they have progressively acquired a lot of raw data. With the
availability of cloud platforms [PBA+ 08] they could take some advantages from these massive data
sets by extracting valuable information. However, the analysis and knowledge extraction process
from big data become very difficult tasks for most of the classical and advanced data mining and
machine learning tools.
Therefore, data mining techniques should be adapted to the emerging technologies to overcome
their limitations. Among other solutions, the MapReduce framework [DG08, DG10] in conjunction
with its distributed file system [GGL03], originally introduced by Google, offers a simple but robust
environment to tackling the processing of large data sets over a cluster of machines. This scheme is
currently taken into consideration in data mining, rather than other parallelization schemes such as
MPI (Message Passing Interface) [SO98], because of its fault-tolerant mechanism, which is crucial
for time-consuming jobs, and because of its simplicity.
MapReduce is a paradigm of parallel programming designed to process or generate large data
sets regardless the underlying hardware or software. Based on functional programming, this model
works in two different steps: the map phase and the reduce phase. Each one has key-value (< k, v >)
pairs as input and output. The map phase takes each < k, v > pair and generates a set of
intermediate < k, v > pairs. Then, MapReduce merges all the values associated with the same
intermediate key as a list (known as shuffle phase). The reduce phase takes that list as input
for producing the final values. Figure 6 depicts a flowchart of the MapReduce framework. In a
MapReduce program, all map and reduce operations run in parallel. First of all, all map functions
are independently run. Meanwhile, reduce operations wait until their respective maps are finished.
Then, they process different keys concurrently and independently. Note that inputs and outputs
of a MapReduce job are stored in an associated distributed file system that is accessible from any
computer of the used cluster.
An illustrative example about the way of working of MapReduce could be finding the average
costs per year from a big list of cost records. Each record may be composed by a variety of values,
but it at least includes the year and the cost. The map function extracts from each record the pairs
< year, cost > and transmits them as its output. The shuffle stage groups the < year, cost > pairs
by its corresponding year, creating a list of costs per year < year, list(cost) >. Finally, the reduce
23
2. Preliminaries
Figure 6: MapReduce flowchart
phase performs the average of all the costs contained in the list of each year.
Different implementations of the MapReduce framework are possible [DG08], depending on the
available cluster architecture. Some implementations of MapReduce are: Mars [HFL+ 08], Phoenix
[TYK11] and Apache Hadoop [Whi12, Pro13a]. We will focus on the Hadoop implementation because of its performance, open source nature, installation facilities and its distributed file system
(Hadoop Distributed File System, HDFS). A Hadoop cluster is formed by a master-slave architecture, where one master node manages an arbitrary number of slave nodes. The HDFS replicates
file data in multiple storage nodes that can concurrently access to the data. As such cluster, a
certain percentage of these slave nodes may be out of order temporarily. For this reason, Hadoop
provides a fault-tolerant mechanism, so that, when one node fails, Hadoop restarts automatically
the task on another node.
In the specialized literature, several recent proposals have focused on the parallelization of machine learning tools based on the MapReduce approach [ZMH09, SFJ12]. For example, some classification techniques such as [HDW+ 11, PR12, CLL13] have been implemented within the MapReduce
paradigm. They have shown that the distribution of the data and the processing under a cloud
computing infrastructure is very useful for speeding up the knowledge extraction process. In fact,
there is a growing open source project, called Apache Mahout [Pro13b], that collects distributed
and scalable machine learning algorithms implemented on top of Hadoop. Nowadays, it supplies
an implementation of several specific techniques, such as, k-means for clustering, a naive bayes
classifier, a collaborative filtering, etc.
Data reduction techniques, such as PG, PS, FS or FW, should ease data mining algorithms to
address with big data problems, however, these methods are also affected by the increase of the size
and complexity of data sets and they are unable to provide a preprocessed data set in a reasonable
time. Therefore, they should be also adapted to new technologies.
24
3.
Chapter I. PhD dissertation
Justification
As we explained in the previous sections, instance generation methods are very useful tools to
reduce the size of training data sets in order to improve the performance of data mining techniques
(e.g. prototype generation for the nearest neighbor rule). Advanced evolutionary techniques are
optimization models that may provide a potential enhancement to generate appropriate sets of
representative examples.
To adopt instance/prototype generation models as outstanding data reduction techniques, the
following issues should be taken into consideration:
• In supervised classification, a great number of prototype generation techniques have been proposed in the literature. Some of them are based on evolutionary techniques with promising
results. However, this field is quite unknown, it is frequently confused with prototype selection models and its main drawbacks are not known. To tackle the design of new prototype
reduction models we consider that:
– First of all, it is necessary to have a deep knowledge of the current state-of-the-art,
analyzing in detail which are the main strengths and weaknesses of these models, comparing theoretically and empirically their performance. After that, the main drawbacks
of existing prototype generation techniques should be addressed. These issues could be
solved through several strategies based on evolutionary algorithms.
– Another interesting trend would be the hybridization of different data reduction algorithms into a single algorithm. In this sense, the performance of prototype generation
techniques could be improved even more taking into consideration the feature space or
relying on the simplicity of prototype selection methods to accelerate the convergence
process of these techniques.
– Finally, it is known that data mining and data reduction techniques lack of scaling up
capabilities to tackle big data problems. Therefore, the study and design of scalable
mining algorithm will be needed.
• Until now, prototype generation techniques have been focused on supervised contexts, in
which all the available training data are labeled, aiming to find the smallest reduced set
that represents the original labeled set. However, their application can be useful in other
fields, such as self-labeling semi-supervised classification where labeled data are sparse and
scattered, to provide new synthetic data or identifying reliable examples. Thus, the following
issues should be addressed:
– Self-labeling semi-supervised learning is a growing field with permits us to tackle the
shortage of labeled examples based on supervised models. A throughout study of this
family of algorithms is essential in order to discern their advantages and disadvantages.
– The previous study will allow us to understand how self-labeling methods can be improved by the aid of prototype generation algorithms. The generation of synthetic data
and the detection of noisy examples for the field of semi-supervised learning may be a
good trend to fulfil labeled data regions and avoid the introduction of noisy data.
All this issues refers to the main topic of this thesis: The development of new prototype generation models for supervised and semi-supervised classification through evolutionary approaches.
4. Objectives
4.
25
Objectives
After the study of the current state of all the areas described in the previous sections, it is
possible to focus on the actual objectives of this thesis. They will include the research and analysis
of the background fields described before, and the development of advanced models for prototype
generation in supervised and semi-supervised contexts based on their most promising properties of
each field.
More specifically, two main objectives motivate the present thesis: analysis, design and implementation of evolutionary prototype generation techniques for (1) supervised and (2) semisupervised learning. In what follows, we will elaborate the sub-objectives that form each one.
• Prototype generation for supervised classification.
– To study the current state of the art in prototype generation. A theoretical
and empirical study of the state-of-the-art on the field of prototype generation in order
to categorize existing trends and discover strengths and weaknesses of each family of
methods. To the best of our knowledge, there is currently no general categorization of
this kind of techniques that establishes an overview of the proposed methods in the literature. The goal of performing this study is to provide guidelines about the application
of these techniques, allowing a broad readership to differentiate between techniques and
make appropriate decisions about the most suitable method for a given type of problem.
Moreover, this step will be our starting point for the next objectives.
– To provide new evolutionary prototype generation models. After analyzing the
last works published in prototype generation, our objective is to develop new evolutionary prototype generation models that overcome the known issues and limitations of the
current state-of-the-art. Therefore, the aim is to design more accurate models with a
higher reduction rate by using better evolutionary approaches. To do so, we will rely on
the success of differential evolution algorithm in real-coded problems.
– To combine the previous prototype generation models with other data reduction approaches. The models previously designed can be improved even more if
other data reduction techniques, such as prototype selection and feature weighting, are
considered, establishing a cooperation between different data preprocessing tasks into a
single algorithm. To address this objective, we will study two possibilities. The first one
is to design hybrid prototype selection and generation algorithms, whereas the second
one will be focused on the combination with an evolutionary feature weighting approach.
– To enable prototype reduction models to be applied on big data sets. Given
that the application of prototype reduction techniques is not feasible in big data sets in
terms of runtime and memory consumption, we aim to develop new algorithmic strategies, based on the emerging cloud-technologies that allow them to do so without major
algorithmic modifications of the original prototype reduction proposals.
• Self-labeling with prototype generation and selection for semi-supervised classification.
– To review the state of the art in self-labeling semi-supervised classification
techniques. A survey of the stat-of-the-art of self-labeling semi-supervised algorithm
to have a full understanding on their capabilities. At the time of writing of this thesis,
there is no a taxonomy of this kind of methods. Our goal is to analyze their main
26
Chapter I. PhD dissertation
strengths and drawbacks in order to discover how prototype generation algorithms can
be useful in this field.
– To develop new self-labeling approaches with the aid of prototype generation or selection models. Given the complex scenario of semi-supervised learning,
in which the number of labeled examples is very reduced, we will focus on the application of prototype generation and selection algorithms to alleviate their main drawbacks.
Two research lines will be established: the first one will be related with the removal of
noisy labeled and unlabeled examples that can be wrongly added during the self-labeled
process, while the second line will be based on the generation of new synthetic labeled
data.
5.
Summary
This thesis is composed by eight works, organized into two main different parts. Each part is
devoted to pursue one of the objectives, and their respective sub-objectives, described above.
• Prototype Generation for Supervised Learning:
– A Review on Prototype Generation.
– New Prototype Generation Methods based on Differential Evolution.
– Integrating Prototype Selection and Feature Weighting within Prototype Generation.
– Enabling Prototype Reduction Models to deal with Big Data Classification.
• Self-labeling with Prototype Generation/Selection for Semi-Supervised Classification:
– A Survey on Self-labeling Semi-Supervised Classification.
– New Self-labeling Approaches Aided by Prototype Generation/Selection Models.
This section presents a summary of the different proposals presented in this dissertation according to the two pursued objectives (Section 5.1 and Section 5.2, respectively). In each section, we
will describe the associated publications and their main contents.
5.1
Prototype Generation for Supervised Learning
This subsection encloses all the works related to the first part of this thesis, devoted to the study
and development of PG algorithms for supervised learning. Section 5.1.1 summarizes the review
performed on PG. Section 5.1.2 shows the proposed evolutionary model for PG. Then, Section
5.1.3 briefly explains the proposed schemes to integrate PG with other data reduction approaches.
Finally, in Section 5.1.4 a big data solution for prototype reduction approaches will be presented.
5.1.1
A review on Prototype Generation
The NN rule has shown to perform well in many different classification and pattern recognition
tasks. However, this rule suffers from several shortcomings in time response, noise sensitivity, high
storage requirements and dependence of the existing data to make predictions. Several approaches
5. Summary
27
have been suggested and studied in order to tackle the drawbacks mentioned above. Among them,
prototype reduction models consist of reducing the training data used for classification. The PS
process consists of choosing a subset of the original training data. Whereas PG builds new artificial
prototypes to increase the accuracy of the NN classification.
PS has been widely studied in the literature [GDCH12], however, although they each relate
to different problems, PG algorithms are commonly confused with PS. Moreover, at the time of
writing of this thesis, there is no a general categorization or taxonomy for these kinds of methods,
and more than 24 techniques had been proposed. For these reasons, we have performed a exhaustive
survey on this topic. From a theoretical point of view, we have proposed a taxonomy based on
the main characteristics presented in these methods. From an empirical point of view, we have
conducted a wide experimental study for measuring their performance in terms of accuracy and
reduction capabilities.
We have identified the main characteristics in PG. They include the type of reduction performed
(incremental, decremental, fixed or mixed), the kind of resulting generated set (condensed, edited
or hybrid), the generation mechanism (positioning adjustment [NL09], class re-labeling [SBM+ 03],
centroid-based [FHA07] and space-splitting [Sán04]) and the way in which they evaluate the search
(filter, semi-wrapper and wrapper). According to these characteristics we have classified them into
several families starting from the generation heuristic followed to the reduction type. Moreover,
some criteria to compare these kinds of methods are explained as well as related and advance work.
In the experimental study, we involved a great number of problems (59), differentiating between
small/large data sets, numerical/nominal/mixed data sets and binary of multi-class problems. Finally, we included a visualization section to illustrate the way of working of PG methods with a
2-dimensional data set.
The journal article associated to this part is:
• I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, A Taxonomy and Experimental Study on
Prototype Generation for Nearest Neighbor Classification. IEEE Transactions on Systems, Man, and Cybernetics–Part C: Applications and Reviews 42 (1) (2012) 86–100, doi:
10.1109/TSMCC.2010.2103939.
5.1.2
New Prototype Generation Methods based on Differential Evolution
The family of positioning adjustment of prototypes highlight as a successful trend within the PG
methodology. The aim of these techniques is to correct the position of a subset of prototypes from
the initial set by using an optimization procedure (real-coded). Many proposals belong to this
family, such as learning vector quantization [Koh90] and its successive improvements [LMYW05,
KO03b], genetic algorithms [FI04] and particle swarm optimization [NL09, CGI09].
Most of the existing positioning adjustment of prototypes techniques start with an initial set
of prototypes and try to improve the classification accuracy by adjusting it. Two schemes of
initialization are commonly used:
• The number of representative instances for each class is proportional to the number of them
in the input data.
• All the classes are represented by the same number of prototypes.
28
Chapter I. PhD dissertation
This initialization process becomes their main drawback due to the fact that this parameter can
be very dependent on the problem tackled. Some PG approaches [FI04, LMYW05] compute the
number of needed prototypes to be retained automatically, but in complex domains, they require
to retain many prototypes.
To address these limitations, we propose a novel evolutionary procedure to automatically find
the smallest reduced set, which is able to achieve suitable classification accuracy over different
types of problems. This method follows an iterative prototype adjustment scheme with an incremental approach. At each step, an optimization procedure is used to adjust the position of the
prototypes, and the method adds new prototypes if needed. As a second contribution of this work,
we adopted the Differential Evolution [SP97, PSL05] technique as optimizer. Specifically, we used
a self-adaptive differential algorithm named SFLSDE [NT09] to avoid the convergence problems
related to fixed parameters. Our proposal is denoted by Iterative Prototype Adjustment based on
Differential Evolution (IPADE).
Among other characteristics of our evolutionary proposal, we would like to mention the way in
which it codifies the individuals. In this algorithm, each individual in the population encodes a
single prototype, so that, the whole population form the resulting reduced set.
To contrast the behavior of our proposal, we conducted experiments on a great number of realworld data sets, the classification accuracy and reduction rate of our approach are investigated and
its performance will be compared with classical and recent PG models.
The journal article associated to this part is:
• I. Triguero, S. Garcı́a, F. Herrera, IPADE: Iterative Prototype Adjustment for Nearest Neighbor Classification. IEEE Transactions on Neural Networks 21 (12) (2010) 1984-1990, doi:
10.1109/TNN.2010.2087415.
An extension of this work was presented in the following conference paper:
• I. Triguero, S. Garcı́a, F. Herrera, Enhancing IPADE Algorithm with a Different Individual Codification. 6th International Conference on Hybrid Artificial Intelligence Systems
(HAIS2011). Wroclaw, Poland, 23-25 May 2011, LNAI 6679, pp. 262–270
This extension consisted of a new individual codification that allowed to the IPADE approach
to improve even more its accuracy capabilities. Concretely, it codified a completed reduced set in
each individual. We denoted this algorithm IPADECS.
5.1.3
Integrating Prototype Selection and Feature Weighting within Prototype Generation
The hybridization of techniques has become a very useful tool in the development of new advanced
data reduction models. It is common to use classical PS methods in pre or late stages of a PG
algorithm as mechanisms for removing noisy or redundant prototypes. For example, some PG
methods implement ENN or DROP algorithms as early filtering processes [LKL02, Sán04] and,
in [KO03b], a hybridization method based on LVQ3 post-processing of conventional prototype
reduction approaches is proposed. The two approaches that will be reviewed here are managed by
hybrid algorithms, helping to combine the efforts of several data pre-processing approaches at once.
5. Summary
29
In the first proposal, we combine PG with a previous PS step aiming to improve the performance
of positioning adjustment algorithms. Several issues motivate the combination of PS and PG:
• PS algorithms assume that the best representative examples can be obtained from a subset
of the original data, whereas PG methods generate new representative examples if needed.
• PG methods relate to a more complex problem than PS, so that, finding a promising solution
requires a higher cost for positioning adjustment methods.
• Determining the number of instances per class is not straightforward for PG approaches.
To hybridize both methodologies, we perform a preliminary PS stage to the adjustment process
to initialize a subset of prototypes. Making use of this idea, we mitigate the complexity of positioning adjustment methods because we provide a promising initial solution to the PG technique. Note
also that PS methods are not forced to select a determinate number of prototypes of each class;
they select the most suitable number of prototypes per class. In addition to this, if the prototypes
selected can be tuned in the search space, the main drawback associated with PS is also overcome.
To understand how the proposed hybrid model can improve the classification accuracy of isolated
PS and PG methods, we analyze the combination of several PS (DROP3 [WM00], ICF [BM02] and
SSMA [GCH08]) and generation algorithms (LVQ3 [Koh90], PSO [NL09] and a proposed differential
evolution algorithm). Moreover, we also analyze several adaptive differential evolution schemes such
as SADE [QHS09], DEGL [DACK09], JADE [ZS09] and SFLSDE [NT09] and a wide variety of
mutation/crossover operator for PG. As result, we obtained that the model composed by SSMA
and SFLSDE was the best performing approach (noted as SSMA-SFLSDE).
In the second proposal, we develop an hybrid FW approach with PG and PS. To do so, we firstly
design a differential evolution FW scheme that is also based on the self-adaptive SFLSDE [NT09]
algorithm. The aim of this FW algorithm is provide a set of optimal feature weights for a given set
of prototypes, maximizing the accuracy obtained with the NN classifier. Then, it is hybridized with
two different prototype reduction approaches: IPADECS, and the hybrid SSMA-SFLSDE, denoting
the resulting hybrid models as IPADECS-DEFW and SSMA-DEPGFW, respectively. Note that
the hybridization process differs between both models given that IPADECS is a pure PG algorithm,
and SSMA-SFLSDE combines PS and PG.
As an additional study we also analyze the scaling up problem of prototype reduction when
dealing with large data set (up to 300 000 instances). To tackle this problem we propose several strategies, based on stratification [CHL05], to apply the proposed hybrid approach in large
problems.
To test the proposed hybrid scheme in comparison with other FW approaches, and isolated
prototype reduction methods, we performed a wide experimental study with many different data
sets.
The journal articles associated to this part are:
• I. Triguero, S. Garcı́a, F. Herrera, Differential Evolution for Optimizing the Positioning of
Prototypes in Nearest Neighbor Classification. Pattern Recognition 44 (4) (2011) 901-916,
doi: 10.1016/j.patcog.2010.10.020.
• I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, Integrating a Differential Evolution Feature
Weighting scheme into Prototype Generation. Neurocomputing 97 (2012) 332-343, doi:
10.1016/j.neucom.2012.06.009.
30
Chapter I. PhD dissertation
5.1.4
Enabling Prototype Reduction Models to deal with Big Data Classification
Nowadays, analyzing and extracting knowledge from large-scale data sets is a very challenging
task. Although data reduction techniques should ease data mining algorithms to tackle big data
problems, these methods are also affected by the increase of the size and complexity of data sets.
They are unable to provide a preprocessed data set in a reasonable time. Hence, a new class of
scalable data reduction method that embraces the huge storage and processing capacity of cloud
platforms is required.
With these issues in mind, in this part of the thesis, we focused on the development of several
strategies to provide the capacity of dealing with big data problems to prototype reduction methods. Several solutions had been developed to enable data reduction techniques to deal with this
problem. For prototype reduction, we can find a data-level approach that is based on a distributed
partitioning model that maintains the class distribution (also called stratification). This splits
the original training data into several subsets that are individually addressed. Then, it joins each
partial reduced set into a global solution.
This approach had been used for PS in [CHL05, DGH10b]. We extended it to PG in the
following conference paper:
• I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, A Study of the Scaling up Capabilities of
Stratified Prototype Generation. Third World Congress on Nature and Biologically Inspired
Computing (NABIC’11), Salamanca (Spain), pp. 304-309, October 19-21, 2011
This scheme provided promising results to enable PG and PS methods to be applied in large
data sets. However, two main problems when the size of the data is highly increased:
• A stratified partitioning process could not be carried out when the size of the data set is so
big that it occupies all the available RAM memory.
• This scheme does not consider that joining each partial solution into a global one could
generate a reduced set with redundant or noisy instances that may damage the classification
performance.
Aiming to handle both drawbacks we proposed a novel distributed partitioning framework relying on the success of the MapReduce approach [DG08]. We denoted this framework “MapReduce
for Prototype Reduction” (MRPR). The map and reduce phases were carefully designed to perform
a proper data reduction process. Specifically, the map phase is devoted to split the original training set into several subsets that are individually addressed by applying the prototype reduction
technique. In the reduce stage, we integrate multiple partial solutions (reduced sets of prototypes)
into a single one. To do this, we propose an iterative filtering and fusion of prototype as part of
the reduce phase. We analyzed the different strategies, of varying computational effort, for the
integration of the partial solutions generated by the mappers.
We analyzed the training and test accuracy, runtime and reduction capabilities of prototype
reductions techniques under the proposed framework. Several variations of the proposed model
will be investigated with different number of mappers and four big data sets of up to 5.7 million
instances. As prototype reduction techniques, we focused on the hybrid SSMA-SFLSDE algorithm
previously proposed as well as two PS methods (FCNN [Ang07] and DROP3 [WM00]) and two PG
(LVQ3 [Koh90] and RSP3 [Sán04]).
The journal article associated to this part is:
5. Summary
31
• I. Triguero, D. Peralta, J. Bacardit, S. Garcı́a, F. Herrera, MRPR: A MapReduce Solution
for Prototype Reduction in Big Data Classification. Submitted to Neurocomputing.
5.2
Self-labeling with prototype generation/selection for semi-supervised classification
This subsection presents a summary of the works related to the second part of this thesis, analyzing
and designed new self-labeling approaches with PG algorithms.
5.2.1
A Survey on Self-labeling Semi-Supervised Classification
Self-labeling methods are appropriate tools to tackle problems with large amounts of unlabeled
data and a small quantity of labeled data. Unlike other semi-supervised learning approaches, this
kind of techniques does not make any specific assumptions about the characteristics input data.
Based on traditional supervised classification algorithms, self-labeling methods follow an iterative
procedure in which they accept that the predictions performed by supervised methods tend to be
correct.
These techniques aim to obtain an enlarged labeled set, based on their most confident predictions, to classify unlabeled data. In the literature [ZG09], self-labeled techniques are typically
divided into self-training [Yar95] and co-training [BM98]. In the former, a classifier is trained
with an initial small number of labeled examples. Then it is retrained with its own most confident predictions, enlarging its labeled training set. The latter split the feature space into two
different conditionally independent views [DLZ10], training one classifier in each view and teaching each other the most confidently predicted examples. Multiview learning for semi-supervised
classification is a generalization of Co-training, without requiring explicit feature splits or the iterative mutual-teaching procedure [ZL05, LZ07]. However, these concepts are sparse and frequently
confused in the literature.
There was no a general categorization focused on self-labeled techniques. In the literature, we
can find general SSL surveys [Zhu05], but they are not exclusively focused on self-labeled techniques
nor especially on studying the similarities among them. For these reasons, we have performed a
survey of self-labeled methods. Firstly, we proposed a taxonomy based on the main characteristics
presented in them. Secondly, we have conducted an exhaustive study that involves a large number
of data sets, with different ratios of labeled data, aiming to measure their performance in terms of
transductive and inductive classification capabilities.
We conducted experiments involving a great number of data sets with different ratios of labeled
data: 10%, 20%, 30% and 40%. In addition, we tested the performance of the best performing
methods over 9 high dimensional data sets obtained from the book of Chapelle [CSZ06]. In this
study we include different base classifiers, such as NN, C4.5 [Qui93], Naive Bayes [JL01] and
SVM [Vap98, Pla99]. Furthermore, a comparison with the supervised learning context has been
performed, analyzing how far are self-labeling techniques from traditional supervised learning.
The journal article associated to this part is:
• I. Triguero, S. Garcı́a, F. Herrera, Self-Labeled Techniques for Semi-Supervised Learning:
Taxonomy, Software and Empirical Study. Knowledge and Information Systems, in press
(2014).
32
Chapter I. PhD dissertation
5.2.2
New Self-labeling Approaches Aided by Prototype Generation/Selection Models
To the best of our knowledge, prototype reduction techniques had not been used to improve the
performance of self-labeling methods. However, their capabilities to generate new data (PG) and
detect noisy samples (via PS) make them an interesting trend to be exploited. The two approaches
proposed in this part of the thesis aim to exploit the abilities of PG and PS techniques in the field
of semi-supervised learning.
The first contribution is devoted to study the influence of noisy data in one of the most used
self-labeling approaches: the Self-Training algorithm [Yar95]. This simple approach perfectly exemplifies the problem of self-labeling techniques that can make erroneous predictions if noisy examples
are labeled and incorporated into the training set. This problem is mainly important in the initial
stages of the algorithm.
In [LZ05], the authors proposed the addition of a statistical filter to the self-training process,
naming this algorithm SETRED. Nevertheless, this method does not perform well in many domains.
The use of a particular filter which has been designed and tested under different condition is not
straightforward. Although the aim of any filter is to remove potentially noisy examples, both
correct examples and examples containing valuable information may also be removed. Therefore,
detecting true noisy examples is a challenging task. In the self-training approach, the number of
available labeled data and the induced noisy examples are two decisive factors when filtering noise.
The aim of this work was to deepen in the integration of different noise filters and we further
analyze recent proposals in order to establish their suitability with respect to the self-training process into the self-training process to distinguish the most relevant features of filters. We distinguish
two types of noise detection mechanism: local and global. We call local methods to those techniques in which the removal decision is based on a local neighborhood of instances [WM00]. Global
methods create different models from the training data. Mislabeled examples can be considered
noisy depending on the hypothesis agreement of the used classifiers. As such, both methodologies
can be considered edition-based algorithms for data reduction.
In our experiments, we focused on the NN rule as a base classifier and ten different noise filters,
involving a wide variety of data set with different ratios of labeled data: 10%, 20%, 30% and 40%.
In the second contribution, we designed a framework, named SEG-SSC, to improve the classification performance of any given self-labeled method by using synthetic labeled data. In our
previous survey, we detected that self-labeled techniques are limited by the number of labeled
points and their distribution to identifying reliable unlabeled examples. This problem was even
more pronounced when the labeled ratio is greatly reduced and labeled examples do not minimally
represent the domain. Moreover, most of the advanced models use some diversity mechanisms, such
as bootstrapping [Bre96], to provide differences between the hypotheses learned with the multiple
classifiers. However, these mechanisms may provide a similar performance to classical self-training
or co-training approaches if the number of labeled data is insufficient to achieve different learned
hypotheses.
The aim of this work is to alleviate these weaknesses by using new synthetic labeled examples
to introduce diversity to multiple classifier approaches and fulfill the labeled data distribution. The
principal aspects of the proposed framework are:
• Introducing diversity to the multiple classifiers used by using more (new) labeled data.
• Fulfilling labeled data distribution with the aid of unlabeled data.
6. Discussion of results
33
• Being applicable to any kind of self-labeled method.
In our empirical studies, we applied this scheme to four recent self-labeled methods that belong
to different families. We tested their capabilities with a large number of data sets and a very
reduced labeled ratio. A study on high-dimensional data sets, extracted from the book by Chapelle
et al. [CSZ06] and the BBC News web page [BBC14], have been also included.
The journal articles associated to this part are:
• I. Triguero, José A. Sáez, J. Luengo, S. Garcı́a, F. Herrera, On the Characterization of Noise
Filters for Self-Training Semi-Supervised in Nearest Neighbor Classification. Neurocomputing
132 (2014) 30-41, doi: 10.1016/j.neucom.2013.05.055.
• I. Triguero, S. Garcı́a, F. Herrera, SEG-SSC: A Framework based on Synthetic Examples
Generation for Self-Labeled Semi-Supervised Classification. Submitted to IEEE Transactions on Cybernetics.
6.
Discussion of results
The following subsections summarize and discuss the results obtained in each specific stage of the
thesis.
6.1
Prototype Generation for supervised learning
This subsection is devoted to discuss the main results obtained in the first objective of this thesis.
6.1.1
A review on Prototype Generation
Classical and recent approaches for PG have been thoroughly analyzed with the development of
this review. As a result, we have highlighted the basic and advanced features of these techniques
by designing a taxonomy of methods. It allowed us to establish several guidelines about which
families of methods are more promising, less exploited and which ones are more susceptible for
being improved.
The extensive experimental study carried out has compared the performance of the current PG
approaches in many different problems. To validate and support the results obtained we used a
statistical analysis based on nonparametric tests. The best methods of each category have been
highlighted. The results of this comparison have shown the potential of the positioning adjustment
family of methods, obtaining a very good trade-off between accuracy, reduction rate and runtime.
Specifically, we observed that evolutionary positioning adjustment algorithms, such as the PSO
algorithm, [NL09] were highlighted as the most accurate results. We also stressed that this family
is commonly based on a fixed reduction type that could be their main drawback. Nevertheless, the
concrete choice of a PG method will depend on the problem tackled, but the results offered in this
work could help to reduce the set of candidates.
As an additional consequence of this paper, a complete software package of PG techniques has
been developed for the KEEL platform [AFSG+ 09]. Moreover, all the data sets prepared for this
work have been included in the KEEL data set repository [AFFL+ 11]. Both contributions allow to
34
Chapter I. PhD dissertation
future PG proposals to conduct rigorous analyses, comparing with the state-of-the-art and a great
number of problems. All the results, source code and data sets can be found in the following web
page http://sci2s.ugr.es/pgtax.
6.1.2
New Prototype Generation Methods based on Differential Evolution
In this part of the thesis, we have presented a new data reduction technique called IPADE which
iteratively learns the most adequate number of prototypes per class and their respective positioning
for the NN classifier, acting as a PG method.
The proposed technique uses a real parameter optimization procedure based on a self-adaptive
differential evolution. It allowed us to adjust the positioning of the prototypes at each step of the
algorithm.
The large experimental study performed with its respective statistical evaluation allows us to
show that IPADE is a suitable method for PG in NN classification. The results shown that IPADE
overcomes significantly all the comparison algorithms with respect to classification accuracy and
reduction rates. It is noteworthy the great balance that this method has achieved in terms of
accuracy and reduction rate. Given its incremental nature, this algorithm specially highlights
because of its reduction power. The complete set of results can be found at http://sci2s.ugr.
es/ipade/.
In the further extension, IPADECS, we found more accurate results by changing the individual
codification. It allowed us to apply new mutation and crossover operators that resulted in better
convergence speed, and therefore, a better accuracy.
6.1.3
Integrating Prototype Selection and Feature Weighting within Prototype Generation
Two hybrid models have been developed to improve the performance of PG algorithms.
The first model combined PS and PG into a single algorithm. It showed the good relation between PS and PG in obtaining hybrid algorithms that allow us to find very promising solutions. The
proposed hybrid models are able to tackle several drawbacks of the isolated methods. Concretely,
we have analyzed the use of positioning adjustment algorithms as an optimization procedure after
a previous PS stage.
The wide experimental study performed has allowed us to justify the behavior of hybrid algorithms when dealing with small and large data sets. These results have been compared with several
non-parametric statistical procedures, which have reinforced the conclusions.
We concluded that the LVQ3 algorithm [Koh90] does not produce optimal positioning in most
cases, whereas PSO [NL09] and the proposed differential evolution scheme result in excellent accuracy rates in comparison with isolated PS methods and PG methods. In terms of the previous PS,
we especially emphasize the use of SSMA [GCH08] in combination with one of the two mentioned
optimization approaches to also achieve high reduction rates in the final set of prototypes obtained.
As part of this work, we also analyzed several self-adaptive differential evolution algorithms
and a multitude of mutation/crossover operators for PG, studying their convergence capabilities.
Among the analyzed approaches, we observed that the SFLSDE algorithm [NT09] found a great
balance between exploration and exploitation during the evolutionary process.
6. Discussion of results
35
The second model introduced a novel data reduction technique which exploits the cooperation
between FW and prototype reduction to improve the classification performance of the NN, storage
requirements and its running time. A self-adaptive differential evolution algorithm has been used
to optimize feature weights and the positioning of the prototypes for the NN algorithm, acting as
an FW scheme and a PG method, respectively.
The experimental study performed allowed us to contrast the behavior of these hybrid models
when dealing with a wide variety of data sets with different numbers of instances and features. These
hybrid models have been able to overcome isolated prototype reduction methods due to the fact
that FW changes the way in which distances between prototypes are measured, and therefore the
adjustment of prototypes can be more refined. In the comparison between the proposed IPADECSDEFW and SSMA-DEPGFW we observed that in terms of reduction rate IPADECS-DEFW is the
best performing hybrid model, however, in terms of accuracy rate, SSMA-DEPGFW usually obtain
more accurate results.
Moreover, the proposed stratified procedure showed to be an useful tool to tackle the scaling
up problem.
6.1.4
Enabling Prototype Reduction Models to deal with Big Data Classification
A MapReduce solution for prototype reduction methods have been developed, denominated as
MRPR. The proposed scheme has shown to be a suitable tool to apply these methods over big
classification data sets with excellent results. Otherwise, these techniques would be limited to
tackle small or medium problems that do not contain more than several thousand of examples, due
to memory and runtime restrictions.
We have taken advantage of cloud environments, making use of the Apache Hadoop implementation [Pro13a] to develop a MapReduce framework for prototype reduction. The MapReduce
paradigm has offered a simple, transparent and efficient environment to parallelize the prototype
reduction computation. Among the three different reduce types investigated: Join, Filtering and
Fusion; we have found that a reducer based on fusion of prototypes permits to obtain reduced sets
with higher reduction rates and accuracy performance.
The designed framework enables prototypes reduction techniques to be applied with data sets
of unlimited number of instances without major algorithmic modifications, just by using more
computers if needed. It also guarantee that the objectives of prototype reduction models are maintained, so that, it reaches high reduction rates without significant accuracy loss. Some guidelines
about which prototype reduction methods are more suitable for the proposed model are provided.
The experimental study carried out has shown that MRPR obtains very competitive results
with different prototype reduction algorithm. Its application has resulted in a very big reduction
of storage requirements and classification time for the NN rule, when dealing with big data sets.
6.2
Self-labeling with Prototype Generation/Selection for Semi-Supervised
Classification
This subsection presents a discussion of the results achieved in the field of self-labeling semisupervised learning with the aid of PG and selection techniques.
36
6.2.1
Chapter I. PhD dissertation
A Survey on Self-labeling Semi-Supervised Classification
An overview of the growing field of self-labeled methods has been performed, classifying existing
models according to their main features. As a result, we have designed a taxonomy of methods
evaluating the main properties of these methods. This study has allowed us to make some remarks
and guidelines for non-experts and researchers of this topic. We have identified the strengths of
every family of methods, indicating which families may be improved.
We have detected the main characteristics of self-labeling approaches. They include the addition
mechanism (incremental, batch, amending), the number of classifier and learning algorithms (single
or multiple), and the type of view (single or multiple view). According to these characteristics we
have classified them into several families starting from the type of view to the addition mechanism.
Moreover, some other properties have been remarked, such as kinds of confidence measures (simple,
agreement and combination), types of teaching (self and mutual teaching) and stopping criteria. To
compare these techniques, we defined four criteria: transductive and inductive accuracy, influence
of the number of labeled instances, noise tolerance and time requirements.
The empirical study performed allows us to highlight several methods from among the whole
set. In both transductive and inductive settings, those methods that use multiple-classifiers and a
single view, such as TriTraining [ZL05], Democratic-Co [GZ00] and Co-Bagging [HS10], have shown
to be the best-performing methods.
The experiments conducted with high-dimensional data sets and very reduced labeled ratio show
that much more work is needed in the field of self-labeled techniques to deal with these problems.
Moreover, a semi-supervised learning module has been developed for the KEEL software, integrating analyzed methods and data sets. The developed software allows the reader to reproduce
the experiments carried out and uses it as an SSL framework to implement new methods. It could
be a useful tool to do experimental analyses in an easier and more effective way.
A web site with all the complementary material is available at http://sci2s.ugr.es/
SelfLabeled, including this work’s basic information, the source code of the analyzed algorithms,
all the data sets involved and the complete results obtained.
6.2.2
New Self-labeling Approaches Aided by Prototype Generation/Selection Models
Two different approaches have been studied in order to improve the performance of self-labeling
methods with prototype reduction techniques.
In the first attempt, we analyzed the characteristics of a wide variety of noise filters, of a different nature, to improve the self-training approach. We include some PS algorithms (denoted as
local approaches) and some other ensemble-based models (named global methods). The experimental analysis performed allowed us to distinguish which characteristics of filtering techniques have
reported a better behavior to address the transductive and inductive problems. We have checked
that global filters (CF [GBLG99] and IPF [KR07] algorithms) highlight as the best performing
family of filters, showing that the hypothesis agreement of several classifiers to select adequate
noisy examples is also robust when the ratio of available labeled data is reduced. Most of local
approaches need more labeled data to perform better. The use of these filters has resulted in a
better performance than that achieved by the previously proposed self-training methods, SETRED
and SNNRCE.
7. Concluding Remarks
37
Hence, the use of global filters is highly recommended in this field, which can be useful for further
work with other semi-supervised approaches and other base classifiers. A web page with all the
complementary material is available at http://sci2s.ugr.es/SelfTraining+Filters, including
this paper’s basic information, all the data sets created and the complete results obtained for each
algorithm.
In our second contribution, we developed a framework called SEG-SSC to improve the performance of any self-labeled semi-supervised classification method. It is focused on the idea of
generating synthetic examples with PG techniques in order to diminish the drawbacks occasioned
by the absence of labeled examples, which deteriorates the efficiency of this family of methods.
The wide experimental study carried out has allowed us to investigate the behavior of the
proposed scheme with a high number of data sets with a varied number of instances and features.
Within the proposed framework, the four self-labeled techniques used have been able to overcome
the original self-labeled methods due to the fact that the addition of new labeled data has implied
a better diversity of multi-classifier approaches and fulfills the distribution of labeled data. Thus,
our proposal becomes a suitable tool for enhancing self-labeled methods.
7.
Concluding Remarks
In this thesis, we have addressed several problems pursuing a common objective: the analysis,
design and implementation of evolutionary PG algorithms. Two main research lines have composed
this dissertation: PG for supervised and semi-supervised classification.
In the first research line, our initial objective was to obtain a full understanding of the PG field
to improve the performance of the NN classifier. To do so, we have carried out a theoretical and
empirical review on PG methods, focused on characterizing the traits of the related techniques.
Based upon all the lessons learned by this study, we have proposed an iterative prototype adjustment
model based on differential evolution. This technique, named IPADE (and its extension IPADECS),
has aimed to generate the smallest reduced set of prototype that represents an original training
data. Depending on the problem tackled, this technique has been able to determine the most
appropriate number of prototypes per class that provides an adequate trade-off between accuracy
and reduction rates. In the experiments conducted, this model has significantly outperformed to
the current state-of-the-art of PG methods. Thus, it has become an useful tool to improve the
classification performance of the NN algorithm.
Aiming to improve the performance of PG models, we have developed hybrid data reduction
techniques that combined PG with PS and FW. Our first attempts were based on the combination
of a previous PS stage with an optimization of the positioning of the prototypes via PG. This
model has shown to perform better than isolated PS and generation algorithms. Among the
different analyzed hybridization, we obtained that the SSMA-SFLSDE algorithm was the best
performing approach. The second hybrid model has taken into consideration the feature space. To
do this, we have designed an evolutionary FW scheme that has been integrated within the previous
models (IPADECS and SSMA-SFLSDE). The proposed hybrid models have been able to overcome
isolated prototype reduction methods, changing the way in which distances between prototypes are
measured with the proposed FW scheme.
Moreover, we have dealt with the big data problem for prototype reduction methods by proposing a novel framework based on MapReduce, named MRPR. The designed model has distributed
the processing of prototype reduction techniques among a set of computing elements, combining
38
Chapter I. PhD dissertation
the resulting set of prototype in different manners. As a result, the MRPR framework has enabled
prototype reduction models to be applied over big data problems with an acceptable runtime and
without accuracy loss. It has become our last contribution to the state of the art of the field PG
in supervised learning.
The second part of this dissertation has been devoted to field self-labeling semi-supervised
classification. Once again, we have performed a survey of specialized literature in order to categorize
self-labeled methods, identifying their essential issues. We have observed that these methods are
currently far from performance that could be obtained if all the training instances were labeled.
We also concluded that these techniques have a strong dependence of the available labeled data.
To alleviate these limitations, we have made use of prototype reduction models.
As first contribution to the field of self-labeling, we have performed an exhaustive study to
characterize the behavior of noise filters within a self-training approach. With this study, we have
reduced the number of mislabeled examples that are added to the labeled set, avoiding the learning
of wrong models. It has allowed us to remove noise data from unlabeled data with edition-based
models. Furthermore, we have also identified the most appropriate filtering techniques to perform
this task.
Finally, we have proposed a framework called SEG-SSC that uses synthetic examples to improve
the performance of self-labeling methods. This model has utilized PG techniques to create synthetic
examples in order to fulfill the labeled data distribution. This model has enhanced the performance
of self-labeling approaches through the incorporation of the generated data in different stages of
the learning process.
Conclusiones
En esta tesis se han abordado distintos problemas persiguiendo un objetivo común: el análisis,
diseño e implementación de algoritmos evolutivos de generación de prototipos. Dos lı́neas de investigación conforman esta disertación: generación de prototipos para clasificación supervisada
y semi-supervisada. En la primera lı́nea de investigación, nuestro objetivo inicial fue obtener un
conocimiento completo del campo de la generación de prototipos con el fin de mejorar el rendimiento
del clasificador del vecino más cercano. Para ello, se han llevado a cabo una revisión teórica y experimental de los métodos de generación de prototipos, centrándose en la caracterización de los
rasgos de estas técnicas. Basándose en las lecciones aprendidas en este estudio, se ha propuesto
un modelo iterativo de ajuste de prototipos basado en el algoritmo de evolución diferencial. Esta
técnica, llamada IPADE (y su extensión IPADECS), ha tenido por objetivo la generación del conjunto de prototipos reducido más pequeño posible que representa el conjunto de entrenamiento
original. Dependiendo del problema abordado, esta técnica ha sido capaz de determinar el número
más adecuado de ejemplos por clase que provee un buen balance entre precisión y ratio de reducción. En los experimentos llevados a cabo, el modelo propuesto ha mejorado significativamente
al estado del arte en generación de prototipos. Ası́, se ha convertido en una herramienta muy útil
para mejorar el rendimiento del clasificador del vecino más cercano.
Con el objetivo seguir mejorando el rendimiento de los modelos de generación de prototipos, se
han desarrollado técnica hı́bridas de reducción de datos que combinan generación de prototipos con
selección y ponderación de caracterı́sticas. Nuestros primeros intentos se basaron en la combinación
de una etapa preliminar de selección de instancias con una etapa de ajuste del posicionamiento de
éstos mediante generación de prototipos. Entre las distintas hibridaciones estudiadas, obtuvimos
que el modelo SSMA-SFLSDE destacó como el algoritmo de mejor rendimiento. El segundo modelo
8. Future Work
39
hı́brido ha tenido en consideración el espacio de caracterı́sticas. Para hacer esto, se ha diseñado un
esquema evolutivo de ponderación de caracterı́sticas que ha sido integrado en los modelos anteriores
(IPADECS y SSMA-SFLSDE). Los modelos hı́bridos propuestos ha sido capaces de superar a los
métodos de reducción de prototipos de forma aislada, cambiando la forma en que se miden las
distancias entre prototipos con el enfoque de ponderación de caracterı́sticas diseñado.
Además, se ha abordado el problema causado por grandes cantidades de datos (big data) en los
algoritmos de reducción de prototipos, mediante el desarrollo de un modelo basado en MapReduce,
denominado MRPR. El modelo diseñado ha distribuido el procesamiento realizado por las técnicas
de reducción de prototipos en un conjunto de procesadores, combinando los conjuntos resultantes
de prototipos de distintas formas. Como resultado, el modelo MRPR ha sido capaz de permitirle
a las técnicas de reducción de prototipos la capacidad de ser ejecutadas sobre grandes conjuntos
de datos en un tiempo aceptable y sin pérdida de precisión. Esta ha sido la última contribución al
estado del arte de generación de prototipos en aprendizaje supervisado.
La segunda parte de esta tesis se ha dedicado al campo de la clasificación semi-supervisada con
auto-etiquetado. Una vez más, se ha realizado un estudio completo de la literatura especializada
para caracterizar los métodos de auto-etiquetado, identificando sus principales problemas. Se ha
observado que estos métodos están actualmente lejos del rendimiento que se obtendrı́a si todos los
ejemplos de entrenamiento estuviesen etiquetados. También se concluye que estas técnicas tienen
una fuerte dependencia de los datos etiquetados disponibles. Para reducir la influencia de estas
limitaciones, se ha hecho uso de modelos de reducción de prototipos.
Como primera contribución al campo del auto-etiquetado, se ha realizado un estudio exhaustivo
para caracterizar el comportamiento de los algoritmos de filtrado de ruido dentro de un enfoque
de auto-entrenamiento (self-training). Con este estudio, se ha reducido el número de ejemplos
más clasificado que son añadidos al conjunto de etiquetados, evitando el aprendizaje de modelos
incorrectos. Esto ha permitido eliminar ruido de los datos no etiquetados con modelos basados
en edición. Además, se han identificado qué tipo de filtros son más apropiados para realizar esta
tarea.
Por último, se ha propuesto un esquema, llamado SEG-SSC, que usa ejemplos sintéticos para
mejorar el rendimiento de los métodos de auto-etiquetado. Este método ha utilizado técnicas de
generación de prototipos para crear ejemplos sintéticos con el fin de completar la distribución de
los datos etiquetados. Este modelo ha mejorado el rendimiento de los enfoques de auto-etiquetado
mediante la incorporación de los datos generados en diferentes etapas del proceso de aprendizaje.
8.
Future Work
The results achieved in this PhD thesis may open new future trends in different challenging problems. In what follows, we present some research lines that can be addressed starting from the
current studies.
Prototype reduction for fuzzy nearest neighbor classification: Besides data reduction
techniques, there are several approaches for improving the performance of NN classification. One
of the trends that have arisen in the last decades is the introduction of fuzzy sets theory [Zad65]
within the mechanics of the NN rule, giving a class membership to each instance of the training
set. These techniques have shown to perform very well, improving the accuracy capabilities of the
standard NN [DGH14].
40
Chapter I. PhD dissertation
However, these techniques can be improved even more if data reduction techniques are applied.
Several fuzzy versions of the classical PS techniques have been proposed (such as [YC98] or [ZLZ11]).
However, there is still a large potential for improving fuzzy NN algorithms if advanced PG or PS
algorithms are developed for this problem.
Fuzzy self-labeling approaches: Continuing with the fuzzy sets theory, we can find some recent
fuzzy models for semi-supervised learning [MPJ11, PGDB13]. However, this topic is currently in
its childhood and much more research is needed.
The way in which fuzzy models can establish the class membership of an example may become
an interesting approach to measure the confidence at the time of labeling an unlabeled example.
The underlying idea used in [DGH14] for fuzzy NN can be easily extended to self-labeled techniques.
Extending SEG-SSC to any semi-supervised learning schemes: There are many possible
variations of our proposed SEG-SSC scheme that could be interesting to explore as future work.
In our opinion, the use of generation techniques with self-labeled techniques is not only a new
way to improve the capabilities of this family of techniques, but could also be useful for most of
the existing semi-supervised learning algorithms such as graph-based, generative models or semisupervised support vector machines.
New big data approaches: As we commented in Section 2.4, the problem of big data affects to
classical and advance data mining and data reduction techniques [BBBE11]. Therefore, the ideas
established in thesis for prototype reduction in large-scale data sets could be extended in many
different fields. For example, we could tackle the following topics:
• Feature selection/weighting: These techniques are also affected by the increment in the size
of the data sets. They are not able to select or find appropriate weights for features in
problem in which the number of instances is very high. Using the ideas learned in the
proposed MapReduce framework, we could extend them to apply these techniques in largescale problems, by carefully developing the reduce phase.
• Classification methods: Apart from the development of scalable data reduction methods, it
may be very interesting the adaptation of standard classification methods to tackle big data
problems.
• Semi-supervised learning: Given that unlabeled data are quite easy to be obtained, it could be
possible to find problems in which we dispose of a very few labeled data and large amounts of
unlabeled problems. It offers a very complex scenario, in which the learning process may not
be feasible in time, but the use of these unlabeled points is highly recommended. Therefore,
the design of scalable semi-supervised models is becoming an interesting topic.
Tackling other related learning paradigms with prototype reduction models: Besides
standard classification problems, there are many other challenges in machine learning which are
of great interest to the research community [YW06]. We consider that the lessons learned in this
thesis can be use to explore new problems such as:
• One-class classification: This problem deals with situations, in which not all of the classes
are available at the training step [ZYY+ 14]. It assumes that the classifier is built on the basis
8. Future Work
41
of samples coming only from a single class, while it must discriminate between the known
examples and new, unseen examples (known as outliers) that do not meet the assumption
about the concept. Applying prototype reduction models in this topic may be useful to reduce
computational complexity and sensitivity to noisy data of the models. Actually, there are
some preliminary studies done on prototype reduction for one-class classification [CdO11].
• Multi-label classification: There are many real world applications where one instance can be
assigned to multiple classes. As an example, consider the problem of assigning functions to
proteins, where one protein could be labeled with multiple functions. This problem, known as
multi-label learning [TK07], increases the complexity of the classification process. The adaptation of prototype reduction models under this scenario is a very challenging and interesting
task.
• Semi-supervised multi-label classification: A significant challenge is to classify examples with
multiple labels by using a small number of labeled samples. This problem can be tackled by
the combination of semi-supervised learning and multi-label learning. The extension of the
ideas proposed in self-labeling learning or the application or data reduction models could be
useful to tackle this problem. There are just a few works that use data reduction in this field
(for instance [LYG+ 10]).
Chapter II
Publications: Published and
Submitted Papers
1.
Prototype generation for supervised classification
The journal paper associated to this part is:
1.1
A Taxonomy and Experimental Study on Prototype Generation for Nearest
Neighbor Classification
• I. Triguero, S. Garcı́a, F. Herrera, A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification. IEEE Transactions on Systems,
Man, and Cybernetics–Part C: Applications and Reviews 42 (1) (2012) 86–100, doi:
10.1109/TSMCC.2010.2103939.
– Status: Published.
– Impact Factor (JCR 2012): 2.548
– Subject Category: Computer Science, Artificial Intelligence. Ranking 17 / 115 (Q1).
– Subject Category: Computer Science, Cybernetics. Ranking 2 / 21 (Q1).
– Subject Category: Computer Science, Interdisciplinary Applications. Ranking 16 / 100
(Q1).
43
86
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012
A Taxonomy and Experimental Study on Prototype
Generation for Nearest Neighbor Classification
Isaac Triguero, Joaquı́n Derrac, Salvador Garcı́a, and Francisco Herrera
Abstract—The nearest neighbor (NN) rule is one of the most
successfully used techniques to resolve classification and pattern
recognition tasks. Despite its high classification accuracy, this rule
suffers from several shortcomings in time response, noise sensitivity, and high storage requirements. These weaknesses have been
tackled by many different approaches, including a good and wellknown solution that we can find in the literature, which consists
of the reduction of the data used for the classification rule (training data). Prototype reduction techniques can be divided into two
different approaches, which are known as prototype selection and
prototype generation (PG) or abstraction. The former process consists of choosing a subset of the original training data, whereas PG
builds new artificial prototypes to increase the accuracy of the NN
classification. In this paper, we provide a survey of PG methods
specifically designed for the NN rule. From a theoretical point of
view, we propose a taxonomy based on the main characteristics
presented in them. Furthermore, from an empirical point of view,
we conduct a wide experimental study that involves small and large
datasets to measure their performance in terms of accuracy and
reduction capabilities. The results are contrasted through nonparametrical statistical tests. Several remarks are made to understand which PG models are appropriate for application to different
datasets.
Index Terms—Classification, learning vector quantization
(LVQ), nearest neighbor (NN), prototype generation (PG),
taxonomy.
I. INTRODUCTION
HE nearest neighbor (NN) algorithm [1] and its derivatives have been shown to perform well, like a nonparametric classifier, in machine-learning and data-mining (DM) tasks
[2]–[4]. It is included in a more specific field of DM known
as lazy learning [5], which refers to the set of methods that
predicts the class label from raw training data and does not obtain learning models. Although NN is a simple technique, it has
demonstrated itself to be one of the most interesting and effective algorithms in DM [6] and pattern recognition [7], and it has
T
Manuscript received March 8, 2010; revised August 22, 2010 and October 25,
2010; accepted December 27, 2010. Date of publication February 4, 2011; date
of current version December 16, 2011. This work was supported by the Spanish
Ministry of Science and Technology under Project TIN2008-06681-C06-01.
The work of I. Triguero was supported by a scholarship from the University of
Granada. The work of J. Derrac was supported by an FPU scholarship from the
Spanish Ministry of Education and Science. This paper was recommended by
Associate Editor M. Last.
I. Triguero, J. Derrac and F. Herrera are with the Department of Computer Science and Artificial Intelligence, Research Center on Information
and Communications Technology, University of Granada, 18071 Granada,
Spain (e-mail: triguero@decsai.ugr.es; jderrac@decsai.ugr.es; herrera@decsai.
ugr.es).
S. Garcı́a is with the Department of Computer Science, University of Jaén,
23071 Jaén, Spain (e-mail: sglopez@ujaen.es).
Digital Object Identifier 10.1109/TSMCC.2010.2103939
been considered one of the top ten methods in DM [8]. A wide
range of new real problems have been stated as classifications
problems [9], [10], where NN has been a great support for them,
for instance, [11] and [12].
The most intuitive approach to pattern classification is based
on the concept of similarity [13]–[15]; obviously, patterns that
are similar, in some sense, have to be assigned to the same
class. The classification process involves partitioning samples
into training and testing categories. Let xp be a training sample
from n available samples in the training set. Let xt be a test
sample, ω be the true class of a training sample, and ω̂ be the
predicted class for a test sample (ω, ω̂ = 1, 2, . . . , Ω). Here, Ω
is the total number of classes. During the training process, we
use only the true class ω of each training sample to train the
classifier, while during testing, we predict the class ω̂ of each
test sample. With the 1NN rule, the predicted class of test sample
xt is set equal to the true class ω of its NN, where nnt is an NN
to xt , if the distance
d(nnt , xt ) = mini {d(nni , xt )}.
For NN, the predicted class of test sample xt is set equal to the
most frequent true class among k nearest training samples. This
forms the decision rule D : xt → ω̂.
Despite its high classification accuracy, it is well known that
NN suffers from several drawbacks [4]. Four weaknesses could
be mentioned as the main causes that prevent the successful
application of this classifier. The first one is the necessity of
high storage requirements in order to retain the set of examples
that defines the decision rule. Furthermore, the storage of all of
the data instances also leads to high computational costs during
the calculation of the decision rule, which is caused by multiple
computations of similarities between the test and training samples. Regarding the third one, NN (especially 1NN) presents
low tolerance to noise because of the fact that it considers all
data relevant, even when the training set may contain incorrect
data. Finally, NN makes predictions over existing data, and it assumes that input data perfectly delimits the decision boundaries
among classes.
Several approaches have been suggested and studied in order
to tackle the aforementioned drawbacks [16]. The research on
similarity measures to improve the effectiveness of NN (and
other related techniques based on similarities) is very extensive
in the literature [15], [17], [18]. Other techniques reduce overlapping between classes [19] based on local probability centers,
thus increasing the tolerance to noise. Researchers also investigate about distance functions that are suitable for use under
high dimensionality conditions [20].
1094-6977/$26.00 © 2011 IEEE
TRIGUERO et al.: A TAXONOMY AND EXPERIMENTAL STUDY ON PROTOTYPE GENERATION FOR NEAREST NEIGHBOR CLASSIFICATION
A successful technique that simultaneously tackles the computational complexity, storage requirements, and noise tolerance
of NN is based on data reduction [21], [22]. These techniques
aim to obtain a representative training set with a lower size
compared to the original one and with a similar or even higher
classification accuracy for new incoming data. In the literature, these are known as reduction techniques [21], instance
selection [23]–[25], prototype selection (PS) [26], and prototype generation (PG) [22], [27], [28] (which are also known as
prototype abstraction methods [29], [30]). Although the PS and
PG problems are frequently confused and considered to be the
same problem, each of them relate to different problems. PS
methods concern the identification of an optimal subset of representative objects from the original training data by discarding
noisy and redundant examples. PG methods, by contrast, besides selecting data, can generate and replace the original data
with new artificial data [27]. This process allows it to fill regions
in the domain of the problem, which have no representative examples in original data. Thus, PS methods assume that the best
representative examples can be obtained from a subset of the
original data, whereas PG methods generate new representative
examples if needed, thus tackling also the fourth weakness of
NN mentioned earlier.
The PG methods that we study in this survey are those specifically designed to enhance NN classification. Nevertheless, many
other techniques could be used for the same goal as PG methods
that are out of the scope of this survey. For instance, clustering
techniques allow us to obtain a representative subset of prototypes or cluster centers, but they are obtained for more general
purposes. A very good review of clustering can be found in [31].
Nowadays, there is no general categorization for PG methods. In the literature, a brief taxonomy for prototype reduction
schemes was proposed in [22]. It includes both PS and PG methods and compares them in terms of classification accuracy and
reduction rate. In this paper, the authors divide the prototype reduction schemes into creative (PG) and selecting methods (PS),
but it is not exclusively focused on PG methods, and especially,
on studying the similarities among them. Furthermore, a considerable number of PG algorithms have been proposed and some
of them are rather unknown. The first approach we can find in
the literature called PNN [32] is based on merging prototypes.
One of the most important families of methods is that based
on learning vector quantization (LVQ) [33]. Other methods are
based on splitting the dimensional space [34], and even evolutionary algorithms and particle swarm optimization [35] have
also been used to tackle this problem [36], [37].
Because of the absence of a focused taxonomy in the literature, we have observed that the new algorithms proposed are
usually compared with only a subset of the complete family of
PG methods and, in most of the studies, no rigorous analysis
has been carried out.
These are the reasons that motivate the global purpose of this
paper, which can be divided into three objectives.
1) To propose a new and complete taxonomy based on the
main properties observed in the PG methods. The taxonomy will allow us to know the advantages and drawbacks
from a theoretical point of view.
87
2) To make an empirical study that analyzes the PG algorithms in terms of accuracy, reduction capabilities, and
time complexity. Our goal is to identify the best methods in each family, depending on the size and type of the
datasets, and to stress the relevant properties of each one.
3) To illustrate through graphical representations the trend of
generation performed by the schemes studied in order to
justify the results obtained in the experiments.
The experimental study will include a statistical analysis
based on nonparametric tests, and we will conduct experiments
that involve a total of 24 PG methods, and 59 small- and largesize datasets. The graphical representations of selected data will
be done by using a two-dimensional (2-D) dataset called banana
with moderate complexity features.
This paper is organized as follows. A description of the properties and an enumeration of the methods, as well as related
and advanced work on PG, are given in Section II. Section III
presents the taxonomy proposed. In Section IV, we describe the
experimental framework, and Section V examines the results
obtained in the empirical study and presents a discussion of
them. Graphical representations of generated data by PG methods are illustrated in Section VI. Finally, Section VII concludes
the paper.
II. PROTOTYPE GENERATION: BACKGROUND
PG builds new artificial examples from the training set; a
formal specification of the problem is the following: Let xp be
an instance, where xp = (xp1 , xp2 , . . . , xpm , xpω ), with xp belonging to a class ω given by xpω , and a m-dimensional space in
which Xpi is the value of the ith feature of the pth sample. Then,
let us assume that there is a training set T R, which consists of
n instances xp , and a test set T S composed of s instances xt ,
with xtω unknown. The purpose of PG is to obtain a prototype generate set T G, which consists of r, r < n, prototypes,
which are either selected or generated from the examples of T R.
The prototypes of the generated set are determined to represent
efficiently the distributions of the classes and to discriminate
well when used to classify the training objects. Their cardinality should be sufficiently small to reduce both the storage and
evaluation time spent by a NN classifier. In this paper, we will
focus on the use of the NN rule, with k = 1, to classify the
examples of T R and T S by using the T G as reference.
This section presents an overview of the PG problem. Three
main topics will be discussed in the following.
1) In Section II-A, the main characteristics, which will define the categories of the taxonomy proposed in this paper,
will be outlined. They refer to the type of reduction, resulting generation set, generation mechanisms, and evaluation
of the search. Furthermore, some criteria to compare PG
methods are established.
2) In Section II-B, we briefly enumerate all the PG methods
proposed in the literature. The complete and abbreviated
names will be given together with the proposed reference.
3) Finally, Section II-C explores other areas related to PG
and gives an interesting summary of advanced work in
this research field.
88
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012
A. Main Characteristics in Prototype Generation Methods
This section establishes different properties of PG methods
that will be necessary for the definition of the taxonomy in the
following section. The issues discussed here include the type
of reduction, resulting generation set, generation mechanisms,
and evaluation of the search. Finally, some criteria will be set in
order to compare the PG methods.
1) Type of Reduction: PG methods search for a reduced set
T G of prototypes to represent the training set T R; there are also
a variety of schemes in which the size of T G can be established.
a) Incremental: An incremental reduction starts with an
empty reduced set T G or with only some representative
prototypes from each class. Then, a succession of additions
of new prototypes or modifications of earlier prototypes
occurs. One important advantage of this kind of reduction
is that these techniques can be faster and need less storage
during the learning phase than nonincremental algorithms.
Furthermore, this type of reduction allows the technique
to adequately establish the number of prototypes required
for each dataset. Nevertheless, this could obtain adverse
results due to the requirement of a high number of prototypes to adjust T R, thus producing overfitting.
b) Decremental: The decremental reduction begins with
T G = T R, and then, the algorithm starts to reduce T G
or modify the prototypes in T G. It can be accomplished
by following different procedures, such as merging, moving or removing prototypes, and relabeling classes. One
advantage observed in decremental schemes is that all
training examples are available for examination to make a
decision. On the other hand, a shortcoming of these kinds
of methods is that they usually present a high computational cost.
c) Fixed: It is common to use a fixed reduction in PG. These
methods establish the final number of prototypes for T G
using a user’s previously defined parameter related to the
percentage of retention of T R. This is the main drawback
of this approach, apart from the fact that it is very dependent on each dataset tackled. However, these techniques
only focus on increasing the classification accuracy.
d) Mixed: A mixed reduction begins with a preselected subset T G, obtained either by random selection with fixed
reduction or by the run of a PS method, and then, additions, modifications, and removals of prototypes are done
in T G. This type of reduction combines the advantages
of the previously seen, thus allowing several rectifications
to solve the problem of fixed reduction. However, these
techniques are prone to overfit the data, and they usually
have high computational cost.
2) Resulting Generation Set: This factor refers to the resulting set generated by the technique, i.e., whether the final set will
retain border, central, or both types of points.
a) Condensation: This set includes the techniques, which
return a reduced set of prototypes that are closer to the
decision boundaries, that are also called border points.
The reason behind retaining border points is that internal
points do not affect the decision boundaries as much as
border points and, thus, can be removed with relatively little effect on classification. The idea behind these methods
is to preserve the accuracy over the training set, but the
generalization accuracy over the test set can be negatively
affected. Nevertheless, the reduction capability of condensation methods is normally high because of the fact that
border points are less than internal points in most of the
data.
b) Edition: These schemes instead seek to remove or modify
border points. They act over points that are noisy or do
not agree with their NNs, thus leaving smoother decision
boundaries behind. However, such algorithms do not remove internal points that do not necessarily contribute to
the decision boundaries. The effect obtained is related to
the improvement of generalization accuracy in test data,
although the reduction rate obtained is lower.
c) Hybrid: Hybrid methods try to find the smallest set T G,
which maintains or even increases the generalization accuracy in test data. To achieve this, it allows modifications of internal and border points based on some specific criteria followed by the algorithm. The NN classifier is highly adaptable to these methods, obtaining great
improvements, even with a very small reduced set of
prototypes.
3) Generation Mechanisms: This factor describes the different mechanisms adopted in the literature to build the final T G
set.
a) Class relabeling: This generation mechanism consists of
changing the class labels of samples from T R, which could
be suspicious of having errors, and belonging to other
different classes. Its purpose is to cope with all types of
imperfections in the training set (mislabeled, noisy, and
atypical cases). The effect obtained is closely related to
the improvement in generalization accuracy of the test
data, although the reduction rate is kept fixed.
b) Centroid based: These techniques are based on generating artificial prototypes by merging a set of similar examples. The merging process is usually made from the
computation of averaged attribute values over a selected
set, yielding the so-called centroids. The identification and
selection of the set of examples are the main concerns of
the algorithms that belong to this category. These methods
can obtain a high reduction rate, but they are also related
to accuracy rate losses.
c) Space splitting: This set includes the techniques based on
different heuristics to partition the feature space, along
with several mechanisms to define new prototypes. The
idea consists of dividing T R into some regions, which
will be replaced with representative examples establishing
the decision boundaries associated with the original T R.
This mechanism works on a space level because of the
fact that the partitions are found in order to discriminate,
as well as possible, a set of examples from others, whereas
centroid-based approaches work on the data level, which
mainly focuses on the optimal selection of only a set of
examples to be treated. The reduction capabilities of these
TRIGUERO et al.: A TAXONOMY AND EXPERIMENTAL STUDY ON PROTOTYPE GENERATION FOR NEAREST NEIGHBOR CLASSIFICATION
techniques usually depend on the number of regions that
are needed to represent T R.
d) Positioning adjustment: The methods that belong to this
family aim to correct the position of a subset of prototypes
from the initial set by using an optimization procedure.
New positions of prototype can be obtained by using the
movement idea in the m-dimensional space, thus adding
or subtracting some quantities to the attribute values of the
prototypes. This mechanism is usually associated with a
fixed or mixed type of reduction.
4) Evaluation of Search: The NN itself is an appropriate
heuristic to guide the search of a PG method. The decisions
made by the heuristic must have an evaluation measure that
allows the comparison of different alternatives. The evaluation
of search criterion depends on the use or nonuse of NN in such
an evaluation.
a) Filter: We refer to filters techniques when they do not
use the NN rule during the evaluation phase. Different
heuristics are used to obtain the reduced set. They can be
faster than NN, but the performance in terms of accuracy
obtained could be worse.
b) Semiwrapper: NN is used for partial data to determine the
criteria of making a certain decision. Thus, NN performance can be measured over localized data, which will
contain most of prototypes that will be influenced in making a decision. It is an intermediate approach, where a
tradeoff between efficiency and accuracy is expected.
c) Wrapper: In this case, the NN rule fully guides the search
by using the complete training set with the leave-oneout validation scheme. The conjunction, in the use of the
two mentioned factors, allows us to get a great estimator
of generalization accuracy, thus obtaining better accuracy
over test data. However, each decision involves a complete
computation of the NN rule over the training set and the
evaluation phase can be computationally expensive.
5) Criteria to Compare PG Methods: When comparing the
PG methods, there are a number of criteria that can be used
to compare the relative strengths and weaknesses of each algorithm. These include storage reduction, noise tolerance, generalization accuracy, and time requirements.
1) Storage reduction: One of the main goals of the PG methods is to reduce storage requirements. Furthermore, another goal closely related to this is to speed up classification. A reduction in the number of stored instances will
typically yield a corresponding reduction in the time it
takes to search through these examples and classify a new
input vector.
2) Noise tolerance: Two main problems may occur in the
presence of noise. The first is that very few instances will
be removed because many instances are needed to maintain the noisy decision boundaries. Second, the generalization accuracy can suffer, especially if noisy instances
are retained instead of good instances, or these are not
relabeled with the correct class.
3) Generalization accuracy: A successful algorithm will often be able to significantly reduce the size of the train-
89
TABLE I
PG METHODS REVIEWED
ing set without significantly reducing the generalization
accuracy.
4) Time requirements: Usually, the learning process is carried
out just once on a training set; therefore, it seems not to
be a very important evaluation method. However, if the
learning phase takes too long, it can become impractical
for real applications.
B. Prototype Generation Methods
More than 25 PG methods have been proposed in the literature. This section is devoted to enumerate and designate them
according to a standard that followed in this paper. For more
details on their implementations, the reader can visit the URL
http://sci2s.ugr.es/pgtax. Implementations of the algorithms in
java can be found in KEEL software [38].
Table I presents an enumeration of the PG methods reviewed
in this paper. The complete name, abbreviation, and reference
is provided for each one. In the case of there being more than
one method in a row, they were proposed together and the best
performing method (indicated by the respective authors) is depicted in bold. We will use the best representative method of
each proposed paper; therefore, only the methods in bold, when
more than one method is proposed, will be compared in the
experimental study.
C. Related and Advanced Work
Nowadays, much research to enhance the NN through data
preprocessing is common and highly demanded. PG could represent a feasible and promising technique to obtain expected
90
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012
results, which justifies its relationship to other methods and
problems. This section provides a brief review on other topics
closely related to PG, and describes other interesting work and
future trends, which have been studied over the past few years.
1) Prototype selection: With the same objective as PG, storage reduction, and classification accuracy improvement,
these methods are limited only to select examples from
the training set. More than 50 methods can be found in
the literature. In general, three kinds of methods are usually differentiated, which are also based on edition [54],
condensation [55], or hybrid models [21], [56]. Advanced
proposals can be found in [24] and [57]–[59].
2) Instance and rule learning hybridizations: It includes all
the methods, which simultaneously use instances and rules
in order to compute the classification of a new object. If the
values of the object are within the range of a rule, its consequent predicts the class; otherwise, if no rule matches
with the object, the most similar rule or instance stored
in the database is used to estimate the class. Similarity
is viewed as the closest rule or instance based on a distance measure. In short, these methods can generalize an
instance into a hyperrectangle or rule [60], [61].
3) Hyperspherical prototypes: This area [62] studies the use
of hyperspheres to cover the training patterns of each class.
The basic idea is to cluster the space into several objects,
each of them corresponding only to one class, and the
class of the nearest object is assigned to the test example.
4) Weighting: This task consists of applying weights to the
instances of the training set, thus modifying the distance
measure between them and any other instance. This technique could be integrated with the PS and PG methods [16], [63], [64], [65], [66] to improve the accuracy
in classification problems and to avoid overfitting. A complete review dealing with this topic can be found in [67].
5) Distance functions: Several distance metrics have been
used with NN, especially when working with categorical
attributes [68]. Many different distance measures try to
optimize the performance of NN [15], [64], [69], [70],
and they have successfully increased the classification
accuracy. Advanced work is based on adaptive distance
functions [71].
6) Oversampling: This term is frequently used in learning
with imbalanced classes [72], [73], and is closely related
to undersampling [74]. Oversampling techniques replicate and generate artificial examples that belong to the
minority classes in order to strengthen the presence of
minority samples and to increase the performance over
them. SMOTE [75] is the most well known oversampling
technique and it has been shown to be very effective in
many domains of application [76].
III. PROTOTYPE GENERATION: TAXONOMY
The main characteristics of the PG methods have been described in Section II-A, and they can be used to categorize the
PG methods proposed in the literature. The type of reduction,
resulting generation set, generation mechanisms, and the evalu-
Fig. 1.
Prototype generation map.
ation of the search constitute a set of properties that define each
PG method. This section presents the taxonomy of PG methods
based on these properties.
In Fig. 1, we show the PG map with the representative
methods proposed in each paper ordered in time. We refer
to representantive methods, which are preferred by the authors or have reported the best results in the corresponding proposal paper. Some interesting remarks can be seen in
Fig. 1.
1) Only two class-relabeling methods have been proposed
for PG algorithms. The reason is that both the methods
obtain great results for this approach in accuracy, but the
underlying concept of these methods does not achieve
high reduction rates, which is one of the most important
objectives of PG. Furthermore, it is important to point out
that both algorithms are based on decremental reduction,
and that they have noise filtering purposes.
2) The condensation techniques constitute a wide group.
They usually use a semiwrapper evaluation with any type
of reduction. It is considered a classic idea due to the fact
that, in recent years, hybrid models are preferred over condensation techniques, with few exceptions. ICPL2 was the
first PG method with a hybrid approach, combining edition, and condensation stages.
3) Recent efforts in proposing positioning adjustment algorithms are noted for mixed reduction. Most of the methods
following this scheme are based on LVQ, and the recent
approaches try to alleviate the main drawback of the fixed
reduction.
4) There are many efforts in centroid-based techniques because they have reported a great synergy with the NN rule,
since the first algorithm PNN. Furthermore, many of them
are based on simple and intuitive heuristics, which allow
them to obtain a reduced set with high-quality accuracy.
By contrast, those with decremental and mixed reduction
are slow techniques.
TRIGUERO et al.: A TAXONOMY AND EXPERIMENTAL STUDY ON PROTOTYPE GENERATION FOR NEAREST NEIGHBOR CLASSIFICATION
91
2) Cohen’s Kappa (Kappa rate): It is an alternative measure
to the classification rate, since it compensates for random hits [78]. In contrast to the classification rate, kappa
evaluates the portion of hits that can be attributed to the
classifier itself (i.e., not to mere chance), relative to all the
classifications that cannot be attributed to chance alone.
An easy way to compute the Cohen’s kappa is to makie
use of the resulting confusion matrix (see Table III) in a
classification task. With the following expression, we can
obtain Cohen’s kappa:
Ω
n Ω
i=1 hii −
i=1 Tr i Tci
(1)
kappa =
n2 − Ω
T
i=1 r i Tci
Fig. 2.
Prototype generation hierarchy.
5) Wrapper evaluation appeared a few years ago and is only
presented in hybrid approaches. This evaluation search
is intended to optimize a selection, without taking into
account computational costs.
Fig. 2 illustrates the categorization following a hierarchy
based on this order: generation mechanisms, resulting generation set, type of reduction, and finally, evaluation of the search.
The properties studied here can help to understand how the
PG algorithms work. In the following sections, we will establish which methods perform best, for each family, considering several metrics of performance with a wide experimental
framework.
IV. EXPERIMENTAL FRAMEWORK
In this section, we show the factors and issues related to
the experimental study. We provide the measures employed to
evaluate the performance of the algorithms (see Section IV-A),
details of the problems chosen for the experimentation (see
Section IV-B), parameters of the algorithms (see Section IV-C),
and finally, the statistical tests employed to contrast the results
obtained are described (see Section IV-D).
A. Performance Measures for Standard Classification
In this study, we deal with multiclass datasets. In these domains, two measures are widely used because of their simplicity
and successful application. We refer to the classification rate
and Cohen’s kappa rate measures, which we will explain in the
following.
1) Classification rate: It is the number of successful hits (correct classifications) relative to the total number of classifications. It has been by far the most commonly used metric
to assess the performance of classifiers for years [2], [77].
where hii is the cell count in the main diagonal (the number of true positives for each class), n is the number of
examples, Ω is the number of class labels, and Tr i and
the rows’ and columns’
Tci are Ω total counts, respectively
(Tr i = Ω
j =1 hij , Tci =
j =1 hj i ).
Cohen’s kappa ranges from −1 (total disagreement)
through 0 (random classification) to 1 (perfect agreement).
For multiclass problems, kappa is a very useful, yet simple, meter to measure a classifier’s classification rate while
compensating for random successes.
The main difference between the classification rate and
Cohen’s kappa is the scoring of the correct classifications. Classification rate scores all the successes over all
classes, whereas Cohen’s kappa scores the successes independently for each class and aggregates them. The second
way of scoring is less sensitive to randomness caused by
a different number of examples in each class.
B. Datasets
In the experimental study, we selected 59 datasets from the
University of California, Irvine (UCI) repository [79] and KEEL
dataset1 [38]. Table II summarizes the properties of the selected
datasets. It shows, for each dataset, the number of examples
(#Ex.), the number of attributes (#Atts.), the number of numerical (#Num.) and nominal (#Nom.) attributes, and the number of
classes (#Cl.). The datasets are grouped into two categories depending on the size they have. Small datasets have less than 2000
instances and large datasets have more than 2000 instances. The
datasets considered are partitioned by using the tenfold crossvalidation (10-fcv) procedure.
C. Parameters
Many different method configurations have been established
by the authors in each paper for the PG techniques. In our
experimental study, we have used the parameters defined in
the reference, where they were originally described, assuming
that the choice of the values of the parameters was optimally
chosen. The configuration parameters, which are common to all
problems, are shown in Table IV. Note that some PG methods
have no parameters to be fixed; therefore, they are not included
in this table.
1 http://sci2s.ugr.es/keel/datasets.
92
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012
TABLE II
SUMMARY DESCRIPTION FOR CLASSIFICATION DATASETS
TABLE III
CONFUSION MATRIX FOR AN Ω-CLASS PROBLEM
In most of the techniques, Euclidean distance is used as the
similarity function, to decide which neighbors are closest. Furthermore, to avoid problems with a large number of attributes
and distances, all datasets have been normalized between 0 and
1. This normalization process allows to apply all the PG methods
over each dataset, independent of the types of attributes.
fact that the initial conditions that guarantee the reliability of the
parametric tests may not be satisfied, thus causing the statistical
analysis to lose credibility with these parametric tests. These
tests are suggested in the studies presented in [80] and [82]–
[84], where its use in the field of machine learning is highly
recommended.
The Wilcoxon test [82], [83] is adopted considering a level of
significance of α = 0.1. More information about statistical tests
and the results obtained can be found in the web site associated
with this paper (http://sci2s.ugr.es/pgtax).
E. Other Considerations
We want to outline that the implementations are based only on
the descriptions and specifications given by the respective authors in their papers. No advanced data structures and enhancements for improving the efficiency of PG methods have been
carried out. All methods are available in KEEL software [38].
D. Statistical Tests for Performance Comparison
In this paper, we use the hypothesis-testing techniques to
provide statistical support for the analysis of the results [80],
[81]. Specifically, we use nonparametric tests because of the
V. ANALYSIS OF RESULTS
This section presents the average results collected in the experimental study and some discussions of them; the complete
TRIGUERO et al.: A TAXONOMY AND EXPERIMENTAL STUDY ON PROTOTYPE GENERATION FOR NEAREST NEIGHBOR CLASSIFICATION
TABLE IV
PARAMETER SPECIFICATION FOR ALL THE METHODS EMPLOYED
IN THE EXPERIMENTATION
results can be found on the web page associated with this paper.
The study will be divided into two parts: analysis of the results
obtained over small-size datasets (see Section V-A) and over
large datasets (see Section V-B). Finally, a global analysis is
added in Section V-C.
A. Analysis and Empirical Results of Small-Size datasets
Table V presents the average results obtained by the PG methods over the 40 small-size datasets. Red. denotes reduction rate
achieved, train Acc. and train Kap. present the accuracy and
kappa obtained in the training data, respectively; on the other
hand, tst Acc. and tst Kap. present the accuracy and kappa obtained over the test data. Finally, Time denotes the average time
elapsed in seconds to finish a run of PG method. The algorithms
are ordered from the best to the worst for each type of result.
Algorithms highlighted in bold are those which obtain the best
result in their corresponding family, according to the first level
of the hierarchy in Fig. 2.
Fig. 3 depicts a representation of an opposition between the
two objectives: reduction and test accuracy. Each algorithm located inside the graphic gets its position from the average values
of each measure evaluated (exact position corresponding to the
beginning of the name of the algorithm). Across the graphic,
there is a line that represents the threshold of test accuracy
achieved by the 1NN algorithm without preprocessing. Note
93
that in Fig. 3(a), the names of some PG methods overlap, and
hence, Fig. 3(b) shows this overlapping zone.
To complete the set of results, the web site associated with
this paper contains the results of applying the Wilcoxon test
to all possible comparisons among all PG considered in small
datasets.
Observing Table V, Fig. 3, and the Wilcoxon Test, we can
point out some interesting facts as follows.
1) Some classical algorithms are at the top in accuracy and
kappa rate. For instance, GENN, GMCA, and MSE obtain better results than other recent methods over test
data. However, these techniques usually have a poor associated reduction rate. We can observe this statement in
the Wilcoxon test, where classical methods significantly
overcome other recent approaches in terms of accuracy
and kappa rates. However, In terms of Acc. ∗ Red. and
Kap. ∗ Red. measures, typically, these methods do not
outperform recent techniques.
2) PSO and ENPC could be stressed from the positioning
adjustment family as the best performing methods. Each
one of them belongs to different subfamilies, fixed and
mixed reduction, respectively. PSO focuses on improving
the classification accuracy, and it obtains a good generalization capability. On the other hand, ENPC has the
overfitting as the main drawback, which is clearly discernible from Table V. In general, LVQ-based approaches
obtain worse accuracy rates than 1NN, but the reduction
rate achieved by them is very high. MSE and HYB are the
most outstanding techniques belonging to the subgroup of
condensation and positioning adjustment.
3) With respect to class-relabeling methods, GENN obtains
better accuracy/kappa rates but worse reduction rates than
Depur. However, the statistical test informs that GENN
does not outperform to the Depur algorithm in terms of
accuracy and kappa rate. Furthermore, when the reduction
rate is taken into consideration, i.e., when the statistical test
is based on the Acc. ∗ Red. and Kap. ∗ Red. measures,
the Depur algorithm clearly outperforms GENN.
4) The decremental approaches belonging to the centroids
family require high computation times, but usually offer
good reduction rates. MCA and PNN tend to overfit the
data, but GMCA obtains excellent results.
5) In the whole centroids family, two methods deserve particular mention: ICPL2 and GMCA. Both generate a reduced
prototype set with good accuracy rates in test data. The
other approaches based on fixed and incremental reduction
are less appropriate to improve the effectiveness of 1NN,
but they are very fast and offer much reduced generated
sets.
6) Regarding space-splitting approaches, several differences
can be observed. RSP3 is an algorithm based on Chen’s
algorithm, but tries to avoid drastic changes in the form of
the decision boundaries, and it produces a good tradeoff
between reduction and accuracy. Although the POC algorithm is a relatively modern technique, this does not obtain
great results. We can justify these results because the αparameter is very sensitive for each dataset. Furthermore,
94
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012
TABLE V
AVERAGE RESULTS OBTAINED BY THE PG METHODS OVER SMALL DATASETS
Fig. 3.
Accuracy in test versus reduction in small datasets. (a) All PG methods. (b) Zoom in the overlapping reduction-rate zone.
it is quite slow when tackling datasets with more than two
classes.
7) The best methods in accuracy/kappa rates for each one of
the families are PSO, GENN, ICPL2, and RSP3, respectively, and five methods outperform 1NN in accuracy.
8) In general, hybrid methods obtain the best result in terms
of accuracy and reduction rate.
9) Usually, there is no difference between the rankings obtained with accuracy and kappa rates, except for some
concrete algorithms. For example, we can observe that
1NN obtains a lower ranking with the kappa measure; it
probably indicates that 1NN benefits from random hits.
Furthermore, in the web site associated with this paper, we can
find an analysis of the results depending on the type of attributes
of the datasets. We show the results in accuracy/kappa rate for
all PG methods differentiating between numerical, nominal, and
mixed datasets. In numerical and nominal datasets, all attributes
must be numerical and nominal, respectively, whereas in mixed
datasets, we include those datasets with numerical and nominal
attributes mixed. Observing these tables, we want to outline
different properties of the PG methods.
1) In general, there is no difference in performance between
numerical, nominal, and mixed datasets, except for some
concrete algorithms. For example, in mixed datasets, we
TRIGUERO et al.: A TAXONOMY AND EXPERIMENTAL STUDY ON PROTOTYPE GENERATION FOR NEAREST NEIGHBOR CLASSIFICATION
95
TABLE VI
AVERAGE RESULTS OBTAINED BY THE PG METHODS OVER LARGE DATASETS
can see that a class-relabeling method, GENN, is on the top
because of the fact that it does not produce modifications
to the attributes. However, in numerical datasets, PSO
is the best performing method, indicating to us that the
positioning adjustment strategy is usually well adapted to
numerical datasets.
2) In fact, comparing these tables, we observe that some
representative techniques of the positioning adjustment
family, such as PSO, MSE, and ENPC, have an accuracy/kappa rate close to 1NN. However, over nominal and
mixed datasets, they decrease their accuracy rates.
3) ICPL2 and GMCA techniques obtain good accuracy/kappa rates independent of the type of input data.
Finally, we perform a study depending on the number of
classes of the datasets. In the web site associated with this paper,
we show the average results in accuracy/kappa rate differentiating between binary and multiclass datasets. We can analyze
several details from the results collected, which are as follows.
1) Eight techniques outperform 1NN in accuracy when they
tackle binary datasets. However, over multiclass datasets,
there are only three techniques that are able to overcome
1NN.
2) Centroid-based techniques usually perform well when
dealing with multiclass datasets. For instance, we can
highlight the MCA, SGP, PNN, ICPL2, and GMCA techniques, which increase their respective rankings with multiclass datasets.
3) GENN and ICPL2 techniques obtain good accuracy/kappa rates independent of the number of
classes.
4) PSCSA has a good behavior with binary datasets. However, over multiclass datasets, PSCSA decreases its performance.
5) Some methods present significant differences between accuracy and kappa measures when dealing with binary
datasets. We can stress MSE, Depur, Chen, and BTS3
like techniques penalized by the kappa measure.
B. Analysis and Empirical Results of Large-Size
Datasets
This section presents the study and analysis of large-size
datasets. The goal of this study is to analyze the effect of scaling
up the data in PG methods. For time complexity reasons, several algorithms cannot be run over large datasets. PNN, MCA,
GMCA, ICPL2, and POC are extremely slow techniques, and
their time complexity quickly increases when the data scale up
or manage more than five classes.
Table VI shows the average results obtained, and Fig. 4 illustrates the comparison between the accuracy and reduction rates
of the PG methods over large-size datasets. Finally, the web
site associated with this paper contains the results of applying
the Wilcoxon test over all possible comparisons among all PG
considered in large datasets.
These tables allow us to highlight some observations of the
results obtained as follows.
1) Only the GENN approach outperforms the performance
of the 1NN in accuracy/kappa rate.
2) Some methods present clear differences when dealing with
large datasets. For instance, we can highlight the PSO and
RSP3 techniques. The former may suffer from a lack of
convergence due to the fact that the performance obtained
in training data is slightly higher than that obtained by
1NN; hence, it may be a sign that more iterations are
needed to tackle large datasets. On the other hand, the
techniques based on space partitioning present some drawbacks when the data scale up and are made up of more
attributes. This is the case with RSP3.
3) In general, LVQ-based methods do not work well when
the data scale up.
96
Fig. 4.
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012
Accuracy in test versus reduction in large datasets. (a) All PG methods considered over large datasets. (b) Zoom in the overlapping reduction-rate zone.
4) BTS3 stands out as the best centroids-based method over
large-size datasets because the best performing ones over
small datasets were also the most complex in time, and
they cannot be run here.
5) Although ENPC overfits the data, it is the best performing
method that consider the tradeoff between accuracy/kappa
and reduction rates. PSO can also be stressed as a good
candidate in this type of dataset.
6) There is no significant differences between the accuracy
and kappa rankings when dealing with large datasets.
Again, we differentiate between numerical, nominal, and
mixed datasets. Complete results can be found in the web site
associated with this paper. Observing these results, we want to
outline different properties of PG methods over large datasets.
Note that there is only one dataset with mixed attributes; for
this reason, we focus this analysis on the differences between
numerical and nominal datasets.
1) When only numerical datasets are taken into consideration, three algorithms outperform the 1NN rule: GENN,
PSO, and ENPC.
2) Over nominal large datasets, no PG method outperforms
1NN.
3) MixtGauss and AMPSO are highly conditioned on the
type of input data, preferring numerical datasets. By contrast, RSP3 is better adapted to nominal datasets.
Finally, we perform again an analysis of the behavior of the
PG techniques depending on the number of classes, but in this
case, over large datasets. the web site associated with this paper
presents the results. Observing these results, we can point out
several comments.
1) Over binary large datasets, there are four algorithms
that outperform 1NN. However, when the PG techniques
tackle multiclass datasets, no PG method overcome
1NN.
2) When dealing with large datasets, there is no important differences between the accuracy and kappa
ranking with binary datasets.
3) Class-relabeling methods perform well independent of the
number of classes.
C. Global Analysis
This section shows a global view of the obtained results. As
a summary, we want to outline several remarks on the use of
PG because the choice of a certain method depends on various
factors.
1) Several PG methods can be emphasized according to their
test accuracy/kappa obtained: PSO, ICPL2, ENPC, and
GENN. In principle, in terms of reduction capabilities,
PSCSA and AVQ obtain the best results, but they offer
poor accuracy rates. Taking into consideration the computational cost, we can consider DSM, LVQ3, and VQ to be
the fastest algorithms.
2) Edition schemes usually outperform the 1NN classifier,
but the number of prototypes in the result set is too high.
This fact could be prohibitive over large datasets because
there is no significant reduction. Furthermore, other PG
methods have shown that it is possible to preserve high
accuracy with a better reduction rate.
3) A high reduction rate serves no purpose, if there is no minimum guarantee of performance accuracy. This is the case
of PSCSA or AVQ. Nevertheless, MSE offers excellent
reduction rates without losing performance accuracy.
4) For the tradeoff reduction–accuracy rate, PSO has been
reported to have the best results over small-size datasets. In
the case of dealing with large datasets, the ENPC approach
seems to be the most appropriate one.
5) A good reduction–accuracy balance is difficult to achieve
with a fast algorithm. Considering this restriction, we
could say that RSP3 allows us to yield generated sets
with a good tradeoff among reduction, accuracy, and time
complexity.
VI. VISUALIZATION OF DATA RESULTING SETS: A CASE
STUDY BASED ON BANANA DATASET
This section is devoted to illustrate the subsets selected resulting from some PG algorithms considered in this study. To
do this, we focus on the banana dataset, which contains 5300
examples in the complete set. It is an artificial dataset of two
TRIGUERO et al.: A TAXONOMY AND EXPERIMENTAL STUDY ON PROTOTYPE GENERATION FOR NEAREST NEIGHBOR CLASSIFICATION
97
Fig. 5. Data generated sets in banana dataset. (a) Banana original (0.8751, 0.7476). (b) GENN (0.0835, 0.8826, 0.7626). (c) LVQ3 (0.9801, 0.8370, 0.6685).
(d) Chen (0.9801, 0.8792, 0.7552). (e) RSP3 (0.8962, 0.8755, 0.7482). (f) BTS3 (0.9801, 0.8557, 0.7074). (g) SGP (0.9961, 0.6587, 0.3433). (h) PSO
(0.9801, 0.8819, 0.7604). (i) ENPC (0.7485, 0.8557, 0.7086).
classes composed of three well-defined clusters of instances of
the class −1 and two clusters of the class 1. Although the borders are clear among the clusters, there is a high overlap between
both classes. The complete dataset is illustrated in Fig. 5(a).
The pictures of the generated sets by some PG methods could
help to visualize and understand their way of working and the
results obtained in the experimental study. The reduction rate
and the accuracy and kappa values in test data registered in the
experimental study are specified for each one. In the original
dataset, the two values indicated correspond to accuracy and
kappa with 1NN.
1) Fig. 5(b) depicts the generated data by the algorithm
GENN. It belongs to the edition approaches, and the generated subset differs slightly from the original dataset. Those
samples found within the class boundaries can either be
removed or be relabeled. It is noticeable that the clusters
of different classes are a little more separated.
2) Fig. 5(c) shows the resulting subset of the classical LVQ3
condensation algorithm. It can be appreciated that most of
the points are moved to define the class boundaries, but a
few interior points are also used. The accuracy and kappa
decrease with respect to the original, as is usually the case
with condensation algorithms.
3) Fig. 5(d) and (e) represents the sets generated by the Chen
and RSP3 methods, respectively. These methods are based
on a space-splitting strategy, but the first one requires the
specification of the final size of the generated sets, while
the latter does not. We can see that the Chen method
generates prototypes keeping a homogeneous distribution
of points in the space. RSP3 was proposed to fix some
problems observed in the Chen method, but in this concrete
dataset, this method is worse in accuracy/kappa rates than
its ancestor. However, the reduction type of Chen’s method
is fixed, and it is very dependent on the dataset tackled.
4) Fig. 5(f) and (g) represents the sets of data generated
by BTS3 and SGP methods. Both techniques are clusterbased and present very high reduction rates over this
dataset. SGP does not work well in this dataset because
it promotes the removal of prototypes and uses an incremental order, which does not allow us to choose the most
98
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012
appropriate decision. BTS3 uses a fixed reduction type;
thus, it focuses on improving accuracy rates, but its generation mechanisms are not well suited for this type of
dataset.
5) Fig. 5(h) and (i) illustrates the sets of data generated by
PSO and ENPC methods. They are wrapper and hybrid
methods of the position-adjusting family and iterate many
times to obtain an optimal reallocation of prototypes. PSO
requires the final size of the subset selected as a parameter,
and this parameter is very conditioned to the complexity
of the dataset addressed. In the banana case, keeping 2%
of prototypes seems to work well. On the other hand,
ENPC can adjust the number of prototypes required to fit
a specific dataset. In the case study presented, we can see
that it obtains similar sets to those obtained by the Chen
approach because it also fills the regions with a homogeneous distribution of generated prototypes. In decision
boundaries, the density of prototypes is increased and may
produce quite noisy samples for further classification of
the test data. It explains its poor behavior in this problem
with respect to PSO, the lower reduction rate achieved,
and the decrement of accuracy/kappa rates with regard to
the original dataset classified with 1NN.
We have seen the resulting datasets of condensation, edition,
and hybrid methods and different generation mechanisms with
some representative PG methods. Although the methods can be
categorized as a specific family, they do not follow a specific
behavior pattern, since some of the condensation techniques
may generate interior points (like in LVQ3), other clusters of
data (RSP3), or even points with a homogeneous distribution in
space (Chen or ENPC). Nevertheless, visual characteristics of
generated sets are also the subject of interest and can also help
to decide the choice of a PG method.
VII. CONCLUSION
In this paper, we have provided an overview of the PG methods proposed in the literature. We have identified the basic and
advanced characteristics. Furthermore, existing work and related fields have been reviewed. Based on the main characteristics studied, we have proposed a taxonomy of the PG methods.
The most important methods have been empirically analyzed
over small and large sizes of classification datasets. To illustrate
and strengthen the study, some graphical representations of data
subsets selected have been drawn and statistical analysis based
on nonparametric tests has been employed. Several remarks and
guidelines can be suggested.
1) A researcher who needs to apply a PG method should
know the main characteristics of these kinds of methods in
order to choose the most suitable. The taxonomy proposed
and the empirical study can help a researcher to make this
decision.
2) To propose a new PG method, rigorous analysis should be
considered to compare the most well-known approaches
and those which fit with the basic properties of the new
proposal. To do this, the taxonomy and analysis of influ-
ence in the literature can help guide a future proposal to
the correct method.
3) This paper helps nonexperts in PG methods to differentiate
between them, to make an appropriate decision about their
application, and to understand their behavior.
4) It is important to know the main advantages of each PG
method. In this paper, many PG methods have been empirically analyzed, but a specific conclusion cannot be drawn
regarding the best performing method. This choice depends on the problem tackled, but the results offered in
this paper could help to reduce the set of candidates.
REFERENCES
[1] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,”
IEEE Trans. Inf. Theory, vol. IT-13, no. 1, pp. 21–27, Jan. 1967.
[2] E. Alpaydin, Introduction to Machine Learning, 2nd ed. Cambridge,
MA: MIT Press, 2010.
[3] V. Cherkassky and F. Mulier, Learning From Data: Concepts, Theory and
Methods, 2nd ed. New York: Interscience, 2007.
[4] I. Kononenko and M. Kukar, Machine Learning and Data Mining: Introduction to Principles and Algorithms. West Sussex: Horwood, 2007.
[5] E. K. Garcia, S. Feldman, M. R. Gupta, and S. Srivastava, “Completely
lazy learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 9, pp. 1274–
1285, Sep. 2010.
[6] A. N. Papadopoulos and Y. Manolopoulos, Nearest Neighbor Search: A
Database Perspective. New York: Springer-Verlag, 2004.
[7] G. Shakhnarovich, T. Darrell, and P. Indyk, Eds., Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. Cambridge, MA: MIT
Press, 2006.
[8] X. Wu and V. Kumar, Eds., The Top Ten Algorithms in Data Mining.
(Chapman & Hall/CRC Data Mining and Knowledge Discovery Series).
Boca Raton, FL: CRC, 2009.
[9] A. Shintemirov, W. Tang, and Q. Wu, “Power transformer fault classification based on dissolved gas analysis by implementing bootstrap and
genetic programming,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev.,
vol. 39, no. 1, pp. 69–79, Jan. 2009.
[10] P. G. Espejo, S. Ventura, and F. Herrera, “A survey on the application of
genetic programming to classification,” IEEE Trans. Syst., Man, Cybern.
C, Appl. Rev., vol. 40, no. 2, pp. 121–144, Mar. 2009.
[11] S. Magnussen, R. McRoberts, and E. Tomppo, “Model-based mean square
error estimators for k-nearest neighbour predictions and applications using remotely sensed data for forest inventories,” Remote Sens. Environ.,
vol. 113, no. 3, pp. 476–488, 2009.
[12] M. Govindarajan and R. Chandrasekaran, “Evaluation of k-nearest neighbor classifier performance for direct marketing,” Expert Syst. Appl.,
vol. 37, no. 1, pp. 253–258, 2009.
[13] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed.
New York: Wiley, 2001.
[14] Y. Chen, E. Garcia, M. Gupta, A. Rahimi, and L. Cazzanti, “Similaritybased classification: Concepts and algorithms,” J. Mach. Learning Res.,
vol. 10, pp. 747–776, 2009.
[15] K. Weinberger and L. Saul, “Distance metric learning for large margin
nearest neighbor lassification,” J. Mach. Learning Res., vol. 10, pp. 207–
244, 2009.
[16] F. Fernández and P. Isasi, “Local feature weighting in nearest prototype
classification,” IEEE Trans. Neural Netw., vol. 19, no. 1, pp. 40–53, Jan.
2008.
[17] E. Pekalska and R. P. Duin, “Beyond traditional kernels: Classification in two dissimilarity-based representation spaces,” IEEE Trans.
Syst., Man, Cybern. C, Appl. Rev., vol. 38, no. 6, pp. 729–744, Nov.
2008.
[18] P. Cunningham, “A taxonomy of similarity mechanisms for case-based
reasoning,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 11, pp. 1532–
1543, Nov. 2009.
[19] B. Li, Y. W. Chen, and Y. Chen, “The nearest neighbor algorithm of local
probability centers,” IEEE Trans. Syst., Man, Cybern. B. Cybern., vol. 38,
no. 1, pp. 141–154, Feb. 2008.
[20] C.-M. Hsu and M.-S. Chen, “On the design and applicability of distance
functions in high-dimensional data space,” IEEE Trans. Knowl. Data
Eng., vol. 21, no. 4, pp. 523–536, Apr. 2009.
TRIGUERO et al.: A TAXONOMY AND EXPERIMENTAL STUDY ON PROTOTYPE GENERATION FOR NEAREST NEIGHBOR CLASSIFICATION
[21] D. R. Wilson and T. R. Martinez, “Reduction techniques for instancebased learning algorithms,” Mach. Learning, vol. 38, no. 3, pp. 257–286,
2000.
[22] S. W. Kim and J. Oomenn, “A brief taxonomy and ranking of creative
prototype reduction schemes,” Pattern Anal. Appl., vol. 6, pp. 232–244,
2003.
[23] H. Brighton and C. Mellish, “Advances in instance selection for instancebased learning algorithms,” Data Mining Knowl. Discov., vol. 6, no. 2,
pp. 153–172, 2002.
[24] E. Marchiori, “Class conditional nearest neighbor for large margin instance
selection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 2, pp. 364–
370, Feb. 2010.
[25] N. Garcı́a-Pedrajas, “Constructing ensembles of classifiers by means of
weighted instance selection,” IEEE Trans. Neural Netw., vol. 20, no. 2,
pp. 258–277, Feb. 2009.
[26] E. Pekalska, R. P. W. Duin, and P. Paclı́k, “Prototype selection for
dissimilarity-based classifiers,” Pattern Recognit., vol. 39, no. 2, pp. 189–
208, 2006.
[27] M. Lozano, J. M. Sotoca, J. S. Sánchez, F. Pla, E. Pekalska, and R.
P. W. Duin, “Experimental study on prototype optimisation algorithms
for prototype-based classification in vector spaces,” Pattern Recognit.,
vol. 39, no. 10, pp. 1827–1838, 2006.
[28] H. A. Fayed, S. R. Hashem, and A. F. Atiya, “Self-generating prototypes
for pattern classification,” Pattern Recognit., vol. 40, no. 5, pp. 1498–
1509, 2007.
[29] W. Lam, C. K. Keung, and D. Liu, “Discovering useful concept prototypes
for classification based on filtering and abstraction,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 14, no. 8, pp. 1075–1090, Aug. 2002.
[30] J. S. Sánchez, “High training set size reduction by space partitioning and
prototype abstraction,” Pattern Recognit., vol. 37, no. 7, pp. 1561–1564,
2004.
[31] R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Trans.
Neural Netw., vol. 16, no. 3, pp. 645–678, May 2005.
[32] C.-L. Chang, “Finding prototypes for nearest neighbor classifiers,” IEEE
Trans. Comput., vol. C-23, no. 11, pp. 1179–1184, Nov. 1974.
[33] T. Kohonen, “The self organizing map,” Proc. IEEE, vol. 78, no. 9,
pp. 1464–1480, Sep. 1990.
[34] C. H. Chen and A. Jóźwik, “A sample set condensation algorithm for the
class sensitive artificial neural network,” Pattern Recognit. Lett., vol. 17,
no. 8, pp. 819–823, Jul. 1996.
[35] R. Kulkarni and G. Venayagamoorthy, “Particle swarm optimization
in wireless-sensor networks: A brief survey ,” IEEE Trans. Syst.,
Man, Cybern. C, Appl. Rev. to be published. DOI: 10.1109/TSMCC.
2010.2054080.
[36] F. Fernández and P. Isasi, “Evolutionary design of nearest prototype classifiers,” J. Heurist., vol. 10, no. 4, pp. 431–454, 2004.
[37] L. Nanni and A. Lumini, “Particle swarm optimization for prototype reduction,” Neurocomputing, vol. 72, no. 4–6, pp. 1092–1097, 2008.
[38] J. Alcalá-Fdez, L. Sánchez, S. Garcı́a, M. J. del Jesus, S. Ventura, J.
M. Garrell, J. Otero, C. Romero, J. Bacardit, V. M. Rivas, J. C. Fernández,
and F. Herrera, “KEEL: A software tool to assess evolutionary algorithms
for data mining problems,” Soft Comput., vol. 13, no. 3, pp. 307–318,
2008.
[39] J. Koplowitz and T. Brown, “On the relation of performance to editing in nearest neighbor rules,” Pattern Recognit., vol. 13, pp. 251–255,
1981.
[40] S. Geva and J. Sitte, “Adaptive nearest neighbor pattern classification,”
IEEE Trans. Neural Netw., vol. 2, no. 2, pp. 318–322, Mar. 1991.
[41] Q. Xie, C. A. Laszlo, and R. K. Ward, “Vector quantization technique for
nonparametric classifier design,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 15, no. 12, pp. 1326–1330, Dec. 1993.
[42] Y. Hamamoto, S. Uchimura, and S. Tomita, “A bootstrap technique for
nearest neighbor classifier design,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 19, no. 1, pp. 73–79, Jan. 1997.
[43] R. Odorico, “Learning vector quantization with training count (LVQTC),”
Neural Netw., vol. 10, no. 6, pp. 1083–1088, 1997.
[44] C. Decaestecker, “Finding prototypes for neares neghbour classification by
means of gradient descent and deterministic annealing,” Pattern Recognit., vol. 30, no. 2, pp. 281–288, 1997.
[45] T. Bezdek, J.C .and Reichherzer, G. Lim, and Y. Attikiouzel, “Multiple
prototype classifier design,” IEEE Trans. Syst., Man Cybern. C, Appl.
Rev., vol. 28, no. 1, pp. 67–79, Feb. 1998.
[46] R. Mollineda, F. Ferri, and E. Vidal, “A merge-based condensing strategy
for multiple prototype classifiers,” IEEE Trans. Syst., Man Cybern. B,
Cybern., vol. 32, no. 5, pp. 662–668, Oct. 2002.
99
[47] J. S. Sánchez, R. Barandela, A. I. Marqués, R. Alejo, and J. Badenas,
“Analysis of new techniques to obtain quality training sets,” Pattern
Recognit. Lett., vol. 24, no. 7, pp. 1015–1022, 2003.
[48] S. W. Kim and J. Oomenn, “Enhancing prototype reduction schemes
with lvq3-type algorithms,” Pattern Recognit., vol. 36, pp. 1083–1093,
2003.
[49] C.-W. Yen, C.-N. Young, and M. L. Nagurka, “A vector quantization
method for nearest neighbor classifier design,” Pattern Recognit. Lett.,
vol. 25, no. 6, pp. 725–731, 2004.
[50] J. Li, M. T. Manry, C. Yu, and D. R. Wilson, “Prototype classifier design
with pruning,” Int. J. Artif. Intell. Tools, vol. 14, no. 1–2, pp. 261–280,
2005.
[51] T. Raicharoen and C. Lursinsap, “A divide-and-conquer approach to the
pairwise opposite class-nearest neighbor (POC-NN) algorithm,” Pattern
Recognit. Lett., vol. 26, no. 10, pp. 1554–1567, 2005.
[52] A. Cervantes, I. M. Galván, and P. Isasi, “AMPSO: A new particle swarm
method for nearest neighborhood classification,” IEEE Trans. Syst., Man,
Cybern. B, Cybern., vol. 39, no. 5, pp. 1082–1091, Oct. 2009.
[53] U. Garain, “Prototype reduction using an artificial immune model,” Pattern Anal. Appl., vol. 11, no. 3–4, pp. 353–363, 2008.
[54] D. L. Wilson, “Asymptotic properties of nearest neighbor rules using
edited data,” IEEE Trans. Syst., Man Cybern., vol. SMC-2, no. 3, pp. 408–
421, Jul. 1972.
[55] P. E. Hart, “The condensed nearest neighbor rule,” IEEE Trans. Inf.
Theory, vol. IT-18, no. 3, pp. 515–516, May 1968.
[56] S. Garcı́a, J.-R. Cano, E. Bernadó-Mansilla, and F. Herrera, “Diagnose of
effective evolutionary prototype selection using an overlapping measure,”
Int. J. Pattern Recognit. Artif. Intell., vol. 28, no. 8, pp. 1527–1548, 2009.
[57] S. Garcı́a, J. R. Cano, and F. Herrera, “A memetic algorithm for evolutionary prototype selection: A scaling up approach,” Pattern Recognit.,
vol. 41, no. 8, pp. 2693–2709, 2008.
[58] H. A. Fayed and A. F. Atiya, “A novel template reduction approach for
the k-nearest neighbor method,” IEEE Trans. Neural Netw., vol. 20, no. 5,
pp. 890–896, May 2009.
[59] J. Derrac, S. Garcı́a, and F. Herrera, “IFS-CoCo: Instance and feature
selection based on cooperative coevolution with nearest neighbor rule,”
Pattern Recognit., vol. 43, no. 6, pp. 2082–2105, 2010.
[60] P. Domingos, “Unifying instance-based and rule-based induction,” Mach.
Learning, vol. 24, no. 2, pp. 141–168, 1996.
[61] O. Luaces and A. Bahamonde, “Inflating examples to obtain rules,” Int.
J. Intell. Syst., vol. 18, pp. 1113–1143, 2003.
[62] H. A. Fayed, S. R. Hashem, and A. F. Atiya, “Hyperspherical prototypes
for pattern classification,” Int. J. Pattern Recognit. Artif. Intell., vol. 23,
no. 8, pp. 1549–1575, 2009.
[63] D. Wettschereck, D. W. Aha, and T. Mohri, “A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms,”
Artif. Intell. Rev., vol. 11, no. 1–5, pp. 273–314, 1997.
[64] R. Paredes and E. Vidal, “Learning weighted metrics to minimize nearestneighbor classification error,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 28, no. 7, pp. 1100–1110, Jul. 2006.
[65] M. Z. Jahromi, E. Parvinnia, and R. John, “A method of learning weighted
similarity function to improve the performance of nearest neighbor,” Inf.
Sci., vol. 179, no. 17, pp. 2964–2973, 2009.
[66] C. Vallejo, J. Troyano, and F. Ortega, “InstanceRank: Bringing order to
datasets,” Pattern Recognit. Lett., vol. 31, no. 2, pp. 133–142, 2010.
[67] C. G. Atkeson, A. W. Moore, and S. Schaal, “Locally weighted learning,”
Artif. Intell. Rev., vol. 11, pp. 11–73, 1997.
[68] D. R. Wilson and T. R. Martinez, “Improved heterogeneous distance functions,” J. Artif. Intell. Res., vol. 6, no. 1, pp. 1–34, 1997.
[69] R. D. Short and K. Fukunaga, “Optimal distance measure for nearest neighbor classification,” IEEE Trans. Inf. Theory, vol. IT-27, no. 5, pp. 622–627,
Sep. 1981.
[70] C. Gagné and M. Parizeau, “Coevolution of nearest neighbor classifiers,”
Int. J. Pattern Recognit. Artif. Intell., vol. 21, no. 5, pp. 921–946, 2007.
[71] J. Wang, P. Neskovic, and L. Cooper, “Improving nearest neighbor rule
with a simple adaptive distance measure,” Pattern Recognit. Lett., vol. 28,
no. 2, pp. 207–213, 2007.
[72] Y. Sun, A. K. C. Wong, and M. S. Kamel, “Classification of imbalanced
data: A review.,” Int. J. Pattern Recognit. Artif. Intell., vol. 23, no. 4,
pp. 687–719, 2009.
[73] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans.
Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.
[74] S. Garcı́a and F. Herrera, “Evolutionary under-sampling for classification
with imbalanced data sets: Proposals and taxonomy,” Evol. Comput.,
vol. 17, no. 3, pp. 275–306, 2009.
100
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 1, JANUARY 2012
[75] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:
Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16,
pp. 321–357, 2002.
[76] N. V. Chawla, D. A. Cieslak, L. O. Hall, and A. Joshi, “Automatically
countering imbalance and its empirical relationship to cost,” Data Mining
Knowl. Discov., vol. 17, no. 2, pp. 225–252, 2008.
[77] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning
Tools and Techniques, 2nd ed. San Francisco, CA: Morgan Kaufmann,
2005.
[78] A. Ben-David, “A lot of randomness is hiding in accuracy,” Eng. Appl.
Artif. Intell., vol. 20, pp. 875–885, 2007.
[79] A. Asuncion and D. Newman. (2007). UCI machine learning repository.
[Online]. Available: http://www.ics.uci.edu/mlearn/MLRepository.html.
[80] S. Garcı́a, A. Fernández, J. Luengo, and F. Herrera, “A study of statistical techniques and performance measures for genetics-based machine
learning: Accuracy and interpretability,” Soft Comput., vol. 13, no. 10,
pp. 959–977, 2009.
[81] D. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, 2nd ed. London, U.K.: Chapman & Hall, 2006.
[82] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,”
J. Mach. Learning Res., vol. 7, pp. 1–30, 2006.
[83] S. Garcı́a and F. Herrera, “An extension on “statistical comparisons of
classifiers over multiple data sets” for all pairwise comparisons,” J. Mach.
Learning Res., vol. 9, pp. 2677–2694, 2008.
[84] S. Garcı́a, A. Fernández, J. Luengo, and F. Herrera, “Advanced nonparametric tests for multiple comparisons in the design of experiments
in computational intelligence and data mining: Experimental analysis of
power,” Inf. Sci., vol. 180, pp. 2044–2064, 2010.
Isaac Triguero received the M.Sc. degree in computer science from the University of Granada,
Granada, Spain, in 2009, where he is currently working toward the Ph.D. degree with the Department of
Computer Science and Artificial Intelligence.
His current research interests include data mining,
data reduction, and evolutionary algorithms.
Joaquı́n Derrac received the M.Sc. degree in
computer science from the University of Granada,
Granada, Spain, in 2008, where he is currently working toward the Ph.D. degree with the Department of
Computer Science and Artificial Intelligence.
His current research interests include data mining, data reduction, lazy learning, and evolutionary
algorithms.
Salvador Garcı́a received the M.Sc. and Ph.D.
degrees in computer science from the University
of Granada, Granada, Spain, in 2004 and 2008,
respectively.
He is currently an Assistant Professor with the
Department of Computer Science, University of
Jaén, Jaén, Spain. His research interests include
data mining, data reduction, data complexity, imbalanced learning, statistical inference, and evolutionary
algorithms.
Francisco Herrera received the M.Sc. and Ph.D. degrees in mathematics from the University of Granada,
Granada, Spain, in 1988 and 1991, respectively.
He is currently a Professor with the Department
of Computer Science and Artificial Intelligence, University of Granada. He has authored or coauthored
more than 150 papers in international journals. He is
a coauthor of the book Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge
Bases (Hackensack, NJ: World Scientific, 2001). He
has coedited five international books and 20 special
issues in international journals on different soft computing topics. He is as Associate Editor of the journals IEEE TRANSACTIONS ON FUZZY SYSTEMS, Information Sciences, Mathware and Soft Computing, Advances in Fuzzy Systems,
Advances in Computational Sciences and Technology, and the International
Journal of Applied Metaheuristics Computing. He is also an Area Editor of the
journal Soft Computing (in the area of genetic algorithms and genetic fuzzy systems). He is also a member of several journal editorial boards, such as Fuzzy Sets
and Systems, Applied Intelligence, Knowledge and Information Systems, Information Fusion, Evolutionary Intelligence, the International Journal of Hybrid
Intelligent Systems, and Memetic Computation. His research interests include
computing with words and decision making, data mining, data preparation, instance selection, fuzzy-rule-based systems, genetic fuzzy systems, knowledge
extraction based on evolutionary algorithms, memetic algorithms, and genetic
algorithms.
1. Prototype generation for supervised classification
1.2
59
IPADE: Iterative Prototype Adjustment for Nearest Neighbor Classification
• I. Triguero, S. Garcı́a, F. Herrera, IPADE: Iterative Prototype Adjustment for Nearest Neighbor Classification. IEEE Transactions on Neural Networks 21 (12) (2010) 1984-1990, doi:
10.1109/TNN.2010.2087415.
– Status: Published.
– Impact Factor (JCR 2010): 2.633
– Subject Category: Computer Science, Artificial Intelligence. Ranking 17 / 108 (Q1).
– Subject Category: Computer Science, Hardware & Architecture. Ranking 3 / 48 (Q1).
– Subject Category: Computer Science, Theory & Methods. Ranking 8 / 97 (Q1).
– Subject Category: Engineering, Electrical & Electronic. Ranking 22 / 247 (Q1).
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON NEURAL NETWORKS
1
Brief Papers
IPADE: Iterative Prototype Adjustment for Nearest
Neighbor Classification
Isaac Triguero, Salvador García, and Francisco Herrera
Abstract— Nearest prototype methods are a successful trend
of many pattern classification tasks. However, they present
several shortcomings such as time response, noise sensitivity, and
storage requirements. Data reduction techniques are suitable to
alleviate these drawbacks. Prototype generation is an appropriate
process for data reduction, which allows the fitting of a dataset
for nearest neighbor (NN) classification. This brief presents a
methodology to learn iteratively the positioning of prototypes
using real parameter optimization procedures. Concretely, we
propose an iterative prototype adjustment technique based on
differential evolution. The results obtained are contrasted with
nonparametric statistical tests and show that our proposal consistently outperforms previously proposed methods, thus becoming
a suitable tool in the task of enhancing the performance of the
NN classifier.
Index Terms— Classification, differential evolution, nearest
neighbor, prototype generation.
I. I NTRODUCTION
Classification is one of the most important tasks in machine
learning and data mining [1], [2]. Most machine learning
methods build a model during the learning process, known
as eager learning methods [3], but there are some approaches
where the algorithm does not need a model. These algorithms
are known as lazy learning methods [4].
The nearest neighbor (NN) algorithm [5] and its derivatives
belong to the family of lazy learning. It has proved itself to
perform well for classification problems in many domains [2],
[6] and is considered one of the top ten methods in data mining
[7]. NN is a nonparametric classifier, which requires the storage of the entire training set and the classification of unseen
cases, finding the class labels of the closest instances to them.
In order to determine how close two instances are, several
distances or similarity measures have been proposed [8]–[10].
The effectiveness and simplicity of the NN may be affected
by several weaknesses such as high computational cost, high
storage requirement, and sensitivity to noise. Furthermore, NN
makes predictions over existing data and assumes that input
data perfectly delimits the decision boundaries among classes.
Several approaches have been suggested and studied to
tackle the drawbacks mentioned above, for instance, weighting
Manuscript received June 10, 2010; revised September 3, 2010; accepted
October 9, 2010. This work was supported by TIN2008-06681-C06-01.
I. Triguero and F. Herrera are with the CITIC-UGR, Department of
Computer Science and Artificial Intelligence, University of Granada, Granada
18071, Spain (e-mail: triguero@decsai.ugr.es; herrera@decsai.ugr.es).
S. García is with the Department of Computer Science, University of Jaén,
Jaén 23071, Spain (e-mail: sglopez@ujaen.es).
Digital Object Identifier 10.1109/TNN.2010.2087415
schemes [11], [12] have been widely used to improve the
results of the NN classifier.
A successful technique that simultaneously tackles the computational complexity, storage requirements, and sensitivity to
noise of NN is based on data reduction. These techniques
aim to obtain a representative training set with a lower size
compared to the original one and with similar or even higher
classification accuracy for new incoming data. Apart from
feature selection [13], data reduction can be divided into two
different approaches, known as prototype selection [14], [15]
and prototype generation (PG) or abstraction [16], [17]. The
former process consists of choosing a subset of the original
training data, while PG can also build new artificial prototypes
to better adjust the decision boundaries between classes in NN
classification.
In the specialized literature, a great number of PG techniques have been proposed. Since the first approach to PNN
based on merging prototypes [18] and divide-and-conquerbased schemes [19], many other proposals of PG were
considered, for instance, Mixt_Gauss [20], ICPL [17], and
RSP [21].
Positioning adjustment of prototypes is another perspective
within the PG methodology. It aims to correct the position of a
subset of prototypes from the initial set by using an optimization procedure. Many proposals belong to this family, such
as learning vector quantization (LVQ) [22] and its successive
improvements [23], [24], genetic algorithms [25], and particle
swarm optimization (PSO) [26], [27].
Many existing positioning adjustment of prototype techniques start with an initial set of prototypes and try to improve
the classification accuracy by adjusting it. Two schemes of
initialization are commonly used.
1) The number of representative instances for each class is
proportional to their number in the input data.
2) All the classes are represented by the same number of
prototypes.
This initialization process is their main drawback due to the
fact that this parameter can be very dependent on the problem
tackled. Some PG approaches [23], [25] compute the number
of needed prototypes to be retained automatically, but in
complex domains they require to retain many prototypes. We
propose a novel procedure to automatically find the smallest
reduced set that achieves suitable classification accuracy over
different types of problems. This method follows an iterative
prototype adjustment scheme with an incremental approach.
At each step, an optimization procedure is used to adjust the
position of the prototypes, and the method adds new prototypes
if needed. As a second contribution of this brief, we will
adopt the differential evolution (DE) [28], [29] technique as
optimizer. Our proposal will be denoted “iterative prototype
adjustment based on differential evolution” (IPADE).
1045–9227/$26.00 © 2010 IEEE
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2
IEEE TRANSACTIONS ON NEURAL NETWORKS
In experiments on 50 real-world benchmark datasets, the
classification accuracy and reduction rate of our approach are
investigated and its performance is compared with classical
and recent PG models.
The rest of this brief is organized as follows. Section II
describes the background of PG and DE. Section III explains
the proposed algorithm IPADE. Section IV discusses the
experimental framework and presents the analysis of results.
Finally, in Section V we summarize our conclusions.
II. BACKGROUND
This section covers the background information necessary
to define and describe our proposal. Section II-A presents the
background on PG. Section II-B shows the main characteristics
of DE.
A. PG
PG is an important technique in data reduction. It has
been widely applied to instance-based classifiers and can be
defined as the application of instance construction algorithms
over a dataset to improve the classification accuracy of a NN
classifier.
More specifically, PG can be defined as follows. Let x p
be an instance where x p = (x p1 , x p2 , . . . , x pm , x pω ), with
x p belonging to a class ω of possible classes given by
x pω and an m-dimensional space in which x pi is the value
of the i th feature of the pth sample. Furthermore, let xt
be an instance where xt = (xt 1 , xt 2 , . . . , xt m , xt ψ ), with xt
belonging to a class ψ, which is unknown, of possible
classes. Then, let us assume that there is a training set TR
which consists of n instances x p and a test set TS composed
of s instances xt . The purpose of PG is to obtain a prototype
generated set GS that consists of r, r < n, prototypes pu
where pu = (pu1 , pu2 , . . . , pum , puω ), which are generated
from the examples of TR. The prototypes of the generated
set are determined to represent efficiently the distributions of
the classes and to discriminate well when used to classify the
training objects. Their cardinality should be sufficiently small
to reduce both the storage and evaluation time spent by an NN
classifier.
The PG approaches can be divided into several families
depending on the main heuristic operation followed. The first
approach that we can find in the literature, called PNN [18],
belongs to the family of methods that carry out a merging
of prototypes of the same class in successive iterations,
generating centroids. Other well-known methods are those
based on a divide-and-conquer scheme, by separating the
m-dimensional space into two or more subspaces with the
purpose of simplifying the problem at each step [19]. Recent
advances that follow a similar operation include Mixt_Gauss
[20], which is an adaptive PG algorithm considered in the
framework of mixture modeling by Gaussian distributions
while assuming a statistical independence of features, and the
RSP3 technique [21] which tries to avoid drastic changes in
the form of decision boundaries associated with TR, which is
the main shortcoming observed in the classical approach [19].
One of the most important families of methods is based on
adjusting the position of the prototypes that can be viewed
as an optimization process. The main algorithm belonging
to this family is LVQ [22]. LVQ can be understood as an
artificial neural network in which a neuron corresponds to
a prototype and a competition weight based is carried out
in order to locate each neuron in a concrete place of the
m-dimensional space to increase the classification accuracy.
The third version of this algorithm, LVQ3, reported the best
results. Several approaches have been proposed that modify the
basic LVQ, for instance LVQPRU [23] which extends LVQ
by using a pruning step to remove noisy instances, or the
HYB algorithm [24] that constitutes a hybridization of several
prototype reduction techniques. Specifically, HYB combines
support vector machines (SVMs) with LVQ3 and executes
a search in order to find the most promising parameters of
LVQ3.
As a positioning adjustment of prototypes technique, a
genetic algorithm called ENPC was proposed for PG in
[25]. This algorithm executes different operators in order
to find the most suitable position of the prototypes. PSO
was proposed for PG in [26], [27], and they also belong to
the positioning adjustment of prototypes category of methods. The main difference between them is the type of codification of the particles. The PSO approach proposed in
[26] codifies a complete solution GS per particle. However,
AMPSO [27] encodes each prototype of G S in a single
particle. AMPSO has been shown to be more effective than
PSO [26].
B. DE
DE follows the general procedure of an evolutionary algorithm. It starts with a population of NP candidate solutions,
the so-called individuals. The generations in DE are denoted
by G = 0, 1, . . . , G max . It is usual to denote each individual
1 , . . . , x D }, called a
as a D-dimensional vector X i,G = {x i,G
i,G
“target vector”.
After initialization, DE applies the mutation operator to
generate a mutant vector Vi,G , with respect to each individual X i,G , in the current population. For each target X i,G ,
at the generation G, its associated mutant vector Vi,G =
1 , . . . , V D }. The method of creating this mutant vector
{Vi,G
i,G
is that which differentiates one DE scheme from another. We
focus on the DE/Rand/1, which generates the mutant vector
as follows:
Vi,G = X r1 ,G + F · (X r2 ,G − X r3 ,G ).
(1)
After the mutation phase, the crossover operation is applied
to each pair of the target vector X i,G and its corresponding
mutant vector Vi,G to generate a new trial vector which we
denote Ui,G . There are three kinds of crossover operators
known as “binomial,” “exponential,” and “arithmetic.”
Specifically, we will focus on the well-known
DE/CurrentToRand/1 strategy [30], which generates the trial
vector Ui,G by linearly combining the target vector X i,G and
the corresponding mutant vector Vi,G as follows:
Ui,G = X i,G + K · (Vi,G − X i,G ).
(2)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON NEURAL NETWORKS
3
Now incorporating (1) in (2) and simplifying, we obtain
1:
2:
3:
4:
5:
Ui,G = X i,G + K · (X r1 ,G − X i,G ) + F · (X r2 ,G − X r3 ,G ). (3)
The indices r1i , r2i , and r3i are mutually exclusive integers
randomly generated within the range [1, NP], which are also
different from the base index i . The scaling factor F is a
positive control parameter for scaling the different vectors. K
is a random number from [0, 1].
When the trial vector has been generated, we must decide
which individual between X i,G and Ui,G should survive in the
population of the next generation G + 1. If the new trial vector
yields an equal or better solution than the target vector, it
replaces the corresponding target vector in the next generation,
otherwise the target is retained in the population.
The success of the DE algorithm in solving a specific
problem crucially depends on the appropriately choice of its
associated control parameter values that determine the convergence speed. Hence, a fixed selection of these parameters can
produce a slow and/or premature convergence depending on
the problem. Thus, researchers have investigated the parameter
adaptation mechanisms to improve the performance of the
basic DE algorithm. One of the most successful adaptive DE
algorithms is SFLSDE [31]. It uses two local search algorithms
in the scale factor space to find the appropriate parameters for
a given X i,G .
III. IPADE
In this section, we present and describe the IPADE approach
in depth. IPADE follows an iterative scheme in which it determines the most appropriate number of prototypes per class and
their best positioning. Concretely, IPADE is divided into three
different stages: initialization (Section III-A), optimization
(Section III-B), and addition of prototypes (Section III-C).
Fig. 1 shows the pseudocode of the model proposed. In
the following, we describe the most significant instructions,
enumerated from 1 to 26.
A. Initialization
A random selection (stratified or not) of examples from TR
may not be the most adequate procedure to initialize the G S.
Instead, IPADE iteratively learns prototypes in order to find the
most appropriate structure of G S. Instruction 1 generates the
initial solution G S. In this step, G S must represent each class
with one prototype and should cover the entire search space
as much as possible. For this reason, each class distribution
is represented with its respective centroid. This initialization
was satisfactorily used by the approaches proposed in [16]
and [20]. The centroid of the class does not completely cover
the region of each class and does not avoid misclassifications.
Thus, instruction 2 applies the first optimization stage using
the initial G S composed of centroids for each class. The
optimization stage must modify the prototypes of G S using
the movement idea in the m-dimensional space, adding or
subtracting some quantities to the attribute values of the
prototypes. It is important to point out that we normalize all
attributes of the dataset to the [0, 1] range.
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
Fig. 1.
GS = Initialization(T R)
DE_Optimization(GS, T R)
AccuracyGlobal = Evaluate(GS, T R)
registerClass[0..] = optimizable
while AccuracyGlobal <>1.0 or all classes are non −
optimizables do
lessAccuracy = ∞
for i = 1 to do
if registerClass[i] == optimizable then
AccuracyClass [i] = Evaluate (GS, Examples
of class i in T R)
if AccuracyClass [i] < lessAccuracy then
lessAccuracy = AccuracyClass [i]
targetClass = i
end if
end if
end for
GStest = GS ∪ RandomExampleForClass (T R,
targetClass)
DE_Optimization(GStest,T R)
accuracy Test = Evaluate(GStest, T R)
if accuracy Test > Accuracy Global then
Accuracy Global = accuracy Test
GS = GStest
else
registerClass[targetClass] = non−optimizable
end if
end while
return GS
IPADE algorithm—basic structure.
B. DE Optimization for IPADE
In this section, we explain the proposal to apply the underlying idea of the DE algorithm to the PG problem as a position
adjusting of prototypes scheme.
First of all, it is necessary to define the solution codification.
In the proposed DE algorithm, each individual in the population encodes a single prototype without the class label and, as
such, the dimension of the individuals is equal to the number
of attributes of the specific problem. An individual classifies
an example of TR when it is the closest particle (in terms of
Euclidean distance) to that example.
The DE algorithm uses each prototype pu of G S, provided
by the IPADE algorithm, as an initial population. Next,
mutation and crossover operators guide the optimization of
the positioning of each pu in the m-dimensional space. It is
important to point out that these operators only produce modifications in the attributes of the prototypes of G S. Hence, the
class value remains unchangeable throughout the evolutionary
cycle. We will focus on the well-known DE/CurrentToRand/1
strategy [30] to generate the trial prototypes pu because it has
reportedly the best behavior. It can be viewed as
pu = pu + K · (pr1 − pu ) + F · (pr2 − pr3 ).
(4)
The examples pr1 , pr2 , and pr3 are randomly extracted
from TR and they belong to the same class as pu . In the
hypothetical case that TR does not contain enough prototypes
of the pu class, i.e., there is not at least three prototypes of
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4
this class in TR, we artificially generate the necessary number
of new prototypes pr j , 1 ≤ j ≤ 3, with the same class
label as pu , using little random perturbations such as pr j =
(pu1 + r and[−0.1, 0.1], pu2 + r and[−0.1, 0.1], . . . , pum +
r and[−0.1, 0.1], puω ).
After applying this operator, we check if there have been
values out of range [0, 1]. If a computed value is greater than
1, we truncate it to 1, and if it is lower than 0, we establish
it at 0.
After the mutation process over all the prototypes of G S,
we obtain a trial solution G S , which is constituted for each
pu . The selection operator decides which solution G S or G S
should survive for the next iteration. The 1NN rule guides this
operator to obtain the corresponding fitness value. We try to
maximize this value, so the selection operator can be viewed
as follows:
G S if accur acy(G S ) >= accur acy(G S)
GS =
(5)
G S otherwise.
In order to guarantee a high-quality solution, we use the
ideas established in [31] to obtain a self-adaptive algorithm.
Instruction 3 evaluates the accuracy of the initial solution, measured by classifying the examples of TR with the prototypes
of GS by using the NN rule.
C. Addition of Prototypes
After the first optimization process, IPADE enters in an iterative loop (instructions 5–25) to determine which classes need
more prototypes to faithfully represent their class distribution.
In order to do this, we need to define two types of classes.
A class ω is said to be optimizable if it allows the addition of
new prototypes to improve its local classification accuracy. The
local accuracy of ω is computed by classifying the examples
of TR whose class is ω with the prototypes kept in G S (using
the NN rule). The target class will be the optimizable class
with the least accuracy registered. From instructions 7–15, the
algorithm identifies the target class in each iteration. Initially,
all classes start as optimizable (instruction 4)
In order to reduce the classification error of the target
class, IPADE extracts a random example of this class from
TR and adds this to the current GS in a new trial set G St est
(instruction 16). This addition forces the re-positioning of the
prototypes of G St est by again using the optimization process
(instruction 17) and its corresponding evaluation (instruction
18) of predictive accuracy.
After this process, we have to ensure that the new positioning of prototypes of G St est , generated with the optimizer,
has reported a successful improvement of the accuracy rate
with respect to the previous G S. If the global accuracy of
the G St est is less than the accuracy of G S, IPADE does not
add this prototype to G S and this class is registered as nonoptimizable. Otherwise, G S = G St est .
The stopping criterion is satisfied when the accuracy rate is
1.0 or all the classes are registered as non-optimizable. The
algorithm returns G S as the smallest reduced set that is able
to classify the TR appropriately.
IEEE TRANSACTIONS ON NEURAL NETWORKS
IV. E XPERIMENTAL F RAMEWORK AND A NALYSIS OF
R ESULTS
This section presents the experimental framework (Section IV-A) and the comparative study between our proposal
and other PG techniques (Section IV-B).
A. Experimental Framework
In this section, we show the issues related to the experimental study. In order to compare the performance of the
algorithms, we use four measures, accuracy [1], [32], the
reduction rate measured as
Reducti on r ate = 1 − si ze(G S)/si ze(TR)
(6)
Acc·Red measured as accuracy·reduction rate, and execution
time.1
We use 50 datasets2 from the KEEL dataset repository3
[33], [34]. These datasets contain between 100 and 20 000
instances, and the number of attributes ranges from 2 to 60.
The datasets considered are partitioned using the 10-fold crossvalidation (10-fcv) procedure.
Many different configurations are established by the authors
of each paper for the different techniques. We focus this
experimentation on the recommended parameters proposed
by their respective authors, assuming that the choice of the
values of the parameters was optimally chosen. However, we
have done a previous study for each method, which depends
on the number of iterations performed, with 300, 500, and
1000 iterations in all the datasets. This parameter can be
very sensitive to the problem tackled. An excessive number of
iterations may produce overfitting for some problems, and a
lower number of iterations may not be enough to tackle other
datasets. For this reason, we present the results of the best
performing number of iterations in each method and dataset.
The complete set of results can be found in the associated web
site (http://sci2s.ugr.es/ipade/). The configuration parameters
of IPADE and the methods used in the comparison are shown
in Table I. In this table, the values of the parameters Fl ,
Fu , i ter S F G SS, and i ter S F H C of the IPADE algorithm
are the recommended values established in [31]. Furthermore,
Euclidean distance is used as a similarity function, and those
which are stochastic methods have been run three times per
partition.
Implementations of the algorithms can be found in the web
site associated or in the KEEL software tool [33].
B. Analysis of Results
In this section, we analyze the results obtained. Specifically,
we check the performance of the IPADE model and seven
other PG techniques.
1 Reduction rate and execution time information can be found in the
web page.
2 Datasets: abalone, appendicitis, australian, balance, banana, bands, breast,
bupa, car, chess, cleveland, coil2000, contraceptive, crx, dermatology, ecoli,
flare-solar, german, glass, haberman, hayes-roth, heart, hepatitis, housevotes,
iris, led7digit, lymphography, magic, mammographic, marketing, monks,
newthyroid, page-blocks, pima, ring, saheart, satimage, segment, sonar, spectheart, splice, tae, thyroid, tic-tac-toe, titanic, twonorm, wine, wisconsin,
yeast, zoo.
3 Available at http://sci2s.ugr.es/keel/datasets.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON NEURAL NETWORKS
5
TABLE I
PARAMETER S PECIFICATION FOR A LL THE M ETHODS E MPLOYED IN THE
RSP3
Mixt_Gauss
ENPC
AMPSO
LVQPRU
HYB
LVQ3
Parameters
Iterations of basic DE = 300/500/1000, iterSFGSS = 8,
iterSFHC = 20, Fl = 0.1, Fu = 0.9
Subset choice = diameter
Reduction rate = 0.95
Iterations = 300/500/1000
Iterations = 300/500/1000, C1 = 1.0, C2 = 1.0, C3 = 0.25,
Vmax = 1, W = 0.1, X = 0.5, Pr = 0.1, Pd = 0.1
Iterations = 300/500/1000, α = 0.1, WindowWidth = 0.5
Search_Iter = 300/500/1000, Optimal_Iter = 1000
α = 0.1, I = 0, F = 0.5
Initial_Window = 0, Final_Window = 0.5
δ = 0.1, δ_Window = 0.1
Initial Selection = SVM
Iterations = 300/500/1000, α = 0.1, WindowWidth = 0.2,
= 0.1
Accuracy of the second algorithm
0.9
0.8
Fig. 3.
0.8
0.7
0.6
0.5
0.5
0.6
0.7
0.8
0.9
Accuracy*Reduction Rate of IPADE
1
Acc·Red results over 50 datasets.
Convergence process
Acc Cleveland
Acc Thyroid
Acc Car
0.95
vs 1NN
vs RSP3
vs LVQ3
vs Mixt_Gauss
vs LVQPRU
vs HYB
vs AMPSO
vs ENPC
yx
(0.9995)
0.9
(0.9989) (0.9987) (0.9987)
(0.9993) (0.9992) (0.9992) (0.9990) (0.9989)
0.85
0.7
(0.9961) (0.9954) (0.9954) (0.9948) (0.9948)
0.8
(0.9942) (0.9942)
(0.9967)
0.75
0.7
0.6
0.65
0.6
0.5
0.55
0.4
0.4
(0.9633) (0.9633) (0.9633) (0.9633) (0.9633)
(0.9706) (0.9670)
(0.9780) (0.9743)
(0.9816)
(0.9974)
(0.9995)
0
0.5
0.6
0.7
0.8
Accuracy of IPADE
0.9
1
Fig. 4.
Fig. 2.
0.9
1
IPADE Comparison
1
vs 1NN
vs RSP3
vs LVQ3
vs Mixt_Gauss
vs LVQPRU
vs HYB
vs AMPSO
vs ENPC
y=x
0.4
0.4
Accuracy
Algorithm
IPADE
Accuracy of the second algorithm
E XPERIMENTATION
IPADE Comparison
1
1
2
3
4
5
6
7
Iterations of the main loop
8
9
Map of convergence over three different datasets.
Accuracy results over 50 datasets.
In the scatterplot of Fig. 2, each point compares IPADE to
a second algorithm on a single dataset. The x-axis position of
the point is the accuracy of IPADE, and the y-axis position is
the accuracy of the comparison algorithm. Therefore, points
below the y = x line correspond to datasets for which IPADE
performs better than a second algorithm.
In order to test the reduction capabilities of PG methods
in comparison with IPADE, Fig. 3 shows at each point the
Acc·Red obtained on a single dataset.
Fig. 4 shows a graphical representation of the convergence
of the IPADE model over three different datasets. The graph
shows a line representing the accuracy rate in each step
and its corresponding reduction rate (in brackets). The xaxis represents the number of iterations of the main loop of
IPADE, and the y-axis represents the accuracy rate currently
achieved.
Tables II and III present the statistical analysis conducted by
nonparametric multiple comparison procedures for accuracy
and Acc·Red, respectively. More specifically, we have used
the Friedman aligned (FA) procedure [35], [36] to compute the
set of rankings that represent the effectiveness associated with
each algorithm (second column). Both tables are ordered from
the best to the worst ranking. In addition, the third column
shows the adjusted p-value with the Holm’s test (HAPV )
[35]. Note that IPADE is established as the control algorithm
because it has obtained the best FA ranking. By using a level
of significance α = 0.01, IPADE is significantly better than the
rest of the methods, considering both accuracy and Acc·Red
measures. More information about these tests and other statistical procedures can be found at http://sci2s.ugr.es/sicidm/.
For the sake of simplicity, we only include the graphical
and statistical results achieved, whereas the complete results
can be found at the web page associated with this brief.
Looking at Tables II and III and Figs. 2–4, we want to make
some interesting comments.
1) Fig. 2 shows that the proposed IPADE outperforms, on
average, the rest of the PG techniques with the parameter
setting established. The most competitive algorithms for
IPADE, in terms of the accuracy measure, are the LVQ3
and LVQPRU algorithms. In this figure, most of the
LVQ3 and LVQPRU points are close to the y = x
line. However, the statistical test confirms that IPADE
significantly outperforms these methods.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6
IEEE TRANSACTIONS ON NEURAL NETWORKS
TABLE II
AVERAGE R ANKINGS OF THE A LGORITHMS (FA + HAPV) FOR THE
A CCURACY M EASURE
Algorithm
Accuracy FA
Accuracy HAPV
IPADE
LVQPRU
LVQ3
RSP3
1NN
AMPSO
ENPC
HYB
Mixt_Gauss
109.63
199.66
203.22
231.80
236.53
248.11
259.13
268.14
272.02
—
6.4064×10−4
6.4064×10−4
7.9160×10−6
4.2657×10−6
5.0703×10−7
5.4223×10−8
5.6781×10−9
3.4239×10−6
TABLE III
AVERAGE R ANKINGS OF THE A LGORITHMS (FA + HAPV) FOR THE
Acc·Red M EASURE
Algorithm
Acc·Red FA
Acc·Red HAPV
IPADE
LVQ3
Mixt_Gauss
LVQPRU
AMPSO
ENPC
RSP3
HYB
1NN
53.83
125.92
169.18
170.38
182.54
267.91
275.32
362.57
421.83
—
0.0055
2.2324×10−5
2.2324×10−5
2.9965×10−6
9.3280×10−16
9.9684×10−17
1.5321×10−31
1.5321×10−44
2) The tradeoff between accuracy and reduction rate is
an important factor because the efficiency of the NN
classifier depends on the resulting number of prototypes
of the G S. Fig. 3 shows that achieving this balance
between accuracy and reduction rate is a difficult task.
IPADE is the best performing method considering the
balance between accuracy and reduction rates. In Fig. 3,
there are more points under the y = x line in comparison
with Fig. 2. Furthermore, Table III also supports this
statement, showing smaller p-values when the reduction
rate is considered.
3) Observing the map of convergence of Fig. 4, we can
highlight the DE algorithm as a promising optimizer
because it is able to reach highly accurate results very
fast. This implies that the IPADE scheme needs a small
number of iterations.
V. C ONCLUSION
In this brief, we have presented a new data reduction
technique called IPADE which iteratively learns the most
adequate number of prototypes per class and their respective
positioning for the NN algorithm, acting as a PG method. This
technique uses a real parameter optimization procedure based
on DE in order to adjust the positioning of the prototypes at
each step. The large experimental study performed allowed
us to show that IPADE is a suitable method for PG in NN
classification. Furthermore, due to the fact that IPADE is a
heuristic optimization approach, as future work this technique
could be used for building an ensemble of classifiers.
R EFERENCES
[1] E. Alpaydin, Introduction to Machine Learning, 2nd ed. Cambridge,
MA: MIT Press, 2010.
[2] I. Kononenko and M. Kukar, Machine Learning and Data Mining:
Introduction to Principles and Algorithms. Chichester, U.K.: Horwood
Publishing Ltd., 2007.
[3] T. M. Mitchell, Machine Learning. New York: McGraw-Hill, 1997.
[4] E. K. García, S. Feldman, M. R. Gupta, and S. Srivastava, “Completely
lazy learning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 9, pp. 1274–
1285, Sep. 2010.
[5] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE
Trans. Inform. Theory, vol. 13, no. 1, pp. 21–27, Jan. 1967.
[6] A. N. Papadopoulos and Y. Manolopoulos, Nearest Neighbor Search: A
Database Perspective. New York: Springer-Verlag, 2004.
[7] X. Wu and V. Kumar, The Top Ten Algorithms in Data Mining. London,
U.K.: Chapman & Hall, 2009.
[8] D. R. Wilson and T. R. Martinez, “Improved heterogeneous distance
functions,” J. Artif. Intell. Res., vol. 6, no. 1, pp. 1–34, Jan. 1997.
[9] F. Fernández and P. Isasi, “Local feature weighting in nearest prototype
classification,” IEEE Trans. Neural Netw., vol. 19, no. 1, pp. 40–53,
Jan. 2008.
[10] N. García-Pedrajas, “Constructing ensembles of classifiers by means of
weighted instance selection,” IEEE Trans. Neural Netw., vol. 20, no. 2,
pp. 258–277, Feb. 2009.
[11] C. G. Atkeson, A. W. Moore, and S. Schaal, “Locally weighted learning,” Artif. Intell. Rev., vol. 11, nos. 1–5, pp. 11–73, Feb. 1997.
[12] R. Parades and E. Vidal, “Learning prototypes and distances: A prototype reduction technique based on nearest neighbor error minimization,”
Pattern Recognit., vol. 39, no. 2, pp. 180–188, Feb. 2006.
[13] H. Liu and H. Motoda, Feature Extraction, Construction and Selection:
A Data Mining Perspective. Norwell, MA: Kluwer, 2001.
[14] H. Liu and H. Motoda, Instance Selection and Construction for Data
Mining. Norwell, MA: Kluwer, 2001.
[15] H. Fayed and A. Atiya, “A novel template reduction approach for the
k-nearest neighbor method,” IEEE Trans. Neural Netw., vol. 20, no. 5,
pp. 890–896, May 2009.
[16] H. A. Fayed, S. R. Hashem, and A. F. Atiya, “Self-generating prototypes
for pattern classification,” Pattern Recognit., vol. 40, no. 5, pp. 1498–
1509, May 2007.
[17] W. Lam, C.-K. Keung, and D. Liu, “Discovering useful concept prototypes for classification based on filtering and abstraction,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 24, no. 8, pp. 1075–1090,
Aug. 2002.
[18] C.-L. Chang, “Finding prototypes for nearest neighbor classifiers,” IEEE
Trans. Comput., vol. 23, no. 11, pp. 1179–1184, Nov. 1974.
[19] C. H. Chen and A. Jóźwik, “A sample set condensation algorithm for
the class sensitive artificial neural network,” Pattern Recognit. Lett.,
vol. 17, no. 8, pp. 819–823, Jul. 1996.
[20] M. Lozano, J. M. Sotoca, J. S. Sánchez, F. Pla, E. Pekalska, and R.
P. W. Duin, “Experimental study on prototype optimisation algorithms
for prototype-based classification in vector spaces,” Pattern Recognit.,
vol. 39, no. 10, pp. 1827–1838, Oct. 2006.
[21] J. S. Sánchez, “High training set size reduction by space partitioning and
prototype abstraction,” Pattern Recognit., vol. 37, no. 7, pp. 1561–1564,
2004.
[22] T. Kohonen, “The self organizing map,” Proc. IEEE, vol. 78, no. 9,
pp. 1464–1480, Sep. 1990.
[23] J. Li, M. T. Manry, C. Yu, and D. R. Wilson, “Prototype classifier design
with pruning,” Int. J. Artif. Intell. Tools, vol. 14, nos. 1–2, pp. 261–280,
2005.
[24] S.-W. Kim and B. J. Oommen, “Enhancing prototype reduction schemes
with LVQ3-type algorithms,” Pattern Recognit., vol. 36, no. 5, pp. 1083–
1093, May 2003.
[25] F. Fernández and P. Isasi, “Evolutionary design of nearest prototype
classifiers,” J. Heuristics, vol. 10, no. 4, pp. 431–454, Jul. 2004.
[26] L. Nanni and A. Lumini, “Particle swarm optimization for prototype reduction,” Neurocomputing, vol. 72, nos. 4–6, pp. 1092–1097,
Jan. 2009.
[27] A. Cervantes, I. M. Galván, and P. Isasi, “AMPSO: A new particle
swarm method for nearest neighborhood classification,” IEEE Trans.
Syst., Man, Cybern., Part B: Cybern., vol. 39, no. 5, pp. 1082–1091,
Oct. 2009.
[28] R. Storn and K. Price, “Differential evolution – A simple and efficient
heuristic for global optimization over continuous spaces,” J. Global
Optim., vol. 11, no. 4, pp. 341–359, Dec. 1997.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON NEURAL NETWORKS
[29] K. V. Price, R. M. Storn, and J. A. Lampinen, Differential Evolution: A
Practical Approach to Global Optimization (Natural Computing Series),
G. Rozenberg, T. Bäck, A. E. Eiben, J. N. Kok, and H. P. Spaink, Eds.
New York: Springer-Verlag, 2005.
[30] K. V. Price, An Introduction to Differential Evolution. London, U.K.:
McGraw-Hill, 1999.
[31] F. Neri and V. Tirronen, “Scale factor local search in differential
evolution,” Memetic Comput., vol. 1, no. 2, pp. 153–171, 2009.
[32] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning
Tools and Techniques, 2nd ed. San Mateo, CA: Morgan Kaufmann, 2005.
[33] J. Alcalá-Fdez, L. Sánchez, S. García, M. J. del Jesus, S. Ventura, J.
M. Garrell, J. Otero, C. Romero, J. Bacardit, V. M. Rivas, J. C. Fernández, and F. Herrera, “KEEL: A software tool to assess evolutionary
algorithms for data mining problems,” Soft Comput., vol. 13, no. 3,
pp. 307–318, Oct. 2009.
7
[34] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L.
Sánchez, and F. Herrera, “KEEL data-mining software tool: Data
set repository, integration of algorithms and experimental analysis
framework,” J. Multiple-Valued Logic Soft Comput., 2010, to be
published.
[35] S. García, A. Fernández, J. Luengo, and F. Herrera, “Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental
analysis of power,” Inform. Sci., vol. 180, no. 10, pp. 2044–2064,
May 2010.
[36] S. García, A. Fernández, J. Luengo, and F. Herrera, “A study of statistical techniques and performance measures for genetics-based machine
learning: Accuracy and interpretability,” Soft Comput., vol. 13, no. 10,
pp. 959–977, Apr. 2009.
1. Prototype generation for supervised classification
1.3
67
Differential Evolution for Optimizing the Positioning of Prototypes in Nearest Neighbor Classification
• I. Triguero, S. Garcı́a, F. Herrera, Differential Evolution for Optimizing the Positioning of
Prototypes in Nearest Neighbor Classification. Pattern Recognition 44 (4) (2011) 901-916,
doi: 10.1016/j.patcog.2010.10.020.
– Status: Published.
– Impact Factor (JCR 2011): 2.292
– Subject Category: Computer Science, Artificial Intelligence. Ranking 18 / 111 (Q1).
– Subject Category: Engineering, Electrical & Electronic. Ranking 35 / 245 (Q1).
Pattern Recognition 44 (2011) 901–916
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
Differential evolution for optimizing the positioning of prototypes in nearest
neighbor classification
Isaac Triguero a,, Salvador Garcı́a b, Francisco Herrera a
a
Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, 18071 Granada,
Spain
b
Department of Computer Science, University of Jaén, 23071 Jaén, Spain
a r t i c l e in f o
abstract
Article history:
Received 2 June 2010
Received in revised form
6 September 2010
Accepted 24 October 2010
Nearest neighbor classification is one of the most used and well known methods in data mining. Its
simplest version has several drawbacks, such as low efficiency, high storage requirements and sensitivity
to noise. Data reduction techniques have been used to alleviate these shortcomings. Among them,
prototype selection and generation techniques have been shown to be very effective. Positioning
adjustment of prototypes is a successful trend within the prototype generation methodology.
Evolutionary algorithms are adaptive methods based on natural evolution that may be used for
searching and optimization. Positioning adjustment of prototypes can be viewed as an optimization
problem, thus it can be solved using evolutionary algorithms. This paper proposes a differential evolution
based approach for optimizing the positioning of prototypes. Specifically, we provide a complete study of
the performance of four recent advances in differential evolution. Furthermore, we show the good synergy
obtained by the combination of a prototype selection stage with an optimization of the positioning of
prototypes previous to nearest neighbor classification. The results are contrasted with non-parametrical
statistical tests and show that our proposals outperform previously proposed methods.
& 2010 Elsevier Ltd. All rights reserved.
Keywords:
Differential evolution
Prototype generation
Prototype selection
Evolutionary algorithms
Classification
1. Introduction
The nearest neighbor (NN) algorithm [1] and its derivatives have
been shown to perform well for classification problems in many
domains [2,3]. These algorithms are also known as instance-based
learning [4] and belong to the lazy learning family of methods [5].
The extended version of NN to k neighbors is considered one of the
most influential data mining algorithms [6] and it has attracted
much attention and research efforts in recent years [7–10]. The NN
classifier requires that all of the data instances are stored and
unseen cases classified by finding the class labels of the closest
instances to them. In order to determine how close two instances
are, several distances or similarity measures have been proposed
[11–13] and this issue is continually under review [14,15]. Despite
its simplicity and high classification accuracy, it suffers from
several drawbacks such as high computational cost, high storage
requirement and sensitivity to noise.
Data reduction processes are very useful in data mining to
improve and simplify the models extracted by the algorithms [16].
Corresponding author. Tel.: +34 958 240598; fax: + 34 958 243317.
E-mail addresses: triguero@decsai.ugr.es (I. Triguero),
sglopez@ujaen.es (S. Garcı́a), herrera@decsai.ugr.es (F. Herrera).
0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2010.10.020
In NN, apart from feature selection [17,18], two main data
reduction techniques have been used with promising results:
prototype selection (PS) and prototype generation (PG) [19,20].
The former is limited to selecting a subset of instances from the
original training set. Typically, three types of PS methods are
known: condensation [21], edition [22] and hybrid methods [23].
Condensation methods try to remove examples which are redundant or irrelevant, which it means that these examples do not offer
any capabilities in the classification task. However, edition methods focus on removing noisy examples, which are those examples
that induce classification errors. Finally, hybrid methods combine
both approaches.
In the specialized literature, a wide number of PS techniques
have been proposed. Since the first approaches for data condensation and edition, CNN [21] and ENN [22], many other proposals of
PS have become well-known in this field. For example, IBL methods
[4], DROP family methods [19] and ICF [23]. Recent approaches to
PS are introduced in [24–26].
Regarding PG methods, also known as prototype abstraction
methods [27], they are not only able to select data, but can also
modify them, allowing interpolations, movements of instances and
artificial generation of new data. Well known methods for PG are
PNN [28], learning quantization vector (LVQ) [29], Chen’s algorithm
[30], ICPL [27], HYB [31] and MixtGauss [32]. A good study of PS and
PG can be found in [33].
902
I. Triguero et al. / Pattern Recognition 44 (2011) 901–916
Evolutionary algorithms (EAs) [34] have been successfully used
in different data mining problems [35,36]. Given that PS and PG
problems could be seen as combinatorial and optimization problems, EAs have been used to solve them with excellent results
[37–40]. PS can be expressed as a binary space search problem and,
as far as we know, the best evolutionary model proposed for PS is
based on memetic algorithms [41] and is called SSMA [38]. PG is
expressed as a continuous space search problem. EAs for PG are
based on the positioning adjustment of prototypes, which is a
suitable method to optimize the position of prototypes, however, it
usually depends upon an initial subset of prototypes extracted from
the training set. Several proposals are presented on this topic, such
as ENPC [42] or PSCSA [43].
Particle swarm optimization (PSO) [44,45] and differential
evolution (DE) [46,47] are two effective evolutionary optimization
techniques for continuous spaces. In fact, PSO has been satisfactorily used for prototype adjustment [39,40]. The first attempts at
using DE for PG can be found in [48]. In that contribution, we did a
preliminary study on the use of DE, concluding that the classic DE
scheme offers competitive results compared to other PG
approaches to small size data sets.
The first contribution of this paper is the use of the DE algorithm
[47] for prototype adjustment. The specialized literature on DE
collects several advanced schemes: SADE [49], OBDE [50], DEGL
[51], JADE [52] and SFLSDE [53]. We will study the mentioned
proposals for PG, except OBDE. This last one, as the authors state,
may not be used for problems where basic knowledge is available.
It constantly seeks the opposite solution to the one evaluated in the
search process and, in PG, this behavior does not make sense.
The remaining proposals will be compared with evolutionary and
non-evolutionary PG algorithms and we will analyze the behavior
of each DE algorithm in this problem.
It is common to use classical PS methods in pre- or late stages of
a PG algorithm as mechanisms for removing noisy or redundant
prototypes. For example, some PG methods implement ENN or
DROP algorithms as early filtering processes [27,54] and, in [31], a
hybridization method based on LVQ3 post-processing of conventional prototype reduction approaches is proposed.
The second contribution of this paper follows a similar idea to
that presented in [31], but it is extended so that PS methods
can be hybridized with any positioning adjustment of prototype
method. Specifically, we study the use of LVQ3, PSO and DE
algorithms for the optimization positioning of prototypes after a
PS process. We will see that LVQ3 does not produce optimal
positioning in most cases, whereas PSO and DE result in
excellent accuracy rates in comparison with isolated PS methods
and PG methods. We especially emphasize the use of SSMA in
combination with one of the two mentioned optimization
approaches to also achieve high reduction rates in the final set
of prototypes obtained.
As we have stated before, the use of DE algorithms for the PG
problem motivates the global purpose of this paper, which can be
divided into three objectives:
To make an empirical study for analyzing the DE algorithms for
the PG problem in terms of accuracy and reduction capabilities.
Our goal is to identify the best DE methods and stress the
relevant properties of each one when they tackle the PG
problem.
To understand how positioning adjustment techniques can
improve the classification accuracy of PS and PG methods with
the use of hybridization models.
To check the behavior and scaling-up capabilities of DE
approaches and hybrid approaches for PG when tackling large
size data sets.
The experimental study will include a statistical analysis based
on non-parametric tests and we will conduct experiments involving a total of 56 small and large size data sets.
In order to organize this paper, Section 2 describes the background of PS, PG and DE. Section 3 explains the DE algorithm
proposed for tackling the position adjustment problem. Section 4
presents the framework of the hybridization proposed. Section 5
discusses the experimental framework and Section 6 presents the
analysis of results. Finally, in Section 7 we summarize our
conclusions.
2. Background
This section covers the background information necessary to
define and describe our proposals. Section 2.1 presents the background on PS and PG. Next, Section 2.2 shows the main characteristics of DE and the most recent advances proposed in the literature
are presented in Section 2.3.
2.1. PS and PG algorithms
This section presents the definition and notation for both PS and
PG problems.
A formal specification of the PS problem is the following: Let xp
be an example where xp ¼ ðxp1 ,xp2 , . . . ,xpD , oÞ, with xp belonging to
a class o given by xpo and a D-dimensional space in which xpi is the
value of the i-th feature of the p-th sample. Then, let us assume that
there is a training set TR which consists of n instances xp and a test
set TS composed of t instances xq, with o unknown. Let SS D TR be
the subset of selected samples resulting from the execution of a PS
algorithm, then we classify a new pattern xq from TS by the NN rule
acting over SS.
The purpose of PG is to obtain a prototype generated set GS,
which consists of r, r o n, prototypes, which are either selected
or generated from the examples of TR. The prototypes of the
generated set are determined to represent efficiently the distributions of the classes and to discriminate well when used to classify
the training objects. Their cardinality should be sufficiently small to
reduce both the storage and evaluation time spent by an NN
classifier.
Both evolutionary and non-evolutionary approaches to PS and
PG will be analyzed in the experimental study. A brief description of
the methods compared will be detailed in Section 5.
2.2. Differential evolution
Differential evolution follows the general procedure of an EA. DE
starts with a population of NP candidate solutions, so-called
individuals. The initial population should cover the entire search
space as much as possible. In some problems, this is achieved by
uniformly randomizing individuals, but in other problems, such as
that considered in this paper, basic knowledge of the problem is
available and the use of other initialization mechanisms is more
effective. The subsequent generations in DE are denoted by G ¼ 0,
1,y,Gmax.
It is usual to denote each individual as a D-dimensional vector
Xi,G ¼ {x1i,G,y, xD
i,G}, called a ‘‘target vector’’.
2.2.1. Mutation operation
After initialization, DE applies the mutation operator to generate a mutant vector Vi,G, with respect to each individual Xi,G, in
the current population. For each target Xi,G, at the generation G,
its associated mutant vector Vi,G ¼ {V1i,G,y,VD
i,G}. The method of
creating this mutant vector is that which differentiates one DE
I. Triguero et al. / Pattern Recognition 44 (2011) 901–916
scheme from another. Six of the most frequently referenced
strategies are listed below:
‘‘DE/Rand/1’’:
Vi,G ¼ Xri ,G þ F ðXri ,G Xri ,G Þ
1
2
ð1Þ
3
‘‘DE/Best/1’’:
Vi,G ¼ Xbest,G þF ðXri ,G Xri ,G Þ
1
ð2Þ
2
903
where f() is the fitness function to be minimized. If the new trial
vector yields a solution equal to or better than the target vector, it
replaces the corresponding target vector in the next generation;
otherwise the target is retained in the population. Therefore, the
population always gets better or retains the same fitness values,
but never deteriorates. This one-to-one selection procedure is
generally kept fixed in most of the DE algorithms.
2.3. Advanced proposals for DE
‘‘DE/RandToBest/1’’:
Vi,G ¼ Xi,G þ F ðXbest,G Xi,G Þ þ F ðXri ,G Xri ,G Þ
1
2
ð3Þ
‘‘DE/Best/2’’:
Vi,G ¼ Xbest,G þF ðXri ,G Xri ,G Þ þ F ðXri ,G Xri ,G Þ
1
2
3
4
ð4Þ
‘‘DE/Rand/2’’:
Vi,G ¼ Xri ,G þ F ðXri ,G Xri ,G Þ þF ðXri ,G Xri ,G Þ
1
2
3
4
5
ð5Þ
‘‘DE/RandToBest/2’’:
Vi,G ¼ Xi,G þ F ðXbest,G Xi,G Þ þ F ðXri ,G Xri ,G Þ
1
þF ðXri ,G Xri ,G Þ
3
ri1,
ri3,
2
ð6Þ
4
ri2,
ri4,
ri5
The indices
are mutually exclusive integers
randomly generated within the range [1, NP], which are also
different from the base index i. These indices are randomly
generated once for each mutation. The scaling factor F is a positive
control parameter for scaling the difference vectors. Xbest,G is the
best individual of the population in terms of fitness.
2.2.2. Crossover operator
After the mutation phase, a crossover operation is applied to
increase the potential diversity of the population. The DE algorithm
can use three kinds of crossover schemes, known as ‘‘Binomial’’,
‘‘Exponential’’ and ‘‘Arithmetic’’ crossovers. This operator is applied
to each pair of the target vector Xi,G and its corresponding mutant
vector Vi,G to generate a new trial vector that we denote Ui,G. The
mutant vector exchanges its components with the target vector Xi,G.
We will focus on the binomial crossover scheme, which is
performed on each component whenever a randomly picked
number between 0 and 1 is less than or equal to the crossover
rate (CR), The CR is a user-specified constant within the range [0,1),
which controls the fraction of parameter values copied from the
mutant vector. This scheme may be outlined as
8 j
< Vi,G if randð0,1Þ o ¼ CR or j ¼ jrand
j
j ¼ 1,2, . . . ,D:
ð7Þ
Ui,G ¼
: Xj
Otherwise
i,G
where rand½0,1Þ A ½0:1 is a uniformly distributed random number,
and jrand A f1,2, . . . ,Dg is a randomly chosen index, which ensures
that Ui,G gets at least one component from Vi,G.
Finally, we describe the arithmetic crossover, which generates
the trial vector Ui,G like this,
Ui,G ¼ Xi,G þ K ðVi,G Xi,G Þ
The success of DE in solving a specific problem crucially depends
on choosing the appropriate mutation strategy and its associated
control parameter values (F and CR) that determine the convergence speed. Hence, a fixed selection of these parameters can
produce slow and/or premature convergence depending on the
problem. Thus, researchers have investigated the parameter adaptation mechanisms to improve the performance of the basic DE
algorithm.
Now, we describe four of the newest and best DE algorithms
proposed in the literature.
ð8Þ
where K is the combination coefficient which is usually used in the
interval [0, 1]. This strategy is known as ‘‘DE/CurrentToRand/1’’.
2.2.3. Selection operator
When the trial vector has been generated, we must decide
which individual between XiG and Ui,G should survive in the
population of the next generation G+ 1. The selection operator is
described as follows:
(
Ui,G if f ðUi,G Þ is better than f ðXi,G Þ
ð9Þ
Xi,G þ 1 ¼
Xi,G Otherwise
2.3.1. Self-adaptive differential evolution (SADE)
SADE [49] was proposed by Qin et al. to alleviate the expensive
trial-and-error search for the most adequate parameters and
mutation strategy. They simultaneously implement four mutation
strategies (Eqs. (1) (5), (6), (8)) that are called candidate pool.
For each target vector Xi,G in the current population, we have to
decide which strategy is selected. Initially, the probability with
respect to each strategy is 1/S, where S is the number of strategies.
SADE adapts the probability of generating offspring by either
strategy based on their success ratios in the past LP generations.
Specifically, they introduce success and failure memories to store
the number of Ui,G that enter the next generation, and the number
of discarded Ui,G.
In SADE, the mutation factors Fi are independently generated at
each generation according to a normal distribution N(0.5,0.3). The
proper choice of CR can lead to successful optimization, so they
consider gradually adjusting the range of CR values according to the
previous values. It is adjusted by using a memory, to store the CR
values with respect to an S-strategy, and a normal distribution.
2.3.2. Adaptive differential evolution with optional external archive
(JADE)
JADE [52] is proposed by Zhang and Sanderson and it is based on
a new mutation strategy and parameter adaptation. The new
strategy is called DE/RandTop best with an optional archive that
is created to resolve the premature convergence of greedy strategies such as DE/RandToBest/k1 and DE/Best/k. The authors call p the
percentage (per unit) of individuals that are considered in the
mutation strategy. A mutation vector with DE/RandTop Best/1 with
archive is generated as follows:
Vi,G ¼ Xi,G þFi ðXbest,G Xi,G Þ þFi ðXr1,G Xur2,G Þ
ð10Þ
where Xi,G, Xr1,G and Xbest,G are selected from P (current population),
S
while Xur2,G is randomly chosen from the union of P A, where A is
the archive of inferior solutions stored from recent explorations.
The archive is initially empty. Then, after each generation, the
solutions that fail in the selection process are stored to the archive.
When the archive size exceeds a certain threshold, some solutions
are randomly removed from the archive A. Furthermore, this
algorithm proposes a parameter adaptation where the mutation
factor Fi of each individual is independently generated according to
a Cauchy distribution [55,56] with location parameter mF and scale
1
It is also known as DE/CurrentToBest/1.
904
I. Triguero et al. / Pattern Recognition 44 (2011) 901–916
parameter 0.1, and then it is truncated to be 1 if Fi 4 ¼ 1 or
regenerated if Fi o ¼ 0. At each generation, the crossover rate CRi is
generated according to a normal distribution N(mCR , 0.1), and then
truncated to [0,1].
2.3.3. Differential evolution using a neighborhood-based mutation
operator (DEGL)
DEGL [51] is also motivated by DE/RandToBest/1. They propose
a new mutation model based on neighborhoods. The authors make
two kinds of neighborhood called ‘‘Local’’ and ‘‘Global’’ neighborhoods, so they propose two kinds of mutation operator. When they
talk about the local neighborhood is not necessarily local in the
sense of their geographical nearness or similar fitness values.
These mutation operators are combined in one, in the following
manner. For each member of the population, a local trial vector
is created by employing the best (fittest) vector in the neighborhood as
Li,G ¼ Xi,G þF ðXLbesti ,G Xi,G Þ þF ðXp,G Xq,G Þ
3. Differential evolution for prototype generation
In this section we explain the proposal to apply the underlying
idea of DE to the PG problem as a position adjusting of prototypes
scheme. Fig. 1 shows the pseudo-code of the model proposed with
the DE/Rand/1 mutation strategy and binomial crossover. In the
following we describe the most significant instructions enumerated from 1 to 34.
First of all, it is necessary to define the solution codification. In
the proposed DE algorithm, each individual Xi,G in the population
encodes a complete solution; that is, a reduced set of prototypes are
encoded sequentially in each individual.
The number of prototypes encoded in each individual will
define its individual size and it is denoted r as previously. A user
parameter will set this value r. It is necessary to point out that this
ð11Þ
where Lbesti is the best vector in the local neighborhood of Xi,G, and
p, q are the indices of two random vectors extracted from the local
neighborhood.
Similarly, the global trial vector is created as
gi,G ¼ Xi,G þ F ðXgbesti ,G Xi,G Þ þ F ðXr1 ,G Xr2 ,G Þ
ð12Þ
where gbesti is the best vector in the current population, and r1 and
r2 are randomized in the interval [1,NP].
To combine both operators, they use a new parameter, known as
‘‘scalar weight’’ o A ð0,1Þ, and they use the following expression:
Vi,G ¼ o gi,G þ ð1oÞ Li,G
ð13Þ
As with other adaptive methods, they propose different
schemes for adaptation. They introduce three kinds of performance: the adaptation of the new o parameter, a deterministic
linear or exponential increment, and a random value for each
vector or a self-adaptive weight factor scheme. However, they do
not present an adaptive control parameter for F and CR.
2.3.4. Scale factor local search in differential evolution (SFLSDE)
Scale factor local search in differential evolution was proposed
by Neri and Tirronen [53]. This self-adaptive algorithm was
inspired by memetic algorithms. In order to guarantee a high
quality solution, SFLSDE uses two local search algorithms in the
scale factor space to find the appropriate parameters for a given Xi,G.
Specifically, they follow two different approaches: scale factor
golden section search (SFGSS) and scale factor hill-climb (SFHC).
Both are based on changing the scale factor value and calculate the
fitness value of the trial vector Ui,G after the mutation and crossover
phases.
SFLSDE follows the typical DE scheme, but at each iteration, five
random numbers are generated (rand1,y,rand5) and they are used
to determine the corresponding trial vector Ui,G. The values of the
parameters are as follows:
8
SFGSS
if rand5 o t3
>
>
>
>
< SFHC
if t3 o ¼ rand5 o t4
Fi ¼ ( F þ F rand
ð14Þ
if rand2 o t1
u
1
l
>
>
>
if rand5 4 t4
>
: Fi
otherwise
(
CRi ¼
rand3
if rand4 o t2
CRi
otherwise
ð15Þ
where tk , k A 1,2,3,4 are constant threshold values. In [53], the
authors only use the DE/rand/1/Bin mutation strategy. In the
experimental study, we incorporate other mutation strategies.
Fig. 1. DE algorithm basic structure.
I. Triguero et al. / Pattern Recognition 44 (2011) 901–916
parameter r is different to the parameter D explained in Section 2.2.
The dimensionality D corresponds to the number of input attributes of the problem.
Following the notation used in Section 2.2, Xi,G defines the target
vector, but in our case, the target vector can be represented as a
matrix. Table 1 describes the structure of an individual. Furthermore, each prototype pj, 1 r j r r, of an individual Xi,G has a class
xpo,j . This class value remains unchangeable by the DE operators
throughout the evolutionary cycle, and it is fixed from the beginning of the process. The number of prototypes evolved for each
class is assigned in the initialization stage.
3.1. Initialization
DE begins with a population of NP individuals Xi,G. Given that
this problem provides some knowledge based on the initial
arrangement of training samples, instruction 3 initializes each
individual Xi,G by choosing r random prototypes from the TR.
The initialization process ensures that every class has at least
one representative prototype. Specifically, we use an initial random
stratified selection of prototypes which guarantees that the
number of prototypes encoded in each individual for each class
is proportional to the number of them in TR. There must be at least
one prototype for each class encoded in the individuals. Fig. 2
shows an example.
It is important to point out that every solution must have the
same structure, thus they must have the same number of prototypes per class, and the classes must have the same arrangement in
the matrix Xi,G. Following the example of Fig. 2, each individual Xi,0
should contain four prototypes, in the following order: three
prototypes of Class 0, one of Class 1.
Table 1
Encoding of a set of prototypes in an individual Xi,G for the DE algorithm.
Prototype 1
Prototype 2
y
Prototype r
Attribute 1
Attribute 2
y
Attribute D
Class
xp1,1
xp1,2
xp2,1
xp2,2
y
y
xpD,1
xpD,2
xpo,1
xpo,2
xp1,r
xp2,r
y
xpD,r
xpo,r
905
3.2. Mutation and crossover operators
The mutation and crossover strategies explained in Section 2
have been implemented. From instructions 9–14, the algorithm
selects 3 or 5 random individuals, depending on the mutation
strategy, and then, it generates the mutant matrix Vi,G with respect
to each individual Xi,G, in the current population. The operations of
addition, subtraction and scalar product are carried out as typical
matrices. This is the justification for the individuals having the
same structure. In order for the mutation operator to make sense,
the operators must act over the same attributes and over prototypes of the same class in all cases.
After applying this operator, it is necessary to check that the
mutant matrix Vi,G has been generated with correct values for all
features of the prototypes, i.e. to check that the values are in the
correct range. Instruction 15 normalizes all attributes of the data
set to the [0, 1] range, so this procedure only needs to check if there
have been values out of range of [0,1]. If a computed value is greater
than 1, we truncate it to 1, and if is lower than 0, we establish it at 0.
Our previous work [48] indicates that the binomial crossover
operator has more suitable behavior for the PG problem than the
rest of the operators. The new trial matrix is generated by using
Eq. (7) and the instructions 16–21 show this operation. In PG,
instead of interchanging attributes values, the mutant matrix Vi,G
exchanges its prototypes with the target Xi,G to generate a new trial
matrix Ui,G.
3.3. Selection operator
This operator must decide which individual between Xi,G and
Ui,G should survive in the population of the next generation G+ 1
(instructions 23–26). The NN rule, with k¼1 (1NN), guides this
operator. The instances in TR are classified with the prototypes
encoded in Xi,G or Ui,G by the 1NN rule with a leave-one-out
validation scheme, and their corresponding fitness values are
measured as the accuracyðÞ obtained, which represents the number
of successful hits (correct classifications) relative to the total
number of classifications. We try to maximize this value, so the
selection operator can be viewed as follows:
(
Ui,G if accuracyðUi,G Þ 4 ¼ accuracyðXi,G Þ
ð16Þ
Xi,G þ 1 ¼
Xi,G otherwise
In case of a tie between the values of accuracy, we select the Ui,G
in order to give the mutated individual the opportunity to enter the
population.
Finally, instructions 27–30 check if the selected individual
obtains the best fitness in the population, and instruction 34
returns the best individual found during the evolutionary process.
4. Hybridizations of prototype selection and generation
methods
This section presents the hybridization model that we propose.
Section 4.1 enumerates the arguments that justify hybridization.
Section 4.2 explains how to construct the hybrid model.
4.1. Motivation
Fig. 2. Initialization process for an individual Xi,0 in Appendicitis data set. TR contains
95 examples. If we established the reduction rate (RR) at 0.95, let us assume that Z ¼
1 RR, r ¼ Z 95 examples ¼ 4 prototypes (truncating this value). Appendicitis is
composed of two classes, with 76 and 19 prototypes respectively. Hence, the
individual Xi,0 should contain: Z 76 ¼ 3 prototypes of Class 0, and Z 19 ¼ 0
prototype of Class 1. We ensure that Class 1 has at least one prototype.
As we stated before, PS and PG relate to different problems. The
main drawback of PS methods is that they assume that the best
representative examples can be obtained from a subset of the
original data whereas PG methods generate new representative
examples if needed. Specifically, positioning adjustment methods
906
I. Triguero et al. / Pattern Recognition 44 (2011) 901–916
aim to correct the position of a subset of prototypes from the initial
set by using an optimization procedure.
However, the positioning adjustment methods are not free of
different drawbacks.
keeping the same structure as the S selected by the PS method, as in
the example given in Section 3.1.
Fig. 3 shows the two different hybrid models. Specifically,
Fig. 3(a) presents the scheme to hybridize a PS method with DE
and PSO, and Fig. 3(b) shows the hybridization process with LVQ3.
They relate to a more complex problem than PS, i.e. the search
space can be more difficult to explore.
As a result of the above, finding a promising solution by using
positioning adjustment methods requires a higher cost than a PS
method.
Positioning adjustment methods usually initialize the generated set GS with a fixed number of random prototypes from TR,
which will be modified in successive iterations. This characteristic is one of the weaknesses of these methods because this
parameter can be very dependent on the specific problem. In
principle, a practitioner must know the exact number of
prototypes which will compose the final solution for each
problem, but moreover, the proportion of prototypes between
classes should be estimated in order to obtain good solutions.
Thus, two schemes of initialization are commonly used:
3 The number of representative instances for each class is
proportional to the number of them in the input data.
3 All the classes are represented with the same number of
prototypes.
As we have seen, the appropriate choice of the number of
prototypes per class has not been addressed by positioning
adjustment techniques.
5. Experimental framework
In this section, we show the factors and issues related to the
experimental study. We provide the measures employed to
evaluate the performance of the algorithms (Section 5.1), details
of the problems chosen for the experimentation (Section 5.2), an
enumeration of the algorithms used for comparison with their
respective parameters (Section 5.3) and finally, the statistical
tests employed to contrast the results obtained are described
(Section 5.4).
5.1. Performance measures for standard classification
In this work, we deal with multi-class data sets. In these
domains, two measures are widely used for measuring the effectiveness of classifiers because of their simplicity and successful
application. We refer to accuracy and Cohen’s kappa rate. Furthermore, the reduction rate will be used as the classification efficiency
measure. They are explained as follows:
Accuracy: is the number of successful hits (correct classifica-
4.2. Hybrid model
Random selection (stratified or not) of prototypes from the TR
may not be the most adequate procedure to initialize the GS.
Instead, we can use a PS algorithm prior to the adjustment process
to initialize a subset of prototypes. Making use of this idea, we
mitigate the first and second drawbacks stated before as most of
the effort performed by positioning adjustment is made over a
localized search area given by a PS solution. We also tackle the third
weakness, because the heuristic of the PS methods is not forced to
select a determinate number of prototypes of each class; it selects
the most suitable number of prototypes per class. In addition to
this, if the prototypes selected by a PS method can be tuned in the
search space, the main drawback associated with PS is also
overcome.
To hybridize PS and positioning adjustment methods, two
different methods of initialization of the positioning adjustment
algorithm will be used, depending on the type of codification of the
solution:
Complete solution per individual: This corresponds to the case
where each individual of the population encodes a complete GS
(i.e., that used by DE and PSO). The SS must be inserted once as
one of the individuals of the population, initializing the rest of
the individuals as the standard procedure does.
Others: This is the case where the complete GS is optimized (i.e.,
the scheme used by LVQ3). The resulting SS of the PS methods is
used by the positioning adjustment procedure as the initial set.
When each individual of the evolutionary algorithm encodes a
complete GS, it helps to alleviate the complexity of the optimization
procedure, because there is a promising initial individual in the
population. Operators used by DE and PSO benefit from the
presence of this individual. Furthermore, this type of codification
tries to avoid getting stuck at a local optimum, initializing the rest
of the individuals with random solutions extracted from the TR,
tions) relative to the total number of classifications. It has been
by far the most commonly used metric for assessing the
performance of classifiers for years [57,58].
Cohen’s kappa (Kappa rate): is an alternative measure to the
classification rate, since it compensates for random hits [59]. In
contrast to accuracy, kappa evaluates the portion of hits that can
be attributed to the classifier itself (i.e., not to mere chance),
relative to all the classifications that cannot be attributed to
chance alone. Cohen’s kappa ranges from 1 (total disagreement) through 0 (random classification) to 1 (perfect agreement). For multi-class problems, kappa is a very useful, yet
simple, meter for measuring a classifier’s accuracy while
compensating for random successes.
Reduction rate: One of the main goals of the PG and PS methods is
to reduce storage requirements. Another goal closely related to
this is to speed up classification. A reduction in the number of
stored instances will typically yield a corresponding reduction
in the time it takes to search through these examples and
classify a new input vector.
Note that Accuracy and Kappa measures are applied over the
training data with a leave-one-out validation scheme.
5.2. Data sets
In the experimental study, we selected 56 data sets from the UCI
repository [60] and the KEEL-dataset repository2 [61]. Table 2
summarizes the properties of the selected data sets. It shows, for
each data set, the number of examples (#Ex.), the number of
attributes (#Atts.), and the number of classes (#Cl.). The data sets
are grouped into two categories depending on the size they have.
Small data sets have less than 2000 instances and large data sets
have more than 2000 instances. The data sets considered are
partitioned using the 10-fold cross-validation (10-fcv) [62,63]
procedure.
2
http://sci2s.ugr.es/keel/datasets
I. Triguero et al. / Pattern Recognition 44 (2011) 901–916
907
Fig. 3. Hybrid model. (a) PSO and DE approaches; (b) LVQ approach.
In K-fold cross-validation (K-fcv), the original sample is randomly
partitioned into K subsamples. Of the K subsamples, a single subsample
is retained as the validation data for testing the model, and the
remaining K-1 subsamples are used as training data. The crossvalidation process is then repeated K times (the folds), with each of
the K subsamples used exactly once as the validation data. The K results
from the folds will be averaged to produce a single estimation.
ICF: This method follows an iterative procedure in which those
Prototype generation methods:
5.3. Comparison algorithms and parameters
Several methods, evolutionary and non-evolutionary, have been
selected to perform an exhaustive study of the capabilities of our
proposals. Those methods are as follows:
LVQ3: Learning vector quantization can be understood as a
1NN: The 1NN rule is used as a baseline limit of performance.
Prototype selection methods:
DROP3: This combines an edition stage with a decremental
approach where the algorithm checks all the instances in order
to find those instances which should be deleted from GS [19].
instances susceptible to removal from GS based on reachability
and coverage properties of the instance are determined [23].
SSMA: This memetic algorithm makes use of a local search or
meme specifically developed for the prototype selection problem. This interweaving of the global and local search phases
allows the two to influence each other [38].
special case of artificial neural network in which a neuron
corresponds to a prototype and a competition weight based is
carried out in order to locate each neuron in a concrete place of
the m-dimensional space to increase the classification accuracy
[29]. It will be used as an optimizer in the proposed hybrid
models.
MixtGauss: This is an adaptive PG method considered in the
framework of mixture modeling by Gaussian distributions,
while assuming a statistical independence of features.
The prototypes are chosen as the mean vectors of the optimized
908
I. Triguero et al. / Pattern Recognition 44 (2011) 901–916
ENPC: This follows a genetic scheme with five operators,
Table 2
Summary description for classification data sets.
Data set
#Ex.
#Atts.
#Cl.
Abalone
Appendicitis
Australian
Autos
Balance
Banana
Bands
Breast
Bupa
Car
Chess
Cleveland
Coil2000
Contraceptive
crx
Dermatology
Ecoli
Flare-solar
German
Glass
Haberman
Hayes-roth
Heart
Hepatitis
Housevotes
Iris
Led7digit
Lymphography
Magic
Mammographic
Marketing
Monks
Movement_libras
Newthyroid
Pageblocks
Penbased
Pima
Saheart
Satimage
Segment
Sonar
Spambase
Spectheart
Splice
Tae
Texture
Thyroid
Tic-tac-toe
Titanic
Twonorm
Vehicle
Vowel
Wine
Wisconsin
Yeast
Zoo
4174
106
690
205
625
5300
539
286
345
1728
3196
297
9822
1473
125
366
336
1066
1000
214
306
133
270
155
435
150
500
148
19020
961
8993
432
360
215
5472
10992
768
462
6435
2310
208
4597
267
3190
151
5500
7200
958
2201
7400
846
990
178
683
1484
101
8
7
14
25
4
2
19
9
6
6
36
13
85
9
15
33
7
9
20
9
3
4
13
19
16
4
7
18
10
5
13
6
90
5
10
16
8
9
36
19
60
57
44
60
5
40
21
9
3
20
18
13
13
9
8
16
28
2
2
6
3
2
2
2
2
4
2
5
2
3
2
6
8
2
2
7
2
3
2
2
2
3
10
4
2
2
9
2
15
3
5
10
2
2
7
7
2
2
2
3
3
11
3
2
2
2
4
11
3
2
10
7
Gaussians, whose mixtures are fit to model each of the
classes [32].
HYB: This constitutes a hybridization of several prototype
reduction techniques. Concretely, HYB combines support vector
machines with LVQ3 and executes a search in order to find the
most appropriate parameters of LVQ3 [31].
RSP3: This technique is based on Chen’s algorithm [30]. The
main difference between them is that in Chen’s algorithm any
subset containing a mixture of instances belonging to different
classes can be chosen to be divided. By contrast, in RSP3 [54], the
subset with the highest overlapping degree is the one picked to
be split. This process tries to avoid drastic changes in the form of
decision boundaries associated with TR which are the main
shortcomings of Chen’s algorithm.
which focus their attention on defining regions in the search
space [42].
PSO: This adjusts the position of an initial set with the PSO rules,
attempting to minimize the classification error [39]. We will use
it as an optimizer in the proposed hybrid models.
PSCSA: This is based on an artificial immune system [64], using
the clonal selection algorithm to find the most appropriate
position for a prototype set [43].
Many different configurations are established by the authors
of each paper for the different techniques. We focus this experimentation on the recommended parameters proposed by their
respective authors, assuming that the choice of the values
of the parameters was optimally chosen. The configuration parameters, which are common for all problems, are shown in
Table 3. Note that some methods have no parameters to be
fixed, so they are not included in this table. In all of the
techniques, Euclidean distance is used as a similarity function
and those which are stochastic methods have been run three times
per partition.
5.4. Statistical tools for analysis
In this paper, we use the hypothesis testing techniques to
provide statistical support for the analysis of the results [65,66].
Specifically, we use non-parametric tests, due to the fact that the
initial conditions that guarantee the reliability of the parametric
tests may not be satisfied, causing the statistical analysis to lose
credibility with these parametric tests. These tests are suggested in
the studies presented in [67,68,65,69], where their use in the field
of machine learning is highly recommended.
Throughout the study, we perform several non-parametric
tests. The Wilcoxon test [67,68] will be used to perform a multiple
pairwise comparison between the different schemes of our proposals. It will be adopted considering a level of significance of
a ¼ 0:1.
Furthermore, in order to perform multiple comparisons
between our proposals and the rest of the techniques considered,
we will use the Friedman Aligned-Ranks test [70] to detect
statistical differences among a group of results and the Holm
post-hoc test [71], to find out which algorithms are distinctive
among the 1*n comparisons performed [69]. A complete description of these statistical tests can be found in Appendix A.
More information about these tests and other statistical procedures can be found at http://sci2s.ugr.es/sicidm/.
6. Analysis of results
In this section, we analyze the results obtained from different
experimental studies. Specifically, our aims are:
To compare the different DE schemes to each other and to
several classical and recent prototype reduction techniques for
1NN based classification over small data sets (Section 6.1).
To test the performance of our DE schemes when the size of the
problems is increased (Section 6.2).
To show the convergence process of basic and advanced DE
algorithms (Section 6.3).
To analyze the benefits of hybrid models over small data-sets
(Section 6.4).
To check if the performance of hybrid models is maintained with
large data sets (Section 6.5).
I. Triguero et al. / Pattern Recognition 44 (2011) 901–916
6.1. Analysis and results of DE schemes over small size data sets
This study is divided into two parts. First, in Section 6.1.1 we
compare the different schemes of DE and identify the best
alternatives for the positioning adjustment of prototypes. The
Wilcoxon test will be used to support this analysis [67]. Next,
Section 6.1.2 shows a comparative study of the better DE methods
with other classical and recent PG techniques. In this case, the
Friedman Aligned Ranks test for multiple comparisons will be used
in association with the Holm post-hoc test [69]. We have used a
total of 40 small data sets of the general framework for both
experiments.
6.1.1. Results of DE schemes over small data sets
We focus this experiment on comparing the differences in
performance of the DE methods based on the experimental framework stated previously.
Table 4 shows the average results (and its standard deviations
‘‘+ -’’) obtained over small data sets in training and test data by six
different mutation strategies for the basic DE, two configurations
for SADE parameters, one for JADE, four different schemes for DEGL
and finally SFLSDE has been tested with two mutation strategies.
The best case for each column is highlighted in bold. The Wilcoxon
test is conducted to compute for each method, with a level of
909
significance of a ¼ 0:1, the number of algorithms outperformed by
it and the number of algorithms with no detected differences in
performance. Specifically, the column denoted by ‘‘+ ’’ reflects the
number of methods outperformed by the method in the row and
the column ‘‘+ ¼’’ shows the number of methods with similar or
worse performance by the method in the row.
Observing Table 4, we can point out some interesting facts:
The choice of an adequate mutation strategy seems to be an
important factor that influences the results obtained. When the
perturbation process is based on the selection of random
individuals to generate a new solution, it may be affected by
a lack of exploitation capability. However, when the best
individual guides the search, exploration capabilities are
reduced. RandToBest strategies have reported the best results
because they perform a good balance between exploration
(random individual) and exploitation (best individual).
Advanced proposals such as JADE and DEGL, that are completely
motivated by the RandToBest strategy, clearly outperform those
basic DE techniques which are based on Rand and Best
strategies. SADE probably loses accuracy in the iterations that
only execute Rand and Best strategies. The DEGL algorithm, with
an exponential increment of the parameter o, has reported the
best kappa test and the statistical test shows that it also
Table 3
Parameter specification for all the methods employed in the experimentation.
Algorithm
Parameters
SSMA
LVQ3
HYB
Population¼ 30, Eval¼ 10 000, Cross ¼0.5, Mutation ¼0.001
Iterations ¼ 500, a ¼ 0:1, WindowWidth ¼0.2, epsilon ¼ 0.1, Reduction Rate ¼ 0.95/0.99
Search_Iter ¼ 200, Optimal_Iter ¼ 1000, alpha ¼ 0.1, I_epsilon ¼ 0, F_epsilon ¼ 0.5
Initial_Window ¼ 0, Final_Window ¼ 0.5, delta ¼ 0.1, delta_Window ¼ 0.1, Initial Selection ¼ SVM
Subset Choice ¼ Diameter
Iterations ¼ 250
SwarmSize ¼ 40, Iterations ¼ 500
C1 ¼ 1, C2 ¼ 3, Vmax ¼ 0.25, Wstart ¼ 1.5, Wend ¼ 0.5, Reduction Rate ¼ 0.95/0.99
HyperMutation Rate ¼ 2, Clonal Rate ¼ 10,
Mutation Rate ¼ 0.01, Stim_Threshold ¼ 0.89, a ¼ 0.4
PopulationSize ¼ 40, Iterations ¼ 500, F ¼ 0.5, CR ¼ 0.9, Reduction Rate ¼ 0.95/0.99
PopulationSize ¼ 40, Iterations ¼ 500, Learning Period ¼ 50 and 100, Reduction Rate ¼ 0.95/0.99
PopulationSize ¼ 40, Iterations ¼ 500, p¼ 0.05, c¼ 0.1, Reduction Rate ¼ 0.95/0.99
PopulationSize ¼ 40, Iterations ¼ 500, F ¼ 0.8, CR ¼ 0.9, WeightFactor¼ 0.0,
WeightScheme¼ Exponential, Adaptive, Random and Linear, Reduction Rate ¼ 0.95/0.99
PopulationSize ¼ 40, Iterations ¼ 500, iterSFGSS ¼ 8, iterSFHC ¼20, Fl¼ 0.1, Fu¼ 0.9, Reduction Rate ¼ 0.95/0.99
RSP3
ENPC
PSO
PSCSA
DE
SADE
JADE
DEGL
SFLSDE
Note: The parameter reduction rate on fixed reduction algorithms has been established at 0.95 for small size data set, 0.99 for large.
Table 4
Results of different DE models over small data sets.
Algorithm
Accuracy
Training
DE/Rand/1/Bin
DE/Best/1/Bin
DE/RandToBest/1/Bin
DE/Best/2/Bin
DE/Rand/2/Bin
DE/RandToBest/2/Bin
SADE LP 50
SADE LP 100
JADE
DEGL EXP
DEGL ADAP
DEGL RANDOM
DEGL LINEAR
SFLSDE/Rand/1/Bin
SFLSDE/RandToBest/1/Bin
0.7679
0.8005
0.8279
0.8285
0.7962
0.8231
0.8195
0.8243
0.8209
0.8144
0.8211
0.8146
0.8187
0.8347
0.8411
7 0.1536
7 0.1275
7 0.1192
7 0.1174
7 0.1456
7 0.1250
7 0.1563
7 0.1232
7 0.1219
7 0.1563
7 0.1563
7 0.1563
7 0.1563
7 0.1563
7 0.1563
Kappa rate
Test
0.7268
0.7393
0.7524
0.7434
0.7348
0.7567
0.7513
0.7502
0.7541
0.7597
0.7525
0.7488
0.7550
0.7582
0.7619
Training
70.1670
70.1464
70.1460
70.1445
70.1563
70.1426
70.1452
7 0.1435
70.1417
70.1394
70.1401
70.1392
7 0.1404
70.1563
7 0.1563
0.5816
0.6355
0.6859
0.6854
0.6295
0.6735
0.6708
0.6776
0.6708
0.6728
0.6606
0.6615
0.6689
0.6960
0.7079
70.2216
70.1986
70.1887
70.1875
70.2187
70.2031
70.1563
70.1979
70.1974
70.1563
70.1563
70.1936
70.1930
70.1948
7 0.1563
Acc Tst
Test
0.4947
0.5155
0.5384
0.5212
0.5061
0.5406
0.5335
0.5324
0.5417
0.5529
0.5351
0.5329
0.5437
0.5461
0.5516
7 0.2579
7 0.2468
7 0.2453
7 0.2463
70.2484
70.2484
7 0.2482
7 0.2432
7 0.2415
70.2328
7 0.2436
7 0.2335
7 0.2371
7 0.1563
7 0.1563
Kappa Tst
+
+¼
+
+¼
0
0
0
1
0
0
1
0
0
6
1
0
1
2
2
13
10
14
13
12
14
12
12
14
14
14
13
13
14
14
0
0
1
2
0
1
2
0
3
4
1
2
2
4
2
10
10
13
13
7
14
13
12
14
14
14
12
13
14
13
910
I. Triguero et al. / Pattern Recognition 44 (2011) 901–916
overcomes more methods supported with a level of significance
a ¼ 0:1 in terms of accuracy rate. The other DEGL’s variants
obtain similar results except for the random approach that is
probably affected by a lack of convergence.
SFLSDE with the RandToBest strategy achieves the best average
results in accuracy. This technique involves the best mutation
strategy and two local searches which allow it to find the most
suitable parameters during the evolution process.
Looking at accuracy, kappa rate and the statistical test; three
methods deserve particular mention: DEGL exponential, SFLSDE
Rand and SFLSDE RandToBest. We will use these methods in the
comparison with other PG methods.
6.1.2. Comparison with other PG techniques over small data sets
In this section, we perform a comparison between the best three
DE models checked before (SFLSDE Rand, RandToBest and DEGL
exponential) with respect to the other 7 PG methods. Table 5 shows
the average results collected. In this case, we add the reduction rate,
which is, an important measure to compare the methods.
Table 6 presents the rankings obtained by the Friedman Aligned
(FA) procedure with the accuracy measure. In this table, algorithms
are ordered from the best to the worst ranking. Furthermore, the
third column shows the adjusted p-value with the Holm’s test
(Holm APV) [69]. Note that the SFLSDE RandToBest is established as
the control algorithm because it has obtained the best FA ranking.
Holm’s test uses the same level of significance as Wilcoxon, a ¼ 0:1.
Algorithms highlighted in bold are those which have been outperformed with this level of significance.
Observing both Tables 5 and 6, we want to make some
interesting comments:
DE methods significantly outperform the other PG techniques,
except for PSO, in accuracy and kappa rate. PSO is clearly the
most competitive PG algorithm for DE. PSO has the same type of
solution codification as DE and a similar evolutionary scheme,
but advanced proposals of the DE algorithm usually obtain
better average results.
We could also have stressed RSP3, HYB and ENPC algorithms as
competitive algorithms for DE. But, as we can see in the table,
they obtain a good performance over training results, but they
do not report great test results, therefore they have a higher
overfitting than the DE algorithms.
In terms of reduction capabilities, DE has been fixed to 0.95.
Only MixtGauss and PSCSA obtain better reduction rates, but
they offer lower accuracy/kappa rates. DE outperforms the rest
of the methods with similar or lower reduction rates.
Now, we focus our attention on the first statement. Holm’s test
has no reported significant differences between DE and PSO. PSO
probably benefits from the multiple comparison test, because it
significantly outperforms the rest of the PG techniques. For this
reason, we want to check the comparison between PSO and DE with
the Wilcoxon test. Table 7 shows the p-values obtained with the
Wilcoxon test. As we can see, advanced DE proposals always
outperform the PSO algorithm with a level of significance of a ¼ 0:1.
6.2. Analysis and results of DE schemes over large size data sets
This section presents the study and analysis of large size data
sets. The goal of this study is to analyze the effect of scaling up the
data in DE methods. Again, we divide this section into two different
stages. First, in Section 6.2.1 we look for the best DE method over
large data sets. Next, Section 6.2.2 compares the results with other
PG methods.
6.2.1. Results of DE schemes over large data sets
In order to test the performance of DE methods we have
established a high reduction rate 0.99. Table 8 presents
Table 6
Average rankings of the algorithms over small data sets (Friedman Aligned-Ranks +
Holm’s Test).
Algorithm
FA ranking
Holm APV
SFLSDE/RandToBest/1/Bin
SFLSDE/Rand/1/Bin
DEGL EXP
PSO
RSP3
1NN
HYB
ENPC
Mixt_Gauss
PSCSA
LVQ3
131.4625
138.3
139.4
158.275
225.3999
226.0
258.3625
268.15
269.6875
286.1875
324.275
–
1.0
1.0
1.0
0.0044
0.0044
4.85 10 5
1.07 10 5
9.33 10 6
4.75 10 7
1.19 10 10
Table 7
Results of the Wilcoxon test compared with PSO over small data sets.
Comparison
p-Value
DEGL EXP vs. PSO
SFLSDE Rand vs. PSO
SFLSDE RandToBest vs. PSO
0.0106
0.0986
0.0408
Table 5
Comparison between the three best DE models and other PG approaches over small data sets.
Algorithm
Accuracy
Training
1NN
MixtGauss
LVQ3
HYB
RSP3
ENPC
PSO
PSCSA
DEGL EXP
SFLSDE/Rand/1/Bin
SFLSDE/RandToBest/1/Bin
0.7369
0.7138
0.6931
0.8309
0.7924
0.8247
0.8238
0.6787
0.8144
0.8347
0.8411
7 0.1654
7 0.1545
7 0.1560
7 0.0154
7 0.1373
7 0.1477
7 0.1274
7 0.1835
7 0.1563
7 0.1563
7 0.1563
Kappa rate
Test
0.7348
0.6932
0.6763
0.7153
0.7325
0.7167
0.7501
0.6682
0.7597
0.7582
0.7619
Training
70.1664
70.1668
70.1662
70.1651
70.1591
70.1597
70.1409
70.1874
70.1394
70.1563
7 0.1563
0.4985
0.4888
0.4421
0.6988
0.6112
0.6800
0.6791
0.4461
0.6728
0.6960
0.7079
7 0.2910
7 0.2473
7 0.2458
7 0.2573
7 0.2420
70.2532
7 0.1950
7 0.2466
7 0.1563
7 0.1948
70.1563
Reduction
Test
0.4918
0.4546
0.4114
0.4790
0.5004
0.4818
0.5332
0.4231
0.5529
0.5461
0.5516
7 0.2950
7 0.2680
7 0.1563
7 0.1563
7 0.2861
7 0.2936
7 0.2402
7 0.2540
7 0.2328
7 0.1563
7 0.1563
0.0000
0.9552
0.9488
0.4278
0.7329
0.7220
0.9491
0.9858
0.9483
0.9481
0.9481
70.0000
70.0084
70.0083
70.1563
70.1185
7 0.1447
70.0072
7 0.1563
70.1563
70.1563
70.1563
I. Triguero et al. / Pattern Recognition 44 (2011) 901–916
911
Table 8
Results of different DE models over large data sets.
Algorithms
Accuracy
Kappa rate
Training
DE/Rand/1/Bin
DE/Best/1/Bin
DE/RandToBest/1/Bin
DE/Best/2/Bin
DE/Rand/2/Bin
DE/RandToBest/2/Bin
SADE LP 50
SADE LP 100
JADE
DEGL EXP
DEGL ADAP
DEGL RANDOM
DEGL LINEAR
SFLSDE/Rand/1/Bin
SFLSDE/RandToBest/1/Bin
0.7831
0.8025
0.8124
0.8183
0.7888
0.8243
0.8107
0.8070
0.8136
0.8076
0.8069
0.8069
0.8088
0.8327
0.8341
Test
70.0055
7 0.0038
70.0032
70.0045
70.0068
70.0045
7 0.0036
7 0.0032
70.0110
7 0.0032
7 0.0041
7 0.0037
7 0.0031
70.0046
7 0.0030
0.7798
0.7881
0.7968
0.7988
0.7838
0.8088
0.7966
0.7941
0.8020
0.7951
0.7946
0.7938
0.7961
0.8181
0.8154
Acc Tst
Training
7 0.2075
7 0.2069
7 0.2087
7 0.2086
7 0.2088
70.2113
7 0.2063
7 0.2070
70.2058
7 0.2058
7 0.2074
7 0.2079
7 0.2058
70.2074
7 0.2072
0.5709
0.5883
0.6115
0.6224
0.5803
0.6377
0.6178
0.6030
0.6204
0.6044
0.6027
0.6025
0.6052
0.6541
0.6556
Test
7 0.2713
7 0.2849
7 0.2845
7 0.2859
7 0.2691
7 0.2920
7 0.2722
7 0.2793
7 0.2803
7 0.2783
7 0.2811
7 0.2799
7 0.2810
7 0.2840
7 0.2879
0.5639
0.5605
0.5815
0.5843
0.5686
0.6073
0.5918
0.5789
0.5969
0.5792
0.5761
0.5772
0.5811
0.6243
0.6184
70.2761
70.2894
70.2888
70.2897
70.2753
70.2928
70.2748
70.2829
70.2830
70.2829
70.2883
70.2844
70.2831
7 0.2925
70.2910
Kappa Tst
+
+¼
+
+¼
0
0
2
1
0
6
2
0
6
0
1
0
1
9
11
12
8
10
11
10
14
13
12
13
11
10
8
11
14
14
0
0
2
1
0
6
2
0
6
0
1
0
1
9
11
12
8
10
11
10
14
13
12
13
11
10
8
11
14
14
Table 9
Comparison between the two best DE models and other PG approaches over large data sets.
Algorithm
Accuracy
Training
1NN
MixtGauss
LVQ3
HYB
RSP3
ENPC
PSO
PSCSA
SFLSDE/Rand/1/Bin
SFLSDE/RandToBest/1/Bin
0.8197
0.7534
0.6840
0.7888
0.7922
0.8809
0.8022
0.6730
0.8388
0.8414
70.0023
70.0141
7 0.0057
70.0234
70.2545
7 0.1610
7 0.0055
7 0.2190
70.0039
70.0028
Kappa rate
Test
0.8072
0.7505
0.6767
0.7618
0.7556
0.7986
0.8049
0.6707
0.8249
0.8236
Training
70.0100
70.2315
7 0.2680
7 0.2168
7 0.2708
7 0.2188
70.2136
70.2205
70.2205
7 0.2199
the comparative study. Again, we use the Wilcoxon test to
differentiate between the different proposals.
We can make several observations from these results:
Some models present important differences when tackling
large data sets. We can stress JADE as a good algorithm
when the size of the data set is higher. With large data sets,
this algorithm overcomes most of the advanced proposals;
except for SFLSDE which remains the best advanced
DE model.
The number of difference vectors to be perturbed by the
mutation operator does seem to be an important factor that
influences the final result obtained when dealing with large
data sets.
SFLSDE/Rand/1/Bin has reported the best average results in
accuracy and kappa rate. However, the statistical test shows
that SFLSDE RandToBest has better behavior than the Rand
strategy. As we can observe in Table 8, SFLSDE/Rand/1/Bin is
able to overcome nine methods and SFLSDE RandToBest 11
methods. The rest of the proposals advanced are not able to
overcome SFSLDE.
When dealing with large data sets, the statistical test notes
higher differences between the methods. Concretely, we can
observe that SFLSDE RandToBest outperforms a total of 11
methods out of 14 with a level of significance of 0.1.
We select both SFLSDE algorithms for the next comparison as
they have, in general, reported the best results.
0.6195
0.4913
0.4409
0.5992
0.6299
0.7613
0.6177
0.3900
0.6570
0.6598
7 0.0229
7 0.3251
7 0.2926
7 0.2790
7 0.3266
7 0.2497
7 0.2887
70.2376
7 0.3036
7 0.3079
Reduction
Test
0.5948
0.4860
0.4264
0.5567
0.5597
0.6170
0.5948
0.3842
0.6281
0.6240
7 0.0181
7 0.3255
7 0.2962
7 0.3153
7 0.3397
7 0.2949
7 0.2880
7 0.2824
7 0.3131
7 0.3131
0.0000
0.9514
0.9899
0.5727
0.8100
0.8205
0.9899
0.9988
0.9901
0.9901
7 0.0000
70.0001
70.0011
70.2903
7 0.1369
70.1919
70.0011
7 0.0017
70.0002
70.0002
Table 10
Average rankings of the algorithms over large data sets (Friedman Aligned-Ranks +
Holm’s Test).
Algorithm
FA ranking
Holm APV
SFLSDE/RandToBest/1/Bin
SFLSDE/Rand/1/Bin
1NN
PSO
ENPC
RSP3
HYB
Mixt_Gauss
LVQ3
PSCSA
47.0313
48.0938
65.1875
66.0625
71.875
86.9063
92.6875
94.375
113.5313
119.25
–
0.9482
0.7359
0.7359
0.5174
0.0746
0.0319
0.0269
3.9323 10 4
9.3584 10 5
6.2.2. Comparison with other PG techniques over large data sets
In this section, we perform a comparison between the two best
DE models obtained for large data sets (SFLSDE models) with
the same algorithms as in Section 6.1.2. Table 9 shows the average
results obtained, and Table 10 displays the FA ranking and the
adjusted p-value obtained with Holm’s test.
Observing Tables 9 and 10, we can summarize that:
Most of the PG methods present clear differences when dealing
with large data sets. For instance, ENPC outperforms its ranking
obtained over small data sets. Together with PSO, they are the
most competitive PG techniques for the DE model and Holm’s
test supports this statement. However, SFLSDE usually obtains
better average results.
912
I. Triguero et al. / Pattern Recognition 44 (2011) 901–916
DE methods significantly overcome the rest of the PG techni-
ques. Specifically, accuracy and the kappa rate demonstrate that
SFLSDE Rand is able to overcome in 0.02 the average results
obtained for the best PG technique (PSO).
In order to improve the efficiency of the 1NN rule when tackling
large data sets, the reduction rate becomes more important. FA
ranking indicates that only SFLSDE models are able to outperform the 1NN with a high reduction rate (0.99), to a greater
extent than the rest of the PG methods.
Again, we use the Wilcoxon test to check if the DE models are able
to outperform the most competitive PG algorithms. Specifically, we
carry out a study with ENPC and PSO. Table 11 presents the results.
The Wilcoxon test shows that ENPC is outperformed with
a ¼ 0:1. However, this hypothesis is rejected with PSO, but the
p-value is smaller than the adjusted p-value of Holm’s test.
Table 11
Results of the Wilcoxon test compared with PSO and ENPC over large data sets.
Comparison
p-Value
SFLSDE
SFLSDE
SFLSDE
SFLSDE
0.1981
0.1928
0.0183
0.0214
Rand vs. PSO
RandToBest vs. PSO
Rand vs. ENPC
RandToBest vs. ENPC
One of the most important issues in the development of any EA
is the analysis of the convergence of its population. If the EA does
not evolve in time, it will not be able to obtain suitable solutions.
We show a graphical representation of the convergence capabilities of DE models, Fig. 4. Specifically, the best basic DE
(DE/RandToBest/2/Bin), SADE (LP ¼50), JADE, DEGL exponential,
and SFLSDE with RandToBest.
To perform this analysis we have selected the Bupa small data
set. The graphics show a line representing the fitness value of
the best individual of each population. The X-axis represents the
number of iterations carried out, and the Y-axis represents the
fitness value currently achieved.
As we can see in the graphic, SADE and DEGL quickly find
promising solutions and they waste more than 300 iterations
without an improvement. However, SFLSDE and DE are slower to
converge, which usually allows them to find a better solution at the
end of the process. They find a great balance between exploration
and exploitation during the evolution.
6.4. Analysis and results of hybrid models over small size data sets
This section shows the average results obtained for our hybrid
models when they are applied to small data sets. Table 12 collects
the average results and the Wilcoxon test. The abilities of hybrid
models are shown and their performance is compared with the
basic components that take part in it.
The results achieved in this part of the study allow us to
conclude the following:
Convergence Analysis (Bupa)
80
Fitness value
6.3. Analysis of convergence
Hybrid models always outperform the basic algorithms upon
75
which they are based. The good synergy between PG and PS
methods is clearly demonstrated with the obtained results. We
selected three PS methods with different reduction rates (ICF
0.7107, DROP3 0.8202 and SSMA 0.9553). A priori a lower
70
DE
SADE
JADE
DEGL
SFLSDE
65
0
50
100
150
200 250 300
Iterations
350
400
Table 13
Average runtime of the optimizer algorithms over small data sets.
450
Runtime
500
SFLSDE
40.5483
PSO
42.3168
LVQ3
0.2316
Fig. 4. Map of convergence: Bupa data set.
Table 12
Hybrid models with small data sets.
Algorithm
Accuracy
Training
DROP3
ICF
SSMA
LVQ3
PSO
SFLSDE/RandToBest/1/Bin
DROP3 + LVQ3
DROP3 + PSO
DROP3 + SFLSDE/RandToBest/1/Bin
ICF+ LVQ3
ICF+ PSO
ICF+ SFLSDE/RandToBest/1/Bin
SSMA +LVQ3
SSMA +PSO
SSMA +SFLSDE/RandToBest/1/Bin
1NN
0.7527
0.7118
0.8207
0.6931
0.8238
0.8411
0.7666
0.8645
0.8711
0.7384
0.8677
0.8738
0.8347
0.8617
0.8651
0.7369
7 0.1240
7 0.1343
7 0.1335
7 0.1560
7 0.1274
7 0.1563
7 0.1197
7 0.0970
7 0.0958
7 0.1189
7 0.0980
70.0991
7 0.1100
7 0.1007
7 0.1010
7 0.1654
Kappa rate
Test
0.7011
0.6784
0.7581
0.6763
0.7501
0.7619
0.7027
0.7501
0.7620
0.6865
0.7523
0.7618
0.7704
0.7770
0.7845
0.7348
Training
7 0.1497
7 0.1505
7 0.1518
7 0.1662
7 0.1409
7 0.1563
7 0.1471
7 0.1349
7 0.1348
7 0.1413
7 0.1398
7 0.1401
7 0.1267
7 0.1267
70.1256
7 0.1664
0.5498
0.4797
0.6685
0.4421
0.6791
0.7079
0.5705
0.7474
0.7605
0.5229
0.7526
0.7642
0.6842
0.7376
0.7407
0.4985
7 0.2139
7 0.2364
7 0.2089
7 0.2458
7 0.1950
7 0.1563
7 0.2200
7 0.1688
7 0.1687
7 0.2191
7 0.1678
70.1730
7 0.2087
7 0.1852
7 0.1891
7 0.2910
Reduction
Test
0.4544
0.4175
0.5455
0.4114
0.5332
0.5516
0.4553
0.5286
0.5488
0.4282
0.5318
0.5462
0.5619
0.5727
0.5836
0.4918
7 0.2651
7 0.2644
7 0.2685
7 0.1563
7 0.2402
7 0.1563
7 0.2703
7 0.2588
7 0.2572
7 0.2590
7 0.2593
7 0.2695
7 0.2657
7 0.2515
70.2524
7 0.2950
0.8202
0.7107
0.9554
0.9488
0.9491
0.9481
0.8202
0.8202
0.8202
0.7107
0.7107
0.7107
0.9554
0.9554
0.9554
0.0000
7 0.0148
7 0.1369
70.0343
7 0.0083
7 0.0072
7 0.1563
7 0.0809
7 0.0809
7 0.0809
7 0.1369
7 0.1369
7 0.1369
70.0343
70.0343
70.0343
7 0.0000
Acc Tst
Kappa Tst
+
+¼
+
+¼
0
0
7
0
6
9
0
5
6
0
5
7
9
7
10
2
5
4
15
4
15
14
5
9
13
5
9
14
14
15
15
11
1
0
6
0
5
7
1
6
6
0
3
7
6
9
10
4
6
4
15
2
14
13
5
13
14
5
8
15
15
15
15
10
I. Triguero et al. / Pattern Recognition 44 (2011) 901–916
913
Table 14
Hybrid models with large data sets.
Algorithms
Accuracy
Training
DROP3
ICF
SSMA
LVQ3
PSO
SFLSDE/RandToBest/1/Bin
DROP3 +LVQ3
DROP3 +PSO
DROP3 +SFLSDE/RandToBest/1/Bin
ICF +LVQ3
ICF +PSO
ICF +SFLSDE/RandToBest/1/Bin
SSMA + LVQ3
SSMA + PSO
SSMA + SFLSDE/RandToBest/1/Bin
1NN
0.7744
0.6781
0.8493
0.6840
0.8022
0.8414
0.7730
0.8386
0.8538
0.6851
0.8397
0.8367
0.8534
0.8576
0.8635
0.8197
7 0.0232
7 0.1863
7 0.0021
7 0.0057
7 0.0055
7 0.0028
7 0.2166
7 0.1913
7 0.1868
7 0.1796
7 0.1950
7 0.1924
7 0.1998
7 0.2007
70.1979
7 0.0023
Kappa rate
Test
0.7472
0.6621
0.8196
0.6767
0.8049
0.8236
0.7412
0.7974
0.8152
0.6641
0.8046
0.8145
0.8244
0.8241
0.8291
0.8072
Reduction
Training
7 0.2256
7 0.2016
7 0.2220
7 0.2680
7 0.2136
7 0.2199
7 0.2382
7 0.2236
7 0.2260
7 0.1997
7 0.2259
7 0.2260
7 0.2197
7 0.2225
7 0.2213
7 0.0100
0.5592
0.4202
0.6725
0.4409
0.6177
0.6598
0.5661
0.6496
0.6843
0.4297
0.6444
0.6434
0.6793
0.6970
0.7056
0.6195
reduction rate should allow better accuracy results to be
obtained. In terms of accuracy/kappa rates, we can observe
how the hybrid models, DROP3+ SFSLDE and ICF+ SFLSDE probably produce overfitting in training data, because they do not
present good generalization capabilities, obtaining lower accuracy/kappa rates in test results.
SFSLDE is the best performing method in comparison with PS
and PG basic algorithms. Furthermore, when it is applied as an
optimizer method in the hybrid models it achieves the best
accuracy/kappa rates. As we stated before, it can sometimes
produce overfitting. However, as we can see from the test results
of SSMA+ SFLSDE, when a high reduction PS method is applied,
SFLSDE is very effective. We can extrapolate this statement for
LVQ3 and PSO, which do not produce overfitting over the
resulting set selected by SSMA.
Although LVQ3 does not offer competitive results, when it is
used to optimize a PS solution, LVQ3 is able to improve
appropriately the position of the prototypes. For instance, as
we can see with SSMA+ LVQ3 in comparison with SSMA, LVQ3
works properly when it starts from a good solution. Although
PSO and DE outperform LVQ3 as optimizers, an advantage of
LVQ is that it is faster. Table 13 shows the average runtime3 of
the optimizers over small data sets. As we can see, the learning
time of LVQ3 is clearly lower than PSO and DE which codify a
complete solution per individual.
With the same reduction rate, PSO outperforms LVQ3, and it is
more effective when we use it in the hybrid models. Nevertheless, with the obtained results and in comparison with the DE
algorithms, PSO is probably affected by a lack of convergence
because of the absence of an adaptive process to improve its
own parameters during the evolution.
6.5. Analysis and results of hybrid models over large size data sets
In this section we want to check if the performance of hybrid
models is maintained when dealing with large data sets. Table 14
shows this experiment.
We briefly summary some interesting facts:
Test
7 0.3079
7 0.3079
7 0.3134
7 0.2926
7 0.2887
7 0.3079
7 0.3140
7 0.2911
7 0.2875
7 0.3042
7 0.3050
7 0.2933
7 0.3120
7 0.3030
7 0.3005
7 0.0229
0.5119 7 0.3248
0.3940 7 0.3143
0.6221 7 0.3177
0.4264 7 0.2962
0.5948 7 0.2880
0.6240 7 0.3131
0.5106 7 0.3308
0.5768 7 0.3130
0.6173 7 0.3135
0.3960 7 0.3133
0.5846 7 0.3226
0.6082 7 0.3151
0.6312 7 0.3163
0.6384 7 0.3092
0.6442 7 0.3105
0.5948 7 0.0181
0.9061
0.8037
0.9844
0.9899
0.9899
0.9901
0.9061
0.9061
0.9061
0.8302
0.8302
0.8302
0.9847
0.9847
0.9847
0.0000
7 0.0569
7 0.1654
7 0.0100
7 0.0011
7 0.0011
7 0.0002
7 0.0569
7 0.0569
7 0.0569
7 0.1386
7 0.1386
7 0.1386
7 0.0099
7 0.0099
7 0.0099
7 0.0000
Acc Tst
Kappa Tst
+
+¼
+
+¼
1
0
9
0
3
7
3
5
6
0
5
6
6
8
11
3
5
2
14
4
13
12
6
9
14
3
12
13
15
15
15
15
1
0
9
0
3
7
3
5
6
0
5
6
6
8
11
3
5
2
14
4
13
12
6
9
14
3
12
13
15
15
15
15
efficiency of the 1NN rule. The DE model has been fixed with a
high reduction rate (0.99) and the advanced proposal SFLSDE
outperforms the rest of the PG and PS basic techniques which
are far from achieving this reduction rate.
SSMA was proposed to cover a drawback of the conventional
evolutionary PS methods: their lack of convergence when facing
large problems. We can observe that it is the best PS method and
its performance is improved when we hybridize with an
optimization procedure. SSMA provides a promising solution
which enables any optimization process, including LVQ3 which
does not offer great results in combination with ICF and DROP3,
to converge quickly.
7. Conclusions
In this work, we have presented differential evolution and its
recent advanced proposals as a data reduction technique. Specifically, it was used to optimize the positioning of the prototypes for
the nearest neighbor algorithm, acting as a prototype generation
method.
The first aim of this paper is to determine which proposed DE
algorithm works properly to tackle the PG problem. Specifically, we
have studied the different mutation strategies recognized in the
literature, and the recent approaches to adapt the parameters of
this evolutionary algorithm in order to find a good balance between
exploration and exploitation.
The second contribution of this paper shows the good relation
between PS and PG in obtaining hybrid algorithms that allow us to
find very promising solutions. Hybrid models are able to tackle
several drawbacks of the isolated PS and PG methods. Concretely,
we have analyzed the use of positioning adjustment algorithms as
an optimization procedure after a previous PS stage. Our DE model
is an appropriate optimizer which has reported the best results in
terms of accuracy and reduction rate.
The wide experimental study performed has allowed us to
justify the behavior of DE algorithms when dealing with small and
large data sets. These results have been compared with several nonparametric statistical procedures, which have reinforced the
conclusions.
When dealing with large data sets, reduction rate must be taking
into consideration as one of the main parameters to improve the
3
These results have been obtained with an Intel(R) Core(TM) i7 CPU 920 at
2.67 GHz.
Acknowledgement
Supported by the Spanish Ministry of Science and Technology
under Project TIN2008-06681-C06-01.
914
I. Triguero et al. / Pattern Recognition 44 (2011) 901–916
Appendix A. Friedman Aligned Ranks and adjusted p-values
The Friedman test is based on n sets of ranks, one set for each
data set in our case; and the performances of the algorithms
analyzed are ranked separately for each data set. Such a ranking
scheme allows for intra-set comparisons only, since inter-set
comparisons are not meaningful. When the number of algorithms
for comparison is small, this may pose a disadvantage. In such
cases, comparability among data sets is desirable and we can
employ the method of aligned ranks [70].
In this technique, a value of location is computed as the average
performance achieved by all algorithms in each data set. Then it
calculates the difference between the performance obtained by an
algorithm and the value of location. This step is repeated for
algorithms and data sets. The resulting differences, called aligned
observations, which keep their identities with respect to the data
set and the combination of algorithms to which they belong, are
then ranked from 1 to kn relative to each other. Then, the ranking
scheme is the same as that employed by a multiple comparison
procedure which employs independent samples; such as the
Kruskal–Wallis test [72]. The ranks assigned to the aligned
observations are called aligned ranks.
The Friedman aligned ranks test statistic can be written as
hP
i
2
k
2
^2
ðk1Þ
j ¼ 1 R:j ðkn =4Þðkn þ1Þ
ð17Þ
T¼
P
2
f½knðkn þ 1Þð2kn þ 1Þ=6gð1=kÞ ni¼ 1 R^i:
where R^ i: is equal to the rank total of the i-th data set and R^ :j is the
rank total of the j-th algorithm.
The test statistic T is compared for significance with a chi-square
distribution for k 1 degrees of freedom. Critical values can be
found at Table A3 in [66]. Furthermore, the p-value could be
computed through normal approximations [73]. If the null-hypothesis is rejected, we can proceed with a post-hoc test. In this study,
we use the Holm post-hoc procedure.
We focus on the comparison between a control method, which
is usually the proposed method, and a set of algorithms used in the
empirical study. This set of comparisons is associated with a set or
family of hypotheses, all of which are related to the control method.
Any of the post-hoc tests is suitable for application to nonparametric tests working over a family of hypotheses.
The test statistic for comparing the i-th algorithm and j-th
algorithm depends on the main non-parametric procedure used. In
this case, it depends on the Friedman Aligned Ranks test:
Since the set of related rankings is converted to absolute
rankings, the expression for computing the test statistic in Friedman Aligned Ranks is the same as that used by the Kruskal–Wallis
test [72,74]
,rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
kðn þ1Þ
^
^
,
ð18Þ
z ¼ ðR i R j Þ
6
where R^ i , R^ j are the average rankings by Friedman Aligned Ranks of
the algorithms compared.
In statistical hypothesis testing, the p-value is the probability of
obtaining a result at least as extreme as the one that was actually
observed, assuming that the null hypothesis is true. It is a useful
and interesting datum for many consumers of statistical analysis. A
p-value provides information about whether a statistical hypothesis test is significant or not, and it also indicates something about
‘‘how significant’’ the result is: the smaller the p-value, the stronger
the evidence against the null hypothesis. Most importantly, it does
this without committing to a particular level of significance.
When a p-value is considered in a multiple comparison, it
reflects the probability error of a certain comparison, but it does not
take into account the remaining comparisons belonging to the
family. If one is comparing k algorithms and in each comparison the
level of significance is a, then in a single comparison the probability
of not making a Type I error is ð1aÞ, then the probability of not
making a Type I error in the k 1 comparison is ð1aÞðk1Þ . Then the
probability of making one or more Type I error is 1ð1aÞðk1Þ . For
instance, if a ¼ 0:05 and k¼10 this is 0.37, which is rather high.
One way to solve this problem is to report adjusted p-values
(APVs) which take into account that multiple tests are conducted.
An APV can be compared directly with any chosen significance level
a. We recommend the use of APVs due to the fact that they provide
more information in a statistical analysis.
The z value in all cases is used to find the corresponding
probability (p-value) from the table of normal distribution
N(0,1), which is then compared with an appropriate level of
significance a [66, Table A1]. The post-hoc tests differ in the way
they adjust the value of a to compensate for multiple comparisons.
Next, we will define the Holm procedure and we will explain
how to compute the APVs. The notation used in the computation of
the APVs is as follows:
Indexes i and j each correspond to a concrete comparison or
hypothesis in the family of hypotheses, according to an incremental order of their p-values. Index i always refers to the
hypothesis in question whose APV is being computed and index
j refers to another hypothesis in the family.
pj is the p-value obtained for the j-th hypothesis.
k is the number of classifiers being compared.
The Holm procedure adjusts the value of a in a step-down
manner. Let p1, p2,y,pk 1 be the ordered p-values (smallest to
largest), so that p1 rp2 r rpk1 , and H1, H2,y,Hk 1 be the
corresponding hypotheses. The Holm procedure rejects H1 to Hi 1
if i is the smallest integer so that pi 4 a=ðkiÞ. Holm’s step-down
procedure starts with the most significant p-value. If p1 is below
a=ðk1Þ, the corresponding hypothesis is rejected and we are
allowed to compare p2 with a=ðk2Þ. If the second hypothesis is
rejected, the test proceeds with the third, and so on. As soon as a
certain null hypothesis cannot be rejected, all the remaining
hypotheses are retained as well.
Holm APVi: min{v;1}, where v ¼ maxfðkjÞpj : 1 r j r ig.
References
[1] T.M. Cover, P.E. Hart, Nearest neighbor pattern classification, IEEE Transactions
on Information Theory 13 (1) (1967) 21–27.
[2] A.N. Papadopoulos, Y. Manolopoulos, Nearest Neighbor Search: A Database
Perspective, Springer, 2004.
[3] I. Kononenko, M. Kukar, Machine Learning and Data Mining: Introduction to
Principles and Algorithms, Horwood Publishing Limited, 2007.
[4] D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithms, Machine
Learning 6 (1) (1991) 37–66.
[5] E.K. Garcia, S. Feldman, M.R. Gupta, S. Srivastava, Completely lazy learning, IEEE
Transactions on Knowledge and Data Engineering 22 (9) (2010) 1274–1285.
[6] X. Wu, V. Kumar (Eds.), The Top Ten Algorithms in Data Mining. Chapman &
Hall/CRC Data Mining and Knowledge Discovery, 2009.
[7] B.M. Steele, Exact bootstrap k-nearest neighbor learners, Machine Learning 74
(3) (2009) 235–255.
[8] P. Chaudhuri, A.K. Ghosh, H. Oja, Classification based on hybridization of
parametric and nonparametric classifiers, IEEE Transactions on Pattern
Analysis and Machine Intelligence 31 (7) (2009) 1153–1164.
[9] Y.-C. Liaw, M.-L. Leou, C.-M. Wu, Fast exact k nearest neighbors search using an
orthogonal search tree, Pattern Recognition 43 (6) (2010) 2351–2358.
[10] J. Derrac, S. Garcı́a, F. Herrera, IFS-CoCo: instance and feature selection based on
cooperative coevolution with nearest neighbor rule, Pattern Recognition 43 (6)
(2010) 2082–2105.
[11] D.R. Wilson, T.R. Martinez, Improved heterogeneous distance functions,
Journal of Artificial Intelligence Research 6 (1997) 1–34.
[12] R. Paredes, E. Vidal, Learning weighted metrics to minimize nearest-neighbor
classification error, IEEE Transactions on Pattern Analysis and Machine
Intelligence 28 (7) (2006) 1100–1110.
I. Triguero et al. / Pattern Recognition 44 (2011) 901–916
[13] M.Z. Jahromi, E. Parvinnia, R. John, A method of learning weighted similarity
function to improve the performance of nearest neighbor, Information
Sciences 179 (17) (2009) 2964–2973.
[14] K.Q. Weinberger, L.K. Saul, Distance metric learning for large margin nearest
neighbor classification, Journal of Machine Learning Research 10 (2009) 207–244.
[15] Y. Chen, E.K. Garcia, M.R. Gupta, A. Rahimi, L. Cazzanti, Similarity-based
classification: concepts and algorithms, Journal of Machine Learning Research
10 (2009) 747–776.
[16] D. Pyle, Data Preparation for Data Mining, The Morgan Kaufmann Series in Data
Management Systems, Morgan Kaufmann, 1999.
[17] H. Liu, H. Motoda, Feature Extraction, Construction and Selection: A Data
Mining Perspective, Kluwer Academic Publishers, 2001.
[18] Y. Li, B.-L. Lu, Feature selection based on loss-margin of nearest neighbor
classification, Pattern Recognition 42 (9) (2009) 1914–1921.
[19] D.R. Wilson, T.R. Martinez, Reduction techniques for instance-based learning
algorithms, Machine Learning 38 (3) (2000) 257–286.
[20] H.A. Fayed, S.R. Hashem, A.F. Atiya, Self-generating prototypes for pattern
classification, Pattern Recognition 40 (5) (2007) 1498–1509.
[21] P.E. Hart, The condensed nearest neighbor rule, IEEE Transactions on Information Theory 18 (1968) 515–516.
[22] D.L. Wilson, Asymptotic properties of nearest neighbor rules using edited data,
IEEE Transactions on System, Man and Cybernetics 2 (3) (1972) 408–421.
[23] H. Brighton, C. Mellish, Advances in instance selection for instance-based
learning algorithms, Data Mining and Knowledge Discovery 6 (2) (2002)
153–172.
[24] E. Marchiori, Hit miss networks with applications to instance selection, Journal
of Machine Learning Research 9 (2008) 997–1017.
[25] H.A. Fayed, A.F. Atiya, A novel template reduction approach for the k-nearest
neighbor method, IEEE Transactions on Neural Networks 20 (5) (2009)
890–896.
[26] E. Marchiori, Class conditional nearest neighbor for large margin instance
selection, IEEE Transactions on Pattern Analysis and Machine Intelligence 32
(2) (2010) 364–370.
[27] W. Lam, C.K. Keung, D. Liu, Discovering useful concept prototypes for
classification based on filtering and abstraction, IEEE Transactions on Pattern
Analysis and Machine Intelligence 14 (8) (2002) 1075–1090.
[28] C.-L. Chang, Finding prototypes for nearest neighbor classifiers, IEEE Transactions on Computers 23 (11) (1974) 1179–1184.
[29] T. Kohonen, The self organizing map, Proceedings of the IEEE 78 (9) (1990)
1464–1480.
[30] C.H. Chen, A. Jóźwik, A sample set condensation algorithm for the class
sensitive artificial neural network, Pattern Recognition Letters 17 (8) (1996)
819–823.
[31] S.W. Kim, J. Oomenn, A brief taxonomy and ranking of creative prototype
reduction schemes, Pattern Analysis and Applications 6 (2003) 232–244.
[32] M. Lozano, J.M. Sotoca, J.S. Sánchez, F. Pla, E. Pekalska, R.P.W. Duin, Experimental study on prototype optimisation algorithms for prototype-based
classification in vector spaces, Pattern Recognition 39 (10) (2006) 1827–1838.
[33] J.C. Bezdek, L.I. Kuncheva, Nearest prototype classifier designs: an experimental study, International Journal of Intelligent Systems 16 (2001)
1445–1473.
[34] A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, SpringerVerlag, Berlin, 2003.
[35] A.A. Freitas, Data Mining and Knowledge Discovery with Evolutionary Algorithms, Springer-Verlag, Berlin, 2002.
[36] G.L. Pappa, A.A. Freitas, Automating the Design of Data Mining Algorithms: An
Evolutionary Computation Approach, Natural Computing, Springer, 2009.
[37] J.-R. Cano, F. Herrera, M. Lozano, Using evolutionary algorithms as instance
selection for data reduction in KDD: an experimental study, IEEE Transactions
on Evolutionary Computation 7 (6) (2003) 561–575.
[38] S. Garcı́a, J.R. Cano, F. Herrera, A memetic algorithm for evolutionary prototype
selection: a scaling up approach, Pattern Recognition 41 (8) (2008) 2693–2709.
[39] L. Nanni, A. Lumini, Particle swarm optimization for prototype reduction,
Neurocomputing 72 (4–6) (2008) 1092–1097.
[40] A. Cervantes, I.M. Galván, P. Isasi, AMPSO: a new particle swarm method for
nearest neighborhood classification, IEEE Transactions on Systems, Man, and
Cybernetics—Part B: Cybernetics 39 (5) (2009) 1082–1091.
[41] N. Krasnogor, J. Smith, A tutorial for competent memetic algorithms: model,
taxonomy, and design issues, IEEE Transactions on Evolutionary Computation
9 (5) (2005) 474–488.
[42] F. Fernández, P. Isasi, Evolutionary design of nearest prototype classifiers,
Journal of Heuristics 10 (4) (2004) 431–454.
[43] U. Garain, Prototype reduction using an artificial immune model, Pattern
Analysis and Applications 11 (3–4) (2008) 353–363.
915
[44] J. Kennedy, R. Eberhart, Learning representative exemplars of concepts: an
initial case study, in: Proceedings of the IEEE International Conference on
Neural Networks, 1995, pp. 1942–1948.
[45] R. Poli, J. Kennedy, T. Blackwell, Particle swarm optimization, Swarm Intelligence 1 (1) (2007) 33–57.
[46] R. Storn, K.V. Price, Differential evolution—a simple and efficient heuristic for
global optimization over continuous spaces, Journal of Global Optimization 11
(10) (1997) 341–359.
[47] K.V. Price, R.M. Storn, J.A. Lampinen, Differential Evolution A Practical
Approach to Global Optimization, Natural Computing Series, 2005.
[48] I. Triguero, S. Garcı́a, F. Herrera, A preliminary study on the use of differential
evolution for adjusting the position of examples in nearest neighbor classification, in: Proceedings of the IEEE Congress on Evolutionary Computation, 2010,
pp. 630–637.
[49] A.K. Qin, V.L. Huang, P.N. Suganthan, Differential evolution algorithm with
strategy adaptation for global numerical optimization, IEEE Transactions on
Evolutionary Computation 13 (2) (2009) 398–417.
[50] S. Rahnamayan, H. Tizhoosh, M. Salama, Opposition-based differential evolution, IEEE Transaction on Evolutionary Computation 12 (1) (2008) 64–79.
[51] S. Das, A. Abraham, U.K. Chakraborty, A. Konar, Differential evolution using a
neighborhood-based mutation operator, IEEE Transactions on Evolutionary
Computation 13 (3) (2009) 526–553.
[52] J. Zhang, A.C. Sanderson, JADE: adaptive differential evolution with optional
external archive, IEEE Transactions on Evolutionary Computation 13 (5) (2009)
945–958.
[53] F. Neri, V. Tirronen, Scale factor local search in differential evolution, Memetic
Computing 1 (2) (2009) 153–171.
[54] J.S. Sánchez, High training set size reduction by space partitioning and
prototype abstraction, Pattern Recognition 37 (7) (2004) 1561–1564.
[55] H.A. David, H.N. Nagaraja, Order Statistics, third ed., Wiley, 2003.
[56] T.J. Rothenberg, F.M. Fisher, C.B. Tilanus, A note on estimation from a cauchy
sample, Journal of the American Statistical Association 59 (306) (1966) 460–463.
[57] E. Alpaydin, Introduction to Machine Learning, second ed., MIT Press, Cambridge, MA, 2010.
[58] I.H. Witten, E. Frank, Data Mining: Practical machine learning tools and
techniques, second ed., Morgan Kaufmann, San Francisco, 2005.
[59] A. Ben-David, A lot of randomness is hiding in accuracy, Engineering Applications of Artificial Intelligence 20 (2007) 875–885.
[60] A. Asuncion, D. Newman, UCI machine learning repository, 2007, URL: /http://
www.ics.uci.edu/ mlearn/MLRepository.htmlS.
[61] J. Alcalá-Fdez, L. Sánchez, S. Garcı́a, M.J. del Jesus, S. Ventura, J.M. Garrell,
J. Otero, C. Romero, J. Bacardit, V.M. Rivas, J.C. Fernández, F. Herrera, KEEL: a
software tool to assess evolutionary algorithms for data mining problems, Soft
Computing 13 (3) (2009) 307–318.
[62] P.A. Devijver, J. Kittler, Pattern Recognition: A Statistical Approach, PrenticeHall, London, 1982.
[63] R. Nisbet, J. Elder, G. Miner, Handbook of Statistical Analysis and Data Mining
Applications, Elsevier, 2009.
[64] L.N. de Castro, J. Timmis, Artificial Immune Systems: A New Computational
Intelligence Approach, Springer, 2002.
[65] S. Garcı́a, A. Fernández, J. Luengo, F. Herrera, A study of statistical techniques
and performance measures for genetics—based machine learning: accuracy
and interpretability, Soft Computing 13 (10) (2009) 959–977.
[66] D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, fourth ed., Chapman & Hall/CRC, 2006.
[67] J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal
of Machine Learning Research 7 (2006) 1–30.
[68] S. Garcı́a, F. Herrera, An extension on ‘‘statistical comparisons of classifiers over
multiple data sets’’ for all pairwise comparisons, Journal of Machine Learning
Research 9 (2008) 2677–2694.
[69] S. Garcı́a, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for
multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power, Information Sciences
180 (2010) 2044–2064.
[70] J. Hodges, E. Lehmann, Ranks methods for combination of independent experiments
in analysis of variance, Annals of Mathematical Statistics 33 (1962) 482–497.
[71] S. Holm, A simple sequentially rejective multiple test procedure, Scandinavian
Journal of Statistics 6 (1979) 65–70.
[72] W.H. Kruskal, W.A. Wallis, Use of ranks in one-criterion variance analysis,
Journal of the American Statistical Association 47 (1952) 583–621.
[73] M. Abramowitz, Handbook of Mathematical Functions, With Formulas, Graphs,
and Mathematical Tables, Dover Publications, 1974.
[74] W.W. Daniel, Applied Nonparametric Statistics, Duxbury Thomson Learning,
1990.
Isaac Triguero Velázquez received the M.Sc. degree in Computer Science from the University of Granada, Granada, Spain, in 2009.
He is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. His research interests include data
mining, data reduction and evolutionary algorithms.
Salvador Garcı́a López received the M.Sc. and Ph.D. degrees in Computer Science from the University of Granada, Granada, Spain, in 2004 and 2008, respectively.
He is currently an Assistant Professor in the Department of Computer Science, University of Jaén, Jaén, Spain. His research interests include data mining, data reduction, data
complexity, imbalanced learning, statistical inference and evolutionary algorithms.
916
I. Triguero et al. / Pattern Recognition 44 (2011) 901–916
Francisco Herrera Triguero received the M.Sc. degree in Mathematics in 1988 and the Ph.D. degree in Mathematics in 1991, both from the University of Granada, Spain.
He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada. He has published more than 150 papers in
international journals. He is coauthor of the book ‘‘Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases’’ (World Scientific, 2001).
As edited activities, he has co-edited five international books and co-edited 20 special issues in international journals on different Soft Computing topics. He acts as
associated editor of the journals: IEEE Transactions on Fuzzy Systems, Information Sciences, Mathware and Soft Computing, Advances in Fuzzy Systems, Advances in
Computational Sciences and Technology, and International Journal of Applied Metaheuristics Computing. He currently serves as area editor of the Journal Soft Computing (area
of genetic algorithms and genetic fuzzy systems), and he serves as member of several journal editorial boards, among others: Fuzzy Sets and Systems, Applied Intelligence,
Knowledge and Information Systems, Information Fusion, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems, Memetic Computation.
His current research interests include computing with words and decision making, data mining, data preparation, instance selection, fuzzy rule based systems, genetic fuzzy
systems, knowledge extraction based on evolutionary algorithms, memetic algorithms and genetic algorithms.
84
Chapter II. Publications: Published and Submitted Papers
1.4
Integrating a Differential Evolution Feature Weighting scheme into Prototype Generation
• I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, Integrating a Differential Evolution Feature
Weighting scheme into Prototype Generation. Neurocomputing 97 (2012) 332-343, doi:
10.1016/j.neucom.2012.06.009.
– Status: Published.
– Impact Factor (JCR 2012): 1.634
– Subject Category: Computer Science, Artificial Intelligence. Ranking 37 / 115 (Q2).
Neurocomputing 97 (2012) 332–343
Contents lists available at SciVerse ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Integrating a differential evolution feature weighting scheme
into prototype generation
Isaac Triguero a,n, Joaquı́n Derrac a, Salvador Garcı́a b, Francisco Herrera a
a
Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada,
18071 Granada, Spain
b
Department of Computer Science, University of Jaén, 23071 Jaén, Spain
a r t i c l e i n f o
a b s t r a c t
Article history:
Received 23 November 2011
Received in revised form
13 March 2012
Accepted 1 June 2012
Communicated by M. Bianchini
Available online 1 July 2012
Prototype generation techniques have arisen as very competitive methods for enhancing the nearest
neighbor classifier through data reduction. Within the prototype generation methodology, the methods
of adjusting the prototypes’ positioning have shown an outstanding performance. Evolutionary
algorithms have been used to optimize the positioning of the prototypes with promising results.
However, these results can be improved even more if other data reduction techniques, such as
prototype selection and feature weighting, are considered.
In this paper, we propose a hybrid evolutionary scheme for data reduction, incorporating a new
feature weighting scheme within two different prototype generation methodologies. Specifically, we
will focus on a self-adaptive differential evolution algorithm in order to optimize feature weights and
the placement of the prototypes. The results are contrasted with nonparametric statistical tests,
showing that our proposal outperforms previously proposed methods, thus showing itself to be a
suitable tool in the task of enhancing the performance of the nearest neighbor classifier.
& 2012 Elsevier B.V. All rights reserved.
Keywords:
Differential evolution
Prototype generation
Prototype selection
Feature weighting
Nearest neighbor
Classification
1. Introduction
The designing of classifiers can be considered to be one of the
most important tasks in machine learning and data mining [1,2].
Most machine learning methods build a model during the learning process, known as eager learning methods [3], but there are
some approaches where the algorithm does not need a model.
These algorithms are known as lazy learning methods [4].
The Nearest Neighbor (NN) rule [5] is a simple and effective
supervised classification technique which belongs to the lazy
learning family of methods. NN is a nonparametric classifier,
which requires that all training data instances are stored. Unseen
cases are classified by finding the class labels of the closest
instances to them. The extended version of NN to k neighbors
(kNN) is considered one of the most influential data mining
algorithms [6] and it has attracted much attention and research
in recent years [7,8]. However, NN may have several disadvantages, such as high computational cost, high storage requirement
and sensitivity to noise, which can affect its performance.
Furthermore, NN makes predictions over existing data and it
n
Corresponding author. Tel.: þ34 958 240598; fax: þ34 958 243317.
E-mail addresses: triguero@decsai.ugr.es (I. Triguero),
jderrac@decsai.ugr.es (J. Derrac), sglopez@ujaen.es (S. Garcı́a),
herrera@decsai.ugr.es (F. Herrera).
0925-2312/$ - see front matter & 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.neucom.2012.06.009
assumes that input data perfectly delimits the decision boundaries among classes.
Many approaches have been proposed to improve the performance of the NN rule. One way to simultaneously tackle the
computational complexity, storage requirements, and sensitivity
to noise of NN is based on data reduction [9]. These techniques try
to obtain a reduced version of the original training data, with the
double objective of removing noisy and irrelevant data. Taking
into consideration the feature space, we can highlight Feature
Selection (FS) [10–13] and feature generation/extraction [14] as
the main techniques. FS consists of choosing a representative
subset of features from the original feature space, while feature
generation creates new features to describe the data. From the
perspective of the instances, data reduction can be divided into
Prototype Selection (PS) [15–17] and Prototype Generation (PG)
[18,19]. The former process consists of choosing an appropriate
subset of the original training data, while the latter can also build
new artificial prototypes to better adjust the decision boundaries
between classes in NN classification. In this way, PG does not
assume that input data perfectly defines the decision boundaries
among classes.
Another way to improve the performance of NN is the employment of weighting schemes. Feature Weighting (FW) [20] is a well
known technique which consists of assigning a weight to each
feature of the domain of the problem to modify the way in which
distances between examples are computed [21]. This technique
I. Triguero et al. / Neurocomputing 97 (2012) 332–343
can be viewed as a generalization of FS algorithms, allowing us to
obtain a soft approximation of the feature relevance degree
assigning a real value as a weight, so different features can
receive different treatments.
Evolutionary algorithms [22] have been successfully used in
different data mining problems [23,24]. Given that PS, PG and FW
problems could be seen as combinatorial and optimization
problems, evolutionary algorithms have been used to solve them
with excellent results [25]. PS can be expressed as a binary space
search problem. To the best of our knowledge, memetic algorithms [26] have provided the best evolutionary model proposed
for PS, called SSMA [27]. PG is expressed as a continuous space
search problem. Evolutionary algorithms for PG are based on the
positioning adjustment of prototypes [28–30], which is a suitable
methodology to optimize the location of prototypes. Concretely,
Differential Evolution (DE) [31,32] and its advanced approaches
[33] have been demonstrated as being the most effective positioning adjustment techniques [34]. Regarding FW methods,
many successful evolutionary proposals, most of them based on
genetic algorithms, have been proposed, applied to the NN
algorithm [35].
Typically, positioning adjustment methods [28,36,30] focus on
the placement process and they do not take into consideration the
selection of the most appropriate number of prototypes per class.
Recently, two different approaches have been proposed in order
to tackle this problem. First, in [37,38], this problem is addressed
by an iterative addition process that determines which classes
need more prototypes to be represented. This algorithm is
denoted as IPADECS. Secondly, in [34], the algorithm SSMA-DEPG
is presented, in which a previous PS stage is applied to provide the
appropriate choice of the number of prototypes per class.
In these techniques, the increase in the size of the data set is a
crucial problem. It has been addressed in PS and PG by using
stratification techniques [39,40]. They split the data set into
various parts to make the application of a prototype reduction
technique easier, using a mechanism to join the solutions of each
part into a global solution.
The aim of this work is to propose a hybrid approach which
combines these two PG methodologies with FW to enhance the
NN rule addressing its main drawbacks. In both schemes, the
most promising feature weights and location of the prototypes
are generated by the SFLSDE algorithm [41] acting as an FW and
PG method respectively. Evolutionary PG methods usually tend to
overfit the training data in a small number of iterations. For this
reason we apply, during the evolutionary optimization process, an
FW stage to modify the fitness function of the PG method and
determine the relevance of each feature.
The hybridization of the PG and FW problems is the main
contribution of this paper, which can be divided into three
objectives:
333
In order to organize this paper, Section 2 describes the background of PS, PG, FW, DE and stratification. Section 3 explains the
hybridization algorithms proposed. Section 4 discusses the
experimental framework and Section 5 presents the analysis of
results. Finally, in Section 6 we summarize our conclusions.
2. Background
This section covers the background information necessary to
define and describe our proposals. Section 2.1 presents a formal
definition of PS and PG problems. Section 2.2 describes the main
characteristic of FW. Section 2.3 explains the DE technique.
Finally, Section 2.4 details the characteristics of the stratification
procedure.
2.1. PS and PG problems
This section presents the definition and notation for both PS
and PG problems.
A formal specification of the PS problem is the following: let xp
be an example where xp ¼ ðxp1 ,xp2 , . . . ,xpD , oÞ, with xp belonging
to a class o given by xpo and a D-dimensional space in which xpi
is the value of the i-th feature of the p-th sample. Then, let us
assume that there is a training set TR which consists of n
instances xp and a test set TS composed of t instances xq , with
o unknown. Let SS DTR be the subset of selected samples
resulting from the execution of a PS algorithm, then we classify
a new pattern xq from TS by the NN rule acting over SS.
The purpose of PG is to obtain a prototype generated set GS,
which consists of r, r o n, prototypes, which are either selected or
generated from the examples of TR. The prototypes of the
generated set are determined to efficiently represent the distributions of the classes and to discriminate well when used to classify
the training objects. Their cardinality should be sufficiently small
to reduce both the storage and evaluation time spent by an NN
classifier.
Both methodologies have been widely studied in the specialized literature. More than 50 PS methods have been proposed. In
general, they can be categorized into three kinds of methods:
condensation [43], edition [44] or hybrid models [27]. A complete
review of this topic is proposed in [17]. Regarding PG techniques,
they can be divided into several families depending on the main
heuristic operation followed: positioning adjustment [30], class
re-labeling [45], centroid-based [18] and space-splitting [46].
A recent PG review is proposed in [19].
More information about PS and PG approaches can be found at
the SCI2S thematic public website on Prototype Reduction in
Nearest Neighbor Classification: Prototype Selection and Prototype
Generation.1
To propose a new FW technique based on a self-adaptive DE.
To the best of our knowledge, DE has not yet been applied to
the FW problem.
To carry out an empirical study to analyze the hybridization
models in terms of classification accuracy. Specifically, we will
analyze whether the integration of an FW stage with PG
methods improves the quality of the resulting reduced sets.
To check the behavior of these hybrid approaches when dealing with huge data sets, developing a stratified model with the
proposed hybrid scheme.
To test the behavior of these approaches, the experimental
study will include a statistical analysis based on nonparametric
statistical tests [42]. We shall conduct experiments involving a
total of 46 classification data sets with different properties.
2.2. Feature weighting
The aim of FW methods is to reduce the sensitivity to
redundant, irrelevant or noisy features in the NN rule, by
modifying its distance function with weights. These modifications
allow us to perform more robust classification tasks, increasing in
this manner the global accuracy of the classifier.
The most well known distance or dissimilarity measure for the
NN rule is the Euclidean Distance (Eq. (1)), where xp and xq are
two examples and D is their number of features. We will use it
throughout this study as it is simple, easy to optimize, and has
1
http://sci2s.ugr.es/pr/
334
I. Triguero et al. / Neurocomputing 97 (2012) 332–343
been widely used in the field of instance based learning [47]
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u D
uX
ð1Þ
EuclideanDistanceðX,YÞ ¼ t
ðxpi xqi Þ2
mutant vector. Then, we must decide which individual should
survive in the next generation Gþ 1. The selection operator is
described as follows:
(
i¼0
FW methods often extend this equation to apply different
weights to each feature (Wi) which modify the way in which the
distance measure is computed (Eq. (2))
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u D
uX
ð2Þ
W i ðxpi xqi Þ2
FWDistðX,YÞ ¼ t
i¼0
This technique has been widely used in the literature. As far as
we know, the most complete study performed can be found in
[20], in which a review of several FW methods for lazy learning
algorithms is presented (with most of them applied to improve
the performance of the NN rule). In this review, FW techniques
were categorized by several dimensions, according to its weight
learning bias, the weight space (binary or continuous), the
representation of features, their generality and the degree of
employment of domain specific knowledge.
A wide number of FW techniques are available in the literature, both classical (see [20]) and recent (for example, [21,35]).
The most well known group of them is the family of Relief-based
algorithms. The Relief algorithm [48] (which was originally an FS
method) has been widely studied and modified, producing some
interesting versions of the original approach [49]. Some of them
are based on ReliefF [50] which is the first step of development of
Relief-based methods as FW techniques [51].
Finally, it is important to note that it is possible to find some
approaches dealing simultaneously with FW and FS tasks, for
instance, inside a Tabu Search procedure [52] or by managing
ensemble-based approaches [53].
2.3. Differential evolution
DE follows the general procedure of an evolutionary algorithm
[33]. DE starts with a population of NP solutions, so-called individuals.
The initial population should cover the entire search space as much as
possible. In some problems, this is achieved by uniformly randomizing individuals, but in other problems, such as the PG problem, basic
knowledge of the problem is available and the use of other initialization mechanisms is more effective. The subsequent generations are
denoted by G ¼ 0; 1, . . . ,Gmax . In DE, it is common to denote each
individual as a D-dimensional vector X i,G ¼{x1i,G , . . . ,xD
i,G }, called a
‘‘target vector’’.
After initialization, DE applies the mutation operator to generate a mutant vector V i,G , with respect to each individual X i,G , in
the current population. For each target X i,G , at the generation G, its
associated mutant vector V i,G ¼{V 1i,G , . . . ,V D
i,G }. The method of
creating this mutant vector is that which differentiates one DE
scheme from another. In this work, we will focus on the
RandToBest/1 which generates the mutant vector as follows:
V i,G ¼ X i,G þ F ðX best,G X i,G Þ þF ðX ri ,G X ri ,G Þ
1
2
ð3Þ
The indices r i1 , r i2 are mutually exclusive integers randomly
generated within the range ½1,NP, which are also different from
the base index i. The scaling factor F is a positive control
parameter for scaling the different vectors.
After the mutation phase, the crossover operation is applied to
each pair of the target vector X i,G and its corresponding mutant
vector V i,G to generate a new trial vector that we denote U i,G . We
will focus on the binomial crossover scheme, which is performed
on each component whenever a randomly picked number
between 0 and 1 is less than or equal to the crossover rate (CR),
which controls the fraction of parameter values copied from the
X i,G þ 1 ¼
U i,G
if F ðU i,G Þ is better than F ðX i,G Þ
X i,G
Otherwise
where F is the fitness function to be minimized. If the new trial
vector yields a solution equal to or better than the target vector, it
replaces the corresponding target vector in the next generation;
otherwise the target is retained in the population. Therefore, the
population always gets better or retains the same fitness values,
but never deteriorates. This one-to-one selection procedure is
generally kept fixed in most of the DE algorithms.
The success of DE in solving a specific problem crucially
depends on choosing the appropriate mutation strategy and its
associated control parameter values (F and CR) that determine the
convergence speed. Hence, a fixed selection of these parameters
can produce slow and/or premature convergence depending on
the problem. Thus, researchers have investigated the parameter
adaptation mechanisms to improve the performance of the basic
DE algorithm [54–56].
One of the most successful adaptive DE algorithms is the Scale
Factor Local Search in Differential Evolution (SFLSDE) proposed
by [41]. This method was established as the best DE technique for
PG in [34].
2.4. Stratification for prototype reduction schemes
When performing data reduction, the scaling up problem
appears as the number of training examples increases beyond
the capacity of the prototype reduction algorithms, harming their
effectiveness and efficiency. This is a crucial problem which must
be overcome in most practical applications of data reduction
methods. In order to avoid it, in this work we will consider the use
of the stratification strategy, initially proposed in [39] for PS, and
[40] for PG.
This stratification strategy splits the training data into disjoint
strata with equal class distribution. The initial data set D is
divided into two sets, TR and TS, as usual (for example a 10th of
the data for TS, and the rest for TR in a 10-fold cross validation).
Then, TR is divided into t disjoint sets TRj, strata of equal size, TR1,
TR2 TRt , maintaining class distribution within each subset. In
this manner, the subsets TR and TS can be represented as follows:
t
[
TR ¼
TRj ,
TS ¼ D\TR
ð4Þ
j¼1
Then, a prototype reduction method should be applied to each
TRj, obtaining a reduced set RSj for each partition.
In PS and PG stratification procedures, the final reduced set is
obtained joining every RSj obtained, and it is denoted as Stratified
Reduced Set (SRS)
SRS ¼
t
[
RSj
ð5Þ
j¼1
When the SRS has been obtained, it is ready to be used by an
NN classifier to classify the instances of TS.
The use of the stratification procedure does not have a great
cost in time. Usually, the process of splitting the training data into
strata, and joining them when the prototype reduction method
has been applied, is not time-consuming, as it does not require
any kind of additional processing. Thus, the time needed for the
I. Triguero et al. / Neurocomputing 97 (2012) 332–343
stratified execution is almost the same as that taken in the
execution of the prototype reduction method in each strata,
which is significantly lower than the time spent if no stratification
is applied, due to the time complexity of the PS and PG methods,
which most of the time is OðN2 Þ or higher.
The prototypes present in TR are independent of each other, so
the distribution of the data into strata will not degrade their
representation capabilities if the class distribution is maintained.
The number of strata, which should be fixed empirically, will
determine the size of them. By using a proper number it is
possible to greatly reduce the training set size. This situation
allows us to avoid the drawbacks that appeared due to the scaling
up problem.
3. Hybrid evolutionary models integrating feature weighting
and prototype generation
In this section we describe in depth the proposed hybrid
approaches and their main components. First of all, we present
the proposed FW scheme based on DE (Section 3.1). Next, as we
established previously, we will design a hybrid model with the
proposed FW for each of the two most effective PG methodologies,
IPADECS [37] (Section 3.2) and SSMA-DEPG [34] (Section 3.3).
Finally, we develop a stratified model for our hybrid proposals
(Section 3.4).
selection operator can be viewed as follows:
8
if accuracyðReducedSet,U i,G Þ
U
>
< i,G
4 ¼ accuracyðReducedSet,X i,G Þ
X i,G þ 1 ¼
>
:X
Otherwise
335
ð6Þ
i,G
In case of a tie between the values of accuracy, we select the
U i,G in order to give the mutated individual the opportunity to
enter the population.
In order to overcome the limitation of the parameters’ selection (F and CR), we use the ideas established in [41] to implement
a self-adaptive DE scheme.
3.2. IPADECS-DEFW: hybridization with IPADECS
The IPADECS algorithm [37] follows an iterative prototype
adjustment scheme with an incremental approach. At each step,
an optimization procedure is used to adjust the position of the
prototypes, adding new ones if needed. The aim of this algorithm
is to determine the most appropriate number of prototypes per
class and adjust their positioning during the evolutionary process.
Specifically, IPADECS uses the SFLSDE technique as an optimizer
with a complete solution per individual codification. At the end of
the process, IPADECS returns the best GS found.
The hybrid model which composes IPADECS and DEFW can
basically be described as the combination of an IPADECS stage
and then a DEFW to determine the best weights. Fig. 1 shows the
pseudo-code of this hybrid scheme. The algorithm proceeds as
follows:
3.1. Differential evolution for feature weighting
Initially, we perform an IPADECS algorithm in which all the
As we stated before, FW can be viewed as a continuous space
search problem in which we want to determine the most appropriate weights for each feature in order to enhance the NN rule.
Specifically, we propose the use of a DE procedure to obtain the
best weights, which allows a given reduced set to increase the
performance of the classification made in TR. We denote this
algorithm as DEFW.
DEFW starts with a population of NP individuals X i,G . In order
to encode a weight vector in a DE individual, this algorithm uses a
real-valued vector containing D elements corresponding to D
attributes, which range in the interval [0,1]. It means that each
individual X i,G in the population encodes a complete solution for
the FW problem. Following the ideas established in [54,55,41], the
initial population should better cover the entire search space as
much as possible by uniformly randomizing individuals within
the defined range.
After the initialization process, DEFW enters in a loop in which
mutation and crossover operators, explained in Section 2.3, guide
the optimization of feature weights by generating new trial
vectors U i,G . After applying these operators, we check if there
have been values out of range of ½y,1. If a computed value is
greater than 1, we truncate it to 1. Furthermore, based on [48], if
this value is lower than a threshold y, we consider this feature to
be irrelevant, and therefore, it is established at 0. In our experiments, y has been fixed empirically to 0.2.
Finally, the selection operator must decide which generated
trial vectors should survive in the population of the next generation Gþ1. For our purpose, the NN rule guides this operator. The
instances in TR are classified with the prototypes of the reduced
set given, but in this case, the distance measure for the NN rule is
modified according to Eq. (2), where the weights Wi are obtained
from X i,G and U i,G . Their corresponding fitness values are measured as the accuracy( ) obtained, which represents the number
of successful hits (correct classifications) relative to the total
number of classifications. We try to maximize this value, so the
Then, the algorithm enters in a loop in which we try to find the
features have a relevance degree of 1.0 (Instructions 1–4).
most appropriate weights and placement of the prototypes:
– Instruction 6 performs a DE optimization of the feature
weights, so that, the best GS obtained from the IPADECS
algorithm is used to determine the appropriate weights, as
we described in Section 3.1. Furthermore, the current
weights are inserted as one of the individuals of the FW
population. In this way, we ensure that the FW scheme
does not degrade the performance of the GS obtained with
IPADECS, due to the selection operator used in DEFW.
– Next, Instruction 7 generates a new GS, with IPADECS, but
in this case, the optimization process takes into consideration the new weights to calculate the distances between
prototypes (see Eq. (2)). The underlying idea of this instruction is that IPADECS should generate a different GS due to
the fact that the distance measure has changed, and therefore, the continuous search space has been modified.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
Weights [1.. D ] = 1.0
bestWeights [1.. D ] = Weights [1..D ]
GS = IPADECS (Weights);
Accuracy = Evaluate With Weights (GS , T R , Weights)
for i = 1 to M AXIT ER do
newWeights[1.. D ] = DEFW (GS, Weights)
GS aux = IPADECS (newWeights)
Accuracy trial = EvaluateWithWeights (GS aux , T R, newWeights)
if Accuracy trial > Accuracy then
Accuracy = Accuracytrial
GS = GS aux
end if
Weights = newWeights
end for
return GS , Weights
Fig. 1. Hybridization of IPADECS and DEFW.
336
I. Triguero et al. / Neurocomputing 97 (2012) 332–343
– After this process, we have to ensure that the new positioning of prototypes GSaux with its respective weights has
reported a successful improvement of the accuracy rate
with respect to the previous GS. If the computed accuracy of
the new GSaux and its respective weights is greater than
the best accuracy found, we save GSaux as the current GS
(Instructions 8–12).
– Instruction 13 stored the obtained weights and they will be
used in the next iteration.
After a previously fixed number of iterations, the hybrid model
returns the best GS and its respective best feature weights.
3.3. SSMA-DEPGFW: hybridization with SSMA-DEPG
The SSMA-DEPG approach [34] uses a PS algorithm prior to the
adjustment process to initialize a subset of prototypes, finding a
promising selection of prototypes per class.
Specifically, the SSMA algorithm is applied [27]. This is a
memetic algorithm which makes use of a local search or meme
specifically developed for the PS problem. The interweaving of the
global and local search phases allows the two to influence each
other. The resulting SS is inserted as one of the individuals of the
population in the SFLSDE algorithm, which, in this case, is acting as a
PG method. Next, it performs mutation and crossover operations to
generate new trial solutions. Again, the NN rule guides the selection
operator, therefore the SSMA-DEPG returns the best location of the
prototypes, which increases the classification rate.
Fig. 2 outlines the hybrid model. To hybridize FW with SSMADEPG, this method is carried out as follows:
Firstly, it is necessary to apply an SSMA stage to determine the
number of prototypes per class (Instruction 1).
Next, the rest of the individuals are randomly generated,
extracting prototypes from the TR and keeping the same
structure as the SS selected by the PS method, thus they must
have the same number of prototypes per class, and the classes
must have the same arrangement in the matrix X i,G .
At this stage, we have established the relevance degree of all
features to 1.0. Then, Instruction 4 determines the best
classification accuracy obtained in the NP population.
After this, our hybrid model enters into a cooperative loop
between FW and SFLSDE.
– The proposed FW method is applied with the best GS found
up to that moment. Once again, the current weights are
inserted as one of the individuals of the FW population
(Instruction 6).
– Then, a new optimization stage is applied to all the
individuals of the population, with the obtained weights
modifying the distance measure between prototypes.
Finally, the method returns the best GS with its appropriate
feature weights, and it is ready to be used as a reference set by
the NN classifier.
1: GS [1] = SSMA();
2: Generate GS [2 ..N P ] randomly
3:
4:
5:
6:
7:
8:
9:
10:
with the prototypes distribution of GS [1]
Weights[1.. D ] = 1.0
Determine the best GS
for i = 1 to M AXIT ER do
Weights [1.. D ] = DEFW( GS [best], Weights)
GS [1..N P ] = SFLSDE( GS [1..N P ], Weights)
Determine the best GS
end for
return GS [best], Weights
Fig. 2. Hybridization of SSMA-DEPG and DEFW.
3.4. A stratified scheme for hybrid FW and PG methods
Since the immediate application of these hybrid methods over
huge sets should be avoided due to their computational cost, we
propose the use of a stratification procedure to mitigate this
drawback, and thus develop a suitable approach to huge
problems.
PS and PG stratified models join every resulting set RSj, obtained
as the application of these techniques to each strata TRj. Nevertheless, in the proposed hybrid scheme, we obtain for each strata a
generated reduced set and its respective feature weights. To develop
a stratified method, we study two different strategies:
Join procedure: In this variant, the SRS is also generated as the
sum of each RSj. However, the weight of each feature is recalculated, applying the DEFW algorithm. In this case, it uses
the SRS set as a given reduced set. The stratified method
returns SRS and its obtained weights to classify the instances
of TS.
Voting rule: This approach consists of applying a majority
voting rule. Each strata RSj and its respective weights are used
to calculate the possible class of each instance of TS. The final
assigned class is produced via majority voting of the computed
class per strata. In our implementation, ties are randomly
decided.
4. Experimental framework
In this section, we present the main characteristics related to
the experimental study. Section 4.1 introduces the data sets used
in this study. Section 4.2 summarizes the algorithms used for
comparison with their respective parameters. Finally, Section 4.3
describes the statistical tests applied to contrast the results
obtained.
4.1. Data sets
In this study, we have selected 40 classification data sets for
the main experimental study. These are well-known problems in
the area, taken from the KEEL data set repository2 [57]. Table 1
summarizes the properties of the selected data sets. It shows, for
each data set, the number of examples (#Ex.), the number of
attributes (#Atts.), and the number of classes (#Cl.). The data sets
considered in this study contain between 100 and 20 000
instances, and the number of attributes ranges from 2 to 85. In
addition, they are partitioned using the 10 fold cross-validation
(10-fcv) procedure and their values are normalized in the interval
[0,1] to equalize the influence of attributes with different range
domains. In addition, instances with missing values have been
discarded before the execution of the methods over the data sets.
Furthermore, we will perform an additional experiment applying our hybrid models to six huge data sets, which contain more
than 20 000 instances. Table 2 shows their characteristics, including the exact number of strata (#Strata.) and instances per strata
(#Instances/Strata.).
4.2. Comparison algorithms and parameters
In order to perform an exhaustive study of the capabilities of
our proposals, we have selected some of the main proposed
models in the literature of PS, PG and FW. In addition, the NN
rule with k¼ 1 (1NN) has been included as a baseline limit of
performance. Apart from SSMA, IPADECS and SSMA-DEPG, which
2
http://sci2s.ugr.es/keel/datasets
I. Triguero et al. / Neurocomputing 97 (2012) 332–343
Table 1
Summary description for classification data sets.
Table 3
Parameter specification for all the methods used in the experimentation.
Data set
#Ex.
#Atts.
#Cl.
Data set
#Ex.
#Atts.
#Cl.
Algorithm
Sbalone
Banana
Bands
Breast
Bupa
Chess
Cleveland
Coil2000
Contraceptive
Crx
Dermatology
Flare-solar
German
Glass
Haberman
Hayes-roth
Heart
Housevotes
Iris
led7digit
4174
5300
539
286
345
3196
297
9822
1473
125
366
1066
1000
214
306
133
270
435
150
500
8
2
19
9
6
36
13
85
9
15
33
9
20
9
3
4
13
16
4
7
28
2
2
2
2
2
5
2
3
2
6
2
2
7
2
3
2
2
3
10
Lym
Magic
Mammographic
Marketing
Monks
Newthyroid
Nursery
Pima
Ring
Saheart
Spambase
Spectheart
splice
Tae
Thyroid
Titanic
Twonorm
Wisconsin
Yeast
Zoo
148
19 020
961
8993
432
215
12 690
768
7400
462
4597
267
3190
151
7200
2201
7400
683
1484
101
18
10
5
13
6
5
8
8
20
9
57
44
60
5
21
3
20
9
8
16
4
2
2
9
2
3
5
2
2
2
2
2
3
3
3
2
2
2
10
7
SSMA
Table 2
Summary description for huge classification data sets.
Data set
#Ex.
#Atts.
#Cl.
#Strata.
#Instances/strata.
Adult
Census
Connect-4
Fars
Letter
Shuttle
48 842
299 285
67 557
100 968
20 000
58 000
14
41
42
29
16
9
2
2
3
8
26
7
10
60
14
20
4
12
4884
4990
4826
5048
5000
4833
have been explained above, the rest of the methods are described
as follows:
TSKNN: A Tabu search based method for simultaneous FS and
337
FW, which encodes in its solutions the current set of features
selected (binary codification), the current set of weights
assigned to features, and the best value of k found for the
kNN classifier. Furthermore, this method uses fuzzy kNN [58]
to avoid ties in the classification process [52].
ReliefF: The first Relief-based method adapted to perform the
FW process [50]. Weights computed in the original Relief
algorithm are not binarized to 0; 1. Instead, they are employed
as final weights for the kNN classifier. This method was
marked as the best performance-based FW method in [20].
GOCBR: A genetic algorithm designed for simultaneous PS and
FW process in the same chromosome. Weights are represented
by binary chains, thus preserving binary codification in the
chromosomes. It has been applied successfully to several realworld applications [59].
Parameters
Population ¼30, Evaluations ¼10 000, Crossover
Probability ¼ 0.5, Mutation Probability ¼0.001
SSMA-DEPG PopulationSFLSDE ¼40, IterationsSFLSDE ¼500, iterSFGSS ¼8,
iterSFHC ¼ 20, Fl¼ 0.1, Fu¼ 0.9
IPADECS
Population ¼10, iterations of Basic DE ¼ 500, iterSFGSS ¼8,
iterSFHC ¼ 20, Fl¼ 0.1, Fu¼ 0.9
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
TSKNN
Evaluations ¼ 10 000, M ¼ 10, N ¼2, P ¼ceil ð #FeaturesÞ
ReliefF
K value for contributions¼ Best in [1,20]
GOCBR
Evaluations ¼ 10 000, Population ¼100, Crossover
Probability ¼ 0.7, Mutation Probability ¼0.1
SSMAMAXITER ¼ 20, PopulationSFLSDE ¼ 40, IterationsSFLSDE¼ 50
DEPGFW
PopulationDEFW ¼25, IterationsDEFW ¼200, iterSFGSS ¼ 8,
iterSFHC ¼ 20, Fl¼ 0.1, Fu¼ 0.9
IPADECSMAXITER ¼ 20, PopulationIPADECS ¼10, iterations of Basic
DEFW
DE ¼ 50
PopulationDEFW ¼25, IterationsDEFW ¼200, iterSFGSS ¼ 8,
iterSFHC ¼ 20, Fl¼ 0.1, Fu¼ 0.9
4.3. Statistical tools for analysis
Hypothesis testing techniques provide us with a way to
statistically support the results obtained in the experimental
study, identifying the most relevant differences found between
the methods [60]. To this end, the use of nonparametric tests will
be preferred over parametric ones, since the initial conditions that
guarantee the reliability of the latter may not be satisfied, causing
the statistical analysis to lose credibility.
We will focus on the use of the Friedman Aligned-ranks (FA)
test [42], as a tool for contrasting the behavior of each of our
proposals. Its application will allow us to highlight the existence
of significant differences between methods. Later, post hoc
procedures like Holm’s or Finner’s will find out which algorithms
are distinctive among the 1nn comparisons performed. Furthermore, we will use the Wilcoxon Signed-Ranks test in those cases
in which we analyze differences between pairs of methods not
marked as significant by the previous tests.
More information about these tests and other statistical
procedures specifically designed for use in the field of Machine
Learning can be found at the SCI2S thematic public website on
Statistical Inference in Computational Intelligence and Data Mining.3
5. Analysis of results
In this section, we analyze the results obtained from different
experimental studies. Specifically, our aims are:
To compare the proposed hybrid schemes to each other over
the 40 data sets (Section 5.1).
To test the performance of these models in comparison with
previously proposed methods (Section 5.2).
To check if the performance of hybrid models is maintained with
Many different configurations are established by the authors of
each paper for the different techniques. We focus this experimentation on the recommended parameters proposed by their respective
authors, assuming that the choice of the values of the parameters
was optimally chosen. The configuration parameters, which are
common to all problems, are shown in Table 3. In all of the
techniques, Euclidean distance is used as a similarity function and
those which are stochastic methods have been run three times per
partition. Note that the values of the parameters Fl, Fu, iterSFGSS and
iterSFHC remain constant in all the DE optimizations, and are the
recommended values established in [41]. Implementations of the
algorithms can be found in the KEEL software tool [57].
huge data sets using the proposed stratified model (Section 5.3).
5.1. Comparison of the proposed hybrid schemes
We focus this experiment on comparing both hybrid schemes in
terms of accuracy and reduction capabilities. Fig. 3 shows a star plot
in which the obtained accuracy test of IPADECS-DEFW and SSMADEPGFW is presented for each data set, allowing us to see in an
easier way how both algorithms behave in the same domains.
3
http://sci2s.ugr.es/sicidm/
338
I. Triguero et al. / Neurocomputing 97 (2012) 332–343
Fig. 3. Accuracy rate comparison.
Fig. 4. Reduction rate comparison.
The reduction rate is defined as
Reduction Rate ¼ 1sizeðGSÞ=sizeðTRÞ
ð7Þ
It has a strong influence on the efficiency of the solutions
obtained, due to the cost of the final classification process performed
by the 1NN classifier. Fig. 4 illustrates a star plot representing the
reduction rate obtained in each data set for both hybrid
models. These star plots represent the performance as the
distance from the center; hence a higher area determines the best
average performance. The plots allow us to visualize the
I. Triguero et al. / Neurocomputing 97 (2012) 332–343
5.2. Comparison with previously proposed methods
IPADECS−DEFW vs SSMA−DEPGFW
0.06
0.04
ΔAccuracy
0.02
0
−0.02
−0.04
−0.06
−0.08
−0.04
−0.02
0
0.02
ΔReduction
0.04
0.06
Fig. 5. Accuracy/reduction rates comparison.
Table 4
Results of the Wilcoxon signed-ranks test comparing hybrid schemes.
Comparison
Accuracy rate
SSMA-DEPGFW vs IPADECS-DEFW
Reduction rate
IPADECS-DEFW Vs SSMA-DEPGFW
Rþ
R
p-Value
442
338
0.4639
802
18
339
4.602 10 10
performance of the algorithms comparatively for each problem and
in general.
Fig. 5 shows a graphical comparison between these methods
considering both objectives simultaneously (accuracy and reduction rate), by using a relative movement diagram [61]. The idea of
this diagram is to represent with an arrow the results of two
methods on each data set. The arrow starts at the coordinate
origin and the coordinates of the tip of the arrow are given by the
difference between the reduction (x-axis) and accuracy (y-axis) of
IPADECS-DEFW and SSMA-DEPGFW, in this order. Furthermore,
numerical results will be presented later in Tables 5 and 6.
Apart from these figures, we use the Wilcoxon test to statistically compare our proposals in both measures. Table 4 collects
the results of its application to the accuracy and reduction rates.
This table shows the rankings R þ and R values achieved and its
associate p-value.
Observing Figs. 3–5 and Table 4 we want to make some
comments:
Fig. 3 shows that both proposals present similar behavior in
many domains. Nevertheless, SSMA-DEPGFW obtains the best
average result in 24 of the 40 data sets. The Wilcoxon test
confirms this statement, showing that there are no significant
differences between both approaches and the Rþ is greater for
SSMA-DEPGFW.
In terms of reduction capabilities, IPADECS-DEFW is shown to be
the best performing hybrid model. As the Wilcoxon test reports,
it obtains significant differences with respect to SSMA-DEPGFW.
In Fig. 5, we observe that most of the arrows point out to the
right side of the plot. This means that IPADECS-DEFW obtains a
lower reduction rate in the problems addressed. Moreover,
there are a similar number of arrows pointing up-right and
down-left, depicting that the accuracy of both methods is
similar. Hence, we can state that IPADECS-DEFW finds the best
trade-off between accuracy and reduction rate.
In this subsection we perform a comparison between the two
proposed hybrid models and the comparison methods established
above. We analyze the results obtained in terms of the accuracy in
test data and reduction rate.
Table 5 shows the accuracy test results for each method
considered in the study. For each data set, the mean accuracy
(Acc) and the standard deviation (SD) are computed. The best
result for each column is highlighted in bold. The last row
presents the average considering all the data sets.
Table 6 presents the reduction rate achieved. Reduction rates
are only shown for those methods which perform a relevant
reduction of the instances of the TR. In this table we can observe
that those methods which are based on SSMA obtain the same
average reduction rate. This is due to the fact that SSMA is used to
obtain the appropriate number of prototypes per class, determining the reduction capabilities at the beginning of the hybrid
models: SSMA-DEPG and SSMA-DEPGFW. IPADECS and IPADECS-DEFW obtains slightly different reduction rates because
they use the same PG approach changing the fitness function
with weights.
To verify the performance of each of our proposals, we have
divided the nonparametric statistical study into two different
parts. Firstly, we will compare IPADECS-DEFW and SSMADEPGFW with the rest of the comparison methods separately
(excluding the other proposal) in terms of test accuracy.
Tables 7 and 8 present the results of the FA test for IPADECSDEFW and SSMA-DEPGFW respectively. In these tables, the
computed FA rankings, which represent the associated effectiveness, are presented in the second column. Both tables are ordered
from the best (lowest) to the worst (highest) ranking. The third
column shows the adjusted p-value (APV) with Holm’s test.
Finally, the fourth column presents the APV with Finner’s test.
Note that IPADECS-DEFW and SSMA-DEPGFW are established as
control algorithms because they have obtained the best FA
ranking in their respective studies. Those APVs highlighted in
bold are methods outperformed by the control, at an a ¼ 0:1 level
of significance.
In this study, we have observed that hybrid schemes
perform well with large data sets (those data sets that have
more than 2000 instances). We select large data sets, from
Table 1, and we compare weighted and unweighted proposals.
Fig. 6 shows this comparison. The x-axis position of the point
is the accuracy of the original proposal on a single data set, and
the y-axis position is the accuracy of the weighted algorithm.
Therefore, points above the y¼x line correspond to data sets
for which new proposals perform better than the original
algorithm.
Given Fig. 6 and the results shown before, we can make the
following analysis:
SSMA-DEPGFW and IPADECS-DEFW achieve the best average
results. It is important to note that the two hybrid models
clearly outperform the methods upon which they are based.
The good synergy between PG and FW methods is demonstrated with the obtained results. Specifically, if we focus our
attention on those data sets with a large number of features
(see splice, chess, etc.), we can state that, in general, the
hybridization between PG and a DEFW scheme can be useful
to increase the classification accuracy obtained.
Furthermore, Fig. 6 shows that the proposed weighted algorithms are able to overcome, in most cases, the original
proposal when dealing with large data sets.
Both SSMA-DEPGFW and IPADECS-DEFW achieve the lowest
(best) ranking in the comparison. The p-value of the FA test is
340
I. Triguero et al. / Neurocomputing 97 (2012) 332–343
Table 5
Accuracy test obtained.
Data sets
1NN
SSMA
SSMA-DEPG
SSMA-DEPGFW
IPADECS
IPADECS-DEFW
TSKNN
ReliefF
GOCBR
Acc
SD
Acc
SD
Acc
SD
Acc
SD
Acc
SD
Acc
SD
Acc
SD
Acc
SD
Acc
SD
Abalone
Banana
Bands
Breast
Bupa
Chess
Cleveland
Coil2000
Contraceptive
Crx
Dermatology
Flare-solar
German
Glass
Haberman
Hayes-roth
Heart
Housevotes
Iris
Led7digit
Lym
Magic
Mammographic
Marketing
Monks
Newthyroid
Nursery
Pima
Ring
Saheart
Spambase
Spectfheart
Splice
Tae
Thyroid
Titanic
Twonorm
Wisconsin
Yeast
Zoo
19.91
87.51
63.09
65.35
61.08
84.70
53.14
89.63
42.77
79.57
95.35
55.54
70.50
73.61
66.97
35.70
77.04
92.16
93.33
40.20
73.87
80.59
73.68
27.38
77.91
97.23
82.67
70.33
75.24
64.49
89.45
69.70
74.95
40.50
92.58
60.75
94.68
95.57
50.47
92.81
1.60
1.03
4.65
6.07
6.88
2.36
7.45
0.77
3.69
5.12
3.45
3.20
4.25
11.91
5.46
9.11
8.89
5.41
5.16
9.48
8.77
0.90
5.59
1.34
5.42
2.26
0.92
3.53
0.82
3.99
1.17
6.55
1.15
8.43
0.81
6.61
0.73
2.59
3.91
6.57
26.09
89.64
59.02
73.79
62.79
90.05
54.78
94.00
48.14
84.78
95.10
65.47
73.20
68.81
73.17
56.18
83.70
92.39
96.00
34.00
83.03
82.03
81.27
30.87
96.79
96.30
85.58
74.23
92.86
71.66
88.28
74.20
73.32
53.17
94.14
73.51
96.34
96.57
57.55
85.33
1.41
0.89
8.98
4.05
8.47
1.67
6.29
0.12
5.93
4.90
5.64
3.97
4.69
8.19
3.75
13.39
10.10
4.99
4.42
6.69
13.95
0.75
5.32
1.63
3.31
3.48
1.17
4.01
1.03
3.46
1.72
8.69
1.63
12.66
0.74
2.47
0.74
2.65
1.66
9.73
25.66
89.55
69.78
70.32
66.00
90.61
56.15
94.00
48.74
85.65
95.37
66.14
71.90
71.98
71.53
75.41
82.22
93.55
94.00
71.40
80.29
82.31
81.27
31.39
95.44
97.68
85.38
74.89
93.49
70.35
89.84
79.02
78.37
56.54
94.58
78.96
96.92
96.14
58.09
95.33
1.71
1.14
6.08
7.51
7.80
2.18
6.76
0.12
4.46
4.46
4.04
3.42
3.11
9.47
6.38
10.57
8.25
5.36
4.67
4.90
15.48
0.65
5.48
0.70
3.21
2.32
1.09
5.81
1.05
5.10
0.97
7.31
4.44
15.86
0.55
2.30
0.79
2.12
2.14
6.49
25.61
89.94
67.00
69.63
67.41
95.56
55.80
94.00
50.17
85.65
94.02
66.95
72.10
73.64
73.18
76.41
85.19
94.24
94.67
71.80
81.76
83.24
81.86
31.90
98.86
96.73
92.99
73.23
93.45
69.47
88.69
79.68
82.51
58.38
96.93
78.83
96.50
96.42
56.88
95.83
1.34
1.16
6.55
7.64
7.96
1.69
6.11
0.12
3.35
4.83
4.31
3.48
5.13
8.86
2.61
10.49
8.11
3.78
4.99
4.77
9.83
0.96
6.03
1.34
1.53
3.64
0.76
5.43
0.64
4.36
2.12
10.73
4.80
11.80
2.39
2.22
0.72
2.23
1.60
9.72
22.21
84.09
67.15
70.91
65.67
80.22
52.49
94.04
48.54
85.22
96.18
66.23
71.80
69.09
74.45
77.05
83.70
92.64
94.67
72.40
78.41
80.23
79.71
30.69
91.20
98.18
64.79
76.84
89.70
70.36
90.89
80.54
79.78
57.71
93.99
78.19
97.66
96.42
57.35
96.33
2.34
4.38
5.91
7.15
8.48
3.81
4.48
0.09
4.67
4.80
3.01
3.13
3.25
11.13
6.40
7.67
9.83
3.71
4.00
3.88
9.31
1.47
4.41
1.11
4.76
3.02
4.58
4.67
1.03
3.07
0.95
4.25
3.99
11.11
0.36
2.92
0.69
1.94
3.13
8.23
25.47
89.70
69.97
71.00
67.25
94.52
54.14
94.02
54.79
85.07
96.73
65.48
71.40
71.45
71.53
75.52
80.74
94.00
94.67
71.20
80.66
83.17
83.67
31.94
96.10
97.71
85.10
71.63
91.22
71.21
92.50
77.93
88.53
58.33
94.28
79.01
97.76
96.28
59.17
96.67
2.39
0.97
5.89
8.52
5.27
1.03
6.20
0.09
3.61
4.54
2.64
3.25
4.27
11.94
4.92
12.11
9.19
4.76
4.00
4.66
14.74
1.01
5.55
1.39
2.48
3.07
1.42
7.35
0.94
3.37
1.38
4.70
1.99
12.04
0.58
2.11
0.72
1.72
3.61
6.83
24.65
89.51
73.67
72.02
62.44
95.94
56.43
94.03
42.70
86.23
96.47
67.16
71.40
76.42
74.15
54.36
81.48
95.16
94.00
10.80
74.54
83.25
82.62
24.05
100.00
93.48
82.67
75.53
84.23
68.22
92.54
76.01
71.72
30.54
95.87
77.78
96.96
96.00
55.86
66.25
1.43
0.84
8.33
6.45
7.90
0.40
6.84
0.05
0.22
3.90
4.01
4.07
2.20
13.21
5.07
11.56
6.42
3.34
4.67
3.12
8.95
0.68
4.76
1.33
0.00
2.95
0.88
5.85
1.17
11.35
1.21
10.12
1.72
2.56
0.61
2.79
0.87
3.61
12.99
8.07
14.71
68.53
70.15
62.47
56.46
96.09
55.10
94.02
39.99
80.43
95.92
57.60
69.30
80.65
63.34
80.20
78.15
94.00
94.00
63.20
70.43
76.68
70.76
26.45
100.00
97.25
78.94
70.32
73.08
60.83
60.58
78.30
78.24
49.12
92.57
61.33
94.65
96.28
51.55
96.83
1.85
2.76
6.38
9.71
4.37
0.57
8.62
0.06
6.05
3.62
2.77
3.51
1.42
12.04
8.42
10.67
9.72
3.48
5.54
5.53
22.52
5.46
4.28
1.91
0.00
4.33
32.09
5.65
1.11
9.15
0.08
11.92
1.30
3.77
0.26
7.90
1.01
2.14
4.97
2.78
20.75
87.87
71.45
67.14
61.81
87.48
52.80
91.75
43.38
84.20
96.46
65.20
70.30
67.67
68.94
67.49
76.67
92.83
94.00
69.80
79.34
80.66
78.67
27.19
79.21
94.87
83.53
70.59
74.54
66.45
89.82
74.99
74.20
55.00
92.85
78.83
94.93
97.14
53.44
96.17
1.32
0.87
6.15
8.14
6.31
1.15
5.75
0.40
3.65
3.91
2.98
3.13
5.37
14.10
6.34
10.55
8.77
6.27
3.59
4.42
9.46
0.71
3.84
1.47
7.15
4.50
1.05
4.88
0.48
14.20
1.48
6.87
1.54
3.98
0.73
2.22
1.25
3.30
6.50
5.16
Average
70.80
20.06
75.20
18.98
77.66
17.24
78.43
17.55
76.44
17.46
78.29
17.29
73.68
22.09
72.46
19.78
74.51
17.89
lower than 105 in both cases, meaning that significant
differences have been detected between the methods of the
experiment.
Holm’s procedure states that the differences of IPADECS-DEFW
over 1NN, ReliefF, GOCBR, TSKNN and SSMA are significant
(a ¼ 0:1). Finner’s procedure goes further, also highlighting the
difference over IPADECS (Finner APV ¼ 0.0926).
In the case of SSMA-DEPGFW the results are similar: the differences over 1NN, ReliefF, GOCBR, TSKNN and SSMA are marked as
significant by Holm’s test (a ¼ 0:1), whereas Finner’s also highlights the difference over IPADECS again (Finner APV ¼ 0.0643).
These results suggest that our proposals, SSMA-DEPGFW and
IPADECS-DEFW, significantly improve all the comparison methods considered except SSMA-DEPG. The multiple comparison test
applied does not detect significant differences between the three
best methods. Hence, we will study this last case carefully,
applying a pairwise comparison between our proposals and
SSMA-DEPG. Specifically, we will focus on the Wilcoxon test,
which allows us to have a further insight into the comparison of
this method with our proposals. Table 9 shows the results of its
application, comparing SSMA-DEFPGW and IPADECS-DEFW with
SSMA-DEPG. The results obtained suggest that it is outperformed
by the new proposals, at a ¼ 0:1 level. Although this result is not
as strong as those differences found by Holm’s and Finner’s
procedures, it still supports the existence of a significant improvement of SSMA-DEPGFW and IPADECS-DEFW over SSMA-DEPG.
5.3. Analyzing scaling up capabilities: a stratified model
In this study, we select the hybrid model IPADECS-DEFW as the
best trade-off between accuracy and reduction rate to implement a
stratified model, considering the two strategies explained in Section
3.4. The performance of this method is analyzed by using six huge
data sets taken from the KEEL data set repository (see Table 2).
To check the performance of the proposed stratified models,
we perform a comparison with the stratified versions of IPADECS
and SSMA-DEPG proposed in [40]. Furthermore, 1NN behavior has
also been analyzed as a baseline method for this study. For all the
techniques, we used the same set up as that used in the former
study, and set up the strata size as near as possible to 5000
instances. Table 2 shows the exact number of strata and instances
per strata.
Table 10 shows the accuracy test results for each method
considered in this study. For each data set, the mean accuracy
(Acc) and the standard deviation (SD) are computed. The best result
for each column is highlighted in bold. The last row presents the
average considering all the huge data sets. Table 11 collects the
I. Triguero et al. / Neurocomputing 97 (2012) 332–343
Table 8
Average FA rankings of SSMA-DEPGFW and the rest of the comparison methods.
Data sets
SSMA
SSMADEPG
SSMADEPGFW
IPADECS IPADECS
DEFW
Abalone
Banana
Bands
Breast
Bupa
Chess
Cleveland
Coil2000
Contraceptive
Crx
Dermatology
Flare-solar
German
Glass
Haberman
Hayes-roth
Heart
Housevotes
Iris
Led7digit
Lym
Magic
Mammographic
Marketing
Monks
Newthyroid
Nursery
Pima
Ring
Saheart
Spambase
Spectfheart
Splice
Tae
Thyroid
Titanic
Twonorm
Wisconsin
Yeast
Zoo
0.9749
0.9900
0.9567
0.9790
0.9417
0.9782
0.9710
0.9999
0.9672
0.9844
0.9663
0.9955
0.9686
0.9237
0.9840
0.9006
0.9716
0.9826
0.9630
0.9693
0.9504
0.9808
0.9895
0.9825
0.9750
0.9700
0.9396
0.9780
0.9902
0.9735
0.9805
0.9696
0.9679
0.9139
0.9982
0.9960
0.9952
0.9932
0.9681
0.9010
0.9749
0.9900
0.9567
0.9790
0.9417
0.9782
0.9710
0.9999
0.9672
0.9844
0.9663
0.9955
0.9686
0.9237
0.9840
0.9006
0.9716
0.9826
0.9630
0.9693
0.9504
0.9808
0.9895
0.9825
0.9750
0.9700
0.9396
0.9780
0.9902
0.9735
0.9805
0.9696
0.9679
0.9139
0.9982
0.9960
0.9952
0.9932
0.9681
0.9010
0.9749
0.9900
0.9567
0.9790
0.9417
0.9782
0.9710
0.9999
0.9672
0.9844
0.9663
0.9955
0.9686
0.9237
0.9840
0.9006
0.9716
0.9826
0.9630
0.9693
0.9504
0.9808
0.9895
0.9825
0.9750
0.9700
0.9396
0.9780
0.9902
0.9735
0.9805
0.9696
0.9679
0.9139
0.9982
0.9960
0.9952
0.9932
0.9681
0.9010
0.9886
0.9981
0.9872
0.9820
0.9848
0.9981
0.9600
0.9997
0.9926
0.9929
0.9806
0.9969
0.9940
0.9393
0.9904
0.9436
0.9853
0.9849
0.9748
0.9747
0.9594
0.9996
0.9938
0.9961
0.9910
0.9835
0.9992
0.9916
0.9956
0.9931
0.9971
0.9817
0.9947
0.9558
0.9992
0.9990
0.9993
0.9951
0.9858
0.9086
0.9882
0.9981
0.9866
0.9829
0.9842
0.9981
0.9600
0.9997
0.9922
0.9929
0.9806
0.9969
0.9920
0.9393
0.9891
0.9436
0.9831
0.9849
0.9748
0.9747
0.9594
0.9996
0.9938
0.9961
0.9910
0.9835
0.9992
0.9916
0.9956
0.9911
0.9971
0.9817
0.9947
0.9558
0.9989
0.9987
0.9993
0.9951
0.9858
0.9086
Average
0.9695 0.9695
0.9695
0.9842
0.9840
Table 7
Average FA rankings of IPADECS-DEFW and the rest of the comparison methods.
Algorithm
FA ranking
Holm APV
Finner APV
IPADECS-DEFW
SSMA-DEPG
IPADECS
SSMA
TSKNN
GOCBR
ReliefF
1NN
97.6750
106.8875
133.9000
149.0250
153.1375
190.4875
209.0500
243.8375
–
0.6561
0.1599
0.0392
0.0294
0
0
0
–
0.6561
0.0926
0.0182
0.0128
0
0
0
p-Value by the FA test ¼ 9:915 106 .
reduction rate achieved for each method. In this table, both
IPADECS-DEFW variants obtain the same average reduction rate.
In Table 10, we observe that the IPADECS-DEFW model with
the join procedure has obtained the best average accuracy result.
The Wilcoxon test has been conducted comparing this method
with the rest. Table 12 shows the results of its application.
As in the former study, IPADECS-DEFW obtains a slightly lower
reduction power than IPADECS as we can see in Table 11.
Observing these tables, we can summarize that with an
appropriate stratification procedure, the idea of combining PG
and FW is also applicable to huge data sets, obtaining good
Algorithm
FA ranking
Holm APV
Finner APV
SSMA-DEPGFW
SSMA-DEPG
IPADECS
SSMA
TSKNN
GOCBR
ReliefF
1NN
93.8875
108.0750
133.5250
149.5250
155.2250
190.8000
208.7250
244.2375
–
0.4929
0.1107
0.0215
0.0121
0
0
0
–
0.4929
0.0643
0.0100
0.0053
0
0
0
p-Value by the FA test ¼ 9:644 106 .
Original vs Weighted version
100
Accuracy of the Weighted proposal
Table 6
Reduction rates obtained.
341
SSMA−DEPG vs SSMA−DEPGFW
IPADECS vs IPADECS−DEFW
y=x
95
90
85
80
75
75
80
85
90
95
Accuracy of the Original proposal
100
Fig. 6. Accuracy results over large data sets.
Table 9
Results of the Wilcoxon signed-ranks test.
Comparison
Rþ
R
p-Value
SSMA-DEPGFW vs SSMA-DEPG
IPADECS-DEFW vs SSMA-DEPG
586.5
521
233.5
259
0.0469
0.0682
results. The Wilcoxon test supports this statement, showing that
IPADECS-DEFW, with the join procedure, is able to significantly
outperform IPADECS, SSMA-DEPG and 1NN to a level of a ¼ 0:1.
6. Conclusions
In this paper, we have introduced a novel data reduction
technique which exploits the cooperation between FW and PG
to improve the classification performance of the NN, storage
requirements and its running time. A self-adaptive DE algorithm
has been used to optimize feature weights and the positioning of
the prototypes for the nearest neighbor algorithm, acting as an
FW scheme and a PG method, respectively.
The proposed DEFW scheme has been incorporated within two
of the most promising PG methods. These hybrid models are able
to overcome isolated PG methods due to the fact that FW changes
the way in which distances between prototypes are measured,
and therefore the adjustment of prototypes can be more refined.
Furthermore, we have proposed a stratified procedure specifically
designed to deal with huge data sets.
The wide experimental study performed has allowed us to
contrast the behavior of these hybrid models when dealing with a
wide variety of data sets with different numbers of instances and
342
I. Triguero et al. / Neurocomputing 97 (2012) 332–343
Table 10
Accuracy test results in huge data sets.
Data sets
1NN
IPADECS
Acc
SD
Acc
SSMA-DEPG
SD
Acc
SD
IPADECS-DEFW
IPADECS-DEFW
Join
Voting rule
Acc
SD
Acc
SD
Adult
Census
Connect-4
Fars
Letter
Shuttle
0.7960
0.9253
0.6720
0.7466
0.9592
0.9993
0.0035
0.0010
0.0036
0.0034
0.0002
0.0004
0.8263
0.9439
0.6569
0.7439
0.9420
0.9941
0.0032
0.0005
0.0009
0.0218
0.0082
0.0021
0.8273
0.9460
0.6794
0.7625
0.9053
0.9967
0.0098
0.0009
0.0061
0.0036
0.0082
0.0021
0.8335
0.9477
0.6847
0.7676
0.9632
0.9967
0.0077
0.0007
0.0058
0.0039
0.0121
0.0008
0.8313
0.9428
0.6624
0.7536
0.9699
0.9967
0.0031
0.0300
0.0045
0.0033
0.0075
0.0015
Average
0.8497
0.0020
0.8512
0.0061
0.8529
0.0051
0.8656
0.0052
0.8595
0.0083
Table 11
Reduction rate results in huge data sets.
Data sets
IPADECS
SSMA-DEPG
IPADECS-DEFW
IPADECS-DEFW
Join
Voting rule
Adult
Census
Connect-4
Fars
Letter
Shuttle
0.9986
0.9994
0.9990
0.9968
0.9924
0.9986
0.9882
0.9973
0.9822
0.9808
0.9805
0.9981
0.9986
0.9987
0.9981
0.9957
0.9901
0.9971
0.9986
0.9987
0.9981
0.9957
0.9901
0.9971
Average
0.9975
0.9878
0.9964
0.9964
Table 12
Results obtained by the Wilcoxon test for algorithm IPADECS-DEFW join.
IPADECS-DEFW join VS
Rþ
R
p-Value
1NN
IPADECS
SSMA-DEPG
IPADECS-DEFW voting rule
20.0
21.0
15.0
12.0
1.0
0.0
0.0
3.0
0.0625
0.0312
0.0625
0.1775
features. The proposed stratified procedure has shown that this
technique is useful to tackle the scaling up problem. The results
have been compared with several nonparametric statistical procedures, which have supported the conclusions drawn.
As future work, we consider that this methodology could be
extended by using different learning algorithms such as support
vector machines, decision trees, and so on, following the guidelines given in similar studies for training set selection [62–64].
Acknowledgments
Supported by the Research Projects TIN2011-28488 and TIC6858. J. Derrac holds an FPU scholarship from the Spanish
Ministry of Education and Science.
References
[1] E. Alpaydin, Introduction to Machine Learning, 2nd edition, MIT Press,
Cambridge, MA, 2010.
[2] I. Kononenko, M. Kukar, Machine Learning and Data Mining: Introduction to
Principles and Algorithms, Horwood Publishing Limited, 2007.
[3] T.M. Mitchell, Machine Learning, McGraw-Hill, 1997.
[4] D.W. Aha (Ed.), Lazy Learning, Springer, 1997.
[5] T.M. Cover, P.E. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf.
Theory 13 (1) (1967) 21–27.
[6] X. Wu, V. Kumar (Eds.), The Top Ten Algorithms in Data Mining, Chapman &
Hall/CRC Data Mining and Knowledge Discovery, 2009.
[7] Y. Gao, F. Gao, Edited AdaBoost by weighted kNN, Neurocomputing 73 (16–18)
(2010) 3079–3088.
[8] J. Derrac, S. Garcı́a, F. Herrera, IFS-CoCo: instance and feature selection based
on cooperative coevolution with nearest neighbor rule, Pattern Recognition
43 (6) (2010) 2082–2105.
[9] D. Pyle, Data Preparation for Data Mining, The Morgan Kaufmann Series in
Data Management Systems, Morgan Kaufmann, 1999.
[10] J.M. Urquiza, I. Rojas, H. Pomares, L.J. Herrera, J. Ortega, A. Prieto, Method for
prediction of protein–protein interactions in yeast using genomics/proteomics information and feature selection, Neurocomputing 74 (16) (2011)
2683–2690.
[11] H. Liu, H. Motoda (Eds.), Computational Methods of Feature Selection,
Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, Chapman
& Hall/CRC, 2007.
[12] J.X. Peng, S. Ferguson, K. Rafferty, P. Kelly, An efficient feature selection
method for mobile devices with application to activity recognition, Neurocomputing 74 (17) (2011) 3543–3552.
[13] J. Derrac, C. Cornelis, S. Garcı́a, F. Herrera, Enhancing evolutionary instance
selection algorithms by means of fuzzy rough set based feature selection, Inf.
Sci. 186 (1) (2012) 73–92.
[14] H. Liu, H. Motoda, Feature Extraction, Construction and Selection: A Data
Mining Perspective, Kluwer Academic Publishers, 2001.
[15] D.R. Wilson, T.R. Martinez, Reduction techniques for instance-based learning
algorithms, Mach. Learn. 38 (3) (2000) 257–286.
[16] A. Guillén, L.J. Herrera, G. Rubio, H. Pomares, A. Lendasse, I. Rojas, New
method for instance or prototype selection using mutual information in time
series prediction, Neurocomputing 73 (10–12) (2010) 2030–2038.
[17] S. Garcı́a, J. Derrac, J. Cano, F. Herrera, Prototype selection for nearest
neighbor classification: taxonomy and empirical study, IEEE Trans. Pattern
Anal. Mach. Intell. 34 (3) (2012) 417–435.
[18] H.A. Fayed, S.R. Hashem, A.F. Atiya, Self-generating prototypes for pattern
classification, Pattern Recognition 40 (5) (2007) 1498–1509.
[19] I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, A taxonomy and experimental
study on prototype generation for nearest neighbor classification, IEEE Trans.
Syst. Man Cybern.—Part C: Appl. Rev. 42 (1) (2012) 86–100.
[20] D. Wettschereck, D.W. Aha, T. Mohri, A review and empirical evaluation of
feature weighting methods for a class of lazy learning algorithms, Artif. Intell.
Rev. 11 (1997) 273–314.
[21] R. Paredes, E. Vidal, Learning weighted metrics to minimize nearest-neighbor
classification error, IEEE Trans. Pattern Anal. Mach. Intell. 28 (7) (2006)
1100–1110.
[22] A.E. Eiben, J.E. Smith, Introduction to Evolutionary Computing, SpringerVerlag, Berlin, 2003.
[23] A.A. Freitas, Data Mining and Knowledge Discovery with Evolutionary
Algorithms, Springer-Verlag, Berlin, 2002.
[24] G.L. Pappa, A.A. Freitas, Automating the Design of Data Mining Algorithms:
An Evolutionary Computation Approach, Natural computing, Springer, 2009.
[25] J.R. Cano, F. Herrera, M. Lozano, Using evolutionary algorithms as instance
selection for data reduction in KDD: an experimental study, IEEE Trans. Evol.
Comput. 7 (6) (2003) 561–575.
[26] N. Krasnogor, J. Smith, A tutorial for competent memetic algorithms: model,
taxonomy, and design issues, IEEE Trans. Evol. Comput. 9 (5) (2005) 474–488.
[27] S. Garcı́a, J.R. Cano, F. Herrera, A memetic algorithm for evolutionary
prototype selection: a scaling up approach, Pattern Recognition 41 (8)
(2008) 2693–2709.
[28] F. Fernández, P. Isasi, Evolutionary design of nearest prototype classifiers, J.
Heuristics 10 (4) (2004) 431–454.
[29] A. Cervantes, I.M. Galván, P. Isasi, AMPSO: a new particle swarm method for
nearest neighborhood classification, IEEE Trans. Syst. Man Cybern.—Part B:
Cybern. 39 (5) (2009) 1082–1091.
[30] L. Nanni, A. Lumini, Particle swarm optimization for prototype reduction,
Neurocomputing 72 (4–6) (2008) 1092–1097.
[31] R. Storn, K.V. Price, Differential evolution—a simple and efficient heuristic for
global optimization over continuous spaces, J. Global Optim. 11 (10) (1997)
341–359.
I. Triguero et al. / Neurocomputing 97 (2012) 332–343
[32] K.V. Price, R.M. Storn, J.A. Lampinen, Differential Evolution: A Practical
Approach to Global Optimization, Natural Computing Series, , 2005.
[33] S. Das, P. Suganthan, Differential evolution: a survey of the state-of-the-art,
IEEE Trans. Evol. Comput. 15 (1) (2011) 4–31.
[34] I. Triguero, S. Garcı́a, F. Herrera, Differential evolution for optimizing the
positioning of prototypes in nearest neighbor classification, Pattern Recognition 44 (4) (2011) 901–916.
[35] F. Fernández, P. Isasi, Local feature weighting in nearest prototype classification, IEEE Trans. Neural Networks 19 (1) (2008) 40–53.
[36] J. Li, M.T. Manry, C. Yu, D.R. Wilson, Prototype classifier design with pruning,
Int. J. Artif. Intell. Tools 14 (1–2) (2005) 261–280.
[37] I. Triguero, S. Garcı́a, F. Herrera, IPADE: iterative prototype adjustment for
nearest neighbor classification, IEEE Trans. Neural Networks 21 (12) (2010)
1984–1990.
[38] I. Triguero, S. Garcı́a, F. Herrera, Enhancing IPADE algorithm with a different
individual codification, in: Proceedings of the 6th International Conference
on Hybrid Artificial Intelligence Systems (HAIS), Lecture Notes in Artificial
Intelligence, vol. 6679, 2011, pp. 262–270.
[39] J.R. Cano, F. Herrera, M. Lozano, Stratification for scaling up evolutionary
prototype selection, Pattern Recognition Lett. 26 (7) (2005) 953–963.
[40] I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, A study of the scaling up
capabilities of stratified prototype generation, in: Proceedings of the Third
World Congress on Nature and Biologically Inspired Computing (NABIC’11),
2011, pp. 304–309.
[41] F. Neri, V. Tirronen, Scale factor local search in differential evolution,
Memetic Comput. 1 (2) (2009) 153–171.
[42] S. Garcı́a, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests
for multiple comparisons in the design of experiments in computational
intelligence and data mining: experimental analysis of power, Inf. Sci. 180
(2010) 2044–2064.
[43] P.E. Hart, The condensed nearest neighbor rule, IEEE Trans. Inf. Theory 18
(1968) 515–516.
[44] D.L. Wilson, Asymptotic properties of nearest neighbor rules using edited
data, IEEE Trans. Syst. Man Cybern. 2 (3) (1972) 408–421.
[45] J.S. Sánchez, R. Barandela, A.I. Marqués, R. Alejo, J. Badenas, Analysis of new
techniques to obtain quality training sets, Pattern Recognition Lett. 24 (7)
(2003) 1015–1022.
[46] J.S. Sánchez, High training set size reduction by space partitioning and
prototype abstraction, Pattern Recognition 37 (7) (2004) 1561–1564.
[47] D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithms, Mach.
Learn. 6 (1) (1991) 37–66.
[48] K. Kira, L.A. Rendell, A practical approach to feature selection, in: Proceedings
of the Ninth International Conference on Machine Learning, Morgan Kaufmann, Aberdeen, Scotland, 1992, pp. 249–256.
[49] K. Ye, K. Feenstra, J. Heringa, A. Ijzerman, E. Marchiori, Multi-RELIEF: a
method to recognize specificity determining residues from multiple
sequence alignments using a machine learning approach for feature weighting, Bioinformatics 24 (1) (2008) 18–25.
[50] I. Kononenko, Estimating attributes: analysis and extensions of RELIEF, in:
Proceedings of the 1994 European Conference on Machine Learning, Springer
Verlag, Catania, Italy, 1994, pp. 171–182.
[51] M.R. Sikonja, I. Kononenko, Theoretical and empirical analysis of ReliefF and
RReliefF, Mach. Learn. 53 (1-2) (2003) 23–69.
[52] M.A. Tahir, A. Bouridane, F. Kurugollu, Simultaneous feature selection and
feature weighting using hybrid tabu search/k-nearest neighbor classifier,
Pattern Recognition Lett. 28 (4) (2007) 438–446.
[53] J. Gertheiss, G. Tutz, Feature selection and weighting by nearest neighbor
ensembles, Chemometr. Intell. Lab. Syst. 99 (2009) 30–38.
[54] A.K. Qin, V.L. Huang, P.N. Suganthan, Differential evolution algorithm with
strategy adaptation for global numerical optimization, IEEE Trans. Evol.
Comput. 13 (2) (2009) 398–417.
[55] S. Das, A. Abraham, U.K. Chakraborty, A. Konar, Differential evolution using a
neighborhood-based mutation operator, IEEE Trans. Evol. Comput. 13 (3)
(2009) 526–553.
[56] J. Zhang, A.C. Sanderson, JADE: adaptive differential evolution with optional
external archive, IEEE Trans. Evol. Comput. 13 (5) (2009) 945–958.
[57] J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. Garcı́a, L. Sánchez, F. Herrera,
KEEL data-mining software tool: data set repository, integration of algorithms
and experimental analysis framework, J. Mult. 17 (2–3) (2011) 255–287.
[58] J.M. Keller, M.R. Gray, J.A. Givens, A fuzzy K-nearest neighbor algorithm, IEEE
Trans. Syst. Man Cybern. 15 (4) (1985) 580–585.
[59] H. Ahn, K. Kim, Bankruptcy prediction modeling with hybrid case-based reasoning and genetic algorithms approach, Appl. Soft Comput. 9 (2009) 599–607.
[60] D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, 5th edition, Chapman & Hall/CRC, 2011.
[61] C. Garcı́a-Osorio, A. de Haro-Garcı́a, N. Garcı́a-Pedrajas, Democratic instance
selection: a linear complexity instance selection algorithm based on classifier
ensemble concepts, Artif. Intell. 174 (2010) 410–441.
[62] J.R. Cano, F. Herrera, M. Lozano, Evolutionary stratified training set selection
for extracting classification rules with trade off precision-interpretability,
Data Knowl. Eng. 60 (2007) 90–108.
[63] S. Garcı́a, A. Fernández, F. Herrera, Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set
selection over imbalanced problems, Appl. Soft Comput. 9 (2009) 1304–1314.
343
[64] L. Nanni, A. Lumini, Prototype reduction techniques: a comparison among
different approaches, Expert Syst. Appl. 38 (9) (2011) 11820–11828.
Isaac Triguero Velázquez received the M.Sc. degree in
Computer Science from the University of Granada, Granada, Spain, in 2009. He is currently a Ph.D. student in the
Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. His
research interests include data mining, semi-supervised
learning, data reduction and evolutionary algorithms.
Joaquı́n Derrac Rus received the M.Sc. degree in
Computer Science from the University of Granada,
Granada, Spain, in 2008. He is currently a Ph.D. student
in the Department of Computer Science and Artificial
Intelligence, University of Granada, Granada, Spain. His
research interests include data mining, data reduction,
statistical inference and evolutionary algorithms.
Salvador Garcı́a López received the M.Sc. and Ph.D.
degrees in Computer Science from the University of
Granada, Granada, Spain, in 2004 and 2008, respectively.
He is currently an Assistant Professor in the Department of Computer Science, University of Jaén, Jaén,
Spain. He has had more than 25 papers published in
international journals. He has co-edited two special
issues of international journals on different Data
Mining topics. His research interests include data
mining, data reduction, data complexity, imbalanced
learning, semi-supervised learning, statistical inference and evolutionary algorithms.
Francisco Herrera Triguero received the M.Sc. in
Mathematics in 1988 and the Ph.D. in Mathematics
in 1991, both from the University of Granada, Spain.
He is currently a Professor in the Department of
Computer Science and Artificial Intelligence at the University of Granada. He has had more than 200 papers
published in international journals. He is a coauthor of the
book ‘‘Genetic Fuzzy Systems: Evolutionary Tuning and
Learning of Fuzzy Knowledge Bases’’ (World Scientific,
2001). He currently acts as Editor in Chief of the international journal ‘‘Progress in Artificial Intelligence’’
(Springer) and serves as area editor of the Journal Soft
Computing (area of evolutionary and bioinspired algorithms) and International Journal of Computational Intelligence Systems (area of
information systems). He acts as associated editor of the journals: IEEE Transactions
on Fuzzy Systems, Information Sciences, Advances in Fuzzy Systems, and International
Journal of Applied Metaheuristics Computing; and he serves as member of several
journal editorial boards, among others: Fuzzy Sets and Systems, Applied Intelligence,
Knowledge and Information Systems, Information Fusion, Evolutionary Intelligence,
International Journal of Hybrid Intelligent Systems, Memetic Computation, Swarm and
Evolutionary Computation.
He received the following honors and awards: ECCAI Fellow 2009, 2010 Spanish
National Award on Computer Science ARITMEL to the ‘‘Spanish Engineer on Computer
Science’’, and International Cajastur ‘‘Mamdani’’ Prize for Soft Computing (Fourth
Edition, 2010).
His current research interests include computing with words and decision making,
data mining, data preparation, instance selection, fuzzy rule based systems, genetic
fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic
algorithms and genetic algorithms.
1. Prototype generation for supervised classification
1.5
97
MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification
• I. Triguero, D. Peralta, J. Bacardit, S. Garcı́a, F. Herrera, MRPR: A MapReduce Solution
for Prototype Reduction in Big Data Classification. Neurocomputing.
– Status: Submitted.
Elsevier Editorial System(tm) for Neurocomputing
Manuscript Draft
Manuscript Number: NEUCOM-D-13-01899R1
Title: MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification
Article Type: SI: Data stream 2013
Keywords: Big data; Mahout; Hadoop; Prototype reduction; Prototype generation; Nearest neighbor
classification
Corresponding Author: Mr. ISAAC TRIGUERO VELÁZQUEZ, M. D.
Corresponding Author's Institution: University of Granada
First Author: ISAAC TRIGUERO VELÁZQUEZ, M. D.
Order of Authors: ISAAC TRIGUERO VELÁZQUEZ, M. D.; Daniel Peralta, Mr.; Jaume Bacardit, Dr.;
Salvador García, Dr.; Francisco Herrera, Prof.
Abstract: In the era of big data, analyzing and extracting knowledge from large-scale data sets is a very
interesting and challenging task. The application of standard data mining tools in such data sets is not
straightforward. Hence, a new class of scalable mining method that embraces the huge storage and
processing capacity of cloud platforms is required.
In this work, we propose a novel distributed partitioning methodology for prototype reduction
techniques in nearest neighbor classification. These methods aim at representing original training data
sets as a reduced number of instances. Their main purposes are to speed up the classification process
and reduce the storage requirements and sensitivity to noise of the nearest neighbor rule. However,
the standard prototype reduction methods cannot cope with very large data sets. To overcome this
limitation, we develop a MapReduce-based framework to distribute the functioning of these algorithms
through a cluster of computing elements, proposing several algorithmic strategies to integrate multiple
partial solutions (reduced sets of prototypes) into a single one. The proposed model enables prototype
reduction algorithms to be applied over big data classification problems without significant accuracy
loss. We test the speeding up capabilities of our model with data sets up to 5.7 millions of instances.
The results show that this model is a suitable tool to enhance the performance of the nearest neighbor
classifier with big data.
Manuscript
ick here to view linked References
MRPR: A MapReduce Solution for Prototype
Reduction in Big Data Classification
Isaac Trigueroa,, Daniel Peraltaa , Jaume Bacarditb , Salvador Garcı́ac ,
Francisco Herreraa
a
Department of Computer Science and Artificial Intelligence, CITIC-UGR
(Research Center on Information and Communications Technology). University of
Granada, 18071 Granada, Spain
b
School of Computing Science, Newcastle University, NE1 7RU, Newcastle, UK
c
Department of Computer Science. University of Jaén, 23071 Jaén, Spain
Abstract
In the era of big data, analyzing and extracting knowledge from large-scale
data sets is a very interesting and challenging task. The application of standard data mining tools in such data sets is not straightforward. Hence, a
new class of scalable mining method that embraces the huge storage and
processing capacity of cloud platforms is required. In this work, we propose
a novel distributed partitioning methodology for prototype reduction techniques in nearest neighbor classification. These methods aim at representing
original training data sets as a reduced number of instances. Their main
purposes are to speed up the classification process and reduce the storage
requirements and sensitivity to noise of the nearest neighbor rule. However,
the standard prototype reduction methods cannot cope with very large data
sets. To overcome this limitation, we develop a MapReduce-based framework
to distribute the functioning of these algorithms through a cluster of computing elements, proposing several algorithmic strategies to integrate multiple
partial solutions (reduced sets of prototypes) into a single one. The proposed model enables prototype reduction algorithms to be applied over big
data classification problems without significant accuracy loss. We test the
speeding up capabilities of our model with data sets up to 5.7 millions of
Email addresses: triguero@decsai.ugr.es (Isaac Triguero),
dperalta@decsai.ugr.es (Daniel Peralta), jaume.bacardit@newcastle.ac.uk (Jaume
Bacardit), sglopez@ujaen.es (Salvador Garcı́a), herrera@decsai.ugr.es (Francisco
Herrera)
Preprint submitted to Neurocomputing
March 3, 2014
instances. The results show that this model is a suitable tool to enhance the
performance of the nearest neighbor classifier with big data.
Keywords:
Big data, Mahout, Hadoop, Prototype reduction, Prototype generation,
Nearest neighbor classification
1. Introduction
The term of big data is increasingly being used to refer to the challenges
and advantages derived from collecting and processing vast amounts of data
[1]. Formally, it is defined as the quantity of data that exceeds the processing capabilities of a given system [2] in terms of time and/or memory
consumption. It is attracting much attention in a wide variety of areas such
as industry, medicine or financial businesses because they have progressively
acquired a lot of raw data. Nowadays, with the availability of cloud platforms
[3] they could take some advantages from these massive data sets by extracting valuable information. However, the analysis and knowledge extraction
process from big data become very difficult tasks for most of the classical
and advanced data mining and machine learning tools [4, 5].
Data mining techniques should be adapted to the emerging technologies
[6, 7] to overcome their limitations. In this sense, the MapReduce framework
[8, 9] in conjunction with its distributed file system [10], originally introduced
by Google, offers a simple but robust environment to tackling the processing of large data sets over a cluster of machines. This scheme is currently
taken into consideration in data mining, rather than other parallelization
schemes such as MPI (Message Passing Interface) [11], because of its faulttolerant mechanism, which is crucial for time-consuming jobs, and because
of its simplicity. In the specialized literature, several recent proposals have
focused on the parallelization of machine learning tools based on the MapReduce approach [12, 13]. For example, some classification techniques such as
[14, 15, 16] have been implemented within the MapReduce paradigm. They
have shown that the distribution of the data and the processing under a
cloud computing infrastructure is very useful for speeding up the knowledge
extraction process.
Data reduction techniques [17] emerged as preprocessing algorithms that
aim to simplify and clean the raw data, enabling data mining algorithms
to be applied not only in a faster way, but also in a more accurate way by
2
removing noisy and redundant data. From the perspective of the attributes
space, the most well-known data reduction processes are feature selection
and feature extraction [18]. Taking into consideration the instance space,
we highlight instance reduction methods. This latter is usually divided into
instance selection [19] and instance generation or abstraction [20]. Advanced
models that tackle simultaneously both problems are [21, 22, 23]. As such,
these techniques should ease data mining algorithms to address with big data
problems, however, these methods are also affected by the increase of the size
and complexity of data sets and they are unable to provide a preprocessed
data set in a reasonable time.
This work is focused on Prototype Reduction (PR) techniques [20], which
are instance reduction methods that aim to improve the classification capabilities of the Nearest Neighbor rule (NN) [24]. These techniques may select
instances from the original data set, or build new artificial prototypes, to
form a resulting set of prototypes that better adjusts the decision boundaries between classes in NN classification. PR techniques have proved to be
very competitive at reducing the computational cost and high storage requirements of the NN algorithm, and also improving its classification performance
[25, 26, 27].
Large-scale data cannot be tackled by standard data reduction techniques
because their runtime becomes impractical. Several solutions have been developed to enable data reduction techniques to deal with this problem. For
PR, we can find a data-level approach that is based on a distributed partitioning model that maintains the class distribution (also called stratification).
This splits the original training data into several subsets that are individually addressed. Then, it joins each partial reduced set into a global solution.
This approach has been used for instance selection [28, 29] and generation
[30] with promising results. However, two main problems appear when we
increase the data set size:
• A stratified partitioning process could not be carried out when the size
of the data set is so big that it occupies all the available RAM memory.
• This scheme does not consider that joining each partial solution into
a global one could generate a reduced set with redundant or noisy
instances that may damage the classification performance.
In this work, we propose a new distributed framework for PR, based on
the stratification procedure, which handles the drawbacks mentioned above.
3
To do so, we rely on the success of the MapReduce framework, designing
carefully the map and reduce tasks to perform a proper PR process. Concretely, the map phase corresponds to the splitting procedure and the application of the PR technique. The reduce stage performs a filtering or fusion of
prototypes to avoid the introduction of harmful prototypes to the resulting
preprocessed data set.
We will denote this framework “MapReduce for Prototype Reduction”
(MRPR). The idea of splitting the data into several subsets, and processing
them separately, fits better with the MapReduce philosophy, than with other
parallelization schemes because of two reasons: Firstly, each subset is individually processed, so that, it does not need data exchange between nodes
to proceed [31]. Secondly, the computational cost of each chunk could be
so high that a fault-tolerant mechanism is mandatory. For the reduce stage
we study three different strategies, of varying computational effort, for the
integration of the partial solutions generated by the mappers.
Developing a distributed partitioning scheme based on MapReduce for
PR motivates the global purpose of this work, which can be divided into
three objectives:
• To enable PR techniques to deal with big data classification problems.
• To analyze and illustrate the scalability of the proposed scheme in terms
of classification accuracy and runtime.
• To study how PR techniques enhance the NN rule when dealing with
big data.
To test the performance of our model, we will conduct experiments on
big data sets focusing on an advanced PR technique, called SSMA-SFLSDE,
which was recently proposed in [27]. Moreover, some additional experiments
with other PR techniques will be also carried out. The experimental study
includes an analysis of training and test accuracy, runtime and reduction
capabilities of PR techniques under the proposed framework. Several variations of the proposed model will be investigated with different number of
mappers and four data sets of up to 5.7 millions instances.
The rest of the paper is organized as follows. In Section 2, we provide
some background material about PR and MapReduce. In Section 3, we
describe the MapReduce implementation proposed for PR and discuss which
PR methods are candidates to be adapted to this framework. We present
4
and discuss the empirical results in Section 4. Finally, Section 5 summarizes
the conclusions of the paper.
2. Background
In this section we provide some background information about the topics
used in this paper. Section 2.1 presents the PR problem and its weaknesses
to deal with big data. Section 2.2 introduces the MapReduce paradigm and
the implementation used in this work.
2.1. Prototype reduction and big data
This section defines the PR problem, its current trends and the drawbacks
of tackling big data with PR techniques. A formal notation of the PR problem
is the following: Let T R be a training data set and T S a test set, they are
formed by a determined number n and t of samples, respectively. Each
sample xp is a tuple (xp1 , xp2 , ..., xpD , ω), where, xpf is the value of the f -th
feature of the p-th sample. This sample belongs to a class ω, given by xpω ,
and a D-dimensional space. For the T R set the class ω is known, while it is
unknown for T S.
The purpose of PR is to provide a reduced set RS which consists of rs,
rs < n, prototypes, which are either selected or generated from the examples
of T R. The prototypes of RS should be calculated to efficiently represent
the distributions of the classes and to discern well when they are used to
classify the training objects. The size of RS should be sufficiently reduced
to deal with the storage and evaluation time problems of the NN classifier.
As we stated above, PR is usually divided into those approaches that are
limited to select instances from T R, known as prototype selection, and those
that may generate artificial examples (prototype generation). Both strategies
have been deeply studied in the literature. Most of the recent proposals are
based on evolutionary algorithms to select [32, 33] or generate [25, 26] an
appropriate RS. Furthermore, there is a hybrid approach between prototype
selection and generation in [27]. Recent reviews about these topics are [19]
and [20]. More information about PR can be found at the SCI2S thematic
public website on Prototype Reduction in Nearest Neighbor Classification:
Prototype Selection and Prototype Generation 1 .
1
http://sci2s.ugr.es/pr/
5
Despite the promising results shown by PR techniques with small and
medium data sets, they lack of scalability to address big T R data sets (from
tens of thousands of instances onwards [29]). The main problems found to
deal with large-scale data are:
• Runtime: The complexity of PR models is O((n · D)2 ) or higher, where
n is the number of instances and D the number of features. Although
these techniques are only applied once on a T R, if this process takes
too long, its application could become inoperable for real applications.
• Memory consumption: Most of PR methods need to store in the main
memory many partial calculations, intermediate solutions, and/or also
the entire T R. When T R is too big, it could easily exceed the available
RAM memory.
As we will see in further sections, these weaknesses motivate the use of
distributed partitioning procedures, which divide the T R into disjoint subsets
that can be manage by PR methods [28].
2.2. Mapreduce
MapReduce is a paradigm of parallel programming [8, 9] designed to
process or generate large data sets. It allows us to tackle big data sets over
a computer cluster regardless the underlying hardware or software. It is
characterized by its highly transparency for programmers, which allows to
parallelize applications in a easy and comfortable way.
Based on functional programming, this model works in two different steps:
the map phase and the reduce phase. Each one has key-value (< k, v >) pairs
as input and output. Both phases are defined by a programmer. The map
phase takes each < k, v > pair and generates a set of intermediate < k, v >
pairs. Then, MapReduce merges all the values associated with the same intermediate key as a list (known as shuffle phase). The reduce phase takes
that list as input for producing the final values. Figure 1 depicts a flowchart
of the MapReduce framework. In a MapReduce program, all map and reduce
operations run in parallel. First of all, all map functions are independently
run. Meanwhile, reduce operations wait until their respective maps are finished. Then, they process different keys concurrently and independently.
Note that inputs and outputs of a MapReduce job are stored in an associated distributed file system that is accessible from any computer of the used
cluster.
6
Figure 1: Flowchart of the MapReduce framework
An illustrative example about the way of working of MapReduce could
be find the average costs per year from a big list of cost records. Each
record may be composed by a variety of values, but it at least includes the
year and the cost. The map function extracts from each record the pairs
< year, cost > and transmits them as its output. The shuffle stage groups
the < year, cost > pairs by its corresponding year, creating a list of costs per
year < year, list(cost) >. Finally, the reduce phase performs the average of
all the costs contained in the list of each year.
Different implementations of the MapReduce framework are possible [8],
depending on the available cluster architecture. Some implementations of
MapReduce are: Mars [34], Phoenix [35] and Apache Hadoop [36, 37]. In
this paper we will focus on the Hadoop implementation because of its performance, open source nature, installation facilities and its distributed file
system (Hadoop Distributed File System, HDFS).
A Hadoop cluster is formed by a master-slave architecture, where one
master node manages an arbitrary number of slave nodes. The HDFS replicates file data in multiple storage nodes that can concurrently access to the
data. As such cluster, a certain percentage of these slave nodes may be out
of order temporarily. For this reason, Hadoop provides a fault-tolerant mechanism, so that, when one node fails, Hadoop restarts automatically the task
on another node.
As we commented above, the MapReduce approach can be useful for many
different tasks. In terms of data mining, it offers a propitious environment
7
to successfully speed up these kinds of techniques. In fact, there is a growing
open source project, called Apache Mahout [38], that collects distributed
and scalable machine learning algorithms implemented on top of Hadoop.
Nowadays, it supplies an implementation of several specific techniques, such
as, k-means for clustering, a naive bayes classifier, a collaborative filtering,
etc. We based our implementations on this library.
3. MRPR: MapReduce for prototype reduction
In this section we present the proposed MapReduce approach for PR.
Firstly, we argue the motivation that justify our proposal (Section 3.1). Then,
we detail the proposed model in depth (Section 3.2). Finally, we comment
which PR methods can be implemented within the proposed framework depending on their main characteristics (Section 3.3)
3.1. Motivation
As mentioned before, PR methods decrease their performance when dealing with large amounts of instances. The distribution and parallelization of
workload in different sub-processes may ease the problems previously enumerated (runtime and memory consumption). To tackle this challenge we have
to create an efficient and flexible PR design that takes advantage of parallelization schemes and cloud-enable infrastructures. The designed framework
should enable PR techniques to be applied with data sets of unlimited number of instances without major algorithmic modifications, just by using more
computers. Furthermore, this model should guarantee that the objectives of
PR models are maintained, so that, it should provide high reduction rates
without significant accuracy loss.
In our previous work [30], a distributed partitioning approach was proposed to alleviate these issues. This model splits the training set, called T R,
into disjoint d subsets (T R1 , T R2 , ..., T Rd ) with equal class distribution and
size. Then, a PR model is applied to each T Rj , obtaining a resulting reduced
set RSj . Finally, all RSj (1 ≤ j ≤ d) are merged into a final reduced set,
called RS, which is used to classify the instances of T S with the NN rule.
This partitioning process shows to perform well in medium size domains.
However, it has some limitations:
• Maintaining the proportion of examples per class of T R within each
subset T Rj cannot be accomplished when the size of the data set does
8
not fit in the main memory. Hence, this strategy cannot scale to data
sets of arbitrary size.
• Joining all the partial reduced sets RSj into a final RS may lead to
the introduction of noisy and/or redundant examples. Each resulting
RSj tries to represent, with the minimum number of instances, a proportion of the entire T R. Thus, when the size of T R tends to be very
high, the instances contained in some T Rj subsets may be located very
near in the D-dimensional space. Therefore, the final RS may enclose
unnecessary instances to represent the training data. The likelihood of
this issue increases with the number of partitions.
Moreover, it is important to note that this distributed model was not
implemented within any parallel environment that ensures high scalability
and fault tolerance. These weaknesses motivate the design of a parallel PR
system based on cloud technologies.
In [30], we compared some relevant PR methods with the distributed
partitioning model. We concluded that the best performing approach was
the SSMA-SFLSDE model [27]. In our experiments, we will mainly focus on
this PR model (although other models will be investigated).
3.2. Parallelizing PR with MapReduce
This section explains how to parallelize PR techniques following a MapReduce procedure. Section 3.2.1 details the map phase and Section 3.2.2 presents
the reduce stage. At the end of the section, Figure 3 illustrates a high level
scheme of the proposed parallel system MRPR.
3.2.1. Map phase
Suppose a training set T R, of a determined size, stored in the HDFS as
a single file. The first step of MRPR is devoted to split T R into a given
number of disjoint subsets. Within a Hadoop perspective, the T R file is
composed by h HDFS blocks that are accessible from any computer of the
cluster independently of its size. Let m the number of map tasks (a userdefined parameter). Each map task (Map1 , Map2 , ..., Mapm ) will form an
associated T Rj , where 1 ≤ j ≤ m, with the instances of each chunk in which
is divided the training set file. It is noteworthy that this partitioning process
is performed sequentially, so that, the Mapj corresponds to the j data chunk
of h/m HDFS blocks. So, each map will process approximately the same
number of instances.
9
Under this scheme, if the partitioning procedure is directly applied over
T R, the class distribution of each subset T Rj could be biased to the original
distribution of instances in its corresponding file. As we stated before, a
proper stratified partitioning could not be carried out if the size of T R does
not fit in the main memory. In order to develop a scheme easily scalable
to any number of instances, we previously randomize the entire file. This
operation is not time-consuming in comparison with the application of the
PR technique and should be applied only once. It does not ensure that every
class is represented proportionally to its number of instances in T R. However, probabilistically, each chunk should include approximately a number of
instances of class ω according to the probability of belonging to this class in
the original T R.
When each map has formed its corresponding T Rj , a PR step is performed
using T Rj as the input training data. This step generates a reduced set
RSj . Note that PR techniques may consume different computational times
although they are applied with data sets of similar characteristics. It mainly
depends on the stopping criteria of each PR model. Nevertheless, MapReduce
starts the reduce phase as the first mapper has finalized. Figure 2 contains the
pseudo-code of the map function. This function is basically the application
of the PR technique for each training partition.
As each map finishes its processing the results are forwarded to a single
reduce task.
3.2.2. Reduce phase
The reduce phase will consist of the iterative aggregation of all the RSj as
a single one RS. Figure 2 shows the pseudo-code of the implemented reduce
function. Initially RS = ∅. To do so, we propose different alternatives:
• Join: This simple option, based on stratification, concatenates all the
RSj sets into a final reduce set RS. Instruction 7 indicates how the
reduce function progressively joins all the RSj as the mappers finish
their processing. This type of reducer implements the same strategy
used in the distributed partitioning procedure that we previously proposed [30]. As such, this joining process does not guarantee that the
resulting RS does not contain irrelevant or even harmful instances, but
it is included as a baseline.
• Filtering: This alternative explores the idea of a filtering stage that
removes noisy instances during the formation of RS. This is based on
10
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
function map(Number of splits j)
Constitute T Rj with the instances of split j.
RSj =PrototypeReduction(T Rj )
return RSj
end function
function reduce(RSj ,typeOf Reducer)
RS = RS ∪ RSj
if typeOf Reducer==Filtering then
RS=Filtering(RS)
end if
if typeOf Reducer==Fusion then
RS=Fusion(RS)
end if
return RS
end function
⊲ Initially RS = ∅
Figure 2: Map and reduce functions
those prototype selection methods belonging to the edition family of
methods [19].
This kind of methods is commonly based on simple heuristics that
discard points that are noisy or do not agree with their neighbors. They
supply smoother decision boundaries for the NN classifier. In general,
edition schemes enhance generalization capabilities by performing a
slight reduction of the original training set.
These characteristics are very appropriates for the current stage of our
framework. At this stage, the map phase has reduced each partition to
a subset of representative instances. To aggregate them into a single
RS set, we do not pursue to reduce more the RS, we focus on removing noisy instances, if any. Therefore, the reduce function iteratively
applies a filtering of the current RS. It means that as the mappers end
their execution, the reduce function is run and the next RS is computed as the filtered set obtained with its current content and the new
RSj . It is described in instructions 8-10 of Figure 2.
• Fusion: In this variant we aim to eliminate redundant prototypes. To
accomplish this objective we rely on the success of centroid-based methods for prototype generation [20]. These techniques reduce a prototype
set by merging similar examples [39]. Since in this step we have to
11
Figure 3: MRPR scheme
fuse all the RSj into a single one, these methods can be very useful to
generate a final set without redundant or very similar prototypes.
As in the previous scheme, the fusion phase will be progressively applied
during the creation of RS. Instructions 11-13 of Figure 2 explain how
to apply the fusion phase in the MapReduce framework.
As we have explained, MRPR only uses one single reducer that is run
every time that a mapper is completed. With the adopted strategy, the use
of a single reducer is computationally less expensive than use more than one.
It decreases the Mapreduce overhead (especially network overhead) [40].
As summary, Figure 3 outlines the way of working of the MRPR framework, differentiating between the map and reduce phases. It puts emphasis
on how the single reducer works and it forms the final RS. The resulting RS
will be used as training set for the NN rule to classify the unseen data of the
T S set.
12
3.3. Which PR methods are more suitable for the MRPR framework?
In this subsection we explain which kind of PR techniques fit with the
proposed MRPR framework in its respective stages. In the map phase, the
main prototype reduction process is carried out by a PR technique. Then,
depending on the selected reduce type we should select a filtering or a fusion
PR technique to combine the resulting reduced sets. In what follows, we
discuss which PR techniques are more appropriate for these stages and how
to combine them.
All PR algorithms utilize a training set (in our case T Rj ) as input and
then return a reduced set RSj . Therefore, all of them could be implemented
in the map phase of MRPR according to the description performed above.
However, depending on their characteristics (reduction, accuracy and runtime), we should take into consideration the following aspects to select a
proper PR algorithm:
• A very accurate PR technique is desirable. However, in many PR techniques it implies a low reduction rate. A resulting RS with an excessive
number of instances can negatively influence in the time needed by the
reduce phase.
• The runtime consumption of a PR algorithm will determine the necessary number of mappers in which the T R set of a given problem should
be divided. Depending on the problem tackled, a very high number
of mappers may result in a non representative subset T Rj from the
original T R.
According to [19, 20], there are six main PR families: edition [41], condensation [42], hybrid approaches [43], positioning adjustment [25], centroidsbased [44] and space splitting [45]. Although there are differences between
the methods of each family, most of them perform in a similar way. With
these previous notes in mind, we can state the following general recommendations:
• Edition-based methods are focused on cleaning the training set by removing noisy data. Thus, these methods are usually very fast and
accurate but they obtain a very low reduction rate. To implement
these methods in our framework we recommend the use of a very fast
reduce phase. For instance, a simple join scheme, a filtering reducer
with the ENN method [41] or a fusion reducer based on PNN [39].
13
• Condensation, hybrid and space splitting approaches commonly offer a
good trade-off between reduction, accuracy and runtime. Their reduction rate is normally around 60-80%, so that, depending on the problem
addressed, the reducer should have a moderate time consumption. For
example, we recommend the use of ENN [41] or Depur [46] for filtering
reducers and GMCA [44] for fusion.
• Positioning adjustment techniques may offer a very high reduction rate
or even adjustable as a user-defined parameter. These techniques can
provide very accurate results in a relatively moderate runtime. To
implement these techniques we suggest the inclusion of very accurate
reducers, such as ICPL [47] for fusion, because the high reduction rate
will allow them to be applied in a fast way.
• Centroid-based algorithms are very accurate, with a moderate reduction power but (in general) very time-consuming. Although its implementation is feasible and could be useful in some problems, we assume
that their use should be limited to the later stage (reduce phase).
As general suggestions to combine PR techniques in the map and reduce
phases, we can establish the following rules:
• High reduction rates in the map phase permit very accurate reducers.
• Low reduction rates in the map phase need fast reducers (join, filtering
or a fast fusion).
As commented in the previous section, we propose the use of editionbased methods for the filtering reduce type and centroid-based algorithms to
fuse prototypes. In our experiments, we will focus on a simple but effective
edition technique: the edited nearest neighbor (ENN) [41]. This algorithm
removes an instance from a set of prototypes if it does not agree with the
majority of its k nearest neighbors. As algorithms to fuse prototype, we will
use the ICLP2 method presented in [47] as a more accurate option and the
GMCA model for a faster reduce phase [44]. The ICPL2 model integrates
several prototypes by identifying borders and merging those instances that
are not located in these borders. It highlights as the best performing model
of the centroid-based family in [20]. The GMCA approach merges prototype
based on a hierarchical clustering. This method provides a good trade-off
between accuracy and runtime needed.
14
4. Experimental study
In this section we present all the questions raised with the experimental study and the results obtained. Section 4.1 describes the performance
measures used to evaluate the MRPR model. Section 4.2 defines and details the hardware and software support used in our experiments. Section
4.3 shows the parameters of the involved algorithms and the data sets chosen. Section 4.4 presents and discusses the results achieved. Finally, Section
4.5 includes additional experiments using different PR techniques within the
MRPR model.
4.1. Performance measures
In this work we study the performance of a parallel PR system to improve
the NN classifier. Hence, we need several types of measures to characterize
the abilities of the proposed approach and its variants. In the following, we
briefly describe the considered measures:
• Accuracy: It counts the number of correct classifications regarding the
total number of instances classified [4, 48]. In our experiments we will
compute training and test classification accuracy.
• Reduction rate: It measures the reduction of storage requirements
achieved by a PR algorithm.
ReductionRate = 1 − size(RS)/size(T R)
(1)
Reducing the stored instances in the T R set will yield a time reduction
to classify a new input sample.
• Runtime: We will quantify the total time spent by MRPR to generate
the RS, including all the computations performed by the MapReduce
framework.
• Test classification time: It refers to the time needed to classify all the
instances of T S regarding a given T R. For PR, it is directly related to
the reduction rate.
• Speed up: It usually checks the efficiency achieved by a parallel system
in comparison with the sequential version of the algorithm. Thus, it
15
measures the relation between the runtime of sequential and parallel
versions. If the calculation is executed in c processing cores and it
is considered fully parallelizable, the maximum theoretical speed up
would be equal to the number of used cores, according to the the Amdahl’s Law [49]. With a MapReduce parallelization scheme, each map
will correspond to a single core, so that, the number of used mappers
determines the maximum attainable speed up. However, due to the
magnitude of the data sets used, we cannot run the sequential version
of the selected PR technique (SSMA-SFLSDE) because its execution
is extremely slow. For this reason, we will take the runtime with the
minimum number of mappers as reference time to calculate the speed
up. Therefore, the speed up will be computed as:
Speedup =
parallel time
(2)
parallel time with minimum number of mappers
4.2. Hardware and software used
The experiments have been carried out on twelve nodes in a cluster: The
master node and eleven compute nodes. Each one of these compute nodes
has the following features:
• Processors: 2 x Intel Xeon CPU E5-2620
• Cores: 6 per processor (12 threads)
• Clock Speed: 2.00 GHz
• Cache: 15 MB
• Network: Gigabit Ethernet (1 Gbps)
• Hard drive: 2 TB
• RAM: 64 GB
The master node works as the user interface and hosts both Hadoop master processes: the NameNode and the JobTracker. The NameNode handles
the HDFS, coordinating the slave machines by the means of their respective
DataNode processes, keeping track of the files and the replications of each
16
HDFS block. The JobTracker is the MapReduce framework master process
that manages the TaskTrackers of each compute node. Its responsibilities are
maintaining the load-balance and the fault-tolerance in the system, ensuring
that all nodes get their part of the input data chunk and reassigning the
parts that could not be executed.
The specific details of the software used are the following:
• MapReduce implementation: Hadoop 2.0.0-cdh4.4.0. MapReduce
1 runtime(Classic). Cloudera’s open-source Apache Hadoop distribution [50].
• Maximum maps tasks: 128.
• Maximum reducer tasks: 1.
• Machine learning library: Mahout 0.8.
• Operating system: Cent OS 6.4.
Note that the total number of cores of the cluster is 132. However, the
maximum number of map tasks are limited to 128 and one for the reducers.
4.3. Data sets and methods
In this experimental study we will use four big classification data sets
taken from the UCI repository [51]. Table 1 summarizes the main characteristics of these data sets. For each data set, we show the number of examples
(#Examples), number of attributes (#Dimension), and the number of classes
(#ω).
Table 1: Summary description of the used big data classification
Data set
PokerHand
KddCup 1999 (DOS vs. normal classes)
Susy
RLCP
#Examples
1025010
4856151
5000000
5749132
#Dimension
10
41
18
4
#ω.
10
2
2
2
These data sets have been partitioned using a 5 fold cross-validation (5fcv) scheme. It means that the data set is split into 5 folds, each one containing 20% of the examples of the data set. For each fold, a PR algorithm
is run over the examples presented in the remaining folds (that is, in the
17
Table 2: Approximate number of instances in each T Rj subset according to
the number of mappers used.
Data set
PokerHand
Kddcup (10%)
Kddcup (50%)
Kddcup (100%)
Susy
RLCP
64
12813
6070
30351
60702
62469
71862
Number
128
6406
3035
15175
30351
31234
35931
of mappers
256 512
3203 1602
1518
759
7588 3794
15175 7588
15617 7809
17965 8983
1024
801
379
1897
3794
3904
4491
training partition, T R). Then, the resulting RS is tested with the current
fold using the NN rule. Test partitions are kept aside during the PR phase
in order to analyze the generalization capabilities provided by the generated
RS. Because of the randomness of some operations that these algorithms
perform, they have been run three times per partition.
Aiming to investigate the effect of the number of instances in our MRPR
scheme, we will create three different versions of the KDD Cup data set by
selecting (randomly) 10%, 50% and 100% of the instances of the original
data set. We will denote these versions as Kddcup (10%), Kddcup (50%)
and Kddcup (100%). The number of instances of a data set and the number
of mappers used in our scheme have a straight relation. Table 2 shows the
approximate number of instances per chunk, that is, the size of each T Rj for
MRPR, attending to the number of mappers established. When the number
of instances per chunk exceeds twenty thousand, the execution of the PR is
not feasible in time. Therefore, we are unable to carry out these experiments.
As we stated before, we will focus on the hybrid SSMA-SFLSDE algorithm [27] to test the MRPR model. However, in Section 4.5, we will conduct
some additional experiments with other PR techniques. Concretely, we will
use LVQ3 [52] and RSP3 [45] as pure prototype generation algorithms as well
as DROP3 [43] and FCNN [53] as prototype selection algorithms.
Furthermore, we will use the ENN algorithm [41] as edition method for
the filtering-based reducer. For the fusion-based reducer, we will apply a very
accurate centroid-based technique called ICLP2 [47] when SSMA-SFLSDE
and LVQ3 are run in the map phase. It is motivated by the high reduction ratio of these positioning adjustment methods. For RSP3, DROP3 and
FCNN we will based on a faster fusion method known as GMCA [44]
18
Table 3: Parameter specification for all the methods involved in the experimentation
Algorithm
MRPR
SSMA-SFLSDE
ICLP2 (Fusion)
ENN (Filtering)
NN
LVQ3
RSP3
DROP3
FCNN
GMCA (Fusion)
Parameters
Number of mappers = 64/128/256/512/1024. Number of reducers=1
Type of Reduce = Join/Filtering/Fusion.
PopulationSFLSDE= 40, IterationsSFLSDE = 500,
iterSFGSS =8, iterSFHC=20, Fl=0.1, Fu=0.9
Filtering method = RT2
Number of neighbors = 3, Euclidean distance.
Number of neighbors = 1, Euclidean distance.
Iterations = 100, alpha = 0.1, WindowWidth=0.2, epsilon = 0.1
Subset Choice = Diameter
Number of neighbors = 3, Euclidean distance.
Number of neighbors = 3, Euclidean distance.
Number of neighbors = 1, Euclidean distance.
In addition, the NN classifier has been included as baseline limit of performance. Table 3 presents all the parameters involved in our experimental
study. These parameters have been fixed according to the recommendation
of the corresponding authors of each algorithm. Note that our research is
not devoted to optimize the accuracy obtained with a PR method over a
specific problem. We focus our experiments on the analysis of the behavior
of the proposed parallel system. To do so, we will study the influence of the
number mappers and type of reduce regarding to the accuracy achieved and
the runtime needed. In some of the experiments we will use a higher number
of mappers than the available map tasks (128). In these cases, the Hadoop
system queues the remaining tasks and they are dispatched as soon as any
map task has finished its processing.
A brief description of the used PR methods is:
• SSMA-SFLSDE: This algorithm is a hybridization of prototype selection and generation. First, a prototype selection step is performed
based on the memetic algorithm SSMA [32]. This approach makes use
of a local search specifically developed for prototype selection. This
initial step allows us to find a promising selection of prototypes per
class. Then, its resulting RS is inserted as one of the individuals of
the population of an adaptive differential evolution algorithm [54, 55],
acting as a prototype generation model to adjust the positioning of the
19
selected prototypes.
• LVQ3: This method combines strategies to “punish” or “reward” the
positioning of a prototype in order to adjust the positioning of a set
of initial prototypes (adjustable). Therefore, it is included in the positioning adjustment family.
• RSP3: This technique tries to avoid drastic changes in the form of decision boundaries associated with T Rby splitting it in different subsets
according to the highest overlapping degree [45]. As such, it belongs to
the family of space-splitting PR techniques.
• DROP3: This model combine a noise-filtering stage and a decremental approach to remove instances from the original T R set that are
considered as harmful within the nearest neighbors. It is included in
the family of hybrid edition and condensation PR techniques.
• FCNN: With an incremental methodology, this algorithm starts by
introducing to the resulting RS the centroids of each class. Then,
a prototype contained in T R will be added according to the nearest
neighbor of each centroid. It belongs to the condensation-based family.
4.4. Exhaustive evaluation of the MRPR framework for the SSMA-SFLSDE
method
This section presents and analyzes the results collected in the experimental study with the SSMA-SFLSDE method from two different points of
view:
• Firstly, we study the accuracy and reduction results obtained with the
three implemented reducers of the MRPR model. We will check the
performance achieved in comparison with the NN rule (Section 4.4.1).
• Secondly, we analyze the scalability of the proposed approach in terms
of runtime and speed up (Section 4.4.2).
Tables 4, 5, 6 and 7 summarize all the results obtained on the considered data sets. They show training/test accuracy, runtime and reduction
rate obtained by the SSMA-SFLSDE algorithm, in our MRPR framework,
depending on the number of mappers (#Mappers) and reduce type. For each
one of these measures, average (Avg.) and standard deviation (Std.) results
20
Table 4: Results obtained for the PokerHand problem.
Reduce type #Mappers
Join
Filtering
Fusion
Join
Filtering
Fusion
Join
Filtering
Fusion
Join
Filtering
Fusion
Join
Filtering
Fusion
64
64
64
128
128
128
256
256
256
512
512
512
1024
1024
1024
NN
–
Training
Avg.
Std.
0.5158
0.5212
0.5201
0.5111
0.5165
0.5157
0.5012
0.5045
0.5161
0.5066
0.5114
0.5088
0.4685
0.4649
0.5052
0.0007
0.0008
0.0011
0.0005
0.0007
0.0012
0.0010
0.0010
0.0004
0.0007
0.0010
0.0008
0.0008
0.0009
0.0003
0.5003 0.0007
Test
Avg.
Std.
0.5102
0.5171
0.5181
0.5084
0.5140
0.5139
0.4989
0.5024
0.5151
0.5035
0.5091
0.5081
0.4672
0.4641
0.5050
Runtime
Avg.
Std.
Reduction rate
Avg.
Std.
Classification
time (T S)
0.0008
0.0014
0.0015
0.0011
0.0007
0.0006
0.0010
0.0006
0.0007
0.0009
0.0005
0.0009
0.0008
0.0010
0.0009
13236.6012
13292.8996
14419.3926
3943.3628
3949.2838
4301.2796
2081.0662
2074.0048
2231.4050
1101.8868
1101.2614
1144.8080
598.2918
585.4320
601.0838
147.8684
222.3406
209.9481
161.4213
135.4213
180.5472
23.6610
25.4510
14.3391
16.6405
13.0263
18.3065
11.6175
8.4529
7.4914
97.5585
98.0714
99.1413
97.2044
97.7955
99.0250
96.5655
97.2681
98.8963
96.2849
97.1122
98.7355
95.2033
96.2073
98.6249
0.0496
0.0386
0.0217
0.0234
0.0254
0.0119
0.0283
0.0155
0.0045
0.0487
0.0370
0.0158
0.0202
0.0113
0.0157
1065.1558
848.0034
374.8814
1183.6378
920.8190
419.6914
1451.1200
1135.2452
478.8326
1545.4300
1472.6066
925.1834
2132.7362
1662.5460
1345.6998
0.5001 0.0011
–
–
–
–
48760.8242
Table 5: Results obtained for the Kddcup (100%) problem.
Reduce type #Mappers
Join
Filtering
Fusion
Join
Filtering
Fusion
Join
Filtering
Fusion
256
256
256
512
512
512
1024
1024
1024
NN
0
Training
Avg.
Std.
0.9991
0.9991
0.9994
0.9991
0.9989
0.9992
0.9990
0.9989
0.9991
0.0003
0.0003
0.0000
0.0001
0.0001
0.0001
0.0002
0.0000
0.0002
0.9994 0.0001
Test
Avg.
Std.
0.9993
0.9991
0.9994
0.9992
0.9989
0.9993
0.9991
0.9989
0.9991
Runtime
Avg.
Std.
Reduction rate
Avg.
Std.
Classification
time(T S)
0.0003
0.0003
0.0000
0.0001
0.0001
0.0001
0.0002
0.0001
0.0002
8536.4206
8655.6950
8655.6950
4614.9390
4941.7682
5018.0266
2620.5402
3103.3776
3191.2468
153.7057
148.6363
148.6363
336.0808
44.8844
62.0603
186.5208
15.4037
75.9777
99.9208
99.9249
99.9279
99.8645
99.8708
99.8660
99.7490
99.7606
99.7492
0.0007
0.0009
0.0008
0.0010
0.0013
0.0006
0.0010
0.0011
0.0010
1630.8426
1308.1294
1110.4478
5569.8084
5430.4020
2278.2806
5724.4108
4036.5422
4247.8348
0.9993 0.0001
–
–
–
–
2354279.8650
are presented (from the 5-fcv experiment). Moreover, the average classification time in the T S is computed as the time needed to classify all the
instances of T S with the corresponding RS generated by MRPR. Furthermore, we compare these results with the accuracy and the test classification
time achieved by the NN classifier. It uses the whole T R set to classify all
the instances of T S. In these tables, average accuracies higher or equal than
the obtained with the NN algorithm have been highlighted in bold. The best
ones in overall, on training and test phases, are stressed in italic.
4.4.1. Analysis of accuracy and reduction capabilities
This section is focused on comparing the resulting accuracy and reduction rates of the different versions of MRPR. Figure 4 depicts the test ac21
Table 6: Results obtained for the Susy problem.
Reduce type #Mappers
Join
Filtering
Fusion
Join
Filtering
Fusion
Join
Filtering
Fusion
256
256
256
512
512
512
1024
1024
1024
NN
0
Training
Avg.
Std.
0.6953
0.6941
0.6870
0.6896
0.6898
0.6810
0.6939
0.6826
0.6757
0.0005
0.0001
0.0002
0.0012
0.0002
0.0002
0.0198
0.0005
0.0004
0.6899 0.0001
Test
Avg.
Std.
0.7234
0.7282
0.7240
0.7217
0.7241
0.7230
0.7188
0.7226
0.7208
Runtime
Avg.
Std.
Reduction rate
Avg.
Std.
Classification
time(T S)
0.0004
0.0003
0.0002
0.0003
0.0003
0.0002
0.0417
0.0006
0.0008
69153.3210
66370.7020
69796.7260
26011.2780
28508.2390
30344.2770
13524.5692
14510.9125
15562.1193
4568.5774
4352.1144
4103.9986
486.6898
484.5556
489.8877
1941.2683
431.5152
327.8043
97.4192
97.7690
98.9068
97.2050
97.5609
98.8337
97.1541
97.3203
98.7049
0.0604
0.0046
0.0040
0.0052
0.0036
0.0302
0.5367
0.0111
0.0044
30347.0420
24686.3550
11421.6820
35067.5140
24867.5478
12169.2180
45387.6154
32568.3810
12135.8233
0.7157 0.0001
–
–
–
–
1167200.3250
Reduction rate
Avg.
Std.
Classification
time(T S)
Table 7: Results obtained for the RLCP problem.
Reduce type #Mappers
Join
Filtering
Fusion
Join
Filtering
Fusion
Join
Filtering
Fusion
256
256
256
512
512
512
1024
1024
1024
NN
0
Training
Avg.
Std.
0.9963
0.9963
0.9963
0.9962
0.9962
0.9962
0.9960
0.9960
0.9960
0.0000
0.0000
0.0000
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.9946 0.0001
Test
Avg.
Std.
0.9963
0.9963
0.9963
0.9962
0.9962
0.9963
0.9960
0.9960
0.9960
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0001
0.0001
0.0001
0.9946 0.0001
Runtime
Avg.
Std.
29549.0944
62.4140
29557.2276
62.7051
26814.9270 1574.4760
10093.9022
61.6980
10916.6962 951.5328
11326.7812
85.6898
5348.4346
20.6944
5328.0388
14.8981
5569.2214
16.5025
–
–
98.0091
98.0091
98.6291
97.9911
97.9919
98.3012
97.9781
97.9781
98.2485
0.0113
0.0113
0.0029
0.0019
0.0016
0.0036
0.0010
0.0010
0.0015
10534.0450
10750.9012
10271.0902
11767.8596
11689.1144
10856.8888
10930.7026
11609.2740
10653.3659
–
–
769706.2186
curacy achieved according to the number of mappers in the data sets considered. It represents the average accuracy depending on the reduce type
utilized. The average accuracy result of the NN rule is presented as a line
y = AverageAccuracy, to show the accuracy differences between using the
whole T R or a generated RS as training data set. In addition, Figure 5 plots
the reduction rates attained by each type of reduce for both problems. In
each sub-figure the average reduction rate with 256 mappers has been drawn.
According to these graphics and tables we can make several observations
from these results:
• Since that within the MRPR framework a PR algorithm does not dispose of the full information about the whole addressed problem, it is
expected that the accuracy obtained decreases according as the number
of available instances in the used training set is reduced. This statement and the way in which the accuracy is reduced depends crucially
on the specific problem tackled and its complexity. However, it could
be generalizable and extensible to most of the problems because there
22
PokerHand
Kddcup (100%)
0.52
0.9994
0.9993
ReduceType
Join
Filtering
Fusion
Accuracy Test
Accuracy Test
0.50
0.9992
ReduceType
Join
Filtering
Fusion
0.9991
0.48
0.9990
0.9989
64
128
256
512
1024
256
512
Number of mappers
1024
Number of mappers
(a) PokerHand: Test accuracy.
(b) Kddcup (100%): Test accuracy.
Susy
RLCP
0.9960
ReduceType
Join
Filtering
Fusion
Accuracy Test
Accuracy Test
0.725
ReduceType
Join
0.9955
Filtering
Fusion
0.720
0.9950
256
512
1024
256
Number of mappers
512
1024
Number of mappers
(c) Susy: Test accuracy.
(d) RLCP: Test accuracy.
Figure 4: Test accuracy results
will be a minimum number of instances in which the performance decrease drastically. Observing previous tables and graphics, we can see
that in the case of the PokerHand problem its performance is markedly
deteriorated when the problem is divided into 1024 subsets (mappers)
in both training and test phases. In Susy data set, the accuracy is
gradually deteriorated as the number of mapper is incremented. For
the Kddcup (100%) and RCLP problems, their performance is very
slightly reduced when the number of mappers is increased (the order
23
100
Average reduction (%)
99
ReduceType
Join
Filtering
Fusion
98
97
PokerHand
Kddcup (100%)
RLCP
Susy
Dataset
Figure 5: Reduction rate achieved with 256 mappers.
of three or four ten-thousandths).
• Nevertheless, it is important to highlight that although the accuracy
of the PR algorithm may be gradually decreased it is not very far from
the achieved with the NN rule. In fact, it could be even higher as
happens in the cases of PokerHand, Susy and RLCP problems. This
situation occurs because PR techniques remove noisy instances from
the T R set that damage the classification performance of the NN rule.
Moreover, PR models typically smooth the decision boundaries between
classes that usually rebounds in an improvement of the generalization
capabilities (test accuracy).
• When tackling large-scale problems, the reduction rate of a PR technique becomes much more important, maintaining the premise that
the accuracy is not very deteriorated. A high reduction rate implies a
significant decrement in the computational time spent to classify new
instances. As we commented before, the accuracy obtained by our
model is not dramatically decreased when the number of mappers is
augmented. The same behavior is found in terms of reduction capabilities. This number also influences in the reduction rate achieved
24
because the lack of information about the whole problem may produce
a degradation of the reduction capabilities of PR techniques. However,
in general, the reduction rates presented are very high, representing the
original problem with less than a 5% of the total number of instances.
It allows us to classify the T S in a very fast time.
• Independently to the number of mappers and type of reduce, there are
no differences between the results of training a test phases. The partitioning process slightly reduces accuracy and reduction rate because of
the lack of the whole information. By contrast, this mechanism assists
not to fall into the overfitting problem, that is, the overlearning of the
training set.
• Comparing the different reduce types, we can check that in general
the fusion approach outperforms to the rest kinds of reducers in most
of the data sets. The fusion scheme results in a better training and
test accuracy. It is noteworthy that in the case of PokerHand data
set, when the other types of reducers decrease their performance, the
fusion reducer is able to preserve its accuracy with 1024 mappers. We
can also observe that the filtering reducer also provides higher accuracy
results than the join approach in PokerHand and Susy problems, while
its results are very similar for the Kddcup (100%) and RLCP sets.
• Taking a quick glance at Figure 5, it reveals that the fusion scheme always reports the higher reduction rate, followed by the filtering scheme.
Beside the fusion reducer promotes a higher reduction rate it has shown
the best accuracy. Therefore, it shows that merging the resultant RSj
sets with a fusion or a filtering process provides a better accuracy and
reduction rates than a joining phase.
• Considering the results provided by the NN rule and the whole T R,
Figure 4 shows that in terms of accuracy, the MRPR model with the
fusion scheme overcomes to the NN rule in PokerHand, Susy and RCLP
problems. A very similar behavior is reached for the Kddcup (100%)
data set. Nevertheless, the reduction rate attained by the MRPR model
implies a lower test classification time. For example, we can see in
Table 4 that we can perform the classification of PokerHand data set
up to 130 times faster than the NN classifier when the fusion method
and 64 mappers are used. A similar improvement is achieved in Susy
25
and RLCP problems. However, for the Kddcup (100%) data set this
improvement is much more accentuated and classifying the test set can
be approximately 2120 times faster (using the fusion reducer and 256
mappers). These results demonstrate and exemplify the necessity of
applying PR techniques to large-scale problems.
4.4.2. Analysis of the scalability
In this part of the experimental study we concentrate on the analysis
of runtime and speed up of the MRPR model. As defined in Section 4.3,
we divided the Kddcup problem into three sets with different number of
instances. We aim to study the influence of the number of instances in
the same problem. Figure 6 draws the average runtime (obtained in the 5fcv experiment) according to the number of mappers used in the problem
considered. Moreover, Figure 7 depicts the speed up achieved by MRPR and
the fusion reducer.
Note that, as we clarified in Section 4.1, the speed up has been computed
using the runtime with the minimum number of mappers (minMaps) as the
reference time. Therefore, it implies that the speed up does not represent the
gain obtained regarding the number of cores. In this chart, the speed up of
MRPR with minMaps in each data set is set as 1. Since the complexity of
SSMA-SFLSDE is O((n·D)2), we cannot expect a quadratic speed up because
the proposed scheme is focused on the number of instances. Furthermore,
it is very important to remember that, in the used cluster, the maximum
available mappers at the same time is 128 and the rest of tasks are queued.
Figure 8 presents an average runtime comparison between the results
obtained in the three versions of the Kddcup problem. It shows for each set
its average runtime with 256, 512 and 1024 mappers of the MRPR approach
using the reducer based on fusion.
Given these figures and previous tables, we want to outline the following
comments:
• Despite the performance showed by the filtering and fusion reducers in
comparison with the joining scheme, all the reduce alternatives spend
very similar runtimes to generate a final RS. It means that although
the fusion and filtering reducers require extra computations regarding to the join approach, we take advantage from the way of working
of MapReduce, so that, the reduce stage is being executed while the
mappers are still finishing. In this way, most of the extra calculations
26
PokerHand
15000
Kddcup (100%)
8000
Reduce Type
Join
Filtering
Fusion
Average runtime (s)
Average runtime (s)
10000
Reduce Type
6000
Join
Filtering
Fusion
5000
4000
0
64
128
256
512
1024
256
Number of mappers
512
1024
Number of mappers
(a) PokerHand: Runtime.
(b) Kddcup (100%): Runtime.
Susy
RLCP
30000
Reduce Type
Join
Filtering
40000
Fusion
Average runtime (s)
Average runtime (s)
60000
20000
Reduce Type
Join
Filtering
Fusion
10000
20000
256
512
1024
256
Number of mappers
512
1024
Number of mappers
(c) Susy: Runtime.
(d) RLCP: Runtime.
Figure 6: Average runtime obtained by MRPR
needed by filtering and fusion approaches are performed before all the
mappers have finished.
• In Figure 7, we can observe different tendencies depending on the used
data set. It is due to the fact that these problems have a different number of features that also determine the complexity of the PR technique.
For this reason, it easier to obtain a higher speed up with PokerHand,
rather than, for instance, in the Kddcup problem, because it has a
lesser number of characteristics. The same behavior is shown in Susy
27
Runtime speedup
25
20
Runtime speedup
Dataset
PokerHand
15
Kddcup (10%)
Kddcup (50%)
Kddcup (100%)
RLCP
10
Susy
5
0
64
128
256
512
1024
Number of mappers
Figure 7: Speed up achieved by MRPR with the fusion reducer
Runtime comparison for Kddcup problem
Average runtime (s)
7500
Dataset
5000
Kddcup (10%)
Kddcup (50%)
Kddcup (100%)
2500
0
256
512
1024
Number of mappers
Figure 8: Runtime comparison on the three versions of the Kddcup problem,
using MRPR with the fusion reducer.
28
and RLCP problems, with a similar number of instances, a slightly
better speed up is achieved with RLCP. In addition, according with
this figure, we can mention that we the same resources (128 mappers)
MRPR is able to accelerate the processing of PR techniques by dividing in the T R set in a higher number of subsets. As we checked in the
previous section, these speed ups do not fall into a significant accuracy
loss.
• Figure 8 illustrates the increment of average runtime when the size of
the same problem is increased. In problems with quadratic complexity,
we could expect that with the same number of mappers this increment
should be also quadratic. In this figure, we can see that the increment
of runtime is much lesser than a quadratic increment. For example,
for 512 mappers, MRPR spends 2571.0068 seconds in Kddcup (50%)
and 8655.6950 seconds for the full problem. As we can see in Table 2,
the approximate number of instances in each T Rj subset is the double
for Kddcup 100% than Kddcup 50% with 512 mappers. Therefore, its
computational cost is not incremented quadratically.
4.5. Experiments on different PR techniques
In this section we perform some additional experiments using four different PR techniques in the proposed MRPR framework. In these experiments,
the number of mappers has been fixed to 64 and we focus on the PokerHand
problem. Table 8 shows the results obtained.
Figure 9 presents a comparison across the four techniques within MRPR.
Figure 9a depicts the accuracy test obtained by the four techniques using the
three reduce types. Figure 9b shows the time needed to classify the test set.
In both plots, the results of the NN rule have been presented as baseline. As
before, those results that are better than the NN rule have been stressed in
bold and the best ones in overall are highlighted in italic.
Observing these results, we can see that the MRPR model works appropriately with these techniques. Nevertheless, we can point out several
differences in comparison with the results obtained with SSMA-SFLSDE:
• Since LVQ3 is a positioning adjustment method with a high reduction
rate, we observe a similar behavior between this technique and SSMASFLSDE within the MRPR model. Note that this algorithm has been
also run with ICLP2 as fusion method. We can highlight that the
29
Table 8: Results obtained for the PokerHand problem with 64 Mappers.
PR technique
Reduce type
LVQ3
Join
Filtering
Fusion
Join
Filtering
Fusion
Join
Filtering
Fusion
Join
Filtering
Fusion
FCNN
DROP3
RSP3
NN
Training
Avg.
Std.
0.4686
0.4892
0.4932
0.4883
0.5185
0.6098
0.5073
0.5157
0.5390
0.6671
0.6491
0.5786
–
0.0005
0.0007
0.0010
0.0008
0.0006
0.0002
0.0004
0.0005
0.0004
0.0003
0.0003
0.0004
0.5003 0.0007
Test
Avg.
Std.
0.4635
0.4861
0.4918
0.4889
0.5169
0.4862
0.5044
0.5124
0.5011
0.5145
0.5173
0.5107
Runtime
Avg.
Std.
Reduction rate
Avg.
Std.
0.0014
0.0013
0.0012
0.0010
0.0005
0.0006
0.0014
0.0013
0.0005
0.0007
0.0008
0.0010
15.3526
17.7602
83.7830
39.8196
5593.4358
3207.8540
69.5268
442.9670
198.1450
219.2912
1898.5854
1448.4272
0.8460
0.1760
4.8944
2.1829
23.1895
37.2208
2.5605
2.6939
5.2750
2.8126
10.8303
60.5462
97.9733
98.6244
99.3811
17.7428
47.3255
72.5604
77.0352
81.2203
92.3467
53.0566
58.8459
84.3655
0.0001
0.0101
0.0067
0.0241
0.0310
0.0080
0.0141
0.0169
0.0043
0.0554
0.0280
0.0189
841.5352
487.0822
273.4192
28232.4110
19533.5424
9854.8956
8529.0618
8139.5878
1811.0866
17668.5268
17181.5448
5741.6588
0.5001 0.0011
–
–
–
–
48760.8242
PokerHand
0.52
Classification
time (T S)
PokerHand
50000
0.51
Accuracy test
0.50
ReduceType
Join
0.49
Filtering
Fusion
0.48
Classification time (s)
40000
30000
ReduceType
Join
Filtering
Fusion
20000
10000
0.47
0
0.46
DROP3
FCNN
LVQ3
RSP3
DROP3
Method
FCNN
LVQ3
RSP3
Method
(a) PokerHand: Accuracy Test.
(b) PokerHand: Classification Time.
Figure 9: Results obtained by MRPR in different PR techniques
filtering and fusion reduce schemes greatly improve the performance of
LVQ3 in accuracy and reduction rates.
• In the previous section we observed that the filtering and fusion stages
provide a greater reduction rate than the join scheme. In this section,
we can see that for FCNN, DROP3 and RSP3, their effect is even more
accentuated due to the fact that these techniques have a lesser reduction power than SSMA-SFLSDE and LVQ3. Therefore, the filtering
and fusion algorithms become more important with these techniques in
order to achieve a high reduction ratio.
30
• The runtime needed by filtering and fusion schemes crucially depends
on the reduction rate of the use technique. For example, the FCNN
method initially provides a very reduced reduction rate (around 18%),
so that, the runtime of filtering and fusion reducers is greater than the
time needed by the join reducer. However, as commented before, the
application of these reduces increases the reduction rate, resulting in a
faster classification time.
• As commented previously, we have used a fusion reducer based on
GMCA when FCNN, DROP3 and RSP3 are applied. It is noteworthy
that this fusion approach has resulted in a faster runtime in comparison with the filtering scheme. Nevertheless, as we expected, the performance reached with this fusion reducer, in terms of accuracy, is lower
than the obtained with ICLP2 in combination with SSMA-SFLSDE.
• Comparing the results obtained with these techniques and SSMA-SFLSDE,
we can observe that the best accuracy test results is obtained with
RSP3 and the filtering scheme (0.5173) with a medium reduction ratio (58.8459%). However, the SSMA-SFLSDE algorithm was able to
achieve a higher accuracy test (0.5181) using the fusion reducer with a
very high reduction rate (99.1413%).
5. Concluding remarks
In this paper we have developed a MapReduce solution for prototype reduction, denominated as MRPR. The proposed scheme enables to these kinds
of techniques to be applied over big classification data sets with promising results. Otherwise, these techniques would be limited to tackle small or medium
problems that does not contain more than several thousand of examples, due
to memory and runtime restrictions. The MapReduce paradigm has offered
a simple, transparent and efficient environment to parallelize the prototype
reduction computation. Three different reduce types have been investigated:
Join, Filtering and Fusion; aiming to provide more accurate preprocessed
sets. We have found that a reducer based on fusion of prototypes permits to
obtain reduced sets with higher reduction rates and accuracy performance.
The experimental study carried out has shown that MRPR obtains very
competitive results. We have tested its behavior with different kinds of PR
techniques, analyzing the accuracy, the reduction rate and the computational cost obtained. In particular, we have studied two prototype selection
31
methods (FCNN and DROP3), two prototype generation (LVQ3 and RSP3)
techniques and the hybrid SSMA-SFLSDE algorithm.
The main achievements of MRPR have been:
• It has allowed us to apply PR techniques in large-scale problems.
• No significant accuracy and reduction losses with very good speed up.
• Its application has resulted in a very big reduction of storage requirements and classification time for the NN rule, when dealing with big
data sets.
As future work, we consider the study of new frameworks that enable PR
techniques to deal with both large-scale and high dimensional data sets.
Acknowledgment
Supported by the Research Projects TIN2011-28488, P10-TIC-6858 and
P11-TIC-7765. D.Peralta holds an FPU scholarship from the Spanish Ministry of Education and Science (FPU12/04902).
References
[1] V. Marx, The big challenges of big data, Nature 498 (7453) (2013) 255–
260.
[2] M. Minelli, M. Chambers, A. Dhiraj, Big Data, Big Analytics: Emerging
Business Intelligence and Analytic Trends for Today’s Businesses (Wiley
CIO), 1st Edition, Wiley Publishing, 2013.
[3] D. Plummer, T. Bittman, T. Austin, D. Cearley, D. S. Cloud, Defining and describing an emerging phenomenon. Technical report, Gartner
(2008).
[4] E. Alpaydin, Introduction to Machine Learning, 2nd Edition, MIT Press,
Cambridge, MA, 2010.
[5] M. Woniak, M. Graña, E. Corchado, A survey of multiple classifier systems as hybrid systems, Information Fusion 16 (2014) 3–17.
32
[6] S. Sakr, A. Liu, D. Batista, M. Alomari, A survey of large scale data
management approaches in cloud environments, IEEE Communications
Surveys and Tutorials 13 (3) (2011) 311–336.
[7] J. Bacardit, X. Llorà, Large-scale data mining using genetics-based
machine learning, Wiley Interdisciplinary Reviews: Data Mining and
Knowledge Discovery 3 (1) (2013) 37–61.
[8] J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large
clusters, Communications of the ACM 51 (1) (2008) 107–113.
[9] J. Dean, S. Ghemawat, Map reduce: A flexible data processing tool,
Communications of the ACM 53 (1) (2010) 72–77.
[10] S. Ghemawat, H. Gobioff, S.-T. Leung, The google file system, in: Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP ’03, 2003, pp. 29–43.
[11] M. Snir, S. Otto, MPI-The Complete Reference: The MPI Core, MIT
Press, 1998.
[12] W. Zhao, H. Ma, Q. He, Parallel k-means clustering based on mapreduce, in: M. Jaatun, G. Zhao, C. Rong (Eds.), Cloud Computing, Vol.
5931 of Lecture Notes in Computer Science, Springer Berlin Heidelberg,
2009, pp. 674–679.
[13] A. Srinivasan, T. Faruquie, S. Joshi, Data and task parallelism in ILP
using mapreduce, Machine Learning 86 (1) (2012) 141–168.
[14] Q. He, C. Du, Q. Wang, F. Zhuang, Z. Shi, A parallel incremental
extreme svm classifier, Neurocomputing 74 (16) (2011) 2532 – 2540.
[15] I. Palit, C. Reddy, Scalable and parallel boosting with mapreduce, IEEE
Transactions on Knowledge and Data Engineering 24 (10) (2012) 1904–
1916.
[16] G. Caruana, M. Li, Y. Liu, An ontology enhanced parallel SVM for
scalable spam filter training, Neurocomputing 108 (2013) 45 – 57.
[17] D. Pyle, Data Preparation for Data Mining, The Morgan Kaufmann
Series in Data Management Systems, Morgan Kaufmann, 1999.
33
[18] H. Liu, H. Motoda (Eds.), Computational Methods of Feature Selection,
Chapman & Hall/Crc Data Mining and Knowledge Discovery Series,
Chapman & Hall/Crc, 2007.
[19] S. Garcı́a, J. Derrac, J. Cano, F. Herrera, Prototype selection for nearest
neighbor classification: Taxonomy and empirical study, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (3) (2012) 417–
435.
[20] I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, A taxonomy and experimental study on prototype generation for nearest neighbor classification,
IEEE Transactions on Systems, Man, and Cybernetics–Part C: Applications and Reviews 42 (1) (2012) 86–100.
[21] J. Derrac, S. Garcı́a, F. Herrera, IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule,
Pattern Recognition 43 (6) (2010) 2082–2105.
[22] J. Derrac, C. Cornelis, S. Garcı́a, F. Herrera, Enhancing evolutionary
instance selection algorithms by means of fuzzy rough set based feature
selection, Information Sciences 186 (1) (2012) 73–92.
[23] N. Garcı́a-Pedrajas, A. de Haro-Garcı́a, J. Pérez-Rodrı́guez, A scalable
approach to simultaneous evolutionary instance and feature selection,
Information Sciences 228 (2013) 150–174.
[24] T. M. Cover, P. E. Hart, Nearest neighbor pattern classification, IEEE
Transactions on Information Theory 13 (1) (1967) 21–27.
[25] L. Nanni, A. Lumini, Particle swarm optimization for prototype reduction, Neurocomputing 72 (4-6) (2008) 1092–1097.
[26] I. Triguero, S. Garcı́a, F. Herrera, IPADE: Iterative prototype adjustment for nearest neighbor classification, IEEE Transactions on Neural
Networks 21 (12) (2010) 1984–1990.
[27] I. Triguero, S. Garcı́a, F. Herrera, Differential evolution for optimizing
the positioning of prototypes in nearest neighbor classification, Pattern
Recognition 44 (4) (2011) 901–916.
34
[28] J. R. Cano, F. Herrera, M. Lozano, Stratification for scaling up evolutionary prototype selection, Pattern Recognition Letters 26 (7) (2005)
953–963.
[29] J. Derrac, S. Garcı́a, F. Herrera, Stratified prototype selection based
on a steady-state memetic algorithm: a study of scalability, Memetic
Computing 2 (3) (2010) 183–199.
[30] I. Triguero, J. Derrac, S. Garcı́a, F. Herrera, A study of the scaling up
capabilities of stratified prototype generation, in: Proceedings of the
third World Congress on Nature and Biologically Inspired Computing
(NABIC’11), 2011, pp. 304–309.
[31] W.-Y. Chen, Y. Song, H. Bai, C.-J. Lin, E. Chang, Parallel spectral
clustering in distributed systems, Pattern Analysis and Machine Intelligence, IEEE Transactions on 33 (3) (2011) 568–586.
[32] S. Garcı́a, J. R. Cano, F. Herrera, A memetic algorithm for evolutionary
prototype selection: A scaling up approach, Pattern Recognition 41 (8)
(2008) 2693–2709.
[33] N. Garcı́a-Pedrajas, J. Pérez-Rodrı́guez, Multi-selection of instances: A
straightforward way to improve evolutionary instance selection, Applied
Soft Computing 12 (11) (2012) 3590 – 3602.
[34] B. He, W. Fang, Q. Luo, N. K. Govindaraju, T. Wang, Mars: A mapreduce framework on graphics processors, in: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, ACM, New York, NY, USA, 2008, pp. 260–269.
[35] J. Talbot, R. M. Yoo, C. Kozyrakis, Phoenix++: Modular mapreduce
for shared-memory systems, in: Proceedings of the Second International
Workshop on MapReduce and Its Applications, ACM, New York, NY,
USA, 2011, pp. 9–16. doi:10.1145/1996092.1996095.
[36] T. White, Hadoop: The Definitive Guide, 3rd Edition, O’Reilly Media,
Inc., 2012.
[37] A. H. Project, Apache hadoop (2013).
URL http://hadoop.apache.org/
35
[38] A. M. Project, Apache mahout (2013).
URL http://mahout.apache.org/
[39] C.-L. Chang, Finding prototypes for nearest neighbor classifiers, IEEE
Transactions on Computers 23 (11) (1974) 1179–1184.
[40] C.-T. Chu, S. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Ng, K. Olukotun,
Map-reduce for machine learning on multicore, in: Advances in Neural
Information Processing Systems, 2007, pp. 281–288.
[41] D. L. Wilson, Asymptotic properties of nearest neighbor rules using
edited data, IEEE Transactions on System, Man and Cybernetics 2 (3)
(1972) 408–421.
[42] P. E. Hart, The condensed nearest neighbor rule, IEEE Transactions on
Information Theory 18 (1968) 515–516.
[43] D. R. Wilson, T. R. Martinez, Reduction techniques for instance-based
learning algorithms, Machine Learning 38 (3) (2000) 257–286.
[44] R. Mollineda, F. Ferri, E. Vidal, A merge-based condensing strategy for
multiple prototype classifiers, IEEE Transactions on Systems, Man and
Cybernetics B 32 (5) (2002) 662–668.
[45] J. S. Sánchez, High training set size reduction by space partitioning and
prototype abstraction, Pattern Recognition 37 (7) (2004) 1561–1564.
[46] J. S. Sánchez, R. Barandela, A. I. Marqués, R. Alejo, J. Badenas, Analysis of new techniques to obtain quality training sets, Pattern Recognition
Letters 24 (7) (2003) 1015–1022.
[47] W. Lam, C. K. Keung, D. Liu, Discovering useful concept prototypes
for classification based on filtering and abstraction., IEEE Transactions
on Pattern Analysis and Machine Intelligence 14 (8) (2002) 1075–1090.
[48] I. H. Witten, E. Frank, Data Mining: Practical machine learning tools
and techniques, 2nd Edition, Morgan Kaufmann, San Francisco, 2005.
[49] G. M. Amdahl, Validity of the single processor approach to achieving
large scale computing capabilities, in: Proc. Spring Joint Comput. Conf.,
ACM, 1967, pp. 483–485.
36
[50] Cloudera, Cloudera distribution including apache hadoop (2013).
URL http://www.cloudera.com
[51] A. Frank, A. Asuncion, UCI machine learning repository (2010).
URL http://archive.ics.uci.edu/ml
[52] T. Kohonen, The self organizing map, Proceedings of the IEEE 78 (9)
(1990) 1464–1480.
[53] F. Angiulli, Fast nearest neighbor condensation for large data sets classification, IEEE Transactions on Knowledge and Data Engineering 19 (11)
(2007) 1450–1464.
[54] K. V. Price, R. M. Storn, J. A. Lampinen, Differential Evolution A
Practical Approach to Global Optimization, Natural Computing Series,
2005.
[55] F. Neri, V. Tirronen, Scale factor local search in differential evolution,
Memetic Computing 1 (2) (2009) 153–171.
37
2. Self-labeling with prototype generation/selection for semi-supervised classification
2.
137
Self-labeling with prototype generation/selection for semisupervised classification
The journal papers associated to this part are:
2.1
Self-Labeled Techniques for Semi-Supervised Learning: Taxonomy, Software and Empirical Study
• I. Triguero, S. Garcı́a, F. Herrera, Self-Labeled Techniques for Semi-Supervised Learning:
Taxonomy, Software and Empirical Study. Knowledge and Information Systems, in press
(2014), doi: 10.1007/s10115-013-0706-y.
– Status: Accepted for publication.
Knowl Inf Syst
DOI 10.1007/s10115-013-0706-y
SURVEY PAPER
Self-labeled techniques for semi-supervised learning:
taxonomy, software and empirical study
Isaac Triguero · Salvador García · Francisco Herrera
Received: 14 May 2013 / Revised: 21 August 2013 / Accepted: 5 November 2013
© Springer-Verlag London 2013
Abstract Semi-supervised classification methods are suitable tools to tackle training sets
with large amounts of unlabeled data and a small quantity of labeled data. This problem has
been addressed by several approaches with different assumptions about the characteristics of
the input data. Among them, self-labeled techniques follow an iterative procedure, aiming
to obtain an enlarged labeled data set, in which they accept that their own predictions tend
to be correct. In this paper, we provide a survey of self-labeled methods for semi-supervised
classification. From a theoretical point of view, we propose a taxonomy based on the main
characteristics presented in them. Empirically, we conduct an exhaustive study that involves
a large number of data sets, with different ratios of labeled data, aiming to measure their
performance in terms of transductive and inductive classification capabilities. The results
are contrasted with nonparametric statistical tests. Note is then taken of which self-labeled
models are the best-performing ones. Moreover, a semi-supervised learning module has
been developed for the Knowledge Extraction based on Evolutionary Learning software,
integrating analyzed methods and data sets.
Keywords Learning from unlabeled data · Semi-supervised learning · Self-training ·
Co-training · Multi-view learning · Classification
I. Triguero (B) · F. Herrera
Department of Computer Science and Artificial Intelligence, Research Center on Information and Communications Technology (CITIC-UGR), University of Granada, 18071 Granada, Spain
e-mail: triguero@decsai.ugr.es
F. Herrera
e-mail: herrera@decsai.ugr.es
S. García
Department of Computer Science, University of Jaén, 23071 Jaén, Spain
e-mail: sglopez@ujaen.es
123
I. Triguero et al.
1 Introduction
The semi-supervised learning (SSL) paradigm [1] has attracted much attention in many
different fields ranging from bioinformatics to Web mining, where it is easier to obtain unlabeled than labeled data because it requires less effort, expertise and time consumption. In
this context, traditional supervised learning [2] is limited to using labeled data to build a
model [3]. Nevertheless, SSL is a learning paradigm concerned with the design of models
in the presence of both labeled and unlabeled data. Essentially, SSL methods use unlabeled samples to either modify or reprioritize the hypothesis obtained from labeled samples
alone.
SSL is an extension of unsupervised and supervised learning by including additional
information typical of the other learning paradigm. Depending on the main objective of
the methods, we can divide SSL into semi-supervised classification (SSC) [4] and semisupervised clustering [5,6]. The former focuses on enhancing supervised classification by
minimizing errors in the labeled examples, but it must also be compatible with the input
distribution of unlabeled instances. The latter, also known as constrained clustering, aims to
obtain better-defined clusters than those obtained from unlabeled data. We focus on SSC.
SSC can be categorized into two slightly different settings [7], denoted transductive and
inductive learning. On the one hand, transductive learning concerns the problem of predicting
the labels of the unlabeled examples, given in advance, by taking both labeled and unlabeled
data together into account to train a classifier. On the other hand, inductive learning considers
the given labeled and unlabeled data as the training examples, and its objective is to predict
unseen data. In this paper, we address both settings in order to carry out an extensive analysis
of the performance of the studied methods.
Many different approaches have been suggested and studied in order to classify using
unlabeled data in SSC. They usually make different assumptions related to the link between
the distribution of unlabeled and labeled data. Generative models [8] learn a joint probability
model that depends on the assumption that the data follow a determined parametric model.
There are also other algorithms such as transductive inference for support vector machines
[9] that assume that the classes are well separated and do not cut through dense unlabeled
data. Alternatively, SSC can also be viewed as a graph min-cut problem [10]. If two instances
are connected by a strong edge, their labels are likely to be the same. In this case, the graph
construction determines the behavior of this kind of algorithm [11]. In addition, there are
recent studies that address multiple assumptions in one model [7,12].
A successful methodology to tackle the SSC problem is based on traditional supervised
classification algorithms [2]. These techniques aim to obtain one (or several) enlarged labeled
set(s), based on their most confident predictions, to classify unlabeled data. We denote these
algorithms self-labeled techniques. In the literature [1], self-labeled techniques are typically
divided into self-training and co-training. Self-training [13,14] is a simple and effective SSL
methodology that has been successfully applied in many real instances. In the self-training
process, a classifier is trained with an initial small number of labeled examples, aiming
to classify unlabeled points. Then it is retrained with its own most confident predictions,
enlarging its labeled training set. This model does not make any specific assumptions for the
input data, but it accepts that its own predictions tend to be correct. The standard co-training
[15] methodology assumes that the feature space can be split into two different conditionally
independent views and that each view is able to predict the classes perfectly [16–18]. It
trains one classifier in each specific view, and then the classifiers teach each other the most
confidently predicted examples. Multi-view learning [19] for SSC is usually understood to be
a generalization of co-training, without requiring explicit feature splits or the iterative mutual-
123
Self-labeled techniques for semi-supervised learning
teaching procedure [20–22]. However, these concepts are sparse and frequently confused in
the literature.
There is currently no general categorization focused on self-labeled techniques. In the
literature, we find several SSL surveys [23,24]. These include a general classification of
SSL methods, dividing self-labeled techniques into self-training, co-training and multi-view
learning, but they are not exclusively focused on self-labeled techniques nor especially on
studying the similarities among them. Furthermore, we can find a good theoretical survey
about disagreement-based models in [25], introducing research advances in this paradigm.
Nevertheless, this categorization is a subset of self-labeling approaches without an explicit
definition of a taxonomy.
Because of the absence of a focused taxonomy in the literature, we have observed that the
new algorithms proposed are usually compared with only a subset of the complete family
of self-labeled methods. Furthermore, in most of the studies, no rigorous analysis has been
carried out. They do not follow a complete experimental framework. Instead, the new proposal
is usually contrasted with a reduced number of data sets, only analyzing either the transductive
or the inductive capabilities of the algorithms.
These are the reasons that motivate the global purpose of this paper, which can be divided
into four objectives:
• To propose a new and complete taxonomy based on the main properties observed in selflabeled methods. The taxonomy will allow us to discern the advantages and drawbacks
from a theoretical point of view. As a result, many frequently confused concepts will be
clarified.
• To establish an experimental methodology to analyze the transductive and inductive
capabilities of these kinds of algorithms.
• To make an empirical study of the state of art of self-labeled techniques. Our goal is to
identify the best methods in each family, depending on the ratio of labeled data, and to
stress the relevant properties of each one.
• To provide an open-source SSL module for the Knowledge Extraction based on Evolutionary Learning (KEEL) software tool [26]. This is a research tool for solving data
mining problems which contains a great number of machine learning algorithms. The
source code of the analyzed algorithms and a wide variety of data sets are available in
this module.
We will conduct experiments involving a total of 55 UCI/KEEL classification data sets with
different ratios of labeled data: 10, 20, 30 and 40 %. The experimental study will include
a statistical analysis based on nonparametric statistical tests [27,28]. Then, we will test the
performance of the best-performing methods over nine data sets obtained from the book of
Chapelle [4]. These problems have a great number of features and a very reduced number
of labeled data. A Web site with all the complementary material is available at http://sci2s.
ugr.es/SelfLabeled, including this paper’s basic information, the source code of the analyzed
algorithms, all the data sets involved and the complete results obtained.
The rest of the paper is organized as follows: Sect. 2 provides a description of the properties
and an enumeration of the methods, as well as related and advanced work on SSC. Section 3
presents the taxonomy proposed. In Sect. 4 we describe the experimental framework, and Sect.
5 examines the results obtained and presents a discussion on them. In Sect. 6 we summarize
our conclusions. “Appendix” shows details and characteristics of the SSL software module.
123
I. Triguero et al.
2 Self-labeled techniques: background
The SSC problem can be defined as follows: Let x p be an example where x p =
(x p1 , x p2 , . . . , x p D , ω), with x p belonging to a class ω and a D-dimensional space in which
x pi is the value of the ith feature of the pth sample. Then, let us assume that there is a labeled
set L which consists of n instances x p with ω known. Furthermore, there is an unlabeled
set U which consists of m instances xq with ω unknown, let m > n. The L ∪ U set forms
the training set (denoted as T R). The purpose of SSC is to obtain a robust learned hypothesis using T R instead of L alone, which can be applied in two slightly different settings:
transductive and inductive learning.
Transductive learning is described as the application of an SSC technique to classify all
the m instances xq of U correctly. The class assignment should represent the distribution
of the classes efficiently, based on the input distribution of unlabeled instances and the L
instances.
Let T S be a test set composed of t unseen instances xr with ω unknown, which has not
been used at the training stage of the SSC technique. The inductive learning phase consists
of correctly classifying the instances of T S based on the previously learned hypothesis.
This section presents an overview of self-labeled techniques. Three main topics will be
discussed:
• In Sect. 2.1, the common characteristics in self-labeled techniques which will define the
categories of the taxonomy proposed in this paper will be outlined.
• In Sect. 2.2, we briefly enumerate all the self-labeled techniques proposed in the literature.
The complete and abbreviated name will be given together with the proposal reference.
• Finally, Sect. 2.3 explores other areas related to self-labeled techniques and provides a
summary of advanced work in this research field.
2.1 Common characteristics in self-labeled techniques
This section provides a framework, establishing different properties of self-labeled techniques, for the definition of the taxonomy in the next section. Other issues that influence
some of the methods are presented in this section although they are not involved in the
proposed taxonomy. Finally, some criteria will be set in order to compare self-labeled
methods.
2.1.1 Main properties of self-labeled techniques
Self-labeled methods search iteratively for one or several enlarged labeled set(s) (E L) of
prototypes to efficiently represent the T R. For simplicity in reading, in what follows we
restrict the description of these properties to one E L. The taxonomy proposed will be based
on these characteristics:
• Addition mechanism: There are a variety of schemes in which an E L is formed.
– Incremental: A strictly incremental approach begins with an enlarged labeled set
E L = L and adds, step-by-step, the most confident instances of U if they fulfill
certain criteria. In this case, the algorithm crucially depends on the way in which it
determines the confidence predictions of each unlabeled instance, that is, the probability of belonging to each class. Under such a scheme, the order in which the
123
Self-labeled techniques for semi-supervised learning
instances are added to the E L determines the learning hypotheses and therefore the
following stages of the algorithm. One of the most important aspects of this approach
is the number of instances added in each iteration. On the one hand, this number could
be defined as a constant parameter and/or independent of the classes of the instances.
On the other hand, it can be chosen as a proportional value of the number of instances
of each class in L. In our experiments, we implement the latter as suggested in [15].
This is the most simple and intuitive way of addressing the SSL problem which often
corresponds to classical self-labeled approaches. One advantage of this approach
is that it can be faster during the learning phase than nonincremental algorithms.
Nevertheless, the main disadvantage is that strictly incremental algorithms can add
instances with erroneous predictions to the class label. Hence, if it occurs, the
learned hypotheses could produce a low performance in transductive and inductive
phases.
– Batch: Another way to generate an E L set is in batch mode. This involves deciding whether each instance meets the addition criteria before adding any of them to
the E L. Then, all those that do meet the criteria are added at once. In this sense,
batch techniques do not assign a definitive class to each unlabeled instance during
the learning process. They can reprioritize the hypotheses obtained from labeled
samples. Batch processing suffers from increased time complexity over incremental
algorithms.
– Amending: Amending models appeared as a solution to the main drawback of the
strictly incremental approach. In this approach, the algorithm starts with E L = L
and iteratively can add or remove any instance that meets the specific criterion.
This mechanism allows rectifications to already performed operations, and its main
advantage is to make the achievement of a good accuracy-suited E L set of instances
easy. As in the incremental approach, its behavior can also depend on the number
of instances added per iteration. Typically, these methods have been designed to
avoid the introduction of noisy instances into E L at each iteration [14,29]. However,
under the rubric of amending model, many other proposals can be developed. For
instance, incremental and batch mode algorithms in combination with a prior or a later
cleaning phase of noisy instances are considered to be an amending model. Despite
its flexibility, this scheme usually requires high computational demands compared
to incremental and batch algorithms.
• Single-classifier versus multi-classifier: Self-labeled techniques can use one or more
classifiers during the enlarging phase of the labeled set. As we stated before, all of these
methods follow a wrapper methodology using classifier(s) to establish the possible class
of unlabeled instances.
In a single-classifier model, each unlabeled instance belongs to the most probable class
assigned by the uniquely used classifier. It implies that these probabilities should be
explicitly measured. There are different ways in which these confidences are computed.
For example, in probabilistic models, such as naive Bayes, the confidence predictions can
usually be measured as the output probability in prediction, and other methods, such as
nearest neighbor [30], could approximate confidence in terms of distance. In general, the
main advantage of single-classifier methods is their simplicity, allowing us to compute
faster confidence probabilities.
Multi-classifier methods combine the learned hypotheses with several classifiers to predict the class of unlabeled instances. The underlying idea of using multi-classifiers in SSL
is that several weak classifiers, learned with a small number of instances, can produce bet-
123
I. Triguero et al.
ter generalization capabilities than only one weak classifier. These methods are motivated,
in part, by the empirical success of ensemble methods. Two different approaches are commonly used to calculate the confidence predictions in multi-classifier methods: agreement of classifiers and combination of the probabilities obtained by single-classifiers.
These techniques usually obtain a more accurate precision than single-classifier models.
Another effect of using several classifiers in self-labeled techniques is their complexity.
In both approaches, different confidence measures have been analyzed in the literature.
They will be explained in Sect. 2.1.2.
• Single-learning versus multi-learning: Apart from the number of classifiers, a key concern
is whether they are constituted by the same (single) or different (multiple) learning
algorithms. The number of different learning algorithms used can also determine the
confidence predictions of unlabeled data.
In a multi-learning approach, the confidence predictions are computed as the integration of a group of different kinds of learners to boost the performance classification. It
works under the hypothesis that different learning techniques present different behaviors, using the bias classification between them, which generate locally different models.
Multi-learning methods are closely linked with multi-classifier models. A multi-learning
method is itself a multi-classifier method in which the different classifiers come from
different learning methods. Hence, the general properties explained above for multiclassifiers are also extrapolated to multi-learning. A specific drawback of this approach
is the choice of the most adequate learning approaches.
By contrast, a single-learning approach could be linked to both single and multiclassifiers. With a single-learning algorithm, the confidence prediction measurement is
relaxed with regard to the rest of the properties, type of view and number of classifiers.
The goal of this approach, which has been shown to perform well in several domains, is
simplicity [20].
• Single-view versus multi-view: This characteristic refers to the way in which the input
feature space is taken into consideration by the self-labeled technique. In a multi-view
self-labeled technique, L is split into two or more subsets of the features L subk of
dimension M (M < D), by projecting each instance x p of L onto the selected Mdimensional subspace. Multi-view requires redundant and conditionally independent
views, provided that the class attribute and each subset are sufficient to train a good
classifier. Therefore, the performance of these kinds of techniques depends on the division procedure followed. One important advantage of this kind of view is that if this
assumption is met [31], the multi-view approach makes fewer generalization errors,
using the hypothetical agreement of classifiers. Nevertheless, this could lead to adverse
results being obtained because the compatibility and independence of the features’
subsets obtained are strong assumptions and many real data sets cannot satisfy these
hypotheses.
Otherwise, a single-view self-labeled technique does not make any specific assumptions for the input data. Hence, it uses the complete L to train. Most of the selflabeled techniques adhere to this due to the fact that in many real-world data sets
the feature input space cannot be appropriately divided into conditionally independent
views.
In the literature, many methods have been erroneously classified as multi-view or cotraining methods without the requirement of sufficient and redundant views. This term is
frequently confused with single-view methods that use multi-classifiers. Note that multiview methods need, by definition, several classifiers. However, a single-view method
could use single and multi-classifiers.
123
Self-labeled techniques for semi-supervised learning
2.1.2 Other properties
We may remark upon other properties that explore how self-labeled methods work. Although
they influence the operation and hence the results obtained with some of these techniques,
we have not included them in the proposed taxonomy for the sake of simplicity and usability.
• Confidence measures: An inaccurate confidence measure leads to adding mislabeled
examples to the E L, which implies a performance degradation of the self-labeling
process. The previously explained characteristics define in a global manner the way in
which the confidence predictions are estimated. Nevertheless, the specific combination
of these characteristics leads to different ideas to establish these probabilities.
– Simple: As we stated before, the probability predictions can be extracted from the
used learning model. This characteristic is essentially presented in single-classifier
methods. For example, probabilistic models return an associated probability of
belonging to each class for each unlabeled instance. In decision tree algorithms
[32], the confidence probabilities can be estimated as the accuracy of the leaf that
makes the prediction. Instance-based learning approaches can estimate probabilities
in terms of dissimilarity between instances.
– Agreement and combination: Multi-classifier methods can approximate their confidence probabilities based on the predictions obtained for each classifier. One way
to calculate them is the hypothetical agreement of the used classifiers. Typically, a
majority voting rule is used in which ties are commonly solved by a random process.
By contrast, there are other methods that generate their confidence predictions as the
aggregation of the obtained probabilities for each classifier in order to find a final
confidence value for each unlabeled instance. Furthermore, it would also be possible
to combine the agreement of classifiers with the calculated probabilities, developing
a hybrid confidence prediction model. Not much research is currently underway with
regard to this latter scheme.
When considering these schemes, some questions should be taken into account
depending on the number of different learning algorithms used. In a single-learning
proposal, the method should retain multiple different hypotheses and combine (agreement or combination) their decisions during the computation of confidence probabilities. In this framework, a mandatory step is to generate new labeled sets based on
the original L. For this purpose, self-labeled techniques usually apply bootstrapping
techniques, such as resampling [33]. Nevertheless, there are other proposals such as
a tenfold cross-validation scheme to create new labeled sets as is proposed in [34].
The effect obtained is related to the improvement of generalization capabilities with
respect to the use of a single-classifier. However, in SSL, the labeled set tends to be
small and the diversity obtained by the bootstrap sampling is limited, due to the fact
that the obtained bootstraps are very similar. In multi-learning approaches, diversity
is obtained from the different learning algorithms. Thus, they do not generate new
labeled sets.
• Self-teaching versus mutual-teaching: Independently of the learning procedure used,
one of these characteristics appears in multi-classifier methods. In a mutual-teaching
approach, the classifiers teach each other their most confident predicted examples. Each
Ci classifier has an associated E L i , which is initialized in different ways. At each stage,
all the classifiers are trained with its respective E L i , and then E L i is increased, with the
most confident examples obtained as the hypotheses combination (or agreement) of the
123
I. Triguero et al.
remaining classifiers. Under this scheme, a classifier Ci does not use its own predictions to
increase its E L i . However, if Ci is unable to detect some interesting unlabeled instances
as target examples, the rest of the classifiers may teach different hypotheses to Ci .
To construct the final learned hypothesis, two approaches can be followed: join or voting
procedures. The join procedure consists of forming a complete E L through the combination of each E L i without repetitions. With a voting procedure, the final hypothesis is
obtained as the application of a majority voting rule, using all the E L i .
By contrast, the self-teaching property refers to those multi-classifiers that maintain a
single E L. In this case, the combination of hypotheses is blended to form a unique
E L. As far as we know, there is no proposed model that combines mutual-teaching and
self-teaching.
• Stopping criteria: This is related to the mechanism used to stop the self-labeling process. It
is an important factor due to the fact that it implies the size of the E L formed and therefore
in the learned hypothesis. Three main approaches can be found in the literature. Firstly,
in classical approaches such as self-training, the self-labeling process is repeated until all
the instances from U have been added to E L. If erroneous unlabeled instances are added
to the E L, they can damage the classification performance. Secondly, in [15], the authors
suggest choosing examples from a smaller pool instead of the whole unlabeled set to form
the E L, establishing a limited number of iterations. This approach has been successfully
applied to many self-labeled methods, outperforming the overall classification rate of
the previous stopping criteria. However, the maximum number of iterations is usually
prefixed, and it is not adaptive to the size of the data set used. Thirdly, the termination
criteria can be satisfied when the used classifiers, in the self-labeling process, do not
change the learned hypotheses. This criteria limits the number of unlabeled instances
added to E L; however, it does not ensure that erroneous unlabeled instances will not be
added to E L.
2.1.3 Criteria to compare self-labeled methods
When comparing self-labeled methods, there are a number of criteria that can be used to
compare the relative strengths and weaknesses of each algorithm. These include transductive
and inductive accuracy, influence of the number of labeled instances, noise tolerance and
time requirements.
• Transductive and inductive accuracy: A successful self-labeling algorithm will often be
able to appropriately enlarge the labeled set, increasing the transductive and inductive
accuracy.
• Influence of the number of labeled instances: The number of labeled instances that a selflabeled technique manages determines the learned hypothesis. A good technique should
be able to learn an appropriate hypothesis with the lowest possible number of labeled
instances.
• Noise tolerance: Noisy instances can be harmful if they are added to the labeled set, as
they bias the classification of unlabeled data to incorrect classes, which could make the
enlarged labeled set in the next iteration even more noisy. This problem may especially
occur in the initial stages of this process, and it can also be more harmful with a reduced
number of labeled instances. Two types of noisy instances may appear during the selflabeling process. The first one is caused by the distribution of the instances of L in their
respective classes. It can lead the classifier to erroneously label some instances. Second,
123
Self-labeled techniques for semi-supervised learning
there may be outliers within the original unlabeled data. These can be detected, avoiding
their labeling and inclusion in L.
• Time requirements: The learning process is usually carried out just once on a training set.
If the objective of a specific application is related to transductive learning, this learning
process becomes more important as it is responsible for managing unlabeled data. For
inductive learning, it does not seem to be a very important evaluation method because
test instances are classified based on the learned model. However, if the learning phase
takes too long, it can become impractical for real applications.
2.2 Self-labeled methods
We have performed a deep search of the specialized literature, identifying the most relevant
self-labeled algorithms. At the time of writing, more than 20 methods have been proposed in
the literature. This section is devoted to enumerating and designating them according to the
standard followed in this paper. For more details on their implementations, the reader can
visit the URL http://sci2s.ugr.es/SelfLabeled. Implementations of most of the algorithms in
java can be found in the KEEL software [35] (see “Appendix”).
Table 1 presents an enumeration of self-labeled methods reviewed in this paper. The
complete name, abbreviation and reference are provided for each one. In the case of there
being more than one method in a row, they were proposed together or by the same authors
and the best-performing method (indicated by the respective authors) is depicted in bold. We
Table 1 Self-labeled methods reviewed
Complete name
Abbr. name
References
Standard self-training
Self-Training
[13]
Standard co-training
Co-Training
[15]
Statistical co-learning
Statistical-Co
[34]
ASSEMBLE
ASSEMBLE
[36]
Democratic co-learning
Democratic-Co
[37]
Self-training with editing
SETRED
[14]
Tri-training
TriTraining
[20]
Tri-training with editing
DE-TriTraining
[38]
Co-forest
CoForest
[21]
Random subspace method for co-training
Rasco
[39]
Co-training by committee: AdaBoost
Co-Adaboost
[40,41]
Co-training by committee: bagging
Co-Bagging
Co-training by committee: RSM
Co-RSM
Co-training by committed: Tree-structured ensemble
Co-Tree
[42]
Co-training with relevant random subspaces
Rel-Rasco
[43]
Classification algorithm based on local clusters centers
CLCC
[44]
Ant-based semi-supervised classification
APSSC
[45]
Self-training nearest neighbor rule using cut edges
SNNRCE
[46]
Robust co-training
R-Co-Training
[17]
Adaptive co-forest editing
ADE-CoForest
[47]
Co-training with NB and SVM classifiers
Co-NB-SVM
[18]
123
I. Triguero et al.
will test some representative methods therefore only the methods in bold will be compared
in the experimental study.
2.3 Related and advanced work
Nowadays, the use of unlabeled data in conjunction with labeled data is a growing field in
different research lines. Self-labeled techniques form a feasible and promising group to make
use of both kinds of data, which is closely related to other methods and problems. This section
provides a brief review of other topics related to self-labeled techniques and describes other
interesting work and future trends which have been studied in the last few years.
With the same objective as self-labeled techniques, we group the three following methodologies according to Zhu and Goldberg’s book [1]:
• Generative models and cluster-then-label methods: The first attempts to deal with unlabeled data correspond to this area. It includes those methods that assume a joint probability
model p(x, y) = p(y) p(x|y), where p(x|y) is an identifiable mixture distribution, for
example, a Gaussian mixture model. Hence, it follows a determined parametric model
[48] using both unlabeled and labeled data. Cluster-then-label methods are closely related
to generative models. Instead of using a probabilistic model, they apply a previous clustering step to the whole data set, and then they label each cluster with the help of labeled
data. Recent advances in these topics are [8,49].
• Graph-based: This represents the SSC problem as a graph min-cut problem [10]. Labeled
and unlabeled examples constitute the graph nodes, and the similarity measurements
between nodes correspond to the graph edges. The graph construction determines the
behavior of this kind of algorithm [50,51]. These methods usually assume label smoothness over the graph. Its main characteristics are nonparametric, discriminative and transductive in nature. Advanced proposals can be found in [11,52].
• Semi-supervised support vector machines (S 3 V M): S 3 V M is an extension of standard
support vector machines (SVM) with unlabeled data [53]. This approach implements
the cluster-assumption for SSL, that is, examples in data cluster have similar labels, so
classes are well separated and do not cut through dense unlabeled data. This methodology
is also known as transductive SVM, although it learns an inductive rule defined over the
search space. Advanced works in S 3 V M are [54–56].
Regarding other problems connected with self-labeled techniques, we briefly describe the
following topics:
• Semi-supervised clustering: This problem, also known as constrained clustering, aims to
obtain better-defined clusters than the ones obtained from unlabeled data [57]. Labeled
data are used to define pairwise constraints between examples, must-links and cannotlinks. The former link establishes the examples that must be in the same cluster, and the
latter refers to those examples that cannot be in the same cluster [58]. A brief review of
this topic can be found in [59].
• Active learning: With the same objective as SSL, avoiding the cost of data labeling, active
learning [60] tries to select the most important examples from a pool of unlabeled data.
These examples are queried by an expert and are then labeled with the appropriate class,
aiming to minimize effort and time consumption. Many active learning algorithms select
as query the examples with maximum label ambiguity or least confidence. Several hybrid
methods between self-labeled techniques and active learning [41,61–63] have been proposed and show that active learning queries maximize the generalization capabilities of
SSL.
123
Self-labeled techniques for semi-supervised learning
• Semi-supervised dimensionality reduction: This area studies the curse of dimensionality
when it is addressed in an SSL framework. The goal of dimensionality reduction algorithms is to find a faithful low-dimensional mapping, or selection, of the high-dimensional
data. Traditional dimensionality reduction techniques designed for supervised and unsupervised learning, such as linear discriminant analysis [64] and principal component
analysis [65], are not appropriate to deal with both labeled and unlabeled data. Recently,
many different frameworks have been proposed to use classical methods in this environment [66,67]. Two well-known dimensionality reduction solutions are feature selection
and feature extraction [68] which have attracted much attention in recent years [69].
• Self-supervised: This paradigm has been recently presented in [70]. It integrates knowledge from labeled data with some features and knowledge from unlabeled data with all the
features. Thus, self-supervised algorithms learn novel features from unlabeled examples
without destroying partial knowledge previously acquired from labeled examples.
• Partial label: It is a paradigm proposed in [71]. This problem deals with partially labeled
multi-class classification, in which instead of a single label per instance, it has a candidate
set of labels, only one of which is correct. In these circumstances, a classifier should learn
how to disambiguate the partially labeled training instance and generalize to unseen data.
3 Self-labeled techniques: taxonomy
The main characteristics of the self-labeled methods have been described in Sect. 2.1.1,
and they can be used to categorize the self-labeled methods proposed in the literature. The
type of view, number of learning algorithms, number of classifiers and addition mechanism
constitute a set of properties that define each method. This section presents the taxonomy of
self-labeled techniques based on these properties.
Figure 1 illustrates the categorization following a hierarchy based on this order: type of
view, number of learning algorithms, number of classifiers and addition mechanism. Considering this figure and the year of publication of each analyzed method, some interesting
observations about the existing and nonexisting proposals can be made:
• The number of single-classifier methods is smaller than multi-classifier. They constitute
four of the 15 methods proposed in the literature. Although these methods may obtain
great results, they do not have a refined confidence prediction mechanism because, in
general, they are limited to extracting the most confident predictions from the learner
used. Nevertheless, two of the most recent approaches belong to this category.
• Amending models appeared a few years ago. Most of the recent research efforts are
focused on these kinds of models because they have reported a great synergy with the
iterative scheme presented in self-labeled methods. In different ways, they remove those
instances that are harmful to the classification task in order to alleviate the main drawback
of the incremental addition mechanism.
• Only two multi-learning approaches have been proposed for self-labeled algorithms, and
the most recent was published in 2004. In our opinion, more research is required in this
area. For instance, there is no amending model that avoids introducing noisy instances
into the enlarged labeled set. Similarly, no amending approaches have been designed for
the family of methods which uses multiple views.
• Standard co-training has been widely used in many real applications [72]. However,
there are a reduced number of advanced multi-view approaches, which, for example, use
multi-learning or amending addition schemes.
123
I. Triguero et al.
Fig. 1 Self-labeled techniques hierarchy
The properties studied here can help us to understand how the self-labeled algorithms work. In the following sections, we will establish which methods perform best
for each family, considering several metrics of performance with a wide experimental
framework.
4 Experimental framework
This section describes all the properties and issues related to the experimental framework
followed in this paper. We provide the measures used to observe differences in the performance of the algorithms (Sect. 4.1), the main characteristics of the problems used (Sect.
4.2), the parameters of the algorithms and the base classifiers used (Sect. 4.3) and finally a
brief description of the nonparametric statistical tests used to contrast the results obtained
(Sect. 4.4).
4.1 Performance measures
Two performance measures are commonly applied because of their simplicity and successful
application when multi-class classification problems are dealt with. As standard classification
methods, the performance of SSC algorithms can be measured in terms of accuracy [1] and
Cohen’s kappa rate [73]. They are briefly explained as follows:
123
Self-labeled techniques for semi-supervised learning
• Accuracy: It is the number of successful hits (correct classifications) relative to the total
number of classifications. It has been by far the most commonly used metric for assessing
the performance of classifiers for years [2,74].
• Cohen’s kappa (Kappa rate): It evaluates the portion of hits that can be attributed to the
classifier itself, excluding random hits, relative to all the classifications that cannot be
attributed to chance alone. Cohen’s kappa ranges from −1 (total disagreement) through
0 (random classification) to 1 (perfect agreement). For multi-class problems, kappa is a
very useful, yet simple, meter for measuring a classifier’s accuracy while compensating
for random successes.
Both metrics will be adopted to measure the efficacy of self-labeled methods in transductive and inductive phases.
4.2 Data sets
The experimentation is based on 55 standard classification data sets taken from the UCI
repository [75] and the KEEL-data set repository1 [35]. Table 2 summarizes the properties
of the selected data sets. It shows, for each data set, the number of examples (#Examples),
the number of attributes (#Features) and the number of classes (#Classes). The data sets
considered contain between 100 and 19,000 instances, the number of attributes ranges from
2 to 90, and the number of classes varies between 2 and 28.
These data sets have been partitioned using the tenfold cross-validation procedure, that is,
the data set has been split into ten folds, each one containing 10 % of the examples of the data
set. For each fold, an algorithm is trained with the examples contained in the rest of folds
(training partition) and then tested with the current fold. It is noteworthy that test partitions
are kept aside to evaluate the performance of the learned hypothesis.
Each training partition is divided into two parts: labeled and unlabeled examples. Using
the recommendation established in [46], in the division process we do not maintain the class
proportion in the labeled and unlabeled sets since the main aim of SSC is to exploit unlabeled
data for better classification results. Hence, we use a random selection of examples that will
be marked as labeled instances, and the class label of the rest of the instances will be removed.
We ensure that every class has at least one representative instance.
In order to study the influence of the amount of labeled data, we take different ratios when
dividing the training set. In our experiments, four ratios are used: 10, 20, 30 and 40 %. For
instance, assuming a data set that contains 1,000 examples, when the labeled rate is 10 %,
100 examples are put into L with their labels, while the remaining 900 examples are put into
U without their labels. In summary, this experimental study involves a total of 220 data sets
(55 data sets × 4 labeled rates).
Apart from these data sets, the best methods will be also tested with nine high-dimensional
problems. These data sets have been extracted from the book of Chapelle [4]. To analyze
transductive and inductive capabilities, these data sets have been also partitioned using the
methodology explained above, except for the number of labeled data. We will use two splits for
training partitions with 10 and 100 labeled examples, respectively. The remaining instances
are marked as unlabeled points. Table 3 presents the main characteristics of these data sets.
All the data sets created can be found on the Web site associated with this paper.2
1 http://sci2s.ugr.es/keel/datasets.
2 http://sci2s.ugr.es/SelfLabeled.
123
I. Triguero et al.
Table 2 Summary description of the original data sets
Data set
abalone
#Examples #Features #Classes Data set
4,174
8
28
appendicitis
106
7
2
mushroom
australian
690
14
2
nursery
autos
205
25
6
pageblocks
banana
movement_libras
#Examples #Features #Classes
360
90
15
8,124
22
2
12,690
8
5
5,472
10
5
5,300
2
2
penbased
10,992
16
10
breast
286
9
2
phoneme
5,404
5
2
bupa
345
6
2
pima
768
8
2
chess
3,196
36
2
ring
7,400
20
2
cleveland
297
13
5
saheart
462
9
2
coil2000
9,822
85
2
satimage
6,435
36
7
contraceptive
1,473
9
3
segment
2,310
19
7
crx
125
15
2
sonar
208
60
2
dermatology
366
33
6
spambase
4,597
55
2
ecoli
336
7
8
spectheart
267
44
2
flare-solar
1,066
9
2
splice
3,190
60
3
german
1,000
20
2
tae
151
5
3
glass
214
9
7
texture
5,500
40
11
haberman
306
3
2
tic-tac-toe
958
9
2
heart
270
13
2
thyroid
7,200
21
3
hepatitis
155
19
2
titanic
2,201
3
2
housevotes
435
16
2
twonorm
7,400
20
2
iris
150
4
3
vehicle
846
18
4
led7digit
500
7
10
vowel
990
13
11
lymphography
148
18
4
wine
178
13
3
19,020
10
2
wisconsin
683
9
2
961
5
2
yeast
1,484
8
10
8,993
13
9
zoo
101
17
7
432
6
2
Table 3 Summary description of
high-dimensional data sets
Data set
magic
mammographic
marketing
monks
123
#Examples
#Features
#Classes
bci
400
117
2
coil
1,500
241
6
coil2
1,500
241
2
digit1
1,500
241
2
g241c
1,500
241
2
g241n
1,500
241
2
secstr
83,679
315
2
text
1,500
11,960
2
usps
1,500
241
2
Self-labeled techniques for semi-supervised learning
4.3 Parameters and base classifiers
In this subsection we show the configuration parameters of all the methods used in this
study. The selected values are common for all problems, and they were selected according
to the recommendation of the corresponding authors of each algorithm, which are also the
default parameter settings included in the KEEL software [26] that we used to develop our
experiments. The approaches analyzed should be as general and as flexible as possible. A
good choice of parameters facilitates their better performance over different data sources, but
their operations should allow good enough results to be obtained although the parameters are
not optimized for a specific data set. This is the main purpose of this experimental survey,
to show the generalization in performance of each self-labeled technique. The configuration
parameters of all the methods are specified in Table 4.
Some of the self-labeled methods have been designed with one or more specific base
classifier(s). In this study, these algorithms maintain their used classifier(s). However, the
interchange of the base classifier is allowed in other approaches. Specifically, they are: SelfTraining, Co-Training, TriTraining, DE-TriTraining, Rasco and Rel-Rasco. In this study,
we select four classic and well-known classifiers in order to find differences in performance
among these self-labeled methods. They are K-nearest neighbor, C4.5, naive Bayes and SVM.
All of these selected base classifiers have been considered as one of the ten most influential
data mining algorithms in [76].
A brief description of the base classifiers and their associated confidence prediction computation are enumerated as follows:
• K-nearest neighbor (KNN): This is one of the simplest and most effective methods based
on dissimilarities among a set of instances. It belongs to the lazy learning family of
methods [77], which do not build a model during the learning process. With this method,
confidence predictions can be approximated by the distance to the currently labeled set.
• C4.5: This is a well-known decision tree algorithm [32]. It induces classification rules
in the form of decision trees for a given training set. The decision tree is built with a
top-down scheme, using the normalized information gain (difference in entropy) that is
obtained from choosing an attribute for splitting the data. The attribute with the highest
normalized information gain is the one used to make the decision. Confidence predictions
are obtained from the accuracy of the leaf that makes the prediction. The accuracy of
a leaf is the percentage of correctly classified train examples from the total number of
covered train instances.
• Naive Bayes (NB): Its aim is to construct a rule which will allow us to assign future objects
to a class, assuming independence of attributes when probabilities are established. For
continuous data, we follow a typical assumption in which continuous values associated
with each class are distributed according to a Gaussian distribution [78]. The extraction
of probabilities is straightforward, due to the fact that this method explicitly computes
the probability belonging to each class for the given test instance.
• Support vector machines (SVM): It maps the original input space into a higherdimensional feature space using a certain kernel function [79]. In the new feature space,
the SVM algorithm searches the optimal separating hyperplane with maximal margin
in order to minimize an upper bound of the expected risk instead of the empirical risk.
Specifically, we use the SMO training algorithm, proposed in [80], to obtain the SVM
base classifiers. Using a logistic model, we can use the probability estimate from the
SMO [80] as the confidence for the predicted class.
123
I. Triguero et al.
Table 4 Parameter specification for all the base learners and self-labeled methods used in the experimentation
Algorithm
Parameters
KNN
Number of neighbors = 3, Euclidean distance
C4.5
Confidence level: c = 0.25
Mininum number of item-sets per leaf: i = 2
Prune after the tree building
NB
No parameters specified
SMO
C = 1.0, tolerance parameter = 0.001
Epsilon = 1.0 × 10−12
Kernel type = polynomial
Polynomial degree = 1
Fit logistic models = true
Self-Training
M AX _I T E R = 40
Co-Training
M AX _I T E R = 40 , initial unlabeled pool = 75
Democratic-Co
Classifiers = 3NN, C4.5, NB
SETRED
M AX _I T E R = 40, threshold = 0.1
TriTraining
No parameters specified
DE-TriTraining
Number of neighbors k = 3, minimum number of neighbors = 2
CoForest
Number of RandomForest classifiers = 6, threshold = 0.75
Rasco
M AX _I T E R = 40, number of views/classifiers = 30
Co-Bagging
M AX _I T E R = 40, committee members = 3
Ensemble learning = Bagging, pool U = 100
Rel-Rasco
M AX _I T E R = 40, number of views/classifiers = 30
CLCC
Number of RandomForest classifiers = 6, Threshold = 0.75
Manipulative beta parameter = 0.4, initial number of cluster = 4
Running frequency z = 10, best center sets = 6, Optional Step = True
APSSC
Spread of the Gaussian = 0.3, evaporation coefficient = 0.7, MT = 0.7
SNNRCE
Threshold = 0.5
ADE-CoForest
Number of RandomForest classifiers = 6, threshold = 0.75
Number of neighbors k = 3, minimum number of neighbors = 2
4.4 Statistical test for performance comparison
Statistical analyses are highly recommended in the field of data mining to find significant
differences between the results obtained by the studied methods . We consider the use of
nonparametric tests according to the recommendation made in [27,81], where a set of simple,
safe and robust nonparametric tests for statistical comparisons of classifiers is presented. In
these studies, the use of nonparametric tests will be preferred to parametric ones, since the
initial conditions that guarantee the reliability of the latter may not be satisfied, causing the
statistical analysis to lose credibility.
The Wilcoxon signed-ranks test [28,82] will be adopted to conduct pairwise comparisons
between all the methods considered in this study. Considering the ratio of the number of
data sets to the number of methods that we will compare throughout this paper, we fix the
significance level α = 0.1 for all comparisons.
123
Self-labeled techniques for semi-supervised learning
Furthermore, we will use the Friedman test [83] in the global analysis to analyze differences between the methods considered outstanding with a multiple comparison analysis.
The Bergmann–Hommel [84] procedure is applied as a post hoc procedure, which was highlighted as the most powerful test in [81], to find out which algorithms are distinctive in n ∗ n
comparisons. Any interested reader can find more information about these tests at http://
sci2s.ugr.es/sicidm/, together with the software for applying the statistical tests.
4.5 Other considerations
We want to stress that the implementations are based only on the descriptions and specifications given by the respective authors in their papers. No advanced data structures and
enhancements for improving the suitability of self-labeled methods have been applied.
5 Analysis of results
This section presents the average results collected in the experimental study and some discussion on them. Due to the extent of the experimental analysis carried out, we report the
complete tables of results on the Web page associated with this paper (see footnote 2). This
study will be divided into two different parts: analysis of the results obtained in transductive
learning (see Sect. 5.1) and inductive learning (see Sect. 5.2) considering different ratios of
labeled data. A global discussion and the identification of outstanding methods is added in
Sect. 5.3. Some representative methods will be also tested under high-dimensional data sets
with small labeled ratio (see Sect. 5.4). Finally, a comparison with the supervised learning
paradigm is performed in Sect. 5.5.
5.1 Transductive results
As we claimed before, the main objective of transductive learning is to predict the true class
label of the unlabeled data used to train. Hence, a good exploitation of unlabeled data can lead
to successful results. Within the framework of transductive learning, we have analyzed which
are the best or the most appropriate proposals attending to their characteristics as explained
before.
Table 5 presents the average accuracy results obtained in the transductive phase. Specifically, it shows the overall results of the analyzed algorithms over the 55 used data sets with
10, 20, 30 and 40 % of labeled data, respectively. For those methods that work with different
classifiers, we have tested various base classifiers, specifying them between brackets. Acc
presents the average accuracy obtained. The algorithms are ordered from the best to the
worst accuracy obtained. Even though kappa results are not reported in the paper, we want
to check the gain (or loss) of each method in the established ranking of algorithms (in terms
of accuracy measure) when the kappa measure is taken into account.
For this purpose, K shows the oscillation of each method in the classification order
established with accuracy with respect to the position obtained with kappa. This information
reveals whether or not a certain algorithm benefits from random hits in comparison with
the rest of the methods. Complete kappa results can be found on the associated Web site.
Furthermore, in this table, we highlight those methods whose performance is within 5 % of
the range between the best and the worst method, that is, valuebest − (0.05 · (valuebest −
valuewor st )). We use boldface for the accuracy measure and italic for kappa. They should
123
I. Triguero et al.
Table 5 Average results obtained by self-labeled methods in the transductive phase
be considered as a set of outstanding methods, regardless of their specific position in the
table.
Figure 2 depicts a star plot representing the average transductive accuracy obtained in each
method for the four labeled ratios considered. This star plot presents the performance as the
distance from the center; therefore, a higher area determines the best average performance.
This illustration allows us to easily visualize the average performance of the algorithms
comparatively for each labeled ratio and in general. Figure 3 presents the same results in a
bar chart aiming to compare the specific accuracy values.
Apart from the average results, we use the Wilcoxon test to statistically compare selflabeled methods in the different labeled ratios. Table 6 summarizes all possible comparisons
involved in the Wilcoxon test between all the methods considered, considering accuracy
results. Again, kappa results and the individual comparisons are exhibited on the aforementioned Web site, where a detailed report of statistical results can be found. This table
presents, for each method in the rows, the number of self-labeled methods outperformed by
using the Wilcoxon test under the column represented by the “+” symbol. The column with
the “±” symbol indicates the number of wins and ties obtained by the method in the row.
The maximum value for each column is highlighted by bold.
123
Self-labeled techniques for semi-supervised learning
Fig. 2 Labeled ratio comparison (star plot): transductive phase
Fig. 3 Labeled ratio comparison (bar chart): transductive phase
123
I. Triguero et al.
Table 6 Wilcoxon test summary results: transductive accuracy
Method
10 %
20 %
30 %
40 %
+
±
+
±
+
±
+
±
Self-Training (KNN)
13
31
12
31
10
26
10
21
Self-Training (C45)
17
31
18
31
19
30
18
31
Self-Training (NB)
4
10
8
11
2
10
0
10
Self-Training (SMO)
9
29
11
30
14
33
17
33
Co-Training (KNN)
7
22
9
18
5
19
3
19
Co-Training (C45)
13
30
17
31
21
32
21
33
Co-Training (NB)
9
25
10
25
4
22
6
19
Co-Training (SMO)
12
31
15
34
26
34
24
34
Democratic-Co
27
34
28
34
27
34
26
34
SETRED
17
33
15
32
12
29
11
29
TriTraining (KNN)
15
31
12
28
8
22
6
19
TriTraining (C45)
25
34
26
34
28
34
27
34
TriTraining (NB)
10
28
13
28
10
27
8
23
6
28
12
30
9
30
16
33
DE-TriTraining (KNN)
13
33
12
28
9
25
9
28
DE-TriTraining (C45)
14
30
14
28
15
28
15
30
DE-TriTraining (NB)
9
21
10
19
4
20
4
22
DE-TriTraining (SMO)
12
31
14
31
15
30
13
31
CoForest
21
34
20
34
21
34
24
34
TriTraining (SMO)
Rasco (KNN)
0
1
0
1
0
2
0
6
Rasco (C45)
4
10
2
8
5
26
9
29
Rasco (NB)
4
14
4
8
3
14
1
10
Rasco (SMO)
2
3
2
6
2
13
4
17
Co-Bagging (KNN)
14
32
16
31
13
30
14
27
Co-Bagging (C45)
24
34
26
34
28
34
11
31
Co-Bagging (NB)
7
21
10
24
7
23
6
21
Co-Bagging (SMO)
8
31
13
33
14
30
20
33
Rel-Rasco (KNN)
0
1
0
1
0
2
0
6
Rel-Rasco (C45)
4
10
2
8
4
26
9
29
Rel-Rasco (NB)
5
13
4
8
3
14
1
10
Rel-Rasco (SMO)
2
4
2
6
2
10
2
14
CLCC
3
19
2
9
0
8
0
4
APSSC
4
19
9
16
3
15
1
14
SNNRCE
22
34
19
33
17
33
19
33
ADE-CoForest
11
31
11
29
6
25
6
28
Once the results are presented in the above tables and graphics, we outline some comments
related to the properties observed, pointing out the best-performing methods in terms of
transductive capabilities:
• Considering the influence of labeled ratio, Fig. 2 shows that, as could be expected, most
of the algorithms obtain a regular increment on their accuracy transductive capabilities
123
Self-labeled techniques for semi-supervised learning
•
•
•
•
when the number of labeled data is increased. By contrast, we can observe how several
methods are highly affected by the labeled ratio. For instance, multi-view methods obtain
higher accuracy and kappa rankings when the labeled ratio is increased. In Table 5, we
can also point out that two techniques are always at the top in accuracy and kappa rate
independently of the labeled ratio: Democratic-Co and TriTraining (C45). They are also
noteworthy as the methods that always obtain a transductive performance within 5 % of
the range between the best and the worst method in both accuracy and kappa measures.
Moreover, Co-Training (SMO) and Co-Bagging (C45) are considered as outstanding
methods in most of the labeled ratios, mainly in the kappa measure. We can observe the
validity of these statements in the Wilcoxon test, which confirms the averaged results
obtained in accuracy and kappa rates.
For those methods whose wrapper classifier can be set, we can find significant differences
in their transductive behavior. In general, C4.5 offers the best transductive accuracy results
in most of the techniques. In depth, classical self-training and co-training approaches
obtain better results when C4.5 or SMO are used as base classifiers. Tri-Training also
works with C4.5 and with KNN. The edited version of Tri-Training and Co-bagging
presents a good synergy with C4.5. Finally, the use of Rasco and Rel-Rasco is more
appropriate when NB or C4.5 is established as the base classifier.
The classical Co-training with SMO or C4.5 as base classifier could be stressed as the
best-performing method from the multi-view family. Rasco and Rel-Rasco are based on
the idea of using random feature subspace (relevant or not) to construct different learned
hypotheses. This idea performs well in several domains as shown in the corresponding
papers. Nevertheless, to deal with a wide variety of data sets, such as in this experimental
study, a more accurate feature selection methodology should be used to enhance their
performance.
Regarding single-view algorithms, several subfamilies deserve particular mention: In
general, incremental approaches obtain the best results in accuracy and kappa rates. The
results obtained by CoForest show it to be the best batch model. Moreover, CoForest
shows that it is at least statistically similar to the rest of the methods in transductive
capabilities. From amending approaches, we can highlight SNNRCE as one important
method which is able to clearly outperform the Self-Training (KNN) method on which
it is based in both measures.
Usually, there is no significant difference (K) between the rankings obtained with
accuracy and kappa rates, except for some concrete algorithms. For example, we can
observe that DE-Training usually obtains a lower ranking with the kappa measure; this
probably indicates that it benefits from random hits. In contrast, other algorithms, such
as APSSC and Co-Training (SMO), improve their rankings when the kappa measure is
used.
Furthermore, we perform an analysis of the results depending on the number of classes.
On the Web site associated with this paper, we show the rankings obtained in accuracy and
kappa for all the methods differentiating between binary and multi-class data sets. Figure
4 displays a summary representation of this study. It shows, for each method and labeled
ratio, the differences in the ranking obtained in binary problems and the ranking achieved in
multi-class data sets. Hence, positive bars indicate that the method performs better in multiclass problems, and negative bars show that the method obtains a higher ranking over binary
domains. We can analyze several details from the results collected, which are as follows:
• When the transductive analysis is divided into binary and multi-class data sets, we find
wide differences in the ranking previously obtained. Democratic-Co and TriTraining
123
I. Triguero et al.
Fig. 4 Differences between rankings in binary and multi-class domains: transductive phase
(C45) continues to lead the ranking if we take into consideration only binary data sets.
Nevertheless, in multi-class problems they are not the best-performing methods, although
they maintain a good behavior.
• Single-classifier methods with an amending addition scheme, such as SETRED and
SNNRCE, are noteworthy as the best-performing methods to deal with multi-class problems.
• Many differences appear in this study depending on the base classifier. In contrast to
the previous analysis in which C4.5 was, in most cases, highlighted as the best base
classifier, C4.5 is now stressed when tackling binary data sets, whereas for multi-class
data sets, we can highlight the KNN rule as the most adequate base classifier for most
of the self-labeled techniques. Specifically, Self-Training, Co-training, Co-Bagging and
Tri-Training with KNN obtain a higher accuracy and kappa rates in comparison with the
rest of the base classifiers.
5.2 Inductive results (test phase)
In contrast to transductive learning, the aim of inductive learning is to classify unknown
examples. In this way, inductive learning proves the generalization capabilities of the selflabeled methods, checking whether the previous learned hypotheses are appropriate or not.
Table 7 shows the average results obtained, and Figs. 5 and 6 illustrate the comparison
between labeled ratios with a star plot and a bar graph, respectively. Finally, the summary
of the results of applying the Wilcoxon test to all the techniques in test data is presented in
Table 8.
These results allow us to highlight some differences in the generalization capabilities of
the analyzed methods:
• Some methods present clear differences when dealing with the inductive phase. For
instance, the amending model SNNRCE obtains a lesser generalization accuracy/kappa,
123
Self-labeled techniques for semi-supervised learning
Table 7 Average results obtained by self-labeled methods in the inductive phase
and it is outperformed by the other member of its family SETRED. It may suffer from an
excessive elimination rule of candidate instances to be incorporated into the labeled set.
On the other hand, classical self-training and co-training methods are shown to perform
well in the inductive phase, and they are at the top in accuracy and kappa rate.
• TriTraining (C45) and Democratic-Co remain at the top of the rankings, established by
the accuracy measure, in conjunction with Co-bagging (C4.5) which is a very competitive
method in the test phase. In Table 7, we observe that the kappa ranking penalizes the
Democratic-Co algorithm and considers other methods to be outstanding. For example,
the classical Co-Training (SMO) is positioned as one of the most important methods when
the kappa rate is taken into consideration. Wilcoxon test supports these ideas, showing
that, in most labeled ratios, TriTraining (C45) is the best proposal to the detriment of
Democratic-Co.
• It is worth mentioning that, in general, those methods that are based on C4.5 and SMO as
base classifier(s) obtain a higher number of outperformed methods (+) and, consequently,
a higher number of wins and ties (±) than the numbers obtained in the transductive
phase. Hence, these methods present good generalization capabilities, and their use is
recommended for inductive tasks.
123
I. Triguero et al.
Fig. 5 Labeled ratio comparison (star plot): inductive phase
Fig. 6 Labeled ratio comparison (bar chart): inductive phase
123
Self-labeled techniques for semi-supervised learning
Table 8 Wilcoxon test summary results: inductive accuracy
Method
10 %
20 %
30 %
40 %
+
±
+
±
+
±
+
±
Self-Training (KNN)
11
30
11
25
9
28
6
24
Self-Training (C4.5)
19
32
22
32
21
32
23
33
Self-Training (NB)
2
11
8
13
1
9
1
9
Self-Training (SMO)
9
29
10
28
11
34
17
32
Co-Training (KNN)
6
26
10
23
4
22
5
21
Co-Training (C4.5)
11
30
23
32
21
33
22
34
Co-Training (NB)
10
27
10
25
4
24
3
23
Co-Training (SMO)
11
32
14
33
21
34
25
34
Democratic-Co
26
34
26
34
25
34
26
34
SETRED
15
34
14
29
10
29
11
29
TriTraining (KNN)
13
34
10
23
3
20
2
16
TriTraining (C4.5)
28
34
31
34
24
34
27
34
TriTraining (NB)
10
29
11
28
9
28
4
20
9
31
12
33
12
33
18
32
DE-TriTraining (KNN)
11
31
11
26
8
26
7
25
DE-TriTraining (C4.5)
12
29
13
29
11
28
12
28
DE-TriTraining (NB)
9
25
10
25
3
19
2
20
DE-TriTraining (SMO)
11
30
17
31
16
32
14
29
CoForest
19
34
18
34
20
34
20
34
TriTraining (SMO)
Rasco (KNN)
0
1
0
1
0
5
0
11
Rasco (C4.5)
2
10
2
8
4
28
11
29
Rasco (NB)
3
12
2
7
4
17
2
15
Rasco (SMO)
2
9
2
8
3
21
3
23
Co-Bagging (KNN)
12
31
18
31
17
32
16
29
Co-Bagging (C4.5)
27
34
28
34
27
34
15
32
Co-Bagging (NB)
8
24
10
28
4
25
4
20
Co-Bagging (SMO)
7
31
16
33
15
33
24
34
Rel-Rasco (KNN)
0
1
0
1
0
3
0
10
Rel-Rasco (C4.5)
2
10
2
8
4
28
10
29
Rel-Rasco (NB)
3
14
3
8
4
17
1
14
Rel-Rasco (SMO)
2
10
2
8
3
21
3
22
CLCC
2
20
2
9
0
2
0
2
APSSC
2
20
9
16
2
17
1
15
16
34
11
27
5
23
6
22
7
30
10
28
3
23
4
27
SNNRCE
ADE-CoForest
• In this phase, the ratio of labeled instances is a more relevant issue for the obtained
performance of all the algorithms. In comparison with transductive results, there are
higher differences between the results obtained for each method, comparing, as extreme
cases, 10 and 40 % ratios. As we can see in Fig. 5, the star plot shows abrupt changes,
mainly from 10 to 20 % of labeled instances.
123
I. Triguero et al.
(a)
(b)
(c)
(d)
Fig. 7 Box plot accuracy inductive phase. a 10 % labeled ratio, b 20 % labeled ratio, c 30 % labeled ratio, d
40 % labeled ratio
Aside from these tables, Fig. 7 collects box plot representations for each labeled ratio. In
this figure, we select a subset of methods whose performances are of interest. In particular,
we select the most promising alternatives based on the best base classifier choice from the
previous study. Box plots have been shown to be an effective tool in data reporting because
123
Self-labeled techniques for semi-supervised learning
Fig. 8 Differences between rankings in binary and multi-class domains: inductive phase
they allow the graphical representation of the performance of algorithms, indicating important
characteristics such as the median, extreme values and spread of values about the median in
the form of quartiles (Q1 and Q3).
• The box plots show which results are more robust in the sense that the boxes are more
compact. In general, Self-training and Co-bagging present the smallest box plots in all
the labeled ratios. By contrast, other methods such as CoForest and Rasco are less robust,
as was previously reflected in the average results and Wilcoxon test.
• Median results also help us to identify promising algorithms which perform well in many
domains. It is interesting that Co-Bagging is shown to be the best median value in most
of the labeled ratios in comparison with Democratic-Co and TriTraining (C4.5), which
were pointed out as the best-performing methods.
Again, we differentiate between binary and multi-class data sets. Figure 8 depicts a summary graphic. The Web site associated with this paper presents the complete results. Observing these results, we can make several comments.
• Over binary data sets, we find an significant increment in the ranking obtained by Rasco
(C45) and Rel-Rasco (C45). In general, C4.5 models lead the ranking established in
accuracy and kappa rates.
• When only multi-class data sets are taken into consideration, as in the transductive phase,
we also observe that TriTraining, SETRED and Self-Training based on KNN reach the
top positions.
5.3 Global analysis
This section provides a global perspective on the obtained results. As a summary, we want
to outline several remarks attending to the previous studies:
123
I. Triguero et al.
Table 9 Average Friedman rankings of outstanding algorithms in transductive and inductive phases
Algorithm
Ranking transductive
Algorithm
Ranking inductive
TriTraining (C45)
3.9477
TriTraining (C45)
3.9455
Democratic-Co
4.1205
Democratic-Co
4.2295
Co-Bagging (C45)
4.2409
Co-Bagging (C45)
4.2727
Co-Training (SMO)
4.3409
Co-Training (SMO)
4.4159
Self-Training (C45)
4.6136
Self-Training (C45)
4.4727
Self-Training (SMO)
4.7636
TriTraining (SMO)
4.8023
Co-Bagging (SMO)
4.9432
Co-Bagging (SMO)
4.8955
TriTraining (SMO)
5.0295
Self-Training (SMO)
4.9659
• In general, those self-labeled methods based on C4.5 or SMO have been shown to perform
well in binary data sets. In contrast, the KNN rule is shown to work better with multi-class
domains. Regarding the NB classifier, the continuous distribution version used [78] has
not reported competitive results in comparison with the rest of classifiers. It will probably
obtain a better performance if a discretization process is used for continuous attributes
[85,86].
• According to the type of data sets, we can also claim that single-classifier models usually
outperform those that are based on multi-classifiers when dealing with multi-class problems. By contrast, multi-classifier methods show a better behavior than single-classifier
in tackling binary data sets.
• Eight self-labeled methods have been emphasized as outstanding methods according to
the accuracy/kappa obtained, at least in one of the labeled ratios of the transductive or
inductive phases: TriTraining (C45), TriTraining (SMO), Democratic-Co, Co-Bagging
(C45), Co-Bagging (SMO), Co-Training (SMO), SelfTraining (C45) and SelfTraining
(SMO).
Now, we focus our attention on these eight outstanding methods, performing a multiple
comparison statistical test between them. To analyze the global capabilities of these algorithms independently of labeled ratios, the statistical test is conducted with all of the 220
data sets (55 data sets × 4 labeled rates). Table 9 presents the results of the Friedman test,
which has been carried out considering the accuracy results obtained in the transductive and
inductive phases. In this table, the computed Friedman rankings are presented, representing
the associated effectiveness of each method in both phases. Algorithms are ordered from the
best (lowest) to the worst (highest) ranking.
Table 10 provides information about the state of retainment or rejection of all the hypotheses, comparing outstanding methods in transductive and inductive phases. They show the
adjusted p value (APV) with Bergmann–Hommel’s procedure for the 28 established comparisons. Each table is set out so that each row contains a hypothesis in which the first algorithm mentioned (left side of the comparison) outperforms the second one (right side). The
hypotheses are ordered from the most to the least significant differences. Those APVs highlighted in bold correspond to hypotheses whose left method outperforms the right method,
at an α = 0.1 level of significance.
Figure 9 outlines two graphs for transductive and inductive statistical results, respectively.
The x-axis presents those methods that are not outperformed by any other algorithm. For each
of them, the y-axis collects the methods that they outperform according to the Bergmann–
Hommel test.
123
Self-labeled techniques for semi-supervised learning
Table 10 Multiple comparison test: Bergmann–Hommel’s APVs
(a)
(b)
Fig. 9 Graphical comparison with Bergmann–Hommel’s test. a Transductive results, b inductive results
123
I. Triguero et al.
(a)
(b)
(c)
Fig. 10 Two-dimensional projections of g241n. Red crosses class +1, blue circles class −1. a problem, b
g24ln: labeled data, c g24ln: 100 labeled data (color figure online)
Thus, observing the multiple comparison test and Fig. 9, we can highlight TriTraining
(C45), Democratic-Co, Co-Bagging (C45) and Co-Training (SMO) as methods that appear
in the x-axis in both Fig. 9a, b. Hence, they are not statistically outperformed by any algorithm. Self-Training (C45) is also highlighted as a nonoutperformed method in the inductive
phase.
5.4 Experiments on high-dimensional data sets with small labeled ratio
The aim of this section is to check the behavior of self-labeling methods when they deal with
high-dimensional data and a reduced labeled ratio. To do this, we focus our attention on four of
the best methods highlighted above: Democratic-Co, TriTraining (C4.5), Co-Bagging(C4.5)
and CoTraining(SMO). The used data sets were provided by Chapelle in [4], and their main
characteristics were described in Sect. 4.2. To illustrate the complexity of these problems,
Fig. 10 depicts an example of one partition of the g241n problem. This graph presents a
two-dimensional projection (obtained with PCA [87]) of the problem and the 10 and 100
labeled data points used with self-labeled methods.
Tables 11 and 12 show the accuracy results obtained in the transductive and inductive
phases with 10 and 100 labeled data, respectively. To measure the goodness of self-labeled
techniques with this kind of data, we compare their results with the obtained with the base
classifiers. Therefore, C4.5, KNN, SMO and NB have been trained with the available labeled
123
Self-labeled techniques for semi-supervised learning
Table 11 High-dimensional data sets: self-labeled performance with ten labeled data
Data sets
Democratic-Co
TriTraining (C4.5)
Co-Bagging (C4.5)
Co-Training (SMO)
TRS
TST
TRS
TST
TRS
TST
TRS
TST
bci
0.5014
0.4875
0.5134
0.5050
0.5043
0.5100
0.5146
0.5250
coil
0.8328
0.8333
0.7864
0.7753
0.7792
0.7727
0.8239
0.8300
coil2
0.8050
0.8027
0.6897
0.7040
0.7139
0.7173
0.7601
0.7673
digit1
0.5945
0.6020
0.5337
0.5273
0.5261
0.5113
0.7372
0.7520
g241c
0.5290
0.5220
0.5169
0.5040
0.5213
0.5007
0.5925
0.6067
g241n
0.5043
0.5020
0.5029
0.5033
0.4990
0.4993
0.5340
0.5253
secstr
0.5719
0.5718
0.5097
0.5085
0.5298
0.5283
0.5281
0.5141
text
0.5604
0.5533
0.5272
0.5167
0.5190
0.5180
0.4993
0.5000
usps
0.8050
0.8027
0.6897
0.7040
0.7139
0.7173
0.7601
0.7673
Average
0.6338
0.6308
0.5855
0.5831
0.5896
0.5861
0.6389
0.6431
Table 12 High-dimensional data sets: self-labeled performance with 100 labeled data
Data sets
bci
Democratic-Co
TriTraining (C4.5)
Co-Bagging (C4.5)
Co-Training (SMO)
TRS
TST
TRS
TST
TRS
TST
TRS
TST
0.5027
0.5450
0.5588
0.5625
0.5604
0.5550
0.6573
0.6500
coil
0.8635
0.8773
0.8372
0.8393
0.8439
0.8480
0.9110
0.9047
coil2
0.8557
0.8413
0.7972
0.8033
0.8064
0.8333
0.8063
0.7927
digit1
0.9370
0.9347
0.8208
0.8600
0.8072
0.8127
0.9158
0.9173
g241c
0.6033
0.5213
0.5689
0.5413
0.5685
0.5660
0.7334
0.7453
g241n
0.5420
0.5053
0.5792
0.6067
0.5696
0.5733
0.7320
0.7313
secstr
0.5917
0.5915
0.5436
0.5421
0.5573
0.5500
0.5476
0.5515
text
0.6608
0.6667
0.6728
0.6800
0.6920
0.7333
0.6797
0.6735
usps
0.8557
0.8413
0.8184
0.8000
0.7904
0.7913
0.8063
0.7927
Average
0.7125
0.7027
0.6885
0.6928
0.6884
0.6959
0.7544
0.7510
examples to predict the class of the rest of unlabeled ones. Note that the information used for
these techniques corresponds to the initial stage of all the self-labeled schemes. However, it is
also known that, depending on the problem, unlabeled data can lead to worse performance [1];
hence, the inclusion of these baselines shows whether self-labeled techniques are appropriate
for these high-dimensional problems. Results are presented in Tables 13 and 14.
In these tables we can appreciate that self-labeled techniques do not fit adequately to these
kinds of problems. They are not able to significantly outperform baseline techniques which
in some cases result in a better performance. When only ten labeled points are used, base
classifiers perform equal or better than most of the self-labeled techniques. If the number
of labeled points is increased to 100, we observe that some self-labeled techniques, such as
TriTraining (C4.5) and Co-Bagging (C4.5), perform better than their base classifier. However,
they are not really competitive with the results obtained with KNN or SMO with 100 labeled
points. Figure 11 illustrates an example of the evolution of the transductive and inductive
accuracy of Democratic-Co during the self-labeling process. With a self-labeling approach,
it is expected that as the iterations go by, the accuracy should be increased. Nevertheless, in
123
I. Triguero et al.
Table 13 High-dimensional data sets: baselines performance with ten labeled data
Datasets
C4.5
KNN
SMO
NB
TRS
TST
TRS
TST
TRS
TST
TRS
TST
bci
0.5189
0.5175
0.4889
0.4725
0.5194
0.5000
0.4966
0.5000
coil
0.7945
0.7913
0.7938
0.7900
0.8398
0.8413
0.6921
0.6913
coil2
0.6997
0.7107
0.7767
0.7867
0.7794
0.7867
0.7676
0.7673
digit1
0.5353
0.5220
0.7738
0.7773
0.6889
0.6673
0.6748
0.6787
g241c
0.5175
0.4973
0.5466
0.5653
0.6020
0.6140
0.5446
0.5587
g241n
0.5160
0.5187
0.5431
0.5333
0.5091
0.5060
0.5048
0.5000
secstr
0.5209
0.5215
0.5121
0.5127
0.5155
0.5155
0.5232
0.5220
text
0.5034
0.5064
0.5143
0.5102
0.5201
0.5167
0.4936
0.4986
usps
0.6997
0.7107
0.7767
0.7867
0.7794
0.7867
0.7676
0.7673
Average
0.5895
0.5885
0.6362
0.6372
0.6393
0.6371
0.6072
0.6093
Table 14 High-dimensional data sets: baselines performance with 100 labeled data
Datasets
C4.5
KNN
SMO
NB
TRS
TST
TRS
TST
TRS
TST
TRS
TST
bci
0.5569
0.5525
0.5204
0.5600
0.6581
0.6500
0.5212
0.5275
coil
0.8226
0.8220
0.9422
0.9387
0.9182
0.9113
0.7684
0.7653
coil2
0.7762
0.7860
0.9245
0.9160
0.8386
0.8300
0.8624
0.8593
digit1
0.7726
0.7800
0.9361
0.9373
0.9146
0.9080
0.9365
0.9453
g241c
0.5446
0.5433
0.5919
0.5973
0.7405
0.7533
0.7202
0.7300
g241n
0.5377
0.5267
0.6286
0.6380
0.7371
0.7400
0.6877
0.6780
secstr
0.5298
0.5284
0.5156
0.5152
0.5240
0.5257
0.5340
0.5346
text
0.5054
0.5049
0.5064
0.5177
0.5196
0.5206
0.5003
0.4945
usps
0.7762
0.7860
0.9245
0.9160
0.8386
0.8300
0.8624
0.8593
Average
0.6469
0.6478
0.7211
0.7262
0.7433
0.7410
0.7103
0.7104
Democratic−Co: Accuracy Analysis (g241n)
0.65
Accuracy
0.6
0.55
0.5
TRS−10
TST−10
TRS−100
TST−100
0.45
0.4
0
1
2
3
4
5
6
7
8
9
10
Iterations
Fig. 11 Transductive (TRS) and inductive (TST) accuracy of Democratic-Co in the g241n problem: 10 and
100
123
Self-labeled techniques for semi-supervised learning
Fig. 12 Differences in average test results between outstanding models and C4.5 and SMO
these problems we find that this expectation is not satisfied in most of cases. In the plot we
see that in intermediate iterations the accuracy is deteriorated. It means that the estimation
of most confident examples is erroneous and much more complicated to be obtained in these
domains. Therefore, these results have exemplified the difficulty of these problems when a
very reduced number of labeled data are used. In our opinion, more research is required to
provide to the self-labeled techniques the ability of deal with high-dimensional problem with
a very reduced labeled ratio.
5.5 How far removed is semi-supervised learning from the traditional supervised learning
paradigm?
This section is devoted to checking how far removed is the accuracy obtained with selflabeled methods, in the SSL context, in comparison with supervised learning. It is clear that
SSL implies a more complex problem than the standard supervised learning problem. In
SSL, algorithms are provided with a lesser number of labeled examples to learn a correct
hypothesis. Specifically, the performance obtained with self-labeled methods is theoretically
upper-bounded by the traditional supervised learning algorithms used as base classifiers.
To contrast this idea, we compare the inductive (test) results obtained with the best methods
highlighted in the previous section with C4.5 and SMO classifiers. These classifiers are
trained with completely labeled training sets, using the same tenfold cross-validation scheme
to compute the accuracy test results. The complete results of this study are available on the
associated Web site.
Figure 12 draws a graphical comparison between outstanding inductive methods, C4.5
and SMO classifiers. For each self-labeled method, the average result obtained is shown
in each labeled ratio. The average result of C4.5 and SMO is represented as a line y =
Average Result, to show the differences between self-labeled methods and these classifiers.
As we can observe in this figure, it is noteworthy that with a reduced number of labeled
examples (10 %), self-labeled techniques are far removed from the results obtained with base
classifiers which use a completely labeled training set. Although an increment in the labeled
ratio does not produce a proportional increase in the performance obtained with self-labeled
techniques, it indicates that from 20 % of the labeled ratio, they offer an acceptable classi-
123
I. Triguero et al.
fication performance. As an extreme case, Co-Training (SMO) does not perform well with
10 % of labeled data, and it shows a great improvement when the labeled ratio is augmented,
approaching the SMO classifier with all labeled training examples.
6 Concluding remarks and global guidelines
The present paper provides a complete overview of the self-labeled methods proposed in the
literature. We have analyzed the basic and advanced features presented in them. Furthermore,
existing and related work have also been reviewed. Based on the main properties studied, we
have proposed a taxonomy of self-labeled methods.
The most important methods have been empirically analyzed in terms of transductive and
inductive settings. In order to strengthen this experimental study, we have conducted statistical
analyses based on nonparametric tests which help us to characterize the capabilities of each
method, supporting the conclusions drawn. Several remarks can be made and guidelines
suggested:
• This paper helps nonexperts in self-labeled methods to differentiate between them, to
make an appropriate decision about their application and to understand their behavior.
• A researcher who needs to apply a self-labeled method should know the main characteristics of these kinds of methods in order to choose the most suitable, depending on the
type of problem. The taxonomy proposed and the empirical study can help a researcher
to make this decision.
• It is important to know the main advantages of each self-labeled method. In this paper,
many methods have been empirically analyzed, but a specific conclusion cannot be drawn
regarding the best-performing method. This choice depends on the problem tackled, but
the results offered in this paper could help to reduce the set of candidates.
• SSC is a growing field, and more research studies should be conducted. In this paper,
several guidelines about unexplored and promising families have been described.
• To propose a new self-labeled method, rigorous analyses should be considered to compare
it with the most well-known approaches and those that fit with the basic properties of the
new proposal in terms of transductive and inductive learning. To do this, the taxonomy
and the proposed experimental framework can help guide a future proposal toward the
correct method.
• The empirical study allows us to highlight several methods from among the whole set. In
both transductive and inductive settings, TriTraining (C45), Democratic-Co, Co-Bagging
(C45) and Co-Training (SMO) are shown to be the best-performing methods. Furthermore, in the inductive phase, the classical Self-Training with C45 as base classifier is
also remarkable as an outstanding method.
• The experiments conducted with high-dimensional data sets and very reduced labeled
ratio show that much more work is needed in the field of self-labeled techniques to deal
with these problems.
• The developed software (see “Appendix”) allows the reader to reproduce the experiments
carried out and uses it as an SSL framework to implement new methods. It could be a
useful tool to do experimental analyses in an easier and more effective way.
Acknowledgments
TIC-7765.
123
This work is supported by the Research Projects TIN2011-28488, TIC-6858 and P11-
Self-labeled techniques for semi-supervised learning
Fig. 13 A snapshot of the semi-supervised learning module for KEEL
7 Appendix
As a consequence of this work, we have developed a complete SSL framework which has
been integrated into the Knowledge Extraction based on Evolutionary Learning (KEEL)
tool3 [26]. This research tool is an open-source software, written in Java, that supports data
management and the design of experiments. Until now, KEEL has paid special attention to
the implementation of supervised and unsupervised learning, clustering, pattern mining and
so on. Nevertheless, it did not offer support for SSL. We integrated a new SSL module into
this software.
The main characteristics of this module are as follows:
• All the data sets involved in the experimental study have been included into this module
and can be used for new experiments. These data sets are composed of three files for each
partition: training, transductive and test partitions. The former is composed of labeled
and unlabeled instances (labeled as “unlabeled”). Transductive partition contains the real
class of unlabeled instances and the latter collect the test instances. These data sets are
included in the KEEL-data set repository and are static, ensuring that further experiments
carried out will no longer be dependent on particular data partitions.
• It allows the design of SSL experiments which generate all the XML scripts and a JAR
program for running it, by creating a zip file for an off-line run. The SSL module is
designed for experiments containing multiple data sets and algorithms connected among
themselves to obtain the desired experimental setup. The parameters configuration of the
methods is also customizable as well as the number of executions, validation scheme
and so on. Figure 13 shows a snapshot of an experiment with three analyzed self-labeled
methods and the customization of the parameters of the algorithm APSSC. Note that every
3 http://www.keel.es.
123
I. Triguero et al.
method could be executed apart from the KEEL tool with an appropriate configuration
file.
• Special care has been taken to allow a researcher to be able to use this module to assess
the relative effectiveness of his own procedures. Guidelines about how to integrate a
method into KEEL can be found in [35].
The KEEL version with the SSL module is available on the associated Web site.
References
1. Zhu X, Goldberg AB (2009) Introduction to semi-supervised learning, 1st edn. Morgan and Claypool,
San Rafael, CA
2. Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd
edn. Morgan Kaufmann, San Francisco
3. Zhu Y, Yu J, Jing L (2013) A novel semi-supervised learning framework with simultaneous text representing. Knowl Inf Syst 34(3):547–562
4. Chapelle O, Schlkopf B, Zien A (2006) Semi-supervised learning, 1st edn. The MIT Press, Cambridge,
MA
5. Pedrycz W (1985) Algorithms of fuzzy clustering with partial supervision. Pattern Recognit Lett 3:13–20
6. Zhao W, He Q, Ma H, Shi Z (2012) Effective semi-supervised document clustering via active learning
with instance-level constraints. Knowl Inf Syst 30(3):569–587
7. Chen K, Wang S (2011) Semi-supervised learning via regularized boosting working on multiple semisupervised assumptions. IEEE Trans Pattern Anal Mach Intell 33(1):129–143
8. Fujino A, Ueda N, Saito K (2008) Semisupervised learning for a hybrid generative/discriminative classifier
based on the maximum entropy principle. IEEE Trans Pattern Anal Mach Intell 30(3):424–437
9. Joachims T (1999) Transductive inference for text classification using support vector machines. In: Proceedings of 16th international conference on machine learning, Morgan Kaufmann, pp 200–209
10. Blum A, Chawla S (2001) Learning from labeled and unlabeled data using graph mincuts. In: Proceedings
of the eighteenth international conference on machine learning, pp 19–26
11. Wang J, Jebara T, Chang S-F (2013) Semi-supervised learning using greedy max-cut. J Mac Learn Res
14(1):771–800
12. Mallapragada PK, Jin R, Jain A, Liu Y (2009) Semiboost: boosting for semi-supervised learning. IEEE
Trans Pattern Anal Mach Intell 31(11):2000–2014
13. Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting of the association for computational linguistics, pp 189–196
14. Li M, Zhou ZH (2005) SETRED: self-training with editing. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 3518 LNAI,
pp 611–621
15. Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of
the annual ACM conference on computational learning theory, pp 92–100
16. Du J, Ling CX, Zhou ZH (2010) When does co-training work in real data? IEEE Trans Knowl Data Eng
23(5):788–799
17. Sun S, Jin F (2011) Robust co-training. Int J Pattern Recognit Artif Intell 25(07):1113–1126
18. Jiang Z, Zhang S, Zeng J (2013) A hybrid generative/discriminative method for semi-supervised classification. Knowl-Based Syst 37:137–145
19. Sun S (2013) A survey of multi-view machine learning. Neural Comput Appl 23(7–8):2031–2038
20. Zhou ZH, Li M (2005) Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans Knowl
Data Eng 17:1529–1541
21. Li M, Zhou ZH (2007) Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans Syst Man Cybern A Syst Hum 37(6):1088–1098
22. Sun S, Shawe-Taylor J (2010) Sparse semi-supervised learning using conjugate functions. J Mach Learn
Res 11:2423–2455
23. Zhu X (2005) Semi-supervised learning literature survey. Technical report 1530, Computer Sciences,
University of Wisconsin-Madison
24. Chawla N, Karakoulas G (2005) Learning from labeled and unlabeled data: an empirical study across
techniques and domains. J Artif Intell Res 23:331–366
25. Zhou Z-H, Li M (2010) Semi-supervised learning by disagreement. Knowl Inf Syst 24(3):415–439
123
Self-labeled techniques for semi-supervised learning
26. Alcalá-Fdez J, Sánchez L, García S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J,
Rivas VM, Fernández JC, Herrera F (2009) KEEL: a software tool to assess evolutionary algorithms for
data mining problems. Soft Comput 13(3):307–318
27. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
28. García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis
of power. Inf Sci 180:2044–2064
29. Triguero I, Sáez JA, Luengo J, García S, Herrera F (2013) On the characterization of noise filters for
self-training semi-supervised in nearest neighbor classification, Neurocomputing (in press)
30. Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
31. Dasgupta S, Littman ML, McAllester DA (2001) Pac generalization bounds for co-training. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems. Neural
information processing systems: natural and synthetic, vol 14. MIT Press, Cambridge, pp 375–382
32. Quinlan JR (1993) C4.5 programs for machine learning. Morgan Kaufmann Publishers, San Francisco,
CA
33. Efron B, Tibshirani RJ (1993) An Introduction to the bootstrap. Chapman & Hall, New York
34. Goldman S, Zhou Y (2000) Enhancing supervised learning with unlabeled data. In: Proceedings of the
17th international conference on machine learning. Morgan Kaufmann, pp 327–334
35. Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL datamining software tool: data set repository, integration of algorithms and experimental analysis framework.
J Multiple-Valued Logic Soft Comput 17(2–3):255–277
36. Bennett K, Demiriz A, Maclin R (2002) Exploiting unlabeled data in ensemble methods. In: Proceedings
of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 289–296
37. Zhou Y, Goldman S (2004) Democratic co-learning. In: IEEE international conference on tools with
artificial intelligence, pp 594–602
38. Deng C, Guo M (2006) Tri-training and data editing based semi-supervised clustering algorithm. In:
Gelbukh A, Reyes-Garcia C (eds) MICAI 2006: advances in artificial intelligence, vol 4293 of lecture
notes in computer science. Springer, Berlin, pp 641–651
39. Wang J, Luo S, Zeng X (2008) A random subspace method for co-training. In: IEEE international joint
conference on computational intelligence, pp 195–200
40. Hady M, Schwenker F (2008) Co-training by committee: a new semi-supervised learning framework. In:
IEEE international conference on data mining workshops, ICDMW ’08, pp 563–572
41. Hady M, Schwenker F (2010) Combining committee-based semi-supervised learning and active learning.
J Comput Sci Technol 25:681–698
42. Hady M, Schwenker F, Palm G (2010) Semi-supervised learning for tree-structured ensembles of rbf
networks with co-training. Neural Netw 23:497–509
43. Yaslan Y, Cataltepe Z (2010) Co-training with relevant random subspaces. Neurocomputing 73(10–
12):1652–1661
44. Huang T, Yu Y, Guo G, Li K (2010) A classification algorithm based on local cluster centers with a few
labeled training examples. Knowl-Based Syst 23(6):563–571
45. Halder A, Ghosh S, Ghosh A (2010) Ant based semi-supervised classification. In: Proceedings of the 7th
international conference on swarm intelligence, ANTS’10, Springer, Berlin, Heidelberg, pp 376–383
46. Wang Y, Xu X, Zhao H, Hua Z (2010) Semi-supervised learning based on nearest neighbor rule and cut
edges. Knowl-Based Syst 23(6):547–554
47. Deng C, Guo M (2011) A new co-training-style random forest for computer aided diagnosis. J Intell Inf
Syst 36:253–281. doi:10.1007/s10844-009-0105-8
48. Nigam K, Mccallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2):103–134
49. Tang X-L, Han M (2010) Semi-supervised Bayesian artmap. Appl Intell 33(3):302–317
50. Joachims T (2003) Transductive learning via spectral graph partitioning. In: Proceedings of twentieth
international conference on machine learning, vol 1, pp 290–297
51. Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning
from labeled and unlabeled examples. J Mach Learn Res 7:2399–2434
52. Xie B, Wang M, Tao D (2011) Toward the optimization of normalized graph Laplacian. IEEE Trans
Neural Netw 22(4):660–666
53. Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov
2(2):121–167
54. Chapelle O, Sindhwani V, Keerthi SS (2008) Optimization techniques for semi-supervised support vector
machines. J Mach Learn Re. 9:203–233
123
I. Triguero et al.
55. Adankon M, Cheriet M (2010) Genetic algorithm-based training for semi-supervised svm. Neural Comput
Appl 19:1197–1206
56. Tian X, Gasso G, Canu S (2012) A multiple kernel framework for inductive semi-supervised svm learning.
Neurocomputing 90:46–58
57. Sugato B, Raymond JM (2003) Comparing and unifying search-based and similarity-based approaches to
semi-supervised clustering. In: Proceedings of the ICML-2003 workshop on the continuum from labeled
to unlabeled data in machine learning and data mining, pp 42–49
58. Yin X, Chen S, Hu E, Zhang D (2010) Semi-supervised clustering with metric learning: an adaptive kernel
method. Pattern Recognit 43(4):1320–1333
59. Grira N, Crucianu M, Boujemaa N (2004) Unsupervised and semi-supervised clustering: a brief survey.
In: A review of machine learning techniques for processing multimedia content. Report of the MUSCLE
European network of excellence FP6
60. Freund Y, Seung HS, Shamir E, Tishby N (1997) Selective sampling using the query by committee
algorithm. Mach Learn 28:133–168
61. Muslea I, Minton S, Knoblock C (2002) Active + semi-supervised learning = robust multi-view learning.
In: Proceedings of ICML-02, 19th international conference on machine learning, pp 435–442
62. Zhang Q, Sun S (2010) Multiple-view multiple-learner active learning. Pattern Recognit 43(9):3113–3119
63. Yu H (2011) Selective sampling techniques for feedback-based data retrieval. Data Min Knowl Discov
22(1–2):1–30
64. Belhumeur P, Hespanha J, Kriegman D (1997) Eigenfaces vs. fisherfaces: recognition using class specific
linear projection. IEEE Trans Pattern Anal Mach Intell 19(7):711–720
65. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference and
prediction, 2nd edn. Springer, Berlin
66. Song Y, Nie F, Zhang C, Xiang S (2008) A unified framework for semi-supervised dimensionality reduction. Pattern Recognit 41(9):2789–2799
67. Li Y, Guan C (2008) Joint feature re-extraction and classification using an iterative semi-supervised
support vector machine algorithm. Mach Learn 71:33–53
68. Liu H, Motoda H (eds) (2007) Computational methods of feature selection. Chapman &Hall/CRC data
mining and knowledge discovery series. Chapman & Hall/CRC, Boca Raton, FL
69. Zhao J, Lu K, He X (2008) Locality sensitive semi-supervised feature selection. Neurocomputing 71(10–
12):1842–1849
70. Gregory PA, Gail AC (2010) Self-supervised ARTMAP. Neural Netw 23:265–282
71. Cour T, Sapp B, Taskar B (2011) Learning from partial labels. J Mach Learn Res 12:1501–1536
72. Joshi A, Papanikolopoulos N (2008) Learning to detect moving shadows in dynamic environments. IEEE
Trans Pattern Anal Mach Intell 30(11):2055–2063
73. Ben-David A (2007) A lot of randomness is hiding in accuracy. Eng Appl Artif Intell 20:875–885
74. Alpaydin E (2010) Introduction to machine learning, 2nd edn. MIT Press, Cambridge, MA
75. Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/mlearn/
MLRepository.html
76. Wu X, Kumar V (eds) (2009) The top ten algorithms in data mining. Chapman & Hall/CRC data mining
and knowledge discovery. Chapman & Hall/CRC, Boca Raton, FL
77. Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
78. John GH, Langley P (2001) Estimating continuous distributions in Bayesian classifiers. In: Proceedings
of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann, San Mateo, pp
338–345
79. Vapnik VN (1998) Statistical learning theory. Wiley-Interscience, London
80. Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. MIT
Press, Cambridge, MA
81. García S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets
for all pairwise comparisons. J Mach Learn Res 9:2677–2694
82. Sheskin DJ (2011) Handbook of parametric and nonparametric statistical procedures, 5th edn. Chapman
& Hall/CRC, Boca Raton, FL
83. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of
variance. J Am Stat Assoc 32:675–701
84. Bergmann G, Hommel G (1988) Improvements of general multiple test procedures for redundant systems
of hypotheses. In: Bauer P, Hommel G, Sonnemann E (eds) Multiple hypotheses testing. Springer, Berlin
pp 100–115
85. Yang Y, Webb G (2009) Discretization for naive-Bayes learning: managing discretization bias and variance. Mac Learn 74(1):39–74
123
Self-labeled techniques for semi-supervised learning
86. García S, Luengo J, Saez JA, López V, Herrera F (2013) A survey of discretization techniques: taxonomy
and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750
87. Jolliffe IT (1986) Principal component analysis. Springer, Berlin
Author Biographies
Isaac Triguero received the M.Sc. degree in Computer Science from
the University of Granada, Granada, Spain, in 2009. He is currently
a Ph.D. student in the Department of Computer Science and Artificial
Intelligence, University of Granada, Granada, Spain. His research interests include data mining, data reduction, biometrics, evolutionary algorithms and semi-supervised learning.
Salvador García received the M.Sc. and Ph.D. degrees in Computer
Science from the University of Granada, Granada, Spain, in 2004
and 2008, respectively. He is currently an Associate Professor in the
Department of Computer Science, University of Jaén, Jaén, Spain. He
has published more than 30 papers in international journals. As edited
activities, he has co-edited two special issues in international journals
on different Data Mining topics. His research interests include data
mining, data reduction, data complexity, imbalanced learning, semisupervised learning, statistical inference and evolutionary algorithms.
123
I. Triguero et al.
Francisco Herrera received his M.Sc. in Mathematics in 1988 and
Ph.D. in Mathematics in 1991, both from the University of Granada,
Spain. He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada. He has
published more than 230 papers in international journals. He is coauthor of the book “Genetic Fuzzy Systems: Evolutionary Tuning and
Learning of Fuzzy Knowledge Bases” (World Scientific, 2001). He currently acts as Editor in Chief of the international journal “Progress
in Artificial Intelligence” (Springer). He acts as area editor of the
International Journal of Computational Intelligence Systems and associated editor of the journals: IEEE Transactions on Fuzzy Systems,
Information Sciences, Knowledge and Information Systems, Advances
in Fuzzy Systems, and International Journal of Applied Metaheuristics Computing; and he serves as member of several journal editorial boards, among others: Fuzzy Sets and Systems, Applied Intelligence, Information Fusion, Evolutionary Intelligence, International
Journal of Hybrid Intelligent Systems, Memetic Computation, and
Swarm and Evolutionary Computation. He received the following honors and awards: ECCAI Fellow 2009, 2010 Spanish National Award on Computer Science ARITMEL to
the “Spanish Engineer on Computer Science”, International Cajastur “Mamdani” Prize for Soft Computing
(Fourth Edition, 2010), IEEE Transactions on Fuzzy System Outstanding 2008 Paper Award (bestowed in
2011), and 2011 Lotfi A. Zadeh Prize Best paper Award of the International Fuzzy Systems Association.
His current research interests include computing with words and decision making, bibliometrics, data mining, biometrics, data preparation, instance selection, fuzzy rule based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic algorithms and genetic algorithms.
123
2. Self-labeling with prototype generation/selection for semi-supervised classification
2.2
179
On the Characterization of Noise Filters for Self-Training Semi-Supervised
in Nearest Neighbor Classification
• I. Triguero, José A. Sáez, J. Luengo, S. Garcı́a, F. Herrera, On the Characterization of Noise
Filters for Self-Training Semi-Supervised in Nearest Neighbor Classification. Neurocomputing
132 (2014) 30-41, doi: 10.1016/j.neucom.2013.05.055.
– Status: Published.
– Impact Factor (JCR 2014): Not available.
– Current Impact Factor of the Journal (JCR 2012): 1.634
– Subject Category: Computer Science, Artificial Intelligence. Ranking 37 / 115 (Q2).
Neurocomputing 132 (2014) 30–41
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
On the characterization of noise filters for self-training
semi-supervised in nearest neighbor classification
Isaac Triguero a,n, José A. Sáez a, Julián Luengo b, Salvador García c, Francisco Herrera a
a
Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology),
University of Granada, 18071 Granada, Spain
b
Department of Civil Engineering, LSI, University of Burgos, 09006 Burgos, Spain
c
Department of Computer Science, University of Jaén, 23071 Jaén, Spain
ar t ic l e i nf o
a b s t r a c t
Article history:
Received 22 October 2012
Received in revised form
18 February 2013
Accepted 30 May 2013
Available online 12 November 2013
Semi-supervised classification methods have received much attention as suitable tools to tackle training
sets with large amounts of unlabeled data and a small quantity of labeled data. Several semi-supervised
learning models have been proposed with different assumptions about the characteristics of the input
data. Among them, the self-training process has emerged as a simple and effective technique, which does
not require any specific hypotheses about the training data. Despite its effectiveness, the self-training
algorithm usually make erroneous predictions, mainly at the initial stages, if noisy examples are labeled
and incorporated into the training set.
Noise filters are commonly used to remove corrupted data in standard classification. In 2005, Li and
Zhou proposed the addition of a statistical filter to the self-training process. Nevertheless, in this
approach, filtering methods have to deal with a reduced number of labeled instances and the erroneous
predictions it may induce. In this work, we analyze the integration of a wide variety of noise filters into
the self-training process to distinguish the most relevant features of filters. We will focus on the nearest
neighbor rule as a base classifier and ten different noise filters. We provide an extensive analysis of the
performance of these filters considering different ratios of labeled data. The results are contrasted with
nonparametric statistical tests that allow us to identify relevant filters, and their main characteristics, in
the field of semi-supervised learning.
& 2013 Elsevier B.V. All rights reserved.
Keywords:
Noise filters
Noisy data
Self-training
Semi-supervised learning
Nearest neighbor classification
1. Introduction
The construction of classifiers can be considered one of the
most important and challenging tasks in machine learning and
data mining [1]. Supervised classification, which has attracted
much attention and research efforts [2], aims to build classifiers
using a set of labeled data. By contrast, in many real-world tasks,
unlabeled data are easier to obtain than labeled ones because they
require less effort, expertise and time-consumption. In this context, semi-supervised learning (SSL) [3] is a learning paradigm
concerned with the design of classifiers in the presence of both
labeled and unlabeled data.
SSL is an extension of unsupervised and supervised learning by
including additional information typical of the other learning paradigm. Depending on the main objective of the methods, SSL encompasses several settings such as semi-supervised classification (SSC) [4]
n
Corresponding author. Tel.: þ 34 958 240598; fax: þ 34 958 243317.
E-mail addresses: triguero@decsai.ugr.es (I. Triguero),
smja@decsai.ugr.es (J.A. Sáez), jluengo@ubu.es (J. Luengo),
sglopez@ujaen.es (S. García), herrera@decsai.ugr.es (F. Herrera).
0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.neucom.2013.05.055
and semi-supervised clustering [5]. The former focuses on enhancing
supervised classification by minimizing errors in the labeled examples
but it must also be compatible with the input distribution of unlabeled
instances. The latter, also known as constrained clustering [6], aims to
obtain better defined clusters than the ones obtained from unlabeled
data. There are other SSL settings, including regression with labeled
and unlabeled data, or dimensionality reduction [7] to find a faithful
low dimensional mapping, or selection of the high dimensional data in
a SSL context. We focus on SSC.
SSC can be categorized into two slightly different settings [8],
denoted as transductive and inductive learning. On one hand,
transductive learning concerns the problem of predicting the
labels of the unlabeled examples, given in advance, by taking both
labeled and unlabeled data together into account to train a
classifier. On the other hand, inductive learning considers the
given labeled and unlabeled data as the training examples, and its
objective is to predict unseen data. In this paper, we address both
settings to carry out an extensive analysis of the performance of
the studied methods.
Many different approaches have been proposed to classify using
unlabeled data in SSC. They usually make different assumptions
I. Triguero et al. / Neurocomputing 132 (2014) 30–41
related to the link between the distribution of unlabeled and
labeled data. Generative models [9] assume a joint probability
model pðx; yÞ ¼pðyÞpðxjyÞ, where pðxjyÞ is an identifiable mixture
distribution, for example a Gaussian mixture model [10]. The
standard co-training [11] methodology assumes that the feature
space can be split into two different conditionally independent
views and that each view is able to predict the classes perfectly
[12–15]. It trains one classifier in each specific view, and then the
classifiers teach each other the most confident predicted examples.
Multiview learning [16,17] can be viewed as a generalization of cotraining, without requiring explicit feature splits or the iterative
mutual-teaching procedure. Instead, it focuses on the explicitly
hypothetical agreement of several classifiers [18]. There are also
other algorithms such as transductive inference for support vector
machines [19,20] that assume that the classes are well-separated
and do not cut through dense unlabeled data. Alternatively, SSC can
also be viewed as a graph min-cut problem [21]. If two instances
are connected by a strong edge, their labels are likely to be the
same. In this case, the graph construction determines the behavior
of this kind of algorithm [22]. In addition, there are recent studies
which address multiple assumptions in one model [8].
Self-training [23,24] is a simple and effective SSL methodology
which has been successfully applied in many real instances
[25,26]. In the self-training process, a classifier is trained with an
initially small number of labeled examples, aiming to classify
unlabeled points. Then it is retrained with its own most confident
predictions, enlarging its labeled training set. This model does not
make any specific assumptions for the input data, but it accepts
that its own predictions tend to be correct.
However, this idea can lead to erroneous predictions if noisy
examples are classified as the most confident examples and
incorporated into the labeled training set. In [27], the authors
propose the addition of a statistical filter [28] to the self-training
process, naming this algorithm SETRED. Nevertheless, this method
does not perform well in many domains. The use of a particular
filter which has been designed and tested under different conditions is not straightforward. Although the aim of any filter is to
remove potentially noisy examples, both correct examples and
examples containing valuable information may also be removed.
Thus, detecting true noisy examples is a challenging task because
the success of filtering methods depends on several factors [29]
such as the kind and nature of data errors, the quantity of noise
removed or the capabilities of the classifier to deal with the loss of
useful information related to the filtering. In the self-training
approach, the number of available labeled data and the induced
noisy examples are two decisive factors when filtering noise.
Hence, the performance of the combination of filtering techniques
and self-training relies heavily on the filtering method chosen. It is
so much so that the inclusion or the absence of one prototype into
the labeled training set can alter the following stages of the selftraining approach, especially in early steps. For these reasons, the
inclusion and analysis of the most suitable filtering method into
the self-training is mandatory in order to diminish the influence of
noisy data.
Filtering techniques follow different approaches to determine
whether an example could be noisy or not. We distinguish two
types of noise detection mechanism: local and global. We call local
methods to those techniques in which the removal decision is
based on a local neighborhood of instances [30,31]. Global methods create different models from the training data. Mislabeled
examples can be considered noisy depending on the hypothesis
agreement of the used classifiers [32,33]. It is necessary to
mention that there are other related approaches in which unlabeled data are used to identify mislabeled training data [34,35].
In this work we deepen in the integration of different noise
filters and we further analyze recent proposals in order to
31
establish their suitability with respect to the self-training process.
We will adopt the Nearest Neighbor (NN) rule [36] as the base
classifier, which has been highlighted as one of the most influential techniques in data mining [37]. For each filtering family, the
most representative noise filters will be tested. The analysis of the
behavior of noise filters in self-training motivates the global
purpose of this paper, which pursues three objectives:
To determine which characteristics of noise filters are more
appropriate to be included in the self-training process.
To perform an empirical study for analyzing the transductive
and inductive capabilities of the filtered and non-filtered selftraining algorithm.
To check the behavior of this approach when dealing with data
sets with different ratios of labeled data.
We will conduct experiments involving a total of 60 classification data sets with different ratios of labeled data: 10%, 20%, 30%
and 40%. In order to test the behavior of noise filters, the
experimental study will include a statistical analysis based on
nonparametric statistical tests [38]. A web page with all the
complementary material is available at 〈http://sci2s.ugr.es/SelfTrai
ningþ Filters〉, including this paper's basic information, all the data
sets created and the complete results obtained for each algorithm.
The rest of the paper is organized as follows: Section 2 defines
the SSC problem and the self-training approach. Section 3 explains
how to combine self-training with noise filters. Section 4 introduces the filtering algorithms used. Section 5 presents the experimental framework and Section 6 discusses the analysis of results
obtained. Finally, in Section 7 we summarize our conclusions.
2. Background: semi-supervised learning via the self-training
approach
This section provides the necessary information to understand
the proposed integration of noise filters into the self-training
process. Section 2.1 defines the SSC problem. Then, Section 2.2
presents the self-training approach used to address the SSC
problem.
2.1. Semi-supervised classification
This section presents the definition and notation for the SSC
problem. A specification of this problem follows: Let xp be an
example where xp ¼ ðxp1 ; xp2 ; …; xpD ; ωÞ, with xp belonging to a
class ω and a D-dimensional space in which xpi is the value of the
i-th feature of the p-th sample. Then, let us assume that there is a
labeled set L which consists of n instances xp ω with ω known.
Furthermore, there is an unlabeled set U which consists of m
instances xq ω with ω unknown, let m b n. The L [ U set forms the
training set TR. The purpose of SSC is to obtain a robust learned
hypothesis using TR instead of L alone, which can be applied in
two slightly different settings: transductive and inductive learning.
Transductive learning is described as the application of an SSC
technique to classify all the m instances xq ω of U with their correct
class. The class assignation should represent the distribution of the
classes efficiently, based on the input distribution of unlabeled
instances and the L instances.
Let TS be a test set composed of t unseen instances xr ω with ω
unknown, which has not been used at the training stage of the SSC
technique. The inductive learning phase consists of correctly
classifying the instances of TS based on the previously learned
hypothesis.
32
I. Triguero et al. / Neurocomputing 132 (2014) 30–41
2.2. Self-training
3. Combining self-training and filtering methods
The self-training approach is a wrapper methodology characterized by the fact that the learning process uses its own predictions to teach itself. This process is also known as bootstrapping or
self-teaching [39]. In general, self-training can be used either as
inductive or transductive learning depending on the nature of the
classifier. Self-training follows an iterative procedure in which a
classifier is trained using labeled data to predict the labels of
unlabeled data in order to obtain an enlarged labeled set L.
Fig. 1 outlines the pseudo-code of the self-training methodology. In the following we describe the most significant instructions
enumerated from 1 to 22.
First of all, it is necessary to determine the number of unlabeled
instances which will be added to L in each iteration. Note that this
parameter can be a constant, or it can be chosen as a proportional
value of the number of instances of each class in L, as Blum and
Mitchell suggest in [11]. We apply this idea in our implementations to determine the amount of prototypes per class which will
be added to L in each iteration (Instructions 1–11).
Then, the algorithm enters into a loop to enlarge the labeled set
L (Instructions 14–20). Instruction 15 calculates the confidence
predictions of all the unlabeled instances, as the probability of
belonging to each class. The way in which the confidence predictions are measured is dependant on the type of classifier used.
Unlike probabilistic models such as Naive Bayes, whose confidence
predictions can be measured as the output probability in prediction, the NN rule has no explicitly measured confidence for an
instance. For the NN rule, the algorithm approximates confidence
in terms of distance, hence, the most confident unlabeled instance
is defined as the closest unlabeled instances to any labeled one (as
defined in [27,3]).
Next, instruction 16 creates a set L′ consisting of the most
confident unlabeled data for each class, keeping the proportion of
instances per class previously computed. L′ is labeled with its
predictions and added to L (Instruction 17). Instruction 18 removes
the instances of L′ from U.
In the original description of the self-training approach [23],
this process was repeated until all the instances from U had been
added to L. However, following [11], we have established a limit to
the number of iterations, MAXITER (Instruction 14). Hence, a pool
of unlabeled examples smaller than U is used in our
implementations.
Finally, the obtained L set is used to classify the U set for
transductive learning and the TS for inductive learning.
In this section we explain the combination of self-training with
noise filters in depth. As mentioned above, the goal of the selftraining process is to find the most adequate class label for
unlabeled data with the aim of enlarging the L set. However, in
SSC, the number of initial labeled examples tends to be too small
to train a classifier with good generalization capabilities. In this
scenario, noisy instances can be harmful if they are added to the
labeled set, as they bias the classification of unlabeled data to
incorrect classes, which could make the enlarged labeled set in the
next iteration even more noisy. This problem may especially occur
in the initial stages of this process.
Two types of noisy instances may appear during the selflabeling process:
Fig. 1. Self-training pseudo-code.
The first one is caused by the distribution of classes in L. It can
lead to the classifier to label erroneously some instances.
There may be outliers within the original unlabeled data. This
second kind can be detected, avoiding its labeling and its
inclusion into L.
These ideas motivate the treatment of noisy data during the
self-training process. Filtering methods have been commonly used
to deal with noisy data. Nevertheless, most of the proposed
filtering schemes have been designed into the supervised classification framework. Hence, the number of labeled data can
determine the way in which one filter decides whether an
example is noisy or not. If incorrect examples are appropriately
detected and removed during the labeling process, the generalization capabilities of the classifier are expected to be improved.
The filtering technique should be applied at each iteration, after
L′ is formed, in order to detect both types of noisy instances. The
identification of noisy examples is performed using the set L [ L′ as
a training set. If one example of L′ is annotated as a possible noisy
example, it is removed from L′, and it will not be added to L.
Nevertheless, this instance should be cleaned from U. Note that this
scheme does not try to relabel suspicious examples and thereby
avoids the introduction of new noise into the training set [27].
Fig. 2 shows a case of study of filtered self-training in a twoclass problem. In this figure, we can observe the first iteration of
Fig. 2. Example of labeling process with editing.
I. Triguero et al. / Neurocomputing 132 (2014) 30–41
the process and how the selection of the most confident examples
can fail due to the distribution of the given labeled instances. One
example of Class 2 has been selected as one of the most confident
instances of Class 1. A filtering technique is needed to remove
incorrectly labeled instances in order to avoid the erroneous future
labeling in subsequent iterations.
4. Filtering methods
This section describes the filters adopted in our study. Filtering
methods are preprocessing mechanisms to detect and eliminate
noisy examples in the training set. The separation of noise
detection and learning has the advantage that noisy examples do
not influence the model building design [40].
As we explained before, each method considers an example to
be harmful depending on its nature. Broadly speaking, we can
categorize filters into two different types: local (Section 4.1) and
global filters (Section 4.2).
In all descriptions, we use TR to refer to the training set, FS to
the filtered set, and ND to refer to the noisy data identified in the
training set (initially, ND ¼ |).
4.1. Local filters
These methods create neighborhoods of instances to detect
suspicious examples. Most of them are based on the distance
between prototypes to determine their similarity. The best known
distance measure for these filters is Euclidean distance (Eq. (1)).
We will use it throughout this study, since it is simple, easy to
optimize, and has been widely used in the field of instance based
learning [41]
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
EuclideanDistanceðX; YÞ ¼
D
∑ ðxpi xqi Þ2
i¼0
ð1Þ
here we offer a brief description of the local filtering methods
studied:
Edited Nearest Neighbor (ENN) [42]: This algorithm starts with
FS ¼TR. Then each instance in FS is removed if it does not agree
with the majority of its k nearest neighbors.
All kNN (AllKNN) [43]: The All kNN technique is an extension of
ENN. Initially, FS¼ TR. Then the NN rule is applied k times. In
each execution, the NN rule varies the number of neighbors
considered between 1 and k. If one instance is misclassified by
the NN rule, it is registered as removable from the FS. Then all
those that do meet the criteria are removed at once.
Relative Neighborhood Graph Edition (RNGE) [44]: This technique builds a proximity undirected graph G ¼ ðV; EÞ, in which
each vertex corresponds to an instance from TR. There is a set of
edges E, so that ðxi ; xj Þ A E if and only if xi and xj satisfy some
neighborhood relation (Eq. (2)). In this case, we say that these
instances are graph neighbors. The graph neighbors of a given
point constitute its graph neighborhood. The edition scheme
discards those instances misclassified by their graph neighbors
(by the usual voting criterion)
ðxi ; xj Þ A E 3 dðxi ; xj Þ rmaxðdðxi ; xk Þ; dðxj ; xk ÞÞ;
8 xk A TR; k a i; j:
ð2Þ
Modified Edited Nearest Neighbor (MENN) [45]: This algorithm
starts with FS ¼TR. Then each instance xp in FS is removed if it
does not agree with all of its kþ l nearest neighbors, where l is
the number of instances in FS which are at the same distance as
the last neighbor of xp .
33
Furthermore, MENN works with a prefixed number of pairs
(k; k′). k is employed as the number of neighbors used to
perform the editing process, and k′ is employed to validate
the edited set FS obtained. The best pair found is employed as
the final reference set. If two or more sets are found to be
optimal, then both are used in the classification of the test
instances. A majority rule is used to decide the output of the
classifier in this case.
Nearest Centroid Neighbor Edition (NCNEdit) [46]: This algorithm defines the neighborhood, taking into account not only
the proximity of prototypes to a given example, but also their
symmetrical distribution around it. Specifically, it calculates the
k nearest centroid neighbors (k NCNs). These k neighbors can
be searched for through an iterative procedure [47] in the
following way:
1. The first neighbor of xp is also its nearest neighbor, x1q .
2. The ith neighbor, xiq , i b 2 is such that the centroid of
this and previously selected neighbors, x1q ; …; xiq , is the
closest to xp .
The NCN Editing algorithm is a slight modification of ENN,
which consists of discarding from FS every example misclassified by the k NCN rule.
Cut edges weight statistic (CEWS) [28]: This method generates a
graph with a set of vertices V¼TR, and a set of edges, E,
connecting each vertex to its nearest neighbor. An edge connecting two vertices that have different labels is denoted as a
cut edge. If an example is located in a neighborhood with too
many cut edges, it should be considered as noise. Next, a
statistical procedure is applied in order to label cut edges as
noisy examples. This is the filtering method used in the first
self-training approach with edition [27] (SETRED).
Edited Nearest Neighbor Estimating Class Probabilistic and
Threshold (ENNTh) [48]: This method applies a probabilistic
NN rule, where the class of an instance is decided as a weighted
probability of the class of its nearest neighbors (each neighbor
has the same a priori probability, and its associated weight is
the inverse of its distance). The editing process is performed
starting with FS ¼TR and deleting from FS every prototype
misclassified by this probabilistic rule.
Furthermore, it defines a threshold in the NN rule, which will
not consider instances with an assigned probability lower than
the established threshold.
Multiedit (Multiedit) [49,50]: This method starts with FS ¼ ∅
and a new set R defined as R¼TR. Then this technique splits R
into nf blocks: R1 ; …; Rb (nf 42). For each instance of the block
nfi, it applies a k NN rule with Rðnf i þ 1Þmod b as the training set. All
misclassified instances are discarded. The remaining instances
constitute the new TR. This process is repeated while at least
one instance is discarded.
4.2. Global filters
We denote as global filters those methods which apply a
classifier to several subsets of TR in order to detect problematic
examples. These methods use different methodologies to divide
the TR. Then, these methods create models over the generated
subsets and use different heuristics to determine noisy examples.
From here, nf is the number of folds in which the training data are
partitioned by the filtering method.
Classification Filter (CF) [32]: The main steps of this filtering
algorithm are the following:
1. Split the current training data set TR using an nf-fold cross
validation scheme.
34
I. Triguero et al. / Neurocomputing 132 (2014) 30–41
2. For each of these nf parts, a learning algorithm is trained on
the other n 1 parts, resulting in n different classifiers. Here,
C4.5 is used as the learning algorithm [51].
3. These n resulting classifiers are then used to tag each
instance in the excluded part as either correct or mislabeled,
by comparing the training label with that assigned by the
classifier.
4. The misclassified examples from the previous step are
added to ND.
5. Remove the noisy examples from the training set:
FS’TR\ND.
Iterative Partitioning Filter (IPF) [33]: This method removes
noisy instances in multiple iterations until a stopping criterion
is reached. The iterative process stops if, for a number of
consecutive iterations, the number of identified noisy examples
in each of these iterations is less than a percentage of the size
of the original training data set. The basic steps of each
iteration are as follows:
1. Split the current training data set TR into nf equal sized
subsets.
2. Build a model with the C4.5 algorithm over each of these nf
subsets and use them to evaluate the whole current training
data set TR.
3. Add to ND the noisy examples identified in TR using a voting
scheme.
4. Remove the noisy examples from the training set:
FS’TR\ND.
Two voting schemes can be used to identify noisy examples:
consensus and majority. The former removes an example if it is
misclassified by all the classifiers, whereas the latter removes
an instance if it is misclassified by more than half of the
classifiers. Furthermore, a noisy instance should be misclassified by the model which was induced in the subset containing
that instance. In our experimentation we consider the majority
scheme in order to detect most of the potentially noisy
examples.
5. Experimental framework
This section describes the experimental study carried out in
this paper. We provide the measures used to determine the
performance of the algorithms (Section 5.1), the characteristics
of the problems used for the experimentation (Section 5.2),
an enumeration of the algorithms used with their respective
parameters (Section 5.3) and finally a description of the nonparametric statistical tests applied to contrast the results obtained
(Section 5.4).
5.1. Performance measures
Two measures are widely used for measuring the effectiveness
of classifiers: accuracy [1,2] and Cohen's kappa rate [52]. They are
briefly explained as follows:
Accuracy: It is the number of successful hits (correct classifica-
tions) relative to the total number of classifications. It has been
by far the most commonly used metric for assessing the
performance of classifiers for years [1,2].
Cohen's kappa (Kappa rate): It evaluates the portion of hits that
can be attributed to the classifier itself, excluding random hits,
relative to all the classifications that cannot be attributed to
chance alone. Cohen's kappa ranges from 1 (total disagreement) through 0 (random classification) to 1 (perfect agreement). For multi-class problems, kappa is a very useful, yet
Table 1
Summary description of the original data sets.
Data set
#Ex.
#Atts.
#Cl.
Data set
#Ex.
#Atts.
#Cl.
abalone
appendicitis
australian
autos
balance
banana
bands
breast
bupa
chess
coil2000
contraceptive
crx
dermatology
ecoli
flare
german
glass
haberman
heart
hepatitis
ionosphere
housevotes
iris
led7digit
letter
lym
magic
mammograph
marketing
4174
106
690
205
625
5300
539
286
345
3196
9822
1473
125
366
336
1066
1000
214
306
270
155
351
435
150
500
20,000
148
19,020
961
8993
8
7
14
25
4
2
19
9
6
36
85
9
15
33
7
9
20
9
3
13
19
33
16
4
7
16
18
10
5
13
28
2
2
6
3
2
2
2
2
2
2
3
2
6
8
2
2
7
2
2
2
2
2
3
10
10
4
2
2
9
monks
movement
mushroom
nursery
pageblocks
penbased
phoneme
pima
PostOper
ring
saheart
satimage
segment
sonar
spambase
spectheart
splice
tae
texture
thyroid
tic-tac-toe
titanic
twonorm
vehicle
vowel
wdbc
wine
wisconsin
yeast
zoo
432
360
8124
12,690
5472
10,992
5404
768
90
7400
462
6435
2310
208
4597
267
3190
151
5500
7200
958
2201
7400
846
990
569
178
683
1484
101
6
90
22
8
10
16
5
8
8
20
9
36
19
60
55
44
60
5
40
21
9
3
20
18
13
30
13
9
8
16
2
15
2
5
5
10
2
2
3
2
2
7
7
2
2
2
3
3
11
3
2
2
2
4
11
2
3
2
10
7
simple, meter for measuring a classifier's accuracy while
compensating for random successes.
5.2. Data sets
The experimentation is based on 60 standard classification data
sets taken from the KEEL-data set repository1 [53,54]. Table 1
summarizes the properties of the selected data sets. It shows, for
each data set, the number of examples (#Ex.), the number of
attributes (#Atts.), and the number of classes (#Cl.). The data sets
considered in this study contain between 100 and 20,000
instances, the number of attributes ranges from 2 to 85 and the
number of classes varies between 2 and 28. Their values are
normalized in the interval [0,1] to equalize the influence of
attributes with different range domains when using the NN rule.
These data sets have been partitioned using the ten fold crossvalidation procedure. Each training partition is divided into two
parts: labeled and unlabeled examples. Using the recommendation established in [24], in the division process, we do not
maintain the class proportion in the labeled and unlabeled sets
since the main aim of SSC is to exploit unlabeled data for better
classification results. Hence, we use a random selection of examples that will be marked as labeled instances, and the class label of
the rest of instances will be removed. We ensure that every class
has at least one representative instance.
In order to study the influence of the amount of labeled data,
we take different ratios when dividing the training set. In our
experiments, four ratios are used: 10%, 20%, 30% and 40%. For
instance, assuming a data set which contains 1000 examples,
when the labeled rate is 10%, 100 examples are put into L with
their labels while the remaining 900 examples are put into U
1
http://sci2s.ugr.es/keel/datasets.php
I. Triguero et al. / Neurocomputing 132 (2014) 30–41
Table 2
Parameter specification for all the methods employed in the experimentation.
Algorithm
Parameters
SelfTraining
ENN
AllKNN
RNGE
MENN
NCNEdit
CEWS
ENNTh
Multiedit
CF
IPF
MAX_ITER¼ 40
Number of neighbors ¼3
Number of neighbors ¼3
Order of the graph¼1st order
Number of neighbors ¼3
Number of neighbors ¼3
Threshold¼ 0.1
Noise threshold¼ 0.7
Number of sub-blocks ¼ 3
Number of partitions: n ¼5, Base algorithm: C4.5
Number of partitions: n ¼5, Filter type: majority,
Iterations for stop criterion: i¼ 3, Examples removed pct.: p ¼ 1%,
Base algorithm: C4.5
Threshold¼ 0.5
SNNRCE
without their labels. In summary, this experimental study involves
a total of 240 data sets (60 data setsn4 labeled rates). Note that test
partitions are kept aside to evaluate the performance of the
learned hypothesis.
All the data sets created can be found on the web page
associated with this paper.2
latter may not be satisfied, causing the statistical analysis to lose
credibility [57].
Throughout the empirical study, we will focus on the use of the
Friedman Aligned-Ranks (FAR) test [38], as a tool for contrasting
the behavior of each proposal. Its application will allow us to
highlight the existence of significant differences between methods. The Finner test is applied as a post hoc procedure to find out
which algorithms present significant differences. More information about these tests and other statistical procedures can be
found at http://sci2s.ugr.es/sicidm/.
6. Analyzing the integration of noise filters in the self-training
approach
In this section, we analyze the results obtained in our experimental study. In particular, our aims are as follows:
To compare the transductive capabilities achieved with the
5.3. Algorithm used and parameters
Apart from the original self-training proposal, two of the main
variants of this algorithm proposed in the literature are SETRED
[27] and SNNRCE [24]. The former corresponds to the first attempt
to use a particular filter (CEWS) [28] during the self-training
process. Hence, we will consider SETRED as equivalent to selftraining with a CEWS filter. The latter algorithm is a recent
approach which introduces several steps into the original selftraining approach, such as a re-labeling stage and a relative graphbased neighborhood to determine the confidence level during the
labeling process. We include these proposals in the experimental
study as comparative techniques.
Table 2 shows the configuration parameters, which are common to all problems, of the comparison techniques and filters used
with self-training. We focus this experimentation on the recommended parameters proposed by their respective authors, assuming that the choice of the values of the parameters was optimally
made. For those filtering methods which are based on the NN rule,
we have established the number of nearest neighbors as k¼ 3. In
filtering algorithms, a value k 4 1 may be convenient, when the
interest lies in protecting the classification task against noisy
instances, as Wilson and Martinez suggested in [30]. In all of the
techniques, we use the Euclidean distance. Due to the fact that
CEWS, Multiedit, CF, IPF and SNNRCE are stochastic methods, they
have been run three times per partition.
Implementations of the algorithms can be found on the web
site associated with this paper.
5.4. Statistical tools for analysis
The use of hypothesis testing methods to support the analysis
of results is highly recommended in the field of Machine Learning.
The aim of these techniques is to identify the most relevant
differences found between the methods [55,56]. To this end, the
use of nonparametric tests will be preferred to parametric ones,
since the initial conditions that guarantee the reliability of the
2
http://sci2s.ugr.es/SelfTrainingþ Filters
35
different kinds of filters under different ratios of labeled data
(Section 6.1).
To study how filtering techniques help the self-training methodology within the generalization process (Section 6.2).
To analyze the behavior of the best filtering techniques in
several data sets (Section 6.3).
To present a global analysis of the results obtained in terms of
the properties of the filtering methods (Section 6.4).
Due to the extension of the experimental analysis carried out,
we report the complete experimental results on the web page
associated with this paper. In this section we present summary
figures and the statistical tests conducted. Tables 3–6 tabulate the
information of the statistical analysis performed by nonparametric
multiple comparison procedures over 10%, 20%, 30% and 40% of
labeled data, respectively. In these tables, filtering methods have
been sorted according to their family, starting from classic to more
recent methods. In each table, we carry out a total of four
statistical tests for accuracy and kappa measures, differentiating
between transductive and test phases. The rankings computed,
according to the FAR test [38], represent the effectiveness associated with each algorithm. The best (lowest) ranking obtained in
each FAR test is marked with ‘n’, which determines the control
algorithm for the post hoc test. Next, together with each FAR
ranking, we present the adjusted p-value with Finner's test (Finner
APV) based on the control algorithm. Those APVs highlighted in
bold show the methods outperformed by the control, at α ¼ 0:1
level of significance.
In these tables, we include as a baseline the NN rule trained
only with labeled data (NN-L), to determine the goodness of the
SSC techniques. Note that this technique corresponds to the initial
stage of all the self-training schemes. However, it is also known
that, depending on the problem, unlabeled data can lead to worse
performance [3], hence, the inclusion of NN-L shows whether the
self-training scheme is an outstanding methodology for SSC.
6.1. Transductive results
As we stated before, the main objective of transductive learning
is to predict the true class label of the unlabeled data used to train.
Hence, a good exploitation of unlabeled data can lead to successful
results.
Observing Tables 3–6 we can make the following analysis:
Considering 10% of labeled instances, the FAR procedure highlights the global filter CF as the best performing in terms of
transductive learning. With this filter, self-training is able to
36
I. Triguero et al. / Neurocomputing 132 (2014) 30–41
Table 3
Average rankings of the algorithms (Friedman aligned-ranks þFinner test) over 10% labeled rate.
Algorithm
SelfTraining-ENN
SelfTraining-AllKNN
SelfTraining-RNGE
SelfTraining-MENN
SelfTraining-NCNEdit
SelfTraining-CEWS
SelfTraining-ENNTh
SelfTraining-Multiedit
SelfTraining-CF
SelfTraining-IPF
SelfTraining
SNNRCE
NN-L
Transductive phase
Test phase
FAR (Accuracy)
Finner APV
FAR (Kappa)
Finner APV
FAR (Accuracy)
Finner APV
FAR (Kappa)
Finner APV
416.5580
405.1160
321.7920
364.0000
402.9254
330.2333
363.2833
377.2250
281.6174n
314.1333
432.5500
345.4421
721.6250
0.0041
0.0080
0.3526
0.0762
0.0080
0.2775
0.0762
0.0398
–
0.4293
0.0015
0.1577
0.0000
390.2920
395.7580
293.7170
408.7833
353.3750
294.9000
397.6667
413.2080
282.8255n
297.7164
381.6421
445.5000
721.1176
0.0154
0.0125
0.7979
0.0066
0.1134
0.7979
0.0125
0.0061
–
0.7805
0.0243
0.0005
0.0000
421.3917
392.8583
374.0583
371.4833
424.4250
379.4667
354.0833
370.5500
268.6083n
355.075
464.9833
388.3083
511.2083
0.0006
0.0060
0.0155
0.0165
0.0006
0.0120
0.0387
0.0165
–
0.0387
0.0000
0.0072
0.0000
397.9667
368.2917
342.4333
405.9417
395.1250
367.8167
379.2000
403.9167
261.4167n
341.5000
452.3417
454.2000
506.3500
0.0018
0.0125
0.0532
0.0013
0.0020
0.0125
0.0063
0.0013
–
0.0532
0.0000
0.0000
0.0000
Table 4
Average rankings of the algorithms (Friedman aligned-ranks þFinner test) over 20% labeled rate.
Algorithm
SelfTraining-ENN
SelfTraining-AllKNN
SelfTraining-RNGE
SelfTraining-MENN
SelfTraining-NCNEdit
SelfTraining-CEWS
SelfTraining-ENNTh
SelfTraining-Multiedit
SelfTraining-CF
SelfTraining-IPF
SelfTraining
SNNRCE
NN-L
Transductive phase
Test phase
FAR (Accuracy)
Finner APV
FAR (Kappa)
Finner APV
FAR (Accuracy)
Finner APV
FAR (Kappa)
Finner APV
414.1167
424.6750
312.8417
363.8083
393.1583
358.9417
360.5083
396.5667
270.9667n
297.7083
485.0083
545.1667
453.0333
0.0012
0.0006
0.3315
0.0358
0.0051
0.0391
0.0391
0.0045
–
0.5156
0.0000
0.0000
0.0000
396.7750
405.3250
306.2333
412.0750
353.2250
350.5750
398.9083
432.3583
279.6167n
287.8333
416.2333
599.0000
438.3417
0.0066
0.0045
0.5485
0.0031
0.0968
0.1006
0.0064
0.0008
–
0.8417
0.0027
0.0000
0.0007
434.2333
431.6667
316.0083
378.9333
397.6000
402.0917
355.1500
376.4000
305.1917n
350.3833
482.1667
436.1000
410.5750
0.0087
0.0087
0.7926
0.1075
0.0419
0.0366
0.2630
0.1097
–
0.2927
0.0002
0.0087
0.0248
420.5583
415.0333
319.8167
395.9500
372.9000
383.9083
367.0833
400.1250
307.3500n
343.1667
439.4583
494.4333
416.7167
0.0235
0.0235
0.7618
0.0530
0.1453
0.0926
0.1731
0.0476
–
0.4105
0.0079
0.0001
0.0235
Table 5
Average rankings of the algorithms (Friedman aligned-ranks þFinner test) over 30% labeled rate.
Algorithm
SelfTraining-ENN
SelfTraining-AllKNN
SelfTraining-RNGE
SelfTraining-MENN
SelfTraining-NCNEdit
SelfTraining-CEWS
SelfTraining-ENNTh
SelfTraining-Multiedit
SelfTraining-CF
SelfTraining-IPF
SelfTraining
SNNRCE
NN-L
Transductive phase
Test phase
FAR (Accuracy)
Finner APV
FAR (Kappa)
Finner APV
FAR (Accuracy)
Finner APV
FAR (Kappa)
Finner APV
394.2000
381.6917
341.9833
335.3250
403.8500
378.8083
338.2333
373.8500
302.7250
295.2667n
437.8167
644.0250
448.7250
0.0384
0.0700
0.3260
0.3541
0.0247
0.0714
0.3440
0.0829
0.8561
–
0.0021
0.0000
0.0011
381.1417
382.6500
324.3333
357.0083
379.5167
364.6333
355.2167
400.2833
314.1583
306.6667n
409.8417
653.7667
447.2833
0.1484
0.1484
0.6993
0.2833
0.1484
0.2285
0.2833
0.0670
0.8555
–
0.0477
0.0000
0.0038
419.4167
388.0500
327.0500
393.7167
425.5667
408.9500
393.0417
373.5167
293.1833n
320.5833
451.6500
450.2833
431.4917
0.0052
0.0280
0.4380
0.0248
0.0039
0.0098
0.0248
0.0607
–
0.5054
0.0014
0.0014
0.0031
405.9500
378.4333
293.5667n
391.3500
396.8750
411.3167
393.1583
411.1667
296.8750
311.7000
429.3500
519.5583
437.2000
0.0126
0.0467
–
0.0232
0.0205
0.0126
0.0231
0.0126
0.9359
0.6911
0.0039
0.0000
0.0029
significantly outperform 8 of the 12 comparison techniques for
the accuracy and kappa measures. The IPF filter which also
belongs to the global family of filters can be stressed as an
excellent filter with this labeled ratio. Furthermore, we can
highlight RNGE and CEWS as the most competitive local filters
in comparison with CF. The comparison technique SNNRCE is
outperformed in terms of kappa measure. By contrast, considering the accuracy measure, it is not overcome with a level
of significance α ¼ 0:1. This fact indicates that SNNRCE benefits
from random hits.
When the number of labeled instances is increased to 20%, we
observe a clear improvement in terms of accuracy and kappa
rate for all the studied methods. Again, global filters obtain the
two best rankings and the CF filter is stressed as the best
performing method. The number of techniques outperformed
has also been increased. RNGE is the most relevant local
filtering technique in comparison with CF and IPF.
When the data set has a relatively higher number of labeled
instances (30% or 40%), either local and global filtering techniques display similar behavior. This is because all the filtering
I. Triguero et al. / Neurocomputing 132 (2014) 30–41
37
Table 6
Average rankings of the algorithms (Friedman aligned-ranks þFinner test) over 40% labeled rate.
Algorithm
Transductive phase
SelfTraining-ENN
SelfTraining-AllKNN
SelfTraining-RNGE
SelfTraining-MENN
SelfTraining-NCNEdit
SelfTraining-CEWS
SelfTraining-ENNTh
SelfTraining-Multiedit
SelfTraining-CF
SelfTraining-IPF
SelfTraining
SNNRCE
NN-L
Test phase
FAR (Accuracy)
Finner APV
FAR (Kappa)
Finner APV
FAR (Accuracy)
Finner APV
FAR (Kappa)
Finner APV
375.1500
359.7917
331.3083
363.6417
368.7833
375.0833
351.8833
340.5667
322.5083
305.6167n
448.5083
676.8667
456.7917
0.2488
0.2681
0.5635
0.2559
0.2488
0.2488
0.3315
0.4534
0.6813
–
0.0021
0.0000
0.0014
359.7333
352.1083
324.2500
384.6250
354.9667
379.7667
367.7417
354.6833
323.7500
314.1167n
423.7333
680.7000
456.3250
0.4135
0.4401
0.8323
0.2378
0.4401
0.2450
0.3477
0.4401
0.8323
–
0.0305
0.0000
0.0033
404.8000
380.9417
322.9083n
350.3083
426.7917
418.6333
352.9083
353.1917
343.9083
371.1250
453.4750
477.7833
419.7250
0.0909
0.2558
–
0.5620
0.0454
0.0548
0.5620
0.5620
0.6097
0.3389
0.0090
0.0020
0.0548
394.6000
378.9333
308.5917n
363.1917
403.9000
427.5667
362.3667
377.9667
339.9000
354.7583
424.6833
523.0667
416.9750
0.0718
0.1449
–
0.2380
0.0485
0.0227
0.2380
0.1449
0.4466
0.2818
0.0227
0.0000
0.0250
1.0
ENN
AllKNN
RNGE
Multiedit
MENN
NCNEdit
ENNTh
CEWS
CF
IPF
SNNRCE
Nofilter
0.90
0.80
Accuracy Test
0.70
0.60
0.50
0.40
0.30
mushroom
penbased
texture
wisconsin
wine
twonorm
wdbc
zoo
housevotes
pageblocks
iris
segment
dermatology
thyroid
coil2000
banana
satimage
letter
ionosphere
spambase
hepatitis
crx
chess
phoneme
australian
magic
mammographic
heart
balance
appendicitis
nursery
tic−tac−toe
ecoli
spectfheart
splice
ring
sonar
breast
pima
german
flare
monks
saheart
titanic
lymphography
bands
led7digit
haberman
glass
vehicle
bupa
vowel
post−operative
yeast
tae
movement_libras
autos
contraceptive
marketing
abalone
0.20
Data sets
Fig. 3. Accuracy test over 10% of labeled data.
techniques are able to detect noisy examples in an easier way
with a representative number of labeled data. Nevertheless, the
IPF filter is outstanding as the best ranking in both kappa and
accuracy measures for high labeled ratios. Note that it is able to
obtain better results than the standard self-training or the
baseline NN-L which shows the usefulness of the filtering
process with a greater number of labeled data.
6.2. Inductive results (test phase)
In contrast to transductive learning, the aim of inductive
learning is to classify unknown examples. In this way, inductive
learning proves the generalization capabilities of the analyzed
methods, checking if the previous learned hypotheses are appropriate or not.
Apart from Tables 3–6, we include four figures representing the
accuracy obtained by the methods in the different labeled ratios.
Figs. 3 and 4 illustrate the accuracy test obtained in each data set
over 10% and 40% of labeled instances. For the sake of simplicity,
the figures with a 20% and 30% of labeled instances and their
corresponding accuracy tables can be found on the associated
web page.
The aim of these figures is to determine in which data sets the
original self-training algorithm is outperformed. For this reason, we
take the standard self-training as the baseline method to be overcome. For a better visualization, on the x-axis the data sets are
ordered from the maximum to the minimum accuracy obtained by
the standard self-training. The y-axis position is the accuracy test of
each algorithm. Self-training without any filter is drawn as a line.
Therefore, points above of this line correspond to data sets for which
the other proposals perform better than the original algorithm.
Finally, Table 7 summarizes the main differences of each
method over the basic self-training algorithm in each labeled
ratio. In this table, we present the number of data sets in which the
obtained accuracy and kappa rates for each technique are strictly
greater (Wins) and greater or equal (Ties þWins) than the baseline. Again, the best results in each column are highlighted in bold.
38
I. Triguero et al. / Neurocomputing 132 (2014) 30–41
1.0
ENN
AllKNN
RNGE
Multiedit
MENN
NCNEdit
ENNTh
CEWS
CF
IPF
SNNRCE
Nofilter
0.90
0.80
Accuracy Test
0.70
0.60
0.50
0.40
0.30
mushroom
penbased
texture
wisconsin
wdbc
page−blocks
iris
twonorm
segment
wine
dermatology
zoo
letter
housevotes
thyroid
coil2000
satimage
chess
spambase
ionosphere
banana
vowel
phoneme
hepatitis
australian
crx
nursery
tic−tac−toe
magic
sonar
heart
lymphography
monk−2
balance
movement_libras
appendicitis
mammographic
ecoli
spectfheart
splice
ring
pima
haberman
german
flare
saheart
vehicle
bands
breast
titanic
led7digit
glass
automobile
bupa
post−operative
yeast
contraceptive
tae
marketing
abalone
0.20
Data sets
Fig. 4. Accuracy test over 40% of labeled data.
Table 7
Comparison of each method over the basic self-training approach.
Algorithm
SelfTraining-ENN
SelfTraining-AllKNN
SelfTraining-RNGE
SelfTraining-MENN
SelfTraining-NCNEdit
SelfTraining-ENNTh
SelfTraining-CEWS
SelfTraining-Multiedit
SelfTraining-CF
SelfTraining-IPF
SNNRCE
10%
Wins
Ties þ Wins
Wins
Ties þ Wins
Wins
Ties þ Wins
Wins
Ties þ Wins
Wins
Ties þ Wins
Wins
Ties þ Wins
Wins
Ties þ Wins
Wins
Ties þ Wins
Wins
Ties þ Wins
Wins
Ties þ Wins
Wins
Ties þ Wins
20%
30%
Accuracy
Kappa
Accuracy
Kappa
Accuracy
Kappa
Accuracy
Kappa
24
40
24
40
27
42
26
40
20
38
23
38
26
49
24
40
34
45
15
31
29
29
26
41
28
43
32
45
26
40
28
46
28
42
26
49
24
40
35
47
28
44
30
31
29
49
28
45
38
51
30
44
29
52
32
48
24
47
31
45
36
50
32
48
31
32
28
43
26
42
33
47
30
44
27
47
28
44
22
45
28
42
33
46
30
46
26
28
24
45
25
43
34
53
27
40
24
48
26
42
19
43
28
45
29
49
30
50
28
29
23
42
24
41
35
52
25
39
23
47
23
38
22
45
24
42
30
46
29
47
20
21
25
49
25
47
29
52
26
46
23
52
27
45
20
44
31
49
27
51
24
46
24
24
23
45
22
42
28
49
26
43
23
50
27
44
16
40
27
43
27
49
24
45
18
19
Taking into account these figures, the previous statistical tests
and Table 7, may make some comments to summarize:
When the inductive learning problem is addressed with 10% of
40%
labeled data, there are significant and interesting differences
between the methods. Concretely, the global CF filter is highlighted as the most promising filter. The Finner procedure
signals the differences of this filter to the rest of the comparison techniques in both accuracy and kappa measures. Fig. 3
also corroborates this statement, because the majority of its
points are above the baseline.
With 20% of labeled instances, CF is still the most suitable
algorithm in terms of generalization capabilities. Depending on
the used measure, accuracy or kappa, the CF filter is able to
obtain a significant improvement over different filters. Nevertheless, IPF, ENNTh and RNGE are established as the best
performing filters which are not statistically outperformed in
both accuracy and kappa measures. Table 7 shows that these
methods obtain a similar number of victories and ties over the
standard self-training.
Using 30%, two main methods from different families, CF and
RNGE, are the two best algorithms. CF is note-worthy in terms
of accuracy test, however RNGE is established as a control
algorithm in terms of the kappa measure. In both cases, the
global filter IPF is still a robust filter, not clearly outperformed
by the control ones.
I. Triguero et al. / Neurocomputing 132 (2014) 30–41
With 40% of labeled examples, RNGE works appropriately but it
does not obtain a distinctive difference from most of the filters.
By contrast, this method has significant differences in comparison with NN-L, standard self-training and SNNRCE.
The figures reveal that when there is an increment in the
labeled ratio, a notable number of points are closer to the used
baseline. This means that when a considerable number of
known data are available, the basic self-training algorithm is
able to work fine, avoiding misclassification. Table 7 also
confirms this idea, because the number of data sets in which
basic self-training is outperformed, decreases over 30% and 40%
of labeled data.
6.3. Analyzing the behavior of the noisy filters
The noise detection capabilities of filtering techniques determine the behavior of the self-training filtering scheme. In this
subsection, we want to analyze how many instances are detected
as noisy ones during the self-labeling process.
To perform this analysis, we have selected two data sets with a
10% of labeled data (we choose nursery and yeast as illustrative
data sets in which the filters reach the same number of iterations
in the self-training approach) and the three best filters (CF, IPF and
RNGE). Table 8 collects the total number of removed instances
(#RI) at the end of the self-training filtered process for each
method. Furthermore, the improvement on average test results,
regarding the standard self-training is also shown (Impr).
In addition, Fig. 5 shows a graphical representation of the
number of removed instances in each self-training iteration for
both Nursery and Yeast data sets. The X-axis represents the
number of iterations carried out, and the Y-axis represents the
number of instances removed in this iteration.
As we can observe in Table 8 and Fig. 5, we can establish that
each method works in a different way, annotating different
instances as noisy ones during the self-training process. For
instance, the CF filter tends to remove a greater number of
instances than IPF and RNGE. Nevertheless, Table 8 shows that
this fact does not imply that its improvement capabilities are
better.
In both Fig. 5(a) and (b) global filters, CF and IPF, present similar
trends. We can see that this kind of filtering techniques detects
analogous proportions of noisy instances according to the iterations. However, the local filter, RNGE, shows a different behavior
from that of global filters.
6.4. Global analysis
This section provides a global perspective on the obtained
results. As a summary we want to outline several points about the
self-training filtered approach in general and the characteristics of
noise filters that are more appropriate to improve the self-training
process:
In all the studies performed, one of the self-training filtered
Table 8
Removed instances for each filtering method.
Data set
Nursery
Yeast
SelfTraining-CF
SelfTraining-IPF
SelfTraining-RNGE
#RI
Impr.
#RI
Impr.
#RI
Impr.
1313
284
0.0794
0.0411
1196
156
0.0805
0.0303
1129
201
0.0781
0.0317
39
always obtains the best results. Independent of the labeled rate
and the established baselines, NN-L and self-training without a
filter are clearly outperformed by at least one of these methods.
This fact highlights the good synergy between noise filters and
the self-training process. As we stated before, the inclusion of
one erroneous example in the labeled data can alter the
following stages, and therefore, the latter's generalization
capabilities. Inductive results highlight the generalization abilities and the usefulness of the use self-training in conjunction
with filtering techniques. Nevertheless, there are significant
differences depending on the selected filter.
Comparing transductive and test phases, we can state that, in
general, the SSC methods used differ widely when tackling the
inductive phase. It shows the necessity of some mechanisms,
like appropriate filtering methodologies, to find robust learned
hypotheses which allow the classifiers to predict unseen cases.
In general, an increment in the labeled ratio indicates the lower
benefit of the filtering techniques. It justifies the view that the
use of SSC methods is more appropriate in the presence of a
lower number of labeled examples. However, even in these
cases, the analyzed self-training filtered algorithms can still be
helpful, as has been shown in the reported results.
The working process of one filter is an important factor in both
transductive and inductive learning. In the conducted experimental study, we have observed that global methods are more
robust in most of the experiments independent of the labeled
Filtering process (Yeast)
Filtering process (Nursery)
180
40
Number of detected noisy instances
Number of detected noisy instances
35
160
140
120
100
80
1
2
3
4
5
6
7
25
20
15
10
CF
IPF
RNGE
0
30
8
5
CF
IPF
RNGE
0
1
2
3
Iterations
Fig. 5. Filtering process. (a) Nursery data set and (b) yeast data set.
4
Iterations
5
6
7
8
40
I. Triguero et al. / Neurocomputing 132 (2014) 30–41
ratio. These methods assumed that the label errors are independent of particular classifiers learned from the data, collecting predictions from different classifiers could provide a better
estimation of mislabeled examples than collecting information
from a single classifier only. This idea performs well under the
semi-supervised learning conditions and especially when the
number of labeled data is reduced. Local approaches also help
to improve the self-training process, however, they are more
useful when there are a higher number of labeled data. It
implies that the idea of constructing a local neighborhood to
determine if one instance should be considered as noise is not
the most appropriate way to deal with SSC.
7. Conclusions
In this paper, we have analyzed the characteristics of a wide
variety of noise filters, of a different nature, to improve the selftraining approach in SSC. Most of these filters have been previously studied from a traditional supervised learning perspective.
However, the filtering process can be more difficult in semisupervised learning due to the reduced number of labeled
instances.
The experimental analysis performed, supported from a statistical point of view, has allowed us to distinguish which characteristics of filtering techniques have reported a better behavior to
address the transductive and inductive problems. We have
checked that global filters (CF and IPF algorithms) highlight as
the best performing family of filters, showing that the hypothesis
agreement of several classifiers is also robust when the ratio of
available labeled data is reduced. Most of the local approaches
need more labeled data to perform better. The use of these filters
has resulted in a better performance than that achieved by the
previously proposed self-training methods, SETRED and SNNRCE.
Thus, the use of global filters is highly recommended in this
field, which can be useful for further work with other SSC
approaches and other base classifiers. As future work, we consider
the design of new global filters for SSC that use fuzzy rough set
models [58,59].
Acknowledgments
Supported by the Research Projects TIN2011-28488 and P11TIC-7765. J.A. Sáez holds an FPU scholarship from the Spanish
Ministry of Education and Science.
References
[1] E. Alpaydin, Introduction to Machine Learning, 2nd ed., MIT Press, Cambridge,
MA, 2010.
[2] I.H. Witten, E. Frank, M.A. Hall, Data Mining: Practical Machine Learning Tools
and Techniques, 3rd ed., Morgan Kaufmann, San Francisco, 2011.
[3] X. Zhu, A.B. Goldberg, Introduction to Semi-Supervised Learning, 1st ed.,
Morgan and Claypool, 2009.
[4] O. Chapelle, B. Schlkopf, A. Zien, Semi-Supervised Learning, 1st ed., The MIT
Press, 2006.
[5] W. Pedrycz, Algorithms of fuzzy clustering with partial supervision, Pattern
Recognition Lett. 3 (1985) 13–20.
[6] N. Seliya, T. Khoshgoftaar, Software quality analysis of unlabeled program
modules with semisupervised clustering, IEEE Trans. Syst. Man Cybern. Part A:
Syst. Humans 37 (2) (2007) 201–211.
[7] L. Faivishevsky, J. Goldberger, Dimensionality reduction based on nonparametric mutual information, Neurocomputing 80 (2012) 31–37.
[8] K. Chen, S. Wang, Semi-supervised learning via regularized boosting working
on multiple semi-supervised assumptions, IEEE Trans. Pattern Anal. Mach.
Intell. 33 (1) (2011) 129–143.
[9] A. Fujino, N. Ueda, K. Saito, Semisupervised learning for a hybrid generative/
discriminative classifier based on the maximum entropy principle, IEEE Trans.
Pattern Anal. Mach. Intell. 30 (3) (2008) 424–437.
[10] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric
framework for learning from labeled and unlabeled examples, J. Mach. Learn.
Res. 7 (2006) 2399–2434.
[11] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training,
in: Proceedings of the Annual ACM Conference on Computational Learning
Theory, 1998, pp. 92–100.
[12] Z. Yu, L. Su, L. Li, Q. Zhao, C. Mao, J. Guo, Question classification based on cotraining style semi-supervised learning, Pattern Recognition Lett. 31 (13)
(2010) 1975–1980.
[13] J. Du, C.X. Ling, Z.-H. Zhou, When does co-training work in real data? IEEE
Trans. Knowl. Data Eng. 23 (5) (2010) 788–799.
[14] Y. Yaslan, Z. Cataltepe, Co-training with relevant random subspaces, Neurocomputing 73 (10–12) (2010) 1652–1661.
[15] J. Xu, H. He, H. Man, Dcpe co-training for classification, Neurocomputing 86
(2012) 75–85.
[16] Z.-H. Zhou, M. Li, Tri-training: exploiting unlabeled data using three classifiers,
IEEE Trans. Knowl. Data Eng. 17 (2005) 1529–1541.
[17] M. Li, Z.-H. Zhou, Improve computer-aided diagnosis with machine learning
techniques using undiagnosed samples, IEEE Trans. Syst. Man Cybern., Part A:
Syst. Humans 37 (6) (2007) 1088–1098.
[18] S. Sun, J. Shawe-Taylor, Sparse semi-supervised learning using conjugate
functions, J. Mach. Learn. Res. 11 (2010) 2423–2455.
[19] T. Joachims, Transductive inference for text classification using support vector
machines, in: Proceedings of the 16th International Conference on Machine
Learning, Morgan Kaufmann, 1999, pp. 200–209.
[20] X. Tian, G. Gasso, S. Canu, A multiple kernel framework for inductive semisupervised SVM learning, Neurocomputing 90 (2012) 46–58.
[21] A. Blum, S. Chawla, Learning from labeled and unlabeled data using graph
mincuts, in: Proceedings of the 18th International Conference on Machine
Learning, 2001, pp. 19–26.
[22] A. Mantrach, N. Van Zeebroeck, P. Francq, M. Shimbo, H. Bersini, M. Saerens,
Semi-supervised classification and betweenness computation on large, sparse,
directed graphs, Pattern Recognition 44 (6) (2011) 1212–1224.
[23] D. Yarowsky, Unsupervised word sense disambiguation rivaling supervised
methods, in: Proceedings of the 33rd Annual Meeting of the Association for
Computational Linguistics, 1995, pp. 189–196.
[24] Y. Wang, X. Xu, H. Zhao, Z. Hua, Semi-supervised learning based on nearest
neighbor rule and cut edges, Knowl. Based Syst. 23 (6) (2010) 547–554.
[25] Y. Li, H. Li, C. Guan, Z. Chin, A self-training semi-supervised support vector
machine algorithm and its applications in brain computer interface, in:
ICASSP, IEEE International Conference on Acoustics, Speech and Signal
Processing—Proceedings, 2007, pp. 385–388.
[26] U. Maulik, D. Chakraborty, A self-trained ensemble with semisupervised SVM:
an application to pixel classification of remote sensing imagery, Pattern
Recognition 44 (3) (2011) 615–623.
[27] M. Li, Z.-H. Zhou, SETRED: self-training with editing, in: Lecture Notes in
Computer Science (including subseries Lecture Notes in Artificial Intelligence
and Lecture Notes in Bioinformatics), 2005, pp. 611–621.
[28] F. Muhlenbach, S. Lallich, D. Zighed, Identifying and handling mislabelled
instances, J. Intell. Inf. Syst. 39 (2004) 89–109.
[29] X. Wu, X. Zhu, Mining with noise knowledge: error-aware data mining, IEEE
Trans. Syst. Man Cybern. Part A: Syst. Humans 38 (4) (2008) 917–932.
[30] D.R. Wilson, T.R. Martinez, Reduction techniques for instance-based learning
algorithms, Mach. Learn. 38 (3) (2000) 257–286.
[31] S. García, J. Derrac, J. Cano, F. Herrera, Prototype selection for nearest neighbor
classification: taxonomy and empirical study, IEEE Trans. Pattern Anal. Mach.
Intell. 34 (3) (2012) 417–435.
[32] D. Gamberger, R. Boskovic, N. Lavrac, C. Groselj, Experiments with noise
filtering in a medical domain, in: Proceedings of the 16th International
Conference on Machine Learning, 1999, pp. 143–151.
[33] T.M. Khoshgoftaar, P. Rebours, Improving software quality prediction by noise
filtering techniques, J. Comput. Sci. Technol. 22 (2007) 387–396.
[34] D. Guan, W. Yuan, Y.-K. Lee, S. Lee, Nearest neighbor editing aided by
unlabeled data, Inf. Sci. 179 (13) (2009) 2273–2282.
[35] D. Guan, W. Yuan, Y.-K. Lee, S. Lee, Identifying mislabeled training data with
the aid of unlabeled data, Appl. Intell. 35 (2011) 345–358.
[36] T.M. Cover, P.E. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf.
Theory 13 (1) (1967) 21–27.
[37] X. Wu, V. Kumar (Eds.), The Top Ten Algorithms in Data Mining, Chapman &
Hall/CRC Data Mining and Knowledge Discovery, 2009.
[38] S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for
multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power, Inf. Sci. 180 (2010)
2044–2064.
[39] N. Chawla, G. Karakoulas, Learning from labeled and unlabeled data: an
empirical study across techniques and domains, J. Artif. Intell. Res. 23 (2005)
331–366.
[40] D. Gamberger, N. Lavrac, S. Dzeroski, Noise detection and elimination in data
preprocessing: experiments in medical domains, Appl. Artif. Intell. 14 (2000)
205–223.
[41] D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithms, Mach.
Learn. 6 (1) (1991) 37–66.
[42] D.L. Wilson, Asymptotic properties of nearest neighbor rules using edited data,
IEEE Trans. Syst. Man Cybern. 2 (3) (1972) 408–421.
[43] I. Tomek, An experiment with the edited nearest-neighbor rule, IEEE Trans.
Syst. Man Cybern. 6 (6) (1976) 448–452.
I. Triguero et al. / Neurocomputing 132 (2014) 30–41
[44] J.S. Sánchez, F. Pla, F.J. Ferri, Prototype selection for the nearest neighbour rule
through proximity graphs, Pattern Recognition Lett. 18 (1997) 507–513.
[45] K. Hattori, M. Takahashi, A new edited k-nearest neighbor rule in the pattern
classification problem, Pattern Recognition 33 (3) (2000) 521–528.
[46] J. Sánchez, R. Barandela, A. Marques, R. Alejo, J. Badenas, Analysis of new
techniques to obtain quality training sets, Pattern Recognition Lett. 24 (7)
(2003) 1015–1022.
[47] B.B. Chaudhuri, A new definition of neighborhood of a point in multidimensional space, Pattern Recognition Lett. 17 (1) (1996) 11–17.
[48] F. Vázquez, J. Sánchez, F. Pla, A stochastic approach to Wilson's editing
algorithm, in: Proceedings of the 2nd Iberian Conference on Pattern Recognition and Image Analysis, 2005, pp. 35–42.
[49] P.A. Devijver, J. Kittler, On the edited nearest neighbor rule, in: Proceedings of
the Fifth International Conference on Pattern Recognition, 1980, pp. 72–80.
[50] F.J. Ferri, J.V. Albert, E. Vidal, Consideration about sample-size sensitivity of a
family of edited nearest-neighbor rules, IEEE Trans. Syst. Man Cybern. 29 (4)
(1999) 667–672.
[51] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Francisco, CA, USA, 1993.
[52] A. Ben-David, A lot of randomness is hiding in accuracy, Eng. Appl. Artif. Intell.
20 (2007) 875–885.
[53] J. Alcalá-Fdez, L. Sánchez, S. García, M.J. del Jesus, S. Ventura, J.M. Garrell,
J. Otero, C. Romero, J. Bacardit, V.M. Rivas, J.C. Fernández, F. Herrera, KEEL: a
software tool to assess evolutionary algorithms for data mining problems, Soft
Comput. 13 (3) (2009) 307–318.
[54] J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. García, L. Sánchez,
F. Herrera, KEEL data-mining software tool: data set repository, integration
of algorithms and experimental analysis framework, J. Multiple-Valued Logic
Soft Comput. 17 (2–3) (2010) 255–277.
[55] S. García, A. Fernández, J. Luengo, F. Herrera, A study of statistical techniques
and performance measures for genetics-based machine learning: accuracy
and interpretability, Soft Comput. 13 (10) (2009) 959–977.
[56] D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, 5th ed., Chapman & Hall/CRC, 2011.
[57] S. García, F. Herrera, An extension on “Statistical comparisons of classifiers
over multiple data sets” for all pairwise comparisons, J. Mach. Learn. Res. 9
(2008) 2677–2694.
[58] N. Mac Parthalain, R. Jensen, Fuzzy-rough set based semi-supervised learning,
in: 2011 IEEE International Conference on Fuzzy Systems (FUZZ), 2011,
pp. 2465–2472.
[59] J. Derrac, N. Verbiest, S. García, C. Cornelis, F. Herrera, On the use of
evolutionary feature selection for improving fuzzy rough set based prototype
selection, Soft Comput. 17 (2013) 223–238.
Isaac Triguero Velázquez received the M.Sc. degree in
Computer Science from the University of Granada,
Granada, Spain, in 2009. He is currently a Ph.D. student
in the Department of Computer Science and Artificial
Intelligence, University of Granada, Granada, Spain. His
research interests include data mining, data reduction,
evolutionary algorithms and semi-supervised learning.
José Antonio Sáez Muñoz received his M.Sc. in Computer Science from the University of Granada, Granada,
Spain, in 2009. He is currently a Ph.D. student in the
Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. His
research interests include the study of the impact of
noisy data in classification, data preprocessing, fuzzy
rule-based systems and imbalanced learning.
41
Julián Luengo Martín received his M.Sc. in Computer
Science and Ph.D. from the University of Granada,
Granada, Spain, in 2006 and 2011 respectively. He is
currently an Assistant Professor in the Department of
Civil Engineering, University of Burgos, Burgos, Spain.
His research interests include machine learning and
data mining, data preparation in knowledge discovery
and data mining, missing values, data complexity and
semi-supervised learning.
Salvador García López received his M.Sc. and Ph.D.
degrees in Computer Science from the University of
Granada, Granada, Spain, in 2004 and 2008, respectively. He is currently an Associate Professor in the
Department of Computer Science, University of Jaén,
Jaén, Spain. He has published more than 30 papers in
international journals. As edited activities, he has coedited two special issues in international journals on
different Data Mining topics. His research interests
include data mining, data reduction, data complexity,
imbalanced learning, semi-supervised learning, statistical inference and evolutionary algorithms.
Francisco Herrera Triguero received his M.Sc. in
Mathematics in 1988 and Ph.D. in Mathematics in
1991, both from the University of Granada, Spain.
He is currently a Professor in the Department of
Computer Science and Artificial Intelligence at the
University of Granada. He has had more than 200
papers published in international journals. He is a
coauthor of the book Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases
(World Scientific, 2001).
He currently acts as an Editor in Chief of the international journal Progress in Artificial Intelligence
(Springer) and serves as an Area Editor of the Journal
Soft Computing (area of evolutionary and bioinspired algorithms) and International
Journal of Computational Intelligence Systems (area of information systems). He
acts as an Associated Editor of the journals: IEEE Transactions on Fuzzy Systems,
Information Sciences, Advances in Fuzzy Systems, and International Journal of
Applied Metaheuristics Computing; and he serves as a Member of several journal
editorial boards, among others: Fuzzy Sets and Systems, Applied Intelligence,
Knowledge and Information Systems, Information Fusion, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems, Memetic Computation,
Swarm and Evolutionary Computation.
He received the following honors and awards: ECCAI Fellow 2009, 2010 Spanish
National Award on Computer Science ARITMEL to the Spanish Engineer on
Computer Science, and International Cajastur Mamdani Prize for Soft Computing
(Fourth Edition, 2010), the 2011 IEEE Transactions on Fuzzy Systems Outstanding
Paper Award and the 2011 Lotfi A. Zadeh Prize Best paper Award of the International Fuzzy Systems Association.
His current research interests include computing with words and decision
making, data mining, bibliometrics, data preparation, instance selection, fuzzy rule
based systems, genetic fuzzy systems, knowledge extraction based on evolutionary
algorithms, memetic algorithms and genetic algorithms.
2. Self-labeling with prototype generation/selection for semi-supervised classification
2.3
193
SEG-SSC: A Framework based on Synthetic Examples Generation for SelfLabeled Semi-Supervised Classification
• I. Triguero, S. Garcı́a, F. Herrera, SEG-SSC: A Framework based on Synthetic Examples
Generation for Self-Labeled Semi-Supervised Classification. IEEE Transactions on Cybernetics
– Status: Submitted.
Transactions on Cybernetics
SEG-SSC: A Framework based on Synthetic Examples
Generation for Self-Labeled Semi-Supervised Classification
IEEE Transactions on Cybernetics
r
Fo
Journal:
Manuscript ID:
Manuscript Type:
Date Submitted by the Author:
n/a
Triguero, Isaac; University of Granada, Computer Science and Artificial
Intelligence
García, Salvador; University of Jaén, Computer Science
Herrera, Francisco; University of Granada (Spain), Dept. of Computer
Science and Artificial Intelligence;
vi
Keywords:
Regular Paper
Re
Complete List of Authors:
CYB-E-2013-08-0823.R1
self-Labeled methods, co-training, synthetic examples, semi-supervised
classification
ew
On
ly
Transactions on Cybernetics
IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013
Page 6 of 18
1
SEG-SSC: A Framework based on Synthetic
Examples Generation for Self-Labeled
Semi-Supervised Classification
Isaac Triguero, Salvador Garcı́a, and Francisco Herrera
to the problem of predicting unseen data by learning from
labeled and unlabeled data as training examples. In this work,
we will analyze both settings.
Existing SSC algorithms are usually classified depending
on the conjectures they make about the relation of labeled
and unlabeled data distributions. Broadly speaking, they are
based on the manifold and/or cluster assumption. The manifold assumption is satisfied if data lie approximately on a
manifold of lower dimensionality than the input space [6]. The
cluster assumption states that similar examples should have
the same label. Graph-based models [7] are the most common
approaches to implementing the manifold assumption [8]. As
regards examples of models based on the cluster assumption,
we can find generative models [9] or semi-supervised support
vector machines [10]. Recent studies have addressed multiple
assumptions in one model [11], [5], [12].
Self-labeled techniques are SSC methods that do not make
any specific suppositions about the input data [13]. These
models use unlabeled data within a supervised framework
via a self-training process. First attempts correspond to the
self-training algorithm [14] that iteratively enlarges the labeled training set by adding the most confident predictions
of the supervised classifier used. The standard co-training
[15] methodology splits the feature space into two different
conditionally independent views. Then, it trains one classifier
in each specific view, and the classifiers teach each other the
most confidently predicted examples. Advanced approaches
do not require explicit feature splits or the iterative mutualteaching procedure imposed by co-training, as they are commonly based on disagreement-based classifiers [16], [17],
[18]. These models have been successfully applied to many
real applications such as image classification [19], shadow
detection [20], computer-aided diagnosis [21], etc.
Self-labeled techniques are limited by the number of labeled
points and their distribution to identifying reliable unlabeled
examples. This problem is more pronounced when the labeled
ratio is greatly reduced and labeled examples do not minimally
represent the domain. Moreover, most of the advanced models
use some diversity mechanisms, such as bootstrapping [22], to
provide differences between the hypotheses learned with the
multiple classifiers. However, these mechanisms may provide
a similar performance to classical self-training or co-training
approaches if the number of labeled data is insufficient to
achieve different learned hypotheses.
The aim of this work is to alleviate these weaknesses by
using new synthetic labeled examples to introduce diversity
iew
ev
rR
Fo
Abstract—Self-labeled techniques are semi-supervised classification methods that address the shortage of labeled examples
via a self-learning process based on supervised models. They
progressively classify unlabeled data and use them to modify the
hypothesis learned from labeled samples. Most relevant proposals
are currently inspired by boosting schemes to iteratively enlarge
the labeled set. Despite their effectiveness, these methods are
constrained by the number of labeled examples and their distribution, which in many cases is sparse and scattered. The aim of
this work is to design a framework, named SEG-SSC, to improve
the classification performance of any given self-labeled method
by using synthetic labeled data. These are generated via an
oversampling technique and a positioning adjustment model that
use both labeled and unlabeled examples as reference. Next, these
examples are incorporated in the main stages of the self-labeling
process. The principal aspects of the proposed framework are:
(a) introducing diversity to the multiple classifiers used by using
more (new) labeled data, (b) fulfilling labeled data distribution
with the aid of unlabeled data, and (c) being applicable to
any kind of self-labeled method. In our empirical studies, we
have applied this scheme to four recent self-labeled methods,
testing their capabilities with a large number of data sets. We
show that this framework significantly improves the classification
capabilities of self-labeled techniques.
Index Terms—Self-Labeled methods, co-training, synthetic examples, semi-supervised classification.
AVING a multitude of unlabeled data and few labeled
ones occurs quite often in many practical applications
such as medical diagnosis, spam filtering, bioinformatics, etc.
In this scenario, learning appropriate hypotheses with traditional supervised classification methods [1] is not straightforward because they only can exploit labeled data. Nevertheless,
Semi-Supervised Classification (SSC) [2], [3], [4] approaches
also utilize unlabeled data to improve the predictive performance, modifying the learned hypothesis obtained from
labeled examples alone.
With SSC we may pursue two different objectives: transductive and inductive classification [5]. The former is devoted
to predicting the correct labels of a set of unlabeled examples
that is also used during the training phase. The latter refers
This work was supported by the Research Projects TIN2011-28488, P10TIC-6858 and P11-TIC-7765.
I. Triguero and F. Herrera are with the Department of Computer Science
and Artificial Intelligence of the University of Granada, CITIC-UGR, Granada,
Spain, 18071. E-mails: {triguero, herrera}@decsai.ugr.es
Salvador Garcı́a is with the Department of Computer Science of the
University of Jaén Jaén, Spain, 23071. E-mail: sglopez@ujaen.es
ly
On
H
I. I NTRODUCTION
Page 7 of 18
Transactions on Cybernetics
IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013
2
different settings. On the one hand, transductive learning is devoted to classify all the m instances xq of U with their correct
class. The class assignation should represent the distribution
of the classes efficiently, based on the input distribution of
L and U . On the other hand, the inductive learning phase
consists of correctly classifying the instances of T S based on
the previously learned hypothesis.
B. Self-labeled techniques: previous work
Self-labeled techniques form an important family of methods in SSC [3]. They are not intrinsically geared to learning
in the presence of both labeled and unlabeled data, but they
use unlabeled points within a supervised learning paradigm.
These techniques aim to obtain one (or several) enlarged
labeled set/s, based on the most reliable predictions. Thus,
these models do not make any specific assumptions about the
input data, but the models accept that their own predictions
tend to be correct. Some authors state that self-labeling is
likely to be the case when the classes form well-separated
clusters [3] (cluster assumption).
The major benefits of this family of methods are: simplicity
and being a wrapper methodology. The former is related to
the facility of implementation and applicability. The latter
means that any kind of classifier can be used regardless of
its complexity, which is very important depending on the
problem tackled. As caveats, the addition of wrongly labeled
examples during the self-labeling process can lead to an even
worse performance. Several mechanisms have been proposed
to reduce this problem [29].
A preeminent work with this philosophy is the self-training
paradigm designed by Yarowsky [14]. In self-training, a supervised classifier is initially trained with the L set. Then it is
retrained with its own most confident predictions, enlarging its
labeled training set. Thus, it is defined as a wrapper method
for SSC. This idea was later extended by Blum and Mitchell
[15] with the method known as co-training. This consists of
two classifiers that are trained on two sufficient and redundant
sets of attributes. This requirement implies that each subset
of features should be able to perfectly define the frontiers
between classes. Then, the method follows a mutual teaching
procedure that works as follows: each classifier labels the most
confidently predicted examples from its point of view and they
are added to the L set of the other classifier. It is also known
that usefulness is constrained by the imposed requirement [30],
which is not satisfied in many real applications. Nevertheless,
this method has become an example for recent models thanks
to the idea of using the agreement (or disagreement) of
multiple classifiers and the mutual teaching approach. A good
study of when co-training works can be found in [31].
Due to the success of co-training and its relatively limited
application, many works have proposed the improvement of
standard co-training by eliminating the established conditions.
In [32], the authors proposed a multi-learning approach, so that
two different supervised learning algorithms were used without
splitting the feature space. They showed that this mechanism
divides the instance space into a set of equivalence classes.
Later, the same authors proposed a faster and more precise
iew
ev
rR
Fo
to multiple classifier approaches and fulfill the labeled data
distribution. A complete motivation for the use of synthetic
labeled examples is discussed in Section III-A.
We propose a framework applicable to any self-labeled
method that incorporates synthetic examples in the selflearning process. We will denote this framework “Synthetic
Examples Generation for Self-labeled Semi-supervised Classification” (SEG-SSC). It is composed of two main parts:
generation and incorporation.
• The generation process consists of an oversampling technique and a later adjustment of the positioning of the
examples. It is initially inspired by the SMOTE algorithm
[23] to generate new synthetic examples, for all the
classes, based on both the small labeled set and the
unlabeled data. Then, this process is refined using a
positioning adjustment of prototypes model [24] based
on a differential evolution algorithm [25].
• New labeled points are then included in two of the main
steps of a self-labeling method: the initialization phase
and the update of the labeled training set, so that it
introduces new examples in a progressive manner during
the self-labeling process.
An extensive experimental analysis is carried out to check
the performance of the proposed framework. We apply the
SEG-SSC scheme to four recent self-labeled techniques that
have different characteristics, comparing the performance obtained with the original proposals. We conduct experiments
over 55 standard classification data sets extracted from the
KEEL and UCI repositories [26], [27] and 11 high dimensional
data sets from the book by Chapelle et al. [2]. The results will
be contrasted with nonparametric statistical tests [28].
The remainder of this paper is organized as follows. Section
II defines the SSC problem and sums up the classical and
current self-labeled approaches. Then, Section III presents the
proposed framework, explaining its motivation and the details
of its implementation. Section IV describes the experimental
setup and discusses the results obtained. Finally, Section V
summarizes the conclusions drawn in this work.
This section provides the definition of the SSC problem
(Section II-A) and briefly describes the most relevant selflabeled approaches proposed in the literature (Section II-B).
A. Semi-supervised classification
A formal description of the SSC problem is as follows: Let
xp be an example where xp = (xp1 , xp2 , ..., xpD , ω), with xp
belonging to a class ω and a D-dimensional space in which
xpi is the value of the i-th feature of the p-th sample. Then,
let us assume that there is a labeled set L which consists of
n instances xp with ω known and an unlabeled set U which
consists of m instances xq with ω unknown, let m > n. The
L ∪ U set forms the training set T R. Moreover, there is a test
set T S composed of t unseen instances xr with ω unknown,
which has not been used at the training stage.
The aim of SSC is to obtain a robust learned hypothesis
using T R instead of L alone. It can be applied in two slightly
ly
On
II. S ELF - LABELED S EMI -S UPERVISED C LASSIFICATION
Transactions on Cybernetics
IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013
3
ev
rR
Fo
alternative, named Democractic co-learning (Democratic-Co)
[33], which is also based on multi-learning. As an alternative,
which requires neither sufficient and redundant views nor
several supervised learning algorithms, Zhou and Li [34] presented the Tri-Training algorithm, which attempts to determine
the most reliable unlabeled data as the agreement of three classifiers (same learning algorithm). Then, they proposed the CoForest algorithm [21] as a similar approach that uses Random
Forest [35]. A further similar approach is Co-Bagging [36],
[37] where confidence is estimated from the local accuracy of
committee members. Other recent self-labeled approaches are
[38], [39], [40], [41], [42].
In summary, all of these recent schemes work on the
hypothesis that several weak classifiers, learned with a small
number of instances, can produce better generalizations than
only one weak classifier. These methods are also known
as disagreement-based models that are motivated, in part,
by the empirical success of ensemble learning. The term
disagreement-based was recently coined by Zhou and Li [17].
III. S YNTHETIC EXAMPLES GENERATION FOR
SELF - LABELED METHODS .
represented by labeled points, it also shows that some of
the nearest unlabeled points to the two labeled examples
of class 1 (blue circles) belong to class 0 (red crosses).
This fact can affect confidence of a self-labeled method
estimated with the base classifier.
• A greatly reduced labeled ratio may produce a lack of
diversity among self-labeling methods with more than
one classifier. As we have established above, multiple
classifier approaches work as a combination of several
weak classifiers. However, if there are only a few labeled
data it is very difficult to obtain different hypotheses,
and therefore, the classifiers are identical. For example,
the Tri-Training algorithm is based on a bootstrapping
approach [22]. This re-sampling technique creates new
labeled sets for each classifier by modifying the original
L. In general, this operation yields different labeled sets
to the original, but it is not significant in the case of
small labeled data sets and the existence of outliers in
the sample. As a consequence, it could lead to biased examples which will not accurately represent the domain of
the problem. Although multi-learning approaches attempt
to achieve diversity by using different kinds of learning
models, a reduced number of instances usually damages
their performance because the models are too weak.
The first limitation has already been addressed in the
literature with different mechanisms [46]. However, the last
two issues are currently open problems.
In order to ease both situations, mainly induced by the
shortage of labeled points, we introduce new labeled data into
the self-labeling process. To do this, we rely on the success of
oversampling approaches in imbalanced domains [47], [48],
[49], [50], but with the difference that we deal with all the
classes of the problem.
Nevertheless, the use of synthetic data for self-labeling
methods is not straightforward and must be carefully performed. The aim of using an oversampling method is to
effectively reinforce the decision regions between classes. To
do so, we will be aided by the distribution of unlabeled data in
conjunction with the labeled ones, because if we focus only on
labeled examples, it may lead to generate noisy instances when
the second issue explained above happens. The effectiveness
of this idea will be empirically checked in Section IV.
iew
In this section we present the SEG-SSC framework. Firstly,
Section III-A enumerates the arguments that justify our proposal. Secondly, Section III-B explains how to generate useful
synthetic examples in a semi-supervised scenario. Finally,
Section III-C describes the SEG-SSC framework, emphasizing
when synthetic data should be used.
A. Motivation: Why add synthetic examples?
Page 8 of 18
ly
On
The most important weakness of self-labeling models can
occur when erroneous labeled examples are added to the
labeled training set. This will incorrectly modify the learned
model, which may lead to the addition of wrong examples in
successive iterations. Why does this situation occur?
• There may be outliers in the original unlabeled set. This
problem can be avoided if they are detected and not
included in the labeled training set. For this problem,
there are several solutions in the literature such as edition
schemes [29], [43], [44] or some other mechanisms
[32]. Recent models, such as Tri-Training [34] or CoForest [21], establish some criteria to compensate for the
negative influence of noise by augmenting the labeled
training set with sufficient new labeled data.
• Independently of the number of unlabeled examples, they
can be limited by the distribution of labeled input data.
If the available labeled instances do not represent a
reliable domain of the problem, it may complicate the
estimation of confidence predictions because the supervised classifiers used do not have enough information
to establish coherent hypotheses. Furthermore, it is even
more difficult if these labeled points are very close to the
decision boundaries. Figure 1 shows an example with the
appendicitis problem [27]. This picture presents a twodimensional projection (obtained with PCA [45]) of the
problem and a partition with 10 % of labeled examples.
As we can observe, not only is the problem not well
B. Generation of synthetic examples
To generate new labeled data in an SSC context we perform
certain operations on the available data, so that we use both
labeled and unlabeled sets. Algorithm 1 outlines the pseudocode of the oversampling technique proposed. This method
is initially based on the SMOTE algorithm proposed in [23]
which was designed for imbalanced domains [51] and is
limited to oversampling the minority class. In our proposal, we
use the underlying idea of SMOTE as an initialization procedure, to generate new examples of all the classes. Furthermore,
the resulting synthetic set of prototypes is then readjusted with
a positioning adjustment of prototypes scheme [24]. Therefore,
this mechanism is divided into two phases: initialization and
adjustment of prototypes.
Page 9 of 18
Transactions on Cybernetics
IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013
Appendicitis
1.0
4
1.0
Class 0
Class 1
0.0
0.0
0.5
1.0
1st principal component
(a) Appendicitis problem
Fig. 1.
0.0
−0.5
ev
rR
−0.5
Fo
−0.5
−1.0
−1.0
Class 0
Class 1
Unlabeled
0.5
2nd component
2nd component
0.5
Appendicitis with 10% labeled
−1.0
−1.0
−0.5
0.0
0.5
1st principal component
1.0
(b) Appendicitis problem: 10% Labeled points (Democratic - 0.7590 accuracy.)
Two-dimensional projections of Appendicitis. Red circles, class 0. Blue squares, class 1. White triangles, unlabeled.
training set T R (See instructions 4 and 5). Furthermore, to
prevent the influence of the order of labeled and unlabeled
instances when computing distances, the T R set is randomized
(Instruction 6).
Next, the algorithm enters a loop (Instructions 7-25) to
proportionally oversample each class, using its own labeled
samples as the base. Thus, we extract from L a set of examples P erClass that belong to the current class (Instruction
8). Each one will serve as the base prototype and will be
oversampled as many times as the previous computed ratio
indicates (Instructions 11-23).
New synthetic examples are located along the line segments
joining any of the k nearest neighbors (randomly chosen). To
face the SSC scenario, the nearest neighbors are not only being
looked for in the L set, but are searched for in the T R set
(Instruction 12). In this way, we try to avoid the negative
effects of the second weakness of self-labeled techniques
explained before. Following the idea of SMOTE, synthetic
examples are initially generated as the difference between an
existing sample and one of its nearest neighbors (Instruction
17). Then, this difference is scaled by a random number in the
range [0,1], and is added to the base example (Instruction 18
and 19). It is noteworthy that the class value of the generated
example is the same as the considered base sample. The
generated prototypes are iteratively stored in OverSampled
until the stopping condition is satisfied.
Adjustment of prototypes: Can we use this process to
improve the distribution of labeled input data? The answer
depends on the specific problem and partition used. Although
the generation algorithm provides more labeled examples that
may be very useful in many domains, they are not totally
confident. It may suffer from the same problem as the self-
iew
1: Input: Labeled set L, Unlabeled set U , Oversampling factor f ,
Number of Neighbors k.
2: Output: OverSampled set.
3: OverSampled = ∅
4: T R=L ∪ U
f · #T R
5: ratio=
#L
6: Randomize T R
7: for i = 1 to N umberOf Classes do
8:
P erClass[i] = getFromClass (L, i)
9:
for j = 1 to #P erClass[i] do
10:
Generated = 0
11:
repeat
12:
neighbors[1..k] = Compute k nearest neighbors
for P erClass[i][j] in T R
13:
nn = Random number between 1 and k
14:
Sample = P erClass[j]
15:
N earest = T R[neighbors[nn]]
16:
for m = 1 to N umberOf Attributes do
17:
dif = N earest[m] − Sample[m]
18:
gap = Random number between 0 and 1.
19:
Synthetic[m] = Sample[m] + gap ∗ dif
20:
end for
21:
OverSampled = OverSampled ∪ Synthetic
22:
Generated + +
23:
until Generated < ratio
24:
end for
25: end for
26: OverSampled=DE adjustment(OverSampled, L)
27: return OverSampled
Initialization: We start from the L and U sets as well as an
user-defined oversampling factor f and a number k of nearest
neighbors. We will generate a set of synthetic prototypes
OverSampled that is initialized as empty (Instruction 3).
The ratio of synthetic examples to be generated is computed
according to f and the proportion of labeled examples in the
ly
On
Algorithm 1: Generation of synthetic examples
Transactions on Cybernetics
Page 10 of 18
IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013
5
Appendicitis: 10% Labeled + Initial Synthetic data
Class 0
1.0
Class 1
Unlabeled
0.8
Synthetic: Class 0
Synthetic: Class 1
2nd component
1.2
0.6
0.4
0.2
0.0
−0.2
−0.4
−1.0
−0.5
In this subsection, we describe the SEG-SSC framework
in depth. With the generation method presented, we obtain
new labeled data that can be directly used to improve the
generalization capabilities of self-labeled approaches. Nevertheless, the aim of this framework is to be as flexible as
possible, so that it can be applied to different self-labeled
algorithms. Although each method proceeds in a different
way, they either share some operations or are very similar.
Therefore, we explain how to incorporate synthetic examples
in the self-learning process in order to address the limitations
on the distribution of labeled data and the lack of diversity in
multiple classifier methods.
0.5
1.0
Fig. 2.
Example of data generation in the Appendicitis problem. Twodimensional projections of Appendicitis. Red circles, class 0. Blue squares,
class 1. White triangles, unlabeled. Red stars, synthetic class 0. Blue pentagons, synthetic class 1. (SEG-SSC+Democratic - 0.8072 accuracy.)
In general, self-labeled methods use a set of N classifiers
Ci , where i [1, N ], to predict the class of unlabeled instances.
Each Ci has an associated labeled set Li that is iteratively
enlarged. In what follows, we describe the three main operations that support our proposal. For clarity, Figure 3 depicts a
flowchart of the proposed scheme, outlining its more general
operations and way of working.
• Initialization of classifiers: In current approaches, Li is
initially formed from the available data in L. Depending
on the particular method, they may use the same labeled
data for each Li or apply a bootstrapping to introduce
diversity. As we showed before, both alternatives can
lead to a lack of diversity when more than one classifier
is used. To solve this, we promote the generation of
different synthetic examples for each classifier Ci . In this
way, the generation mechanism is applied a total of N
times. Because L data are the most confident examples,
we ensure that they belong to each Li in conjunction with
synthetic examples. Note that the generation method has
some randomness, so different executions generate distinct synthetic points. This ensures the diversity between
Li sets.
• Self-labeling stage: After the initialization phase, each
classifier is trained with its respective Li . Then, the
learned hypotheses are used to classify unlabeled points,
determining the most reliable examples. There are several
ways to perform this operation. Single classifier approaches extract their confidence from the base classifier
and multiple classifiers calculate confidence predictions
in terms of the agreement or combination of hypotheses.
Independently of the procedure followed, each classifier
ly
On
C. Self-labeling with synthetic data
0.0
1st principal component
iew
ev
rR
Fo
labeling approaches and their confidence predictions. It is
well-known that SMOTE can generate noisy data [52] which
are usually eliminated with edition schemes. Because we are
not interested in removing synthetic data, we will apply an
evolutionary adjustment process to the OverSampled set
(Instruction 26) based on the differential evolution algorithm
used in [53].
Differential evolution [25] follows the general procedure
of an evolutionary algorithm [54]. It starts with a set of
candidate solutions, the so-called individuals, which evolve
during a determined number of generations through different operators: mutation, crossover and selection; aiming to
minimize/maximize a fitness function. For our purposes, this
algorithm is adapted in the following way:
• Each individual encodes a single prototype. The process
consists of the optimization of the location of all the
individuals of the population.
• Mutation and crossover operators guide the optimization
of the positioning of the prototypes. These operators only
produce modifications to the attributes of the prototypes
of the OverSampled set, keeping the class value unchangeable throughout the evolutionary cycle. We will
focus on the DE/CurrentToRand/1 strategy to generate
new prototypes [55].
• Then, we obtain a new set of synthetic prototypes that
should be evaluated to decide whether it is better or not
than the current set. To make this decision, we use the
most reliable data we have, that is, the labeled data L.
The generated data should be able to correctly classify
L. To check this, the nearest neighbor rule is used as the
base classifier to obtain the corresponding fitness value.
We try to maximize this value.
The stopping criteria is achieved when the generated data
perfectly classify L, or a given number of iterations have been
performed. More details in Section III.B of reference [53].
It is worth mentioning that this optimization process is
only applied to cases in which the former oversampling
approach generates synthetic data that is not able to classify
L. We thereby endow our model with greater robustness.
Figure 2 shows an example of a resulting set of over-sampled
prototypes in the appendicitis problem. We can observe that
in comparison with Figure 1, the available labeled data points
better represent the domain of the problem.
Page 11 of 18
Transactions on Cybernetics
IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013
6
SEG-SSC flowchart
iew
ev
rR
Fo
Fig. 3.
TABLE I
M AIN CHARACTERISTICS OF SELECTED SELF - LABELED METHODS
Algorithm
Initialization
Democratic-Co
Tri-Training
Co-Forest
Co-Bagging
Simple
Bootstrapping
Bootstrapping
Simple
learning
learning
learning
learning
Classifiers
Teaching scheme
Confidence rule
algorithms
algorithms
algorithms
algorithms
Self-teaching
Mutual-teaching
Self-teaching
Self-teaching
Weighted majority
Majority
Majority
Majority
obtains a set Li that will be used to enlarge Li . At this
point, there are two possibilities: self or mutual teaching.
The former uses its own predictions to augment its Li .
With a mutual teaching approach, a classifier Cj teaches
its confidence predictions to the rest of the classifiers, that
is, it increases Li , ∀ i = j. When all the Li are increased,
a new oversampling stage is performed for each Li , using
its prototypes and the remaining unlabeled examples. The
resulting Li sets are ready to be used in the next iteration.
Final classification: The stopping criteria depends on the
specific self-labeled method used, which is usually defined by a given number of iterations or by the condition
of the learned hypotheses of the classifiers used, which
does not change. When it is satisfied, not all the unlabeled
instances have had to be added to one of the Li sets.
For this reason, the resulting Li sets have to be used to
classify the remaining instances of U and the T S set.
As such, this scheme is applicable to any self-labeling
method and should provide better generalization capabilities to
all of them. To test the proposed framework, we have applied
these ideas to four self-labeling approaches: Democratic-Co
ly
On
•
Different
Same
Same
Same
[33], Tri-Training [34], Co-Forest [21] and Co-Bagging [36],
[37]. These models have different characteristics, such as distinct mechanisms to determine confident examples (agreement
or combination), teaching schemes, uses of different learning
algorithms or having a different initialization scheme. Table I
summarizes the main properties of these models. We modify
these models by adding synthetic examples, as explained
above, to have an idea of how flexible our framework is. The
modified versions of these algorithms will be denoted: SEGSSC+Democratic-Co, SEG-SSC+Tri-Training, SEG-SSC+CoForest and SEG-SS+Co-Bagging.
IV. E XPERIMENTAL SETUP AND ANALYSIS OF RESULTS
This section presents all of the issues related to the experimental framework used in this work and the analysis of
results. Section IV-A describes the main properties of the
data sets used and the parameters of the selected algorithms.
Section IV-B presents and analyzes the results obtained with
standard classification data sets. Finally, Section IV-C studies
the behavior of the proposed framework when dealing with
high dimensional problems.
A. Data sets and parameters
The experimentation is based on 55 standard classification
data sets taken from the UCI repository [27] and the KEELdataset repository1[26] and 11 high dimensional problems
1 http://sci2s.ugr.es/keel/datasets
Transactions on Cybernetics
Page 12 of 18
IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013
7
TABLE II
S UMMARY DESCRIPTION OF STANDARD CLASSIFICATION DATA SETS
#D.
8
7
14
25
2
9
6
36
13
85
9
15
33
7
9
20
9
3
13
19
16
4
7
18
10
5
13
6
#ω.
28
2
2
6
2
2
2
2
5
2
3
2
6
8
2
2
7
2
2
2
2
3
10
4
2
2
9
2
Data set
movement libras
mushroom
nursery
pageblocks
penbased
phoneme
pima
ring
saheart
satimage
segment
sonar
spambase
spectheart
splice
tae
texture
tic-tac-toe
thyroid
titanic
twonorm
vehicle
vowel
wine
wisconsin
yeast
zoo
#Ex.
360
8124
12 690
5472
10 992
5404
768
7400
462
6435
2310
208
4597
267
3190
151
5500
958
7200
2201
7400
846
990
178
683
1484
101
#D.
90
22
8
10
16
5
8
20
9
36
19
60
55
44
60
5
40
9
21
3
20
18
13
13
9
8
17
#ω.
15
2
5
5
10
2
2
2
2
7
7
2
2
2
3
3
11
2
3
2
2
4
11
3
2
10
7
ev
rR
#Ex.
4174
106
690
205
5300
286
345
3196
297
9822
1473
125
366
336
1066
1000
214
306
270
155
435
150
500
148
19 020
961
8993
432
Fo
Data set
abalone
appendicitis
australian
autos
banana
breast
bupa
chess
cleveland
coil2000
contraceptive
crx
dermatology
ecoli
flare-solar
german
glass
haberman
heart
hepatitis
housevotes
iris
led7digit
lymphography
magic
mammographic
marketing
monks
TABLE III
S UMMARY DESCRIPTION OF HIGH DIMENSIONAL DATA SETS
#D.
117
241
241
241
241
241
315
11 960
241
9636
4613
#ω.
2
6
2
2
2
2
2
2
2
5
5
Reference
[2]
Parameters
Number of Neighbors = 3, Euclidean Distance
Confidence level: c = 0.25
Mininum number of item-sets per leaf: i = 2
Prune after the tree building
Democratic-Co
Tri-Training
Co-Forest
Co-Bagging
Classifiers = 3NN, C4.5, NB
No parameters specified
Number of RandomForest Classifiers = 6, Threshold = 0.75
M AX IT ER = 40, Committee members = 3
Ensemble Learning = Bagging, Pool U = 100
SEG-SSC
Differential evolution
parameters
Oversampling factor=0.25, Number of Neighbors = 5
Iterations = 100, iterSFGSS = 8,iterSFHC =20
Fl=0.1, Fu=0.9
established in [40], in the division process we do not maintain
the class proportion in the labeled and unlabeled sets since
the main aim of SSC is to exploit unlabeled data for better
classification results. Hence, we use a random selection of
examples that will be marked as labeled instances, and the
class label of the rest of the instances will be removed. We
ensure that every class has at least one representative instance.
In standard classification data sets we have taken a labeled
ratio of 10%. For high dimensional data sets, we will use two
splits for training partitions with 10 and 100 labeled examples,
respectively. In both cases, the remaining instances are marked
as unlabeled points.
Regarding the parameters of the algorithms, the selected
values are fixed for all problems, and they have been chosen
according to the recommendation of the corresponding authors
of each algorithm. From our point of view, the approaches
analyzed should be as general and as flexible as possible.
It is known that a good choice of parameters boosts their
better performance over different data sources, but their way
of working should offer good enough results in spite of the
fact that the parameters are not optimized for a specific data
set. This is the main purpose of this experimental setup, to
show how the proposed framework can improve the efficacy
of self-labeled techniques. Table IV specifies the configuration
parameters of all the methods. Because these algorithms carry
out some random operations during the labeling process, they
have been run three times per partition.
In this table, we also present the parameters involved in our
framework: the oversampling factor, the number of neighbors
and the parameters needed for the differential evolution optimization. They can also be adjusted for each problem, however, with the same aim of being as flexible as possible. We
have fixed these values empirically in previous experiments.
The parameters used for the differential evolution optimization
are the same as those established in [53], except for the
number of iterations that have been reduced. We decrease this
value because, under this framework, the reference set used by
differential evolution contains a smaller number of instances
than in the case of supervised learning.
The Co-Forest and Democractic-Co algorithms were designed and tested with determined base classifiers. In this
study, these algorithms maintain their classifiers. However,
the interchange of the base classifiers is allowed in the TriTraining and Co-Bagging approaches. In these cases, we will
[56]
extracted from the book by Chapelle et al. [2] and the
BBC News web page [56]. Tables II and III summarize the
properties of the selected data sets. They show, for each
data set, the number of examples (#Ex.), the number of
attributes (#D.), and the number of classes (#ω.). The standard
classification data sets considered contain between 100 and
19,000 instances, the number of attributes ranges from 2 to 90
and the number of classes varies between 2 and 28. However,
the 11 high dimensional data sets contain between 400 and
83,679 instances and the number of features oscillates from
117 to 11,960.
All the data sets have been partitioned using the 10 fold
cross-validation procedure, that is, the data set has been split
into 10 folds, each one containing 10% of the examples of
the data set. For each fold, an algorithm is trained with
the examples contained in the rest of the folds (training
partition) and then tested with the current fold. Note that
test partitions are kept aside to assess the performance of the
learned hypothesis.
Each training partition is then divided into two parts:
labeled and unlabeled examples. Using the recommendation
ly
On
#Ex.
400
1500
1500
1500
1500
1500
83 679
1500
1500
2225
737
Algorithm
KNN
C4.5
iew
Data set
bci
coil
coil2
digit1
g241c
g241n
secstr
text
usps
bbc
bbcsport
TABLE IV
PARAMETER SPECIFICATION FOR THE BASE LEARNERS AND THE
SELF - LABELED METHODS USED IN THE EXPERIMENTATION
Page 13 of 18
Transactions on Cybernetics
IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013
8
ev
rR
Fo
test two base classifiers, the K-Nearest Neighbor [57] and the
C4.5 algorithms [58]. A brief description of these base classifiers and their associated confidence prediction computation
are given as follows:
• K-Nearest Neighbor (KNN): This is an instance-based
learning algorithm that belongs to the lazy learning family
of methods [59]. As such, it does not build a model
during the learning process and is based on dissimilarities
among a set of instances. For those self-labeled methods
that need to estimate confidence predictions from this
classifier, they can approximate it in terms of distance
from the currently labeled set.
• C4.5: This is a decision tree algorithm [58] that induces
classification rules for a given training set. The decision
tree is built with a top-down scheme, using the normalized information gain (difference in entropy) that is
obtained from choosing an attribute for splitting the data.
The attribute with the highest normalized information
gain is the one used to make the decision. Confidence
predictions can be obtained from the accuracy of the leaf
that makes the prediction. The accuracy of a leaf is the
percentage of correctly classified train examples from the
total number of covered train instances.
(a) Transductive accuracy
B. Experiments on standard classification data sets
iew
In this subsection we compare the modified versions of
the selected self-labeled methods (within SEG-SSC) with the
original ones, focusing on the results obtained on the 55
standard classification data sets and a labeled ratio of 10%. We
analyze the transductive and inductive accuracy capabilities
of these methods. Both results are presented in Tables V and
VI, respectively. In these tables, we have specified the base
classifier between brackets for Tri-Training and Co-Bagging
algorithms.
Aside from these tables, Figure 4 depicts two box plot
representations of the results obtained in transductive and
inductive settings, respectively. With these box plots we show
a graphical comparison of the performance of the algorithms,
indicating their most important characteristics such as the
median, extreme values and spread of values about the median
in the form of quartiles (Q1 and Q3).
Observing these tables and the figure we can appreciate
differences between each of the original proposals and the
improvement achieved by the addition of synthetic examples.
Nevertheless, the use of hypothesis testing methods is mandatory in order to contrast the results of a new proposal with
several comparison methods. The aim of these techniques is to
identify the most relevant differences found between methods,
which is highly recommended in the data mining field [28].
To do this, we focus on the Wilcoxon signed-ranks test [60]
because it establishes a pairwise comparison between methods.
In this way, we can see if there are significant differences
between the original and modified versions. More information
about this test and other statistical procedures can be found at
http://sci2s.ugr.es/sicidm/.
Table VII collects the results of the application of the
Wilcoxon signed-ranks test to the transductive and inductive
ly
On
(b) Inductive accuracy
Fig. 4. Box plot of transductive and inductive accuracy rates. The boxes
contain 50% of the data (Q1 to Q3), blue points are the median values and
the lines extend to the most extreme values
accuracy rates. It shows the rankings R+ and R− values
achieved and its associate p-value. Adopting a level of significance of α = 0.1, we emphasize in bold face those
comparisons in which SEG-SSC significantly outperforms the
original algorithm.
With these results we can make the following analysis:
•
In Tables V and VI we can see that our framework
provides a great improvement in accuracy to the selflabeled techniques used in most of the data sets and rarely
does it significantly reduce its performance level. On
Transactions on Cybernetics
Page 14 of 18
IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013
9
TABLE V
T RANSDUCTIVE ACCURACY RESULTS OVER STANDARD CLASSIFICATION DATA SETS .
Datasets
Co-Bagging
(KNN)
22.9200
80.3700
82.0200
36.9300
87.6300
69.8400
55.3100
81.4600
53.2200
92.3700
41.7400
82.7300
87.2400
68.1700
61.9800
68.5200
47.7900
66.8700
76.6700
82.5300
89.0500
92.3000
66.1700
61.0900
79.8500
76.3300
26.6500
68.5400
36.3800
99.2000
73.4900
93.1300
96.8200
79.8800
65.6200
61.5200
65.6300
85.9500
87.6900
65.7800
83.3700
73.3700
69.5400
40.6000
93.4100
92.0300
67.9700
67.7400
95.2200
55.6500
37.7200
93.7600
95.9400
45.9300
78.4400
71.0558
Co-Bagging
(C4.5)
20.1100
82.2600
81.3200
35.1900
85.7200
68.9500
59.8900
95.0800
50.8100
93.6300
47.9700
84.8300
86.9700
65.7000
71.2700
69.2000
45.9700
71.4500
73.7000
81.4200
93.5400
80.3300
57.3100
57.6200
83.2700
81.3600
27.8300
97.0300
20.9300
99.4700
90.2400
95.3300
90.5000
79.0700
64.4200
85.8900
65.8700
82.1500
90.9800
63.0100
89.4700
72.1200
81.9700
38.6400
84.8800
98.9700
71.4500
77.7700
85.9300
59.0200
43.6400
80.9900
92.9400
48.0800
75.5800
72.3462
Democratic-Co
iew
18.6274
80.1273
83.2558
43.8713
85.9003
71.0801
58.1093
92.0444
54.2972
90.2300
44.7097
82.3222
90.2394
65.4363
70.7523
72.0370
49.8987
72.3424
79.2694
78.9245
89.6435
93.9344
61.8272
58.9634
79.6235
80.8290
26.0327
86.7461
28.8276
99.5213
89.5509
93.7309
95.6400
79.9137
70.4932
94.6296
65.9012
85.6518
90.6197
70.3383
88.7979
76.2400
92.7786
39.9467
88.7026
96.7147
70.5093
77.5310
97.3390
57.9951
50.3616
95.1475
96.6413
48.0431
86.4836
74.3477
average, the versions that use synthetic examples always
outperform the algorithms upon which they are based in
both the transductive and inductive phases. In general,
the average improvement achieved for one algorithm in
the transductive setting is more or less maintained in the
inductive test, which shows the robustness of the models.
Co-Bagging (KNN) seems to be the algorithm that benefits most when it uses synthetic instances, by contrast,
SEG-SSC does not significantly increase the average
performance of Co-Forest. Comparing all the algorithms,
the best performing approach is SEG-SSC+Democraticco.
It is known that the performance of self-labeled algorithms depends firstly on the general abilities of their base
classifiers [61]. We notice that C4.5 is a better base classifier than KNN for the Tri-Training philosophy. However,
the Co-Bagging algorithm performs in a similar way with
both classifiers. As expected, the results obtained with our
framework are also affected by the base classifier. At this
point, we can see that those algorithms that are based on
KNN offer a greater average improvement.
•
•
Tri-Training
(KNN)
19.1120
79.6511
80.7156
45.1214
86.7179
68.6319
56.8213
82.7566
50.7729
89.0100
42.4225
80.4531
90.8924
67.1914
63.7500
68.1605
54.2011
66.1459
74.3836
80.2432
89.7856
94.1803
61.3333
65.4149
77.7595
75.7620
26.2354
72.0527
44.1379
99.6088
78.3274
93.3202
98.0345
80.7817
66.7475
82.5375
64.8068
87.8005
91.6346
67.0835
85.9357
73.3690
79.8181
41.0862
96.4669
91.0460
70.6640
74.5973
96.3664
60.7683
50.9601
93.9660
95.9910
49.5750
93.7834
73.0708
SEG-SSC
Tri-Training
Co-Forest
(C4.5)
19.6966
20.8546
80.3736
80.1847
80.8587
83.1306
39.0806
41.3491
86.7622
54.9849
68.3242
69.4731
58.6070
61.1814
95.3117
93.9368
51.3120
53.6428
92.1800
92.9300
48.0525
49.6524
83.1333
82.6423
90.2385
91.6161
64.9134
69.8830
70.1389
39.3403
68.7778
68.1358
51.7359
56.3432
71.2583
62.1837
74.2009
73.2877
78.9000
84.1557
92.2012
89.6818
83.6066
90.8197
63.5062
63.3580
59.0391
62.5010
82.4028
84.3311
81.2622
77.4400
27.0427
28.7900
92.4715
91.2758
26.9655
32.9310
99.7049
90.9362
91.5057
37.8782
95.5446
95.6912
92.3247
95.9388
80.6835
81.8896
70.6701
69.6090
81.7150
81.5349
65.1284
66.2758
84.1478
86.9986
91.0150
92.0726
66.3745
71.2210
90.0437
92.6510
75.1242
80.2129
87.6393
49.5666
40.8437
37.1800
89.1021
92.6218
98.0864
97.5103
73.0610
62.8186
77.7834
70.8566
87.0871
92.0938
62.1831
66.5154
50.9726
54.4140
83.9114
91.0536
93.9148
95.6295
50.2999
48.3013
88.2045
93.5267
73.6259
71.7279
ly
On
Original Proposals
Tri-Training
Co-Forest
(C4.5)
20.2800
20.5500
82.7300
81.9000
82.7700
82.4900
37.7200
41.5600
85.1000
53.9400
68.6900
71.4800
58.5700
59.4700
95.4300
94.3700
50.2800
51.8500
93.4300
92.8300
47.6600
48.6300
84.8100
82.7200
87.6200
89.3000
65.9200
66.5600
71.1600
39.8700
68.5800
67.8500
49.5500
55.2600
71.6200
58.7000
75.6600
71.0500
81.4200
84.8800
92.9500
91.3500
75.9000
90.1600
60.3000
62.9400
58.9600
61.8000
82.0100
84.2000
81.5600
76.6700
28.1500
28.7900
97.0300
96.0700
22.4800
30.4800
99.5100
90.9200
90.3300
37.6600
95.4300
95.7900
90.0200
95.5900
78.0900
80.4700
68.2900
68.5500
85.0400
88.4400
65.6600
64.3800
82.3000
86.5500
90.3700
90.8600
63.8900
69.2100
88.3700
92.1400
72.8200
77.7600
81.9500
50.3800
39.3700
37.8300
85.2600
90.7000
99.1200
98.6100
71.0000
61.8500
77.6900
70.9800
85.9200
91.3300
60.4800
64.1500
44.9000
51.9800
78.9800
87.5200
92.5900
93.4800
49.7300
45.8800
75.4100
89.7800
72.5611
71.1002
ev
rR
•
19.7700
80.6000
82.6800
32.8500
84.2700
70.2000
54.2800
92.0000
53.8000
93.0300
45.0000
84.6100
88.0000
63.6400
71.3300
70.1200
48.7100
73.2300
78.8600
81.4200
89.8500
91.6400
60.0500
51.9000
78.7300
80.4300
27.8800
90.1100
17.5200
99.2600
89.6300
90.6000
94.7300
78.6000
71.9600
88.7200
67.4500
85.0600
90.5900
64.3100
88.1700
73.0900
93.4900
39.6300
88.2700
94.1500
69.7600
77.1800
97.0300
47.8700
40.6700
93.0700
96.2400
48.9500
92.6200
73.0475
Tri-Training
(KNN)
22.1300
78.1400
79.0200
45.6100
86.9500
69.3400
55.7900
79.6200
52.4800
88.2500
42.0000
80.8500
89.4800
66.4800
62.9700
66.2100
51.2600
65.5800
73.4700
81.4000
89.1000
92.1300
59.4300
64.7600
78.0800
73.8900
25.6800
70.5700
40.2800
99.5200
76.3200
93.4200
97.7400
80.8000
65.0600
66.6000
63.3900
86.0700
90.0200
65.1900
82.8400
69.6300
66.5700
40.7600
94.4200
90.6400
70.3200
74.6000
93.8000
56.9700
48.0400
92.8500
95.5000
46.3300
92.2300
71.4651
Fo
abalone
appendicitis
australian
automobile
banana
breast
bupa
chess
cleveland
coil2000
contraceptive
crx
dermatology
ecoli
flare
german
glass
haberman
heart
hepatitis
housevotes
iris
led7digit
lymphography
magic
mammographic
marketing
monk-2
movement libras
mushroom
nursery
page-blocks
penbased
phoneme
pima
ring
saheart
satimage
segment
sonar
spambase
spectfheart
splice
tae
texture
thyroid
tic-tac-toe
titanic
twonorm
vehicle
vowel
wine
wisconsin
yeast
zoo
Average
Democratic-Co
Co-Bagging
(KNN)
19.2365
79.4376
80.6082
45.3764
86.8274
68.9862
56.6096
84.1663
51.5208
90.2400
42.8332
82.1320
91.7198
67.3791
66.8634
69.0000
53.1702
66.6686
74.3379
81.5418
89.4874
94.0164
63.6790
63.0386
78.0042
77.3973
25.6569
76.5598
43.8276
99.5957
78.5675
93.3789
98.0155
80.9576
67.0209
86.7901
65.3412
88.1784
91.5652
67.2593
86.9667
74.2972
81.2926
42.3864
96.4916
91.1094
72.8934
75.5737
96.4097
61.7747
51.5586
92.9243
96.1535
49.2672
92.8808
73.6177
Co-Bagging
(C4.5)
19.9401
81.3013
80.5546
35.4917
86.9113
67.7848
59.5062
95.4083
50.8462
90.6400
46.4188
81.8868
88.8587
63.4947
68.7963
69.0617
50.7878
68.9213
75.0228
79.3906
92.5117
83.0328
63.1111
54.8241
81.6811
80.5495
26.4127
88.3120
26.7931
99.7705
89.2461
95.1814
93.5455
81.0970
68.0663
79.1859
65.3147
84.3185
91.5438
66.9614
90.9567
76.5126
89.2918
41.8213
90.3681
96.6907
72.6751
77.5086
88.1198
62.1973
51.9202
87.4473
94.9061
49.1259
90.2835
73.3147
In Figure 4, the size of the boxes are related to the
robustness of the algorithms. Thus, we observe that, in
many cases, SEG-SSC finds more compact boxes than
the original algorithms. In the cases in which the boxes
have more or less the same size, we can see that they
are higher in the plot. Median results also help us to
identify algorithms that perform well in many domains.
Thus, we observe again that most of the median values
of modified versions are higher than the original proposals. Taking into account median values, SEG-SSC+TriTraining(C4.5) may be considered the best model.
According to the Wilcoxon signed-ranks test, SEG-SSC
achieves that all the methods significantly overcome their
original proposals in terms of transductive learning, supporting previous conclusions. However, in the inductive
phase, we find that Co-Forest and Co-Bagging (C4.5)
have not been significantly improved. Even so, they report
higher R+ rankings than the original models, which
means that they perform slightly better.
Page 15 of 18
Transactions on Cybernetics
IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013
10
TABLE VI
I NDUCTIVE ACCURACY RESULTS OVER STANDARD CLASSIFICATION DATA SETS .
Datasets
Co-Bagging
(KNN)
20.7700
74.7300
81.3000
31.0500
87.3600
70.9000
57.6600
80.8800
54.9600
92.2700
41.0700
81.0200
87.0800
66.7300
63.2300
68.8000
48.2400
67.9900
78.5200
81.1800
90.3600
93.3300
66.0000
66.0800
79.8100
77.5800
26.9300
67.0700
34.1700
99.0100
73.8500
93.4900
96.8200
79.8300
63.5800
61.7600
64.0800
85.6400
87.0600
65.8800
83.2700
73.1100
69.7500
42.8700
94.3100
92.1000
67.9600
67.4800
95.1800
55.5500
39.2900
93.2700
95.9300
46.7000
78.3300
70.9667
Co-Bagging
(C4.5)
20.8200
80.4500
82.7500
33.6600
85.5300
72.5200
61.1900
95.4300
53.3500
93.4800
48.2700
84.9600
87.6000
65.5800
71.4000
71.1000
48.9800
71.2200
70.3700
83.4300
91.9500
80.0000
56.4000
59.4800
83.1900
80.8500
27.0600
96.5700
24.1700
99.5400
90.0600
95.6700
90.4900
78.8900
63.4200
85.8200
64.9700
82.0500
91.6000
70.1400
89.5100
75.7100
82.4800
42.3700
85.0000
99.0600
70.3500
78.3700
85.9700
60.3000
45.5600
78.6600
92.8400
47.6500
74.5600
72.7782
AND INDUCTIVE PHASES
Transductive phase
R+
R−
p-value
R+
SEG-SSC+Democratic-Co vs. Democratic-Co
SEG-SSC+Tri-Training (KNN) vs. Tri-Training (KNN)
SEG-SSC+Tri-Training (C4.5) vs. Tri-Training (C4.5)
SEG-SSC+Co-Forest vs. Co-Forest
SEG-SSC+Co-Bagging (KNN) vs. Co-Bagging (KNN)
SEG-SSC+Co-Bagging (C4.5) vs. Co-Bagging (C4.5)
1086
1371
1083
1132
1201
966
0.0080
0.0000
0.0086
0.0008
0.0003
0.0997
1057
995
966
863
1204
943
454
169
457
353
339
574
20.3408
76.5455
84.3478
44.1879
85.7736
72.5921
61.9745
91.4895
55.7156
90.1955
46.0953
83.6402
89.3233
68.1907
71.3939
71.7000
52.7294
71.8602
80.0000
82.7446
90.2963
91.3333
59.8000
68.1709
79.8370
80.6285
25.8285
86.4930
29.7222
99.5531
88.4568
94.3714
95.8151
79.4957
68.1074
94.3784
67.9880
85.5945
91.0390
71.0714
88.0794
76.7379
92.2257
38.4583
90.9455
94.2917
71.5154
77.6018
96.8919
59.4664
49.1919
95.5229
96.3694
46.5645
88.5000
74.7488
iew
TABLE VII
R ESULTS OF THE W ILCOXON SIGNED - RANKS TEST ON TRANSDUCTIVE
Comparison
Democratic-Co
Test phase
R−
p-value
428
545
574
622
336
597
0.0067
0.0588
0.0997
0.2970
0.0003
0.1460
C. Experiments on high dimensional problems with small
labeled ratio
This subsection is devoted to studying the behavior of the
proposed framework when it is applied to high dimensional
data and a very reduced labeled ratio. Most of the considered
data sets (9 of 11) were provided in the book by Chapelle
et al. [2], in which the studies were performed using only
10 and 100 labeled instances. We attempt to perform a similar
study with the difference that we also investigate the inductive
abilities of the models. Furthermore, BBC and BBCsport data
Tri-Training
(KNN)
20.1730
75.7273
80.4348
43.5711
86.4717
70.4380
54.7159
78.5345
55.9584
88.2800
41.7549
79.2744
90.1495
67.2727
64.4507
68.5000
57.1968
64.7419
77.0370
77.3506
89.0168
93.3333
60.4000
65.2129
77.1819
76.5328
26.7307
64.3084
43.0556
99.5009
75.7330
93.3847
98.2351
80.7726
64.4723
69.4189
63.2239
87.0244
91.8182
66.3571
82.4662
74.9858
65.8307
38.3750
96.8182
91.1250
71.3004
74.1512
93.6351
60.6373
52.5253
93.2026
95.2032
50.1383
92.7222
72.0157
SEG-SSC
Tri-Training
Co-Forest
(C4.5)
20.8922
22.3287
79.4545
79.2727
82.1739
83.0435
40.7375
40.2639
86.8868
55.5472
71.9036
70.9338
63.2280
58.5091
94.9313
93.2094
51.0575
53.2235
92.1100
93.1000
47.5188
49.6920
83.4395
82.0433
87.6332
90.6976
65.5437
67.5758
70.3624
38.8415
68.0000
68.6000
56.1803
51.6927
69.2043
59.2903
72.5926
71.8519
81.2673
90.5032
88.2021
90.8505
83.3333
92.0000
63.0000
62.6000
65.2829
68.2017
82.3554
84.7266
81.7016
79.6800
27.5396
29.6700
92.0869
88.8386
29.4444
32.5000
99.7678
90.8713
91.3117
37.3148
95.6873
95.8699
92.2486
96.1066
80.7542
81.3837
67.7110
68.6167
81.5270
80.2568
66.6744
67.1045
83.8226
86.5584
91.2554
91.8182
64.8095
73.0476
90.2108
92.6257
79.4444
79.3875
88.1191
50.0000
38.5417
36.4600
89.4909
92.3818
98.1528
97.6528
72.0252
59.0833
78.2378
70.5140
87.0135
90.6216
61.7129
63.8459
52.2222
54.3434
85.9150
91.0131
93.7194
94.6063
50.3387
47.4420
86.9167
92.6389
73.9217
71.4700
ly
On
Original Proposals
Tri-Training
Co-Forest
(C4.5)
21.6100
21.0100
80.4500
82.2700
84.4900
84.0600
38.8900
45.6900
84.8100
52.7000
72.1600
73.3900
57.4200
58.5100
95.7800
94.4000
47.6100
53.6600
93.5700
92.9900
48.1300
48.5300
85.5500
82.0600
88.1600
90.4700
65.8500
62.8300
71.5800
40.2400
71.7000
68.6000
49.2100
55.8900
70.8800
60.1400
71.4800
69.2600
83.4300
81.0900
91.5800
92.1600
72.6700
93.3300
60.4000
63.4000
61.1800
64.6400
82.4500
84.3600
81.8300
79.4100
26.9400
29.2300
96.5700
93.9200
27.5000
31.1100
99.5500
90.8400
90.3900
38.0900
95.6100
95.8500
90.2700
95.5100
77.7000
80.0700
65.6400
66.2700
85.4200
88.2300
67.7600
65.5900
82.2400
86.0000
90.0000
90.3000
70.1900
75.5000
88.1000
91.8600
75.7400
77.5100
82.5400
50.6600
45.7100
37.7900
85.2400
90.6500
99.1800
98.5800
70.8800
59.7100
77.6500
70.6500
86.1600
89.8900
61.9400
61.2400
45.2500
52.2200
82.0300
85.8800
93.1200
93.5800
49.0700
45.6200
71.9200
90.8900
72.9669
71.2424
ev
rR
21.0600
82.1800
84.4900
36.0100
84.1700
72.8700
51.0400
91.9900
52.2300
93.2200
43.5800
84.9500
87.6000
63.7000
72.1400
71.6000
48.6800
71.5600
80.0000
83.4300
88.9900
91.3300
61.6000
49.0100
78.4200
79.6300
27.1000
90.7500
19.7200
99.2700
89.5100
90.7700
94.7400
78.7400
69.6700
87.4100
68.1900
84.6200
90.2600
60.0500
87.7700
73.7900
89.7800
37.7100
89.4400
93.9300
69.0000
77.5600
96.4500
50.2300
41.6200
94.9300
96.5000
48.8600
93.1400
73.0362
Tri-Training
(KNN)
20.7200
73.8200
80.2900
43.0700
86.8100
70.8100
54.1600
83.0900
56.6200
87.9500
42.1600
80.3400
89.3000
66.7000
63.8900
66.7000
59.4100
59.7800
76.6700
79.6900
89.3900
92.0000
59.4000
67.8700
76.7800
76.9900
26.2000
64.6000
44.4400
99.4700
86.9800
93.6400
98.0100
80.4600
62.6500
60.4100
62.7700
85.2100
90.7400
63.4500
81.1000
69.0500
77.5900
40.4200
95.2400
90.6700
70.6700
74.1500
91.0900
55.2000
49.8000
92.6500
94.6200
47.5100
93.4700
71.7576
Fo
abalone
appendicitis
australian
automobile
banana
breast
bupa
chess
cleveland
coil2000
contraceptive
crx
dermatology
ecoli
flare
german
glass
haberman
heart
hepatitis
housevotes
iris
led7digit
lymphography
magic
mammographic
marketing
monk-2
movement libras
mushroom
nursery
page-blocks
penbased
phoneme
pima
ring
saheart
satimage
segment
sonar
spambase
spectfheart
splice
tae
texture
thyroid
tic-tac-toe
titanic
twonorm
vehicle
vowel
wine
wisconsin
yeast
zoo
Average
Democratic-Co
Co-Bagging
(KNN)
20.3636
74.7273
81.5942
43.9220
86.6792
69.0904
57.3746
83.8238
56.0482
90.1500
42.9100
81.7268
91.5388
70.2674
66.8877
68.3000
54.2606
65.3226
75.9259
78.4383
90.3950
92.6667
61.8000
68.8683
77.9443
78.3287
25.8699
77.0660
43.3333
99.4826
78.2330
93.5306
98.0441
80.9578
65.8994
86.8649
65.6059
88.0969
91.7749
63.9524
86.5782
79.8291
81.7555
39.0417
96.8182
91.4444
72.4452
75.1940
96.5270
62.4034
52.5253
96.0784
95.7938
50.2091
92.7222
73.7715
Co-Bagging
(C4.5)
21.1299
83.3636
82.3188
42.4181
86.7736
68.7853
62.5796
95.7745
50.6689
90.2500
47.1824
80.3731
87.3709
65.8556
68.5779
70.2000
54.6483
65.2688
71.8519
77.7403
90.9857
84.0000
61.2000
61.5182
81.9821
79.7660
27.1754
86.3375
27.2222
99.8040
89.1512
95.3398
93.6136
81.1430
66.9233
79.1757
64.7317
84.3355
91.5584
70.0476
90.8852
79.0456
89.2476
39.1667
90.6727
96.8611
72.7522
77.2394
87.8243
62.5266
53.0303
88.7582
94.7580
47.8501
87.3889
73.5845
sets have been also analyzed in a semi-supervised context with
a few number of labeled instances [62].
In the scatterplots of Figure 5 we depict transductive and
inductive accuracy results obtained with 10 and 100 labeled
data. In these plots, the x-axis position of the point is the
accuracy of the original self-labeled method on a single data
set, and the y-axis position is the accuracy of the modified
algorithm. Therefore, points above the y = x line correspond
to data sets for which new proposals perform better than the
original algorithm.
Table VIII tabulates the average results obtained in the
11 data sets considered, including transductive and inductive
phases for both 10 and 100 splits.
Given Figure 5 and Table VIII, we can make the following
comments:
• In all the plots of Figure 5, most of the points are above
the y = x line, which means that, with the proposed
framework, the self-labeled techniques perform better
than the original algorithms. Differentiating between 10
and 100 available labeled points, we can see that when
we have 100 labeled examples, there are more points
Transactions on Cybernetics
Page 16 of 18
IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013
11
Transductive comparison with 10 labeled points
Inductive comparison with 10 labeled points
80
Accuracy with the proposed framework
Accuracy with the proposed framework
80
70
60
50
Tritraining (KNN) vs. SE-Tritraining (KNN)
40
Tritraining (C4.5) vs. SE-Tritraining (C4.5)
Democratic-Co vs. SE-Democratic-Co
CoForest vs. SE-CoForest
30
70
60
50
Tritraining (KNN) vs. SE-Tritraining (KNN)
40
Tritraining (C4.5) vs. SE-Tritraining (C4.5)
Democratic-Co vs. SE-Democratic-Co
CoForest vs. SE-CoForest
30
Co-bagging (KNN) vs. SE-Co-bagging (KNN)
Co-bagging (KNN) vs. SE-Co-bagging (KNN)
Co-bagging (C4.5) vs. SE-Co-bagging (C4.5)
Co-bagging (C4.5) vs. SE-Co-bagging (C4.5)
y=x
30
40
50
60
70
y=x
80
30
40
Accuracy of the original proposals
Fo
Inductive comparison with 100 labeled points
Tritraining (KNN) vs. SE-Tritraining (KNN)
Tritraining (C4.5) vs. SE-Tritraining (C4.5)
40
Democratic-Co vs. SE-Democratic-Co
Co-bagging (KNN) vs. SE-Co-bagging (KNN)
30
Co-bagging (C4.5) vs. SE-Co-bagging (C4.5)
y=x
30
40
50
60
70
Accuracy of the original proposals
(c) 100 labeled points: Transductive accuracy.
80
70
60
50
Tritraining (KNN) vs. SE-Tritraining (KNN)
Democratic-Co vs. SE-Democratic-Co
100
•
y=x
30
40
50
60
70
80
Accuracy of the original proposals
(d) 100 labeled points: Inductive accuracy.
ly
On
Phase
TRS
TST
TRS
TST
56.6987
56.5331
70.8563
70.1286
SEG-SSC+
Democratic-Co
10
Co-bagging (C4.5) vs. SE-Co-bagging (C4.5)
High dimensional data sets: Transductive and inductive accuracy results
Democratic-Co
#L
CoForest vs. SE-CoForest
Co-bagging (KNN) vs. SE-Co-bagging (KNN)
30
TABLE VIII
H IGH DIMENSIONAL DATA SETS : AVERAGE RESULTS OBTAINED IN
TRANSDUCTIVE (TRS) AND INDUCTIVE (TST) PHASES .
10
Tritraining (C4.5) vs. SE-Tritraining (C4.5)
40
iew
CoForest vs. SE-CoForest
80
Accuracy with the proposed framework
Accuracy with the proposed framework
ev
rR
50
80
Accuracy of the original proposals
Transductive comparison with 100 labeled points
60
100
70
(b) 10 labeled points: Inductive accuracy.
70
#L
60
(a) 10 labeled points: Transductive accuracy.
80
Fig. 5.
50
Phase
TRS
TST
TRS
TST
58.4330
58.8520
73.4449
73.5080
Tri-Training
(KNN)
56.1988
55.9070
66.5833
65.4899
Tri-Training
(C4.5)
53.1233
53.3841
68.7295
69.1031
Co-Forest
SEG-SSC+
Tri-Training
(KNN)
57.1121
56.0891
70.1754
65.1796
SEG-SSC+
Tri-Training
(C4.5)
58.7192
58.3297
71.7133
71.1483
SEG-SSC+
Co-Forest
55.5031
55.6181
68.0144
67.1261
58.0919
57.6709
70.1887
69.2605
Co-Bagging
(KNN)
56.8976
56.3126
65.4552
66.1272
Co-Bagging
(C4.5)
53.8810
53.8852
68.7148
69.7143
SEG-SSC+
Co-Bagging
(KNN)
57.2811
57.4786
70.4770
71.0570
SEG-SSC+
Co-Bagging
(C4.5)
57.8010
58.1651
72.4768
71.9971
above this line in both the transductive and inductive
phases. We do not discern great differences between
the performance obtained in both learning phases which
shows that the hypotheses learned with the available
labeled and unlabeled data were appropriate.
Table VIII shows that, on average, the proposed scheme
obtains a better performance level than the original ones
in most cases, independently of the learning phase and
•
the number of labeled data considered. Attending to the
difference between transductive and inductive results, we
observe that, in general, SEG-SSC increments both proportionally. Nevertheless, there are significant differences
between the results obtained with 10 and 100 labeled
points.
With these results in mind, we can see the good synergy
between synthetic examples and self-labeled techniques
in these domains, but, what are the main differences
with the results obtained in the previous subsection? We
observe great differences between those algorithms that
use KNN as a base classifier and those that use C4.5.
With standard classification data sets, we ascertained
that C4.5 was the best base classifier for Tri-Training
and performs similarly to KNN for Co-Bagging. These
statements are maintained in these domains, where C4.5
performs better. In this study, SEG-SSC+Democratic may
be highlighted as the best performing model, obtaining
the highest transductive and inductive accuracy results
with 10 and 100 labeled examples.
Page 17 of 18
Transactions on Cybernetics
IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013
12
V. C ONCLUDING R EMARKS
R EFERENCES
iew
ev
rR
Fo
In this paper we have developed a novel framework called
SEG-SSC to improve the performance of any self-labeled
semi-supervised classification method. It is focused on the
idea of using synthetic examples in order to diminish the
drawbacks occasioned by the absence of labeled examples,
which deteriorates the efficiency of this family of methods.
The proposed self-labeled scheme with synthetic examples
has been incorporated in four well-known self-labeled techniques that have been modified by introducing the necessary
elements to follow the designed framework. These models
are able to overcome the original self-labeled methods due
to the fact that the addition of new labeled data implies a
better diversity of multiple classifier approaches and fulfills
the distribution of labeled data.
The wide experimental study carried out has allowed us
to investigate the behavior of the proposed scheme with a
high number of data sets with a varied number of instances
and features. The results have been statistically compared,
supporting the assertion that our proposal is a suitable tool
for enhancing self-labeled methods.
There are many possible variations of our proposed semisupervised scheme that could be interesting to explore as future work. In our opinion, the use of oversampling techniques
with self-labeled techniques is not only a new way to improve
the capabilities of this family of techniques, but could also
be useful for most of the existing semi-supervised learning
algorithms.
[12] Q. Wang, P. Yuen, and G. Feng, “Semi-supervised metric learning
via topology preserving multiple semi-supervised assumptions,” Pattern
Recognition, vol. 46, no. 9, pp. 2576–2587, 2013.
[13] I. Triguero, S. Garca, and F. Herrera, “Self-labeled techniques for
semi-supervised learning: taxonomy, software and empirical study,”
Knowledge and Information Systems, pp. 1–40, 2014, in press, doi:
10.1007/s10115-013-0706-y.
[14] D. Yarowsky, “Unsupervised word sense disambiguation rivaling supervised methods,” in Proceedings of the 33rd Annual Meeting of the
Association for Computational Linguistics, 1995, pp. 189–196.
[15] A. Blum and T. Mitchell, “Combining labeled and unlabeled data
with Co-Training,” in Proceedings of the Annual ACM Conference on
Computational Learning Theory, 1998, pp. 92–100.
[16] K. Bennett, A. Demiriz, and R. Maclin, “Exploiting unlabeled data in
ensemble methods,” in Proceedings of the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 2002, pp. 289–
296.
[17] Z.-H. Zhou and M. Li, “Semi-supervised learning by disagreement,”
Knowl. Inf. Syst., vol. 24, no. 3, pp. 415–439, 2010.
[18] G. Jin and R. Raich, “Hinge loss bound approach for surrogate supervision multi-view learning,” Pattern Recognition Letters, vol. 37, pp. 143
– 150, 2014.
[19] U. Maulik and D. Chakraborty, “A self-trained ensemble with semisupervised svm: An application to pixel classification of remote sensing
imagery,” Pattern Recognition, vol. 44, no. 3, pp. 615 – 623, 2011.
[20] A. Joshi and N. Papanikolopoulos, “Learning to detect moving shadows
in dynamic environments,” Pattern Analysis and Machine Intelligence,
IEEE Transactions on, vol. 30, no. 11, pp. 2055–2063, nov. 2008.
[21] M. Li and Z. H. Zhou, “Improve computer-aided diagnosis with machine
learning techniques using undiagnosed samples,” IEEE Transactions on
Systems, Man and Cybernetics, Part A: Systems and Humans, vol. 37,
no. 6, pp. 1088–1098, 2007.
[22] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, pp. 123–
140, August 1996.
[23] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
“SMOTE: Synthetic minority over-sampling technique,” Journal of
Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
[24] I. Triguero, S. Garcı́a, and F. Herrera, “Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification,”
Pattern Recognition, vol. 44, no. 4, pp. 901–916, 2011.
[25] K. V. Price, R. M. Storn, and J. A. Lampinen, Differential Evolution
A Practical Approach to Global Optimization, ser. Natural Computing
Series, G. Rozenberg, T. Bäck, A. E. Eiben, J. N. Kok, and H. P. Spaink,
Eds., 2005.
[26] J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. Garcı́a,
L. Sánchez, and F. Herrera, “KEEL data-mining software tool: Data set
repository, integration of algorithms and experimental analysis framework,” Journal of Multiple-Valued Logic and Soft Computing, vol. 17,
no. 2-3, pp. 255–277, 2011.
[27] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010.
[Online]. Available: http://archive.ics.uci.edu/ml
[28] S. Garcı́a, A. Fernández, J. Luengo, and F. Herrera, “Advanced nonparametric tests for multiple comparisons in the design of experiments
in computational intelligence and data mining: Experimental analysis of
power,” Information Sciences, vol. 180, pp. 2044–2064, 2010.
[29] M. Li and Z. H. Zhou, “SETRED: self-training with editing,” in
Lecture Notes in Computer Science (including subseries Lecture Notes
in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 3518
LNAI, 2005, pp. 611–621.
[30] S. Dasgupta, M. L. Littman, and D. A. McAllester, “PAC generalization
bounds for co-training,” in Advances in Neural Information Processing
Systems 14,Neural Information Processing Systems: Natural and Synthetic, 2001, pp. 375–382.
[31] J. Du, C. X. Ling, and Z. H. Zhou, “When does co-training work in real
data?” IEEE Transactions on Knowledge and Data Engineering, vol. 23,
no. 5, pp. 788–799, 2010.
[32] S. Goldman and Y. Zhou, “Enhancing supervised learning with unlabeled
data,” in In proceedings of the 17th International Conference on Machine
Learning. Morgan Kaufmann, 2000, pp. 327–334.
[33] Y. Zhou and S. Goldman, “Democratic co-learning,” in Tools with
Artificial Intelligence, IEEE International Conference on, 2004, pp. 594–
202.
[34] Z. H. Zhou and M. Li, “Tri-training: Exploiting unlabeled data using
three classifiers,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, pp. 1529–1541, 2005.
[35] L. B. Statistics and L. Breiman, “Random forests,” Machine Learning,
vol. 45, no. 1, pp. 5–32, 2001.
ly
On
[1] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques,
3rd ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,
2011.
[2] O. Chapelle, B. Schlkopf, and A. Zien, Semi-Supervised Learning,
1st ed. The MIT Press, 2006.
[3] X. Zhu and A. B. Goldberg, Introduction to Semi-Supervised Learning,
1st ed. Morgan and Claypool, 2009.
[4] F. Schwenker and E. Trentin, “Pattern classification and clustering: A
review of partially supervised learning approaches,” Pattern Recognition
Letters, vol. 37, pp. 4 – 14, 2014.
[5] K. Chen and S. Wang, “Semi-supervised learning via regularized boosting working on multiple semi-supervised assumptions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp.
129–143, 2011.
[6] G. Wang, F. Wang, T. Chen, D.-Y. Yeung, and F. Lochovsky, “Solution path for manifold regularized semisupervised classification,” IEEE
Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics,
vol. 42, no. 2, pp. 308–319, 2012.
[7] A. Blum and S. Chawla, “Learning from labeled and unlabeled data
using graph mincuts,” in Proceedings of the Eighteenth International
Conference on Machine Learning, 2001, pp. 19–26.
[8] J. Wang, T. Jebara, and S.-F. Chang, “Semi-supervised learning using
greedy max-cut,” Journal of Machine Learning Research, vol. 14, no. 1,
pp. 771–800, 2013.
[9] A. Fujino, N. Ueda, and K. Saito, “Semisupervised learning for a hybrid
generative/discriminative classifier based on the maximum entropy principle,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 30, no. 3, pp. 424–437, 2008.
[10] T. Joachims, “Transductive inference for text classification using support
vector machines,” in Proc. 16th Internation Conference on Machine
Learning. Morgan Kaufmann, 1999, pp. 200–209.
[11] P. Kumar Mallapragada, R. Jin, A. Jain, and Y. Liu, “Semiboost:
Boosting for semi-supervised learning,” Pattern Analysis and Machine
Intelligence, IEEE Transactions on, vol. 31, no. 11, pp. 2000–2014,
2009.
Transactions on Cybernetics
Page 18 of 18
IEEE TRANSACTIONS ON CYBERNETICS, VOL. X, NO. X, MONTH 2013
13
Isaac Triguero received the M.Sc. degree in Computer Science from the University of Granada,
Granada, Spain, in 2009.
He is currently a Ph.D. student in the Department
of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain. His research interests include data mining, data reduction, biometrics, evolutionary algorithms and semi-supervised
learning.
Salvador Garcı́a received the M.Sc. and Ph.D.
degrees in Computer Science from the University
of Granada, Granada, Spain, in 2004 and 2008,
respectively.
He is currently an Associate Professor in the Department of Computer Science, University of Jaén,
Jaén, Spain. He has published more than 40 papers
in international journals. As edited activities, he has
co-edited two special issues in international journals
on different Data Mining topics and is member of
the editorial board of the Information Fusion journal.
His research interests include data mining, data reduction, data complexity,
imbalanced learning, semi-supervised learning, statistical inference and evolutionary algorithms.
iew
ev
rR
Fo
[36] M. Hady and F. Schwenker, “Combining committee-based semisupervised learning and active learning,” Journal of Computer Science
and Technology, vol. 25, pp. 681–698, 2010.
[37] M. Hady, F. Schwenker, and G. Palm, “Semi-supervised learning for
tree-structured ensembles of rbf networks with co-training.” Neural
Networks, vol. 23, pp. 497–509, 2010.
[38] Y. Yaslan and Z. Cataltepe, “Co-training with relevant random subspaces,” Neurocomput., vol. 73, no. 10-12, pp. 1652–1661, 2010.
[39] T. Huang, Y. Yu, G. Guo, and K. Li, “A classification algorithm based on
local cluster centers with a few labeled training examples,” KnowledgeBased Systems, vol. 23, no. 6, pp. 563–571, 2010.
[40] Y. Wang, X. Xu, H. Zhao, and Z. Hua, “Semi-supervised learning based
on nearest neighbor rule and cut edges,” Knowledge-Based Systems,
vol. 23, no. 6, pp. 547–554, 2010.
[41] S. Sun and Q. Zhang, “Multiple-view multiple-learner semi-supervised
learning,” Neural Processing Letters, vol. 34, no. 3, pp. 229–240, 2011.
[42] A. Halder, S. Ghosh, and A. Ghosh, “Aggregation pheromone metaphor
for semi-supervised classification,” Pattern Recognition, vol. 46, no. 8,
pp. 2239–2248, 2013.
[43] M.-L. Zhang and Z.-H. Zhou, “CoTrade: Confident co-training with data
editing,” IEEE Transactions on Systems, Man, and Cybernetics, Part B:
Cybernetics, vol. 41, no. 6, pp. 1612–1626, 2011.
[44] I. Triguero, J. A. Sáez, J. Luengo, S. Garcı́a, and F. Herrera, “On
the characterization of noise filters for self-training semi-supervised in
nearest neighbor classification,” Neurocomputing, 2013, , in press, doi:
10.1016/j.neucom.2013.05.055.
[45] I. T. Jolliffe, Principal Component Analysis.
Berlin; New York:
Springer-Verlag, 1986.
[46] C. Deng and M. Guo, “A new co-training-style random forest for
computer aided diagnosis,” Journal of Intelligent Information Systems,
vol. 36, pp. 253–281, 2011.
[47] Y. Sun, A. K. C. Wong, and M. S. Kamel, “Classification of imbalanced
data: A review,” International Journal of Pattern Recognition and
Artificial Intelligence, vol. 23, no. 04, pp. 687–719, 2009.
[48] H. He and E. Garcia, “Learning from imbalanced data,” Knowledge and
Data Engineering, IEEE Transactions on, vol. 21, no. 9, pp. 1263–1284,
2009.
[49] S. Garcı́a, J. Derrac, I. Triguero, C. J. Carmona, and F. Herrera,
“Evolutionary-based selection of generalized instances for imbalanced
classification,” Know.-Based Syst., vol. 25, no. 1, pp. 3–12, 2012.
[50] H. Zhang and M. Li, “Rwo-sampling: A random walk over-sampling
approach to imbalanced data classification,” Information Fusion, 2014,
in press, doi: 10.1016/j.inffus.2013.12.003.
[51] “An insight into classification with imbalanced data: Empirical results
and current trends on using data intrinsic characteristics,” Information
Sciences, vol. 250, pp. 113 – 141, 2013.
[52] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A study of the
behaviour of several methods for balancing machine learning training
data,” SIGKDD Explorations, vol. 6, no. 1, pp. 20–29, 2004.
[53] I. Triguero, S. Garcı́a, and F. Herrera, “IPADE: Iterative prototype
adjustment for nearest neighbor classification,” IEEE Transactions on
Neural Networks, vol. 21, no. 12, pp. 1984–1990, 2010.
[54] A. E. Eiben and J. E. Smith, Introduction to Evolutionary Computing.
Springer–Verlag, Berlin, 2003.
[55] S. Das and P. Suganthan, “Differential evolution: A survey of the stateof-the-art,” IEEE Transactions on Evolutionary Computation, vol. 15,
no. 1, pp. 4–31, 2011.
[56] “BBC datasets,” 2014. [Online]. Available: http://mlg.ucd.ie/datasets/
bbc.html
[57] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,”
IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27,
1967.
[58] J. R. Quinlan, C4.5: programs for machine learning. San Francisco,
CA, USA: Morgan Kaufmann Publishers, 1993.
[59] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning
algorithms,” Machine Learning, vol. 6, no. 1, pp. 37–66, 1991.
[60] F. Wilcoxon, “Individual Comparisons by Ranking Methods,” Biometrics
Bulletin, vol. 1, no. 6, pp. 80–83, 1945.
[61] Z. Jiang, S. Zhang, and J. Zeng, “A hybrid generative/discriminative
method for semi-supervised classification,” Knowledge-Based Systems,
vol. 37, pp. 137–145, 2013.
[62] W. Li, L. Duan, I. Tsang, and D. Xu, “Co-labeling: A new multi-view
learning approach for ambiguous problems,” in Proceedings - IEEE
International Conference on Data Mining, ICDM, 2012, pp. 419–428.
ly
On
Francisco Herrera received his M.Sc. in Mathematics in 1988 and Ph.D. in Mathematics in 1991, both
from the University of Granada, Spain.
He is currently a Professor in the Department of
Computer Science and Artificial Intelligence at the
University of Granada. He has published more than
240 papers in international journals. He is coauthor
of the book ”Genetic Fuzzy Systems: Evolutionary
Tuning and Learning of Fuzzy Knowledge Bases”
(World Scientific, 2001).
He currently acts as Editor in Chief of the international journals “Information Fusion” (Elsevier) and “Progress in Artificial
Intelligence” (Springer). He acts as area editor of the International Journal
of Computational Intelligence Systems and associated editor of the journals:
IEEE Transactions on Fuzzy Systems, Information Sciences, Knowledge and
Information Systems, Advances in Fuzzy Systems, and International Journal
of Applied Metaheuristics Computing; and he serves as member of several
journal editorial boards, among others: Fuzzy Sets and Systems, Applied
Intelligence, Information Fusion, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems, Memetic Computation, and Swarm and
Evolutionary Computation.
He received the following honors and awards: ECCAI Fellow 2009, 2010
Spanish National Award on Computer Science ARITMEL to the “Spanish
Engineer on Computer Science”, International Cajastur “Mamdani” Prize for
Soft Computing (Fourth Edition, 2010), IEEE Transactions on Fuzzy System
Outstanding 2008 Paper Award (bestowed in 2011), and 2011 Lotfi A. Zadeh
Prize Best paper Award of the International Fuzzy Systems Association.
His current research interests include computing with words and decision
making, bibliometrics, data mining, biometrics, data preparation, instance selection, fuzzy rule based systems, genetic fuzzy systems, knowledge extraction
based on evolutionary algorithms, memetic algorithms and genetic algorithms.
Bibliography
[AC10]
Adankon M. and Cheriet M. (2010) Genetic algorithm-based training for semisupervised svm. Neural Computing and Applications 19: 1197–1206.
[AFFL+ 11] Alcalá-Fdez J., Fernandez A., Luengo J., Derrac J., Garcı́a S., Sánchez L., and Herrera
F. (2011) KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and
Soft Computing 17(2-3): 255–277.
[AFSG+ 09] Alcalá-Fdez J., Sánchez L., Garcı́a S., del Jesus M. J., Ventura S., Garrell J. M.,
Otero J., Romero C., Bacardit J., Rivas V. M., Fernández J. C., and Herrera F. (2009)
KEEL: a software tool to assess evolutionary algorithms for data mining problems.
Soft Computing 13(3): 307–318.
[AIS93]
Agrawal R., Imieliński T., and Swami A. (1993) Mining association rules between sets
of items in large databases. SIGMOD Rec. 22(2): 207–216.
[AKA91]
Aha D. W., Kibler D., and Albert M. K. (1991) Instance-based learning algorithms.
Machine Learning 6: 37–66.
[Alp10]
Alpaydin E. (2010) Introduction to Machine Learning. The MIT Press, 2nd edition.
[Ang07]
Angiulli F. (2007) Fast nearest neighbor condensation for large data sets classification.
IEEE Transactions on Knowledge and Data Engineering 19(11): 1450–1464.
[BBBE11]
Bizer C., Boncz P. A., Brodie M. L., and Erling O. (2011) The meaningful use of big
data: four perspectives - four challenges. SIGMOD Record 40(4): 56–60.
[BBC14]
BBC (2014) BBC datasets. http://mlg.ucd.ie/datasets/bbc.html.
[BC01]
Blum A. and Chawla S. (2001) Learning from labeled and unlabeled data using graph
mincuts. In Proceedings of the Eighteenth International Conference on Machine Learning, pp. 19–26.
[BM98]
Blum A. and Mitchell T. (1998) Combining labeled and unlabeled data with CoTraining. In Proceedings of the Annual ACM Conference on Computational Learning
Theory, pp. 92–100.
[BM02]
Brighton H. and Mellish C. (2002) Advances in instance selection for instance-based
learning algorithms. Data Mining and Knowledge Discovery 6(2): 153–172.
[BNS06]
Belkin M., Niyogi P., and Sindhwani V. (2006) Manifold regularization: A geometric
framework for learning from labeled and unlabeled examples. Journal of Machine
Learning Research 7: 2399–2434.
209
210
BIBLIOGRAPHY
[Bra07]
Bramer M. (2007) Principles of Data Mining (Undergraduate Topics in Computer
Science). Springer-Verlag New York, Inc., Secaucus, NJ, USA.
[Bre96]
Breiman L. (August 1996) Bagging predictors. Machine Learning 24: 123–140.
[CdO11]
Cabral G. G. and de Oliveira A. L. I. (2011) A novel one-class classification method
based on feature analysis and prototype reduction. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Anchorage, Alaska, USA,
October 9-12, 2011, pp. 983–988.
[CGG+ 09] Chen Y., Garcia E., Gupta M., Rahimi A., and Cazzanti L. (2009) Similarity-based
classification: Concepts and algorithms. Journal of Machine Learning Research 10:
747–776.
[CGI09]
Cervantes A., Galván I. M., and Isasi P. (2009) Ampso: A new particle swarm method
for nearest neighborhood classification. IEEE Transactions on Systems, Man, and
Cybernetics–Part B: Cybernetics 39(5): 1082–1091.
[CH67]
Cover T. M. and Hart P. E. (1967) Nearest neighbor pattern classification. IEEE
Transactions on Information Theory 13(1): 21–27.
[CHL03]
Cano J. R., Herrera F., and Lozano M. (2003) Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study. IEEE Transactions
on Evolutionary Computation 7(6): 561–575.
[CHL05]
Cano J. R., Herrera F., and Lozano M. (2005) Stratification for scaling up evolutionary
prototype selection. Pattern Recognition Letters 26(7): 953–963.
[CLL13]
Caruana G., Li M., and Liu Y. (2013) An ontology enhanced parallel SVM for scalable
spam filter training. Neurocomputing 108: 45 – 57.
[CM98]
Cherkassky V. S. and Mulier F. (1998) Learning from Data: Concepts, Theory, and
Methods. John Wiley & Sons, Inc., New York, NY, USA, 1st edition.
[CSK08]
Chapelle O., Sindhwani V., and Keerthi S. S. (2008) Optimization techniques for semisupervised support vector machines. Journal of Machine Learning Research 9: 203–
233.
[CSZ06]
Chapelle O., Schlkopf B., and Zien A. (2006) Semi-Supervised Learning. The MIT
Press, 1st edition.
[CW11]
Chen K. and Wang S. (2011) Semi-supervised learning via regularized boosting working
on multiple semi-supervised assumptions. IEEE Transactions on Pattern Analysis and
Machine Intelligence 33(1): 129–143.
[DACK09] Das S., Abraham A., Chakraborty U. K., and Konar A. (2009) Differential evolution
using a neighborhood-based mutation operator. IEEE Transactions on Evolutionary
Computation 13(3): 526–553.
[DG08]
Dean J. and Ghemawat S. (Enero 2008) Mapreduce: simplified data processing on
large clusters. Communications of the ACM 51(1): 107–113.
[DG10]
Dean J. and Ghemawat S. (2010) Map reduce: A flexible data processing tool. Communications of the ACM 53(1): 72–77.
BIBLIOGRAPHY
211
[DGH10a]
Derrac J., Garcı́a S., and Herrera F. (2010) A survey on evolutionary instance selection
and generation. Internation Journal of Applied Metaheuristic Computing 1(1): 60–92.
[DGH10b]
Derrac J., Garcı́a S., and Herrera F. (2010) Stratified prototype selection based on
a steady-state memetic algorithm: a study of scalability. Memetic Computing 2(3):
183–199.
[DGH14]
Derrac J., Garcı́a S., and Herrera F. (2014) Fuzzy nearest neighbor algorithms: Taxonomy, experimental analysis and prospects. Information Sciences 260: 98–119.
[DHS00]
Duda R., Hart P., and Stork D. (2000) Pattern Classification. Wiley-Interscience, John
Wiley & Sons, Southern Gate, Chichester, West Sussex, England, 2nd edition.
[DLM01]
Dasgupta S., Littman M. L., and McAllester D. A. (2001) Pac generalization bounds
for co-training. In Advances in Neural Information Processing Systems 14,Neural Information Processing Systems: Natural and Synthetic, pp. 375–382.
[DLZ10]
Du J., Ling C. X., and Zhou Z. H. (2010) When does co-training work in real data?
IEEE Transactions on Knowledge and Data Engineering 23(5): 788–799.
[DS11]
Das S. and Suganthan P. (2011) Differential evolution: A survey of the state-of-the-art.
IEEE Transactions on Evolutionary Computation 15(1): 4–31.
[DTGH12] Derrac J., Triguero I., Garcia S., and Herrera F. (2012) Integrating instance selection,
instance weighting, and feature weighting for nearest neighbor classifiers by coevolutionary algorithms. IEEE Transactions on Systems, Man, and Cybernetics, Part B:
Cybernetics 42(5): 1383–1397.
[ES08]
Eiben A. E. and Smith J. E. (2008) Introduction to Evolutionary Computing. Natural
Computing. Springer-Verlag, 2nd edition.
[EVH10]
Espejo P., Ventura S., and Herrera F. (2010) A survey on the application of genetic
programming to classification. IEEE Transactions on Systems, Man and Cybernetics
Part C: Applications and Reviews 40(2): 121–144.
[FH51]
Fix E. and Hodges J. (February 1951) Discriminatory analysis, nonparametric discrimination: Consistency properties. Technical Report 4, USAF School of Aviation
Medicine.
[FHA07]
Fayed H. A., Hashem S. R., and Atiya A. F. (2007) Self-generating prototypes for
pattern classification. Pattern Recognition 40(5): 1498–1509.
[FI04]
Fernández F. and Isasi P. (2004) Evolutionary design of nearest prototype classifiers.
Journal of Heuristics 10(4): 431–454.
[FI08]
Fernández F. and Isasi P. (2008) Local feature weighting in nearest prototype classification. IEEE Transactions on Neural Networks 19(1): 40–53.
[FPSS96]
Fayyad U. M., Piatetsky-Shapiro G., and Smyth P. (1996) From data mining to knowledge discovery in databases. AI Magazine 17(3): 37–54.
[Fre02]
Freitas A. A. (2002) Data Mining and Knowledge Discovery with Evolutionary Algorithms. Springer-Verlag.
212
BIBLIOGRAPHY
[Fri97]
Friedman J. H. (1997) Data mining and statistics: What’s the connection? In Proceedings of the 29th Symposium on the Interface Between Computer Science and Statistics.
[FUS08]
Fujino A., Ueda N., and Saito K. (2008) Semisupervised learning for a hybrid generative/discriminative classifier based on the maximum entropy principle. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(3): 424–437.
[Gar08]
Garain U. (2008) Prototype reduction using an artificial immune model. Pattern Analysis & Applications 11(3-4): 353–363.
[GB08]
Gancarski P. and Blansche A. (Oct 2008) Darwinian, lamarckian, and baldwinian
(co)evolutionary approaches for feature weighting in k -means-based algorithms. IEEE
Transactions on Evolutionary Computation 12(5): 617–629.
[GBLG99] Gamberger D., Boskovic R., Lavrac N., and Groselj C. (1999) Experiments with noise
filtering in a medical domain. In Proceedings of the Sixteenth International Conference
on Machine Learning, pp. 143–151.
[GCB97]
Grother P. J., Candela G. T., and Blue J. L. (1997) Fast implementations of nearest
neighbor classifiers. Pattern Recognition 30(3): 459 – 465.
[GCH08]
Garcı́a S., Cano J. R., and Herrera F. (2008) A memetic algorithm for evolutionary
prototype selection: A scaling up approach. Pattern Recognition 41(8): 2693–2709.
[GDCH12] Garcı́a S., Derrac J., Cano J. R., and Herrera F. (2012) Prototype selection for nearest
neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern
Analysis and Machine Intelligence 34(3): 417–435.
[GE03]
Guyon I. and Elisseeff A. (2003) An introduction to variable and feature selection.
Journal of Machine Learning Research 3: 1157–1182.
[GGL03]
Ghemawat S., Gobioff H., and Leung S.-T. (2003) The google file system. In Proceedings
of the nineteenth ACM symposium on Operating systems principles, SOSP ’03, pp. 29–
43.
[GGNZ06] Guyon I., Gunn S., Nikravesh M., and Zadeh L. A. (Eds.) (2006) Feature Extraction:
Foundations and Applications. Springer.
[Gho06]
Ghosh A. K. (2006) On optimum choice of k in nearest neighbor classification. Computational Statistics and Data Analysis 50(11): 3113 – 3123.
[GJ05]
Ghosh A. and Jain L. C. (Eds.) (2005) Evolutionary Computation in Data Mining.
Springer-Verlag.
[GZ00]
Goldman S. and Zhou Y. (2000) Enhancing supervised learning with unlabeled data. In
In proceedings of the 17th International Conference on Machine Learning, pp. 327–334.
Morgan Kaufmann.
[Har68]
Hart P. E. (1968) The condensed nearest neighbor rule. IEEE Transactions on Information Theory 18: 515–516.
[Har75]
Hartigan J. A. (1975) Clustering Algorithms. John Wiley & Sons, Inc., New York, NY,
USA, 99th edition.
BIBLIOGRAPHY
213
[HDW+ 11] He Q., Du C., Wang Q., Zhuang F., and Shi Z. (2011) A parallel incremental extreme
svm classifier. Neurocomputing 74(16): 2532 – 2540.
[HFL+ 08]
He B., Fang W., Luo Q., Govindaraju N. K., and Wang T. (2008) Mars: A mapreduce
framework on graphics processors. In Proceedings of the 17th International Conference
on Parallel Architectures and Compilation Techniques, PACT ’08, pp. 260–269. ACM,
New York, NY, USA.
[HGG13]
Halder A., Ghosh S., and Ghosh A. (2013) Aggregation pheromone metaphor for semisupervised classification. Pattern Recognition 46(8): 2239–2248.
[HKP11]
Han J., Kamber M., and Pei J. (2011) Data Mining: Concepts and Techniques. Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition.
[HS10]
Hady M. and Schwenker F. (2010) Combining committee-based semi-supervised learning and active learning. Journal of Computer Science and Technology 25: 681–698.
[HSP10]
Hady M., Schwenker F., and Palm G. (2010) Semi-supervised learning for treestructured ensembles of rbf networks with co-training. Neural Networks 23: 497–509.
[HTF09]
Hastie T., Tibshirani R., and Friedman J. (2009) The Elements of Statistical Learning:
Data Mining, Inference and Prediction. Springer, New York Berlin Heidelberg, 2nd
edition.
[HYGL10] Huang T., Yu Y., Guo G., and Li K. (2010) A classification algorithm based on local
cluster centers with a few labeled training examples. Knowledge-Based Systems 23(6):
563–571.
[JL01]
John G. H. and Langley P. (2001) Estimating continuous distributions in bayesian
classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann, San Mateo.
[Joa99]
Joachims T. (1999) Transductive inference for text classification using support vector
machines. In Proc. 16th Internation Conference on Machine Learning, pp. 200–209.
Morgan Kaufmann.
[Joa03]
Joachims T. (2003) Transductive learning via spectral graph partitioning. In Proceedings, Twentieth International Conference on Machine Learning, volumen 1, pp.
290–297.
[Jol02]
Jolliffe I. T. (2002) Principal Component Analysis. Springer, 2nd edition.
[JP08]
Joshi A. and Papanikolopoulos N. (nov. 2008) Learning to detect moving shadows in
dynamic environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(11): 2055–2063.
[KB98]
Kuncheva L. I. and Bedzek J. C. (1998) Nearest prototype classification: Clustering,
genetic algorithms, or random search? IEEE Transactions on Systems, Man, and
Cybernetics 28(1): 160–164.
[KJ97]
Kohavi R. and John G. H. (1997) Wrappers for feature subset selection. Artificial
Intelligence 97(1-2): 273–324.
214
BIBLIOGRAPHY
[KM99]
Krishna K. and Murty M. (1999) Genetic k-means algorithm. IEEE Transactions on
Systems, Man, and Cybernetics, Part B: Cybernetics 29(3): 433–439.
[KO03a]
Kim S. W. and Oomenn J. (2003) A brief taxonomy and ranking of creative prototype
reduction schemes. Pattern Analysis and Applications 6: 232–244.
[KO03b]
Kim S.-W. and Oommen B. J. (2003) Enhancing prototype reduction schemes with
LVQ3-type algorithms. Pattern Recognition 36(5): 1083–1093.
[Koh90]
Kohonen T. (1990) The self organizing map. Proceedings of the IEEE 78(9): 1464–
1480.
[Kon94]
Kononenko I. (1994) Estimating attributes: Analysis and extensions of RELIEF. In
Proceedings of the 1994 European Conference on Machine Learning, Catania, Italy,
pp. 171–182. Springer Verlag.
[KR92]
Kira K. and Rendell L. A. (1992) A practical approach to feature selection. In Proceedings of the Ninth International Conference on Machine Learning, Aberdeen, Scotland,
pp. 249–256. Morgan Kaufmann.
[KR07]
Khoshgoftaar T. M. and Rebours P. (2007) Improving software quality prediction by
noise filtering techniques. Journal of Computer Science and Technology 22: 387–396.
[Kun95]
Kuncheva L. I. (1995) Editing for the k–nearest neighbors rule by a genetic algorithm.
Pattern Recognition Letters 16: 809–814.
[Lan01]
Laney D. (February 2001) 3D data management: Controlling data volume, velocity,
and variety. Technical report, META Group.
[LdRBH14] López V., del Rı́o S., Benı́tez J. M., and Herrera F. (2014) Cost-sensitive linguistic
fuzzy rule based classification systems under the mapreduce framework for imbalanced
big data. Fuzzy Sets and Systems in press, doi: 10.1016/j.fss.2014.01.015.
[LKL02]
Lam W., Keung C. K., and Liu D. (2002) Discovering useful concept prototypes for
classification based on filtering and abstraction. IEEE Transactions on Pattern Analysis and Machine Intelligence 14(8): 1075–1090.
[LM07]
Liu H. and Motoda H. (Eds.) (2007) Computational Methods of Feature Selection.
Chapman & Hall/Crc Data Mining and Knowledge Discovery Series. Chapman &
Hall/Crc.
[LMYW05] Li J., Manry M. T., Yu C., and Wilson D. R. (2005) Prototype classifier design with
pruning. International Journal on Artificial Intelligence Tools 14(1-2): 261–280.
[LYG+ 10]
Li G.-Z., You M., Ge L., Yang J. Y., and Yang M. Q. (2010) Feature selection for
semi-supervised multi-label learning with application to gene function analysis. In
Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, BCB ’10, pp. 354–357. ACM, New York, NY, USA.
[LZ05]
Li M. and Zhou Z. H. (2005) SETRED: self-training with editing. In Lecture Notes
in Computer Science (including subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics), volumen 3518 LNAI, pp. 611–621.
BIBLIOGRAPHY
215
[LZ07]
Li M. and Zhou Z. H. (2007) Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Transactions on Systems, Man and
Cybernetics, Part A: Systems and Humans 37(6): 1088–1098.
[Mar13]
Marx V. (2013) The big challenges of big data. Nature 498(7453): 255–260.
[MCD13]
Minelli M., Chambers M., and Dhiraj A. (2013) Big Data, Big Analytics: Emerging
Business Intelligence and Analytic Trends for Today’s Businesses (Wiley CIO). Wiley
Publishing, 1st edition.
[MD01]
Mjolsness E. and DeCoste D. (2001) Machine learning for science: State of the art and
future prospects. Science 293: 2051–2055.
[MFV02]
Mollineda R., Ferri F., and Vidal E. (2002) A merge-based condensing strategy for
multiple prototype classifiers. IEEE Transactions on Systems, Man and Cybernetics
B 32(5): 662–668.
[Mit97]
Mitchell T. M. (1997) Machine Learning. McGraw-Hill.
[MPJ11]
Mac Parthalain N. and Jensen R. (2011) Fuzzy-rough set based semi-supervised learning. In IEEE International Conference on Fuzzy Systems (FUZZ), pp. 2465–2472.
[NL09]
Nanni L. and Lumini A. (2009) Particle swarm optimization for prototype reduction.
Neurocomputing 72(4-6): 1092–1097.
[NT09]
Neri F. and Tirronen V. (2009) Scale factor local search in differential evolution.
Memetic Computing 1(2): 153–171.
[OLM04]
Oh I.-S., Lee J.-S., and Moon B.-R. (Noviembre 2004) Hybrid genetic algorithms for
feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence
26(11): 1424–1437.
[PBA+ 08]
Plummer D., Bittman T., Austin T., Cearley D., and Cloud D. S. (2008) Defining and
describing an emerging phenomenon. Technical report, Gartner.
[Ped85]
Pedrycz W. (1985) Algorithms of fuzzy clustering with partial supervision. Pattern
Recognition Letters 3: 13–20.
[PF09]
Pappa G. L. and Freitas A. A. (2009) Automating the Design of Data Mining Algorithms: An Evolutionary Computation Approach. Natural computing. Springer.
[PGDB13] Poria S., Gelbukh A., Das D., and Bandyopadhyay S. (2013) Fuzzy clustering for semisupervised learning - case study: Construction of an emotion lexicon. In Batyrshin I.
and González Mendoza M. (Eds.) Advances in Artificial Intelligence, volumen 7629 of
Lecture Notes in Computer Science, pp. 73–86. Springer Berlin Heidelberg.
[Pla99]
Platt J. C. (1999) Fast training of support vector machines using sequential minimal
optimization. MIT Press.
[PR12]
Palit I. and Reddy C. (2012) Scalable and parallel boosting with mapreduce. IEEE
Transactions on Knowledge and Data Engineering 24(10): 1904–1916.
[Pro13a]
Project A. H. (2013) Apache hadoop. http://hadoop.apache.org/.
[Pro13b]
Project A. M. (2013) Apache mahout. http://mahout.apache.org/.
216
BIBLIOGRAPHY
[PSL05]
Price K. V., Storn R. M., and Lampinen J. A. (2005) Differential Evolution A Practical
Approach to Global Optimization. Natural Computing Series.
[PV06]
Paredes R. and Vidal E. (2006) Learning weighted metrics to minimize nearestneighbor classification error. IEEE Transactions on Pattern Analysis and Machine
Intelligence 28(7): 1100–1110.
[Pyl99]
Pyle D. (1999) Data Preparation for Data Mining. The Morgan Kaufmann Series in
Data Management Systems. Morgan Kaufmann.
[QHS09]
Qin A. K., Huang V. L., and Suganthan P. N. (2009) Differential evolution algorithm
with strategy adaptation for global numerical optimization. IEEE Transactions on
Evolutionary Computation 13(2): 398–417.
[Qui93]
Quinlan J. R. (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers, San Francisco, CA, USA.
[Sán04]
Sánchez J. S. (2004) High training set size reduction by space partitioning and prototype abstraction. Pattern Recognition 37(7): 1561–1564.
[SBM+ 03]
Sánchez J. S., Barandela R., Marqués A. I., Alejo R., and Badenas J. (2003) Analysis
of new techniques to obtain quality training sets. Pattern Recognition Letters 24(7):
1015–1022.
[SFJ12]
Srinivasan A., Faruquie T., and Joshi S. (2012) Data and task parallelism in ILP using
mapreduce. Machine Learning 86(1): 141–168.
[SIL07]
Saeys Y., Inza I., and Larrañaga P. (2007) A review of feature selection techniques in
bioinformatics. Bioinformatics 23(19): 2507–2517.
[SO98]
Snir M. and Otto S. (1998) MPI-The Complete Reference: The MPI Core. MIT Press.
[SP97]
Storn R. and Price K. V. (1997) Differential evolution - A simple and efficient heuristic
for global optimization over continuous spaces. Journal of Global Optimization 11(10):
341–359.
[SZ11]
Sun S. and Zhang Q. (2011) Multiple-view multiple-learner semi-supervised learning.
Neural Processing Letters 34(3): 229–240.
[TDGH12] Triguero I., Derrac J., Garcı́a S., and Herrera F. (2012) A taxonomy and experimental
study on prototype generation for nearest neighbor classification. IEEE Transactions
on Systems, Man, and Cybernetics, Part C: Applications and Reviews 42(1): 86–100.
[TH10]
Tang X.-L. and Han M. (2010) Semi-supervised bayesian artmap. Applied Intelligence
33(3): 302–317.
[TK07]
Tsoumakas G. and Katakis I. (2007) Multi-label classification: An overview. Internation Journal of Data Warehousing and Mining 2007: 1–13.
[TSA+ 10]
Thusoo A., Shao Z., Anthony S., Borthakur D., Jain N., Sen Sarma J., Murthy R.,
and Liu H. (2010) Data warehousing and analytics infrastructure at facebook. In
Proceedings of the 2010 ACM SIGMOD International Conference on Management of
Data, SIGMOD ’10, pp. 1013–1020. ACM, New York, NY, USA.
BIBLIOGRAPHY
217
[TSK05]
Tan P., Steinbach M., and Kumar V. (2005) Introduction to Data Mining. AddisonWesley Longman Publishing Co., Inc., 1st edition.
[TYK11]
Talbot J., Yoo R. M., and Kozyrakis C. (2011) Phoenix++: Modular mapreduce
for shared-memory systems. In Proceedings of the Second International Workshop on
MapReduce and Its Applications, pp. 9–16. ACM, New York, NY, USA.
[Vap98]
Vapnik V. N. (1998) Statistical Learning Theory. Wiley-Interscience.
[WAM97]
Wettschereck D., Aha D. W., and Mohri T. (1997) A review and empirical evaluation of
feature weighting methods for a class of lazy learning algorithms. Artificial Intelligence
Review 11: 273–314.
[WFH11]
Witten I. H., Frank E., and Hall M. A. (2011) Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems.
Morgan Kaufmann.
[Whi12]
White T. (2012) Hadoop: The Definitive Guide. O’Reilly Media, Inc., 3rd edition.
[Wil72]
Wilson D. L. (1972) Asymptotic properties of nearest neighbor rules using edited data.
IEEE Transactions on System, Man and Cybernetics 2(3): 408–421.
[WJC13]
Wang J., Jebara T., and Chang S.-F. (2013) Semi-supervised learning using greedy
max-cut. Journal of Machine Learning Research 14(1): 771–800.
[WK09]
Wu X. and Kumar V. (Eds.) (2009) The Top Ten Algorithms in Data Mining. Data
Mining and Knowledge Discovery. Chapman & Hall/CRC.
[WM00]
Wilson D. R. and Martinez T. R. (2000) Reduction techniques for instance-based
learning algorithms. Machine Learning 38(3): 257–286.
[XWT11]
Xie B., Wang M., and Tao D. (2011) Toward the optimization of normalized graph
laplacian. IEEE Transactions on Neural Networks 22(4): 660–666.
[Yar95]
Yarowsky D. (1995) Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational
Linguistics, pp. 189–196.
[YC98]
Yang M. and Cheng C. (1998) On the edited fuzzy k-nearest neighbor rule. IEEE
Transactions on Systems, Man, and Cybernetics Part B: Cybernetics 28(3): 461–466.
[YC10]
Yaslan Y. and Cataltepe Z. (2010) Co-training with relevant random subspaces. Neurocomputing 73(10-12): 1652–1661.
[YW06]
Yang Q. and Wu X. (2006) 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making 5(4): 597–604.
[Zad65]
Zadeh L. (1965) Fuzzy sets. Information and Control 8(3): 338–353.
[ZG04]
Zhou Y. and Goldman S. (2004) Democratic co-learning. In IEEE International Conference on Tools with Artificial Intelligence, pp. 594–202.
[ZG09]
Zhu X. and Goldberg A. B. (2009) Introduction to Semi-Supervised Learning. Morgan
and Claypool, 1st edition.
218
BIBLIOGRAPHY
[Zhu05]
Zhu X. (2005) Semi-supervised learning literature survey. Technical Report 1530,
Computer Sciences, University of Wisconsin-Madison.
[ZL05]
Zhou Z. H. and Li M. (2005) Tri-training: Exploiting unlabeled data using three
classifiers. IEEE Transactions on Knowledge and Data Engineering 17: 1529–1541.
[ZL10]
Zhou Z.-H. and Li M. (2010) Semi-supervised learning by disagreement. Knowledge
and Information Systems 24(3): 415–439.
[ZLZ11]
Zhai J.-H., Li N., and Zhai M.-Y. (2011) The condensed fuzzy k-nearest neighbor rule
based on sample fuzzy entropy. In Proceedings of the 2011 International Conference
on Machine Learning and Cybernetics (ICMLC’11), Guilin, China, July 10-13, pp.
282–286.
[ZMH09]
Zhao W., Ma H., and He Q. (2009) Parallel k-means clustering based on mapreduce.
In Jaatun M., Zhao G., and Rong C. (Eds.) Cloud Computing, volumen 5931 of Lecture
Notes in Computer Science, pp. 674–679. Springer Berlin Heidelberg.
[ZS09]
Zhang J. and Sanderson A. C. (2009) JADE: Adaptive differential evolution with
optional external archive. IEEE Transactions on Evolutionary Computation 13(5):
945–958.
[ZYY+ 14]
Zhu F., Ye N., Yu W., Xu S., and Li G. (2014) Boundary detection and sample
reduction for one-class support vector machines. Neurocomputing 123: 166 – 173.
[ZZY03]
Zhang S., Zhang C., and Yang Q. (2003) Data preparation for data mining. Applied
Artificial Intelligence 17(5-6): 375–381.