Open Challenges for Data Stream Mining Research

Transcription

Open Challenges for Data Stream Mining Research
Open Challenges for Data Stream Mining Research
Georg Krempl
Indre Žliobaite
Dariusz Brzeziński
University Magdeburg, Germany
Aalto University and HIIT, Finland
Poznan U. of Technology, Poland
georg.krempl@iti.cs.uni-magdeburg.de
indre.zliobaite@aalto.fi
dariusz.brzezinski@cs.put.poznan.pl
Eyke Hüllermeier
Mark Last
Vincent Lemaire
University of Paderborn, Germany
Ben-Gurion U. of the Negev, Israel
Orange Labs, France
eyke@upb.de
mlast@bgu.ac.il
vincent.lemaire@orange.com
Tino Noack
Ammar Shaker
Sonja Sievi
TU Cottbus, Germany
University of Paderborn, Germany
Astrium Space Transportation, Germany
noacktin@tu-cottbus.de
ammar.shaker@upb.de
sonja.sievi@astrium.eads.net
Myra Spiliopoulou
Jerzy Stefanowski
University Magdeburg, Germany
Poznan U. of Technology, Poland
myra@iti.cs.uni-magdeburg.de
jerzy.stefanowski@cs.put.poznan.pl
ABSTRACT
Every day, huge volumes of sensory, transactional, and web data
are continuously generated as streams, which need to be analyzed
online as they arrive. Streaming data can be considered as one
of the main sources of what is called big data. While predictive
modeling for data streams and big data have received a lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, overlooking important challenges imposed by real-world applications.
This article presents a discussion on eight open challenges for data
stream mining. Our goal is to identify gaps between current research and meaningful applications, highlight open problems, and
define new application-relevant research directions for data stream
mining. The identified challenges cover the full cycle of knowledge
discovery and involve such problems as: protecting data privacy,
dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream mining algorithms. The resulting analysis is illustrated by practical
applications and provides general suggestions concerning lines of
future research in data stream mining.
1.
INTRODUCTION
The volumes of automatically generated data are constantly increasing. According to the Digital Universe Study [18], over 2.8ZB
of data were created and processed in 2012, with a projected increase of 15 times by 2020. This growth in the production of digital data results from our surrounding environment being equipped
with more and more sensors. People carrying smart phones produce
data, database transactions are being counted and stored, streams of
data are extracted from virtual environments in the form of logs or
user generated content. A significant part of such data is volatile,
which means it needs to be analyzed in real time as it arrives. Data
stream mining is a research field that studies methods and algorithms for extracting knowledge from volatile streaming data [14;
5; 1]. Although data streams, online learning, big data, and adaptation to concept drift have become important research topics during
the last decade, truly autonomous, self-maintaining, adaptive data
mining systems are rarely reported. This paper identifies real-world
challenges for data stream research that are important but yet unsolved. Our objective is to present to the community a position
paper that could inspire and guide future research in data streams.
This article builds upon discussions at the International Workshop
on Real-World Challenges for Data Stream Mining (RealStream)1
in September 2013, in Prague, Czech Republic.
Several related position papers are available. Dietterich [10] presents
a discussion focused on predictive modeling techniques, that are
applicable to streaming and non-streaming data. Fan and Bifet [12]
concentrate on challenges presented by large volumes of data. Zliobaite et al. [48] focus on concept drift and adaptation of systems
during online operation. Gaber et al. [13] discuss ubiquitous data
mining with attention to collaborative data stream mining. In this
paper, we focus on research challenges for streaming data inspired
and required by real-world applications. In contrast to existing position papers, we raise issues connected not only with large volumes of data and concept drift, but also such practical problems
as privacy constraints, availability of information, and dealing with
legacy systems.
The scope of this paper is not restricted to algorithmic challenges,
it aims at covering the full cycle of knowledge discovery from data
(CRISP [40]), from understanding the context of the task, to data
preparation, modeling, evaluation, and deployment. We discuss
eight challenges: making models simpler, protecting privacy and
confidentiality, dealing with legacy systems, stream preprocessing,
timing and availability of information, relational stream mining,
analyzing event data, and evaluation of stream mining algorithms.
Figure 1 illustrates the positioning of these challenges in the CRISP
cycle. Some of these apply to traditional (non-streaming) data mining as well, but they are critical in streaming environments. Along
with further discussion of these challenges, we present our position
where the forthcoming focus of research and development efforts
should be directed to address these challenges.
In the remainder of the article, section 2 gives a brief introduction to
data stream mining, sections 3–7 discuss each identified challenge,
and section 8 highlights action points for future research.
1
SIGKDD Explorations
http://sites.google.com/site/realstream2013
Volume 16, Issue 1
Page 1
$
"
#
Figure 1: CRISP cycle with data stream research challenges.
2.
DATA STREAM MINING
Mining big data streams faces three principal challenges: volume,
velocity, and volatility. Volume and velocity require a high volume
of data to be processed in limited time. Starting from the first arriving instance, the amount of available data constantly increases from
zero to potentially infinity. This requires incremental approaches
that incorporate information as it becomes available, and online
processing if not all data can be kept [15]. Volatility, on the other
hand, corresponds to a dynamic environment with ever-changing
patterns. Here, old data is of limited use, even if it could be saved
and processed again later. This is due to change, that can affect the
induced data mining models in multiple ways: change of the target
variable, change in the available feature information, and drift.
Changes of the target variable occur for example in credit scoring, when the definition of the classification target “default” versus
“non-default” changes due to business or regulatory requirements.
Changes in the available feature information arise when new features become available, e.g. due to a new sensor or instrument.
Similarly, existing features might need to be excluded due to regulatory requirements, or a feature might change in its scale, if data
from a more precise instrument becomes available. Finally, drift is
a phenomenon that occurs when the distributions of features x and
target variables y change in time. The challenge posed by drift has
been subject to extensive research, thus we provide here solely a
brief categorization and refer to recent surveys like [17].
In supervised learning, drift can affect the posterior P (y|x), the
conditional feature P (x|y), the feature P (x) and the class prior
P (y) distribution. The distinction based on which distribution is
assumed to be affected, and which is assumed to be static, serves to
assess the suitability of an approach for a particular task. It is worth
noting, that the problem of changing distributions is also present in
unsupervised learning from data streams.
A further categorization of drift can be made by:
• smoothness of concept transition: Transitions between concepts can be sudden or gradual. The former is sometimes also
denoted in literature as shift or abrupt drift.
SIGKDD Explorations
• systematic or unsystematic: In the former case, there are
patterns in the way the distributions change that can be exploited to predict change and perform faster model adaptation. Examples are subpopulations that can be identified and
show distinct, trackable evolutionary patterns. In the latter
case, no such patterns exist and drift occurs seemingly at random. An example for the latter is fickle concept drift.
• real or virtual: While the former requires model adaptation,
the latter corresponds to observing outliers or noise, which
should not be incorporated into a model.
!
• singular or recurring contexts: In the former case, a model
becomes obsolete once and for all when its context is replaced by a novel context. In the latter case, a model’s context might reoccur at a later moment in time, for example due
to a business cycle or seasonality, therefore, obsolete models
might still regain value.
Stream mining approaches in general address the challenges posed
by volume, velocity and volatility of data. However, in real-world
applications these three challenges often coincide with other, to
date insufficiently considered ones.
The next sections discuss eight identified challenges for data stream
mining, providing illustrations with real world application examples, and formulating suggestions for forthcoming research.
3. PROTECTING PRIVACY AND CONFIDENTIALITY
Data streams present new challenges and opportunities with respect
to protecting privacy and confidentiality in data mining. Privacy
preserving data mining has been studied for over a decade (see.
e.g. [3]). The main objective is to develop such data mining techniques that would not uncover information or patterns which compromise confidentiality and privacy obligations. Modeling can be
done on original or anonymized data, but when the model is released, it should not contain information that may violate privacy
or confidentiality. This is typically achieved by controlled distortion of sensitive data by modifying the values or adding noise.
Ensuring privacy and confidentiality is important for gaining trust
of the users and the society in autonomous, stream data mining
systems. While in offline data mining a human analyst working
with the data can do a sanity check before releasing the model, in
data stream mining privacy preservation needs to be done online.
Several existing works relate to privacy preservation in publishing
streaming data (e.g. [46]), but no systematic research in relation to
broader data stream challenges exists.
We identify two main challenges for privacy preservation in mining
data streams. The first challenge is incompleteness of information.
Data arrives in portions and the model is updated online. Therefore, the model is never final and it is difficult to judge privacy
preservation before seeing all the data. For example, suppose GPS
traces of individuals are being collected for modeling traffic situation. Suppose person A at current time travels from the campus to
the airport. The privacy of a person will be compromised, if there
are no similar trips by other persons in the very near future. However, near future trips are unknown at the current time, when the
model needs to be updated.
On the other hand, data stream mining algorithms may have some
inherent privacy preservation properties due to the fact that they do
not need to see all the modeling data at once, and can be incrementally updated with portions of data. Investigating privacy preservation properties of existing data stream algorithms makes another
interesting direction for future research.
Volume 16, Issue 1
Page 2
The second important challenge for privacy preservation is concept
drift. As data may evolve over time, fixed privacy preservation
rules may no longer hold. For example, suppose winter comes,
snow falls, and much less people commute by bike. By knowing
that a person comes to work by bike and having a set of GPS traces,
it may not be possible to identify this person uniquely in summer,
when there are many cyclists, but possible in winter. Hence, an important direction for future research is to develop adaptive privacy
preservation mechanisms, that would diagnose such a situation and
adapt themselves to preserve privacy in the new circumstances.
4.
STREAMED DATA MANAGEMENT
Most of the data stream research concentrates on developing predictive models that address a simplified scenario, in which data is
already pre-processed, completely and immediately available for
free. However, successful business implementations depend strongly
on the alignment of the used machine learning algorithms with
both, the business objectives, and the available data. This section
discusses often omitted challenges connected with streaming data.
actions before feeding the newest data to the predictive models.
The problem of preprocessing for data streams is challenging due to
the challenging nature of the data (continuously arriving and evolving). An analyst cannot know for sure, what kind of data to expect
in the future, and cannot deterministically enumerate possible actions. Therefore, not only models, but also the procedure itself
needs to be fully automated.
This research problem can be approached from several angles. One
way is to look at existing predictive models for data streams, and
try to integrate them with selected data preprocessing methods (e.g.
feature selection, outlier definition and removal).
Another way is to systematically characterize the existing offline
data preprocessing approaches, try to find a mapping between those
approaches and problem settings in data streams, and extend preprocessing approaches for data streams in such a way as traditional
predictive models have been extended for data stream settings.
In either case, developing individual methods and methodology for
preprocessing of data streams would bridge an important gap in the
practical applications of data stream mining.
4.1 Streamed Preprocessing
4.2 Timing and Availability of Information
Data preprocessing is an important step in all real world data analysis applications, since data comes from complex environments,
may be noisy, redundant, contain outliers and missing values. Many
standard procedures for preprocessing offline data are available and
well established, see e.g. [33]; however, the data stream setting introduces new challenges that have not received sufficient research
attention yet.
While in traditional offline analysis data preprocessing is a once-off
procedure, usually done by a human expert prior to modeling, in the
streaming scenario manual processing is not feasible, as new data
continuously arrives. Streaming data needs fully automated preprocessing methods, that can optimize the parameters and operate
autonomously. Moreover, preprocessing models need to be able to
update themselves automatically along with evolving data, in a similar way as predictive models for streaming data do. Furthermore,
all updates of preprocessing procedures need to be synchronized
with the subsequent predictive models, otherwise after an update in
preprocessing the data representation may change and, as a result,
the previously used predictive model may become useless.
Except for some studies, mainly focusing on feature construction
over data streams, e.g. [49; 4], no systematic methodology for data
stream preprocessing is currently available.
As an illustrative example for challenges related to data preprocessing, consider predicting traffic jams based on mobile sensing data.
People using navigation services on mobile devices can opt to send
anonymized data to the service provider. Service providers, such as
Google, Yandex or Nokia, provide estimations and predictions of
traffic jams based on this data. First, the data of each user is mapped
to the road network, the speed of each user on each road segment
of the trip is computed, data from multiple users is aggregated, and
finally the current speed of the traffic is estimated.
There are a lot of data preprocessing challenges associated with
this task. First, noisiness of GPS data might vary depending on
location and load of the telecommunication network. There may
be outliers, for instance, if somebody stopped in the middle of a
segment to wait for a passenger, or a car broke. The number of
pedestrians using mobile navigation may vary, and require adaptive
instance selection. Moreover, road networks may change over time,
leading to changes in average speeds, in the number of cars and
even car types (e.g. heavy trucks might be banned, new optimal
routes emerge). All these issues require automated preprocessing
Most algorithms developed for evolving data streams make simplifying assumptions on the timing and availability of information. In
particular, they assume that information is complete, immediately
available, and received passively and for free. These assumptions
often do not hold in real-world applications, e.g., patient monitoring, robot vision, or marketing [43]. This section is dedicated to the
discussion of these assumptions and the challenges resulting from
their absence. For some of these challenges, corresponding situations in offline, static data mining have already been addressed
in literature. We will briefly point out where a mapping of such
known solutions to the online, evolving stream setting is easily feasible, for example by applying windowing techniques. However,
we will focus on problems for which no such simple mapping exists and which are therefore open challenges in stream mining.
SIGKDD Explorations
4.2.1 Handling Incomplete Information
Completeness of information assumes that the true values of all
variables, that is of features and of the target, are revealed eventually to the mining algorithm.
The problem of missing values, which corresponds to incompleteness of features, has been discussed extensively for the offline,
static settings. A recent survey is given in [45]. However, only few
works address data streams, and in particular evolving data streams.
Thus several open challenges remain, some are pointed out in the
review by [29]: how to address the problem that the frequency in
which missing values occur is unpredictable, but largely affects the
quality of imputations? How to (automatically) select the best imputation technique? How to proceed in the trade-off between speed
and statistical accuracy?
Another problem is that of missing values of the target variable. It
has been studied extensively in the static setting as semi-supervised
learning (SSL, see [11]). A requirement for applying SSL techniques to streams is the availability of at least some labeled data
from the most recent distribution. While first attempts to this problem have been made, e.g. the online manifold regularization approach in [19] and the ensembles-based approach suggested by
[11], improvements in speed and the provision of performance guarantees remain open challenges. A special case of incomplete information is “censored data” in Event History Analysis (EHA), which
is described in section 5.2. A related problem discussed below is
active learning (AL, see [38]).
Volume 16, Issue 1
Page 3
4.2.2 Dealing with Skewed Distributions
Class imbalance, where the class prior probability of the minority class is small compared to that of the majority class, is a frequent problem in real-world applications like fraud detection or
credit scoring. This problem has been well studied in the offline
setting (see e.g. [22] for a recent book on that subject), and has
also been studied to some extent in the online, stream-based setting
(see [23] for a recent survey). However, among the few existing
stream-based approaches, most do not pay attention to drift of the
minority class, and as [23] pointed out, a more rigorous evaluation
of these algorithms on real-world data needs yet to be done.
4.2.3 Handling Delayed Information
Latency means information becomes available with significant delay. For example, in the case of so-called verification latency, the
value of the preceding instance’s target variable is not available before the subsequent instance has to be predicted. On evolving data
streams, this is more than a mere problem of streaming data integration between feature and target streams, as due to concept drift
patterns show temporal locality [2]. It means that feedback on the
current prediction is not available to improve the subsequent predictions, but only eventually will become available for much later
predictions. Thus, there is no recent sample of labeled data at all
that would correspond to the most-recent unlabeled data, and semisupervised learning approaches are not directly applicable.
A related problem in static, offline data mining is that addressed
by unsupervised transductive transfer learning (or unsupervised domain adaptation): given labeled data from a source domain, a predictive model is sought for a related target domain in which no
labeled data is available. In principle, ideas from transfer learning
could be used to address latency in evolving data streams, for example by employing them in a chunk-based approach, as suggested
in [43]. However, adapting them for use in evolving data streams
has not been tried yet and constitutes a non-trivial, open task, as
adaptation in streams must be fast and fully automated and thus
cannot rely on iterated careful tuning by human experts.
Furthermore, consecutive chunks constitute several domains, thus
the transitions between several subsequent chunks might provide
exploitable patterns of systematic drift. This idea has been introduced in [27], and a few so-called drift-mining algorithms that
identify and exploit such patterns have been proposed since then.
However, the existing approaches cover only a very limited set of
possible drift patterns and scenarios.
• uncertainty regarding convergence: in contrast to learning
in static contexts, due to drift there is no guarantee that with
additional labels the difference between model and reality
narrows down. This leaves the formulation of suitable stop
criteria a challenging open issue.
• necessity of perpetual validation: even if there has been
convergence due to some temporary stability, the learned hypotheses can get invalidated at any time by subsequent drift.
This can affect any part of the feature space and is not necessarily detectable from unlabeled data. Thus, without perpetual validation the mining algorithm might lock itself to a
wrong hypothesis without ever noticing.
• temporal budget allocation: the necessity of perpetual validation raises the question of optimally allocating the labeling
budget over time.
• performance bounds: in the case of drifting posteriors, no
theoretical work exists that provides bounds for errors and
label requests. However, deriving such bounds will also require assuming some type of systematic drift.
The task of active feature acquisition, where one has to actively
select among costly features, constitutes another open challenge on
evolving data streams: in contrast to the static, offline setting, the
value of a feature is likely to change with its drifting distribution.
5. MINING ENTITIES AND EVENTS
Conventional stream mining algorithms learn over a single stream
of arriving entities. In subsection 5.1, we introduce the paradigm
of entity stream mining, where the entities constituting the stream
are linked to instances (structured pieces of information) from further streams. Model learning in this paradigm involves the incorporation of the streaming information into the stream of entities;
learning tasks include cluster evolution, migration of entities from
one state to another, classifier adaptation as entities re-appear with
another label than before.
Then, in subsection 5.2, we investigate the special case where entities are associated with the occurrence of events. Model learning
then implies identifying the moment of occurrence of an event on
an entity. This scenario might be seen as a special case of entity
stream mining, since an event can be seen as a degenerate instance
consisting of a single value (the event’s occurrence).
5.1 Entity Stream Mining
4.2.4 Active Selection from Costly Information
The challenge of intelligently selecting among costly pieces of information is the subject of active learning research. Active streambased selective sampling [38] describes a scenario, in which instances arrive one-by-one. While the instances’ feature vectors are
provided for free, obtaining their true target values is costly, and the
definitive decision whether or not to request this target value must
be taken before proceeding to the next instance. This corresponds
to a data stream, but not necessarily to an evolving one. As a result,
only a small subset of stream-based selective sampling algorithms
is suited for non-stationary environments. To make things worse,
many contributions do not state explicitly whether they were designed for drift, neither do they provide experimental evaluations
on such evolving data streams, thus leaving the reader the arduous task to assess their suitability for evolving streams. A first, recent attempt to provide an overview on the existing active learning
strategies for evolving data streams is given in [43]. The challenges
for active learning posed by evolving data streams are:
SIGKDD Explorations
Let T be a stream of entities, e.g. customers of a company or patients of a hospital. We observe entities over time, e.g. on a company’s website or at a hospital admission vicinity: an entity appears
and re-appears at discrete time points, new entities show up. At a
time point t, an entity e ∈ T is linked with different pieces of information - the purchases and ratings performed by a customer, the
anamnesis, the medical tests and the diagnosis recorded for the patient. Each of these information pieces ij (t) is a structured record
or an unstructured text from a stream Tj , linked to e via the foreign
key relation. Thus, the entities in T are in 1-to-1 or 1-to-n relation
with entities from further streams T1 , . . . , Tm (stream of purchases,
stream of ratings, stream of complaints etc). The schema describing the streams T, T1 , . . . , Tm can be perceived as a conventional
relational schema, except that it describes streams instead of static
sets.
In this relational setting, the entity stream mining task corresponds
to learning a model ζT over T , thereby incorporating information
from the adjoint streams T1 , . . . , Tm that ”feed” the entities in T .
Volume 16, Issue 1
Page 4
Albeit the members of each stream are entities, we use the term
”entity” only for stream T – the target of learning, while we denote
the entities in the other streams as ”instances”. In the unsupervised
setting, entity stream clustering encompasses learning and adapting
clusters over T , taking account the other streams that arrive at different speeds. In the supervised setting, entity stream classification
involves learning and adapting a classifier, notwithstanding the fact
that an entity’s label may change from one time point to the next,
as new instances referencing it arrive.
entities, thus the challenges pertinent to stream mining also apply
here. One of these challenges, and one much discussed in the context of big data, is volatility. In relational stream mining, volatility
refers to the entity itself, not only to the stream of instances that
reference the entities. Finally, an entity is ultimately big data by
itself, since it is described by multiple streams. Hence, next to the
problem of dealing with new forms of learning and new aspects of
drift, the subject of efficient learning and adaption in the Big Data
context becomes paramount.
5.1.1 Challenges of Aggregation
5.2 Analyzing Event Data
The first challenge of entity stream mining task concerns information summarization: how to aggregate into each entity e at each
time point t the information available on it from the other streams?
What information should be stored for each entity? How to deal
with differences in the speeds of the individual streams? How to
learn over the streams efficiently? Answering these questions in a
seamless way would allow us to deploy conventional stream mining
methods for entity stream mining after aggregation.
The information referencing a relational entity cannot be held perpetually for learning, hence aggregation of the arriving streams is
necessary. Information aggregation over time-stamped data is traditionally practiced in document stream mining, where the objective is to derive and adapt content summaries on learned topics.
Content summarization on entities, which are referenced in the document stream, is studied by Kotov et al., who maintain for each
entity the number of times it is mentioned in the news [26].
In such studies, summarization is a task by itself. Aggregation of
information for subsequent learning is a bit more challenging, because summarization implies information loss - notably information about the evolution of an entity. Hassani and Seidl monitor
health parameters of patients, modeling the stream of recordings
on a patient as a sequence of events [21]: the learning task is then
to predict forthcoming values. Aggregation with selective forgetting of past information is proposed in [25; 42] in the classification
context: the former method [25] slides a window over the stream,
while the latter [42] forgets entities that have not appeared for a
while, and summarizes the information in frequent itemsets, which
are then used as new features for learning.
Events are an example for data that occurs often yet is rarely analyzed in the stream setting. In static environments, events are usually studied through event history analysis (EHA), a statistical method for modeling and analyzing the temporal distribution of events
related to specific objects in the course of their lifetime [9]. More
specifically, EHA is interested in the duration before the occurrence
of an event or, in the recurrent case (where the same event can occur repeatedly), the duration between two events. The notion of
an event is completely generic and may indicate, for example, the
failure of an electrical device. The method is perhaps even better
known as survival analysis, a term that originates from applications
in medicine, in which an event is the death of a patient and survival
time is the time period between the beginning of the study and the
occurrence of this event. EHA can also be considered as a special
case of entity stream mining described in section 5.1, because the
basic statistical entities in EHA are monitored objects (or subjects),
typically described in terms of feature vectors x ∈ Rn , together
with their survival time s. Then, the goal is to model the dependence of s on x. A corresponding model provides hints at possible
cause-effect relationships (e.g., what properties tend to increase a
patient’s survival time) and, moreover, can be used for predictive
purposes (e.g., what is the expected survival time of a patient).
Although one might be tempted to approach this modeling task as
a standard regression problem with input (regressor) x and output (response) s, it is important to notice that the survival time s
is normally not observed for all objects. Indeed, the problem of
censoring plays an important role in EHA and occurs in different
facets. In particular, it may happen that some of the objects survived till the end of the study at time tend (also called the cut-off
point). They are censored or, more specifically, right censored,
since tevent has not been observed for them; instead, it is only
known that tevent > tend . In snapshot monitoring [28], the data
stream may be sampled multiple times, resulting in a new cut-off
point for each snapshot. Unlike standard regression analysis, EHA
is specifically tailored for analyzing event data of that kind. It is
built upon the hazard function as a basic mathematical tool.
5.1.2 Challenges of Learning
Even if information aggregation over the streams T1 , . . . , Tm is
performed intelligently, entity stream mining still calls for more
than conventional stream mining methods. The reason is that entities of stream T re-appear in the stream and evolve. In particular,
in the unsupervised setting, an entity may be linked to conceptually different instances at each time point, e.g. reflecting a customer’s change in preferences. In the supervised setting, an entity
may change its label; for example, a customer’s affinity to risk may
change in response to market changes or to changes in family status. This corresponds to entity drift, i.e. a new type of drift beyond
the conventional concept drift pertaining to model ζT . Hence, how
should entity drift be traced, and how should the interplay between
entity drift and model drift be captured?
In the unsupervised setting, Oliveira and Gama learn and monitor
clusters as states of evolution [32], while [41] extend that work to
learn Markov chains that mark the entities’ evolution. As pointed
out in [32], these states are not necessarily predefined – they must
be subject of learning. In [43], we report on further solutions to
the entity evolution problem and to the problem of learning with
forgetting over multiple streams and over the entities referenced by
them.
Conventional concept drift also occurs when learning a model over
SIGKDD Explorations
5.2.1 Survival function and hazard rate
Suppose the time of occurrence of the next event (since the start or
the last event) for an object x is modeled as a real-valued random
variable T with probability density function f (· | x). The hazard
function or hazard rate h(· | x) models the propensity of the occurrence of an event, that is, the marginal probability of an event to
occur at time t, given that no event has occurred so far:
h(t | x) =
f (t | x)
f (t | x)
=
,
S(t | x)
1 − F (t | x)
where S(· | x) is the survival function and F (· | x) the cumulative
distribution of f (· | x). Thus,
t
F (t | x) = P(T ≤ t) =
f (u | x) du
Volume 16, Issue 1
0
Page 5
is the probability of an event to occur before time t. Correspondingly, S(t | x) = 1 − F (t | x) is the probability that the event did
not occur until time t (the survival probability). It can hence be
used to model the probability of the right-censoring of the time for
an event to occur.
A simple example is the Cox proportional hazard model [9], in
which the hazard rate is constant over time; thus, it does depend
on the feature vector x = (x1 , . . . , xn ) but not on time t. More
specifically, the hazard rate is modeled as a log-linear function of
the features xi :
h(t | x) = λ(x) = exp x β
The model is proportional in the sense that increasing xi by one
unit increases the hazard rate λ(x) by a factor of αi = exp(βi ).
For this model, one easily derives the survival function S(t | x) =
1 − exp(−λ(x) · t) and an expected survival time of 1/λ(x).
5.2.2 EHA on data streams
Although the temporal nature of event data naturally fits the data
stream model and, moreover, event data is naturally produced by
many data sources, EHA has been considered in the data stream
scenario only very recently. In [39], the authors propose a method
for analyzing earthquake and Twitter data, namely an extension of
the above Cox model based on a sliding window approach. The
authors of [28] modify standard classification algorithms, such as
decision trees, so that they can be trained on a snapshot stream of
both censored and non-censored data.
Like in the case of clustering [35], where one distinguishes between
clustering observations and clustering data sources, two different
settings can be envisioned for EHA on data streams:
1. In the first setting, events are generated by multiple data sources
(representing monitored objects), and the features pertain to
these sources; thus, each data source is characterized by a
feature vector x and produces a stream of (recurrent) events.
For example, data sources could be users in a computer network, and an event occurs whenever a user sends an email.
2. In the second setting, events are produced by a single data
source, but now the events themselves are characterized by
features. For example, events might be emails sent by an
email server, and each email is represented by a certain set
of properties.
Statistical event models on data streams can be used in much the
same way as in the case of static data. For example, they can serve
predictive purposes, i.e., to answer questions such as “How much
time will elapse before the next email arrives?” or “What is the
probability to receive more than 100 emails within the next hour?”.
What is specifically interesting, however, and indeed distinguishes
the data stream setting from the static case, is the fact that the model
may change over time. This is a subtle aspect, because the hazard
model h(t | x) itself may already be time-dependent; here, however, t is not the absolute time but the duration time, i.e., the time
elapsed since the last event. A change of the model is comparable to concept drift in classification, and means that the way in
which the hazard rate depends on time t and on the features xi
changes over time. For example, consider the event “increase of
a stock rate” and suppose that βi = log(2) for the binary feature
xi = energy sector in the above Cox model (which, as already
mentioned, does not depend on t). Thus, this feature doubles the
hazard rate and hence halves the expected duration between two
events. Needless to say, however, this influence may change over
time, depending on how well the energy sector is doing.
SIGKDD Explorations
Dealing with model changes of that kind is clearly an important
challenge for event analysis on data streams. Although the problem
is to some extent addressed by the works mentioned above, there
is certainly scope for further improvement, and for using these approaches to derive predictive models from censored data. Besides,
there are many other directions for future work. For example, since
the detection of events is a main prerequisite for analyzing them,
the combination of EHA with methods for event detection [36] is
an important challenge. Indeed, this problem is often far from trivial, and in many cases, events (such as frauds, for example) can only
be detected with a certain time delay; dealing with delayed events
is therefore another important topic, which was also discussed in
section 4.2.
6. EVALUATION OF DATA STREAM ALGORITHMS
All of the aforementioned challenges are milestones on the road to
better algorithms for real-world data stream mining systems. To
verify if these challenges are met, practitioners need tools capable of evaluating newly proposed solutions. Although in the field
of static classification such tools exist, they are insufficient in data
stream environments due to such problems as: concept drift, limited processing time, verification latency, multiple stream structures, evolving class skew, censored data, and changing misclassification costs. In fact, the myriad of additional complexities posed
by data streams makes algorithm evaluation a highly multi-criterial
task, in which optimal trade-offs may change over time.
Recent developments in applied machine learning [6] emphasize
the importance of understanding the data one is working with and
using evaluation metrics which reflect its difficulties. As mentioned before, data streams set new requirements compared to traditional data mining and researchers are beginning to acknowledge the shortcomings of existing evaluation metrics. For example, Gama et al. [16] proposed a way of calculating classification
accuracy using only the most recent stream examples, therefore allowing for time-oriented evaluation and aiding concept drift detection. Methods which test the classifier’s robustness to drifts and
noise on a practical, experimental level are also starting to arise
[34; 47]. However, all these evaluation techniques focus on single criteria such as prediction accuracy or robustness to drifts, even
though data streams make evaluation a constant trade-off between
several criteria [7]. Moreover, in data stream environments there is
a need for more advanced tools for visualizing changes in algorithm
predictions with time.
The problem of creating complex evaluation methods for stream
mining algorithms lies mainly in the size and evolving nature of
data streams. It is much more difficult to estimate and visualize,
for example, prediction accuracy if evaluation must be done online, using limited resources, and the classification task changes
with time. In fact, the algorithm’s ability to adapt is another aspect which needs to be evaluated, although information needed to
perform such evaluation is not always available. Concept drifts are
known in advance mainly when using synthetic or benchmark data,
while in more practical scenarios occurrences and types of concepts
are not directly known and only the label of each arriving instance
is known. Moreover, in many cases the task is more complicated, as
labeling information is not instantly available. Other difficulties in
evaluation include processing complex relational streams and coping with class imbalance when class distributions evolve with time.
Finally, not only do we need measures for evaluating single aspects
of stream mining algorithms, but also ways of combining several of
these aspects into global evaluation models, which would take into
Volume 16, Issue 1
Page 6
account expert knowledge and user preferences.
Clearly, evaluation of data stream algorithms is a fertile ground
for novel theoretical and algorithmic solutions. In terms of prediction measures, data stream mining still requires evaluation tools
that would be immune to class imbalance and robust to noise. In
our opinion, solutions to this problem should involve not only metrics based on relative performance to baseline (chance) classifiers,
but also graphical measures similar to PR-curves or cost curves.
Furthermore, there is a need for integrating information about concept drifts in the evaluation process. As mentioned earlier, possible
ways of considering concept drifts will depend on the information
that is available. If true concepts are known, algorithms could be
evaluated based on: how often they detect drift, how early they detect it, how they react to it, and how quickly they recover from it.
Moreover, in this scenario, evaluation of an algorithm should be
dependent on whether it takes place during drift or during times of
concept stability. A possible way of tackling this problem would be
the proposal of graphical methods, similar to ROC analysis, which
would work online and visualize concept drift measures alongside
prediction measures. Additionally, these graphical measures could
take into account the state of the stream, for example, its speed,
number of missing values, or class distribution. Similar methods
could be proposed for scenarios where concepts are not known in
advance, however, in these cases measures should be based on drift
detectors or label-independent stream statistics. Above all, due to
the number of aspects which need to be measured, we believe that
the evaluation of data stream algorithms requires a multi-criterial
view. This could be done by using inspirations from multiple criteria decision analysis, where trade-offs between criteria are achieved
using user-feedback. In particular, a user could showcase his/her
criteria preferences (for example, between memory consumption,
accuracy, reactivity, self-tuning, and adaptability) by deciding between alternative algorithms for a given data stream. It is worth
noticing that such a multi-criterial view on evaluation is difficult to
encapsulate in a single number, as it is usually done in traditional
offline learning. This might suggest that researchers in this area
should turn towards semi-qualitative and semi-quantitative evaluation, for which systematic methodologies should be developed.
Finally, a separate research direction involves rethinking the way
we test data stream mining algorithms. The traditional train, crossvalidate, test workflow in classification is not applicable for sequential data, which makes, for instance, parameter tuning much more
difficult. Similarly, ground truth verification in unsupervised learning is practically impossible in data stream environments. With
these problems in mind, it is worth stating that there is still a shortage of real and synthetic benchmark datasets. Such a situation
might be a result of non-uniform standards for testing algorithms on
streaming data. As community, we should decide on such matters
as: What characteristics should benchmark datasets have? Should
they have prediction tasks attached? Should we move towards online evaluation tools rather than datasets? These questions should
be answered in order to solve evaluation issues in controlled environments before we create measures for real-world scenarios.
7.
FROM ALGORITHMS TO DECISION
SUPPORT SYSTEMS
While a lot of algorithmic methods for data streams are already
available, their deployment in real applications with real streaming
data presents a new dimension of challenges. This section points
out two such challenges: making models simpler and dealing with
legacy systems.
SIGKDD Explorations
7.1 Making models simpler, more reactive, and
more specialized
In this subsection, we discuss aspects like the simplicity of a model,
its proper combination of offline and online components, and its
customization to the requirements of the application domain. As
an application example, consider the French Orange Portal2 , which
registers millions of visits daily. Most of these visitors are only
known through anonymous cookie IDs. For all of these visitors,
the portal has the ambition to provide specific and relevant contents
as well as printing ads for targeted audiences. Using information
about visits on the portal the questions are: what part of the portal
does each cookie visit, and when and which contents did it consult,
what advertisement was sent, when (if) was it clicked. All this information generates hundreds of gigabytes of data each week. A
user profiling system needs to have a back end part to preprocess
the information required at the input of a front end part, which will
compute appetency to advertising (for example) using stream mining techniques (in this case a supervised classifier). Since the ads
to print change regularly, based on marketing campaigns, the extensive parameter tuning is infeasible as one has to react quickly to
change. Currently, these tasks are either solved using bandit methods from game theory [8], which impairs adaptation to drift, or
done offline in big data systems, resulting in slow reactivity.
7.1.1 Minimizing parameter dependence
Adaptive predictive systems are intrinsically parametrized. In most
of the cases, setting these parameters, or tuning them is a difficult
task, which in turn negatively affects the usability of these systems.
Therefore, it is strongly desired for the system to have as few user
adjustable parameters as possible. Unfortunately, the state of the
art does not produce methods with trustworthy or easily adjustable
parameters. Moreover, many predictive modeling methods use a
lot of parameters, rendering them particularly impractical for data
stream applications, where models are allowed to evolve over time,
and input parameters often need to evolve as well.
The process of predictive modeling encompasses fitting of parameters on a training dataset and subsequently selecting the best model,
either by heuristics or principled methods. Recently, model selection methods have been proposed that do not require internal crossvalidation, but rather use the Bayesian machinery to design regularizers with data dependent priors [20]. However, they are not yet
applicable in data streams, as their computational time complexity
is too high and they require all examples to be kept in memory.
7.1.2 Combining offline and online models
Online and offline learning are mostly considered as mutually exclusive, but it is their combination that might enhance the value
of data the most. Online learning, which processes instances oneby-one and builds models incrementally, has the virtue of being
fast, both in the processing of data and in the adaptation of models. Offline (or batch) learning has the advantage of allowing the
use of more sophisticated mining techniques, which might be more
time-consuming or require a human expert. While the first allows
the processing of “fast data” that requires real-time processing and
adaptivity, the second allows processing of “big data” that requires
longer processing time and larger abstraction.
Their combination can take place in many steps of the mining process, such as the data preparation and the preprocessing steps. For
example, offline learning on big data could extract fundamental and
sustainable trends from data using batch processing and massive
parallelism. Online learning could then take real-time decisions
2
www.orange.fr
Volume 16, Issue 1
Page 7
from online events to optimize an immediate pay-off. In the online
advertisement application mentioned above, the user-click prediction is done within a context, defined for example by the currently
viewed page and the profile of the cookie. The decision which
banner to display is done online, but the context can be preprocessed offline. By deriving meta-information such as “the profile is
a young male, the page is from the sport cluster”, the offline component can ease the online decision task.
mining can make a decisive contribution to enhance and facilitate
the required monitoring tasks. Recently, we are planning to use the
ISS Columbus module as a technology demonstrator for integrating data stream processing and mining into the existing monitoring
processes [31]. Figure 2 exemplifies the failure management system (FMS) of the ISS Columbus module. While it is impossible to
simply redesign the FMS from scratch, we can outline the following challenges.
7.1.3 Solving the right problem
Domain knowledge may help to solve many issues raised in this
paper, by systematically exploiting particularities of application domains. However, this is seldom considered, as typical data stream
methods are created to deal with a large variety of domains. For instance, in some domains the learning algorithm receives only partial feedback upon its prediction, i.e. a single bit of right-or-wrong,
rather than the true label. In the user-click prediction example, if a
user does not click on a banner, we do not know which one would
have been correct, but solely that the displayed one was wrong.
This is related to the issues on timing and availability of information discussed in section 4.2.
However, building predictive models that systematically incorporate domain knowledge or domain specific information requires
to choose the right optimization criteria. As mentioned in section 6, the data stream setting requires optimizing multiple criteria
simultaneously, as optimizing only predictive performance is not
sufficient. We need to develop learning algorithms, which minimize an objective function including intrinsically and simultaneously: memory consumption, predictive performance, reactivity,
self monitoring and tuning, and (explainable) auto-adaptivity. Data
streams research is lacking methodologies for forming and optimizing such criteria.
Therefore, models should be simple so that they do not depend on
a set of carefully tuned parameters. Additionally, they should combine offline and online techniques to address challenges of big and
fast data, and they should solve the right problem, which might
consist in solving a multi-criteria optimization task. Finally, they
have to be able to learn from a small amount of data and with low
variance [37], to react quickly to drift.
7.2 Dealing with Legacy Systems
In many application environments, such as financial services or
health care systems, business critical applications are in operation
for decades. Since these applications produce massive amounts of
data, it becomes very promising to process these amounts of data
by real-time stream mining approaches. However, it is often impossible to change existing infrastructures in order to introduce fully
fledged stream mining systems. Rather than changing existing infrastructures, approaches are required that integrate stream mining
techniques into legacy systems. In general, problems concerning
legacy systems are domain-specific and encompass both technical
and procedural issues. In this section, we analyze challenges posed
by a specific real-world application with legacy issues — the ISS
Columbus spacecraft module.
1. ISS Columbus module
5. Assembly, integration,
and test facility
2. Ground control
centre
4. Mission archiv
3. Engineering support
centre
Figure 2: ISS Columbus FMS
7.2.2 Complexity
Even though spacecraft monitoring is very challenging by itself,
it becomes increasingly difficult and complex due to the integration of data stream mining into such legacy systems. However,
it was assumed to enhance and facilitate current monitoring processes. Thus, appropriate mechanism are required to integrate data
stream mining into the current processes to decrease complexity.
7.2.3 Interlocking
As depicted in Figure 2, the ISS Columbus module is connected
to ground instances. Real-time monitoring must be applied aboard
where computational resources are restricted (e.g. processor speed
and memory or power consumption). Near real-time monitoring or
long-term analysis must be applied on-ground where the downlink
suffers from latencies because of a long transmission distance, is
subject to bandwidth limitations, and continuously interrupted due
to loss of signal. Consequently, new data stream mining mechanisms are necessary which ensure a smooth interlocking functionality of aboard and ground instances.
7.2.4 Reliability and Balance
7.2.1 ISS Columbus
Spacecrafts are very complex systems, exposed to very different
physical environments (e.g. space), and associated to ground stations. These systems are under constant and remote monitoring
by means of telemetry and commands. The ISS Columbus module has been in operation for more than 5 years. For some time,
it is pointed out that the monitoring process is not as efficient as
previously expected [30]. However, we assume that data stream
SIGKDD Explorations
The reliability of spacecrafts is indispensable for astronauts’ health
and mission success. Accordingly, spacecrafts pass very long and
expensive planning and testing phases. Hence, potential data stream
mining algorithms must ensure reliability and the integration of
such algorithms into legacy systems must not cause critical side
effects. Furthermore, data stream mining is an automatic process
which neglects interactions with human experts, while spacecraft
monitoring is a semi-automatic process and human experts (e.g.
Volume 16, Issue 1
Page 8
the flight control team) are responsible for decisions and consequent actions. This problem poses the following question: How to
integrate data stream mining into legacy systems when automation
needs to be increased but the human expert needs to be maintained
in the loop? Abstract discussions on this topic are provided by expert systems [44] and the MAPE-K reference model [24]. Expert
systems aim to combine human expertise with artificial expertise
and the MAPE-K reference model aims to provide an autonomic
control loop. A balance must be struck which considers both aforementioned aspects appropriately.
Overall, the Columbus study has shown that extending legacy systems with real time data stream mining technologies is feasible and
it is an important area for further stream-mining research.
8.
CONCLUDING REMARKS
In this paper, we discussed research challenges for data streams,
originating from real-world applications. We analyzed issues concerning privacy, availability of information, relational and event
streams, preprocessing, model complexity, evaluation, and legacy
systems. The discussed issues were illustrated by practical applications including GPS systems, Twitter analysis, earthquake predictions, customer profiling, and spacecraft monitoring. The study of
real-world problems highlighted shortcomings of existing methodologies and showcased previously unaddressed research issues.
Consequently, we call the data stream mining community to consider the following action points for data stream research:
• developing methods for ensuring privacy with incomplete
information as data arrives, while taking into account the
evolving nature of data;
• considering the availability of information by developing models that handle incomplete, delayed and/or costly feedback;
• taking advantage of relations between streaming entities;
funded by the German Research Foundation, projects SP 572/11-1
(IMPRINT) and HU 1284/5-1, the Academy of Finland grant 118653
(ALGODAN), and the Polish National Science Center grants
DEC-2011/03/N/ST6/00360 and DEC-2013/11/B/ST6/00963.
9. REFERENCES
[1] C. Aggarwal, editor. Data Streams: Models and Algorithms.
Springer, 2007.
[2] C. Aggarwal and D. Turaga. Mining data streams: Systems
and algorithms. In Machine Learning and Knowledge Discovery for Engineering Systems Health Management, pages
4–32. Chapman and Hall, 2012.
[3] R. Agrawal and R. Srikant. Privacy-preserving data mining.
SIGMOD Rec., 29(2):439–450, 2000.
[4] C. Anagnostopoulos, N. Adams, and D. Hand. Deciding what
to observe next: Adaptive variable selection for regression in
multivariate data streams. In Proc. of the 2008 ACM Symp. on
Applied Computing, SAC, pages 961–965, 2008.
[5] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom.
Models and issues in data stream systems. In Proc. of the 21st
ACM SIGACT-SIGMOD-SIGART Symposium on Principles
of Database Systems, PODS, pages 1–16, 2002.
[6] C. Brodley, U. Rebbapragada, K. Small, and B. Wallace.
Challenges and opportunities in applied machine learning. AI
Magazine, 33(1):11–24, 2012.
[7] D. Brzezinski and J. Stefanowski. Reacting to different types
of concept drift: The accuracy updated ensemble algorithm.
IEEE Trans. on Neural Networks and Learning Systems.,
25:81–94, 2014.
• developing event detection methods and predictive models
for censored data;
[8] D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal. Mortal
multi-armed bandits. In Proc. of the 22nd Conf. on Neural
Information Processing Systems, NIPS, pages 273–280, 2008.
• developing a systematic methodology for streamed preprocessing;
[9] D. Cox and D. Oakes. Analysis of Survival Data. Chapman &
Hall, London, 1984.
• creating simpler models through multi-objective optimization criteria, which consider not only accuracy, but also computational resources, diagnostics, reactivity, interpretability;
[10] T. Dietterich. Machine-learning research. AI Magazine,
18(4):97–136, 1997.
• establishing a multi-criteria view towards evaluation, dealing
with absence of the ground truth about how data changes;
[11] G. Ditzler and R. Polikar. Semi-supervised learning in nonstationary environments. In Proc. of the 2011 Int. Joint Conf.
on Neural Networks, IJCNN, pages 2741 – 2748, 2011.
• developing online monitoring systems, ensuring reliability of
any updates, and balancing the distribution of resources.
[12] W. Fan and A. Bifet. Mining big data: current status, and forecast to the future. SIGKDD Explorations, 14(2):1–5, 2012.
As our study shows, there are challenges in every step of the CRISP
data mining process. To date, modeling over data streams has
been viewed and approached as an extension of traditional methods. However, our discussion and application examples show that
in many cases it would be beneficial to step aside from building
upon existing offline approaches, and start blank considering what
is required in the stream setting.
[13] M. Gaber, J. Gama, S. Krishnaswamy, J. Gomes, and F. Stahl.
Data stream mining in ubiquitous environments: state-of-theart and current directions. Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery, 4(2):116 – 138,
2014.
Acknowledgments
[15] J. Gama. Knowledge Discovery from Data Streams. Chapman
& Hall/CRC, 2010.
We would like to thank the participants of the RealStream2013
workshop at ECMLPKDD2013 in Prague, and in particular Bernhard Pfahringer and George Forman, for suggestions and discussions on the challenges in stream mining. Part of this work was
SIGKDD Explorations
[14] M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data
streams: A review. SIGMOD Rec., 34(2):18–26, 2005.
[16] J. Gama, R. Sebastiao, and P. Rodrigues. On evaluating
stream learning algorithms. Machine Learning, 90(3):317–
346, 2013.
Volume 16, Issue 1
Page 9
[17] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and
A. Bouchachia. A survey on concept-drift adaptation. ACM
Computing Surveys, 46(4), 2014.
[18] J. Gantz and D. Reinsel. The digital universe in 2020: Big
data, bigger digital shadows, and biggest growth in the far
east, December 2012.
[19] A. Goldberg, M. Li, and X. Zhu. Online manifold regularization: A new learning setting and empirical study. In Proc.
of the European Conf. on Machine Learning and Principles
of Knowledge Discovery in Databases, ECMLPKDD, pages
393–407, 2008.
[20] I. Guyon, A. Saffari, G. Dror, and G. Cawley. Model selection: Beyond the bayesian/frequentist divide. Journal of Machine Learning Research, 11:61–87, 2010.
[21] M. Hassani and T. Seidl. Towards a mobile health context
prediction: Sequential pattern mining in multiple streams.
In Proc. of , IEEE Int. Conf. on Mobile Data Management,
MDM, pages 55–57, 2011.
[22] H. He and Y. Ma, editors. Imbalanced Learning: Foundations, Algorithms, and Applications. IEEE, 2013.
[23] T. Hoens, R. Polikar, and N. Chawla. Learning from streaming data with concept drift and imbalance: an overview.
Progress in Artificial Intelligence, 1(1):89–101, 2012.
[24] IBM. An architectural blueprint for autonomic computing.
Technical report, IBM, 2003.
[25] E. Ikonomovska, K. Driessens, S. Dzeroski, and J. Gama.
Adaptive windowing for online learning from multiple interrelated data streams. In Proc. of the 11th IEEE Int. Conf. on
Data Mining Workshops, ICDMW, pages 697–704, 2011.
[26] A. Kotov, C. Zhai, and R. Sproat. Mining named entities
with temporally correlated bursts from multilingual web news
streams. In Proc. of the 4th ACM Int. Conf. on Web Search and
Data Mining, WSDM, pages 237–246, 2011.
[27] G. Krempl. The algorithm APT to classify in concurrence of
latency and drift. In Proc. of the 10th Int. Conf. on Advances
in Intelligent Data Analysis, IDA, pages 222–233, 2011.
[28] M. Last and H. Halpert. Survival analysis meets data stream
mining. In Proc. of the 1st Worksh. on Real-World Challenges
for Data Stream Mining, RealStream, pages 26–29, 2013.
[29] F. Nelwamondo and T. Marwala. Key issues on computational
intelligence techniques for missing data imputation - a review.
In Proc. of World Multi Conf. on Systemics, Cybernetics and
Informatics, volume 4, pages 35–40, 2008.
[30] E. Noack, W. Belau, R. Wohlgemuth, R. Müller, S. Palumberi,
P. Parodi, and F. Burzagli. Efficiency of the columbus failure
management system. In Proc. of the AIAA 40th Int. Conf. on
Environmental Systems, 2010.
[31] E. Noack, A. Luedtke, I. Schmitt, T. Noack, E. Schaumlöffel,
E. Hauke, J. Stamminger, and E. Frisk. The columbus module
as a technology demonstrator for innovative failure management. In German Air and Space Travel Congress, 2012.
[32] M. Oliveira and J. Gama. A framework to monitor clusters
evolution applied to economy and finance problems. Intelligent Data Analysis, 16(1):93–111, 2012.
SIGKDD Explorations
[33] D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann Publishers Inc., 1999.
[34] T. Raeder and N. Chawla. Model monitor (m2 ): Evaluating, comparing, and monitoring models. Journal of Machine
Learning Research, 10:1387–1390, 2009.
[35] P. Rodrigues and J. Gama. Distributed clustering of ubiquitous data streams. WIREs Data Mining and Knowledge Discovery, pages 38–54, 2013.
[36] T. Sakaki, M. Okazaki, and Y. Matsuo. Tweet analysis for
real-time event detection and earthquake reporting system development. IEEE Trans. on Knowledge and Data Engineering, 25(4):919–931, 2013.
[37] C. Salperwyck and V. Lemaire. Learning with few examples:
An empirical study on leading classifiers. In Proc. of the 2011
Int. Joint Conf. on Neural Networks, IJCNN, pages 1010–
1019, 2011.
[38] B. Settles. Active Learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning. Morgan and Claypool
Publishers, 2012.
[39] A. Shaker and E. Hüllermeier. Survival analysis on data
streams: Analyzing temporal events in dynamically changing environments. Int. Journal of Applied Mathematics and
Computer Science, 24(1):199–212, 2014.
[40] C. Shearer. The CRISP-DM model: the new blueprint for data
mining. J Data Warehousing, 2000.
[41] Z. Siddiqui, M. Oliveira, J. Gama, and M. Spiliopoulou.
Where are we going? predicting the evolution of individuals. In Proc. of the 11th Int. Conf. on Advances in Intelligent
Data Analysis, IDA, pages 357–368, 2012.
[42] Z. Siddiqui and M. Spiliopoulou. Classification rule mining
for a stream of perennial objects. In Proc. of the 5th Int. Conf.
on Rule-based Reasoning, Programming, and Applications,
RuleML, pages 281–296, 2011.
[43] M. Spiliopoulou and G. Krempl. Tutorial ”mining multiple
threads of streaming data”. In Proc. of the Pacific-Asia Conf.
on Knowledge Discovery and Data Mining, PAKDD, 2013.
[44] D. Waterman. A Guide to Expert Systems. Addison-Wesley,
1986.
[45] W. Young, G. Weckman, and W. Holland. A survey of
methodologies for the treatment of missing values within
datasets: limitations and benefits. Theoretical Issues in Ergonomics Science, 12, January 2011.
[46] B. Zhou, Y. Han, J. Pei, B. Jiang, Y. Tao, and Y. Jia. Continuous privacy preserving publishing of data streams. In Proc.
of the 12th Int. Conf. on Extending Database Technology,
EDBT, pages 648–659, 2009.
[47] I. Zliobaite. Controlled permutations for testing adaptive
learning models. Knowledge and Information Systems, In
Press, 2014.
[48] I. Zliobaite, A. Bifet, M. Gaber, B. Gabrys, J. Gama,
L. Minku, and K. Musial. Next challenges for adaptive learning systems. SIGKDD Explorations, 14(1):48–55, 2012.
[49] I. Zliobaite and B. Gabrys. Adaptive preprocessing for
streaming data. IEEE Trans. on Knowledge and Data Engineering, 26(2):309–321, 2014.
Volume 16, Issue 1
Page 10
Twitter Analytics: A Big Data Management Perspective
Oshini Goonetilleke† , Timos Sellis† , Xiuzhen Zhang† , Saket Sathe§
Twitter Analytics:
Bigand
Data
Management
Perspective
School of ComputerA
Science
IT, RMIT
University, Melbourne, Australia
†
§
IBM Melbourne Research Laboratory, Australia
{oshini.goonetilleke,
timos.sellis, xiuzhen.zhang}@rmit.edu.au, ssathe@au.ibm.com
Oshini Goonetilleke† , Timos Sellis† , Xiuzhen Zhang† , Saket Sathe§
†
School of Computer Science and IT, RMIT University, Melbourne, Australia
§
IBM Melbourne Research Laboratory, Australia
{oshini.goonetilleke, timos.sellis, xiuzhen.zhang
@rmit.edu.au,
ssathe@au.ibm.com
ABSTRACT
The }systems
that perform
analysis in the context of
With the inception of the Twitter microblogging platform
in 2006, a myriad of research efforts have emerged studying
different aspects of the Twittersphere. Each study exploits
its own tools and mechanisms to capture, store, query and
ABSTRACT
analyze Twitter data. Inevitably, platforms have been deWith
the
of ad-hoc
the Twitter
microblogging
platform
veloped
to inception
replace this
exploration
with a more
strucin 2006,
a myriad
of research
efforts
have emerged
studying
tured
and
methodological
form
of analysis.
Another
body
different
aspects
of the
Each study
exploits
of
literature
focuses
on Twittersphere.
developing languages
for querying
its
own tools
and mechanisms
to capture,
query
Tweets.
This paper
addresses issues
aroundstore,
the big
data and
naanalyze
Twitterand
data.
Inevitably,
havedata
been
deture
of Twitter
emphasizes
theplatforms
need for new
manveloped
replace
ad-hoc frameworks
exploration with
more strucagementtoand
querythis
language
that aaddress
limitured and
methodological
analysis.
Another
body
tations
of existing
systems.form
We of
review
existing
approaches
of literature
focuses on
developing
languages
for querying
that
were developed
to facilitate
twitter
analytics
followed
Tweets.
This paper
addressesissues
issuesand
around
the big
data naby a discussion
on research
technical
challenges
ture
of Twitterintegrated
and emphasizes
the need for new data manin
developing
solutions.
agement and query language frameworks that address limitations of existing systems. We review existing approaches
1.
INTRODUCTION
that were
developed to facilitate twitter analytics followed
The
growth
of dataissues
generated
from social
media
by a massive
discussion
on research
and technical
challenges
sources
have resulted
in solutions.
a growing interest on efficient and
in developing
integrated
effective means of collecting, analyzing and querying large
volumes of social data. The existing platforms exploit sev1.
INTRODUCTION
eral characteristics
of big data, including large volumes of
The massive
of data
generated
from
socialand
media
data,
velocity growth
due to the
streaming
nature
of data,
vasources
resulted
in a growing
efficient
and
riety duehave
to the
integration
of data interest
from theon
web
and other
effective means
collecting,
analyzing
and querying
large
sources.
Hence,ofsocial
network
data presents
an excellent
volumes for
of social
data.
Thedata.
existing platforms exploit sevtestbed
research
on big
eral
characteristics
big data,
including
large
volumes of
In particular,
onlineof social
networking
and
microblogging
data, velocity
due has
to the
streaming
nature
of data,
anduser
vaplatform
Twitter
seen
exponential
growth
in its
riety
due toitsthe
integration
of data
from
web200
andmillion
other
base since
inception
in 2006
with
nowthe
over
sources.
Hence,users
social
network500
data
presents
an (Twitterexcellent
monthly active
producing
million
tweets
testbed the
for research
big data.
sphere,
postings on
made
to Twitter) daily1 . A wide reIn particular,
online
networking since
and microblogging
search
community
hassocial
been established
then with the
platform
Twitter has seen
exponential
growth in
user
hope
of understanding
interactions
on Twitter.
Foritsexambasestudies
since its
inception
in 2006 in
with
nowdomains
over 200
million
ple,
have
been conducted
many
exploring
monthly active
users producing
500 million
tweets
(Twitterdifferent
perspectives
of understanding
human
behavior.
1
. A wide
resphere,research
the postings
to Twitter)
dailyincluding
Prior
focusesmade
on a variety
of topics
opinsearch
community
has event
been established
since
with the
ion mining
[12, 14, 38],
detection [46,
65, then
76], spread
of
hope
of understanding
interactions
on Twitter.
Foranalysis
exampandemics
[26,58,68], celebrity
engagement
[74] and
ple,political
studies discourse
have been[28,
conducted
many
domains
exploring
of
40, 70]. in
These
types
of efforts
have
different
perspectivestoofunderstand
understanding
human behavior.
enabled researchers
interactions
on Twitter
Prior
focuses
on a varietyeducation,
of topics including
relatedresearch
to the fields
of journalism,
marketing,opindision mining
[12, 14, 38], event detection [46, 65, 76], spread of
aster
relief etc.
pandemics [26,58,68], celebrity engagement [74] and analysis
1
ofhttp://tnw.to/s0n9u
political discourse [28, 40, 70]. These types of efforts have
enabled researchers to understand interactions on Twitter
related to the fields of journalism, education, marketing, disaster relief etc.
1
http://tnw.to/s0n9u
SIGKDD Explorations
these
interactions typically involve the following major components: focused crawling, data management and data analytics. Here, data management comprises of information extraction, pre-processing, data modeling and query processThe components.
systems that Figure
perform1 analysis
the diagram
context of
ing
shows a in
block
of these
such
interactions
involve the following
major compoa system andtypically
depicts interactions
between various
nents: Until
focused
crawling,
data
and data
ananents.
now,
there has
beenmanagement
significant amount
of prior
lytics.
Here,
dataimproving
management
of information
research
around
eachcomprises
of the components
shownexin
traction,
andthere
queryhave
processFigure
1, pre-processing,
but to the best data
of ourmodeling
knowledge,
been
ing frameworks
components.that
Figure
1 shows
a block
diagram
of such
no
propose
a unified
approach
to Twitter
a
system
and depicts
compodata
management
thatinteractions
seamlessly between
integratevarious
all these
comnents. Until
now, there
has
been significant
amount
ponents.
Following
these
observations,
in this
paperofweprior
exresearch
each that
of the
components
shownfor
in
tensivelyaround
survey improving
the techniques
have
been proposed
Figure 1,each
but of
to the
thecomponents
best of our knowledge,
there1,have
realising
shown in Figure
and been
then
no frameworks
that
propose
a unified
approach
to Twitter
motivate
the need
and
challenges
of a unified
framework
for
data
management
that seamlessly integrate all these commanaging
Twitter data.
ponents. Following these observations, in this paper we extensively survey the techniques that have been proposed for
realising each of the components shown in Figure 1, and then
motivate the need and challenges of a unified framework for
managing Twitter data.
Figure 1: An abstraction of a Twitter data management
platform
In our survey of existing literature we observed ways in
which researchers have tried to develop general platforms
Figure
1: aAn
abstraction
of a Twitter
data management
to
provide
repeatable
foundation
for Twitter
data analytplatform
ics. We review the tweet analytics space primarily focusing
on the following key elements:
In our survey of existing literature we observed ways in
• Data Collection. Researchers have several options for
which researchers have tried to develop general platforms
collecting a suitable data set from Twitter. In Section 2
to provide a repeatable foundation for Twitter data analytwe briefly describe mechanisms and tools that focus priics. We review the tweet analytics space primarily focusing
marily on facilitating the initial data acquisition phase.
on the following key elements:
These tools systematically capture the data using any of
• Data Collection. Researchers have several options for
collecting a suitable data set from Twitter. In Section 2
we briefly describe mechanisms and tools that focus primarily on facilitating the initial data acquisition phase.
These tools systematically capture the data using any of
Volume 16, Issue 1
Page 11
the Twitter’s publicly accessible APIs.
• Data management frameworks. In addition to providing a module for crawling tweets, these frameworks
provide support for pre-processing, information extraction and/or visualization capabilities. Prepossessing deals
the Twitter’s publicly accessible APIs.
with preparing tweets for data analysis. Information ex• traction
Data management
frameworks.
addition
proaims to derive
more insightInfrom
the to
tweets,
viding
module
for crawling
these frameworks
which isa not
directly
reported tweets,
by the Twitter
API, e.g.
provide
support
pre-processing,
information
extraca sentiment
for afor
given
tweet. Several
frameworks
protion and/or
visualization
capabilities.
Prepossessing
deals
vide
functionality
to present
the results
in many output
with
tweets for
data
analysis.
Information
exformspreparing
of visualizations
while
others
are built
exclusively
traction
aims
to
derive
more
insight
from
the
tweets,
to search over a large tweet collection. In Section 3 we
which
not directly
reported by frameworks.
the Twitter API, e.g.
review isexisting
data management
a sentiment for a given tweet. Several frameworks pro• Languages for querying tweets. A growing body
vide functionality to present the results in many output
of literature proposes declarative query languages as a
forms of visualizations while others are built exclusively
mechanism of extracting structured information from tweets.
to search over a large tweet collection. In Section 3 we
Languages present end users with a set of primitives benreview existing data management frameworks.
eficial in exploring the Twittersphere along different di• Languages
querying
tweets. Adeclarative
growing body
mensions. Infor
Section
4 we investigate
lanof literature
proposes
declarative
query
a
guages
and similar
systems
developed
for languages
querying aasvamechanism
of extracting
riety of tweet
properties.structured information from tweets.
Languages present end users with a set of primitives benWe eficial
have in
identified
thetheessential
ingredients
a unified
exploring
Twittersphere
alongfor
different
diTwitter
data
management
solution,
with
the
intention
mensions. In Section 4 we investigate declarative that
lanan analyst
will similar
easily be
able to
extend its
for
guages and
systems
developed
for capabilities
querying a vaspecific
of properties.
research. Such a solution will allow the
riety types
of tweet
data analyst to focus on the use cases of the analytics task
We conveniently
have identified
thethe
essential
ingredients
for by
a unified
by
using
functionality
provided
an inTwitter
management
solution,
with
the intention
that
tegrated data
framework.
In Section
5 we
present
our position
an
analyst
will
easily
be
able
to
extend
its
capabilities
for
and emphasize the need for integrated solutions that address
specific types
of research.
SuchSection
a solution
will allow
the
limitations
of existing
systems.
6 outlines
research
data
to focus associated
on the usewith
casesthe
of development
the analyticsoftask
issuesanalyst
and challenges
inby
conveniently
using
the
functionality
provided
tegrated platforms. Finally we conclude in Sectionby7. an integrated framework. In Section 5 we present our position
and emphasize the need for integrated solutions that address
2. OPTIONS
FOR
DATA
COLLECTION
limitations
of existing
systems.
Section
6 outlines research
Researchers
have several
optionswith
when
an API
issues
and challenges
associated
thechoosing
development
of for
indata collection,
i.e. Finally
the Search,
Streaming
and the7.REST
tegrated
platforms.
we conclude
in Section
API. Each API has varying capabilities with respect to the
type and the amount of information that can be retrieved.
2.
OPTIONS FOR DATA COLLECTION
The Search API is dedicated for running searches against an
Researchers
have
several
when choosing
an with
API the
for
index
of recent
tweets.
It options
takes keywords
as queries
data
collection,
i.e. the
Search,
Streaming
the REST
possibility
of multiple
queries
combined
as aand
comma
sepaAPI.
API
has varying
capabilities
with respect
to the
rated Each
list. A
request
to the search
API returns
a collection
type
and the
amount
of information
that can be retrieved.
of
relevant
tweets
matching
a user query.
The
APIAPI
is dedicated
runningtosearches
against
an
The Search
Streaming
providesfor
a stream
continuously
capindex
of
recent
tweets.
It
takes
keywords
as
queries
with
the
ture the public tweets where parameters are provided to
possibility
of multiple
as keywords,
a comma sepafilter the results
of the queries
stream combined
by hashtags,
twitrated
list.
A
request
to
the
search
API
returns
collection
ter user ids, usernames or geographic regions. aThe
REST
of relevant
matching
query.of the most recent
API
can betweets
used to
retrievea auser
fraction
The
Streaming
stream
continuously
tweets
publishedAPI
by aprovides
Twitter auser.
All to
three
APIs limitcapthe
ture
the
public
tweets
where
parameters
are
provided are
to
number of requests within a time window and rate-limits
filter the
the stream
by hashtags,
posed
at results
the userof and
the application
level.keywords,
Responsetwitobter
userfrom
ids, Twitter
usernames
regions.
Theformat.
REST
tained
APIorisgeographic
generally in
the JSON
2retrieve a fraction of the most recent
API
can
be
used
to
are
available
in
many
programming
Third party libraries
tweets
published
by a Twitter
user. API.
All three
limitprothe
languages
for accessing
the Twitter
TheseAPIs
libraries
number
of requests
time window
and rate-limitsand
are
vide wrappers
and within
providea methods
for authentication
posed
at the user
and the application
level.
Response obother functions
to conveniently
access the
API.
tained
from
Twitter
API
generally
in the
JSONcoverage
format.
Publicly
available
APIs
doisnot
guarantee
complete
2
are
available
in
many
programming
Third
party
libraries
of the data for a given query as the feeds are not designed
languages
for accessing
theexample,
Twitter API.
These libraries
profor
enterprise
access. For
the streaming
API only
vide
wrappers
and provide
for authentication
and
provides
a random
sample methods
of 1% (known
as the Spritzer
other functions to conveniently access the API.
2
https://dev.twitter.com/docs/twitter-libraries
Publicly
available APIs do not guarantee complete coverage
of the data for a given query as the feeds are not designed
for enterprise access. For example, the streaming API only
provides a random sample of 1% (known as the Spritzer
2
https://dev.twitter.com/docs/twitter-libraries
SIGKDD Explorations
stream) of the public Twitter stream in real-time. Applications where this rate limitation is too restrictive rely on third
party resellers like GNIP, DataSift or Topsy 3 , who provides
access to the entire collection of the tweets known as the
Twitter FireHose. At a cost, resellers can provide unlimstream)
of the
public Twitter
streamdata,
in real-time.
ited access
to archives
of historical
real-timeApplicastreamtions
where
this
rate
limitation
is
too
restrictive
rely on third
ing data or both. It is mostly the corporate 3businesses
who
, whoconsumer
provides
party
like GNIP,toDataSift
or Topsy
opt forresellers
such alternatives
gain insights
into their
access
to the entire
collection of the tweets known as the
and competitor
patterns.
Twitter
FireHose.
At
a cost,sufficient
resellersfor
cananprovide
In order to obtain a dataset
analysisunlimtask,
ited
access to archives
of historical
data,
real-time
it
is necessary
to efficiently
query the
respective
APIstreammething
or the
both.
It is mostly
the corporate
businesses
who
ods,data
within
bounds
of imposed
rate limits.
The requests
opt
for
such
alternatives
to
gain
insights
into
their
consumer
to the API may have to run continuously spanning across
and
competitor
patterns.Creating the users social graph for
several
days or weeks.
In
order
to
obtain
a dataset
sufficient
for an analysis
a community of interest
requires
additional
modules task,
that
it is necessary
to efficiently
query
thecrawls
respective
crawl
user accounts
iteratively.
Large
with API
moremethcomods,
thewas
bounds
imposedwith
ratethe
limits.
The
requests
pletewithin
coverage
madeofpossible
use of
whitelisted
to
the
API
may
have
to
run
continuously
spanning
accounts [21, 45] and using the computation power of across
cloud
several days[55].
or weeks.
the users
socialwhitelisted
graph for
computing
Due to Creating
Twitters current
policy,
a
community
of interest requires
that
accounts
are discontinued
and areadditional
no longer modules
an option
as
crawl user
accounts
crawls with
more commeans
of large
dataiteratively.
collection.Large
Distributed
systems
have
plete
coverage was
with the userunning,
of whitelisted
been developed
[16,made
45] topossible
make continuously
large
accounts
[21,
45]
and
using
the
computation
power
cloud
scale crawls feasible. There are other solutions that of
provide
computingfeatures
[55]. Due
Twitters
policy,
whitelisted
extended
to to
process
the current
incoming
Twitter
data as
accounts
discontinued and are no longer an option as
discussed are
next.
means of large data collection. Distributed systems have
been developed [16, 45] to make continuously running, large
3.
MANAGEMENT
scale DATA
crawls feasible.
There are otherFRAMEWORKS
solutions that provide
extended features to process the incoming Twitter data as
3.1
Focused
discussed
next. Crawlers
The focus in studies like TwitterEcho [16] and Byun et
al. [19] is data collection, where the primary contributions
3.
DATA MANAGEMENT FRAMEWORKS
are driven by crawling strategies for effective retrieval and
better coverage. TwitterEcho describes an open source dis3.1
Focused Crawlers
tributed crawler for Twitter. Data can be collected from
The
focus
in studies like
TwitterEcho
anda Byun
et
a focused community
of interest
and it [16]
adapts
centralal.
[19]
is
data
collection,
where
the
primary
contributions
ized distributed architecture in which multiple thin clients
are deployed
driven byto
crawling
effective
retrieval and
are
create astrategies
scalable for
system.
TwitterEcho
debetter
TwitterEcho
open
source
disvises a coverage.
user expansion
strategydescribes
by whichanthe
user’s
follower
tributed
crawler iteratively
for Twitter.
Data
be collected
lists are crawled
using
thecan
REST
API. Thefrom
sysa focused
community
of interest
andfor
it controlling
adapts a centraltem
also includes
modules
responsible
the exized
distributed
in whichfeature
multiple
thin clients
pansion
strategy.architecture
The user selection
identifies
user
are deployed
to monitored
create a scalable
TwitterEcho
accounts
to be
by the system.
system with
modules defor
vises
a user expansion
strategy
by which
the user’sThe
follower
user profile
analysis and
language
identification.
user
lists
are crawled
usingtothe
REST
Thecomsysselection
feature iteratively
is customized
crawl
the API.
focused
tem alsoofincludes
modules
responsible
exmunity
Portuguese
tweets
but can for
be controlling
adapted to the
target
pansion
strategy. The
user
identifies
user
other communities.
Byun
et selection
al. [19] infeature
their work
propose
a
accounts
be monitored
by the
system with
rule-basedtodata
collection tool
for Twitter
with modules
the focusfor
of
user
profilesentiment
analysis of
and
language
identification.
The user
analysing
Twitter
messages.
It is a java-based
4focused comselection
feature
is
customized
to
crawl
the
open source tool developed using the Drools rule engine.
munity
of Portuguese
tweetsofbut
be adapted
target
They stress
the importance
an can
automated
data to
collector
other
communities.
Byun et al. data
[19] in
their
work messages.
propose a
that also
filters out unnecessary
such
as spam
rule-based data collection tool for Twitter with the focus of
analysing
sentiment of Twitter
It is a java-based
3.2
Pre-processing
andmessages.
Information
Extracopen source
tion tool developed using the Drools4 rule engine.
They stress
an automated
data
collector
Apart
from the
dataimportance
collection, of
several
frameworks
implement
that also filters
out unnecessary
data such as spam
methods
to perform
extensive pre-processing
andmessages.
information extraction of the tweets. Pre-processing tasks of Trend3.2
Pre-processing and Information ExtracMiner [61] take into account the challenges posed by the
tion
noisy genre of tweets. Tokenization, stemming and POS
Apart from data collection, several frameworks implement
3
http://gnip.com/,
http://datasift.com/,
http:
methods
to perform extensive
pre-processing and informa//about.topsy.com/
tion
extraction of the tweets. Pre-processing tasks of Trend4
http://drools.jboss.org/
Miner
[61] take into account the challenges posed by the
noisy genre of tweets. Tokenization, stemming and POS
3
http://gnip.com/,
http://datasift.com/,
//about.topsy.com/
4
http://drools.jboss.org/
Volume 16, Issue 1
http:
Page 12
tagging are some of the text processing tasks that better
prepare tweets for the analysis task. The platform provides separate built-in modules to extract information such
as location, language, sentiment and named entities that
are deemed very useful in data analytics. The creation of a
tagging
sometools
of the
textthe
processing
tasks
better
pipeline are
of these
allows
data analyst
to that
extend
and
prepare
tweets
for
the
analysis
task.
The
platform
proreuse each component with relative ease.
vides separate
to extract
information
such
TwitIE
[17] is built-in
another modules
open-source
information
extraction
as
location,
sentiment
and named
entities
that
NLP
pipelinelanguage,
customized
for microblog
text. For
the purare
deemed
very
useful
in
data
analytics.
The
creation
a
pose of information extraction (IE), the general purposeofIE
pipeline of
these tools
allows
data analyst
to extendsuch
and
pipeline
ANNIE
is used
and the
it consists
of components
reuse
each component
with
relative
as sentence
splitter, POS
tagger
and ease.
gazetteer lists (for locaTwitIE
[17]
is
another
open-source
information
extraction
tion prediction). Each step of the pipeline
addresses
drawNLP pipeline
customized
for microblog
text. For
purbacks
in traditional
NLP systems
by addressing
the the
inherent
pose
of information
extraction
(IE),
the general
purpose
IE
challenges
in microblog
text. As
a result,
individual
compipeline of
ANNIE
is are
usedcustomized.
and it consists
of components
such
ponents
ANNIE
Language
identification,
as
sentence splitter,
POS tagger
and
gazetteer
(forentity
locatokenisation,
normalization,
POS
tagging
and lists
named
tion
prediction).
Each
step
of
the
pipeline
addresses
drawrecognition is performed with each module reporting accubackson
in tweets.
traditional NLP systems by addressing the inherent
racy
challenges
in presents
microblog
text. As
a result,
comBaldwin [11]
a system
designed
for individual
event detection
ponents
of
ANNIE
are
customized.
Language
identification,
on Twitter with functionality for pre-processing. JSON retokenisation,
POSAPI
tagging
named
sults returnednormalization,
by the Streaming
are and
parsed
and entity
piped
recognition
is
performed
with
each
module
reporting
accuthrough language filtering and lexical normalisation comporacy
on
tweets.
nents. Messages that do not have location information are
Baldwin [11] using
presents
a system designed
for event
geo-located,
probabilistic
models since
it’s detection
a critical
on
Twitter
with functionality
for pre-processing.
JSON exreissue
in identifying
where an event
occurs. Information
sults returned
by the
Streaming
API from
are parsed
andsources
piped
traction
modules
require
knowledge
external
through
language filtering
and lexical
normalisation
compoand are generally
more expensive
tasks
than language
pronents.
Messages
that
do
not
have
location
information
cessing. Platforms that support real-time analysis [11, are
76]
geo-located,
using probabilistic
models since
it’s a critical
require
processing
tasks to be conducted
on-the-fly
where
issue
in identifying
where analgorithms
event occurs.
Information
exthe speed
of the underlying
is a crucial
considertraction modules require knowledge from external sources
ation.
and are generally more expensive tasks than language processing.
Platforms
that support real-time analysis [11, 76]
3.3 Generic
Platforms
require processing tasks to be conducted on-the-fly where
There are several proposals in which researchers have tried
the speed of the underlying algorithms is a crucial considerto develop generic platforms to provide a repeatable founation.
dation for Twitter data analytics. Twitter Zombie [15] is a
platform to unify the data gathering and analysis methods
3.3
Generic Platforms
by presenting a candidate architecture and methodological
There
are
several
proposals
in which
have tried
approach for
examining
specific
parts researchers
of the Twittersphere.
to
develop
generic
platforms
to
provide
a
repeatable
founIt outlines architecture for standard capture, transformadation
foranalysis
Twitterofdata
analytics.
Twitter
Zombie
[15] is a
tion
and
Twitter
interactions
using
the Twitter’s
platform
to unify
the data
gathering
and analysis
methods
Search API.
This tool
is designed
to gather
data from
Twitby
presenting
a
candidate
architecture
and
methodological
ter by executing a series of independent search jobs on a
approach
examining
specific parts
of and
the their
Twittersphere.
continual for
basis
and the collected
tweets
metadata
It
outlines
architecture
for
standard
capture,
transformais kept in a relational DBMS. One of the interesting
features
tion
and analysis ofisTwitter
interactions
using
the Twitter’s
of
TwitterZombie
its ability
to capture
hierarchical
relaSearch
API.
Thisdata
toolreturned
is designed
to gatherAdata
from transTwittionships
in the
by Twitter.
network
ter
executing
a series
of independentonsearch
jobs on
a
latorbymodule
performs
post-processing
the tweets
and
continual
basis and
the collected
tweets separately
and their metadata
stores
hashtags,
mentions
and retweets,
from the
is
kepttext.
in a relational
DBMS.
One of the interesting
features
tweet
Raw tweets
are transformed
into a representaof TwitterZombie
ability
to capture
hierarchical
relation
of interactionsistoitscreate
networks
of retweets,
mentions
tionships
the data returned
by Twitter.
A network
transand usersinmentioning
hashtags.
This feature
captured
by
lator
module performs
post-processing
theattention
tweets and
TwitterZombie,
which other
studies pay on
little
to,
stores
hashtags,
mentions
and retweets,
from the
is
helpful
in answering
different
types ofseparately
research questions
tweet
text. Raw
transformed
intoina the
representawith relative
ease.tweets
Socialare
graphs
are created
form of
tion
of interactions
tonetwork
create networks
mentions
a retweet
or mention
and theyofdoretweets,
not crawl
for the
and
hashtags.
Thisrelationships.
feature captured
by
user users
graphmentioning
with traditional
following
It also
TwitterZombie,
which
other
studies
pay
little
attention
to,
draws discussion on how multi-byte tweets in languages like
is helpful
answering
different
of researchtransliteraquestions
Arabic
or in
Chinese
can be
storedtypes
by performing
with
tion. relative ease. Social graphs are created in the form of
a
retweet
or mention
network
they do
not crawl of
forsupthe
More
recently,
TwitHoard
[69]and
suggests
a framework
user graph
with traditional
It also
porting
processors
for data following
analytics relationships.
on Twitter with
emdraws discussion on how multi-byte tweets in languages like
Arabic or Chinese can be stored by performing transliteration.
More recently, TwitHoard [69] suggests a framework of supporting processors for data analytics on Twitter with em-
SIGKDD Explorations
phasis on selection of a proper data set for the definition of
a campaign. The platform consists of three layers; campaign
crawling layer, integrated modeling layer and the data analysis layer. In the campaign crawling layer, a configuration
module follows an iterative approach to ensure the camphasis
on selection
a proper
set for(keywords).
the definition
of
paign converges
to of
a proper
setdata
of filters
Cola
campaign.
The
platform
consists
of
three
layers;
campaign
lected tweets, metadata and the community data (relationcrawling
layer,
integrated
modeling
layer
the data
analships
among
Twitter
users)
are stored
in and
a graph
database.
ysis
the campaign
crawling
layer,
a configuration
Thislayer.
study In
should
be highlighted
for its
distinction
to allow
module
follows
an iterative
approach
to ensure
themodel
camfor a flexible
querying
mechanism
on top
of a data
paignon
converges
to The
a proper
of filters (keywords).
Colbuilt
raw data.
modelset
is generated
in the integrated
lected
tweets,
and thea community
dataof(relationmodeling
layermetadata
and comprises
representation
associaships between
among Twitter
users)
are stored
in in
a graph
tions
terms (e.g.
hashtags)
used
tweets database.
and their
This
studyinshould
highlighted
forisits
distinctionastoitallow
evolution
time. be
Their
approach
interesting
capfor
a the
flexible
mechanism
top of a In
data
tures
oftenquerying
overlooked
temporalon
dimension.
themodel
third
built analysis
on raw data.
model
is generated
integrated
data
layer,The
a query
language
is usedintothe
design
a tarmodeling
and comprises
representation
get view oflayer
the campaign
data athat
corresponds of
to associaa set of
tions between
terms (e.g.
hashtags)
used
in tweets
their
tweets
that contain
for example,
the
answer
to anand
opinion
evolution
in time. Their approach is interesting as it capmining question.
tures
often overlooked
temporal
dimension.
In the
third
Whilethe
including
components
for capture
and storage
of tweets,
data
analysis
layer,
a
query
language
is
used
to
design
a
additional tools have been developed to search through tarthe
get
view of
the campaign
data that
to presents
a set of
collected
tweets.
The architecture
of corresponds
CoalMine [72]
tweets
contain
example,
thedemonstrated
answer to anon
opinion
a socialthat
network
datafor
mining
system
Twitmining
question.
ter,
designed
to process large amounts of streaming social
While
including
components
capturean
and
storage
of tweets,
data. The
ad-hoc
query toolfor
provides
end
user with
the
additional
tools
have
been
developed
to
search
through
the
ability to access one or more data files through a Googlecollected
The Appropriate
architecture of
CoalMine
[72] presents
like
searchtweets.
interface.
support
is provided
for a
a
network
data
mining
system demonstrated
on Twitsetsocial
of boolean
and
logical
operators
for ease of querying
on
ter, of
designed
to process
amounts
streaming
social
top
a standard
Apache large
Lucene
index. of
The
data collection
data.
The ad-hoc
query is
tool
provides an
user withconthe
and storage
component
responsible
for end
establishing
ability
to
access
one
or
more
data
files
through
a
Googlenections to the REST API and to store the JSON objects
like search
Appropriate
returned
ininterface.
compressed
formats. support is provided for a
set
of boolean
and logical
operators
for ease to
of querying
on
In building
support
platforms
it is necessary
make provitop
of
a
standard
Apache
Lucene
index.
The
data
collection
sions for practical considerations such as processing big data.
and storage component
is responsible
for establishing
TrendMiner
[61] facilitates
real-time analysis
of tweets conand
nections
the REST API
and to and
store
the JSON
objects
takes intotoconsideration
scalability
efficiency
of processreturned
compressed
formats.
ing large in
volumes
of data.
TrendMiner makes an effort to
In building
platforms
is necessarytools
to make
proviunify
some support
of the existing
textit processing
for Online
sions
practical considerations
such as
processing
big data.
Socialfor
Networking
(OSN) data, with
emphasis
on adapting
TrendMiner
[61] facilitates
real-time
analysisbatches
of tweets
and
to
real-life scenarios
that include
processing
of miltakes
intodata.
consideration
scalability
efficiency
processlions of
They envision
the and
system
to be of
developed
ing
large batch-mode
volumes of data.
TrendMiner
makes
an effort
to
for both
and online
processing.
TwitIE
[17] as
unify someinofthe
the
existingsection
text processing
for Online
discussed
previous
is anothertools
open-source
inSocial
Networking
(OSN)
data, with
emphasisfor
onmicroblog
adapting
formation
extraction
NLP pipeline
customized
to real-life scenarios that include processing batches of miltext.
lions of data. They envision the system to be developed
for both
batch-mode and online Platforms
processing. TwitIE [17] as
3.4
Application-specific
discussed in the previous section is another open-source inApart from the above mentioned general purpose platforms,
formation extraction NLP pipeline customized for microblog
there are many frameworks targeted at conducting specific
text.
types of analysis with Twitter data. Emergency Situation
Awareness (ESA) [76] is a platform developed to detect, as3.4
Application-specific Platforms
sess, summarise and report messages of interest published
Apart
from the
mentioned general
platforms,
on
Twitter
for above
crisis coordination
tasks.purpose
The objective
of
there
are many
targeted
specific
their work
is to frameworks
convert large
streamsatofconducting
social media
data
typesuseful
of analysis
with
Twitterinformation
data. Emergency
Situation
into
situation
awareness
in real-time.
The
Awareness
(ESA)
[76] isofa modules
platformto
developed
to detect,conasESA platform
consists
detect incidents,
sess,
summarise
and
report
messages
of
interest
published
dense and summarise messages, classify messages of high
on Twitter
forand
crisis
coordination
The
objective
of
value,
identify
track
issues and tasks.
finally to
conduct
forentheir
work isoftohistorical
convert large
streams
of socialare
media
data
sic analysis
events.
The modules
enriched
into
situation
awareness
information
in real-time.
The
by a useful
suite of
visualisation
interfaces.
Baldwin
et al. [11] proESA another
platformsupport
consistsplatform
of modules
to detect
incidents,
conpose
focused
on detecting
events
dense
and summarise
messages,
on
Twitter.
The Twitter
stream isclassify
queried messages
with a setof
of high
keyvalue, specified
identify and
track
issues
to conduct
forenwords
by the
user
withand
the finally
objective
of filtering
the
sic analysis of historical events. The modules are enriched
by a suite of visualisation interfaces. Baldwin et al. [11] propose another support platform focused on detecting events
on Twitter. The Twitter stream is queried with a set of keywords specified by the user with the objective of filtering the
Volume 16, Issue 1
Page 13
stream on a topic of interest. The results are piped through
text processing components and the geo-located tweets are
visualised on a map for better interaction. Clearly, platforms
of this nature that deal with incident exploration need to
make provisions for real-time analysis of the incoming Twitstream
on aand
topic
of interest.
The visualizations
results are piped
through
ter stream
produce
suitable
of detected
text
processing
components
and
the
geo-located
tweets
are
incidents.
visualised on a map for better interaction. Clearly, platforms
of
natureModel
that deal
with
incidentMechanisms
exploration need to
3.5this Data
and
Storage
make provisions for real-time analysis of the incoming TwitData models are not discussed in detail in most studies,
ter stream and produce suitable visualizations of detected
as a simple data model is sufficient to conduct basic form
incidents.
of analysis. When standard tweets are collected, flat files
[11, 72] is the preferred choice. Several studies that cap3.5
Data Model and Storage Mechanisms
ture the social relationships [15, 19] of the Twittersphere,
Data models
are not discussed
in detail
most
studies,
employs
the relational
data model
but doinnot
necessarily
as
a simple
data model inis asufficient
to conductAs
basic
form
store
the relationships
graph database.
a conseof analysis.
standard
tweets
are collected,
flat files
quence,
manyWhen
analyses
that can
be performed
conveniently
[11,
is theare
preferred
choice. bySeveral
studies that Only
capon a72]graph
not captured
these platforms.
ture
the
social
relationships
[15,
19]
of
the
Twittersphere,
TwitHoard [69] in their paper models co-occurrence of terms
employs
thewith
relational
data model
butproperties.
do not necessarily
as
a graph
temporally
evolving
Twitter
store
the[15]
relationships
in a graph
database.
As a conseZombie
and TwitHoard
[69] should
be highlighted
for
quence, many
analyses including
that can be
conveniently
capturing
interactions
theperformed
retweets and
term ason
a graphapart
are from
not captured
by these
platforms. social
Only
sociations
the traditional
follower/friend
TwitHoard
[69]
in
their
paper
models
co-occurrence
of
terms
relationships. TrendMiner [61] draws explicit discussion on
as a graph
with temporally
evolving
properties.
Twitter
making
provisions
for processing
millions
of data and
takes
Zombie
[15]
TwitHoard
[69] should be
highlighted
for
advantage
of and
Apache
Hadoop MapReduce
framework
to percapturing
interactions
including
the
retweets
and
term
asform distributed processing of the tweets stored as key-value
sociations
apart from
the traditional
pairs.
CoalMine
[72] also
has Apachefollower/friend
Hadoop at thesocial
core
relationships.
TrendMiner
[61] drawsresponsible
explicit discussion
on
of their batch processing
component
for efficient
making provisions
processing
millions of data and takes
processing
of large for
amount
of data.
advantage of Apache Hadoop MapReduce framework to perform
of the tweetsInterfaces
stored as key-value
3.6 distributed
Supportprocessing
for Visualization
pairs. CoalMine [72] also has Apache Hadoop at the core
There are many platforms designed with integrated tools
of their batch processing component responsible for efficient
predominantly for visualization, to analyse data in spatial,
processing of large amount of data.
temporal and topical perspectives. One tool is tweetTracker
[44], which is designed to aid monitoring of tweets for hu3.6
Support for Visualization Interfaces
manitarian and disaster relief. TweetXplorer [54] also proThere
are many
platforms
designed
withTwitter
integrated
vides useful
visualization
tools
to explore
data.tools
For
predominantly
for visualization,
to analyse
data in spatial,
a
particular campaign,
visualizations
in tweetXplorer
help
temporal
and
topical
perspectives.
One tool
is tweetTracker
analysts to
view
the data
along different
dimensions;
most
[44],
which
is
designed
to
aid
monitoring
of
tweets forusers
huinteresting days in the campaign (when), important
manitarian
and disaster
relief.
TweetXplorer
[54] alsoinproand
their tweets
(who/what)
and
important locations
the
vides
useful
visualization
to explore[48],
Twitter
data. For
dataset
(where).
Systemstools
like TwitInfo
Twitcident
[8]
a
particular
visualizations
and
Torettorcampaign,
[65] also provide
a suiteinoftweetXplorer
visualisationhelp
caanalysts
the tweets
data along
different
dimensions;
most
pabilitiestotoview
explore
in different
dimensions
relating
interesting
days
in
the
campaign
(when),
important
users
to specific applications like fighting fire and detecting earthand theirWeb-mashups
tweets (who/what)
and important
in the
quakes.
like Trendsmap
[5] andlocations
Twitalyzer
[6]
dataset
Systems
like
TwitInfobusiness
[48], Twitcident
provide (where).
a web interface
and
enterprise
solutions [8]
to
and
also
visualisation cagain Torettor
real-time [65]
trend
andprovide
insightsa ofsuite
userofgroups.
pabilities
to explorean
tweets
in different
dimensions
relating
Table
1 illustrates
overview
of related
approaches
and
to
specific
like fighting
fire and detecting
features
of applications
different platforms.
Pre-processing
in Tableearth1 inquakes.
like Trendsmap
[5] and
[6]
dicates ifWeb-mashups
any form of language
processing
tasksTwitalyzer
such as POS
provide
a web
interface and
enterprise business
solutions
to
tagging or
normalization
are conducted.
Information
extracgain
real-time
trend
and
insights
of
user
groups.
tion refers to the types of post processing performed to infer
Table 1 illustrates
an overview
of relatedorapproaches
and
additional
information,
such as sentiment
named entities
features
of different
platforms.
Pre-processing
in that
Tableis 1carin(NEs). Multiple
ticks
() correspond
to a task
dicates
if
any
form
of
language
processing
tasks
such
as
POS
ried out extensively. In addition to collecting tweets, some
tagging also
or normalization
are conducted.
Information
extracstudies
capture the user’s
social graph
while others
protion
to the
post processing
performedretweets,
to infer
pose refers
the need
to types
regardof interactions
of hashtags,
additional
such as sentiment
named
entities
mentions asinformation,
separate properties.
Backend or
data
models
sup(NEs). by
Multiple
ticks ()
correspond
task that
is carported
the platform
shape
the typestoofaanalysis
that
can
ried
out extensively.
to collecting
tweets,
be conveniently
doneIn
onaddition
each framework.
From
the some
sumstudiesinalso
capture
user’s
social
graph
while
others
promary
Table
1, wethe
can
observe
that
each
study
on data
pose the need to regard interactions of hashtags, retweets,
mentions as separate properties. Backend data models supported by the platform shape the types of analysis that can
be conveniently done on each framework. From the summary in Table 1, we can observe that each study on data
SIGKDD Explorations
management frameworks concentrate on a set of challenges
more than others.
We aim to recognize the key ingredients of an integrated
framework that takes into account shortcomings of existing
systems.
management frameworks concentrate on a set of challenges
more than others.
We aim
to recognize the
key QUERYING
ingredients of anTWEETS
integrated
4.
LANGUAGES
FOR
framework
takes into
account shortcomings
of systems
existing
The
goal ofthat
proposing
declarative
languages and
systems.
for querying tweets is to put forward a set of primitives
or an interface for analysts to conveniently query specific
interactions on Twitter exploring the user, time, space and
4.
LANGUAGES FOR QUERYING TWEETS
topical dimensions. High level languages for querying tweets
The goal
of proposing
declarative
languages
systems
extend
capabilities
of existing
languages
such and
as SQL
and
for
querying
tweets
to put
forward
setTwitter
of primitives
SPARQL.
Queries
areiseither
executed
on athe
stream
or real-time
an interface
fora analysts
to conveniently
query specific
in
or on
stored collection
of tweets.
interactions on Twitter exploring the user, time, space and
topical
dimensions.
High level languages for querying tweets
4.1 Generic
Languages
extend capabilities of existing languages such as SQL and
TweeQL [49] provides a streaming SQL-like interface to the
SPARQL. Queries are either executed on the Twitter stream
Twitter API and provides a set of user defined functions
in real-time or on a stored collection of tweets.
(UDFs) to manipulate data. The objective is to introduce
a query language to extract structure and useful informa4.1
Generic Languages
tion that is embedded in unstructured Twitter data. The
TweeQL
provides
streamingand
SQL-like
interface
to the
language [49]
exploits
botharelational
streaming
semantics.
Twitter
API
and
provides
a
set
of
user
defined
functions
UDFs allow for operations such as location identification,
(UDFs)processing,
to manipulate
data. prediction,
The objective
is to entity
introduce
string
sentiment
named
exa
query language
extract structure
and ofuseful
informatraction
and eventtodetection.
In the spirit
streaming
setion thatitisprovides
embedded
unstructured
Twitteraggregations
data. The
mantics,
SQLinconstructs
to perform
language
exploits both
relational
and streaming
semantics.
over the incoming
stream
on a user-specified
time
window.
UDFs
allow
for
operations
such
as
location
identification,
The result of a given query can be stored in a relational
string processing,
sentiment
prediction, named entity exfashion
for subsequent
querying.
traction
andrepresenting
event detection.
In the network
spirit of in
streaming
seModels for
any social
RDF have
mantics,
it
provides
SQL
constructs
to
perform
aggregations
been proposed by Martin and Gutierrez [50] allowing queries
over
the incoming
stream
on a user-specified
window.
in
SPARQL.
The work
explores
the feasibilitytime
of adoption
The
result
of by
a given
query cantheir
be stored
in aanrelational
of this
model
demonstrating
idea with
illustrafashion
for subsequent
tive prototype
but doesquerying.
not focus on a single social network
Models
for representing
anyTwarQL
social network
in RDF
have
like
Twitter
in particular.
[53] extracts
content
been
byencodes
Martin it
and
[50]using
allowing
queries
from proposed
tweets and
in Gutierrez
RDF format
shared
and
in SPARQL.
The work explores
feasibility
of adoption
well
known vocabularies
(FOAF,the
MOAT,
SIOC)
enabling
of
this model
by demonstrating
theirfacility
idea with
an illustraquerying
in SPARQL.
The extraction
processes
plain
tive
prototype
but
does
not
focus
on
a
single
social
network
tweets and expands its description by adding sentiment
anlike Twitter
in particular.
[53] extracts
notations,
DBPedia
entities, TwarQL
hashtag definitions
andcontent
URLs.
from
tweets and encodes
it in
RDFdifferent
format using
shared and
The annotation
of tweets
using
vocabularies
enwell
known
vocabularies
(FOAF,
MOAT,
SIOC)
ables querying and analysis in different dimensionsenabling
such as
querying
SPARQL.
The and
extraction
location, in
users,
sentiment
relatedfacility
namedprocesses
entities. plain
The
tweets
and
expands
its
description
by
adding
sentiment
aninfrastructure of TwarQL enables subscription to a stream
notations,
DBPedia
hashtag
URLs.
that
matches
a given entities,
query and
returnsdefinitions
streamingand
annotated
The
data annotation
in real-time.of tweets using different vocabularies enables
querying
and analysis
inare
different
dimensions
such as
Temporal
and topical
features
of paramount
importance
location,
users,microblogging
sentiment and
related
entities.
in
an evolving
stream
likenamed
Twitter.
In the The
laninfrastructure
of TwarQL
enables
subscription
to be
a stream
guages above, time
and topic
of a tweet
(topic can
reprethat
matches
given
query and
streaming
annotated
sented
simplyaby
a hashtag)
arereturns
considered
meta-data
of the
data
real-time.
tweetinand
is not treated any different from other metadata
Temporal
topical
features are
of paramount
importance
reported. and
Topics
are regarded
as part
of the tweet
content
in an
evolving
stream
Twitter.
In theAPI.
lanor
what
drivesmicroblogging
the data filtering
tasklike
from
the Twitter
guages
above,
time
and topic
of a tweet
(topic
can
repreThere have
been
efforts
to exploit
features
that
gobe
well
besented
a hashtag)
aretime
considered
meta-data
of the
yond asimply
simpleby
filter
based on
and topic.
Plachouras
tweetStavrakas
and is not[59]
treated
other modelling
metadata
and
stressany
thedifferent
need forfrom
temporal
reported.
aretoregarded
as part
of the
tweet content
of terms inTopics
Twitter
effectively
capture
changing
trends.
or
what refers
drives to
theany
data
filtering
task phrase
from the
A term
word
or short
of Twitter
interest API.
in a
There have
been efforts
to or
exploit
features
go recogniwell betweet,
including
hashtags
output
of anthat
entity
yond
a simple Their
filter proposed
based on query
time and
topic. can
Plachouras
tion process.
operators
express
and Stavrakas
the need
for temporal
modelling
complex
queries[59]
forstress
associations
between
terms over
varyof terms in Twitter to effectively capture changing trends.
A term refers to any word or short phrase of interest in a
tweet, including hashtags or output of an entity recognition process. Their proposed query operators can express
complex queries for associations between terms over vary-
Volume 16, Issue 1
Page 14
Table 1: Overview of related approaches in data management frameworks.
Prepossessing
Examples of
Social and/or other
Data Store
extracted information
interactions captured?
TwitterEcho [16]
Language
Yes
Not given
Byun et al. [19]
Location
Yes
Relational
Table 1: Overview
of related approaches in data management
Twitter Zombie [15]
Yesframeworks.
Relational
Prepossessing
Examples of
Social and/or
Data
TwitHoard [69]
Yes other
GraphStore
DB
extracted information
interactions
CoalMine [72]
Nocaptured?
Files
TwitterEcho
[16]
Language
Yes
Not given
TrendMiner [61]
Location,
Sentiment, NEs
No
Key-value
pairs
Byun
et
al.
[19]
Location
Yes
Relational
TwitIE [17]
Language, Location, NEs
No
Not given
Twitter
Yes
Relational
ESA [76]Zombie [15]
Location, NEs
No
Not given
TwitHoard
[69]
Yes
Graph
DB
Baldwin et al. [11]
Language, Location
No
Flat files
CoalMine [72]
No
Files
TrendMiner [61]
Location, Sentiment, NEs
No
Key-value pairs
[17] to discover context
of collected
Language,
NEs
No
Not given
ing time TwitIE
granularities,
data.Location,
graph
database to demonstrate
the feasibility
of the model.
ESA
[76]
Location,
NEs
No
given
Operators also allow retrieving a subset of tweets satisfying
languages that operate on the twitter Not
stream
like TweeQL
Baldwin
et al. [11]
No the output inFlat
files TweeQL
these complex
conditions
on term associations.
ThisLanguage,
enables Location
and TwarQL generates
real-time;
the end user to select a good set of terms (hashtags) that
[49] allows the resulting tweets to be collected in batches
drive the data collection which has a direct impact on the
then stores them in a relational database, while TwarQL [53]
ing
timeofgranularities,
to discover
context
of collected data.
graph
to information
demonstrateextraction
the feasibility
of the
model.
quality
the results generated
from
the analysis.
at
the database
end of the
phase,
annotated
Operators
also allow
retrieving
a subsetofoftweets
tweetsoften
satisfying
languages
operate
on the twitter stream like TweeQL
tweets are that
encoded
in RDF.
Spatial features
are another
property
overthese
conditions
on term
associations.
This
enables
and TwarQL generates the output in real-time; TweeQL
lookedcomplex
in complex
analysis.
Previously
discussed
work
uses
the
user attribute
to select aasgood
set of terms
that
[49] allows
the Languages
resulting tweets
be collected
in batches
4.3
Query
fortoSocial
Networks
the end
location
a mechanism
to (hashtags)
filter tweets
in
drive
the
data
collection
which
has
a
direct
impact
on
the
then
stores
them
in
a
relational
database,
while
TwarQL
[53]
space. To complete our discussion we briefly outline two
To the best of our knowledge, there is no existing work
quality
theuse
results
generated
from thetoanalysis.
at
the
end
of
the
information
extraction
phase,
annotated
studies of
that
geo-spatial
properties
perform complex
focusing on high level languages operating on the Twittweetssocial
are encoded
RDF. it is important to note proSpatial
another
propertyDoytsher
of tweetsetoften
overanalysisfeatures
using theare
location
attribute.
al. [31]
inter’s
graph. inHowever
looked
in acomplex
analysis.
discussed
work uses
troduced
model and
queryPreviously
language suited
for integrated
posals for declarative query languages tailored for querying
4.3
Query Languages
Social
the
attribute
a mechanism
filter
tweetsnetin
datalocation
connecting
a socialasnetwork
of users to
with
a spatial
social networks
in general [10,for
30, 32,
50, 51,Networks
64]. One of the
space.
complete
discussion
we briefly
two
work to To
identify
placesour
visited
frequently.
Edges outline
named lifeTo the supported
best of ourareknowledge,
there
is no existing
queries
path queries
satisfying
a set of work
constudies
properties
to perform
complex
patternsthat
are use
usedgeo-spatial
to associate
the social
and spatial
netfocusing
high
level
operating
on the
Twitditions ononthe
path,
andlanguages
the languages
in general
take
adanalysis
the location
attribute. Doytsher
et al. [31] for
inworks. using
Different
time granularities
can be expressed
ter’s social
graph. properties
However it
is important
to Semantics
note provantage
of inherent
of social
networks.
troduced
a model
andrepresented
query language
suited
for integrated
each visited
location
by the
life-pattern
edge.
posals
declarative
tailored
for querying
of
the for
languages
are query
based languages
on Datalog
[51], SQL
[32, 64]
data
social network ofemploys
users with
a spatial synnetEven connecting
though thea implementation
a partially
social
networks in general
[10, 30, 32, 50, 51,
64].
One of the
or RDF/SPARQL
[50]. Implementations
are
conducted
on
work
identifyitplaces
visited
frequently.
Edges named
thetictodataset,
will be
interesting
to investigate
how lifethe
queries supported
are path
satisfying
a set ofsocial
conbibliographical
networks
[32],queries
Facebook
and evolving
patterns
are networks
used to associate
social and
spatial
socio-spatial
and the the
life-pattern
edges
that netare
ditions
thelike
path,
and the
languages
general
take adcontent on
sites
Yahoo!
Travel
[10] andinare
not tested
on
works.
Differentthe
time
granularities
be expressed
for
used to associate
spatial
and socialcan
networks
can be repvantage of
inherenttaking
properties
of social
networks.
Semantics
Twitter
networks
Twitter
specific
affordances
into
each
visited
by thewith
life-pattern
edge.
resented
in a location
real socialrepresented
network dataset
location inforof the languages are based on Datalog [51], SQL [32, 64]
consideration.
Even
though
implementation
employs
a partially
synmation,
such the
as Twitter.
GeoScope
[18] finds
information
or RDF/SPARQL [50]. Implementations are conducted on
thetic
it will
be interesting
to investigate
how the
trends dataset,
by detecting
significant
correlations
among trending
bibliographical
networks
[32], Facebook
and evolving
4.4
Information
Retrieval
- Tweet
Searchsocial
socio-spatial
the life-pattern
edges
location-topicnetworks
pairs in aand
sliding
window. This
givesthat
riseare
to
content
sites
like
Yahoo!
Travel
[10]
and
are
not tested
on
Another class of systems presents textual queries
to effiused
to associateofthe
spatial and
caninformabe repthe importance
capturing
the social
notionnetworks
of spatial
Twitter
networks
taking
Twitter
specific
affordances
into
ciently search over a corpus of tweets. The challenges in this
resented
in ainreal
social
networkindataset
with
location
infortion trends
social
networks
analysis
tasks.
Real-time
consideration.
area are similar to that of information retrieval in addition
mation,
as Twitter.
GeoScope
[18] in
finds
information
detectionsuch
of crisis
events from
a location
space,
exhibits
with having to deal with peculiarities of tweets. The short
trends
by
detecting
significant
correlations
among
trending
the possible value of Geoscope. In one of the experiments
4.4
Retrieval
- Tweet
length Information
of tweets in particular
creates
added Search
complexity to
location-topic
a sliding
This gives
rise to
Twitter is usedpairs
as aincase
study window.
to demonstrate
its usefulAnother
class
of
systems
presents
textual
queries relevant
to effitext-based
search
tasks
as
it
is
difficult
to
identify
the
of capturing
the to
notion
of spatial
informaness,importance
where a hashtag
is chosen
represent
the topic
and
ciently matching
search over
a corpus
of[13,35].
tweets. Expanding
The challenges
in conthis
tweets
a
user
query
tweet
tion
trendswhich
in social
networks
in analysis
tasks.
Real-time
city from
the tweet
originates
chosen
to capture
the
area
are
similar to
that
oftoinformation
retrieval
in The
addition
tent
is
suggested
as
a
way
enhance
the
meaning.
goal
detection
location. of crisis events from a location in space, exhibits
with
having
to deal
with peculiarities
of tweets. need
The in
short
of such
systems
is to express
a user’s information
the
the possible value of Geoscope. In one of the experiments
lengthof of
tweets text
in particular
creates
added
complexity
to
form
a
simple
query,
much
like
in
search
engines,
and
Twitter
is
used
as
a
case
study
to
demonstrate
its
useful4.2 Data Model for the Languages
text-based
search
tasks
as it is difficult
to identify
relevant
return
a
tweet
list
in
real-time
with
effective
strategies
for
ness, where a hashtag is chosen to represent the topic and
tweets matching
a user measurements
query [13,35]. Expanding
conranking
and relevance
[33, 34, 71]. tweet
Indexing
Relational,
RDF the
andtweet
Graphs
are the chosen
most common
choices
city
from which
originates
to capture
the
tent is suggested
as a way to
enhance
the meaning.
The goal
mechanisms
are
discussed
in
[23]
as
they
directly
impact
efof
data
representation.
There
is
a
close
affiliation
in
these
location.
of
suchretrieval
systems of
is to
express
a user’s
need track
in the5
ficient
tweets.
The
TRECinformation
micro-blogging
data models observing that, for instance, a graph can easily
form
of a simple
text query,
much like
searchreal-time
engines, and
is
dedicated
to calling
participants
to in
conduct
adcorrespond to a set of RDF triples or vice versa. In fact,
4.2
Data Model
for theand
Languages
return
a tweet
in areal-time
withcollection.
effective strategies
for
hoc search
taskslist
over
given tweet
Publications
some studies
like Plachouras
Stavrakas [59] have put
ranking
and
relevance
measurements
[33,
34,
71].
Indexing
Relational,
RDF
and
Graphs
are
the
most
common
choices
of
TREC
[56],
documents
the
findings
of
all
systems
in
the
forward their data model as a labeled multi digraph and have
mechanisms
are discussed
[23] as tweets
they directly
impact
efof
data arepresentation.
Therefor
is its
a close
affiliation in None
these
task
of ranking
the most in
relevant
matching
a prechosen
relational database
implementation.
5
ficient
of tweets.
data
models
observing
for Twitter
instance,social
a graph
can easily
definedretrieval
set of user
queries.The TREC micro-blogging track
of these
query
systems that,
models
network
with
is dedicated
to calling
participants
to conduct
real-time
adcorrespond
a set relationships
of RDF triples
or vice
versa.
In fact,
following or to
retweet
among
users.
Doytsher
et
Table
2 illustrates
an overview
of related
approaches
in syshoc
tasks over
a given
tweet
collection.
Publications
some
like Plachouras
and query
Stavrakas
[59] have
al. [31]studies
implement
their algebraic
operators
with put
the
temssearch
for querying
tweets.
Data
models
and dimensions
inof TREC [56], documents the findings of all systems in the
forward
their
dataand
model
as a labeled
multi digraph
and have
use of both
graph
a relational
database
as the underlying
5
task
of ranking the most relevant tweets matching a prechosen
a relational
for its compare
implementation.
http://trec.nist.gov/
data storage.
They database
experimentally
relationalNone
and
defined set of user queries.
of these query systems models Twitter social network with
following or retweet relationships among users. Doytsher et
Table 2 illustrates an overview of related approaches in sysal. [31] implement their algebraic query operators with the
tems for querying tweets. Data models and dimensions inuse of both graph and a relational database as the underlying
5
http://trec.nist.gov/
data storage. They experimentally compare relational and
SIGKDD Explorations
Volume 16, Issue 1
Page 15
Table 2: Overview of approaches in systems for querying tweets.
Data Model
Explored dimensions
Relational RDF Graph Text Time Space Social Network Real-Time
TweeQL [49]
Yes
TwarQL [53]
Yes
Table 2: Overview
of approaches in systems
for
querying tweets.
Plachouras et al. [60]
No
Explored
dimensions
Doytsher et al. [31]∗
Data Model
No
Relational RDF Graph Text
Time
Space
GeoScope et al. [18]∗
Social Network Real-Time
Yes
TweeQL
[49]
Yes
∗
Languages on social networks
No
TwarQL
[53]
Yes
Tweet search systems
Yes
Plachouras et al. [60]
No
Doytsher et al. [31]∗
No
∗
GeoScope
et
al.
[18]
Yes
vestigated in each system are depicted
in the Table 2. Sysmation extraction considering the inherent peculiarities of
∗
on social
networks
tweets and not all No
tems Languages
that have made
provision
for the real-time
streaming
frameworks we discussed provided
this
Tweet
Pre- processing components
Yes
nature
of thesearch
tweetssystems
are indicated in the Real-time
column.
functionality.
for example,
norMultiple ticks () correspond to a dimension explored in demalization and tokenization should be implemented, with
tail. Note that the systems marked with an asterisk (*) are
the option for the end users to customize the modules to suit
vestigated
in eachspecifically
system aretargeting
depicted tweets,
in the Table
2. their
Sysmation
extraction considering
inherentmodules
peculiarities
of
not implemented
though
their requirements.
Informationthe
extraction
such as
tems
that have
made provision
forbethe
real-time
tweets and
not all frameworks
we discussedand
provided
application
is meaningful
and can
extended
to streaming
the Twitlocation
prediction,
sentiment classification
named this
ennature
of the
are
indicated
the Real-time
column.
functionality.
Precomponentsanalysis
for example,
nortersphere.
Wetweets
observe
potential
forin
developing
languages
for
tity recognition
areprocessing
useful in conducting
on tweets
Multiple
ticks
()
correspond
to
a
dimension
explored
in
demalization
and
tokenization
should
be
implemented,
with
querying tweets that include querying by dimensions that
and attempt to derive more information from plain tweet
tail.
Note
that thebysystems
marked
with
an asterisk
are
the option
for the
end usersIdeally,
to customize
modules
to suit
are not
captured
existing
systems,
especially
the(*)
social
text
and their
metadata.
a userthe
should
be able
to
not
implemented specifically targeting tweets, though their
their
requirements.
Information
modules
such
as
graph.
integrate
any combination
of the extraction
components
into their
own
application is meaningful and can be extended to the Twitlocation
prediction, sentiment classification and named enapplications.
tersphere. We observe potential for developing languages for
tity recognition are useful in conducting analysis on tweets
Data
model:toMuch
of the
literature
presented
3.5
querying
tweets
that FOR
include INTEGRATED
querying by dimensions
that
and attempt
derive
more
information
from (Section
plain tweet
5. THE
NEED
SOLUand
4.2)
does
not
emphasize
or
draw
explicit
discussions
are not captured by existing systems, especially the social
text and their metadata. Ideally, a user should be able to
on
the data
in use. ofThe
data into
model
greatly
TIONS
graph.
integrate
anymodel
combination
the logical
components
their
own
influences
the
types
of
analysis
that
can
be
done
with
relaThere is a need to assimilate individual efforts with the goal
applications.
tive ease on collected data. A physical representation of the
of providing a unified framework that can be used by reData model:
of the
literature
(Section 3.5
model
involvingMuch
suitable
indexing
andpresented
storage mechanisms
searchers
practitioners
many disciplines.
Inte5.
THEandNEED
FOR across
INTEGRATED
SOLUand
4.2)volumes
does notof emphasize
draw explicit
discussions
of large
data is an or
important
consideration
for
grated solutions should ideally handle the entire workflow
on
the
data
model
in
use.
The
logical
data
model
greatly
TIONS
efficient retrieval. We notice that current research pays
litof the data analysis life cycle from collecting the tweets to
influences
theto
types
of analysis
that interactions,
can be done with
relaThere
is a need
to assimilate
with we
the have
goal
tle
attention
queries
on Twitter
the social
presenting
the results
to theindividual
user. Theefforts
literature
tive
ease
collected A
data.
A physical
of the
of providing
a unifiedsections
framework
that efforts
can bethat
usedsupport
by regraph
in on
particular.
graph
view of representation
the Twittersphere
is
reviewed
in previous
outlines
model involving
suitableand
indexing
and storage
searchers
and practitioners
across
disciplines.
Inteconsistently
overlooked
we recognize
great mechanisms
potential in
different parts
of the workflow.
In many
this section,
we present
of
of data
is an important
consideration
for
grated
solutions
ideally
handle the
entire workflow
thislarge
area.volumes
The graph
construction
on Twitter
is not limour position
withshould
the aim
of outlining
significant
compoefficient
retrieval.
We
notice
that
current
research
pays
litof the ofdata
analysis life
cycle from
collecting
the tweets to
ited to considering the users as nodes and links as follownents
an integrated
solution
addressing
the limitations
of
tle attention
to queries
on Twitter
thesuch
social
presenting
the results to the user. The literature we have
ing
relationships;
embracing
useful interactions,
characteristics
as
existing systems.
graph
in mention
particular.
Ahashtag
graph view
of the Twittersphere
is
reviewed
in
previous
sections
outlines
efforts
that
support
retweet,
and
(co-occurrence)
networks
in
According to a review of literature conducted on the miconsistently
overlooked
andopportunity
we recognize
great
potential
in
different
parts
of
the
workflow.
In
this
section,
we
present
the
data
model
will
create
to
conduct
complex
croblogging platform [25], the majority of published work
this
area.on The
construction
on of
Twitter
not envilimour
positionconcentrates
with the aim
significant
compoanalysis
thesegraph
structural
properties
tweets.is We
on Twitter
on of
theoutlining
user domain
and the
mesited
to
considering
the
users
as
nodes
and
links
as
follownentsdomain.
of an integrated
the limitations
of
sion a new data model that proactively captures structural
sage
The usersolution
domain addressing
explores properties
of Twiting relationships;
embracing
useful characteristics
such as
existing
relationships
taking
into consideration
efficient retrieval
of
ter userssystems.
in the microblogging environment while the mesretweet,
mention
and hashtag
(co-occurrence) networks in
relevant
data
to
perform
the
queries.
According
to
a
review
of
literature
conducted
on
the
misage domain deals with properties exhibited by the tweets
the data model will create opportunity to conduct complex
croblogging platform
[25], the
majority
work
themselves.
In comparison
to the
extentofofpublished
work done
on
Query
Languages
described
in Section
deanalysis language:
on these structural
properties
of tweets.
We4,envion
Twitter
concentrates
on
the
user
domain
and
the
mesthe microblogging platform, only a few investigates the defines
both
simple
operators
to
be appliedcaptures
on tweets
and adsion
a
new
data
model
that
proactively
structural
sage
domain.
The user
domain explores
properties
of Twitvelopment
of data
management
frameworks
and query
lanvanced
operators
that
extract
complex patterns,
which can
relationships
taking
into
consideration
efficient retrieval
of
ter users
in the
microblogging
environment
while
the mesguages
that
describe
and facilitate
processing
of online
sobe manipulated
in
different
types
of applications. Some lanrelevant
data
to
perform
the
queries.
sage
domain
deals
with
properties
exhibited
by
the
tweets
cial networking data. In consequence, there is opportunity
guages provide support for continuous queries on the stream
themselves.
In comparison
extent
of work
done on
Query
language:
in offer
Section
4, defor
improvement
in this areatoforthe
future
research
addressing
or queries
on a storedLanguages
collection, described
while others
flexibility
the
microblogging
platform,
only a few
the definesboth.
both The
simple
operators
to be view
applied
on tweets
adthe challenges
in data
management.
Weinvestigates
elicit the following
for
advent
of a graph
makes
crucialand
contrivelopment
of data management
frameworks
andfor
query
lanvanced
that
extract
complex allowing
patterns,uswhich
can
high-level components
and envisage
a platform
Twitter
butions operators
in analyzing
the
Twittersphere
to query
guages
that describe
facilitate processing of online sobe
manipulated
different
types offorms.
applications.
lanthat
encompasses
suchand
capabilities:
twitter
data in in
novel
and varying
It willSome
be intercial networking data. In consequence, there is opportunity
guagesto
provide
support
fortypical
continuous
queries on
the
stream
esting
investigate
how
functionality
[73]
provided
Focused
crawler:
Responsible
retrieval
andaddressing
collection
for improvement
in this
area for for
future
research
or
on a stored
collection,
while
otherstooffer
flexibility
by queries
graph query
languages
can be
adapted
Twitter
netof Twitter
data
crawling
the publicly
accessible
Twitthe
challenges
in by
data
management.
We elicit
the following
for both.
The
adventquery
of a graph
view makes
crucial
contriworks.
In
developing
languages,
one
could
investigate
ter
APIs.
A
focused
crawler
should
allow
the
user
to
dehigh-level components and envisage a platform for Twitter
butions
in analyzing
the real-time
Twittersphere
allowing
us to
query
the distinction
between
and batch
mode
processfine
campaign with
filters, monitor output and
that aencompasses
such suitable
capabilities:
twitter
data
in
novel
and
varying
forms.
It
will
be
intering. Visualizing the data retrieved as a result of a query
iteratively crawl Twitter for large volumes of data until its
esting
to investigate
how
typical
functionality
[73] provided
in
a
suitable
manner
is
also
an
important
concern.
A
set
of
Focused
crawler:
Responsible
for
retrieval
and
collection
coverage of relevant tweets is satisfactory.
by graph query
languages
be useful
adapted
to Twitter
netpre-defined
output
formats can
will be
in order
to provide
of Twitter data by crawling the publicly accessible TwitPre-processor:
As highlighted
in Section
3.2,user
thistostage
works.
In developing
query languages,
investigate
an
informative
visualization
over a mapone
forcould
a query
that reter APIs. A focused
crawler should
allow the
deusually
compriseswith
of modules
pre-processing
and inforthe distinction between real-time and batch mode processfine a campaign
suitablefor
filters,
monitor output
and
ing. Visualizing the data retrieved as a result of a query
iteratively crawl Twitter for large volumes of data until its
in
a suitable manner is also an important concern. A set of
coverage of relevant tweets is satisfactory.
pre-defined output formats will be useful in order to provide
Pre-processor: As highlighted in Section 3.2, this stage
an informative visualization over a map for a query that reusually comprises of modules for pre-processing and infor-
SIGKDD Explorations
Volume 16, Issue 1
Page 16
turns an array of locations. Another interesting avenue to
explore is the introduction of a ranking mechanism on
the query result. Ranking criteria may involve relevance,
timeliness or network attributes like the reputation of users
in the case of a social graph. Ranking functions are a stanturns
an array of in
locations.
Another
interesting
avenue
to
dard requirement
the field of
information
retrieval
[23,39]
explore
is
the
introduction
of
a
ranking
mechanism
on
and studies like SociQL [30] report the use of visibility and
the query result.
may involve
relevance,
reputations
metricsRanking
to rank criteria
results generated
from
a social
timeliness
or network
attributes
theview
reputation
of users
graph. A query
language
with a like
graph
of the Twitterin
the case
a social
graph. Ranking
functionsand
are ranking
a stansphere
alongofwith
capabilities
for visualizations
dardcertainly
requirement
in the
of information
retrieval
[23,39]
will
benefit
thefield
upcoming
data analysis
efforts
of
and
studies like SociQL [30] report the use of visibility and
Twitter.
reputations
metrics
generated
from
Here, we focused
on to
therank
key results
ingredients
required
foraasocial
fully
graph.
A query
language
with a improvements
graph view of the
Twitterdeveloped
solution
and discussed
we can
make
sphere
alongliterature.
with capabilities
for visualizations
and ranking
on existing
In the next
section, we identify
chalwill certainly
benefitissues
the upcoming
lenges
and research
involved. data analysis efforts of
Twitter.
Here, we focused on the key ingredients required for a fully
6.
CHALLENGES
ANDimprovements
RESEARCH
developed
solution and discussed
we ISSUES
can make
on
In the next
section,
wewe
identify
chalTo existing
completeliterature.
our discussion,
in this
section
summarize
lenges
and research
issues
key
research
issues in
datainvolved.
management and present technical challenges that need to be addressed in the context of
building a data analytics platform for Twitter.
6. CHALLENGES AND RESEARCH ISSUES
To
our discussion,
in this section we summarize
6.1complete
Data Collection
Challenges
key research issues in data management and present techOnce a suitable Twitter API has been identified, we can
nical challenges that need to be addressed in the context of
define a campaign with a set of parameters. The focused
building a data analytics platform for Twitter.
crawler can be programmed to retrieve all tweets matching
the query of the campaign. If a social graph is necessary,
6.1 Data Collection Challenges
separate modules would be responsible to create this netOnce iteratively.
a suitable Exhaustively
Twitter API crawling
has beenallidentified,
we can
work
the relationships
define
a
campaign
with
a
set
of
parameters.
The
focused
between Twitter users is prohibitive given the restrictions
crawler
canTwitter
be programmed
to itretrieve
all tweets
set
by the
API. Hence
is required
for thematching
focused
the
query
the campaign.
If a social
is necessary,
crawler
to of
prioritize
the relationships
to graph
crawl based
on the
separateand
modules
wouldofbespecific
responsible
to accounts.
create thisInnetimpact
importance
Twitter
the
work
iteratively.
Exhaustively
the relationships
case that
the platform
handlescrawling
multipleallcampaigns
in parbetween
Twitter
users to
is optimize
prohibitive
thetorestrictions
allel, there
is a need
thegiven
access
the API.
set by the Twitter
API. Hence itofisarequired
the focused
Typically,
the implementation
crawler for
should
aim to
crawler
prioritize
theofrelationships
to considering
crawl based the
on the
minimizetothe
number
API requests,
reimpact
and
importance
of
specific
Twitter
accounts.
the
strictions, while fetching data for many campaigns in In
paralcase
that the
platform
handles multiple
parlel. Hence
building
an effective
crawling campaigns
strategy is in
a chalallel,
there
is
a
need
to
optimize
the
access
to
the
API.
lenging task, in order to optimize the use of API requests
Typically, the implementation of a crawler should aim to
available.
minimize
thecoverage
number of
ofthe
API
requests,is considering
the reAppropriate
campaign
another significant
strictions,
while
fetching
data
for
many
campaigns
in
paralconcern and denotes whether all the relevant information
has
lel. Hence
building
an effective
crawling
strategy is
chalbeen
collected.
When
specifying
the parameters
to adefine
lenging
task, ina order
to optimize
useknowledge
of API requests
the campaign,
user needs
a very the
good
on the
available.
relevant keywords. Depending on the specified keywords,
Appropriate
the campaign
another
significant
a
collection coverage
may missofrelevant
tweetsis in
addition
to the
concern
and denotes
whether
all theby
relevant
has
tweets removed
due to
restrictions
APIs. information
Plachouras and
been collected.
When
parameters
Stavrakas
work [60]
is anspecifying
initial stepthe
in this
directiontoasdefine
it inthe
campaign,
a user of
needs
a very
good
knowledge
on the
vestigates
this notion
coverage
and
proposes
mechanisms
relevant
keywords.
Depending
on
the
specified
keywords,
to automatically adapt the campaign to evolving hashtags.
a collection may miss relevant tweets in addition to the
tweets
removed due to restrictions
by APIs. Plachouras and
6.2 Pre-processing
Challenges
Stavrakas work [60] is an initial step in this direction as it inMany problems associated with summarization, topic devestigates this notion of coverage and proposes mechanisms
tection and part-of-speech (POS) tagging, in the case of
to automatically adapt the campaign to evolving hashtags.
well-formed documents, e.g. news articles, have been extensively studied in the literature. Traditional named entity
6.2
Pre-processing Challenges
recognizers (NERs) heavily depend on local linguistic feaMany [62]
problems
associated
with summarization,
topic and
detures
of well-formed
documents
like capitalization
tection
and part-of-speech
(POS)
tagging,
the case of
POS tagging
of previous words.
None
of the in
characteristics
well-formed
documents,
news articles,
been
exhold
for tweets
with shorte.g.
utterances
of tweetshave
limited
to 140
tensively studied in the literature. Traditional named entity
recognizers (NERs) heavily depend on local linguistic features [62] of well-formed documents like capitalization and
POS tagging of previous words. None of the characteristics
hold for tweets with short utterances of tweets limited to 140
SIGKDD Explorations
characters, which make use of informal language, undoubtedly making a simple task of POS tagging more challenging. Besides the length limit, heavy and inconsistent usage
of abbreviations, capitalizations and uncommon grammar
constructions pose additional challenges to text processing.
characters,
which of
make
usebeing
of informal
undoubtWith the volume
tweets
orders language,
of magnitude
more
edly
making
a
simple
task
of
POS
tagging
more
challengthan news articles, most of the conventional methods
cannot
ing.directly
Besidesapplied
the length
limit,
inconsistent
be
to the
noisyheavy
genreand
of Twitter
data.usage
Any
of
abbreviations,
capitalizations
uncommon
grammar
effort
that uses Twitter
data needsand
to make
use of appropriconstructions
pose additional
challenges
to text
ate twitter-specific
strategies to
pre-process
text processing.
addressing
With
the volume
of tweets
being
ordersproperties
of magnitude
more
the
challenges
associated
with
intrinsic
of tweets.
than
newsinformation
articles, most
of the conventional
cannot
Similarly,
extraction
from tweetsmethods
is not straightbe
directly
applied
to the
of Twitter
data.
Any
forward
as it
is difficult
tonoisy
derivegenre
context
and topics
from
a
effort
that is
uses
Twitter data
needs
to make useThere
of appropritweet that
a scattered
part of
a conversation.
is sepate
twitter-specific
strategies to
pre-process
text addressing
arate
literature on identifying
entities
(references
to organithe challenges
with intrinsic
properties
of [20,36],
tweets.
zations,
places,associated
products, persons)
[47,63],
languages
Similarly,
is not source
straightsentiment information
[57] present extraction
in the tweetfrom
texttweets
for a richer
of
forward
as it is
difficult istoanother
derive context
and topics
from a
information.
Location
vital property
representtweet
that isfeatures
a scattered
part
a conversation.
There
ing
spatial
either
of of
the
tweet or of the
user.is sepThe
arate
literature
identifying
(references
to iforganilocation
of eachon
tweet
may beentities
optionally
recorded
using
zations,
places, products,
[47,63],
languages
a GPS-enabled
device. Apersons)
user can
also specify
his[20,36],
or her
sentiment
present
in the
tweet
textand
for is
a richer
of
location as[57]
a part
of the
user
profile
often source
reported
information.
Location
is
another
vital
property
representin varying granularities. The drawback is that only a small
ing spatial
features
of the tweet
or of the [24].
user. Since
The
portion
of about
1% either
of the tweets
are geo-located
location
of eachalways
tweet may
be optionally
recorded
if using
analysis almost
requires
the location
property,
when
a
GPS-enabled
device. A
userown
canmechanisms
also specifytohisinfer
or her
absent,
studies conduct
their
lolocationofas
a part
the user
is often
reported
cation
the
user, of
a tweet,
or profile
both. and
There
are two
major
in
varying granularities.
drawback
is thatanalysis
only a small
approaches
for location The
prediction:
content
with
portion of about
1% of the
tweets
[24]. Since
probabilistic
language
models
[24,are
27,geo-located
37] or inference
from
analysis
almost
requires
social and
otheralways
relations
[22, 29,the
67].location property, when
absent, studies conduct their own mechanisms to infer location Data
of the Management
user, a tweet, or Challenges
both. There are two major
6.3
approaches for location prediction: content analysis with
In Section 3.5 and Section 4.2 we outlined several alternaprobabilistic language models [24, 27, 37] or inference from
tive approaches in literature for a data model to characsocial and other relations [22, 29, 67].
terise the Twittersphere. The relational and RDF models
are frequently chosen while graph-based models are acknowl6.3
Data Management Challenges
edged, however not realized concretely at the implementaIn
Section
andinSection
several
alternation phase.3.5
Ways
which 4.2
we we
canoutlined
apply graph
data
mantive approaches
in literature
for a diverse
data model
to characagement
in Twitter
are extremely
and interesting;
terise
thetypes
Twittersphere.
and RDF
models
different
of networksThe
canrelational
be constructed
apart
from
are
frequently
chosen
while
graph-based
models
are
acknowlthe traditional social graph as outlined in Section 5. With
edged,
however
not realized
at the implementathe
advent
of a graph
view to concretely
model the Twittersphere,
gives
tion
phase.
Ways
in which
graph
manrise to
a range
of queries
thatwe
cancan
be apply
performed
ondata
the strucagement
intweets
Twitter
are extremely
diverse
and range
interesting;
ture of the
essentially
capturing
a wider
of use
different
typesused
of networks
be analytics
constructed
apart from
case scenarios
in typicalcan
data
tasks.
the
traditional
as outlined
in Section
5. With
As discussed
in social
Sectiongraph
4.3, there
are already
languages
simthe
advent
of
a
graph
view
to
model
the
Twittersphere,
gives
ilar to SQL adapted to social networks. Many of the techrise
to ainrange
of queries
can case
be performed
the strucniques
literature
are that
for the
of genericonsocial
netture
of
the
tweets
essentially
capturing
a
wider
range
of use
works under a number of specific assumptions. For example
casesocial
scenarios
used satisfy
in typical
data analytics
the
networks
properties
such astasks.
the power law
As
discussed in
Section
4.3,small
therediameters
are already
languages
simdistribution,
sparsity
and
[52].
We envision
ilar
to
SQL
adapted
to
social
networks.
Many
of
the
techqueries that take another step further and executes on Twitniques
in literature
are forlanguages
the case FQL
of generic
social
ter
graphs.
Simple query
[1] and
YQLnet[7]
works
a number
of specific
assumptions.
For example
provideunder
features
to explore
properties
of Facebook
and Yathe social
networks
satisfy
suchpart(usually
as the power
law
hoo
APIs but
are limited
toproperties
querying only
a sindistribution,
sparsity and
small
diameters
[52]. We
gle user’s connections)
of the
large
social graph.
As envision
pointed
queries
thatreport
take another
furtherand
and Web
executes
Twitout in the
on the step
Databases
2.0 on
Panel
at
ter graphs.
Simple
query languages
FQL [1] trust,
and YQL
[7]
VLDB
2007 [9],
understanding
and analyzing
authorprovide
features to
explore
of Facebook
and netYaity, authenticity,
and
other properties
quality measures
in social
hoo
APIs
but
are
limited
to
querying
only
part(usually
a
sinworks pose major research challenges. While there are ingle user’s connections)
of themeasures
large social
graph. As
vestigations
on these quality
in Twitter,
it’spointed
about
out
thewe
report
ondeclarative
the Databases
and Web
2.0 Panel
at
timeinthat
enable
querying
of networks
using
VLDB
2007 [9], understanding and analyzing trust, authorsuch
measurements.
ity, authenticity, and other quality measures in social networks pose major research challenges. While there are investigations on these quality measures in Twitter, it’s about
time that we enable declarative querying of networks using
such measurements.
Volume 16, Issue 1
Page 17
One of the predominant challenges is the management of
large graphs that inevitably results from modeling users,
tweets and their properties as graphs. With the large volume
of data involved in any practical task, a data model should
be information rich, yet a concise representation that enOne
the predominant
challenges
is theonmanagement
of
ables of
expression
of useful queries.
Queries
graphs should
large
graphs
that
inevitably
results
from
modeling
users,
be optimized for large networks and should ideally run intweets and their
as graphs.
large volume
dependent
of theproperties
size of the
graph. With
Therethe
arealready
apof
data involved
in any practical
a data model
proaches
that investigate
efficienttask,
algorithms
on veryshould
large
be
information
rich,Efficient
yet a concise
representation
engraphs
[41–43, 66].
encoding
and indexingthat
mechaables expression
of place
usefultaking
queries.
graphs should
nisms
should be in
intoQueries
accountonvariations
of inbe
optimized
large proposed
networks for
andtweets
should
ideally
run indexing
systemsfor
already
[23]
and indexing
dependent
of the
size of the
arealready
apof
graphs [75]
in general.
We graph.
need to There
consider
maintaining
proaches
that
investigate
efficient
onefficient
very large
indexes for
tweets,
keywords,
users,algorithms
hashtags for
acgraphs
[41–43,
Efficient
encoding
and situations
indexing mechacess of data
in 66].
advance
queries.
In certain
it may
nisms
should beto
in store
place the
taking
intoraw
account
inbe
impractical
entire
tweetvariations
and the of
comdexing
systems
for tweetsto[23]
andcompress
indexing
plete user
graphalready
and it proposed
may be desirable
either
of graphs
[75] in general.
We It
need
to considertomaintaining
or
drop portions
of the data.
is important
investigate
indexes
for tweets,ofkeywords,
users,
for efficient
acwhich properties
tweets and
thehashtags
graph should
be comcess
of
data
in
advance
queries.
In
certain
situations
it
may
pressed.
be impractical
to challenges,
store the entire
tweet
and the
comBesides
the above
tweetsraw
impose
general
research
plete
graph
it mayChallenges
be desirable
to either
compress
issuesuser
related
to and
big data.
should
be addressed
or
dropsame
portions
data.
It big
is important
to investigate
in the
spiritofasthe
any
other
data analytics
task. In
which
of tweets
the graph
should
be being
comthe
faceproperties
of challenges
posed and
by large
volumes
of data
pressed.
collected, the NoSQL paradigm should be considered as an
Besides
above
tweets
impose
generalsolutions
research
obvious the
choice
of challenges,
dealing with
them.
Developed
issues related
to big for
data.
Challenges
should beand
addressed
should
be extensible
upcoming
requirements
should
in
the same
spirit When
as anycollecting
other bigdata,
dataaanalytics
In
indeed
scale well.
user needtask.
to conthe face
of challenges
by large
volumes
of data
being
sider
scalable
crawlingposed
as there
are large
volumes
of tweets
collected,
the NoSQL
paradigm
be considered
as an
received, processed
and
indexed should
per second.
With respect
obvious
choice
of
dealing
with
them.
Developed
solutions
to implementation, it is necessary to investigate paradigms
should
be well,
extensible
for upcoming
requirements
and
that
scale
like MapReduce
which
is optimized
forshould
offline
indeed
scale
well. data
When
collecting on
data,
a user need
to conanalytics
on large
partitioned
hundreds
of machines.
sider scalable
as thereofarethe
large
volumes
of tweets
Depending
oncrawling
the complexity
queries
supported,
it
received,
indexed
per
second. With
respect
might be processed
difficult toand
express
graph
algorithms
intuitively
in
to
implementation,
it is necessary
to investigate
paradigms
MapReduce
graph models,
consequently
databases
such as
that scale
is optimized
for offline
Titan
[4], well,
DEXlike
[3],MapReduce
and Neo4j which
[2] should
be compared
for
analytics
on large data partitioned on hundreds of machines.
graph implementations.
Depending on the complexity of the queries supported, it
might be difficult to express graph algorithms intuitively in
7.
CONCLUSION
MapReduce graph models, consequently databases such as
In
this
paper
we addressed
issues[2]
around
thebebig
data nature
Titan [4],
DEX
[3], and Neo4j
should
compared
for
of
Twitter
analytics and the need for new data management
graph
implementations.
and query language frameworks. By conducting a careful
and extensive review of the existing literature we observed
7.
ways CONCLUSION
in which researchers have tried to develop general platIn this to
paper
we addressed
issues
around thefor
bigTwitter
data nature
forms
provide
a repeatable
foundation
data
of
Twitter We
analytics
andthe
thetweet
need for
new data
management
analytics.
reviewed
analytics
space
by explorand mechanisms
query language
frameworks.
conducting
careful
ing
primarily
for data By
collection,
data amanageand
reviewfor
of querying
the existing
observed
mentextensive
and languages
andliterature
analyzingwe
tweets.
We
ways
in which researchers
have
tried to develop
plathave identified
the essential
ingredients
requiredgeneral
for a unified
forms to provide
a repeatable
foundation
for Twitter
data
framework
that address
the limitations
of existing
systems.
analytics.
reviewed
the tweet
analytics
space bysome
explorThe paperWe
outlines
research
issues
and identifies
of
ing mechanisms
primarily with
for data
data
the
challenges associated
the collection,
development
of managesuch inment
andplatforms.
languages for querying and analyzing tweets. We
tegrated
have identified the essential ingredients required for a unified
framework that address the limitations of existing systems.
8.
REFERENCES
The paper outlines research issues and identifies some of
the challenges associated with the development of such in[1] FaceBook
Query
Language(FQL)
overview.
tegrated platforms.
https://developers.facebook.com/docs/
technical-guides/fql.
8.
REFERENCES
[2] Neo4j: The world’s leading graph database. http://
www.neo4j.org/.
[1] FaceBook
Query
Language(FQL)
overview.
https://developers.facebook.com/docs/
technical-guides/fql.
[2] Neo4j: The world’s leading graph database. http://
www.neo4j.org/.
SIGKDD Explorations
[3] Sparksee: Scalable high-performance graph database.
http://www.sparsity-technologies.com/.
[4] Titan:
distributed
graph
database.
http:
//thinkaurelius.github.io/titan.
[3] Sparksee: Scalable high-performance graph database.
[5] TrendsMap,
Realtime local twitter trends. http://
http://www.sparsity-technologies.com/.
trendsmap.com/.
[4] Titan:
distributed
graph
database.
http:
[6] Twitalyzer:
Serious analytics for social business. http:
//thinkaurelius.github.io/titan.
//twitalyzer.com.
[5] TrendsMap, Realtime local twitter trends. http://
[7] Yahoo!
Query Language guide on YDN. https://
trendsmap.com/.
developer.yahoo.com/yql/.
[6] Twitalyzer: Serious analytics for social business. http:
[8] F.
Abel, C. Hauff, and G. Houben. Twitcident: fight//twitalyzer.com.
ing fire with information from social web streams. In
WWW, pages
305–308,
2012.
[7] Yahoo!
Query
Language
guide on YDN. https://
developer.yahoo.com/yql/.
[9] S. Amer-Yahia, V. Markl, A. Halevy, A. Doan,
G. Abel,
Alonso,
Kossmann,
G. Weikum.
Databases
[8] F.
C. D.
Hauff,
and G. and
Houben.
Twitcident:
fightand
Webwith
2.0 panel
at VLDB
2007.
In SIGMOD
Record,
ing fire
information
from
social
web streams.
In
volume
pages
49–52,2012.
Mar. 2008.
WWW, 37,
pages
305–308,
[10]
V. Lakshmanan;,
and Cong
So[9] S. AmerYahia;,
Amer-Yahia,L. V.
Markl, A. Halevy,
A. Yu.
Doan,
cialScope
Enabling
information
discoveryDatabases
on social
G. Alonso,: D.
Kossmann,
and G. Weikum.
content
In CIDR,
2009.2007. In SIGMOD Record,
and Websites.
2.0 panel
at VLDB
volume 37, pages 49–52, Mar. 2008.
[11] T. Baldwin, P. Cook, and B. Han. A support platform
event detection
using
social intelligence.
In Demon[10] for
S. AmerYahia;,
L. V.
Lakshmanan;,
and Cong
Yu. Sostrations
13th Conference
of the
European
cialScope at: the
Enabling
information
discovery
on Chapsocial
ter
of the
Association
Computational Linguistics,
content
sites.
In CIDR, for
2009.
pages 69–72, 2012.
[11] T. Baldwin, P. Cook, and B. Han. A support platform
[12] for
L. Barbosa
and J. Feng.
sentiment detection
on
event detection
usingRobust
social intelligence.
In DemonTwitter
biased
noisy data.
pages
36–44,ChapAug.
strationsfrom
at the
13th and
Conference
of the
European
2010.
ter of the Association for Computational Linguistics,
pages 69–72, 2012.
[13] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam,
E. H. Chi.
interactive
topic-based
browsing
[12] and
L. Barbosa
and Eddi:
J. Feng.
Robust sentiment
detection
on
of
social from
status
streams.
In 23nd
annual
sympoTwitter
biased
and noisy
data.
pagesACM
36–44,
Aug.
sium
2010. on User interface software and technology - UIST,
pages 303–312, Oct. 2010.
[13] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam,
[14] and
A. Bifet
E. Eddi:
Frank.interactive
Sentimenttopic-based
knowledge discovery
E. H.and
Chi.
browsing
in
streaming
data.InDiscovery
Science.
of twitter
social status
streams.
23nd annual
ACMSpringer
sympoBerlin
pagessoftware
1–15, Oct.
sium onHeidelberg,
User interface
and 2010.
technology - UIST,
pages 303–312, Oct. 2010.
[15] A. Black, C. Mascaro, M. Gallagher, and S. P. GogTwitter
Zombie:
for capturing,
so[14] gins.
A. Bifet
and E.
Frank. Architecture
Sentiment knowledge
discovery
cially
transforming
Twittersphere.
in twitter
streaming and
data.analyzing
Discoverythe
Science.
Springer
In
International
on Oct.
Supporting
Berlin
Heidelberg,conference
pages 1–15,
2010. group work,
pages 229–238, 2012.
[15] A. Black, C. Mascaro, M. Gallagher, and S. P. Gog[16] gins.
M. Boanjak
E. Oliveira.
TwitterEcho
- A disTwitter and
Zombie:
Architecture
for capturing,
sotributed
focused crawler
support the
openTwittersphere.
research with
cially transforming
and to
analyzing
twitter
data. In International
companion
on
In International
conference onconference
Supporting
group work,
World
Wide Web,
pages 1233–1239, 2012.
pages 229–238,
2012.
[17]
K. Bontcheva
andE.L.Oliveira.
Derczynski.
TwitIE: an
[16] M.
Boanjak and
TwitterEcho
- Aopendissource
extraction
pipeline
microblog
tributedinformation
focused crawler
to support
open for
research
with
text.
Indata.
International
Conference
on Recent
Advances
twitter
In International
conference
companion
on
in
Natural
Processing,
2013.
World
WideLanguage
Web, pages
1233–1239,
2012.
[18]
C. Budak,
T. Georgiou,
and D. E. Abbadi.
[17] K.
Bontcheva
and L. Derczynski.
TwitIE: GeoScope:
an openOnline
detection of extraction
geo-correlated
information
trends
source information
pipeline
for microblog
in
social
networks. PVLDB,
7(4):229–240,
2013.
text.
In International
Conference
on Recent
Advances
in Natural Language Processing, 2013.
[18] C. Budak, T. Georgiou, and D. E. Abbadi. GeoScope:
Online detection of geo-correlated information trends
in social networks. PVLDB, 7(4):229–240, 2013.
Volume 16, Issue 1
Page 18
[19] C. Byun, H. Lee, Y. Kim, and K. K. Kim. Twitter
data collecting tool with rule-based filtering and analysis module. International Journal of Web Information
Systems, 9(3):184–203, 2013.
[34] S. Frénot and S. Grumbach. An in-browser microblog
ranking engine. In International conference on Advances in Conceptual Modeling, volume 7518, pages 78–
88, 2012.
[20]
S. Carter,
W. Lee,
Weerkamp,
andand
M. K.
Tsagkias.
Microblog
[19] C.
Byun, H.
Y. Kim,
K. Kim.
Twitter
language
identification:
the limitations
of
data collecting
tool with Overcoming
rule-based filtering
and analyshort,
unedited
and idiomatic
text. of
Language
Resources
sis module.
International
Journal
Web Information
and
Evaluation,
47(1):195–215,
Systems,
9(3):184–203,
2013. June 2012.
[35]
G. Frénot
Golovchinsky
M. Efron.An
Making
sense of
Twitter
[34] S.
and S.and
Grumbach.
in-browser
microblog
search.
CHI, 2010.
rankingInengine.
In International conference on Advances in Conceptual Modeling, volume 7518, pages 78–
[36] M. Graham, S. A. . Hale, and D. Gaffney. Where in the
88, 2012.
world are you ? Geolocation and language identification
in Twitter.
In ICWSM,
2012.of Twitter
[35] G.
Golovchinsky
and M. pages
Efron.518–521,
Making sense
search. In CHI, 2010.
[37] B. Hecht, L. Hong, B. Suh, and E. Chi. Tweets from
Justin
Bieber’s
the and
dynamics
of the Where
locationinfield
[36] M.
Graham,
S. heart:
A. . Hale,
D. Gaffney.
the
in userareprofiles.
In Conference
Human
Factors in
world
you ? Geolocation
and on
language
identification
Computing
pages
237–246,
2011.
in
Twitter. Systems,
In ICWSM,
pages
518–521,
2012.
[21] S.
M.Carter,
Cha, H.W.
Haddadi,
F. Benevenuto,
and K.Microblog
P. Gum[20]
Weerkamp,
and M. Tsagkias.
madi. Measuring
user influence
in twitter:
The million
language
identification:
Overcoming
the limitations
of
follower
fallacy. and
In ICWSM,
2010.
short,
unedited
idiomaticpages
text. 10–17,
Language
Resources
and Evaluation, 47(1):195–215, June 2012.
[22] S. Chandra, L. Khan, and F. B. Muhaya. Estimating
social interactions–a
[21] twitter
M. Cha,user
H. location
Haddadi,using
F. Benevenuto,
and K. P.content
Gumbased
approach. In
IEEE
Conference
on Privacy,
Semadi. Measuring
user
influence
in twitter:
The million
curity,
and In
Trust,
pagespages
838–843,
Oct.
2011.
followerRisk
fallacy.
ICWSM,
10–17,
2010.
[23]
C. Chandra,
Chen, F. L.
Li, Khan,
C. Ooi,
and
TI : An
efficient
[22] S.
and
F. S.B.Wu.
Muhaya.
Estimating
indexing
mechanism
for real-time
search. In SIGMOD,
twitter user
location using
social interactions–a
content
pages
based 649–660,
approach.2011.
In IEEE Conference on Privacy, Security, Risk and Trust, pages 838–843, Oct. 2011.
[24] Z. Cheng, J. Caverlee, K. Lee, and C. Science. A
forS.geo-locating
microblog
[23] content-driven
C. Chen, F. Li,framework
C. Ooi, and
Wu. TI : An
efficient
users.
ACM
Transactions
on Intelligent
Systems
and
indexing
mechanism
for real-time
search. In
SIGMOD,
Technology,
2012.
pages 649–660,
2011.
[25]
M. Cheng,
Cheong J.
andCaverlee,
S. Ray. A
of recent
[24] Z.
K.literature
Lee, andreview
C. Science.
A
microblogging
report,
Clayton
content-driven developments.
framework forTechnical
geo-locating
microblog
School
of Information
Technology,
Monash
University,
users. ACM
Transactions
on Intelligent
Systems
and
2011.
Technology, 2012.
[26]
Chew,
Cynthia,
G. Eysenbach.
in the
[25] M.
Cheong
and and
S. Ray.
A literaturePandemics
review of recent
age
of twitter:developments.
content analysis
of tweets
during
the
microblogging
Technical
report,
Clayton
2009
H1N1
outbreak. PloS
one, 5(11),
2010.University,
School
of Information
Technology,
Monash
2011.
[27] B. O. Connor, N. A. Smith, and E. P. Xing. A latent
model forand
geographic
lexical variation.
[26] variable
Chew, Cynthia,
G. Eysenbach.
PandemicsIninConthe
ference
on Empirical
Methods
in Natural
Language
age of twitter:
content
analysis
of tweets
duringProthe
cessing,
pages
1277–1287,
2009 H1N1
outbreak.
PloS2010.
one, 5(11), 2010.
[28]
Conover,
Michael,
J. Ratkiewicz,
[27] B.
O. Connor,
N. A. Smith,
and E. P. M.
Xing.Francisco,
A latent
B.
Gonçalves,
and
A. Flammini.
variable
model F.
forMenczer,
geographic
lexical
variation. Political
In Conpolarization
on Twitter.
In ICWSM,
2011.
ference on Empirical
Methods
in Natural
Language Processing, pages 1277–1287, 2010.
[29] J. David. Thats what friends are for inferring location
online social
mediaJ.platforms
based M.
on social
rela[28] in
Conover,
Michael,
Ratkiewicz,
Francisco,
tionships.
In ICWSM,
2013.and A. Flammini. Political
B. Gonçalves,
F. Menczer,
polarization on Twitter. In ICWSM, 2011.
[30] Diego Serrano, Eleni Stroulia, Denilson Barbosa and
Guana.Thats
SociQL:
query are
language
for the
social
[29] V.
J. David.
whatAfriends
for inferring
location
Web.
In social
E. Kranakis,
editor, Advances
Network
in online
media platforms
based on in
social
relaAnalysis
its Applications,
chapter 17, pages 381–
tionships.and
In ICWSM,
2013.
406. 2013.
[30] Diego Serrano, Eleni Stroulia, Denilson Barbosa and
[31] V.
Y. Doytsher
and B. Galon.
Querying
geo-social
by
Guana. SociQL:
A query
language
for thedata
social
bridging
networkseditor,
and social
networks.
In 2nd
Web. In spatial
E. Kranakis,
Advances
in Network
ACM
SIGSPATIAL
International
Workshop
on LocaAnalysis
and its Applications,
chapter
17, pages
381–
tion
406. Based
2013. Social Networks, pages 39–46, 2010.
[32]
A. Dries,
S. Nijssen,
and L. De
Raedt.geo-social
A query language
[31] Y.
Doytsher
and B. Galon.
Querying
data by
for
analyzing
networks.
In CIKM,
pages
485–494,In2009.
bridging
spatial
networks
and social
networks.
2nd
ACM SIGSPATIAL International Workshop on Loca[33] tion
M. Efron.
retrieval inpages
a microblogging
BasedHashtag
Social Networks,
39–46, 2010.environment. pages 787–788, 2010.
[32] A. Dries, S. Nijssen, and L. De Raedt. A query language
for analyzing networks. In CIKM, pages 485–494, 2009.
[33] M. Efron. Hashtag retrieval in a microblogging environment. pages 787–788, 2010.
SIGKDD Explorations
[38]
Jansen,
M. Zhang,
K. Sobel,
Chowdury.
[37] B. J.
Hecht,
L. Hong,
B. Suh,
and E. and
Chi.A.Tweets
from
Twitter
power: heart:
Tweets
electronic
word
of mouth.
Justin Bieber’s
theas
dynamics
of the
location
field
Journal
of the American
SocietyonforHuman
Information
in user profiles.
In Conference
FactorsSciin
ence
and Technology,
60(11):2169–2188,
Nov. 2009.
Computing
Systems, pages
237–246, 2011.
[39]
L. Hidayah,
T.K.
Elsayed,
H. Chowdury.
Ramadan.
[38] J.
B. Jiang,
J. Jansen,
M. Zhang,
Sobel, and A.
BEST
KAUST
at TREC-2011
: Building
Twitterofpower:
Tweets
as electronic
word of effective
mouth.
search
TREC, Society
2011. for Information SciJournalinofTwitter.
the American
ence and Technology, 60(11):2169–2188, Nov. 2009.
[40] P. Jürgens, A. Jungherr, and H. Schoen. Small worlds
a difference:
new gatekeepers
of
[39] with
J. Jiang,
L. Hidayah,
T. Elsayed, and
and the
H. filtering
Ramadan.
political
Twitter. In: International
Web
BEST ofinformation
KAUST at on
TREC-2011
Building effective
Science
pages 1–5, June 2011.
search inConference-WebSci,
Twitter. TREC, 2011.
[41]
Kang, D.A.
H. Jungherr,
Chau, andand
C. Faloutsos.
and
[40] U.
P. Jürgens,
H. Schoen.Managing
Small worlds
mining
large graphsnew
: Systems
and implementations.
with a difference:
gatekeepers
and the filtering In
of
SIGMOD,
volume 1, on
pages
589–592,
2012.
political information
Twitter.
In International
Web
Science Conference-WebSci, pages 1–5, June 2011.
[42] U. Kang and C. Faloutsos. Big graph mining :
andChau,
discoveries.
SIGKDD Managing
Explorations,
[41] Algorithms
U. Kang, D. H.
and C. Faloutsos.
and
14(2):29–36,
2013. : Systems and implementations. In
mining large graphs
SIGMOD, volume 1, pages 589–592, 2012.
[43] U. Kang, H. Tong, J. Sun, C.-Y. Lin, and C. Faloutsos.
An and
efficient
platform
for large
graphs.:
[42] Gbase:
U. Kang
C. analysis
Faloutsos.
Big graph
mining
VLDB
Journal,
June 2012.Explorations,
Algorithms
and21(5):637–650,
discoveries. SIGKDD
14(2):29–36, 2013.
[44] S. Kumar, G. Barbier, M. Abbasi, and H. Liu. Tweethumanitarian
and
disaster
[43] Tracker:
U. Kang, An
H. analysis
Tong, J. tool
Sun,for
C.-Y.
Lin, and C.
Faloutsos.
relief.
ICWSM,
661–662,
2011.
Gbase:InAn
efficientpages
analysis
platform
for large graphs.
VLDB Journal, 21(5):637–650, June 2012.
[45] H. Kwak, C. Lee, H. Park, and S. Moon. What is twita socialG.
network
or aM.
news
media?
pages
[44] ter,
S. Kumar,
Barbier,
Abbasi,
andInH.WWW,
Liu. Tweet591–600,
2010.
Tracker: An
analysis tool for humanitarian and disaster
relief. In ICWSM, pages 661–662, 2011.
[46] C.-H. Lee, H.-C. Yang, T.-F. Chien, and W.-S. Wen.
approach
for Park,
event and
detection
by mining
spatio[45] A
H.novel
Kwak,
C. Lee, H.
S. Moon.
What is
twittemporal
information
microblogs.
International
ter, a social
network or on
a news
media? In WWW,
pages
Conference
on Advances in Social Networks Analysis
591–600, 2010.
and Mining, pages 254–259, July 2011.
[46] C.-H. Lee, H.-C. Yang, T.-F. Chien, and W.-S. Wen.
[47] A
C. novel
Li, J. approach
Weng, Q.for
He,event
Y. Yao,
and A.by
Datta.
TwiNER:
detection
mining
spationamed entity
recognition
targeted twitter
stream. In
temporal
information
on in
microblogs.
In International
SIGIR, pages
2012.
Conference
on721–730,
Advances
in Social Networks Analysis
and Mining, pages 254–259, July 2011.
[48] A. Marcus, M. Bernstein, and O. Badar. Tweets as
data:
TweeQL
and
In SIG[47] C.
Li, demonstration
J. Weng, Q. He,ofY.
Yao, and
A.Twitinfo.
Datta. TwiNER:
MOD, pages
named
entity 1259–1261,
recognition2011.
in targeted twitter stream. In
SIGIR, pages 721–730, 2012.
[49] A. Marcus, M. Bernstein, and O. Badar. Processing and
in tweets.
SIGMOD
Record,
40(4),
[48] visualizing
A. Marcus,the
M.data
Bernstein,
and
O. Badar.
Tweets
as
2012.
data: demonstration of TweeQL and Twitinfo. In SIGMOD, pages 1259–1261, 2011.
[49] A. Marcus, M. Bernstein, and O. Badar. Processing and
visualizing the data in tweets. SIGMOD Record, 40(4),
2012.
Volume 16, Issue 1
Page 19
[50] M. S. Martı́n and C. Gutierrez. Representing, querying
and transforming social networks with RDF/SPARQL.
In European Semantic Web Conference, pages 293–307,
2009.
[63] A. Ritter, S. Clark, and O. Etzioni. Named entity recognition in tweets : an experimental study. In Conference
on Empirical Methods in Natural Language Processing,
pages 1524–1534, 2011.
[51]
P. T.
Mauro
Martı́n, Claudio
Gutierrez.
SNQL
[50] M.
S. W.
Martı́n
andSan
C. Gutierrez.
Representing,
querying
:and
A social
networksocial
querynetworks
and transformation
language.
transforming
with RDF/SPARQL.
In European
5th Alberto
Mendelzon
Workshop
on
Semantic
Web International
Conference, pages
293–307,
Foundations
of Data Management, 2011.
2009.
[63] A.
S. Clark,
and O. SoQL:
Etzioni.ANamed
entity
recog[64]
R. Ritter,
Ronen and
O. Shmueli.
language
for querynition
in tweets
: an
experimental
In Conference
ing
and
creating
data
in social study.
networks.
In ICDE,
on
Empirical
Methods
Natural Language Processing,
pages
1595–1602,
Mar.in
2009.
1524–1534,
2011.shakes twitter users : Real-time
[65] pages
T. Sakaki.
Earthquake
event detection by social sensors. In WWW, pages 851–
[64] R.
Ronen
860,
2010.and O. Shmueli. SoQL: A language for querying and creating data in social networks. In ICDE,
pages
1595–1602,
2009.GPS : A graph processing
[66] S.
Salihoglu
and J.Mar.
Widom.
[65] T.
Sakaki.
shakes
twitter on
users
: Real-time
system.
In Earthquake
International
Conference
Scientific
and
event detection
by social
sensors. Inpages
WWW,
pages
851–
Statistical
Database
Management,
1–31,
2013.
860, 2010.
[67] A. Schulz, A. Hadjakos, and H. Paulheim. A multi[66] S.
Salihoglu
and J. Widom.
GPS : A graph
processing
indicator
approach
for geolocalization
of tweets.
In
system. Inpages
International
Conference on Scientific and
ICWSM,
573–582, 2013.
Statistical Database Management, pages 1–31, 2013.
[68] A. Signorini, A. M. Segre, and P. M. Polgreen. The
[67] A.
A. Hadjakos,
and H.
Paulheim.
A multiuse Schulz,
of Twitter
to track levels
of disease
activity
and
indicator
approach
of tweets.
In
public
concern
in the for
U.S.geolocalization
during the influenza
A H1N1
ICWSM, pages
pandemic.
PloS 573–582,
one, 6(5),2013.
Jan. 2011.
[52]
M.T.
Mcglohon
and
Faloutsos.
Statistical
properties
[51] P.
W. Mauro
SanC.Martı́n,
Claudio
Gutierrez.
SNQL
of
In C.and
C. transformation
Aggarwal, editor,
Social
: Asocial
social networks.
network query
language.
Network
Data Analytics,
2, pagesWorkshop
17–42. 2011.
In 5th Alberto
Mendelzonchapter
International
on
Foundations of Data Management, 2011.
[53] P. Mendes, A. Passant, and P. Kapanipathi. Twarql:
into the
wisdom
of the crowd.
In Proceedings
of
[52] tapping
M. Mcglohon
and
C. Faloutsos.
Statistical
properties
the
6th International
on Semantic
Systems,
of social
networks. InConference
C. C. Aggarwal,
editor,
Social
pages
3–5,
2010.
Network
Data
Analytics, chapter 2, pages 17–42. 2011.
[54]
Morstatter,
S. Kumar,
and R. Maciejew[53] F.
P. Mendes,
A. Passant,
andH.P.Liu,
Kapanipathi.
Twarql:
ski.
Understanding
Twitter
datacrowd.
with TweetXplorer.
tapping
into the wisdom
of the
In Proceedings In
of
SIGKDD,
pages 1482–1485,
2013.on Semantic Systems,
the 6th International
Conference
pages 3–5, 2010.
[55] P. Noordhuis, M. Heijkoop, and A. Lazovik. Mining
in the cloud:
A caseH.study.
IEEE
3rd Inter[54] Twitter
F. Morstatter,
S. Kumar,
Liu, In
and
R. Maciejewnational
ConferenceTwitter
on Cloud
pages 107–
ski. Understanding
dataComputing,
with TweetXplorer.
In
114,
July 2010.
SIGKDD,
pages 1482–1485, 2013.
[56]
C. Macdonald,
Lin,
I. Soboroff.
[55] I.
P. Ounis,
Noordhuis,
M. Heijkoop,J.and
A. and
Lazovik.
Mining
Overview
the
TREC-2011
Microblog
Track.
20th
Twitter inofthe
cloud:
A case study.
In IEEE
3rdInInterText
REtrieval
Conference
(TREC),
2011. pages 107–
national
Conference
on Cloud
Computing,
114, July 2010.
[57] A. Pak and P. Paroubek. Twitter as a corpus for senanalysis
and opinionJ.mining.
In International
[56] timent
I. Ounis,
C. Macdonald,
Lin, and
I. Soboroff.
Conference
Resources
and
Evaluation,
Overview of on
the Language
TREC-2011
Microblog
Track.
In 20th
pages
1320–1326,
2010.
Text REtrieval
Conference
(TREC), 2011.
[58]
Paul,
M.and
J, and
M. Dredze.Twitter
In ICWSM,
pages 265–272.
[57] A.
Pak
P. Paroubek.
as a corpus
for sentiment analysis and opinion mining. In International
[59] Conference
V. Plachouras
Y. Stavrakas.
Querying
term assoon and
Language
Resources
and Evaluation,
ciations
and their 2010.
temporal evolution in social data. In
pages 1320–1326,
International VLDB Workshop on Online Social Sys2012.
[58] tems,
Paul, M.
J, and M. Dredze. In ICWSM, pages 265–272.
[60] V. Plachouras,
Y. Stavrakas,
and Querying
A. Andreou.
Assess[59]
Plachouras and
Y. Stavrakas.
term
assoing the coverage
data collection
campaigns
Twitciations
and theiroftemporal
evolution
in socialon
data.
In
ter: A case study.
On the Move
to Meaningful
InInternational
VLDBInWorkshop
on Online
Social Systernet 2012.
Systems: OTM 2013 Workshops, pages 598–607.
tems,
2013.
[60] V. Plachouras, Y. Stavrakas, and A. Andreou. Assess[61] ing
D. the
Preotiuc-Pietro,
S. collection
Samangooei,
and T.
coverage of data
campaigns
on Cohn.
TwitTrendminer
: An architecture
for real
analysisInof
ter: A case study.
In On the Move
to time
Meaningful
social
text.
In Workshop
on Real-Time
Analysis
ternet media
Systems:
OTM
2013 Workshops,
pages 598–607.
and
2013.Mining of Social Streams, pages 4–7, 2012.
[62]
L. Ratinov
and D. Roth.
challenges
[61] D.
Preotiuc-Pietro,
S. Design
Samangooei,
andand
T.misconCohn.
ceptions
in named
entity recognition.
Conference
Trendminer
: An architecture
for realIntime
analysis on
of
Computational
Natural
Language
social media text.
In Workshop
on Learning
Real-Time(CoNLL),
Analysis
number
June,
and Mining
of pages
Social147–155,
Streams,2009.
pages 4–7, 2012.
[62] L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In Conference on
Computational Natural Language Learning (CoNLL),
number June, pages 147–155, 2009.
SIGKDD Explorations
[68] A.
Signorini, and
A. M.
and P.A M.
Polgreen.
The
[69]
Y. Stavrakas
V. Segre,
Plachouras.
platform
for supuse of Twitter
to track
of challenges
disease activity
and
porting
data analytics
onlevels
twitter
and objecpublicIntl.
concern
in the U.S.
during theExtraction
influenza A&H1N1
tives.
Workshop
on Knowledge
Conpandemic. from
PloS Social
one, 6(5),
Jan.(Ict
2011.
solidation
Media,
270239), 2013.
[69] Y.
Stavrakas
and V. Plachouras.
A platform
for sup[70]
Tumasjan,
Andranik,
T. O. Sprenger,
P. G. Sandner,
porting
data
analytics
on twitter
challenges
objecand
I. M.
Welpe.
Predicting
Elections
withand
Twitter:
tives. Intl.
Workshop on
Knowledge
Extraction
& ConWhat
140 Characters
Reveal
about Political
Sentiment.
solidation
from
Social
Media,
(Ict 270239), 2013.
In
ICWSM,
pages
178–185,
2010.
[70]
Andranik,
T.J.O.
Sprenger,
P. G. Sandner,
[71] Tumasjan,
J. Weng, E.-p.
Lim, and
Jiang.
TwitterRank
: Findand
I. M. Welpe. Predicting
ing topic-sensitive
influential Elections
twitterers.with
In Twitter:
WSDM,
What
Characters
pages 140
261–270,
2010. Reveal about Political Sentiment.
In ICWSM, pages 178–185, 2010.
[72] J. S. White, J. N. Matthews, and J. L. Stacy. Coalmine:
[71] an
J. Weng,
E.-p. in
Lim,
and J.aJiang.
: Findexperience
building
systemTwitterRank
for social media
aning topic-sensitive
influential
In WSDM,
alytics.
In I. V. Ternovskiy
andtwitterers.
P. Chin, editors,
Propages 261–270,
2010.
ceedings
of SPIE,
volume 8408, 2012.
[72] J.
N. Matthews,
J. L. Stacy.
Coalmine:
[73]
P. S.
T.White,
Wood. J.
Query
languagesand
for graph
databases.
SIGan experience
building a Apr.
system
for social media anMOD
Record, in
41(1):50–60,
2012.
alytics. In I. V. Ternovskiy and P. Chin, editors, Pro[74] S.
Wu, J.ofM.
Hofman,
W.8408,
A. Mason,
ceedings
SPIE,
volume
2012. and D. J. Watts.
Who says what to whom on twitter. In WWW, pages
[73] 705–714,
P. T. Wood.
Query
Mar.
2011.languages for graph databases. SIGMOD Record, 41(1):50–60, Apr. 2012.
[75] X. Yan, P. S. Yu, and J. Han. Graph indexing : A
[74] S.
Wu, J.structure-based
M. Hofman, W.approach.
A. Mason,
D. J. Watts.
frequent
In and
SIGMOD,
pages
Who says2004.
what to whom on twitter. In WWW, pages
335–346,
705–714, Mar. 2011.
[76] J. Yin, S. Karimi, B. Robinson, and M. Cameron. ESA:
[75] X.
Yan, P. situation
S. Yu, and
J. Han. via
Graph
indexing : In
A
emergency
awareness
microbloggers.
frequentpages
structure-based
In SIGMOD, pages
CIKM,
2701–2703, approach.
2012.
335–346, 2004.
[76] J. Yin, S. Karimi, B. Robinson, and M. Cameron. ESA:
emergency situation awareness via microbloggers. In
CIKM, pages 2701–2703, 2012.
Volume 16, Issue 1
Page 20
What is Tumblr: A Statistical Overview and Comparison
Yi Chang† , Lei Tang§ , Yoshiyuki Inagaki† , Yan Liu‡
†
Yahoo Labs, Sunnyvale, CA 94089
@WalmartLabs, San Bruno, CA 94066
‡
University of Southern California, Los Angeles, CA 90089
§
yichang@yahoo-inc.com, leitang@acm.org, inagakiy@yahoo-inc.com, yanliu.cs@usc.edu
ABSTRACT
Tumblr, as one of the most popular microblogging platforms, has
gained momentum recently. It is reported to have 166.4 millions of
users and 73.4 billions of posts by January 2014. While many articles about Tumblr have been published in major press, there is not
much scholar work so far. In this paper, we provide some pioneer
analysis on Tumblr from a variety of aspects. We study the social
network structure among Tumblr users, analyze its user generated
content, and describe reblogging patterns to analyze its user behavior. We aim to provide a comprehensive statistical overview of
Tumblr and compare it with other popular social services, including
blogosphere, Twitter and Facebook, in answering a couple of key
questions: What is Tumblr? How is Tumblr different from other
social media networks? In short, we find Tumblr has more rich
content than other microblogging platforms, and it contains hybrid
characteristics of social networking, traditional blogosphere, and
social media. This work serves as an early snapshot of Tumblr that
later work can leverage.
1.
INTRODUCTION
Tumblr, as one of the most prevalent microblogging sites, has become phenomenal in recent years, and it is acquired by Yahoo! in
2013. By mid-January 2014, Tumblr has 166.4 millions of users
and 73.4 billions of posts1 . It is reported to be the most popular
social site among young generation, as half of Tumblr’s visitor are
under 25 years old2 . Tumblr is ranked as the 16th most popular
sites in United States, which is the 2nd most dominant blogging
site, the 2nd largest microblogging service, and the 5th most prevalent social site3 . In contrast to the momentum Tumblr gained in
recent press, little academic research has been conducted over this
burgeoning social service. Naturally questions arise: What is Tumblr? What is the difference between Tumblr and other blogging or
social media sites?
Traditional blogging sites, such as Blogspot4 and Live Journal5 ,
have high quality content but little social interactions. Nardi et
al. [17] investigated blogging as a form of personal communication and expression, and showed that the vast majority of blog posts
are written by ordinary people with a small audience. On the conhttp://www.tumblr.com/about
http://www.webcitation.org/64UXrbl8H
3
http://www.alexa.com/topsites/countries/US
4
http://blogspot.com
5
http://livejournal.com
1
2
trary, popular social networking sites like Facebook6 , have richer
social interactions, but lower quality content comparing with blogosphere. Since most social interactions are either unpublished or
less meaningful for the majority of public audience, it is natural for
Facebook users to form different communities or social circles. Microblogging services, in between of traditional blogging and online
social networking services, have intermediate quality content and
intermediate social interactions. Twitter7 , which is the largest microblogging site, has the limitation of 140 characters in each post,
and the Twitter following relationship is not reciprocal: a Twitter
user does not need to follow back if the user is followed by another.
As a result, Twitter is considered as a new social media [11], and
short messages can be broadcasted to a Twitter user’s followers in
real time.
Tumblr is also posed as a microblogging platform. Tumblr users
can follow another user without following back, which forms a nonreciprocal social network; a Tumblr post can be re-broadcasted by
a user to its own followers via reblogging. But unlike Twitter, Tumblr has no length limitation for each post, and Tumblr also supports
multimedia post, such as images, audios or videos. With these differences in mind, are the social network, user generated content, or
user behavior on Tumblr dramatically different from other social
media sites?
In this paper, we provide a statistical overview over Tumblr from
assorted aspects. We study the social network structure among
Tumblr users and compare its network properties with other commonly used ones. Meanwhile, we study content generated in Tumblr and examine the content generation patterns. One step further,
we also analyze how a blog post is being reblogged and propagated
through a network, both topologically and temporally. Our study
shows that Tumblr provides hybrid microblogging services: it contains dual characteristics of both social media and traditional blogging. Meanwhile, surprising patterns surface. We describe these
intriguing findings and provide insights, which hopefully can be
leveraged by other researchers to understand more about this new
form of social media.
2.
6
7
SIGKDD Explorations
TUMBLR AT FIRST SIGHT
Tumblr is ranked the second largest microblogging service, right
after Twitter, with over 166.4 million users and 73.4 billion posts
by January 2014. Tumblr is easy to register, and one can sign up
for Tumblr service with a valid email address within 30 seconds.
Once sign in Tumblr, a user can follow other users. Different from
Facebook, the connections in Tumblr do not require mutual confirmation. Hence the social network in Tumblr is unidirectional.
http://facebook.com
http://twitter.com
Volume 16, Issue 1
Page 21
Both Twitter and Tumblr are considered as microblogging platforms. Comparing with Twitter, Tumblr exposes several differences:
Photo: 78.11%
Text: 14.13%
Quote: 2.27%
Audio: 2.01%
Video: 1.35%
Chat: 0.85%
Answer: 0.82%
Link: 0.46%
• There is no length limitation for each post;
• Tumblr supports multimedia posts, such as images, audios
and videos;
• Similar to hashtags in Twitter, bloggers can also tag their
blog post, which is commonplace in traditional blogging.
But tags in Tumblr are seperate from blog content, while in
Twitter the hashtag can appear anywhere within a tweet.
• Tumblr recently (Jan. 2014) allowed users to mention and
link to specific users inside posts. This @user mechanism
needs more time to be adopted by the community;
Figure 2: Distribution of Posts (Better viewed in color)
• Tumblr does not differentiate verified account.
Figure 1: Post Types in Tumblr
Specifically, Tumblr defines 8 types of posts: photo, text, quote,
audio, video, chat, link and answer. As shown in Figure 1, one
has the flexibility to start a post in any type except answer. Text,
photo, audio, video and link allow one to post, share and comment
any multimedia content. Quote and chat, which are not available
in most other social networking platforms, let Tumblr users share
quote or chat history from ichat or msn. Answer occurs only when
one tries to interact with other users: when one user posts a question, in particular, writes a post with text box ending with a question
mark, the user can enable the option for others to answer the question, which will be disabled automatically after 7 days. A post can
also be reblogged by another user to broadcast to his own followers. The reblogged post will quote the original post by default and
allow the reblogger to add additional comments.
Figure 2 demonstrates the distribution of Tumblr post types, based
on 586.4 million posts we collected. As seen in the figure, even
though all kinds of content are supported, photo and text dominate
the distribution, accounting for more than 92% of the posts. Therefore, we will concentrate on these two types of posts for our content
analysis later.
Since Tumblr has a strong presence of photos, it is natural to compare it to other photo or image based social networks like Flickr8
and Pinterest9 . Flickr is mainly an image hosting website, and
Flicker users can add contact, comment or like others’ photos. Yet,
different from Tumblr, one cannot reblog another’s photo in Flickr.
Pinterest is designed for curators, allowing one to share photos or
videos of her taste with the public. Pinterest links a pin to the
commercial website where the product presented in the pin can
be purchased, which accounts for a stronger e-commerce behavior.
Therefore, the target audience of Tumblr and Pinterest are quite
different: the majority of users in Tumblr are under age 25, while
Pinterest is heavily used by women within age from 25 to 44 [16].
We directly sample a sub-graph snapshot of social network from
Tumblr on August 2013, which contains 62.8 million nodes and
8
9
http://flickr.com
http://pinterest.com
SIGKDD Explorations
3.1 billion edges. Though this graph is not yet up-to-date, we believe that many network properties should be well preserved given
the scale of this graph. Meanwhile, we sample about 586.4 million
of Tumblr posts from August 10 to September 6, 2013. Unfortunately, Tumblr does not require users to fill in basic profile information, such as gender or location. Therefore, it is impossible for
us to conduct user profile analysis as done in other works. In order to handle such large volume of data, most statistical patterns
are computed through a MapReduce cluster, with some algorithms
being tricky. We will skip the involved implementation details but
concentrate solely on the derived patterns.
Most statistical patterns can be presented in three different forms:
probability density function (PDF), cumulative distribution function (CDF) or complementary cumulative distribution function (CCDF),
describing P r(X = x), P r(X ≤ x) and P r(X ≥ x) respectively, where X is a random variable and x is certain value. Due
to the space limit, it is impossible to include all of them. Hence,
we decide which form(s) to include depending on presentation and
comparison convenience with other relevant papers. That is, if
CCDF is reported in a relevant paper, we try to also report CCDF
here so that rigorous comparison is possible.
Next, we study properties of Tumblr through different lenses, in
particular, as a social network, a content generation website, and
an information propagation platform, respectively.
3.
TUMBLR AS SOCIAL NETWORK
We begin our analysis of Tumblr by examining its social network
topology structure. Numerous social networks have been analyzed
in the past, such as traditional blogosphere [21], Twitter [10; 11],
Facebook [22], and instant messenger communication network [13].
Here we run an array of standard network analysis to compare with
other networks, with results summarized in Table 110 .
Degree Distribution. Since Tumblr does not require mutual confirmation when one follows another user, we represent the followerfollowee network in Tumblr as a directed graph: in-degree of a user
represents how many followers the user has attracted, while outdegree indicates how many other users one user has been following.
Our sampled sub-graph contains 62.8 million nodes and 3.1 billion
Even though we wish to include results over other popular social
media networks like Pinterest, Sina Weibo and Instagram, analysis
over those websites not available or just small-scale case studies
that are difficult to generalize to a comprehensive scale for a fair
comparison. Actually in the Table, we observe quite a discrepancy
between numbers reported over a small twitter data set and another
comprehensive snapshot.
10
Volume 16, Issue 1
Page 22
Table 1: Comparison of Tumblr with other popular social networks. The numbers of Blogosphere, Twitter-small, Twitter-huge, Facebook,
and MSN are obtained from [21; 10; 11; 22; 13], respectively. In the table, – implies the corresponding statistic is not available or not
applicable; GCC denotes the giant connected component; the symbols in parenthesis m, d, e, r respectively represent mean, median, the
90% effective diameter, and diameter (the maximum shortest path in the network).
Metric
Tumblr Blogosphere Twitter-small
Twitter-huge
Facebook
MSN
#nodes
62.8M
143,736
87,897
41.7M
721M
180M
#links
3.1B
707,761
829,467
1.47B
68.7B
1.3B
in-degree distr
∝ k−2.19
∝ k−2.38
∝ k−2.4
∝ k−2.276
–
–
�= power-law
–
–
– �= power-law ∝ k0.8 e−0.03k
degree distr in r-graph
direction
directed
directed
directed
directed
undirected
undirected
reciprocity
29.03%
3%
58%
22.1%
–
–
degree correlation
0.106
–
–
>0
0.226
–
avg distance
4.7(m), 5(d)
9.3(m)
–
4.1(m), 4(d)
4.7(m), 5(d)
6.6(m), 6(d)
diameter
5.4(e), ≥ 29(r)
12(r)
6(r) 4.8(e), ≥ 18(r)
< 5(e) 7.8(e), ≥ 29(r)
GCC coverage
99.61%
75.08%
93.03%
–
99.91%
99.90%
edges. Within this social graph, 41.40% of nodes have 0 in-degree,
and the maximum in-degree of a node is 4.06 million. By contrast, 12.74% of nodes have 0 out-degree, the maximum out-degree
of a node is 155.5k. Top popular Tumblr users include equipo11 ,
instagram12 , and woodendreams13 . This indicates the media characteristic of Tumblr: the most popular user has more than 4 million
audience, while more than 40% of users are purely audience since
they don’t have any followers.
Figure 3(a) demonstrates the distribution of in-degrees in the blue
curve and that of out-degrees in the red curve, where y-axis refers
to the cumulated density distribution function (CCDF): the probability that accounts have at least k in-degrees or out-degrees, i.e.,
P (K >= k). It is observed that Tumblr users’ in-degree follows
a power-law distribution with exponent −2.19, which is quite similar from the power law exponent of Twitter at −2.28 [11] or that
of traditional blogs at −2.38 [21]. This also confirms with earlier
empirical observation that most social network have a power-law
exponent between −2 and −3 [6].
In regard to out-degree distribution, we notice the red curve has a
big drop when out-degree is around 5000, since there was a limit
that ordinary Tumblr users can follow at most 5000 other users.
Tumblr users’ out-degree does not follow a power-law distribution,
which is similar to blogosphere of traditional blogging [21].
If we explore user’s in-degree and out-degree together, we could
generate normalized 3-D histogram in Figure 3(b). As both indegree and out-degree follow the heavy-tail distribution, we only
zoom in those user who have less than 210 in-degrees and outdegrees. Apparently, there is a positive correlation between indegree and out-degree because of the dominance of diagonal bars.
In aggregation, a user with low in-degree tends to have low outdegree as well, even though some nodes, especially those top popular ones, have very imbalanced in-degree and out-degree.
Reciprocity. Since Tumblr is a directed network, we would like to
examine the reciprocity of the graph. We derive the backbone of the
Tumblr network by keeping those reciprocal connections only, i.e.,
user a follows b and vice versa. Let r-graph denote the corresponding reciprocal graph. We found 29.03% of Tumblr user pairs have
reciprocity relationship, which is higher than 22.1% of reciprocity
on Twitter [11] and 3% of reciprocity on Blogosphere [21], indicating a stronger interaction between users in the network. Figure 3(c)
shows the distribution of degrees in the r-graph. There is a turning
http://equipo.tumblr.com
http://instagram.tumblr.com
13
http://woodendreams.tumblr.com
11
point due to the Tumblr limit of 5000 followees for ordinary users.
The reciprocity relationship on Tumblr does not follow the power
law distribution, since the curve mostly is convex, similar to the
pattern reported over Facebook[22].
Meanwhile, it has been observed that one’s degree is correlated
with the degree of his friends. This is also called degree correlation
or degree assortativity [18; 19]. Over the derived r-graph, we obtain
a correlation of 0.106 between terminal nodes of reciprocate connections, reconfirming the positive degree assortativity as reported
in Twitter [11]. Nevertheless, compared with the strong social network Facebook, Tumblr’s degree assortativity is weaker (0.106 vs.
0.226).
Degree of Separation. Small world phenomenon is almost universal among social networks. With this huge Tumblr network,
we are able to validate the well-known “six degrees of separation”
as well. Figure 4 displays the distribution of the shortest paths in
the network. To approximate the distribution, we randomly sample
60,000 nodes as seed and calculate for each node the shortest paths
to other nodes. It is observed that the distribution of paths length
reaches its mode with the highest probability at 4 hops, and has a
median of 5 hops. On average, the distance between two connected
nodes is 4.7. Even though the longest shortest path in the approximation has 29 hops, 90% of shortest paths are within 5.4 hops. All
these numbers are close to those reported on Facebook and Twitter,
yet significantly smaller than that obtained over blogosphere and
instant messenger network [13].
Component Size. The previous result shows that those users who
are connected have a small average distance. It relies on the assumption that most users are connected to each other, which we
shall confirm immediately. Because the Tumblr graph is directed,
we compute out all weakly-connected components by ignoring the
direction of edges. It turns out the giant connected component
(GCC) encompasses 99.61% of nodes in the graph. Over the derived r-graph, 97.55% are residing in the corresponding GCC. This
finding suggests the whole graph is almost just one connected component, and almost all users can reach others through just few hops.
To give a palpable understanding, we summarize commonly used
network statistics in Table 1. Those numbers from other popular
social networks (blogosphere, Twitter, Facebook, and MSN) are
also included for comparison. From this compact view, it is obvious traditional blogs yield a significantly different network structure. Tumblr, even though originally proposed for blogging, yields
a network structure that is more similar to Twitter and Facebook.
12
SIGKDD Explorations
Volume 16, Issue 1
Page 23
0
Percentage of Users
−2
10
CCDF
0
10
0.2
In−Degree
Out−Degree
−4
10
0.15
−2
10
0.1
0.05
CCDF
10
0
0
−6
10
−8
10
0
2
10
10
4
6
10
10
In−Degree or Out−Degree
1
2
3
4
5
X
In−Degree = 2
8
10
(a) in/out degree distribution
6
7
8
9
10
0 1
2
3 4
5
6
7
8
9
10
−4
10
−6
10
−8
Out−Degree = 2Y
10
(b) in/out degree correlation
0
10
1
10
2
3
4
10
10
10
In−Degree (same to Out−Degree)
5
10
(c) degree distribution in r-graph
Figure 3: Degree Distribution of Tumblr Network
0
10
# Posts
Mean Post Length
Median Post Length
Max Post Length
−2
10
−4
PDF
10
−6
10
Text Post
Dataset
21.5 M
426.7 Bytes
87 Bytes
446.0 K Bytes
Photo Caption
Dataset
26.3 M
64.3 Bytes
29 Bytes
485.5 K Bytes
Table 2: Statistics of User Generated Contents
−8
10
−10
10
0
5
10
15
20
Shortest Path Length
25
30
5
10
15
20
Shortest Path Length
25
30
1
0.8
CDF
0.6
0.4
0.2
0
0
Figure 4: Shortest Path Distribution
4.
TUMBLR AS BLOGOSPHERE FOR
CONTENT GENERATION
As Tumblr is initially proposed for the purpose of blogging, here
we analyze its user generated contents. As described earlier, photo
and text posts account for more than 92% of total posts. Hence, we
concentrate only on these two types of posts. One text post may
contain URL, quote or raw message. In this study, we are mainly
interested in the authentic contents generated by users. Hence, we
extract raw messages as the content information of each text post,
by removing quotes and URLs. Similarly, photo posts contains 3
categories of information: photo URL, quote photo caption, raw
photo caption. While the photo URL might contain lots of additional meta information, it would require tremendous effort to analyze all images in Tumblr. Hence, we focus on raw photo captions
as the content of each photo post. We end up with two datasets of
content: one is text post, and the other is photo caption.
What’s the effect of no length limit for post? Both Tumblr and
Twitter are considered microblogging platforms, yet there is one
SIGKDD Explorations
key difference: Tumblr has no length limit while Twitter enforces
the strict limitation of 140 bytes for each tweet. How does this key
difference affect user post behavior?
It has been reported that the average length of posts on Twitter is
67.9 bytes and the median is 60 bytes14 . Corresponding statistics
of Tumblr are shown in Table 2. For the text post dataset, the average length is 426.7 bytes and the median is 87 bytes, which both,
as expected, are longer than that of Twitter. Keep in mind Tumblr’s numbers are obtained after removing all quotes, photos and
URLs, which further discounts the discrepancy between Tumblr
and Twitter. The big gap between mean and median is due to a
small percentage of extremely long posts. For instance, the longest
text post is 446K bytes in our sampled dataset. As for photo captions, naturally we expect it to be much shorter than text posts.
The average length is around 64.3 bytes, but the median is only 29
bytes. Although photo posts are dominant in Tumblr, the number
of text posts and photo captions in Table 2 are comparable, because
majority of photo posts don’t contain any raw photo captions.
A further related question: is the 140-byte limit sensible? We plot
post length distribution of the text post dataset, and zoom into less
than 280 bytes in Figure 5. About 24.48% of posts are beyond
140 bytes, which indicates that at least around one quarter of posts
will have to be rewritten in a more compact version if the limit was
enforced in Tumblr.
Blending all numbers above together, we can see at least two types
of posts: one is more like posting a reference (URL or photo) with
added information or short comments, the other is authentic user
generated content like in traditional blogging. In other words, Tumblr is a mix of both types of posts, and its no-length-limit policy
encourages its users to post longer high-quality content directly.
What are people talking about? Because there is no length limit
on Tumblr, the blog post tends to be more meaningful, which alhttp://www.quora.com/Twitter-1/What-is-the-average-length-ofa-tweet
14
Volume 16, Issue 1
Page 24
Topic
Pets
1
0.8
CCDF
Scenery
0.6
Pop
Music
Photography
0.4
0.2
0
0
50
100
150
200
Post Length (Bytes)
250
Sports
300
Medical
Figure 5: Post Length Distribution
Topic
Pop
Music
Sports
Internet
Pets
Medical
Finance
Topical Keywords
music song listen iframe band album lyrics
video guitar
game play team win video cookie
ball football top sims fun beat league
internet computer laptop google search online
site facebook drop website app mobile iphone
big dog cat animal pet animals bear tiny
small deal puppy
anxiety pain hospital mental panic cancer
depression brain stress medical
money pay store loan online interest buying
bank apply card credit
Table 3: Topical Keywords from Text Post Dataset
lows us to run topic analysis over the two datasets to have an overview
of the content. We run LDA [4] with 100 topics on both datasets,
and showcase several topics and their corresponding keywords on
Tables 3 and 4, which also show the high quality of textual content
on Tumblr clearly. Medical, Pets, Pop Music, Sports are shared interests across 2 different datasets, although representative topical
keywords might be different even for the same topic. Finance, Internet only attracts enough attentions from text posts, while only
significant amount of photo posts show interest to Photography,
Scenery topics. We want to emphasize that most of these keywords
are semantically meaningful and representative of the topics.
Who are the major contributors of contents? There are two potential hypotheses. 1) One supposes those socially popular users
post more. This is derived from the result that those popular users
are followed by many users, therefore blogging is one way to attract more audience as followers. Meanwhile, it might be true that
blogging is an incentive for celebrities to interact or reward their
followers. 2) The other assumes that long-term users (in terms of
registration time) post more, since they are accustomed to this service, and they are more likely to have their own focused communities or social circles. These peer interactions encourage them to
generate more authentic content to share with others.
Do socially popular users or long-term users generate more contents? In order to answer this question, we choose a fixed time
window of two weeks in August 2013 and examine how frequent
each user blogs on Tumblr. We sort all users based on their indegree (or duration time since registration) and then partition them
into 10 equi-width bins. For each bin, we calculate the average
blogging frequency. For easy comparison, we consider the maximal value of all bins as 1, and normalize the relative ratio for other
bins. The results are displayed in Figure 6, where x-axis from left to
right indicates increasing in-degree (or decreasing duration time).
SIGKDD Explorations
Topical Keywords
cat dog cute upload kitty batch puppy
pet animal kitten adorable
summer beach sun sky sunset sea nature
ocean island clouds lake pool beautiful
music song rock band album listen lyrics
punk guitar dj pop sound hip
photo instagram pic picture check
daily shoot tbt photography
team world ball win football club
round false soccer league baseball
body pain skin brain depression hospital
teeth drugs problems sick cancer blood
Table 4: Topical Keywords from Photo Caption Dataset
For brevity, we just show the result for text post dataset as similar
patterns were observed over photo captions.
The patterns are strong in both figures. Those users who have
higher in-degree tend to post more, in terms of both mean and median. One caveat is that what we observe and report here is merely
correlation, and it does not derive causality. Here we draw a conservative conclusion that the social popularity is highly positively
correlated with user blog frequency. A similar positive correlation
is also observed in Twitter[11].
In contrast, the pattern in terms of user registration time is beyond
our imagination until we draw the figure. Surprisingly, those users
who either register earliest or register latest tend to post less frequently. Those who are in between are inclined to post more frequently. Obviously, our initial hypothesis about the incentive for
new users to blog more is invalid. There could be different explanations in hindsight. Rather than guessing the underlying explanation, we decide to leave this phenomenon as an open question to
future researchers.
As for reference, we also look at average post-length of users, because it has been adopted as a simple metric to approximate quality
of blog posts [1]. The corresponding correlations are plot in Figure 7. In terms of post length, the tail users in social networks are
the winner. Meanwhile, long-term or recently-joined users tend to
post longer blogs. Apparently, this pattern is exactly opposite to
post frequency. That is, the more frequent one blogs, the shorter
the blog post is. And less frequent bloggers tend to have longer
posts. That is totally valid considering each individual has limited
time and resources. We even changed the post length to the maximum for each individual user rather than average, but the pattern
remains still.
In summary, without the post length limitation, Tumblr users are
inclined to write longer blogs, and thus leading to higher-quality
user generated content, which can be leveraged for topic analysis.
The social celebrities (those with large number of followers) are
the main contributors of contents, which is similar to Twitter [24].
Surprisingly, long-term users and recently-registered users tend to
blog less frequently. The post-length in general has a negative correlation with post frequency. The more frequently one posts, the
shorter those posts tend to be.
5.
TUMBLR FOR INFORMATION PROPAGATION
Tumblr offers one feature which is missing in traditional blog services: reblog. Once a user posts a blog, other users in Tumblr can
reblog to comment or broadcast to their own followers. This en-
Volume 16, Issue 1
Page 25
1.2
1
Normalized Post Length
1
Normalized Post Frequency
1.2
Mean of Post Frequency
Median of Post Frequency
0.8
0.6
0.4
0.2
0
0.6
0.4
0
In−Degree from Low to High along x−Axis
1.2
Mean of Post Frequency
Median of Post Frequency
1
Normalized Post Length
1
Normalized Post Frequency
0.8
0.2
1.2
0.8
0.6
0.4
0.2
0
Mean of Post Length
Median of Post Length
In−Degree from Low to High along x−Axis
Mean of Post Length
Median of Post Length
0.8
0.6
0.4
0.2
0
Registration Time from Early to Late along x−Axis
Registration Time from Early to Late along x−Axis
Figure 6: Correlation of Post Frequency with User In-degree or
Duration Time since Registration
Figure 7: Correlation of Post Length with User In-degree or Duration Time since Registration
ables information to be propagated through the network. In this
section, we examine the reblogging patterns in Tumblr. We examine all blog posts uploaded within the first 2 weeks, and count reblog events in the subsequent 2 weeks right after the blog is posted,
so that there would be no bias because of the time window selection
in our blog data.
Who are reblogging? Firstly, we would like to understand which
users tend to reblog more? Those people who reblog frequently
serves as the information transmitter. Similar to the previous section, we examine the correlation of reblogging behavior with users’
in-degree. As shown in the Figure 8, social celebrities, who are the
major source of contents, reblog a lot more compared with other
users. This reblogging is propagated further through their huge
number of followers. Hence, they serve as both content contributor and information transmitter. On the other hand, users who
registered earlier reblog more as well. The socially popular and
long-term users are the backbone of Tumblr network to make it a
vibrant community for information propagation and sharing.
Reblog size distribution. Once a blog is posted, it can be reblogged by others. Those reblogs can be reblogged even further,
which leads to a tree structure, which is called reblog cascade, with
the first author being the root node. The reblog cascade size indicates the number of reblog actions that have been involved in the
cascade. Figure 9 plots the distribution of reblog cascade sizes.
Not surprisingly, it follows a power-law distribution, with majority
of reblog cascade involving few reblog events. Yet, within a time
window of two weeks, the maximum cascade could reach 116.6K.
In order to have a detailed understanding of reblog cascades, we
zoom into the short head and plot the CCDF up to reblog cascade
size equivalent to 20 in Figure 9. It is observed that only about
19.32% of reblog cascades have size greater than 10. By contrast,
only 1% of retweet cascades have size larger than 10 [11]. The reblog cascades in Tumblr tend to be larger than retweet cascades in
Twitter.
Reblog depth distribution. As shown in previous sections, almost
any pair of users are connected through few hops. How many hops
does one blog to propagate to another user in reality? Hence, we
look at the reblog cascade depth, the maximum number of nodes to
pass in order to reach one leaf node from the root node in the reblog
cascade structure. Note that reblog depth and size are different. A
cascade of depth 2 can involve hundreds of nodes if every other
node in the cascade reblogs the same root node.
Figure 10 plots the distribution of number of hops: again, the reblog
cascade depth distribution follows a power law as well according
to the PDF; when zooming into the CCDF, we observe that only
9.21% of reblog cascades have depth larger than 6. That is, majority of cascades can reach just few hops, which is consistent with the
findings reported over Twitter [3]. Actually, 53.31% of cascades in
Tumblr have depth 2. Nevertheless, the maximum depth among all
cascades can reach 241 based on two week data. This looks un-
SIGKDD Explorations
Volume 16, Issue 1
Page 26
0
10
Mean of Reblog Frequency
Median of Reblog Frequency
0.8
−4
10
−6
10
0.6
−8
10
0.4
0.2
0
2
4
6
10
10
Reblog Cascade Size
10
0.8
In−Degree from Low to High along x−Axis
Mean of Reblog Frequency
Median of Reblog Frequency
1
0.6
0.4
0.2
0
0
0.8
5
10
15
Reblog Cascade Size
20
25
Figure 9: Distribution of Reblog Cascade Size
0.6
0.4
is posted, the less likely it would be reblogged. 75.03% of first reblog arrive within the first hour since a blog is posted, and 95.84%
of first reblog appears within one day. Comparatively, It has been
reported that “half of retweeting occurs within an hour and 75%
under a day” [11] on Twitter. In short, Tumblr reblog has a strong
bias toward recency, and information propagation on Tumblr is fast.
0.2
0
0
10
1
1.2
Normalized Reblog Frequency
−2
10
PDF
1
CCDF
Normalized Reblog Frequency
1.2
Registration Time from Early to Late along x−Axis
Figure 8: Correlation of Reblog Frequency with User In-degree or
Duration Time since Registration
likely at first glimpse, considering any two users are just few hops
away. Indeed, this is because users can add comment while reblogging, and thus one user is likely to involve in one reblog cascade
multiple times. We notice that some Tumblr users adopt reblog as
one way for conversation or chat.
Reblog Structure Distribution. Since most reblog cascades are
few hops, here we show the cascade tree structure distribution up
to size 5 in Figure 11. The structures are sorted based on their coverage. Apparently, a substantial percentage of cascades (36.05%)
are of size 2, i.e., a post being reblogged merely once. Generally
speaking, a reblog cascade of a flat structure tends to have a higher
probability than a reblog cascade of the same size but with a deep
structure. For instance, a reblog cascade of size 3 have two variants, of which the flat one covers 9.42% cascade while the deep
one drops to 5.85%. The same patten applies to reblog cascades
of size 4 and 5. In other words, it is easier to spread a message
widely rather than deeply in general. This implies that it might be
acceptable to consider only the cascade effect under few hops and
focus those nodes with larger audience when one tries to maximize
influence or information propagation.
Temporal patten of reblog. We have investigated the information
propagation spatially in terms of network topology, now we study
how fast for one blog to be reblogged? Figure 12 displays the distribution of time gap between a post and its first reblog. There is
a strong bias toward recency. The larger the time gap since a blog
SIGKDD Explorations
6.
RELATED WORK
There are rich literatures on both existing and emerging online social network services. Statistical patterns across different types of
social networks are reported, including traditional blogosphere [21],
user-generated content platforms like Flickr, Youtube and LiveJournal [15], Twitter [10; 11], instant messenger network [13],
Facebook [22], and Pinterest [7; 20]. Majority of them observe
shared patterns such as long tail distribution for user degrees (power
law or power law with exponential cut-off), small (90% quantile effective) diameter, positive degree association, homophily effect in
terms of user profiles (age or location), but not with respect to gender. Indeed, people are more likely to talk to the opposite sex [13].
The recent study of Pinterest observed that ladies tend to be more
active and engaged than men [20], and women and men have different interests [5]. We have compared Tumblr’s patterns with other
social networks in Table 1 and observed that most of those trend
hold in Tumblr except for some number difference.
Lampe et al. [12] did a set of survey studies on Facebook users,
and shown that people use Facebook to maintain existing offline
connections. Java et al. [10] presented one of the earliest research paper for Twitter, and found that users leverage Twitter to
talk their daily activities and to seek or share information. In addition, Schwartz [7] is one of the early studies on Pinterest, and
from a statistical point of view that female users repin more but
with fewer followers than male users. While Hochman and Raz [8]
published an early paper using Instagram data, and indicated differences in local color usage, cultural production rate, for the analysis
of location-based visual information flows.
Existing studies on user influence are based on social networks or
Volume 16, Issue 1
Page 27
36.05%
9.42%
5.85%
3.58%
1.69%
1.44%
2.78%
1.20%
1.15%
0.58%
0.51%
0.42%
0.33%
0.31%
0.24%
0.21%
Figure 11: Cascade Structure Distribution up to Size 5. The percentage at the top is the coverage of cascade structure.
0
1
10
0.8
−2
10
CDF
PDF
0.6
−4
10
0.4
−6
10
0.2
−8
10
0
1
10
2
0
3
10
10
Reblog Cascade Depth
10
1
CCDF
10m 1h
1d 1w
Lag Time of First Reblog
Figure 12: Distribution of Time Lag between a Blog and its first
Reblog
0.8
7.
0.6
0.4
0.2
0
0
1m
5
10
15
Reblog Cascade Depth
20
25
Figure 10: Distribution of Reblog Cascade Depth
content analysis. McGlohon et al. [14] found topology features
can help us distinguish blogs, the temporal activity of blogs is very
non-uniform and bursty, but it is self-similar. Bakshy et al. [3]
investigated the attributes and relative influence based on Twitter
follower graph, and concluded that word-of-mouth diffusion can
only be harnessed reliably by targeting large numbers of potential
influencers, thereby capturing average effects. Hopcroft et al. [9]
studied the Twitter user influence based on two-way reciprocal relationship prediction. Weng et al. [23] extended PageRank algorithm
to measure the influence of Twitter users, and took both the topical similarity between users and link structure into account. Kwak
et al. [11] study the topological and geographical properties on
the entire Twittersphere and they observe some notable properties
of Twitter, such as a non-power-law follower distribution, a short
effective diameter, and low reciprocity, marking a deviation from
known characteristics of human social networks.
However, due to data access limitation, majority of the existing
scholar papers are based on either Twitter data or traditional blogging data. This work closes the gap by providing the first overview
of Tumblr so that others can leverage as a stepstone to investigate
more over this evolving social service or compare with other related
services.
SIGKDD Explorations
CONCLUSIONS AND FUTURE WORK
In this paper, we provide a statistical overview of Tumblr in terms
of social network structure, content generation and information propagation. We show that Tumblr serves as a social network, a blogosphere and social media simultaneously. It provides high quality content with rich multimedia information, which offers unique
characteristics to attract youngsters. Meanwhile, we also summarize and offer as rigorous comparison as possible with other social
services based on numbers reported in other papers. Below we
highlight some key findings:
• With multimedia support in Tumblr, photos and text account
for majority of blog posts, while audios and videos are still
rare.
• Tumblr, though initially proposed for blogging, yields a significantly different network structure from traditional blogosphere. Tumblr’s network is much denser and better connected. Close to 29.03% of connections on Tumblr are reciprocate, while blogosphere has only 3%. The average distance between two users in Tumblr is 4.7, which is roughly
half of that in blogosphere. The giant connected component
covers 99.61% of nodes as compared to 75% in blogosphere.
• Tumblr network is highly similar to Twitter and Facebook,
with power-law distribution for in-degree distribution, nonpower law out-degree distribution, positive degree associativity for reciprocate connections, small distance between
connected nodes, and a dominant giant connected component.
• Without post length limitation, Tumblr users tend to post
longer. Approximately 1/4 of text posts have authentic contents beyond 140 bytes, implying a substantial portion of
high quality blog posts for other tasks like topic
Volume 16, Issue 1
Page 28
• Those social celebrities tend to be more active. They post
analysis and text mining. and reblog more frequently, serving as both content generators and information transmitters.
Moreover, frequent bloggers like to write short, while infrequent bloggers spend more effort in writing longer posts.
• In terms of duration since registration, those long-term users
and recently registered users post less frequently. Yet, longterm users reblog more.
• Majority of reblog cascades are tiny in terms of both size
and depth, though extreme ones are not uncommon. It is relatively easier to propagate a message wide but shallow rather
than deep, suggesting the priority for influence maximization
or information propagation.
• Compared with Twitter, Tumblr is more vibrant and faster in
terms of reblog and interactions. Tumblr reblog has a strong
bias toward recency. Approximately 3/4 of the first reblogs
occur within the first hour and 95.84% appear within one
day.
This snapshot research is by no means to be complete. There are
several directions to extend this work. First, some patterns described here are correlations. They do not illustrate the underlying
mechanism. It is imperative to differentiate correlation and causality [2] so that we can better understand the user behavior. Secondly,
it is observed that Tumblr is very popular among young users, as
half of Tumblr’s visitor base being under 25 years old. Why is it
so? We need to combine content analysis, social network analysis,
together with user profiles to figure out. In addition, since more
than 70% of Tumblr posts are images, it is necessary to go beyond
photo captions, and analyze image content together with other meta
information.
8.
REFERENCES
[1] N. Agarwal, H. Liu, L. Tang, and P. S. Yu. Identifying the influential bloggers in a community. In Proceedings of WSDM,
2008.
[2] A. Anagnostopoulos, R. Kumar, and M. Mahdian. Influence
and correlation in social networks. In Proceedings of KDD,
2008.
[3] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Watts. Everyone’s an influencer: quantifying influence on twitter. In
Proceedings of WSDM, 2011.
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan:. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022,
2003.
[5] S. Chang, V. Kumar, E. Gilbert, and L. Terveen. Specialization, homophily, and gender in a social curation site: Findings
from pinterest. In Proceedings of The 17th ACM Conference
on Computer Supported Cooperative Work and Social Computing, CSCW’14, 2014.
[6] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law
distributions in empirical data. arXiv, 706, 2007.
[7] E. Gilbert, S. Bakhshi, S. Chang, and L. Terveen. ’i need to
try this!’: A statistical overview of pinterest. In Proceedings
of the SIGCHI Conference on Human Factors in Computing
Systems (CHI), 2013.
SIGKDD Explorations
[8] N. Hochman and R. Schwartz. Visualizing instagram: Tracing
cultural visual rhythms. In Proceedings of the Workshop on
Social Media Visualization (SocMedVis) in conjunction with
ICWSM, 2012.
[9] J. E. Hopcroft, T. Lou, and J. Tang. Who will follow you
back?: reciprocal relationship prediction. In Proceedings of
ACM International Conference on Information and Knowledge Management (CIKM), pages 1137–1146, 2011.
[10] A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and communities. In
WebKDD/SNA-KDD ’07, pages 56–65, New York, NY, USA,
2007. ACM.
[11] H. Kwak, C. Lee, H. Park, and S. B. Moon. What is twitter, a
social network or a news media. In Proceedings of 19th International World Wide Web Conference (WWW), 2010.
[12] C. Lampe, N. Ellison, and C. Steinfield. A familiar
face(book): Profile elements as signals in an online social network. In Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems (CHI), 2007.
[13] J. Leskovec and E. Horvitz. Planetary-scale views on a large
instant-messaging network. In WWW ’08: Proceeding of the
17th international conference on World Wide Web, pages
915–924, New York, NY, USA, 2008. ACM.
[14] M. McGlohon, J. Leskovec, C. Faloutsos, M. Hurst, and N. S.
Glance. Finding patterns in blog shapes and blog evolution.
In Proceedings of ICWSM, 2007.
[15] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and
B. Bhattacharjee. Measurement and analysis of online social
networks. In IMC ’07: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, pages 29–42,
New York, NY, USA, 2007. ACM.
[16] S. Mittal, N. Gupta, P. Dewan, and P. Kumaraguru. The pinbang theory: Discovering the pinterest world. arXiv preprint
arXiv:1307.4952, 2013.
[17] B. Nardi, D. J. Schiano, S. Gumbrecht, and L. Swartz. Why
we blog. Commun. ACM, 47(12):41–46, 2004.
[18] M. E. J. Newman. Assortative mixing in networks. Physical
review letters, 89(20): 208701, 2002.
[19] M. E. J. Newman. Mixing patterns in networks. Physical Review E, 67(2): 026126, 2003.
[20] R. Ottoni, J. P. Pesce, D. Las Casas, G. Franciscani, P. Kumaruguru, and V. Almeida. Ladies first: Analyzing gender roles and behaviors in pinterest. Proceedings of ICWSM,
2013.
[21] X. Shi, B. Tseng, , and L. A. Adamic. Looking at the blogosphere topology through different lenses. In Proceedings of
ICWSM, 2007.
[22] J. Ugander, B. Karrer, L. Backstrom, and C. Marlow.
The anatomy of the facebook social graph. arXiv preprint
arXiv:1111.4503, 2011.
[23] J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding topic-sensitive influential twitterers. In Proceedings of
WSDM, pages 1137–1146, 2010.
[24] S. Wu, J. M. Hofman, W. A. Mason, and D. J. Watts. Who
says what to whom on twitter. In Proceedings of WWW 2011.
Volume 16, Issue 1
Page 29
Change Detection in Streaming Data in the Era of Big
Change Detection
Streaming
in the Era of Big
Data:inModels
and Data
Issues
Data: Models and Issues
Dang-Hoan Tran
Vietnam
Maritime University
Dang-Hoan
Tran
Vietnam
Maritime
University
484 Lach
Tray, Ngo
Quyen
484Haiphong,
Lach Tray, Vietnam
Ngo Quyen
Haiphong, Vietnam
Mohamed Medhat Gaber
School of Computing
Mohamed
MedhatScience
Gaber
School
of Computing
Science
and Digital
Media, Robert
andGordon
Digital Media,
Robert
University
Gordon
Riverside
EastUniversity
Garthdee Road
dangRiverside East
Garthdee Road
Aberdeen
danghoan.tran@vimaru.edu.vn
Aberdeen
AB10
7GJ, UK
hoan.tran@vimaru.edu.vn
AB10 7GJ, UK
m.gaber1@rgu.ac.uk
m.gaber1@rgu.ac.uk
ABSTRACT
ABSTRACT
Big Data is identified by its three Vs, namely velocity, volume,
Big Data
is identified
by data
its three
Vs, processing
namely velocity,
volume,
and
variety.
The area of
stream
has long
dealt
and
The two
areaVsofvelocity
data stream
processing
hasa long
dealt
with variety.
the former
and volume.
Over
decade
of
with
the
former
two
Vs
velocity
and
volume.
Over
a
decade
of
intensive research, the community has provided many important
intensive
research, theincommunity
has provided
research discoveries
the area. The
third V ofmany
Big important
Data has
research
in the
area.and
Thethethird
of Big Data data
has
been the discoveries
result of social
media
largeV unstructured
been
the result
of socialtechniques
media andhave
the large
unstructured
it
generates.
Streaming
also been
proposeddata
reit
generates.
Streaming
techniques
have
also been
proposed
recently
addressing
this emerging
need.
However,
a hidden
factor
cently
addressing
this
emerging
need.
However,
a
hidden
factor
can represent an important fourth V, that is variability or change.
can
important
fourthand
V, that
is variability
or change.
Our represent
world is an
changing
rapidly,
accounting
to variability
is
Our
world
is
changing
rapidly,
and
accounting
to
variability
is
a crucial success factor. This paper provides a survey of change
adetection
crucial success
factor.
This paper
provides data.
a survey
change
techniques
as applied
to streaming
Theofreview
is
detection techniques as applied to streaming data. The review is
timely with the rise of Big Data technologies, and the need to
timely with the rise of Big Data technologies, and the need to
have this important aspect highlighted and its techniques categohave this important aspect highlighted and its techniques categorized and detailed.
rized and detailed.
1.
1.
INTRODUCTION
INTRODUCTION
Today’s
world is changing very fast. The changes occur in every
Today’s world is changing very fast. The changes occur in every
aspects
aspects of
of life.
life. Therefore,
Therefore, the
the ability
ability to
to detect,
detect, adapt,
adapt, and
and react
react
to
the
change
play
an
important
role
in
all
aspects
of
to the change play an important role in all aspects of life.
life. The
The
physical
physical world
world is
is often
often represented
represented in
in some
some model
model or
or some
some inforinformation
mation system.
system. The
The changes
changes in
in the
the physical
physical world
world are
are reflected
reflected in
in
terms
of
the
changes
in
data
or
model
built
from
data.
terms of the changes in data or model built from data. Therefore,
Therefore,
the
the nature
nature of
of data
data is
is changing.
changing.
The
advance
of
technology
The advance of technology results
results in
in the
the data
data deluge.
deluge. The
The data
data
volume
volume is
is increasing
increasing with
with an
an estimated
estimated rate
rate of
of 50%
50% per
per year
year [39].
[39].
Data
Data flood
flood makes
makes traditional
traditional methods
methods including
including traditional
traditional disdistributed
framework
and
parallel
models
inappropriate
tributed framework and parallel models inappropriate for
for proprocessing,
cessing, analyzing,
analyzing, storing,
storing, and
and understanding
understanding these
these massive
massive data
data
sets.
sets. Data
Data deluge
deluge needs
needs aa new
new generation
generation of
of computing
computing tools
tools that
that
th
Jim
paradigm in
in scientific
scientific computing
computing [25].
[25]. ReReJim Gray
Gray calls
calls the
the 4
4th paradigm
cently, there have been some emerging computing paradigms that
meet the requirements of Big Data as follows. Parallel batch processing model only deals with the stationary massive data [17].
However, evolving data continuously arrives with high speed. In
fact, online data stream processing is the main approach to dealing with the problem of three characteristics of Big Data including big volume, big velocity, and big variety. Streaming data processing is a model of Big Data processing. Streaming data is temporal data in nature. In addition to the temporal nature, streaming
data may include spatial characteristics. For example, geographic
information systems can produce spatial-temporal data stream.
Streaming data processing and mining have been deploying in
SIGKDD Explorations
Explorations
SIGKDD
Kai-Uwe Sattler
Ilmenau
University
Kai-Uwe
Sattlerof
Ilmenau
University of
Technology
POTechnology
Box 100565
PO Box Germany
100565
Ilmenau,
Ilmenau, Germany
kus@tu-ilmenau.de
kus@tu-ilmenau.de
real-world systems such as InforSphere Streams (IBM)1 , Rapid1 , Rapid5 . In
real-world
systems
such2 ,asStreamBase
InforSphere3 ,Streams
MOA 4(IBM)
, AnduIN
miner
Streams
Plugin
2
3
4
5 . In
, StreamBase
MOA a, AnduIN
miner
Streams
Plugin
order to
deal with
the high-speed
data ,streams,
hybrid model
order
to
deal
with
the
high-speed
data
streams,
a
hybrid
model
that combines the advantages of both parallel batch processing
that combines
the advantages
of both model
parallelis batch
processing
model
and streaming
data processing
proposed.
Some
7 , and Grok
model
data
processing
is proposed.
Some8
projectsand
for streaming
such hybrid
model
include model
S4 66 , Storm
7 , and Grok 8
,
Storm
projects
for
such
hybrid
model
include
S4
.
.One of these challenges facing data stream processing and minOne
these
challenges
data stream
ing isofthe
changing
naturefacing
of streaming
data.processing
Therefore,and
the minabiling
is
the
changing
nature
of
streaming
data.
Therefore,
the ability to identify trends, patterns, and changes in the underlying
proity
to identify
trends,
patterns,
and changes
in the underlying
processes
generating
data
contributes
to the success
of processing
cesses
generating
data
contributes
to
the
success
of
processing
and mining massive high-speed data streams.
and mining massive high-speed data streams.
A model of continuous distributed monitoring has been recently
A model of continuous distributed monitoring has been recently
proposed to deal with streaming data coming from multiple
proposed to deal with streaming data coming from multiple
sources. This model has many observers where each observer
sources. This model has many observers where each observer
monitors a single data stream. The goal of continuous distributed
monitors a single data stream. The goal of continuous distributed
monitoring is to perform some tasks that need to aggregate the inmonitoring is to perform some tasks that need to aggregate the incoming data from the observers. The continuous distributed moncoming data from the observers. The continuous distributed monitoring is applied to monitor networks such as sensor networks,
itoring is applied to monitor networks such as sensor networks,
social networks, networks of ISP [11].
social networks, networks of ISP [11].
Change
Change detection
detection is
is the
the process
process of
of identifying
identifying differences
differences in
in the
the
state
state of
of an
an object
object or
or phenomenon
phenomenon by
by observing
observing it
it at
at different
different
times
times or
or different
different locations
locations in
in space.
space. In
In the
the streaming
streaming context,
context,
change
detection
is
the
process
of
segmenting
change detection is the process of segmenting aa data
data stream
stream into
into
different
different segments
segments by
by identifying
identifying the
the points
points where
where the
the stream
stream
dynamics
dynamics change
change [53].
[53]. A
A change
change detection
detection method
method consists
consists of
of
the
the following
following tasks:
tasks: change
change detection
detection and
and localization
localization of
of change.
change.
Change
Change detection
detection identifies
identifies whether
whether aa change
change occurs,
occurs, and
and reresponds
sponds to
to the
the presence
presence of
of such
such change.
change. Besides
Besides change
change detecdetection,
tion, localization
localization of
of changes
changes determines
determines the
the location
location of
of change.
change.
The
The problem
problem of
of locating
locating the
the change
change has
has been
been studied
studied in
in statistics
statistics
in
in the
the problems
problems of
of change
change point
point detection.
detection.
This
This paper
paper presents
presents the
the background
background issues
issues and
and notation
notation relevant
relevant
to the problem of change detection in data streams.
2.
CHANGE DETECTION IN STREAMING DATA
1 http://www-01.ibm.com/software/data/infosphere/
streams/
2 http://www-ai.cs.uni-dortmund.de/auto?self=
$eit184kc
3 http://www.streambase.com/
4 http://moa.cs.waikato.ac.nz/
5 http://www.tu-ilmenau.de/dbis/research/anduin/
6 http://incubator.apache.org/s4/
7 https://github.com/nathanmarz/storm/wiki/Tutorial
8
8 https://www.numenta.com/grok_info.html
Volume 16,
16, Issue
Issue 1
1
Volume
Page 30
30
Page
Streaming computational model is considered one of the widelyused models for processing and analyzing massive data. Streaming data processing helps the decision-making process in realtime.
A data
stream is defined
as isfollows.
Streaming
computational
model
considered one of the widelyused models for processing and analyzing massive data. StreamEFINITION
1. Ahelps
data the
stream
is an infinite sequence
eleingDdata
processing
decision-making
process inofrealments
time. A data stream is defined as follows.
(1)
S = (X1 , T1 ) , .., X j , T j , ...
stream is an infinite sequence of eleD EFINITION 1. A data
ments
Each element is a pair X j , T j where X j is a d-dimensional vec
xd) (X
arriving
at the time stamp
T j . Time-stamp
tor X j = (x1 , x2 , ...,
(1)
S=
1 , T1 ) , .., X j , T j , ...
is defined over discretedomain
with a total order. There are two
d-dimensional
vecEach
is a pairexplicit
X j , T jtime-stamp
where X j is agenerated
types element
of time-stamps:
when data
(x1 , x2 ,time-stamp
..., xd ) arriving
at the time
stampdata
T j . Time-stamp
tor
X j =implicit
arrive;
is assigned
by some
stream prois defined
over discrete domain with a total order. There are two
cessing
system.
types of time-stamps: explicit time-stamp is generated when data
Streaming
data time-stamp
includes theis fundamental
characteristics
as profolarrive; implicit
assigned by some
data stream
lows.
data arrives continuously. Second, streaming data
cessingFirst,
system.
evolves overtime. Third, streaming data is noisy, corrupted.
Streaming
datainterfering
includes the
fundamental
characteristics
as folForth,
timely
is important.
From
the characteristics
lows.
First, data
continuously.
Second,
streaming
data
of streaming
data arrives
and data
stream model,
data stream
processevolves
overtime.
Third,
streaming
data
is
noisy,
corrupted.
ing and mining pose the following challenges. First, as streaming
Forth,
timely
interfering
is important.
From thedata
characteristics
data
arrives
rapidly,
the techniques
of streaming
process and
of
streaming
data the
stream
data stream
analysis
must data
keepand
up with
data model,
rate to prevent
fromprocessthe loss
ing
and mining
pose the following
as streaming
of important
information
as well aschallenges.
avoid dataFirst,
redundancy.
Secdata
rapidly,
techniques
process
and
ond, arrives
as the speed
of the
streaming
dataof
is streaming
very high, data
the data
volume
analysis
must
keep
up
with
the
data
rate
to
prevent
from
the
loss
overcomes the processing capacity of the existing systems. Third,
of
information
welltime,
as avoid
data redundancy.
Sectheimportant
value of data
decreasesasover
the recent
streaming data
is
ond,
as
the
speed
of
streaming
data
is
very
high,
the
data
volume
sufficient for many applications. Therefore, one can only capture
overcomes the processing capacity of the existing systems. Third,
and process the data as soon as it is generated.
the value of data decreases over time, the recent streaming data is
sufficient
for many Detection:
applications. Therefore,
one canand
onlyNotacapture
2.1
Change
Definitions
and process
tionthe data as soon as it is generated.
This
presents
concepts and
classificationand
of changes
2.1 section
Change
Detection:
Definitions
Notaand change
tiondetection methods. To develop a change detection
method,
we should understand what a change is.
This section presents concepts and classification of changes
andDchange
detection
methods.
To develop
detection
EFINITION
2. Change
is defined
as thea change
difference
in the
method,
we
should
understand
what
a
change
is.
state of an object or phenomenon over time and/or space [52;
1].D EFINITION 2. Change is defined as the difference in the
state
an object
or phenomenon
timeofand/or
space
[52;a
In theofview
of system,
change is theover
process
transition
from
1].
state of a system to another. In other words, a change can be defined
the of
difference
betweenisan
state
and a laterfrom
state.a
In theas
view
system, change
theearlier
process
of transition
An
distinction
between
change
difference
stateimportant
of a system
to another.
In other
words,and
a change
can is
be that
deafined
change
refers
to a transition
state state
of an and
object
or astate.
pheas the
difference
betweeninantheearlier
a later
nomenon
overtime
while the
difference
means
dissimilarity
in
An important
distinction
between
change
andthe
difference
is that
the
characteristics
two objects.
A change
a change
refers to of
a transition
in the
state ofcan
an reflect
object the
or ashortpheterm
trendovertime
or long-term
For example,
stock
analyst may
nomenon
whiletrend.
the difference
meansa the
dissimilarity
in
be
in theofshort-term
change
of thecan
stock
price.the shorttheinterested
characteristics
two objects.
A change
reflect
Change
detection
is defined
asFor
the example,
process ofa stock
identifying
differterm trend
or long-term
trend.
analyst
may
ences
in the state
an object or
phenomenon
by observing
it at
be interested
in theofshort-term
change
of the stock
price.
different
times [54].
the above
a change
is detected
Change detection
is In
defined
as thedefinition,
process of
identifying
differon
thein
basis
differences
of an
at different
times without
ences
the of
state
of an object
orobject
phenomenon
by observing
it at
considering
the[54].
differences
of an definition,
object in locations
In
different times
In the above
a changeinisspace.
detected
many
world
applications,
changes
occur both
in terms
of
on thereal
basis
of differences
of an
objectcan
at different
times
without
both
time and
For example,
multiple
spatial-temporal
data
considering
thespace.
differences
of an object
in locations
in space.
In
many real
world applications,
changeslongitude,
can occurtime)
both are
in terms
of
streams
representing
triple (latitude,
created
both
timeinformation
and space. For
example,
spatial-temporal
in
traffic
systems
usingmultiple
GPS [23].
Hence, changedata
destreamscan
representing
(latitude, longitude, time) are created
tection
be definedtriple
as follows.
in traffic information systems using GPS [23]. Hence, change de3. Change
detection is the process of identifyD EFINITION
tection
can be defined
as follows.
ing differences in the state of an object or phenomenon by ob3. Change
detection
is thelocations
process in
of space.
identifyD EFINITION
serving
it at different
times and/or
different
ing differences in the state of an object or phenomenon by obA
distinction
betweentimes
concept
driftdifferent
detection
and change
detecserving
it at different
and/or
locations
in space.
tion is that concept drift detection focuses on the labeled data
A
distinction
and and
change
detecwhile
change between
detectionconcept
can dealdrift
withdetection
both labeled
unlabeled
tion is that concept drift detection focuses on the labeled data
while change detection can deal with both labeled and unlabeled
SIGKDD Explorations
SIGKDD Explorations
data. Change analysis both detects and explains the change. Hido
et al. [26] proposed a method for change analysis by using supervised learning.
data.
Change analysis
both detects
explains
change. Hido
4. Change
point and
detection
is the
identifying
time
D EFINITION
et al. [26] proposed a method for change analysis by using superpoints
at which properties of time series data change[32]
vised learning.
Depending on specific application, change detection can be
D EFINITION 4. Change point detection is identifying time
called in different terms such as burst detection, outlier detection,
points at which properties of time series data change[32]
or anomaly detection. Burst detection a special kind of change
DependingBurst
on specific
application,
detection sum
can exbe
detection.
is a period
on streamchange
with aggregated
called
in adifferent
terms
such
as burst
detection,
detection,
ceeding
threshold
[31].
Outlier
detection
is aoutlier
special
kind of
or anomaly
detection.
Burst detection
detection can
a special
kindasofa change
change
detection.
Anomaly
be seen
special
detection.
Burstdetection
is a period
on streamdata.
with aggregated sum extype
of change
in streaming
ceeding
threshold
[31].problem
Outlierofdetection
is a special
of
To find aasolution
to the
change detection,
wekind
should
change detection.
can be
special
consider
the aspectsAnomaly
of changedetection
of the system
in seen
whichaswea want
to
type
of As
change
detection
detect.
shown
in [52],in
thestreaming
followingdata.
aspects of change, which
To
find
solution to the
problem
of change
detection,
wechange,
should
must
beaconsidered,
include
subject
of change,
type of
consider
aspects
of change
of the
system of
in which
want to
cause
of the
change,
effect
of change,
response
change,wetemporal
detect.
[52], the
of change,
which
issues, As
andshown
spatialinissues.
In following
particular,aspects
to design
an algorithm
mustdetecting
be considered,
subject
of change,
typemajor
of change,
for
changesinclude
in sensor
streaming
data, the
quescause
of change,
of include:
change, What
response
of change,
temporal
tions we
need to effect
answer
is the
system in
which
issues,
and spatial
In particular,
tothe
design
an algorithm
the changes
need toissues.
be detected?
What are
principles
used to
for
detecting
changesWhat
in sensor
data,
quesmodel
the problem?
is datastreaming
type? What
arethe
themajor
constraints
tions
we
need
to
answer
include:
What
is
the
system
in
which
of the problem? What is the physical subject of change? What is
the changes
to be detected?
What
aretothe
principles
used to
the
meaningneed
of change
to the user?
How
respond
and react
to
model
the problem?
is data
What are the constraints
this change?
How to What
visualize
thistype?
change?
of the problem? What is the physical subject of change? What is
A change detection method can fall into one of two types: batch
the meaning of change to the user? How to respond and react to
change detection and sequential change detection. Given a sethis change? How to visualize this change?
quence of N observations x1 , .., xN , where N is invariant, the task
A change detection method can fall into one of two types: batch
of a batch change detection method is deciding whether a change
change detection and sequential change detection. Given a seoccurs at some point in the sequence by using all N available obquence of N observations x1 , .., xN , where N is invariant, the task
servations. When the arriving speed of data is too high, batch
of a batch change detection method is deciding whether a change
change detection is suitable. In other words, change detection
occurs at some point in the sequence by using all N available obmethod using two adjacent windows model will be used. Howservations. When the arriving speed of data is too high, batch
ever, the drawback of batch change detection method is that its
change detection is suitable. In other words, change detection
running
time is very large when detecting changes in a large
method using two adjacent windows model will be used. Howamount
of
data. In contrast,
sequential
change
detection
probever, the drawback
of batchthe
change
detection
method
is that
its
lem
is
based
on
the
observations
so
far.
If
no
change
is
detected,
running time is very large when detecting changes in a large
the
next of
observation
is processed.
Whenever
a change
is detected,
amount
data. In contrast,
the sequential
change
detection
probthe
change
detector
is
reset.
lem is based on the observations so far. If no change is detected,
Change
methods
can be classified
the following
apthe
next detection
observation
is processed.
Wheneverinto
a change
is detected,
proaches:
the changethreshold-based
detector is reset.change detection method; state-based
change
change
method.
Change detection
detection method;
methods trend-based
can be classified
intodetection
the following
apA
change threshold-based
detection algorithm
should
meet method;
three main
requireproaches:
change
detection
state-based
ments
accuracy,
promptness,
online.
The algorithm
change[37]:
detection
method;
trend-basedand
change
detection
method.
should
detect
as many
as possible
actual
change
genA change
detection
algorithm
should
meet
threepoints
main and
requireerate
few asaccuracy,
possible false
alarms. The
should
detect
mentsas[37]:
promptness,
and algorithm
online. The
algorithm
change
point as
The algorithm
shouldand
be geneffishould detect
as early
manyas
as possible.
possible actual
change points
cient
sufficient
for a realfalse
timealarms.
environment.
erate as
few as possible
The algorithm should detect
Change
detection
in data
stream allows
us to identify
change point
as early
as possible.
The algorithm
shouldthebetimeeffievolving
trends,for
and
time-evolving
patterns. Research issues on
cient sufficient
a real
time environment.
mining
in in
data
streams
include
and representaChangechanges
detection
data
stream
allowsmodeling
us to identify
the timetion
of changes,
mining
method,
and interactive
evolving
trends, change-adaptive
and time-evolving
patterns.
Research
issues on
exploration
of changes
Change
detection
plays
important
mining changes
in data [19].
streams
include
modeling
andanrepresentarole
in changes,
the field change-adaptive
of data stream analysis.
Since change
in model
tion of
mining method,
and interactive
may
conveyofinteresting
time-dependent
information
knowlexploration
changes [19].
Change detection
plays anand
important
role inthe
thechange
field of
stream
analysis.
change
in model
edge,
of data
the data
stream
can beSince
used for
understanding
maynature
convey
time-dependent
information
andresearch
knowlthe
of interesting
several applications.
Basically,
interesting
edge, the change
of thechanges
data stream
can streams
be used for
problems
on mining
in data
canunderstanding
be classified
the
several applications.
Basically,
interesting
research
intonature
three of
categories:
modeling and
representation
of changes,
problems
on mining
changes in exploration
data streams
be classified
mining
methods,
and interactive
of can
changes.
Change
into threealgorithm
categories:
representationinof
changes,
detection
canmodeling
be used asand
a sub-procedure
many
other
mining
methods,
and
interactiveinexploration
ofwith
changes.
Change
data
stream
mining
algorithms
order to deal
the changing
detection
algorithm
as a sub-procedure
many other
data
in data
streamscan
[28;be4].used
A definition
of changeindetection
for
data
streamdata
mining
algorithms
in order to deal with the changing
streaming
is given
as follows
data in data streams [28; 4]. A definition of change detection for
Change
detection is the process of segmentD EFINITION
streaming
data is5.
given
as follows
D EFINITION 5. Change detection is the process of segment-
Volume 16, Issue 1
Volume 16, Issue 1
Page 31
Page 31
the changes of
the generated model
Incoming
Data stream
the changes of
Data Stream
the generated model
Processing/Mining
Incoming
Data stream
Data Stream
the changes of
Processing/Mining
data generating process
Model
Model
the changes of
Figure 1:
general process
diagram for detecting changes in data stream
dataAgenerating
ing
a data
into
different
identifying
thestream
points
Figure
1: Astream
general
diagram
forsegments
detectingbychanges
in data
where the stream dynamics changes [53].
•
•
As
streams
overtime
in nature,
there is growing
eming data
a data
streamevolve
into different
segments
by identifying
the points
phasis
on detecting
changeschanges
not only[53].
in the underlying data diswhere the
stream dynamics
tribution, but also in the models generated by data stream process
As data
is growing
emand
datastreams
stream evolve
mining.overtime
As can in
benature,
seen inthere
Figure
1, a change
phasis
on detecting
changes
notoronly
the underlying
discan occur
in the data
stream,
theinstreaming
model.data
Theretribution,
in types
the models
stream
process
fore,
therebut
arealso
two
of thegenerated
problemsbyofdata
change
detection:
and
datadetection
stream mining.
Asgenerating
can be seen
in Figure
1, a change
change
in the data
process
and change
deteccan
occur
in
the
data
stream,
or
the
streaming
model.
Theretion in the model generated by a data stream processing, or
minfore,
there
are two types
ofofthe
problems
of change
detection:
ing. The
fundamental
issues
detecting
changes
in data
streams
change
in the data
process
and change
detecincludesdetection
characterizing
andgenerating
quantifying
of changes
and detecttion
in
the
model
generated
by
a
data
stream
processing,
or
mining changes. A change detection method in streaming data needs
ing.
The fundamental
issues of detecting
changes
in data streams
a trade-off
among space-efficiency,
detection
performance,
and
includes
characterizing and quantifying of changes and detecttime-efficiency.
ing changes. A change detection method in streaming data needs
a trade-off
among Detection
space-efficiency,
detection in
performance,
and
2.2
Change
Methods
Streaming
time-efficiency.
Data
Over
last 50 years,
change detection
has been
widely studied
2.2 theChange
Detection
Methods
in Streaming
and applied
in
both
academic
research
and
industry.
For examData
ple, it has been studied for a long time in the following fields:
Over the last 50 years, change detection has been widely studied
statistics, signal processing, and control theory. In recent years
and applied in both academic research and industry. For exammany
detection
have
been
for streample, it change
has been
studiedmethods
for a long
time
in proposed
the following
fields:
ing
data.
The
approaches
to
detecting
changes
in
data
streamyears
can
statistics, signal processing, and control theory. In recent
be
classified
follows. methods have been proposed for streammany
changeasdetection
ing •data.
Thestream
approaches
changes
in data
Data
model:toAdetecting
data stream
can fall
intostream
one of can
the
be classified
as follows.
following
models: time series model, cash register model,
model [41].
Onstream
the basis
of the
• and
Dataturnstile
stream model:
A data
can fall
intodata
onestream
of the
model,
there
are change
algorithms
developed
for
following
models:
time detection
series model,
cash register
model,
the
corresponding
data
stream
model.
Krishnamurthy
et
al
and turnstile model [41]. On the basis of the data stream
presented
a sketch-based
change detection
for the
model, there
are change detection
algorithmsmethod
developed
for
most
general streaming
model model.
Turnstile
model [35]. et al
the corresponding
data stream
Krishnamurthy
presented a sketch-based change detection method for the
• Data characteristics: Change detection methods can be
most general streaming model Turnstile model [35].
classified on the basis of the data characteristics of streamsuch as data dimensionality,
datamethods
label, and
• ing
Datadata
characteristics:
Change detection
candata
be
type.
A data
coming
from
thecharacteristics
data stream can
be uniclassified
on item
the basis
of the
data
of streamvariate
multi-dimensional.
It would data
be great
if we
ing dataorsuch
as data dimensionality,
label,
andcould
data
develop
a general
algorithm
able
detect
changes
type. A data
item coming
from
thetodata
stream
can in
be both
uniunivariate
and multidimensional
data streams.
devariate or multi-dimensional.
It would
be great Change
if we could
tection
in streaming
multivariate
data have
developalgorithms
a general algorithm
able
to detect changes
in been
both
presented
34; 36]. Data streams
be classified
univariate [14;
and multidimensional
datacan
streams.
Changeinto
decategorial
data stream
and numerical
data stream.
Webeen
can
tection algorithms
in streaming
multivariate
data have
presentedthe
[14;
34; 36].
Data streams
canfor
becategorial
classified data
into
develop
change
detection
algorithm
categorial
data stream
and
numerical
stream.
We can
stream
or numerical
data
stream.
In realdata
world
applications,
develop
theitem
change
detection
algorithm
for categorial
each
data
in data
stream
may include
multipledata
atstream orofnumerical
data stream.
In real world
tributes
both numerical
and categorial
data. applications,
In such situeach data
item
data stream
may include
multiple
atations,
these
datainstreams
can be projected
by each
attribute
tributes
numerical
and categorial
In such
or groupofofboth
attributes.
Change
detectiondata.
methods
cansitube
ations, these
streams can be
projected
by streams
each attribute
applied
to thedata
corresponding
projected
data
afteror group
of streams
attributes.
detection
methods
be
wards.
Data
areChange
classified
into labeled
data can
stream
applied
to the corresponding
and unlabeled
data streams. Aprojected
labeled data
data streams
stream isafterone
wards. Data streams are classified into labeled data stream
and unlabeled data streams. A labeled data stream is one
SIGKDD Explorations
SIGKDD Explorations
•
•
•
•
whose individual example is associated with a given class
label, otherwise, it is unlabeled data stream. A change detection algorithm that identifies changes in the labeled data
stream
is supervised
change
detection [34;
whileclass
one
whose individual
example
is associated
with5],
a given
detecting
changesit in
the unlabeled
stream
is called
label, otherwise,
is unlabeled
data data
stream.
A change
deunsupervised
change
algorithm
The advantection algorithm
that detection
identifies changes
in [7].
the labeled
data
tage
of is
thesupervised
supervisedchange
approach
is that the
stream
detection
[34;detection
5], whileaccuone
racy
is high.
However,
theunlabeled
ground truth
mustisbecalled
gendetecting
changes
in the
datadata
stream
erated.
Thus achange
unsupervised
change
detection
is
unsupervised
detection
algorithm
[7]. approach
The advantage of thetosupervised
approach
detection
preferred
the supervised
one isinthat
casethethe
ground accutruth
racy is unavailable.
high. However, the ground truth data must be gendata
erated. Thus a unsupervised change detection approach is
Completeness
statistical information:
Onground
the basis
of
preferred to theofsupervised
one in case the
truth
the
datacompleteness
is unavailable.of statistical information, a change detection algorithm can fall into one of three following catCompleteness
of statistical
On thearebasis
of
egories. Parametric
change information:
detection schemes
based
the knowing
completeness
of statistical
information,
a change
deon
the full
prior information
before
and after
tection algorithm
can in
fall
one of three
following
catchange.
For example,
theinto
distributional
change
detection
egories.
change detection
schemes
are based
methods,Parametric
the data distributions
before and
after change
are
on knowing
theAfull
prior introduced
information
beforeto and
after
known
[41; 42].
recently
method
detecting
change.
example,
the distributional
changemethod
detection
changes For
in order
stockin streams
is a parametric
in
methods,
data distributions
andorders
after change
which
thethe
distribution
of streambefore
of stock
confideare
to
known
[41; 42].
A recently
introduced
method of
to detecting
the Poisson
distribution
[37].
The advantage
parametchanges
in
order
stock
streams
is
a
parametric
ric change detection approaches is that they can method
produceina
which
distribution
stream
of stock orders
to
higher the
accurate
result of
than
semi-parametric
andconfide
nonparathe
Poisson
distribution
[37].
The
advantage
of
parametmetric methods. However, in many real-time applications,
ric change
detection
approaches
is that they
can produce
data
may not
confine
to any standard
distribution,
thusa
higher
accurate
result than
semi-parametric
and nonparaparametric
approaches
are inapplicable.
Semi-parametric
metric
methods.
However,
in many real-time
methods
are based
on the assumption
that theapplications,
distribution
data
may not confine
thus
of observations
belongstoto any
somestandard
class of distribution,
distribution funcparametric approaches are inapplicable. Semi-parametric
tion, and parameters of the distribution function change
methods are based on the assumption that the distribution
in disorder moments. Recently, Kuncheva [36] has proof observations belongs to some class of distribution funcposed a semi-parametric method using a semi-parametric
tion, and parameters of the distribution function change
log-likelihood for testing a change. Nonparametric methin disorder moments. Recently, Kuncheva [36] has proods make no distribution assumptions on the data. Nonposed a semi-parametric method using a semi-parametric
parametric methods for detecting changes in the underlylog-likelihood for testing a change. Nonparametric mething data distribution includes Wilcoxon, kernel method,
ods make no distribution assumptions on the data. NonKullback-Leiber
distance,
and Kolmogorov-Smirnov
test.
parametric methods
for detecting
changes in the underlyNonparametric
methods
can
be
classified
into
two
cateing data distribution includes Wilcoxon, kernel method,
gories:
nonparametric
methods
using window [33]; nonKullback-Leiber
distance,
and Kolmogorov-Smirnov
test.
parametric
methods
without
using
window into
[27].two
We catehave
Nonparametric methods can be classified
paid
particular
attentionmethods
to the nonparametric
degories:
nonparametric
using window change
[33]; nontection
methods
usingwithout
windowusing
because
in many
parametric
methods
window
[27].real-world
We have
applications,
theattention
distributions
both null hypothesis
paid particular
to theofnonparametric
change and
dealternative
hypothesis
are unknown
ininadvance.
Furthertection methods
using window
because
many real-world
more,
we arethe
only
interested of
in both
recent
data.
A common
applications,
distributions
null
hypothesis
and
approach
identifyingaretheunknown
change is
comparing
two
alternativetohypothesis
in to
advance.
Furthersamples
order
to interested
find out theindifference
between
them,
more, weinare
only
recent data.
A common
which
is
called
two-sample
change
detection,
or
windowapproach to identifying the change is to comparing two
based
change
detection.
stream is infinite,
slidsamples
in order
to find As
out data
the difference
betweenathem,
ing
window
is often
used to detect
Window
based
which
is called
two-sample
changechanges.
detection,
or windowchange
detection
incurs the
[37].
based change
detection.
Ashigh
datadelay
stream
is Window-based
infinite, a slidchange
detection
scheme
the dissimilarity
meaing window
is often
used is
to based
detecton
changes.
Window based
sure
between
twoincurs
distributions
synopses
extracted from
change
detection
the highordelay
[37]. Window-based
the
reference
window
andisthe
current
window.
change
detection
scheme
based
on the
dissimilarity measure between two distributions or synopses extracted from
Velocity of data change: Aggarwal proposes a framework
the reference window and the current window.
that can deal with the changes in both spatial velocity profile
and temporal
velocityAggarwal
profile [1;proposes
2]. In this
approach,
Velocity
of data change:
a framework
thatchanges
can dealin
with
changes
in both spatial
the
datathedensity
occurring
at eachvelocity
locationproare
file and temporal
velocityvelocity
profile [1;
2]. In in
thissome
approach,
estimated
by estimating
density
userthe changes
in data
densityAn
occurring
at advantage
each location
are
defined
temporal
window.
important
of this
estimated
estimating
velocity
density This
in some
userapproach isbythat
it visualizes
the changes.
visualizadefined
temporal helps
window.
important the
advantage
this
tion
of changes
userAn
understand
changesofintuapproach
itively. is that it visualizes the changes. This visualization of changes helps user understand the changes intuSpeed
itively. of response: If a change detection method needs
to react to the detected changes as fast as possible, the
Speed of response: If a change detection method needs
to react to the detected changes as fast as possible, the
Volume 16, Issue 1
Volume 16, Issue 1
Page 32
Page 32
quickest
quickest detection
detection of
of change
change should
should be
be proposed.
proposed. Quickest
Quickest
change
change detection
detection can
can help
help aa system
system make
make aa timely
timely alarm.
alarm.
Timely
Timely alarm
alarm warning
warning is
is benefit
benefit for
for economical.
economical. In
In some
some
cases,
cases, it
it may
may save
save the
the human
human life
life such
such as
as in
in fire-fighting
fire-fighting
system.
system. Change
Change detection
detection methods
methods using
using two
two overlapping
overlapping
windows
windows can
can quickly
quickly react
react to
to the
the changes
changes in
in streaming
streaming data
data
while
while methods
methods using
using adjacent
adjacent windows
windows model
model may
may incur
incur
the
the high
high delay.
delay. As
As change
change can
can be
be abrupt
abrupt change
change or
or gradual
gradual
change,
change, there
there exists
exists the
the abrupt
abrupt change
change detection
detection algorithm
algorithm
and gradual change detection algorithm [46; 40].
and gradual change detection algorithm [46; 40].
• Decision making methodology: Based on the decision
• Decision making methodology: Based on the decision
making methodology, a change detection method can fall
making methodology, a change detection method can fall
into one of the following categories: rank-based method
into one of the following categories: rank-based method
[33], density-based method [55], information-theoretic
[33], density-based method [55], information-theoretic
method [15]. A change detection problem can be also clasmethod [15]. A change detection problem can be also classified into batch change detection and sequential change
sified
into Based
batch on
change
detection
change
detection.
detection
delayand
that sequential
a change detector
detection.
Based
on detection
delay
that a can
change
detector
suffers from,
a change
detection
methods
fall into
one
suffers
from, a change
can fall into
of two following
types: detection
real-time methods
change detection,
and one
retof
two following
real-time
change
retrospective
changetypes:
detection.
Based
on thedetection,
spatial orand
temporospective
change
detection.
Based
on
the
spatial
or
temporal characteristics of data, change detection algorithm can
ral characteristics
of data,
algorithmtemcan
fall
into one of three
kinds:change
spatialdetection
change detection;
fall
one ofdetection;
three kinds:
spatial change detection;
temporalinto
change
or spatio-temporal
change detecporal[6].
change detection; or spatio-temporal change detection
tion [6].
• Application: On the basis of applications that generate data
• streams,
Application:
the basis
of applications
data
data On
streams
can be
classified as that
into generate
transactional
streams,
data streams
classified
as into data
transactional
data stream,
sensor can
databestream,
network
stream,
data stream,
sensor
data
stream, data
network
data
stream,
stock
order data
stream,
astronomy
stream,
video
data
stock
order
stream,
data stream,there
video
stream,
etc. data
Based
on theastronomy
specific applications,
aredata
the
stream, etc.
Based methods
on the specific
thereapplicaare the
change
detection
for theapplications,
corresponding
change
detection
methods
for methods
the corresponding
tions such
as change
detection
for sensor applicastreamtionsdata
such[56],
as change
streaming
changedetection
detectionmethods
methodsfor
forsensor
transactional
ing data [56],
methodsvan
for Leeuwen
transactional
streaming
datachange
[45; 57;detection
8]. For example,
and
streaming
[45;
57; 8]. For
example,
van Leeuwen
Siebes
[57]data
have
presented
a change
detection
methodand
for
Siebes [57] have
presented
change
method
for
transactional
streaming
dataabased
on detection
the principle
of Mintransactional
streaming
data based on the principle of Minimum
Description
Length.
imum Description Length.
• Stream processing methodology: Based on methodology
processing
data stream,
a data stream
canmethodology
be classified
• for
Stream
processing
methodology:
Based on
into
online data
stream
anda off-line
datacan
stream
[38]. In
for processing
data
stream,
data stream
be classified
some
work, data
an online
called
a live[38].
stream
into online
streamdata
andstream
off-lineis data
stream
In
while
off-line
data stream
is called
some an
work,
an online
data stream
is archived
called a data
live stream
[18].
datadata
stream
needs
to be processed
online
bewhileOnline
an off-line
stream
is called
archived data
stream
cause
of its high
data streams
include
[18]. Online
data speed.
stream Such
needsonline
to be processed
online
bestreams
of network
measurements,
cause ofofitsstock
highticker,
speed.streams
Such online
data streams
include
and
streams
of sensor
data,etc.ofOff-line
is a sestreams
of stock
ticker, streams
networkstream
measurements,
quence
of updates
to warehouses
or backup
devices.
and streams
of sensor
data,etc. Off-line
stream
is a The
sequeries
over
the off-line
streams orcan
be processed
offquence of
updates
to warehouses
backup
devices. The
line.
However,
as off-line
it is insufficient
off-line
queries
over the
streams time
can to
be process
processed
offstreams,
techniques
summarizing
are necessary.
line. However,
as it isforinsufficient
timedata
to process
off-line
In
off-linetechniques
change detection
method, the
entire
set is
streams,
for summarizing
data
are data
necessary.
available
the analysis
process
to detect
the change.
The
In off-linefor
change
detection
method,
the entire
data set
is
online
method
theprocess
changetoincrementally
basedThe
on
available
for thedetects
analysis
detect the change.
the
recently
incoming
item. An
important distinction
online
method
detects data
the change
incrementally
based on
between
off-line
method
and
online
is that distinction
the online
the recently
incoming
data
item.
An one
important
method
constrained
byand
the online
detection
reaction
time
betweenisoff-line
method
oneand
is that
the online
due
to the
of real-time
applications
whiletime
the
method
is requirement
constrained by
the detection
and reaction
off-line is free from the detection time, and reaction time.
due to the requirement of real-time applications while the
Methods for detecting changes can be useful for streamoff-line is free from the detection time, and reaction time.
ing data warehouses where both live streams of data and
Methods for detecting changes can be useful for streamarchived data streams are available [24; 29]. In this work,
ing data warehouses where both live streams of data and
we focus on developing the methods for detecting changes
archived data streams are available [24; 29]. In this work,
in online data streams, in particular, sensor data streams.
we focus on developing the methods for detecting changes
in work
onlineondata
streams, inchange
particular,
sensorproposed
data streams.
The first
model-based
detection
by [21;
22] is FOCUS. The central idea behind FOCUS is that the models
The first work on model-based change detection proposed by [21;
22] is FOCUS. The central idea behind FOCUS is that the models
SIGKDD Explorations
SIGKDD Explorations
can
can be
be divided
divided into
into structural
structural and
and measurement
measurement components.
components. To
To
detect
detect deviation
deviation between
between two
two models,
models, they
they compare
compare specific
specific parts
parts
of
of these
these corresponding
corresponding models.
models. The
The models
models obtained
obtained by
by data
data
mining
mining algorithms
algorithms includes
includes frequent
frequent item
item sets,
sets, decision
decision trees,
trees, and
and
clusters.
clusters. The
The change
change in
in model
model may
may convey
convey interesting
interesting informainformation
tion or
or knowledge
knowledge of
of an
an event
event or
or phenomenon.
phenomenon. Model
Model change
change is
is
defined
defined in
in terms
terms of
of the
the difference
difference between
between two
two set
set of
of parameters
parameters
of
of two
two models
models and
and the
the quantitative
quantitative characteristics
characteristics of
of two
two modmodels.
els. As
As such,
such, model
model change
change detection
detection is
is finding
finding the
the difference
difference
between two set of parameters of two models and the quantitabetween two set of parameters of two models and the quantitative characteristics of these two models. We should distinguish
tive characteristics of these two models. We should distinguish
between detection of changes in data distribution by using modbetween detection of changes in data distribution by using models and detection of changes in model built from streaming data.
els and detection of changes in model built from streaming data.
While model change detection aims to identify the difference beWhile model change detection aims to identify the difference between two models, change detection in the underlying data distritween two models, change detection in the underlying data distribution by using models is inferring the changes in two data sets
bution
by using models is inferring the changes in two data sets
from the difference between two models constructed from two
from
the
between
two modelsdata
constructed
from
data sets. difference
The changes
in the underlying
distribution
cantwo
indata
The changes inchanges
the underlying
data distribution
can the
induce sets.
the corresponding
in the model
produced from
duce
the
corresponding
changes
in
the
model
produced
from
the
data generating process.
data
generating
process.
As models
can be
generated by statistics method or data mining
As
models
can
be
generatedinby
statistics
or datainto
mining
methods, change detection
models
canmethod
be classified
data
methods,
change
modelsTwo
cankinds
be classified
into
mining
model
anddetection
statisticalinmodel.
of models
wedata
are
mining
model
and statistical
model.
Two kinds
of models
we are
interested
in detecting
changes
are predictive
model
and explanainterested
in Predictive
detecting changes
areused
predictive
model
explanatory
model.
model is
to predict
theand
changes
in
tory
model.Detecting
Predictive
model inis the
used
to predict
changesfor
in
the future.
changes
pattern
can bethe
beneficial
the future.
Detecting
in the
pattern
can be beneficial
for
many
applications.
Inchanges
explanatory
model,
a change
that occurred
many
In explained.
explanatoryThere
model,
that occurred
is bothapplications.
detected and
area change
some approaches
to
is both detection:
detected and
explained.
There are
some approaches
to
change
one-model
approach,
two-model
approach, or
change detection:
one-model approach, two-model approach, or
multiple-model
approach.
multiple-model
A model-basedapproach.
change detection algorithm consists of two
A model-based
change
algorithm
consistsdetection.
of two
phases
as follows:
modeldetection
construction
and change
phases
as follows:
construction
andmining
change
detection.
First, a model
is builtmodel
by using
some stream
method
such
First,
a model
is clustering,
built by using
somepattern.
stream mining
such
as
decision
tree,
frequent
Second,method
a difference
as decision
tree, clustering,
frequent
pattern.based
Second,
difference
measure
between
two models
is computed
the acharacterismeasure
between
computed
the characteristics
of the
model,two
thismodels
step isisalso
called based
the quantification
of
tics of difference.
the model, Therefore,
this step isone
alsofundamental
called the quantification
model
issue here is of
to
quantify
the changes
between two
and to issue
determine
model difference.
Therefore,
one models
fundamental
here criteis to
ria
for making
decision
whether
and
whenand
a change
in the model
quantify
the changes
between
two
models
to determine
criteoccurs.
Recently,
some whether
change detection
streaming
ria for making
decision
and when methods
a change in the
model
data
by clustering
have been
proposed
[10;methods
3]. Basedinon
the data
occurs.
Recently, some
change
detection
streaming
stream
model,
may
have the[10;
corresponding
data bymining
clustering
havewe
been
proposed
3]. Based onproblems
the data
of
detecting
changes
model
follows.
Ikonomovska et
al. [30]
stream
mining
model,inwe
mayashave
the corresponding
problems
have
presented
an algorithm
learningIkonomovska
regression trees
of detecting
changes
in model for
as follows.
et al.from
[30]
streaming
data in
presence
conceptregression
drifts. Their
change
have presented
an the
algorithm
foroflearning
trees
from
detection
based
on sequential
statistical
that
monstreamingmethod
data inisthe
presence
of concept
drifts.tests
Their
change
itoring
themethod
changes
of theon
local
error, atstatistical
each node
of that
tree,monand
detection
is based
sequential
tests
inform
of theerror,
localatchanges.
itoring the
the learning
changes process
of the local
each node of tree, and
Detecting
streamofcluster
model
has been received ininform thechanges
learningof
process
the local
changes.
creasing
Zhou
et al.cluster
[59] have
presented
method for
Detectingattention.
changes of
stream
model
has beena received
intracking
evolution
of et
clusters
over
sliding
windows
by using
creasing the
attention.
Zhou
al. [59]
have
presented
a method
for
temporal
cluster
features
and the over
exponential
histogram,
tracking the
evolution
of clusters
sliding windows
bywhich
using
called
exponential
histogram
cluster
features.histogram,
Chen and Liu
[9]
temporal
cluster features
andofthe
exponential
which
have
a framework
detecting
the changes
in clustercalledpresented
exponential
histogram for
of cluster
features.
Chen and
Liu [9]
ing
from
datachanges
streamsinby
using
havestructures
presentedconstructed
a framework
for categorial
detecting the
clusterhierarchial
entropy
trees
to
capture
the
entropy
characteristics
of
ing structures constructed from categorial data streams by using
clusters,
andentropy
then detecting
in clustering
structures based
hierarchial
trees to changes
capture the
entropy characteristics
of
on
these and
entropy
clusters,
thencharacteristics.
detecting changes in clustering structures based
Based
onentropy
the data
stream mining model, we may have the coron these
characteristics.
responding problems of detecting changes in model as follows
Based on the data stream mining model, we may have the cor[14]. Recently Ng and Dash [44] have introduced an algorithm
responding problems of detecting changes in model as follows
for mining frequent patterns from evolving data streams. Their
[14]. Recently Ng and Dash [44] have introduced an algorithm
algorithm is capable of updating the frequent patterns based on
for mining frequent patterns from evolving data streams. Their
the algorithms for detecting changes in the underlying data disalgorithm is capable of updating the frequent patterns based on
tributions. Two windows are used for change detection: the refthe
algorithms
changes
in the
data diserence
windowfor
anddetecting
the current
window.
At underlying
the initial stage,
the
tributions. Two windows are used for change detection: the reference window and the current window. At the initial stage, the
Volume 16, Issue 1
Volume 16, Issue 1
Page 33
Page 33
reference is initialized with the first batch of transactions from
reference
is initialized
the first
batch
data
stream.
The currentwith
window
moves
onof
thetransactions
data streamfrom
and
captures
the The
next current
batch ofwindow
transactions.
frequent
item sets
data stream.
moves Two
on the
data stream
and
are
constructed
two
windows
by using
the
captures
the nextfrom
batch
ofcorresponding
transactions. Two
frequent
item sets
Apriori
algorithm.
A statistical
test is performed
on by
twousing
absolute
are constructed
from
two corresponding
windows
the
support
values thatAare
computed
Apriorion
from
referApriori algorithm.
statistical
testby
is the
performed
twothe
absolute
ence
window
window.byBased
on the from
statistical
test,
support
valuesand
thatcurrent
are computed
the Apriori
the referthe
significant
or insignificant.
the deviation
encedeviation
window can
and be
current
window.
Based on theIfstatistical
test,
is
a change
in theordata
stream is reported.
Chang
thesignificant
deviation then
can be
significant
insignificant.
If the deviation
and
Lee [8] have
method
for monitoring
the Chang
recent
is significant
then presented
a change ina the
data stream
is reported.
change
item setsa from
dataforstream
by using
and Leeof[8]frequent
have presented
method
monitoring
the sliding
recent
window.
change of frequent item sets from data stream by using sliding
window.
2.3 Design Methodology
2.3
Design
Methodology
There are
two design
methodologies for developing the change
detection
streaming data.
first methodology
There are algorithms
two design in
methodologies
for The
developing
the change
is
to adaptalgorithms
the existing
detection
methods
streaming
detection
in change
streaming
data. The
first for
methodology
data.
However,
many traditional
change detection
methods
canis to adapt
the existing
change detection
methods for
streaming
not
extendedmany
for streaming
because
of the methods
high compudata.beHowever,
traditionaldata
change
detection
cantational
complexity
as some
kernel-based
not be extended
for such
streaming
data
because of change
the highdetection
compumethods,
and density-based
change
detection methods.
The sectational complexity
such as some
kernel-based
change detection
ond
methodology
is to develop
new change
detection
methods
for
methods,
and density-based
change
detection
methods.
The secstreaming
data. is to develop new change detection methods for
ond methodology
There
are two
streaming
data.common approaches to the problem of change detection
in streaming
dataapproaches
distributions:
distance-based
change deThere are
two common
to the
problem of change
detectors
and
predictive
model-based
change
detectors. change
In the fortection in
streaming
data
distributions:
distance-based
demer,
two
windows
are used
to extractchange
two data
segments
the
tectors
and
predictive
model-based
detectors.
Infrom
the fordata
The change
is to
quantified
by data
usingsegments
some dissimilarmer, stream.
two windows
are used
extract two
from the
ity
If thechange
dissimilarity
measure
is greater
a given
datameasure.
stream. The
is quantified
by using
somethan
dissimilarthreshold
then
a
change
is
detected.
Similar
to
distance-based
ity measure. If the dissimilarity measure is greater than a given
change
detectors,
two windows
are used
for detecting
changes.
threshold
then a change
is detected.
Similar
to distance-based
Instead
of
comparing
the
dissimilarity
measure
between
two
winchange detectors, two windows are used for detecting changes.
dows
a given threshold,
a change
is detected
by using
the
Insteadwith
of comparing
the dissimilarity
measure
between
two winprediction
error
of
the
model
built
from
the
current
window
and
dows with a given threshold, a change is detected by using the
the
predictive
model
from
thethe
reference
prediction
error
of theconstructed
model built
from
current window.
window and
the predictive model constructed from the reference window.
erance. The scalability refers to the ability to extend the
erance.
Thenetwork
scalability
refers
to the ability
to extend
the
size
of the
without
significantly
reducing
the performance
the framework.
As faults may
occurthe
dueperto
size of theof
network
without significantly
reducing
the
transmission
and the As
effects
of noisy
channels
formance
of the error
framework.
faults
may occur
duebeto
tween
local sensors
and
fusion
center,ofa noisy
distributed
change
the transmission
error
and
the effects
channels
bedetection
method
should
be able
to tolerate
these faults
in
tween local
sensors
and fusion
center,
a distributed
change
order
to assure
theshould
function
thetosystem.
detection
method
be of
able
tolerate these faults in
order to assure the function of the system.
• Distributed change detection using the local approach is direlevant
to the
problemusing
of multiple
• rectly
Distributed
change
detection
the localhypotheses
approach istestdiing
and
data fusion
because of
each
local change
detector
rectly
relevant
to the problem
multiple
hypotheses
testneeds
to data
perform
a hypothesis
test local
to determine
ing and
fusion
because each
change whether
detector
aneeds
change
occurs. Therefore,
the deto perform
a hypothesisbesides
test toconsidering
determine whether
tection
performance
of local change
algorithms
a change
occurs. Therefore,
besides detection
considering
the deincluding
probabilityof
oflocal
detection
anddetection
probability
of false
tection performance
change
algorithms
alarm
at the
node level,ofthe
detection
performance
disincluding
probability
detection
and
probability of
ofafalse
tributed
at the
fusion center
alarm at change
the nodedetection
level, themethod
detection
performance
of amust
disbe
takenchange
into account.
tributed
detection method at the fusion center must
be taken into account.
Distributed detection and data fusion have been widely studied
for
many decades.
However,
recently,
Distributed
detection
and dataonly
fusion
have distributed
been widelydetection
studied
in
data has
receivedonly
attention.
forstreaming
many decades.
However,
recently, distributed detection
in streaming data has received attention.
3.1
3.1
Distributed Detection: One-shot versus
Continuous
Distributed Detection: One-shot versus
Distributed
detection of changes can be classified into two types
Continuous
3.
3.
of
models asdetection
follows. of changes can be classified into two types
Distributed
of models as follows.
• One-shot distributed detection of changes: Figure 2 shows
models
of one-shot
distributed
change detection.
One• two
One-shot
distributed
detection
of changes:
Figure 2 shows
shot
change of
detection
method
means
a change
detector
detwo models
one-shot
distributed
change
detection.
Onetects
and reacts
to themethod
detected
change
once detector
a changedeis
shot change
detection
means
a change
detected.
distributed
have retects and One-shot
reacts to the
detectedchange
changedetection
once a change
is
ceived
great
deal of attention
forchange
a long time.
One-shot
detected.
One-shot
distributed
detection
havedisretributed
change
include
two models:
distributed
ceived great
dealdetection
of attention
for a long
time. One-shot
disdetection
with
decision
with
decision
fusion
as
shown in
tributed change detection include two models: distributed
Figure
2(a);
distributed
without
decision
fusion
detection
with
decision detection
with decision
fusion
as shown
in
as
illustrated
in
Figure
2(b).
What
are
the
differences
beFigure 2(a); distributed detection without decision fusion
tween
one-shot
and What
continuous
as illustrated
in detection
Figure 2(b).
are thedetection.
differences be-
can
be achieved
only when
could amount
develop of
thestreaming
change detecKnowledge
discovery
from we
massive
data
tion
frameworks
that
monitor
streaming
data
created
by multiple
can be achieved only when we could develop the change
detecsources
such as sensor
networks,
WWWdata
[13].created
The objectives
of
tion frameworks
that monitor
streaming
by multiple
designing
a distributed
detection
areobjectives
maximizing
sources such
as sensor change
networks,
WWWscheme
[13]. The
of
the lifetime of the network, maximizing the detection capability,
designing a distributed change detection scheme are maximizing
and minimizing the communication cost [58].
the lifetime of the network, maximizing the detection capability,
There
are two approaches
to the problem
of change detection in
and minimizing
the communication
cost [58].
streaming data that is created from multiple sources. In the cenThere are two approaches to the problem of change detection in
tralized approach: all remote sites send raw data to the coordistreaming data that is created from multiple sources. In the cennator. The coordinator aggregates all the raw streaming data that
tralized approach: all remote sites send raw data to the coordiis received from the remote sites. Detection of changes is pernator. The coordinator aggregates all the raw streaming data that
formed on the aggregated streaming data. In most cases, comis received from the remote sites. Detection of changes is permunication consumes the largest amount of energy. The lifetime
formed on the aggregated streaming data. In most cases, comof sensors therefore drastically reduces when they communicate
munication consumes the largest amount of energy. The lifetime
raw measurements to a centralized server for analysis. Centralof sensors therefore drastically reduces when they communicate
ized approaches suffer from the following problems: communiraw measurements to a centralized server for analysis. Centralcation constraint, power consumption, robustness, and privacy.
ized approaches suffer from the following problems: communiDistributed detection of changes in streaming data addresses the
cation
constraint, power consumption, robustness, and privacy.
challenges that come from the problem of change detection, data
Distributed
detectionand
of changes
in streaming
data addresses
the
stream processing,
the problem
of distributed
computing.
challenges
that come
from
thethe
problem
of change
detection,
data
The challenges
coming
from
distributed
computing
environstream
ment areprocessing,
as follows and the problem of distributed computing.
The challenges coming from the distributed computing environment
as follows
• are
Distributed
change detection in streaming data is a problem
of distributed computing in nature. Therefore, a distributed
• Distributed
change
detection
in streaming
data isthe
a problem
framework for
detecting
changes
should meet
properof
distributed
computing
in
nature.
Therefore,
a
distributed
ties of distributed computing such scalability, and fault tolframework for detecting changes should meet the properties of distributed computing such scalability, and fault tol-
As
of the properties
of distributed Computing
computational systems
3.2oneLocality
in Distributed
is locality [43], a distributed algorithm for detecting changes in
As one of the properties of distributed computational systems
streaming data should meet the locality. A local algorithm is deis locality [43], a distributed algorithm for detecting changes in
fined as one whose resource consumption is independent of the
streaming data should meet the locality. A local algorithm is desystem size. The scalability of distributed stream mining algofined
onebewhose
resource
consumption
is independent
the
rithmsascan
achieved
by using
the local change
detectionof
algosystem
rithms size. The scalability of distributed stream mining algorithms can be achieved by using the local change detection algoLocal algorithms can fall into one of two categories [16]:Exact
rithms
local algorithms are defined as ones that produce the same results
Local
algorithmsalgorithm;
can fall into
one of twolocal
categories
[16]:Exact
as a centralized
Approximate
algorithms
are allocal
algorithms
are
defined
as
ones
that
produce
the
same
results
gorithms that produce approximations of the results that centralas
a centralized
Approximate
local algorithms
alized
algorithms algorithm;
would produce.
Two attractive
properties are
of logorithms
that
produce
approximations
of
the
results
that
centralcal algorithms are scalability and fault tolerance. A distributed
ized algorithms would produce. Two attractive properties of local algorithms are scalability and fault tolerance. A distributed
DISTRIBUTED CHANGE DETECTION
IN
STREAMINGCHANGE
DATA DETECTION
DISTRIBUTED
Knowledge
discovery from massive
amount of streaming data
IN STREAMING
DATA
SIGKDD Explorations
SIGKDD Explorations
one-shot
detection
and continuous
detection.
• tween
Continuous
distributed
detection
of changes:
In this chapter,
we
propose
two
continuous
distributed
detection
mod• Continuous distributed detection of changes:
In this chapels
as
shown
in
Figure
3.
An
important
distinction
between
ter, we propose two continuous distributed detection modcontinuous
of changes
and one-shot
els as showndistributed
in Figure 3.detection
An important
distinction
between
distributed
detection
of
changes
is
that
the
inputs
to the
continuous distributed detection of changes and
one-shot
one-shot distributed change detection are batches of data
distributed detection of changes is that the inputs to the
while the inputs to the continuous distributed detection of
one-shot distributed change detection are batches of data
changes are the data streams in which data items continuwhile the inputs to the continuous distributed detection of
ously arrive.
changes are the data streams in which data items continuously detection
arrive. model without fusion is a truly distributed
Distributed
detection model in which The decision-making process occurs at
Distributed detection model without fusion is a truly distributed
each sensor.
detection model in which The decision-making process occurs at
each
3.2 sensor.
Locality in Distributed Computing
Volume 16, Issue 1
Volume 16, Issue 1
Page 34
Page 34
Phenomenon
Phenomenon
x1
Phenomenon
x2
x2
x1
x2
x1
x2
Phenomenon
x1
u1
u2
Local decision
u1
Local decision
u2
Local decision
Global decision
u1
u2
(a) Distributed detection without decision
u1
u2
fusion
Local decision
Global decision
(b) Distributed detection with decision fusion
(a) Distributed detection without decision (b) Distributed detection with decision fusion
Figure 2: One-shot distributed change detection models
fusion
Figure 2: One-shot distributed change detection models
Phenomenon
Phenomenon
Data stream 1
Phenomenon
Data stream 1
Data stream 1
Data stream 2
Phenomenon
Data stream 1
u1
Local decision
Data stream 2
Data stream 2
u1
u1
Data stream 2
u2
u2
Local decision
Local decision
u1
Local decision
Global decision
u2
Local decision
u2
Local decision
Global decision
(a) Distributed continuous Local
detection
without deci- (b) Distributed continuous detection with decision
decision
Local decision
sion fusion
fusion
(a) Distributed continuous detection without deci- (b) Distributed continuous detection with decision
sion fusion
fusion
Figure 3: Continuous distributed
change detection models
Figure 3: Continuous distributed change detection models
SIGKDD Explorations
SIGKDD Explorations
Volume 16, Issue 1
Volume 16, Issue 1
Page 35
Page 35
framework for mining streaming data should be robust to netframework
for mining
streaming
work
partitions,
and node
failures. data should be robust to networkadvantage
partitions,ofand
node
failures. is the ability to preserve priThe
local
approaches
The advantage
of local approaches
the ability
vacy
[20]. A drawback
of the localisapproach
to to
thepreserve
problempriof
distributed
change
detection
is local
the synchronization
vacy [20]. A
drawback
of the
approach to theproblem.
problemFor
of
example,
local change
approach
can meet the principle
ofFor
lodistributedthe
change
detection
is the synchronization
problem.
calized
algorithms
wireless
sensorcan
networks
in which
dataofproexample,
the local in
change
approach
meet the
principle
locessing
is performed
at node-level
much asinpossible
in order
calized algorithms
in wireless
sensorasnetworks
which data
proto
reduceisthe
amount of
informationastomuch
be sent
the network.
cessing
performed
at node-level
as in
possible
in order
to reduce the amount of information to be sent in the network.
3.3
3.3
Distributed Detection of Changes in
DistributedData
Detection of Changes in
Streaming
Data
Over theStreaming
last decades, the
problem of decentralized detection has
received
much
attention.
There are
directionsdetection
of research
Over the last
decades,
the problem
of two
decentralized
has
on
decentralized
detection.There
The first
focusesofon
aggrereceived
much attention.
are approach
two directions
research
gating
measurements
from The
multiple
sensors tofocuses
test a single
hyon decentralized
detection.
first approach
on aggrepothesis.
The second focuses
on dealing
with to
multiple
gating measurements
from multiple
sensors
test a dependent
single hytesting/estimation
tasks
fromon
multiple
[51]. Distributed
pothesis. The second
focuses
dealingsensors
with multiple
dependent
change
detection usually
involves
a setsensors
of sensors
receive
testing/estimation
tasks from
multiple
[51]. that
Distributed
observations
from usually
the environment
then
obserchange detection
involves and
a set
of transmit
sensors those
that receive
vations
back to
fusion
center in order
tothen
reach
the final
consensus
observations
from
the environment
and
transmit
those
obserof
detection.
Decentralized
datathefusion
are therevations
back to
fusion centerdetection
in order toand
reach
final consensus
fore
two closely
related tasks
that arise
in data
the context
of sensor
of detection.
Decentralized
detection
and
fusion are
therenetworks
[48; 47].
Two traditional
approaches
to the decentralfore two closely
related
tasks that arise
in the context
of sensor
ized
change
detection
aretraditional
data fusion,approaches
and decision
In data
networks
[48;
47]. Two
to fusion.
the decentralfusion,
each
node
detects
change
and
sends
quantized
version
of
ized change detection are data fusion, and decision fusion. In data
its
observation
to detects
a fusionchange
centerand
responsible
for making
decifusion,
each node
sends quantized
version
of
sion
on the detected
changes,
and responsible
further relaying
information.
its observation
to a fusion
center
for making
deciIn
contrast,
decision
fusion, and
eachfurther
node performs
change
sion
on the in
detected
changes,
relaying local
information.
detection
byinusing
somefusion,
local change
algorithm
andlocal
updates
its
In contrast,
decision
each node
performs
change
decision
on the
received
information
and broadcasts
again
detectionbased
by using
some
local change
algorithm
and updates
its
its
new decision.
Thisreceived
processinformation
repeats until
decision
based on the
andconsensus
broadcastsamong
again
the
nodesdecision.
are reached.
to datauntil
fusion,
decision among
fusion
its new
ThisCompared
process repeats
consensus
can
reduceare
thereached.
communication
costtobecause
sensors
need only
to
the nodes
Compared
data fusion,
decision
fusion
transmit
the
local
decisions
represented
by
small
data
structures.
can reduce the communication cost because sensors need only to
Although
there
is great
deal represented
of work on distributed
detection
and
transmit the
local
decisions
by small data
structures.
data
fusion,
most
of
work
focuses
on
the
one-time
change
detecAlthough there is great deal of work on distributed detection and
tion
One-time
is defined
as a querychange
that needs
to
data solutions.
fusion, most
of workquery
focuses
on the one-time
detecproceed
data
once
in
order
to
provide
the
answer
[12].
Likewise,
tion solutions. One-time query is defined as a query that needs to
one-time
change
method
is athe
change
detection
that reproceed data
oncedetection
in order to
provide
answer
[12]. Likewise,
quires
to
proceed
data
once
in
response
to
the
change
occurred.
In
one-time change detection method is a change detection that rereal-world
applications,
we
need
the
approaches
capable
of
conquires to proceed data once in response to the change occurred. In
tinuously
thewe
changes
of approaches
the events occurring
the
real-worldmonitoring
applications,
need the
capable ofinconenvironment.
Recently,
work
on
continuous
detection
and
montinuously monitoring the changes of the events occurring in the
itoring
of changes
has been
receiving
attentionand
such
as
environment.
Recently,
work started
on continuous
detection
mon[49; 13; 50]. Das et al. [13] have presented a scalable distributed
itoring of changes has been started receiving attention such as
framework for detecting changes in astronomy data streams us[49; 13; 50]. Das et al. [13] have presented a scalable distributed
ing local, asynchronous eigen monitoring algorithms. Palpanas
framework for detecting changes in astronomy data streams uset al. [49] proposed a distributed framework for outlier detection
ing local, asynchronous eigen monitoring algorithms. Palpanas
in real-time data streams. In their framework, each sensor estiet al. [49] proposed a distributed framework for outlier detection
mates and maintains a model for its underlying distribution by
in real-time data streams. In their framework, each sensor estiusing kernel density estimators. However, they did not show how
mates and maintains a model for its underlying distribution by
to reach the global detection decision.
using kernel density estimators. However, they did not show how
to reach the global detection decision.
4. CONCLUDING REMARKS
We
in this paper that variability,
or simply change, is cru4. argued
CONCLUDING
REMARKS
cial in a world full of affecting factors that alter the behavior of
We argued in this paper that variability, or simply change, is cruthe data, and consequently the underlying model. The ability to
cial in a world full of affecting factors that alter the behavior of
detect such changes in centralized as well as distributed system
the
data, and consequently the underlying model. The ability to
plays an important role in identifying validity of data models.
detect such changes in centralized as well as distributed system
The paper presented the state-of-the-art in this area of paramount
plays an important role in identifying validity of data models.
importance. Techniques, in some cases, are tightly coupled with
The
paper presented
state-of-the-art
area of paramount
application
domains.the
However,
most of in
thethis
techniques
reviewed
importance.
Techniques,
in
some
cases,
are
tightly
in this paper are generic and could be adapted to coupled
differentwith
doapplication
domains. However, most of the techniques reviewed
mains of applications.
in this paper are generic and could be adapted to different doWith Big Data technologies reaching a mature stage, the future
mains of applications.
With Big Data technologies reaching a mature stage, the future
SIGKDD Explorations
SIGKDD Explorations
work in change detection is expected to exploit such scalable data
work in change
is expected
exploitand
such
scalable
data
processing
toolsdetection
in efficiently
detect, to
localize
classify
occurprocessing
tools
efficiently
detect, localize
classifymodels
occurring
changes.
Forinexample,
distributed
changeand
detection
ring make
changes.
example,
distributed
changetodetection
models
can
use For
of the
MapReduce
framework
accelerate
their
respective
processes.
can make use
of the MapReduce framework to accelerate their
respective processes.
5.
5.
REFERENCES
REFERENCES
[1] C. Aggarwal. A framework for diagnosing changes in
evolving
data streams.
In Proceedings
of the changes
2003 ACM
[1] C.
Aggarwal.
A framework
for diagnosing
in
SIGMOD
international
on Management
of ACM
data,
evolving data
streams. conference
In Proceedings
of the 2003
pages
575–586.
ACM New
York, NY,
USA, 2003. of data,
SIGMOD
international
conference
on Management
pages 575–586. ACM New York, NY, USA, 2003.
[2] C. Aggarwal. On change diagnosis in evolving data streams.
IEEE
Transactions
on Knowledge
Data Engineering,
[2] C.
Aggarwal.
On change
diagnosis inand
evolving
data streams.
pages
2005.
IEEE 587–600,
Transactions
on Knowledge and Data Engineering,
pages 587–600, 2005.
[3] C. Aggarwal. A segment-based framework for modeling
andAggarwal.
mining data
Knowledge
and information
sys[3] C.
A streams.
segment-based
framework
for modeling
tems,
30(1):1–29,
2012. Knowledge and information sysand mining
data streams.
tems, 30(1):1–29, 2012.
[4] C. Aggarwal and P. Yu. A survey of synopsis construction
in data
streams.
Data
streams:
models
and algorithms,
page
[4] C.
Aggarwal
and
P. Yu.
A survey
of synopsis
construction
169,
2007.
in data
streams. Data streams: models and algorithms, page
169, 2007.
[5] A. Bondu and M. Boullé. A supervised approach for change
detection
data
Insupervised
Neural Networks
(IJCNN),
The
[5] A.
Bondu in
and
M.streams.
Boullé. A
approach
for change
2011
International
Joint In
Conference
on, pages
519–526.
detection
in data streams.
Neural Networks
(IJCNN),
The
IEEE,
2011.
2011 International
Joint Conference on, pages 519–526.
IEEE, 2011.
[6] S. Boriah, V. Kumar, M. Steinbach, C. Potter, and
S. Klooster.
cover change
detection: C.
a case
study.and
In
[6] S.
Boriah, Land
V. Kumar,
M. Steinbach,
Potter,
Proceeding
the 14th
SIGKDD
international
conferS. Klooster.ofLand
coverACM
change
detection:
a case study.
In
ence
on Knowledge
discovery
and datainternational
mining, pages
857–
Proceeding
of the 14th
ACM SIGKDD
confer865.
2008. discovery and data mining, pages 857–
ence ACM,
on Knowledge
865. ACM, 2008.
[7] G. Cabanes and Y. Bennani. Change detection in data
streams
through
learning.detection
In Neural
[7] G.
Cabanes
and unsupervised
Y. Bennani. Change
in Netdata
works
2012 International
Conference
streams(IJCNN),
through The
unsupervised
learning. Joint
In Neural
Neton,
pages
1–6. IEEE,
works
(IJCNN),
The 2012.
2012 International Joint Conference
on, pages 1–6. IEEE, 2012.
[8] J. Chang and W. Lee. estwin: adaptively monitoring the recent
change
itemsets
over online
data streams.
[8] J.
Chang
andofW.frequent
Lee. estwin:
adaptively
monitoring
the reIn
Proceedings
of
the
twelfth
international
conference
on
cent change of frequent itemsets over online data streams.
Information
andofknowledge
536–539.
In Proceedings
the twelfthmanagement,
internationalpages
conference
on
ACM, 2003.
Information and knowledge management, pages 536–539.
ACM, 2003.
[9] K. Chen and L. Liu. HE-Tree: a framework for detecting
changes in clustering structure for categorical data streams.
[9] K. Chen and L. Liu. HE-Tree: a framework for detecting
The VLDB Journal, pages 1–20.
changes in clustering structure for categorical data streams.
The VLDB Journal, pages 1–20.
[10] T. CHEN, C. YUAN, A. SHEIKH, and C. NEUBAUER.
Segment-based change detection method in multivariate
[10] T. CHEN, C. YUAN, A. SHEIKH, and C. NEUBAUER.
data stream, Apr. 9 2009. WO Patent WO/2009/045,312.
Segment-based change detection method in multivariate
dataCormode.
stream, Apr.
2009. WOdistributed
Patent WO/2009/045,312.
[11] G.
The 9continuous
monitoring model.
SIGMOD Record, 42(1):5, 2013.
[11] G. Cormode. The continuous distributed monitoring model.
SIGMOD
Record,
2013. Efficient strategies for
[12] G.
Cormode
and 42(1):5,
M. Garofalakis.
continuous distributed tracking tasks. IEEE Data Engineer[12] G.
and M. Garofalakis.
Efficient strategies for
ing Cormode
Bulletin, 28(1):33–39,
2005.
continuous distributed tracking tasks. IEEE Data EngineeringDas,
Bulletin,
28(1):33–39,
2005.
[13] K.
K. Bhaduri,
S. Arora,
W. Griffin, K. Borne, C. Giannella, and H. Kargupta. Scalable Distributed Change De[13] K.
Das,from
K. Bhaduri,
S. Arora,
Griffin,
K. Borne,
Gitection
Astronomy
Data W.
Streams
using
Local, C.
Asynannella,
H. Kargupta.
Scalable
Distributed
Change
Dechronousand
Eigen
Monitoring
Algorithms.
In SIAM
Internatection
from Astronomy
Data
Streams
using2009.
Local, Asyntional Conference
on Data
Mining,
Nevada,
chronous Eigen Monitoring Algorithms. In SIAM International Conference on Data Mining, Nevada, 2009.
Volume 16, Issue 1
Volume 16, Issue 1
Page 36
Page 36
[14] T. Dasu, S. Krishnan, D. Lin, S. Venkatasubramanian, and
[14] T.
S. Krishnan,
D. Lin,
Venkatasubramanian,
and
K. Dasu,
Yi. Change
(Detection)
You S.
Can
Believe in: Finding Distributional
Shifts
in Data Streams.
Proceedings
of theDis8th
K. Yi. Change
(Detection)
You CanInBelieve
in: Finding
International
Symposium
on Intelligent
Data Analysis:
tributional Shifts
in Data Streams.
In Proceedings
of the Ad8th
vances
in Intelligent
Dataon
Analysis
VIII,Data
pageAnalysis:
34. Springer,
International
Symposium
Intelligent
Ad2009.
vances in Intelligent Data Analysis VIII, page 34. Springer,
2009.
[15] T. Dasu, S. Krishnan, S. Venkatasubramanian, and K. Yi.
AnDasu,
information-theoretic
to detecting and
changes
in
[15] T.
S. Krishnan, S. approach
Venkatasubramanian,
K. Yi.
multi-dimensional
data streams.
In 38th
Symposium
on the
An information-theoretic
approach
to detecting
changes
in
Interface
of Statistics,
Computing
Science,
and Applicamulti-dimensional
data streams.
In 38th
Symposium
on the
tions.
Citeseer,
2005. Computing Science, and ApplicaInterface
of Statistics,
tions. Citeseer, 2005.
[16] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kargupta.
Distributed
data
in peer-to-peer
[16] S.
Datta,
K. Bhaduri,
C. mining
Giannella,
R. Wolff, andnetworks.
H. KarIEEE
Computing,
pages in
18–26,
2006. networks.
gupta.Internet
Distributed
data mining
peer-to-peer
IEEE Internet Computing, pages 18–26, 2006.
[17] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing
on S.
large
clusters. Mapreduce:
Communications
of the
[17] J.
Dean and
Ghemawat.
Simplified
dataACM,
pro51(1):107–113,
cessing on large2008.
clusters. Communications of the ACM,
51(1):107–113, 2008.
[18] N. Dindar, P. M. Fischer, M. Soner, and N. Tatbul. Efficiently
correlating
complexM.
events
over
archived
[18] N.
Dindar,
P. M. Fischer,
Soner,
andlive
N. and
Tatbul.
Effidata
streams.
In ACM
DEBS Conference,
2011.and archived
ciently
correlating
complex
events over live
data streams. In ACM DEBS Conference, 2011.
[19] G. Dong, J. Han, L. Lakshmanan, J. Pei, H. Wang, and
P. Yu.
Online
mining
of changes from
dataH.streams:
Re[19] G.
Dong,
J. Han,
L. Lakshmanan,
J. Pei,
Wang, and
search
and preliminary
Citeseer.
P. Yu. problems
Online mining
of changesresults.
from data
streams: Research
problems J.and
preliminary
results. Citeseer.
[20] A.
R. Ganguly,
Gama,
O. A. Omitaomu,
M. M. Gaber,
[29] W. Huang, E. Omiecinski, L. Mark, and M. Nguyen. His[29] tory
W. Huang,
Omiecinski,
L. Mark,
and in
M.streams.
Nguyen. Data
HisguidedE.low-cost
change
detection
Warehousing
and Knowledge
Discovery,
pages 75–86,
tory guided low-cost
change detection
in streams.
Data
2009.
Warehousing and Knowledge Discovery, pages 75–86,
2009.
[30] E. Ikonomovska, J. Gama, R. Sebastião, and D. Gjorgjevik.
trees from
dataR.
streams
withand
driftD.detection.
In
[30] Regression
E. Ikonomovska,
J. Gama,
Sebastião,
Gjorgjevik.
Discovery
pages
RegressionScience,
trees from
data121–135.
streams Springer,
with drift2009.
detection. In
Discovery Science, pages 121–135. Springer, 2009.
[31] M. Karnstedt, D. Klan, C. Pölitz, K.-U. Sattler, and
Adaptive
burst detection
a stream
engine.and
In
[31] C.
M. Franke.
Karnstedt,
D. Klan,
C. Pölitz,in K.-U.
Sattler,
Proceedings
of the 2009
ACM
symposium
on Applied
ComC. Franke. Adaptive
burst
detection
in a stream
engine.
In
puting,
pagesof1511–1515.
ACM,
2009. on Applied ComProceedings
the 2009 ACM
symposium
puting, pages 1511–1515. ACM, 2009.
[32] Y. Kawahara and M. Sugiyama. Change-point detection in
density-ratio
estimation.
In Pro[32] time-series
Y. Kawaharadata
andbyM.direct
Sugiyama.
Change-point
detection
in
ceedings
of data
2009bySIAM
time-series
directInternational
density-ratioConference
estimation.on
In Data
ProMining
pages
389–400, 2009.
ceedings(SDM2009),
of 2009 SIAM
International
Conference on Data
Mining (SDM2009), pages 389–400, 2009.
[33] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in
streams.
In Proceedings
the Thirtieth
international
[33] data
D. Kifer,
S. Ben-David,
and J.of
Gehrke.
Detecting
change in
conference
onIn
Very
large dataofbases-Volume
page 191.
data streams.
Proceedings
the Thirtieth 30,
international
VLDB
Endowment,
2004.data bases-Volume 30, page 191.
conference
on Very large
VLDB Endowment, 2004.
[34] A. Kim, C. Marzban, D. Percival, and W. Stuetzle. Using
to evaluate
detectors
a multivariate
[34] labeled
A. Kim,data
C. Marzban,
D.change
Percival,
and W.inStuetzle.
Using
streaming
Signal detectors
Processing,
labeled dataenvironment.
to evaluate change
in a89(12):2529–
multivariate
2536,
2009.environment. Signal Processing, 89(12):2529–
streaming
2536,
2009.
[35] B. Krishnamurthy,
S. Sen, Y. Zhang, and Y. Chen. Sketch-
measuring
in data
characteristics.AInframework
Proceedings
[21] V.
Ganti, J. changes
Gehrke, and
R. Ramakrishnan.
for
of
the eighteenth
ACM
SIGMOD-SIGACT-SIGART
sympomeasuring
changes
in data
characteristics. In Proceedings
sium
Principles
of SIGMOD-SIGACT-SIGART
database systems, pages 126–137.
of theon
eighteenth
ACM
sympoACM,
1999.
sium on
Principles of database systems, pages 126–137.
ACM,
1999.
[22] V.
Ganti,
J. Gehrke, R. Ramakrishnan, and W. Loh. A
using likelihood
andmultivariate
Data Engi[36] data
L. Kuncheva.
Changedetectors.
detectionKnowledge
in streaming
neering,
IEEE
Transactions
on, Knowledge
(99):1–1, 2011.
data using
likelihood
detectors.
and Data EngiIEEE
(99):1–1,
2011. and K. Ra[37] neering,
X. Liu, X.
Wu,Transactions
H. Wang, R.on,
Zhang,
J. Bailey,
andR.
R.Ganguly,
R. Vatsavai.
Knowledge
from
data,
[20] A.
J. Gama,
O. A.discovery
Omitaomu,
M.sensor
M. Gaber,
volume
7.
CRC,
2008.
and R. R. Vatsavai. Knowledge discovery from sensor data,
volume
2008.
[21] V.
Ganti,7.J.CRC,
Gehrke,
and R. Ramakrishnan. A framework for
framework
measuring
in dataand
characteristics.
[22] V.
Ganti, J.forGehrke,
R. differences
Ramakrishnan,
W. Loh. A
Journal
of
Computer
and
System
Sciences,
64(3):542–578,
framework for measuring differences in data characteristics.
2002.
Journal of Computer and System Sciences, 64(3):542–578,
2002.
[23] S.
Geisler, C. Quix, and S. Schiffer. A data stream-based
evaluation
forS.traffic
information
systems. In
[23] S.
Geisler, framework
C. Quix, and
Schiffer.
A data stream-based
Proceedings of the ACM SIGSPATIAL International Workevaluation framework for traffic information systems. In
shop on GeoStreaming, pages 11–18. ACM, 2010.
Proceedings of the ACM SIGSPATIAL International Workshop
on GeoStreaming,
11–18. ACM,
[24] L.
Golab,
T. Johnson, pages
J. S. Seidel,
and V.2010.
Shkapenyuk.
Stream warehousing with datadepot. In Proceedings of the
[24] L. Golab, T. Johnson, J. S. Seidel, and V. Shkapenyuk.
35th SIGMOD international conference on Management of
Stream warehousing with datadepot. In Proceedings of the
data, pages 847–854. ACM, 2009.
35th SIGMOD international conference on Management of
data,J.pages
ACM,
2009.
[25] A.
Hey, 847–854.
S. Tansley,
and
K. M. Tolle. The fourth
paradigm: data-intensive scientific discovery. Microsoft
[25] A. J. Hey, S. Tansley, and K. M. Tolle. The fourth
Research Redmond, WA, 2009.
paradigm: data-intensive scientific discovery. Microsoft
Research
Redmond,
WA, 2009.
[26] S.
Hido, T.
Idé, H. Kashima,
H. Kubo, and H. Matsuzawa.
Unsupervised change analysis using supervised learning.
[26] S. Hido, T. Idé, H. Kashima, H. Kubo, and H. Matsuzawa.
In Proceedings of the 12th Pacific-Asia conference on AdUnsupervised change analysis using supervised learning.
vances in knowledge discovery and data mining, pages
In
Proceedings
of the 12th 2008.
Pacific-Asia conference on Ad148–159.
Springer-Verlag,
vances in knowledge discovery and data mining, pages
148–159.
2008. changes in unlabeled data
[27] S.
Ho and Springer-Verlag,
H. Wechsler. Detecting
streams using martingale. In Proceedings of the 20th in[27] S.
Ho and H.joint
Wechsler.
Detecting
changesintelligence,
in unlabeledpages
data
ternational
conference
on Artifical
streams
using
martingale.
In
Proceedings
of
the
20th
1912–1917. Morgan Kaufmann Publishers Inc., 2007. international joint conference on Artifical intelligence, pages
1912–1917.
Kaufmann
Publishers
Inc., 2007.
[28] W.
Huang, E.Morgan
Omiecinski,
and L.
Mark. Evolution
in Data
Streams. 2003.
[28] W. Huang, E. Omiecinski, and L. Mark. Evolution in Data
Streams. 2003.
SIGKDD Explorations
SIGKDD Explorations
change detection:
Methods,
evaluation,
and applica[35] based
B. Krishnamurthy,
S. Sen,
Y. Zhang,
and Y. Chen.
Sketchtions.
In
Proceedings
of
the
3rd
ACM
SIGCOMM
Conferbased change detection: Methods, evaluation, and applicaence
Measurement,
ACM
New
tions.on
InInternet
Proceedings
of the 3rdpages
ACM234–247.
SIGCOMM
ConferYork,
NY,
USA,
2003.
ence on Internet Measurement, pages 234–247. ACM New
NY, USA,
2003. detection in streaming multivariate
[36] York,
L. Kuncheva.
Change
distribution
in stock
[37] mamohanarao.
X. Liu, X. Wu, Mining
H. Wang,
R. Zhang,change
J. Bailey,
and K.order
Rastreams.
Prof.
of
ICDE,
pages
105–108,
2010.
mamohanarao. Mining distribution change in stock order
Prof.
pages
105–108, 2010.
[38] streams.
G. Manku
andof
R.ICDE,
Motwani.
Approximate
frequency counts
over
data
streams.
In
Proceedings
of
the
28th
international
[38] G. Manku and R. Motwani. Approximate frequency
counts
conference
on Very
Large Data of
Bases,
pages
346–357.
over data streams.
In Proceedings
the 28th
international
VLDB Endowment, 2002.
conference on Very Large Data Bases, pages 346–357.
Endowment,
2002.
[39] VLDB
J. Manyika,
M. Chui,
B. Brown, J. Bughin, R. Dobbs,
C. Roxburgh, and A. Byers. Big data: The next frontier for
[39] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs,
innovation, competition and productivity. McKinsey Global
C. Roxburgh, and A. Byers. Big data: The next frontier for
Institute, May, 2011.
innovation, competition and productivity. McKinsey Global
May,
[40] Institute,
A. Maslov,
M. 2011.
Pechenizkiy, T. Kärkkäinen, and M. Tähtinen. Quantile index for gradual and abrupt change detec[40] A. Maslov, M. Pechenizkiy, T. Kärkkäinen, and M. Tähtition from cfb boiler sensor data in online settings. In Pronen. Quantile index for gradual and abrupt change detecceedings of the Sixth International Workshop on Knowledge
tion from cfb boiler sensor data in online settings. In ProDiscovery from Sensor Data, pages 25–33. ACM, 2012.
ceedings of the Sixth International Workshop on Knowledge
from Sensor
pagesAlgorithms
25–33. ACM,
[41] Discovery
S. Muthukrishnan.
DataData,
streams:
and 2012.
applications. Now Publishers Inc, 2005.
[41] S. Muthukrishnan. Data streams: Algorithms and applicaNow Publishers
Inc,den
2005.
[42] tions.
S. Muthukrishnan,
E. van
Berg, and Y. Wu. Sequential
change detection on data streams. ICDM Workshops, 2007.
[42] S. Muthukrishnan, E. van den Berg, and Y. Wu. Sequential
data streams.
ICDM
2007.
[43] change
M. Naordetection
and L. on
Stockmeyer.
What
can Workshops,
be computed
locally? pages 184–193, 1993.
[43] M. Naor and L. Stockmeyer. What can be computed lo[44] cally?
W. Ng pages
and M.184–193,
Dash. A 1993.
change detector for mining frequent
patterns over evolving data streams. In Systems, Man and
[44] Cybernetics,
W. Ng and M.2008.
Dash.SMC
A change
for mining frequent
2008. detector
IEEE International
Conferpatterns
data IEEE,
streams.
In Systems, Man and
ence on, over
pagesevolving
2407–2412.
2008.
Cybernetics, 2008. SMC 2008. IEEE International Conference on, pages 2407–2412. IEEE, 2008.
Volume 16, Issue 1
Volume 16, Issue 1
Page 37
Page 37
[45] W. Ng and M. Dash. A test paradigm for detecting changes
in transactional
data streams.
In Database
Systemschanges
for Ad[45] W.
Ng and M. Dash.
A test paradigm
for detecting
vanced
Applications,
pages 204–219.
Springer,
2008.
in transactional
data streams.
In Database
Systems
for Advanced Applications, pages 204–219. Springer, 2008.
[46] D. Nikovski and A. Jain. Fast adaptive algorithms for abrupt
change
detection.
learning,
2010.
[46] D.
Nikovski
and A.Machine
Jain. Fast
adaptive79(3):283–306,
algorithms for abrupt
change detection. Machine learning, 79(3):283–306, 2010.
[47] R. Niu and P. K. Varshney. Performance analysis of distributed
detection
a randomPerformance
sensor field. analysis
Signal Process[47] R.
Niu and
P. K. in
Varshney.
of dising,
IEEE
Transactions
on, 56(1):339–349,
2008. Processtributed
detection
in a random
sensor field. Signal
ing, IEEE Transactions on, 56(1):339–349, 2008.
[48] R. Niu, P. K. Varshney, and Q. Cheng. Distributed detectionNiu,
in a P.
large
sensor
network.
[48] R.
K. wireless
Varshney,
and Q.
Cheng.Information
DistributedFusion,
detec7(4):380–394,
2006. sensor network. Information Fusion,
tion in a large wireless
7(4):380–394,
[49] T.
Palpanas, 2006.
D. Papadopoulos, V. Kalogeraki, and
D. Gunopulos.
deviation detection
in sensor net[49] T.
Palpanas, Distributed
D. Papadopoulos,
V. Kalogeraki,
and
works.
ACM SIGMOD
Record,
32(4):77–82,
D. Gunopulos.
Distributed
deviation
detection2003.
in sensor networks.Pham,
ACM SIGMOD
Record,
2003.
[50] D.-S.
S. Venkatesh,
M. 32(4):77–82,
Lazarescu, and
S. Budhaditya. Pham,
Anomaly
detection inM.large-scale
net[50] D.-S.
S. Venkatesh,
Lazarescu,data
andstream
S. Budhaworks.
Data Mining
and Knowledge
Discovery,
28(1):145–
ditya. Anomaly
detection
in large-scale
data stream
net189,
2014.
works.
Data Mining and Knowledge Discovery, 28(1):145–
189,Rajagopal,
2014.
[51] R.
X. Nguyen, S. C. Ergen, and P. Varaiya.
Distributed
online
simultaneous
detection
multi[51] R.
Rajagopal,
X. Nguyen,
S. C.fault
Ergen,
and P.forVaraiya.
ple
sensors.
In
Information
Processing
in
Sensor
Networks,
Distributed online simultaneous fault detection for multi2008.
IPSN’08.
International
Conference
on, pages
133–
ple sensors.
In Information
Processing
in Sensor
Networks,
144.
IEEE,
2008.
2008. IPSN’08. International Conference on, pages 133–
144.Roddick,
IEEE, 2008.
[52] J.
L. Al-Jadir, L. Bertossi, M. Dumas,
H.
Gregersen,
Hornsby, L.J. Bertossi,
Lufter, F.M.Mandreoli,
[52] J. Roddick, L.K. Al-Jadir,
Dumas,
T.
Mannisto,
E.
Mayol,
et
al.
Evolution
and F.
change
in data
H. Gregersen, K. Hornsby, J. Lufter,
Mandreoli,
management:issues
andetdirections.
ACM
T. Mannisto, E. Mayol,
al. Evolution
andSigmod
changeRecord,
in data
29(1):21–25,
2000. and directions. ACM Sigmod Record,
management:issues
29(1):21–25, 2000.
[53] G. Ross, D. Tasoulis, and N. Adams. Online annotation and
regimeand
switching
dataOnline
streams.
In Proceed[53] prediction
G. Ross, D.for
Tasoulis,
N. Adams.
annotation
and
ings
of the for
2009
ACMswitching
symposium
on streams.
Applied In
Computing,
prediction
regime
data
Proceedpages
ACM,
2009.
ings of1501–1505.
the 2009 ACM
symposium
on Applied Computing,
pages 1501–1505. ACM, 2009.
[54] A. Singh. Review Article Digital change detection techusing
remotely-sensed
data. International
Journal
of
[54] niques
A. Singh.
Review
Article Digital
change detection
techRemote
Sensing,
10(6):989–1003,
niques using
remotely-sensed
data.1989.
International Journal of
Remote Sensing, 10(6):989–1003, 1989.
[55] X. Song, M. Wu, C. Jermaine, and S. Ranka. Statistical
data. In Statistical
Proceed[55] change
X. Song,detection
M. Wu,for
C. multi-dimensional
Jermaine, and S. Ranka.
ings
of detection
the 13th ACM
SIGKDD international
change
for multi-dimensional
data. Inconference
Proceedon
discovery
and datainternational
mining, pagesconference
667–676.
ingsKnowledge
of the 13th
ACM SIGKDD
ACM,
2007. discovery and data mining, pages 667–676.
on Knowledge
ACM, 2007.
[56] D.-H. Tran and K.-U. Sattler. On detection of changes in
data streams.
In Proceedings
of the 9thof
International
[56] sensor
D.-H. Tran
and K.-U.
Sattler. On detection
changes in
Conference
on Advances
in Mobile of
Computing
and Multisensor data streams.
In Proceedings
the 9th International
media,
pageson50–57.
ACM,
Conference
Advances
in 2011.
Mobile Computing and Multimedia, pages 50–57. ACM, 2011.
[57] M. van Leeuwen and A. Siebes. Streamkrimp: Detecting
data streams.
Machine
and Knowledge
[57] change
M. van in
Leeuwen
and A.
Siebes.Learning
Streamkrimp:
Detecting
Discovery
in Databases,
pages 672–687,
2008.
change in data
streams. Machine
Learning
and Knowledge
Discovery in Databases, pages 672–687, 2008.
[58] V. Veeravalli and P. Varshney. Distributed inference in wiresensor networks.
Philosophical
Transactions
the
[58] less
V. Veeravalli
and P. Varshney.
Distributed
inference inofwireRoyal
Societynetworks.
A: Mathematical,
Physical
and Engineering
less sensor
Philosophical
Transactions
of the
Sciences,
370(1958):100–117,
2012.
Royal Society
A: Mathematical,
Physical and Engineering
Sciences,
370(1958):100–117,
2012.
[59] A. Zhou, F. Cao, W. Qian, and C. Jin. Tracking clusters
overand
sliding
windows.
Knowledge
[59] in
A. evolving
Zhou, F.data
Cao,streams
W. Qian,
C. Jin.
Tracking
clusters
and
Information
Systems,
15(2):181–214,
2008.
in evolving data streams over sliding windows. Knowledge
and Information Systems, 15(2):181–214, 2008.
SIGKDD Explorations
Volume 16, Issue 1
Page 38
SIGKDD Explorations
Volume 16, Issue 1
Page 38
Contextual Crowd Intelligence
Beng Chin Ooi† , Kian-Lee Tan† , Quoc Trung Tran† , James W. L. Yip§ ,
Gang Chen# , Zheng Jye Ling§ , Thi Nguyen† , Anthony K. H. Tung† , Meihui Zhang†
† National
University of Singapore § National University Health System # Zhejiang University
† {ooibc, tankl, tqtrung, thi, atung, zmeihui}@comp.nus.edu.sg
§ {james yip, zheng jye ling}@nuhs.edu.sg
# cg@zju.edu.cn
ABSTRACT
Most data analytics applications are industry/domain specific, e.g.,
predicting patients at high risk of being admitted to intensive care
unit in the healthcare sector or predicting malicious SMSs in the
telecommunication sector. Existing solutions are based on “best
practices”, i.e., the systems’ decisions are knowledge-driven
and/or data-driven. However, there are rules and exceptional cases
that can only be precisely formulated and identified by
subject-matter experts (SMEs) who have accumulated many years
of experience. This paper envisions a more intelligent database
management system (DBMS) that captures such knowledge to
effectively address the industry/domain specific applications. At
the core, the system is a hybrid human-machine database engine
where the machine interacts with the SMEs as part of a feedback
loop to gather, infer, ascertain and enhance the database
knowledge and processing. We discuss the challenges towards
building such a system through examples in healthcare predictive
analysis – a popular area for big data analytics.
1. INTRODUCTION
Most data analytics applications are industry or domain specific.
For example, many prediction tasks in healthcare require prior
medical knowledge, such as, identifying patients at high risk of
being admitted to the intensive care unit, or predicting the
probability of the patients being readmitted into the hospital
within 30 days after discharge. Another example from the
telecommunication sector is the identification of malicious SMSs
requiring inputs from security experts. Building competent tools
to effectively address these problems are important, as industrial
organizations face increasing pressures to improve outcomes while
reducing costs [3].
Existing solutions to industry or domain specific tasks are based
on “best practices”. These solutions are knowledge-driven (i.e.,
utilizing general guidelines such existing clinical guidelines or
literature from medical journals) and/or data-driven (i.e., deriving
rules from observational data) [31]. Let us consider the task of
identifying the risk factors related to heart failure.
The
knowledge-driven solution uses risk factors identified from
existing clinical knowledge or literature, such as, age,
hypertension and diabetes status. However, it may miss out other
unknown risk factors specific to the population of interest. The
reason is that the guidelines are generic and based on existing
knowledge, which results in models that may not adequately
represent the underlying complex disease processes in the
population with a comprehensive list of risk factors [31]. The
SIGKDD Explorations
data-driven solution employs machine learning algorithms to
derive risk factors solely from observational data. An alternative
approach combines the knowledge-driven and data-driven
approaches in the data analytics applications [31]. However, there
are exceptional situations where it is not easy to capture or
formalize, and where neither general guidelines are available nor
rules can be derived from data (e.g., in rare conditions). Instead, it
is only through many years of experience can subject-matter
experts (SMEs) formulate and identify these situations. The
challenge then is to be able to capture and utilize such knowledge
to effectively support industry/domain specific applications, e.g.,
improving the accuracy of the prediction tasks.
This paper proposes building the next generation of intelligent
database management systems (DBMSs) that exploit contextual
crowd intelligence. The crowd intelligence here refers to the
knowledge and experience of subject-matter experts (SMEs).
Although such knowledge is an important component in
transforming data into information, it is currently not captured by
a structured system. The participants in an intelligent crowd are
domain experts rather than “unknown” lay-persons in existing
systems that use crowdsourcing as part of database query
processing (e.g., CrowdDB [13], Deco [24], Qurk [23],
CDAS [12; 22]) and information extraction or knowledge
acquisition (e.g., HIGGINS [21] and CASTLE [28]).
For
applications where data confidentiality and privacy are important
(e.g., healthcare analytics), the intelligent crowd may consist of
only experts from within the organization, since the tasks cannot
be outsourced to external parties. Given that the crowd is known
apriori, there is an assurance of user accountability, which
translates to an assurance in the quality of the answers. A recent
system, called Data Tamer [30], also proposed to leverage on
expert crowdsourcing system to enhance machine computation but
in the context of data curation. Our proposition differs from Data
Tamer in several aspects. First, the target applications of our work
(i.e., data analytics) are different from those in Data Tamer (i.e.,
data curation). Thus, each system needs to address a unique,
different set of challenges. Second, the domain experts in our
context are also users/reviewers of the system. Thus, the experts
are likely to take ownership and hence are motivated to improve
the accuracy of the analytics and the usability of the applications.
This would reduce the need to localize/customize the system since
the experts/users are continuously interacting with the system;
these experts define the “best practices” for the system. For
example, doctors in a particular department may use a different
convention or notation from another department, e.g., when
doctors write “PID” in the orthopedic department, the acronym
refers to the “Prolapsed Intervertebral Disc” only and not the
“Pelvic Inflammatory Disease”. Clearly, such knowledge can only
Volume 16, Issue 1
Page 39
be provided by internal domain experts. In contrast, experts in
Data Tamer are not the users of the system and hence there is a
need to customize/localize the system for different use-cases.
In order to entrench the crowd intelligence into the DBMS, the
system needs to keep SMEs as part of the feedback loop. The
system can then further utilize feedback provided from the SMEs
to infer, ascertain and enhance its processing, thus continuously
improving the effectiveness of the system. For example, when
predicting the risk of unplanned patient readmissions, the system
asks the doctors to label patients who the system has low
confidence in predicting their readmissions, and the
rules/hypotheses that the doctors used to do the labeling. One
example of such an expert rule is that an elderly patient who lives
alone and have had several severe diseases is likely to be
readmitted into the hospital frequently. The system would then
verify or adjust these rules/hypotheses and revert back to the
doctors with evidence to support or reject their rules/hypotheses.
Such interactions are beneficial to both the system and the doctors.
Eventually, the application system evolves over time. SMEs
become part of this evolving process by sharing their domain
knowledge and rich experience, thereby contributing to the
improvement and development of the system. Hence, the experts
are more willing and comfortable to use the system to alleviate the
burden of their duties.
This work is part of our CIIDAA project on building large scale,
Comprehensive IT Infrastructure for Data-intensive Applications
and Analysis [2]. Our collaborators are clinicians in the National
University Health System (NUHS) [5]. The project aims to harness
the power of cloud computing to solve big data problems in the real
world, with healthcare predictive analytics being a popular area for
big data analytics [26].
Organization.
The remainder of this paper is organized as
follows. Section 2 presents motivating examples in healthcare
predictive analytics. Section 3 discusses the architecture of an
intelligent DBMS that aims to embed contextual crowd
intelligence. Section 4 elaborates on research problems that we
need to address in order to build an intelligent DBMS. Section 5
presents our preliminary results on the problem of predicting the
risk of unplanned patient readmissions. Section 6 presents the
related work. Finally, Section 7 concludes our work.
2. MOTIVATING EXAMPLES
Let us consider a hospital that has an integrated view of the medical
care records of patients as shown in Table 1. The table contains two
types of information:
• Structured information, including the case identifier,
patient’s name, age, gender, race, the number of days that
the patient stayed at the hospital during a particular visit
(LengthO f Stay), and the number of days before the patient
was readmitted into the hospital after discharge
(Readmission) ; and
• Unstructured information, i.e., free-text from a doctor’s note
that contains additional and useful information of a patient
healthcare profile such as his past medical history, social
factors, previous medications, complaints of patients based
on a doctor’s investigations, major lab results, issues and
progress, etc.
The tuples in this table are extracted from real cases of patients
admitted to the National University Hospital (NUH) in Singapore.
Healthcare professionals often have queries relating to predicting
the severity of patients’ condition, such as, identifying patients at
SIGKDD Explorations
high risk of being admitted to intensive care unit, or predicting the
probability of the patients being readmitted into the hospital soon
after discharge. There are also queries that monitor real-time data
of patients in critical conditions for unusual conditions, such as,
whether patients are at high risk of collapsing. With correct
predictions, doctors can intervene early to alleviate the
deterioration of patient’s health outcome. This can potentially
reduce the burden of limited healthcare resources in the primary
and acute care facilities. For instance, if a patient is at high-risk
for unplanned post discharge readmission, he can potentially
benefit from close followed-up after discharge, e.g., the hospital
sends a case manager or nurse to examine him once every three
days. In addition, important queries related to public health
surveillance can be answered in a timely fashion. For example, it
is critical to provide real-time, early information to alert
decision-makers of emerging threats that need to be addressed in a
particular population. The ultimate goal of these predictive queries
is to predict, pre-empt and prevent for better healthcare outcome.
3.
AN INTELLIGENT DBMS FOR BIG DATA ANALYTICS
In this section, we discuss the challenges of addressing big data
analytics and present an overview of a hybrid human-machine
system for these tasks.
3.1
Challenges of Big Data Analytics
Essentially, many tasks of big data analytics can be viewed as
conventional data mining problems, such as, classifying patients
into different class labels (high or low risk of being admitted to
intensive care units). There are, however, three important aspects
that differentiate big data analytics from traditional machine
learning problems.
• First, many valuable features for the analytics tasks are
stored in unstructured data, for example, doctor’s notes [25].
We cannot simply treat these notes as traditional
“bag-of-words” documents. Instead, we need powerful tools
to extract from these documents the right entities (such as,
diseases, medications, laboratory tests) and domain-specific
relationships (such as, the relationship between a disease
and a laboratory test). The text in unstructured data has to
be contextualized to each organization’s practice, e.g.,
doctors in a particular department may use a different
convention or notation from another department.
• Second, there is usually a lack of training samples with
well-defined class labels. For instance, when predicting the
risk of committing suicide for each patient, the total number
patients known to have committed suicide (i.e., class 1) is
very small. However, it does not mean that all the remaining
patients did not commit suicide (i.e., class 0). Hence we
need to infer the correct class labels for these patients. This
problem also occurs in other domains such as home security
and banking. For example, one important task that many
national security agencies need to perform is identifying
persons or groups of people who will likely commit a
crime [4]. In this setting, the agency maintains a very small
set of people who have committed crime. However, we
cannot simply assume that the remaining people are not
likely to commit crime. As before, we need to infer the
correct class labels for these people. Another example is in
telecommunication, where a service provider wants to
predict whether an SMS is malicious. In this case, we do not
Volume 16, Issue 1
Page 40
CaseID
Name
Age
Gender
Race
LengthOfStay
Readmission
Case 1
Patient 1
71
Female
Chinese
5
20
Case 2
Patient 2
60
Male
Malaysian
10
20
Doctor’s note
PMH:
1 IHD
- on GTN 0.5mg prn
2 DM
- on Metformin 750mg
- HbA1c 7.5% 09/12
3 HL
Stays with son · · ·
Social issues: Single, no child
Used to live with friend in a shophouse
Now at sheltered home since Sept 2011.
No next-of-skin or visitor.
···
Table 1: Medical care table
have any predefined class labels and might need to ask
security experts to provide the class labels for some sample
cases.
• Lastly, data in different domains (e.g., healthcare,
telecommunication, home security) is expected to grow
dramatically in the years ahead [26]. For instance, patients
in intensive care units are constantly being monitored, and
their historical records have to be retained. This can easily
result in hundreds of millions of (historical) records of
patients. As another example, during a mass casualty
disaster (e.g., SARS, H5N1), there is an overwhelming
number of patients who have to be monitored and tracked,
and information about each patient is huge by itself.
Furthermore, streaming data arrive continuously, e.g., new
data from the real-time data feed are constantly being
inserted. Hence, the system in healthcare setting must
provide the real-time predictions, e.g., predicting the
survival of patients in the next 6 hours.
The three above mentioned aspects call for a new generation of
intelligent DBMSs that can provide effective solutions for big data
analytics.
Our proposition of exploiting contextual crowd
intelligence is, we believe, a big step towards this goal.
3.2 Contextual Data Management
The central theme of crowd intelligence is to get domain experts
engaged as both the participants to fine tune the system and the
end-users of the system. Figure 1 presents an intelligent system
that exploits contextual crowd intelligence for big data analytics.
The system first builds a knowledge base that will be subsequently
used for the analytics tasks based on historical data, domain
knowledge from SMEs (e.g., doctors), and other sources such as
general clinical guidelines. Each source contributes to build some
“weak classifiers”. The system needs to combine these classifiers
to derive a final classifier that achieves a high level of accuracy for
prediction purposes. The system also needs to go through several
iterations of interaction with the experts to refine, for example, the
final classifier. As such, the experts participate in the entire
process in fine tuning the system and decide on the “best
practices”. When real-time data or feed arrives, the system
performs the prediction on-the-fly and alerts the experts
immediately. Hence, the experts become the end-users of the
system.
We have developed the epiC system [1; 10; 19] to support large
scale data processing, and are extending it to support healthcare
analytics. Figure 2 shows the software stack of epiC. At the
bottom, the storage layer supports different storage systems (e.g.,
SIGKDD Explorations
Hadoop Distributed File System (HDFS) and a key-value storage
system, ES2 [8]) for both unstructured and structured data. The
next layer (which is the security layer) enables users to protect
data privacy by encryption. The third layer (which is the
distributed processing layer) provides a distributed processing
infrastructure called E3 [9] that supports different parallel
processing logics such as MapReduce [11], Directed Acyclic
Graph (DAG) and SQL. The top layer (which is the analytics
layer) exploits the contextual crowd intelligence for big data
analytics. The details of this layer are shown in Figure 1. In
Figure 2, KB is the knowledge base and iCrowd is the component
that interacts with the domain experts. Different components of
the analytics layer (e.g., scalable machine learning algorithms) can
process their data with the most appropriate data processing model
and their computations will be automatically executed in parallel
by the lower layers.
In the remaining of this paper, we focus only on the analytics layer.
For more details of the other layers of the epiC system, please refer
to [1; 10; 19].
4.
RESEARCH PROBLEMS
In this section, we elaborate on the research problems that we need
to address in order to build an intelligent system for big data
analytics.
4.1
Asking Experts The Right Questions
Given a large volume of data and a limited amount of time that
domain experts can participate in building the systems, we need
to ask the experts the right questions. In the context of healthcare
analytics, we plan to ask the following domain knowledge from
doctors.
• Labelings. The system asks doctors to label tuples that the
system has low confidence in performing the prediction
task. There are two important issues here. First, doctors
have different levels of confidence when answering different
questions, i.e., doctors are reluctant to assess patient profiles
that they do not have specialties. Second, since there is so
much information about patients, selecting the relevant
feature of each patient to present to the doctors in order not
to overwhelm them is also a major issue.
In essence, what we need is a diverse set of labeled patients
that covers the whole data space as much as possible. One
possible solution is to group similar patient profiles together
and show these groups to doctors. The purpose is to let the
doctors select the groups of patients that they are
comfortable in providing the labels. In addition, for each
Volume 16, Issue 1
Page 41
Classifer
KB
iCrowd
Analytics Layer
Answers
Historical data
Classifier from
SMEs
Classifier from
data
Classifier from
general guidelines
Classifier from
other sources
Questions
MR
DAG
SQL
Distributed
Processing Layer
E3
Derive
Real-time data/feed
SMEs
Update
Trusted Data Service
Predictor
Security Layer
Knowledge
Base
Recommendations
HDFS
ES2
Storage Layer
Feedback
Figure 1: Contextual crowd intelligence for big data analytics.
group, we present only the features which the patients in the
group have similar values. In this way, we can avoid
overwhelming the doctors with information. Note that, in
some cases, we need to perform hierarchical clustering to
reduce the number of patients shown to the doctors each
time.
Selecting the right clustering algorithms and
developing effective visualization tools to present patient’s
profiles are important here.
• Rules/Hypotheses.
The system collects expert
rules/hypotheses that the doctors used to do the labeling.
For example, to predict the risk of unplanned patient
readmissions, the doctors suggested a hypothesis that social
factors and the status of the diseases are important risk
indicators for readmission. The system would then verify or
adjust these hypotheses and revert back to the doctors with
evidence to support or reject their hypotheses. Such
interactions are beneficial to both the system and the
doctors.
• Inferred implicit knowledge. The system can also infer
implicit and valuable knowledge based on the
answers/reactions of the domain experts. For instance, if the
doctors label two patients who belong to a given cluster
differently, then the system can adjust the distance function
used to compute the similarity between two patients, and
thus infer which features are more important.
Such
knowledge is implicit as the doctors themselves may not be
aware of.
We can also ask the same kind of questions for the analytics tasks
in other domains. For instance, to predict malicious SMSs, we
need to select a small set of messages (by utilizing some clustering
algorithms) and ask the experts to provide labels for these
samples. We also collect rules and heuristics that the experts
utilize to label the SMSs.
4.2 Extracting Domain
Unstructured Data
Entities
From
Feature selection is very important for any machine learning task
and can greatly affect the algorithm’s quality. Processing doctor’s
notes for extracting important features is an inevitably important
step for healthcare analytics problems.
There are several
state-of-the-art Natural Language Processing (NLP) engines for
processing clinical documents, such as, MedLEE [14] and
cTAKES [27]. These engines process clinical notes, identifying
types of clinical entities (e.g., medications, diseases, procedures,
SIGKDD Explorations
Figure 2: The software stack of epiC for big data analytics.
lab tests) from various medical dictionaries (a.k.a. knowledge
base), such as, the Unified Medical Language System (UMLS) [6].
We now discuss several problems raised due to the nature of the
unstructured data and the incompleteness of the knowledge base,
and subsequently discuss a hybrid human-machine approach to
solve these problems. The discussion uses the following running
example. We run cTAKES on the doctor’s note of patient 1 (in
Table 1), and obtain the following clinical entities: (1) diseases:
IHD (Ischemic Heart Disease) and DM; (2) medications: GTN
and Metformin; and (3) laboratory test: HbA1c.
Ambiguous mentions. In many cases, a mention in the free text
may refer to different domain entities. For instance, in the running
example, “DM” refers to two different diseases “Dystrophy
Myotonic” and “Diabetes Mellitus”. We note that this problem is
not uncommon as doctors tend to use abbreviations in their notes.
For example, “CCF” refers to either “Congestive heart failure” or
“Carotid-Cavernous Fistula” diseases; “PID” refers to either
“Prolapsed Intervertebral Disc” or “Pelvic Inflammatory Disease”.
There are also cases where only human but not the machine can
understand the meaning of some mentions in the text. For
example, assuming that we are extracting the social factor of
patients in Table 1. It is rather easy to extract the social factor for
patient 1, since the text contains the phraze “stays with son”.
However, it is challenging, if not possible, for the machine to
extract the social factor for patient 2. The reason is that the
paragraph contains several different keywords relating to the
social factor such as “single”, “no child”, “live with friend”,
“sheltered home”, “next-of-kin”.
Incomplete knowledge base. The knowledge base is incomplete
for the following reasons. First, the terms used in the doctor’s
notes could be specific within a country or a particular hospital,
whereas the existing knowledge bases may only cover the
universal ones. Thus, these terms do not exist in the dictionary.
One example is the term “HL” in our running example, which
refers to the “Hyperlipidemia” disease but is not captured in
UMLS. Second, the relationships between entities covered in
existing medical knowledge bases (like ULMS) are far from
complete. In the running example, the fact that the medication
Metformin is used to treat Diabetes Mellitus (DM) is also missing
in UMLS. The relationships that exist between domain entities can
be used to derive implicit and useful information. For instance,
from the laboratory result of the lab test HbA1c, we can infer
whether the DM condition is well-controlled (i.e., the relationship
between a disease and a lab test).
Volume 16, Issue 1
Page 42
A hybrid human-machine approach.
To infer the correct
entities from unstructured data, a hybrid human-machine solution
should be employed. The system can leverage the information
from the knowledge base (e.g., UMLS) together with the implicit
information (signals) inherent in the unstructured data (e.g.,
doctor’s notes) to improve the accuracy of its inference process
and enhance the knowledge base as well. The system will pose
questions to the healthcare professionals for verification. Based on
the answers from the experts, the system adjusts its inference
results. The inference process gets more accurate and complete as
the system runs more iterations. Meanwhile, the knowledge base
becomes more comprehensive and customized to each
organization’s practice. More specifically, in our running example:
• Since “DM” is attached with the laboratory test “HbA1c”
in the paragraph, the machine conjectures that “DM” would
refer to the “Diabetes Mellitus” disease only. The reason is
that HbA1c is a laboratory test that monitors the control of
diabetes and HbA1c does not have any relationship with the
other disease related to “DM” (i.e., “Dystrophy Myotonic”).
• To correctly infer the disease “Hyperlipidemia” for “HL”,
the machine infers a pattern of “num d” where num is a
fraction annotation and d is a disease. (“1 IHD” and “2
DM” are two examples.) The machine then infers that “HL”
may refer to a disease since the phrase “3 HL” follows the
pattern. The machine then poses a question to a doctor:
which disease “HL” represents for? In this case, the doctor
confirms that “HL” represents for the “Hyperlipidemia”
disease. Based on the answer, the machine adds the
mapping between the mention “HL” and the disease
“Hyperlipidemia” to the knowledge base. Hence, the
knowledge base becomes more comprehensive and
customized to NUH’s practice.
• To identify the missing relationship between the medication
Metformin and the disease DM, the machine infers a pattern
of “d on med”, where d is a disease, med is a medication
and med is used to treat d. (“IHD on GTN” is an example.)
The machine conjectures that there should have a
relationship between DM and Metformin, since the phraze
“DM on Metformin” follows the pattern. The machine then
verifies this inference with the doctors. The doctors confirm
that they typically write the medications that are used to
treat a disease right next to the disease, and connect these
relationships by the preposition “on”. Clearly, such rule is
very useful – the machine will then infer other missing
relationships using this expert rule with fewer questions
being posed to the doctors.
• To derive the social factor for patient 2, the machine can first
attempt to derive the information using a simple strategy
such as analyzing the NLP structure of sentences containing
patterns like “stay with”, “live with”. For complicated cases
when the machine cannot find out the information, we need
to tap on the knowledge of the experts.
4.3 Combining Multiple Weak Classifiers
We can obtain different classifiers from multiple sources such as
classifiers built based on the observational data, rules used by the
doctors and general clinical guidelines. Each source of knowledge
can be considered as a “weak classifier” and the task is to combine
these classifiers to derive a final classifier that achieves a very high
accuracy in prediction.
SIGKDD Explorations
C7
Round 3
Intermediate
classifiers
C5
Round 2
C1
C2
Classifier from
doctor A
Classifier from
doctor B
Round 1
Final classifier
C6
C3
Classifier from
data
C4
Classifier from
general guidelines
Figure 3: An example of several rounds of learning for healthcare
predictive analytics
There are many ways to achieve the goal. Figure 3 shows an
example of a process consisting of three rounds of learning for the
task of predicting the severity of patients. In the first round, the
system computes four classifiers: C1 and C2 are the classifiers
derived from rules provided by SMEs (i.e., doctors); C3 is the
classifier derived from historical data; and C4 is the classifier
derived from clinical guidelines.
It is essential to resolve
disagreeing opinions from various sources. There are several ways
to combine different classifiers, such as, using majority-voting for
the outputs of different rules/classifiers or combining features
being used in different input classifiers.
It is likely that all the classifiers built after the first round do not
agree with each other for the prediction tasks. Thus, in this
example, the system performs two additional rounds of learning to
improve the accuracy of the classifier. It is also possible that there
is no way to reconcile the classifiers, i.e., there will be multiple
different classifiers. In such situations, it may be necessary to
“rank” the results of the different classifiers, and pick the answer
that is ranked highest. How to do this is an open question.
4.4
Scalable Processing
Big data analytics is characterized by the so-called 3V features:
Volume - a huge amount of data, Velocity - a high data ingestion
rate, and Variety - a mixed of structured, semi-structured and
unstructured data. These requirements force us to rethink the
whole software stack to address big data analytics efficiently and
effectively, ranging from the storage layer that should manipulate
both structured and unstructured data to application layer that
should support scalable machine learning algorithms. To illustrate
the points, let us reconsider the problem of predicting the
malicious SMSs. The collection of SMSs is huge, e.g., in the order
of hundreds of tera-bytes. As discussed in Section 4.1, we need to
pick a set of SMSs for domain experts to label. Conventional
clustering algorithms may not work well here as we need to
handle such a large amount of data. The problem is even more
challenging in our context, as we need to frequently get the
domain experts involved in building the system. The delay from
human beings’ reaction may be a large factor affecting the low
latency of the system.
The scalability of the problems is also in terms of
high-dimensional data space. Our data set inherently contains a
large number of features. For instance, there are different
information about patients such as thousands of different diseases
and lab tests. One solution to reduce the dimensions is to group
these attributes semantically, e.g., grouping together different
diseases that share a same “root”. For instance, the Hypertension
disease, Hypotension disease and Ischaemic Heart disease can be
grouped together under the category of Cardiovascular disease.
Volume 16, Issue 1
Page 43
Clearly, to perform such tasks, we need to consult the domain
experts as different hospitals/doctors may have different
opinions/reasoning in performing this task. This is, again, an
example of getting the domain experts involved in building the
systems.
# actual class 1 # actual class 0
#predicted class 1
1071
1321
#predicted class 0
4587
22070
(a) Using only structured features
4.5 Engaging Expert Users
#predicted class 1
#predicted class 0
As the system needs to interact with SMEs frequently, it is
important to engage the experts along the process of building and
using the system.
The system should provide several
functionalities for this purpose:
• Presenting feedback to the experts. For instance, the system
can explain how well an expert performs compared to other
colleagues. As another example, the system can reveal
comments and annotations by other experts to see whether
an expert would change her decision. It is also interesting to
present new patterns of knowledge that an expert may lack
and potentially educate her.
5. PRELIMINARY RESULTS
We are studying the problem of predicting the probability of
patients being readmitted into the hospital within 30 days after
discharge. We refer to the task as readmission prediction for short.
We use the clinical data drawn from the National University
Hospital’s Computerized Clinical Data Repository (CCDR) and
focus only on the elderly patients (i.e., patients with age older than
60) admitted to the hospital in 2012. The table used for the
prediction task is the medical care table1 that has similar schema
as the one presented in Table 1. There are in total 29049 elderly
patients admitted to NUH in 2012, where 5658 patients readmitted
within 30 days, i.e., the proportion of patients who were
readmitted (i.e. class label 1) is 0.188.
5.1 Interacting with Domain Experts
We have been getting the doctors involved in the following tasks.
Hypothesis/Rules. Our clinician collaborators have suggested a
hypothesis that the following features (indicators) might be
important for the readmission prediction:
• Social-economic factors, e.g., who are the care-givers and
the patient’s economic status.
• Lab findings. We should extract the lab findings that the
doctors mentioned in their notes instead of using the labs
recorded in the structured data in CCDR. The reason is that
patients typically have hundreds of lab tests but only a small
1 To derive the medical care table, we joined information from various relations in CCDR, including: Discharge Summary, Patient
Demographics, Visit and Encounter, Lab Results and Emergency
Department.
SIGKDD Explorations
# actual class 0
4250
19141
(b) Using both structured and derived features
Table 2: The accuracy of our classifier.
• A user-friendly interface for the experts to provide their
inputs such as rules, hypothesis, labels, etc.
• The system should provide not only the final outcome (e.g.,
whether the patient is at high/low risk of being sent to ICU)
but also the reasons that drive its decision. Therefore,
keeping track of the provenance of the knowledge is
important. For instance, when the system makes a decision
that differs from experts’ opinions, the system should be
able to trace back whether the mismatch is mainly due to the
use of some general guidelines, or due to other experts’
opinions.
# actual class 1
2679
2979
number of them is important and is captured in the doctor’s
notes. As a result, selecting lab findings mentioned by
doctors naturally reduces the dimensions of the data set.
• Comorbidity influence, i.e., we should take into account the
past medical history of the patient together with the disease
status (whether the disease has been well-controlled).
Participants in a crowd-sourcing system. We adopted a hybrid
human-machine approach to extract the social factors and lab
findings from doctor’s free-text notes.
To extract the social factors, we use an NLP technique to analyze
sentences containing phrases related to the social factor such as
“live (with)”, “stay (with)”, “main care-giver” to pinpoint some
keywords such as “daughter”, “family”, “spouse”, etc. The system
then asks the doctors to handpick a set of predefined categories of
social factors. For instance, living with family and taking care by
professional helpers (e.g., maid, domestic helpers) are in a same
group.
As another example, living alone and living in a
community nursing home are in a same group. The system also
performs a postprocessing step to pull out cases that can be
assigned more than one category of social factors. The system
then asks the doctors to label these cases manually. (There are
about 200 cases that need to be manually labeled.)
To extract the lab findings, the system first uses a simple pattern
matching technique to extract all possible lab tests mentioned in
the note. For instance, if the note contains a pattern of the form
“word num” where word is some word and num is a number, then
word is a candidate lab test. A word is a correct lab test if it exists
in the medical dictionary with the category of lab tests. For the
“false” lab tests that are currently not present in the dictionary and
appear frequently in the notes, the system asks the doctors to verify
them. As a result, there are some actual lab tests that are missing
in the dictionary such as “TW”, which is a local convention used
inside NUH.
Extracting medical concepts. We run the cTAKES NLP engine
over the UMLS dictionary to extract the past medical history of a
patient. We are in the process of developing algorithms to improve
the accuracy of extraction (to resolve problems mentioned in
Section 4.2). Thus, we use the number of diseases that the patient
has as an indicator instead of the actual diseases.
5.2
Results
After interacting with the doctors to extract relevant features, we
obtained two sets of features for the prediction task:
• Structured features: patients’ demographics (age, gender,
race), the number of days that the patient stayed at the
hospital, the number of previous hospitalizations, and the
Volume 16, Issue 1
Page 44
number of prior emergency visits in the last six month
before admission.
• Derived features from free-texts (We refer to these features
as derived features for short): social factors, lab findings, and
past medical history (i.e., diseases).
We used WEKA [15] to run a 10-fold cross-validation and the
Bayesian Network classifier to construct a readmission classifier2 .
Table 2 reports the accuracy of the prediction across all the 10
validation data. If only structured features are used to build the
classifier (Table 2(a)), the resulting classifier can correctly predict
1071 cases that are readmitted (within 30 days). The precision and
recall in this case are 0.448 and 0.189, respectively. Meanwhile, if
both structured and derived features are used to build the classifier
(Table 2(b)), the resulting classifier can correctly predict 2679
cases that are readmitted. The precision and recall are 0.387 and
0.473 respectively.
Clearly, the recall has been improved
significantly with the usage of the derived features from the
free-text doctor’s notes. The result is also very promising when we
compared it to the result handled manually by domain experts
such as physicians, case managers, and nurses [7]. The recall
reported in [7] is in the range [0.149, 0.306]. The conclusion in [7]
is that care-providers were not able to accurately predict which
patients were at highest risk of readmission. However, we believe
that a hybrid machine-human solution would greatly alleviate the
problem.
We would like to emphasize that there are many rooms to further
improve the accuracy of the prediction such as enhancing the
feature extraction process, employing additional features, such as,
disease status, specific diagnoses, medications, and using special
classifiers for highly-imbalanced data set.
6. RELATED WORK
Related works to our proposition can be broadly classified into the
following three categories.
Existing solutions for industry/domain specific applications.
Existing solutions are currently built based on “best practices”.
One direction is knowledge-driven approach that is based on
general guidelines such as clinical guidelines, e.g., IBM
Watson [3]. Another direction is data-driven approach that is
based on “rules” extracted from the observational data, e.g., [16;
18; 20]. Recently, IBM proposes to combine the strengths of the
two directions [31]. However, these solutions have not explored
the exceptionally complicated rules/patterns that can only be
provided by internal domain experts with years of working
experience. Our research aims to fill this gap: we seek to engage
the experts as users of the system, and tap on their expertise to
enhance the database knowledge and processing. There are several
benefits of employing internal domain experts. First, we do not
need to customize/localize the system for different use-cases; they
themselves define the “best practices” for the system. Second, in
terms of the data used to build the knowledge base, our system
mainly bases on observational data and knowledge provided by
domain experts; whereas others (e.g., IBM Watson) need to
process a much larger amount of inputs such as medical journals,
white papers, medical policies and practices, information in the
web, etc. Third, the system should become more “intelligent” over
times when the expert users continuously enhance the system with
their expert knowledge.
2 We
also used other classifiers such as decision tree, rule-based
classifier, SVM, etc and observe that the Bayesian Network classifier provides the best result.
SIGKDD Explorations
Crowdsourcing in database. There has been a lot of recent
interest in the database community in using crowdsourcing as part
of database query processing (e.g., CrowdDB [13], Deco [24],
Qurk [23], CDAS [12; 22]). As discussed, the intelligent crowds
in our context are domain experts (rather than lay-persons in the
existing crowds) who are also users/reviewers of the system.
Furthermore, exploiting intelligent crowd can be much more
collaborative in nature. In typical crowdsourcing, the crowds are
not aware of each other’s answers. But in our context, we can
actually go through several iterations and see whether the experts
will change their decisions when they are provided with comments
and annotations by other experts.
A recent system, called Data Tamer [30], also leveraged expert
crowdsourcing system to enhance machine computation but in the
context of data curation. As discussed in Section 1, the key
difference between our proposition and Data Tamer lies in the fact
that the domain experts in our context are also users/reviewers of
the system. Thus, the experts are likely to take ownership and
hence are motivated to improve the accuracy of the analytics and
the usability of the applications. This would reduce the need to
localize/customize the system. Also, each system needs to address
a different set of challenges, since the targeted applications are
different.
Active learning. In the active learning model, the data come
unlabeled but the goal is to ultimately learn a classifier (e.g., [17;
29; 32]). The idea is to query the labels of just a few points that are
especially informative in order to obtain an accurate classifier. The
labels are obtained from highly-trained experts (e.g., doctors). The
scope of our proposition is much more general than active learning
in the following points. First, we would like to exploit as much
domain knowledge from experts as possible, not restricting to only
the class labels as in active learning. For instance, rules and
hypotheses provided by experts with many years of experience
must be exploited in several cases. Second, active learning focuses
on getting a better classifier so the query points presented to the
crowd are usually those data points that are at the boundary of the
separating plane. However, these are also the data points that the
experts are usually not very clear about. As such, we need to be
able to identify additional information that should be provided for
the experts to be able to make an informed decision. Lastly, we
need to handle a large amount of data whereas existing solutions
on active learning usually deal with small data set.
7.
CONCLUSION
Each of us is a subject-matter expert (SME) of our profession, and
we carry with us a vast amount of knowledge and insights not
captured by a structured system. This might have explained the
emergence of Knowledge Management systems. However, there
are many rules and exceptional cases that can only be formulated
by experts with many years of experience. Such rules, when
properly coded, can help in facilitating contextual decision
making. This paper envisions a more intelligent DBMS that
captures such information or knowledge. At the core, the system is
a hybrid human-machine database processing engine where the
machine keeps the SMEs as part of the feedback loop to gather,
infer, ascertain and enhance the database knowledge and
processing. This paper discussed many open challenges that we
need to tackle in order to build such a system.
8.
ACKNOWLEDGEMENTS
This work was supported by the National Research Foundation,
Prime Minister’s Office, Singapore under Grant No.
Volume 16, Issue 1
Page 45
NRF-CRP8-2011-08. We thank Associate Professor Gerald C.H.
Koh and Dr. Chuen Seng Tan (Saw Swee Hock School of Public
Health, National University Health System) for sharing with us
domain knowledge in healthcare.
9. REFERENCES
[1] http://www.comp.nus.edu.sg/∼epic.
[2] The
comprehensive
it
infrastructure
for
dataintensive
applications
and
analysis
project.
http://www.comp.nus.edu.sg/∼ciidaa/.
[3] Ibm big data for healthcare. http://www.ibm.com.
[4] The minority report:
Chicago’s new police
computer
predicts
crimes,
but
is
it
racist?
http://www.theverge.com/2014/2/19/5419854/the-minorityreport-this-computer-predicts-crime-but-is-it-racist.
[5] National university health system. http://www.nuhs.edu.sg/.
[6] Unified
medical
language
http://www.nlm.nih.gov/research/umls/.
system.
[7] N. Allaudeen, J. L. Schnipper, E. J. Orav, R. M. Wachter, and
A. R. Vidyarthi. Inability of providers to predict unplanned
readmissions. J Gen Intern Med, 26(7):771776.
[8] Y. Cao, C. Chen, F. Guo, D. Jiang, Y. Lin, B. C. Ooi, H. T.
Vo, S. Wu, and Q. Xu. Es2: A cloud data storage system for
supporting both oltp and olap. In ICDE, pages 291–302, 2011.
[18] A. Hosseinzadeh, M. T. Izadi, A. Verma, D. Precup, and D. L.
Buckeridge. Assessing the predictability of hospital readmission using machine learning. In IAAI, 2013.
[19] D. Jiang, G. Chen, B. C. Ooi, K.-L. Tan, and S. Wu. epic:
an extensible and scalable system for processing big data. In
PVLDB, 2014.
[20] P. S. Keenan, S.-L. T. Normand, Z. Lin, E. E. Drye, K. R.
Bhat, J. S. Ross, J. D. Schuur, B. D. Stauffer, S. M. Bernheim,
A. J. Epstein, Y. Wang, J. Herrin, J. Chen, J. J. Federer, J. A.
Mattera, Y. Wang, and H. M. Krumholz. An administrative
claims measure suitable for profiling hospital performance on
the basis of 30-day all-cause readmission rates among patients
with heart failure. Circ Cardiovasc Qual Outcomes, 1(1):29–
37, 2008.
[21] K. S. Kumar, P. Triantafillou, and G. Weikum. Human computing games for knowledge acquisition. In CIKM, pages
2513–2516, 2013.
[22] X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang.
Cdas: a crowdsourcing data analytics system. PVLDB,
5(10):1040–1051, 2012.
[23] A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR,
pages 211–214, 2011.
[24] A. G. Parameswaran, H. Park, H. Garcia-Molina, N. Polyzotis, and J. Widom. Deco: declarative crowdsourcing. In CIKM, pages 1203–1212, 2012.
[9] G. Chen, K. Chen, D. Jiang, B. C. Ooi, L. Shi, H. T. Vo,
and S. Wu. E3: an elastic execution engine for scalable data
processing. JIP, 20(1):65–76, 2012.
[25] S. Perera, A. Sheth, K. Thirunarayan, S. Nair, and N. Shah.
Challenges in understanding clinical notes: Why nlp engines
fall short and where background knowledge can help. In CIKM Workshop, 2013.
[10] G. Chen, H. Jagadish, D. Jiang, D. Maier, B. Ooi, K. Tan, and
W. Tan. Federation in cloud data management: Challenges
and opportunities. TDKE, 2014.
[26] W. Raghupathi and V. Raghupathi. Big data analytics in
healthcare: promise and potential. Health Information Science and Systems, 2014.
[11] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1), Jan. 2008.
[12] J. Fan, M. Lu, B. C. Ooi, W.-C. Tan, and M. Zhang. A hybrid
machine-crowdsourcing system for matching web tables. In
ICDE, pages 976–987, 2014.
[13] M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and
R. Xin. Crowddb: answering queries with crowdsourcing. In
SIGMOD Conference, pages 61–72, 2011.
[14] C. Friedman, P. O. Alderson, J. H. Austin, J. J. Cimino, and
S. B. Johnson. A general natural-language text processor for
clinical radiology. JAMIA, 1(2):161–174, 1994.
[15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
and I. H. Witten. The weka data mining software: An update.
SIGKDD Explorations, 11(1), 2009.
[16] J. Han. Data Mining: Concepts and Techniques. Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 2005.
[17] S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch mode active
learning and its application to medical image classification. In
ICML, pages 417–424, 2006.
SIGKDD Explorations
[27] G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn,
K. K. Schuler, and C. G. Chute. Mayo clinical text analysis
and knowledge extraction system (ctakes): architecture, component evaluation and applications. JAMIA, 17(5):507–513,
2010.
[28] T. K. Sean Goldberg, Daisy Zhe Wang. Castle: Crowdassisted system for textual labeling & extraction. HCOM,
2013.
[29] B. Settles. Active learning literature survey. Technical report,
University of Wisconsin–Madison, 2010.
[30] M. Stonebraker, D. Bruckner, I. Ilyas, G. Beskales, M. Cherniack, S. Zdonik, A. Pagan, and S. Xu. Data curation at scale:
The data tamer system. In CIDR, 2013.
[31] J. Sun, J. Hu, D. Luo, M. Markatou, F. Wang, S. Edabollahi, S. E. Steinhubl, Z. Daar, and W. F. Stewart. Combining
knowledge and data driven insights for identifying risk factors
using electronic health records. In AMIA, 2012.
[32] J. Wiens and J. Guttag. Active learning applied to patientadaptive heartbeat classification. In NIPS, pages 2442–2450,
2010.
Volume 16, Issue 1
Page 46
On Power Law Distributions in Large-scale Taxonomies
Rohit Babbar, Cornelia Metzig, Ioannis Partalas, Eric Gaussier and Massih-Reza
On Power Law Distributions
in Large-scale Taxonomies
Amini
Université Grenoble Alpes, CNRS
F-38000 Grenoble, France
rohit.babbar@imag.fr,
cornelia.metzig@imag.fr,
ioannis.partalas@imag.fr,
Rohit
Babbar, Cornelia Metzig,
Ioannis Partalas, Eric
Gaussier and Massih-Reza
eric.gaussier@imag.fr,Amini
massih-reza.amini@imag.fr
Université Grenoble Alpes, CNRS
F-38000 Grenoble, France
ABSTRACT
erarchy tree in large-scale taxonomies with the goal of modrohit.babbar@imag.fr, cornelia.metzig@imag.fr,
ioannis.partalas@imag.fr,
elling the process
of their evolution. This is undertaken
In many of the large-scale physical and social complex syseric.gaussier@imag.fr,
massih-reza.amini@imag.fr
by a quantitative study of the evolution of large-scale taxtems phenomena fat-tailed distributions occur, for which different generating mechanisms have been proposed. In this
paper,
we study models of generating power law distribuABSTRACT
tions in the evolution of large-scale taxonomies such as Open
In many ofProject,
the large-scale
physical
and social
complex
Directory
which consist
of websites
assigned
to sysone
tems
phenomena
fat-tailed
distributions
for which
difof tens
of thousands
of categories.
Theoccur,
categories
in such
ferent generating
mechanisms
haveorbeen
In conthis
taxonomies
are arranged
in tree
DAGproposed.
structured
paper,
we study
of generating
lawthem.
distribufigurations
havingmodels
parent-child
relationspower
among
We
tions quantitatively
in the evolutionanalyse
of large-scale
taxonomies
such as
first
the formation
process
of Open
such
Directory Project,
of websites
assigned as
to one
taxonomies,
whichwhich
leads consist
to power
law distribution
the
of
tens of thousands
of categories.
The ofcategories
such
stationary
distributions.
In the context
designinginclassitaxonomies
are arranged
in treewhich
or DAG
structuredassign
confiers
for large-scale
taxonomies,
automatically
figurations
having to
parent-child
relations among
them. how
We
unseen
documents
leaf-level categories,
we highlight
firstfat-tailed
quantitatively
formation can
process
of such
the
natureanalyse
of thesethe
distributions
be leveraged
taxonomies,
which
leads
power
law distribution
the
to
analytically
study
the to
space
complexity
of such as
classistationary
distributions.
of designingonclassifiers. Empirical
evaluationInofthe
thecontext
space complexity
pubfiers available
for large-scale
taxonomies,
which
licly
datasets
demonstrates
theautomatically
applicability assign
of our
unseen documents to leaf-level categories, we highlight how
approach.
the fat-tailed nature of these distributions can be leveraged
to analytically study the space complexity of such classi1.
INTRODUCTION
fiers. Empirical evaluation of the space complexity on pubWith
the tremendous
of datathe
on applicability
the web fromof varlicly available
datasets growth
demonstrates
our
ious
sources
such
as
social
networks,
online
business serapproach.
vices and news networks, structuring the data into conceptual taxonomies leads to better scalability, interpretability
1.
INTRODUCTION
and visualization.
Yahoo! directory, the open directory
With the(ODP)
tremendous
growth ofare
data
on the web
from varproject
and Wikipedia
prominent
examples
of
ious sources
such
as social networks,
online
business
sersuch
web-scale
taxonomies.
The Medical
Subject
Heading
vices
and news
data into
concephierarchy
of thenetworks,
Nationalstructuring
Library of the
Medicine
is another
tual taxonomies
leads to better
scalability,
instance
of a large-scale
taxonomy
in the interpretability
domain of life
and visualization.
Yahoo! directory,
the open
directory
sciences.
These taxonomies
consist of classes
arranged
in
project
(ODP)structure
and Wikipedia
are prominent
examples
of
a
hierarchical
with parent-child
relations
among
such web-scale
taxonomies.
Subject
them
and can be
in the formThe
of a Medical
rooted tree
or a Heading
directed
hierarchy
of the
National
Librarywhich
of Medicine
another
acyclic graph.
ODP
for instance,
is in theisform
of a
instancetree,
of alists
large-scale
taxonomy
in the
domain among
of life
rooted
over 5 million
websites
distributed
sciences.
consist
of classes arranged
in
close
to 1 These
milliontaxonomies
categories and
is maintained
by close to
a
hierarchical
with parent-child
relations
among
100,000
humanstructure
editors. Wikipedia,
on the other
hand,
repthem and
can be
in the form
of a rooted
or a directed
resents
a more
complicated
directed
graphtree
taxonomy
strucacyclic
graph. ODP
fora instance,
which is in
of a
ture consisting
of over
million categories.
Inthe
thisform
context,
rooted tree,hierarchical
lists over 5 classification
million websites
distributed
among
large-scale
deals
with the task
of
close to 1 million
categories
is maintained
by close
automatically
assigning
labelsand
to unseen
documents
fromto
a
100,000
human
editors.
Wikipedia,
on the other
repset of target
classes
which
are represented
by thehand,
leaf level
resentsina the
more
complicated directed graph taxonomy strucnodes
hierarchy.
ture
consisting
over athe
million
categories.
In this
In
this
work, weofstudy
distribution
of data
andcontext,
the hilarge-scale hierarchical classification deals with the task of
automatically assigning labels to unseen documents from a
set of target classes which are represented by the leaf level
nodes in the hierarchy.
In this work, we study the distribution of data and the hi-
SIGKDD Explorations
onomy using models of preferential attachment, based on
the famous model proposed by Yule [33] and showing that
erarchy treethe
in large-scale
taxonomies
with theexhibits
goal of amodthroughout
growth process,
the taxonomy
fatelling
process ofWe
their
evolution.
This isto undertaken
tailed the
distribution.
apply
this reasoning
both cateby a quantitative
study
of the evolution
of large-scale
taxgory
sizes and tree
connectivity
in a simple
joint model.
onomy using
modelsvariable
of preferential
attachment,
on
Formally,
a random
X is defined
to followbased
a power
the
model
proposed
by Yuleconstant
[33] anda,showing
that
law famous
distribution
if for
some positive
the complethroughout
the growth
process, the
exhibits a fatmentary cumulative
distribution
is taxonomy
given as follows:
tailed distribution. We apply this reasoning to both cateP (X > x) ∝ in
x−aa simple joint model.
gory sizes and tree connectivity
Formally, a random variable X is defined to follow a power
Power
law distributions,
more generally
fat-tailed
dislaw distribution
if for someorpositive
constant a,
the completributions
that decaydistribution
slower thanisGaussians,
are found in a
mentary cumulative
given as follows:
wide variety of physical and social complex systems, ranging
−a
(X > x) ∝ xof
from city population,Pdistribution
wealth to citations of
scientific articles [23]. It is also found in network connectivPower
lawthe
distributions,
more generally
fat-tailed
disity,
where
internet andorWikipedia
are prominent
examtributions
slower
Gaussians,
are foundwebin a
ples [27; 7].that
Ourdecay
analysis
in than
the context
of large-scale
wide variety leads
of physical
and social
complex systems,
taxonomies
to a better
understanding
of suchranging
largefrom data,
city population,
distribution
of wealth
to citations
of
scale
and also leveraged
in order
to present
a concrete
scientific
articles
It is alsofor
found
in network
connectivanalysis of
space[23].
complexity
hierarchical
classification
ity, where Due
the internet
and increasing
Wikipediascale
are prominent
schemes.
to the ever
of trainingexamdata
ples
[27;terms
7]. Our
analysis
in the
context of large-scale
size in
of the
number
of documents,
feature setwebsize
taxonomies
to a classes,
better understanding
of such of
largeand
numberleads
of target
the space complexity
the
scale data,
and also
leveraged
in order
present
a concrete
trained
classifiers
plays
a crucial
role intothe
applicability
of
analysis
of space
complexity
hierarchical
classification
classification
systems
in manyfor
applications
of practical
imschemes. Due to the ever increasing scale of training data
portance.
size in
terms
of the number
of presented
documents,
set prosize
The
space
complexity
analysis
in feature
this paper
and
of target
classes, the
space
complexity
of the
videsnumber
an analytical
comparison
of the
trained
model for
hitrained classifiers
a crucialwhich
role incan
thebeapplicability
of
erarchical
and flat plays
classification,
used to select
classification
systems
many applications
of practical
imthe appropriate
modelina-priori
for the classification
probportance.
lem at hand, without actually having to train any modThe Exploiting
space complexity
analysis
presented
in this paper
proels.
the power
law nature
of taxonomies
to study
vides
an analytical
comparisonforofhierarchical
the trained Support
model forVechithe training
time complexity
erarchical
andhas
flat been
classification,
which
can19].
be used
select
tor
Machines
performed
in [32;
The to
authors
the appropriate
model
a-priori
for the classification
probtherein
justify the
power
law assumption
only empirically,
lem
at our
hand,
without
to we
train
any modunlike
analysis
in actually
Section 3having
wherein
describe
the
els. Exploiting
the of
power
law nature
taxonomiesmore
to study
generative
process
large-scale
web of
taxonomies
conthe training
complexity
for processes
hierarchical
Support
Veccretely,
in thetime
context
of similar
studied
in other
tor Machines
has the
beenimportant
performedinsights
in [32; of
19].[32;
The
models.
Despite
19],authors
space
therein justify
law assumption
only
complexity
has the
not power
been treated
formally so
far.empirically,
unlike
our analysis
in paper
Section
3 wherein
we describe
The remainder
of this
is as
follows. Related
workthe
on
generative
process
of
large-scale
web
taxonomies
more
conreporting power law distributions and on large scale hierarcretely,classification
in the context
of similar in
processes
in other
chical
is presented
Section studied
2. In Section
3,
models.
thegrowth
important
insights
of [32; 19], space
we recall Despite
important
models
and quantitatively
juscomplexity
has not of
been
treated
far.
tify
the formation
power
lawsformally
as they so
are
found in hiThe remainder of this paper is as follows. Related work on
reporting power law distributions and on large scale hierarchical classification is presented in Section 2. In Section 3,
we recall important growth models and quantitatively justify the formation of power laws as they are found in hi-
Volume 16, Issue 1
Page 47
erarchical large-scale web taxonomies by studying the evolution dynamics that generate them. More specifically, we
present a process that jointly models the growth in the size
of categories, as well as the growth of the hierarchical tree
structure. We derive from this growth model why the class
erarchical
large-scale
taxonomies
by hierarchy
studying the
size
distribution
at a web
given
level of the
alsoevoexlution
dynamics
generate
them.
More we
specifically,
we
hibits power
law that
decay.
Building
on this,
then appeal
present
a process
jointly
models
thethe
growth
in the size
to
Heaps’
law in that
Section
4, to
explain
distribution
of
of categories,
as categories
well as thewhich
growth
theexploited
hierarchical
tree
features
among
is of
then
in Secstructure.
derive the
fromspace
this growth
model
the class
tion 5 for We
analysing
complexity
forwhy
hierarchical
size distribution
at a given
level of isthe
hierarchyvalidated
also exclassification
schemes.
The analysis
empirically
hibits
poweravailable
law decay.
Building
on from
this, we
on publicly
DMOZ
datasets
the then
Largeappeal
Scale
1
to Heaps’ lawText
in Section
4, to explain
the (LSHTC)
distribution
of
and
Hierarchical
Classification
Challenge
2
featuresdata
among
categories
which Intellectual
is then exploited
in Secfrom World
Property
Orpatent
(IPC)
tion
5 for analysing
the space
complexity
hierarchical
ganization.
Finally, Section
6 concludes
thisfor
work.
classification schemes. The analysis is empirically validated
on publicly available DMOZ datasets from the Large Scale
2.
RELATED
WORK Challenge (LSHTC)1 and
Hierarchical
Text Classification
2
Power
law
distributions
reported
in a wide
varietyOrof
World
Intellectual
Property
patent data (IPC) fromare
physical
andFinally,
social complex
[22],
such
as in interganization.
Section 6systems
concludes
this
work.
net topologies. For instance [11; 7] showed that internet
topologies exhibit power laws with respect to the in-degree
2.
of theRELATED
nodes. AlsoWORK
the size distribution of website catePower law
distributions
reportedofinwebsites,
a wide exhibits
variety of
gories,
measured
in termsare
of number
a
physical and
social complex
systems [22],
such as inininterfat-tailed
distribution,
as empirically
demonstrated
[32;
net
topologies.
instanceProject
[11; 7] (ODP).
showed Various
that internet
19] for
the OpenFor
Directory
modtopologies
exhibit
powerfor
laws
respectpower
to thelaw
in-degree
els
have been
proposed
thewith
generation
distriof the nodes.
Also the that
size may
distribution
catebutions,
a phenomenon
be seen ofas website
fundamental
gories,
measured
in terms
number
of websites,inexhibits
a
in
complex
systems
as theofnormal
distribution
statistics
fat-tailed
distribution,
as to
empirically
demonstrated
in [32;
[25].
However,
in contrast
the straight-forward
derivation
19]
for thedistribution
Open Directory
Variousmodels
modof normal
via theProject
central (ODP).
limit theorem,
els have been
proposed
for the generation
power
law distriexplaining
power
law formation
all rely on
an approximabutions,
a phenomenon
may be
as fundamental
tion.
Some
explanationsthat
are based
onseen
multiplicative
noise
in
systems as thegroup
normal
distribution
in statistics
or complex
on the renormalization
formalism
[28; 30;
16]. For
[25].growth
However,
in contrast
to the straight-forward
derivation
the
process
of large-scale
taxonomies, models
based
of
distribution
via the
limit theorem,
models
on normal
preferential
attachment
arecentral
most appropriate,
which
are
explaining
formation
on on
an the
approximaused
in thispower
paper.law
These
modelsall
arerely
based
seminal
tion. Some
explanations
are based
on multiplicative
noise
model
by Yule
[33], originally
formulated
for the taxonomy
or
on the renormalization
group
formalism
30; 16].
For
of biological
species, detailed
in section
3. It[28;
applies
to systhe growth
based
tems
where process
elementsofoflarge-scale
the systemtaxonomies,
are groupedmodels
into classes,
on preferential
appropriate,
which and
are
and
the systemattachment
grows bothare
in most
the number
of classes,
used
this number
paper. These
models
are based
on the
seminal
in
theintotal
of elements
(which
are here
documents
model
by Yule In
[33],
formulated
the taxonomy
or
websites).
itsoriginally
original form,
Yule’sfor
model
serves as
of
biological for
species,
in section
3. Ittaxonomy,
applies to irresysexplanation
powerdetailed
law formation
in any
tems where
elements
of hierarchy
the systemamong
are grouped
into classes,
spective
of an
eventual
categories.
Similar
and the system
grows
bothtoinexplain
the number
and
dynamics
have been
applied
scalingofinclasses,
the connecin
the of
total
numberwhich
of elements
(which
here documents
tivity
a network,
grows in
termsare
of nodes
and edges
or websites).
its original[2].
form,
Yule’s
modelgeneralizaserves as
via
preferentialInattachment
Recent
further
explanation
for power
law formation
taxonomy,
tions apply the
same growth
processintoany
trees
[17; 14; irre29].
spective
of an eventual
among categories.
In
this paper,
describe hierarchy
the approximate
power-lawSimilar
in the
dynamics have been
applied
to explain
scaling
in the
child-to-parent
category
relations
by the
model
by connecKlemm
tivity
a network,
whichwe
grows
in terms
nodes and
edges
et al. of
[17].
Furthermore,
combine
thisofformation
process
viaa preferential
attachment
Recent
in
simple manner
with the [2].
original
Yulefurther
model generalizain order to
tions apply
same law
growth
process sizes,
to trees
29].
explain
also the
a power
in category
i.e. [17;
we 14;
provide
In
this paper, describe
the approximate
power-law
in the
a comprehensive
explanation
for the formation
process
of
child-to-parent
relations
byDMOZ.
the model
by the
Klemm
large-scale
web category
taxonomies
such as
From
secet
al. we
[17].
Furthermore,
we combine
this for
formation
process
ond,
infer
a third scaling
distribution
the number
of
in a simple
with
theisoriginal
model in order
to
features
permanner
category.
This
done viaYule
the empirical
Heaps’s
explain
a power
law in
i.e. we between
provide
law
[10],also
which
describes
thecategory
scaling sizes,
relationship
a
comprehensive
explanation
for the formation process of
text
length and the
size of its vocabulary.
large-scale
taxonomies
such as DMOZ.
From
the secSome
of theweb
earlier
works on exploiting
hierarchy
among
tarond, we infer a third scaling distribution for the number of
1
http://lshtc.iit.demokritos.gr/
features
per category. This is done via the empirical Heaps’s
2
http://web2.wipo.int/ipcpub/
law
[10], which describes the scaling relationship between
text length and the size of its vocabulary.
Some of the earlier works on exploiting hierarchy among tar-
get classes for the purpose of text classification have been
studied in [18; 6] and [8] wherein the number of target classes
were limited to a few hundreds. However, the work by [19]
is among the pioneering studies in hierarchical classification
towards addressing web-scale directories such as Yahoo! diget classes
for the of
purpose
of text target
classification
rectory
consisting
over 100,000
classes.have
Thebeen
austudied
in [18;the
6] and
[8] whereinwith
the respect
number to
of accuracy
target classes
thors analyse
performance
and
were limited
a few hundreds.
the work
by [19]
training
timeto
complexity
for flat However,
and hierarchical
classificais among
therecently,
pioneering
studies
in hierarchical
classification
tion.
More
other
techniques
for large-scale
hierartowards
addressing
web-scale
such Prevention
as Yahoo! dichical text
classification
have directories
been proposed.
of
rectory
consisting by
of applying
over 100,000
target
classes.
Theon
auerror
propagation
Refined
Experts
trained
a
thors analyse
withInrespect
to accuracy
and
validation
set the
was performance
proposed in [4].
this approach,
bottomtraining
time complexity
for flat
and hierarchical
classificaup
information
propagation
is performed
by utilizing
the
tion. More
recently,
otherclassifiers
techniques
for large-scale
hieraroutput
of the
lower level
in order
to improve
claschical textatclassification
have
been
proposed. Prevention
of
sification
top level. The
deep
classification
method proerror propagation
applying
Refinedpruning
Experts to
trained
on a
posed
in [31] firstby
applies
hierarchy
identify
validation
set was
proposed
in [4].
In thisPrediction
approach, of
bottommuch smaller
subset
of target
classes.
a test
up information
performedNaive
by utilizing
the
instance
is then propagation
performed byisre-training
Bayes clasoutput
thesubset
lower of
level
classifiers
order to from
improve
sifier
onofthe
target
classesinidentified
the clasfirst
sification
atrecently,
top level.Bayesian
The deep
classification
method hierprostep. More
modelling
of large-scale
posed inclassification
[31] first applies
hierarchy
pruning
a
archical
has been
proposed
in [15]toinidentify
which himuch smaller
subset of target
classes.
Predictionnodes
of a test
erarchical
dependencies
between
the parent-child
are
instance
performed
by re-training
modelledisbythen
centring
the prior
of the childNaive
nodeBayes
at theclaspasifier on values
the subset
target classes identified from the first
rameter
of itsofparent.
step.
More recently,
Bayesian
modelling
large-scale
hierIn addition
to prediction
accuracy,
otherofmetrics
of perforarchical
classification
has
been
proposed
in
[15]
in
which
himance such as prediction and training speed as well as space
erarchical dependencies
the parent-child
nodes
are
complexity
of the modelbetween
have become
increasingly
impormodelled
by iscentring
the true
priorinof the
the context
child node
the patant. This
especially
of at
challenges
rameter
of its
posed
byvalues
problems
in parent.
the space of Big Data, wherein an optiIn addition
prediction
accuracy,
other metrics
of performal
trade-offtoamong
such metrics
is desired.
The significance
mance
such asspeed
prediction
traininghas
speed
well as space
of prediction
in suchand
scenarios
beenashighlighted
in
complexity
of such
the model
have
imporrecent
studies
as [3; 13;
24;become
5]. Theincreasingly
prediction speed
is
tant. This
is especially
true in theof context
of challenges
directly
related
to space complexity
the trained
model, as
posed
problems
in the
Big Data,
wherein
it
mayby
not
be possible
tospace
load of
a large
trained
modelaninoptithe
mal trade-off
such
metrics
desired.its
The
significance
main
memoryamong
due to
sheer
size. isDespite
direct
impact
of
in no
such
scenarios
highlighted
in
onprediction
predictionspeed
speed,
earlier
workhas
hasbeen
focused
on space
recent studies
such as [3; 13;
24; 5]. The prediction speed is
complexity
of hierarchical
classifiers.
directly related
to space
complexity
the trained
model, as
Additionally,
while
the existence
of of
power
law distributions
it
not
be possible
to load
a large
in the
hasmay
been
used
for analysis
purposes
in trained
[32; 19] model
no thorough
main memory
sheer
Despite
its direct
impact
justification
is due
giventoon
the size.
existence
of such
phenomenon.
on
no 3,earlier
worktohas
focused
space
Ourprediction
analysis inspeed,
Section
attempts
address
thisonissue
in
complexity
of
hierarchical
classifiers.
a quantitative manner. Finally, power law semantics have
Additionally,
existence
power law of
distributions
been
used for while
modelthe
selection
andof evaluation
large-scale
has
been used
for analysissystems
purposes
[32; 19]
no thorough
hierarchical
classification
[1].inUnlike
problems
studjustification
is given
on the
existence
of which
such phenomenon.
ied
in classical
machine
learning
sense
deal with a
Our analysis
in Section
3, classes,
attemptsthis
to address
this forms
issue in
limited
number
of target
application
a
a
quantitative
manner. hidden
Finally,information
power law semantics
have
blue-print
on extracting
in big data.
been used for model selection and evaluation of large-scale
hierarchical classification systems [1]. Unlike problems stud3. POWER LAW IN LARGE-SCALE WEB
ied in classical machine learning sense which deal with a
TAXONOMIES
limited
number of target classes, this application forms a
We begin by
the complementary
blue-print
onintroducing
extracting hidden
information cumulative
in big data.size
distribution for category sizes. Let Ni denote the size of category
i (in terms LAW
of number
documents), then the WEB
proba3. POWER
INofLARGE-SCALE
bility that Ni > N is given by
1
3
To avoid confusion, we denote the power law exponents for
in-degree distribution and feature size distribution γ and δ.
2
http://lshtc.iit.demokritos.gr/
http://web2.wipo.int/ipcpub/
SIGKDD Explorations
TAXONOMIES
P (Nithe
>N
) ∝ N −β
(1)
We begin by introducing
complementary
cumulative size
denote
the size
of catdistribution
category
sizes.
Let Ni of
where β > 0fordenotes
the
exponent
the power
law
disegory
i (in3 terms
of number
of documents),
the probaEmpirically,
it can
be assessed then
by plotting
the
tribution.
given
by its size (see Figure 1) The
bility of
that
Ni > N issize
rank
a category’s
against
−β
derivative of this distribution,
category
size probability
P (Ni > N )the
∝N
(1)
3
To avoid
denote
the power
lawpower
exponents
for
where
β >confusion,
0 denoteswethe
exponent
of the
law disin-degree 3distribution and feature size distribution γ and δ.
tribution. Empirically, it can be assessed by plotting the
rank of a category’s size against its size (see Figure 1) The
derivative of this distribution, the category size probability
Volume 16, Issue 1
Page 48
1000
100000
1
10
100
1000
10000
category size N
10
10
100
1000
10000
category size N
Figure 10000
1: Category size vs rank
for the
γ =distribution
1.9
LSHTC2-DMOZ dataset.
# of categories
# of with
categories
dgi>dgwith dgi>dg
10000
100
1000
10
1
2
3
Level
4
5
Level
10
1000
1
100
100000
1000
100
100
10000
1
1000
10000
100
γ = 1.9
1000
10
100
1
1
10
100
1000
# of indegrees dg
10
Figure 2: Indegree vs rank distribution for the LSHTC21
DMOZ dataset.
1
10
100
1000
# ofofindegrees
dg via models by
We explain the formation
these two laws
Yule [33] and a related model by Klemm [17], detailed in
Figure
rankare
distribution
forinthe
LSHTC2sections2:3.1Indegree
and 3.2, vs
which
then related
section
3.3.
DMOZ dataset.
Yule’s model
We explain the formation of these two laws via models by
Yule’s model describes a system that grows in two quantities,
Yule [33] and a related model by Klemm [17], detailed in
in elements and in classes in which the elements are assigned.
sections 3.1 and 3.2, which are then related in section 3.3.
It assumes that for a system having κ classes, the probability
that a new element will be assigned to a certain class is
3.1
Yule’stomodel
proportional
its current size,
Yule’s model describes a system that grows in two quantities,
N
in elements and in classes
κ ithe elements are assigned.
(2)
p(i) in
= which
It assumes that for a system having
κ iclasses,
the probability
i =1 N
that
a new element will be assigned to a certain class is
4
http://lshtc.iit.demokritos.gr/LSHTC2
datasets
proportional
to its current size,
p(i) = κ
Ni
i =1
4
10000
Figure 3: Number of categories at each level in the hierarchy
10
of the LSHTC2-DMOZ
database.
1
2
3
4
5
β = 1.1
Figure 1: Category size vs rank distribution for the
1
LSHTC2-DMOZ
dataset.
3.1
100000
# categories # categories
# of categories
Ni>N with Ni>N
# of with
categories
density p(Ni ), then also follows a power law with exponent
−(β+1)
(β + 1), i.e. p(Ni ) ∝ Ni
.
Two of our empirical findings are a power law for both the
complementary cumulative category size distribution and
the counter-cumulative in-degree distribution, shown in Figdensity p(Ni ), then also follows a power law with exponent
ures 1 and 2, for LSHTC2-DMOZ
dataset which is a subset
−(β+1)
(β
+ 1), i.e.
. 394, 000 websites and 27, 785
i ) ∝4N
i
contains
of ODP.
Thep(N
dataset
Two of our The
empirical
findings
are a power
law
for of
both
categories.
number
of categories
at each
level
the the
hicomplementary
cumulative
category
size
distribution
and
erarchy is shown in Figure 3.
the counter-cumulative in-degree distribution, shown in Figures 1 and 2, for LSHTC2-DMOZ dataset which is a subset
000 websites and 27, 785
of ODP.100000
The dataset4 contains 394, β
= 1.1
categories. The number of categories at each level of the hierarchy is10000
shown in Figure 3.
(2)
N i
http://lshtc.iit.demokritos.gr/LSHTC2 datasets
SIGKDD Explorations
It further assumes that for every m elements that are added
Figure
3: Number of
categories
each level
in the
hierarchy
to the pre-existing
classes
in theatsystem,
a new
class
of size
5
of
LSHTC2-DMOZ
database.
.
1 isthe
created
The described system is constantly growing in terms of elements and classes, so strictly speaking, a stationary state
It further
assumes
for every
m elements
that are added
does
not exist
[20].that
However,
a stationary
distribution,
the
to the pre-existing
classes inhas
thebeen
system,
a newusing
classthe
of size
so-called
Yule
distribution,
derived
ap5
1 is created
proach
of the. master equation with similar approximations
The[26;
described
constantly
growing [23],
in terms
elby
23; 17].system
Here,iswe
follow Newman
who ofconements
so the
strictly
speaking,
a stationary
siders asand
oneclasses,
time-step
duration
between
creation ofstate
two
does
not exist
[20]. From
However,
stationary
distribution,
the
consecutive
classes.
this afollows
that the
average numso-called
Yule distribution,
beenmderived
using
apber
of elements
per class is has
always
+ 1, and
the the
system
proach
the +
master
equation
similar where
approximations
containsofκ(m
1) elements
atwith
a moment
the numby [26;
23; 17].
Newman
[23], who
condenote
the fraction
of classes
ber
of classes
is κ.Here,
Let we
pN,κfollow
siders
time-step
the the
duration
creation
havingasNone
elements
when
total between
number of
classesofistwo
κ.
consecutive
From
thisinstances,
follows that
average numBetween twoclasses.
successive
time
thethe
probability
for a
ber
elements per
class
always
mgain
+ 1,aand
system
newthe
element
is
givenofpre-existing
class
i ofissize
Ni to
contains
κ(m + 1) elements at a moment where the nummN
i /(κ(m + 1)). Since there are κ pN,κ classes of size N ,
denote
thegain
fraction
classes
ber
of classesnumber
is κ. Let
pN,κ
the expected
such
classes
which
a newofelement
having
N elements
total by
number
of classes is κ.
(and grow
to size (Nwhen
+ 1))the
is given
:
Between two successive time instances, the probability for a
mNclass i of size Ni m
to gain
new element(3)
is
given pre-existing
N paN,κ
κ pN,κ =
(m κ+p1)
1)
1)).+Since
there are
mNi /(κ(m +κ(m
N,κ classes of size N ,
the expected number such classes which gain a new element
The
classes
websites
(andnumber
grow toofsize
(N +with
1)) N
is given
by are
: thus fewer by the
above quantity, but some which had (N −1) websites prior to
the addition of mN
a new class have nowmone more website. This
(3)
κ pN,κ =
N pN,κ
step depicting
the+change
of the(m
state
κ(m
1)
+ 1)of the system from κ
classes to (κ + 1) classes is shown in Figure 4. Therefore,
The
number of
classesofwith
N websites
thus fewer
by the
the expected
number
classes
with N are
documents
when
the
above quantity,
but
some which
had
−1)
websites
prior to
number
of classes
is (κ+1)
is given
by(N
the
following
equation:
the addition of a new class have now one more website. This
m
step
the=change
system
from κ
κ pN,κof
+ the state
[(Nof−the
1)(p
(κ depicting
+ 1)pN,(κ+1)
(N −1),κ )
m + 1 in Figure 4. Therefore,
(4)
classes to (κ + 1) classes is shown
−N pwhen
N,κ ] the
the expected number of classes with N documents
number of classes is (κ+1) is given by the following equation:
The first term in the right hand side of Equation 4 correm
sponds
to classes =
with
N documents
the number
of
κ pN,κ
+
[(Nwhen
− 1)(p
(κ + 1)p
N,(κ+1)
(N −1),κ )
+1
classes is κ. The second termmcorresponds
to the contribu(4)
tion from classes of size (N − 1) which have grown
size
−N pto
] N,
N,κ
this is shown by the left arrow (pointing rightwards) in FigThe 4.first
term
the corresponds
right hand side
of Equation
4 correure
The
lastinterm
to the
decrease resulting
sponds to classes with N documents when the number of
5
The initial
be generalized
to othertosmall
sizes; for
classes
is κ. size
Themay
second
term corresponds
the contribuinstance
al. (N
consider
entrant
with
tion fromTessone
classes ofetsize
− 1) which
have classes
grown to
sizesize
N,
drawn
from aby
truncated
power (pointing
law [29] . rightwards) in Figthis is shown
the left arrow
ure 4. The last term corresponds to the decrease resulting
5
The initial size may be generalized to other small sizes; for
instance Tessone et al. consider entrant classes with size
drawn from a truncated power law [29] .
Volume 16, Issue 1
Page 49
300
Variables
Constants
Indices
im
w
Number
elements
Index
forofthe
class added to the system after which a new class is added
∈ [0, 1] Probability
Table 1: Summary
of notationthat
usedattachment
in Section 3of subcategories is preferential
Indices
from classes which have gained an element and have become
the class
ofi size (N + 1),Index
this isforshown
by the right arrow (pointing
rightwards) in Figure 4. The equation for the class of size 1
Table
is given
by: 1: Summary of notation used in Section 3
m
(κ + 1)p1,(κ+1) = κ p1,κ + 1 −
p1,κ
(5)
m
1 have become
from classes which have gained an element +
and
of
+ 1), κ
this
shown(and
by the
right arrow
(pointing
As size
the (N
number
of isclasses
therefore
the number
of
rightwards)
in
Figure
4.
The
equation
for
the
class
of size 1
elements κ(m + 1)) in the system increases, the probability
is
given
by:element is classified into a class of size N , given by
that
a new
Equation 3, is assumed to remain constantmand independent
(κ +
1)phypothesis,
(5)
p1,κ
1,(κ+1) = κ p1,κ + 1 −
of κ. Under
this
the stationary
for
m + distribution
1
class sizes can be determined by solving Equation 4 and
As
theEquation
number 5κasofthe
classes
(and
therefore
theis number
using
initial
condition.
This
given byof
elements κ(m + 1)) in the system increases, the probability
+ 1/m)B(N,
+ 1/m)
pN =is(1
that a new element
classified
into a2class
of size N , given(6)
by
Equation 3, is assumed to remain constant and independent
where
B(., .) this
is the
beta distribution.
Equation
6 has been
of κ. Under
hypothesis,
the stationary
distribution
for
termed
Yule
distribution
[26].
Written
for
a
continuous
class sizes can be determined by solving Equation 4 variand
able
, it has a5power
tail:condition. This is given by
usingNEquation
as thelaw
initial
−2− 1
m
)∝N
+ 1/m)B(N,
2 + 1/m)
pN = (1p(N
(6)
From the
equation
the exponentEquation
of the density
where
B(.,above
.) is the
beta distribution.
6 has funcbeen
tion
is between
2 and 3.[26].
ItsWritten
cumulative
distribution
termed
Yule distribution
for a size
continuous
variN has
), asagiven
Equation
1, has an exponent given
P
(NkN>
able
, it
powerbylaw
tail:
by
1
p(N ) ∝ N −2− m
β = (1 + (1/m))
(7)
From
the
above
equation
the
exponent
of
the
density
funcwhich is between 1 and 2. The higher the frequency 1/m
tion
is between
2 and
Its cumulative
size distribution
at which
new classes
are3.introduced,
the bigger
β becomes,
P
(Nthe
k > N ), as given by Equation 1, has an exponent given
and
lower the average class size. This exponent is stable
by
over time although the taxonomy is constantly growing.
3.2
β = (1 + (1/m))
(7)
Preferential attachment models for netis between
1 and
2. The higher the frequency 1/m
works
and
trees
which
at similar
which new
classes
are introduced,
the network
bigger βgrowth
becomes,
A
model
has been
formulated for
by
and the lower
average
class size.
Thisthe
exponent
is stable
Barabási
and the
Albert
[2], which
explains
formation
of a
over time
the taxonomy
is constantly
growing.
power
lawalthough
distribution
in connectivity
degree of
nodes. It
assumes that the networks grow in terms of nodes and edges,
3.2
Preferential attachment models for netand that every newly added node to the system connects
works
and trees
with a fixed number of edges to existing nodes. Attachment
A again
similarpreferential,
model has been
formulated
for network
growth
by
is
i.e. the
probability
for a newly
added
Barabási and Albert [2], which explains the formation of a
power law distribution in connectivity degree of nodes. It
assumes that the networks grow in terms of nodes and edges,
and that every newly added node to the system connects
with a fixed number of edges to existing nodes. Attachment
is again preferential, i.e. the probability for a newly added
SIGKDD Explorations
250
number of classes
Number of elements in class i
Number of subclasses of class i
Number of features of class i
Total number of classes
Total number of in-degrees (=subcategories)
Fraction of classes having N elements
Number
elements
in class
i
when
theoftotal
number
of classes
is κ
Number of subclasses of class i
Number of features of class i
Total number
of classes
Number
of elements
added to the system afTotal
number
of class
in-degrees
(=subcategories)
ter
which
a new
is added
Fraction
of classes having
N elementsof sub∈
[0, 1] Probability
that attachment
when the total
number of classes is κ
categories
is preferential
200
number of classes
Ni
dgi
di
κ
DG
Variables
pN,κ
Ni
dgi
Constants
di
κ
m
DG
pN,κ
w
200
0
150
300
100
250
50
150
0
N-1
N
N+1
class size
100
Figure 4: Illustration of Equation 4. Individual classes grow
50 move to the right over time, as indicated by
constantly i.e.,
arrows. A stationary distribution means that the height of
0
each bar remains
N-1 N N+1
0 constant.
class size
Figurei to
4: connect
Illustration
Equation
4. Individual
classes grow
node
to a of
certain
existing
node j is proportional
constantly
i.e.,ofmove
to the
right
to
its number
existing
edges
of over
nodetime,
j. as indicated by
arrows.
A stationary
distribution
means
thatcorresponds
the height of
A
node in
the Barabási-Albert
(BA)
model
a
each bar
remains
constant.
class
in Yule’s
model,
and a new edge to two newly assigned
element. Every added edge counts both to the degree of an
existing node j, as well as to the newly added node i. For
nodereason
i to connect
to a certain
node
j is added
proportional
this
the existing
nodes existing
j and the
newly
node i
to
its
number
of
existing
edges
of
node
j.
grow always by the same number of edges, implying m = 1
A
in the Barabási-Albert
(BA) model
corresponds of
a
andnode
consequently
β = 2 in the BA-model,
independently
class
in Yule’s
and each
a newnew
edge
to two
newly assigned
the number
of model,
edges that
node
creates.
element.
Every
added edge
both to the
of an
The seminal
BA-model
hascounts
been extended
in degree
many ways.
existing
node
j,
as
well
as
to
the
newly
added
node
i.
For
For hierarchical taxonomies, we use a preferential attachthis
j and
the newly
addedgrowth
node i
mentreason
modelthe
for existing
trees by nodes
[17]. The
authors
considered
grow
always edges,
by the and
sameexplain
number
of edges,
implying m
1
via directed
power
law formation
in =
the
and
consequently
= 2 directed
in the BA-model,
independently
of
in-degree,
i.e. the β
edges
from children
to parent in
the
number
of edges
that each
node creates.
a tree
structure.
In contrast
to new
the BA-model,
newly added
The
seminal
BA-model
has
been
extended
in in-degree
many ways.
nodes and existing nodes do not increase their
by
For same
hierarchical
taxonomies,
we usestart
a preferential
attachthe
amount,
since new nodes
with an in-degree
ment
for trees
bycannot
[17]. The
authors
considered
of 0. model
Leaf nodes
thus
attract
attachment
of growth
nodes,
via directed
edges,
and explain
power
law lead
formation
in the
and
preferential
attachment
alone
cannot
to a powerin-degree,
i.e. random
the edges
directed
from
children
to parent
in
law. A small
term
ensures
that
some nodes
attach
a
tree
structure.
In
contrast
to
the
BA-model,
newly
added
to existing ones independently of their degree, which is the
nodes and existing
nodesofdo
by
analogous
to the start
a not
newincrease
class intheir
the in-degree
Yule model.
the
amount,
sincea new
new node
nodesattaches
start with
in-degree
The same
probability
v that
as aan
child
to the
of
0. Leaf
nodes
attract
attachment of nodes,
existing
node
i of thus
with cannot
indegree
dgi becomes
and preferential attachment alone cannot lead to a powerdi − 1
1
law. A small random
v(i) = wterm ensures
+ (1 −that
w) some
, nodes attach
(8)
DG of their degree,
DG
to existing ones independently
which is the
analogous
the size
startofofthe
a new
class
in the Yule
where DG to
is the
system
measured
in themodel.
total
The
probability
v
that
a
new
node
attaches
as
a
child tothat
the
number of in-degrees. w ∈ [0, 1] denotes the probability
becomes
existing
node i of
indegree(1dg−i w)
the attachment
is with
preferential,
the probability that
it is random to any node,
di −independently
1
1 of their numbers
v(i)it=has
w been done
+ (1 −
w)
, process [26;
(8)
of indegrees. As
for
the
Yule
DG
DG
23; 14; 29], the stationary distribution is again derived via
where
DG isEquation
the size 4.
of the
measured
the total
the master
Thesystem
exponent
of the in
asymptotic
number
of in-degrees.
w ∈ [0,
1] denotes is
the
that
power law
in the in-degree
distribution
β probability
= 1 + 1/w.This
the
attachment
(1 − properties
w) the probability
that
model
is suitableistopreferential,
explain scaling
of the tree
or
it
is
random
to
any
node,
independently
of
their
numbers
network structure of large-scale web taxonomies, which have
of
indegrees.
As itempirically,
has been done
for the Yule
process [26;
also
been analysed
for instance
for subcategories
23;
14; 29], the
is again
derivedtrees
via
of Wikipedia
[7].stationary
It has alsodistribution
been applied
to directory
the
master
Equation
4.
The
exponent
of
the
asymptotic
in [14].
power law in the in-degree distribution is β = 1 + 1/w.This
model
is suitable
to hierarchical
explain scaling properties
of the tree or
3.3 Model
for
web taxonomies
network
structure
of
large-scale
web
taxonomies,
which have
We now apply these models to large-scale web taxonomies
also
been analysed
empirically,
for instance
subcategories
like DMOZ.
Empirically,
we uncovered
two for
scaling
laws: (a)
of Wikipedia [7]. It has also been applied to directory trees
in [14].
3.3
Model for hierarchical web taxonomies
We now apply these models to large-scale web taxonomies
like DMOZ. Empirically, we uncovered two scaling laws: (a)
Volume 16, Issue 1
Page 50
one for the size distribution of leaf categories and (b) one for
the indegree (child-to-parent link) distribution of categories
(shown in Figure 2). These two scaling laws are linked in a
non-trivial manner: a category may be very small or even
not contain any websites, but nevertheless be highly conone for the
size on
distribution
leaf categories
and
(b) jointly,
one for
nected.
Since
the other ofhand
(a) and (b)
arise
the
indegree here
(child-to-parent
link) distribution
categories
we propose
a model generating
the two of
scaling
laws
(shown
in Figure
2).manner.
These two
laws
are linked in of
a
in
a simple
generic
Wescaling
suggest
a combination
non-trivial
manner:
a category
may be very
small
the two processes
detailed
in subsections
3.1 and
3.2ortoeven
denot contain
any websites,
nevertheless
be highlyadded
conscribe
the growth
process: but
websites
are continuously
nected.
Since on
other hand
(a) and (b)byarise
jointly,
to
the system,
andthe
classified
into categories
human
refwe
propose
a time,
modelthe
generating
erees.
At thehere
same
categoriesthe
aretwo
notscaling
a merelaws
set,
in a form
simple
generic
manner.
We grows
suggest
a combination
of
but
a tree
structure,
which
itself
in two quantithe
processes
in subsections
3.2 to deties:two
in the
numberdetailed
nodes (categories)
and3.1
in and
the number
of
scribe
the growth
websites are
continuously
added
in-degrees
of nodesprocess:
(child-to-parent
links,
i.e. subcategoryto the system,
and Based
classified
intorules
categories
by human
refto-category
links).
on the
for voluntary
referees
erees.
the same
the categories
a mere
set,
of the At
DMOZ
how time,
to classify
websites, are
we not
propose
a simbut combined
form a treedescription
structure, which
itselfAltogether,
in two quantiple
of the grows
process.
the
ties: in the
number
nodes
(categories) and in the number of
database
grows
in three
quantities:
in-degrees of nodes (child-to-parent links, i.e. subcategoryto-category
Based onNew
the rules
for voluntary
referees
(i) Growthlinks).
in websites.
websites
are assigned
into
of thecategories
DMOZ how
to classify
websites,
a sim5).
i, with
probability
p(i) we
∝ propose
Ni (Figure
ple combined
description
of theindependently
process. Altogether,
the
This assignment
happens
of the hierdatabase
grows
quantities:
archy
levelinofthree
category.
However, only leaf categories
may receive documents.
(i) Growth in websites. New websites are assigned into
categories i, with probability p(i) ∝ Ni (Figure 5).
This assignment happens independently of the hierarchy level of category. However, only leaf categories
may receive documents.
Figure 5: (i): A website is assigned to existing categories
with p(i) ∝ Ni .
(ii) Growth in categories. With probability 1/m, the refFigureerees
5: (i):
A website
is assigned
to existing
assign
a website
into a newly
createdcategories
category,
.
with p(i)
∝
N
i
at any level of the hierarchy (Figure 6).
This assumption would suffice to create a power law in
category
size distribution,
but since a 1/m,
tree-structure
(ii) the
Growth
in categories.
With probability
the refamong
categories
exists,
we
also
assume
that
event
erees assign a website into a newly created the
category,
of
is also attaching
at category
any levelcreation
of the hierarchy
(Figureat
6).particular places
to the tree structure. The probability v(i) that a cateThis
assumption
suffice
createparent
a power
law in
gory is
created aswould
the child
of ato
certain
category
the
category
a tree-structure
i can
dependsize
in distribution,
addition on but
the since
in-degree
di of that
among
categories
exists,9).
we also assume that the event
category
(see Equation
of category creation is also attaching at particular places
2 v(i) that a cateto the tree structure. The probability
gory is created as the child of a certain
parent
category
3
2
i can depend in addition on the in-degree di of that
0 0
category (see Equation 9). 0 0 0
0
2
Figure 6: (ii): Growth in categories is equivalent to growth
3
2
of the tree structure in terms of in-degrees.
0
0
0
0
0
0
Figure 6: (ii): Growth in categories is equivalent to growth
of(iii)
theGrowth
tree structure
in terms
of in-degrees.
in children
categories.
Finally, the hierarchy
may also grow in terms of levels, since with a certain
probability (1 − w), new children categories are assigned independently of the number of children, i.e.
(iii) Growth in children categories. Finally, the hierarchy
may also grow in terms of levels, since with a certain
probability (1 − w), new children categories are assigned independently of the number of children, i.e.
SIGKDD Explorations
its in-degree di of the category i. (Figure 7). Like in
[17], the attachment probability to a parent i is
dgi − 1
i
+ (1 − w)
.
(9)
DG
DG
its in-degree di of the category i. (Figure 7). Like in
[17], the attachment probability to a parent i is
v(i) = w
2
3
4 = w
v(i)
2
0
0
1
2
0
dgi − 1
i
+ (1 − w)
.
DG
DG
(9)
0
0
4
3
2
Figure 7: (iii): Growth in children categories.
0
0
1
0
0
Equation 8,0 where i = 1, would suffice to explain
power law in-degrees dgi and in category sizes Ni .
Figure 7: (iii): Growth in children categories.
To link the two processes more plausibly, it can be
assumed that the second term in Equation 9 denoting
assignment8,ofwhere
new ‘first
size
1, woulddepends
suffice on
to the
explain
Equation
i =children’
categories,
N
i of parent
power
law in-degrees
dg and in category sizes N .
i
i
Ni
To link the two processes
be
i = more
, plausibly, it can(10)
N in Equation 9 denoting
assumed that the second term
assignment
new ‘first
on the
size
since this isofcloser
to thechildren’
rules bydepends
which the
referees
of
parent
categories,
N
i
create new categories, but is not essential for the explanation of the power laws.NIt
reflects that the bigger
i
i =the probability
,
(10)
a leaf category, the higher
that referees
N
create a child category when assigning a new website
since
to
it. this is closer to the rules by which the referees
create new categories, but is not essential for the explanation the
of the
power
laws.
It reflects
that the
bigger
To summarize,
central
idea
of this
joint model
is to
cona leaf
category,forthe
higher
probabilitythe
that
referees
sider two
measures
the
size ofthe
a category:
number
of
create a
category
when
a new
website
(which
governs
the assigning
preferential
attachment
its websites
Nichild
to websites),
it.
of new
and its in-degree, i.e. the number of its
children dgi , which governs the preferential attachment of
To summarize,
this joint
to connew
categories.the
Tocentral
explainidea
theofpower
law model
in the iscategory
sider two
measures(i)
forand
the (ii)
sizeare
of athe
category:
the number
of
sizes,
assumptions
requirements.
For the
governs
the preferential
attachment
its websites
i (which
power
law inNthe
number
of indegrees,
assumptions
(ii) and
of new
and its in-degree,
i.e. the
number
of its
(iii)
are websites),
the requirements.
The empirically
found
exponents
governs
preferential
attachment
of
children
i , which
β
= 1.1 dg
and
γ = 1.9
yield the
a frequency
of new
categories
new categories.
To explain
law (1
in −
the
1/m=0.1
and a frequency
of the
newpower
indegrees
w)category
= 0.9.
sizes, assumptions (i) and (ii) are the requirements. For the
power
in the
number of indegrees, assumptions (ii) and
3.4 law
Other
interpretations
(iii)
are of
theassuming
requirements.
The empirically
exponents
Instead
in Equations
9 and 10 found
that referees
deβ
= to
1.1open
and aγ single
= 1.9child
yieldcategory,
a frequency
new realistic
categories
cide
it isofmore
to
1/m=0.1
andan
a frequency
of new indegrees
(1 − w)
0.9.or
assume that
existing category
is restructured,
i.e.=one
several child categories are created, and websites are moved
3.4
Other
interpretations
into these
new categories
such that the parent category conInstead
of
assuming
Equations
9 and
detains less websites orin even
none at
all. 10If that
one referees
of the new
cide to open
a single
child category,
it isofmore
realisticcatto
children
categories
inherits
all websites
the parent
assume
thatFigure
an existing
category
is restructured,
i.e. one or
egory (see
8), the
Yule model
applies directly.
If
several
child categories
are created,
and websites
arecontains
moved
the
websites
are partitioned
differently,
the model
into theseshrinking
new categories
such thatThis
the parent
effective
of categories.
is not category
describedconby
tains
less model,
websites
orthe
even
none Equation
at all. If 4one
of the only
new
the Yule
and
master
considers
children categories.
categories inherits
all it
websites
of the
parent
growing
However,
has been
shown
[29; cat21]
egory
(see Figure
8), the
Yule model
applies
If
that models
including
shrinking
categories
also directly.
lead to the
the websites
partitioned
differently,
the model compaticontains
formation
of are
power
laws. Further
generalizations
effective
shrinking
of categories.
Thisnew
is not
described
by
ble
with power
law formation
are that
categories
do not
the
Yule model,
and the
Equation
4 considers
only
necessarily
start with
one master
document,
and that
the frequency
growing
categories.
been shown [29; 21]
of
new categories
doesHowever,
not needittohas
be constant.
that models including shrinking categories also lead to the
formation of power laws. Further generalizations compatible with power law formation are that new categories do not
necessarily start with one document, and that the frequency
of new categories does not need to be constant.
Volume 16, Issue 1
Page 51
Figure 8: Model without and with shrinking categories. In
the left figure, a child category inherits all the elements of
its parent and takes its place in the size distribution.
# of categories# with
Ni>N
of categories
with Ni>N
100000
Figure 8: Model without and with shrinking
Level 2 categories. In
Level
3 elements of
the left figure, a child category inherits
all the
Level
4
10000
its parent
and takes its place in the size
distribution.
Level 5
1000
100000
1
10
10
100
1000
10000 100000
category size N
Figure 9: Category size distribution for each level of the
1
LSHTC2-DMOZ
dataset.
1
10
100
1000 10000 100000
category size N
3.5
Limitations
However,
1 and
do not exhibit
power
Figure
9: Figures
Category
size2distribution
for perfect
each level
of law
the
decay
for
several
reasons.
Firstly,
the
dataset
is
limited.
LSHTC2-DMOZ dataset.
Secondly, the hypothesis that the assignment probability
(Equation 2) depends uniquely on the size of a category
3.5
Limitations
might be
too strong for web directories, neglecting the change
in importance
of topics.
big categories
can exist
However,
Figures
1 and 2Indoreality,
not exhibit
perfect power
law
which receive
only reasons.
few new documents
none at isall.limited.
Dorodecay
for several
Firstly, theordataset
govtsev and
[9] have
studied
this problem
by introSecondly,
theMendes
hypothesis
that
the assignment
probability
ducing an assignment
that
exponentially
(Equation
2) dependsprobability
uniquely on
thedecays
size of
a category
with age.
For
a low
parameter neglecting
they showthe
that
the
might
be too
strong
fordecay
web directories,
change
stronger
this decay,
the steeper
thebig
power
law; for
in
importance
of topics.
In reality,
categories
canstrong
exist
decay, receive
no power
lawfew
forms.
A last reason
that
refwhich
only
new documents
or might
none atbeall.
Doroerees re-structure
categories
ways strongly
deviating
from
govtsev
and Mendes
[9] haveinstudied
this problem
by introthe rules
- (iii).
ducing
an(i)
assignment
probability that decays exponentially
with age. For a low decay parameter they show that the
3.6 Statistics
stronger
this decay,per
thehierarchy
steeper the level
power law; for strong
The
tree-structure
of
a
database
allows might
also tobestudy
the
decay, no power law forms. A last reason
that refsizes of
class belonging
to a in
given
level
of thedeviating
hierarchy.from
As
erees
re-structure
categories
ways
strongly
shown
in (i)
Figure
3 the DMOZ database contains 5 levels of
the
rules
- (iii).
different size. If only classes on a given level l of the hier3.6
Statistics
perwehierarchy
level
archy are
considered,
equally found
a power law in category tree-structure
size distribution
in Figure
Per-level
power
The
of as
a shown
database
allows 9.
also
to study
the
law decay
has
also been
theofin-degree
distribusizes
of class
belonging
to found
a givenfor
level
the hierarchy.
As
tion. This
result3 may
equallydatabase
be explained
by the
model
shown
in Figure
the DMOZ
contains
5 levels
of
introduced
above:
Equations
9 respectively,
are valid
different
size.
If only
classes 2onand
a given
level l of the
hieralso if are
instead
of p(k)we
oneequally
considers
the conditional probaarchy
considered,
κfound a power law in cateNi ,l
i =1,l
gory
size
distribution
as
shown
in
Figure
9.is Per-level
power
the probability
bility p(l)p(i|l), where p(l) =
κ
=1 Ni
i
law decay has also been found for the in-degree
distribuNi,l
the
of
assignment
to amay
givenequally
level, and
p(i|l) = by
κ
tion.
This result
be explained
the
Nimodel
,l
i =1,l
introduced above:
Equations
and
9 respectively,
are valid
probability
of being
assigned2to
a given
class within
that
also if instead of p(k) one considers
the conditional proba
bility p(l)p(i|l), where p(l) =
κ
i =1,l
κ
i =1
Ni ,l
Ni is the probability
of assignment to a given level, and p(i|l) =
κ
Ni,l
Ni ,l
i =1,l
the
probability of being assigned to a given class within that
SIGKDD Explorations
10000
100
δ = 1.9
1000
10
100
1
100
10
1000
10000
100000
category size in features d
Figure 11: Number of features vs rank distribution.
1000
10000
100000
category size in features d
level. The formation
process may be seen as a Yule process
10
1000
1
100
δ = 1.9
1000
1
100
Level 2
Level 3
Level 4
Level 5
100
10000
# of categories
di>d with di>d
# ofwith
categories
10000
κ
within a level if
i =1,l Ni ,l is used for the normalization
Figure 11:2, Number
features vs
rank distribution.
in Equation
and thisofformation
happens
with probability p(l) that a website gets assigned into level l. Thereby,
the rate at ml at which new classes are created need not
be
theThe
same
for everyprocess
level, and
exponent
of
level.
formation
may therefore
be seen asthe
a Yule
process
κ
the
power
law iffit may
vary
level
to
level.
Power
law
,lfrom
N
is
used
for
the
normalization
within
a level
i
i =1,l
decay
for the2,per-level
size distribution
is a probabilstraightin Equation
and thisclass
formation
happens with
forward
corollary
of thegets
described
and
ity p(l) that
a website
assignedformation
into levelprocess,
l. Thereby,
will
be used
inl Section
5 to
analyse
theare
space
complexity
of
at which
new
classes
created
need not
the rate
at m
hierarchical
be the sameclassifiers.
for every level, and therefore the exponent of
the power law fit may vary from level to level. Power law
decay for the per-level class size distribution is a straightforward
corollary of BETWEEN
the described formation
process,SIZE
and
4. RELATION
CATEGORY
will be
used
in
Section
5
to
analyse
the
space
complexity
of
AND NUMBER OF FEATURES
hierarchical classifiers.
Having explained the formation of two scaling laws in the
database, a third one has been found for the number of
features di in each category, G(d) (see Figures 11 and 12).
4. isRELATION
CATEGORY
SIZE
This
a consequenceBETWEEN
of both the category
size distribution,
NUMBER
OF FEATURES
shownAND
(in Figure
1) in combination
with another power law,
termed
lawthe
[10].
This empirical
states
that
Having Heaps’
explained
formation
of two law
scaling
laws
in the
number
words
R been
in a document
related
to the
database,of adistinct
third one
has
found for isthe
number
of
length
a document
as follows
each category,
G(d) (see Figures 11 and 12).
featuresn dofi in
This is a consequence of both the category size distribution,
,
R(n) = Knαwith
shown (in Figure 1) in combination
another power (11)
law,
termed Heaps’ law [10]. This empirical law states that the
where
α is typically
between 0.4
and 0.6.
numberthe
of empirical
distinct words
R in a document
is related
to For
the
the
LSHTC2-DMOZ
dataset,
Figure 10 shows that for the
length
n of a document
as follows
collection of words and the collection of websites, similar exponents are found. AnR(n)
interpretation
(11)
= Knα , of this result is that
the total number words in a category can be measured approximately
by the α
number
of websites
a category,
alwhere the empirical
is typically
betweenin0.4
and 0.6. For
though
not all websites
have the
same10length.
the LSHTC2-DMOZ
dataset,
Figure
shows that for the
Figure
10 shows
that
bigger
categoriesofcontain
alsosimilar
more feacollection
of words
and
the collection
websites,
extures,
but
increase
weaker than of
thethis
increase
ponents
arethis
found.
An is
interpretation
result in
is webthat
sites.
Thisnumber
implieswords
that less
‘feature-rich’
exthe total
in avery
category
can be categories
measured apist,
which is also
reflected
in the
decayinexponent
δ = 1.9
proximately
by the
number
of high
websites
a category,
alof a power-law
fit in Figure
to the slower dethough
not all websites
have11,
the(compared
same length.
cay of the
category
distribution
shown
in figure
1 where
Figure
10 shows
thatsize
bigger
categories
contain
also more
feaβ = 1.1).
Catenation
size than
distribution
measured
in
tures,
but this
increaseofis the
weaker
the increase
in webfeatures
and
Heaps’
lawless
yields
size distribution
measites.
This
implies
that
veryagain
‘feature-rich’
categories
exi.e. exponent
multiplication
of
sured
in websites:
P (i) =in R(G(d
i )),decay
ist,
which
is also reflected
the high
δ = 1.9
the
yields
that δ11,
· α(compared
= 1.1 which
confirms
of a exponents
power-law fit
in Figure
to the
slower our
deempirically
found value
β = 1.1.
cay of the category
size distribution
shown in figure 1 where
β = 1.1). Catenation of the size distribution measured in
features and Heaps’ law yields again size distribution measured in websites: P (i) = R(G(di )), i.e. multiplication of
the exponents yields that δ · α = 1.1 which confirms our
empirically found value β = 1.1.
Volume 16, Issue 1
Page 52
100000
1e+06
α = 0.59
1e+06
10000
100000
1000
10000
100
1000
α = 0.59
10000
100000
1e+06
1e+07
1e+08
nb of words
1000
nb of features
nb of features
nb of features
nb of features
1e+06
100000
α = 0.53
1e+06
10000
100000
1000
α = 0.53
10000
100
1
10
100
1000
10000
100000
nb of docs in collection
1000
Figure 10: Heaps’ law: number of distinct words vs. number of words, and vs number of documents.
5.
100
1000
100
10000
100000
1e+06
1e+07
1e+08
SPACE COMPLEXITY
OF
LARGE-SCALE
nb of
words
HIERARCHICAL CLASSIFICATION
1
10
100
1000
10000
100000
memory. We, therefore,
compare
the space complexity of
nb of docs
in collection
hierarchical and flat methods which governs the size of the
trained model in large scale classification. The goal of this
Figure 10: Heaps’
law: number
of distinct words
Fat-tailed distributions
in large-scale
web taxonomies
high- vs. number of words, and vs number of documents.
analysis is to determine the conditions under which the size
light the underlying structure and semantics which are useof the hierarchically trained linear model is lower than that
ful to visualize important properties of the data especially in
of
flat model.
5.
SPACE
COMPLEXITY
OF
LARGE-SCALE
memory.
We, therefore, compare the space complexity of
big data scenarios. In this section we focus on the applicaAs
a prototypical
classifier,
wewhich
use agoverns
linear classifier
the
hierarchical
and flat
methods
the size of
of the
tions HIERARCHICAL
in the context of large-scale
hierarchical classification,
CLASSIFICATION
x which
bescale
obtained
using standard
algorithms
form
wTmodel
trained
in can
large
classification.
The goal
of this
wherein
the
fit
of
power
law
distribution
to
such
taxonomies
Fat-tailed distributions in large-scale web taxonomies highsuch
as Support
Vector the
Machine
or Logistic
In
analysis
is to determine
conditions
under Regression.
which the size
can
to structure
concretelyand
analyse
the space
lightbe
theleveraged
underlying
semantics
whichcomplexare usethis
work,
we apply trained
one-vs-all
L2-regularized
L2-loss
supof
the
hierarchically
linear
model
is
lower
than
that
ity
of visualize
large-scale
hierarchical
classifiers
in data
the context
of in
a
ful to
important
properties
of the
especially
port
classification as it has been shown to yield stateof flatvector
model.
generic
classifier
deployed
hierarchical
big datalinear
scenarios.
In this
section in
wetop-down
focus on the
applicaof-the-art
performance
in thewe
context
largeclassifier
scale textofclasAs
a
prototypical
classifier,
use a of
linear
the
cascade.
tions in the context of large-scale hierarchical classification,
T [12]. For flat classification one stores weight vecsification
x
which
can
be
obtained
using
standard
algorithms
form
w
In
the following
we first
present formally
the task of
wherein
the fit ofsections
power law
distribution
to such taxonomies
∀y and hence
in Machine
a K classorproblem
d dimensional
tors
suchw
asy ,Support
Vector
LogisticinRegression.
In
hierarchical
classification
and then
we proceed
to the
space
can be leveraged
to concretely
analyse
the space
complexfeature
space,
the
space
complexity
for flat classification
is:
this
work,
we
apply
one-vs-all
L2-regularized
L2-loss supcomplexity
analysis
for large-scale
systems.
Finally,
we of
emity of large-scale
hierarchical
classifiers
in the
context
a
port vector classification as it has been shown to yield statepirically
validate
the derived
bounds.
generic linear
classifier
deployed
in top-down hierarchical
(12)
Size
F lat = d × K
of-the-art performance
in the
context of large scale text clascascade.
sification
[12]. For
stores weight
vec5.1
Hierarchical
which
represents
theflat
sizeclassification
of the matrixone
consisting
of K weight
In the following
sectionsClassification
we first present formally the task of
tors
w
y , ∀y and hence in a K class problem in d dimensional
vectors, one for each class, spanning the entire input space.
hierarchical
classification
and then weclassification,
proceed to the
In
single-label
multi-class hierarchical
thespace
trainfeature space, the space complexity for flat classification is:
We need a more sophisticated analysis for computing the
complexity
for large-scale
Finally,
)}N
In emthe
ing
set can analysis
be represented
by S =systems.
{(x(i) , y (i)
i=1 . we
space complexity for Size
hierarchical
classification. In this case,
pirically of
validate
the derived bounds.
context
text classification,
x(i) ∈ X denotes the vector
(12)
F lat = d × K
even though the total number of weight vectors is much more
representation of document i in an input space X ⊆ Rd .
since
are computed
forthe
allmatrix
the nodes
in the tree
not
5.1
Hierarchical
Classification
whichthese
represents
the size of
consisting
of Kand
weight
The hierarchy
in the form
of rooted tree is given by G =
only
for
the
leaves
as
in
flat
classification.
Inspite
of
this,
the
vectors,
one
for
each
class,
spanning
the
entire
input
space.
(V,single-label
E) where Vmulti-class
⊇ Y denotes
the setclassification,
of nodes ofthe
G, trainand
In
hierarchical
N
size
of hierarchical
model can beanalysis
much smaller
as compared
E denotes
of edges with
We
need
a
more
sophisticated
for
computing
the
.
In
the
ing
set canthe
be set
represented
by Sparent-to-child
= {(x(i) , y (i) )}orientation.
i=1
to
flatcomplexity
model in the
large scale classification.
classification. InIntuitively,
The leaves
of the
tree which usually
the setthe
of vector
target
space
for hierarchical
this case,
X denotes
context
of text
classification,
x(i) ∈ form
when
the feature
set size
is high
(top
levels
in the
hierarchy),
d
classes
is
given
by
Y
=
{u
∈
V
:
v
∈
V,
(u,
v)
∈
E}.
Assumeven
though
the
total
number
of
weight
vectors
is much
more
representation of document i in an input space X ⊆ R .
(i)
the
number
of computed
classes is less,
and
onnodes
the contrary,
when
the
since
these
are
for
all
the
in
the
tree
and
not
∈
Y
represents
ing
that
there
are
K
classes,
the
label
y
The hierarchy in the form of rooted tree is given by G =
(i)
number
of
classes
is
high
(at
the
bottom),
the
feature
set
only for the leaves as in flat classification. Inspite of this, the
The hierarchical
the
class
associated
with
the instance
(V, E)
where
V ⊇ Y
denotes
the setx of . nodes
of G, and
size
is
low.
size
of
hierarchical
model
can
be
much
smaller
as
compared
relationship
among
implies
a transition
from genE denotes the
set ofcategories
edges with
parent-to-child
orientation.
In
to analytically
compare
relative sizesIntuitively,
of hierarto order
flat model
in the large
scale the
classification.
eralization
one traverses
any
from
The leaves to
of specialization
the tree whichasusually
form the
setpath
of target
chical
and
flat
models
in
the
context
of
large
scale
classifiwhen the feature set size is high (top levels in the hierarchy),
root
towards
documents
classes
is giventhe
by leaves.
Y = {u This
∈ V : implies
v ∈ V, that
(u, v)the
∈ E}.
Assumcation,
we assume
power
law and
behaviour
with
respect
to the
the
(i)
the
number
of
classes
is
less,
on
the
contrary,
when
which
are
assigned
to
a
particular
leaf
also
belong
to
the
ing that there are K classes, the label y ∈ Y represents
number
of
features,
across
levels
in bottom),
the hierarchy.
More pre(i) that leaf node.
number
of
classes
is
high
(at
the
the
feature
set
inner
nodes
on
the
path
from
the
root
to
the class associated with the instance x . The hierarchical
cisely,
the categories at a level in the hierarchy are ordered
size is if
low.
relationship among categories implies a transition from genwith
respect
to the number
of features,
we observe
power
In order
to analytically
compare
the relative
sizes ofa hierar5.2
Space
Complexityas one traverses any path from
eralization
to specialization
law
behaviour.
This hasinalso
been
verified
empirically
as ilchical
and
flat
models
the
context
of
large
scale
classifiroot prediction
towards the
leaves.
This implies
that the documents
The
speed
for large-scale
classification
is crucial
lustrated
inassume
Figure power
12 for law
various
levels in
therespect
hierarchy,
for
cation,
we
behaviour
with
to
the
which
are assigned
a particular
leaf
also belong
to the
for
its application
in to
many
scenarios of
practical
importance.
one
of the
datasets
used
in levels
our experiments.
MoreMore
formally,
number
of
features,
across
in
the
hierarchy.
preinner
on the in
path
the hierarchical
root to that classifiers
leaf node.are
It
hasnodes
been shown
[32;from
3] that
the
r-th in
ranked
category,are
according
the
feature
dl,r of at
cisely,
if the size
categories
a level
the hierarchy
ordered
usually faster to train and test time as compared to flat
to therespect
number
ofthe
features,
for
level
l, 1 ≤wel ≤
L − 1, aispower
given
with
to
number
of
features,
observe
classifiers.
However,
given the large physical memory of
5.2 Space
Complexity
by:
law
behaviour.
This
has
also
been
verified
empirically
as ilmodern
systems,
whatforalso
matters classification
in practice isisthe
size
The prediction
speed
large-scale
crucial
−β
lustrated in Figure 12 for
various
levels
in
the
hierarchy,
for
l
≈
d
r
(13)
d
of
the
trained
model
with
respect
to
the
available
physical
l,r
l,1
for its application in many scenarios of practical importance.
one of the datasets used in our experiments. More formally,
It has been shown in [32; 3] that hierarchical classifiers are
the feature size dl,r of the r-th ranked category, according
usually faster to train and test time as compared to flat
to the number of features, for level l, 1 ≤ l ≤ L − 1, is given
classifiers. However, given the large physical memory of
by:
modern systems, what also matters in practice is the size
(13)
dl,r ≈ dl,1 r−βl
of the trained model with respect to the available physical
SIGKDD Explorations
Volume 16, Issue 1
Page 53
where dl,1 represents the feature size of the category ranked
1 at level l and β > 0 is the parameter of the power law.
Using this ranking as above, let bl,r represent the number
of children of the r-th ranked category at level l (bl,r is the
branching factor for this category), and let Bl represents the
where
dl,1 represents
the feature
sizel.ofThen
the category
total number
of categories
at level
the size ranked
of the
1
at
level
l
and
β
>
0
is
the
parameter
of
the
power law.
entire hierarchical classification model is given by:
Using this ranking as above, let bl,r represent the number
Bl
Bl
L−1
L−1
r-th
of children of the
ranked category
at level l−β
(bl,r is the
=
b
d
≈
bl,rBdl,1
r l
(14)
Size
Hier
l,r
l,r
branching factor for this category), and let
l represents the
l=1 r=1
l=1 r=1
total number of categories at level l. Then the size of the
Here
l = 1 corresponds
to the
rootisnode,
entirelevel
hierarchical
classification
model
given with
by: B1 = 1.
10000
Size
Hier =
Bl
L−1
bl,r dl,r ≈
Bl
L−1
bl,r dl,1 r−βl
(14)
# of categories
di>d
# with
of categories
with di>d
Level 2
l=1 r=1
l=1 r=1
Level 3
Here level l = 1 corresponds to the rootLevel
node,4 with B1 = 1.
1000
10000
β
Sizehier < bd1 (L − 1)
(β − 1)
K
If β >
(> 1), then Sizehier < Sizef lat
K − b(L −the
1) size of the corresponding flat clasUsing our notation,
1000
10
100
1
100
10
1000
10000
# of features d
100000
Figure 12: Power-law variation for features in different levels
1
for LSHTC2-a
the feature set
size
100 dataset, Y-axis
1000 represents
10000
100000
plotted against rank of the categories on X-axis
# of features d
We now state a proposition that shows that, under some conFigure
Power-law
variation
for features
in different
levels
ditions 12:
on the
depth of
the hierarchy,
its number
of leaves,
for
LSHTC2-a
dataset,
Y-axis
represents
the
feature
set
its branching factors and power law parameters, the sizesize
of
plotted
againstclassifier
rank of is
the
categories
a hierarchical
below
that ofonitsX-axis
flat version.
1. For a hierarchy
of that,
categories
depth
L
WeProposition
now state a proposition
that shows
underof
some
conβl andits
b=
maxl,rof
bl,rleaves,
. Deand K leaves,
β =ofmin
ditions
on the let
depth
the1≤l≤L
hierarchy,
number
noting
the space
complexity
of alaw
hierarchical
classification
its branching
factors
and power
parameters,
the size of
and the
one ofthat
its corresponding
flat vermodel
by Sizehier
a hierarchical
classifier
is below
of its flat version.
sion by Sizef lat , one has:
Proposition 1. For a hierarchy of categories of depth L
K
= then
maxl,r bl,r . Deand KFor
leaves,
letif ββ =
β > 1,
> min1≤l≤L βl and
(> b1),
K − b(L
1)
(15)
noting the space complexity
of a− hierarchical
classification
of its
corresponding
model by Sizehier and the one Size
< Sizef lat flat verhier
sion by Sizef lat , one has:
1−β
b(L−1)(1−β) − 1
<
K, then
ForFor
0 <β β><1,1,ififβ > (1−β) K
(> 1),
b then
− 1)
(bK − b(L
(16)
− 1)
(15)
Size
<
Size
hier
f
lat
Size
<
Size
hier
f lat
d1 and Bl −
≤ 1b(l−1)
forβ 1 ≤ l ≤ L, one
Proof. As dl,1 ≤b(L−1)(1−β)
1−
<
For
0 <Equation
β < 1, if 14 and
thenb:
has,
from
the definitions
of K,
β and
(1−β)
b
(b
− 1)
(16)
(l−1)
L−1
b
Size
<
Size
hier−β
f lat
Sizehier ≤ bd1
r
(l−1)
≤ br=1
for 1 ≤ l ≤ L, one
Proof. As dl,1 ≤ d1 and Bll=1
(l−1)
has, from Equation 14and
the
definitions
of β and b:
b
−β
using ([32]):
One can then bound r=1 r
(l−1)
L−1
b
(l−1)(1−β)
−β
b(l−1)
−β
Sizehier
b ≤ bd1 − β r
r
<
for β = 0, 1
(17)
1 − l=1
β r=1
r=1
b(l−1) −β
One can then bound r=1 r
using ([32]):
b(l−1)
r=1
r−β <
b(l−1)(1−β) − β
1−β
SIGKDD Explorations
(L−1)(1−β)
−1
If β > 1, since b > 1, it implies that (bb(1−β) −1)(1−β)
< 0.
Using our notation, the size of the corresponding flat clasTherefore,
Inequality
18 can be re-written as:
sifier is: Size
f lat = Kd1 , where K denotes the number of
leaves. Thus:
Level 2
Level 3
Level 4
100
leading to, for β = 0, 1:
L−1
b(l−1)(1−β) − β Sizehier < bd1
1−β
l=1
(L−1)(1−β)
β
−1
leading to, for β = 0,b1:
− (L − 1)
= bd1
(1 − β)
(b(1−β) − 1)(1 − β)
L−1
b(l−1)(1−β) − β (18)
Sizehier < bd1
1−β
l=1
where the last equality
is based on the sum of the first terms
(L−1)(1−β)
of the geometric series
β
b (b(1−β) )l .− 1
−
(L
−
1)
= bd1
(L−1)(1−β)
− β)< 0.
− 1)(1 −that
β) b(1−β) (1 −1
If β > 1, since b >(b(1−β)
1, it implies
−1)(1−β)
(b
(18)
Therefore, Inequality 18 can be re-written as:
where the last equality is based on the sum
β of the first terms
<(1−β)
bd1 (L
Size
hier (b
)l . − 1) (β − 1)
of the geometric
series
for β = 0, 1
(17)
sifier is:
SizeCondition
, where K denotes the number of
which
proves
f lat = Kd115.
leaves.
Thus:
The
proof
for Condition 16 is similar: assuming 0 < β < 1, it
β
)
is this time the second
term in Equation 18 (−(L − 1) (1−β)
K
(> 1), then Sizehier < Sizef lat
If β >
which is negative,
so
that
one
obtains:
K − b(L − 1)
which proves Condition 15. b(L−1)(1−β) − 1
Sizehier < bd1
(1−β) − assuming
The proof for Condition 16 is(bsimilar:
1)(1 − β) 0 < β < 1, it
β
)
is this time the second term in Equation 18 (−(L − 1) (1−β)
and then:
which is negative, so that one obtains:
b(L−1)(1−β) − 1
1 − β (L−1)(1−β)
If
<
K,
b then Size−hier
1 < Sizef lat
b
−
1)
(b(1−β)
Sizehier < bd1
(b(1−β) − 1)(1 − β)
which concludes the proof of the proposition.
and then:
It canb(L−1)(1−β)
be shown, −
but
is βbeyond the scope of this paper,
1−
1 this
If Condition
<satisfiedK,
< Sizeof
Sizehier
f lat
that
16
is
forthen
a range
of values
β ∈
(1−β)
b
(b
− 1)
]0, 1[. However, as is shown in the experimental part, it is
which
concludes
the proof of1the
proposition.
Condition
15 of Proposition
that
holds in practice.
The previous proposition complements the analysis presented
It can
shown,it but
this isthat
beyond
the scope
of test
this time
paper,
in
[32] be
in which
is shown
the training
and
of
that
Condition
16 is satisfied
for a range
of values
of β ∈
hierarchical
classifiers
is importantly
decreased
with respect
]0,
as isflat
shown
in the experimental
it is
to 1[.
the However,
ones of their
counterpart.
In this workpart,
we show
Condition
15 ofcomplexity
Propositionof1 hierarchical
that holds inclassifiers
practice. is also
that the space
The
previous
thepractice,
analysis presented
better,
underproposition
a conditioncomplements
that holds in
than the
in [32]
in which
is shown thatTherefore,
the training
test
time
of
one
of their
flat it
counterparts.
forand
large
scale
taxhierarchical
classifiers
importantly
decreased
respect
onomies whose
featureis size
distribution
exhibitwith
power
law
to
the ones
of their classifiers
flat counterpart.
In this
work
we show
decay,
hierarchical
should be
better
in terms
of
that the
space
hierarchical
classifiers is also
speed
than
flat complexity
ones, due toofthe
following reasons:
better, under a condition that holds in practice, than the
shown
above, the space
complexity
of hierarchical
one1.ofAs
their
flat counterparts.
Therefore,
for large
scale taxclassifier
lower than
classifiers.
onomies
whoseisfeature
size flat
distribution
exhibit power law
decay, hierarchical classifiers should be better in terms of
2. For
K flat
classes,
classifiers
need to be evalspeed
than
ones,only
dueO(log
to theK)
following
reasons:
uated per test document as against O(K) classifiers in
flat shown
classification.
1. As
above, the space complexity of hierarchical
classifier is lower than flat classifiers.
In order to empirically validate the claim of Proposition 1,
we2.
measured
the trained
modelK)
sizes
of a standard
top-down
For K classes,
only O(log
classifiers
need to
be evalhierarchical
scheme
(TD), which
uses a O(K)
linear classifiers
classifier at
uated per
test document
as against
in
each parent
of the hierarchy, and the flat one.
flat classification.
We use the publicly available DMOZ data of the LSHTC
In
order towhich
empirically
validate
claim ofMozilla.
Proposition
1,
challenge
is a subset
of the
Directory
More
we measuredwe
theused
trained
sizes of aofstandard
top-down
specifically,
the model
large dataset
the LSHTC-2010
hierarchical scheme (TD), which uses a linear classifier at
each parent of the hierarchy, and the flat one.
We use the publicly available DMOZ data of the LSHTC
challenge which is a subset of Directory Mozilla. More
specifically, we used the large dataset of the LSHTC-2010
Volume 16, Issue 1
Page 54
edition and two datasets were extracted from the LSHTC2011 edition. These are referred to as LSHTC1-large, LSHTC2a and LSHTC2-b respectively in Table 2. The fourth dataset
(IPC) comes from the patent collection released by World
Intellectual Property Organization. The datasets are in the
edition
two datasets
were
extracted
from by
thestemming
LSHTCLibSVMand
format,
which have
been
preprocessed
2011
edition.
These
are
referred
to
as
LSHTC1-large,
and stopword removal. Various properties of interest LSHTC2for the
a and LSHTC2-b
respectively
datasets
are shown
in Table 2.in Table 2. The fourth dataset
(IPC) comes from the patent collection released by World
Intellectual
Organization. The
datasets are
in the
Dataset Property#Tr./#Test
#Classes
#Feat.
LibSVM format, which have been preprocessed by stemming
LSHTC1-large
93,805/34,880
12,294
347,255
and
stopword removal.
Various properties
of interest
for the
LSHTC2-a
25,310/6,441
1,789
145,859
datasets are shown in Table 2.
LSHTC2-b
36,834/9,605
3,672
145,354
IPC
46,324/28,926
451
1,123,497
Dataset
#Tr./#Test
#Classes
#Feat.
LSHTC1-large
93,805/34,880
12,294
347,255
Table
2: Datasets for
hierarchical classification
with the
LSHTC2-a
25,310/6,441
1,789 target
145,859
properties:
Number of
training/test examples,
classes
LSHTC2-b
145,354
and
size of the feature36,834/9,605
space. The depth3,672
of the hierarchy
tree
IPC
for
LSHTC datasets 46,324/28,926
is 6 and for the IPC451
dataset1,123,497
is 4.
Table 32: shows
Datasets
for hierarchical
classification
the
Table
the difference
in trained
model sizewith
(actual
properties:
Number
of
training/test
examples,
target
classes
value of the model size on the hard drive) between the two
and size of theschemes
feature space.
of thealong
hierarchy
classification
for theThe
fourdepth
datasets,
with tree
the
for LSHTC
datasets
is 6 and 1.
forThe
the symbol
IPC dataset
is 4.to the
values
defined
in Proposition
refers
K
of condition 15.
quantity K−b(L−1)
Table 3 shows the difference in trained model size (actual
value of the model size on the hard drive) between the two
Dataset schemes forTD
Flatdatasets,
β
b
the
classification
the four
along
with
values
defined
in
Proposition
1.
The
symbol
refers
to the
LSHTC1-large 2.8 90.0 1.62 344 1.12
K
of
condition
15.
quantity
LSHTC2-a
0.46
5.4
1.35
55
1.14
K−b(L−1)
LSHTC2-b
1.1 11.9 1.53 77 1.09
IPC
3.6 Flat
10.5 2.03
34
1.17
Dataset
TD
β
b
LSHTC1-large 2.8 90.0 1.62 344 1.12
Table 3: Model size (in GB) for flat and hierarchical models
LSHTC2-a
0.46 5.4 1.35 55 1.14
along with the corresponding values defined in Proposition
LSHTC2-b
1.1 11.9 1.53 K77 1.09
1. The symbol refers to the quantity K−b(L−1)
IPC
3.6 10.5 2.03 34 1.17
As shown for the three DMOZ datasets, the trained model
Table
Model size
for flatofand
hierarchical
models
for flat3:classifiers
can(inbeGB)
an order
magnitude
larger
than
along
with the corresponding
definedfrom
in Proposition
for hierarchical
classification. values
This results
the
sparse
K
1.
symbol refers
to theofquantity
K−b(L−1)
andThe
high-dimensional
nature
the problem
which is quite
typical in text classification. For flat classifiers, the entire
As shown
the three for
DMOZ
datasets,
model
feature
setfor
participates
all the
classes,the
buttrained
for top-down
for
flat classifiers
be an
magnitude
than
classification,
the can
number
of order
classesofand
features larger
participatfor hierarchical
classification.
This results
from
thetraverssparse
ing
in classifier training
are inversely
related,
when
and the
high-dimensional
nature
of thethe
problem
is quite
ing
tree from the root
towards
leaves.which
As shown
in
typical
in text
classification.
For flat classifiers,
the entire
Proposition
1, the
power law exponent
β plays a crucial
role
feature
set participates
for of
allhierarchical
the classes, classifier.
but for top-down
in
reducing
the model size
classification, the number of classes and features participating in classifier training are inversely related, when travers6.
CONCLUSIONS
ing the
tree from the root towards the leaves. As shown in
In
this work1,we
a exponent
model in βorder
the
Proposition
thepresented
power law
playstoa explain
crucial role
dynamics
insize
the of
creation
and evolution
in
reducingthat
the exist
model
hierarchical
classifier. of largescale taxonomies such as the DMOZ directory, where the
categories are organized in a hierarchical form. More specif6.
ically,CONCLUSIONS
the presented process models jointly the growth in
In
presented
modelofindocuments)
order to explain
thethis
size work
of thewe
categories
(ina terms
as wellthe
as
dynamics
existtaxonomy
in the creation
andofevolution
of which
largethe
growththat
of the
in terms
categories,
scale
such not
as the
DMOZ
directory,
where
the
to ourtaxonomies
knowledge have
been
addressed
in a joint
framecategories
are
organized
in
a
hierarchical
form.
More
specifwork. From one of them, the power law in category size
ically, the presented
process
theofgrowth
in
distribution,
we derived
powermodels
laws at jointly
each level
the hierthe
sizeand
of the
terms oflaw
documents)
as welllaw
as
archy,
withcategories
the help (in
of Heaps’s
a third scaling
thethe
growth
of the
in ofterms
of categories,
in
features
size taxonomy
distribution
categories
which wewhich
then
to our knowledge have not been addressed in a joint framework. From one of them, the power law in category size
distribution, we derived power laws at each level of the hierarchy, and with the help of Heaps’s law a third scaling law
in the features size distribution of categories which we then
SIGKDD Explorations
exploit for performing an analysis of the space complexity
of linear classifiers in large-scale taxonomies. We provided
a grounded analysis of the space complexity for hierarchical
and flat classifiers and proved that the complexity of the
former is always lower than that of the latter. The analysis
exploit
performing
an analysis
of thelarge-scale
space complexity
has beenfor
empirically
validated
in several
datasets
of
linear
classifiers
in
large-scale
taxonomies.
We
provided
showing that the size of the hierarchical models
can
be siga groundedsmaller
analysis
of the
the space
complexity
nificantly
that
ones created
by a for
flathierarchical
classifier.
and
classifiers
andanalysis
proved can
thatbe
theused
complexity
The flat
space
complexity
in order of
to the
esformer
is
always
lower
than
that
of
the
latter.
The
analysis
timate beforehand the size of trained models for large-scale
has been
empirically
validated
in several large-scale
datasets
data.
This
is of importance
in large-scale
systems where
the
showing
that
the size
of the
hierarchical
models
can time.
be sigsize of the
trained
models
may
impact the
inference
nificantly smaller that the ones created by a flat classifier.
The space complexity analysis can be used in order to es7.
ACKNOWLEDGEMENTS
timate
beforehand the size of trained models for large-scale
This
has
been partially
supportedsystems
by ANR
project
data. work
This is
of importance
in large-scale
where
the
Class-Y
(ANR-10-BLAN-0211),
BioASQ
size of the
trained models may impact
theEuropean
inference project
time.
(grant agreement no. 318652), LabEx PERSYVAL-Lab ANR11-LABX-0025, and the Mastodons project Garguantua.
7.
ACKNOWLEDGEMENTS
This work has been partially supported by ANR project
8.
REFERENCES
Class-Y
(ANR-10-BLAN-0211), BioASQ European project
(grant agreement no. 318652), LabEx PERSYVAL-Lab ANR[1] R. Babbar, and
I. Partalas,
C. Metzig,
E.Garguantua.
Gaussier, and
11-LABX-0025,
the Mastodons
project
M.-R. Amini. Comparative classifier evaluation for webtaxonomies using power law. In European Seman8. scale
REFERENCES
tic Web Conference, 2013.
[1]
Babbar,
I. Partalas,
C. Metzig,
E. Gaussier,
[2] R.
A.-L.
Barabási
and R. Albert.
Emergence
of scalingand
in
M.-R. Amini.
Comparative
classifier evaluation for
webrandom
networks.
science, 286(5439):509–512,
1999.
scale taxonomies using power law. In European Seman[3] tic
S. Bengio,
J. Weston,2013.
and D. Grangier. Label embedWeb Conference,
ding trees for large multi-class tasks. In Neural Infor[2] mation
A.-L. Barabási
andSystems,
R. Albert.
Emergence
scaling in
Processing
pages
163–171,of2010.
random networks. science, 286(5439):509–512, 1999.
[4] P. N. Bennett and N. Nguyen. Refined experts: im[3] proving
S. Bengio,
J. Weston, inand
D. taxonomies.
Grangier. Label
embedclassification
large
In Proceedding
trees
large
multi-classACM
tasks.SIGIR
In Neural
Inforings of
the for
32nd
international
Conference
mation
Processing
Systems, pages
163–171, 2010.
on
Research
and Development
in Information
Retrieval,
pages 11–18, 2009.
[4] P. N. Bennett and N. Nguyen. Refined experts: imclassification
in large
taxonomies.
Proceed[5] proving
L. Bottou
and O. Bousquet.
The
tradeoffs ofInlarge
scale
ings of the
international
SIGIR Conference
learning.
In32nd
Advances
In NeuralACM
Information
Processing
on Research
and161–168,
Development
Systems,
pages
2008.in Information Retrieval,
pages 11–18, 2009.
[6] L. Cai and T. Hofmann. Hierarchical document cate[5] gorization
L. Bottou and
Bousquet.
The machines.
tradeoffs ofInlarge
scale
withO.support
vector
Proceedlearning.
Inthirteenth
Advances ACM
In Neural
Information
Processing
ings
of the
international
conference
on
Systems,
pages
2008.
Information
and161–168,
knowledge
management, pages 78–87,
2004.
[6] L. Cai and T. Hofmann. Hierarchical document catewith
vectorF.machines.
[7] gorization
A. Capocci,
V. support
D. Servedio,
Colaiori, InL.ProceedS. Buings
thirteenth
ACM international
conference
on
riol, of
D.the
Donato,
S. Leonardi,
and G. Caldarelli.
PrefInformation
and knowledge
management,
pages
78–87,
erential
attachment
in the growth
of social
networks:
2004.internet encyclopedia wikipedia. Physical Review
The
E, 74(3):036116, 2006.
[7] A. Capocci, V. D. Servedio, F. Colaiori, L. S. BuD. Donato,
S. Leonardi,
and G.
Caldarelli.
[8] riol,
O. Dekel,
J. Keshet,
and Y. Singer.
Large
margin Prefhiererential classification.
attachment inInthe
growth ofofsocial
networks:
archical
Proceedings
the twenty-first
The
internet encyclopedia
international
conference onwikipedia.
Machine Physical
learning, Review
ICML
E, 74(3):036116,
’04,
pages 27–34, 2006.
2004.
[8] O.
J. Keshet, and
and Y.
Large margin
hier[9]
S. Dekel,
N. Dorogovtsev
J. Singer.
F. F. Mendes.
Evolution
archical
classification.
In Proceedings
of the twenty-first
of networks
with aging
of sites. Physical
Review E,
international2000.
conference on Machine learning, ICML
62(2):1842,
’04, pages 27–34, 2004.
[9] S. N. Dorogovtsev and J. F. F. Mendes. Evolution
of networks with aging of sites. Physical Review E,
62(2):1842, 2000.
Volume 16, Issue 1
Page 55
[10] L. Egghe. Untangling herdan’s law and heaps’ law:
Mathematical and informetric arguments. Journal of
the American Society for Information Science and
Technology, 58(5):702–709, 2007.
[11]
P. Faloutsos,
and C. law
Faloutsos.
On power[10] M.
L. Faloutsos,
Egghe. Untangling
herdan’s
and heaps’
law:
law
relationships
the internetarguments.
topology. SIGCOMM.
Mathematical
andof informetric
Journal of
the American Society for Information Science and
[12] Technology,
R.-E. Fan, K.-W.
Chang, 2007.
C.-J. Hsieh, X.-R. Wang,
58(5):702–709,
and C.-J. Lin. LIBLINEAR: A library for large linear
Journal
of Machine
Learning On
Research,
[11] classification.
M. Faloutsos, P.
Faloutsos,
and C. Faloutsos.
power9:1871–1874,
2008.
law relationships
of the internet topology. SIGCOMM.
[13]
T. GaoFan,
andK.-W.
D. Koller.
Discriminative
of re[12] R.-E.
Chang,
C.-J. Hsieh,learning
X.-R. Wang,
laxed
hierarchy
for large-scale
visual recognition.
In
and C.-J.
Lin. LIBLINEAR:
A library
for large linear
IEEE
International
Computer
Vision
classification.
JournalConference
of MachineonLearning
Research,
(ICCV),
pages
2072–2079, 2011.
9:1871–1874,
2008.
[14]
C.Koller.
J. Tessone,
and F. Schweitzer.
[13] M.
T. M.
GaoGeipel,
and D.
Discriminative
learningAofcomreplementary
view for
on the
growth of
directory
trees. The
laxed hierarchy
large-scale
visual
recognition.
In
European
Physical Journal
B, 71(4):641–648,
2009.
IEEE International
Conference
on Computer
Vision
(ICCV), pages 2072–2079, 2011.
[15] S. Gopal, Y. Yang, B. Bai, and A. Niculescu-Mizil.
models
forTessone,
large-scale
classifica[14] Bayesian
M. M. Geipel,
C. J.
and hierarchical
F. Schweitzer.
A comtion.
In Neural
Processing
Systems,
2012.
plementary
viewInformation
on the growth
of directory
trees.
The
European Physical Journal B, 71(4):641–648, 2009.
[16] G. Jona-Lasinio. Renormalization group and probabiltheory. Y.
Physics
2001.
[15] ity
S. Gopal,
Yang,Reports,
B. Bai,352(4):439–458,
and A. Niculescu-Mizil.
Bayesian models for large-scale hierarchical classifica[17] tion.
K. Klemm,
V. M.
Eguı́luz, and
M. San Miguel.
In Neural
Information
Processing
Systems,Scaling
2012.
in the structure of directory trees in a computer cluster.
Physical
review letters,
95(12):128701,
2005.
[16] G.
Jona-Lasinio.
Renormalization
group
and probability theory. Physics Reports, 352(4):439–458, 2001.
[18] D. Koller and M. Sahami. Hierarchically classifying
using
very fewand
words.
In Miguel.
Proceedings
of
[17] documents
K. Klemm, V.
M. Eguı́luz,
M. San
Scaling
the
Fourteenth
International
Conference
on Machine
in the
structure of
directory trees
in a computer
cluster.
Learning,
ICMLletters,
’97, 1997.
Physical review
95(12):128701, 2005.
[19]
Liu, Y.
H. Wan, Hierarchically
H.-J. Zeng, Z. Chen,
and
[18] T.-Y.
D. Koller
andYang,
M. Sahami.
classifying
W.-Y.
Ma. Support
vector
classification
with
documents
using very
fewmachines
words. In
Proceedings
of
a
large-scale
taxonomy. SIGKDD,
2005.
thevery
Fourteenth
International
Conference
on Machine
Learning, ICML ’97, 1997.
[20] B. Mandelbrot. A note on a class of skew distribution
critique
of aZeng,
paperZ.
byChen,
ha simon.
[19] functions:
T.-Y. Liu, Analysis
Y. Yang,and
H. Wan,
H.-J.
and
Information
and Control,
1959.
W.-Y. Ma. Support
vector2(1):90–99,
machines classification
with
a very large-scale taxonomy. SIGKDD, 2005.
[21] C. Metzig and M. B. Gordon. A model for scaling in
size and A
growth
rate
distribution.
Physica A,
[20] firms’
B. Mandelbrot.
note on
a class
of skew distribution
2014.
functions: Analysis and critique of a paper by ha simon.
Information and Control, 2(1):90–99, 1959.
[21] C. Metzig and M. B. Gordon. A model for scaling in
firms’ size and growth rate distribution. Physica A,
2014.
SIGKDD Explorations
[22] M. Newman. Power laws, pareto distributions and zipf’s
law. Contemporary Physics, 46(5):323–351, 2005.
[23] M. E. J. Newman. Power laws, Pareto distributions and
Zipf’s law. Contemporary Physics, 2005.
[22] M. Newman. Power laws, pareto distributions and zipf’s
[24] I. Partalas, R. Babbar, É. Gaussier, and C. Amblard.
law. Contemporary Physics, 46(5):323–351, 2005.
Adaptive classifier selection in large-scale hierarchical
In ICONIP,
pages
612–619,
2012.
[23] classification.
M. E. J. Newman.
Power laws,
Pareto
distributions
and
Zipf’s law. Contemporary Physics, 2005.
[25] P. Richmond and S. Solomon. Power laws are disboltzmann
laws. É.
International
Journal
of Mod[24] guised
I. Partalas,
R. Babbar,
Gaussier, and
C. Amblard.
ern
Physics
C, 12(03):333–343,
Adaptive
classifier
selection in 2001.
large-scale hierarchical
classification. In ICONIP, pages 612–619, 2012.
[26] H. A. Simon. On a class of skew distribution functions.
1955.
[25] Biometrika,
P. Richmond42(3/4):425–440,
and S. Solomon.
Power laws are disguised boltzmann laws. International Journal of Mod[27] C. Song, S. Havlin, and H. A. Makse. Self-similarity of
ern Physics C, 12(03):333–343, 2001.
complex networks. Nature, 433(7024):392–395, 2005.
[28]
Takayasu,
A.-H.
Sato,
and distribution
M. Takayasu.
Stable
[26] H. A.
Simon. On
a class
of skew
functions.
infinite
variance
fluctuations 1955.
in randomly amplified
Biometrika,
42(3/4):425–440,
langevin systems. Physical Review Letters, 79(6):966–
[27] 969,
C. Song,
1997.S. Havlin, and H. A. Makse. Self-similarity of
complex networks. Nature, 433(7024):392–395, 2005.
[29]
Tessone, A.-H.
M. M.Sato,
Geipel,
and
Schweitzer.Stable
Sus[28] C.
H. J.
Takayasu,
and
M.F.Takayasu.
tainable
growth in
complex networks.
EPLamplified
(Euroinfinite variance
fluctuations
in randomly
physics
2011. Letters, 79(6):966–
langevinLetters),
systems.96(5):58005,
Physical Review
969, 1997.
[30] K. G. Wilson and J. Kogut. The renormalization group
expansion.
Physics and
Reports,
12(2):75–199,
[29] and
C. J.the
Tessone,
M. M. Geipel,
F. Schweitzer.
Sus1974.
tainable growth in complex networks. EPL (Europhysics Letters), 96(5):58005, 2011.
[31] G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classifiin large-scale
text hierarchies.
In Proceedings
of
[30] cation
K. G. Wilson
and J. Kogut.
The renormalization
group
the
international
SIGIR
conference
and 31st
the annual
expansion.
Physics ACM
Reports,
12(2):75–199,
on
Research and development in information retrieval,
1974.
SIGIR ’08, pages 619–626, 2008.
[31] G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classifi[32] cation
Y. Yang,
Zhang, and
Kisiel. A scalability
analysis
in J.
large-scale
textB.hierarchies.
In Proceedings
of
of
in text
categorization.
In Proceedings
of
theclassifiers
31st annual
international
ACM SIGIR
conference
the
26th annual
ACM
SIGIR conference
on Research
and international
development in
information
retrieval,
on
Research
and 619–626,
development
in informaion retrieval,
SIGIR
’08, pages
2008.
SIGIR ’03, pages 96–103, 2003.
[32] Y. Yang, J. Zhang, and B. Kisiel. A scalability analysis
[33] of
G. classifiers
U. Yule. Ainmathematical
theory ofInevolution,
based
text categorization.
Proceedings
of
on
of dr. jc willis,
frs. Philosophical
the the
26thconclusions
annual international
ACM SIGIR
conference
Transactions
of the
Royal Society
of London. Series
B,
on Research and
development
in informaion
retrieval,
Containing
Papers96–103,
of a Biological
Character, 213:21–
SIGIR ’03, pages
2003.
87, 1925.
[33] G. U. Yule. A mathematical theory of evolution, based
on the conclusions of dr. jc willis, frs. Philosophical
Transactions of the Royal Society of London. Series B,
Containing Papers of a Biological Character, 213:21–
87, 1925.
Volume 16, Issue 1
Page 56
Interview: Michael Brodie, Leading Database Researcher,
Industry Leader, Thinker
Gregory Piatetsky
KDnuggets
Brookline, MA
gregory@kdnuggets.com
3. INTERVIEW
ABSTRACT
We discuss the most important database research advances,
industry developments, role of relational and NoSQL databases,
Computing Reality, Data Curation, Cloud Computing, Tamr and
Jisto startups, what he learned as a chief Scientist of Verizon,
Knowledge Discovery, Privacy Issues, and more.
Keywords
Data Curation, NoSQL, Data Curation, Cloud Computing,
Verizon, Privacy, Computing Reality.
1. INTRODUCTION
I had a pleasure of working with Michael Brodie when we were
both at GTE Laboratories in 1990s, where he was already a worldfamous researcher and a department manager. I recently met him
at another conference, and our discussion led to this interview.
Michael is still very sharp, very active, and busy - he answered
these questions while flying from Boston to Doha, Qatar where he
is advising Qatar Computing Research Institute.
Parts of this interview were published in KDnuggets [1-3].
2. BACKGROUND
Dr. Michael L. Brodie [4] has served as
Chief Scientist of a Fortune 20 company, an
Advisory Board member of leading
national
and international
research
organizations, and an invited speaker and
lecturer. In his role as Chief Scientist Dr.
Brodie has researched and analyzed
challenges and opportunities in advanced
technology,
architecture,
and
methodologies for Information Technology
strategies. He has guided advanced
deployments of emergent technologies at
industrial scale, most recently Cloud
Computing and Big Data.
Gregory Piatetsky: You have started as a researcher in
Databases (PhD from Toronto) and had a very distinguished
and varied career spanning academia, industry, and
government, in US, Europe, Australia, and Latin America
over the last 25+ years. From your unique vantage point, what
were 3 most important database research advances?
Michael Brodie: Three most important database research
advances:
1.
2.
3.
Throughout his career Dr. Brodie has been active in both
advanced, academic research and large-scale industrial practice
attempting to obtain mutual benefits from the industrial
deployment of innovative technologies while helping research to
understand industrial requirements and constraints. He has
contributed to multi-disciplinary problem solving at scale in
contexts such as Terrorism and Individual Privacy, and
Information Technology Challenges in Healthcare Reform.
SIGKDD Explorations
Volume 16, Issue 1
Ted Codd’s Relational model of data (1970) is the most
important database research advance as it launched what
is now a $28 BN/year market still growing at 11%
CAGR with over 215 RDBMSs on the market. More
important to me it launched four decades of amazing
research advances starting with query optimization
(Selinger) and transactions (Gray) and innovation that
has probably grown at 20% CAGR.
The next most important research advance or stage was
a change in perspective that specific domains require
their own DBMS such as graph databases, array stores,
document stores, key-value stores, NoSQL, NewSQL,
and many more to come. DB-Engines.com lists twelve
DBMS categories thus bumping the database world
from managing 8% of the world’s data to about 12% but
due to the growth of non-database data back to 10%.
Soon, due to the role of data in our digitized world there
will be data management systems for many more
domains. While this is amazingly cool, how do we solve
multi-disciplinary (multi-data domain) problems in a
consistent rather that disjoint way?
The next most important research advance is just
emerging and is mind blowing. I call it Computing
Reality, acknowledging that every datum (every real
world observation) is not definitive but probabilistic.
Unlike conventional databases and more like reality,
Computing Reality has no single version of truth. How
do we model such worlds, more realistic worlds and
compute over them? The simple answer is that it is
already in Big Data sources. There are many related
attempts to address Computing Reality including social
computing, probabilistic computing, probabilistic
databases, Open Worlds in AI, Web Science,
Approximate Computing, Crowd Computing, and more.
Perhaps this will be the next generation of computing.
Page 57
GP: What about the most important database industry
developments?
MB: Alas the database industry, like all industries, has a legacy
problem that stifles innovation. It has taken over 30 years to
emerge from the relational era. The most important recent
database industry development came from outside the database
industry, it is Big Data and its marketing arm called MapReduce
and its data sidekicks, Hadoop and NoSQL. Frankly, the
database industry has been insular and protected its relational
turf for FAR too long. Smart folks at Yahoo!, Google and other
places saw value in data, non-database data, and thus emerged
MapReduce, Hadoop, and NoSQL- generally crappy database
ideas but it woke up the database industry1. Hadoop and NoSQL
are growing in demand. In time it will be seen that they are
amazing for a very specific problem domain, embarrassingly
parallel problems, but it is a money pit for everything else. The
importance of MapReduce is that it forced the database industry to
get out of their hammocks.
GP: What is the role of Relational Databases, NoSQL
databases, Graph databases, and other databases today?
Relational Databases have two extremely well established roles.
Conventional row stores serve the OLTP community as the
backbone of enterprise operations. These blindingly fast
transaction processors are moving in-memory. OLTP stores are
modest in number and size (< 1 TB) growing and declining in
lock step with business growth and decline. Column stores,
OLAP, are the backbone of data warehouses and until recently
business intelligence. In general there are huge numbers of these,
often of very large size in the Petabyte and Exabyte range. This is
where Big Data battle lines are being drawn. What fun!!
This is also where we turn from polishing the relational round ball
[5] and focus on the other dozen or so other DBMS categories.
Taking over is relative; none of the 12 other categories has more
than 3% of the database market. Graph databases serve graph
applications like networking in communications, telecom, social
networks, and of course NSA applications! But what is wonderful
about these emerging classes of data-domain specific DBMSs is
that we are only now discovering the rich use cases that they
serve.
The use cases define the DBMSs and the DBMSs help formulate
the use cases. SciDB is a superb example of managing scientific
data and computation at scale. It is awkward for both communities
– database folks who don’t speak linear algebra or matrices, and
scientists who only speak R. Exciting times. For a little fun look at
the database-engines list [6].
Database Engines
DB-Engines lists 216 different database management systems,
which are classified according to their database model (e.g.
1
On June 25, 2014 Google launched Cloud Dataflow replacing
MapReduce and marking the decline of MR and Hadoop as
predicted at launch in 2010 by Mike Stonebraker in 2010.
SIGKDD Explorations
relational DBMS, key-value stores etc.). This pie-chart shows the
number of systems in each category. Some of the systems belong
to more than one category.
Popularity changes per category, April 2014, over 1 year
x
Graph DBMS – growing dramatically 3.5X
x
Wide column stores – 2X
x
Document stores 2X
x
Native XML DBMS – 1.5X
x
Key-value stores – 1.5X
x
Search engines – 1.5X
x
RDF stores – 1.5X
x
Object oriented DBMS - flat
x
Multivalue DBMS - flat
x
Relational DBMS - flat
GP: You have held an amazing variety of positions in
academy, industry, government organization, VC firms, and
start-ups, in US, Brazil, Canada, Australia, and Europe.
Which 3 positions were most satisfying to you and why?
MB: What a great question. Thank you for asking because it
caused me to think about what I have really enjoyed over 40
years. Somehow CSAIL at MIT and the Faculty of Computing
and Communications at EPFL jump to mind.
There are scary smart people at those places. Like climbing
mountains it both scares and exhilarates me. To be frank my jobs
at big enterprises in hindsight are confusing. I guess I was window
dressing because my role did not feel like it had impact. So
getting motivated and scared at MIT and EPFL are probably top,
so there’s number one. Why? Just look down 5,000 feet and ask
why am I here?
Second is a combination of Advisory roles at US Academy of
Science, DERI, STI, ERCIM, Web Science Trust, and others
because they gave me a sense of collaborating, challenging, and
contributing. How cool is that?_________________________
Third would be working at startups like Tamr and Jisto. Imagine
Volume 16, Issue 1
Page 58
waking up in the morning and thinking you might change the
world. That requires that I conceive the world not just differently,
but so that it solves someone’s REAL problem. Even more cool.
Gregory Piatetsky: Currently you are an adviser at a startup
called Tamr [7], co-founded by another leading DB researcher
and serial entrepreneur Michael Stonebraker. What can you
tell us about Tamr and its product?
Michael Brodie: Consider the data universe. Since the 1980’s I
have said in keynotes that the database and business worlds deal
with less than 10% of the world’s data most of which is
structured, discrete, and conforms to some schema. With the Web
and Internet of Things in the 1990s massive amounts of
unstructured data began to emerge with a growth rate that was
inconceivable while shrinking database data to less than 8%.
EMC/IDC claims [14] that our Digital Universe is 4.4 zettabytes
and will double every two years until 2020 when it will be 44
zettabytes.
[If you are constantly amazed at the growth of the Digital
World, you don’t understand it yet – A profound, casual
comment of my departed friend, Gerard Berry, Academie
Francais.]
In 1988 or so you, Gregory, and a few others saw the potential of
data with your knowledge discovery in databases – then a radical
idea. Little did others, including me, realize the potential of this,
now named Big Data. Even though Big Data is hot in 2014,
almost 30 years later, it’s application, tools, and technologies are
in their infancy, analogous to the emergence of the Web in the
early 1990s. Just as the Web has and is changing the world, so too
will Big Data.
________________________________________
Compared with database data, Big Data is crazy. It’s largely
not understood hence it is schema-less or model-less. Big Data is
inconceivably massive, dirty, imprecise, incomplete, and
heterogeneous beyond anything we’ve seen before. Yet it
trumps finite, precise, database data in many ways hence is a
treasure trove of value. Big Data is qualitatively different from
database data that is a small subset of Big Data - EMC/IDC claims
1% as of 2013. It offers far greater potential thus value and
requires different thinking, tools, and techniques. Database data is
approached top-down. Telco billing folks know billing inside out
so they create models that they impose, top-down on data. Data
that does not comply is erroneous. Database data, like Telco bills
must be precise with a single version of truth, so that the billing
amount is justifiable. Due in part to scale, Big Data must be
approached bottom up. More fundamentally, we should let data
speak; see what models or correlations emerge from the data, e.g.,
to discover if adding strawberry to the popsicle line-up makes
sense (a known unknown) or to discover something we never
thought of (unknown unknowns). Rather than impose a
preconceived, possibly biased, model on data we should
investigate what possible models, interpretations, or correlations
are in the data (possibly in the phenomena) that might help us
understand it.
SIGKDD Explorations
Hence, the new paradigm is to approach Big Data bottom-up
due to the scale of the data and to let the data speak. Big Data
is a different, larger world than the database world. The database
world (small data) is a small corner of the Big Data world.
Correspondingly Big Data requires new tools, e.g., Big Data
Analytics, Machine Learning (the current red haired child),
Fourier transforms, statistics, visualizations, in short any model
that might help elucidate the wisdom in the data. But how do you
get Big Data, e.g., 100 data sources, 1,000, 100,000 or even
500,000, into these tools? How do you identify the 5,000 data
sources that include Sally Blogs and consolidate them into a
coherent, rationale, consistent view of dear Sally? When questions
arise in consolidating Sally’s data, how do you bring the relevant
human expertise, if needed, to bear – at scale on 1 million people?
Many successful Big Data projects report that this data curation
process takes 80% of the project resources leaving 20% for the
problem at hand. Data curation is so costly because it is largely
manual hence it is error prone. That’s where Tamr comes to the
rescue. It is a solution to curate data at scale.
We call it collaborative data curation because it optimizes the use
of indispensable human experts. Data Curation is for Big Data
what Data Integration is for small data.
Data Curation is bottom up and Data Integration is top down.
It took me about a year to understand that fundamental difference.
I have spent over 20 years of my professional life dealing with
those amazing Data Integration platforms and some of the world’s
largest data integration applications. Those technologies and
platforms apply beautifully to database data – small data; they
simply do not apply to Big Data.
To emphasize what is ahead, here is a prediction. Data Integration
is increasingly crucial to combining top-down data into
meaningful views. Data Integration is a huge challenge and huge
market that will not go away. Big Data is orders of magnitude
larger than small or database data. Correspondingly Data Curation
will be orders of magnitude larger than Data Integration. The
world will need Data Curation solutions like Tamr to let data
scientists focus on analytics, the essential use and value of big
data, while containing the costs of data preparation. In addition to
Tamr there are over 65 very cool data curation products
contributing to addressing the growing need and creating a new
software market. What is also cool about data curation is that it
can be used to enrich the existing information assets that are the
core of most enterprise’s applications and operations. Of course,
the really cool potential of data curation is that it makes Big Data
analytics more efficiently available to allow users to discover
that
they
never
knew!
How
cool
is
things
that?__________________________
Volume 16, Issue 1
Page 59
For more on Data Curation at scale, see Stonebraker [8].
GP: You also advise another startup Jisto. What can you tell
us about your role there?
MB: I am having a blast with Jisto [9]– some amazingly talented
young engineers [PhDs actually] with lots of energy and a killer
idea. Jisto is an exceptional example of the quality you ask about
in the next question.
Cloud computing enabled by virtualization is radically changing
the world by reducing the cost and increasing the availability of
computing resources. Can you imagine that only 50% of the
world’s servers are virtualized?
Pop quiz [do not cheat and read ahead].
What is the average CPU utilization of physical servers,
worldwide? Of virtual servers?
Answer: Virtual machine CPU utilization is typically in the 3050% range while physical servers are 10-20%, due to risk and
scheduling ,but mostly cultural challenges.
Jisto enables enterprises to transparently run more computeintensive workloads on these paid-for but unused resources
whether on premises or in public or private clouds, thus reducing
costs by 75–90% over acquiring more hardware or cloud
resources.
Jisto provides a high-performance, virtualized, elastic cloudcomputing environment from underutilized enterprise or cloud
computing resources (servers, laptops, etc.) without impacting the
primary task on those resources. Organizations that will benefit
most from Jisto are those that run parallelized compute-intensive
applications in the data center or in private and public clouds (e.g.,
Amazon Web Services, Windows Azure, Google Cloud Platform,
IBM SmartCloud).
_
Jisto is currently looking for early adopters for its beta program
who will gain significant reduction in the cost of their computing
possibly avoiding costly data center expansion.
GP: You are also a Principal at First Founders Limited. What
do you look for in young business ventures - how do you
determine quality?-------------------------------------------MB: There are armies of people who evaluate the potential of
startups. The professional ones are called "Venture Capitalists
(VCs). The retired ones are called Angels. Like any serious
problem there is due diligence to determine and evaluate the
factors relevant to the business opportunity, the technology, the
business plan, etc. as the many books [10] and formulas suggest.-------------------------------------------------------If you are reading a book, then you don’t know. Ultimately it
comes down to good taste developed over years of successful
experience. Andy Palmer, a serial entrepreneur, good friend, and
very smart guy said “Do it once really well then repeat.” Andy
ought to know, Tamr is about his 25th startup.
SIGKDD Explorations
At First Founders I can do some technology, Jim can do finance,
Howard can do business plans. Collectively we make a judgment.
But good VC’s are the wizards. They have Rolodexes. When their
taste says maybe they refer the startup to the relevant folks in their
network who essentially do the due diligence for them. Like at
First Founders, the judgment is crowd sourced, actually what we
call at Tamr, it is expert sourced. I have a growing trust of the
crowd and especially of the expert crowd.
GP: You were a Chief Scientist at Verizon for over 10 years
(and before that at GTE Labs which became part of Verizon).
What were some of the most interesting projects you were
involved in at GTE and Verizon?
MB: The technical challenge that stays with me is that addressed
by the Verizon Portal, Verizon’s solution for Enterprise
Telecommunications – providing Telecommunication services to
enterprise customers, such as Microsoft. Verizon, like all large
Telcos, is the result of the merger & acquisition of 300+ smaller
Telcos. Each had at least 3 billing systems; hence Verizon
acquired over 1,000 billing systems. Billing is only one of over a
dozen systems categories, including sales, marketing, ordering,
and provisioning. Providing a customer like Microsoft with a
telephone bill for each Microsoft organization requires integrating
data potentially from over 1,000 databases. As is the case for most
enterprises, Verizon and Microsoft reorganize constantly
complicating the sources to be integrated, like Microsoft, and the
targets, Verizon’s changing businesses, e.g., wireline and FiOS.
Every service company faces this little-discussed massive
challenge.
Integrating 1,000s of operational systems is a backward looking
problem. The cool forward-looking problem was Verizon IT’s
Standard Operating Environment (SOE). Prior to cloud platforms
and cloud providers, Verizon IT (actually one team) sought to
develop an SOE onto which Verizon’s major applications (0ver
6,000) could be migrated to be managed virtually on an internal
cloud. What a fun challenge. When the team left Verizon as a
group over 60 major corporate applications, including SAP, had
been migrated. Smart folks, good solution that failed in Verizon.
In industry, challenges are 80-20; 80% political, 20%
technical. The SOE is being reborn in the infrastructure of
another major infrastructure corporation.
Finally, the next most interesting and yet unsolved industry
challenge was getting over the legacy that Mike Stonebraker and I
addressed in [11].
How do you keep a massive system up to date in terms of the
application requirements and the underlying technology or
migrate it to a modern, efficient, more cost effective platform?
Enterprises tend to invest only in new revenue generating
opportunities often leaving the legacy problem to grow and grow.
So existing systems like billing languish and accumulate. It’s like
a teenager never tidying their room for 60 years. Now where
are my blue shoes? I suggested to Mike Stonebraker that we
rewrite our 1995 book. He did not even respond to the email,
suggesting that it is largely a political problem and not technical,
no matter the brilliant technical solution provided.
Volume 16, Issue 1
Page 60
Lesson: If you are a CIO, clean up your goddamn room; you’re
not going out until you do!
GP: Around 1989 when you were a manager at GTE Labs and
I was a member of technical staff there, you were somewhat
skeptical of the idea I proposed for research into Knowledge
Discovery in Databases (then called KDD or Data Mining, and
more recently Predictive Analytics, and Data Science). The
field has progressed significantly since then. From your point
of view, what are the main successes and disappointments of
KDD/Data Mining/Predictive Analytics and can Data Science
become an actual science?
MB: My current research concerns the scientific and
philosophical underpinnings of Big Data and Data Science. With
Big Data we are undergoing a fundamental shift in thinking and in
computing. Big Data is a marvelous tool to investigate What –
correlations or patterns that suggest that things might have or will
occur.
Big Data’s weakness is that it says nothing about Why –
causation or why a phenomenon occurred or will occur.
A pernicious aspect of What are the biases that we bring to it. On
a personal note, my biased recall of 1989 was how marvelous
your ideas were and the amazing potential of data mining. I accept
your view that I was skeptical rather than enthusiastic as I recall.
You see I modified reality to fit my desire to be on the winning
side, which I was not then. Hence, what we think that we thought
may bear little resemblance to reality or, more precisely other
people’s reality. As Richard Feynman said,
“The first principle is that you must not fool yourself - and you
are the easiest person to fool.”
That said, I see the main successes of this trend as a nascent
trajectory along the lines of Big Data, Data Analytics, Business
Intelligence, Data Science, and whatever the current trendy term
is. The World of What is phenomenal – machines proposing
potential correlations that are beyond our ability to identify.
Humans consider seven plus or minus 2 variables at a time, a
rather simple model, while models, such as Machine Learning,
can consider millions or billions of variables at a time. Yet 95%
(or even 99.99999%) of the resulting correlations may be
meaningless. For example, ~99% of credit card transactions are
legitimate with less than 1% that are fraudulent, yet the 1% can
kill the profits of a bank. So precision and outlier cases, called
anomalies in science can matter. So it pays to search for
apparently anomalous behavior – as it is happening!
We have already seen massive benefits of Big Data in the stock
market, electoral predictions, marketing success, and many more
that underlie the Big Data explosion. Yet there is a potential Big
Data Winter ahead if people blindly apply Big Data and more
specifically Machine Learning. The failures concern limited
models of phenomena and the human tendency of bias. People can
and do use What (Big Data, etc.) to support their biases and
limited models, e.g., used to support the claim of the absence of
climate change or lack of human impact on climate change, rather
SIGKDD Explorations
than letting the data speak to suggest directions and models that
we may never have thought of. As it has always been, it takes
courage to change from a discrete world of top-down models [I
know how this works!] to an ambiguous, probabilistic world
[What possible ways does this work?].
Those are natural successes and limitations of an emerging field.
The direction, opportunities, and changes are profound. I
experience a mix of fear and tingles thinking of asking the data to
speak. Hoping that I can be open to what it says and
distinguishing s..t from Shinola.
I call the vision Computing Reality. It may be the Next
Generation of Computing.
GP: In your very insightful report of the White House-MIT
Big Data Privacy Workshop [12] you have a quote “Big data
has rendered obsolete the current approach to protecting
privacy and civil liberties". Will people get used to much less
privacy (as the digitally-savvy younger people seem to be) or
will government regulation and/or technology be able to
protect privacy? How will this play in US vs. Europe vs. other
regions of the world?
MB: As an undergraduate at the University of Toronto, I was
extremely fortunate to have had Kelly Gotlieb, the Father of
Computing in Canada, as a mentor. I was a student in his 1971
course, Computers and Society, later to become the first book on
the topic. Kelly and the issues, including privacy, have resonated
with me throughout my career. Kelly observed that privacy, like
many other cultural norms, varies over time. So yes, privacy will
fluctuate from Alan Westin’s notion of determining how your
personal information is communicated to the Facebook-esk "Get
over it".
While personal privacy is undergoing significant change,
disclosure of information assets that are part of the digital
economy or of government or corporate strategy may have very
significant impacts on our economy and democracy. Hence, this
raises issues of security, protection, and cultural and social issues
too complex to be treated here.
However, there are a number of very smart people looking at
various aspects. The quote you cite is from Craig Mundy [13] who
explores changes that Big Data brings debating the balancing of
economic versus privacy issues.
Very smart folks, like Butler Lampson and Mike Stonebraker, are
commenting on practical solutions to this age-old problem. Their
arguments are along the following lines. Due to the massive scale
of Big Data, and what I call Computing Reality, previously topdown solutions for security, such as anticipating and preventing
security breaches, will simply not scale to Big Data. They must be
augmented with new approaches including bottom-up solutions
such as Stonebraker’s logging to detect and stem previously
unanticipated security breaches and Weitzner’s accountable
systems.
To beat the Heartbleed bug and others like it, “Organizations need
Volume 16, Issue 1
Page 61
to be able to detect attackers and issues well after they have made
it through their gates, find them, and stop them before damage can
occur,” Gazit, a leading cyber security expert said recently. “The
only way to achieve such a laser-precision level of detection is
through the use of hyper-dimensional big data analytics,
deploying it as part of the very core of the defense mechanisms.
Big Data has rendered obsolete the current approach to
protecting privacy and civil liberties.”
Hence, Big Data requires a shift from a focus on top-down
methods of controlling data generation and collection to a focus
on data usage. Not only do top-down methods not scale, “Tightly
restricting data collection and retention could rob society of a
hugely valuable resource [13]”. Adequate let alone complete
solutions will take years to develop.
GP: What interesting technical developments you expect in
Database and Cloud Technology in the next 5 years?
MB: I call the Big Picture Computing Reality in which we model
the world from whatever reasonable perspectives emerge from the
data and are appropriate, e.g., have veracity, and make decisions
symbiotically with machines and people collaborating to optimize
resources while achieving measures of veracity for each result.
One subspace of this world is what we currently know with high
levels of confidence, the type of information that we store in
relational databases. Another encompassing space is what we
know but forgot or don’t want to remember (unknown knowns)
and a third is what we speculate but do not know (known
unknowns), these are all the hypotheses that we make but do not
know in science, business, and life.
(Michael Brodie and his son on a peak in New Hampshire)
My activities include the gym (4 times a week); hiking/climbing
~75 mountains USA, Nepal, Greece, Italy, France, Switzerland,
and even Australia; 42 of the 48 4,000 footers in NH (most with
Mike Stonebraker); cooking (daily and special occasions with my
son Justin, an amazing chef and brewer, when he’s not doing his
PhD), travel, and my garden; all of these – except the gym and
garden - with family and close friends.
Very cool Big Data Books:
Big Data: A Revolution That Will Transform How We Live, Work,
and Think by Viktor Mayer-Schonberger, Kenneth Cukier,
Houghton Mifflin Harcourt confused and inspired me, then
The rest of the data space – unknown unknowns - is infinite;
otherwise learning would be at an end. That is the space of
discovery.
The Signal and the Noise: Why So Many Predictions Fail-but
Some Don't, by Nate Silver, Penguin Press, inspired me.
I am investigating Computing Reality to investigate the entire
space with the objective of accelerating Scientific Discovery.
This is practically interesting because very little of our world is
discrete, bounded, finite, or involves a single version of truth, yet
that is the world of most computing. With Computing Reality we
hope to be far more pragmatic and realistic. This is technically
and theoretically interesting because we have almost no
mathematical or computing models in these areas. Those that exist
are just emerging or are massively complex. How cool is that?
You see what old retired guys get to do?
Real books
GP: What do you like to do in your free time? What recent
book you liked?
MB: Free time – what a concept! My yoga teacher, Lynne
recommended that I should try to do nothing one day, and I will. I
will. Soon. Really. Life is such a blast; it’s hard to keep still.
SIGKDD Explorations
x
Ken Follett’s The Pillars of the Earth; Century Trilogy
(Fall of Giants, Winter of the World and Edge of
Eternity)
x
Henning Mankell’s The Fifth Woman (A Kurt
Wallander Mystery)
GP: You just returned from Doha, Qatar where you were
advising the Qatar Computing Research Institute (QCRI) quite far from Silicon Valley, New York, or Boston. What is
happening there and what computing research are they
doing?
MB: This was my first visit to Qatar that was remarkable
culturally an intellectually. Culturally I saw spectacular result of
hydrocarbon wealth and vision, e.g., amazing architecture
emerging from the dessert. Intellectually I saw the beginnings of
Qatar’s National Vision 2030 to transform Qatar’s economy from
hydrocarbon-based to knowledge-based.
Volume 16, Issue 1
Page 62
One step in this direction by the Qatar Foundation was to create
the Qatar Computing Research Institute (QCRI). In less than three
years QCRI has established the beginnings of a world-class
computer science research group seeded with world-class
researchers in strategically important areas such as Social
Computing, Data Analysis, Cyber Security, and Arabic Language
Technologies (e.g., Machine Learning and Translation) amongst
others. Each group already has multiple publications over several
years in the leading conferences in their areas, e.g., SIGMOD and
VLDB for Data Analysis. I spent my time reviewing with them
what I consider to be some of the most challenging issues in Big
Data.
4. REFERENCES
[1] Gregory Piatetsky, Exclusive Interview: Michael Brodie,
Leading Database Researcher, Industry Leader, Thinker, in
KDnuggets, April 2014,
[7] Gregory Piatetsky, Exclusive: Tamr at the New Frontier of Big
Data Curation, KDnuggets, May 2014,
http://www.kdnuggets.com/2014/05/tamr-new-frontier-big-datacuration.html
[8] Stonebraker et al, Data Curation at Scale: The Data Tamer
System In CIDR 2013 (Conference on Innovative Data Systems
Research).
[9] http://jisto.com/
[10] R. Field, "Disciplined Entrepreneurship: 24 Steps to a
Successful Startup by Bill Aulet", Journal of Business & Finance
Librarianship, vol. 19, no. 1, pp. 83-86, Jan. 2014.
[11] M. Brodie and M. Stonebraker. Legacy Information Systems
Migration: The Incremental Strategy, Morgan Kaufmann
Publishers, San Francisco, CA (1995) ISBN 1-55860-330-1
[12] M. Brodie, White House-MIT Big Data Privacy Workshop
Report, KDnuggets, Mar 27, 2014.
http://www.kdnuggets.com/2014/04/michael-brodie-databaseresearcher-leader-thinker.html
http://www.kdnuggets.com/2014/03/white-house-mit-big-dataprivacy-workshop-report.html
[2] Gregory Piatetsky, Interview (part 2): Michael Brodie on Data
Curation, Cloud Computing, Startup Quality, Verizon, in
KDnuggets, May 2014,
[13] Craig Mundy, Privacy Pragmatism: Focus on Data Use, Not
Data Collection, Foreign Affairs, March/April 2014.
http://www.kdnuggets.com/2014/04/interview-michael-brodie-2data-curation-cloud-computing-verizon.html
[3] Gregory Piatetsky, Interview (part 3): Michael Brodie on
Industry Lessons, Knowledge Discovery, and Future Trends, in
KDnuggets, May 2014,
http://www.kdnuggets.com/2014/05/interview-michael-brodie-3industry-lessons-knowledge-discovery-trends-qcri.html
[4] www.michaelbrodie.com/michael_brodie.asp
[5] Michael Stonebraker. Are We Polishing a Round Ball? Panel
Abstract. ICDE, page 606. IEEE Computer Society, 1993
[6] Database Engines List, http://db-engines.com/en/blog_post/23
http:/db-engines.com/en/blog_post/23
SIGKDD Explorations
[14] The Digital Universe of Opportunities: Rich Data and the
Increasing Value of the Internet of Things. IDC/EMC, April 2014.
About the authors:
Gregory Piatetsky is a Data Scientist, co-founder of KDD
conferences and ACM SIGKDD society for Knowledge
Discovery and Data Mining, and President and Editor of
KDnuggets. He tweets about Analytics, Big Data, Data Science,
and Data Mining at @kdnuggets.
Volume 16, Issue 1
Page 63

Similar documents