Full Issue in PDF

Transcription

Full Issue in PDF

Journal of Emerging Technologies in Web Intelligence
ISSN 1798-0461
Volume 4, Number 3, August 2012
Contents
Special Issue: Web Data Mining
Guest Editors: Richard Khoury
Guest Editorial
Richard Khoury
205
SPECIAL ISSUE PAPERS
Query Classification using Wikipedia's Category Graph
Milad AlemZadeh, Richard Khoury, and Fakhri Karray
207
Towards Identifying Personalized Twitter Trending Topics using the Twitter Client RSS Feeds
Jinan Fiaidhi, Sabah Mohammed, and Aminul Islam
221
Architecture of a Cloud-Based Social Networking News Site
Jeff Luo, Jon Kivinen, Joshua Malo, and Richard Khoury
227
Analyzing Temporal Query for Improving Web Search
Rim Faiz
234
Trend Recalling Algorithm for Automated Online Trading in Stock Market
Simon Fong, Jackie Tai, and Pit Pichappan
240
A Novel Method of Significant Words Identification in Text Summarization
Maryam Kiabod, Mohammad Naderi Dehkordi, and Mehran Sharafi
252
Attribute Overlap Minimization and Outlier Elimination as Dimensionality Reduction Techniques for
Text Classification Algorithms
Simon Fong and Antonio Cerone
259
New Metrics between Bodies of Evidences
Pascal Djiknavorian, Dominic Grenier, and Pierre Valin
264
REGULAR PAPERS
Bringing location to IP Addresses with IP Geolocation
Jamie Taylor, Joseph Devlin, and Kevin Curran
273
On the Network Characteristics of the Google's Suggest Service
Zakaria Al-Qudah, Mohammed Halloush, Hussein R. Alzoubi, and Osama Al-kofahi
278
RISING SCHOLAR PAPERS
Review of Web Personalization
Zeeshan Khawar Malik and Colin Fyfe
285
SHORT PAPERS
An International Comparison on Need Analysis of Web Counseling System Design
Takaaki Goto, Chieko Kato, Futoshi Sugimoto, and Kensei Tsuchida
297
JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 3, AUGUST 2012
205
Special Issue: Web Data Mining
Guest Editorial
Richard Khoury
Department of Software Engineering, Lakehead University, Thunder Bay, Canada
Email: rkhoury@lakeheadu.ca
The Internet is a massive and continuously-growing
source of data and information on subjects ranging from
breaking news to personal anecdotes to objective
documentation. It has proven to be a hugely beneficial
resource for researchers, giving them a source of free, upto-date, real-world data that can be used in a varied range
of projects and applications. Hundreds of new algorithms
and systems are being proposed each year to filter out
desired information from this seemingly endless amount
of data, to clean and organize it, to infer knowledge from
it and to act on this knowledge.
The importance of the web in scientific research today
can be informally gauged by counting the number of
published papers that use the name of a major website in
their titles, abstracts, or keyword lists. To illustrate, we
gathered these statistics using the IEEE Xplore search
system for eight well-known websites for each of the past
10 years. The results of this survey, presented in Figure 1,
indicate that research interest for web data is increasing
steadily. The individual websites’ naturally know
increases and decreases in popularity; for example, we
can see Google overtake Yahoo! around 2007. In 2011,
Figure 1.
© 2012 ACADEMY PUBLISHER
doi:10.4304/jetwi.4.3.205-206
the three most cited websites in our sample of the
scientific literature were Google, Facebook and Twitter.
This ranking is similar, but a bit off, compared to the realworld popularity of these websites as measured by the
website ranking site Alexa. Alexa’s ranking does put
Google and Facebook in first and second place
respectively, but rank Twitter ninth, below YouTube,
Yahoo!, Baidu and Wikipedia. There is nonetheless a
good similarity between the rankings of Figure 1 and
those of Alexa, which is not unexpected: to be useful for
scientific research a site needs to contain a lot of data,
which means that it must be visited and contributed to by
a lot of users, which in turn means a high Alexa rating.
This special issue is thus dedicated to the topic of Web
Data Mining. We attempted to compile papers that touch
upon both a variety of websites and a variety of data
mining challenges. Clearly it would be impossible to
create a representative sample of all data mining tasks
and all websites in use in the literature. However, after
thorough peer-reviewing and careful deliberation, we
have selected the following papers as good examples of a
range of web data mining challenges being addressed
Number of publications per year that uses the name of a major website.
206
today.
In our first paper, “Query Classification Using
Wikipedia’s Category Graph”, the authors Milad
AlemZadeh, Richard Khoury and Fakhri Karray, perform
data mining on Wikipedia, one of the more popular
websites both in the literature and to the public. The task
they focused on is query classification, or the challenge
of determining the topic intended by a query given only
the words of that query. This is a challenge with broad
applicability, from web search engines to questionanswering systems. Their work demonstrates how a
system exploiting web data can perform this task with
virtually no domain restrictions, making it very appealing
for applications that need to interact with human beings
in any setting whatsoever.
Next, we move from Wikipedia to Twitter, a website
whose popularity we discussed earlier. In “Towards
Identifying Personalized Twitter Trending Topics using
the Twitter Client RSS Feeds”, Jinan Fiaidhi, Sabah
Mohammed, and Aminul Islam, take on the challenge of
mining the massive, real-time stream of tweets for
interesting trending topics. Moreover, the notion of what
is interesting can be personalized for each user based not
only on the tweets’ vocabulary, but also on the user’s
personal details and geographical location. Their paper
thus defines the first true Twitter stream personalization
system.
Staying on the topic of defining innovative new
systems, in “Architecture of a Cloud-Based Social
Networking News Site”, three undergraduate engineering
students, Jeff Luo, Jon Kivinen, and Joshua Malo, give us
a tour of a social networking platform they developed.
Their work presents a new perspective on web data
mining from social networks, in which every aspect of
the social network is under the control of the researchers,
from the type of information users can put up to the
underlying cloud architecture itself.
In “Analyzing Temporal Queries for Improving Web
Search”, Rim Faiz brings us back to the topic of web
query understanding. Her work focuses on the challenge
of adding a temporal understanding component in web
search systems. Mining temporal information in this way
can help improve search engines by making it possible to
correctly interpret queries that are dependent on temporal
context. And indeed, her enhanced method shows a
promising increase in accuracy.
In “Trend Recalling Algorithm for Automated Online
Trading in Stock Market”, Simon Fong, Jackie Tai, and
Pit Pichappan, exploit another source of web data: the
online stock market. This data source is one often
overlooked (only 75 references for the entire 2002-2011
period we studied in Figure 1), but one of unquestionable
importance today. This paper shows how this data can be
mined for trends, which can then be used to successfully
guide trading decisions. Specifically, by matching the
current trend to past trends and recalling the trading
strategies that worked in the past, the system can adapt its
behaviour and greatly increase its profits.
In a further example of both the variety of web data
and of data mining tasks, in “A Novel Method of
Significant Words Identification in Text Summarization”,
Maryam Kiabod, Mohammad Naderi Dekhordi and
Mehran Sharafi, mine a database of web newswire in
order to train a neural network to mimic the behaviour of
a human reader. This neural network underlies the ability
of their system to pick out the important keywords and
key sentences that summarize a document. Once trained
on the web data set, the system works quite well, and in
fact outperforms commercially-available summarization
tools.
Our final two papers take a wider view of the
challenge of web data mining, and focus on the mining
process itself. In our penultimate paper, “Attribute
Overlap Minimization and Outlier Elimination as
Dimensionality Reduction Techniques for Text
Classification Algorithms”, Simon Fong and Antonio
Cerone note that the massive and increasing volume of
online documents, and the corresponding increase in the
number of features to be handled to represent them, is
becoming a problem for web mining algorithms, and
especially for real-time algorithms. They thus explore the
challenge of feature reduction in web documents. Their
experiments – conducted on Wikipedia articles in
multiple languages and on CNN.com news articles –
demonstrate not only the possibility but also the benefits
of dimensionality reduction of web data.
Our final paper, “New Metrics between Bodies of
Evidence” by Pascal Djiknavorian, Dominic Grenier,
and Pierre Valin, presents a higher-level theoretical
perspective on web data mining. They propose new
metrics to compare and evaluate evidence and uncertainty
in the context of the Dempster-Shafer theory. Their work
introduces fundamental theoretical advances of
consequence for all information retrieval applications. It
could be useful, for example, for a new generation of web
search systems that can pinpoint relevant information in a
web page, rather than consider the page as a whole. It
could also be useful to handle the uncertainty incurred
when
combining
information
from
multiple
heterogeneous web data sources.
Richard Khoury received his
Bachelor’s Degree and his
Master’s Degree in Electrical and
Computer Engineering from Laval
University (Québec City, QC) in
2002 and 2004 respectively, and
his Doctorate in Electrical and
Computer Engineering from the
University of Waterloo (Waterloo,
ON) in 2007. Since August 2008,
he has been an Assistant Professor, tenure track, in the
Department of Software Engineering at Lakehead
University. Dr. Khoury has published 20 papers in
international journals and conferences, and has served on
the organization committee of three major conferences.
His primary area of research is natural language
processing, but his research interests also include data
mining, knowledge management, machine learning, and
intelligent systems.
207
Query Classification using
Wikipedia’s Category Graph
Milad AlemZadeh
Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, Ontario, Canada
Email: malemzad@uwaterloo.ca
Richard Khoury
Department of Software Engineering, Lakehead University, Thunder Bay, Ontario, Canada,
Email: richard.khoury@lakeheadu.ca
Fakhri Karray
Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, Ontario, Canada
Email: karray@uwaterloo.ca
Abstract— Wikipedia’s category graph is a network of
300,000 interconnected category labels, and can be a
powerful resource for many classification tasks. However,
its size and the lack of order can make it difficult to
navigate. In this paper, we present a new algorithm to
efficiently exploit this graph and accurately rank
classification labels given user-specified keywords. We
highlight multiple possible variations of this algorithm, and
study the impact of these variations on the classification
results in order to determine the optimal way to exploit the
category graph. We implement our algorithm as the core of
a query classification system and demonstrate its reliability
using the KDD CUP 2005 and TREC 2007 competitions as
benchmarks.
Index Terms—Keyword search, Natural language
processing, Knowledge based systems, Web sites, Semantic
Web
I. INTRODUCTION
Query classification is the task of Natural Language
Processing (NLP) whose goal is to identify the category
label, in a predefined set, that best represents the domain
of a question being asked. An accurate query
classification system would be beneficial in many
practical systems, including search engines and questionanswering systems. Query classification shares some
similarities with other categorization tasks in NLP, and
with document classification in particular. However, the
challenge of query classification is accentuated by the
fact that a typical query is only between one and four
words long [1], [2], rather than the hundreds or thousands
of words one can get from an average text document.
Such a limited number of keywords makes it difficult to
select the correct category label, and moreover it makes
the selection very sensitive to “noise words”, or words
unrelated to the query that the user entered for some
reason such as because they didn’t remember a correct
name or technical term to query for. A second challenge
of query classification comes from the fact that, while
doi:10.4304/jetwi.4.3.207-220
document libraries and databases can be specialized to a
single domain, the users of query systems expect to be
able to ask queries about any domain at all [1].
This paper continues our work on query classification
using the Wikipedia category graph [3], [4]. It refines and
expands on our previous work by studying multiple
different design alternatives that similar classification
systems could opt for, and considers the impact of each
one. In contrast with our previous papers, the focus here
is not on presenting a single classification system, but on
implementing and comparing multiple systems that differ
on critical points.
The rest of the paper is organized as follows. Section 2
presents overviews of the literature in the field of query
classification with a special focus on the use of Wikipedia
for that task. We present in detail our ranking and
classification algorithm in Section 3, and take care to
highlight the points where we considered different design
options. Each of these options was implemented and
tested, and in Section 4 we describe and analyze the
experimental results we obtained with each variation of
our system. Finally, we give some concluding remarks in
Section 5
II. BACKGROUND
Query classification is the task of NLP that focuses on
inferring the domain information surrounding userwritten queries, and on assigning to each query the best
category label from a predefined set. Given the ubiquity
of search engines and question-handling systems today,
this challenge has been receiving a growing amount of
attention. For example, it was the topic of the ACM’s
annual KDD CUP competition in 2005 [5], where 37
systems competed to classify a set of 800,000 real web
queries into a set of 67 categories designed to cover most
topics found on the internet. The winning system was
designed to classify a query by comparing its word vector
to that of each website in a set pre-classified in the
208
Google directory. The query was assigned the category of
the most similar website, and the directory’s set of
categories was mapped to the KDD CUP’s set [2]. This
system was later improved by introducing a bridging
classifier and an intermediate-level category taxonomy
[6].
Most query classifiers in the literature, like the system
described above, are based on the idea of mapping the
queries into an external knowledge source (an objective
third-party knowledge base) or internal knowledge source
(user-specific information) to classify them. This simple
idea leads to a great variety of classification systems.
Using an internal knowledge source, Cao et al. [7]
developed a query classifier that disambiguates the
queries based on the context of the user’s recent online
history. And on the other hand, many very different
knowledge sources have been used in practice, including
ontologies [8], websites [9], web query logs [10], and
Wikipedia [4], [11], [12].
Exploiting Wikipedia as a knowledge source has
become commonplace in scientific research. Several
hundreds of journal and conference papers have been
published using this tool since its creation in 2001.
However, while both query classification and NLP using
Wikipedia are common challenges, to the best of our
knowledge there have been only three query classification
systems based on Wikipedia.
The first of these three systems was proposed by Hu et
al. [11]. Their system begins with a set of seed concepts
to recognize, and it retrieves the Wikipedia articles and
categories relevant to these concepts. It then builds a
domain graph by following the links in these articles
using a Markov random walk algorithm. Each step from
one concept to the next on the graph is assigned a
transition probability, and these probabilities are then
used to compute the likelihood of each domain. Once the
knowledge base has been build in this way, a new user
query can be classified simply by using its keywords to
retrieve a list of relevant Wikipedia domains, and sorting
them by likelihood. Unfortunately, their system remained
small-scale and limited to only three basic domains,
namely “travel”, “personal name” and “job”. It is not a
general-domain classifier such as the one we aim to
create.
The second query classification system was designed
by one of our co-authors in [12]. It follows Wikipedia’s
encyclopedia structure to classify queries step-by-step,
using the query’s words to select titles, then selecting
articles based on these titles, then categories from the
articles. At each step, the weights of the selected elements
are computed based on the relevant elements in the
previous step: a title’s weight depends on the words that
selected it, an article’s weight on the titles’, and a
category’s weight on the articles’. Unlike [11], this
system was a general classifier that could handle queries
from any domain, and its performance would have ranked
near the top of the KDD CUP 2005 competition.
The last query classification system is our own
previous work, described in [4]. It is also a general
classifier, but its fundamental principles differ
fundamentally from [12]. Instead of using titles and
articles to pinpoint the categories in which to classify a
query like was done in [12], the classifier of [4] used
titles only to create a set of inexact initial categories for
the query and then explored the category graph to
discover the best goal categories from a set of
predetermined valid classification goals. This classifier
also differs from the one described in this work on a
number of points, including the equations used to weight
and rank categories and the mapping of the classification
goals. But the most fundamental difference is the use in
this paper of pre-computed base-goal category distances
instead of an exploration algorithm. As we will show in
this paper, all these modifications are justified both from
a theoretical standpoint and practically by improvements
in the experimental results.
While using Wikipedia for query classification has not
been a common task, there have been several document
classification projects done using that resource which are
worth mentioning. Schönhofen [13] successfully
developed a complete document classifier using
Wikipedia, by mapping the document’s vocabulary to
titles, articles, and finally categories, and weighting the
mapping at each step. In fact, we used some of the
mapping techniques he developed in one of our previous
works [12]. Alternatively, other authors use Wikipedia to
enrich existing text classifiers by improving upon the
simple bag-of-words approach. The authors of [14] use it
to build a kernel to map the document’s words to the
Wikipedia article space and classify there, while the
authors of [15] and [16] use it for text enrichment, to
expand the vocabulary of the text by adding relevant
synonyms taken from Wikipedia titles. Interestingly,
improvements are reported in the classification results of
[13], [15] and [16], while only [14] reports worse results
than the bag-of-words method. The conclusion seems to
be that working in the word space is the better option; a
conclusion that [14] also shares. Likewise, that is the
approach we used in the system we present in this paper.
III. ALGORITHM
Wikipedia’s category graph is a massive set of almost
300,000 category labels, describing every domain of
knowledge and ranging from the very precise, such as
“fictional secret agent and spies”, to the very general,
such as “information”. The categories are connected by
hypernym relationships, with a child category having an
“is-a” relationship to its parents. However, the graph is
not strictly hierarchic: there exist shortcuts in the
connections (i.e. starting from one child category and
going up two different paths of different lengths to reach
the same parent category) as well as loops (i.e. starting
from one child category and going up a path to reach the
same child category again).
The query classification algorithm we propose in this
paper is designed to exploit this graph structure. As we
will show in this section, it is a three-stage algorithm,
with a lot of flexibility possible within each step. The first
Input: Wikipedia database dump
1. CG ← the Category Graph
extracted from Wikipedia
2. Associate to each category in CG
the list of all titles pointing
to it
3. GC ← the set of Goal Categories
identified in CG
4. Dist(GC,CG) ← the shortest-path
distance between every GC and all
categories in CG
Input:
User query, CG
5. KL ← Keyword List of all
keywords in the user query
6. TL ← Title List of all titles in
CG featuring at least one word in
KL
7. KTW ← Keyword-Title Weight, a
list containing the weight of a
keyword from KL featured in a
title from TL
8. BC ← Base Categories, all
categories in CG pointed to by TL
9. CD ← Category Density for all BC
computed from the KTW
10. BC ← top BC ranked by CD
Input: GC, DIST(GC,BC), CD
11. GS ← Goal Score of each GC,
computed based on their distance
to each BC and on CD
12. Return: top 3 GC ranked by GS
Figure 1. Structure of the three steps of our classification algorithm: the
pre-processing step (top), the base category evaluation (middle), and the
exploration for the goal categories (bottom).
stage is a pre-processing stage, during which the category
graph is built and critical application-specific information
is determined. This stage needs to be done only once to
create the system, by contrast with the next two stages
that are executed for each submitted query. In the second
stage, a user’s query is mapped to a set of base categories,
and these base categories are weighted and ranked. And
finally, the algorithm explores the graph starting from the
base categories and going towards the nearest goal
categories in stage 3. The pseudocode of our new
algorithm is shown in Figure 1.
A. Stage 1: Pre-Processing the Category Graph
We begin the first stage of our algorithm by extracting
the list of categories in Wikipedia and the connections
between categories from the database dump made freely
available by the Wikimedia Foundation. For this project,
we used the version available from September 2008.
Furthermore, our graph includes one extra piece of
information in addition to the categories, namely the
article titles. In Wikipedia, each article is an encyclopedic
entry on a given topic which is classified in a set of
categories, and which is pointed to by a number of titles:
a single main title, some redirect titles (for common
alternative names, including foreign translations and
typos) and some disambiguation titles (for ambiguous
209
names that may refer to it). For example, the article for
the United States is under the main title “United States”,
as well as the redirect titles “USA”, “United States of
America” and “United Staets” (common typo
redirection), and the disambiguation title “America”. Our
pre-processing deletes stopwords and punctuation from
the titles, then maps them directly to the categories of the
articles and discards the articles. After this processing, we
find that our category graph features 5,453,808 titles and
282,271 categories.
The next step in the graph construction category is to
define a set of goal categories that are acceptable
classification labels. The exact number and nature of
these goal categories will be application-specific.
However, the set of Wikipedia category labels is large
enough to cover numerous domains at many levels of
precision, which means that it will be easy for system
designers to identify a subset of relevant categories for
their applications, or to map an existing category set to
Wikipedia categories.
The final pre-processing step is to define, compute and
store the distance between the goal categories and every
category in the graph. This distance between two
categories is the number of intermediate categories that
must be visited on the shortest path between them. We
allow any path between two categories, regardless of
whether it goes up to parent categories or down to
children categories or zigzags through the graph. This
stands in contrast with our previous work [4], where we
only allowed paths going from child to parent category.
The reason for adopting this more permissive approach is
to make our classifier more general: the parent-only
approach may work well in the case of [4] where all the
goal categories selected were higher in the hierarchy than
the average base category, but it would fail miserably in
the opposite scenario when the base categories are parents
of the goal categories. When searching for the shortest
paths, we can avoid the graph problems we mentioned
previously, of multiple paths and loops between
categories, by only saving the first encounter of a
category and by terminating paths that are revisiting
categories. Finally, we can note that, while exploring the
graph to find the shortest distance from every goal
category and all other categories may seem like a
daunting task, for a set of about 100 goal queries such as
we used in our experiments it can be done in only a few
minutes on a regular computer.
B. Stage 2: Discovering the Base Categories
The second stage of our algorithm as shown in Figure 1
is to map the user’s query to an initial set of weighted
base categories. This is accomplished by stripping the
query of stopwords to keep only relevant keywords, and
then generating the exhaustive list of titles that feature at
least one of these keywords. Next, the algorithm
considers each title t and determines the weight Wt of the
keywords it contains. This weight is computed based on
two parameters: the number of keywords featured in the
title (Nk), and the proportional importance of keywords in
the title (Pk). The form of the weight equation is given in
210
equation (1).
the number of query keywords. It happens in the case
where each keyword has its maximum value in that
W t = N k Pk
(1) category, meaning that one of the titles pointing to the
The number of keywords featured in the title is a category is composed of exactly the query words.
k ,i
simple and unambiguous measure. The proportional
D i = ∑ k max (W t )
(3)
t
importance is open to interpretation however. In this
research, we considered three different measures of
At the end of this stage of the algorithm, we have a
importance. The first is simply the proportion of weighted list of base categories, featuring some
keywords in the title (Nk / Nt, where Nt is the total number categories pointed to by high-weight words and summing
of words in title t). The second is the proportion of to a high density score, and a lot of categories pointed to
characters in the title that belong to keywords (Ck / Ct, by only lower-weight words and having a lower score. In
where Ck is the number of characters of the keywords our experiments, we found that the set contains over
featured in the title and Ct is the total number of 3,000 base categories on average. We limit the size of this
characters in title t). This metric assumes that longer list by keeping only the set of highest-density categories,
keywords are more important; in the context of queries, as categories with a density too low are deemed to be too
which are only a few words long [1], [2], it may be true unrelated to the original query to be of use. This can be
that more emphasis was meant by the user on the longest, done either on a density basis (i.e. keeping categories
most evident word in the query. The final measure of whose density is more than a certain proportion of the
proportional importance is based on the word’s inverted highest density obtained for this query, regardless of the
frequencies. It is computed as the sum of inverted number of categories this represents, as we did in [4]) or
frequencies of the keywords in the title to the sum of on a set-size basis (i.e. keeping a fixed number of
frequencies of all title words (ΣFk / ΣFt), where the categories regardless of their density, the approach we
inverted frequency of a word w is computed as:
will prioritize in this paper). When using the set-size
Fw = ln( T / Tw )
(2)
In equation (2), T is the total number of titles in our
category graph and Tw is the number of titles featuring
word w. It is, in essence, the IDF part of the classic term
frequency-inverse
document
frequency
(TFIDF)
equation: ( N w / N ) ln( T / T w ) , where Nw is the number
of instances of word w in a specific title (or more
generally, a document) and N is the total number of words
in that title. The TF part ( N w / N ) is ignored because it
does not give a reliable result when dealing with short
titles that only feature each word once or twice. We have
used this metric successfully in the past in another
classifier we designed [12].
We can see from equation (1) that every keyword
appearing in a title will receive the same weight Wt.
Moreover, when a title is composed exclusively of query
keywords, their weight will be the number of keywords
contained in the title. The maximum weight a keyword
can have is thus equal to the number of keywords in the
query; it occurs in the case where a title is composed of
all query keywords and nothing else.
Next, our algorithm builds a set of base categories by
listing exhaustively all categories pointed to by the list of
titles. This set of base categories can be seen as an initial
coarse classification for the query. These base categories
are each assigned a density value. A category’s density
value is computed by determining the maximum weight
each query keyword takes in the list of titles that point to
that category, then summing the weights of all keywords,
as shown in equation (3). In that equation, Di is the
density of category i, and Wtk,i refers to the weight Wt of a
title t that contains keyword k and points to category i.
Following our discussion on equation (1), we can see that
the maximum density a category can have is the square of
approach, a question arises on how to deal with ties when
the number of tied categories exceeds the size of the set to
return. In our system, we break ties by keeping a count of
the number of titles that feature keywords and that point
to each category, and giving priority to the categories
pointed to by more titles.
C. Stage 3: Ranking the Goal Categories
Once the list of base categories is available, the third
and final stage of the algorithm is to determine which
ones of the goal categories identified in the first stage are
the best classification labels for the query. As we outlined
in the pseudocode of Figure 1, our system does this by
ranking the goal categories based on their shortest-path
distance to the selected base categories. There are of
course other options that have been considered in the
literature. For example, Coursey and Mihalcea [17]
proposed an alternative metric based on graph centrality,
while Syed et al. [18] developed a spreading activation
scheme to discover related concepts in a set of
documents. Some of these ideas could be adapted into our
method in future research.
However, even after settling on the shortest-path
distance metric, there are many ways we could take into
account the base categories’ densities into the goal
categories’ ranking. The simplest option is to use it at a
threshold value – to cut off base categories that have a
density lower than a certain value, and then rank the goal
categories according to which are closest to any
remaining base category regardless of density. That is the
approach we used in [4]. On the other hand, taking the
density into account creates different conditions for the
system. Since some base categories are now more
important than others, it becomes acceptable, for
example, to rank a goal that is further away from several
high-density base categories higher than a goal that is
211
closer to a low-density base category. We thus define a
ranking score for the goal categories, as the sum for all
base categories of a ratio of their density to the distance
separating the goal and base. There are several ways to
compute this ratio; five options that we considered in this
study are:
Sj =
∑
Sj =
∑D
i
i
D i / ( dist ( i , j ) + 0 . 0001 )
i
/ ( dist ( i , j )² + 0 . 0001 )
Precision Award was given to the system with the top
overall precision value within the top 10 systems
evaluated on overall F1 value. Overall Recall was not
used in the competition, but is included here because it is
useful in our experiments.
∑ j queries correctly labeled as c j
Precision =
(9)
∑ j queries labeled as c j
(4)
(5)
Recall =
∑ queries correctly labeled as c
∑ queries belonging to c
j
∑De
− dist ( i , j )
Sj =
∑De
− 2 dist ( i , j )
Sj =
∑De
− dist ( i , j )²
i
i
i
i
i
i
(6)
(7)
(8)
In each of these equations, the score Sj of goal category
j is computed as the sum, for all base categories i, of the
density Di of that category, which was computed in
equation (3), divided by a function of the distance
between categories i and j. This function is a simple
division in equations (4) and (5), but the exponential in
equations (6-8) put progressively more importance on the
distance compared to the density. The addition of 0.0001
in equations (4) and (5) is simply to avoid a division by
zero in the case where a selected base category is also a
goal category.
Finally, the goal categories with the highest score are
returned as classification results. In our current version of
the system, we return the top three categories, to allow for
queries to belong to several different categories. We
believe that this corresponds to a human level of
categorization; for example, in the KDD CUP 2005
competition [5], human labelers used on average 3.3
categories per query. However, this parameter is flexible,
and we ran experiments keeping anywhere from one to
five goal categories.
IV. EXPERIMENTAL RESULTS
The various alternatives and options for our classifier
described in the previous section were all implemented
and tested, in order to study the behavior of the system
and determine the optimal combination. That optimal
combination was then subjected to a final set of tests with
new data.
In order to compare and study the variations of our
system, we submitted them all to the same challenge as
the KDD CUP 2005 competition [5]. The 37 solutions
entered in that competition were evaluated by classifying
a set of 800 queries into up to five categories from a
predefined set of 67 target categories cj and comparing
the results to the classification done by three human
labelers. The solutions were ranked based on overall
precision and overall F1 value, as computed by Equations
(9-14). The competition’s Performance Award was given
to the system with the top overall F1 value, and the
F1 =
2 × Precision × Recall
Precision + Recall
Overall Precision =
Overall Recall =
Overall F1 =
(10)
j
j
Sj =
j
(11)
1 3
∑ Precision against labeler L
3 L=1
(12)
1 3
∑ Recall against labeler L
3 L =1
(13)
1 3
∑ F1 against labeler L
3 L=1
(14)
In order for our system to compare to the KDD CUP
competition results, we need to use the same set of
category labels. As we mentioned in Section 3, the size
and level of detail of Wikipedia’s category graph makes it
possible to identify categories to map most sets of labels
to. In our case, we identified 99 goal categories in
Wikipedia corresponding to the 67 KDD CUP category
set. These correspondences are presented in Appendix A.
A. Proportional Importance of Keywords
The first aspect of the system we studied is the
different formulae for the proportional importance of
query keywords in a title. As we explained in Section
IIIB, the choice of formula has a direct impact on the
system, as it determines which titles are more relevant
given the user’s query. This in turn determines the
relevance of the base categories that lead to the goal
categories. A bad choice at this stage can have an impact
on the rest of the system.
The weight of a title, and of the query keywords it
contains, is function of the two parameters presented in
equation (1), namely the number of keywords present in
the title and the importance of those keywords in that
title. Section IIIB gives three possible mathematical
definitions of keyword importance in a title. They are a
straightforward proportion of keywords in the title, the
proportion of characters in the title that belong to
keywords, and the proportion of IDF of keywords to the
total IDF of the title, as computed with equation (2). We
implemented all three equations and tested the system
independently using each. In all implementations, we
limited the list of base categories to 25, weighted the goal
categories using equation (5), and varied the number of
returned goal categories from 1 to 5.
The results of these experiments are presented in
Figure 2. The three different experiments are shown with
different grey shades and markers: dark squares for the
212
Figure 2. Overall precision (dashed line), recall (dotted line) and F1
(solid line) using Nk*(Ck/Ct) (dark squares), Nk*(Nk/Nt) (medium
triangles), and Nk*(ΣFk/ΣFt) (light circles).
formula using the proportion of characters, medium
triangles for the formula using the proportion of words,
and light circles for the formula using the proportion of
IDF. Three results are also shown for each experiment:
the overall precision computed using equation (12) in a
dashed line, the overall recall of equation (13) in a dotted
line, and the overall F1 of equation (14) in a solid line.
A few observations can be made from Figure 2. The
first is that the overall result curves of all three variations
have the same shape. This means that the system behaves
in a very consistent way regardless of the exact formula
used. There is no point where one of the results given one
equation shoots off in a wildly different range of values
from the other two equations. Moreover, while the exact
difference in the results between the three equations
varies, there is no point where they switch and one
equation goes from giving worse results than another to
giving better results. We can also see that the precision
decreases and the recall increases as we increase the
number of acceptable goal categories. This result was to
be expected: increasing the number of categories returned
in the results means that each query is classified in more
categories, leading to more correct classification (that
increase recall) and more incorrect classifications (that
decrease precision). Finally, we can note that the best
equation for the proportional importance of keywords in
titles is consistently the proportion of keywords (Nk / Nt),
followed closely by the proportion of characters (Ck / Ct),
while the proportion of IDF (ΣFk / ΣFt) trails in third
position.
It is surprising that the IDF measure gives the worst
results of the three, when it worked well in other projects
[12]. However, the IDF measure is based on a simple
assumption, that a word with low semantic importance is
one that is used commonly in most documents of the
corpus. In our current system however, the “documents”
are article titles, which are by design short, limited to
important keywords, and stripped of semantically
irrelevant words. These are clearly in contradiction with
the assumptions that underlie the IDF measure. We can
see this clearly when we compare the statistics of the
keywords given in the example in [12] with the same
keywords in our system, as we do in Table I. The system
in [12] computed its statistics from the entire Wikipedia
corpus, including article text, and thus computed reliable
statistics; in the example in Table I the rarely-used
company name WWE is found much more significant
than the common corporate nouns chief, executive,
chairman and headquartered. On the other hand, in our
system WWE is used in almost as many titles as
executive and has a comparable Fw score, which is
dwarfed by the Fw score of chairman and headquartered,
two common words that are very rarely used in article
titles.
Finally, we can wonder if the two parts of equation (1)
are really necessary, especially since the best equation we
found for proportional importance repeats the Nk term. To
explore that question, we ran the same test again using
each part of the equation separately. Figure 3 plots the
TABLE I
COMPARISON OF IDF OF SAMPLE KEYWORDS
Keyword
WWE
Chief
Executive
Chairman
Headquartered
Tw*
2,705
83,977
82,976
40,241
38,749
Fw*
7.8
5.6
5.8
7.2
7.1
*Columns 2 and 3 are taken from [12].
Tw
657
1,695
867
233
10
Fw
9.0
8.1
8.7
10.1
13.2
Figure 3. Overall F1 using Nk*(Nk/Nt) (solid medium triangles), Nk/Nt
(dashed dark rectangles), and Nk (light dotted circles).
overall F1 using Nk alone in light dotted line with circle
markers, using Nk / Nt in black dashed line with square
markers, and reproduces the overall F1 of Nk * (Nk / Nt)
from Figure 3 in its medium solid line with triangle
markers for comparison. This figure shows clearly that
using the complete equation gives better results than
using either one of its components.
B. Size of the Base Category Set
The second aspect of the system we studied comes at
the end of the second stage of the algorithm, when the list
of base categories is trimmed down to keep only the most
relevant ones. This list will initially contain all categories
connected to any title that contains at least one of the
keywords the user specified. As we mentioned before, the
average number of base categories generated by a query
is 3,400 and the maximum is 45,000. These base
categories are then used to compute the score of the goal
categories, using one of the summations of equations (48). This test aims to see if the quality of the results can be
improved by limiting the size of the set of base categories
used in this summation, and if so what is the approximate
ideal size.
For this test, we used Wt = Nk * (Nk / Nt) for equation
(1), the best formula found in the previous test. We again
weighted the goal categories using equation (5) and
varied the number of returned goal categories from 1 to 5.
Figure 4 shows the F1 value of the system under these
conditions when trimming the list of base categories to
500 (black solid line with diamonds), 100 (light solid line
with circles), 50 (medium solid line with triangles), 25
(light dotted line with squares), 10 (black dashed line
with squares) and 1 (black dotted line with circles).
Figure 4 shows clearly that the quality of the results
213
drops if the set of base categories is too large (500) or too
small (1). The difference in the results between the other
four cases is less marked, and in fact the results with 10
and 100 base categories overlap. More notably, the results
with 10 base categories start weaker than the case with
100, spike around 3 goal categories to outperform it, then
drop again and tie it at 5 goal categories. This instability
seems to indicate that 10 base categories are not enough.
The tests with 25 and 50 base categories are the two that
yield the best results; it thus seems then that the optimal
size of the base category set is in that range. The 25 base
category case outperforms the 50 case, and is the one we
will prefer.
It is interesting to consider that in our previous study
[4], we used the other alternative we proposed, namely to
trim the set based on the density values. The cutoff we
used was half the density value of the base category in the
set with the highest density; any category with less than
that density value was eliminated. This gave us a set of 28
base categories on average, a result which is consistent
with the optimum we discovered in the present study.
C. Goal Category Score and Ranking
Another aspect of the system we wanted to study is the
choice of equations we can use to account for the base
categories’ density and distance when ranking the goal
categories. The option we used in the previous
subsection, to find the nearest goals to any of the retained
base categories regardless of their densities, is entirely
valid. The alternative we consider here is to rank the goal
categories in function of their distance to each base
category and of the density of that base. We proposed
five possible equations in Section IIIC to mathematically
combine density and distance to rank the goal categories.
Equation (4) considers both distance and density evenly,
and the others put progressively more importance on the
distance up to equation (8).
To illustrate the different impact of equations (4-8),
consider three fictional base categories, one which has a
density of 4 and is at a distance of 4 from a goal category,
a second with a density of 4 and a distance of 3 from the
same goal category, and the third with a density of 3 and
a distance of 3 from the goal. The contribution of each of
these bases to the goal category in each of the
summations is given in Table II. As we can see in this
table, the contribution of each base decreases as we move
down from equation (4) to equation (8), but it decreases a
lot more and a lot faster for the base at a distance of 4.
The contribution to the summation of the category at a
distance of 4 is almost equal to that of the categories at a
distance of 3 when using equation (4), but is three orders
TABLE II
IMPACT OF THE GOAL CATEGORY EQUATIONS
Equation
Figure 4. Overall F1 using from 1 to 500 base categories.
(4)
(5)
(6)
(7)
(8)
Density 4
Distance 4
1.00
0.25
0.07
0.001
4.5x10-7
Density 4
Distance 3
1.33
0.44
0.20
0.01
0.0005
Density 3
Distance 3
1.00
0.33
0.15
0.007
0.0004
214
of magnitude smaller when using equation (8). That is the
result of putting more and more emphasis on distance
rather than density: the impact of a farther-away higherdensity category becomes negligible compared to a closer
lower-density category. Meanwhile, comparing the
contribution of the two categories of different densities at
the same distance shows that, while they are in the same
order of magnitude, the higher-density one is always
more important than the lower-density one, as we would
want.
We ran tests of our system using each of these five
equations. In these tests, we again set Wt = Nk * (Nk / Nt)
for equation (1), kept the 25 highest-density base
categories, and varied from retuning 1 to 5 categories.
The overall F1 of the variations of the system is presented
in Figure 5. In this figure, the classification results
obtained using equation (4) are shown with a black
dashed line with circle markers, equation (5) uses a grey
solid line with square markers, equation (6) uses a dashed
black line with triangle markers, equation (7) uses a light
grey line with circle markers and equation (8) uses a grey
line with triangle markers. For comparison, we also ran
the classification using the exploration algorithm from
our previous work [4], and included those results as a
black dotted line with square markers.
We can see from Figure 5 that putting too much
importance on distance rather than density can have a
detrimental impact on the quality of the results: the results
using equations (7) and (8) are the worst of the five
equations. Even the results from equation (6) are of
debatable quality: although it is in the same range as the
results of equations (4) and (5), it shows a clear
downward trend as we increase the number of goal
categories considered, going from the best result for 2
goals to second-best with 3 goals to a narrow third place
with 4 goals and finally to a more distant third place with
5 goals. Finally, we can see that the results using the
exploration algorithm of [4] are clearly the worst ones,
despite the system being updated to use the better
category density equations and goal category mappings
discovered in this study. The main difference between the
two systems is thus the use of our old exploration
algorithm to discover the goal categories nearest to any of
the 25 base categories. This is also the source of the
poorer results: the exploration algorithm is very sensitive
to noise and outlier base categories that are near a goal
category, and will return that goal category as a result. On
the other hand, all five equations have in common that
they sum the value of all base categories for each goal
category, and therefore build-in noise tolerance. An
outlier base category will seldom give enough of a score
boost to one goal to eclipse the combined effect of the 24
other base categories on other goals.
Out of curiosity, we ran the same test a second time,
but this time keeping the 100 highest-density base
categories. These results are presented in Figure 6, using
the same line conventions as Figure 5. It is interesting to
see that this time it is equation (6) that yields the best
results with a solid lead, not equation (5). This indicates a
more fundamental relationship in our system: the best
summation for the goal categories is not an absolute but
depends on the number of base categories retained. With
a smaller set of 25 base categories, the system works best
when it considers a larger picture including the impact of
more distant categories. But with a larger set of 100 base
categories, the abundance of more distant categories
seems to generate noise, and the system works best by
limiting their impact and by focusing on closer base
categories.
Figure 5. Goal score formulae using 25 base categories.
Figure 6. Goal score formulae using 100 base categories.
D. Number of Goal Categories Returned
The final parameter in our system is the number of goal
categories to return. We have already explained in
Section IIIC that our preference to return three goal
categories is based on a study of human classification –
namely, in the KDD CUP 2005 competition [5], human
labelers used on average 3.3 categories per query.
Moreover, looking at the F1 figures we presented in the
previous subsections, we can see that the curve seems to
be exponential, with each extra category returned giving a
lesser increase in F1 value. Returning a fifth goal
category gives the least improvement compared to
returning only four goal categories, and in fact in some
cases it causes a drop in F1. Returning three categories
seems to be at the limit between the initial faster rate of
increase of the curve and the later plateau.
Another way to look at the question is to consider the
average score of goal categories at each rank, after
summing the densities of the base categories for each and
ranking them. If on average the top-ranked categories
have a large difference to the rest of the graph, it will
show that there exist a robust division between the likelycorrect goal categories to return and the other goal
categories. The opposite observation, on the other hand,
would reveal that the rankings could be unstable and
sensitive to noise, and that there is no solid score
distinction between the goals our system returns and the
others.
For this part of the study, we used the summation of
equation (5). We can recall from our discussion in
Section III that the maximum word weight is Nk and the
maximum category density is Nk². Queries in the KDD
CUP data set are at most 10 words long, giving a
maximum base category density of 100. This in turn gives
a maximum goal category score of 1,002,400 using
equation (5) and 25 base categories in the case where the
distance between the goal category and each of the base
categories is one except for a single base category at a
distance of zero; in other words, the goal is one of the
base categories found and all other base categories are
immediately connected to it. More realistically, we find in
our experiments that the average base category density
computed by equation (3) is 1.48, and the average
distance between a base and goal category is 5.6 steps, so
an average goal category score using equation (5) would
be 1.18.
Figure 7 shows the average score of the goal category
at each rank over all KDD CUP queries used in our
experiment from the previous section, obtained using the
method described above. This graph shows that the top
category has on average a score of 3,272, several orders
of magnitude above the average but still below our
theoretical maximum. In fact, even the maximum score
we observed in our experiments is only 67,506, very far
below the theoretical maximum. This is due to the fact
that most base categories are more than a single step
removed from the goal category.
The graph also shows the massive difference between
the first three ranks of goal categories and the other 96.
The average score goes from 3,272, to 657 at rank 2 and
127 at rank 3, down to 20 and 16 at ranks 4 and 5
respectively, then cover the interval from 2 to 0.7
between ranks 6 and 99. This demonstrates a number of
facts. First of all, both the values of the first five goal
ranks and the differences between their scores when
compared to the other 94 shows that these first ranks are
resilient to noise and variations. It also justifies our
decision to study the performance of our system using the
top 1 to 5 goal categories, and it gives further
experimental support to our decision to limit the number
215
Figure 7. Average goal category score per rank over all KDD CUP
queries.
of goal categories returned by the classifier to three.
It is interesting to note that the average score of the
categories over the entire distribution is 42.53, very far
off from our theoretical average of 1.18. However, if we
ignore the first three ranks, whose values are very high
outliners in this distribution, the average score becomes
1.62. Moreover, the average score over ranks 6 to 99 is
1.28. Both of these values are in line with the average we
expected to find.
E. The Optimal System
After having performed these experiments, we are
ready to put forward the optimal classifier, or the one that
combines the best features from the options we have
studied. This classifier uses Wt = Nk * (Nk / Nt) for
equation (1), selects the top 25 base categories, ranks the
goal categories using the summation formula of equation
(5), and returns the top-three categories ranked. The
results we obtain with that system are presented in Table
III, with other KDD CUP competition finalists reported in
[5] for comparison. Note that participants had the option
to enter their system for precision ranking but not F1
ranking or vice-versa rather than both precision and F1
ranking, and several participants chose to use that option.
Consequently, there are some N/A values in the results in
Table III. As can be seen from Table III, our system
performs well above the competition average, and in fact
ranks in the top-10 of the competition in F1 and in the
top-5 in precision. For comparison, system #22, which
TABLE III
CLASSIFICATION RESULTS
System
KDDCUP #22
KDDCUP #37
KDDCUP #21
Our system
KDDCUP #14
Our previous work
KDDCUP Mean
KDDCUP Median
F1
Rank
1
N/A
6
7
7*
10
Overall
F1
0.4444
0.4261
0.3401
0.3366
0.3129
0.2827
0.2353
0.2327
Precision
Rank
N/A
1
2*
2
N/A
7
Overall
Precision
0.4141
0.4237
0.3409
0.3643
0.3173
0.3065
0.2545
0.2446
* indicates competition systems that would have been outranked by
ours.
216
TABLE IV
SAMPLE BASE CATEGORIES
Category
Internet Explorer
Internet history
Windows web browsers
Microsoft criticisms and
controversies
HTTP
Mobile phone web browsers
Cascading Style Sheets
Internet
PlayStation Games
Islands of Finland
History of animation
TABLE V
SAMPLE GOAL CATEGORIES
Rank
1
2
3
8
Density
4
4
4
4
Titles
36
32
20
4
25
26
33
37
660
905
1811
2.67
2.67
2
1
0.4
0.33
0.05
5
4
5
17
2
1
1
won the first place in the competition, achieved the best
results by querying multiple web search engines and
aggregating their results [2]. Our method may not
perform as well right now, but it offers the potential for
algorithmic and knowledge-base improvements that goes
well beyond those of a simple aggregate function, and is
not dependent on third-party commercial technology.
We also updated the the classifier we built in our
previous work of [4] to use Wt = Nk * (Nk / Nt), the top-25
base category cutoff and the goal category mapping of
Appendix A. Its original iterative graph exploration
method was also slightly modified to explore all paths
rather than parents-only, to break ties using equation (5)
rather than a random draw, and to return the top-three
goal categories rather than the top-five. These
modifications are all meant to update the system of [4]
with the best features obtained in this research, to create a
fair comparison. The results we obtained are included in
Table III. While it does perform better than the average
KDDCUP system, we find that our previous classifier still
falls short of the one we studied in this paper.
We also found in our results that 47 of the 800 test
queries were not classified at all, because the algorithm
failed to select any base categories at all. This situation
occurs when no Wikipedia titles featuring query words
can be found. These queries are all single words, and that
word is either an uncommon abbreviation (the query
“AATFCU” for example), misspelled in an unusual way
(“egyptains”), an erroneous compounding of two words
(“contactlens”), a rare website URL, or even a
combination of the above (such as the misspelled URL
“studioeonline.com” instead of “studioweonline.com”).
These are all situations that occur with real user search
queries, and are therefore present in the KDDCUP data
set. It is worth noting that Wikipedia titles include
common cases of all these errors, so that only the 5.9%
most unusual cases lead to failure in our system.
It could be interesting to study a specific example, to
see the system’s behavior step by step. We chose for this
purpose to study a query for “internet explorer” in the
KDDCUP set. This query was manually classified by the
competition’s three labelers, into the KDDCUP categories
“Computers\Software; Computers\Internet & Intranet;
Computers\Security;
Computers\Multimedia;
Information\Companies & Industries” by the first labeler,
Goal Category
Internet
Software
Computing
Internet culture
Websites
Technology
Magazines
Industries
Law
Renting
Rank
1
2
3
4
5
16
18
30
49
99
Score
11.51
10.09
8.03
5.63
4.97
3.54
3.39
2.86
2.54
1.30
Refer to Appendix A for the list of KDDCUP categories
corresponding to these goal categories.
into
“Computers\Internet
&
Intranet;
Computers\Software” by the second labeler, and into
“Computers\Software; Computers\Internet & Intranet;
Information\Companies & Industries” by the third labeler.
The algorithm begins by identifying a set of relevant
base categories using the procedure explained in Section
IIIB and then weighting them using equation (3). For this
query, our algorithm identifies 1,810 base categories, and
keeps the 25 highest-density ones, breaking the tie for
number 25 by considering the number of titles pointing to
the categories as we explained in Section IIIB.
For any two-word query, the maximum title weight
value that can be computed by equation (1) is 2, and the
maximum base category density value that can be
returned by equation (3) is 4. And in fact, we find that 8
categories receive this maximum density, including some
examples we listed in Table IV. We can see from these
examples that the top-ranked base categories are indeed
very relevant to the query. Examining the entire set of
base categories reveals that the density values drop to half
the maximum by rank 33, and to a quarter of it by rank
37. The density value continues to drop as we go down
the list: the average density of a base category in this
example is 0.4 which corresponds to rank 660, by the
middle of the list at rank 905 the density is 0.33, and the
final category in the list has a density of only 0.05. It can
also be seen from the samples in Table IV that the
relevance of the categories to the query does seem to
decrease along with the density value. Looking at the
complete list of 1,810 base categories, we find that the
first non-software-related category is “Exploration” at
rank 41 with a density of 1. But software-related
categories continue to dominate the list, mixed with a
growing number of non-software categories, until rank
354 (density of 0.5 and 1 title pointing to the category)
where non-computer categories begin to dominate.
Incidentally, the last software-related category in the list
is “United States internet case law”, at rank 1791 with a
density of 0.11.
The next step of our algorithm is to rank the 99 goal
categories using the sum of density values in equation (5).
Sample rankings are given in Table V. This table uses the
Wikipedia goal category labels; the matching KDDCUP
categories can be found in Appendix A. We can see from
these results that the scores drop by half from the first
TABLE VI
TEST CLASSIFICATION RESULTS
Query set
KDDCUP 111
TREC
KDDCUP 800
Overall
F1
0.3636
0.4639
0.3366
Overall
Precision
0.4254
0.4223
0.3643
TABLE VII
CATEGORIZATION AND RECALL
Overall
Recall
0.3175
0.5267
0.3195
result to the fourth one. This is much less drastic than the
drop we observed on average in Figure 7, but is
nonetheless consistent as it shows a quick drop from a
peak over the first three ranks and a long and more stable
tail over ranks 4 to 99.
It is also encouraging to see that the best two goal
categories selected by our system correspond to
“Computers\Internet
&
Intranet”
and
“Computers\Software”, the only two categories to be
picked by all three KDDCUP labelers. The fourth goal
corresponds to “Online Community/Other” and is the first
goal that is not in the KDDCUP “Computer/” category,
although it is still strongly relevant to the query. Further
down, the first goal that corresponds neither to a
“Computers/” nor “Online Community/” category is
Technology (“Information\Science & Technology”) at
rank 16, which is still somewhat related to the query, and
the first truly irrelevant result is Magazines
(“Living\Book & Magazine”) at rank 18 with a little over
a quarter of the top category’s score. Of the categories
picked by labelers, the one that ranked worst in our
system was “Information\Companies & Industries” at
rank 30. All the other categories they identified are found
in the top-10 results of our system.
F. New Data and Final Tests
In order to show that our results in Table III are general
and not due to picking the best system for a specific data
set, we ran two more tests of our system with two new
data sets.
The first data set is a set of 111 KDD CUP 2005
queries classified by a competition judge. This set was
not part of the 800 test queries we used previously; it was
a set of queries made available by the competition
organizers to participants prior to the competition, to
develop and test their systems. Naturally, the queries in
this set will be similar to the other KDD CUP queries,
and so we expect similar results.
The second data set is a set of queries taken from the
TREC 2007 Question-Answering (QA) track [19]. That
data set is composed of 445 questions on 70 different
topics; we randomly selected three questions per topic to
use for our test. It is also worth noting that the questions
in TREC 2007 were designed to be asked sequentially,
meaning that a system could rely on information from the
previous questions, while our system is designed to
classify each query by itself with no query history.
Consequently, questions that were too vague to be
understood without previous information were
disambiguated by adding the topic label. For example, the
question ‘Who is the CEO?’ in the series of questions on
the company 3M was rephrased as ‘Who is the CEO of
217
Query set
TREC Labeler 1
TREC Labeler 2
KDDCUP Labeler 2
KDDCUP Labeler 1
KDDCUP Labeler 3
Average number
of categories
1.93 ± 0.81
2.91 ± 0.92
2.39 ± 0.93
3.67 ± 1.13
3.85 ± 1.09
Recall
0.5443
0.5090
0.3763
0.3076
0.2747
3M?’ Finally, two of the co-authors independently
labeled the questions to KDD CUP categories in order to
have a standard to compare our system’s results to in
Equations (9) and (10). The TREC data set was selected
in order to subject our system to very different testing
conditions: instead of the short keyword-only KDD CUP
web queries, TREC has long and grammatically-correct
English questions.
The results from both tests are presented in Table VI,
along with our system’s development results already
presented in Table III for comparison. These results show
that our classifier works better with the test data than with
the training data it was developed and optimized on. This
counter-intuitive result requires explanation.
The greatest difference in our results is on recall, which
increases by over 20% from the training KDDCUP test to
the TREC test. Recall, as presented in equation (10), is
the ratio of correct category labels identified by our
system for a query to the total number of category labels
the query really has. Since our classifier returns a fixed
number of three categories per query, it stands to reason
that it cannot achieve perfect recall for a query set that
assigns more than three categories, and that it can get
better recall on a query set that assigns fewer categories
per query. To examine this hypothesis, we compared the
results of five of our labelers individually: the three
labelers of the KDDCUP competition and the two
labelers of the TREC competition (the 111 KDDCUP
demo queries, having been labeled by only one person,
were not useful for this test). Specifically, we looked at
the average number of categories per query each labeler
used and the recall value our system achieved using that
query set. The results, presented in Table VII, show that
our intuition is correct: query sets with less categories per
query lead to higher recall, with the most drastic example
being the increase of 1.5 categories per query between
KDDCUP labelers 2 and 3 that yielded a 10% decrease in
recall. However, it also appears from that table that the
relationship does not hold across different query sets:
KDDCUP labeler 2 assigns less labels per query that
TREC labeler 2 but still has a much lower recall.
Next, we can contrast the two KDDCUP tests: they
both had nearly identical recall but the new data gave a
6% increase in precision. This is interesting because the
queries are from the same data sets, they are web
keyword searches of the same average length, and the
correct categorization statistics are nearly identical to
those of Labeler 3 so we would actually expect the recall
to be lower than it ended up being. An increase in both
precision and recall can have the same origin in equations
218
(9) and (10): a greater proportion of correct categories
identified by our classifier. But everything else being
equal, this would only happen if the queries themselves
were easier for our system to understand. To verify this
hypothesis, we checked both query sets for words that are
unknown in our system. As we explained previously, a lot
of these words may be rare but simple typos (“egyptains”)
or missing spaces between two words (“contactlens”),
and while they are unknown and ignored in our system
their meaning is immediately obvious to the human
labelers. The labelers thus have more information to
classify the queries, which makes it inherently more
difficult for our system to generate the same
classification. Upon evaluation of our data, we find that
the KDDCUP set of 800 queries features about twice the
frequency of unknown words of the set of 111 queries.
Indeed, 10.4% of queries in the 800-query set have
unknown words and 4.4% of words overall are unknown,
while only 5.4% of queries in the 111-query set have
unknown words and only 2.5% of words in that set are
unknown. This is an important difference between the
two query sets, and we believe it explains why the 111
queries are more often classified correctly. It incidentally
also indicates that an automated corrector should be
incorporated in the system in the future.
The better performance of our system on the TREC
query set can be explained in the same way. Thanks to the
fact that set is composed of correct English questions, it
features even fewer unknown words: a mere 0.4% of
words in 1.9% of queries. Moreover, for the same reason,
the queries are much longer: on average 5.3 words in
length after stopword removal, compared to 2.4 words for
the KDDCUP queries. This means that even if there is an
unknown word in a query, there are still a lot of other
words in the TREC queries for our system to make a
reasonably good classification.
Differences in the queries aside, it does not appear to
be major distinctions, much less setbacks, when using our
classifier on new and unseen data sets. It seems robust
enough to handle new queries in a different spread of
domains, and to handle both web-style keyword searches
and English questions without loss of precision or recall.
Finally, it could be interesting to determine how our
classifier’s performance compares to that of a human
doing the same labeling task. Query classification is a
subjective task: since queries are short and often
ambiguous, their exact meaning and classification is often
dependent on human interpretation [20]. It is clear from
Table VII that this is the case for our query sets, that
human labelers do not agree with each other on the
classification of these queries. We can evaluate the
human labelers by computing the F1 of each one’s
classification compared to the others in the same data set.
In the case of the KDDCUP data, the average F1 of
human labelers is known to be between 0.4771 and
0.5377 [5], while for our labeled TREC data we can
compute the F1 between the two human labelers to be
0.5605. This means our system has between 63% and
71% of a human’s performance when labeling the
KDDCUP queries, and 83% of a human’s performance
when labeling the TREC queries. It thus appears that by
this benchmark, our classifier again performs better on
the TREC data set than on the KDDCUP one. This gives
further weight to our conclusion that our system is robust
enough to handle very diverse queries.
V. CONCLUSION
In this paper, we presented a ranking and classification
algorithm to exploit the Wikipedia category graph to find
the best set of goal categories given user-specified
keywords. To demonstrate its efficiency, we implemented
a query classification system using our algorithm. We
performed a thorough study of the algorithm in this paper,
focusing on each design decision individually and
considering the practical impact of different alternatives.
We showed that our system’s classification results
compare favorably to those of the KDD CUP 2005
competition: it would have ranked 2nd on precision with
a performance 10% better than the competition mean, and
7th in the competition on F1. We further detailed the
results of an example query in key steps of the algorithm,
to demonstrate that each partial result is correct. And
finally we presented two blind tests on different data sets
that were not used to develop the system, to validate our
results.
We believe this work will be of interest to anyone
developing
query
classification
systems,
text
classification systems, or most other kinds of
classification software. By using Wikipedia, a
classification system gains the ability to classify queries
into a set of almost 300,000 categories covering most of
human knowledge and which can easily be mapped to a
simpler application-specific set of categories when
needed, as we did in this study. And while we considered
and tested multiple alternatives at every design stage of
our system, it is possible to conceive of further
alternatives that could be implemented on the same
framework and compared to our results. Future work can
focus on exploring these alternatives and further
improving the quality of the classification. In that respect,
as we indicated in Section IV.F, one of the first directions
to work in will be to integrate an automated corrector into
the system, to address the problem of unknown words.
APPENDIX A
This appendix lists how we mapped the 67 KDD CUP
categories to 99 corresponding Wikipedia categories in
the September 2008 version of the encyclopedia.
KDD CUP Category
Computers\Hardware
Wikipedia Category
Computer hardware
Internet
Computers\Internet & Intranet
Computer networks
Computers\Mobile
Mobile computers
Computing
Computers\Multimedia
Multimedia
Networks
Computers\Networks &
Telecommunication
Telecommunications
Computers\Security
Computers\Software
Computers\Other
Entertainment\Celebrities
Computer security
Software
Computing
Celebrities
Games
Entertainment\Games & Toys
Toys
Entertainment\Humor & Fun Humor
Entertainment\Movies
Film
Entertainment\Music
Music
Entertainment\Pictures &
Photographs
Photos
Entertainment\Radio
Radio
Entertainment\TV
Television
Entertainment\Other
Entertainment
Arts
Information\Arts &
Humanities
Humanities
Companies
Information\Companies &
Industries
Industries
Science
Information\Science &
Technology
Technology
Information\Education
Education
Law
Information\Law & Politics
Politics
Regions
Information\Local &
Municipalities
Regional
Local government
Reference
Information\References &
Libraries
Libraries
Information\Other
Information
Books
Living\Book & Magazine
Magazines
Automobiles
Living\Car & Garage
Garages
Living\Career & Jobs
Employment
Dating
Living\Dating &
Relationships
Intimate relationships
Family
Living\Family & Kids
Children
Fashion
Living\Fashion & Apparel
Clothing
Finance
Living\Finance & Investment
Investment
Food and drink
Living\Food & Cooking
Cooking
Decorative arts
Living\Furnishing &
Furnishings
Houseware
Home appliances
Giving
Living\Gifts & Collectables
Collecting
Health
Living\Health & Fitness
Exercise
Landscape
Living\Landscaping &
Gardening
Gardening
Pets
Living\Pets & Animals
Animals
Living\Real Estate
Living\Religion & Belief
Living\Tools & Hardware
Living\Travel & Vacation
Living\Other
Online Community\Chat &
Instant Messaging
Online Community\Forums &
Groups
Online
Community\Homepages
Online Community\People
Search
Online Community\Personal
Services
Online Community\Other
Shopping\Auctions & Bids
Shopping\Stores & Products
Shopping\Buying Guides &
Researching
Shopping\Lease & Rent
Shopping\Bargains &
Discounts
Shopping\Other
Sports\American Football
Sports\Auto Racing
Sports\Baseball
Sports\Basketball
Sports\Hockey
Sports\News & Scores
Sports\Schedules & Tickets
Sports\Soccer
Sports\Tennis
Sports\Olympic Games
Sports\Outdoor Recreations
Sports\Other
219
Real estate
Religion
Belief
Tools
Hardware (mechanical)
Travel
Holidays
Personal life
On-line chat
Instant messaging
Internet forums
Websites
Internet personalities
Online social networking
Virtual communities
Internet culture
Auctions and trading
Retail
Product management
Consumer behaviour
Consumer protection
Renting
Sales promotion
Bargaining theory
Distribution, retailing, and
wholesaling
American football
Auto racing
Baseball
Basketball
Hockey
Sports media
Sport events
Seasons
Football (soccer)
Tennis
Olympics
Outdoor recreation
Sports
REFERENCES
[1] Maj. B. J. Jansen, A. Spink, T. Saracevic, “Real life, real
users, and real needs: a study and analysis of user queries
on the web”, Information Processing and Management,
vol. 36, issue 2, 2000, pp. 207-227.
[2] D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, Q.
Yang, “Q2C@UST: our winning solution to query
classification in KDDCUP 2005”, ACM SIGKDD
Explorations Newsletter, vol. 7, issue 2, 2005, pp. 100-110.
[3] M. Alemzadeh, F. Karray, “An efficient method for
tagging a query with category labels using Wikipedia
towards enhancing search engine results”, 2010
IEEE/WIC/ACM International Conference on Web
220
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
Intelligence and Intelligent Agent Technology, Toronto,
Canada, 2010, pp. 192-195.
M. Alemzadeh, R. Khoury, F. Karray, “Exploring
Wikipedia’s Category Graph for Query Classification”, in
Autonomous and Intelligent Systems, M. Kamel, F. Farray,
W. Gueaieb, A. Khamis (eds.), Lecture notes in Artificial
Intelligence, 1st edition, vol. 6752, Springer, 2011, pp.
222-230.
Y. Li, Z. Zheng, H. Dai, “KDD CUP-2005 report: Facing a
great challenge”, ACM SIGKDD Explorations Newsletter,
vol. 7 issue 2, 2005, pp. 91-99.
D. Shen, J. Sun, Q. Yang, Z. Chen, “Building bridges for
web query classification”, Proceedings of SIGIR’06, 2006,
pp. 131-138.
H. Cao, D. H. Hu, D. Shen, D. Jiang, J.-T. Sun, E. Chen,
and Q. Yang. “Context-aware query classification”,
Proceedings of SIGIR, 2009.
J. Fu, J. Xu, K. Jia, “Domain ontology based automatic
question answering”, International Conference on
Computer Engineering and Technology (ICCET '08), vol.
2, 2009, pp. 346-349.
J. Yu, N. Ye, “Automatic web query classification using
large unlabeled web pages”, 9th International Conference
on Web-Age Information Management, Zhangjiajie, China,
2008, pp. 211-215.
S. M. Beitzel, E. C. Jensen, D. D. Lewis, A. Chowdhury,
O. Frieder, “Automatic classification of web queries using
very large unlabeled query logs”, ACM Transactions on
Information Systems, vol. 25, no. 2, 2007, article 9.
J. Hu, G. Wang, F. Lochovsky, J.-T. Sun, Z. Chen,
“Understanding user's query intent with Wikipedia”,
Proceedings of the 18th international conference on World
Wide Web, Spain, 2009, pp. 471-480.
R. Khoury, “Query Classification using Wikipedia”,
International Journal of Intelligent Information and
Database Systems, vol. 5, no. 2, April 2011, pp. 143-163.
P. Schönhofen, “Identifying document topics using the
Wikipedia category network”, Web Intelligence and Agent
Systems, IOS Press, Vol. 7, No. 2, 2009, pp. 195-207.
Z. Minier, Z. Bodo, L. Csato, “Wikipedia-based kernels for
text categorization”, International Symposium on Symbolic
and Numeric Algorithms for Scientific Computing,
Romania, 2007, pp, 157-164.
P. Wang, J. Hu, H.-J. Zeng, Z. Chen, “Using Wikipedia
knowledge to improve text classification”, Knowledge and
Information Systems, vol. 19, issue 3, 2009, pp. 265-281.
S. Banerjee, K. Ramanathan, A. Gupta, “Clustering short
texts using Wikipedia,” Proceedings of the 30th annual
international ACM SIGIR conference on Research and
development in information retrieval, Amsterdam,
Netherlands, 2007 pp. 787-788.
K. Coursey, R. Mihalcea, “Topic identification using
Wikipedia graph centrality”, Proceedings of NAACL HLT,
2009, pp. 117-120.
Z. S. Syed, T. Finin, A. Joshi, “Wikipedia as an ontology
for describing documents”, Proceedings of the Second
International Conference on Weblogs and Social Media,
March 2008.
H. T. Dang, D. Kelly, J. Lin, “Overview of the TREC 2007
question answering track”, Proceedings of the Sixteenth
Text Retrieval Conference, 2007.
B. Cao, J.-T. Sun, E. W. Xiang, D. H. Hu, Q. Yang, Z.
Chen, “PQC: personalized query classification”,
Proceedings of the 18th ACM conference on information
and knowledge management, Hong Kong, China, 2009, pp.
1217-1226.
221
Towards Identifying Personalized Twitter
Trending Topics using the Twitter Client RSS
Feeds
Jinan Fiaidhi, Sabah Mohammed
Department of Computer Science
Lakehead University
Thunder Bay, Ontario P7B 5E1, Canada
{jfiaidhi,mohammed}@lakeheadu.ca
Aminul Islam
Department of Computer Science
Lakehead University
Thunder Bay, Ontario P7B 5E1, Canada
maislam@lakeheadu.ca
Abstract—We are currently witnessing an information
explosion with aid of many micro-blogging toolkits like the
Twitter. Although Twitter provides a list of most popular
topics people tweet about known as Trending Topics in real
time, it is often hard to understand what these trending
topics are about where most of these trending topics are far
away from the personal preferences of the twitter user. In
this article, we pay attention to the issue of personalizing the
search for trending topics via enabling the twitter user to
provide RSS feeds that include the personal preferences
along with a twitter client that can filter personalized tweets
and trending topics according to a sound algorithm for
capturing the trending information. The algorithms used
are the Latent Dirichlet allocation (LDA) along with the
Levenshtein Distance. Our experimentations show that the
developed prototype for personalized trending topics (T3C)
finds more interesting trending topics that match the
Twitter user list of preferences than traditional techniques
without RSS personalization.
Index Terms—component; Trending
Streaming; Classification, ADL
topics;
Twitter
I. INTRODUCTION
Twitter is a popular social networking service with
over 100 million users. Twitter monitors the millions and
billions of 140-character bits of wisdom that travel the
Twitter verse and lists out the top 10 hottest trends (also
known as “trending topics”) [1]. With such social
networking, streams have become the main source of
information for sharing and analyzing information as it
comes into the system. Streams are central concepts in
most momentous Twitter applications. Streams become so
important that they even replaces search engines as a
starting point of Web browsing – now a typical Web
session consists in reading Twitter streams and following
doi:10.4304/jetwi.4.3.221-226
links found in these streams instead of starting with Web
search. One of the central applications of using streams
with Twitter is mine in real-time trending topics. To
develop such application one need to use one of the three
Twitter Application Programming Interfaces (APIs). The
first API is the REpresentational State Transfer (REST)
which covers the basic Twitter functions (e.g. send direct
messages, retweets, manipulate your lists). The second is
the Twitter search API which can do everything that the
Twitter Advanced Search can do. The third API is the
streaming API which give developers low latency access
to Twitter's global stream of Tweet data. In particular the
streaming API gives the developer the ability to create a
long-standing connection to Twitter that receives “push”
updates when new tweets matching certain criteria arrive,
obviating the need to constantly poll for updates. For this
reason the use of the streaming APIs becomes more
common practice related to twitter applications like
finding trending topics. In such approach the user
subscribes to followers (e.g. FriendFeed) and read the
stream made up of posts from the followers. However, the
problem with this approach is that there is always a
compromise with the number of followers that the user
would like to read and the amount of information he/she is
able to consume. Twitter users share variety of comments
regarding a wide range of topics. Some researchers
recommended a streaming approach that identifies
interesting tweets based on their density, negativity,
trending and influence characteristics [2, 3]. However,
mining this content to define user interests is a challenge
that requires an effective solution. Certainly identifying a
personalized stream that contains only a moderate number
of posts that are potentially interesting for the user can be
used for the customization and personalization of a variety
of commercial and non-commercial applications like
222
product marketing and recommendation. Recently Twitter
introduced local trending topics that contribute to the
solution of this problem through improving the Discover
Tab to show what users in your geography are tweeting
about. But this service fail short in providing many
personalized trending like displaying trends only from
those you follow or whom they follow. There are many
current other attempts to fill this gap like the Cadmus1 and
KeyTweet2, however, there is comprehensive solution that
can provide wider range of personalization venues for the
Twitter users. This article introduces our investigation on
developing personalized trending topics over stream of
tweets.
II. RELATED RESEACH
Research on topic identification within a textual data is
either related to information retrieval, data mining or a
hybrid of both. Information retrieval research provides
searching techniques that can identify the main concepts
in a given text based on structural elements available
within the provided text (e.g. by identifying noun phrases
as good topic markers [5]). This is a multi-stage process
that starts by identifying key concepts within a document,
then grouping these to find topics, and finally mapping the
topics back to documents and using the mapping to find
higher-level groupings. Information retrieval research
utilizes computational linguists and natural language
techniques to predict important terms in document using
methods like coreference, anaphora resolution or
discourse center [6]. However, using linguistic techniques
in identifying important terms do not necessarily
correspond to the subject or the theme. Predicting
important terms involves numerical weighting of terms in
document. Terms with top weights are judged important
and representative of document. In this direction terms
extraction methods like TF-IDF [7] (term frequencyinverse document frequency (TF–IDF) generally extracts
from a text keywords which represent topics within the
text). However, TF-IDF does not conduct segmentation).
A segmentation method (e.g., TextTiling [8]) generally
segments a text into blocks (paragraphs) in accord with
topic changes within the text, but it does not identify (or
label) by itself the topics discussed in each of the blocks.
While both techniques (i.e. TF-IDF and Segmentation)
have some appealing features—notably in its basic
identification of sets of words that are discriminative for
documents in the collection—these approaches also
provides a relatively small amount of reduction in
description length and reveals little in the way of inter- or
intra-document statistical structure. To address these
shortcomings, IR researchers have proposed several other
dimensionality reduction techniques and
topic
identification techniques (e.g. LSI (latent semantic
indexing), LDA (Latent Dirichlet allocation)) [9]. On the
other hand, data mining tries to analyze text and predict
Identify applicable sponsor/s here. If no sponsors, delete this text
box. (sponsors)
1
2
http://thecadmus.com/
http://keytweet.com/
frequent itemsets, or groups of named entities that
commonly appear together from a training dataset and use
these associations to predict topics in future given
documents [10]. This approach assumes a previously
available datasets and it not suitable for streaming and
dynamically changing topics as the one associated with
Twitter. For this reason we consider this approach is out
of the scope of this article.
III.
DEVELOPING A STREAMING CLIENT FOR
IDENTIFYING TRENDING TOPICS
It is a simple task to start developing a Twitter
streaming client especially with the availability of variety
of Twitter streaming APIs (e.g. Twitter4J3, JavaTwitter4,
JTwitter5). However, modifying this client to search for
trending topics and adapting to the user preferences is
another issue that can add higher programming
complexities. The advantage of using trending topics is to
reduce messaging overload that each active user receives
each day. Without classifying the incoming tweets users
are forced the Twitter to march through a
chronologically-ordered morass to find tweets of interest.
By finding personalized trending topics and grouping
tweets according to coherently clustered trending topics
for more directed exploration will simplify searching and
identifying tweets of interest. In this section we are
presenting a Twitter client that enables the client to group
tweets according to the user preferences into topics
mentioned explicitly or implicitly, which users can then
be browsed for items of interest. To implement this topic
clustering, we have developed a revised LDA (Latent
Dirichlet allocation) algorithm for discovering trending
topics. Figure 1 illustrates the structure of our Trending
Topics Twitter Client (T3C).
Figure1. The Structure of the T3C Twitter Client.
Data was collected using the Twitter streaming API6,
with the filter tweet stream providing the input data and
the trends/location stream providing the list of terms
identified by Twitter as trending topics. The filter
3
http://repo1.maven.org/maven2/net/homeip/yusuke/twitter4j/
http://www.javacodegeeks.com/2011/10/java-twitter-client-withtwitter4j.html
5
http://www.winterwell.com/software/jtwitter.php
6
http://twitter.com
4
streaming API is a limited stream returns public statuses
that match one or more filter predicates. The United
States (New York) and Canada (Toronto) was used as the
location for evaluation. Google Geocoding API has been
used to get location wise Twitter data 7 . The streaming
data was collected automatically using the Twitter4j API.
The streaming data was stored in a tabular CSV formatted
file. Data has been collected with different time interval
for same city and topic. We collected different topics of
dataset from different city with different time interval.
We collected Canada gas price topics from Thunder Bay
and Toronto on 25th April 2012 and 26th April 2012
about 200 tweets. We collected sport (Basketball) related
topics from New York and Toronto on 8th May 2012 and
9th may 2012 about 28000 tweets. We collected health
(flu) related topics from Toronto and Vancouver on 8th
May 2012 and 9th May 2012 about 600 tweets. We
collected political (election) topics from Los Angeles and
Toronto on 9th May 2012 and 10th May 2012 about 6000
tweets. We collected education (engineering school)
related topics from Toronto and New York on 9 th May
2012 and 10th May 2012 about 2000 tweets. Additionally
we collected large set of data from USA and Canada
between 25th June 2012 and 30th June 2012. We collected
total 2736048 (economy 1795211, education 89455,
health 390801, politics 60265, sports 400316) tweets. We
ran our client to automatically collect the data. We used
multiple twitter account to collect data concurrently 8 .
Next the tweets were preprocessed to remove URL’s,
Unicode characters, usernames, and punctuation, html,
etc. A stop word file containing common English stop
words was used to filter out tweets from common words. 9
The T3C client collects tweets and filters those that match
the user preferences according the user feeds sent by the
T3C user via the RSS protocol.
IV. T3C TRENDING TOPICS PERSONALIZATION
In this section, we describe an improved Twitter
personalization mechanism by incorporating the user
personalization RSS feeds. The user of our T3C Twitter
client can feed his or her own personalization feeds via
the RSS protocol or the user can directly upload the
personalize data list from a given URL. Using RSS, the
T3C reads these feeds an XMLEventReader method that
reads all the available feeds and store them in a
personalize data list. Figure 2 illustrates the filtering of
personalized tweets through the streaming process.
While the Twitter API collects tweet to form a dataset,
the tweets that are related to the user personalization
feeds are filtered out using string similarity method that is
based on the Levenshtein Distance algorithm 10 . The
Levenshtein distance is a measure between strings: the
minimum cost of transforming one string into another
through a sequence of edit operations. In our T3C the use
of this measure can be illustrated using the following
code snippet.
7
http://developers.google.com/maps/geocoding
http://flash.lakeheadu.ca/~maislam/Data
9
http://flash.lakeheadu.ca/~maislam/Data/stopwords.txt
10
http://en.wikipedia.org/wiki/Levenshtein_distance
8
223
Figure2. Filtering Personalized Tweets During Streaming.
while((line = input.readLine()) != null){
line=cleanup(line);
double Distance = 80;
if(personalize.size()>0)
Distance=4000;
for (int j = 0; j < personalize.size(); ++j ) {
String comparisionTweet = personalize.get(j);
Int thisDistance;
thisDistance=Util.computeLevenshteinDistance(comparisionTweet,
line);
if (Distance > thisDistance) { Distance = thisDistance;}
}
if(Distance<=80)
articleTextList.add(line);
}
The similarity detection loop continues until the end of
the dataset. For each tweets we remove URL’s, Unicode
characters, usernames, and punctuation, html, stop words,
etc. Then similarity loop iterates over the user personalize
RSS data list to get the minimum Levenshtein distance
value. In our implementation we have set an average
distance value 80 as a good value to catch most related
personalize tweets. We also found that using the
Levenshtein distance we can remove duplicate tweets if
the distance is zero indicating that the two tweets are
identical [11]. After filtering the personalized tweets the
Latent Dirichlet allocation (LDA) algorithm is used to
generate trending topics model. The basic idea of LDA is
that documents are represented as random mixtures over
latent topics, where each topic is characterized by a
distribution over words [4]. LDA makes the assumption
that document generation can be explained in terms of
these distributions, which are assumed to have a Dirichlet
prior. First a topic distribution is chosen for the
document, and then each word in the document is
generated by randomly selecting a topic from the topic
distribution and randomly selecting a word from the
chosen topic. Given a set of documents, the main
challenge is to infer the word distributions and topic
mixtures that best explain the observed data. This
inference is computationally intractable, but an
approximate answer can be found using a Gibbs sampling
224
approach. The LingPipe LDA implementation11 was used
in our Twitter client prototype. In this LDA
implementation, a topic is nothing more than a discrete
probability distribution over words. That is, given a topic,
each word has a probability of occurring, and the sum of
all word probabilities in a topic must be one. For the
purposes of LDA, a document is modeled as a sequence
of tokens. We use the tokenizer factory and symbol table
to convert the text to a sequence of token identifiers in the
symbol table, using the static utility method built into
LingPipe LDA12. Using the LingPipe LDA API we can
report topics according to the following code snippet:
TABLE I.
TRENDING TOPICS WITHOUT PERSONALIZATION
Trending Topic
Count Probability
12479
0.161
basketball
2471
0.032
play
1277
0.016
watch
1271
0.016
school
1153
0.015
game
1109
0.014
bballproblemz
1082
0.014
#basketballproblem
0.014
basketballproblem 1079
1063
0.014
asleep
874
0.011
love
853
0.011
player
647
0.008
football
for (int topic = 0; topic < numTopics; ++topic) {
int topicCount = sample.topicCount(topic);
ObjectToCounterMap<Integer> counter = new
ObjectToCounterMap<Integer>();
for (int wordId = 0; wordId < numWords; ++wordId)
{
String word = mSymbolTable.idToSymbol(wordId);
double Distance = 4;
if(personalize.size()>0)
Distance=4000;
for (int j = 0; j < personalize.size(); ++j )
{
String comparisionTweet = personalize.get(j);
thisDistance=Util.computeLevenshteinDistance(comparisionTweet,wor
d);
if (Distance > thisDistance) {
Distance = thisDistance;
}
}
if(Distance<=4 ){
counter.set(Integer.valueOf(wordId),
sample.topicWordCount(topic, wordId));
}
}
List<Integer> topWords = counter.keysOrderedByCountList();
}
The iterative process of identifying trending topics
maps the word identifiers to their counts in the current
topic. The resulting mapping is sorted for each identifier
based on their counts, from high to low, and assigned to a
list of integers. Then trending topics are ranked according
to the Z score by testing binomial hypothesis of word
frequency in personalized topic against the word
frequency in the corpus [11]. Table 1 and 2 illustrates
running T3C with or without personalization with initial
search around football and basketball.
11
http://aliasi.com/lingpipe/docs/api/com/aliasi/cluster/LatentDirichletAllocation.ht
ml
12
http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html
TABLE II.
TRENDING TOPICS WITH PERSONALIZATION
Trending Topic
Count Probability
3660
0.298
basketball
372
0.030
watch
322
0.026
love
322
0.026
play
289
0.024
game
204
0.017
football
181
0.015
player
82
0.007
team
72
0.007
don
72
0.007
season
59
0.005
baseball
57
0.005
short
V. EXPERIMENTATION RELATED TO THE IDENIFICATION OF
TRENDING TOPICS
Our experimentation starts by collecting a reasonable
tweets samples on general topics like health, education,
sports, ecomomy and politics. For this purpose, we run
our T3C client to find trending topics by applying certain
personalization feeds/queries as well as without any
personalization. To demonstrate the effects of our RSSBased personalization, we conducted for experiments.
For the first experiment, we have collected 3,90,801
Tweets related to the health topic and applied our
personalization mechanism by uploading medical feeds
related
to
cancer/oncology
research
from
MedicalNewsToday 13 . We used the same sample and
filter trending topics without personalization feeds for
comparison purposes using the Twitter Filter API and the
LDA algorithm. For personalization purpose, we apply
the Twitter Filter API first to get general helath related
Tweets followed by calling our rss reader to read the
client personalization feeds and after that we apply the
Levenshtein Distance algorithm (between user
personalization feeds and health related tweets) followed
by the LDA algorithm to finally finding the personalized
13
http://www.medicalnewstoday.com/rss/cancer-oncology.xml
225
trending topics. Figures 3a and 3b illustrate the
comparison between finding trending topics with
personalization and without personalization. For this
expirement, we used fixed some variables like: Dirichlet
priors to be .01 for η , .01 for α, number of topics to be 12
for, samples to be 2000, 200 for buring period and 5
sampling frequency14.
(a)
(a) Comparison Histogram
(b) Comparison Graph
Figure 3. Comparing Health Related Trending Topics with RSS
Personalization and without Personalization.
Figures 4.a and 4.b are showing the most frequent
health related trending topics word for both personalize
and non-personalize topics.
Moreover, we conducted similar experiments using
other general topics. For economy and finance, we
collected 17,95,211 and used the Economist Banking
RSS feeds 15 . For education we collected 89455 tweets
and used the CBC Technology Feeds16. For politics we
collected 60265 tweets and use the CBC Politics Feeds17.
Finally for sports we collected 400316 tweets and used
the CBC Sports Feeds 18 . We publish all the results of
these experiments on our Lakehead University Flash
server 19 . Our experiments shows clearly that our RSSBased personalization mechanism finds trending topics
that matches the user perferences through the provided
user feeds.
14
http://alias-i.com/lingpipe/docs/api/index.html
http://www.economist.com/topics/banking/index.xml
16
http://rss.cbc.ca/lineup/technology.xml.
17
http://rss.cbc.ca/lineup/politics.xml.
18
http://rss.cbc.ca/lineup/sports.xml
19
http://flash.lakeheadu.ca/~maislam/TestSample
15
(b)
Figure 4. Frequency Counts for Trending Topics with or without RSS
Personalization.
VI.
CONCLUSIONS
While numerous volumes of Tweets users receive daily,
certain popular issues tend to capture their attention. Such
trending topics are of great interest not only to the Twitter
micro-bloggers but also to advertisers, marketers,
journalists and many others. An examination of the state
of the art in this area reveals progress that lags its
importance [14]. In this article, we have introduced a new
method for identifying trending topics using RSS feeds.
In this method we used two algorithms to identify tweets
that are similar to the RSS Levenshtein Distance
algorithm and the LDA. Although LDA is a popular
information retrieval algorithm that have been used also
for finding trending topics [12], no attempt that we know
have used the RSS feed for personalization. Figure 5
shows a screenshot of GUI of our RSS-Based
Personalization Twitter Client (T3C).
226
[4]
[5]
[6]
[7]
Figure 5. GUI for the RSS-Based T3C Client.
We are continuing our attempts to develop more
personalization mechanisms that adds more focused
identification of personalized trending topics using
techniques that utilize machine learning algorithms [13].
The results of these experiments will be the subject of our
next article.
ACKNOWLEDGMENT
[8]
[9]
[10]
[11]
Dr. J. Fiaidhi would like to acknowledge the support of
NSERC for the research conducted in this article.
REFERENCES
James Benhardus, Streaming Trend Detection in Twitter,
2010 UCCS REU FOR ARTIFICIAL INTELLIGENCE,
NATURAL
LANGUAGE
PROCESSING
AND
INFORMATION RETRIEVAL FINAL REPORT.
[2] Ming Hao et. al, Visual sentiment analysis on twitter data
streams,2011 IEEE Conference onVisual Analytics
Science and Technology (VAST), 23-28 Oct. 2011, pp277
– 278
[3] Suzumura, T. and Oiki, T., StreamWeb: Real-Time Web
Monitoring with Stream Computing, 2011 IEEE
[12]
[1]
[13]
[14]
International Conference on Web Services (ICWS), 4-9
July 2011, pp620 – 627
Kevin R. Canini, Lei Shi and Thomas L. Griffiths, Online
Inference of Topics with Latent Dirichlet Allocation,In
Proceedings of the International Conference on Artificial
Intelligence
and
Statistics,
2009,
http://cocosci.berkeley.edu/tom/papers/topicpf.pdf
Bendersky, M. and Croft, W.B. Discovering key concepts
in verbose queries. SIGIR '08, ACM Press (2008).
Nomoto, Tadashi and Matsumoto, Yuji,EXPLOITING
TEXT STRUCTURE FOR TOPIC IDENTIFICATION,
Workshop On Very Large Corpora, 1996
Salton, G., & Yang, C. S. (1973). On the specification of
term values in automatic indexing. Journal of
Documentation, 29(4), 351–372.
Hearst, M. (1997). Texttiling: Segmenting text into multiparagraph subtopic passages. Computational Linguistics,
23(1), 33–64.
David M. Blei, Andrew Y. Ng and Michael I. Jordan,
Latent Dirichlet Allocation, Journal of Machine Learning
Research 3 (2003) 993-1022
Alexander Pak and Patrick Paroubek, Twitter as a Corpus
for Sentiment Analysis and Opinion Mining, Proceedings
of the Seventh International Conference on Language
Resources and Evaluation (LREC'10)}, May 19-21,
2010,Valletta, Malta.
Alex Hai Wang, Don’t Follow me: Spam Detection in
Twitter, IEEE Proceedings of the 2010 International
Conference on Security and Cryptography (SECRYPT),
26-28
July
2010,
http://test.scripts.psu.edu/students/h/x/hxw164/files/SECR
YPT2010_Wang.pdf
Daniel Ramage, Susan Dumais, and Dan Liebling,
Characterizing Microblogs with Topic Models, in Proc.
ICWSM 2010, American Association for Artificial
Intelligence , May 2010
Kathy Lee, Diana Palsetia, Ramanathan Narayanan, Md.
Mostofa Ali Patwary, Ankit Agrawal, and Alok Choudhary,
Twitter Trending Topic Classiﬁcation, 2011 11th IEEE
International Conference on Data Mining Workshops,
ICDMW2011,pp.251-258.
Fang Fang and Nargis Pervin, Finding Trending Topics in
Twitter in Real Time, NRICH Research, 2010, Available
online: http://nrich.comp.nus.edu.sg/research_topic3.html.
227
Architecture of a Cloud-Based Social Networking
News Site
Jeff Luo, Jon Kivinen, Joshua Malo, Richard Khoury
Department of Software Engineering, Lakehead University, Thunder Bay, Canada
Email: {yluo, jlkivine, jmalo, rkhoury}@lakeheadu.ca
Abstract—Web 2.0 websites provide a set of tools for
internet users to produce and consume news items.
Prominent examples of news production sites include Issuu
(issuu.com) and FlippingBook (page-flip.com) which allows
users to upload publication files and transform them into
flash-animated online publications with integrated socialmedia-sharing and statistics-tracking features. A prominent
example of news consumption site is Google News
(news.google.com), which allows users some degree of
control over the layout of the presentation of news feeds,
including trusted news sources and extra category keywords,
but offers no real editing and social sharing components.
This proposed project bridges the gap between news
production sites and news consumption sites in order to
offer to any user - including non-profit organizations,
student or professional news media organizations, and the
average Internet user - the ability to create, share, and
consume social news publications in a way that gives users
complete control of the layout and content of their paper,
the facilities to share designs and article collections socially,
as well as provide related article suggestions all in a single
easy to use horizontally scaling system.
Index Terms—Web 2.0, Web engineering, Cloud computing,
News, Social networking, Recommendation systems
I. INTRODUCTION
Web 2.0 applications, and in particular social
networking sites, enjoy unprecedented popularity today.
For example, there were over 900,000 registered users on
Facebook at the end of March 2012 1 , more than the
population of any country on Earth save China and India.
In a parallel development, a recent survey by the Pew
Research Centre [1] discovered that an overwhelming
92% of Americans get news from multiple platforms,
including in 61% of cases online news sources. Moreover,
this survey showed that the news is no longer seen as a
passive “they report, we read” activity but as an
interactive activity, with 72% of Americans saying they
follow the news explicitly because they enjoy talking
about it with other people. The social aspect of news is
dominant online: 75% of online news consumers receive
news articles from friends, 52% retransmit those news
articles, and 25% of people contribute to them by writing
comments. There is also a clear intersection between
1
http://newsroom.fb.com/content/default.aspx?NewsAreaId=22
doi:10.4304/jetwi.4.3.227-233
social networks and news consumption: half of socialnetwork-using news consumers use the social network to
receive news items from their friends or from people they
follow, and a third use their social networking site to
actively share news. However, news portal sites such as
GoogleNews remain the single most popular online news
source.
There is thus a clear user need for a social networking
news site. Such a site will combine the interactive social
aspect of news that users enjoy with the diversity of news
sources that portal sites offer.
The aim of this paper is to develop a new Web 2.0 site
that offers any news publisher - including non-profit
organizations, student or professional news media
organizations, and the average Internet user - the ability
to create and share news publications in a way that gives
them complete control over the layout and content of
their paper as well as the sharing and accessibility of their
publications. The system will allow readers to discover
new publications by offering helpful suggestions based
on their current interests, their reading patterns, and their
social network connections. It will also allow them to
share news, comments, and interact generally with other
readers and friends in their social network. Finally, the
cloud platform will offer a horizontally-scaling and
flexible platform for the system. The contribution of this
paper is thus to present the design and development of a
new Web 2.0 special-purpose social networking site.
From a web data mining point of view, implementing
and controlling such a system opens up a lot of very
interesting avenues of research. The news articles
available on the system will create a growing text and
multimedia corpus useful for a wide range of research
projects, ranging from traditional text classification to
specialized applications such as news topic detection and
tracking [2]. Social networking platforms now supply
data for a wide range of social relationship studies, such
as exploring community structures and divides [3]. And
feedback from the recommendation system will be useful
to determine which variables are more or less influential
in human decision-making [4].
The rest of this paper is organized as follows. The next
section will present a brief overview of related projects.
We will give an overview of the structure of our proposed
system in Section III. In Section IV we will present in
more details the functional requirements of each of the
system’s components. We will bring these components
228
together and present the overall architecture of the system
in Section V. Next, Section VI will discuss
implementation and testing considerations of the system.
Finally, we will present some concluding thoughts in
Section VII.
II. RELATED WORKS
There has been some work done in developing
alternative architectures for social networking sites. For
example, the authors of [5] propose a mobile social
network server divided into five components: an HTTP
server that interacts with the web, standard profile
repository database and privacy control components, a
location database that allows the system to keep track of
the user’s location as he moves around with his mobile
device, and the matching logic component that connects
the other four components together. The authors of [6]
take it one step further to create a theme-based mobile
social network, which is aware not only of the user’s
location but also of activities related to his interests in his
immediate surroundings, of their duration and of other
participants. Our proposed system is also a theme-based
social network, as it learns the users’ interests both from
what is stated explicitly in their profiles and implicitly
from the material they read, and will propose new
publications based on these themes. However, the
architectures mentioned above were based on having a
single central web server, in contrast to our cloud
architecture.
Researchers have been aware for some time of the
network congestion issues that comes with the traditional
client-server architecture [7]. Cloud-based networking is
a growingly popular solution to this problem. A relevant
example of this solution is the cloud-based collaborative
multimedia sharing system proposed in [8]. The building
block of that system is a media server that allows users in
a common session to collaborate on multimedia streams
in real time. Media servers are created and destroyed
according to user demand for the service by a group
manager server, and users access the system through an
access control server. The entire system is designed to
interface with existing social networks (the prototype was
integrated to Facebook). By comparison, our system is
not a stand-alone component to integrate to an external
social network, but an entire and complete social network.
There are many open research challenges related to
online news data mining. Some examples surveyed in [9]
include automated content extraction from the news
websites, topic detection and tracking, clustering of
related news items, and news summarization. All these
challenges are further compounded when one considers
that online news sources are multilingual, and therefore
elements of automated translation and corpus alignment
may be required. These individual challenges are all
combined into one in the task of news aggregation [9], or
automatically collecting articles from multiple sources
and clustering the related ones under a single headline.
News aggregate sites are of critical importance however;
as the PEW survey noted, they are the single most
popular source of online news [1]. Our news-themed
social network site would serve as a new type of news
aggregate site.
Researchers working on recommendation systems have
shown that individuals trust people they are close to
(family members, close friends) over more distant
acquaintances or complete strangers. This connection
between relationship degree and trust can be applied to
social networks, to turn friend networks in to a Web-oftrust [10]. The trust a user feels for another can be further
extrapolated from their joint history (such as the number
of public and private messages exchanged), the overlap in
their interests, or even simply whether the second user
has completed their personal profile on the site they are
both members of [4]. It is clear, then, that social networks
are a ripe source of data for recommendation systems.
Our proposed system is in line with this realization.
One of the key areas of applied research today in
Cloud is on performance and scalability. The authors of
[11] propose a dynamic scaling architecture via the use of
“a front-end load balancer routing user requests to web
applications deployed on virtual machines (VMs)” [11],
aiming to maximizing resource utilization in the VMs
while minimizing total number of VMs. The authors also
used a scaling algorithm based on threshold number of
active user sessions. Our proposed system adopts this
approach, but considers the thresholds of both the virtual
machines' hardware utilization as well as the number of
active user-generated requests and events, instead of
sessions. Further, our system adopts the performance
architecture principles discussed in [12] to examine the
practical considerations in the design and development of
performance intelligence architectures. For performance
metrics and measurements, our system adopts the
resource, workload, and performance indicators discussed
in [13] and the approach discussed in [14] to utilize the
server-side monitoring data to determine the thresholds
and when to trigger a reconfiguration of the cloud, and
the client-side end-to-end monitoring data to evaluate the
effectiveness of the performance architecture and
implementation designed into the system, as it would be
felt and perceived by the users of the system.
It thus appears that our proposed system stands at the
intersection of several areas of research. Part of its appeal
is that it would combine all these components into a
single unified website, and serve as a research and
development platform for researchers in all these areas.
Likewise, the web data generated by the system would be
valuable in several research projects. And all this would
be done while answering a real user need.
III. SYSTEM OVERVIEW
There are eight major components to our proposed
system:
1. The cloud component serves the overall function
of performing active systems monitoring, and
providing a high performance, horizontallyscaling back-end infrastructure for the system.
2. A relational database system is essential to store
and retrieve all the data that will be handled by the
3.
4.
5.
6.
7.
8.
system, including user information, social network
information, news articles, and layout information.
The social network aspect of the system is crucial
to turn the passive act of reading news into an
active social activity. Social networking will
comprise a large portion of the front-end
functionalities available to users.
A natural language processing engine will be
integrated into the system to analyze all the
articles submitted. It will work both on individual
articles, to detect its topic and classify it
appropriately, and on sets of articles, to detect
trends and discover related articles.
A suggestion engine will combine information
from both the social network and the natural
language processor in order to suggest new
reading materials for each individual user.
The content layout system is central to the content
producer’s experience. It will provide the producer
a simple and easy-to-use interface to control all
aspects of a news article’s layout (placement of
text and multimedia, margins and spacing, etc.)
and thus to create their own unique experience for
the readers.
The user interface will give the reader access to
the content and will display it in the way the
producer designed it. The interface will also give
the user access to and control over his
involvement in the community through the social
network.
The business logic component will facilitate user
authentication and access control to ensure users
are able to connect to the system and access their
designs and article collections, and prevent them
from accessing unauthorized content.
IV. SYSTEM REQUIREMENTS
Each of the eight components we listed in Section II is
responsible for a set of functionalities in the overall
system. The functional requirements these components
must satisfy in order for the entire system to work
properly are described here.
A. Cloud
The Cloud component responds to changes in
processing demand by modifying the amount of available
computing resources. It does this by monitoring and
responding to traffic volume and resource usage, and
creating or destroying virtual resources as needed.
Additionally, the Cloud component is responsible for
distributing the traffic load across the available resources.
The Cloud must be designed to support the following
functionalities:
 ability to interface with a Cloud Hypervisor [15]
that virtualizes system resources to allow the
system to control the operation of the Cloud;
 ability to perform real time monitoring of the
web traffic and workloads in the system;
 ability to monitor the state and performance of
the system, including its individual machines;






229
ability for individual virtual servers within the
system to communicate with each other;
ability for the virtual servers to load share and
load balance amongst each other;
ability to distribute workloads evenly across the
set of virtual servers within the system;
ability to add or remove computing resources
into the system based upon demand and load;
ability to dynamically reconfigure the topology
of virtual servers to optimally consume
computing resources;
ability to scale horizontally.
B. Database
The database component is required for persistent
storage and optimized data retrieval. The database must
be designed to support the following functionalities:
 represent and store subscribers and authors of
each given publication;
 represent and store layouts of individual articles;
 represent and store sets of linked articles to form
a publication;
 represent and store the social network.
C. Social Network
The social network component provides users with a
richer content discovery experience by allowing users to
obtain meaningful content suggestions. It must support
the following functionalities:
 support user groups (aka friend lists);
 map social relationships;
 model user interactions with articles and
publications;
 control sharing and privacy;
 comment on articles.
D. Natural Language Processor
The natural language processor allows the articles in
the system to be analyzed and used for content
suggestions and discovery. A basic version can simply
build a word vector for each article, and computes the
cosine similarity with the word vectors of other articles
and of categories. The natural language processor must
support the following features:
 ability to categorize the topics of an article or
newspaper;
 ability to measure the similarity between
different articles.
E. Suggestion Engine
The suggestion engine is a meaningful content
discovery tool for users. One way to suggest new content
to users is to display related articles alongside what they
are currently viewing. The suggestion engine must
support the following features:
 ability to draw conclusions on the interests of a
reader given their activities and relationships on
the social network;
 ability to draw conclusions on the interests of a
reader based on the set of articles they have read;
230

ability to provide relevant article suggestions
based upon the conclusions reached about the
user’s interests.
F. Content Layout System
The layout system will provide content producers with
an experience similar to that of a desktop editor
application. Page design should be done in-browser with
tools that the editor will find familiar. The layout system
must support the following features:
 drag and drop content into place;
 adjust style of content;
 adjust layout of content including but not limited
to adding, removing and adjusting columns; and
adjusting size and position of areas for content;
 save and reload layouts or layout elements;
 aid positioning using grids or guidelines;
 save and reload default layouts for a publication.
G. User Interface
The user interface should render the content for the
reader the way the content producer envisioned it to be.
Navigation through content should be non-obtrusive. The
user interface must support the following features:
 overall styling during reading experience set by
content producer;
 user account management including but not
limited to profile information, authentication,
password recovery;
 content production including but not limited to
viewing, adding, modifying, and removing
content, issues, and layouts;
 unobtrusive navigation while immersed in the
reading experience;
 options during the reading experience to
comment on and/or share the article to the
internal social networking, by email, and/or to
the currently ubiquitous social networking sites.
H. Business Logic
User access to content should be controlled in order to
differentiate between a user who may edit content of a
given article or publication and a user who may only
view the content. Additionally, in the case of paid
subscriptions to publications, access control needs to
differentiate between users who may or may not read
certain articles. The business logics must support the
following features:
 ability to authenticate users using unique login
and passwords;
 ability to enforce access control of user’s data
based upon the user’s privileges.
I. Non-Functional Requirements
The system’s non-functional requirements are
consistent with those of other web-based systems. They
are as follow:
 the site must be accessible with all major web
browsers, namely Internet Explorer, Firefox,
Chrome, and Safari;



system-generated data must be kept to a
minimum and encoded so as to minimize the
amount of bandwidth used;
the user interfaces must be easy to understand
and use, both for readers, producers, and
administrators;
the system must be quick to respond and errorfree.
V. SYSTEM ARCHITECTURE
A. Logical Architecture
Figure 1 illustrates the logical connections between the
components of our proposed system. The user interface is
presented to the user through a browser, and it directly
connects to the business and cloud logics module which
will use the Suggestion Engine and Layout Engine as
needed. It is also connected to the Social Networking
module, which provides the social networking
functionalities. The data is stored in a Database, which
the Natural Language Processor periodically queries to
analyze all available articles. The Social Networking
module will maintain a Social Graph of all user
relationships and interactions. Meanwhile, the Cloud
Logics component will monitor the system’s overall
performance and interact with the Cloud Hypervisor as
needed to adjust the system’s physical structure to
respond to levels of user demand.
B. Physical Architecture
All the end users of our system, be them readers or
producers, will connect to the site via a web browser
running on any device or platform. The browser connects
to the cloud’s software load balancer, which is hosted on
a virtual machine. The load balancer forwards requests to
corresponding identical web servers (also hosted as
virtual machines) to service each request, and may load
share amongst themselves. Furthermore, the web servers
connect to a set of databases that will replicate between
each other for both fail safe redundancy and throughput.
The Web Servers also connect to the social network. This
setup is illustrated in Figure 2.
Figure 1. Logical architecture of the system.
231
users. The Load Balancer communicates with the cloud
over TCP to obtain the information for the web servers.
VI. PROTOTYPE IMPLEMENTATION AND TESTING
Figure 2. Physical architecture of the system.
C. Architecture Details
The software architecture for the client-side Web UI
software consists mainly of JavaScript model classes used
to represent the locations of elements on the page.
Objects of these classes are saved to the database through
requests to the web server or instantiated from saved
object states retrieved from the web server.
The software architecture for the server-side software
consists mainly of controllers and models. Ruby on Rails
uses a Model, View, Controller architecture. Controllers
use instances of Models and Views to render a page for
the user. Requests are mapped to member functions of
controllers based mainly on the request URI and HTTP
method.
The natural language processor is implemented as a
client-server system as well. The NLP server is
responsible for the language processing functionalities of
the system. It uses TCP functionalities to receive
communications from the NLP client. It also interfaces
directly with the database to fetch information
independently of the rest of the system.
The cloud is composed of several interconnected subcomponents. There is a Cloud Controller, which is
responsible for real time monitoring and workload
management functions, dynamic cloud reconfiguration
and server load balancing, and cloud hypervisor
operations. This is the central component of the cloud,
which manages the other servers and optimizes the entire
system. To illustrate its function, its state chart is
presented in Figure 3. The Controller accepts incoming
TCP connections from the Load Balancer, and connects
to each of the Server System Monitors that reside in the
servers under its control via TCP as well. The Server
System Monitors are a simple monitoring subcomponent,
which periodically gathers system performance
information and forwards it to the controller through a
TCP connection. Both the Controller and the Monitors
are written in C++ as Linux applications. The Load
Balancer, by contrast, is implemented in PHP as a web
application at the forefront of the Cloud. Its function is to
accept incoming HTTP requests from users and forward
them to designated web servers, and balance the
workload so no single server is over- or under-utilized. It
also forwards the responses from the servers back to the
A working prototype of the entire system was
implemented and run on a VMware 5.0 ESXi physical
server, which contains a quad core CPU (2GHz per core)
and 3 GB of RAM. Each of the virtual machines within
runs a Ubuntu 11.10 Server operating system (32bit) with
virtual hardware settings of 1 CPU Core, whereby
VMware performs time-sharing between the 4 physical
CPU cores, and 5 GB of HDD space. The setup also
include 384 MB RAM for web and load balancer servers
and 512 MB RAM for MySQL database and NLP servers,
1024 MB RAM for the Cloud Controller server. The
server is connected to the internet using a router with the
ability to set static IP addresses and port forwarding. A
screenshot of the content layout interface is given in
Figure 4. While the interface is simple, it gives content
producers complete control to add, move and edit text,
headings and multimedia items.
A set of test cases was developed and ran to verify the
functionality of critical features of the system. The cloud
controller was tested for its server manipulation features:
the ability to create new servers, to configure them
correctly, to reconnect to them to check on them, and to
delete them when they are no longer needed. Each of
these operations also tested the cloud controller’s ability
to update the network’s topology. With these components
validated, we then proceeded to test higher-level
functionalities, such as the controller’s ability to gather
usage statistics from the servers, to balance the servers’
Figure 3. State chart of the cloud controller program.
232
workloads and optimize the topology, and to forward
request and responses. The user interface and database
were tested by registering both regular user accounts and
publication editor accounts, and executing the legal and
illegal functions of both classes of users. The editor could
create new publications and new articles inside the
publications, and edit these articles using the layout editor
shown in Figure 4. The regular user could browse the
publications, subscribe to those desired, and see the
articles displayed in exactly the way the editor had laid
them out. Finally, the natural language processing
functionalities of the system (along with that section of
the database) were tested by uploading a set of 10 news
articles into the system and then having the processor
parse them, build word vectors, and compare these to a
set of predefined class vectors to classify the articles into
their correct topics.
The last test we ran was a workload test, designed to
verify the robustness and reliability of our cloud
controller and architecture. For this test, we used a set of
other computers to send HTTP requests. We used
requests for the website’s home page as well as multipage requests. Given our system’s dual-core hardware,
we tested setups with one and two web servers. In each
case, we measure the system’s throughput, CPU usage,
disk usage and memory usage. In both setups, we found
similar disk, memory, and CPU usage. The throughput
was different however, with the setup with two web
servers consistently performing better than the setup with
only one web server. On requests for the home page, the
two-server setup yielded an average throughput of 258
kB/s against 182 kB/s for the setup with one web server.
And for multi-page requests, which required database
queries and more processing, the throughput of the twoserver setup was 142 kB/s against 102 kB/s for the singleserver setup. The 40% improvement observed in
throughput demonstrates that our system can scale up
quite efficiently. To further illustrate, the response times
of both systems during our test are shown graphically in
Figure 5. In that figure, the higher line is the response
time of the single-server setup, the lower line is the
Figure 4. Response times of the two setups tested.
response time of the two-server setup, and the individual
points are HTTP requests. We can see that throughout the
experiments, the single-server setup consistently requires
more time to respond to the requests compared to the
two-server setup. While these tests were conducted with
static topologies, we expect the response times to “jump”
from the single-server setup line to the double-server
setup when the system adds a web server dynamically,
and the throughput to increase in a similar fashion.
Additional jumps are expected as the system adds
additional servers.
VII. CONCLUSION
Building a Web 2.0 social networking site is a very
ambitious project. In this paper, we developed a web
application with smart horizontal scaling using a cloudbased architecture, incorporating modern aspects of web
technology as well as elements of natural language
processing to help readers discover content and help
publishers get discovered. The size and scope of this
project make it a challenge for any developer. This paper
aims to be a roadmap, to help others duplicate or improve
upon our architecture.
While the system is completely functional in the state
described in this paper, there is room to further develop
and improve each one of its components. There is a wide
range of NLP and recommendation algorithms in the
literature, some of which could be adopted to improve the
natural language processor and the suggestion engine
respectively. New editing tools can be added to the
content layout system to give more control to content
producers. The design of a better user interface is an open
challenge, not just in our system but in the entire software
world. Gathering real workload usage will allow us to
fine-tune the cloud’s load balancing algorithms. And
finally, the social networking component of the system
could be both simplified and enhanced by linking our
system to an existing social network such as Facebook,
Google+, or Twitter. Each new feature and improvement
we make in each component will of course require
additional testing. And once a more complete and
polished version of the site is ready, it should be deployed
and used in practice, to gather both real-world usage
information and user feedback that will help guide the
next iteration of the system.
REFERENCES
Figure 5. The prototype’s content layout editor.
[1] Kristen Purcell, Lee Rainie, Amy Mitchell, Tom Rosenstiel,
Kenny Olmstead, “Understanding the participatory news
consumer: How internet and cell phone users have turned
news into a social experience”, Pew Research Center,
March
2010.
Available:
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
http://www.pewinternet.org/Reports/2010/OnlineNews.aspx?r=1, accessed April 2012.
Xiangying Dai, Yunlian Sun, “Event identification within
news topics”, International Conference on Intelligent
Computing and Integrated Systems (ICISS), October 2010,
pp. 498-502.
Nam P. Nguyen, Thang N. Dinh, Ying Xuan, My T. Thai,
“Adaptive algorithms for detecting community structure in
dynamic social networks”, Proceedings of the IEEE
INFOCOM, 2011, pp. 2282-2290.
Chen Wei, Simon Fong, “Social Network Collaborative
Filtering Framework and online Trust Factors: a Case
Study on Facebook”, 5th International Conference on
Digital Information Management, 2010.
Yao-Jen Chang, Hung-Huan Liu, Li-Der Chou, Yen-Wen
Chen, Haw-Yun Shin, “A General Architecture of Mobile
Social Network Services”, International Conference on
Convergence Information Technology, November 2007, pp.
151-156.
Jiamei Tang, Sangwook Kim, “Theme-Based Mobile
Social Network System”, IEEE Ninth International
Conference on Dependable, Autonomic and Secure
Computing (DASC), 2011, pp. 1089-1095.
Rabih Dagher, Cristian Gadea, Bogdan Ionescu, Dan
Ionescu, Robin Tropper, “A SIP Based P2P Architecture
for Social Networking Multimedia”, 12th IEEE/ACM
International Symposium on Distributed Simulation and
Real-Time Applications, 2008, pp. 187-193.
Cristian Gadea, Bogdan Solomon, Bogdan Ionescu, Dan
Ionescu, “A Collaborative Cloud-Based Multimedia
Sharing Platform for Social Networking Environments”,
Proceedings of 20th International Conference on Computer
Communications and Networks (ICCCN), 2011, pp. 1-6.
Wael M.S. Yafooz, Siti Z.Z. Abidin, Nasiroh Omar,
“Challenges and issues on online news management”,
IEEE International Conference on Control System,
Computing and Engineering (ICCSCE), 2011, pp. 482-487.
Paolo Massa, Paolo Avesani, “Trust-aware recommender
systems”, Proceedings of the 2007 ACM conference on
Recommender systems (RecSys '07), October 2007,
Minneapolis, USA, pp.17-24.
Trieu C. Chieu, Ajay Mohindra, Alexei A. Karve,
“Scalability and Performance of Web Applications in a
Compute Cloud”, IEEE 8th International Conference on eBusiness Engineering (ICEBE), Oct. 2011, pp. 317-323.
Prasad Calyam, Munkundan Sridharan, Yingxiao Xu,
Kunpeng Zhu, Alex Berryman, Rohit Patali, Aishwarya
Venkataraman, “Enabling performance intelligence for
application adaptation in the Future Internet”, Journal of
Communications and Networks, vol. 13, no. 6, Dec. 2011,
pp.591-601.
Jerry Gao, Pushkala Pattabhiraman, Xiaoying Bai W. T.
Tsai, “SaaS performance and scalability evaluation in
clouds”, 2011 IEEE 6th International Symposium on
Service Oriented System Engineering (SOSE), Dec. 2011,
pp. 61-71.
Niclas Snellman, Adnan Ashraf, Ivan Porres, “Towards
Automatic Performance and Scalability Testing of Rich
Internet Applications in the Cloud”, 37th EUROMICRO
Conference on Software Engineering and Advanced
Applications (SEAA), Aug. -Sept. 2011, pp. 161-169.
Bhanu P Tholeti, “Hypervisors, virtualization, and the
cloud: Learn about hypervisors, system virtualization, and
how it works in a cloud environment”, IBM
developerWorks,
September
2011.
Available:
http://www.ibm.com/developerworks/cloud/library/clhypervisorcompare, accessed April 2012.
233
234
Analyzing Temporal Query for Improving Web
Search
Rim Faiz
LARODEC, IHEC, University of Carthage, Tunisia
E-mail: Rim.Faiz@ihec.rnu.tn
Abstract— Research of pertinent information on the web is a
recent concern of Information Society. Processing based on
statistics is no longer enough to handle (i.e. to search,
translate, summarize...) relevant information from texts. The
problem is how to extract knowledge taking into account
document contents as well as the context of the query.
Requests looking for events taking place during a certain
time period (i.e. between 1990 and 2001) cannot provide yet
the expected results. We propose a method to transform the
query in order to "understand" its context and its temporal
framework. Our method is validated by the SORTWEB
System.
find "the relevant information", without being
overwhelmed with a volume of uncontrollable and
unmanageable answers.
In the section that follows, we present some new
methods which are based on analysis of the context for
improving research on the Web. Then we propose our
method based on two concepts: the concept of context in
general (Desclés et al., 1997), (Lawrence et al. 1998) and
the concept of temporal context (Faiz, 2002), (ElKhlifi
and Faiz, 2010). Finally, we present the validation of our
method by the SORTWEB system.
Index Terms— Information Extraction, Semantics of queries,
Web Search, Temporal Expressions Identification
II. RELATED WORKS ON TEMPORAL INFORMATION
I. INTRODUCTION
The Web is positioned as the primary source of
information in the world and the search for relevant
information on the Web is considered one of the new
needs of the information society. The interest of the
consultation of this media is related to the effectiveness
of the search engines information. The main search
engines operate essentially on keywords, but this
technique has limitations: thousands of pages are offered
to each query, but only some contain the relevant
information.
To improve the quality of obtained results, search
engines must take into account the semantics of queries.
The methods of information processing based on statistics
are no longer sufficient to meet the needs of users to
manipulate (search, translate, summarize...) information
on the Web. A fact tends to be necessary: introduce
"more semantic" for the search of relevant information
from texts.
The extraction of specific information remains the
fundamental question of our study. In this sense, it shares
the concerns of researchers who have examined the texts
understanding (Sabah, 2001), (Nazarenko and Poibeau,
2004), (Poibeau and Nazarenko, 1999) as those dealing
today with the link between the semantic web and textual
data (Berners-Lee et al., 2001), (Poibeau, 2004).
The objective of our work is to refine the search for
information on the web. It is to treat the content structure
and make it usable for other types of automatic
processing. Indeed, when the user makes his query, he
expects, generally, find precisely what he seeks, i.e. to
doi:10.4304/jetwi.4.3.234-239
RETRIEVAL
Nowadays, the web is operated by persons who seek
information via a search engine and operate their own
results. Tomorrow, the web should primarily be used by
automatons that will address themselves the questions
asked by people, and automatically give the best results.
Thus, the web becomes a forum for exchange of
information between machines, allowing access to a very
large volume of information and providing the means to
manage these informations. In this case, a machine can
understand the volume of information available on the
web and thus provide more consistent assistance to
people, provided that we endow the machine with some
“intelligence”.
By “intelligence”, we expose the fact of linking human
intelligence with artificial intelligence to optimize the
search of information activities on the web. The search of
information involves the user in an interrogation process
of the search engine. The defined query is sent to the
indexes of documents. The documents whose indexes
have an adequate "similarity" to the query (i.e. keywords
in the query exist in the resulting documents) are
considered relevant.
However, the request for information expressed by a
query can be an inaccurate description of the user’s needs.
In general, when the user is not satisfied with the
results of its initial query, he tries to change it so as to
identify its needs better. This change in the query is to be
reformulated. In general, the reformulation is expressed
by removing or adding words.
The results of the study by PD Bruza (Bruza and al.,
2000), (Bruza and Dennis, 1997), conducted on
reformulations made by users themselves have shown
that reformulation is often the repetition of the initial
request, the adding or the withdrawal of few words,
changing the spelling of the request, or the use of its
derivatives or abbreviations.
In this context, we can cite the system developed by
HyperIndex P.D. Bruza and al. (Bruza and Dennis, 1997)
(Dennis et al., 2002) relating to a technical reformulation
of queries that helps the user to refine or extend the initial
request by the addition, deletion or substitution of terms.
The terms of reformulation, are extracted from the titles
of Web pages. It is a post-interrogation reformulation: the
user defines an initial query, after which the resulting
titles of Web pages provided by the search system are
analyzed as a lattice of terms in order to be used by the
HyperIndex search engine. The user can navigate through
this HyperIndex giving an overview of all possible forms
of reformulation (refinement or enlargement).
Other work has been developed in this context, we can
cite:
- R. W. Van Der Pol, (Van Der Pol, 2003) proposed a
system to reformulation pre-interrogation based on the
representation of a medical field. This field is organized
into concepts linked by a certain number of binary
relations (i.e. causes, treats and subclass). The complaints
are built in a specification language in which users
express their needs. The reformulation of requests is
automatic. It takes place in two stages, the first concern
the identification of concepts that pairs the need of the
user, the second concerns the making up of these terms in
order to formulate the request.
- A. D. Mezaour (Mezaour, 2004) proposed a method
of targeted research documents. The proposed language
allows the user to combine multiple criteria to
characterize the pages of interest with the use of logical
operators. Each criteria specified in a query can target the
search for its values (keywords) on a fixed part of the
structure of a page (for example, its title) or characterize a
particular property of a page (example: URL). By using
the logical operators conjunction and disjunction, it is
possible to combine the above criteria in order to target
both the type of page (html, pdf, etc.) with certain
properties of the URL of a page, or characteristics of
some key parts (title, body of the document). Mezaour
thinks a possibility of improving its approach consists in
enriching the initial request by synonyms representing the
values of words for each query. According to him, the
assessment of his requests passes over relevant
documents that do not contain the terms of the request but
equivalent synonyms.
- O. Alonso (Alonso et al., 2016) proposed a method
for clustering and exploring search results based on
temporal expressions within the text. They mentioned
that temporal reasoning is also essential in supporting the
emerging temporal information retrieval research
direction (Alonso et al., 2011). In other work (Strötgen et
al. 2012), they present an approach to identify top
relevant temporal expressions in documents using
expression, document, corpus, and query-based features.
They present two relevance functions: one to calculate
relevance scores for temporal expressions in general, and
235
one with respect to a search query, which consists of a
textual part, a temporal part, or both.
- In their work, E. Alfonseca et al. (Alfonseca et al.,
2009) showed how query periodicities could be used to
improve query suggestions, although they seem to have
more limited utility for general topical categorization.
- A. Kulkarni et al. (2011), in their work, showed that
Web search is strongly influenced by time. They
mentioned that the relationship between documents and
queries can change as people’s intent changes. They have
explored how queries, their associated documents, and
query intents change over the course of 10 weeks by
analyzing large scale query log data, a daily Web crawl,
and periodic human relevance judgments. To improve
their work, A. Kulkarni et al. plan to develop a search
algorithm that uses the term history in a document to
identify the most relevant documents.
- A. Kumar et al. (2011) proposed a language modeling
approach that builds histograms encoding the probability
of different temporal periods for a document. They have
shown that it is possible to perform accurate temporal
resolution of texts by combining evidence from both
explicit temporal expressions and the implicit temporal
properties of general words. Initial results indicate this
language modeling approach is effective for predicting
the dates of publication of short stories, which contain
few explicit mentions of years.
- Zhao et al. (2012) develop a temporal reasoning
system that addresses three fundamental tasks related to
temporal expressions in text: extraction, normalization to
time intervals and comparison. They demonstrate that
their system can perform temporal reasoning by
comparing normalized temporal expressions with respect
to several temporal relations.
We note that, in general, manual reformulation aims at
building a new query with a list of terms proposed by the
system. In the case of an automatic reformulation, the
system will build the new query.
However, the method of automatic reformulation,
generally, does not take into account the context of the
query. The standard model of search tools admits many
disadvantages such as limited diversification, competence
and performance. While, the establishment of research by
the context is much more advantageous.
The contextual information retrieval refers to implicit
or explicit knowledge regarding the intentions of the user,
the user's environment and the system itself. The
hypothesis of our work is that making explicit certain
elements of context could improve the performance of
information research systems.
The improved performance of engines is a major issue.
Our study deals with a particular aspect: taking into
account the temporal context. In order to improve
accuracy and allow a more contextual search, we
described a method based on the analysis of the temporal
context of a query so as to obtain relevant event
information.
236
III. CONTRIBUTIONS
The explosion in the volume of data and the improving
of the storage capacity of databases were not
accompanied by the development of analytical tools and
research needed to exploit this mass of information. The
realization of intelligent systems research has become an
emergency.
In addition, queries for responding to requests for
information from users become very complex and the
extraction of the most relevant data becomes increasingly
difficult when the data sources are diverse and numerous.
It is imperative to consider the semantics of the data and
use this semantics to improve web search. More
especially as the results of a search query with a search
engine returns a large number of documents which is not
easy to manage and operate.
Indeed, in carrying out tests on several search engines,
we found inefficient engines for queries on a date or a
period of time. Therefore, we propose to develop a tool to
take into account the temporal context of the query.
In this context, we propose an approach, like those
aimed at improving the performance of search engines
(Agichtein et al., 2001), (Glover et al., 1999, 2001)
(Lawrence et al. 2001) such as the introduction of the
concept of context, the analysis of web pages or the
creation of specific search engines in a given field.
The objective of our work is to improve the efficiency
and accuracy of event information retrieval on the Web
and analyzing the temporal context for understanding the
query. Therefore, the matter is to propose more precise
queries semantically close to the original user’s queries.
Our study consists on the one hand to reformulate
queries searching for text documents having an event
aspect, i.e. containing temporal markers (i.e. during, after,
since, etc.) taking into account the temporal context of the
query, and on the other hand, to obtain relevant results
specifically responding to the queries.
The question that arises is how to find event
information and transform collections of data into
intelligible knowledge, useful and interesting in the
temporal context where we are.
We found that, in general, queries seeking one or more
events taking place at a given date or during a determined
period do not produce the expected results. For example,
the scientific discoveries since 1940. In this sample of
query, the user wants to seek scientific discoveries since
1940 until today, not for the year 1940 only; it is then to
deal with a period of time. Indeed, a standard search
engine only searches on the term "1940" and not on the
time period in question, from which the idea of the
reformulation of the user’s query, basing the search on
the term introduced by the user and a combination of
words synonymous with the terms of the original query.
The processing of the query is mainly done at the context
level. The system must be able to understand the timing
of the query. Therefore, we provide it with some
intelligence (to approach the human reasoning) plus a
semantic analysis (for understanding the query).
Such a system is very difficult to implement for several
reasons:
 The diversity of documents types on the web (file
types: doc, txt, ppt, pdf, ps, etc.),
 The multitude of languages,
 The richness of languages: it is very difficult to
establish a genuine process of parsing which took
into account the structure of each sentence.
To do this, we will focus our work on a document type
and a type of event queries containing temporal indicators
(in the month, in the year, between time and date, etc.).
For the identification of temporal expressions, we used
our method of automatic filtering of temporal information
we have developed in earlier works, (Faiz and Biskri,
2002). The temporal information retrieval from the query
is made by identifying temporal markers (since, during,
before, until ...) or by the presence of explicit date in the
query.
Then, for the interpretation of these terms and the need
to seek event information taking place on a date or a
period, we propose a time representation from the
concept of interval (Allen and Ferguson, 1994). This
representation is based on the start date and end date of
events (punctual or instantaneous events and durative
event).
Besides, in view of the type of queries that we will
study and the temporal markers such as "before", "after"
and "until", we need to express this in terms of interval.
Are two types of events: punctual or instantaneous events
(Evi) and durative events (EVD):
 The instantaneous event (Evi): If the beginning
date is equal to the ending date of the event
Deb (Evi) = End (Evi).
 The durative event (EVD): the one who takes
place without interruption
Deb (EVD) <> End (EVD).
We consider that an event E admits a start date d(E)
and an end date f(E), with (d(E) <f(E)).
We ideally distinguish two types of events, those of
zero duration d(E) = f (E) that are expressed, for example,
by the phrase "in + (date)" example "in 2001" and whose
duration is not zero, i.e. (E) <> f (E), so the interval is
[d(E), f (E)] and are expressed, for example, by
expression: "Between 1990 and 2001" or the phrase
"since 1980. The temporal grain which we base our
example is the year.
In our work, we need to represent, in the form of an
interval, the temporal information contained in the query
for using it as additional information to the query.
Thus, we apply the rules of interpretation to determine,
in an explicit way, the time interval. Example: "If the
query contains" from "+ beginning_year Then interval =
[beginning_year, current_year]." So, if the document
contains the event sought taking place during the time
interval generated, then we consider it as relevant.
To understand the context of the query made by the
user better, we also considered the extending of the query
by adding words synonymous with the event in question.
Example: the word "attack", we use synonyms: "attack,
explosion, crime, etc ".
IV. VALIDATION OF THE PROPOSED METHOD:
SORTWEB SYSTEM
We validated our work by developing SORTWEB
system (System Optimization Time Queries on the Web)
that improves research on the Web and through an
automatic query reformulation to obtain relevant results
and meet the expectations of the user. This reformulation
is done by adding automatic synonymous terms sought
for the event. The enrichment of the request allows for
better research and results from the search terms and
synonyms not only terms entered by the user.
The process is as follows:
A query such as "Event + temporal marker + date or
time period" launched will be analyzed and segmented
into two parts and by detecting a marker time (in the
month, in the year, since ...).
The web search will be launched once the changes are
made to the query (cf. Figure1).
237
 The event (containing the description of the event
sought) will be transformed and reformulated
referring to the basis of synonyms that allows the
enrichment of the query terms in the same
direction to take account of the semantics of the
request.
 The part with a date or a time period which took
place during the event. This part (not always the
form of an explicit date, for example: last year,
next year or the form of an interval, for example:
during this century, since 1990) will be treated and
processed under the standard form of a date or a
time interval. Examples: "since 1980" will be
represented by the interval [1980 2006].
Figure 1. SORTWEB System Architecture.
This research is done using a search engine. The
document is then downloaded, analyzed and then filtered
by the time of the request. The filtering is to travel
documents and verify results if the selected information
respects the semantics of the initial request.
After the course of all documents downloaded, only
the addresses of documents considered relevant are added
to the page containing the results.
To test and validate our system, we launched the same
requests (for example, since 2000 attacks, wars between
1990 and 2002) on several search engines such as Google
or Yahoo. We ascertain that our system returns a number
of documents much smaller than that proposed by them
directly.
In addition, the returned results contain relevant
documents issued from the search of synonyms and not
from the terms of the initial request. For example: The
attacks since 2000. This request is processed through our
system and the research was done using the term “attack”
and also the following terms “attack, explosion, crime”.
The use of synonyms is very important because the user
may be interested in documents containing not only the
word “attack” but also containing other words in the same
context.
It should be noted that the evaluation of an information
retrieval system is measured by the degree of relevance of
results. The problem lies in the fact that the user
relevance is different from the system relevance.
In general, in a relevant document, the user may find
the information he needs. We talk about user relevance
when the user considers that a document meets his needs.
However, the system relevance is judged through the
used matching function.
To determine the relevance of obtained results, we
conducted an evaluation by human experts. We found
that 80% of the results were relevant.
Also, we calculated the accuracy for evaluating the
quality of answers provided by the system, the results of
238
the tests were measured using the rate of accuracy as
follows:
Accuracy = (No. relevant documents found / No
documents found) = 80.6%
If our method has many advantages such as
minimizing the number of such results while keeping
their relevance and documents that have come from
words (synonyms under the user’s request) added
automatically by the system, it opens up new ways of
studies. One of the perspectives that we intend to achieve
is the improving of the search for event information. We
have to work more on the very famous events in countries
where they occur such as: the event of “pilgrimage” that
may be associated with “Saudi Arabia”.
[9]
[10]
[11]
[12]
V. CONCLUSION
The new generation of search engines differs from the
previous generation by the fact that these engines are
increasingly incorporating new techniques other than the
simple keyword search but adding other methods to
improve the results of search engines, such as the
introduction of the concept of context, analysis of web
pages or the creation of specific search engines in a given
area.
Thus, the improved performance of engines is a major
issue. Our study states a particular aspect: taking into
account the temporal context. In order to improve
accuracy and allow a more contextual search, we
described a method based on the analysis of the temporal
context of a query to obtain relevant event information.
[13]
[14]
[15]
[16]
REFERENCES
[1] Alfonseca, E., Ciaramita, M. and Hall, K. (2009),
Gazpacho and summer rash: Lexical relationships from
temporal patterns of Web search queries. In Proceedings of
EMNLP 2009, 1046-1055.
[2] Agichtein E., Lawrence S. et Gravano L. (2001), Learning
Search Engine Specific Query Transformations for
Question Answering. Proceedings of the Tenth
International World Wide Web Conference, WWW10,
may 1-5.
[3] Allen J.F., Ferguson G., (1994), Actions and Events in
Interval Temporal Logic. Journal Logic and Computation,
vol. 4, n°5, pp.531-579.
[4] Alfonseca, E., Ciaramita, M. and Hall, K. Gazpacho and
summer rash: Lexical relationships from temporal patterns
of Web search queries. In Proceedings of EMNLP 2009,
1046-1055.f
[5] Alonso, O. and Gertz, M. Clustering of search results using
temporal attributes. In Proceedings of SIGIR 2006, 597598.
[6] Alonso O., Strötgen J., Baeza-Yates R. and Gertz M.
(2011), Temporal information retrieval: Challenges and
opportunities, TWAW 2011, Hyderabad, India, pp.1-8.
[7] Berners-Lee T., Hendler J. and Lassila O., (2001), the
semantic web: A new form of Web content that is
meaningful to computers will unleash a revolution of new
possibilities. Scientific American.
[8] Bruza P., McArthur R., Dennis S., (2000), Interactive
internet search: keyword, director and query reformulation
mechanisms compared. Proceedings of the 23rd annual
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
international ACM SIGIR conference on Research and
development in information retrieval, July 24-28, Athens,
ACM Press, pp. 280-287.
Bruza P.D. and Dennis S., (1997), Query Reformulation on
Internet: Empirical Data and the Hyperindex Search
Engine. Proceedings of RIAO-97, Computer Assisted
Information Searching on the Internet.
Dennis S., Bruza P., McArthur R., (2002), Web searching:
A process-oriented experimental study of three interactive
search paradigms. Journal of the American Society for
Information Science an Technology, vol. 53, n°2, pp. 120133.
ElKhlifi A. and Faiz R. (2010), French-written Event
Extraction based on Contextual Exploration. Proceedings
of the 23th International FLAIRS 2010, AAAI Press,
California, USA.
Faiz R. and Biskri, I. (2002), Hybrid approach for the
assistance in the events extraction in great textual data
bases. Proceedings of IEEE International Conference on
Systems, Man and Cybernatics (IEEE SMC 2002), Tunisia,
6-9 October, Vol. 1, pp. 615-619.
Faiz R. (2002), Exev: extracting events from news reports.
Actes des Journées internationales d’Analyse statistique
des Données Textuelles (JADT 2002), A. Morin et P.
Sébillot (Editeurs), Vol. 1, France, pp. 257-264.
Faiz R. (2006), Identifying relevant sentences in news
articles for event information extraction. International
Journal of Computer Processing of Oriental Languages
(IJCPOL), World Scientific, Vol. 19, No. 1, pp. 1-19.
Glover E., Flake G., Lawrence S., Birmingham W., Giles
C.L., Kruger A. and Pen-Nock D., (2001), Improving
category specific web search by learning query
modifications. In Symposium on Applications and the
Internet (SAINT-2001), pp. 23–31.
Glover E., Lawrence S., Gordon M., Birmingham W.,
Giles C.L., (1999), Architecture of a Metasearch Engine
that Supports User Information Needs. Proceedings of the
Eighth International Conference on Information and
Knowledge Management, (CIKM 99), ACM, pp. 210-216.
Kulkarni A., Teevan J., Svore K. M. and Dumais S. T.
(2011), Understanding temporal query dynamics, wsdm
WSDM’11, February 9–12, 2011, Hong Kong, China.
Kumar A., Lease M., Baldridge J. (2011), Supervised
language modeling for temporal resolution of texts,
CIKM11, pp. 2069-2072.
Lawrence S., Coetzee F., Glover E., Pennock D., Flake G.,
Nielsen F., Krovetz R., Kruger A., Giles C.L., (2001),
Persistence of Web References in Scientific Research.
IEEE Computer, Vol.34, p.26–31.
Mezaour A., (2004), Recherche ciblée de documents sur le
Web. Revue des Nouvelles Technologies de l’Information
(RNTI), D.A. Zighed and G. Venturini (Eds.), CépaduésEditions, vol. 2, pp. 491-502.
Nazarenko A., Poibeau T., (2004), L’évaluation des
systèmes d’analyse et de compréhension de textes. In
L’évaluation des systèmes de traitement de l’information,
Chaudiron S. (Ed.), Paris, Lavoisier.
Poibeau T., (2004), Annotation d’informations textuelles :
le cas du web sémantique. Revue d’Intelligence Artificielle
(RIA), vol. 18, n°1, Paris, Editions Hermès, pp. 139-157.
Poibeau T., Nazarenko A., (1999), L’extraction
d’information, une nouvelle conception de la
compréhension de texte. Traitement Automatique des
Langues (TAL), vol. 40, n°2, pp. 87-115.
Sabah G., (2001), Sens et traitements automatiques des
langues. Pierrel J. M. (dir.), Ingénierie des langues, Hermès.
[25] Strötgen J., Alonso O., and Gertz M. (2012). Identification
of Top Relevant Temporal Expressions in Documents. In
TempWeb 2012: 2nd Temporal Web Analytics Workshop
(together with WWW 2012), Lyon, France.
[26] Van der Pol R.W., (2003), Dipe-D: A Tool for KnowledgeBased Query Formulation in Information Retrieval.
Information Retrieval, vol. 6, n°1, pp.21-47.
[27] Zhao R., Do Q., Roth D. (2012), A Robust Shallow
Temporal Reasoning System. Proceedings of the
Demonstration Session at the Conference of the North
American Chapter of the Association for Computational
Linguistics: Human Language Technologies (HLTNAACL 2012), pp. 29-32, Montréal, Canada.
239
Dr. Rim Faiz obtained his Ph.D. in Computer Science from the
University of Paris-Dauphine, in France. She is currently a
Professor of Computer Science at the University of Carthage,
Institute of High Business Study (IHEC) at Carthage, in Tunisia.
Her research interests include Artificial Intelligence, Machine
Learning, Natural Language Processing, Information Retrieval,
Text Mining, Web Mining and Semantic Web. She is member
of scientific and organization committees of several
international conferences. She has several publications in
international journals and conferences (AAAI, IEEE, ACM ...).
Dr. Faiz is also the responsible of the Professional Master
"Electronic Commerce" and the Research Master "Business
Intelligence applied to the Management" at IHEC of Carthage.
240
Trend Recalling Algorithm for Automated Online
Trading in Stock Market
Simon Fong, Jackie Tai
Department of Computer and Information Science,University of Macau,Taipa, Macau SAR
Email: ccfong@umac.mo, ma56562@umac.mo
Pit Pichappan
School of Information Systems, Al Imam University,Riyadh, Saudi Arabia
Email: pichappan@dirf.org
Abstract —Unlike financial forecasting, a type of mechanical
trading technique called Trend Following (TF) doesn’t
predict any market movement; instead it identifies a trend
at early time of the day, and trades automatically
afterwards by a pre-defined strategy regardless of the
moving market directions during run time. Trend following
trading has a long and successful history among speculators.
The traditional TF trading method is by human judgment in
setting the rules (aka the strategy). Subsequently the TF
strategy is executed in pure objective operational manner.
Finding the correct strategy at the beginning is crucial in TF.
This usually involves human intervention in first identifying
a trend, and configuring when to place an order and close it
out, when certain conditions are met. In this paper, we
presented a new type of TF, namely Trend Recalling
algorithm that operates in a totally automated manner. It
works by partially matching the current trend with one of
the proven successful patterns from the past. Our
experiments based on real stock market data show that this
algorithm has an edge over the other trend following
methods in profitability. The new algorithm is also
compared to time-series forecasting type of stock trading,
and it can even outperform the best forecasting type in a
simulation.
Index Terms—Trend Following Algorithm, Automated
Stock Market Trading
I. INTRODUCTION
Trend following (TF) [1] is a reactive trading method in
response to the real-time market situation; it does neither
price forecasting nor predicting any market movement.
Once a trend is identified, it activates the predefined
trading rules and adheres rigidly to the rules until the next
prominent trend is detected. Trend following does not
guarantee profit every time. Nonetheless over a long
period of time it may probably profit by obtaining more
gains than loses. Since TF is an objective mechanism that
is totally free from human judgment and technical
forecasting, the trends and patterns of the underlying data
play a primarily influential role in deciding its ultimate
performance.
It was already shown in [2] that market fluctuation
adversely affects the performance of TF. Although
financial cycles are known phenomena it is a controversy
doi:10.4304/jetwi.4.3.240-251
whether cycles can be predicted or past values cannot
forecast future values because they are random in nature.
Nonetheless, we observed that cycles could not be easily
predicted, but the abstract patterns of such cycles can be
practically recalled and used for simple pattern matching.
The formal interpretation of financial cycle (or better
known as economic cycle) refers to economy-wide
fluctuations in production or economic activity over
several months or years. Here we consider it as the cycle
that run continuously between bull market and bear
market; some people refer this as market cycle (although
they are highly correlated). In general a cycle is made of
four stages, and these four stages are: “(1) consolidation (2)
upward advancement (3) culmination (4) decline” [3].
Despite being termed as cycles, they do not follow a
mechanical or predictable periodic pattern. However
similar patterns are being observed to be always repeating
themselves in the future, just as a question of when,
though in approximate shapes. We can anticipate that
some exceptional peak (or other particular pattern) of the
market trend that happen today, will one day happen again,
just like how it did happen in history. For instance, in the
“1997 Asian Financial Crisis” [4], the Hang Seng Index in
Hong Kong plunged from the top to bottom (in stages 3 to
4); then about ten years later, the scenario repeats itself in
the “2008 Financial Crisis” [5] with a similar pattern.
Dow Theory [6] describes the market trend (part of the
cycle) as three types of movement. (1) The "primary
movement", main movement or primary trend, which can
last from a few months to several years. It can be either a
bullish or bearish market trend. (2) The "secondary
movement", medium trend or intermediate reaction may
last from ten days to three months. (3) The "minor
movement" or daily swing varies from hours to a day.
Primary trend is a part of the cycle, which consist of one
or several intermediate reactions and the daily swings are
the minor movements that consist of all the detailed
movements. Now if we project the previous assumption
that the cycle is ever continuously rolling, into the minor
daily movement, can we assume the trend that happens
today, may also appear some days later in the future?
Here is an example for this assumption; Figure 1 shows
31 (bottom) trend graphs of Hang Seng Index Futures,
which are sourced from two different dates. Although they
are not exactly the same, in terms of major upwards and
downwards trends the two graphs do look alike. This is
the underlying concept of our trend recalling trading
strategies that are based on searching for similar patterns
from the past. This concept is valid for TF because TF
works by smoothing out the averages of the time series.
Minor fluctuations or jitters along the trend are averaged
out. This is important because TF is known to work well
on major trending cycles aka major outlines of the market
trend.
The paper is structured as follow: Details of the trend
recalling algorithm are presented in Section 2, step by step.
Simulation experiments are carried out for evaluating the
performance of the Trend Recalling algorithm in
automated trading in Section 3. In particular, we compare
the Trend Recalling algorithm with a selected time series
forecasting algorithm. Section 4 concludes the paper.
241
respectively two intra-day 2009-12-07 (top) and 2008-01works [1][2]. The third question is rather challenging, that
is actually the core decision maker in the TF system and
where the key factor in making profit is; questions 4 and 5
are related to it. Suppose that we have found a way to
identify trend signal to buy or sell, and we have a position
opened. Now if the system along the way identifies
another trend signal, which complies with the current
opened position direction, then we should keep it open,
since it suggested that the trend is not yet over. However,
if it is counter to the current position, we should probably
get a close out, regardless whether you are currently
winning or losing, as it indicates a trend reversion.
Our improved TF algorithm is designed to answer this
question: when to buy or sell. The clue is derived from the
past most similar trend. It is a fact that financial cycles do
exist, and it is hypothesized that a trend on a particular day
from the past could happen again some days later. This
assumption supports the Trend Recalling trading
mechanism, which is the basic driving force that our
improved trend following algorithm relies on. The idea is
expressed as a process diagram in Figure 2. As it can be
seen in the diagram there are four major processes for
decision making. Namely they are Pre-processing,
Selection, Verification and Analysis. Figure 2 shows the
process of which our improved TF model works by
recalling a trading strategy that used to perform well in the
past by matching the current shape of the pattern to that of
the old time. A handful of such patterns and
corresponding trading strategies are short-listed; one
strategy is picked from the list after thorough verification
and analysis.
Figure 1. Intra-day of 2009-12-07 and 2008-01-31 day trend graphs.
II. RECALLING PAST TRENDS
An improved version of Trend Following algorithm
called Trend Recalling is proposed in this paper, which
looks back to the past for reference for selecting the best
trading strategy. It works exactly like TF except that the
trading rules are borrowed from one of the best
performing past trends that matches most of the current
trend. The design of a TF system is grounded on the rules
that are summarized by Michael W. Covel, into the
following five questions [7]:
1. How does the system determine what market to buy or
sell at any time?
2. How does the system determine how much of a market
to buy or sell at any time?
3. How does the system determine when you buy or sell a
market?
4. How does the system determine when you get out of a
losing position?
5. How does the system determine when you get out of a
winning position?
There is no standard answer to the above questions;
likewise there exists no definite guideline for how the
trading rules in TF should be implemented. The first and
second questions are already answered in our previous
A. Pre-processing
In this step, raw historical data that are collected from
the past market are archived into a pool of samples. The
pool size is chosen arbitrarily by the user. Five years data
were archived in the database in our case. A sample is a
day trend from the past with the corresponding trading
strategy attached. The trend is like an index pattern for
locating the winning trading strategy that is in the format
of a sequence of buy and sell decisions. Good trading
strategy is one that used to maximize profit in the past
given the specific market trend pattern. This past pattern,
which is deemed to be similar to the current market trend,
is now serving as a guidance to locate the strategy to be
applied for decision making during the current market
trade session.
Since the past day trend that yielded a great profit
before, reusing it can almost guarantee a perfect trading
strategy that is superior to human judgment or a complex
time series forecasting algorithm. The past samples are
referenced by best trading strategies on an indicator that
we name it as “EDM” (exponential divergence in
movement). EDM is a crisp value indicator that is based
on two moving average differences.
(1)
EDM (t )  f EMAs (t )  EMAl (t )  ,
2 

EMA(t )   price (t )  EMA(t  1) 
  EMA(t  1)
n
1

(2)
242
Figure 2. Improved TF process with Recalling function.
Where price(t) is the current price at any given time t, n
is the number of periods, s denotes a shorter period of
Exponential Moving Average, EMA(t) at time t, l
represents a longer period EMA(t), f(.) is a function for
generating the crisp result. The indicator sculpts the trend;
and based on this information, a TF program finds a list of
best trading strategies, which can potentially generate high
profit. The following diagram in Figure 3 is an example of
pre-processing a trend dated on 2009-12-07 that shows the
EDM. As indicated from the diagram the program first
found a long position at 10:00 followed by a short position
at around 10:25, then a long position at 11:25, finally a
short position around 13:51 and closes it out at the end of
the day, which reaps a total of 634 index points. Each
index point is equivalent to $50 Hong Kong dollars
(KKD). In Hong Kong stock market, there is a two hours
break between morning and afternoon sessions. To avoid
this discontinuation on the chart, we shift the time
backward, and joined these two sessions into one, so
13:15 is equivalent to 15:15.
Figure 3. Example of EDM and preprocessed trend of 2009-12-07.
243
2009-12-07 closed
market day trend
Best fitted sample
2008-01-31
Figure 4. Example of 2009-12-07 sample (above) and its corresponding best fitness (2008-01-31) day trend and RSI graph (below).
B. Selection
Once a pool of samples reached a substantial size, the
Trend Recalling mechanism is ready to use. The stored
past samples are searched and the matching ones are
short-listed. The goal of this selection process is to find
the most similar samples from the pool, which will be
used as a guideline in the forthcoming trading session. A
foremost technical challenge is that no two trends are
exactly the same, as they do differ from day to day as the
market fluctuates in all different manners. Secondly, even
two sample day trends look similar but their price ranges
can usually be quite different. With consideration of these
challenges, it implies that the sample cannot be compared
directly value to value and by every interval for a precise
match. Some normalization is necessary for enabling
some rough matches. Furthermore the comparison should
allow certain level of fuzziness. Hence each sample trend
should be converted into a normalized graph, and by
comparing their rough edges and measure the difference,
it is possible to quantitatively derive a numeric list of
similarities. In pattern recognition, the shape of an image
can be converted to an outline like a wire-frame by using
some image processing algorithm. The same type of
algorithm is used here for extracting features from the
trend line samples for quick comparison during a TF
trading process.
In our algorithm, each sample is first converted into a
normalized graph, by calculating their technical indicators
data. A popular indicator Relative Strength Index (RSI)
has a limited value range (from 1 to 100), which is
suitable for fast comparison, and they are sufficient to
reflect the shape of a trend. In other words, these
indicators help to normalize each trend sample into a
simple 2D line graph. We can then simply compare each
of their differences of shapes by superimposing these line
graphs on top of each other for estimating the differences.
This approach produces a hierarchical similarity list, such
that we can get around with the inexact matching problem
and allows a certain level of fuzziness without losing their
similarity attributes. Figure 4 shows an example of two
similar sample trend graphs with the RSI displayed. The
blue line is the original market trend, red line is the
moving average and green line is the RSI.
244
C. Verification
During this process, each candidate from the list will be
tested against the current market state. Ranking from the
top smallest number as the most similar, they will be
passed through fitness test. Each trend sample corresponds
to a specific trading strategy (that was already established
in the pre-processing step). Each trading strategy will be
extracted and evaluated against historical data. The
strategy is then tested on how well it performed as a trial.
Each of their performances will be recorded. The trial
performance will be used as a criterion to rearrange the list.
Here we have an example before and after the fitness test,
which was run on the 2009-12-07 during the middle of
simulated trade session. The comparison is done based
solely on the indicator EDM of the moving market price.
Verification is needed because the selection of these
candidates is by a best effort approach. That is because the
current and past market situations may still differ to
certain extent.
D. Confirmation
After the verification process is done, the candidate list
is re-sorted according to the fitness test results. The fittest
one will be used as the reference of subsequent trading
strategy during the TF decision making. In order to further
improve the performance on top of the referencing to the
past best strategy, some technical analysis is suggested to
be referenced as well. By the advice of Richard L.
Weissman from his book [8], the two-moving average
crossover system should be used as a signal confirmation.
Cross-over means a rise on the market price starts to
emerge; it must cross over to its averaged trend. The twomoving average crossover system entails the rise of a
second, shorter-term moving average. Instead of using
simple moving average, however, EMA - exponential
moving average with RSI should be used, that is a shortterm RSI EMA and a long-term RSI EMA crossover
proceed with the trading only when the market fluctuation
is neither too high nor too low. When the reading of this
fuzzy system is too high or too low, the system closes out
the position then. There are many ways to calculate
volatility; the most common one is finding the standard
deviation of an asset closing price over a year. The central
concept of volatility is finding the amount of changes or
variances in the price of the asset over a period of time. So,
we can measure market volatility simply by the following
equation:
system. When a changing trend is confirmed and it
appears as a good trading signal, the crossover system
must also be referenced and check if it gives a consistent
signal. Otherwise the potential change in trend is
considered as a false signal or intermittent noise. For
example in our case the trading strategy from the recalled
sample hints a long position trade. We check if RSI
crossover system shows a short-term EMA crossing over
its long-term EMA or not.
Figure 5. Fitness test applied on 2009-12-07 at the time 14:47.
In addition to validating the hinted trading signals from
past strategy, market volatility should be considered
during decision making. It was found in our previous
work [2] that the performance of TF is affected mostly by
the market fluctuation. It resulted in losses because
frequent wrong trading actions were made by the TF rules
when the market fluctuates too often. The market
fluctuation is fuzzified as a fuzzy volatility indicator. This
fuzzy volatility indicator is embedded in the TF
mechanism is to automatically monitor the volatility, and
percent. Base on the previous fluctuation test result, we
can define it as the following fuzzy membership.
Volatility (t )  SMAn ln( price(t ))  ln( price(t  1)) C  (3),
SMA(t ) 
Close (t )  Close (t  1)  ......Close (t  n  1)
n
Figure 6. Fitness test applied on 2009-12-07 at the time 14:47.
(4)
Where ln(.) is a natural logarithm, n is the number of
periods, t is the current time, C is a constant that enlarges
the digit to a significant figure. SMA is Simple Moving
Average that is the average stock price over a certain
period of time. By observing how the equation responds to
historical data, we can find the maximum volatility as ±15
During the trading session, volatility will be constantly
referenced while the following rules apply at the TF
system:
IF volatility is too positive high and long position is
opened THEN close it out
IF volatility is too positive high and no position is opened
THEN open short position
IF volatility is too low THEN do nothing
IF volatility is too negative high and short position is
opened THEN close it out
IF volatility is too negative high and no position is opened
THEN open long position
These rules have a higher priority over the trade
strategies, such that when the condition has met any of
these rules, it will take over the control regardless of what
decision that the trade strategies has made. In other words
other conditions are not considered but only the volatility
factor. The four processes are summarized as pseudo
codes shown in the Appendix. Though the model is
generic, which should be able to work on any market with
varying patterns, a new pool sample is recommended to be
created for different market as in the Pre-Processing
Section.
III. EXPERIMENTS
Two experiments are conducted in this project. One is
for verifying the efficacy of Trend Recalling algorithm in
a simulated automated trading system. The other is to
compare the performances in terms of profitability yielded
by Trend Recalling algorithm and time series forecasting
algorithms. The objective of the experiments is to
investigate the feasibility of Trend Recalling algorithm in
automated trading environment as an alternative to timeseries forecasting.
A. Performance of Trend Recalling in Automated
Trading
The improved TF algorithm with Trend Recalling
function is programmed into an automated trading
245
simulator, in JAVA. A simplified diagram of the
prototype is shown in Figure 7. It essentially is an
automated system, which adopts trading algorithms for
deciding when to buy and sell based on predefined rules
and the current market trend. The system interfaces with
certain application plug-ins that instruct an online-broker
to trade in an open market. The trading interval is per
minute. Two sets of data are used for the experiment for
avoiding bias in data selection. One set is market data of
Hang Seng Index Futures collected during the year of
2010, the other one is H-Share also during the same year.
They are basically time-series that have two attributes:
timestamp and price. Their prices move and the records
get updated in every minute. The two datasets however
share the same temporal format and the same length, with
identical market start and end times for fair comparisons
of the algorithms. The past patterns stored in the data base
are collected from the past 2.5 years for the use by Trend
Recalling algorithm. All trials of simulations are run and
the corresponding trading strategies are decided by the
automated trading on the fly. At the end of the day, a trade
is concluded by measuring the profit or loss that the
system has made. The overall performance of the
algorithms is the sum of profit/loss averaged by the
number of days. In the simulation each trade is calculated
in the unit of index point, each index point is equivalent to
HKD 50, which is subject to overhead cost as defined by
Interactive Broker unbundled commission scheme at HKD
19.3 per trade. The Return-of-Investment (ROI) is the
prime performance index that is based on Hong Kong
Exchange current Initial margin requirement (each
contract HKD 7400 in year of 2010).
Figure 7. A prototype diagram that shows the trading algorithms is the core in the system.
A time-sequence illustration is shown in Figure 8 that
depicts the essential ‘incubation’ period required prior to
the start of trading. The timings are chosen arbitrarily.
However, sufficient time (e.g., 30 minutes was chosen in
our experiment) should be allowed since the beginning of
the market for RSI to be calculated. Subsequently another
buffer period of time followed by the calculation of the
first RSI0 would be required for growing the initial trend
pattern to be used for matching. If this initial part of the
trend pattern is too short, the following trading by the
Trend Recalling algorithm may not work effectively
because of inaccurate matching by short patterns. If it is
stalled for too long for accumulating a long matching
pattern, it would be late for catching up with potential
trading opportunities for the rest of the day. The stock
market is assumed to operate on a daily basis. A fresh
trade is started from the beginning of each day. In our case,
we chose to wait for 30 minutes between the time when
246
RSI0 is calculated and when the trading by Trend
Recalling started. The Trend Recalling steps repeat in
every interval that periodically guides the buying or
selling actions in the automatic trading system. As time
progresses the matching pattern lengthens, matching
would increasingly become more accurate and the advices
for trading actions become more reliable. In our
simulation we found that the whole process by Trend
.
Recalling algorithm that includes fetching samples from
the database, matching and deciding the trading actions
etc., consumes a small amount of running time. In average,
it takes only 463.44 milliseconds to complete a trading
decision with standard deviation of 45.16; the experiment
was run on a PC with a CPU of Xeon QC X3430 and 4Gb
RAM, Windows XP SP3 operating system.
Sample
market trend
Buffer period for collecting enough
market trend (since market
opened) for trend matching
Collect enough data
for calculating RSI
(1) Update RSIt
(2) Selection (RSIt as search index)
Start (3) Verification (Fitness test, etc.)
trading
(4) Confirmation (Check cross-over)
Then decide: {buy, sell or no action)
Continue trading … till market closes
09:45
Market starts
10:15
Start calculating
initial RSI0
This buffer time is
needed because
RSI needs at least
15min of data to
be computed
10:45
After initial RSI0 is
calculated, collect
this starting trend
for matching
Figure 8. Illustration of the incubation period in market trading by Trend Recalling.
The simulation results are shown in Table 1 and Table
2 respectively, for running the trading systems with
different TF algorithms over Hang Seng Index Futures
data and H-Share data. The Static TF is one that has
predefined thresholds P and Q whose values do not
change throughout the whole trade. P and Q are the bars
when over which the current price goes beyond, the
system will automatically sell and buy respectively.
Dynamic TF allows the values of the bars to be changed.
Fuzzy TF essentially fuzzifies these bars, and FuzzyVix
fuzzifies both the bars and the volatility of the market
price. Readers who are interested in the full details can
refer to [2]. The Tables show the performance figures in
terms of ROI, profits and losses and the error rates.
Overhead costs per trade are taken into account for
calculating profits. The error rate is the frequency or the
percentage of times the TF system made a wrong move
that incurs a loss. As we observed from both Tables, more
than 400% increase in ROI by Trend Recalling algorithm
is achieved at the end of the experimental runs. This is a
significant result as it implies the proposed algorithm can
reap more than four folds of whatever the initial
investment is, annually. The trading pattern of Trend
Recalling algorithm is shown in Figure 9 for Hang Seng
Index Futures data and Figure 10 for H-Share. The same
simulation parameters are used by default. The trading
pattern of Trend Recalling algorithm is compared to that
of other TF algorithms proposed earlier by the authors.
Readers who are interested in the other TF algorithms can
refer to [2][3] for details. From the Figures, the trading
performance by Trend Recalling strategy is always
winning and keeps improving in a long run. Figure 11
shows a longitudinal view of trading results over a day;
one can see that TF does not guarantee profits at all times,
but overall there are more profits than losses.
247
TABLE I.
PERFORMANCE OF ALL TF TRADING ALGORITHMS ON HANG SENG INDEX FUTURES 2010
Static
Dynamic
Fuzzy
FuzzyVix
Recalling
8000
7000
Index Point
6000
5000
4000
3000
2000
1000
0
20100104
-1000
20100201
20100303
20100331
20100503
20100601
20100630
20100729
20100826
20100924
20101025
Date
Figure 9. Simulation of all TF trading algorithms on Hang Seng Index Futures during 2010.
TABLE II.
PERFORMANCE OF ALL TF TRADING ALGORITHMS ON H-SHARE 2010
20101122
20101220
248
Static
Dynamic
Fuzzy
FuzzyVix
Recalling
6000
Index Point
5000
4000
3000
2000
1000
0
20100104
20100201
20100303
20100331
20100503
20100601
20100630
20100729
20100826
20100924
20101025
20101122
20101220
-1000
Date
Figure 10. Simulation of all TF trading algorithms on H-Share during 2010.
Trend Recalling: Daily Profit and Loss
Index Point
600
400
200
0
-200
-400
Figure 11. Profit and loss diagram.
B. Comparison of Trend Recalling and Time Series
Forecasting
Time series forecasting (TSF) is another popular
technique for stock market trading by mining over the
former part of the trend in order to predicting the trend of
near-future. The major difference between TSF and TF is
that, TSF focuses on the current movements of the trend
with no regard to history, and TSF regresses over a set of
past observations collected over time. Some people may
distinguish them as predictive and reactive types of
trading algorithms. Though the reactive type of algorithms
have not been widely studied in research community,
there are many predictive types of time series forecasting
models available, such as stationary model, trend model,
linear trend model, regression model, etc. Some advance
even combined neural network with TSF [9].
In our experiment here, we want to compare the
working performance of TSF and TF, which is
represented by its best performer so far – Trend Recalling
algorithm. For a fair comparison, both types of algorithms
would operate over the same dataset, which is the Hang
Seng Index Futures. We simulate their operations and
trading results over a year, under the same conditions, and
compare the level of profits each of them can achieve. The
profit or loss for each trade would be recorded down, and
then compute an average return-of-investment (ROI) out
of them. ROI will then be the common performance
indicator for the two competing algorithms. It is assumed
that ROI is of prime interest here though there may be
other technical performance indictors available for
evaluating a trading algorithm [10]. For examples, Need
to Finish, Price Sensitivity, Risk Tolerance, Frequency of
Trade Signals and Algorithmic Trading Costs etc.
In the TSF, future values are predicted continuously as
trading proceeds. If the predicted value is greater than the
closing value, the system shall take a long position for the
upcoming trade. And if it is lower than the previous value,
it takes a short position; anything else it will do nothing.
Instead of testing out each individual algorithm under the
TSF family, a representative algorithm will be chosen
based on its best prediction accuracy for this specific set
of testing data. Oracle Crystal Ball [11] that is well-known
prediction software with good industrial strength is used
to find a prediction model that offers the best accuracy.
For comparison the “best” candidate forecasting algorithm
is selected by Oracle Crystal Ball that yields the lowest
average prediction error. Oracle Crystal Ball has built-in
estimators that calculate the performance of each
prediction model by four commonly used accuracy
measures: the mean absolute deviation (MAD), the mean
absolute percent error (MAPE), the mean square error
(MSE), and the root mean square error (RMSE). Theil’s U
statistic is a relative accuracy measure that compares the
forecasted results with a naive forecast. When Theil’s U is
one the forecasting technique is about as good as guessing;
more than one implies the forecasting technique is worse
than guessing, less than one means it is better than
guessing. Durbin–Watson statistic is a test statistic used
to detect the presence of autocorrelation in the prediction
error. The value always lies between 0 and 4. If the
Durbin–Watson statistic is substantially smaller than 2,
there is evidence of positive serial correlation. In general
if Durbin–Watson is smaller than 1, there may be cause
for alarm. Small values of Durbin–Watson statistic
indicate successive error terms are, on average, close in
value to one another, or positively correlated. If it is
greater than 2 successive error terms are, on average,
much different in value to one another, i.e., negatively
correlated. In regressions, this can imply an
underestimation of the level of statistical significance.
Table 3 lists the prediction accuracies in terms of the error
measures. The best performer is Single Exponential
Smoothing prediction model for the chosen testing dataset.
TABLE III.
PERFORMANCE OF PREDICTIVE MODELS GENERATED BY ORACLE
CRYSTAL BALL
With this optimal prediction model suggested by Oracle
Crystal Ball for the given data, we apply the following
trade strategies for the prediction model: For a long
position to open, the following equation should be
satisfied, Pvt+1 ﹣ Pt > 0. For a short position to open, the
following equation should be satisfied, Pv t+1 ﹣ Pt < 0
where Pv t+1 is the predictive value, and Pt is the closing
price at the time t.
The two trading models, one by TF and the other by
TSF, are put vis-à-vis in the simulation. The simulation
results are gathered and presented in Table 4 and their
corresponding performance curves are shown in Figure 12.
The results show that Trend Recalling consistently
outperformed Single Exponential Smoothing algorithm in
our experiment.
249
TABLE IV.
SIMULATION RESULTS OF "PREDICTIVE MODEL" AND "REACTIVE
MODEL"
IV. CONCLUSION
Trend following has been known as a rational stock
trading technique that just rides on the market trends with
some preset rules for deciding when to buy or sell. TF has
been widely used in industries, but none of it was studied
academically in computer science communities. We
pioneered in formulating TF into algorithms and
evaluating their performance. Our previous work has
shown that its performance suffers when the market
fluctuates in large extents. In this paper, we extended the
original TF algorithm by adding a market trend recalling
function, innovating a new algorithm called Trend
Recalling Algorithm. Trading strategy that used to make
profit from the past was recalled for serving as a reference
for the current trading. The trading strategy was recalled
by matching the current market trend that was elapsed
since the market opened, with the past market trend at
which good profit was made by the strategy. Matching
market trend patterns was not easy because patterns can be
quite different in details, and the problem was overcome
in this paper. Our simulation showed that the improved TF
model with Trend Recalling algorithm is able to generate
profit from stock market trading at more than four times
of ROI. The new Trend Recalling algorithm was shown to
outperform the previous TF algorithms as well as a timeseries forecasting algorithm in our experiments.
250
Predictive (Overnight)
Reactive (Trend Recalling)
8000
6000
Index Point
4000
2000
0
20100104
20100202
20100305
20100408
20100507
20100608
20100709
20100809
20100907
20101008
20101108
20101207
-2000
-4000
Date
Figure 12. Simulation trade result of predictive model and reactive model on HSI futures contracts in 2010.
REFERENCES
[1]
[2]
[3]
[4]
[5]
Fong S. and Tai J., " Improved Trend Following Trading
Model by Recalling Past Strategies in Derivatives Market",
The Third International Conferences on Pervasive Patterns
and Applications (PATTERNS 2011), 25-30 September
2011, Rome, Italy, pp. 31-36
Fong S., Tai J., and Si Y.W., "Trend Following Algorithms
for Technical Trading in Stock Market", Journal of
Emerging Technologies in Web Intelligence (JETWI),
Academy Publisher, ISSN 1798-0461, Volume 3, Issue 2,
May 2011, Oulu, Finland, pp. 136-145.
Stan Weinstein's, Secrets for Profiting in Bull and Bear
Markets, pp. 31-44, McGraw-Hill, USA, 1988.
Wikipedia, 1997 Asian Financial Crisis, Available at
http://en.wikipedia.org/wiki/1997_Asian_Financial_Crisis,
last accessed on July-3-2012.
Wikipedia, Financial crisis of 2007–2010, Available at
http://en.wikipedia.org/wiki/Financial_crisis_of_2007, last
accessed on July-3-2012.
[6]
[7]
[8]
[9]
[10]
[11]
Schannep J., "Dow Theory for the 21st Century: Technical
Indicators for Improving Your Investment Results", Wiley,
USA, 2008.
Covel M.W., "Trend Following: How Great Traders Make
Millions in Up or Down Markets", New Expanded Edition,
Prentice Hall, USA, 2007, pp. 220-231.
Weissman R.L., "Mechanical Trading Systems: Pairing
Trader Psychology with Technical Analysis", Wiley, USA,
2004, pp. 10-19.
Mehdi K. and Mehdi B., "A New Hybrid Methodology for
Nonlinear Time Series Forecasting", Modelling and
Simulation in Engineering, vol. 2011, Article ID 379121, 5
pages, 2011.
Domowitz I. and Yegerman H., "Measuring and
Interpreting the Performance of Broker Algorithms", 2005,
Techical Report, ITG Inc., August 2005, pp. 1-12.
Oracle Crystal Ball, a spreadsheet-based application for
predictive modeling, forecasting, simulation Available at
http://www.oracle.com/technetwork/middleware/crystalbal
l/overview/index.html, last accessed on July-3-2012.
Appendix – The Pseudo Code of the Trend Recalling Algorithm
251
252
A Novel Method of Significant Words
Identification in Text Summarization
Maryam Kiabod
Department of Computer Engineering, Najafabad Branch, Islamic azad University, Isfahan, Iran
Email: m_kiabod@sco.iaun.ac.ir
Mohammad Naderi Dehkordi and Mehran Sharafi
Department of Computer Engineering, Najafabad Branch, Islamic azad University, Isfahan, Iran
Email: naderi@iaun.ac.ir, mehran_sharafi@iaun.ac.ir
Abstract—Text summarization is a process that reduces the
size of the text document and extracts significant sentences
from a text document. We present a novel technique for text
summarization. The originality of technique lies on
exploiting local and global properties of words and
identifying significant words. The local property of word
can be considered as the sum of normalized term frequency
multiplied by its weight and normalized number of
sentences containing that word multiplied by its weight. If
local score of a word is less than local score threshold, we
remove that word. Global property can be thought of as
maximum semantic similarity between a word and title
words. Also we introduce an iterative algorithm to identify
significant words. This algorithm converges to the fixed
number of significant words after some iterations and the
number of iterations strongly depends on the text document.
We used a two-layered backpropagation neural network
with three neurons in the hidden layer to calculate weights.
The results show that this technique has better performance
than MS-word 2007, baseline and Gistsumm summarizers.
Index Terms—Significant Words, Text Summarization,
Pruning Algorithm
I. INTRODUCTION
As the amount of information grows rapidly, text
summarization is getting more important. Text
summarization is a tool to save time and to decide about
reading a document or not. It is a very complicated task.
It should manipulate a huge quantity of words and
produce a cohesive summary. The main goal in text
summarization is extracting the most important concept
of text document. Two kinds of text summarization are:
Extractive and Abstractive. Extractive method selects a
subset of sentences that contain the main concept of text.
In contrast, abstractive method derives main concept of
text and builds the summarization based on Natural
Language Processing. Our technique is based on
extractive method. There are several techniques used for
extractive method. Some researchers applied statistical
criterions. Some of these criterions include TF/IDF (Term
Frequency-Inverse Document Frequency) [1], number of
words occurring in title [2], and number of numerical
doi:10.4304/jetwi.4.3.252-258
data [3]. Using these criterions does not produce a readerfriendly summary. As a result NLP (Natural Language
Processing) and lexical cohesion [4] are used to guarantee
the cohesion of the summary. Lexical cohesion is the
chains of related words in text that capture a part of the
cohesive structure of the text. Semantic relations between
words are used in lexical cohesion. Halliday and Hasan [5]
classified lexical cohesion into two categories: reiteration
category and collocation category. Reiteration category
considers repetition, synonym, and hyponyms, while
collocation category deals with the co-occurrence
between words in text document. In this article, we
present a new technique which benefits of the advantages
of both statistical and NLP techniques and reduces the
number of words for Natural Language Processing. We
use two statistical features: term frequency normalized by
number of text words and number of sentences containing
the word normalized by total number of text sentences.
Also we use synonym, hyponymy, and meronymy
relations in reiteration category to reflect the semantic
similarity between text words and title words. A twolayered backpropation neural network is used to automate
identification of weights of features. The rest of the
article is organized as follow. Section 2 provides a review
of previous works on text summarization systems.
Section 3 presents our technique. Section 4 describes
experimental results and evaluation. Finally we conclude
and suggest future work in section 5.
II. TEXT SUMMARIZATION APPROACHES
Automatic text summarization dates back to fifties. In
1958, Luhn [6] created text summarization system based
on weighting sentences of a text. He used word frequency
to specify topic of the text document. There are some
methods that consider statistical criterions. Edmundson [7]
used Cue method (i.e. "introduction", "conclusion", and
"result"), title method and location method for
determining the weight of sentences. Statistical methods
suffer from not considering the cohesion of text.
Kupiec, Pederson, and Chen [8] suggested a trainable
method to summarize text document. In this method,
number of votes collected by the sentence determines the
probability of being included the sentence in the
summary.
Another method includes graph approach proposed by
Kruengkrai and Jaruskululchi [9] to determine text title
and produce summary. Their approach takes advantages
of both the local and global properties of sentences. They
used clusters of significant words within each sentence to
calculate the local property of sentence and relations of
all sentences in document to determine global property of
text document.
Beside statistical methods, there are other approaches
that consider semantic relations between words. These
methods need linguistic knowledge. Chen, Wang, and
Guan [10] proposed an automated text summarization
system based on lexical chain. Lexical chain is a series
of interrelated words in a text. WordNet is a lexical
database which includes relations between words such as
synonym, hyponymy, meronymy, and some other
relations.
Svore, Vander Wende and Bures [11] used machine
learning algorithm to summarize text. Eslami,
Khosravyan D., Kyoomarsi, and Khosravi proposed an
approach based on Fuzzy Logic [12]. Fuzzy Logic does
not guarantee the cohesion of the summary of text.
Halavati, Qazvinian, Sharif H. applied Genetic algorithm
in text summarization system [13]. Latent Semantic
Analysis [14] is another approach used in text
summarization system. Abdel Fattha and Ren [15]
proposed a technique based on Regression to estimate
text features weights. In regression model a mathematical
function can relate output to input variables. Feature
parameters were considered as input variables and
training phase identifies corresponding outputs.
There are some methods that combine algorithms, such
as, Fuzzy Logic and PSO [16]. Salim, Salem Binwahla,
and Suanmali [17] proposed a technique based on fuzzy
logic. Text features (such as similarity to title, sentence
length, and similarity to keywords, etc.) were given to
fuzzy system as input parameters.
Ref. [18] presented MMR (Maximal Marginal
Relevance) as text summarization technique. In this
approach a greedy algorithm is used to select the most
relevant sentences of text to user query. Another aim in
this approach is minimizing redundancy with sentences
already included in the summary. Then, a linear
combination of these two criterions is used to choose the
best sentences for summary. Carbonell and Goldstein [19]
used cosine similarity to calculate these two properties. In
2008 [20] used centroid score to calculate the first
property and cosine similarity to compute the second
property. Different measures of novelty were used to
adopt this technique [21, 22]. To avoid greedy algorithms
problems, many have used optimization algorithms to
solve the new formulation of the summarization task [23,
24, 25].
III. PROPOSED TECHNIQUE
The goal in extractive text summarization is selecting
the most relevant sentences of the text. One of the most
important phases in text summarization process is
identifying significant words of the text. Significant
words play an important role in specifying the best
sentences for summary. There are some methods to
identify significant words of the text. Some methods use
statistical techniques and some other methods apply
semantic relations between words of the text to determine
significant words of text. Such as term frequency (TF),
similarity to title words, etc. each method has its own
advantages and disadvantages. In our work, a
combination of these methods is used to improve the
performance of the text summarization system. In this
way, we use the advantages of several techniques to make
text summarization system better. We use both statistical
criterions and semantic relations between words to
identify significant words of text. Our technique has five
steps: preprocessing, calculating words score, significant
words identification, calculating sentences score, and
sentence selection. These steps are shown in Fig. 1.
Figure 1: the flowchart of proposed technique
253
254
The first step, preprocessing, involves preparing text
document for the next analysis and pruning the words of
the text document. This step involves sentence
segmentation, sentence tokenization part of speech
tagging, and finding the nouns of the text document.
Keywords or significant words are usually nouns, so
finding nouns of the text can help improving performance
of our system. The second step, calculating words scores,
calculates words scores according to their local score and
global score explained in detail later. Local score is
determined based on statistical criterions and global score
is determined through semantic similarity between a word
and title words. The third step, significant words
identification, uses words score and an iterative algorithm
to select the most important words of text. The fourth
step, calculating sentence score, calculates sentence score
according to sentence local score, sentence global score
and sentence location. The fifth step, sentence selection,
selects the most relevant sentences of text based on their
scores. These five steps are explained in detail in the next
five sections.
A. Preprocessing
The first step in text summarization involves preparing
text document to be analyzed by text summarization
algorithm. First of all we perform sentence segmentation
to separate text document into sentences. Then sentence
tokenization is applied to separate the input text into
individual words. Some words in text document do not
play any role in selecting relevant sentences of text for
summary, Such as stop words ("a", "an", the"). For this
purpose, we use part of speech tagging to recognize types
of the text words. Finally, we separate nouns of the text
document. Our technique works on nouns of text. In the
rest of the article we use "word" rather than "noun".
In this phase, we use two statistical criterions: term
frequency of the word normalized by total number of
words (represented by TF) and number of sentences
containing the word normalized by total number of
sentences of text document (represented by Sen_Count).
We combine these two criterions to define equation (1) to
calculate local score of words.
word_local_score = α * TF + (1- α) * Sen_Count
(1)
where α is weight of the parameter and is in the range
of (0, 1).
We utilize a two-layered backpropagation neural
network with three neurons in hidden layer, maximum
error of 0.001, and learning rate of 0.2 to obtain this
weight. The dendrites weights of this network are
initialized in the range of (0, 1). We use sigmoid function
as transfer function. The importance of each parameter is
determined by the average of dendrites weights connected
to the input neuron that represents a parameter [26]. After
training neural network with training dataset we use
weights to calculate words local scores. The algorithm in
this step prunes words of the text document and deletes
words without any role in selecting relevant sentences for
summary. This is done by defining a threshold and taking
words whose scores are above that threshold. This
algorithm is shown in Algorithm 1:
Algorithm 1: Word pruning Algorithm
Input: local score of words, words list
Output: pruned words list
1.
B. Calculating Words Score
After preparing input text for text summarization
process, it is time to determine words score to be used in
later steps. In this step we utilize combination of
statistical criterions and lexical cohesion to calculate text
words scores. Finding semantic relations between words
is a complicated and time consuming process. So, first of
all, we remove unimportant words. For this reason, we
calculate local score of word. If local score of a word is
less than the word_local_score_threshold, we will remove
that word. Word_local_score_threshold is the average of
all text words scores multiplied by a PF (a number in the
range of (0, 1) as a Pruning Factor in word selection). By
increasing PF, more words will be removed from text
document. In this way, the number of words decreases
and the algorithm gets faster. We calculate global score
for remaining words based on reiteration category of
lexical cohesion. Finally, we calculate words scores by
using local and global score of words. This step is
described in detail in three next sections.

2. foreach words w of text do
3.
If
(word_local_score
<
word_local_score_threshold)
Delete word from significant words list;
4.
end
5. end
6. return pruned words list;
In this algorithm, i represents word index and PF
stands for Pruning Factor.
The first line of the Algorithm 1 computes local score
threshold of words by taking the average of the local
score of words multiplied by PF. The second line of it
prunes words by taking words whose scores are above the
word_local_score_threshold. Finally, the algorithm
returns the pruned words list in the seventh line.

Calculating global score of words
Calculating local score of words
In this phase, we consider semantic similarity between
text words and title words. We use WordNet, a lexical
database, to determine semantic relations between text
words and title words. We fixed the weight of repetition
and synonym to 1, of hyponymy and hyperonymy to 0.7,
and of meronymy and holonymy to 0.4. We also consider
repetition of keywords in the text and fix the weight of it
to 0.9.
We define equation (2) to calculate global score of
words:
Word_global_score = Max (sim (w, ))
(2)
According to this equation, first of all, we calculate the
maximum similarity between each word and title words.
Then the sum of maximum similarities is calculated to
determine global score of words. This score is used in the
next section.

Calculating word score
The final phase in this step is calculating word score.
In our technique, word score is calculated by combination
of local score and global score of word. We define
equation (3) to calculate word score.
Word_score=α*(word_local_score)+β*(word_global_s
core) (3)
α and β are determined by neural network illustrated
before.
C. Identifying Significant Words
Significant words play an important role in text
summarization systems. The sentences containing
important words have better chance to be included in
summary. In the case of finding significant words of text
with a high accuracy, the results of text summarization
will be great. So, we focus on significant word
identification process to improve text summarization
results. In this step, we introduce a new iterative method
to determine significant words of text. In this method,
significant words are initiated with text words. Then a
threshold is defined to be used to identify the words that
should be removed from initial significant words. This is
done by applying the average of all significant words
scores in previous iteration as word_score_threshold. If a
word score is less than this threshold, we will remove that
word from significant words list. In each loop of this
algorithm some words are deleted from significant words
list. The algorithm converges to the fixed number of
significant words after some iteration. The algorithm is
shown below:
Algorithm 2: Significant words identification algorithm
5.
6.
7.
8.
9.
10.
11.
255
if (word_score< words_score_threshold)
Delete word from significant words list;
end
end
Word_score_threshold:=average(significant_words_scores);
end
return significant words list;
words_score_threshold in Algorithm 2 is the average
of all scores of significant words of text. This threshold
changes in every iteration of algorithm. The new value of
it is calculated through the average of scores of
significant words in previous iteration of algorithm.
The first line of Algorithm 2 initiates significant words
list by text words. The second line initiates
Word_score_threshold by calculating the average of
scores of text words. The third line to the tenth line
iterates to delete unimportant words from significant
words list. The ninth line of the algorithm computes
words_score_threshold for the next iteration. Finally, the
algorithm returns significant words list in line ten.
D. Calculating Sentence Score
In this step, we use significant words determined by
previous step to calculate sentence score. Our technique
in this phase is based on Kruengkrai and Jaruskululchi [9]
approach, but we changed the parameters. They
combined local and global properties of a sentence to
determine sentence score as follow:
Sentence_score = α*G + (1-α)*L
(4)
Where G is the normalized global connectivity score
and L is the normalized local clustering score. It results
this score in the range of (0, 1).
We define G and L as follow:
G=
(5)
L=
(6)
where
is the maximum semantic relation
among sentence words and title and keywords. As shown
in equation (5), we consider semantic relations among
sentence words and title and keywords to determine the
global property of a sentence. Then, we normalize it by
total number of words in the sentence. The parameter α
determines the importance of G and L. we use neural
network illustrated before to determine α.
Baxendale [27] showed that sentences located at first
and last paragraph of text document are more important
and having greater chances to be included in summary.
So, we divide text document into three sections and
multiply sentences scores in the first and last section by
0.4 and in the second section by 0.2. The algorithm is
shown below.
Input: text words list, text words scores
Output: significant words list
Algorithm 3: Sentence score calculation algorithm
1.
significant_words := text_words;
2.
Word_score_threshold :=average(text_words_scores);
3.
4.
while number of significant words changes do
foreach significant words of text do
Input: number of significant words of each sentence, total number
of significant words of text, total number of words in each sentence,
similarity score between a word and title words, sentence location, and
the parameters α and β
Output: scores of sentences
1.
foreach sentence of text do
2.
sentence_local_score:=
3.
sentence_global_score :=
;
;
4.
Sentence_score := α*G + (1-α)*L;
5.
If ((1/3)*TSN < sentence_loc < (2/3)*TSN)
6.
Sentence_score *:=0.2;
7.
else : Sentence_score *:=0.4;
8.
end
9.
end
10. return scores of sentences;
TSN IN Algorithm 3 is referred as total number of text
sentences. Sentence_loc is the location of sentence in text
document.
The Algorithm 3 repeats line two to line eight for each
sentence. Line two computes local score of sentences.
The third line of the algorithm computes global score of
sentence. The forth line computes sentence score
according to local score and global score. The fifth line to
the eighth line considers the sentence location. If sentence
location is in the first section or last section of the text
document, multiply it’s score by 0.4 otherwise multiply
score of sentence by 0.2. Finally, the algorithm returns
sentences scores in line ten.
E. Sentence Selection
After calculating scores of the sentences, we can use
these scores to select the most important sentences of text.
This is done by ranking sentences according to their
scores in decreasing order. Sentences with higher score
tend to be included in summary more than other
sentences of the text document. In our technique these
sentences have more similarity to title. This similarity is
measured according to statistical and semantic techniques
used in our technique. Another criterion to choose
sentences for summary is Compression Rate.
Compression rate is a scale to decrease the size of text
summary. A higher compression rate leads to a shorter
summary. We fix compression rate to 80%. Then n topscoring sentences are selected according to compression
rate to form the output summary.
We use DUC2002 1 as input data to train neural
network and test our technique. DUC 2002 is a collection
of newswire articles, comprised of 59 document clusters.
Each document in DUC2002 consists of 9 to 56 sentences
with an average of 28 sentences. Each document within
the collections has one or two manually created abstracts
with approximately 100 words which are specified by a
model.
We evaluate the technique for different PF. The best
result was achieved for PF=0.25 as shown in Fig. 2. We
compare our results with MS-word 2007, Gistsumm, and
baseline summarizers. MS-word 2007 uses statistical
criterions, such as term frequency, to summarize a text.
Gistsumm uses the gist as a guideline to identify and
select text segments to include in the final extract. Gist is
calculated on the basis of a list of keywords of the source
text and is the result of the measurement of the
representativeness of intra- and inter-paragraph sentences.
The baseline is the first 100 words from the beginning of
the document as determine by DUC2002.
The results are shown in Fig. 3 and Fig. 4. The
numerical results are shown in Table 1. The text number
in Table 1 shows the text number in the tables. Our
technique (OptSumm) reaches the average precision of
0.577, recall of 0.4935 and f-measure of 0.531. The MSword 2007 summarizer achieves the average precision of
0.258, recall of 0.252 and f-measure of 0.254. The
Gistsumm reaches the average precision of 0.333 and fmeasure of 0.299. the baseline achieves the average of
0.388, recall of 0.28 and f-measure of 0.325.the results
have shown that our system has better performance in
comparison with MS-word 2007, Gistsumm and baseline
summarizers.
Fig. 3, Fig. 4, and Fig. 5 show that the precision score,
the Recall score, and F-measure are higher when we use
OptSumm rather than MS-word 2007, Gistsumm, and
baseline summarizers.
1
0.8
Precision
256
0.6
PF=0.25
0.4
PF=0.5
0.2
PF=0.75
0
1
3
5
7
PF=1.0
Text Number
IV. EVALUATION
Figure 2: the comparison of different PF
Text summarization evaluation is a complicated task.
We use three criterions to evaluate our system [28]:
Precision Rate =
(7)
Recall Rate =
(8)
F-measure=
(9)
1.
9 11 13
www.nlpir.nist.gov
257
1.2
1
OptSumm
0.6
0.8
F-measure
Precision
1
0.8
MS-Word
0.4
GistSumm
0.2
baseline
0
1
3
5
7
9 11 13
0.6
OptSumm
0.4
MS-word
0.2
Gistsumm
baseline
0
Text Number
1
3
5
7
9
11 13
Text Number
Figure 3: the comparison of precision score among four summarizers
Figure 5: the comparison of F-measure score among four
summarizers
1
Recall
0.8
0.6
OptSumm
0.4
MS-Word
0.2
GistSumm
baseline
0
1
3
5
7
9 11 13
Text Number
Figure 4: the comparison of recall score among four summarizers
Table I.
THE COMPARISON OF PRECISION AND RECALL AMONG FOUR SUMMARIZERS
Text
Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
SET
NO.
D061J
D062J
D106g
D113h
D083a
D071f
D072f
D092c
D074b
D091c
D110h
D102e
D098e
average
Model
Precision
b
a
a
b
b
a
j
a
a
j
b
f
a
-
0.45
0.8
0.875
0.8
0.5
0.66
0.5
0.85
0.8
0.27
0.875
0.107
0.125
0.577
OptSumm
Recall
Fmeasure
0.5
0.473
0.66
0.723
0.38
0.529
0.44
0.567
0.428
0.461
0.5
0.568
0.5
0.5
0.666
0.746
0.666
0.726
0.42
0.328
0.777
0.823
0.33
0.161
0.142
0.132
0.4935
0.531
Precision
0.1
0.142
0.363
0.25
0.4
0.33
0.222
0.55
0.4
0.1
0.5
0.01
0.125
0.258
MS-word 2007
Recall
Fmeasure
0.125
0.111
0.166
0.153
0.22
0.273
0.111
0.153
0.285
0.332
0.375
0.351
0.25
0.235
0.55
0.55
0.33
0.361
0.15
0.12
0.55
0.523
0.11
0.018
0.142
0.132
0.252
0.254
Precision
0.285
0.4
0.33
0.25
0.25
0.2
0.44
0.57
0.2
0.36
0.6
0.09
0.2
0.333
GistSumm
Recall
Fmeasure
0.25
0.266
0.33
0.361
0.166
0.220
0.111
0.153
0.142
0.181
0.125
0.153
0.5
0.468
0.22
0.317
0.16
0.177
0.57
0.441
0.33
0.425
0.11
0.099
0.142
0.166
0.272
0.299
Precision
0.5
1.0
0.625
0.8
0.4
0.5
0.2
0.2
0.6
0.1
0.166
0.1
0
0.388
baseline
Recall
0.375
0.5
0.27
0.44
0.285
0.75
0.125
0.111
0.5
0.15
0.111
0.15
0
0.28
Fmeasure
0.428
0.666
0.377
0.567
0.332
0.6
0.153
0.142
0.545
0.12
0.133
0.12
0
0.325
V. CONCLUSION and FUTURE WORK
REFERENCES
In this article, we proposed a new technique to
summarize text documents. We introduced a new
approach to calculate words scores and identify
significant words of the text. A neural network was used
to determine the style of human reader and to which
words and sentences the human reader deems to be
important in a text. The evaluation results show better
performance than MS-word 2007, GistSumm, and
baseline summarizers. In future work, we intend to use
other features, such as font based feature and cue-phrase
feature in words local score and calculate words scores
based on it. Also the sentence local score and global score
can be changed to reflect the reader's needs.
[1] M.Wasson, "Using Leading Text for News Summaries:
Evaluation results and implications for commercial
summarization applications”, In Proceedings of the 17th
International Conference on Computational Linguistics and
36th Annual Meeting of the ACL, pp.1364-1368, 1998.
[2] G.Salton,C.Buckley,"Term-weighting Approaches in
Automatic Text Retrieval", Information Proceeding and
Management 24,1988,513-523.Reprinted in:Sparck-Jones,
K.; Willet ,P.(eds).Readings in I.Retreival, Morgan
Kaufmann,pp.323-328,1997
[3] C.Y.Lin, "Training a Selection Function for Extraction", In
Proceedings of eighth international conference on
Information and knowledge management, Kansas City,
Missouri, United States, pp.55-62,1999.
[4] M.Hoey, Patterns of Lexis in Text. Oxford: Oxford
University Press, 1991
258
[5] M.Halliday, and Hasan, R.1975.Cohesion in English.
London: Longman
[6] H.P.Luhn, “The Automatic Creation of Literature Abstracts”,
IBM journal of Research Development, 1958, pp.159-165.
[7] H.P.Edmundson, “New Methods in Automatic Extraction”,
journal of the ACM, 1969, pp.264-285.
[8] J.Kupiec , j.Pedersen, AND j.Chen, “A Trainable Document
Summarizer”, In Proceedings of the 18th ACMSIGIR
Conference,1955,pp.68-73.
[9] C.Jaruskululchi, Kruengkrai, “Generic Text Summarization
Using Local and Global Properties of Sentences”,
IEEE/WIC international conference on web intelligence,
October 2003, pp.13-16.
[10] Y.Chen, X. Wang, L.V.YI Guan,” Automatic Text
Summarization Based on Lexical Chains”, in Advances in
Natural Computation, 2005, pp.947-951.
[11] K.Svore, L.Vanderwende, and C.Bures, “Enhancing
Single-document Summarization by Combining Ranknet
and Third-party Sources”, In Proceeding of the EMNLPCoNLL.
[12] F.Kyoomarsi, H.Khosravi, E.Eslami, and P.Khosravyan
Dehkordy, “Optimizing Text Summarization Based on
Fuzzy Logic”, In Proceedings of Seventh IEEE/ACIS
International Conference on Computer and Information
Science, IEEE, University of shahid Bahonar
Kerman,2008,pp.347-352.
[13] V.Qazvinian, L.Sharif Hassanabadi, R.Halavati,
“Summarization Text with a Genetic Algorithm-Based
Sentence Extraction”, International of Knowledge
Management Studies (IJKMS),2008,vol.4,no.2,pp.426-444.
[14] S.Hariharan, “Multi Document Summarization by
Combinational Approach”, International Journal of
Computational Cognition, 2010, vol.8, no.4, pp.68-74.
[15] M.Abdel Fattah, and F.Ren, “Automatic Text
Summarization”, Proceedings of World of Science,
Engineering and Technology,2008,vol.27,pp.195-192.
[16] L.Suanmali, M.Salem, N.Binwahlan and Salim, “sentence
Features Fusion for Text Summarization Using Fuzzy
Logic”, IEEE, 2009,pp.142-145.
[17] L.Suanmali, N. Salim, and M.Salem Binwahlan, “Fuzzy
Swarm Based Text Summarization”, journal of computer
science, 2009, pp.338-346.
[18] J. Carbonell and J. Goldstein, “The use of MMR, diversitybased rerunning for reordering documents and producing
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
summaries,” in Proceedings of theAnnual International
ACM SIGIR Conference on Research and Development in
Information Retrieval, pp. 335–336, 1998.
C. D. Manning and H. Schutze, Foundations of Natural
Language Processing.MIT Press, 1999.
S. Xie and Y. Liu, “Using corpus and knowledge-based
similarity measure in Maximum Marginal Relevance for
meeting summarization,” in Proceedings of the IEEE
International Conference on Acoustics, Speech and Signal
Processing, pp. 4985–4988, 2008.
G. Murray, S. Renals, and J. Carletta, “Extractive
summarization of meeting recordings,” in Proceedings of
9th European Conference on Speech Communication and
Technology, pp. 593–596, 2005.
D. R. Timothy, T. Allison, S. Blair-goldensohn, J. Blitzer,
A. elebi, S. Dimitrov, E. Drabek, A. Hakim, W. Lam, D.
Liu, J. Otterbacher, H. Qi, H. Saggion, S. Teufel, A.
Winkel, and Z. Zhang, “MEAD — a platform for
multidocument multilingual text summarization,” in
Proceedings of the International Conference on Language
Resources and Evaluation, 2004.
R. McDonald, “A study of global inference algorithms in
multi-document summarization,” in Proceedings of the
European Conference on IR Research, pp. 557–564, 2007.
S. Ye, T.-S. Chua, M.-Y. Kan, and L. Qiu, “Document
concept lattice for text understanding and summarization,”
Information Processing and Management, vol. 43, no. 6, pp.
1643–1662, 2007.
W. Yih, J. Goodman, L. Vanderwende, and H. Suzuki,
“Multi-document
summarization
by
maximizing
informative content-words,” in Proceedings of the
International Joint Conference on Artificial Intelligence, pp.
1776–1782, 2007.
N.Soltanian Zadeh, L.Sharif, “Evaluation of Effective
Parameters and Their Effect on Summarization Systems
Using Neural Network”, Fifth annual international
conference of computer society of iran, 2008.
P.Baxendale,” Machine-Made Index for Technical
Literature –an Experiment”, IBM Journal of Research
Development,1958.
Y.Y. Chen, O.M.Foong, S.P.Uong, I.Kurniawan, “Text
Summarization for Oil and Gas Drilling Topic”,
Proceeding of world academy of science and technology,
vol.32, pp.37-40, 2008.
259
Attribute Overlap Minimization and Outlier
Elimination as Dimensionality Reduction
Techniques for Text Classification Algorithms
Simon Fong
Department of Computer and Information Science, University of Macau, Macau SAR
Email: ccfong@umac.mo
Antonio Cerone
International Institute for Software Technology, United Nations University, Macau SAR
Email: antonio@iist.unu.edu
Abstract—Text classification is the task of assigning free text
documents to some predefined groups. Many algorithms
have been proposed; in particular, dimensionality reduction
(DR) which is an important data pre-processing step has
been extensively studied. DR can effectively reduce the
features representation space which in turn helps improve
the efficiency of text classification. Two DR methods namely
Attribute Overlap Minimization (AOM) and Outlier
Elimination (OE) are applied for downsizing the features
representation space, on the numbers of attributes and
amount of instances respectively, prior to training a decision
model for text classification. AOM works by swapping the
membership of the overlapped attributes (which are also
known as features or keywords) to a group that has a higher
occurrence frequency. Dimensionality is lowered when only
significant and unique attributes are describing unique
groups. OE eliminates instances that describe infrequent
attributes. These two DR techniques can function with
conventional feature selection together to further enhance
their effectiveness. In this paper, two datasets on classifying
languages and categorizing online news into six emotion
groups are tested with a combination of AOM, OE and a
wide range of classification algorithms. Significant
improvements in prediction accuracy, tree size and speed
are observed.
Index Terms—Data stream mining, optimized very fast
decision tree, incremental optimization.
I. INTRODUCTION
Text classification is a classical text mining process
that concerns automatically sorting unstructured and free
text documents into predefined groups [1]. This problem
receives much attention from researchers from data
mining research community for its practical importance
in many online applications such as automatic
categorization of web pages in search engines [2],
detection of public moods online [3] and information
retrieval that selectively acquire online text documents
into the preferred categories.
Given the online nature of the text classification
applications, the algorithms often would have to deal with
massive volume of online text that are stored in
doi:10.4304/jetwi.4.3.259-263
unstructured format, such as hypertexts, emails,
electronic news archive and digital libraries. A prominent
challenge or difficulty of text classification application is
processing the high dimensionality of the attribute
representation space manifested from the text data.
Text information is often represented by a string
variable which is a single dimensional data array or
linked-list in computer memory. Though the size of a
string may be bounded, a string variable can potentially
contain infinite number of words combinations; each
string that represents an instance of text document will
have a different size. The large number of values from the
training dataset and the irregular length of each instance
make training a classifier extremely difficult. To tackle
this issue, the text strings are transformed into a fixedsized list of attributes that represent the frequency of
occurrence of each corresponding word in the dataset.
The frequency list is often called Word Vector in the
form of a bit-vector, which is an occurrence frequency
representation of the words. The length of a word vector
is bounded by the maximum number of unique words
exist in the dataset. An example in WEKA, which stands
for 'Waikato Environment for Knowledge Analysis' is a
popular suite of machine learning software written in
Java, developed at the University of Waikato, illustrates
how sentences in nature language are converted to word
vector of frequency counts.
Figure 1. An example of text string converted to word vector.
Although word vectors could be processed by most
classification algorithms, the transformation approach is
no scalable. For large texts that contain many words the
word vector grows to prohibitory huge which slows down
the model training time and it leads to the well-known
data mining problem called 'curse of dimensionality'. The
word vector is most sparse that occupies unnecessary runtime memory space.
Hence dimensionality reduction techniques (DR) are
extensively studied by researchers. The techniques aim to
reduce the number of components of a dataset such as
word vectors; while at the same time, the original data are
represented as accurately as possible. DR often yields
fewer features and/or instances. Therefore a compact
representation of the data could be achieved for
improving text mining performance and reduces
computational costs.
Two types of DR are usually applied, often together, for
reducing the number of attributes/features and to
streamlining the amount of instances. They attempt to
eliminate irrelevant and redundant attributes/data from
the training dataset and/or its transformed representation,
making the training data compact and efficient for dataintensive task like constructing a classifier.
In paper, a DR method called Attribute Overlap
Minimization (AOM) is introduced which reduces the
number of dimensions by refining the membership of
each group that the word vector is more likely to belong
to. Furthermore the corresponding instances that do not
fit well in the rearranged groups are removed. This paper
reports about this DR technique and experiments are
conducted to demonstrate its effectiveness over two
different datasets.
II. MODEL FRAMEWORK
A typical text mining workflow consists of data preprocessing that includes data cleaning, formatting and
missing value handling, dimensionality reduction and
data mining model training. Figure 1 shows such a typical
text mining workflow. A classifier which is enabled by
data mining algorithms needs to be trained initially by
processing through a substantial amount of pre-labeled
records to an acceptable accuracy, before it could be used
for classifying new unseen instances to the predicted
groups.
The data mining algorithms are relatively mature in their
efficacy and their performance is largely depending on
the quality of the training data – which is the result of the
DR that tries to abstract the original dataset to a compact
representation.
A type of DR methods which is well-known as
Stemming [4] has been proposed and widely used in the
past. Stemming algorithms or so-called stemmers are
designed to reduce a single word to its stem or root form
[5] by finding its morphological root. This is done by
removing the suffix of the words. It helps shorten the
length of most terms. The other important type of DR is
Feature Selection which selects only the attributes whose
values represent the words that exist in the text document,
and it filters off those attributes that have less or little
predictive power with respect to the classification model.
So a subset of the original attributes can be retained for
building an accurate model. A comparative study of
different feature selection methods [6] have been
evaluated pertaining to subsiding text space
dimensionality. It was shown possible that between 50%
and 90% of the terms from the text space can be removed
by using suitable feature selection schemes without
sacrificing any accuracy in the final classification model.
Classification Model
Training
Training dataset with
reduced dimensions
Dimensionality Reduction
Attribute Reduction
F.S.
A.O.M.
Data Reduction
Outliner removal
Labeled training records
Labeling data records
with classes
Clean and concise text
Class assignment
260
Denoising and
stemming
Structured text
Formatting dataset
Unstructured text
Data Extraction
Snippets of data
www
Figure 2. A typical text-mining workflow.
Both types of DR methods reduce the dimensionality
of a dataset as an important element of the text data preprocessing stage. However, it is observed that feature
selection heavily removes less-important attributes based
on their potential contributing power in a classifier,
without regards to the context of the training text data.
We identify that one of the leading factors to
misclassification is the confusion of the contexts of the
words in different groups. The confusion disrupts the
training process of the classification model by mistakenly
interpreting a word/term from an instance as an indication
to one group but in fact it is more likely to belong to
another. A redundant and false mapping-relation between
the attributes and the target group is therefore created in
the model that dampers the accuracy of the resultant
classification model.
The source of this problem is originated from the
common attributes which are owned by more than one
group. A single term, without referring to the context of
its use, can be belonged to two or more target groups of
text. For the example given in Figure 1, the individual
term ‘Cancer’ is actually in a case that belongs to ‘Good
news’, while the same term can potentially and intuitively
be deemed as an element of ‘Bad news’.
To rectify this problem a data pre-processing method
called Attribute Overlap Minimization (AOM) is
proposed. In principle, it works by relocating the terms to
a group in which the term has the highest occurrence
frequency. The relocation can be absolute, that is based
on the Winner-takes-all approach. The group that has the
highest frequency count of the overlapped word recruits it
all. In the dataset, the instances that contain the
overlapped words would have to delete them off if the
labeled class group is not the winner group. The instances
that belong to the winner group continue to own the
words for describing the characteristic of the group.
Another milder approach is to assign ‘weights’ according
to the relative occurrence frequencies across the groups.
The strict approach may have a disadvantage of overrelocation that leads to a situation where the winner group
monopolizes the ownership of the frequently occurring
terms, leaving the other groups lack of key terms for
training up their mapping relations. However, when the
instances have a sufficient number of instances and the
overlapped terms are not too many, AOM works well and
fast. Comparing to FS, AOM is having the advantage of
preserving most of the attributes and yet it can prevent
potential confusion in the classification training. Another
benefit is the speed due to the fact that it is not necessary
to refer to some ontological information during the
processing. An example is shown in Figure 3, where in
linguistic languages common words that have the same
spellings are overlapped across different languages. AOM
is a competitive scheme that allows a language group in
which the words appear most frequently acquires away
the overlapped words.
German
261
English
French
Spanish
en
die
data
pour
de
se
que
un
la
Figure 3. An illustration of overlapped words among different
languages
III. EXPERIMENT
In order to validate the feasibility our proposed model,
a text mining program is built in WEKA over two
representative datasets, and by using a wide range of
classification algorithms. We aim to study the
performance of the classifiers together with the use of
different dimensionality reduction methods. The training
data which are obtained from online websites are
unstructured in nature. After the conversion the word
vector grows to a size of 8135 attributes for maintaining
frequency counts for each word in the documents. A
combination of DR techniques is applied in our
experiment. An outliner removal algorithm is used for
trimming off data rows that have exceptionally different
values from the norm. For reducing the number of
attributes, a standard Feature Selection algorithms (FS)
called Chi-Square is used because of its popularity and
generality, together with our novel approach called
Attribute Overlap Minimization (AOM) are applied.
Two training data are used in the experiment: one is a
collection of sample sentences on the related topics of
data mining, retrieved from Wikipedia websites of four
different languages – Spanish, French, English and
German. The other one is excerpted from CNN news
website, of the news articles that were released for ten
days across the New Year 2012. The news collection has
a good mix of political happenings, important world
events and lifestyles. One hundred of sample news was
obtained in total, and they were rated manually according
to six basic human psychological emotions, namely,
Anger, Fear, Joy, Love, Sadness and Surprise. The data
are formatted into ARFF format, having one
news/instance per row in the following structure:
<emotion>, <”text of the news”> where the second field
has a variable length. Similarly for the language sample
dataset, the structure is <language>, <” wiki page text”>.
The HTML tags, punctuation mark and symbols are
filtered off. The training datasets are then subject to the
above-mentioned dimensionality reduction methods for
transformation to a concise dataset in which the attributes
have substantial predictive powers contributing to.
Accuracy which is a key performance indicator is defined
by the percentage of the number of correctly classified
instances over the total number of instances in the
training dataset. Others are decision tree size or the
amount of generated rules which implies the demand of
the runtime memory requirement, and the time taken for
262
training up the model. By applying attribute reduction
and data reduction, we can observe that the initial number
of attributes have reduced greatly from 8135 to 11.
Having a concise and elite amount of attributes is crucial
in real-time application, and in text mining online news,
the number of attributes is proportional to the coverage of
news articles – the more unique words (vocabularies) that
are being covered, the greater the number of attributes
there are.
TABLE I.
They are the dataset with FS only, transformed dataset
with reduced attributes and overlapped attributes
rearranged (by both FS and AOM), and transformed
dataset with both attributes reduced and outliners
removed (FS+AOM+OE). The full performance results in
terms of accuracy, tree/rule size and time taken are shown
in Tables 3, 4 and 5.
TABLE III.
PERFORMANCE COMPARISON USING DIFFERENT CLASSIFIERS FOR
LANGUAGE DATASET WITH FS TECHNIQUE ONLY.
PERFORMANCE OF DECISION TREE MODEL TESTED UNDER DIFFERENT
TYPES OF DR METHODS APPLIED, LANGUAGE DATASET.
TABLE II.
PERFORMANCE OF DECISION TREE MODEL TESTED UNDER DIFFERENT
TYPES OF DR METHODS APPLIED, EMOTION DATASET.
TABLE IV.
LANGUAGE DATASET WITH FS AND AOM TECHNIQUES.
In general, it can be seen that the results from the
above tables have the smallest tree size, highest accuracy
and a very short training time when the three DR methods
are used together. The language dataset represents a
scenario where the number of attributes is approximately
10 times larger than the number of instances which is
usual in text mining when vector space is used. The
emotion dataset represents an extremely imbalanced case
where the ratio of attributes to instances is greater than
80:1. It should be highlighted that by applying a series of
FS+AOM+OE in the extreme case of emotion dataset, the
number of attributes was not cut to an extremely small
number (50 instead of 11) that are sufficient to
characterize an emotional group, the instances amount are
not overly eliminated (91 over 63) for sufficiently
training the model; yet the accuracy achieved is the
highest possible.
The experiment is then extended to evaluate the use of
machine learning algorithms, with the benchmarking
objective of achieving the highest accuracy. The selection
list of the machine learning algorithm used in our
experiment here is by no means exhaustive, but will form
the basis of a performance comparison which should
supposedly cover most of the popular algorithms. The
machine learning algorithms are grouped by four main
categories, Decision Tree, Rules, Bayes, Meta and
Miscellaneous; all of them are known to be effective for
data classification in data mining to certain extents.
Three versions of inflected datasets were text-mined by
different classification algorithms in this experiment.
TABLE V.
LANGUAGE DATASET WITH FS+AOM+OE TECHNIQUES.
The experiments are repeated with respect to accuracy
only, but graphically showing the effects of applying no
technique at all, techniques that are responsible for
reducing the attributes, and techniques that reduce both
attributes and instances. The results are visually displayed
at scattered plots in Figure 4 and Figure 5 for language
dataset and emotion dataset respectively.
263
the model training with dataset of high dimensionality,
except the Rotation Forest.
TABLE VI.
% PERFORMANCE GAIN – (L) LANGUAGE, (R) EMOTION
Figure 4. Accuracy graph of classifiers over the language dataset
IV. CONCLUSION
Novel dimensionality reduction techniques for text
mining namely Attribute Overlap Minimization and
Outlier Elimination are introduced in this paper. The
performance is tested in empirical experiments for
verifying the advantage of the techniques. The results
show that the techniques are effective especially on large
vector space.
Figure 5. Accuracy graph of classifiers over the emotion dataset
From the charts when the accuracy value of various
classifiers lay across, it can be observed that in general
DR methods indeed yield certain improvement. The
improvement gain between the original dataset without
any technique applied and the inflected datasets with DR
techniques is very apparent in the emotion dataset which
represents a very large vector space. It means for text
mining applications that deal with a wide coverage of
vocabularies like online news it is very essential to apply
DR techniques for maintaining the accuracy. In fact the
gain ratio results from Table 6 shows a big leap of
improvement gain between the no-DR-applied and DRapplied, for language dataset and emotion dataset 3.584684% vs 101.3853% increases.
On a second note, the improvement gain between with
and without outlier elimination is relatively higher for
language dataset. 5.519077% > 3.584684%. That infers
to the importance of removing outliers especially in a
relatively small vector space.
Of all the classifiers being under test, Decision tree
type and Bayes type outperform the rest. This
phenomenon is observed consistently over different
datasets and different DR techniques used. All the
classification algorithms yield improvement and survive
REFERENCES
[1] E. Leopold and Kindermann J. Text categorization with
support vector machines: how to represent texts in input
space? Machine Learning, (2002), Vol.46, pp.423-444.
[2] X. Qi and B. Davison. Web Page Classification: Features
and Algorithms. ACM Computing Surveys, (2009), Vol.41,
No.2, pp.12-31.
[3] S. Fong, Measuring Emotions from Online News and
Evaluating Public Models from Netizens’ Comments: A
Text Mining Approach. Journal of Emerging Technologies
in Web Intelligence, (2012), Vol.4, No.1, pp.60-66.
[4] P. Ponmuthuramalingam and T. Devi. Effective Dimension
Reduction Techniques for Text Documents, International
Journal of Computer Science and Network Security, (2010),
Vol.10, No.7, pp.101-109.
[5] Porter M.F. An Algorithm for Suffix Stripping. Program,
(1980), Vol.14, no.3, pp.130-137.
[6] Y. Yang and J. O. Pederson. A comparative study on
feature selection in text categorization. In Proceedings of
Fourteenth International Conference on Machine Learning,
(1997), pp.412-420.
264
New Metrics between Bodies of Evidences
Pascal Djiknavorian, Dominic Grenier
Laval University/Electrical and Computer Engineering, Quebec, Canada
Email: {djikna, Dominic.Grenier}@gel.ulaval.ca
Pierre Valin
Defence R&D Canada Valcartier/Decision Support Systems section, Quebec, Canada
Email: pierre.valin@drdc-rddc.gc.ca
Abstract—We address the problem of the computational
difficulties occurring by the heavy processing load required
by the use of the Dempster-Shafer Theory (DST) in
Information Retrieval. Specifically, we focus our efforts on
the measure of performance known as the Jousselme
distance between two basic probability assignments (or
bodies of evidences). We discuss first the extension of the
Jousselme distance from the DST to the DezertSmarandache Theory, a generalization of the DST. It is
followed by an introduction to two new metrics we have
developed: a Hamming inspired metric for evidences, and a
metric based on the degree of shared uncertainty. The
performances of theses metrics are compared one to each
other.
Index Terms—Dempster-Shafer, Measure of performance,
Evidential Theory, Dezert-Smarandache, Distance
I. INTRODUCTION
Comparing two, or more, bodies of evidences (BOE)
in the case of large frame of discernment, in the
Dempster-Shafer theory of evidence [1, 2], may not
always give intuitive choices from which we can simply
choose a proposition the with largest basic probability
assignment (BPA) (or mass), or belief. A metric becomes
very useful to analyze the behavior of a decision system
in order to correct and enhance its performance. It is also
useful when trying to evaluate the distance between two
systems giving different BOEs. It is also helpful to
determine if a source of information regularly gives an
answer that is far from other sources, so that this faulty
source can be weighted or discarded. Different
approaches to deal with conflicting or unreliable sources
are proposed in [3, 4, 5].
Although the Dempster-Shafer Theory (DST) has
many advantages, such as its ability to deal with
uncertainty and ignorance, it has the problem of
becoming quickly computationally heavy as it is an NPhard problem [6]. To alleviate this computational burden,
many approximation techniques of belief functions exist
[7, 8, 9]. References [10, 11] show implementations and a
comparative study of some approximation techniques.
To be able to efficiently evaluate the various
approximation techniques, one needs some form of
metric. The Jousselme distance between two bodies of
evidences [12] is one of them. However, there is a
doi:10.4304/jetwi.4.3.264-272
problem with this metric: it requires the computation of
the cardinal of a given set, an operation which is very
costly computation-wise within the DST. Alternatives to
the Jousselme distance are thus needed. This is the
objective of the research we present here.
A. The Dempster-Shafer Theory in Information Retrieval
The authors of [13] use the DST to combine the visual
and textual measures for ranking choosing the best word
to use as annotation for an image. The DST is also used
in the modeling of uncertainty in Information Retrieval
(IR) applied to structured documents. We find in [14] that
the use of the DST is due to: (i) it’s ability to represent
leaf objects; (ii) it’s ability to capture uncertainty and the
aggregation operator it provides, allowing the expression
of uncertainty with respect to aggregated components;
and (iii) the properties of the aggregation operator that are
compatible with those defined by the logical model
developed by [15].
Extensible Markup Language (XML) IR, by contrast to
traditional IR, deals with documents that contain
structural markups which can be used as hints to assess
the relevancy of individual elements instead of the whole
document. Reference [16] presents how the DST can be
used in the weighting of elements in the document. It is
also used to express uncertainty and to combine
evidences derived from different inferences, providing
relevancy values of all elements of the XML document.
Good mapping algorithms that perform efficient
syntactic and semantic mappings between classes and
their properties in different ontologies is often required
for Question Answering systems. For that purpose, a
multi-agent framework was proposed in [17]. In this
framework, individual agents perform the mappings, and
their beliefs are combined using the DST. In that system,
the DST is used to deal with the uncertainty related to the
use of different ontologies. The authors also use
similarity assessment algorithms between concepts
(words) and inherited hypernyms; once using BOE to
represent information, metrics between BOE could be
used to accomplish this.
As shown in [18], the fundamental issues in IR are the
selection of an appropriate scheme/model for document
representation and query formulation, and the
determination of a ranking function to express the
relevance of the document to the query. The authors
compare IR systems based on probability and belief
theories, and note a series of advantages and
disadvantages with the use of the DST in IR. Putting
aside the issue of computational complexity, they come to
the conclusion that the DST is the better option, thanks to
its ability to deal with uncertainty and ignorance.
The most significant differences between DST and
probability theory are the explicit representation of
uncertainty and the evidence combination mechanism.
This can allow for more effective document processing
[19]. It is also reported by [20] that the uncertainty
occurring in IR can come from three sources regarding
the relation of a document to a query: (i) in the existence
of different evidences; (ii) due to unknown number of
evidences; and (iii) in the existence of incorrect evidences.
There is thus a clear benefit to using a method that can
better combine evidences and handle their uncertainty.
Interested readers are encouraged to consult [21] for an
extensive study of the use of Dempster-Shafer Theory to
Information Retrieval.
II. BACKGROUND
A. Dempster-Shafer Theory of Evidence
Dempster-Shafer Theory (DST) has been in use for
over 40 years [1-2]. The theory of evidence or DST has
been shown to be a good tool for representing and
combining pieces of uncertain information. The DST of
evidence offers a powerful approach to manage the
uncertainties within the problem of target identity. DST
requires no a priori information about the probability
distribution of the hypothesis; it can also resolve conflicts
and can assign a mathematical meaning to ignorance.
However, traditional DST has the major inconvenience
of being an NP-hard problem [6]. As various evidences
are combined over time, Dempster-Shafer (DS)
combination rules will have a tendency to generate more
and more propositions (i.e. focal elements), which in turn
will have to be combined with new input evidences.
Since this problem increases exponentially, the number of
retained solutions must be limited by some approximation
schemes, which truncate the number of such propositions
in a coherent (but somewhat arbitrary) way. Let be the
frame of discernment, i.e. the finite set of mutually
exclusive and exhaustive hypotheses
.
The power set of ,
is the set the
subsets of
, where  denotes the empty set.
1) Belief functions:
Based on the information provided by sensor sources
and known a priori information (i.e. a knowledge base), a
new proposition is built. Then, based on this proposition,
a Basic Probability Assignment (BPA or mass function)
is generated, taking into account some uncertainty or
vagueness. Let us call
, the new incoming BPA. The
core of the fusion process is the combination of
and
the BPA at the previous time,
. The resulting BPA at
time
is then the support for decision making. Using
different criteria, the best candidate for identification is
265
selected from the database. On the other hand,
must
be combined with a new incoming BPA and thus
becomes
. However, this step must be preceded by a
proposition management step, where
is approximated.
Indeed, since the combination process is based on
intersections of sets, the number of focal elements
increases
exponentially
and
rapidly
becomes
unmanageable. This proposition management step is a
crucial one as it can influence the entire identification
process.
The Basic Probability Assignment is a function such
that
which satisfies the following
conditions:
(1)
(2)
Where
is called the mass. It represents our
confidence in the fact that “all we know is that the object
belongs to A”. In other words,
is a measure of the
belief attributed exactly to , and to none of the subsets
of . The elements of
that have a non-zero mass are
called focal elements. Given a BPA , two functions
from
to
are defined: a belief function
, and a
plausibility function such that
(3)
(4)
It can also be stated that
, where
is the complement of A and
measures the total
belief that the object is in , whereas
measures the
total belief that can move into . The functions ,
and
are in one-to-one correspondence, so it is
equivalent to talk about any one of them or about the
corresponding body of evidence.
2) Conflict definition:
The conflict corresponds to the sum of all masses for
which the set intersection yield the null set . is called
the conflict factor and is defined as:
(5)
measures the degree of conflict between
and
:
corresponds to the absence of conflict, whereas
implies a complete contradiction between
and
. Indeed,
if and only if no empty set is created
when
and
are combined. On the other hand we get
if and only if all the sets resulting from this
combination are empty.
3) Dempster-Shafer Combination Formulae:
In DST, a combined or “fused” mass is obtained by
combining the previous
(presumably the results of
previous fusion steps) with a new
to obtain a fused
result as follows:
(6)
(7)
266
TABLE I.
CARDINALITIES FOR DST AND DSMT
2
4
5
Cardinal of
Cardinal of
Cardinal of
3
8
19
4
16
167
5
32
7,580
The reader is referred to a series of books on DSmT
[22, 23, 24] for lengthy descriptions of the meaning of
this formula. A three-step approach is proposed in the
second of these books, which is used in this technical
report. From now on, the term “hybrid” will be dropped
for simplicity.
6
64
7,828,353
The renormalization step using the conflict
,
corresponding to the sum of all masses for which the set
intersection yields the null set, is a critical feature of the
DS combination rule. Formulated as is equation (6), the
DS combination rule is associative. Many alternative
ways of redistributing the conflict lose this property. The
associativity of the DS combination rule is critical when
the timestamps of the sensor reports are unreliable. This
is because an associative rule of combination is
impervious to a change in the order of reports coming in.
By contrast, other rules can be extremely sensitive to the
order of combination.
B. Dezert-Smarandache Theory
The Dezert-Smarandache Theory (DSmT) [22, 23, 24]
encompasses DST as a special case, namely when all
intersections are null. Both the DST and the DSmT use
the language of masses assigned to each declaration from
a sensor. A declaration is a set made up of singletons of
the frame of discernment , and all sets that can be made
from them through unions are allowed (this is referred to
as the power set
). In DSmT, all unions and
intersections are allowed for a declaration, this forming
the much larger hyper power set
which follows the
Dedekind sequence.
For a case of cardinality 3,
, with
,
is still of manageable size:
3, 2
3, 1
2
C. Pignistic Transformation
1) Classical Pignistic Transformation:
One of the most popular transformations is the
pignistic transformation proposed by Smets [25] as basis
for decision in the evidential theory framework. The
decision rule based on a BPA m is:
(13)
(14)
(15)
with the identified object among the objects in .
This decision presents the main advantage that it takes
into account the cardinality of the focal elements.
2) DSm Cardinal:
The Dezert-Smarandache (DSm) cardinal [22, 23, 24]
of a set A, noted
, accounts for the total number of
partitions including all intersection subsets. Each of these
partitions possesses a numeric weight equal to 1, and thus
they are all equal. The DSm cardinal is used in the
generalized pignistic transformation equation to
redistribute the mass of a set A among all its partitions B
such that
.
3) Generalized Pignistic Transformation:
The mathematical transformation that lets us go from a
representation model of belief functions to a probabilistic
model is called a generalized pignistic transformation [22,
23, 24]. The following equation defines the
transformation operator.
3,
1,
1, 1 2
1 3
2 3)}
For larger cardinalities, the hyper power set makes
computations prohibitively expensive (in CPU time).
Table I illustrates the problem with the first few
cardinalities of and .
1)
Dezert-Smarandache
Hybrid
Combination
Formulae:
In DSmT, the hybrid rule [22, 23, 24] appropriate for
constraints turns out to be much more complicated:
(9)
(8)
(16)
D. Jousselme Distance between two BOEs
1) Similarity Properties:
Diaz and al. [26] expects that a good similarity
measure should respect the following six properties:
oth increasing on
and decreasing on
(10)
symmetry
(18)
and
(19)
(20)
e clusi eness (21)
ecreasing on
(17)
identity of
indiscernible
(11)
(12)
normali ation
(22)
2) Jaccard Similarity Measure:
The Jaccard similarity measure [27] is a statistic used
for comparing the similarity and diversity of sample sets.
It was originally created for species similarity evaluation.
267
TABLE II
FIRST SERIES OF THREE BODIES OF EVIDENCES
(23)
3) Distances Properties:
A distance function, also called a distance metric, on a
set of points
is a function
with four
properties [28, 29]; suppose
:
non-negati ity
(24)
identity of
indiscernible
(25)
symmetry
(26)
triangle ine uality(27)
Some authors also require that
be non empty.
4) Jousselme Distance:
To analyze the performance of approximation
algorithm, to compare the proximity to non-approximated
versions, or to analyze the performance of the DS fusion
algorithm comparing the proximity with the ground truth
if available, the Jousselme distance measures can be used
[12]. The Jousselme distance is an Euclidean distance
between two BPAs. Let
and
be two BPAs defined
on the same frame of discernment , the distance
between
and
is defined as:
(28)
(29)
where
is the Jaccard similarity measure
III. NEW METRICS
A. Extension of the Jousselme distance to the DSmT
The Jousselme distance as defined originally in [12]
can work without major changes, as it is within the DSm
framework. The user simply has to use two BPAs defined
over the DSm theory instead of BPAs defined within the
DS theory. Boundaries, size, and thus amount of
computation will of course be increased. But otherwise,
there is no counter indication to using this distance in
DSmT. We thus can keep equation (28) as the definition
of Jousselme distance within DSmT, with the definition
of the DSm Cardinal.
Tables II and III show the bodies of evidences and
their distances one-to-another. The example was realized
with a discernment frame of size three (
, so that
the cardinal of its hyper power set would be
19 for
the free model, as defined by Dezert and Smarandache
[22]. Table II is divided into three sections, each one of
them represents data for one BOE. The three columns
give the focal sets, associated BPA value, and the
cardinal of that set.
Pairwise computation between the different pairs of
BOEs took quite some time with all the required
calculations by the Jousselme distance of evidences. The
results are shown in Table III. The proof of respect of all
properties has already been done for the DST in [12].
The difference with the original version of the distance
presented in [12] is the allowed presence of intersections
which creates the hyper power set from the power set.
This difference adds up possibilities of more
computations to get to the distance value. More
specifically, the cardinal evaluation part of the Jousselme
distance is worsened by the hyper power set increase in
size when compared to the power set.
B. Hamming-inspired metric on evidences
1) Continuous XOR mathematical operator:
In [30], Weisstein define the standard OR operator
noted as a connective in logic which yields true if any
one of a sequence conditions is true, and false if all
conditions are false.
In [31], Germundsson and Weisstein define the
standard XOR logical operator ( ) as a connective in
logic known as the exclusive OR or exclusive disjunction.
It yields true if exactly one, but not both, of two
conditions are true. This operator is typically designed as
symmetric difference in set theory [32]. As such, the
authors define it as the union of the complement of A
with respect to B and B with respect to A. Figure 1 is a
Venn diagram displaying binary XOR operator on
numerical discrete values in Figure 1.
TABLE III
EXTENTED JOUSSELME DISTANCE RESULTS
268
Figure 1. Venn diagram displaying binary XOR operator on numerical
discrete values.
Figure 2. Venn diagram displaying continuous XOR operator.
Starting with the standard XOR logical operator and
inspired by the Hamming distance [33] which uses a
symmetric difference implicitly, we develop the idea of a
continuous XOR operator. Figure 2 shows a simple case
similar to that of previous figure but using values from .
We can see that it is working as an absolute value of the
difference applied on each partition of the Venn diagrams
individually one to another.
C. Metric using a degree of shared uncertainty
1) Similarity coefficient of degree of shared
uncertainty:
The idea behind a similarity coefficient of degree of
shared uncertainty is to quantify the degree of shared
uncertainty that lies behind a pair of sets. We want to
avoid the use of cardinal operators. We conceived a
decision tree test which will evaluate the degree of shared
uncertainty. The following equation shows what the
coefficient of similarity between a pair of sets is when
using the metric that we suggest.
2) Metric between evidences based on Hamming
distance principle:
The Hamming distance [33] between two strings is the
minimum number of substitutions required to change one
string to another. In other words, it is defined by the sum
of absolute values of differences. From this, with the
DSm cardinal [22], and using a continuous XOR
mathematical operator, we have developed a new
distance, the Hamming Distance of Evidences (HDE).
This distance is bounded within normal values, such that
. This new distance also respects the
properties of equations (24-27): non-negativity, identity
of indiscernibility, symmetry, and the triangle inequality.
The HDE is defined as in equation (30), which uses the
defined in equation (31), and where
is the super-power set. For example, in the
case where we have a discernment frame such as
, we would obtain the following super-power
set
.
(30)
(31)
The HDE uses the BPA mass distributed among the
different parts (sets) in
that composes the BPA
from
. This transition from
to
is done using
equation (31). Using the super-power set version of the
BPA gets us a more refined and precise definition of it.
Once in the super-power set framework, we use an
adaptation of the Hamming distance or the continuous
XOR operation defined previously. Its implementation is
more easily understood as a summation of the absolute of
divided by 2.
the differences1 between the BPAs in
For BOEs defined in Table II in the previous section,
without any constrained set, we get the results given in
Table IV. Then, we can easily compare relative distances
to have a reliable point of reference. The Jousselme
distance is considered to be our distance of reference.
1
This is equivalent to the symmetric difference
expression used to define XOR operator in literature [32].
(32)
Equation (32) gives a coefficient value of 3 when the
pair of sets is equal; the value 2 when one of the sets is
included in the other one, and 1 when the sets give a nonempty intersection but none is included in another nor
being equal. Finally, the coefficient has a value of 0 when
the intersection between the pair of sets is the empty set.
The maximum value that the coefficient of
similarity
has between sets A and B is 3.
2) Metric between evidences based on a degree of
shared uncertainty:
From the similarity coefficient of degree of shared
uncertainty as defined above, we get the following
distance, noted
and defined in equation (33). In
that equation, the factor
is a
normalization factor required to bound of the distance.
The summation over
symbolizes a
sum going over the matrix of every possible pair of sets
from focal elements
.
(33)
Even if we consider (33) as the distance using
similarity coefficient, we might want to consider the
possibility of building one that uses only a triangular
matrix out of the matrix-domain of the summation.
However, since commutativity is a built-in property, this
measure will have a bit of useless redundancy.
TABLE IV
HAMMING DISTANCE ON EVIDENCES RESULTS
Equation (33) could be expressed in the simple
form:
, where
is a similarity
factor. Since distances use dissimilarity factors (so that a
distance of 0 means that
), a subtraction from 1 is
required. However, the idea of a distance solely based on
equation (33) isn’t enough. One should consider
weighting similarities with mass value from BPAs in
order to really represent the distance between bodies of
evidences and not only a combination of sets. We
propose (34) as a final equation for that reason.
(34)
Table V uses a simple case to show the inner workings of
this method. The first matrix shown in the table is a
computation matrix with the degree of shared uncertainty
, defined in (32), and the product of the masses of
The second matrix
the pair of sets
gives the value of the weighted similarity values. Finally,
the last table in Table V indicates the sum of the values
from within the previous matrix, or the value of the sum
in equation (34), the normalization factor and finally the
Distance of Shared Uncertainty (DSU).
This distance could be qualified as discrete in the sense
that not all values of will be possible for DSU in any
case of distance measurement. However, that is true only
for fixed values of BPA. Since BPA values are
continuous in
then DSU
Table VI shows the results of the metric based on the
degree of shared uncertainty measurements on the same
BOEs described in Table II as previously experimented
on at Tables III for Jousselme distance and Table IV for
Hamming distance on evidences.
TABLE V
SIMPLE CASE OF METRIC BASED ON SHARED UNCERTAINTY DEGREE
269
TABLE VI
METRIC BASED ON SHARED UNCERTAINTY DEGREE RESULTS
IV. EXAMPLES AND PERFORMANCES
This section explores the metrics presented in the
previous section. Theses metrics will be used as distance
measurements. We have implemented a DST, DSmT
combination system within MatlabTM. The details
explaining how DSmT was implemented appear in [34,
35]. Functions have been added in that system for the
execution of the computation of various metrics.
A. A few simple examples
1) Exploration case 1:
Using the same bodies of evidences as presented in
Table II, we obtained the results and times given in Table
VII for the execution, in seconds, for the same inputs
given to the three distances presented previously: the ,
the HDE and DSU. Based only on this data, it is difficult
to choose which metric is best. However we can already
see, as expected, that the Jousselme distance would be
difficult to use in real-time complex cases due to the
computation time it requires.
This case further explores the behaviors of the distance
metrics. We will use two bodies of evidences. The first
will be fixed with the following values:
. For the second BOE, we will
increment successively the mass of one focal element
nine times, reducing from the same value the mass of the
second focal element such as
, where
.
The results of this exploration case are given in Table
VIII. We can notice from that table that DSU is not able
to correctly consider distances with the mass distributions.
Obviously, this is an undesirable behavior occurring for
the situation with a pair of BOE with identical sets.
We can also see that the HDE and Jousselme distance
responds in a symmetric manner to the symmetric mass
distribution around equal BOEs. In other words, steps
and
gives equal values, as they should. For
step
, all metrics gives the proper distance of zero.
TABLE VII
DISTANCE AND TIME OF EXECUTION VALUES FOR CASE 1
Dist.
HDE
DSU
time
Dist.
time
Dist.
time
270
TABLE IX
EXECUTION TIME VALUES FOR CASE 2
HDE
DSU
Figure 3. Venn diagram with the 7 partitions of a size 3 case.
Table IX shows that both HDE and DSU demonstrate a
clear advantage over Jousselme distance in terms of
execution times.
Figure 3 shows the 7 possible partitions of a size 3 case.
This case proceeds a little differently from the previous
two. Instead of keeping identical BOEs with varying
masses, the BOEs are now varied. A third and fourth
focal elements in some of the BOEs are introduced for
that purpose. The first BOE is always the same:
. The BOEs used as
the second one in the pairwise distances are listed here:
A.
B.
C.
=0.2}
=0.2}
=0.2}
D.
E.
F.
G.
The results of this case are given in Table X. As
expected, we can observe a Distance Variation (
)
increase for the following pairs:
,
,
and
. The notation
signifies that the observed distance variation going from
case X to Y is increasing.
For the interesting cases F and G, we have
. The difference between F and G is that
the mass of
goes to
in F, while in G it mainly
goes to
.
DSU metric for case F is equal to case G, in all the
other metrics they give smaller values for case G when
compared to case F. Similar conclusions are obtained
when comparing metrics for the pairs of cases (A, C), and
(B,D): for similar mass redistribution, when giving the
mass to a disjunction the resulting distance is smaller than
if it were to be distributed to an intersection.
3) Exploration’s conclusions:
In general, it is better for identical sets to have lowest
distance. Otherwise, a minimal number of sets will
minimize the distribution of mass onto unshared
partitions. With no identical partitions in common, it is
preferable to have a higher mass onto disjunctive sets
which have more common partitions. Also, it is better to
have disjunctive sets as specific as possible; in other
words, of lowest cardinality. Hence, too much mass given
to a set that has too many uncommon partitions with the
targeted ID or ground truth must be avoided. To get
distances values such as
, one needs masses in
to be
distributed on sets that have a higher ratio of common
partitions with
than the sets of
would have.
Finally the use of either Jousselme (adapted to DSmT)
or the DHE, which is much quicker, is recommended.
TABLE VIII
DISTANCE VALUES FOR CASE 2
HDE
DSU
TABLE X
DISTANCE VALUES FOR CASE 3
HDE
DSU
V. CONCLUSIONS
This paper introduced two new distances between
evidences for both the Dempster-Shafer Theory and
Dezert-Smarandache Theory to replace the Jousselme
distance.
When the size of the discernment frame gets high: the
distance calculation becomes too big to handle in a
reasonable amount of time. In time critical systems, it
would be better to use the Hamming distance of
evidences. For the distance using the degree of shared
uncertainty DSU, studies must be done further. A
correction may be required to prevent it from considering
masses properly when facing identical bodies of
evidences.
Future works would include the use of DSmT [22, 23,
24] and its hierarchical information representation
abilities in conjunction with approximation of belief
functions algorithms in Information Retrieval.
[11]
[12]
[13]
[14]
[15]
ACKNOWLEDGMENT
The authors wish to thank the reviewers for their
comments. This work was carried out as part of Pascal
jikna orian’s doctoral research at Uni ersité La al.
Pascal Djiknavorian study was partly funded by RDDC.
[16]
[17]
REFERENCES
[1] G. Shafer, A Mathematical Theory of Evidence. Princeton
University Press, Princeton, NJ, USA, 1976.
[2] A. empster, “Upper and lower probabilities induced by
multi alued mapping”, The Annals of Mathematical
Statistics, vol. 38, pp. 325-339, 1967.
[3] M. C. Florea and E. Bosse, “ empster-Shafer Theory:
combination of information using conte tual knowledge”,
in Proceedings of 12th International Conference on
Information Fusion, Seattle, WA, USA, July 6-9, 2009, pp.
522-528.
[4] S. Le Hegarat-Mascle, I. Bloch, and D. Vidal-Madjar,
“Application of
empster-Shafer evidence theory to
unsuper ised classification in multisource remote sensing”,
IEEE Transactions on Geoscience and Remote Sensing,
vol. 35, issue: 4, pp.1018-1031, August 1997.
[5] J. Klein and O. Colot, “Automatic discounting rate
computation using a dissent criterion”, Proceedings of
Workshop on the Theory of Belief Functions, Brest, France,
April 1-2, 2010.
[6] P. Orponen, “ empster’s rule of combination is # Pcomplete”, Artificial Intelligence, vol. 44, no. 1-2, pp.
245–253, 1990.
[7] B. Tessem, “Approximations for efficient computation in
the theory of evidence”, Artificial Intelligence, vol. 61, pp
315-329, June 1993.
[8] M. Bauer, “Approximation Algorithms and Decision
Making in the Dempster-Shafer Theory of Evidence-An
Empirical study”, International Journal of Approximate
Reasoning, vol. 17, no. 2-3, pp. 217–237, 1997.
[9] D. Boily, and P. Valin, “Truncated Dempster-Shafer
Optimization and Benchmarking”, in Proceedings of
Sensor
Fusion:
Architectures,
Algorithms,
and
Applications IV, SPIE Aerosense 2000, Orlando, Florida,
April 24-28, 2000, Vol. 4051, pp. 237-246.
[10] P. Djiknavorian, P. Valin and D. Grenier, “Approximations
of belief functions for fusion of ESM reports within the
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
271
Sm framework”, Proceedings of the 13th International
Conference of Information Fusion, Edinburg, UK, 2010.
P. Djiknavorian, A. Martin, P. Valin and D. Grenier, «
Étude comparati e d’appro imation de fonctions de
croyances généralisées / Comparative Study of
approximations of generalized beliefs functions », in
Proceedings of Logique Floue et ses Applications,
LFA2010, Lannion, France, Novembre 2010.
A.-L. Jousselme, D. Grenier, and E. Bosse, “A new
distance between two bodies of evidence”, Information
Fusion, vol. 2, no. 2, pp. 91-101, June 2001.
R. Xiaoguang, Y. Nenghau, W. Taifeng, L. Mingjing, “A
Search-Based Web Image Annotation Method”,
Proceedings of IEEE International Conference on
Multimedia and Expo, 2007, pp. 655-658.
M. Lalmas, “Dempster-Shafer’s theory of evidence applied
to structured documents: modelling uncertainty”. in
Proceedings of the 20th annual international ACM SIGIR,
pp. 110-119, Philadelphia, PA, USA. ACM, 1997.
DOI:10.1145/258525.258546
Y. Chiaramella, P. Mulhem and F. Fourel, “A Model for
Multimedia Information Retrieval”, Technical Report,
Basic Research Action FERMI 8134, 1996.
F. Raja, M. Rahgozar, F. Oroumchian, “Using DempsterShafer Theory in XML Information Retrieval”,
Proceedings of World Academy of Science, Engineering
and Technology, Vol. 14, August 2006.
M. Nagy, M. Vargas-Vera and E. Motta, “Uncertain
Reasoning in Multi-agent Ontology Mapping on the
Semantic Web”, in Proceedings of the Si th Me ican
International Conference on Artificial Intelligence –
Special Session, MICAI 2007, November 4-10, 2007. Pp.
221-230. DOI:10.1109/MICAI.2007.11
K.R. Chowdhary and V.S. ansal, “Information Retrie al
using probability and belief theory”, in Proceedings of the
2011 International Conference on Emerging Trends in
Networks and Computer Communications (ETNCC), 2011,
pp. 188-191. DOI:10.1109/ETNCC.2011.5958513
A. Verikas, A. Lipnickas, K. Malmqvist, M. Bacauskiene,
and A. Gel inis, “Soft combination of neural classifiers: A
comparati e study”, Pattern Recognition Letters, vol. 20,
pp. 429-444, 1999.
A.M. Fard, H. Kamyar, Intelligent Agent based Grid Data
Mining using Game Theory and Soft Computing, Bachelor
of Science Thesis, Ferdowsi University of Mashhad,
September 2007.
I. Ruth en and M. Lalmas, “Using empster-Shafer’s
Theory of Evidence to Combine Aspects of Information
Use”, Journal of Intelligent Information Systems, vol. 19
issue 3, pp.267-301, 2002.
Smarandache, F., Dezert, J. editors. Advances and
Applications of DSmT for Information Fusion, vol. 1,
American Research Press, 2004.
Ph. Smets, “ ata fusion in the Transferable elief Model”,
in Proceedings of the 3rd International Conference on
Information Fusion, Fusion 2000, Paris, July 10-13, 2000,
pp. PS21-PS33.
J. Diaz, M. Rifqi, B. Bouchon-Meunier, “A Similarity
Measure between
asic
elief Assignments”, in
Proceedings of the 9th International Conference on
Information Fusion, Florence, Italy, 10-13 July, 2006.
272
[27] P. Jaccard, “Étude comparati e de la distribution florale
dans une portion des Alpes et des Jura” ulletin de la
Société Vaudoise des Sciences Naturelles, vol. 37, pp.
547–579, 1901.
[28] E.W. Weisstein, “ istance” From MathWorld--A Wolfram
Web Resource, July 2012.
http://mathworld.wolfram.com/Distance.html
[29] M. Fréchet, “Sur quelques points du calcul fonctionnel”,
Rendiconti del Circolo Matematico di Palermo, vol. 22, pp.
1-74, 1906.
[30] E.W. Weisstein, “OR”, From MathWorld--A Wolfram
Web Resource, July 2012.
http://mathworld.wolfram.com/OR.html
[31] R. Germundsson and W. E. Weisstein, “XOR”, From
MathWorld--A Wolfram Web Resource, July 2012.
http://mathworld.wolfram.com/XOR.html
[32] E.W. Weisstein, “Symmetric Difference”, From
MathWorld--A Wolfram Web Resource, July 2012.
http://mathworld.wolfram.com/SymmetricDifference.html
[33] R.W. Hamming, “Error detecting and error correcting
codes”, Bell System Technical Journal, vol. 29, issue 2, pp.
147–160, 1950.
[34] P. Djiknavorian, D. Grenier, “Reducing DSmT hybrid rule
complexity through optimization of the calculation
algorithm”, in Advances and Applications of DSmT for
Information Fusion, F. Smarandache and J. Dezert, editors,
[35] P. Djiknavorian, “Fusion d’informations dans un cadre de
raisonnement de Dezert-Smarandache appliquée sur des
rapports de capteurs ESM sous le STANAG 1241”,
Master’s Thesis, Uni ersité La al, 2008.
Pascal Djiknavorian received a B.Eng.
in computer engineering and a
certificate in business administration in
2005 from Laval University. From there,
he also completed in 2008 a M.Sc. in
electrical engineering on information
fusion within the Dezert-Smarandache
theory framework applied to ESM
reports under STANAG 1241.
He is currently a Ph.D. student in
information fusion at Laval University and is supervised by
Professor Dominic Grenier and Professor Pierre Valin. He has a
do en publications as book’s chapters, in journals, and
conference’s proceedings. His research interests include
evidential theory, Dezert-Smarandache theory, approximation
algorithms, and optimization methods.
Mr. Djiknavorian is a graduate student member of the IEEE.
Dominic Grenier received the M.Sc.
and Ph.D. degrees in electrical
engineering in 1985 and 1989,
respectively, from the UniversitéLaval,
Quebec City, Canada.
From 1989 to 1990, he was a
Postdoctoral Fellow in the radar
division of the Defense Research
Establishment in Ottawa (DREO),
Canada. In 1990, he joined the
Department of Electrical Engineering at UniversitéLaval where
he is currently a Full Professor since 2000. He was also coeditor for the Canadian Journal on Electrical and Computer
Engineering during 6 years.
Recognized by the undergraduate students in electrical and
computer engineering at Université Laval as the
electromagnetism and RF specialist, his excellence in teaching
has resulted in his being awarded the “ est Teacher Award”
many times. He obtained in 2009 one special fellowship from
the Quebec Minister for education.
His research interests include inverse synthetic aperture radar
imaging, signal array processing for high resolution direction of
arrivals and data fusion for identification.
Prof. Grenier has 32 publications in refereed journals and 75
more in conference proceedings. In addition, 33 graduate
students completed their thesis under his direction since 1992.
Prof. Grenier is a registered professional engineer in the
Province of Quebec (OIQ), Canada.
Pierre Valin received a B.Sc. honours
physics (1972) and a M.Sc. degree
(1974) from McGill University, then a
Ph.D. in theoretical high energy physics
from Harvard University (1980), under
the supervision of the 1979 Nobel
Laureate Dr. Sheldon Glashow.
He was a faculty lecturer at McGill
and Associate Professor of Physics in
New Brunswick at Moncton and
Fredericton before joining Lockheed Martin Canada in 1993
(then called Paramax), as a Principal Member of R&D. In 2004,
he became a defence scientist at Defence R&D Canada (DRDC)
at Valcartier, where he currently leads a research group in
Future C2 Concepts & Structures. He is thrust leader for Air
Command at DRDC since 2007.
He has been particularly active in the International Society of
Information Fusion (ISIF) though the organization of FUSION
2001 and 2007. He has been ISIF board member since 2003,
VP-membership since 2004, and was president in 2006. He is
also an associate editor for the Journal of Advances in
Information Fusion (JAIF).
r. Valin’s interests focus mainly on the following topics:
Multi-Sensor Data Fusion (MSDF) requirements and design, C2
systems, algorithmic benchmarking, use of a priori information
databases, imagery classifiers (EO/IR and SAR) and their fusion,
neural networks, fuzzy logic, information processing and
uncertainty representation, reasoning techniques for recognition
and identification (Bayes, Dempster-Shafer, DezertSmarandache), SAR image processing, Network Centric
Warfare, distributed information fusion, dynamic resource
management, as well as theoretical and mathematical physics.
273
Bringing location to IP Addresses
with IP Geolocation
Jamie Taylor, Joseph Devlin, Kevin Curran
School of Computing and Intelligent Systems
University of Ulster, Magee Campus, Northland Road, Northern Ireland, UK
Email: kj.curran@ulster.ac.uk
Abstract - IP Geolocation allows us to assign a
geographical location to an IP address allowing us to
build up a picture of the person behind that IP address.
This can have many potential benefits for business and
other types of application. The IP address of a device is
unique to that device and as such the location can be
narrowed down from the continent to the country and
even to the street address of the device. This method of
tracking can have very broad results and can sometimes
only get an accurate result with some input from the
user about their location. In some countries laws are in
place that state a service can only track you as far as
your country without your consent. If the user consents
then the service can view your ISP's logs and track you
as accurately as possible. The ability to determine the
exact location of a person connecting over the Internet
can not only lead to innovative location based services
but it can also dramatically optimise the shipment of
data from end to end. In this paper we will look at
applications and methodologies (both traditional and
more recent) for IP Geolocation.
I. INTRODUCTION
IP Geolocation is the process of obtaining the
geographical location of an individual or party starting
out with nothing more than an IP address [1]. The uses
(both current and potential) of IP Geolocation are many.
Already, the technology is being used in advertising, sales
and security. Geolocation is the identification of the realworld geographic location of an Internet-connected
computer, mobile device, website visitor or other. IP
address Geolocation data can include information such as
country, region, city, postal/zip code, latitude, longitude
and time zone [1]. Geolocation may refer to the practice
of assessing the location, or to the actual assessed
location, or to locational data. Geolocation is increasingly
being implemented to ensure web users around the world
are successfully being navigated to content that has been
localised for them. Due to the ‘.com’ dilemma, most
companies are finding that more than half of the visitors
to their global (.com) home pages are based outside of
their home markets. The majority of these users do not
find the country site that has been developed for them.
Companies such as Amazon have introduced geolocation
as a method of dealing with this problem [2].
doi:10.4304/jetwi.4.3.273-277
There are organisations that are responsible for allocating
IP address. The
Internet Assigned
Number
Authority(IANA) is responsible for allocating large
blocks of IP addresses to the following five Regional
Internet Registries(RIR) that serve specific regions in the
world: AfriNIC (Africa), APNIC (Asia/Pacific), ARIN
(North America), LACNIC (Latin America) and RIPE
NCC (Europe, the Middle East and Central Asia). These
RIR's then allocate blocks of IP addresses to Internet
Service Providers (ISP) who then allocates IP addresses
to businesses, organizations and individual consumers.
Using the above information IP addresses can be broken
down into graphical locations within few steps but to get
a more accurate result than that the user may have to
provide additional details to aid the process. In some
cases, this practice will become more efficient the more it
is used, a user's location can be tracked by closely
matching their IP address with a neighbouring IP address
that has already been located. Many businesses have been
started up just by hosting large databases of IP address to
allow services to apply this technology with varying
degrees of efficiency and accuracy. There are many
methods of tracking a device such as GPS and cell phone
triangulation but Geolocation, the least accurate, is
becoming popular among website owners and
government bodies alike
The foundation for geolocation is the Internet protocol
(IP) address, a numeric string assigned to every device
attached to the Internet. When you surf the web, your
computer sends out this IP address to every website you
visit. IP addresses are not like mailing addresses. That is,
most are not fixed to a specific geographic location. And
knowing that a particular ISP (Internet Service Provider)
is based in a particular city is no guarantee that you’ll
know where its customers are located [3]. That is where
geolocation service providers come in.
Geolocation
service providers build massive databases that link each
IP address to a specific location. Some geolocation
databases are available for sale, and some can also be
searched for free online. As the IP system is in a constant
state of flux, many providers update their databases on a
daily or weekly basis. Some geolocation vendors report a
510% change in IP addresses locations each week.
Geolocation can provide much more than a geographic
location. Many geolocation providers supply up to 30
274
data fields for each IP address that can help to further
determine if users really are where they say they are.
These may include country, region, state, city, ZIP code,
area code; Latitude/longitude; Time zone; Network
connection type and domain name and type (i.e. .com or
.edu). Not every IP address accurately represents the
location of the web user. For example, some
multinational companies route Internet traffic from their
many international offices through a few IP addresses,
which may create the impression that some Internet users
are in, say, the UK when they are actually based in
France. If someone is using a dial-up connection from
Ireland back to their ISP provider in the France, it will
appear like they are in the France. There are also proxy
services that allow web users to cloak their identities
online, a few geolocation providers however have
introduced technology that can look past these proxy
servers to access the user's true location. In addition,
some providers can now locate, down to a city-street
level, people connecting to the Internet via mobile phones
or public Wi-Fi networks. This is accomplished through
cell tower and Wi-Fi access point triangulation [4].
Here, we will be looking at this technology in more detail
and what it could mean for us and our lives going
forward. This paper is structured as follows. In Section 2,
we look in more depth at the applications for IP
Geolocation (both current and potential). Section 3 then
presents a number of IP Geolocation methods, starting
with more 'traditional' methods before progressing to
those more recent and 'hybrid' in nature. In section 4, we
outline some methods for avoiding IP geolocation and we
conclude our discussion in Section 5.
II. IP GEOLOCATION USAGE
Localization is the process of adapting a product or
service to target a specific group of users. These changes
can include the look and feel of the product, the language
and even fundamental changes in how the service or
product works. Many global organisations would like to
be able to tailor the experience of a website to the types
of users viewing it as it can have a significant impact on
whether or not a user will use your service. The ability to
gather useful metrics increases when you add in the fact
you can tell where your customers are from. Take for
example Google. Google provides localized versions of
its search engine to almost every country in the world.
Using Geolocation, they can select the correct language
for each user and alter their search results to reflect more
accurately what it is the user is actually searching for. If
Google so chose they could even start to omit certain
results to comply with national laws. Google ads use this
feature heavily by making sure that local businesses can
reach people in their area so as to increase the impact of
the advertising. This localization of websites is becoming
increasingly popular and Geolocation is a tool that grants
the ability to easily find out which version of your
website to show.
Other websites use localisation in the opposite way.
Instead of attempting to increase the use of the site by
accommodating worldwide users some websites would
use a user's location to ensure that they cannot access the
website or its content. This practice is most common on
sites that host copyrighted content such as movies, TV
shows or music. An example of this is the BBC iPlayer.
This service cannot be accessed in the USA for example,
as the BBC iPlayer will not allow anyone with an IP
address outside the UK to view the content. Online
gaming/gambling websites use Geolocation tools to
ensure that they are not committing crimes in countries
where gambling is illegal [5].. An example of this is
www.WilliamHill.com. This website filters out American
users to avoid breaking laws in that country. In Italy, a
country where gambling is illegal, you will only be
granted a license to host a gambling website if you apply
Geolocation tools to restrict access to the site by Italians.
MegaUpload.com in 2012 was involved in a legal dispute
with regard to their facilitating of copyright infringement.
To try to avoid charges such as those the company, who
have all their assets in Hong Kong, made sure to use
Geolocation tools to filter out anyone in Hong Kong from
using their services. This meant that MegaUpload.com
was committing copyright infringement in every country
in the world except Hong Kong.
IP Geolocation has a vast array of both current and
potential uses and areas of application. Of course, the
accuracy (or granularity) needed varies from application
to application. Through the use of IP Geolocation,
advertisements can be specifically tailored to an
individual based on their geographical location. For
example, a user in London will see adverts relative to the
London area, a user in New York will see adverts relative
to the New York area and so forth.
Additional
information such as local currency, pricing and tax can
also be presented. A real life example of this would be
Google AdSense. As one may imagine, the accuracy
needed for this is considerable; we would need a town or
(even better) a street as opposed to say a country or state
in order to provide accurate information to the user.
As the online space continues to become the place to do
business, issues once thought to be solved now rear their
heads again. For example, DVD drives were region
locked to prevent media being played outside the
intended region, but problems exist in combating this
resurging issue in the online space. Other examples of
content restriction include the enforcement of blackout
restrictions for broadcasting, blocking illegal downloads
and the filtering of material based on culture. Content
localisation on the other hand, is working to ensure that
only relevant information is displayed to the user. The
accuracy needed for an application shows a visitor from
Miami, Florida dressed in beach attire instead of parkas is
less than that needed for advertising. Here, geolocation at
the country level is normally sufficient to ensure that
users from one country cannot access content exclusive to
275
another country for example.
Delay Based Methods
Businesses can often struggle to adhere to national and
regional laws due to the degree of variance. Failure to
comply with these laws however can result in financial
penalties or even prison time. Advertising for instance,
can be subject to tight control such as what can be
advertised where, when and if the product or service in
question can be advertised in a particular location at all.
Indeed, even the above examples of content restriction
are often done to comply with legal requirements. In
addition, there is the need to avoid trading with countries,
groups and individuals black-listed by government.
Quova1 provide us with the OFAC (Office for Foreign
Assets Control – United States) and the need to comply
with its economic and trade sanctions as an example of
this. IP Geolocation offers us a powerful tool to help us
comply with these legal requirements. However, to use
IP Geolocation effectively in this scenario, we would
need a state-level accuracy as laws can vary from state to
state.
Constraint Based Geolocation (CBG) is a delay-based
method employing multilateration (estimating a position
using some fixed points) [6]. The ability of CBG to create
and maintain a dynamic relationship between IP address
and geographical location is one of the methods key
contributions to the IP Geolocation process, since most
preceding work relied on a static IP address to
geographical location relationship. To calculate this
distance, each landmark measures its distance from all
other landmarks. A bestline is then created where a
bestline is the least distorted relationship between
geographic distance and network delay.
IP Geolocation can also offer much to those in security.
IP Geolocation is used as a security measure by financial
institutions to help protect against fraud by checking the
geographical location of the user and comparing it with
common trends. In the field of sales, user location can be
compared with billing address for example. MaxMind2 is
one such group offering products such as MinFraud
which provides relevant information about the IP's
historic behaviour, legitimate and suspicious and attempts
to detect potential fraud by analysing the differences
between the user location and billing address.
III. METHODS OF IP GEOLOCATION
A common approach to IP Geolocation is to create and
manually maintain a database containing relevant data.
These Non-automated methods, (i.e. those relying on
some form of human interaction or contribution) can be
undesirable. Problems include the fact that IP addresses
are dynamically assigned and not static and therefore the
database requires frequent updating (potentially at
considerable financial cost and the risk of human error).
The switch from IPv4 (2^32 possible addresses) to IPv6
(2^128 possible addresses) increases the challenge
exponentially. One approach is to rely on delay
measurements in order to geolocate a target. It should be
noted however that these approaches rely on a set of
'landmarks', where a landmark is some point whose
location is already known. A common way often used to
construct this set of landmarks is to take a subset of nodes
from the PlanetLab3 network (consisting of more than
1000 nodes).
1
www.quova.com
www.maxmind.com
3
www.planet-lab.org/
2
A circle then emanates from each landmark, the radius of
which represents the targets estimated distance
(calculated above) from that landmark. The area of
intersection is the region in which the target is believed to
reside; CBG will commonly guess that the target is at the
centroid of this region. The area of this region is an
indication of confidence, the smaller the area, the more
confident CBG is in its answer, and a larger area implies
a lower level of confidence.
Speed of Internet (SOI) [7] can be viewed as a
simplification of CBG. Whereas CBG calculates a
distance-to-delay conversion value for each landmark,
SOI instead uses a general conversion value across all
landmarks. This value is 4/9c (where c is the speed of
light in a vacuum). Numerous delays (such as circuitous
paths and packetization) prevent data from travelling
through fibre optic cables at its highest potential speed
(2/3c). Therefore it is reasoned that 4/9c can be used to
safely narrow the region of intersection without
sacrificing location accuracy. Shortest Ping is the
simplest delay-based technique. In this approach a target
is simply mapped to the closest landmark based on
round-trip time (aka ping time). Delay-based methods
rely on the distance between the target and its nearest
landmark. This is a good predicator of the estimation
error. Round-trip time is also a good indication of the
error; delay-based methods work well when the RTT is
small and performance deteriorates relative to the
increase in RTT. Having to effectively take the network
as is, feeling your way around with delay measurements
as opposed to being able to map it out to potentially
improve accuracy is something we'd be keen to
overcome. As we will see, using topology information
and other forms of external information can greatly
increase accuracy.
Topology-based Geolocation (TBG)
The methods here attempt to go beyond using delay
measurements as their sole metric. Some seek to
combine traditional delay measurements with additional
information such as knowledge of network topology and
other additional information; some even attempt to recast
276
the problem entirely. The reliance of delay-based methods
upon a carefully chosen set of landmarks is a problem [7].
Topology-based Geolocation (TBG) however uses
topology in addition to delay-based measurements to
increase consistency and accuracy. This topology is the
combination of the set of measurements between
landmarks, the set of measurements between landmarks
and the target (both measurements obtained by
traceroute) and the structural observations about
collocated interfaces. The target is then located using this
topology in conjunction with end to end delays and per
hop latency estimates. When presented with a number of
potential locations for a target, TBG will map the target to
the location of the last constrained router.
It should be noted however that TBG incurs some
overheads that simple delay-based methods do not. TBG
must first construct its topology information and an
additional overhead can be found in refreshing this
information to ensure it is up to date and accurate.
However the authors point out that this topology
information can be used for multiple targets and that this
overhead need not necessarily apply to every
measurement one may wish to make. There are three
main variants of TBG. These are
1) TBG-pure, using active landmarks only
2) TBG-passive, using active and passive
landmarks
3) TBG-undns, using active and passive landmarks
in conjunction with verified hints
Once successfully located, intermediate routers can be
used as additional landmarks to help locate other network
entities.
Figure 1: Identifying & clustering multiple network interfaces
[7]
In order to accurately determine the locations of these
intermediate routers with confidence and be able to use
them effectively, we must record position estimates for all
routers encountered so we can base our final position
estimate on as much information as possible. For
instance, in trying to geolocate a router that is one hop
from a given point and multiple hops from another given
point, we need to record all routers we encounter which
will allow us to determine its position with more
accuracy. In other words, a geolocation technique has to
simultaneously geolocate the targets as well as routers
encountered [7]. Discovering that a router has multiple
network interfaces is a common occurrence. Normally
these interfaces are then grouped (or clustered) together
(this process of identification and resolution is also
known as IP aliasing), otherwise we falsely inflate and
complicate our topology information. Part a of Figure 1
shows two routers u and v which are in fact multiple
interfaces for the same physical router. In part b we see
how the topology has been simplified by identifying u
and v and clustering them.
Web Parsing Approach
Another approach to improve upon delay-based methods
is through the use of additional external information
beyond inherently inaccurate delay-based measurements
[8]. This can be achieved through parsing additional
information from the web. Prime candidates therefore are
those who have their geographic location on their
website. The method seeks to extract, verify and utilize
this information to improve accuracy. The overall system
is made up of two main components, first is a three-tier
measurement methodology, which seeks to obtain a
targets location. The second part is a methodology for
extracting and verifying information from the web then
used to create web-based landmarks.
This three-tier measurement methodology uses a slightly
modified version of CBG (where 4/9c is used as an upper
bound rather than 2/3c) to obtain a rough starting point.
Tiers 2 and 3 then bring in information from the web
obtained by the second component to increase the
accuracy of the final result. The information extraction
and verification methodology relies on websites having a
geographical address (primarily a ZIP code). This ZIP
code combined with a keyword such as university or
business is passed to a public mapping. If this produces
multiple IPs within the domain name then they are to be
grouped together and refined during the verification
process. Assuming one has used a public search tool as
suggested: the first stage of verification is to remove
results from the search if their ZIP code does not match
that in the original query. In cases where the use of a
Shared Hosting technique or a CDN (Content Delivery
Network) results in an IP address being used for multiple
domain names the landmark (and subsequently the IP) is
to be discarded. Finally, in the case where a branch office
assumes the IP of its headquarters: compare ZIP codes
again to confirm its identity as a branch and subsequently
remove it. As with TBG described above, the method
presented here also succumbs to certain overheads that
delay-based methods are able to avoid. The measurement
stage has a delay of 1-2 seconds for each measurement
made. This is the result of 8 RTTs (Round Trip Time); 2
of which are performed in the first tier and 3 in both the
second and third tiers. The verification stage incurs an
overhead for each ZIP code considered as all landmarks
for each ZIP are cached. However, this will only require
occasional updates and thus does not affect each and
every search. [9] attempt to improve the accuracy of IP
Geolocation by broadening the scope of information
considered through casting IP Geolocation as a machine
learning-based classification problem. Here a Naive
Bayes classifier is used along with a set of latency, hop
count and population density measurements. Each of
these metrics/variables can be assigned a weight to affect
how it will influence and inform the classifier. Results
are classed in quintiles, with each quintile representing
20% of the target IPs and a level of confidence in the
results within that quintile.
IV.
GEOLOCATION EVASION (CYBERTRAVEL)
With Geolocation restrictions becoming more popular
internet users are finding ways to evade these restrictions.
Every country has its own laws that they are applying to
cases involving Geolocation, but those laws were not
written with the technology in mind. Cybertravel in a
phrase, admittedly almost unknown but apt, that refers to
evading Geolocation, GPS and other similar tracking
technologies by pretending that you are in a real world
location that you are not. Cybertravel is not the same as
making yourself anonymous as the latter is about making
your location unknown and the former is about providing
an incorrect location. One way to do this is to alter your
IP address to make it seem that you are from another
region. Many people use this evasion technique to access
content that is restricted. It is popular with people who
are trying to access websites hosting copyrighted TV
shows. Another less popular way to cybertravel is to gain
remote access another device that is physically in the
region that you want to access the internet from. This way
you are not actually altering you IP; it is as if you had
physically travelled to that region and accessed the
internet from there. Services like TOR4 provide you with
internet anonymity. Actively trying to hide your identity
or location can have the result that websites cannot
determine your region and thus may not allow you access
to their content at all which is why cybertravel is the
method of choice for accessing region restricted content.
Services exist that allow a user to pay a monthly fee in
return for an IP from a particular region. An example of
this is www.myexpatnetwork.co.uk. This company allows
users from outside the UK to gain a UK IP address. This
company only deals in the GBP currency and is marketed
to UK residents. The company is not breaking any laws
by 'leasing' these IP addresses and because they are
marketing at UK residents who are abroad they may
4
https://www.torproject.org/
277
believe they are covered from advertising a Geolocation
evasion tool. This service description may not stand up in
court as a large portion of their customers are likely non
UK residents looking to access UK only content. Evasion
of Geolocation has not become a major issue at the
moment. As with many issues like this, most
organisations do not care until services like
myexpatnetwork become popular and so easy to use that
a serious financial loss looms. Governments are starting
to pay attention to this issue now as they start to
understand the difficulty of enforcing their laws against
companies and people outside their jurisdiction who
commit crimes on the internet.
V. CONCLUSION
We have provided an overview of IP Geolocation
applications and methodologies both traditional and those
that attempt to push the envelope. The methodologies
presented here vary both in their complexity and
accuracy; as such, we cannot claim any one method as the
ideal solution. The optimal approach is therefore highly
sensitive to the type of application being developed.
REFERENCES
[1] Lassabe, F. (2009). Geolocalisation et prediction dans les
reseaux Wi-Fi en interieur. PhD thesis, Université de
Franche-Comté. Besançon
[2] Brewster, S., Dunlop, M., 2002. Mobile Computer
Interaction. ISBN: 978-3-540-23086-1. Springer.
[3] Furey, E., Curran, K., Lunney, T., Woods, D. and Santos, J.
(2008) Location Awareness Trials at the University of
Ulster, Networkshop 2008 - The JANET UK International
Workshop on Networking 2008, The University of
Strathclyde, 8th-10th April 2008
[4] Furey, E., Curran, K. and McKevitt, P. (2010) Predictive
Indoor Tracking by the Probabilistic Modelling of Human
Movement Habits. IERIC 2010- Intel European Research
and Innovation Conference 2010, Intel Ireland Campus,
Leixlip, Co Kildare, 12-14th October 2010
[5] Sawyer, S. (2011) EU Online Gambling and IP Geolocation,
Neustar IP Intelligence, http://www.quova.com/blog2/4994/
[6] Gueye, B., Ziviani, A., Crovella, M. and Fdida, S. (2004)
Constaint
Based
Geolocation
of
internet
hosts. In IMC '04. Proceedings of the 4th ACM SIGCOMM
conference on Internet measurement, pp: 288 – 293.
[7] Katz-Bassett, E., John, J., Krishnamurthy, A., Wetherall, D.,
Anderson, T. and Chawathe, Y. (2006) Towards IP
Geolocation Using Delay and Topology Measurements. In
ICM '06. Proceedings of the 6th ACM SIGCOMM
conference on Internet measurement, pages 71 – 84.
[8] Wang, Y., Burgener, D., Flores, M., Kuzmanovic, A. and
Huang, C. (2011) Towards Street-Level Client Independent
IP Geolocation. In NSDI'11. Proceedings of the 8th
USENIX conference on networked systems design and
implementation, pp: 27-36
[9] Eriksson, B., Barford, P., Sommers, J. and Nowak, R. (2010)
A Learning-based Approach for IP Geolocation. IN PAM'10
Proceedings of the 11th international conference on Passive
and active measurement, pp: 171 – 180.
278
On the Network Characteristics of the Google’s
Suggest Service
Zakaria Al-Qudah, Mohammed Halloush, Hussein R. Alzoubi, Osama Al-kofahi
Yarmouk University, Dept. of Computer Engineering, Irbid, Jordan
Email:{zakaria.al-qudah, mdhall, halzoubi, osameh}@yu.edu.jo
Abstract— This paper investigates application- and
transport-level characteristics of Google’s interactive
Suggest service by analyzing a passively captured packet
trace from a campus network. In particular, we study the
number of HTTP GET requests involved in user search
queries, the inter-request times, the number of HTTP GET
requests per TCP connection, the number of keyword
suggestions that Google’s Suggest service to users, and
how often users utilize these suggestions in enhancing their
search queries. Our findings indicate, for example, that
nearly 40% of Google search queries involve five or more
HTTP GET requests. For 36% of these requests, Google
returns no suggestions, and 57% of the time users do not
utilize returned suggestions. Furthermore, we find that
some HTTP characteristics such as inter-request generation
time for interactive search application are different from
that of traditional Web applications. These results confirm
the findings of other studies that examined interactive
applications and reported that such applications are more
aggressive than traditional Web applications.
Index Terms— Ajax, Web 2.0, Network measurments, Performance, Google search engine
I. I NTRODUCTION
Interactive Web applications have become extremely
common today. The majority of Web sites including major
Web-based email services (e.g., Gmail, Yahoo mail, etc.),
map services, social networks, and web search services
support an interactive user experience.
One of the enabling technologies for this interactive
user-engaging experience is Asynchronous Javascript and
XML (AJAX) [1]. AJAX allows the web client (browser)
to asynchronously fetch content from a server without
the need for typical user interactions such as clicking a
link or a button. Due to this asynchronous nature, these
interactive applications exhibit traffic characteristics that
might be different from those of classical applications. In
classical (non-interactive) applications, requests for content are usually issued in response to human actions such
as clicking a link or submitting a Web form. Thus, the
human factor is the major factor in traffic generation. With
interactive applications, however, requests can be issued
in response to user interactions that typically would not
generate requests such as filling a text field or hovering
the mouse over a link or an image. Furthermore, these
requests can be made even without user intervention at
all such as the case of fetching new email message with
Gmail or updating news content on a news Web site.
Therefore, traffic generation is not necessarily limited by
the human factor.
doi:10.4304/jetwi.4.3.278-284
In this paper we focus on one such interactive application which is the Google interactive search engine. Google
search engine has many interactive features that are aimed
at providing a rich search experience implemented using
the AJAX technology [2]. For example, Google Suggest
(or Autocomplete) [3] provides suggested search phrases
to users as they type their query (See Fig. 1). The user can
select a suggested search phrase, optionally edit it, and
submit a search query for that search phrase. Suggestions
are created using some prediction algorithm to help users
find what they are looking for. Google Instant Search
service [4] streams a continuously updated search results
to the user as they type their search phrases (See Fig. 2).
This is hoped to guide users’ search process even if
they do not know what exactly they are looking for. The
other interactive feature that Google provides is Instant
Previews [5]. With Instant previews (Shown in Fig. 3),
users can see a preview of the web pages returned in
the search results by simply hovering over these search
results. This service is aimed at providing users with the
ability to quickly compare results and pinpoint relevant
content in the results web page.
Figure 1. A snapshot of the Google Suggest feature
In this paper, we study the characteristics of the Google
Suggest service by analyzing a passively captured packet
trace from Yarmouk University campus in Jordan. We
look into the number of HTTP GET requests a search
query generates, the inter-request generation time, the
number of suggestions Google typically returns for a request, and the percentage of time the returned suggestions
279
Google Instant Preview is another feature provided by
Google. This feature allows users to get snapshots of web
pages for the search results without leaving the search
results. This feature enhances researcher’s search experience and satisfaction. Google Instant Preview provides
an image of the web page in addition to the extracted
text. Previews are dynamically generated because content
is continuously changing. Google users are 5% better
satisfied with this new feature [9].
Figure 2. A snapshot of the Google Instant search feature. Note that
the search results for the first suggestion is displayed before the user
finishes typing his/her query
Figure 3. A snapshot of the Google Instant Preview feature. Note
the displayed preview of the web page of the first search result entry
“Yarmouk University”
are actually utilized by users. We have already begun to
study the Google Instant feature and plan to study the
Instant Previews in the future.
The rest of this paper is organized as follows. Section II
highlights some background information related to our
work. Section III motivates our work. Section IV presents
the related work. Section V highlights the packet trace
capturing environment and characteristics. Section VI
presents our results and discuses our findings. We conclude and present our future work plans in Section VII.
II. BACKGROUND
When browsing the web, one normally uses web search
engines several times a day to find the required information on the web. Web search engines therefore are
visited by huge number of people every day. Web search
can use query-based, directory-based, or phrase-based
query reformulation-assisted search methods. Google is
considered among the most popular search engines on the
web. The Google search engine uses the standard Internet
query search method [6], [7].
In 2010 Google announced Google Instant, which liveupdates search results interactively during the time at
which users type queries. Every time the user hits a
new character, the search results are changed accordingly
based on what the search engine thinks a user is looking
for. This can save substantial user time, since most of the
time, the results that a user are looking for are returned
before finishing typing. Another advantage of Google
Instant is that users are less likely to type misspelled
keywords because of the instant feedback. The public
generally provided positive feedback towards this new
feature [8].
III. M OTIVATION
One motivation of this study is that we believe that
measuring such services is extremely important due to
the recent popularity of interactive web features. Characterizing new trends in network usage help the research
community and network operators update their mental
model about network usage. Another motivation is that
such characterization is quite important for building simulators and performance benchmarks and for designing
new services and enhancing existing ones.
Moreover, Google interactive features may produce
large amount of information, which may result in bad
experience for users on mobile devices or over lowspeed Internet connections [8]. With the prevalence of
browsing the web via mobile devices today, we believe
that characterizing these services is vital to understanding
the performance of these services. To the best of our
knowledge, this is the first attempt to characterizing interactive features of a search engine from the applicationand the transport-level perspective.
IV. R ELATED W ORK
The AJAX technology suit enables automated HTTP
requests without human intervention by allowing web
browsers to make requests asynchronously. This has been
made possible through the use of advanced features of
HTTP 1.1 like prefetching data from servers, HTTP persistent connections, and pipelining. These features mask
network latency and give end users a smoother experience
of web applications. Therefore, AJAX creates interactive
web applications and increases speed and usability [10].
The authors of [10] performed a traffic study of a
number of Web 2.0 applications and compared their characteristics to traditional HTTP traffic through statistical
analysis. They collected HTTP traces from two networks:
the Munich Scientific Network in Munich, Germany and
the Lawrence Berkeley National Laboratories (LBNL) in
Berkeley, USA and classified traffic into Web 2.0 applications traffic and conventional applications traffic. They
have used packet-level traces from large user populations
and then reconstructed HTTP request-response streams.
They identified the 500 most popular web servers that
used AJAX-enabled Web 2.0 applications. Google Maps
is one of the first applications that used AJAX. Therefore,
the authors have focused on Google Maps Traffic. The
presented findings of this study show that Web 2.0 traffic
is more aggressive and bursty than classical HTTP traffic.
This is due to the active prefetching of data, which means
280
many more automatic HTTP requests, and consequently
greater number of bytes transferred. Moreover, they found
that sessions in AJAX applications last longer and are
more active than conventional HTTP traffic. Furthermore,
AJAX inter-request times within a session are very similar
and much shorter because they are more frequent than all
other HTTP traffic.
Besides [10], some work exists in the literature for
characterizing HTTP traffic generated by popular Web
2.0 websites. In [11] for example, the authors examined
traces of Web-based service usage from an enterprise
and a university. They examined methodologies for analyzing Web-based service classes and identifying service
instances, service providers, and brands pointing to the
strengths and weaknesses of the techniques used. The
authors also studied the evolution of Web workloads
over the past decade, where they found that although the
Web services have significantly changed over time, the
underlying object-level properties have not.
The authors of [12] studied HTTP traffic from their
campus network related to map applications. Their work
examined the traffic from four map web sites: Google
Maps, Yahoo Maps, Baidu Maps, and Sogou Maps. In
their paper, they proposed a method for analyzing the
mash-up (combing data from multiple sources) characteristics of Google Maps traffic. They found that 40%
of Google Maps sessions come from mash-up from other
websites and that caching is still useful in web based map
applications.
Li et al. [13] studied the evolution of HTTP traffic
and classified its usage. The results provided are based
on a trace collected in 2003 and another trace collected
in 2006. The total bytes in each HTTP traffic classes
in the two traces were compared. The authors found
that the whole HTTP traffic increased by 180% while
Web browsing and Crawler both increased by 108%.
However, Web apps, File download, Advertising, Web
mail, Multimedia, News feeds and IM have shown sharp
rise.
Maier et al. [14] presented a study of residential
broadband Internet traffic using packet-level traces from
a European ISP. The authors found that session durations
are quite short. They also found that HTTP, not peerto-peer, carries most of the traffic. They observed that
Flash Video contributes 25% of all HTTP traffic, followed
by RAR archives, while peer-to-peer contributes only to
14% of the overall traffic. Moreover, most DSL lines fail
to utilize their available bandwidth and that connections
from client-server applications achieve higher throughput
per flow than P2P connections.
In [15], a study of user sessions of YouTube was
conducted. The results obtained from the study indicate
longer user think times and longer inter-transaction times.
The results also show that in terms of content, large video
files transferred. Finally, in [16] [17], the authors proposed the AJAXTRACKER, a tool for mimicking a human interaction with a web service and collecting traces.
The proposed tool captures measurements by imitating
mouse events that result in exchanging messages between
the client and the Web server. The generated traces can be
used for studying and characterizing different applications
like mail and maps.
V. DATA SET AND
METHODOLOGY
As mentioned, this study is conducted based on a
packet-level trace captured at the edge of the engineering
building at Yarmouk University, Jordan. The engineering
building contains roughly 180 hosts that are connected
through a typical 100Mbps ethernet. The trace is collected
over a period of five business days. The trace contains a
total 31490 HTTP transactions that are related to Google
search. To extract the transactions that are related to
Google search, the URL or the “HOST:” HTTP request
header has to contain the word “google”. The search query
is contained in the URL in the form of “q=xyz” where
“xyz” is the query. After identifying the HTTP request
as a Google search request, the corresponding HTTP
response is also extracted. The returned suggestions are
extracted from these HTTP responses. We also collect the
type of the returned HTTP response in order to separate
queries from one another as explained later in Section VI.
VI. R ESULTS
In this section, we measure various parameters related
to Google search queries. To identify the boundaries of
a search query, we manually analyze a portion of the
collected trace. We found that throughout the process a
user is typing the search phrase, the browser generates
HTTP GET requests. For these HTTP GET requests,
the type of the HTTP response is either “text/xml” or
“text/javascript”. When the user hits the Return key (to
obtain the search results), the browser generates another
HTTP GET request, for which the type of the returned
HTTP response is “text/html”. We verify this observation
by actively performing a number of search queries and
observing the captured traffic.
In our trace, however, we found a number of occurrences of a scenario where a series of HTTP GET requests
from a user appear to be related to two different queries,
yet this series of HTTP GET requests is not split by an
“text/html” response separating the boundaries of the two
different queries. There is a number of usage scenarios
that could result in such a behavior. For example, a user
might type in a search phrase and get interrupted for some
reason. Therefore, the search query will end without the
Return key being hit. Furthermore, the TCP connection
that is supposed to carry the last HTTP response might
get disrupted after the user hit the Return key and before
the HTTP response is delivered back to the user.
To handle such cases, we consider two HTTP GET
requests that are not split by an “text/html” response belong to two different search queries if the time separation
between the two requests is greater than t seconds. To find
a suitable value for this parameter, we plot the percentage
of queries found using the time separation to the overall
number of search queries for different values of t in Fig.
% of queries identified using time separation
0.6
0.5
0.4
0.3
0.2
0.1
0
10
20
30
40
50
60
Time (sec)
70
80
90
Figure 4. Setting of parameter t
GET requests with some search queries involving over 90
HTTP GET requests. These results show that the “chatty”
nature of AJAX-based applications reported in [10] for the
map and email applications also applies to the Google
suggest application.
Among the near 60% of search queries for which we
believe that users are enabling Suggest (i.e., number of
GET requests is greater than one), the vast majority of
queries seem to not utilize the suggestions for the first
few characters. This is because users continue to type
despite the returned suggestions. A possible reason for
this might be that for a small number of characters of the
search phrase, Google returns quite general suggestions
that are usually not selected by the user. We believe there
is a room for improvement in the service design by not
returning suggestions for the initial few characters of the
query.
B. Inter-Requests Time
1
0.8
CDF
4. The figure suggests that, in general, the percentage of
queries identified using the time separation heuristic is
insensitive to the setting of the parameter t when t is
above 30 seconds, an t = 60 is an appropriate value since
the percentage of search queries identified using the time
spacing heuristic to the number of overall search queries
remains stable around this value. Therefore, we choose
this value throughout our evaluation below.
A. HTTP GET Requests
281
0.6
0.4
0.2
1
0
1e-05 0.0001 0.001 0.01
0.1
1
Inter-request time (sec)
CDF
0.8
0.6
10
100
Figure 6. HTTP GET inter-request times within a search query
0.4
0.2
0
1
10
No. of HTTP GET requests
100
Figure 5. No. of HTTP GET requests per search query
This subsection investigates the number of HTTP GET
requests a search query typically involves. We identify
a total of 7598 search query. We plot the Cumulative
Distribution Function (CDF) of the number of HTTP GET
requests in a query in Fig. 5. As shown, over 40% of
search queries involve only one HTTP GET request. The
possible reasons for the existence of these queries include
(i) users not turning on Google Suggest and (ii) users
copying search phrases and pasting them into Google
and hitting the Return key. The figure also shows that
arround 30% of search queries involve five or more HTTP
Next, we turn our attention to investigating the time
spacing between HTTP GET requests within the same
search query. Fig. 6 shows the results. As shown, 64%
of HTTP GET requests involved in a search query are
separated by less than one second. We contrast these
results with our mental model of traditional HTTP interactions. Normally, a number of HTTP requests are
made to download a Web page along with the embedded
objects. Then, a think time elapses before new requests
are made to download a new page [18]. The interactive
search application generates a radically different pattern
of HTTP GET requests. This is due to the fact that
requests for new set of Google suggestions are automatically made while the user types the search query. This
also confirms the results of [10] indicating that interrequest times are shorter in AJAX-based applications than
it is in traditional applications. We however believe that
the traffic characteristics of these interactive applications
are generally application-dependent and not technologydependent. That is, the characteristics of the traffic that is
generated by an application employing AJAX technology
depends on the type of the application and not on the fact
that it uses the AJAX technology. This is because AJAX
enables the application to generate traffic automatically
without user intervention (or in response to user actions
that typically do not generate traffic such as hovering over
a link), however, it is up to the application logic to decide
whether to generate HTTP requests and when to generate
these requests.
1
0.8
CDF
282
0.4
0.2
C. TCP Connections
D. Suggestions (Predictions)
In this section, we investigate the number of suggestions Google returns in HTTP responses for each HTTP
GET request. We observe that the maximum number of
returned suggestions is ten. Fig. 7 plots CDF of the
number of returned suggestions per HTTP GET request.
As shown, close to 40% of HTTP responses involve
zero suggestions. This includes cases where the suggest algorithm returned no suggestions and connections
disrupted before responses arrive. Furthermore, around
50% of HTTP responses the Google Suggest service
returns the full ten suggestions. For the remaining small
fraction, the Suggest service returns between one and nine
suggestions. A likely reason for returning between one
and nine sugestions is the inability of google to find ten
suggestions for the specific user’s search phrase.
0
1
No. of Google suggestions (logscale)
10
Figure 7. No. of suggested search phrases returned for an HTTP GET
request
E. Prediction Usefulness
1
0.8
CDF
In our trace, we find that each HTTP GET request is
carried over its own TCP connection. To verify if this is
a result of the deployed HTTP proxy, we have performed
a number of search queries from the author’s houses
(i.e., using residential broadband network connections).
This experiment involves performing a number of search
queries for different web browsers (Microsoft Internet
Explorer, Mozialla Firefox, and Google Chrome) on a
Microsoft Windows 7 machine. We have captured and
examined the packet trace of these search queries. Our
findings indicate that, contrary to what we find in our
trace, various HTTP GET requests can be carried over
the same TCP connection. We note here that HTTP
proxies are commonly deployed in institutional networks.
Therefore, our network setting is not necessarily unique,
and we believe that it is totally legitimate to assume that
many other institutions are employing similar network
settings.
We note that having a separate TCP connection per
request might have significant impact on the performance
of this service. In particular, each new TCP connection
requires the TCP three-way handshake which might add a
significant delay. Furthermore, if an HTTP request needs
to be split over many packets, the rate at which these
packets are transmitted to the server is limited by the TCP
congestion control mechanisms. Therefore, these added
delays might limit the usefulness of the Suggest service
since suggestions usually become obsolete when the user
types new text.
0.6
0.6
0.4
0.2
0
0.1
Percentage (logscale)
1
Figure 8. Percentage of time users actually use the returned suggestions
of the search query
Next, we investigate the percentage of time users do
actually use the returned suggestions during the search
process. To assess this, within a search query, we assume
that the user has utilized the returned suggestions if the
search phrase in the current request matches one of the
suggestions appeared in the response for the previous
request. To illustrate this, consider the following scenario
from our trace. An HTTP GET request was sent to Google
with “you” as a partial search phrase. Google responded
with “youtube, you, youtube downloader, yout, youtube to
mp3, youtu, youtube download, youtube music, you top,
you born” as search suggestions. The next HTTP GET
request was sent to Google with “youtube” as the search
phrase. In this case, we assume that the user has utilized
the returned suggestion since the current HTTP GET
request involves one of the suggestions that were provided
as a response to the previous HTTP GET request. That
is, the user is asking for “youtube” which was one of the
suggestions made by Google in the previous response.
We note that this is an upper limit on the usage of this
service because a search phrase in the current request may
match the suggestions in the previous request, yet, the user
might have typed the phrase instead of selecting it from
the list of returned suggestions.
The result is plotted in Fig. 8. As shown, nearly 58% of
the time, users do not use the returned suggestions at all.
On the other hand, nearly 10% of queries are constructed
with complete guidance of the Suggest service. The
following scenario from our trace illustrates a case where
a query can be constructed with complete help of Google
suggestions. A user typed “f” which triggered a request
for suggestions to Google. Google responsed with “facebook,face,fa,friv,fac,firefox,faceboo,farfesh,factjo,fatafeat”
as suggestions. The next and final request was for
“facebook” which is among the suggested search
phrases. In this case the user selected a search phrase
from the first set of returned suggestion to complete the
search query. This means suggestions are fully utilized.
Hence, full utilization of google suggest service is
acheived when the user selects a suggested phrase from
each returned list of suggestions for a particular search
query.
VII. C ONCLUSIONS AND F UTURE W ORK
In this paper, we have investigated the applicationand transport-level characteristics of Google’s Suggest
interactive feature as observed in a passively captured
packet trace from a campus network. We find that a
large number of HTTP GET requests could be issued
to obtain suggestions for a search query. Interestingly,
the characteristics of the HTTP GET requests deviate
significantly from those of HTTP GET requests issued for
classical Web interactions. In particular, while classical
Web interactions are limited by the human factor (thinktime), interactive applications are not necessarily limited
by this factor. Furthermore, we have characterized the
number and usefulness of suggestions made by Google.
To this end, we have found that Google responds to the
majority of requests for suggestions with either zero or 10
suggestions (the number 10 is the maximum number of
suggestions returned per request). However, nearly 58%
of users do not utilize the returned suggestions at all.
We have already begun to investigate the characteristics
of other Google interactive search features such as Google
instant search and plan to evaluate the Google instant
preview as well.
R EFERENCES
[1] J. J. Garrett, “Ajax: A new approach to web applications,”
http://adaptivepath.com/ideas/essays/archives/000385.php,
February
2005,
[Online;
Stand
18.03.2008].
[Online].
Available:
http://adaptivepath.com/ideas/essays/archives/000385.php
[2] “Ajax:A New Approach to Web Applications,”
http://adaptivepath.com/ideas/ajax-new-approach-webapplications.
[3] “Google
Suggest
(or
Autocomplete),”
http://www.google.com/support/websearch/bin/static.py?hl=
en&page=guide.cs&guide=1186810&answer=106230&rd=1.
283
[4] “Google Instant,” http://www.google.com/instant/.
[5] “Google
Instant
Previews,”
http://www.google.com/landing/instantpreviews/#a.
[6] P. Bruza, R. McArthur, and S. Dennis, “Interactive internet
search: keyword, directory and query reformulation mechanisms compared,” in SIGIR’00, 2000, pp. 280–287.
[7] S. Dennis, P. Bruza, and R. McArthur, “Web searching:
A process-oriented experimental study of three interactive
search paradigms,” Journal of the American Society for
Information Science and Technology, vol. 53, issue 2, pp.
120–130, 2002.
[8] http://dejanseo.com.au/google-instant/.
[9] http://dejanseo.com.au/google-instant-previews/.
[10] F. Schneider, S. Agarwal, T. Alpcan, and A. Feldmann,
“The new web: Characterizing ajax traffic.” in PAM’08,
2008, pp. 31–40.
[11] P. Gill, M. Arlitt, N. Carlsson, A. Mahanti, and
C. Williamson, “Characterizing organizational use of webbased services: Methodology, challenges, observations and
insights,” ACM Transactions on the Web, 2011.
[12] S. Lin, Z. Gao, and K. Xu, “Web 2.0 traffic
measurement: analysis on online map applications,”
in Proceedings of the 18th international workshop on
Network and operating systems support for digital
audio and video, ser. NOSSDAV ’09. New York,
NY, USA: ACM, 2009, pp. 7–12. [Online]. Available:
http://doi.acm.org/10.1145/1542245.1542248
[13] W. Li, A. W. Moore, and M. Canini, “Classifying http
traffic in the new age,” 2008.
[14] G. Maier, A. Feldmann, V. Paxson, and M. Allman,
“On dominant characteristics of residential broadband
internet traffic,” in Proceedings of the 9th ACM
SIGCOMM conference on Internet measurement
conference, ser. IMC ’09. New York, NY, USA:
ACM, 2009, pp. 90–102. [Online]. Available:
http://doi.acm.org/10.1145/1644893.1644904
[15] P. Gill, M. Arlitt, Z. Li, and A. Mahanti, “Characterizing
user sessions on youtube,” in In Proc. of 15th Annual
Multimedia Computing and Networking Conference, San
Jose, CA, USA, 2008.
[16] M. Lee, R. R. Kompella, and S. Singh, “Ajaxtracker: active
measurement system for high-fidelity characterization
of ajax applications,” in Proceedings of the 2010
USENIX conference on Web application development,
ser. WebApps’10.
Berkeley, CA, USA: USENIX
Association, 2010, pp. 2–2. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1863166.1863168
[17] M. Lee, S. Singh, and R. Kompella, “Ajaxtracker: A
tool for high-fidelity characterization of ajax applications,”
2008.
[18] M. E. Crovella and A. Bestavros, “Self-similarity in
world wide web traffic: Evidence and possible causes,”
IEEE/ACM Transactions on Networking, 1997.
Zakaria Al-Qudah received his B.S. degree in Computer
Engineering from Yarmouk University, Jordan in 2004. He
recived his M.S. and Ph.D. degrees in Computer Engineering
from Case Western Reserve University, USA, in 2007 and 2010
respectively. He is currently an assistent professor of Computer
Engineering at Yarmouk University. His research interests include internet, content distribution networks, and security.
Mohammed Halloush received the B.S. degree from Jordan
University of Science and Technology, Irbid, Jordan in 2004,
the M.S. and the Ph.D. degrees in Electrical Engineering from
Michigan State University, East Lansing, MI, USA in 2005,
2009 respectively. Currently he is an Assistant professor in the
department of Computer Engineering at Yarmouk University,
Irbid Jordan. His research interests include network coding,
284
multimedia communications, wireless communications and networking.
Hussein Al-Zoubi received his MS. and Ph.D. in Computer
Engineering from the University of Alabama in Huntsville, USA
in 2004 and 2007, respectively. Since 2007, he has been working
with the Department of Computer Engineering, Hijjawi Faculty
for Engineering Technology, Yarmouk University, Jordan. He
is currently an associate professor. His research interests include computer networks and their applications: wireless and
wired, security, multimedia, queuing analysis, and high-speed
networks.
Osameh Al-Kofahi received his B.S. degree in Electrical and
Computer Engineering from Jordan University of Science and
Technology, Irbid, Jordan in 2002. He received his Ph.D. degree from Iowa State University, USA. in 2009. His research
interests include Wireless Networks, especially Wireless Sensor
Networks (WSNs), Wireless Mesh Networks (WMNs) and Ad
hoc networks, Survivability and Fault Tolerance in wireless
networks and Practical Network Coding.
285
Zeeshan Khawar Malik, Colin Fyfe
School of Computing, University of The West of Scotland
Email: {zeeshan.malik,colin.fyfe}@uws.ac.uk
Abstract— Today the modern phase of the internet is the
personalize phase where the user is able to view everything
that matches his/her interest and needs. Nowadays, Web
users are relying totally on the internet in relation to all
the problems they have in their daily life. If someone wants
to find a job he/she will look on the internet, similarly if
someone wants to buy some product/item the best preferred
platform will be the internet so due to large numbers of
users on the internet and also due to the large amount of
data on the internet people starts preferring those platforms
where they can find what they need in as minimum time
as possible. The only way to make the web intelligent
is through personalization. Web Personalization has been
introduced more than a decade ago and many researchers
have contributed to make this strategy as efficient as possible
and also as convenient for the user as possible. Web
personalization research has a combination of many other
areas that are linked with it and includes AI, Machine
Learning, Data Mining and natural language processing.
This report describes the whole era of web personalization
with a description of all the processes that have made this
technique more popular and widespread. This report has
also thrown light on the importance of this strategy and
also the benefits and limitations of the methods that are
introduced in this strategy. This report also discusses how
this approach has made the internet world more facilitating
and easy-to-use for the user.
Index Terms— Web Personalization, Learning, Matching and
Recommendation
I. I NTRODUCTION
In the early days of internet technology, people used
to suffer a lot while browsing and finding data as per
their interest and needs due to the richness of information
available online. The concept of web personalization
has to a very large extent enabled the internet users to
find the most appropriate and best information as per
their interest. This is one of the major contributions on
the internet derived from the first and foremost concept
of Adaptive Hypermedia which becomes more popular
by giving a major contribution to adaptive web-based
hypermedia in teaching systems [1], [2] and [3]. Adaptive
Hypermedia was derived by observing the browsing habits
of different users on the internet where people faced a lot
of difficulty in choosing links out of many links available
at one time. Based on this linking system, this concept
of adaptive hypermedia was introduced which provides
the most appropriate links to the users based on their
browsing habits. This concept became more popular when
it was introduced in the area of educational hypermedia
This paper covers the whole review of web personalization technology
This work was supported by University of The West of Scotland.
doi:10.4304/jetwi.4.3.285-296
[4]. Web personalization [11, 22 ,72, 16] is to some extent
closely linked with adaptive hypermedia in the way that
the former most of the time works on an open corpus
hypermedia whereas the latter mostly worked and became
popular on closed corpus hypermedia. The basic objective
of personalization is to some extent similar to adaptive
hypermedia which is to help users by giving them the
most appropriate information for their needs. The reason
why web personalization has become more popular than
adaptive hypermedia is due to its frequent implementation
in commercial applications. Very few areas of the internet
are left where this concept has not yet reached. Most
areas of the internet have adopted this method including
e-business [5], e-tailing [6], e-auctioning [7], [8] and
others [9]. User adaptive services and personalization
features both are basically designed for enabling users
to reach their targeted needs without spending much time
in searching.
Web Personalization is divided into three main phases
1) Learning [10]
2) Matching [11] and
3) Recommendation [5], [12]
as shown in Fig. 1 and Fig. 2 in detail. Learning is
further subdivided into two types 1) Explicit Learning
and 2) Implicit Learning. There is one more type of
learning method mentioned most frequently nowadays by
different researchers called behavioural learning [13] that
also comes under the Implicit Learning category. The next
stage is the matching phase. There is more than one type
of matching or filtration techniques proposed by different
researchers which primarily include
1) Content-Based Filtration [14]
2) Collaborative Filtration [15], [16] [17], [18]
3) Rule-Based Filtration [19] and
4) Hybrid Filtration [20]
These prime categories further include sub-categories
mentioned later that are based on the prior mentioned
categories but are used to further enhance the performance and efficiency of this phase. There is still a lot
of weaknesses to the efficiency and performance of this
phase. Many new ideas are currently being proposed all
over the world to further improve the performance in
finding the nearest neighbours in the shortest possible
time and producing more accurate results. The last phase
is the recommendation [21] phase which is responsible
for displaying the closest match to the interest and personalized choice of users. In this report a detailed review
of web personalization is made by taking into account the
following major points.
286
Figure 1. Three Stages of Web Personalization.
Figure 2. Web Personalization Process.
1) What is Web Personalization? What are the main
Building Blocks of Web Personalization?
2) What are the major techniques that are involved in each
phase of web personalization?
3) Description of each phase with complete overview of
all the major contributions that are made in each phase
of web personalization.
II. WHAT IS WEB PERSONALIZATION?
Web Personalization can be defined as a process of
helping users by providing customized or relevant information on the basis of Web Experience to a particular
user or set of users [22].
A Form of user-to-system interactivity that uses a set
of technological features to adapt the content, delivery,
and arrangement of a communication to individual users
explicitly registered and/or implicitly determined preferences [23].
One of the first and foremost companies who had
introduced this concept of personalization was Yahoo
in 1996 [24]. Yahoo has introduced this feature of personalizing the user needs and requirement by providing
different facilitating products to its users like Yahoo Companion, Yahoo Personalized Search and Yahoo Modules.
Yahoo experienced quite a number of challenges which
include scalability issues, usability issues and large-scale
personalization issues but summing up as whole find it
quite a successful feature as far as the user needs and
requirements were concerned.
Similarly Amazon, one of the biggest companies in the
internet market, summarizes the recommendation system
with three common approaches
1) Traditional Collaborative Filtering
2) Cluster Modelling and
3) Search-Based Methods
as described in [25]. Amazon has also incorporated this
method of web personalization and the most well-known
use of collaborative filtering is also done by Amazon as
well.
Amazon.com, the poster child of personalization,
will start recommending needlepoint books to you as
soon as you order that ideal gift for your great
aunt.(http://www.shorewalker.com)
Web Personalization is the art of customizing items
responding to the needs of users. Due to the large amount
of data on the internet, people often get so confused
in reaching their correct destination and spend so much
time in searching and browsing the internet that in the
end they get disappointed and prefer to do their work
using traditional means. The only way to help internet
users is by providing an organized look to the data and
personalizing the whole decoration of items to satisfy
the individual’s desire and in doing this the only way
is to embed features of web personalization. Everyday a
user has a different mood when browsing the internet and
based on that day’s particular interest the user browses
the internet, but definitely a time comes when the interest starts becoming redundant day by day and at that
particular situation if the historical transactional record
[5], [12] is maintained properly and the user behaviour is
recorded [13] properly then the company can take benefit
in filtering the record based on a single user or a group
of users and can recommend useful links according to the
interest of the user.
Web Personalization can also be defined as a recommendation system that performs information filtering.
The most important layer on which this feature is
strongly dependent is the data layer [26]. This layer plays
a very important role in recommendation. The system
which is capable of storing data from more than one
dimension is able to personalize the data in a much better
way. Hence the feature of web personalization has a pretty
closed relationship with web mining
Web Personalization is normally offered as an implicit
facility to the user: whereas some websites considered
it optional for the user, most websites do it implicitly
without asking the user. The issues that are considered
very closely while offering web personalization is the
issue of high-scalability of data [27], lack of performance
issues [19], correct recommendation issues , black box
filtration issues [28], [29] and other privacy issues [30].
Black box filtration is defined as a scenario where the user
cannot understand the reason behind the recommendation
and is unable to control the recommendation process.
It is very difficult to cover the filtration process for a
large amount of data which includes pages and products
while maintaining a correct prediction and performance
accuracy and this normally happens due to the sparsity
of data and the incremental cost of correlation among the
users [31], [32].
This feature has a strong effect on internet marketing
as well. Personalizing users needs is a much better way
of selling items without wasting much time. This feature further pushes the sales ratio and helps merchants
convince their customers without confusing them and
puzzling them [33]. The internet has now become a strong
source of earning money. The first step towards selling
any item or generating revenue involves marketing of that
item and convincing the user that the items which are
being offered are of a superior quality and nobody can
give them this item with such a high quality and at such
a low price. In order to make the first step closer to the
user, one way is by personalizing the items for each user
regarding his/her area of interest. It means personalization
can easily be used to reduce the gap between any two
objects which can be a user and a product, a user and
a user, a merchant and consumer, a publisher and an
advertiser [34], a friend and an enemy and all the other
combinations that are currently operating with each other
on the internet.
In a recent survey conducted by [23] in City University
London, it is found that personalization as a whole is
becoming really very popular in news sites as well. Electronic News platforms such as WSJ.com, NYTimes.com,
FT.com, Guardian.co.uk, BBC News online, WashingtonPost.com, News.sky.com, Telegraph.co.uk, theSun.co.uk,
TimesOnline.co.uk and Mirror.co.uk which has almost
completely superseded traditional news organizations are
right now considered to have one of the highest user
viewership platforms globally. Today news sites are highly
looking towards these personalization features and trying to adopt both explicit and implicit ways that includes email newsletters, one-to-one collaborative filtering, homepage customization, homepage edition, mobile
editions and apps, my page, my stories, RSS feeds, SMS
alerts, Twitter feeds and widgets as a former and contextual recommendations/ aggregations, Geo targeted editions, aggregated collaborative filtering, multiple metrics
and social collaborative filtering as ways for personalizing
the information just to further attract a users attention and
to enable users to view specific information according to
their interest. Due to the increasing number of viewers
day by day these news platforms are becoming one of the
biggest sources of internet marketing as well and most of
the advertisers from all over the world are trying very
hard to offer maximum percentage in terms of PPC (Pay
Per Click) and PPS (Pay Per Sale) strategy to place their
advertisement on these platforms to increase their sales
and to generate revenue. So it is once again proved that
personalization is one of the most important features that
give a very high support to internet advertising as well.
While discussing internet advertising the most popular
and fastest way to promote any product or item on
the internet is through affiliate marketing [35]. Affiliate
Marketing offers different methods as discussed in [36]
287
for the affiliates to generate revenue from the merchants
by selling or promoting their items. Web Personalization
is playing an important role in reducing the gap between
affiliates and advertisers by facilitating affiliates and providing them an easy way of growing with the merchant
by making their items sell in a personalized and specific
way.
With the growing nature of this feature it is proved
as confirmed by [37] that the era of personalization has
begun and further states that
people what they want is a brittle and shallow civic
philosophy.
It is hard to guess what people really want but still
researchers are trying to reach as close as possible. Further
in this report the basic structure of web personalization is
explained in detail.
III. LEARNING
This phase is considered one of the compulsory phases
of web personalization. Learning is the first step towards
the implementation of web personalization. The next two
phases are totally dependent on this phase. The better
this phase is executed, the better and more accurate the
next two phases will execute. Different researchers have
proposed different methods for learning such as Web
Watcher in [38] which learns the user’s interest using reinforcement learning. Similarly Letizia in [39] behaves
as an assistant for browsing the web and learns the
user’s web behaviour in a conventional web browser. A
system in [40] is described as a system that learns user
profiles and analyses user behaviour to perform filtered
net-news. Similarly in [41] the author uses re-inforcement
learning to analyse and learn a user’s preferences and web
browsing behaviour. Recent research in [42] proposed a
method of semantic web personalization which works on
the content structure and based on the ontology terms
learns to recognize patterns from the web usage log files.
Learning is primarily the process of data collection
defined in two different categories as mentioned earlier
1) Explicit Learning and
2) Implicit Learning
Which are further elaborated below:A. IMPLICIT LEARNING
Implicit learning is a concept which is beneficial since
there is no extra time consumption from the user point
of view. In this category nobody will ask the user to give
feedback regarding the product’s use, nobody will ask
the user to insert product feedback ratings, nobody will
ask the user to fill feedback forms and in fact nobody
will ask the user to spend extra time in giving feedback
anywhere and in any form. The system implicitly records
different kinds of information related to the user which
shows the user’s interest and personalized choices. The
three most important sources that are considered while
getting implicit feedback for a user includes 1) Reading
time of the user at any web page 2) Scrolling over the
same page again and again and 3) behavioural interaction
with the system.
288
1) GEO LOCATIONS: Geolocation technology helps
in finding the real location of any object. This is very
beneficial as an input to a personalization system and
hence most of the popular portals like Google implicitly
store geographical location of each user using Google
search engine and then personalize the search results for
each user according to the geographical location of that
user. This concept is becoming very popular in other areas
of the internet as well which primarily includes internet
advertising [43]. Due to the increase in the mobility of
internet spatial information is also becoming pervasive
on the web. These mobile devices help in collecting
additional information such as context information, location and time related to a particular user’s transaction
on the web [44]. Intelligent techniques [45], [46] and
[47] are proposed by researchers to record the spatial
information in a robust manner and this further plays an
additional role of accuracy in personalizing the record of
the user. It is evident from the fact that many services
on the internet require collection of spatial information in
order to become more effective with respect to the needs
of the user. Services such as restaurant finders, hospital
finders, patrol station finders and post office finders on
the internet require spatial information for giving effective
recommendations to the users.
2) BEHAVIORAL LEARNING: In this category the
individual behaviour of a user is recorded by taking into
consideration the click count of the user at a particular
link, the time spent on each page and the search text
frequency [13], [40]. Social networking sites are nowadays found to extract the behaviour of each individual
and this information is used by many online merchants
to personalize pages in accordance with the extracted
information retrieved from social sites. An adaptive web
is mostly preferred nowadays which changes with time.
In order to absorb the change, the web should be capable
enough to record user’s interest and can easily adapt the
ever increasing changes with respect to the user’s interest
in terms of buying or any other activity on the web. Many
interesting techniques have been proposed to record user’s
behaviour [48], [49] and adapt with respect to the changes
by observing the dynamic behaviour of the user.
3) CONTEXTUAL RELATED INFORMATION: There
are many organizations like ChoiceStream, 7 Billion
People, Inc, Mozenda and Danskin that are working as
product development companies and are producing web
personalization software that can help online merchants
filter records on the basis of this software to give personalized results to their users. Some of these companies
are gathering contextual related information from various
blogs, video galleries, photo galleries and tweets and
based on these aggregated data are producing personalized
results. Apart from this since the origin of Web 2.0 the
data related to users is becoming very sparse and many
learning techniques are proposed by different researchers
to extract useful information from this high amount of
data by taking into account the tagging behaviour of the
user, the collaborative ratings of the user and to record
social bookmarking and blogging activities of the user
[50], [51].
4) SOCIAL COLLABORATIVE LEARNING: Online
Social Networking and Social Marketing Sites [52] are
the best platforms to derive a user’s interest and to analyse user behaviour. Social Collaborative filtering records
social interactions among people of different cultures and
communities involved together in the form of groups
in social networking sites. This clustering of people
shows close relationship among people in terms of nature
and compatibility among people. Social Collaborative
Learning systems learn a user’s interests by taking into
account the collaborative attributes of people lying in the
same group and give benefit to their users from these
socially collaborative data by personalizing their needs
on the basis of the filtered information they extract from
these social networking sites. This social networking site
introduces many new concepts that portray the feature of
web personalization like facebook Beacon introduced by
Facebook but removed due to privacy issues [53].
5) SIMULATED FEEDBACKS: This is the latest concept discussed by [54] and [55] in which the researchers
have proposed a method for search engine personalization
based on web query logs analysis using prognostic search
methods to generate implicit feedback. This concept is
the next generation personalization method which the
popular search engines like Google and yahoo can use to
extract implicitly simulated feedbacks from their user’s
query logs using AI methods and can personalize their
retrieval process. This concept is divided into four steps
1) query formulation 2) searching 3) browsing the results
and 4) generating clicks. The query formulation works by
selecting a search session from user’s historical data and
sending the queries sequentially to the search engine.The
second steps involves retrieval of data based on the query
selected in the previous step.The browsing result session
is the most important step in which the patience factor
of the user is learned based on the number of clicks per
session, maximum page rank clicked in a session, time
spent in a session and number of queries executed in each
session.The last step is the scoring phase based on the
number of clicks the user made on each link in every
session. This is one of the dynamic ways proposed to get
simulated feedback based on insight from query logs and
using artificial methods to generate feedbacks.
B. EXPLICIT LEARNING
Explicit Learning methods are considered more expensive in terms of time consumption and less efficient
in terms of user dependency. This method includes all
possible ways that merchants normally adopt to explicitly
get their user’s feedback in the form of email newsletter,
registration process, user rating, RSS Twitter feeds, blogs,
forums and getting feedbacks through widgets. Through
explicit learning sometimes the chance of error becomes
greater. Error arises because sometimes the user is not
in a mood to give feedback and therefore enters bogus
information into the explicit panel [56].
1) EMAIL NEWSLETTER: This strategy of being in
touch with your registered users is getting very popular
day by day [23] and [57]. The sign up process for this
strategy will help the merchant find their user’s interest
by knowing which product’s update the user want in
his/her mailbox regularly. This strategy is the best way
of electronic marketing as well as finding the interest of
your customers. There are many independent companies
like Aweber and Getresponse that are offering this service
to most merchants on the internet and people are getting a
lot of benefits in terms of revenue generation and building
a close personalized relationship with their customers.
Tools like iContact [58]have a functionality to do message
personalization as well. Message personalization is a
strategy through which certain parameters in the email’s
content can be generalized and is one more quick and
personalized way of explicitly getting feedback by just
writing one generic email for all the users.
2) PREFERENCE REGISTRATION: This concept is
incorporated by content providing sites such as news
sites to get user preferences through registration for the
recommendation of content. Every person has his own
choice of content view so these news sites have embedded
a content preference registration module where a user
can enter his/her preference about the content so that the
system can personalize the page in accordance with the
preferences entered by the users. Most web portals create
user profiles using a preference registration mechanism
by asking questions of the user during registrations that
identify their interest and reason for registering but on the
other hand these web portals also have to face various
security issues in the end as well [59]. The use of a web
mining strategy has reduced this technique of preference
registration system [60].
3) SMS REGISTRATION: Mobile SMS service is being used in many areas starting from digital libraries [61]
up to behavioral change intervention in health services
[62] as well. Today mobile technology is getting popular
day by day and people prefer to get regular updates on
mobiles instead of their personal desktops inbox. Buyers
who till now only expect location-based services through
mobile are also expecting time and personalization features in mobile as well [63]. Most websites like Minnesota
West are offering SMS registration through which they
can get personalized interests of their users explicitly and
can send regular updates through SMS on their mobiles
regarding the latest news of their products and packages.
4) EXPLICIT USER RATING: Amazon, one of the
most popular e-commerce based companies on the internet has incorporated three kinds of rating methods
1) A Star Rating 2) A Review and 3) A Thumbs Up/
Down Rating. The star rating helps the customers judge
the quality of the product. A Review rating shows the
review of existing customer after buying the product and
a Thumbs Up/Down Rating gives the customer’s feedback
after reading the reviews of other people related to that
product. These explicit user rating methods are one of
the biggest sources to judge customer’s needs and desires
289
about the product and Amazon is using this information
for personalization purposes. Explicit user rating plays
a vital role in identifying user’s need but extra time
consumption of this process means that sometimes the
user feels very uncomfortable to do it or sometimes the
user feels very reluctant in doing it unless and until some
benefit is coming out of it [64]. However still websites
have incorporated this method to gather data and identify
user’s interest.
5) RSS TWITTER FEEDS: RDF Site Summary is
used to give regular updates about the blog entries, news
headlines, audio and video in a standard format. RSS
Feeds help customers get updated information about the
latest updates on the merchant’s site. Users sometimes
feel very tired searching for their interest related articles
and this RSS Feeding feature help users by updating
them about the articles of interest.To them this feature of
RSS Feeding is very popular among content-oriented sites
such as News sites and researchers are trying to evolve
techniques to extract feedbacks from these RSS feeds for
recommendations [65]. This concept is also being used
by many merchants for the personalization process by
getting user’s interests with regards to the updates a user
requires in the form of RSS. Similarly the twitter social
platform is becoming very popular in enabling the user to
get updated about the latest information. Most merchants’
sites are offering integration with a user’s twitter account
to get the latest feeds of those merchants’ product on the
individual’s twitter accounts. Almost 1000+ tweets are
generated by more than 200 million people in one second
which in itself is an excellent source for recommender
systems [66].These two methods are also used by many
site owners especially news sites so that they can use this
information for personalizing the user page.
6) SOCIAL FEEDBACK PAGES: Social feedback
pages are those pages which companies usually build
on social-networking sites to get comments from their
customers related to the discussion of their products.
These product pages are also explicitly used by the
merchants to derive personalized interest of their users
and to know the emotions of their customers with their
products [67]. It has now became a trend that every brand,
either small or large before introducing itself into the
market, first uses the social web to get feedback about
their upcoming brand directly from the user and then
based on the feedback introduce their own brand into
the market [68]. Although the information on the page
seems to be very large and raw but still it is considered
a very useful way to extract user’s individual perception
regarding any product or service.
7) USER FEEDBACK: User Feedback plays a vital
role to get a customer’s feedback about the company’s
quality of services, quality of products offered and many
other things. This information is collected by most merchants to gather a user’s interests so that they can give a
personalized view of information to that user next time
when the user visits their site.It is identified in [69] that
most of the user feedbacks are differentiated in terms of
290
explicitness, validity and acquisition costs. It is identified
that especially for new users explicit customer requirements as in [70] are also considered as a useful source of
user feedbacks for personalization. Overall user feedbacks
plays a very important role for recommendation but it is
proved that in most of the systems, gradually the explicit
user feedbacks decreases with time and sometimes it
shows a very negative effect on a user’s behaviour [71].
8) BLOGS AND FORUMS: Blogs and forums play
a vital role in creating a discussion platform where a
user can share his views about the product or services
he has purchased online. Most e-companies offer these
platforms to their customers where customers explicitly
give their feedback regarding the products by participating
in the forum or by giving comments on articles posted
by the vendor related to the products or services. This
information is used by the merchant for personalizing
their layouts on the basis of user feedbacks from these
additional platforms. Semantic Blogging Agent as in [72]
is one of the agents proposed by researchers that works
as a crawler and extracts semantic related information
from the blogs using natural language processing methods
to provide personalized services. Blogging is also very
popular among mobile users as well. Blogs not only
contains the description of various products, services,
places or any interesting subject but also contains user’s
comments on each article and with mobile technology the
participation ratio has increased a lot. Researchers have
proposed various content recommendation techniques in
blogs for mobile phone users as in [73] and [74].
IV. MATCHING
The matching module is another important part of web
personalization. The matching module is responsible for
extracting the recommendation list of data for the target
user by using an appropriate matching technique. Different researchers have proposed more than one matching
criterias but all of them lie under three basic categories
of matching 1) Content-Based Filtration Technique 2)
Rule-Based Reasoning Technique and 3) Collaborative
Filtration Technique.
A. CONTENT-BASED FILTRATION TECHNIQUE
Content-Based filtration approach filters data based
on a user’s previous liking based stored data. There
are different approaches for the content-based filtration
technique. Some merchants have incorporated a rating
system and ask customers to rate the content and based on
the rating of the individual, filter the content next time for
that individual [75]. There is more than one content-based
page segmentation approach introduced by researchers
through which the page is divided into smaller units using
different methods. The content is filtered in each segment
and then the decision is made whether this segment of
the page is incorporated in the filtered page or not [76],
[77]. Content-based filtration technique is feasible only
if there is something stored on the basis of content that
shows the user’s interest for e.g. it is easy to give a
recommendation for the joke about a horse out of many
horse related jokes stored in the database on the basis of
a user’s previous liking but it is impossible to extract the
funniest joke out of all the jokes related to horses; for that
one has to use collaborative filtration technique. In order
to perform content filtration the text should be structured
but for both structured and unstructured data one has to
incorporate the process of stemming [78] especially news
sites which contains news articles which are example of
unstructured data. There are different approaches used for
content filtration as mentioned in figure 3.
Figure 3. Methods Used in Content Filtration Technique.
1) USER PROFILE: The profile of user plays a vital
role in content filtration [79]. The profiles mainly consist
of two important pieces of information.
1) The first consists of the user’s preferred choice of data.
A user profile contains all the data that shows a user’s
interest. The record contains all the data that shows a
user’s preference model.
2) Secondly it contains the historical record of the user’s
transactions. It contains all the information regarding the
ratings of users, the likes and dislikes of the users and all
the queries typed by the user for record retrieval.
These profiles are used by the content filtration system
[80] for displaying a user’s preferred data which will be
personalized according to the user’s interest.
2) DECISION TREE: A decision tree is another
method used for content filtration. Decision tree is created
by recursively partitioning the training data as in [81]. In
decision trees a document or a webpage is divided into
subgroups and it will be continuously subdivided until
a single type of a class is left. Using decision trees it
is possible to find the interests of a user but it works
well on structured data and in fact it is not feasible for
unstructured text classification [82].
3) RELEVANCE FEEDBACK: Relevance feedback
[83] and [84] is used to help users refine their queries
on the basis of previous search results. This method is
also used for content filtration in which a user rates the
documents returned by the retrieval system with respect
to their interest. The most common algorithm that is used
for relevance feedback purposes is Rocchio’s algorithm
[85]. Rocchio’s algorithm maintains the weights for both
relevant and non-relevant documents retrieved after the
execution of the query and on the basis of a weighted
sum incrementally move the query vector towards the
cluster of relevant documents and away from irrelevant
documents.
4) LINEAR CLASSIFICATION: There are numerous
linear classification methods [86], [87] that are used for
text categorization purposes. In this method the document
is represented in a vector space. The learning process will
produce an output of n-dimensional weight vector whose
dot product with an n-dimensional instance produces a
numeric score prediction that leads to a linear regression
strategy. The most important benefit of these linear classification approaches is that they can be easily learned on
an incremental basis and can easily be deployed on web.
5) PROBABILISTIC METHODS: This is one more
technique used for text classification and the method
primarily used in it is the Naive Bayesian Classifier [88].
The two most common methods of Bayesian Classifier
that are used for text classification are the multinomial
model and multivariate Bernoulli as described in [89].
Some probabilistic models are called generative Models.
B. COLLABORATIVE FILTERING
Most online shops store records related to the buying of products by different customers. It is true that
many products can be bought by many customers and
it is also true that a single product can be bought by
more than one customer but in order to predict which
product the new customer should buy it is important to
know the number of products that have been bought by
other customers with the same background and choice
and for this purpose collaborative filtration is performed.
Collaborative filtration [27], [90], [91] is the process
through which one can predict based on collaborative
information from multiple users the list of items for the
new users. Collaborative Filtration has some limitations
as well that come with the increase in the number of
items because it is very difficult to scale this technique
to high volume of data while maintaining a reasonable
prediction accuracy however apart from these limitations
collaborative filtering is the most popular technique that
is incorporated by most merchants for personalization.
Many collaborative systems are designed on the basis of
datasets on which these systems have to be implemented.
The collaborative system designed for one dataset where
there are more users than items may not work properly
for any other type of datasets. The researchers in [92]
perform a complete evaluation of collaborative systems
with respect to the datasets being used, the methods of
prediction and also perform a comparative evaluation of
several different evaluation metrics on various nearestneighbour based collaborative filtration algorithms.
There are different approaches used for collaborative
filtration as mentioned in Fig. 4.
Figure 4. Methods Used in Collaborative Filtration Technique.
291
1) MODEL-BASED APPROACH: Model-based approaches such as [93] classify the data based on probabilistic hidden semantic associations among co-occurring
objects. Model-based approaches divide the data into
multiple segments and based on a user’s likelihood [94]
move the specific data into atleast one segment based on
the probability and threshold value. Most of the modelbased approaches are computationally very expensive but
most of them gather user’s interest and classify them into
multiple segments [95], [96] and [97].
2) MEMORY-BASED APPROACH: Clustering Algorithms such as K-means [98] are considered as the basis
for memory-based approaches. The data is clustered and
classified based on local centroid of each cluster. Most of
the collaborative filtration techniques such as [99] work
on user profiles based on their navigational patterns. Similarly [100] performs clustering based on sliding window
of time in active sessions and [101] presents a fuzzy
concept for clustering.
3) CASE-BASED APPROACH: Most of the times one
problem has one solution which represents a case in the
case-based reasoning approach. In case-based reasoning
[102] if a new customer comes and needs a solution
to his/her problem then depending upon the previously
stored problems that are linked with at least one case solution, the one which is nearest to the customer’s problem
will be considered as the case solution to his/her problem.
Case-based recommender are closer to user requirements
and work more efficiently and intelligently than normal
filtering approaches in a way that every case works as a
perfect match for a subset of users and so the data for consideration becomes less as compared to normal filtration
approaches which resulted in an increase in performance
as well as accuracy. Overall case-based reasoning always
helps in improving the quality of recommendations [103].
4) TAG-BASED APPROACH: A Tag-based approach
as in [104] was introduced in collaborative filtering to
increase the accuracy of the CF process. Usually two
persons like one item based on different reasons such as
one person may like a product as he is finding that product
funny whereas another user likes that item as he is finding
that product entertaining, so a tag is an extra-facility to
write a user’s views in one or two short words in the
form of a tag that shows his/her reason for his interest
and will help in finding the similarity and dissimilarity
among user’s interest using collaborative filtration. tagbased filtration sometimes are dependent on additional
factors such as popularity of tag, representation of tag
and affinity between user and tags [105].
5) PERSONALITY
BASED
APPROACH:
A
Personality-based approach [106] was introduced to add
the emotional attitude of the users to the collaborative
filtration process which became further useful in reducing
the high computational processing in calculating the
similarity matrix for all users. User attitude plays a vital
role in deriving the likes and dislikes of users so by
using a big five personality model [106] the researcher
explicitly derive the interest of the user that makes the
292
collaborative filtration process more robust and accurate.
6) RULE-BASED FILTRATION: This approach is one
more method that is used for personalization purposes.
The concept of rule-based approach is elaborated as all
the business rules that are created by merchants either
on the basis of transactions or on the basis of expert
policies to further facilitate or create attraction in their
online business. Rule-based approach such as a merchant
offers gold membership, silver membership or bronze
membership to its customers based on specific rules. Similarly a merchant offers discount coupons to its customers
who make purchases on weekends. These rule-based approaches [107] are created in different ways as templatedriven rule-based filtering approach, interestingness-based
rule filtering approach, similarity-based rule-filtering approach and incremental profiling approach. Rules are also
identified using mining rules as Apriori [108] which is
used to discover association rules; similarly Cart is a
decision tree [109] used to identify classification rules.
The only limitation in the rule-based approach is the
creation of invalid or unreasonable rules just on one
or two transactions which makes the data very sparse
and complex to understand. A rule-based approach is
very much dependent on the business rules and a sudden
change in any rule will have a very high impact on the
whole data as well.
7) HYBRID APPROACHES: A single technique is not
considered enough to give a recommendation taking into
account all the dynamic scenarios for each user. It is true
that each user has his own historical background and his
own list of likes and dislikes. Sometimes one method
of filtration is not enough for one particular case for
example collaborative filtration process is not beneficial
for a new user with not enough historical background but
is proved excellent in other scenarios, similarly a contentbased filtration process is not feasible where a user has
not enough data associated with it that shows his likes or
dislikes. Taking into account these scenarios researchers
have proposed different hybrid methods [26] that include
more than one technique [110] for filtration to be used
for personalization purpose which could be used on the
basis of a union or intersection for recommendation.
WEIGHTED APPROACH: In this approach [111] the
results of more than one method for filtration are calculated numerically for recommendation.
MIXED APPROACH: In this approach [112] the results
of more than one approach are displayed based on ranking
and the recommender’s confidence of recommendation.
SWITCHING APPROACH: In this approach [21] more
than one method for filtrations is used in a way that
if one is unable to recommend with high confidence
the system will be switched to the second method for
filtration and if the second as well is unable to recommend
with high confidence, the system will switch to the third
recommender.
FEATURE AUGMENTATION: In this approach [113]
a contributing recommendation system is augmented with
the actual recommender to increase its performance in
terms of recommendation.
CASCADING: In this approach [114] the primary and
secondary recommenders are organized in a cascading
way such that on each retrieval both recommenders break
ties with each other for recommendation.
V. RECOMMENDATION
Recommendation is considered the final phase of personalization whose performance and work is dependent
wholly upon the previous two stages. Recommendation
is the retrieval process which functions in accordance
with the learning and matching phase. The review of
all the methods which are discussed in learning and
matching phase recommendation is primarily and conclusively based on four main methods that include contentbased recommendation, collaborative-based recommendation, knowledge-based recommendation and based on
user-demographics or user demographic profiles.
VI. FUTURE DIRECTIONS
The overall objective of reviewing the whole era of
web personalization is to realize its importance in terms
of the facilities it provides to the end-users as well as
giving a precise overview of the list of almost all the
methods that have been introduced in each of its phases.
One more important aim of this review is to give a brief
overview of web personalization to those researchers
working in other areas of the internet so that they are able
to use this feature to evolve some intelligent solutions
which match human needs in their areas as well. Some of
the highlighted areas of the internet for future directions
with respect to web personalization are:1)Internet marketing is the first step towards any product
or service recognition on the internet. Through web
personalization one is able to judge to some extent the
browsing needs of the user and if a person is able to
see advertisements of those products or services which
he/she is looking for then the chances of that person’s
interest in buying or even clicking that advertisement’s
link will rise. Researchers are already trying to use
personalizing features for doing improved social web
marketing as in [52] and helping customers in decision
making using web personalization [115].
2)Internet of Things [69] is a recent development
of internet. The internet of things will make all
the identifiable things communicate with each other
wirelessly. This concept of web personalization can
offer many applications to IoT (Internet of Things) like
personalizing things to control and communicate as
per users interest, helping the customer in selecting the
shop within a pre-selected shopping list, guidance in
interacting with things of the user related to their interest
and enabling the things learn from users personalized
behaviour.
3)Affiliate Networks are the key platform for both the
publishers and advertisers to interact with each other.
There is a huge gap [34] between the publisher and
advertiser in terms of selecting the most appropriate
choice based on similarity. This gap can be reduced
using web personalization by collaboratively filtering
the transactional profiles of publisher and advertiser and
giving recommendations on the basis of best match to
both of them.
4)The future of mobile networking also requires
personalization, ambient awareness and adaptability
[116] in its services. All services need to be personalized
for each individual in his or her environment and in
accordance with his or her preferences and different
services should be adapted on a real time basis.
In other words in every field of life which includes
aerospace and aviation, automotive systems, telecommunications, intelligent buildings, medical technology, independent living, pharmaceutical, retail, logistics, supply
chain management, processing industries, safety, security
and privacy requires personalization in them to enable
these technologies more user specific and in compatible
with human needs.
VII. C ONCLUSION
Every second there is an increment of data on the web.
With this increase of data and information on the web the
adoption of web personalization will continue to grow
unabated. This trend has now become a need and with
the passage of time this trend will enter every field of
our life and so in the future we will be provided with
everything that we actually require. In this paper we have
briefly describe the various research carried out in the
area of web personalization. This paper also states how
the adoption of web personalization is essential for users
to facilitate, organize, personalize and to provide exactly
needed data
ACKNOWLEDGMENT
The authors would like to gratefully acknowledge the
careful reviewing of an earlier version of this paper which
has greatly improved the paper.
R EFERENCES
[1] P. Brusilovsky and C. Peylo, “Adaptive and intelligent
web-based educational systems,” Int. J. Artif. Intell. Ed.,
vol. 13, no. 2-4, pp. 159–172, Apr. 2003.
[2] W. N. Nicola Henze, “Adapdibility in the kbs hyperbook
system,” in Proceedings of the 2nd Workshop on Adaptive
Systems and User Modeling on the WWW, 1999.
[3] N. Henze and W. Nejdl, “Extendible adaptive hypermedia
courseware: Integrating different courses and web material,” in Proceedings of the International Conference on
Adaptive Hypermedia and Adaptive Web-Based Systems,
ser. AH ’00. London, UK, UK: Springer-Verlag, 2000,
pp. 109–120.
[4] R. Zeiliger, “Concept-map based navigation in educational hypermedia : a case study.” 1996.
[5] W.-P. Lee, C.-H. Liu, and C.-C. Lu, “Intelligent agentbased systems for personalized recommendations in internet commerce,” Expert Systems with Applications,
vol. 22, no. 4, pp. 275 – 284, 2002.
[6] A. D. Smith, “E-personalization and its tactical and
beneficial relationship to e-tailing,” 2012.
293
[7] C. Oemarjadi, “Web personalization in used cars ecommerce site,” 2011.
[8] C. Bouganis, D. Koukopoulos, and D. Kalles, “A realtime auction system over the www,” 1999.
[9] O. Nasraoui, “World wide web personalization,” in In
J. Wang (ed), Encyclopedia of Data Mining and Data
Warehousing, Idea Group, 2005.
[10] H. Hirsh, C. Basu, and B. D. Davison, “Learning to
personalize,” Commun. ACM, vol. 43, no. 8, pp. 102–
106, Aug. 2000.
[11] C. Wei, W. Sen, Z. Yuan, and C. Lian-Chang, “Algorithm
of mining sequential patterns for web personalization
services,” SIGMIS Database, vol. 40, no. 2, pp. 57–66,
Apr. 2009.
[12] Q. Song, G. Wang, and C. Wang, “Automatic recommendation of classification algorithms based on data set
characteristics,” Pattern Recogn., vol. 45, no. 7, pp. 2672–
2689, July 2012.
[13] M. Albanese, A. Picariello, C. Sansone, and L. Sansone,
“Web personalization based on static information and
dynamic user behavior,” in Proceedings of the 6th annual
ACM international workshop on Web information and
data management, ser. WIDM ’04. New York, NY, USA:
ACM, 2004, pp. 80–87.
[14] W. Chu and S.-T. Park, “Personalized recommendation
on dynamic content using predictive bilinear models,”
in Proceedings of the 18th international conference on
World wide web, ser. WWW ’09. New York, NY, USA:
ACM, 2009, pp. 691–700.
[15] S. Gong and H. Ye, “An item based collaborative filtering using bp neural networks prediction,” Intelligent
Information Systems, IASTED International Conference
on, vol. 0, pp. 146–148, 2009.
[16] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl,
“An algorithmic framework for performing collaborative
filtering,” in Proceedings of the 22nd annual international
ACM SIGIR conference on Research and development in
information retrieval, ser. SIGIR ’99. New York, NY,
USA: ACM, 1999, pp. 230–237.
[17] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker,
L. R. Gordon, and J. Riedl, “Grouplens: applying collaborative filtering to usenet news,” Commun. ACM, vol. 40,
no. 3, pp. 77–87, Mar. 1997.
[18] U. Shardanand and P. Maes, “Social information filtering:
algorithms for automating word of mouth,” in Proceedings of the SIGCHI conference on Human factors in
computing systems, ser. CHI ’95. New York, NY, USA:
ACM Press/Addison-Wesley Publishing Co., 1995, pp.
210–217.
[19] C.-H. Lee, Y.-H. Kim, and P.-K. Rhee, “Web personalization expert with combining collaborative filtering and
association rule mining technique,” Expert Systems with
Applications, vol. 21, no. 3, pp. 131 – 137, 2001.
[20] M. ubuk, “Hybrid recommendation engine based on
anonymous users,” Eindhoven : Technische Universiteit
Eindhoven, 2009, 2009.
[21] D. Billsus and M. J. Pazzani, “User modeling for adaptive
news access,” User Modeling and User-Adapted Interaction, vol. 10, no. 2-3, pp. 147–180, Feb. 2000.
[22] S. Anand and B. Mobasher, “Intelligent techniques for
web personalization,” in Intelligent Techniques for Web
Personalization, ser. Lecture Notes in Computer Science,
B. Mobasher and S. Anand, Eds.
Springer Berlin /
Heidelberg, 2005, vol. 3169, pp. 1–36.
[23] S. Thurman, N. Schifferes, “The future of personalization
at news websites:lessons from a longitudinal study,” 2012.
[24] U. Manber, A. Patel, and J. Robison, “Experience with
personalization of yahoo!” Commun. ACM, vol. 43, no. 8,
pp. 35–39, Aug. 2000.
294
[25] G. Linden, B. Smith, and J. York, “Amazon.com recommendations: item-to-item collaborative filtering,” Internet
Computing, IEEE, vol. 7, no. 1, pp. 76 – 80, jan/feb 2003.
[26] R. Burke, “Hybrid recommender systems: Survey and
experiments,” User Modeling and User-Adapted Interaction, vol. 12, no. 4, pp. 331–370, Nov. 2002.
[27] A. S. Das, M. Datar, A. Garg, and S. Rajaram, “Google
news personalization: scalable online collaborative filtering,” in Proceedings of the 16th international conference
on World Wide Web, ser. WWW ’07. New York, NY,
USA: ACM, 2007, pp. 271–280.
[28] J. L. Herlocker, J. A. Konstan, and J. Riedl, “Explaining
collaborative filtering recommendations,” in Proceedings
of the 2000 ACM conference on Computer supported
cooperative work, ser. CSCW ’00. New York, NY, USA:
ACM, 2000, pp. 241–250.
[29] J. ZASLOW, “If tivo thinks you are gay, here’s how to
set it straight what you buy affects recommendations on
amazon.com, too; why the cartoons?” 2002.
[30] E. Toch, Y. Wang, and L. Cranor, “Personalization and
privacy: a survey of privacy risks and remedies in
personalization-based systems,” User Modeling and UserAdapted Interaction, vol. 22, pp. 203–220, 2012.
[31] “Clustering
items
for
collaborative
filtering,”
2001.
[Online].
Available:
http://citeseer.ist.psu.edu/connor01clustering.html
[32] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Analysis of recommendation algorithms for e-commerce,” in
Proceedings of the 2nd ACM conference on Electronic
commerce, ser. EC ’00. New York, NY, USA: ACM,
2000, pp. 158–167.
[33] C. Allen, B. Yaeckel, and D. Kania, Internet World Guide
to One-to-One Web Marketing. New York, NY, USA:
John Wiley & Sons, Inc., 1998.
[34] Z. Malik, “A new personalized approach in affiliate marketing,” International Association of Development and
Information Society, 2012.
[35] B. B. C., “Complete guide to affiliate marketing on the
web,” in Complete Guide to Affiliate Marketing on the
Web, B. B. C., Ed. Atlantic Publishing Co, 2009, pp.
1–384.
[36] J. K. R. Bandyopadhyay, Subir Wolfe, “Journal of
academy of business and economics,” Int. A. Bus. Econ.
Ed., vol. 9, no. 4, Apr. 2003.
[37] P. Eli, “The filter bubble,” 2011.
[38] T. Joachims, D. Freitag, and T. Mitchell, “Webwatcher: A tour guide for the world wide web,” in
IN PROCEEDINGS OF THE FIFTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE. Morgan Kaufmann, 1997, pp. 770–775.
[39] H. Lieberman, “Letizia: an agent that assists web browsing,” in Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1, ser. IJCAI’95.
San Francisco, CA, USA: Morgan Kaufmann Publishers
Inc., 1995, pp. 924–929.
[40] M. Morita and Y. Shinoda, “Information filtering based
on user behavior analysis and best match text retrieval,” in
Proceedings of the 17th annual international ACM SIGIR
conference on Research and development in information
retrieval, ser. SIGIR ’94. New York, NY, USA: SpringerVerlag New York, Inc., 1994, pp. 272–281.
[41] Y.-W. Seo and B.-T. Zhang, “Learning user’s preferences
by analyzing web-browsing behaviors,” in Proceedings
of the fourth international conference on Autonomous
agents, ser. AGENTS ’00. New York, NY, USA: ACM,
2000, pp. 381–387.
[42] R. G. Tiwari, M. Husain, V. Srivastava, and A. Agrawal,
“Web personalization by assimilating usage data and
semantics expressed in ontology terms,” in Proceedings
of the International Conference & Workshop on
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
[59]
[60]
Emerging Trends in Technology, ser. ICWET ’11. New
York, NY, USA: ACM, 2011, pp. 516–521.
R. Lixăndroiu, “Customizing web advertisements based
on internet users’ location,” in Proceedings of the 11th
WSEAS international conference on mathematics and
computers in business and economics and 11th WSEAS
international conference on Biology and chemistry, ser.
MCBE’10/MCBC’10, Stevens Point, Wisconsin, USA,
2010, pp. 273–278.
E. Gabber, P. B. Gibbons, D. M. Kristol, Y. Matias, and
A. Mayer, “Consistent, yet anonymous, web access with
lpwa,” Commun. ACM, vol. 42, no. 2, pp. 42–47, Feb.
1999.
M.-H. Kuo, L.-C. Chen, and C.-W. Liang, “Building
and evaluating a location-based service recommendation
system with a preference adjustment mechanism,” Expert
Systems with Applications, vol. 36, no. 2, Part 2, pp. 3543
– 3554, 2009.
M.-H. Park, J.-H. Hong, and S.-B. Cho, “Location-based
recommendation system using bayesian users preference
model in mobile devices,” in Ubiquitous Intelligence
and Computing, ser. Lecture Notes in Computer Science,
J. Indulska, J. Ma, L. Yang, T. Ungerer, and J. Cao, Eds.
Springer Berlin / Heidelberg, 2007, vol. 4611, pp. 1130–
1139.
Y. Yang and C. Claramunt, “A hybrid approach for spatial
web personalization,” in Web and Wireless Geographical
Information Systems, ser. Lecture Notes in Computer
Science, K.-J. Li and C. Vangenot, Eds. Springer Berlin
/ Heidelberg, 2005, vol. 3833, pp. 206–221.
W. Z. J. J. M. kilfoil, A.Ghorbani and X.Xu, “Towards
an adaptive web: The state of the art and science,” 2003.
D. Kelly and J. Teevan, “Implicit feedback for inferring
user preference: a bibliography,” SIGIR Forum, vol. 37,
no. 2, pp. 18–28, Sept. 2003.
Q. Wang and H. Jin, “Exploring online social activities
for adaptive search personalization,” in Proceedings of the
19th ACM international conference on Information and
knowledge management, ser. CIKM ’10. New York, NY,
USA: ACM, 2010, pp. 999–1008.
M. A. Muhammad Nauman, Shahbaz Khan and F. Hussain, “Resolving lexical ambiguities in folksonomy based
search systems through common sense and personalization,” 2008.
B. Cugelman, “Online social marketing: website factors
in behavioural change.”
L. Story, “Feedback retreats on online tracking,” 2012.
N. Kumar M and V. Varma, “An introduction to prognostic search,” in Behavior Computing, L. Cao and P. S. Yu,
Eds. Springer London, 2012, pp. 165–175.
N. K. M, “Generating simulated feedback through prognostic search approach,” Search and Information Extraction Lab, pp. 1–61, 2010.
A. STENOVA, “Feedback acquisition in web-based learning,” 2012.
K. Y. Tam and S. Y. Ho, “Web personalization: Is it
effective?” IT Professional, vol. 5, pp. 53–57, 2003.
R. Allis, “Best practices for email marketers,” Broadwick
Corps., 2005.
C.-H. Lee, Y.-H. Kim, and P.-K. Rhee, “Web personalization expert with combining collaborative filtering and
B. Mobasher, H. Dai, T. Luo, Y. Sun, and J. Zhu,
“Integrating web usage and content mining for more effective personalization,” in Electronic Commerce and Web
Technologies, ser. Lecture Notes in Computer Science.
176.
[61] A. Parker, “Sms its use in the digital library,” in Asian
Digital Libraries. Looking Back 10 Years and Forging
New Frontiers, ser. Lecture Notes in Computer Science,
D. Goh, T. Cao, I. Slvberg, and E. Rasmussen, Eds.
390.
[63] A. Dickinger, P. Haghirian, J. Murphy, and A. Scharl, “An
investigation and conceptual model of sms marketing,” in
System Sciences, 2004. Proceedings of the 37th Annual
Hawaii International Conference on, jan. 2004, p. 10 pp.
[64] M. Claypool, D. Brown, P. Le, and M. Waseda, “Inferring
user interest,” IEEE Internet Computing, vol. 5, pp. 32–
39, 2001.
[65] J. J. Samper, P. A. Castillo, L. Araujo, and J. J. M.
Guervós, “Nectarss, an rss feed ranking system
that implicitly learns user preferences,” CoRR, vol.
abs/cs/0610019, 2006.
[66] O. Phelan, K. McCarthy, M. Bennett, and B. Smyth,
“Terms of a feather: Content-based news recommendation
and discovery using twitter,” in Advances in Information Retrieval, ser. Lecture Notes in Computer Science,
P. Clough, C. Foley, C. Gurrin, G. Jones, W. Kraaij,
H. Lee, and V. Mudoch, Eds. Springer Berlin / Heidelberg, 2011, vol. 6611, pp. 448–459.
[67] S. Carter, “Get bold: Using social media to create a
new type of social business),” in Get Bold: Using Social
Media to Create a New Type of Social Business, P. K.
Brusilovsky, Ed. Pearson PLC, 2012.
[69] L. McGinty and B. Smyth, “Adaptive selection: An analysis of critiquing and preference-based feedback in conversational recommender systems,” International Journal
of Electronic Commerce, pp. 35–57, 2006.
[70] M. Zanker and M. Jessenitschnig, “Case-studies on exploiting explicit customer requirements in recommender
systems,” User Modeling and User-Adapted Interaction,
vol. 19, pp. 133–166, 2009.
[71] G. Jawaheer, M. Szomszor, and P. Kostkova, “Characterisation of explicit feedback in an online music recommendation service,” in Proceedings of the fourth ACM
conference on Recommender systems, ser. RecSys ’10.
ACM, 2010, pp. 317–320.
[72] K. T. Wolfgang Woerndl, Georg Groh, “Semantic blogging agents: Weblogs and personalization in the semantic
web,” aaai.org, 2010.
[73] P.-H. Chiu, G. Y.-M. Kao, and C.-C. Lo, “Personalized
blog content recommender system for mobile phone
users,” International Journal of Human-Computer Studies, vol. 68, no. 8, pp. 496 – 507, 2010.
[74] D.-R. Liu, P.-Y. Tsai, and P.-H. Chiu, “Personalized
recommendation of popular blog articles for mobile applications,” Information Sciences, vol. 181, no. 9, pp. 1552
– 1572, 2011.
[75] P. Resnick and J. Miller, “Pics: Internet access controls
without censorship,” Commun. ACM, vol. 39, no. 10, pp.
87–93, Oct. 1996.
[76] K. Kumppusamy and G. Aghila, “A personalized web
page content filtering model based on segmentation,”
International Journal of Information Science and Techniques, vol. 2, no. 1, pp. 1–51, 2012.
[77] C. Kohlschütter and W. Nejdl, “A densitometric approach
to web page segmentation,” in Proceedings of the 17th
[78]
[79]
[80]
[81]
[82]
[83]
[84]
[85]
[86]
[87]
[88]
[89]
[90]
[91]
[92]
[93]
[94]
[95]
[96]
295
ACM conference on Information and knowledge management, ser. CIKM ’08. ACM, 2008, pp. 1173–1182.
M. Porter, “An algorithm for suffix stripping,” 2006.
B. Mobasher, H. Dai, T. Luo, and M. Nakagawa, “Discovery and evaluation of aggregate usage profiles for web
personalization,” Data Mining and Knowledge Discovery,
vol. 6, pp. 61–82, 2002.
S. Gauch, M. Speretta, A. Chandramouli, and A. Micarelli, “User profiles for personalized information access,” in The Adaptive Web. Springer Berlin / Heidelberg,
2007, vol. 4321, pp. 54–89.
J. R. Quinlan, “Induction of decision trees,” Machine
Learning, vol. 1, pp. 81–106, 1986.
Y. Yang, “A comparative study on feature selection in text
categorization,” 1997.
“Relevance feedback and personalization: A language
modeling perspective,” Systems in Digital, 2001.
Y. Hijikata, “Implicit user profiling for on demand relevance feedback,” in Proceedings of the 9th international
conference on Intelligent user interfaces, ser. IUI ’04.
New York, NY, USA: ACM, 2004, pp. 198–205.
J. Rocchio, “Relevance feedback in information retrieval,”
1971.
T. Zhang and F. Oles, “Text categorization based on
regularized linear classification methods,” Information
Retrieval, vol. 4, pp. 5–31, 2001.
D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka,
“Training algorithms for linear text classifiers,” in Proceedings of the 19th annual international ACM SIGIR
conference on Research and development in information
retrieval, ser. SIGIR ’96. New York, NY, USA: ACM,
1996, pp. 298–306.
P. E. H. Richard O. Duda, “Pattern classification and
scene analysis,” in Pattern Classification and Scene Analysis. Wiley-Blackwell, 1973, pp. 1–512.
A. McCallum and K. Nigam, “A comparison of event
models for naive bayes text classification,” 1998.
D. M. Pennock, E. Horvitz, S. Lawrence, and C. L. Giles,
“Collaborative filtering by personality diagnosis: a hybrid
memory- and model-based approach,” in Proceedings
of the Sixteenth conference on Uncertainty in artificial
intelligence, ser. UAI’00.
San Francisco, CA, USA:
Morgan Kaufmann Publishers Inc., 2000, pp. 473–480.
B. M. Kim and Q. Li, “Probabilistic model estimation
for collaborative filtering based on items attributes,” in
Proceedings of the 2004 IEEE/WIC/ACM International
Conference on Web Intelligence, ser. WI ’04. Washington, DC, USA: IEEE Computer Society, 2004, pp. 185–
191.
J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T.
Riedl, “Evaluating collaborative filtering recommender
systems,” ACM Trans. Inf. Syst., vol. 22, no. 1, pp. 5–
53, Jan. 2004.
T. Hofmann, “Unsupervised learning by probabilistic
latent semantic analysis,” Mach. Learn., vol. 42, no. 1-2,
pp. 177–196, Jan. 2001.
A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, vol. 39, no. 1, pp. 1–38, 1977.
J. S. Breese, D. Heckerman, and C. Kadie, “Empirical
analysis of predictive algorithms for collaborative filtering,” in Proceedings of the Fourteenth conference on
Uncertainty in artificial intelligence, ser. UAI’98. San
Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,
1998, pp. 43–52.
B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Itembased collaborative filtering recommendation algorithms,”
in Proceedings of the 10th international conference on
296
[97]
[98]
[99]
[100]
[101]
[102]
[103]
[104]
[105]
[106]
[107]
[108]
[109]
[110]
[111]
[112]
[113]
[114]
[115]
[116]
World Wide Web, ser. WWW ’01. New York, NY, USA:
ACM, 2001, pp. 285–295.
G. Shani, R. I. Brafman, and D. Heckerman, “An mdpbased recommender system,” in Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence,
ser. UAI’02. Morgan Kaufmann Publishers Inc., 2002,
pp. 453–460.
A. Ahmad and L. Dey, “A k-mean clustering algorithm
for mixed numeric and categorical data,” Data amp;
Knowledge Engineering, vol. 63, no. 2, pp. 503 – 527,
2007.
F. B. K. J. F. Cyrus Shahabi, Adil Faisal, “Insite: A
tool for real-time knowledge discovery from user web
navigation,” 2000.
T. S. C.P. Sumathi, R. Padmaja Valli, “Automatic recommendation of web pages in web usage mining,” 2010.
A. K.Suresh, R.Madana Mohana, “Improved fcm algorithm for clustering on web usage mining,” 2011.
C. Hayes, P. Cunningham, and B. Smyth, “A case-based
reasoning view of automated collaborative filtering,” in
Proceedings of the 4th International Conference on CaseBased Reasoning: Case-Based Reasoning Research and
Development, ser. ICCBR ’01.
London, UK, UK:
Springer-Verlag, 2001, pp. 234–248.
B. Smyth, “Case-based recommendation,” in The Adaptive Web, ser. Lecture Notes in Computer Science.
376.
J. M. S. U. Reyn Nakamoto, Shinsuke Nakajima, “Tagbased contextual collaborative filtering,” 2008.
F. A. Durão and P. Dolog, “A personalized tag-based
recommendation in social web systems,” CoRR, vol.
abs/1203.0332, 2012.
J. T. Marko TKalcic, Matevz Kunaver, “Personality based
user similarity measure for a collaborative recommender
system,” 2009.
G. Adomavicius and A. Tuzhilin, “Expert-driven validation of rule-based user models in personalization applications,” Data Mining and Knowledge Discovery, vol. 5,
pp. 33–58, 2001.
C. Aggarwal and P. Yu, “Online generation of association
rules,” in Data Engineering, 1998. Proceedings., 14th
International Conference on, feb 1998, pp. 402 –411.
L. Breiman, “Classification and regression trees,” in Classification and regression trees. Wads-worth Brooks/Cole
Advanced Books Software, 1984.
N. Taghipour and A. Kardan, “A hybrid web recommender system based on q-learning,” in Proceedings of
the 2008 ACM symposium on Applied computing, ser.
SAC ’08. New York, NY, USA: ACM, 2008, pp. 1164–
1168.
T. Miranda, M. Claypool, M. Claypool, A. Gokhale,
A. Gokhale, T. Mir, P. Murnikov, P. Murnikov, D. Netes,
D. Netes, M. Sartin, and M. Sartin, “Combining contentbased and collaborative filters in an online newspaper,”
in In Proceedings of ACM SIGIR Workshop on Recommender Systems, 1999.
B. Smyth and P. Cotter, “A personalised tv listings service
for the digital tv age,” Knowledge-Based Systems, vol. 13,
no. 23, pp. 53 – 59, 2000.
R. J. M. Prem Melville, “Content-boosted collaborative
filtering for improved recommendations,” 2002.
R. Burke, “Hybrid recommender systems : A comparative
study,” cdmdepauledu, 2006.
S. H. Ha, “Helping online customers decide through web
personalization,” Intelligent Systems, IEEE, vol. 17, no. 6,
pp. 34 – 43, nov/dec 2002.
S. Arbanowski, P. Ballon, K. David, O. Droegehorn, H. Eertink, W. Kellerer, H. van Kranenburg,
K. Raatikainen, and R. Popescu-Zeletin, “I-centric com-
munications: personalization, ambient awareness, and
adaptability for future mobile services,” Communications
Magazine, IEEE, vol. 42, no. 9, pp. 63 – 69, sept. 2004.
Zeeshan Khawar Malik is currently a PhD
Candidate at the University of The West of Scotland.He
received his MS and BS(honors) degree from University of
The Central Punjab, Lahore Pakistan, in 2003 and 2006,
respectively. By profession he is an Assistant Professor in
University of The Punjab, Lahore Pakistan currently on Leave
for his PhD studies.
Colin Fyfe is a Personal Professor at The University of the West of Scotland. He has published more than 350
refereed papers and been Director of Studies for 23 PhDs. He
is on the Editorial Boards of 6 international journals and has
been Visiting Professor at universities in Hong Kong, China,
Australia, Spain, South Korea and USA.
297
An International Comparison on Need Analysis
of Web Counseling System Design
Takaaki GOTO
Graduate School of Informatics and Engineering, The University of Electro-Communications, Japan
Email: gototakaaki@uec.ac.jp
Chieko KATO, Futoshi SUGIMOTO, Kensei TSUCHIDA
Department of Information Sciences and Arts, Toyo University, Japan
Email: {kato-c, f sugi, kensei}@toyo.jp
Abstract— Recently, more people are working abroad, and
there is an increasing number of people with mental health
problems. It is very difficult to find a place to get treatment,
therefore online counseling is becoming more and more
popular these days. We are making an online web counseling
system to support the mental health of people working
abroad. The design of this web counseling site needs to be in
color, and the screen is all they can see, and people who look
at a web site may be alerted about something hidden, so what
they see on this screen should help them feel comfortable and
be encouraged to try online counseling. We researched the
way online counseling is conducted in foreign countries. We
noticed that the color of this web counseling site is different
depending on which country you are from. When it comes
to web counseling, color is definitely important.
Index Terms— web counseling, design, analysis of variance,
international comparison
I. I NTRODUCTION
According to figures released by the Foreign Affairs
Ministry, since 2005 Japanese people living abroad has
topped ten million and is increasing year by year[1].
Also, there are a lot of foreigners living in and visiting
Japan, especially from the U.S and China [2], [3]. People
living in foreign countries face many problems, such as
language, culture, human relationships and having to live
alone away from their families. This causes mental health
problems. We organized a project team to support those
people working abroad. Our team has professionals, who
help out with legal matters, computer technology, psychological problems and nursing [4], [5]. Online counseling
is very useful, and can often be accessed by people living
far away and in their own language. For example, Chinese
and English speaking people living in Japan can get help,
even if they don ’t speak Japanese. Japanese people living
abroad in a remote area where there are no psychiatrists
or psychologists can also seek counseling online.
There are two reasons why design is so important. We
have to take into account these reasons when designing a
web counseling system.
One point is that counseling is different from other
online activities such as shopping or playing games.
Counseling has different needs from ordinary shoppers or
online users, so the system and screen must be designed
doi:10.4304/jetwi.4.3.297-300
differently. The normal web site design may not be
suitable for counseling. We also know from past research
that those people with mental health problems are seeking
a particular design, which is different from the kinds of
designs that are attractive to others [6]. For example, on
line shopping has a special kind of design which may
attract people suddenly, and only at that moment. On
the other hand, counseling is not a spur of the moment
activity, and people need to feel assured that they can
continue this for a certain period of time. And they need
to feel confident that they can trust their counselors.
The second reason is that there are many foreigners,
both in Japan and other countries. When making an
effective design, the desires of users will be different from
country to country.
It has been found out that preferred colors are different
from country to country. For example, among 20 countries
including Japan, China, Germany, and the U.S. research
was done to find out which colors people like best, and
feel familiar with [7]. The color Japanese and American
people like best is bright purple. Next comes bright red.
Chinese people like white best, then bright purple. A
familiar color to Japanese people is bright red, next white.
Chinese people prefer bright red, next bright orange.
American people are familiar with bright red and bright
purple. Countries have different preferences regarding
colors.
There was a survey done in each country to find out
which colors are appropriate to online counseling. In
Japan, China and the U.S., the background color and
the color for words were investigated. This data was
statistically analyzed.
II. D ETAILS OF SURVEY
Out of 122 subjects ages 18 to 25; 39 were Japanese,
52 were Chinese and 31 were Americans. This survey
was done in Jan., 2010 in Japan; in June, 2010 in China;
Sept., 2010 in U.S.
In the survey, people were shown five different colors
for backgrounds and words. Their impressions were evaluated at 6 levels from very good to very bad. Figures 1
and 2 show what the survey looked like.
298
Figure 1. An example of background color of Web counseling system (beige).
Figure 2. An example of background color of Web counseling system (blue).
299
The subjects were asked which colors they prefer, if
they were actually going to do online counseling. They
looked at all five colors at the same time printed on paper.
Colors were on the left side and the rankings were written
on the right side from very good to very bad. An analysis
was also made about the size of letters.
The colors for background and words were selected for
their effectiveness and helping people feel relaxed ([8],
[9]). The colors were represented like this. Red: #de424c,
Blue: #006b95, Purple: #5f3785, Beige: #ceb59f, Green:
#008f59.
How effective colors are is shown in Table I ([8], [9],
[10]).
TABLE I.
A N EXAMPLE OF MEANING
OF COLOR .
Figure 3. Evaluation for background color red.
Color
Red
Blue
Green
Purple
Beige
Meaning
Red makes people feel more energetic and gives people
a better feeling. But it also makes people feel nervous
and more aggressive.
Blue helps people feel more and more relaxed and calm.
But it also makes people feel cold and lonely.
Green helps people feel peaceful and has a healing effect.
But it also makes people feel selfish and lazy.
Purple gives people a noble feeling. It is a mysterious
color and makes people think deeply. But it is not
realistic, and makes people feel vague and uneasy.
Beige helps people to relax and is thought of as being
sincere. But many people feel that beige is conservative
and unattractive.
III. A NALYSIS AND C ONSIDERATION OF R ESULTS
Two factors, background color and country were used
to make two-way ANOVA data layout. Only those colors
which showed a significant difference are considered here.
As a result, although no main effect of countries was
found, main effect of background colors (F (4, 460) =
36.42, p < .01) and interaction effect of countries
and background colors were significant (F (8, 460) =
5.09, p < .01). Therefore we verified whether there are
differences for each background color among the three
countries. Looking at red, the total impression was not
good. In the U.S., people (M = 2.27, SD = 0.94)
thought red was not good even more than Japan (M =
3.19, SD = 1.33) (a significance was found (p < .05)
between U. S. and Japan in the multiple comparison)
or China (M = 3.76, SD = 1.43) (a significance was
found (p < .01) between U. S. and China in the multiple
comparison). Figure 3 shows results of red.
Red was a familiar color among all countries, but in
terms of online counseling red was not the preferred
background color. This does not match previous research.
Especially in the U.S., red was not suitable as a background color for online counseling.
Next, considering color of the words, the result of
this analysis is explained. The two factors, the color
of words and country were used to make two-way
ANOVA data layout. As a result，main effect of countries
(F (2, 116) = 4.68, p < .05)，main effect of colors of
words (F (4, 464) = 54.20, p < .01) and interaction effect
of colors of words and countries (F (8, 464) = 4.82, p <
.01) were found. Therefore we verified whether there
are differences for each color of words among the three
countries.
There was a significant difference for colors written
in red and green. Red (M = 5.0, SD = 0.93) was
considered the best in the U.S compared to Japan (M =
3.6, SD = 1.14) (a significance was found (p < .01)
between U. S. and Japan in the multiple comparison)
and China (M = 4.72, SD = 1, 10) (a significance was
found (p < .05) between China and Japan in the multiple
comparison) with regard to words. Figure 4 shows the
results for words written in red.
Figure 4. Evaluation for text color red.
Words written in Green were considered best in China
(M = 4.57, SD = 1.04) (a significance was found (p <
.05) between China and Japan in the multiple comparison)
and the U.S (M = 4.55, SD = 1.04), compared to Japan
(M = 3.76, SD = 1.36) (a significance was found (p <
.05) between U. S. and Japan in the multiple comparison).
Figure 5 is for the color green.
There was also an analysis made about the size of
letters, but there was no significant difference about size.
300
[9] Keiko Yamawaki, Yokuwakaru ShikisaiShinri. Natsumesha, 2005, (in Japanese).
[10] Jonathan Dee and Lesley Taylor, Color Therapy. Sunchoh
Publishing, 2006, (in Japanese).
Takaaki GOTO received his M.E. and Dr.Eng. degrees from
Toyo University in 2003 and 2009 respectively. In 2009 he
joined the University of Electro-Communications as a Project
Assistant Professor of the Center for Industrial and Governmental Relations. He has been a Project Assistant Professor of
Graduate School of Informatics and Engineering at the University of Electro-Communications. His main research interests are
applications of graph grammars, visual languages, and software
development environments. He is a member of IPSJ, IEICE
Japan and IEEE.
Figure 5. Evaluation for text color green.
IV. C ONCLUSION
A survey was done to evaluate the design for online
counseling and a comparison was made between three
countries, Japan, the U.S. and China. This result was
analyzed by statistical method. There was a difference
between color preferences for online counseling and in
general. When red is used as a background color and for
words, one must consider carefully the difference among
countries.
In future research, we would like to do a survey
involving many people living in different countries. What
is desirable in terms of time spent in online counseling,
age and gender will be considered when designing an
online counseling web site.
R EFERENCES
[1] The Ministry of Foreign Affairs of Japan, “Annual
report of statistics on japanese nationals overseas,”
(in Japanese). [Online]. Available: http://www.mofa.go.jp/
mofaj/toko/tokei/hojin/index.html
[2] The
Ministry
of
Justice,
“About
number
of foreign residents,” (in Japanese). [Online].
Available: http://www.moj.go.jp/nyuukokukanri/kouhou/
nyuukokukanri01 00013.html
[3] Japn National Tourism Organization, “Changing numbers
of foreign visitors,” (in Japanese). [Online]. Available: http://www.jnto.go.jp/jpn/reference/tourism data/pdf/
marketingdata tourists after vj.pdf
[4] Chieko Kato, Yasunori Shiono, Takaaki Goto, and Kensei
Tsuchida, “Development of online counseling system and
usability evaluation,” Journal of Emerging Technologies in
Web Intelligence, vol. 3, no. 21, pp. 146–153, 2011.
[5] Takaaki Goto, Chieko Kato, and Kensei Tsuchida, “GUI
for Online Counseling System,” Journal of the Visualization Society of Japan, vol. 30, no. 117, pp. 90–95, 2010,
(in Japanese).
[6] Akito Kobori, Chieko Kato, Nobuo Takahashi, Kensei
Tsuchida, and Heliang Zhuang, “Investigation and analysis of effective images for on-line counseling system,”
Proceedings of the 2009 IEICE Society Conference, p. 149,
2009, (in Japanese).
[7] Hideaki Chijiiwa, “International comparison of emotion
for colors,” Journal of Japanese Society for Sensory Evaluation, vol. 6, no. 1, pp. 15–19, 2002, (in Japanese).
[8] Yoshinori Michie, “Psychology and color for comfort,” Re,
vol. 26, no. 1, pp. 26–29, 2004, (in Japanese).
Chieko KATO graduated from the Faculty of Literature, Shirayuri Women’s University in 1997, and received her M.A. from
the Tokyo University and Dr. Eng. degree from Hosei University
in 1999 and 2007, respectively. She served 2003 to 2006 as
an Assistant Professor at the Oita Prefectural Junior College
of Arts and Culture. She currently teaches at Toyo University,
which she joined in 2006 as an Assistant Professor, and was
promoted to an Associate Professor in 2007. Her research areas
include clinical psychology and psychological statistics. She is
a member of IEICE Japan, Design Research Association, the
Japanese Society of Psychopathology of Expression and Arts
Therapy.
Futoshi SUGIMOTO received his B.S. degree in communication systems engineering and M.S. degree in management engineering from the University of Electro-Communications, Tokyo,
Japan, in 1975 and 1978, respectively, and Ph.D. degree in
computer science from Toyo University, Tokyo, Japan, in 1998.
In 1978, he joined Toyo University as a Research Associate in
the Department of Information and Computer Sciences. From
1984 to 1999, he was an Assistant Professor, from 2000 to
2005, was an Associate Professor, and from 2006 to 2008, was
an Professor in the same department. From 2009, he has been a
Professor in the Department of Information Sciences and Arts.
From April 2000 to March 2001, he was an exchange fellow in
the University of Montana, USA. His current research interests
are in cognitive engineering and human interface. Dr. sugimoto
is a member of the Institute of Image Information and Television
Engineers, IPSJ, and Human Interface Society (Japan).
Kensei TSUCHIDA received his M.S. and D.S. degrees in
mathematics from Waseda University in 1984 and 1994 respectively. He was a member of the Software Engineering
Development Laboratory, NEC Corporation in 1984-1990. From
1990 to 1992, he was a Research Associate of the Department
of Industrial Engineering and Management at Kanagawa University. In 1992 he joined Toyo University, where he was an
Instructor until 1995 and an Associate Professor from 1995 to
2002 and a Professor from 2002 to 2009 at the Department
of Information and Computer Sciences and since 2009 he
has been a Professor of Faculty of Information Sciences and
Arts. He was a Visiting Associate Professor of the Department
of Computer Science at Oregon State University from 1997
to 1998. His research interests include software visualization,
human interface, graph languages, and graph algorithms. He is
a member of IPSJ, IEICE Japan and IEEE Computer Society.
Call for Papers and Special Issues
Aims and Scope
Journal of Emerging Technologies in Web Intelligence (JETWI, ISSN 1798-0461) is a peer reviewed and indexed international journal, aims at
gathering the latest advances of various topics in web intelligence and reporting how organizations can gain competitive advantages by applying the
different emergent techniques in the real-world scenarios. Papers and studies which couple the intelligence techniques and theories with specific web
technology problems are mainly targeted. Survey and tutorial articles that emphasize the research and application of web intelligence in a particular
domain are also welcomed. These areas include, but are not limited to, the following:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Web 3.0
Enterprise Mashup
Ambient Intelligence (AmI)
Situational Applications
Emerging Web-based Systems
Ambient Awareness
Ambient and Ubiquitous Learning
Ambient Assisted Living
Telepresence
Lifelong Integrated Learning
Smart Environments
Web 2.0 and Social intelligence
Context Aware Ubiquitous Computing
Intelligent Brokers and Mediators
Web Mining and Farming
Wisdom Web
Web Security
Web Information Filtering and Access Control Models
Web Services and Semantic Web
Human-Web Interaction
Web Technologies and Protocols
Web Agents and Agent-based Systems
Agent Self-organization, Learning, and Adaptation
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Agent-based Knowledge Discovery
Agent-mediated Markets
Knowledge Grid and Grid intelligence
Knowledge Management, Networks, and Communities
Agent Infrastructure and Architecture
Agent-mediated Markets
Cooperative Problem Solving
Distributed Intelligence and Emergent Behavior
Information Ecology
Mediators and Middlewares
Granular Computing for the Web
Ontology Engineering
Personalization Techniques
Semantic Web
Web based Support Systems
Web based Information Retrieval Support Systems
Web Services, Services Discovery & Composition
Ubiquitous Imaging and Multimedia
Wearable, Wireless and Mobile e-interfacing
E-Applications
Cloud Computing
Web-Oriented Architectrues
Special Issue Guidelines
Special issues feature specifically aimed and targeted topics of interest contributed by authors responding to a particular Call for Papers or by
invitation, edited by guest editor(s). We encourage you to submit proposals for creating special issues in areas that are of interest to the Journal.
Preference will be given to proposals that cover some unique aspect of the technology and ones that include subjects that are timely and useful to the
readers of the Journal. A Special Issue is typically made of 10 to 15 papers, with each paper 8 to 12 pages of length.
The following information should be included as part of the proposal:
•
Proposed title for the Special Issue
•
Description of the topic area to be focused upon and justification
•
Review process for the selection and rejection of papers.
•
Name, contact, position, affiliation, and biography of the Guest Editor(s)
•
List of potential reviewers
•
Potential authors to the issue
•
Tentative time-table for the call for papers and reviews
If a proposal is accepted, the guest editor will be responsible for:
•
Preparing the “Call for Papers” to be included on the Journal’s Web site.
•
Distribution of the Call for Papers broadly to various mailing lists and sites.
•
Getting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors. Authors should be
informed the Instructions for Authors.
•
Providing us the completed and approved final versions of the papers formatted in the Journal’s style, together with all authors’ contact
information.
•
Writing a one- or two-page introductory editorial to be published in the Special Issue.
Special Issue for a Conference/Workshop
A special issue for a Conference/Workshop is usually released in association with the committee members of the Conference/Workshop like general
chairs and/or program chairs who are appointed as the Guest Editors of the Special Issue. Special Issue for a Conference/Workshop is typically made of
10 to 15 papers, with each paper 8 to 12 pages of length.
Guest Editors are involved in the following steps in guest-editing a Special Issue based on a Conference/Workshop:
•
Selecting a Title for the Special Issue, e.g. “Special Issue: Selected Best Papers of XYZ Conference”.
•
Sending us a formal “Letter of Intent” for the Special Issue.
•
Creating a “Call for Papers” for the Special Issue, posting it on the conference web site, and publicizing it to the conference attendees.
Information about the Journal and Academy Publisher can be included in the Call for Papers.
•
Establishing criteria for paper selection/rejections. The papers can be nominated based on multiple criteria, e.g. rank in review process plus the
evaluation from the Session Chairs and the feedback from the Conference attendees.
•
Selecting and inviting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors. Authors
should be informed the Author Instructions. Usually, the Proceedings manuscripts should be expanded and enhanced.
•
Providing us the completed and approved final versions of the papers formatted in the Journal’s style, together with all authors’ contact
information.
•
Writing a one- or two-page introductory editorial to be published in the Special Issue.
More information is available on the web site at http://www.academypublisher.com/jetwi/.
(Contents Continued from Back Cover)
On the Network Characteristics of the Google's Suggest Service
Zakaria Al-Qudah, Mohammed Halloush, Hussein R. Alzoubi, and Osama Al-kofahi
278
RISING SCHOLAR PAPERS
Zeeshan Khawar Malik and Colin Fyfe
285
SHORT PAPERS
An International Comparison on Need Analysis of Web Counseling System Design
Takaaki Goto, Chieko Kato, Futoshi Sugimoto, and Kensei Tsuchida
297

Full Issue in PDF

Transcription

Similar documents

Franz Schubert Impromptus, Moments Musicaux, Klavierstücke for

Pimp my memory - Temple University

Studs Terkel, with Tony Parker

Supporting Information Metal-Ion Catalysis in Alkaline Ethanolysis of

customer intelligence through new eyes through

GrauS NeWs - Robinsons Bookshop

tailgate - Staples Promotional Products

INF3800/INF4800 Søketeknologi

datadays_with_notes

- Lab for Media Search - National University of Singapore