URL

Transcription

URL
Proceedings of the 3rd ACM RecSys’10 Workshop on
Recommender Systems and the
Social Web
EDITORS
Jill Freyne, Sarabjot Singh Anand,
Ido Guy, Andreas Hotho
October 23rd, 2011
Chicago, Illinois, USA
Organizing COMMITTEE
Jill Freyne, CSIRO, TasICT Centre, Australia
Sarabjot Singh Anand, University of Warwick, UK
Ido Guy, IBM Research, Haifa, Israel
Andreas Hotho, University of Würzburg , Germany
PROGRAM COMMITTEE
Shlomo Berkovsky, CSIRO, Australia
Peter Brusilovsky, University of Pittsburgh, USA
Robin Burke, De Paul University, USA
Xiongcai Cai, University of New South Wales, Australia
Elizabeth Daly, IBM Research, Cambridge, MA USA
Jon Dron, Athabasca University, Canada
Casey Dugan, IBM Research, USA
Rosta Farzan, Carnegie Mellon University, USA
Werner Geyer, IBM Research, USA
Max Harper, University of Minnesota, USA
C. Lee Giles,The Pennsylvania State University, USA
Kristina Lerman, University of Southern California, USA
Luiz Pizzato, The University of Sydney, Australia
Lars Schmidt-Thieme, University of Hildesheim, Germany
Shilad Sen, Macalester College, St. Paul, USA
Aaditeshwar Seth, IIT Delhi, India
Barry Smyth, University College Dublin, Ireland
Gerd Stumme, University of Kassel, Germany
Juergen Vogel, SAP Research
FORWARD
The exponential growth of the social web poses challenges and presents new
opportunities for recommender system research. The social web has turned
information consumers into active contributors creating massive amounts of
information. Finding relevant and interesting content at the right time and in the right
context is challenging for existing recommender approaches. At the same time,
social systems by their definition encourage interaction between users and both
online content and other users, thus generating new sources of knowledge for
recommender systems. Web 2.0 users explicitly provide personal information and
implicitly express preferences through their interactions with others and the system
(e.g. commenting, friending, rating, etc.). These various new sources of knowledge
can be leveraged to improve recommendation techniques and develop new
strategies which focus on social recommendation. The Social Web provides huge
opportunities for recommender technology and in turn recommender technologies
can play a part in fuelling the success of the Social Web phenomenon.
The goal of this workshop was to bring together researcher and practitioners to
explore, discuss, and understand challenges and new opportunities for recommender
systems and the Social Web. We encouraged contributions in the following areas:
•
•
•
•
•
•
•
•
•
•
•
Case studies and novel fielded social recommender applications
Economy of community-based systems: Using recommenders to encourage
users to contribute and sustain participation.
Social network and folksonomy development: Recommending friends, tags,
bookmarks, blogs, music, communities etc.
Recommender systems mash-ups, Web 2.0 user interfaces, rich media
recommender systems
Collaborative knowledge authoring, collective intelligence
Recommender applications involving users or groups directly in the
recommendation process
Exploiting folksonomies, social network information, interaction, user context
and communities or groups for recommendations
Trust and reputation aware social recommendations
Semantic Web recommender systems, use of ontologies or microformats
Empirical evaluation of social recommender techniques, success and failure
measures
Social recommender systems in the enterprise
The workshop consisted both of technical sessions, in which selected participants
presented their results or ongoing research, as well as informal breakout sessions on
more focused topics.
Papers discussing various aspects of recommender system in the Social Web were
submitted and 10 papers were selected for presentation and discussion in the
workshop through a formal reviewing process.
The Workshop Chairs
October 2011
Third Workshop on Recommender Systems and the Social Web
23rd October, 2011
Workshop Programme
8:30 - 9:00 - Opening & Introductions
9:00 - 10:15 Paper Session I - Social Search and Discovery
Kevin Mcnally, Michael O'Mahony and Barry Smyth. Evaluating User Reputation in
Collaborative Web Search (15+5)
Owen Phelan, Kevin Mccarthy and Barry Smyth. Yokie - A Curated, Real-time Search &
Discovery System using Twitter (15+5)
Tamara Heck, Isabella Peters and Wolfgang G. Stock. Testing Collaborative Filtering
against Co-Citation Analysis and Bibliographic Coupling for Academic Author
Recommendation (15+5)
Discussion (15)
10:15-10:45 - Coffee Break
10:45 - 12:30 - Paper Session II - Groups Communities, and Networks
Lara Quijano-Sanchez, Juan Recio-Garcia and Belen Diaz-Agudo. Group
recommendation methods for social network environments (15+5)
Yu Chen and Pearl Pu. Do You Feel How We Feel? An Affective Interface in Social
Group Recommender Systems (10+5)
Amit Sharma, Meethu Malu and Dan Cosley. PopCore: A system for Network-Centric
Recommendations (10+5)
Shaghayegh Sahebi and William Cohen. Community-Based Recommendations: a
Solution to the Cold Start Problem (10+5)
Maria Terzi, Maria-Angela Ferrario and Jon Whittle. Free Text In User Reviews: Their
Role In Recommender (10+5)
Discussion (20)
12:30 – 14:00 Lunch
14:00-15:00 - Keynote Presentation (Werner Geyer, title TBD)
15:05-15:45 - Paper Session III - User Generated Content
Sandra Garcia Esparza, Michael O'Mahony and Barry Smyth. A Multi-Criteria Evaluation
of a User Generated Content Based Recommender System (15+5)
Jonathan Gemmell, Tom Schimoler, Bamshad Mobasher and Robin Burke. Personalized
Recommendation by Example (15+5)
Discussion (15)
15:45-16:15 - Coffee Break
16:15 - 18:00 - Breakout Sessions in Groups
18:00 - Closing
Contents
Evaluating User Reputation in Collaborative Web Search
Kevin Mcnally, Michael O'Mahony and Barry Smyth.
1
Yokie - A Curated, Real-time Search & Discovery System using Twitter
Owen Phelan, Kevin Mccarthy and Barry Smyth
9
Collaborative Filtering against Co-Citation Analysis and Bibliographic Coupling for
Academic Author Recommendation
Tamara Heck, Isabella Peters and Wolfgang G. Stock. Testing
Group recommendation methods for social network environments
Lara Quijano-Sanchez, Juan Recio-Garcia and Belen Diaz-Agudo.
Do You Feel How We Feel? An Affective Interface in Social Group Recommender
Systems
Yu Chen and Pearl Pu.
16
24
32
PopCore: A system for Network-Centric Recommendations
Amit Sharma, Meethu Malu and Dan Cosley
36
Community-Based Recommendations: a Solution to the Cold Start Problem
Shaghayegh Sahebi and William Cohen
40
Free Text In User Reviews: Their Role In Recommender
Maria Terzi, Maria-Angela Ferrario and Jon Whittle.
45
A Multi-Criteria Evaluation of a User Generated Content Based Recommender
System
Sandra Garcia Esparza, Michael O'Mahony and Barry Smyth.
49
Personalized Recommendation by Example
Jonathan Gemmell, Tom Schimoler, Bamshad Mobasher and Robin Burke
57
1
Evaluating User Reputation in Collaborative Web Search
Kevin McNally, Michael P. O’Mahony, Barry Smyth
CLARITY Centre for Sensor Web Technologies
School Of Computer Science & Informatics
University College Dublin
{firstname.lastname}@ucd.ie
ABSTRACT
Often today’s recommender systems look to past user activity in order to influence future recommendations. In the case
of social web search, employing collaborative recommendation techniques allows for personalization of search results.
If recommendations arise from past user activity, the expertise of those users driving the recommendation process can
play an important role when it comes to ensuring recommendation quality. Hence the reputation of users is important
in collaborative and social search tasks, in addition to result relevance as traditionally considered in web search. In
this paper we explore this concept of reputation; specifically,
investigating how reputation can enhance the recommendation engine at the core of the HeyStaks social search utility.
We evaluate a number of different reputation models in the
context of the HeyStaks system, and demonstrate how incorporating reputation into the recommendation process can
enhance the relevance of results recommended by HeyStaks.
1.
INTRODUCTION
The early years of web search (1995-1998) were characterised by innovation as researchers came to discover some
of the shortcomings of traditional term-based information
retrieval techniques in the face of large-scale, heterogeneous
web content, and in the face of queries from users who
were far from search experts. While traditional term-based
matching techniques played an important role in result selection, they were not sufficiently robust when it came to
delivering a reliable and relevant ranking of search results.
The significant breakthrough that led to modern web search
engines came about through the work of Brin and Page [1],
and Kleinberg [6], highlighting the importance of link connectivity when it came to understanding the importance of
web pages. In the end, ranking metrics based on this type
of connectivity data came to provide a key signal for all of
today’s mainstream search engines.
By and large the world of web search has remained relatively stable over the past decade or more. Mainstream
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
search engines have innovated around the edges of search
but their core approaches have remained intact. However
there are signs that this is now changing and it is an interesting time in the world of mainstream web search, especially as all of the mainstream players look to the world
of social networks to provide new types of search content
and, importantly in this paper, new sources of ranking signals. There is now considerable interest in the concept of
social search, based on the idea that information in our social graphs can be used to improve mainstream search. For
example, the HeyStaks system [19] has been developed to
add a layer of social search onto mainstream search engines,
using recommendation techniques to automatically suggest
results to users based on pages that members of their social
graphs have found to be interesting for similar queries in
the past. HeyStaks adds collaboration to conventional web
search and allows us to benefit from the past search histories
of people we trust and on topics that matter to us.
In this paper we examine the role of reputation in HeyStaks’
recommendation engine. Previously we have described how
to estimate the reputation of a searcher by analysing how
frequently their past search efforts have translated into useful recommendations for other users [9, 11]. We have also
examined user behaviour in HeyStaks, and highlighted the
potential for reputation to unearth users who have gained
the most benefit from the system and whose activity benefits others [10]. For example, if my previous searches (and
the pages that I find) lead to result recommendations to
others that are regularly acted on (selected, tagged, shared
etc.), then my reputation should increase, whereas if my
past search efforts rarely translate into useful recommendations then my reputation should decline. In this paper we
expand on previous work by considering a number of user
reputation models, showing how these models can be used
to estimate result reputation, and comparing the ability of
these models to influence recommendation quality based on
recent live-user data.
2.
RELATED WORK
Recently there has been considerable interest in reputation
systems to provide mechanisms to evaluate user reputation
and inter-user trust across a growing number of social web
and e-commerce applications. For example, the reputation
system used by eBay has been examined by Jøsang et al.
[5] and Resnick et al. [16]. Briefly, eBay elicits feedback
from buyers and sellers regarding their interactions with
each other, and that information is aggregated in order to
calculate user reputation scores. The aim is to reward good
2
behaviour on the site and to improve robustness by leveraging reputation to predict whether a vendor will honour
future transactions. Resnick found that using information
received directly from users to calculate reputation is not
without its problems [16]. Feedback is generally reciprocal; users almost always give positive feedback if they themselves had received positive feedback from the person they
performed a transaction with. Jøsang confirms this, stating
this may lead to malicious use of the system and as such
needs manual curation.
The work of O’Donovan and Smyth [14] addresses reputation in recommender systems. In this case, a standard collaborative filtering algorithm is modified to add a user-user
trust score to complement the normal profile or item-based
similarity score, so that recommendation partners are chosen from those users that are not only similar to the target
user, but who have also had a positive recommendation history with that user. It is posited that reputation can be
estimated by measuring the accuracy of a profile at making
predictions over time. Using this metric average prediction
error is improved by 22%.
Other recent research has examined reputation systems
employed in social networking platforms. Lazzari performed
a case study of the professional social networking site Naymz
[8]. He warns that calculating reputation on a global level
allows users who have interacted with only a small number
of others to accrue a high degree of reputation, making the
system vulnerable to malicious use. Similar to Jøsang in
[5], Lazzari suggests that vulnerability lies in the site itself,
allowing malicious users to game the reputation system for
their own ends. However, applying reputation globally affords malicious users influence over the entire system, which
adds to its vulnerability.
The previous section outlines our intention to present different reputation models to be applied to HeyStaks users.
These models are in part derived from constructing a graph
based on collaborations that occur in the HeyStaks community. Perhaps two of the most well-known link analysis
algorithms that are applied to online social network graphs
are PageRank and HITS.
PageRank is the well known algorithm employed by the
Google search engine to rank web search results [1]. The key
intuition behind PageRank is that pages on the web can be
modeled as vertices in a directed graph, where the edge set
is determined by the hyperlinks between pages. PageRank
leverages this link structure to produce an estimate of a relative importance of web pages, with inlinks from pages seen
as a form of recommendation from page authors. Important
pages are considered to be those with relatively large number
of inlinks. Moreover, pages that are linked to by many other
important pages receive higher ranks themselves. PageRank
is a recursive algorithm, where the ranks of pages are a function of the ranks of those pages that link to them.
The HITS algorithm [6] was also developed to rank web
search results and, like PageRank, makes use of the link
structure of the web to perform ranking. In particular, HITS
computes two distinct scores for each page: an authority
score and a hub score. The former provides an estimate of
the value of a page’s content while the latter measures the
value of its links to other pages. Pages receive higher authority scores if they are linked to by pages with high hub scores,
and receive higher hub scores if they link to many pages with
high authority scores. HITS is an iterative algorithm where
authority and hub scores are computed recursively.
A lot of work has been done in the area of link analysis in
the social web space in the recent past, often by employing
the techniques introduced by Page and Kleinberg. For example the well-known algorithm FolkRank [4], an adaptation
of PageRank, looks to exploit users’ disposition for adding
metadata to online content in order to construct a graph
based on social tagging information. Work by Schifanella et
al. [18] expands on the idea behind FolkRank, and claims
that examination of folksonomy data can help in predicting
links between people in the social network graphs of Flickr
and Last.fm.
In this paper we consider reputation models in the context
of the HeyStaks social search service which seek to capture
the quality of search knowledge that is contributed by users.
Further, we present a framework in which user reputation is
employed to influence the recommendations that are made
by HeyStaks. Using data from a live-user trial, we show how
this approach leads to significant improvements in the ranking of recommendations from a quality perspective. This
differs from our approach in that we wish to leverage the
HeyStaks social graph to determine who provides the best
quality content as determined by their community.
3.
THE HEYSTAKS RECOMMENDATION
ENGINE
In this section we review the HeyStaks recommendation
engine to provide sufficient context for this work. Further
details can be found in [19] (which focuses on the relevance
model) and in [11] (which focuses on the reputation model).
3.0.1
Profiling Stak Pages
Each stak in HeyStaks captures the search activities of its
stak members. The basic unit of stak information is a result
(URL) and each stak (S) is associated with a set of results,
S = {r1 , ..., rk }. Each result is also anonymously associated
with a number of implicit and explicit interest indicators,
based on the type of actions (for example, selecting, voting,
tagging and sharing) that users can perform on these pages.
These actions can be associated with a degree of confidence that the user finds the page to be relevant. Each
result page riS from stak S, is associated with relevance indicators: the number of times a result has been selected (Sl),
the query terms (q1 , ..., qn ) that led to its selection, the terms
contained in the snippet of the selected result (s1 , ..., sk ), the
number of times a result has been tagged (T g), the terms
used to tag it (t1 , ..., tm ), the votes it has received (v + , v − ),
and the number of people it has been shared with (Sh) as
per Equation 1.
riS = {q1 ...qn , s1 ...sk , t1 ...tm , v + , v − , Sl, T g, Sh} .
(1)
Importantly, this means each result page is associated with
a set of term data (query and/or tag terms) and a set of usage data (the selection, tag, share, and voting count). The
term data provides the basis for retrieving and ranking recommendation candidates. The usage data provides an additional source of evidence that can be used to filter results
and to generate a final set of recommendations.
3.0.2
Recommending Search Results
At search time, the searcher’s query q and current stak
S are used to generate a list of recommendations. Here we
3
score(r, q) = w × rep(r, t) + (1 − w) × rel(q, r) .
(2)
The relevance of a result r with respect to a query qt is
computed using TF-IDF [17], which gives high weights to
terms that are popular for a result r but rare across other
stak results, thereby serving to prioritise results that match
distinguishing index terms, as per Equation 3.
X
rel(q, r) =
tf (tr) × idf (t)2 .
(3)
τ q
The reputation of a result r at time t (rep(r, t)) is an orthogonal measure of recommendation quality. The intuition
is that we should prefer results that originate from more reputable stak members. We explore user reputation and how
it can be computed in the next section.
4.
REPUTATION MODELS FOR SOCIAL
SEARCH
For HeyStaks, searchers themselves play a crucial role in
determining what gets recommended and to whom, and so
the quality of these searchers can be an important factor
to consider during recommendation. Recommendation candidates originating from the activities of very experienced
users, for example, might be considered ahead of candidates that come from the activity of less experienced users.
This is particularly important given the potential for malicious users to disrupt stak quality by introducing dubious
results to a stak. For example, as it stands it is feasible
for a malicious user to flood a stak with results in the hope
that at least some will be recommended to other users at
search time. This type of gaming has the potential to significantly degrade recommendation quality; see also recent
related research on malicious users and robustness by the
recommender systems community [3, 7, 13, 15]. For this
reason we propose to complement the relevance of a page,
during recommendation, with an orthogonal measure of reputation to reflect the predicted quality of the users who are
responsible for this recommendation. In fact we propose a
variety of reputation models and in Section 5 we evaluate
their effectiveness in practice.
4.1
Search, Collaboration, and Reputation
The long-term value of HeyStaks as a social search service
depends critically on the ability of users to benefit from its
quality search knowledge and if, for example, all of the best
search experiences are tied up in private staks and never
shared, then this long-term value will be greatly diminished.
!"#$%#
!"#$%#
!"#
&'#
&#
!"#$%#
$#
&#
!"#"
!"#$%#
&#
!%#
#"
!"#$%#
&"#
&"
!"$%&&"'()*%'+
,"-"$*%'+
(a)
%&%&%&
!'#
%&%&%&
discuss recommendation generation from the current stak S
only, although recommendations may also come from other
staks that the user has joined or created. There are two key
steps when it comes to generating recommendations. First,
a set of recommendation candidates are retrieved from S
!"#"
based on the overlap between the query terms and
the terms
#"
used to index each recommendation (query,
snippet,
and
$"
!"
!"$%&&"'()*%'+
tag terms). These recommendations are then
filtered and
,"-"$*%'+
ranked. Results that do not exceed certain activity thresholds are eliminated; such as, for example, results with only
a single selection or results with more negative votes than
positive votes (see [19]). Remaining recommendation candidates are then ranked according to a weighted score of its
relevance and reputation (Equation 2), where w is used to
adjust the relative influence of relevance and reputation.
%"
)#
)#
(#
!"#$%#
)#
&%#
(b)
Figure 1: Collaboration and reputation: (a) the
consumer c selects result r, which has been recommended based on the producer p’s previous activity,
so that c confers some unit of reputation (rep) on p.
(b) The consumer c selects a result r that has been
produced by several producers, p1 , ..., pk ; reputation
is shared amongst these producers with each user receiving an equal share of rep/k units of reputation.
Thus, our model of reputation must recognise the quality of
shared search knowledge. There is a way to capture this
notion of shared search by quality in a manner that serves
to incentivise users to behave in just the right way to grow
long-term value for all. The key idea is that the quality of
shared search knowledge can be estimated by looking at the
search collaborations that naturally occur within HeyStaks.
If HeyStaks recommends a result to a searcher, and the
searcher chooses to act on this result (i.e. select, tag, vote
or share), then we can view this as a single instance of
search collaboration. The current searcher who chooses to
act on the recommendation is known as the consumer and,
in the simplest case, the original searcher, whose earlier action on this result caused it to be added to the search stak,
and ultimately recommended, is known as the producer. In
other words, the producer created search knowledge that
was deemed to be relevant enough to be recommended and
useful enough for the consumer to act upon it. The basic
idea behind our reputation models is that this act of implicit
collaboration between producer and consumer confers some
unit of reputation on the producer (Figure 1(a)). And the
reputation models that we will present in what follows differ in the way that they distribute and aggregate reputation
among these collaborations.
4.2
Graph-Based Reputation Models
We can treat the collaborations that occur among HeyStaks
users as a type of graph. Each node represents a unique
user and the edges represent collaborations between pairs of
users. These edges are directed to reflect the producer/consumer
relationships and reputation flows along these edges, and is
aggregated at the nodes. As such, the extent to which users
collaborate (i.e., the number of times each user is a producer
in a collaboration event) is used to weight the nodes in the
collaboration graph. We now present a series of graph-based
reputation model alternatives.
4.2.1
Reputation as a Weighted Count of Collaboration Events
Our first and simplest reputation model calculates the reputation of a producer as a weighted sum of the collaboration
events in which they have participated. The simplest case is
captured by Figure 1(a) where a single producer participates
in a collaboration event with a given consumer and benefits
4
from a single unit of reputation as a result. More generally
however, at the time when the consumer acts (selects, tags,
votes etc.) on the promoted result, there may have been a
number of past producers who each contributed part of the
search knowledge that caused this result to be promoted. A
specific producer may have been the first to select the result
in a given stak, but subsequent users may have selected it
for different queries, or they may have voted on it or tagged
it or shared it with others independently of its other producers. Alternatively, a collaboration event can have a knock-on
effect, where the original producer–consumer relationship is
broadened as more people act on the same recommendation
over time. The original consumer becomes a second producer as a new user acts on the same recommendation, and
so on. Thus we need to be able to share reputation across
these different producers; see Figure 1(b).
More formally, let us consider the selection of a result r
by a user c, the consumer, at time t. The producers responsible for the recommendation of this result are given by
producers(r, t) as per Equation 4 such that each pi denotes
a specific user ui in a specific stak Sj .
producers(r, t) = {p1 , ..., pk } .
(4)
Then, for each producer of r, pi , we update its reputation
as in Equation 5. In this way reputation is shared equally
among its k contributing producers.
rep(pi , t) = rep(pi , t − 1) + 1/k .
(5)
As it stands this reputation model is susceptible to gaming
in the following manner. To increase their reputation, malicious users could attempt to flood a stak with pages in the
hope that at least some are recommended and subsequently
acted on by other users. If this happens, then these malicious producers will benefit from increased reputation, and
further pages from these users may continue to be recommended. The problem is that the current reputation model
distributes reputation equally among all producers. To address this we can adjust our reputation model by changing
the way in which reputation is distributed. The basic idea
is that a producer should receive more reputation if many
of their past contributions have been consumed by other
users but the should receive less reputation if most of their
contributions have not been consumed.
More formally, for a producer pi , let nt (pi , t − 1) be the
total number of distinct results that this user has added to
the stak in question prior to time t; remember that pi refers
to a user ui and a specific stak Sj . Further, let nr (pi , t − 1)
be the number of these results that have been subsequently
recommended and consumed by other users. We define the
consumption ratio according to Equation 6; κ is an initialization constant that is set to 0.01 in our experiments. Accordingly, if a producer has a high consumption ratio it means
that many of their contributions have been consumed by
other users, suggesting that the producer has added useful
content to the stak. In contrast, if a user has a low consumption ratio then it means that few of their contributions
have proven to be useful to other users.
consumption ratio(pi , t) = κ +
nr (pi , t − 1)
.
nt (pi , t − 1)
(6)
Thus, given the selection of a result r by a consumer c at
time t: if p1 , ..., pk are the contributing producers, then we
can use their consumption ratios as the basis for sharing
reputation according to Equation 7.
consumption ratio(pi , t)
.
∀p{p1 ,...,pk } consumption ratio(p, t)
(7)
In this way, users who have a history of contributing many
irrelevant results to a stak (that is, users with low consumption ratios) will receive a small proportion of the reputation
share compared to users who have a history of contributing
many useful results.
rep(pi , t) = rep(pi , t−1)+ P
4.2.2
Reputation as PageRank
The PageRank algorithm can be readily applied to compute the reputation of HeyStaks users, which take the place
of web pages in the graph. When a collaboration event occurs, directed links are inserted from the consumer (i.e. the
user who selects or votes etc. on the recommended page)
to each of the producers (i.e. the set of users whose previous activity on the page caused it to be recommended by
HeyStaks). Once all the collaboration events up to some
point in time, t, have been captured on the graph, the
PageRank algorithm is then executed and the reputation
(PageRank) of each user pi at time t is computed as:
X P R(pj )
1−d
+d
,
(8)
P R(pi ) =
N
|L(pj |)
pj ∈M (pi )
where d s a damping factor, N is the number of users,
M (pi ) is the set of inlinks (from consumers) to (producer)
pi and L(pj ) is the the set of outlinks from pj (i.e. the other
users from whom pj has consumed results). In this paper,
we use the JUNG (Java Universal Network/Graph) Framework (http://jung.sourceforge.net/) implementation of
PageRank.
4.2.3
Reputation as HITS
As with PageRank, we use the collaboration graph and the
HITS algorithm to estimate user reputation. In this regard,
it seems appropriate to consider producers as authorities and
consumers as hubs. However, as we will discuss in Section 5,
hub scores are useful when it comes to identifying a particular class of users which act both as useful consumers and
producers of high quality search knowledge. Thus we model
user reputation using both authority and hub scores, which
we compute using the JUNG implementation of the HITS
algorithm. Briefly, the algorithm operates as follows. After
initialisation, repeated iterations are used to update the authority (auth(pi )) and hub scores (hub(pi )) for each user pi .
At each iteration, authority and hub scores are given by:
X
auth(pi ) =
hub(pj )
(9)
pj ∈M (pi )
hub(pi ) =
X
auth(pj )
(10)
pj ∈L(pi )
where as before M (pi ) is the set of inlinks (from consumers)
to (producer) pi and L(pi ) is the set of outlinks from pi (i.e.
the other users from whom pj has consumed results).
4.3
Reputation and Result Recommendation
In the previous sections we have described reputation models for users. Individual stak members accumulate reputation when results that they have added to the stak are recommended and acted on by other users. We have described
5
how reputation is distributed between multiple producers
during these collaboration events. In this section we describe how this reputation information can be used to produce better recommendations at search time.
The recommendation engine described in Section 3 operates at the level of an individual result page and scores
each recommendation candidate based on how relevant it is
to the target query. If we are to allow reputation to influence recommendation ranking, as well as relevance, then we
need to transform our user-based reputation measure into a
result-based reputation measure. How then can we compute
the reputation of a result that have been recommended by
a set of producers?
Before the reputation of a page is calculated, the reputation score of each producer is normalized according to the
maximum user reputation score existing in the stak at the
time that the recommendation is made. But how can we
calculate the reputation of a page based on that of its producers? One option is to simply add the reputation scores of
the producers. However, this favours results that have been
produced by lots of producers, even if the reputation of these
producers is low. Another option is to compute the average
of the reputation scores of the producers, but this tends to
depress the reputation of results that have been produced by
many low-reputation users even if some users have very high
reputation scores. In our work we have found a third option
to work best. The reputation of a result page r (at time t) is
simply the maximum reputation of its associated producers;
see Equation 11. Thus, as long as at least some of the producers are considered reputable then this result will receive
a high reputation score, even if many of the producers have
low reputation scores. These less reputable users might be
novices and so their low reputations are not so much of a
concern in the face of highly reputable producers.
`
´
rep(pi , t) .
(11)
rep(r, t) =
max
∀pi {p1 ,...,pk }
Now we have two ways to evaluate the appropriateness of
a page for recommendation — the relevance of the page as
per Equation 3 and its reputation as per Equation 11 — and
we can combine these two scores using a simple weighted
sum according to Equation 2 to calculate the rank score of
a result page r and its producers p1 , ..., pk at time t, with
respect to query qT .
5.
EVALUATION
In previous work [19] we have demonstrated how the standard relevance-based recommendations generated by HeyStaks
can be more relevant than the top ranking results of Google.
In this work we wish to compare HeyStaks’ relevance-based
recommendation technique to an extended version of the
system that also includes reputation. In more recent prior
work, our initial proof-of-concept reputation model has been
outlined and motivated, and a preliminary evaluation of reputation scores assigned to early adopters of the HeyStaks
system was carried out [11]. We have also showed that
user reputation scores can be used to positively influence
HeyStaks recommendations [12], however this work focused
on only one model.
The purpose of this paper has been to build on previous
work by proposing a number of alternatives to estimating the
reputation of users (producers) who are helping other users
(consumers) to search within the HeyStaks social search ser-
Question
1. Who was the last Briton to win the men’s singles at Wimbledon?
2. Which Old Testament book is about the sufferings of one man?
3. Which reporter fronted the film footage that sparked off Band Aid?
4. Which space probes failed to find life on Mars?
Table 1: A sample of the user-trial questions.
vice. The aim is to explore known link-analysis techniques
to find a mechanism that best captures HeyStaks users’ reputation in terms of the quality of content they provide their
community. We measure each model’s effectiveness by allowing the scores to influence recommendations made by
HeyStaks: The hypothesis is that by allowing reputation,
as well as relevance, to influence the ranking of result recommendation, we can improve the overall quality of search
results. In this section we evaluate these reputation models
using data generated during a recent closed, live-user trial
of HeyStaks, designed to evaluate the utility of HeyStaks’
brand of collaborative search in fact-finding information discovery tasks.
5.1
Dataset and Methodology
Our live-user trial involved 64 first-year undergraduate
university students with varying degrees of search expertise.
Users were asked to participate in a general knowledge quiz,
during a supervised laboratory session, answering as many
questions as they could from a set of 20 questions in the
space of 1 hour. Each student received the same set of questions which were randomly presented to avoid any ordering
bias. The questions were selected for their obscurity and
difficulty; see Table 1 for a sample of these questions. Each
user was allocated a desktop computer with the Firefox web
browser and HeyStaks’ toolbar pre-installed; they were permitted to use Google, enhanced by HeyStaks functionality,
as an aid in the quiz. The 64 students were randomly divided
into search groups. Each group was associated with a newly
created search stak, which would act as a repository for the
groups’ search knowledge. We created 6 solitary staks, each
containing just a single user, and 4 shared staks containing
5, 9, 19, and 25 users. The solitary staks served as a benchmark to evaluate the search effectiveness of individual users
on a non-collaborative search setting, whereas the different
sizes of shared staks provided an opportunity to examine the
effectiveness of collaborative search across a range of different group sizes. All activity on both Google search results
and HeyStaks recommendations was logged, as well as all
queries submitted during the experiment.
During the 60 minute trial, 3,124 queries and 1,998 result
activities (selections, tagging, voting, popouts) were logged,
and 724 unique results were selected. During the course
of the trial, result selections — the typical form of search
activity — dominated over HeyStaks-specific activities such
as tagging and voting. On average, across all staks, result
selections accounted for just over 81% of all activities, with
tagging accounting for just under 12% and voting for 6%.
In recent work we described the performance results of
this trial showing how larger groups tended to benefit from
the increased collaboration effects of HeyStaks [9]. Members
of shared staks answered significantly more questions correctly, and with fewer queries, than the members of solitary
staks who did not benefit from collaboration. In this paper
we are interested in exploring reputation. No reputation
model was used during the live-user trial and so recommen-
6
dations were ranked based on relevance only. However the
data produced makes it possible for us to replay the user
trial so that we can construct our reputation models and
use them to re-rank HeyStaks recommendations. We can
retrospectively test the quality of re-ranked results versus
the original ranking against a ground-truth relevance; since
as part of the post-trial analysis, each selected result was
manually classified as relevant (the result contained the answer to a question), partially relevant (the result referred to
an answer, but not explicity), or not-relevant (the result did
not contain any reference to an answer) by experts.
5.2
User Reputation
We now examine the type of user reputation values that
are generated from the trial data. In Figure 2, box-plots are
shown for the median reputation scores across the 4 shared
staks and for each reputation model. Here we see that for
the WeightedSum model there is a clear difference in the median reputation score for members of the 5 person stak when
compared to members of the larger staks. This is not evident in results for the PageRank model, which shows very
similar reputation scores, regardless of stak size. For the
Hubs and Authority models we see very exagerated median
reputation scores for the largest 25-person stak, whereas the
median reputation scores for members of the smaller staks
are orders of magnitude less. Next we consider, for members of each stak, how the reputation scores produced by the
four reputation models compare. The pairwise rank correlations between user reputation scores given by each reputation model are shown in Table 2. With the exception of
the 5 person stak (likely due to the relatively small number of users in this particular stak), correlations are seen
to be high between the WeightedSum, PageRank and Authority models. For example, pairwise correlations between
these models in the range 0.90-0.94 are observed for the 25
person stak. In contrast, the correlations between the Hubs
model and the other models are much lower; and indeed,
are negative for the smaller 5 and 9 person staks. It is
difficult to draw precise conclusions about the Hubs correlations for each of the staks concerned (given the constrained
nature of the user-trial and the different numbers of users
in each stak), but since the HITS Hubs metric is designed
to identify pages that contain useful links towards authoritative pages in the web search domain (analogous to good
consumers rather than producers in our context), such low
correlations are to be expected with the other models which
more directly focus on producer activity.
Further, a desirable property of a reputation model is that
it should capture consumption diversity, meaning that in order for producers to gain high reputation, many consumers
should benefit from the content that producers contribute
to staks. Table 3 shows the Pearson correlation between the
number of distinct consumers per producer (per stak) and
producer reputation according to each of the four reputation models tested. Across all staks, Authority displays the
highest correlations (between 0.98 and 1), indicating that
this model is particularly effective in capturing consumption
diversity. This is to be expected, given that user Authority
scores are directly influenced by the number of consumers interacting with them. In contrast and given the nature of the
Hubs model, it unsurprisingly fails to capture consumption
diversity. For the larger staks, we can see good correlations
are achieved for the WeightedSum and PageRank models
(a)
PageRank
Hubs
Authority
WeightedSum
0.90
-0.60
0.30
PageRank
-0.70
0.50
Hubs
(b)
PageRank
Hubs
Authority
WeightedSum
0.88
-0.67
0.72
PageRank
-0.68
0.70
Hubs
(c)
PageRank
Hubs
Authority
WeightedSum
0.84
0.31
0.83
PageRank
0.31
0.91
Hubs
(d)
PageRank
Hubs
Authority
WeightedSum
0.94
0.35
0.90
PageRank
0.30
0.92
Hubs
-0.90
-0.98
0.37
0.18
Table 2: Pairwise rank correlations between user
reputation scores given by each reputation model
for (a) 5 person, (b) 9 person, (c) 19 person and (d)
25 person staks.
also, but less so for the smaller staks. In future work, we
plan on refining our WeightedSum model in order to better
reflect consumption diversity for such small-sized staks.
Figure 2 shows that there are significant differences in user
reputation scores produced by the four different models. But
how best to interpret these differences? In this work, we
consider that the true test of these reputation models is the
extent to which they improve in the quality of results recommended by HeyStaks. We have described how HeyStaks
combines term-based relevance and user reputation to generate its recommendation rankings (see Equation 2); in the
following section we regenerate each of the recommendation
lists produced during the trial using our reputation models
and compare the performance of each.
5.3
From Reputation to Quality
Since we have ground-truth relevance information for all
of the recommendations (relative to the quiz questions), we
can then determine the quality of the resulting recommendations. Specifically, we focus on the top recommended result
and note whether it is relevant (that is, contains the answer
to the question) or not relevant (does not contain the answer
to the question). For each reputation model we compute an
overall relevance rate, as the ratio of the percentage of recommendation sessions where the top result was deemed to
be relevant, to the percentage of those where the top result
was not-relevant. Moreover, we can compare this to the relevance rate of the recommendations made by the standard
HeyStaks ranking (i.e. when w = 0 in Equation 2) in the
WeightedSum
PageRank
Hubs
Authority
5
0.41
0.75
-0.86
1.00
Stak
9
0.52
0.58
-0.63
1.00
Size
19
0.78
0.85
0.43
0.98
25
0.85
0.92
0.26
0.99
Table 3: Correlations between the number of distinct consumers per producer per stak and producer
reputation.
7
0.05
40
−1
−1
10
10
0.04
30
0.02
10
5
9
19
0
25
−3
10
5
9
Stak Size
19
25
10
5
Stak Size
(a) WeightedSum
−3
10
−4
−4
10
0.01
0
−2
10
Rep Score
20
Rep Score
Rep Score
Rep Score
−2
10
0.03
9
19
25
5
9
(b) PageRank
(c) Hubs
19
25
Stak Size
Stak Size
(d) Authority
Figure 2: Reputation scores per user, per stak for the four reputation models.
trial to compute an overall relevance benefit; such that a relevance benefit of 40%, for a given reputation model, means
that this model generated 40% more relevant recommendations than the standard HeyStaks ranking scheme.
Figure 3 presents a graph of relevance benefit versus the
weighting (w) used in Equation 2 to adjust the influence of
term-based relevance versus user reputation during recommendation. The results for all four reputation models indicate a significant benefit in recommendation quality when
compared to the standard HeyStaks recommendations. As
we increase the influence of reputation over relevance during recommendation (by increasing w) we see a consistent increase in the relevance benefit, up to values of w in the range
0.5-0.7. For example, we can see that for w = 0.5, the reputation models are driving a relative improvement in recommendation relevance of about 30-40% compared to default
HeyStaks’ relevance-only based recommendations. Overall
the Hubs model performs best. It consistently outperforms
the other models across all values of w and achieves a maximum relevance benefit of about 45% at w = 0.7. Looking
at mean relevance benefit across reputation models, Hubs
is clearly the best performer. For example, Hubs achieves
a mean relevance benefit of 31%, while the other models
achieve similar mean relevance benefits of between 21-25%.
In a sense, this finding is counter-intuitive and highlights
an interesting property of the HITS algorithm in this context. One might expect, for example, that the Authority
model would outperform Hubs, given that Authority scores
capture the extent to which users are good producers of quality search knowledge (i.e. users whose recommendations are
frequently selected by other users), while Hubs captures the
extent to which users are good consumers (i.e. users who select, tag, vote etc. HeyStaks recommendations deriving from
the activity of good producers). However, given the man-
!"#"$%&'"()"&"*+(,-.(
'!"
&!"
%!"
$!"
#!"
-./012.3456"
780.98:;"
<5=>"
?521@A/2B"
!"
!"
!(#"
!($"
!(%"
!(&"
!('"
!()"
!(*"
!(+"
!(,"
/"012+(
Figure 3: Relevance benefit vs. reputation model.
#"
ner in which the collaboration graph is constructed (Section
4.2), once a user has consumed a recommended result, then
that user is also considered to be a producer of the result in
question if it is recommended by HeyStaks and selected by
other users at future points in time. Thus, good consumers
— who select recommended results from many good producers (i.e. producers with high Authority scores) — serve
a “filter” for a broad base of quality search knowledge, and
hence re-ranking default HeyStaks recommendations using
reputation scores from the Hubs model leads to the better
recommendation performance observed in Figure 3.
5.4
Limitations
In this evaluation we have compared a number of reputation models based on live-user search data. One limitation
of this approach is that although the evaluation uses liveuser search data, the final recommendations are not themselves evaluated using live-users. Instead we replay users’
searches to generate reputation-enhanced recommendations.
The reason for this is the difficulty in securing sufficiently
many live-users for a trial of this nature, which combines
a number of reputation models and therefore a number of
experimental conditions. That being said, our evaluation
methodology is sound since we evaluate the final recommendations with respect to their ground-truth relevance. We
have an objective measure of page relevance based on the
Q&A nature of the trial and we use this to evaluate the genuine relevance of the final recommendations. The fact that
our reputation models deliver relevance benefits above and
beyond the standard HeyStaks recommendation algorithm is
a clear indication that reputation provides a valuable ranking signal. Of course this evaluation cannot tell whether
users will actually select these reputation ranked recommendations, although there is no reason to suppose that they
would treat these recommendation differently from the default HeyStaks recommendations, which they are inclined to
select. We view this as a matter for future work.
Another point worth noting is that the live-user trial is
limited to a specific type of search task, in this case a Q&A
search task. Although such a task is informational in nature
(according to stipulations set out by Broder [2]) it would
be unsafe to draw general conclusions in relation to other
more open-ended search tasks. However, this type of focused
search task is not uncommon among web searchers and as
such we feel it represents an important and suitable usecase that is worthy of evaluation. Moreover, previous work
[19] has looked at the role of HeyStaks in more open-ended
search tasks to note related benefits to end-users from its
default relevance-based recommendations. As part of our
8
future work we are currently in the process of deploying
and evaluating our reputation model across similar generalpurpose search tasks.
[8]
6.
CONCLUSIONS
In this paper we have described a number of different user
reputation models designed to mediate result recommendation in collaborative search systems. We have described the
results of a comparative evaluation in the context of real-user
data which highlights the ability of these models to improve
overall recommendation quality, when combined with conventional recommendation ranking metrics. Moreover, we
have found that one model, based on the well-known HITS
Hubs metric seems to perform especially well, delivering relative improvements of up to 45%. We believe that this work
lays the ground-work for future research in this area which
will focus on scaling-up the role of reputation in HeyStaks
and refining the combination of relevance and reputation
during recommendation.
Our reputation model is utility-based [11], based on an
analysis of the usefulness of producer recommendations during collaboration events. Currently, in HeyStaks the identity
of users (producers and consumers) is not revealed and so
users do not see where their recommendations come from.
In the future it may be appropriate to relax this anonymity
condition in certain circumstances (under user control). By
doing so it will then be possible to individual users to better understand the source of their recommendations and the
reputation of their collaborating users. As such this model
can ultimately lead to the formation of trust-based relationships via search collaboration.
7.
ACKNOWLEDGMENTS
This work is supported by Science Foundation Ireland under grant 07/CE/I1147.
8.
REFERENCES
[1] S. Brin and L. Page. The Anatomy of a Large-Scale
Hypertextual Web Search Engine. In WWW ’98:
Proceedings of the 7th international conference on
World Wide Web, pages 107–117, Brisbane, Australia,
1998. ACM.
[2] A. Broder. A taxonomy of web search. SIGIR Forum,
36:3–10, September 2002.
[3] K. Bryan, M. O’Mahony, and P. Cunningham.
Unsupervised Retrieval of Attack Profiles in
Collaborative Recommender Systems. In RecSys ’08:
Proceedings of the 2008 ACM conference on
Recommender systems, pages 155–162, New York, NY,
USA, 2008. ACM.
[4] A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme.
FolkRank: A Ranking Algorithm for Folksonomies. In
FGIR 2006, pages 2–5. CiteSeer, 2006.
[5] A. Jøsang, R. Ismail, and C. Boyd. A Survey of Trust
and Reputation Systems for Online Service Provision.
Decis. Support Syst., 43(2):618–644, 2007.
[6] J. M. Kleinberg. Authoritative Sources in a
Hyperlinked Environment. Journal of the ACM,
46(5):604–632, 1999.
[7] S. K. Lam and J. Riedl. Shilling recommender systems
for fun and profit. In Proceedings of the 13th
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
International World Wide Web Conference (WWW
’04), pages 393–402, New York, NY, USA, May 17–20
2004. ACM.
M. Lazzari. An Experiment on the Weakness of
Reputation Algorithms Used in Professional Social
Networks: The Case of Naymz. In IADIS International
Conference e-Society, pages 519–522, 2010.
K. McNally, M. P. O’Mahony, and B. Smyth. Social
and Collaborative Web Search: An Evaluation Study.
In Proceedings of the 16th international conference on
Intelligent User Interfaces (IUI ’11), pages 387–390,
Palo Alto, California USA, 2011. ACM.
K. McNally, M. P. O’Mahony, B. Smyth, M. Coyle,
and P. Briggs. Collaboration and Reputation in Social
Web Search. In 2nd Workshop on Recommender
Systems and the Social Web, in association with The
4th ACM Conference on Recommender Systems
(RecSys 2010), 2010.
K. McNally, M. P. O’Mahony, B. Smyth, M. Coyle,
and P. Briggs. Towards a Reputation-based Model of
Social Web Search. In Proceedings of the 14th
International Conference on Intelligent User Interfaces
(IUI 2010), pages 179–188, Hong Kong, China, 2010.
K. McNally, M. P. O’Mahony, B. Smyth, M. Coyle,
and P. Briggs. A Case-study of Collaboration and
Reputation in Social Web Search. ACM TIST:
Transactions on Intelligent Systems Technology (In
Press), 2011.
B. Mobasher, R. Burke, R. Bhaumik, and
C. Williams. Toward trustworthy recommender
systems: An analysis of attack models and algorithm
robustness. ACM Transactions on Internet Technology
(TOIT), 7(4):1–40, 2007.
J. O’Donovan and B. Smyth. Trust in Recommender
Systems. In IUI ’05: Proceedings of the 10th
international conference on Intelligent user interfaces,
pages 167–174, 2005.
M. P. O’Mahony, N. J. Hurley, and G. C. M. Silvestre.
Promoting recommendations: An attack on
collaborative filtering. In Proceedings of the 13th
International Conference on Database and Expert
Systems Applications (DEXA 2002), pages 494–503,
Aix-en-Provence, France, 2002. Springer.
P. Resnick and R. Zeckhauser. Trust Among Strangers
in Internet Transactions: Empirical Analysis of eBay’s
Reputation System. Advances in Applied
Microeconomics, 11:127–157, 2002.
G. Salton and M. J. McGill. Introduction to Modern
Information Retrieval. McGraw-Hill, 1983.
R. Schifanella, A. Barrat, C. Cattuto, B. Markines,
and F. Menczer. Folks in folksonomies: social link
prediction from shared metadata. In Proceedings of the
Third ACM International Conference on Web Search
and Data Mining (WSDM ’10), pages 271–280, New
York, New York, USA, 2010. ACM.
B. Smyth, P. Briggs, M. Coyle, and M. P. O’Mahony.
Google Shared. A Case Study in Social Search. In
User Modeling, Adaptation and Personalization.
Springer-Verlag, June 2009.
9
Yokie - A Curated, Real-time
Search & Discovery System using Twitter
Owen Phelan, Kevin McCarthy and Barry Smyth
CLARITY Centre for Sensor Web Technologies
School Of Computer Science & Informatics
University College Dublin
Email: firstname.lastname@ucd.ie
ABSTRACT
Social networks and the Real-time Web (RTW) have joined
Search and Discovery as central pillars of online human activities. These are staple venues of interaction, with vast
social graphs facilitating messaging and sharing of information. Twitter1 , for example, boasts 200 million users posting over 150 million messages every day. Such volumes of
content being disseminated make for a tempting source of
relevant content on the web. In this paper, we presentYokie,
a novel search and discovery system that sources its index
from the shared URL’s of a curated selection of Twitter
users. The added benefit of this method is that tweets containing these URL’s contain extra contextual information,
such as terms describing the URL, publishing time, down
to the Tweet metadata which can include location and user
data. Also, since we are exploiting a social graph structure
of content sharing, it is possible to explore novel reputation
ranking of content. The mixture of contextual data, with
the fundamental harnessing of sharing activities amongst a
curated set of users combine to produce a novel system that,
with an initial online user trial, has shown promising results.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous
General Terms
Algorithms, Experimentation, Theory
Keywords
Search, Discovery, Information Retrieval, Relevance, Reputation, Twitter
1
Twitter - http://www.twitter.com
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
Tweet count
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
54221
1411784
6924205
7453870
60042573
Tweet count
(with URL)
11964
331445
1539323
1647295
13113525
Average:
Std. Dev.
%
22.065251
23.47703
22.231043
22.09986
21.840378
22.468298
0.67627121
Figure 1: Analysis of 5 public Twitter datasets of
varying sizes consisting of public tweets, with percentage of Tweets containing URL’s. Datasets gathered at various points between 2009 and 2011. Sets
1,2 and 3 were focussed scrapes, specific to a set of
hashtags. Sets 4 and 5 were general public scrapes
of the Twitter firehose.
1.
INTRODUCTION
Google2 , Bing3 and Yahoo!4 are household tools for finding relevant items on the web, of varying quality and relevance to the users search query or task. These systems rely
on the use of automatic software “crawlers” that build queryable indexes by navigating the web of documents. These
crawlers index documents based on their content, find edges
between each document (hyperlinks), and perform a set of
weighting and relevance calculations to decide on hubs and
authorities of the web, while improving index quality[3].
More recently, search systems have started to introduce
context into their ranking and retrieval strategies, such as location and time of document publication. These are mostly
content-based (related to documents actual content), as it is
difficult for a web crawler to determine the precise contextual features of a web document.
Social networks are an abundant resource of social activity and discussion. In the case of Twitter, we estimate an
average rate of 22% of Twitter tweets contain a hyperlink
to a document (analysis shown in Figure 1).
2
Google - http://www.google.com
Microsoft Bing - http://www.bing.com
4
Yahoo! - http://www.yahoo.com
3
10
This rate has held steady despite the three-fold increase of
Twitter’s tweet-per-day rate in the past year, and an increase of 10 fold between 2009 and 2010. These URL’s can
be news items, photos, Geo-located “check-in’s”, videos, as
well as vanilla URL’s to websites[13]. The interesting dynamic here is these networks allow users to repost, or retweet
other people’s items, which allow for these links to propagate
throughout the graphs of users on the service.
In this paper, we present Yokie, a novel search and discovery system with several main attributes relating to content
sources, querying, ranking and retrieval;
location
user
@phelo
3: The querying component of the system allows users
to add extra contextual filters in addition to query
strings, these are in the form of a temporal window
(between two dates). It extracts a range of contextual
features from shared content, and as such the systems
querying UI can be adapted to exploit these also.
4: The added contextual data from these messages can
give the user interesting and more relevant ways of
ranking content over traditional approaches, as well as
interesting item discovery Monday
opportunities.
1 August 11
We use the contextual data for presentation and added
ranking functionality. Emphasis is also placed on contextual filtering, such as temporal windows in queries, rather
than just a single attribute keyword search. In the following
sections, we will describe the system in greater detail, along
with the details of a live user evaluation of the prototype,
and a discussion of the results and future directions.
2.
BACKGROUND
Social network activity dominates traffic and per-user expended time on the web [13]. Real-time web products and
services provide access to new types of information and the
real-time nature of these data streams provide as many opportunities as they do challenges. In addition, companies
like Twitter have adopted a very open approach to making
their data available via APIs5 . It is no surprise then that
the recent literature includes analyses of Twitter’s real-time
data and graph, largely with a view to developing an understanding of why and how people are using services like
Twitter; see for example [5, 7, 8, 11, 13, 15].
For instance, the work of Kwak et al. [13] describes a
very comprehensive analysis of Twitter users and Twitter
usage, covering almost 42 million users, nearly 1.5 billion
social connections, and over 100 million tweets.The authors
examined reciprocity and homophily among Twitter users,
they have compared a number of different ways to evaluate user influence, as well as investigating how information
diffuses through the Twitter ecosystem as a result of social
relationships and retweeting behaviour.
5
Twitter Developer API - http://developer.twitter.com
time
16:23 GMT
obama in japan
on #g20 #ecotalks
1: The system uses posted and shared content that contain hyperlinks as the basis of an index of webpages,
the main content of which is based on user-generated
text included with each hyperlink.
2: It operates using a Curated list of sources; in this
case, these sources are Twitter users who post Tweets.
These curated lists of users are called Search Parties.
Dublin,
Ireland
http://bit.ly/82os5zx
TERMS
#HASHTAGS
obama,
japan
#g20,
#ecotalks
URL:
http://bit.ly/82os5zx
Figure 2: Example of a Twitter tweet containing a
hyperlink and contextual information
Some of our own previous work has explored using Twitter as a news discovery and recommendation service, with
item discovery appearing to be a prominently useful feature
[15]. Krishnamurthy et al. identify classes of Twitter users
based on behaviours and geographical dispersion [12]. They
highlight the process of producing and consuming content
based on retweet actions, where users source and disseminate information through the network. Other work by Chen
et al. and Bernstein et al. have looked at using the content
of messages to recommend content topically [1, 4].
Curation and content-editorial are age-old practices in
publishing activities. News organizations operate editorial
teams to filter output for relevant, interesting, topical and
aesthetic content for their audiences. In terms of the domain
of recommender systems, it can be considered an interesting avenue of exploration, such as to benchmark against an
automatic or intelligent methods of item recommendation.
Related to the idea of Curation are the notions of the
Trust, Provenance and Reputation of those who are providing input into the system. Reputation scoring is an active
field in Recommender Systems [17] and Social Search Systems [2]. in particular, focus is placed on finding reputable
sources of information to extract and present content from.
As an example, the TrustRank technique proposed by Gyongyi et al. computes a reputation score of elements in a
web-graph with the purpose of detecting spam[6]. Whereas,
several explorations such as those by McNally et al. have
explored the notion of computing reputable users in a social
search context[14].
Yokie’s inherent novelty starts with its broad range of related research fields that can be explored. The systems main
features and technologies are described in the next section.
11
3.
YOKIE
Twitter is a expansive natural resource of user-generated
content, that while each item may only seem to comprise of
only 140 characters, also contains a rich quantity of metadata and contextual information thats published in a timely
manner. Rather than an automatic crawler locating documents on the web, Yokie follows a curated set of users
on Twitter, and is capable of receiving a stream of documents contributed by users in a timely fashion. These documents contain hyperlinks, descriptive text, and other metadata such as time-of-publishing and numbers of people in a
potential audience.
In the example of Figure 2, the user @phelo has posted a
URL with a set of text (Obama in Japan on #G20 #ecotalks)
at a given time. The system extracts the URL, resolves it
(expanded to e.g. www.cnn.com/obama.html) and stores it.
Instead of using the content of that URL as a basis of the
search index, it uses the set of surrounding text. The index
also takes into account the time data of when the tweet was
published. The content of that index item that relates to
that URL also contains related content from tweets posted
by a curated set of users that containing that same URL.
These contextual pieces of metadata are all stored so the
system can perform content ranking and re-ranking.
To give a typical use case, a user may query the term
“obama” in a traditional search system, and return relevant
content based on some ranking strategy. In Yokie, the UI
allows the user to directly query a search term, along with a
defined temporal window; so the query will look like “obama”
between “6 hours ago” and “now”. Yokie has the ability
to parse natural language date-time strings, which we feel
allows for an easier definition of the search task rather than
just cumbersome date-picking UI’s. The results-list can then
be re-ranked using a number of features.
In the following sections, we will delve into greater detail
of how the system accomplishes its data gathering, storage,
retrieval, presentation and user interactions. We will also
discuss several ranking strategies that enable the user to
explore results based on the contextual data from both their
originating and related tweets.
3.1
3.1.1
@--@--@---
• Scraping Tweets
(defined by Search Party)
• Filtering Content
• Parsing Content
• Resolving URLs
• Finding Mentions of URLs
• Analyzing Metadata of
Tweets
Indexer
timestamp, url, #hashtags,
(Update text if not new)
Main
Content
Index
•Parse Query term
•Parse Query Date window
•Query Main Content Index
•Retrieve Results, gather metadata
Re-ranks results based
on metadata (age,
reputation, mentions,
relevance, etc)
for each (tweeter, mentions, date, urlTitle,
urlDescription, etc.)
•Calculate Reputation score
Re-Ranking System
Querying System
Monday 1 August 11
(B)
Debtcrisis
crisis talks
talks
Debt
solvedby
by obama
obama
solved
http://bit.ly/z9ka
http://bit.ly/z9ka
by Curated User y
by Curated User y
Great! #obama
Great!
#obama
http://bit.ly/z9ka
Obama in Crisis
Obama
Crisis
Debt
ceilingintalks
http://bit.ly/z9ka
Debt ceiling talks
http://bit.ly/z9ka
by Curated User x
by Curated User x
http://bit.ly/z9ka
{debt, crisis, talks,
great,
{debt,solved,
crisis,obama
talks,
obama, #obama,
great,
solved, obama
ceiling, talks, crisis}
Text
by Curated User z
by Curated User z
obama, #obama,
ceiling, talks, crisis}
Solr Index document
for http://bit.ly/z9ka
Solr Index document
for http://bit.ly/z9ka
Database Metadata
for http://bit.ly/z9ka
Public Retweets or
Mentions of
http://bit.ly/z9ka
Public Retweets or
Mentions of
http://bit.ly/z9ka
Debt crisis talks
solved by obama
http://bit.ly/z9ka
6
It is intended that when the system has a number of these
parties, they are indexed separately.
Query:
“iPad” from
“1 day ago” to “now”
URL
Metadata
Database
The Search Party
A key component of the system is a curated list of content
sources, what we’ve termed a Search Party 6 . An example
could be a user curating a list of Twitter users who they believe to be related to, or indeed talk about a given domain.
This allows users to curate dedicated search engines for personal and community use based around a domain-specific
topic. In the current prototype we have curated a seed-list
of 140 Twitter users who mostly discuss Technology, and
have been listed in Twitter’s list feature under a Technology
category.
Data Gathering
•Index tweet content,
Architecture
The architecture for the system, as presented in Figure
4A, highlights how data is gathered, stored and queried, and
how results are presented to the user. Here, we will discuss
each main component and how each operates during query,
retrieval, results presentation and interaction-times.
(A)
Search Party
Debt crisis talks
solved by obama
http://bit.ly/z9ka
Database
Metadata
{Time, location,
for http://bit.ly/z9ka
original user,
URL Title,
etc...}
{Time,
location,
original user,
URL Title, etc...}
Debt crisis talks
solved by obama
http://bit.ly/z9ka
Debt crisis talks
solved by obama
http://bit.ly/z9ka
Debt crisis talks
solved by obama
http://bit.ly/z9ka
Debt crisis talks
solved by obama
http://bit.ly/z9ka
Debt crisis talks
solved by obama
http://bit.ly/z9ka
Monday 1 August 11
Figure 4: Yokie System - (A) Full System Architecture. (B) Shows the process of describing and
indexing a document.
12
Figure 3: Yokie in a browser. As shown above, the system takes a traditional approach to layout, it includes
functions for viewing extra metadata related to the item (pane to the right of the “DesignTaxi” item.)
3.1.2
Content Gathering
The Data-gathering agent uses Twitter’s API to scrape a
domain of Tweets, or a subset of the total stream. The system can be adapted to listen to the public stream, or sources
can be curated based on user lists, keywords, geographical
metadata or algorithmic analysis of relevant, interesting or
important content. These messages are then stored and indexed using the service described in the following subsection.
This component also carries out real-time language classification and finds related messages that contain the same URL
so the system can calculate item popularity.
3.1.3
Storage & Indexing
Once content is gathered, it is pushed to the Storage
and Indexing subsystem. This is responsible for extracting metadata regarding the tweets, for instance timestamp
data, hashtags (#obama, etc.), user profile information, location, etc, as well as the message content itself. The process
by which Yokie deals with indexing and storing metadata is
described in Figure 4B. The main content, as well as the
urlID of the URL mentioned in the message, as well as the
timestamp is pushed to an indexer for storage and querying. In our current implementation we use Apache Solr7 for
this. We also store the remaining extracted metadata in the
form of a Database. The current implementation of the system uses the MongoDB8 NoSQL system for this databasing
functionality.
7
8
Apache Solr - http://lucene.apache.org/solr/
MongoDB - http://mongodb.org
These content indexes and databases give us a quick, programmable way of querying the content, while providing an
equally handy way of gathering associated metadata for presenting the results of a contextual query, rerank based on
metadata and presenting further metadata to the user.
3.1.4
User Interface
The system is essentially made up of a query interface,
currently comprising of a query string field, and two temporal fields, from and to. The system takes in a query string
with an associated time window, which can be either a natural language query (eg. “1 day ago”, “now”, “last week”, etc)
or a fixed date (“12 December 2010”). The UI also allows
users to drill-down on results to explore related content such
as the original tweet that the URL was shared with, the time
and day it was shared, and the related Tweet mentions (if
any). A re-ranking menu is also presented, that allows users
to re-rank the results (this will be discussed further below).
Such interface tweaks have been successfully applied to systems in the past and have shown to provide a value add for
users and motivate participation[16] (in this case, for the
curated list of Twitter users).
13
3.1.5
Querying
The querying subsystem is the largest component of the
system. It parses user queries, based on the triple
{QueryString, Tmax , Tmin }. The natural-language date
strings it receives (eg. “1 week ago” to “1 hour ago”) is parsed
into a computer-readable format (eg 12 June 2011 12:31:41
translates to the UNIX timestamp of 1307881901). Users
can specify specific dates, as well as special keywords such
as “yesterday” (12am the day before), and “now”.
The triple is pushed to the Querying subsystem, and a set of
database ID’s of URL’s are returned (urlID’s). The querying system takes these resulting urlID’s and finds complete
database objects for each URL. These objects contain pertinent metadata for the URL, its title, expanded hyperlink,
description, as well as the surrounding Tweet content related
to the initial tweet that mentioned it.
At querying time, the system uses the expanded metadata
from the database to re-rank the vector of URL’s based on
the users’ specified ranking strategy; these are explained in
the following subsection.
3.2
Relevance
Traditional IR systems use traditional Term Frequency
Inverse Document Frequency scoring (TFxIDF) [18]; we label this as “Relevance” in Yokie, and represents this evaluations’ benchmark. The indexing component of the architecture natively ranks items based on relevance. The rest
of these items are ranked algorithmically post retrieval-time
using the ranking strategies described in the following subsections.
3.2.2
Item Age
Since the content that is indexed is timely, and the querying system has temporal windowing, it is prudent to consider
the ability to rank items based on their age. We allow the
user to rank the list based on newer and older items. This is
particularly useful in the context of the temporal window, as
users may query between a certain date or time and “now”,
then rank by newer first. This will give them a near-realtime updating of content related to the query.
3.2.3
Item Popularity
When an item is indexed by the data-gathering agent, a
separate thread begins that searches Twitter for mentions
of the same URL. As such, we consider Popularity a useful
metric. This is the total number of unique mentions of a
given URL inside the query time-window. These related
tweets are sourced from the public feed, as well as amongst
the users of the curated Search Party.
3.2.4
3.2.5
Reputation
As described in Section 2, reputation is an interesting subfield appearing in recommender systems and search contexts.
In the current iteration of Yokie, we use reputation scoring of
Twitter users to rank items. As such, we place items from
more reputable users higher in a descending list. In this
iteration of the system, we use a shallow summation of the
total potential audience of the URL based on the sum of follower counts of each person in the curated domain list. Our
motivation for doing so related to the notion that follower
relationships in Twitter’s directed graph structure of social
network topography may reflect in a form of promotion or
voting in favor of a person to follow. In future iterations of
the system we hope to explore a range of more comprehensive reputation scoring based on graph analyses and topic
detection.
Ranking Strategies
For our initial prototype of the system, we implemented
several ranking strategies that are used to order the results
lists presented to users. The main aim in exposing several
strategies over the main relevance strategy was to expose
the potential benefits of the added contextual data extracted
from the stream of tweets.
3.2.1
in the set. For example, a given URL U has a longevity
score of l which is based on the difference between the Unix
timestamp of the latest mention Tmax and the first mention
Tmin .
Item Longevity
Longevity describes the total length of time an item appears in the domain (the amount of time between the first
mention/activity and last mention/activity of the item). This
score applies for items that have more than one occurrence
In the following section we will discuss a live user evaluation which, among other things, encouraged user re-ranking
of their results lists using the properties discussed above.
4.
ONLINE USER EVALUATION
In order to capture some preliminary usage statistics, we
developed the prototype (as seen in Figure 3) for a live user
evaluation. Our aim here was to examine patterns related to
how users interacted with the querying interface, and subsequent results lists on a session-level basis.
We launched the prototype system, which was online for
one week. In this setup, we curated a list of 140 users on
Twitter who we believed contributed mostly technology content. Yokie’s data gathering agent gathered past and current
content from each account, and in all during the evaluation
it captured 75,021 unique URL’s, the oldest URL was from
May 2007 to the present hour of the evaluation. This varied
depending on the number of statuses and frequency for each
account in the curated list.
The broad makeup of the evaluation participants were
technology and Computer Science research students and colleagues within our group. The definition of the technologyoriented search party was initially explained. We captured a
range of participant activities in the system. For each user,
we counted search sessions which contained one or more
queries, as well as any click-throughs from the results list
and highlighting actions in the metadata window. Initially,
we had also defaulted the temporal window to be between
“one month ago” and “now” (the time the query was performed). Data from each result interaction was captured,
which included list positions and score. The results of these
interactions are presented in the following section.
5.
PRELIMINARY RESULTS
In this section, we will focus on preliminary results from
the evaluation that represent usage trends and patterns, and
highlight the novel nature of the system and its interface. In
total, the system ran for seven days, with 35 unique participants who performed 223 total queries.
14
Sessions
In a broad sense, each user partook in search sessions that
contained queries and other interactions. A session is defined
as the total set of interactions (Queries, click-throughs, hovers, reformulations, etc.) between the initial query and the
final instance of user interaction. The evaluation gathered
69 search sessions in total across all 35 participants. On average, each user performed 3 queries per session. Here, we
will discuss query activity and will focus particularly on the
temporal windowing aspect of the query.
Number of Re-ranks by Strategy
40
30
#reranks
5.1
20
10
0
5.1.1
Queries
As mentioned, there were in total 223 queries performed
by all users across the 7 days. When we delve deeper into the
makeup of the queries themselves, we see that of those 223
queries, there were 116 unique query strings, showing high
overlap and duplication across the set. In some of these
cases, that overlap was due to query reformulation, which
will be discussed in a section below. The average query
length amongst this set of unique queries was 1.54 terms,
and on a character basis it was an average of 9 characters.
5.1.2
Temporal windowing
As mentioned in the previous sections, Yokie’s querying
interface has an emphasis on the searcher providing a temporal window for each query. Inside the total 223 queries,
the average time-window was 55 days in size, however in
198 of the queries had the temporal window starting from
“now” and extending backwards, which means people were
interested in current updating content relating to a query.
Some more interesting results relating to the windowing are
described in Section 5.3.
5.2
Result-lists & User Interactions
The three main user interactions in the system in terms of
results lists were item “hovering”, where users were encouraged to explore extra metadata regarding the item, as well
as clicking on the items themselves. Here we will describe
these interactions and some details regarding the results lists
makeup.
5.2.1
Results-lists
For each of the 223 queries, there were results lists that
ranged between 3 and 4000 returned items, with an average
size of 294 items. In each case, the user was only presented
with the top 50 based on the ranking strategy. For each
interaction of items on the results list, data such as the position of the item in the list.
5.2.2
Clicks & Hovers
Once results lists were presented to the user post querying,
the user has either the option to “peek” at extra metadata
relating to the URL, which is shown in the screenshot in Figure 3, or click on the item in a traditional fashion to visit the
page. One interesting metric was that while there was a reasonable number of 80 click-throughs, there was tremendous
interest in the metadata pane, which garnered 267 activations. Also interesting was that 90% of the click-throughs
that the system captured occurred on items that had their
metadata exposed. If we consider click-throughs to be the
ultimate success metric of a query, then this shows that users
are highly interested in exploring more information relating
to their results.
Relevance
Newer First
Older First
Mentions
Reputation
Figure 5: Breakdown of frequency of re-ranking per
strategy.
5.2.3
Re-ranking of results
The ranking strategies outlined in Section 3 were implemented into the system, namely Relevance, Newest first,
Oldest first, Popularity, Reputation and Longevity; and each
of these were explained to the users at the beginning of the
live evaluation.
As shown in Figure 5 users preferred reranking using each
of the strategies rather than the benchmark relevance metric. Reputation was the most popular reranking strategy
employed, but was only slightly ahead of popularity.
Users also seemed to like to rank with newer items at the
top of the results-lists. This, perhaps, shows the utility of a
real-time search system, which can potentially provide upto-the minute results based solely on their freshness.
5.3
Query Reformulation
One interesting result that emerged is the user activity
of query reformulation. Related search analysis work by [9]
and [10] have discussed the user practice of reformulating
queries in search sessions. In all, there were 24 sessions
that contained a reformulation of the query, but in each of
these reformulations the user did not change the query term
in any way. The reformulation exclusively took the form
of a modification of the time window. In 8 of the cases,
the users refreshed the results, but this is because the Tmin
value (or date to) was set at “now”, which meant that content
was potentially changing at a real-time rate. Unfortunately
the results-count did not change during their reformulation.
In 5 cases, the reformulation was based on a narrowing of
the time window. For example, a user queried the term
“iPad” from “3 weeks ago” to “3 days ago” reformulated to
be “1 week ago” to “1 day ago”. The remaining 11 cases
of reformulation involved a widening of the time window,
where users were querying over a broader period of time.
These typically gained significantly larger results lists.
These time-based query reformulations may have been
because we had provided an explicit interface to modify
the time window. Another interesting interaction, or lack
thereof, was in relation to the re-ranking. Within the queries
that users reformulated, they did not once re-rank first.
6.
CONCLUSIONS
In this paper, we have presented a novel search and discovery platform that harnesses users’ urge to share content on
the real-time web as a basis for finding and indexing relevant
content. We also explore other emerging themes in informa-
15
tion discovery, such as curation as a means of selecting and
editing the underlying structure of a system. As mentioned
in the previous section, an initial user evaluation has shown
the system to provide an engaging search interface that allows, among other things, easy query reformulation.
Presently we are formulating changes to the system based
on the outcomes and observations during the evaluation described in this paper. We hope to explore further the potential of the contextual features that are extracted from
Twitter, and other social networks.
Analysis of social graphs are commonly exploited in recommendations, and the unique architecture we have employed in Yokie will allow us to explore the use of these
graphs further, especially in the context of user and item
reputation scoring.
The curation system merits further exploration, particularly as a basis of evaluation compared with standard automatic neighborhood formation. One considerable experiment would involve the role and usefulness of curation in
such a system, as compared with automated systems for
content indexing. Techniques surrounding tag recommendation and content and query expansion would be starting
avenues, as would topic detection using algorithms such as
Latent Dirchlet Allocation to group items and relate them
to a query based on topical similarity.
Naturally, all of these proposed techniques would culminate in a larger-scale user trial involving many more participants, with a more focused agenda explore each of these.
Yokie is novel as its positioned on a union of many different
fields of research, including IR, Recommender Systems, Social Search systems, social networks, to name but a few. As
such, it has wide potential for users and research goals.
7.
ACKNOWLEDGEMENTS
With sincere thanks to our evaluation participants. This
work is generously supported by Science Foundation Ireland
under Grant No. 07/CE/11147 CLARITY CSET.
8.
REFERENCES
[1] M S Bernstein, Bongwon Suh, Lichan Hong, Jilin
Chen, Sanjay Kairam, and E H Chi. Eddi : Interactive
topic-based browsing of social status streams. Fortune,
pages 303–312, 2010.
[2] Oisı́n Boydell and Barry Smyth. Capturing
community search expertise for personalized web
search using snippet-indexes. In Proceedings of the
15th ACM international conference on Information
and knowledge management, CIKM ’06, pages
277–286, New York, NY, USA, 2006. ACM.
[3] Sergey Brin and Lawrence Page. The anatomy of a
large-scale hypertextual web search engine. Comput.
Netw. ISDN Syst., 30:107–117, April 1998.
[4] Jilin Chen, Rowan Nairn, Les Nelson, Michael
Bernstein, and Ed Chi. Short and tweet: experiments
on recommending content from information streams.
In Proceedings of the 28th international conference on
Human factors in computing systems, CHI ’10, pages
1185–1194, New York, NY, USA, 2010. ACM.
[5] Sandra Garcia Esparza, Michael P. O’Mahony, and
Barry Smyth. On the real-time web as a source of
recommendation knowledge. In RecSys 2010,
Barcelona, Spain, September 26-30 2010. ACM.
[6] Zoltán Gyöngyi, Hector Garcia-Molina, and Jan
Pedersen. Combating web spam with trustrank. In
VLDB ’04: Proceedings of the Thirtieth international
conference on Very large data bases, pages 576–587.
VLDB Endowment, 2004.
[7] John Hannon, Mike Bennett, and Barry Smyth.
Recommending twitter users to follow using content
and collaborative filtering approaches. In RecSys 2010:
Proceedings of the The 4th ACM Conference on
Recommender Systems, Barcelona, Spain, September
26-30 2010. ACM.
[8] Bernardo A. Huberman, Daniel M. Romero, and Fang
Wu. Social networks that matter: Twitter under the
microscope. SSRN eLibrary, 2008.
[9] Bernard J. Jansen, Danielle L. Booth, and Amanda
Spink. Patterns of query reformulation during web
searching. Journal of the American Society for
Information Science and Technology, 60(7):1358–1371,
July 2009.
[10] Bernard J. Jansen, Amanda Spink, and Vinish
Kathuria. How to define searching sessions on web
search engines. In Proceedings of the 8th Knowledge
discovery on the web international conference on
Advances in web mining and web usage analysis,
WebKDD’06, pages 92–109, Berlin, Heidelberg, 2007.
Springer-Verlag.
[11] Akshay Java, Xiaodan Song, Tim Finin, and Belle
Tseng. Why we twitter: understanding microblogging
usage and communities. In Procedings of the Joint 9th
WEBKDD and 1st SNA-KDD Workshop, pages
56–65, 2007.
[12] Balachander Krishnamurthy, Phillipa Gill, and Martin
Arlitt. A few chirps about twitter. In WOSP ’08:
Proceedings of the first workshop on Online social
networks, pages 19–24, NY, USA, 2008. ACM.
[13] Haewoom Kwak, Changhyun Lee, Hosung Park, and
Sue Moon. What is twitter, a social network or a news
media? In WWW ’10, pages 591–600, 2010.
[14] Kevin McNally, Michael P. O’Mahony, Barry Smyth,
Maurice Coyle, and Peter Briggs. Towards a
reputation-based model of social web search. In
Proceedings of the 15th international conference on
Intelligent user interfaces, IUI ’10, pages 179–188,
New York, NY, USA, 2010. ACM.
[15] Owen Phelan, Kevin McCarthy, Mike Bennett, and
Barry Smyth. Terms of a feather: content-based news
recommendation and discovery using twitter. In
Proceedings of the 33rd European conference on
Advances in information retrieval, ECIR’11, pages
448–459, Berlin, Heidelberg, 2011. Springer-Verlag.
[16] Al M. Rashid, Kimberly Ling, Regina D. Tassone,
Paul Resnick, Robert Kraut, and John Riedl.
Motivating participation by displaying the value of
contribution. In Proceedings of the SIGCHI conference
on Human Factors in computing systems, CHI ’06,
pages 955–958, New York, NY, USA, 2006. ACM.
[17] Paul Resnick, Ko Kuwabara, Richard Zeckhauser, and
Eric Friedman. Reputation systems. Commun. ACM,
43:45–48, December 2000.
[18] Fabrizio Sebastiani. Machine learning in automated
text categorization. ACM Comput. Surv., 34:1–47,
March 2002.
16
Testing Collaborative Filtering against Co-Citation
Analysis and Bibliographic Coupling
for Academic Author Recommendation
Isabella Peters
Wolfgang G. Stock
Heinrich-Heine-University
Dept. of Information Science
D-40225 Düsseldorf, Germany
Heinrich-Heine-University
Dept. of Information Science
D-40225 Düsseldorf, Germany
Heinrich-Heine-University
Dept. of Information Science
D-40225 Düsseldorf, Germany
Tamara.Heck@hhu.de
Isabella.Peters@hhu.de
Stock@phil.hhu.de
Tamara Heck
ABSTRACT
Recommendation systems have become an important tool to
overcome information overload and help people to make the right
choice of needed items, which can be e.g. documents, products,
tags or even other people. Last attribute has aroused our interest:
Scientists are in need of different collaboration partners, i.e.
experts for a special topic similar to their research field, to work
with. Co-citation and bibliographic coupling have become
standard measurements in scientometrics for detecting author
similarity, but it can be laborious to elevate these data accurately.
As collaborative filtering (CF) has proved to show acceptable
results in recommender systems, we investigate in the comparison
of scientometric analysis methods and CF methods. We use data
from the social bookmarking service CiteULike as well as from
the multi-discipline information services Web of Science and
Scopus to recommend authors as potential collaborators for a
target scientist. The paper aims to answer how a relevant author
cluster for a target scientist can be proposed with CF and how the
results differ in comparison with co-citation and bibliographic
coupling. In this paper we will show first result, complemented by
an explicit user evaluation with the help of the target authors.
Categories and Subject Descriptors
H.2.8 [Database Management]: Database Applications –
Scientific databases. H.3.3 [Information Storage and
Retrieval]: Information Search and Retrieval – Information
filtering. H.3.5 [Information Storage and Retrieval]: Online
Information Services – Web-based services.
General Terms
Measurement, Experimentation, Human Factors, Management.
Keywords
Collaborative Filtering, Recommendation, Evaluation, Social
Bookmarking,
Personalization,
Similarity
Measurement,
Bibliographic Coupling, Author Co-Citation, Social Tagging.
1. INTRODUCTION
An important task for knowledge management in academic
settings and in knowledge-intensive companies is to find the
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
Conference’10, Month 1–2, 2010, City, State, Country.
Copyright 2010 ACM 1-58113-000-0/00/0010…$10.00.
“right” people who can work together to solve successfully a
scientific or technological problem. This can either be a partner
having the same skills and providing similar know-how, or
someone with complementary skills to form a collaborative team.
In both cases the research interests must be similar. Amongst
others this interest can be figured out with a person’s scientific
publications. Exemplarily, we will list some situations in which
expert recommendations are very useful:
• compilation of a (formal) working group in a large university
department or company,
• compilation of researchers for preparing a project proposal for a
research grant (inside and outside the department and company),
• forming a Community of Practice (CoP), independent from the
affiliation with the institutions following only shared interests,
• accosting colleagues in preparation of a congress, a panel or a
workshop,
• asking colleagues for contributions to a handbook or a
specialized journal issue,
• finding appropriate co-authors.
It is very important for cooperation in science and technology that
the reputation of the experts is proved [15]. A recommendation
service must not suggest just anybody who is possibly relevant,
but has to check up on the expert’s reputation. The reputation of a
person in science and technology grows with her or his amount of
publications in peer-reviewed journals and with the citations of
those publications [14]. So we are going to use academic
information services, which stores publication and citation data, as
basis for our author recommendation. Multi-discipline
information services which allow publication and citation counts
are Web of Science (WoS) and Scopus [34, 40, 41]. Additionally
our experimental expert recommendation applies also data from
CiteULike, which is a social bookmarking service for academic
literature [18, 22]. So we can not only consider the authors’
perspectives (by tracking their publications, references and
citations via WoS and Scopus), but also the perspectives of the
readers (by tracking their bookmarks and tags via CiteULike) to
recommend relevant partners. Our research questions are: 1) Can
we propose a relevant author cluster for a target scientist with CF
applying CiteULike data? 2) Are these results different to the
results based on co-citation and bibliographic coupling?
Recommender systems (RS) nowadays use different methods and
algorithms to recommend items, for e.g. products, movies, music,
articles, to a Web user. The aim is personalized recommendation
[5], i.e. to get a list of items, which are unknown to the target user
and which he might be interested in. One problem is to find the
best resources for user a and to rank them according to their
17
relevance [16]. Two approaches are normally distinguished
(among other distinctive recommender methods and
hybridizations): The content-based approach, which tries to
identify similarities between items based on their content and
positively rated by user a, and the collaborative filtering approach
(CF), which not only considers the ratings of user a, but also the
ratings of other users [a.o. 16, 20, 25, 35, 37, 42, 48]. One
advantage of CF compared to the content-based method is that
recommendations rely on the evaluation of other users and not
only on the item’s content, which can be inappropriate for quality
indication.
RS work with user ratings assigned to the items, also called useritem response [16]: They can be scalar (e.g. 1-5 stars), binary
(like/dislike) or unary, i.e. a user doesn’t rate an item, but his
purchase or access of the item is assumed as a positive response.
The latter user-item response can also be used for
recommendations in social tagging systems (STS) as e.g. social
bookmarking systems like BibSonomy, CiteULike and Connotea
[38]. STS have a folksonomy structure with user-resource-tag
relations, which is the basis for CF. In STS not only
recommendations of items are possible, but also recommendations
of tags and users, which is the basis for our academic author
recommendations. We apply approaches of CF to recommend
potential collaboration partners to academic researchers. Hereby
we ask if CF in a STS recommends different results than the
established scientometric measurements, author co-citation and
bibliographic coupling of authors. In general these measurements
are not explicitly used for recommendation, but rather for author
and scientific network analysis [54].
2. RELATED WORK
RS can be constructed in many different ways, e.g. choosing the
appropriate algorithm especially for personal recommendation
[54], defining user interactions and user models [44], facing
criteria like RS accuracy, efficiency and stability [16] and
focusing on optimal RS learning models [47]. With the
appearance of bookmarking and collaboration services in the
Web, several algorithms and hybridizations have been developed
[27]. They may differ in combination of the considered relations
between users, items and tags and the used weights. Similarity
fusion [59] for example combines user- and item-based filtering
(subcategories of CF) and additionally adds ratings of similar
items by similar users. Cacheda et al. give an overview of
different algorithms and compare the performances of the
methods, also proposing a new algorithm, which takes account of
the users’ positive or negative ratings of the items [11]. Bogers
and van den Bosch compare three different collaborative filtering
algorithms, two item-based and one user-based. The latter one
outperformed the others throughout a time of 37 months [8]. But
the most evident problem seems to be the cold-start, i.e. new items
cannot be recommended at the beginning [2]. Said et al. are also
concerned with the cold-start problem and the performance of
different algorithms within a time span: Thereby adding tag
similarity measures can improve the quality of item
recommendation because tags offer more detailed information
about items [50]. Hotho et al. propose the FolkRank [27], a graph
based approach similar to the idea of the PageRank, which can be
applied in a system with a folksonomy structure like a
bookmarking service. Hereby users, tags and resources are the
nodes in the graph and the relations between them become the
weighted edges, taken into account weight-spreading like the
PageRank does. In the current approach similarity based on users
and tags within CiteULike is measured separately. Using the
relations between them, like it is done in the FolkRank method,
may lead to better recommendations. However this method may
not be applied to bibliographic coupling and author co-citation
[see paragraph 3] without modifications.
Several papers investigate in expert recommendation mainly for
business institutions [45, 46, 62]. Petry et al. developed the expert
recommendation system ICARE, which should recommend
experts in an organization. Therefore the focus doesn’t lie on an
author’s publications and citations, but for example on his
organizational level, his availability and his reputation [45].
Reichling and Wulf investigated in a recommender system for a
European industrial association supporting their knowledge
management, foregone a field study and interviews with the
employees. Experts were defined according to their collection of
written documents, which were automatically analyzed.
Additionally a post-integrated user profile with information about
their background and job is used [46]. Using user profiles in
bookmarking services could also be helpful to provide further
information about a user’s interests and prove user
recommendation, which could be an investigating new research
approach. However this approach has serious problems with
privacy and data security on the Web.
Apart from people recommendation for commercial companies
[a.o. 12, 51] other approaches concentrate on Web 2.0 user and
academics. Au Yeung et al., using the non-academic bookmarking
system Del.icio.us, define an expert user as someone who has
high-quality documents in his bookmark collection (many others
users with high expertise have them in their collection) and who
tends to identify useful documents before other users do it
(according to the timestamp of a bookmark) [3]. In comparison
their SPEAR algorithm is better for finding such experts than the
HITS algorithm, which is used for link structure analysis.
Compared to the current approach the “high-quality documents”
in this experiment are the publications of our target author, i.e. a
user who has bookmarked one of these publications is important
for our user-based recommendation (see paragraph 3). A weighed
approach like Yeung et al. did it when they weighted a user’s
bookmarks according to their quality could also be interesting to
test. Blazek focuses on expert recommendation sets of articles for
a “Domain Novice Researcher”, i.e. for example new academics,
who enter a new domain using a collection of academic
documents [7]. A main aspect hereby is again the cold start
problem: Citation analysis can hardly be applied for novice
researchers, as long as there are no or only few references and
citations. Therefore in the current approach only target authors
where chosen, who have at least published five articles in the last
five years. Blazek understands his expert recommendation mainly
as a recommendation of relevant documents. Heck and Peters
propose to use social bookmarking systems for scientific literature
such as BibSonomy, CiteULike and Connotea to recommend
researchers, who are unknown to the target researcher, but share
the same interests and are therefore potential cooperation partners
to build CoP [24]. Users are recommended when they have either
common bookmarks or common tags, a method founded on the
idea of CF. A condition is that the researcher, who should get
relevant expert recommendations, must be active in the social
bookmarking system and put his relevant literature to his internet
library. In this project, beneath the additional comparison of CF
against co-citation and bibliographic coupling, we avoid the
problem of the “active researcher”, i.e. we have a look at the users
in CiteULike, who have bookmarked our target researcher’s
publications. Therefore the recommendation doesn’t depend on
the target scientist himself, which would be based on his
bookmarks and assigned tags, but on the bookmarking users and
their collaborative filtering. The approach of Cabanac is similar
to [24], but he concentrates only on user similarity networks and
18
relevant articles, not on the recommendation of unknown
researchers [10]. He uses the concepts of Ben Jabeur et al. to build
a social network for recommending relevant literature [4]. The
following entities can be used: Co-authorship, Authorship,
Citation, Reference, Bookmarking, Tagging, Annotation and
Friendship. Additionally Cabanac adds social clues like
connectivity of researchers and meeting opportunities on scientific
conferences. According to him these social clues lead to a better
performance of the recommendation system. Both approaches [4,
10] aim to build a social network to show the researcher
connectivity to each other. In this project co-authorship for
example is not important, as we try to recommend unknown
researchers or academics our target author has not in his mind.
Zanardi and Capra, proposing a “Social Ranking”, calculate
similarity between users based on same tags and tag-pairs based
on same bookmarks they both describe [63]. The tag similarity is
compared with a user’s query tag; both user and tag similarity are
then combined. The results show that user similarity improves
accuracy whereas tag similarity improves coverage.
Another important aspect with RS is their evaluation. RS should
not only prove accuracy and efficiency, but also usefulness for the
users [26]. The users’ need must be detected to make the best
recommendation. Beneath RS evaluation based on models [29],
some papers investigate in user evaluation [39]. McNee et al.
show recommender pitfalls to assure users acceptance and
growing usage of recommenders as knowledge management tools.
This is also one of our main aspects in this paper because we want
to recommend potential collaboration partners to our target
scientists. They have to prove the recommended people as useful
for their scientific work.
3. MODELING RECOMMENDATION
3.1 Similarity Algorithm
The most common similarity measures in Information Science are
Cosine, Dice and Jaccard-Sneath [1, 2, 31, 33, 49, 57]. The last
two are similar and come to similar results [17]. Additionally
Hamers et al. proved that the similarity measures with the cosine
coefficient are twice the number than the Jaccard coefficient
showed referring to citation measurements [21]. According to van
Eck and Waltman the most popular similarity measures are the
association strength, the cosine, the inclusion index and the
Jaccard index [58]. In our comparative experiment we make use
of the cosine. Our own experiences [23] and results from the
literature [49] show that cosine works well. But in later project
steps we want to extent the similarity measures to Dice and
Jaccard-Sneath as well.
3.2 Collaborative Filtering Using Bookmarks
and Tags in CiteULike
Social bookmarking systems like BibSonomy, Connotea and
CiteULike have become very popular [36]: Unlike bookmarking
systems like Del.icio.us they focus on academic literature
management. Basis for social recommendation are their
folksonomies. A folksonomy [38, 43] is defined as a tuple F: =
(U, T, R, Y), where U, T and R are finite sets with the elements
usernames, tags and resources and Y is a ternary relation between
them: Y ⊆U x T x R with the elements called tag actions or
assignments. The tripartite structure allows matching users,
resources or tags which are similar to each other. CF uses data of
the users in a system to measure similarity [16]. To get a 2dimensional matrix for applying traditional CF, which is not
possible in the ternary relation Y, one could split F in three 2dimensional subsets: The docsonomy DF:= (T, R, Z) where Z ⊆ T
x R, the personomy PUT:= (U, T, X) where X ⊆U x T, and the user-
resource relation, which we call in our case personal bookmark
list (PBL): PBLUR:= (U, R, W) where W ⊆ U x R.
In our experimental comparison we want to cluster scientific
authors who have similar research interests. Scientometric
analyses are co-citation and bibliographic coupling, which we
compare with data from CiteULike using CF. Therefore we are
not interested in the CiteULike users themselves, but in their tags
and bookmarks they connect with our target author, i.e. the
bookmarked papers which our target author published. We set Ra
for all bookmarked articles which our target author a published
and Ta for all tags which are assigned to those articles. To set our
database for author similarity measure we have two possible
methods:
1.
2.
We search for all users u 𝜖 U who have at least one
article of the target author a in their bookmark list:
PBLURa:= (U, Ra, W) where W ⊆ U x Ra .
We search for all documents, to which users assigned
the same tags like to the target author’s a articles: DFa:=
(Ta, R, Z) where Z ⊆ Ta x R.
The disadvantage in the first method, in our case, is the small
number of users. It can be difficult to rely only on these users for
identifying similarity [30]. Therefore we use the second method:
Resources (here: scientific papers) can be supposed similar, if the
same tags have been assigned to them. Our assumption is that also
the authors of these documents are similar because users describe
their papers with the same keywords. Tags show topical relations,
and authors with thematically relations concerning their research
field are potential collaboration partners. Additionally the more
common tags two documents have, the more similar they are. In
some cases very general tags like “nanotube” and “spectroscopy”
were assigned to our target authors’ articles. So we decided to set
a minimum of unique tags a document must have in common with
a target author’s document:
DFa:= (Ta, R, Z) where Z ⊆ Ta x R e. {r ∈ Ta x R with |Ta| ≥ 2} (1)
On this database we measure author similarity in two different
ways: (A) Based on common tags t assigned to the authors’
documents by users; (B) Based on common users u. We use the
cosine coefficient as explained above:
𝑎)𝑠𝑖𝑚(𝑎, 𝑏): =
𝑇𝑎 ∩ 𝑇𝑏
�𝑇𝑎 ∗ 𝑇𝑏
𝑏)𝑠𝑖𝑚(𝑎, 𝑏): =
𝑈𝑎 ∩ 𝑈𝑏
�𝑈𝑎 ∗ 𝑈𝑏
(2)
Consider that the latter method leads to different results than
applying the proposed first method for database modeling. If we
would apply the first method, we would find all users who have at
least one document of target author a in their bookmark list. With
the second method, we get all users, who have at least one
document in their bookmark list, which is similar to any of target
author’s a articles, i.e. users who bookmarked a document of a
may be left out. As we want to apply one unique dataset for author
similarity measure, we do not merge both methods, but measure
tag-based and user-based similarity in the dataset described above.
Nevertheless where no tags were available, we chose the first
method (see paragraph 5).
3.3 Author Co-Citation and Bibliographic
Coupling of Authors
There are four relations between two authors concerning their
publications, references and citations: co-authorship, direct
citation, bibliographic coupling of authors and author co-citation.
The first two relationships are not appropriate for our problem, for
here it is sure that one authors knows the other: of course, one
knows his co-authors, and we can assume, that an author knows
19
who she has cited. Our goal is to recommend unknown scientists.
Bibliographic coupling (BC) [28] and co-citations (CC) [55] are
undirected weighted linkages of two papers, calculated through
the fraction of shared references (BC) or co-citations (CC). We
aggregate the data from the document level to the author level.
Bibliographic coupling of authors means that two authors a and b
are linked if they cite the same authors in their references. We
mine data about bibliographic coupling of authors by using WoS,
for this information service allows searches for “related records”,
where the relation is calculated by the number of references a
certain document has in common with the source article [13, 56].
Our assumption is: Two authors who have two documents with a
high number of same references are more similar than two authors
who have a high number of same references in many documents,
i.e. the number of same references per document is important.
Consider authors a, b and c:
if
𝑠𝑖𝑚(𝑎, 𝑏) > 𝑠𝑖𝑚(𝑎, 𝑐)
𝑅𝑒𝑓𝑎 ∩ 𝑅𝑒𝑓𝑏
.
𝐷𝑎 ∪ 𝐷𝑏
>
𝑅𝑒𝑓𝑎 ∩ 𝑅𝑒𝑓𝑐
.
𝐷𝑎 ∪ 𝐷𝑐
(3)
(4)
where Ref is the set of references of an author and D the set of
documents of an author {d ∈ D x Refa}. For example: author a has
6 references in common with author b and c. These 6 common
references are found in two unique documents of author a,
respectively of author b, but in 6 unique documents of author c,
i.e.:
6
6
>
(5)
2+6
2+2
Therefore it can be said that authors a and b are similar if there
documents have similar reference lists. Our assumption leads to
the following dataset model for BC, where we take all authors of
related documents with at least n common references with any of
the target author’s publications, where n may vary in different
cases:
BC:= (Refd(a), D, S) where S ⊆ Refd(a x D and {d ∈ D | Refd(a)| ≥ n,
n 𝜖 ℕ}
(6)
where Refd(a) is the number of references in one document d of
target author a. Unique authors of the dataset are accomplished;
the list of the generated authors of the related documents is cut at
m 𝜖 ℕ unique authors (m > 30) because their publications and
references for BC have to be analyzed manually in WoS. For
these related authors we measure similarity with the cosine (Eq.
2a), where T is substituted with H and Ha is the number of unique
references of target author a and Hb the number of references of
author b.
Author Co-Citation (ACC) [32, 52, 53, 60, 61] means that two
authors a and b are linked if they are cited in the same documents.
ACC is then measured with cosine (Eq. 2a), where T is substituted
with J and Ja is the number of unique citing articles which cite
target author a and Jb is the number of unique citing articles which
cite author b. To mine the author-co-citation data it is not possible
to work with WoS, for in the references section of a bibliographic
entry there is only the first author of the cited documents and not,
what is needed, a declaration of all authors [65]. Therefore we are
going to mine those data from Scopus, for here we can find more
than one author of the cited literature. We perform an inclusive
all-author co-citation, i.e. two authors are considered co-cited
when a paper they co-authored is cited [64]. The dataset is based
on the documents which cite at least one of the target author’s
articles in Scopus:
ACC:= (D, Ca, Q) where Q ⊆ D x Ca with |Q| > 0
(7)
where Ca is the set of cited articles of target author a. The list of
potential similar authors is cut at m 𝜖 ℕ unique authors (m > 30)
because their publications for ACC have to be analyzed manually
in Scopus. With regards to the results of the research literature,
both methods, BC and ACC in combination, perform best to
represent research activities [6, 9, 19]. Applying the proposed four
mined datasets and similarity approaches we can assemble four
different sets of potential similar authors, which we call clusters.
One cluster is based on BC in WoS, one cluster is based on ACC
in Scopus, one cluster is based on common users in CiteULike
and one cluster is based on common tags in CiteULike. We can
now analyze the authors who are most similar to our target author
according to the cosine coefficient and evaluate the results.
Additionally based on the mined datasets we can also measure
similarity between all authors of a cluster. These results are shown
in visualizations, which we call graphs. Therefore for each cluster
a visualized graph exists which will also be evaluated.
4. DATASET LIMITATIONS
While filtering the information in the three information services
different problems arise, which we would point out briefly,
because the recommendation results highly depend on the source
dataset. In Scopus we detected differences in the metadata: An
identical article may appear in different ways, i.e. for example
title and authors may be complete in one reference list of an
article, but incomplete in a reference list of another article. In our
case, several co-authors in the dataset are missed and could not be
considered for co-citation. The completeness of co-authorship
highly varies: In a random sample, where the co-citation dataset is
adjusted with data of the Scopus website, five of 14 authors have
a complete coverage, three of them have coverage between 70 and
90 %, five between 55 and 70 % and one author only has coverage
of about 33 %. In the information services there is the problem of
homonymy concerning author names. Additionally in CiteULike
users also misspell author names, which were rechecked for our
dataset. The id-number for an author in Scopus is practical for
identification, but it may also fail when two or more authors with
the same name are allocated to the same research field and change
their working place several times. In WoS we don’t have an
author-id and it is more difficult to distinct a single person.
Therefore we check the filtered author’s document list and if
necessary correct it based on the articles’ subject area.
5. EXPERIMENTAL RESULTS
We cooperate with physicists of the Forschungszentrum Jülich
and worked with 6 researchers so far. For any of the 6 target
academic authors (35-50 years old) we build individual clusters
with authors who are supposed to be similar to them. We limit
source for the dataset modeling to the authors’ publications
between 2006-2011 to make recommendations based on the actual
research interest of the physicist. To summarize, any scientist got
the following four clusters: 1. Based on author co-citation (COCI)
in Scopus, 2. Based on bibliographic coupling (BICO) in WoS, 3.
Based on common users in CiteULike (CULU) and 4. Based on
common tags in CiteULike (CULT). Based on the cosine
similarity we are also able to show graphs of all four clusters
using the cosine coefficient for similarity measure between all
authors (e.g. Fig. 1 and Fig. 2). We applied the software Gephi 1
for the cluster visualization. The nodes (=authornames) are sized
according to their connections, the edges are seized according to
the cosine weight. Consider that the CiteULike graphs are much
bigger because all related authors are taken into account. To get a
1
http://gephi.org/
20
Figure 1. Extract of a CULT graph, circle = target author,
cosine interval 0.99-0.49.
Figure 2. BICO graph, circle = target author,
cosine threshold 0.2.
clear graph arrangement for a better evaluation, we set thresholds
based on the cosine coefficient when needed. Additionally we left
out author-pairs with a similarity of 1 if they had only one user or
tag (in the CiteULike dataset) in common because this would
distort the results.
We will shortly summarize the interesting answers of part one: As
confirmed in our earlier studies [24] most of the physicists work
in research teams, i.e. they collaborate in small groups (in general
not more than 5 people). The choice of people for possible
collaboration highly depends on their research interest: There
must be a high thematic overlap. On the other hand, if the overlap
is too high, it could be disadvantageous. Some authors, who
claimed a similar author in a cluster important, stated that they
wouldn’t cooperate with him because he exactly does the same
research, i.e. he is important for their own work, but rather a
competitor of them. Additionally another statement against
collaboration was less thematically overlap. Successful
collaborations with international institutes are aspired. In general
our interviewees meet new colleagues at conferences and
scientific workshops.
While modeling the datasets we found out that one of the six
authors didn’t have any users, who bookmarked any of his articles
in CiteULike. Some articles were found, but they were adjusted to
the system by the CiteULike operators themselves, so the
CiteULike clusters couldn’t be modeled for this scientist. One
researcher’s articles were bookmarked, but not tagged. In this
case, we used method 1 in 3.2 to model the dataset. In all four
clusters we ranked the similar authors with the cosine. In general
it can be seen that the cosine coefficient for BC is very low
according to the one for ACC and similarity measures in
CiteULike. This is because some authors have a lot of references,
which minimize similarity. Additionally similarity is
comparatively very high for measurements in CiteULike because
the number of users and assigned tags related to the target
authors’ publications was relatively low.
6. EVALUATION
To prove our experimental results we let our 6 target physicists
evaluate the clusters as well as the graphs. The evaluation is
divided in three parts. Part one is arranged in a semi-structured
interview with questions about the scientist’s research behavior
and the purchase of relevant literature as well as his working
behavior, i.e. is he organized in teams and with whom does he
cooperate? These questions should show a picture of the scientists
work and help to estimate the following evaluation results. In the
second part the target author has to rank the proposed similar
authors according to their relevance. Therefore the ten top authors
of all four measurements are listed in alphabetical order (coauthors eliminated). The interviewee should tell if he knew the
proposed authors, how important these authors are for his research
(rating from not important (1) to very important (10)), with whom
he would cooperate and which important authors he misses.
In part three our author has to evaluate the cluster graphs (rating
from 1 to 10) according to the distribution of the authors and the
generated groups. Here the questions are: 1. Due to your
individual valuation does the distribution of the authors reflect
reality respective to the research community and the
collaborations of them? 2. Are there any other important authors
you didn’t remember before? 3. Would this graph, i.e. the
recommendation of similar authors, help you e.g. to organize a
workshop or find collaboration partners?
100%
80%
60%
40%
20%
0%
COCI
BICO
CULU
CULT
Figure 3. Coverage of important authors in the
recommendation of the Top 20 authors.
Part two of the evaluation is concerned with the similar author
ranking. We analyze all authors an interviewee claimed important
with at least a rating of 5 and all important authors, which the
researcher additionally added and which were not on the Top 10
list of any cluster. In general our target authors have up to 30
people they claim most important for their recent scientific work.
Figure 3 shows the coverage of these important authors for the
first 20 ranks based on the cosine (consider author 6 didn’t have
any publication bookmarked in CiteULike). For example
concerning target author 1: 30 % of the 20 most similar authors of
the co-citation cluster (COCI) are claimed important. In the
bibliographic coupling cluster (BICO) it is 20 %, in the CiteULike
cluster based on users (CULU) 30 % and in the CiteULike cluster
based on tags 25 %. Compared to the other target authors there are
great differences. The BICO and COCI cluster can be said to
provide the best results except by author 1 and 5. Concerning the
21
CiteULike clusters they are slightly worse, but not in all cases: By
author 1 the CULU provide the same coverage than COCI, both
CULU and CULT are better than BICO. By author 5 (no CULT
because no tags were assigned) CULU has full coverage, which
means that all the 20 top authors ranked by the cosine are claimed
important by the target author. Beneath the coverage results
shown in figure 3 it is interesting to look at the important authors
who were only found on the CiteULike clusters: E.g. 6 of the 29
important authors of target author 1 are only in the CiteULike
cluster, the same applies to 5 of 19 important authors of target
author 2.
The great differences may also depend on the interviewees’ recent
research activities: Some of the physicists said that they slightly
changed their research interest. Hitherto similar authors who were
important in the past aren’t important nowadays. One problem
with our applied similarity measure may be that it is based on past
data, i.e. publications of the last five years. The authors the
interviewees marked as important, are important for recent
research. If we would have considered all authors who are or have
been important, the results for the clusters would have been better.
In the third part of the evaluation the interviewee had to evaluate
the graphs. The average cluster relevance (based on six target
users) are 5.08 for COCI, 8.7 for BICO, 2.13 for CULU and 5.25
for CULT. Consider only four authors had publications and tags
in CUL to be analyzed. For author 5, for whom we applied
method 1 (see 3.2) in case of missing tags, no CiteULike graph
could be modeled because only one user bookmarked his articles
and we measured author similarity only on the numbers of authors
this user had in his literature list. Two authors claimed BICO and
CULT to be very relevant and proposed to combine these two to
get all important authors and relevant research communities. In
BICO and COCI some interviewees missed important authors.
Two of the interviewees stated that the authors in BICO and
COCI are too obvious to be similar and were interested in bigger
graphs with new potential interesting colleagues. A combined
cluster could help them to find researcher groups, partners for
cooperation and it would be supportive to intensify relationships
among colleagues. Looking at the graphs almost all target authors
recollected important colleagues, who didn’t come to their mind
first, which they found very helpful. They stated that bigger
graphs like CULT show more unknown and possible relevant
people. However to give a clear statement about the similar
researchers who were unknown by the target user, the interviewee
would have had to look at these researchers’ publications.
Assumptions can be made that if an unknown person is clearly
connected to a known relevant researcher group, this person
would do similar relevant work. As the interviewees stated that
the distribution of the researchers is shown correctly, it is likely,
but not explicitly proved, that the unknown scientist are also
allocated correctly within the graph.
An important factor for all interviewees is a clear cluster
arrangement. A problem which may concern CUL clusters is the
sparse dataset, i.e. if only few tags were assigned to one author’s
publications or only one user bookmarked them, the cluster cannot
show high distinguishable communities. That was the case with
author 2 and 5. Author 2 gave worse ratings to the CUL graphs
because they didn’t show clear distributions and author groups.
Further categorizations of authors, e.g. via tags or author
keywords, might help to classify scientists’ work.
7. DISCUSSION
In our project we analyzed academic author recommendation
based on different author relations in three information services.
We combined two classical approaches (co-citation and
bibliographic coupling) with collaborative filtering methods. First
results and the evaluation show that the combination of different
methods leads to the best results. Similarity based on users and
assigned tags of an online bookmarking system may complement
co-citation and bibliographic coupling. By some target authors
more important similar authors were found in CiteULike than in
Scopus or WoS. The interviewees approved this assumption with
the graph relevance ranking. They and other researchers in former
studies confirm that there is a need for author recommendation:
Many physicists don’t work by oneself, but in project teams. The
cooperation with colleagues of the same research field is essential.
A recommender system could support them. Our paper shows a
new approach to recommend relevant collaboration colleagues for
scientific authors. The challenge will be to combine the different
similarity approaches. One method is the simple summation of the
cosine values. The cumulated cosine values provide better ranking
results for some relevant researchers, but they are not satisfactory.
Further investigations will be made in a weighted algorithm which
considers the results of all four cluster. The relations between
user- and tag-based similarity in a bookmarking system should
also be considered and tested, e.g. with a graph based approach
like FolkRank [27] or expertise analysis (SPEAR) [3]. Besides
this the paper did not study important aspects of a running
recommender system like accuracy and efficiency. Research has
to be done on these fields. An issue which may as well be
addressed is social network analysis and graph constructions.
8. ACKNOWLEDGMENTS
Tamara Heck is financed by a grant of the Strategische
Forschungsfonds of the Heinrich-Heine-University Düsseldorf.
We would like to thank Oliver Hanraths for data mining, Stefanie
Haustein for valuable discussions and the physicists of
Forschungszentrum Jülich for participating in the evaluation.
9. REFERENCES
[1] Ahlgren, P., Jarneving, B. and Rousseau, R. 2003.
Requirements for a cocitation similarity measure, with
special reference to Pearson’s correlation coefficient. Journal
of the American Society for Information Science and
Technology, 54(6), 550-560.
[2] Ahn, H. J. 2008. A new similarity measure for collaborative
filtering to alleviate the new user cold-starting problem.
Information Sciences, 178, 37-51.
[3] Au Yeung, C. M., Noll, M., Gibbins, N., Meinel, C. and
Shadbolt, N. 2009. On measuring expertise in collaborative
tagging systems. Web Science Conference: Society On-Line,
18th-20th March 2009, Athens, Greece.
[4] Ben Jabeur, L., Tamine, L. and Boughanem, M. 2010. A
social model for literature access: towards a weighted social
network of authors. Proceedings of RIAO '10 International
Conference on Adaptivity, Personalization and Fusion of
Heterogeneous Information. Paris, France, 32-39.
[5] Berkovsky, S., Kuflik, T. and Ricci F. 2007. Mediation of
user models for enhanced personalization in recommender
systems. User Model User-Adap Inter, 18, 245-286.
[6] Bichteler, J. and Eaton, E. A. 1980. The combined use of
bibliographic coupling and cocitation for document-retrieval.
Journal of the American Society for Information Science,
31(4), 278-282.
[7] Blazek, R. 2007. Author-Statement Citation Analysis
Applied as a Recommender System to Support Non-DomainExpert Academic Research. Doctoral Dissertation. Fort
Lauderdale, FL: Nova Southeastern University.
22
[8] Bogers, T. and van den Bosch, A. 2008. Recommending
scientific articles using CiteULike. Proceedings of the 2008
ACM Conference on Recommender Systems. New York,
NY, 287-290.
[9] Boyack, K. W. and Klavans, R. 2010. Co-citation analysis,
bibliographic coupling, and direct citation. Which citation
approach represents the research front most accurately?
Journal of the American Society for Information Science and
Technology, 61(12), 2389-2404.
bookmarking systems. In Proceedings of I-Know 2010,10th
International Conference on Knowledge Management and
Knowledge Technologies, 458-464.
[25] Herlocker, J. L., Konstan, J. A., Borchers, A. and Riedl, J.
1999. An algorithmic framework for performing
collaborative filtering. Proceedings of the 22nd Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval. New York, 230-237.
[10] Cabanac, G. 2010. Accuracy of inter-researcher similarity
measures based on topical and social clues. Scientometrics,
87(3), 597-620.
[26] Herlocker, J. L., Konstan, J. A., Terveen L. G. and Riedl, J.
T. 2004. Evaluating collaborative filtering recommender
systems. ACM Transactions on Information Systems, 22(1),
5-53.
[11] Cacheda, F., Carneiro, V., Fernández, D. and Formoso, V.
2011. Comparison of collaborative filtering algorithms:
Limitations of current techniques and proposals for scalable,
high-performance recommender systems. ACM Transactions
on the Web, 5(1), article 2.
[27] Hotho, A., Jäschke, R., Schmitz, C. and Stumme, G. 2006.
Information retrieval in folksonomies: Search and ranking
(pp. 411-426). In Sure, Y., Domingue, J. (Eds.), The
Semantic Web: Research and Applications, Lecture Notes in
Computer Science 4011, Springer, Heidelberg.
[12] Cai, X., Bain, M., Krzywicki, A., Wobcke, W., Kim, Y. S.,
Compton, P. and Mahidadia, A. 2011. Collaborative filtering
for people to people recommendation in social networks.
Lecture Notes in Computer Science, 6464, 476-485.
[28] Kessler, M. M. 1963. Bibliographic coupling between
scientific papers. American Documentation, 14, 10-25.
[13] Cawkell, T. 2000. Methods of information retrieval using
Web of Science. Pulmonary hypertension as a subject
example. Journal of Information Science, 26(1), 66-70.
[14] Cronin, B. 1984. The Citation Process. The Role and
Significance of Citations in Scientific Communication.
London, UK: Taylor Graham.
[15] Cruz, C. C. P., Motta, C. L. R., Santoro, F. M. and Elia, M.
2009. Applying reputation mechanisms in communities of
practice. A case study. Journal of Universal Computer
Science, 15(9), 1886-1906.
[16] Desrosiers, C. and Karypis, G. 2011. A comprehensive
survey of neighborhood-based recommendation methods (pp.
197-144). In Ricci, F., Rokach, L., Shapira, B. and Kantor,
P.B (Eds.), Recommender Systems Handbook. Springer, NY.
[17] Egghe, L. 2010. Good properties of similarity measures and
their complementarity. Journal of the American Society for
Information Science and Technology, 61(10), 2151-2160.
[18] Emamy, K. and Cameron, R. 2007. CiteULike. A
researcher’s bookmarking service. Ariadne, 51.
[19] Gmur, M. 2003. Co-citation analysis and the search for
invisible colleges. A methodological evaluation.
Scientometrics, 57(1), 27-57.
[20] Goldberg, D., Nichols, D., Oki, B. M. and Terry, D. 1992.
Using collaborative filtering to weave an information
tapestry. Communications of the ACM, 35(12), 61-70.
[21] Hamers, L., Hemeryck, Y., Herweyers, G. and Janssen, M.
1989. Similarity measures in scientometric research: The
Jaccard Index versus Salton’s cosine formula. Information
Processing & Management, 25(3), 315-318.
[22] Haustein, S. and Siebenlist, T. 2011. Applying social
bookmarking data to evaluate journal usage. Journal of
Informetrics, 5, 446-457.
[23] Heck, T. (2011). A comparison of different user-similarity
measures as basis for research and scientific cooperation.
Information Science and Social Media International
Conference August 24-26, Åbo/Turku, Finland.
[24] Heck, T. and Peters, I. 2010. Expert recommender systems:
Establishing Communities of Practice based on social
[29] Krohn-Grimberghe, A., Nanopoulos, A. and SchmidtThieme, L. 2010. A novel multidimensional framework for
evaluating recommender systems. In Proceedings of the
ACM RecSys 2010 Workshop on User-Centric Evaluation of
Recommender Systems and Their Interfaces (UCERSTI).
New York, NY, ACM.
[30] Lee, D. H. and Brusilovky, P. 2010a. Social networks and
interest similarity. The case of CiteULike. In Proceedings of
the 21st ACM Conference on Hypertext & Hypermedia,
Toronto, Canada, 151-155.
[31] Lee, D. H. and Brusilovky, P. 2010b. Using self-defined
group activities for improving recommendations in
collaborative tagging systems. In Proceedings of the Fourth
ACM Conference on Recommender Systems. NY, 221-224.
[32] Leydesdorff, L. 2005. Similarity measures, author cocitation
analysis, and information theory. Journal of the American
Society for Information Science and Technology, 56(7), 769772.
[33] Leydesdorff, L. 2008. On the normalization and visualization
of author co-citation data. Salton’s cosine versus the Jaccard
index. Journal of the American Society for Information
Science and Technology, 59(1), 77-85.
[34] Li, J., Burnham, J. F., Lemley, T. and Britton, R. M. 2010.
Citation analysis. Comparison of Web of Science, Scopus,
SciFinder, and Google Scholar. Journal of Electronic
Resources in Medical Libraries, 7(3), 196-217.
[35] Liang, H., Xu, Y., Li, Y. and Nayak, R. 2008. Collaborative
filtering recommender systems using tag information. ACM
International Conference on Web Intelligence and Intelligent
Agent Technology. New York, NY, 59-62.
[36] Linde, F. and Stock, W.G. 2011. Information Markets.
Berlin, Germany, New York, NY: De Gruyter Saur.
[37] Luo, H., Niu, C., Shen, R. and Ullrich, C. 2008. A
collaborative filtering framework based on both local user
similarity and global user similarity. Machine Learning,
72(3), 231-245.
[38] Marinho, L. B., Nanopoulos, A., Schmidt-Thieme, L.,
Jäschke, R., Hotho, A., Stumme, G. and Symeonidis, P.
2011. Social tagging recommenders systems (pp. 615-644).
In Ricci, F., Rokach, L., Shapira, B. and Kantor, P.B (Eds.),
Recommender Systems Handbook. Springer, NY.
23
[39] McNee, S. M., Kapoor, N. and Konstan, J.A. 2006. Don’t
look stupid. Avoiding pitfalls when recommending research
papers. In Proc. of the 20th anniversary Conference on
Computer Supported Cooperative Work. New York, NY,
ACM, 171-180.
[52] Schneider, J.W. and Borlund, P. 2007a. Matrix Comparison,
Part 1: Motivation and important issues for measuring the
resemblance between proximity measures or ordination
results. Journal of the American Society for Information
Science and Technology, 58(11), 1586-1595.
[40] Meho, L. I. and Rogers, Y. 2008. Citation counting, citation
ranking, and h-index of human-computer interaction
researchers. A comparison of Scopus and Web of Science.
Journal of the American Society for Information Science and
Technology, 59(11), 1711-1726.
[53] Schneider, J. W. and Borlund, P. 2007b. Matrix Comparison,
Part 2: Measuring the resemblance between proximity
measures or ordination results by use of the Mantel and
Procrustes statistics. Journal of the American Society for
Information Science and Technology, 58(11), 1596-1609.
[41] Meho, L. I. and Sugimoto, C. R. 2009. Assessing the
scholarly impact of information studies. A tale of two
citation databases – Scopus and Web of Science. Journal of
the American Society for Information Science and
Technology, 60(12), 2499-2508.
[54] Shepitsen, A., Gemmell, J., Mobasher, B. and Burke, R.
2008. Personalized recommendation in social tagging
systems using hierarchical clustering. In Proc. of the 2008
ACM Conference on Recommender Systems. NY, 259-266.
[42] Parra, D. and Brusilovsky, P. 2009. Collaborative filtering
for social tagging systems. An Experiment with CiteULike.
In Proc. of the Third ACM Conference on Recommender
Systems. New York, NY, ACM, 237-240.
[43] Peters, I. 2009. Folksonomies. Indexing and Retrieval in
Web 2.0. Berlin, Germany: De Gruyter Saur.
[44] Ramezani, M., Bergman, L., Thompson, R., Burke, R. and
Mobasher, B. 2008. Selecting and applying recommendation
technology. In Proc. of International Workshop on
Recommendation and Collaboration, in Conjunction with
2008 International ACM Conference on Intelligent User
Interfaces. Canaria, Canary Islands, Spain.
[45] Petry, H., Tedesco, P., Vieira, V. and Salgado, A. C. 2008.
ICARE. A context-sensitive expert recommendation system.
In The 18th European Conference on Artificial Intelligence.
Workshop on Recommender Systems. Patras, Greece, 53-58.
[46] Reichling, T. and Wulf, V. 2009. Expert recommender
systems in practice. Evaluating semi-automatic profile
generation. CHI ’09. In Proc. of the 27th International
Conference on Human Factors in Computing Systems. New
York, NY, 59-68.
[47] Rendle, S., Marinho, L. B., Nanopoulos, A. and SchmidtThieme, L. 2009. Learning optimal ranking with tensor
factorization for tag recommendation. In Proceedings of the
15th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. New York, NY, 727-736.
[55] Small, H. 1973. Cocitation in scientific literature. New
measure of relationship between 2 documents. Journal of the
American Society for Information Science, 24(4), 265-269.
[56] Stock, W. G. 1999. Web of Science. Ein Netz
wissenschaftlicher Informationen – gesponnen aus Fußnoten
[Web of Science. A web of scientific information – cocooned
from footnotes]. Password, no. 7+8, 21-25.
[57] Van Eck, N. J. and Waltman, L. 2008. Appropriate similarity
measures for author co-citation analysis. Journal of the
American Society for Information Science and Technology,
59(10), 1653-1661.
[58] Van Eck, N. J. and Waltman, L. 2009. How to normalize
cooccurrence data? An analysis of some well-known
similarity measures. Journal of the American Society for
Information Science and Technology, 60(8), 1635-1651.
[59] Wang, J., de Vries, A. P. and Reinders, M. J. T. 2006.
Unifying userbased and itembased collaborative filtering
approaches by similarity fusion. In Proc. of the 29th Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval. NY, 501-508.
[60] White, H. D., & Griffith (1981). Author cocitation. A
literature measure of intellectual structure. Journal of the
American Society for Information Science, 32(3), 163-171.
[61] White, H. D. and Griffith, B. 1982. Authors as markers of
intellectual space. Co-citation in studies of science,
technology and society. Journal of Documentation, 38(4),
255-272.
[48] Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P. and
Riedl, J. 1994. Grouplens: An open architecture for
collaborative filtering of netnews. In Proc. of CSCW’94,
ACM Conference on Computer Supported Cooperative
Work. New York, NY, ACM, 175-186.
[62] Yukawa, T., Kasahara, K., Kita, T. & Kato, T. (2001). An
expert recommendation system using concept-based
relevance discernment. In Proceedings of ICTAI ’01, 13th
IEEE International Conference on Tools with Artificial
Intelligence (pp. 257-264. Dallas, Texas, 257-264.
[49] Rorvig, M. 1999. Images of similarity: A visual exploration
of optimal similarity metrics and scaling properties of TREC
topic-document sets. Journal of the American Society for
Information Science and Technology, 50(8), 639-651.
[63] Zanardi, V. and Capra, L. 2008. Social ranking: Uncovering
relevant content using tag-based recommender systems.
Proceedings of the 2008 ACM Conference on Recommender
Systems. New York, NY, 51-58.
[50] Said, A., Wetzker, R., Umbrath, W. and Hennig, L. 2009. A
hybrid PLSA approach for warmer cold start in folksonomy
recommendation. In Proc. of the International Conference on
Recommender Systems. New York, NY, 87-90.
[64] Zhao, D. and Strotmann, A. 2007. All-author vs. first author
co-citation analysis of the Information Science field using
Scopus. Proceedings of the 70th Annual Meeting of the
American Society for Information Science and Technology,
44(1). Milwaukee, Wisconsin, 1-12.
[51] Sarwar, B., Karypis, G., Konstan, J. and Riedl, J. 2000.
Analysis of recommendation algorithms for ecommerce. In
Proceedings of the 2nd ACM 2nd ACM Conference on
Electronic Commerce. New York, NY, 158-167.
[65] Zhao, D. and Strotmann, A. 2011. Counting first, last, or all
authors in citation analysis. Collaborative stem cell research
field. Journal of the American Society for Information
Science and Technology, 62(4), 654-67.
24
Group recommendation methods for
social network environments
Universidad Complutense de Madrid, Spain
Lara Quijano-Sanchez
lara.quijano@fdi.ucm.es
Juan A. Recio-Garcia
jareciog@fdi.ucm.es
ABSTRACT
Social networks present an opportunity for enhancing the
design of social recommender systems, particularly group
recommenders. Concepts like personality, tie strength or
emotional contagion are social factors that increase the accuracy of the system’s prediction when including them in
a group recommendation model. This paper analyses the
inclusion of social factors in current preference aggregation
strategies by exploiting the knowledge available in social network environments. Proposed techniques are evaluated in a
real application for Facebook that recommends movies for
groups of users.
General Terms
Algorithms, Human Factors, Performance
Keywords
Recommender Systems, Social Networks, Personality, Trust
1.
INTRODUCTION
Recommender systems are born from the necessity of having some kind of guidance when searching through complex
product spaces. More precisely, group recommenders are
built to help groups of people decide a common activity or
item. Nowadays, social networks are commonly used to organize events and activities for groups of users. Therefore,
they are an ideal environment for exploiting recommendation techniques. Moreover, we can use the various relationships captured in these communities (trust, confidence, tie
strength, etc.) in new ways, by incorporating better indicators of relationship information.
There is a proliferation of recommender systems that cope
with the challenge of addressing recommendations for groups
of users in different domains like MusicFX [12], FlyTrap [2]
or LET’S BROWSE [9] among many. What all these recommenders have in common is that the group recommendations take the personal preferences obtained from their
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
Belen Diaz-Agudo
belend@sip.ucm.es
users into account but they consider each user equal to the
others. The recommendation is not influenced by their personality or the way each one behaves in a group when joining
a decision-making process. In our approach we propose to
study how people interact depending on their personality or
their closeness in order to improve group recommendations.
Our previous work [16, 15, 17] studies how to measure the
personality and trust between. It also proposes several recommendation methods that incorporate both factors.
Group recommendation approaches are typically based on
generating an aggregated preference using the user’s individual preferences. As stated in [6] the main approaches to
generate the preference aggregation are (a) merging the recommendations made for individuals, (b) aggregation of ratings for individuals and (c) constructing a group preference
model. Masthoff [10] presents a compilation of the most important preference aggregation techniques. These basic approaches merge the ratings predicted individually for each
item to calculate a global prediction for the group. The selection of a proper aggregation strategy is a key element in
the success of recommendations. The main contribution of
this paper is a comparative analysis of these strategies applied to our social-enhanced recommendation methods. This
study will help us choose the best aggregation strategy for
our group recommendation approach.
The second goal of this paper is to illustrate the potential of the proposed social recommendation techniques, to
be able to evaluate them in a real environment and to make
them reachable to our users. To do so, we have developed
HappyMovie, a movie recommendation application for Facebook. Our application applies the methods presented in this
paper to aid groups of users deciding what is the best movie
to watch together. Our system takes advantage of the popularization among users of organizing events through social
networks. It is becoming very usual that someone proposes
an activity and invites several friends to the event using
Facebook or any other online community. HappyMovie goes
one step beyond and guides groups of friends to decide an
activity to perform together (in our case, selecting a proper
movie for the group).
Summing up in this paper we present a group recommender application embedded in a social network that allows us to study and improve (as we will later conclude) the
performance of different aggregation techniques when using
them in our recommendation method based on personality
and trust factors.
Section 2 introduces our Facebook application HappyMovie.
Our social-enhanced group recommendation method is ex-
25
Figure 1: HappyMovie Main Page
Figure 2: Personality test in HappyMovie
plained in Section 3. We present the case study in the movie
recommendation domain and the experimental evaluation of
the presented methods in Section 4. Finally Section 5 concludes the paper.
2.
FACEBOOK’S GROUP RECOMMENDATION APPLICATION: HAPPYMOVIE
Happy Movie is a Facebook application where we provide
a movie recommendation for a group of people planning to
go to the cinema together. The application works as Facebook’s event application where the event is going to the cinema. The application recommends to the attending users a
movie from the current movie listing of a selected place. In
HappyMovie users can create new events, invite their Facebook friends to any of their events and erase themselves
from them. When a user starts the application, as shown in
Figure 1, she can1 :
1. Answer to a personality test: In previous works we
have studied the different behaviours that people have
in conflict situations according to their personality. In
[16, 15] we presented a group recommendation method
that distinguishes different types of individuals regarding their personality. There are different approaches
that can be used in order to obtain the different roles
that people play when interacting in a decision making
1
We must point that it is necessary to perform activities 1
and 2 before being able to perform any of the others.
process, for example the TKI test [19]. We have used a
movie metaphor that consists on displaying two movie
characters with opposite personalities for five possible
personality aspects. We have proven in [17] that with
this test we are able to obtain in a less tedious way the
personality measures that the TKI test obtains with
equitable results in the recommendations. In our test,
one character represents the essential characteristics
of the personality feature, while the other one represents all the opposite ones. What the user has to do
is to choose with whom of each pair of characters she
feels more identified by simple moving an arrow that
indicates de grade of conformity as shown in Figure 2.
Note that additional information about the characters
and the category they are representing is provided to
our users.
2. Perform a movies preference test: This test is
shown in Figure 3. This information about the user
is required to predict the ratings for the movies to be
recommended. Our group recommendation strategies
combine individual recommendations to find a movie
suitable for all the users in the group. This individual
recommender estimates the ratings that a user would
assign to each product in the catalogue. It is built using the jCOLIBRI framework [3] and follows a content
based approach [14]. To do so, it assigns an estimated
rating to the products to be recommended based on
an average of the ratings given by the user in the preference test to the most similar products. In our case,
it returns an average of the three most similar rated
items.
3. Create a new event: Figure 4 shows how this option
is presented. To create an event, users must establish
the place, the deadline for users to join the event and
the date when the event will take place. Invited users
receive a notification of the event and are able to accept or reject it. Once the event has been created any
attending user can see the date and place of the event
and a proposal of three movies, that are the best ones
that our group recommender has found for the current
group of attending users.
4. Access to events the user is already attending:
When a user enters an event, the application calculates
the trust that the user has with all the other users that
have joined the event up to now. Current research
has pointed out that people tend to rely more on recommendations from people they trust (friends) than
on recommendations based on anonymous ratings [18].
This social element is even more important when we
are performing a group recommendation where users
have to decide an item for the whole group. This kind
of recommendations usually follow an argumentation
process, where each user defends her preferences and
rebuts other’s opinions. Here, when users must change
their mind to reach a common decision, the trust between users is the major factor. Note that trust is
also related to tie strength and previous works have
reported that both are conceptually different but there
is a correlation between them [8]. The calculation of
the trust is the one that has the most benefits of embedding the application in a social network. With a
26
Figure 3: Preferences test in HappyMovie
standalone application, the task of obtaining the data
required to compute the trust between users is very
tedious. Now, we are able to calculate the trust between users extracting the specific information from
each of their own profiles in the social network. Users
in Facebook, can post on their profiles a huge amount
of personal information that can be analysed to compute the trust with other users: distance in the social
network, number of shared comments, likes and interests, personal information, pictures, games, duration of
the friendship, etc [5, 4]. We analyse 10 different trust
factors comparing the information stored in their Facebook profiles. Next, these factors are combined using a
weighted average. A detailed explanation of the trust
factors obtained from Facebook and the combination
process is provided in [16].
When the event is created it looks up for the current
movie listing from the selected city and provides a list
of 3 movies, that represent the best 3 movies that the
recommender has found in the movie listing for the
users that have joined the event up to now, this is
shown in Figure 5. This list is automatically updated
every time a user joins the event or retires from it. This
process keeps going on until the last possible day to
join or retire from the event, the deadline date. From
that date until the date when the event will take place
users can vote the 3 final proposed movies. When the
final date arrives the votes are analyzed and the most
voted movie is the one presented.
3.
GROUP RECOMMENDATION METHOD
Our group recommendation method is based on the typical preference aggregation approaches plus personality and
social factors. The novelty presented in this paper is the
evaluation of the different aggregation approaches that exist
when using them with our group recommendation method.
These approaches [11, 13] aggregate the users individual
predicted ratings pred(u, i) to obtain an estimation for the
group {gpred(G, i)|u ∈ G}. Then the item with the highest
group predicted scoring is proposed.
Figure 4: How to create an activity in HappyMovie
G
gpred(G, i) =
pred(u, i)
(1)
∀u∈G
Here G is a group of users, which user u belongs to. This
function provides an aggregated value that predicts the group
preference for a given item i. By using this estimation, our
group recommender proposes the set of k items with the highest group predicted scoring. This is what we will later refer
as a “Standard recommender”.
In our proposal, we modify the individual ratings with
the personality and trust factors. This way, we modify the
impact of the individual preferences as shown in Equation
2.
gpred(G, i)
=
G
pred0 (u, i)
∀u,v∈G
pred0 (u, i)
=
G
f ( pred(u, i) , pu , tu,v )
(2)
∀v∈G
where gpred(G, i) is the group rating prediction for a given
item i, pred(u, i) is the original individual prediction for user
u and item i, pu is the personality value for user u and tu,v
is the trust value between users u and v.
There are several ways to modify the predicted rating for
a user according to the personality and trust factors. The
one that has proven to be the most efficient in our previous
experiments performed in [16, 15], is the Delegation-based
method. The idea behind this method is that users create
their opinions based on the opinions of their friends. The
delegation-based method tries to simulate the following behaviour: when we are deciding which item to choose within
a group of users we ask the people who we trust. Moreover,
we also take into account their personality to give a certain
importance to their opinions (for example, because we know
that a selfish person may get angry if we do not choose her
preferred item). The estimation of the delegation-based rating (dbr(u, i)) given an user u and an item i is computed
in this way:
27
The function that represents this strategy is:
gpred(G, i) =
1 X
pred0 (u, i)
|G| u∈G
(4)
Where pred0 (u, i) is the predicted rating for each user
u, and every item i. gpred0 (G, i) is the final rating of
item i for the group.
• Borda Count: (Borda 1971). The Borda count is
a single-winner election method in which users rank
candidates in order of preference. The Borda count
determines the winner of an election by giving each
candidate a certain number of points corresponding
to the position in which she is ranked by each voter.
Once all votes have been counted the candidate with
the most points is the winner. Because it sometimes
elects broadly acceptable candidates, rather than those
preferred by the majority, the Borda count is often described as a consensus-based electoral system, rather
than a majority-based one. Finally, to obtain the group
preference ordering, the points awarded for the individuals are added up.
Figure 5: How events look like in HappyMovie
gpred(G, i)
=
X
bs(u, i)
u∈G
bs(u, i)
0
pred (u, i)
dbr(u, i)
=
=
pos( pred0 (u, i) , OL )
OL = {pred0 (u, i1 ), . . . , pred0 (u, in )}
1
|
X
P
v∈G tu,v | v∈G∧v6=u
where pred(u, ip ) ≤ pred0 (u, ip+1 )
(5)
tu,v ·( pred(v, i) + pv )
(3)
In this formula, we take into account the recommendation
pred(v, i) of every friend v for item i. This rating is increased
or decreased depending on her personality (pv ), and finally it
is weighted according to the level of trust (tu,v ). Note that
this formula is not normalized by the group size and uses
the accumulated personality. Therefore, this formula could
return a value out of the ratings range. As we are only interested in giving a final ordered list of the users preferences
in the products of a given catalogue, it is not necessary to
normalize the results given by our formula.
Next, we will explain the aggregation functions that can
be applied to combine the individual estimations.
3.1
=
Where bs(u,i) is the Borda score assigned to each item
rated by user u. It is obtained as the position of the
estimated rating for item i in the ordered list OL of the
ratings predicted for all the items. A problem arises
when an individual has multiple alternatives with the
same rating. We have decided to distribute the points.
• Copeland Rule:(Copeland, 1951) ranks the alternatives according to the difference between the number of
alternatives they beat and the number of alternatives
they loose against. It is a good procedure to overcome
problems resulting from voting cycles [7].
X
gpred(G, i) =
cs(i)
i∈C
Aggregation Functions
A wide set of aggregation functions has been devised for
combining individual preferences [10], being the average and
least misery the most commonly used. In our previous research we only evaluated the performance of our approach
using social factors with the average satisfaction function.
As we have said before choosing the aggregation function
that performs best is a key element to provide good recommendations. Here we explain the ones that we have studied
for our group recommendation method.
• Average Satisfaction: Refers to the common arithmetic mean, which is a method to derive the central
tendency of a sample space. It computes the average
of the predicted ratings of each member of the group.
cs(u, i)
=

 1
−1
 0
wins(u, i)
losses(u, i)
=
=
|pred0 (u, i) > pred(u, j)|, ∀i 6= j
|pred0 (u, i) < pred(u, j)|, ∀i 6= j
if wins(u, i) > losses(u, i)
if wins(u, i) < losses(u, i)
a.o.c.
(6)
• Approval Voting: is a single-winner voting system
used for elections. Each voter may vote for (approve
of) as many candidates as they wish. The winner is the
candidate receiving the most votes. Users could vote
for all alternatives with a rating higher than a certain
threshold δ, as this means voting for all alternatives
they like at least a little bit.
28
gpred(G, i)
=
X
recommender”). During the experiment we have compared
the results obtained with these two recommenders and the
different aggregation functions, for the real and synthetic
data. Next we describe the details of the experiment.
as(u, i)
u∈G
as(u, i)
=
if pred0 (u, i) ≥ δ
a.o.c.
1
0
(7)
• Least Misery: This strategy follows the idea that,
even if the average satisfaction is high, a solution that
leaves one or more members very dissatisfied is likely
to be considered undesirable. This strategy considers
that a group is as happy as its least happy member.
The final list of ratings is the minimum of each of the
the individual ratings. A disadvantage can be that if
the majority really likes one item, but one person does
not, then it will never be chosen.
gpred(G, i) = min pred0 (u, i)
u∈G
(8)
• Most Pleasure Strategy: It is the opposite of the
previous strategy, it chooses the highest rating for each
item to form the final list of predicted ratings.
gpred(G, i) = max pred0 (u, i)
u∈G
(9)
• Average Without Misery: assigns the average of
the weights in the individual ratings. The difference
here is that those preferences which have a weight under a certain threshold will not be considered.
gpred(G, i) =
1
|G|
X
pred0 (u, i)
(10)
u∈G|pred0 (u,i)>δ
In next section we study how the different aggregation
functions influence in the results of the group recommendation and we also to prove the validity of our method.
4.
EXPERIMENTAL EVALUATION
We have run an experiment in the movie recommendation domain. In it we have been able to benefit from the
advantages of having the group recommendation application, HappyMovie, embedded in a social network. We have
tested our theories with groups of real users and also with
synthetically generated users. The reason of using synthetically generated users, besides of the real users, is that we
wanted to have control of the data distribution, which does
not happen when using real data. This synthetic data set
let us explore every group composition and personality distribution within the group. It also lets us reproduce the
behaviour of large groups that are very difficult to organize
in experiments with real users. With both the synthetic and
real data we are able to explore the realistic and the pure
combinational distributions. After obtaining the necessary
data to perform our case study we have implemented two
versions of our system: a standard recommender that only
aggregates preferences (the baseline explained in Equation 1,
we will refer to it from now on as “Base recommender”) and
another one that implements our method: the DelegationBased method (we will refer to it as “Delegation Method
• Experimental set-up of the real data: As we want
to evaluate the real performance of our group recommendation method, we test our Facebook application
HappyMovie. From it we obtain the real users data
used in this first experiment. To do so we create different events in the social network as explained in 2. In
these events we ask volunteers to complete some tests
that let us obtain two of the factors used by our system: personality and personal preferences. The demographic data about our participants (mean age, gender,
etc) is quite varied because they have been selected
among our friends, family and students. We finally
have 58 users. The tests are the ones presented in
Section 2 (Figures 2 and 3) that let us obtain the personality and individual preferences of each user. Note
that trust is obtained by analysing Facebook profiles.
In order to make a good recommendation it is necessary to have accurate information about users individual preferences. To be able to have this we have asked
our users to rate at least 20 movies in the personal
preferences test (we studied that 20 was the minimum
rates needed in order to have a representative profile
of the users preferences). On average, the users have
rated 30 movies from the 50 Movies to rate list.
Now we have all the information required to build the
individual profile of our users which is necessary for our
recommendation method. This profile is based, as we
have explained before, on three different aspects: personality, individual preferences and trust with other
users. We need a way to measure the accuracy of the
group recommendation. To be able to compare our
results with what will happen in real life, we brought
our users together in person and ask them to mix differently several times and simulate that they are going
to the cinema together, forming different groups that
would actually come out in reality. We provide them
15 movies that represent the current movie listing and
we ask them to choose which 3 movies they actually
would watch together. We have chosen to ask for just 3
movies, because real users are only interested on a few
movies they really want to watch. And knowing that
a movie listing is normally formed by no more than 15
movies, we have considered that 3 would be the maximum of movies that a user will be really interested
in watching at a time. Later, users created events in
HappyMovie and joined them with the same configuration as they did in reality, we manage to gather 15
groups (which means 15 events in the application) of 9
(4 groups), 5 (6 groups) or 3 (5 groups) members. The
three movies that each group chooses are stored as the
real group favorites set –rgf –. Our goal is to evaluate
the accuracy of our recommender by comparing the set
of the 3 final proposed movies in each event as shown
in Figure 5 –the pgf set– with the real preferences rgf.
The more that the predicted movie set resembles the
real one the better results is our application providing.
The evaluation metrics applied to compare both sets
are explained in Section 4.1.
29
• Experimental set-up of the Synthetic Data: We
have performed a second experiment, where we have
simulated the behaviour of 100 people. In it we assign
to our synthetic users a random personality value. We
basically define five different types of personality according to the range provided by the TKI [19] and our
movie metaphor personality tests: very selfish, selfish,
tolerant, cooperative and very cooperative. For example if we consider a very selfish person her personality
value must be contained in a range of [0.8,1.0]. When
we analyzed the range of the personality values of the
real users, there were some of these ranges that were
unexplored because with smaller samples (58) and a
not fixed group composition not all the possible situations appear. As we wanted to study all the possible behaviors we decided to use the synthetic data.
We must note that the validity of experimenting with
these synthetically generated users has been already
proven in our previous studies [16].
To study the effects of the different types of personalities we generate 20 users for each type of personality.
We group users in sets of 3, 5, 10, 15, 20 and 40 people. For each group size we select the components of
the group so that the personality distributions contain
all the possible combinations: groups of very selfish,
selfish, tolerant, cooperative, very cooperative, very
selfish & very cooperative, very selfish & tolerant, ...
and so on until we reach 13 possible combinations and
groups for each size. In the end we have 76 groups (13
different distributions for each size, except for the 40
people group where we only had 11 combinations due
to the resemblance of personalities in such big groups).
The next step needed in our experiment is to obtain
the real group favorites (rgf ) and be able to measure
the recommendation accuracy. We have to simulate
the individual preferences test, so that we know which
movies would each of our users have chosen individually (from the same movie listing of a cinema, that
we proposed to our real users). Afterwards we also
have to determine which of that movies the group as a
whole would have finally decided to watch. To do so,
we use the description of the movies to predict which
movies each user likes.
We have given to each of our synthetically generated
users a random profile. These profiles are constructed
according to typical preferences in movies of real life
people according to their age, sex and preferences and
studding the Movielens data set [1]. For example, the
ratings that a user with a childish profile would give
are very high ratings to animation, children or musical
movies and very low ratings to drama, horror, documental, etc. From these profiles we are able to predict
the individual likes of our users. We have selected the
same list of 50 heterogeneous movies from the preferences test that HappyMovie offers (see Figure 3) and
we rated them for each user according to their profile.
Afterwards, with a content-based recommender we rate
and organize the listing of the cinema in order of preference by comparing the items with the ones in the
simulated preferences. We chose the top 3 and marked
them as the real individual favorites –rif –. Secondly
we need to obtain the decision of the group. Now that
we know which movies would the individual users argue for, we reproduce a real life situation were everyone
discusses their preferences, taking into account the personalities and the friendship between them and then
we finally obtain the real favorite movies for the group
rgf. We use this information to evaluate the accuracy
of our recommender by comparing how many of the
first 3 recommended movies in the predicted group favorites –pgf – belong to the rgf set of that group.
4.1
Evaluation metrics
As we need to have an evaluation function to measure the
accuracy of our method, we have studied several aspects before deciding which matters we should take into account in
the evaluation process of our experiment. We need to compare the results of our recommender system to the real preferences of the users (that is, what would happen in a real life
situation), also we must note: 1) The number of estimated
movies that we were going to take into account: Long lists
of ordered items are of no use in this case scenario. Real
users are only interested on a few movies they really want
to watch. This fact discards several evaluation metrics that
compare the ordering of the items in the real list of favourite
movies and the estimated one (MAE, nDCGs, etc). 2) The
number of relevant and retrieved items in our system is fixed :
Therefore, we cannot use general measures like recall or precision. However, there are some metrics used in the Information Extraction field [20] that limit the retrieved set. This is
the case of the precision@n measure that computes the precision after n items have been retrieved. We have decided
to use the precision@3 to evaluate how many of the movies
in pgf are in the rgf set (note that |rgf | = 3). This kind of
evaluation can be seen from a different point of view: we are
usually interested on having at least one of the movies from
pgf in the rgf set. This measure is called success@n and
returns 1 if there is at least one hit in the first n positions.
Therefore, we could use success@3 to evaluate our system
computing the rate of recommendations where we have at
least one-hit in the real group favorites list. For example,
a 90% of accuracy using success@3 represents that the recommender suggests at least one correct movie for the 90%
of the evaluated groups. In fact, success@3 is equivalent
to having precision@3 > 1/3, when considering retrievals
individually before computing the average. We can also define a 2success@3 metric (equivalent to precision@3 > 2/3)
that represents how many times the estimated favorites list
pgf contains at least two movies from rgf. Obviously, it is
much more difficult to achieve high results using this second
measure.
4.2
Results
This section describes the results obtained in the two experiments: using real vs. synthetic data. As we have explained before, we have built two recommenders, the standard recommender (we refer to it as “Base recommender”)
and the one using the Delegation-based method (we refer to
it as “Delegation Method recommender”), both explained in
Section 3. In them we have tested the 7 different types of
aggregation functions, explained in Section 3.1.
Figure 6 shows the results for the real and the synthetic
data using the recommender that implements Delegationbased method for all the different merging functions. We
can see that the best two merging functions are the average
30
Figure 6: Results for the real data and the synthetic
data using our method for all the different merging
functions.
satisfaction function for the real data and the least misery
for the synthetic data.
We have also analysed the improvement of our method
in comparison with the base recommender for all the different aggregation functions. This improvement is shown in
Figures 7 and 8. From them we conclude that our method
improves the base recommender for the real data in a 10%
for the success@3 measure and in a 12% for the 2success@3
measure. For the synthetic data the recommender that implements our method improves the results in a 16% for the
success@3 measure and in a 7% for the 2success@3 measure.
With these results we are able to conclude that by using our
method we do improve significantly the base recommender.
In Figures 9 and 10 we have studied the results of the recommender that implements our method with the real and
synthetic data when varying the group size. We have performed the analysis for all the different aggregation functions. We can see that while some aggregation functions
like average satisfaction give better results for small groups
(we consider as small groups size 10 or less), others like least
misery or most pleasure work the other way round and give
better results for big groups. In this particular case we can
clearly see the necessity of using our synthetically generated
data, because with the real data we only had 3 different
groups sizes and results were not significant. This is why on
average, as we can see in Figure 6, the best function for the
real data is average satisfaction (as we remember group size
in real data is 9, 5 & 3, all of them considered small groups).
And on the other hand, on average, the best function for the
synthetic data is least misery, which makes sense as the results in Figure 10 show that this function works better for
big groups and on average the synthetic data groups are big
(with groups of 15, 20 & 40). Graphics 9 and 10 reflect a
similar behaviour as the group size grows. This difference
in the results of the different aggregation functions when
varying the group size, opens the possibility of improving
our method with an adaptive group recommender, where
the recommendation algorithm adapts itself to the personality distribution in the group, its size and other characteristics. Figures 9 and 10 also show that for the groups that
are equal in size and therefore comparable (3,5&9), results
between synthetic and real data for all the different merging functions differ on average by no more than 0.11. So we
prove that our synthetic generated data is valid and provides
sustainable results.
5.
CONCLUSIONS
This paper introduces HappyMovie, a movie recommender
Figure 7: Comparison of the results for the real data
with and without our method for all the different
merging functions.
Figure 8: Comparison of the results for the synthetic data with and without our method for all the
different merging functions.
system for groups that models the social side of users to provide better recommendations. It is integrated in Facebook
to infer the social behaviours within the group. It is clear
that groups have an influence on individuals when reaching a common decision. This is commonly referred as emotional contagion. This contagion is usually proportional to
the trust between individuals as closer friends have a higher
influence. Therefore our system analyses users interaction
(interchanged messages, shared photos, friends in common,
. . . ) to calculate this social factor. However, the influence
of the group depends also on the degree of conformity of
the individual. The degree of conformity is counteracted by
the individual’s behaviour when facing a conflict situation.
Here, the personality influences the acceptance of others’
proposals. Our model also includes the personality by asking
the users to answer a short personality test. Both variables,
personality and trust, are used to estimate the preference of
each user for a given item. To do so, we modify the ratings
estimated by a standard content-based recommender in our
delegation-based method. This method models the effect of
the emotional contagion and obtains an estimation that is
based on the estimations for other users with a close relationship. This closeness is inferred according to the trust
between users. Finally, the personality variable represents
the degree of conformity with the items preferred by these
closer individuals.
Individual predictions must be combined to obtain a global
recommendation for the group. Several aggregation strategies have been proposed to obtain this final value: average
satisfaction, least misery, approval voting, etc. We have
evaluated these strategies applied to our social-enhanced recommendation method with real and synthetically generated
31
[6]
[7]
Figure 9: Comparison of the results of all the different merging functions when varying the group size
for the real data.
[8]
[9]
[10]
[11]
Figure 10: Comparison of the results of all the different merging functions when varying the group size
for the synthetic data.
users. Results show that average satisfaction (computing
the average estimated rating) and the least misery are the
best options and that our method improves the accuracy of
standard approaches by 12%.
6.
[13]
ACKNOWLEDGMENTS
Supported by Spanish Ministry of Science & Education
(TIN2009-13692-C03-03) and Madrid Education Council and
UCM (Group 910494). We also thank the friends and students that have participated in the experiment.
7.
[12]
[14]
[15]
REFERENCES
[1] J. Bobadilla, F. Serradilla, and A. Hernando.
Collaborative filtering adapted to recommender
systems of e-learning. Knowl.-Based Syst.,
22(4):261–265, 2009.
[2] A. Crossen, J. Budzik, and K. J. Hammond. Flytrap:
intelligent group music recommendation. In IUI ’02:
Proceedings of the 7th international conference on
Intelligent user interfaces, pages 184–185. ACM, 2002.
[3] B. Dı́az-Agudo, P. A. González-Calero, J. A.
Recio-Garcı́a, and A. A. Sánchez-Ruiz-Granados.
Building cbr systems with jcolibri. Sci. Comput.
Program., 69(1-3):68–75, 2007.
[4] E. Gilbert and K. Karahalios. Predicting tie strength
with social media. In CHI ’09, pages 211–220. ACM,
2009.
[5] J. Golbeck. Combining provenance with trust in social
networks for semantic web content filtering. In
L. Moreau and I. T. Foster, editors, Provenance and
Annotation of Data, International Provenance and
[16]
[17]
[18]
[19]
[20]
Annotation Workshop, IPAW 2006, Chicago, IL,
USA, May 3-5, 2006, Revised Selected Papers, volume
4145 of Lecture Notes in Computer Science, pages
101–108. Springer, 2006.
A. Jameson and B. Smyth. Recommendation to
groups. In P. Brusilovsky, A. Kobsa, and W. Nejdl,
editors, The Adaptive Web, Methods and Strategies of
Web Personalization, volume 4321 of Lecture Notes in
Computer Science, pages 596–627. Springer, 2007.
C. Klamler. The copeland rule and condorcet’s
principle. Economic Theory, 25(3):745–749, 04 2005.
D. Z. Levin, R. Cross, and L. C. Abrams. The
strength of weak ties you can trust: the mediating role
of trust in effective knowledge transfer. Management
Science, 50:1477–1490, 2004.
H. Lieberman, N. W. V. Dyke, and A. S. Vivacqua.
Let’s browse: A collaborative web browsing agent. In
IUI, pages 65–68, 1999.
J. Masthoff. Group modeling: Selecting a sequence of
television items to suit a group of viewers. User
Modeling and User-Adapted Interaction, pages 37–85,
2004.
J. Masthoff and A. Gatt. In pursuit of satisfaction and
the prevention of embarrassment: affective state in
group recommender systems. User Modeling and
User-Adapted Interaction, 16(3-4):281–319, 2006.
J. F. McCarthy and T. D. Anagnost. MusicFX: An
arbiter of group preferences for computer aupported
collaborative workouts. In CSCW ’98: Proceedings of
the 1998 ACM conference on Computer supported
cooperative work, pages 363–372. ACM, 1998.
M. O’Connor, D. Cosley, J. A. Konstan, and J. Riedl.
Polylens: a recommender system for groups of users.
In ECSCW’01: Proceedings of the seventh conference
on European Conference on Computer Supported
Cooperative Work, pages 199–218, Norwell, MA, USA,
2001. Kluwer Academic Publishers.
M. J. Pazzani and D. Billsus. Content-based
recommendation systems. In The Adaptive Web, pages
325–341, 2007.
L. Quijano-Sánchez, J. Recio-Garcı́a, B. Dı́az-Agudo,
and G. Jiménez-Dı́az. Social factors in group
recommender systems. In ACM-TIST,
TIST-2011-01-0013. in press, 2011.
L. Quijano-Sánchez, J. A. Recio-Garcı́a, and
B. Dı́az-Agudo. Personality and social trust in group
recommendations. In ICTAI’10., pages 121–126. IEEE
Computing Society, 2010.
L. Quijano-Sánchez, J. A. Recio-Garcı́a, and
B. Dı́az-Agudo. Happymovie: A facebook application
for recommending movies to groups. In ICTAI’11. to
be published, 2011.
R. R. Sinha and K. Swearingen. Comparing
recommendations made by online systems and friends.
In DELOS Workshop: Personalisation and
Recommender Systems in Digital Libraries, 2001.
K. Thomas and R. Kilmann. Thomas-Kilmann
Conflict Mode Instrument. Tuxedo, N.Y., 1974.
S. Tomlinson. Comparing the robustness of expansion
techniques and retrieval measures. In CLEF, pages
129–136, 2006.
32
Do You Feel How I Feel? An Affective Interface in Social
Group Recommender Systems
Yu Chen
Pearl Pu
Human Computer Interaction Group
Swiss Federal Institute of Technology
CH-1015, Lausanne, Switzerland
Human Computer Interaction Group
Swiss Federal Institute of Technology
CH-1015, Lausanne, Switzerland
yu.chen@epfl.ch
pearl.pu@epfl.ch
ABSTRACT
Group and social recommender systems aim to recommend items
of interest to a group or a community of people. The user issues in
such systems cannot be addressed by examining the satisfaction of
their members as individuals. Rather, group satisfaction should be
studied as a result of the interaction and interface methods that
support group awareness and interaction. In this paper, we
introduced Affective Color Tagging Interface (ACTI) that
supports emotional tagging and feedback within a social group in
pursuit of an affective recommender system. We further apply
ACTI to GroupFun, a music social group recommender system.
We then report results of a field study and particularly how social
relationship within a group influences users’ acceptance and
attitudes for ACTI.
preferences and enhance recommendation accuracy. Musicovery
[3] has developed an interactive interface for users to select music
category based on their mood.
However, in group and social recommender systems, the affective
states of users are more than the sum of its members. Users’
emotion could not only be influenced by recommended items, e.g.,
music, but also that of others. Group formation and characteristics
have been investigated as a premise to study social group
recommender systems [4]. In this paper, we described the results
of an empirical study of an affective interface for providing
feedback in a social group recommender system. Our goal is to set
a basic understanding of this area with a particular focus on the
following two questions.
1)
How would users like affective user interface in social
group environment?
2)
How does social relationship influence user behavior
and attitude?
Categories and Subject Descriptors
H.5.2 [Information Interfaces and Presentation]: User
Interfaces –Graphical user interfaces (GUI), User-centered
design. H.5.3 [Information Interfaces and Presentation]: Group
and Organization Interfaces - Organizational design, Web-based
interaction
General Terms
Design, Human Factors
Keywords
Group and Social Recommender Systems, Interface Design,
Interaction Design, Affective Interface, Emotional Contagion
1. INTRODUCTION
With the proliferation of social networks, social groups have
extended the meaning from families, friends and colleagues to
people who share same interests or experiences in online
communities, or people who are socially connected. The field trail
version of Google+ allows users to define social “circles” of
connections, share “sparks” they find interesting and join
unplanned “hangouts”; interest groups on Last.fm are formed by
members who support the same singers or music bands; members
of LinkedIn groups are usually people who share similar academic,
industrial or technical background.
Meanwhile, we have learned from theories in psychology field
that multimedia files such as music, video and pictures evoke
emotion. Purely numeric ratings are not sufficient for users to
accurately provide feedback. Attempts have been made to
recommend music by other methods. Hu and Pu [2] have proved
that personality quizzes reveal more hidden aspects of user
To answer this question, we implemented GroupFun, a music
recommender system that suggests music playlists to groups of
people. We conduct a field study, which allows users to indicate
their preferences by rating music; we then invite them to evaluate
the experimented Affective Color Tagging Interface (ACTI). This
is followed by interview questions investigating their attitudes
towards ACTI.
The next section discusses existing work and how they relate with
our work. This is followed by descriptions on the functionalities
and interface of GroupFun and design issues of ACTI in Section
3. Section 4 describes the hypotheses and procedure of a pilot
study. After reporting the study results in Section 5, this paper
concludes with limitations and future work in Section 6.
2. RELATED WORK
2.1 Emotion in Recommender Systems
The main goal of studying recommender systems is to improve
user satisfaction. However, satisfaction is a highly subjective
metric. Masthoff and Gatt [5] have considered satisfaction as an
affective state or mood based on the following aspects in socioand psycho- theories: 1) mood impacts judgement; 2)
retrospective feelings can differ from feelings experienced; 3)
expectation can influence emotion and 4) emotions wear off over
time. However, they did not propose any feasible methods to
apply the above psychological theories.
Musicovery is a typical example of websites recommending music
by user selected mood. Musicovery classifies mood by two
33
dimensions: dark-positive and energetic-calm. It uses highly
interactive interface for users to experience different emotion
categories and their corresponding music. However, such
recommender does not apply in social group environment, as
individual emotion diversifies from each other.
2.2 Emotional Contagion
Masthoff and Gatt [5] also proved that in group recommender
systems, members’ emotion can be influenced by each other, and
this phenomenon is called emotional contagion. Hancock et al
(2008) [6] have investigated emotion contagion and proved that
emotions can be sensed in text-based computer mediated
communications. More significantly, they have proved the
emotional contagion occurred between partners. Sy, T., S. Côté, et
al. (2005) [7] carried out a large-scale user study involving 189
users forming 56 groups. They proved that the leaders transfer
their moods to group members and that leaders’ moods impact the
effort and the coordination of groups. However, to the best of our
knowledge, implementation of features related with emotional
contagion in group recommender systems is lacking.
2.3 Group Relationships
music. For instance, peaceful music is usually selected for
chatting while energetic music is a top candidate for parties.
We adopted Geneva Emotional Music Scale (GEMS) for user
emotion evaluation [8]. GEMS is the first instrument that has
been designed to evaluate music-evoked emotions. We adopt the
short version GEMS-9, consisting of 9 classifications of emotions,
including wonder, transcendence, power, tenderness, nostalgia,
peacefulness, joyful, sadness, tension. Each class provides a scale
from 1 to 5 indicating the intensity of the emotion. However,
asking users to evaluate the evoked emotion of music is not our
research focus. Rather, we want to provide a ludic affective
feedback interface for a group of users to participate. However, a
survey-style questionnaire easily distracts users from the system
itself. Inspired by Geneva Emotional Wheel (GEW) [9], we
visualize the evaluation scale to a color wheel, as is shown in
Figure 2.
This wheel contains 9 dimensions designed by GEMS, and each
dimension contains 5 degrees of intensity, visualized by sizes of
circles, distances from the center and saturation of colors in order
to enhance visualization. The smaller size the circle has, the less
intense the emotion is. Users could tag their evoked emotions in
any of the 9 dimensions according to the intensity.
Emotional contagion depends on the relationship of group
members. Masthoff defined and distinguish different types of
relationships. In a communal sharing relationship (e.g., friends),
group members are more likely to care about each other.
Furthermore, their emotions are more likely to influence each
other. For example, if your friend feels happy, then you are likely
to feel happy and if your friend feels sad, you are likely to feel sad.
In an equality matching relationship, e.g., strangers, members in
such groups are less likely to be influenced by others. Such
differences caused by relationships have not been proved by
experiment. In our work, we compare how the two groups differ
in their attitude towards affective interface in group environment.
3. Prototype System
We have developed a music group recommender system named
GroupFun (http://apps.facebook.com/groupfun/), which is a
Facebook application that allows groups of users to share music
for events, such as a graduation parties. The functions of
GroupFun mainly include: 1) group management, 2) music
playlist recommendation. Users are able to create a group, invite
their friends to the group, and join other groups. Each user rates
uploaded songs1. The ratings follow a 5-scale style, as is shown in
Figure 1.
Figure 1. Screenshot of original rating interface
We further designed Affective Color Tagging Interface (ACTI) as
an additional widget that allows users to tag their emotions
evoked by the music they have listened to. Different from
Musicovery, which recommends music based on user mood, ACTI
uses emotions as an explanation, feedback and re-communication
channel. In group environment, each user usually actively
persuades other users to take his/her own preferences. In previous
user studies (reported in another paper), we observed that users
would not use texts for explanations due to incurrence of much
user effort. Users also commented that their preferences on music
differ with contexts, which usually correspond to emotions in
1
The algorithms for recommending playlist and songs are
introduced in another paper.
Figure 2. GEMW distribution in GroupFun
34
We further include ACTI widget in GroupFun rating interface.
Users could give emotional feedback to songs by clicking the
emotional button at the left side of song ratings. The emotional
rating interface will pop out with the collective emotions in a
group, which visualize the overall emotional element of the music
(Figure 3). This serves as an alternative way for explanation and
persuasion.
recommendation list simultaneously. This scenario provides two
types of roles, one driver and three passengers. The driver could
be the user who creates the group. Since the focus group is small,
it is not realistic to design a large-group scenario such as a party.
In this experiment, GroupFun provides 15 songs, which is
suitable for the experiment in the sense that 15 songs are
approximately suitable for a 45-minute ride.
4.4 Experiment Design and Procedure
We carried out the user study on two groups in two days. We first
invited each member of the group to meet in our lab with their
favorite music. We then debriefed each group on the procedure of
the study and the usage of GroupFun. We then assigned roles for
each group. Users started following the scenario after exploring
original GroupFun for around five minutes.
In each group, the driver first created a group and sent invitation
to his/her friends (other participants of the group), and uploaded
his/her favorite music. Invited users accepted the invitation,
joined the group and contributed songs respectively. Meanwhile,
they could listen to group music and rate songs.
Then they continued with a survey questionnaire evaluating
GroupFun. After that, we invited them to explore the
experimented GroupFun interface with ACTI, followed by an
interview. The user study process in both groups has been
recorded for further analysis.
5. INTERVIEW RESULTS
Figure 3. Screenshot of rating interface with ACTI
4. PILOT STUDY
4.1 Hypotheses
We postulated interface and interaction design in social group
recommender systems depends on relationship among group
members. In other words, interface and interaction design is
different in groups constitute of friends and those constitutes of
strangers. We also hypothesize that groups whose members are in
close relationship like ACTI interface and groups whose members
are not close with each other would less likely to use ACTI.
4.2 Participants
In total, two groups (4 participants each) joined the user study.
Two PhD students from a course voluntarily participated in the
user study. Each of them invited another three people to form a
group. In order to investigate the effect of relationship on group
behavior, we asked the first student to invite people who are not
familiar with each other, while we asked the second student to
invite people who are familiar with each other.
4.3 Scenarios and Roles
In order to organize the experiment, we design the car driving
scenario. A group of 4 people are travelling together from
Lausanne to Geneva by car. It is around 45 minutes ride. Based on
different group relationships, we design the first scenario as
strangers sharing a car-ride and the second scenario as friends
travelling together. Among each group, one of the participants is a
driver and the other three participants are passengers. They are
using GroupFun to select a playlist for their trip. This scenario is
a typical group activity where all members consume a
In order to verify the differences in relationship among group
members, each participant indicated their closeness with each
other (except themselves) ranging from 1 (don’t know him/her) to
5 (know him/her very well). For example, if Participant A is in
very close relationship with Participant B, he is expected to rate 5
for Participant A. The total score in Group 1 is 33 while that in
Group 2 is 44. This result verifies the group differences as we
expected.
It is easy to discover from the video recording that both groups
enjoy listening to the group music very much. For example, some
of them even sang with the music while listening. More
interestingly, we found out members in Group 2 even discussed
with each other about a song that they particularly enjoyed.
During the interview phase, we first asked them to call back the
experiment scenario and whether music influences their mood. All
of them agreed on the evoked emotion by music.
Then each participant compared original GroupFun interface with
the experiment interface with ACTI. In Group 1, only one member
supported ACTI, while others thought it is not necessary and too
complicated to include additional features. It cost them more
effort. By contrast, members in Group 2 were highly positive
about ACTI. As one user said, “It is interesting to listen to music
while evaluating their mood. It is even more interesting to
compare their results with group emotion.”
Followed by this question, we asked them whether their emotions
are influenced by group emotion. As we hypothesized, members
in Group 1 do not see any influence, but they still enjoyed using
GroupFun because of the recommended group music. By contrast,
3 out of 4 participants in Group 2 said they would like to view
group emotion when tagging their own mood evoked by music.
The last question is to ask participants their suggestions on
interaction and interface design of GroupFun. We collected some
valuable suggestions from participants. One participant in Group
35
1 regarded simplicity as an important factor. He would prefer
interesting functions, but the interface should not distract them
from the main function of GroupFun, and that explains why he
did not like ACTI. On the other hand, one member from Group 2
said some interesting group activity would help them to listen to
suggested songs more carefully and therefore provide more
accurate and responsible ratings. Another participant in Group 2
also mentioned that they would like to see more enjoyable
interfaces in GroupFun as an entertaining application.
6. CONCLUSIONS
We designed an affective color tagging interface (ACTI) and
applied it to GroupFun, a social group music recommender
system. We further invited two different types of groups for our
user study. While members of one group do not know each other
well, the other group consists of friends in good relationship.
Users in both groups prefer simple but interesting interface. Even
though both groups of users enjoyed listening to group songs and
indicated their positive attitudes towards GroupFun, the user
study did show obvious differences in their group behavior.
Members of the group with close relationship were active in
discussion with each other and they like interfaces that support
group emotion, and consider it interesting and entertaining.
Meanwhile, groups whose members do not know each other well
consider affective interface complicated and not useful. This
further proves our hypothesis that interface design in social group
recommender systems should consider group formation and
relationship.
However, this work is still at the preliminary stage, and has some
limitations. First, as a pilot study, we only invited two groups of
users to survey their needs. In order to further establish design
guidelines, we need more groups and more types of groups and
conduct larger scale user studies. Furthermore, browser-based
affective interface is limited. Rather, user emotion could be
captured automatically in an ambient environment. Our future
work also include ambient affective interface in social group
recommender systems.
7. ACKNOWLEDGEMENT
We thank all participants for their interest in our project, their
valuable time and suggestions.
8. REFERENCES
[1] McCarthy J. and Anagnost T. 1998. MusicFX: an arbiter of
group preferences for computer supported collaborative
workouts. In Proceedings of the 1998 ACM conference on
Computer supported cooperative work, p.363-372.
[2] Musicovery. http://musicovery.com/
[3] Rong Hu and Pearl Pu. A Study on User Perception of
Personality-Based Recommender Systems. In: P. De Bra, A.
Kobsa, and D. Chin (Eds.): UMAP 2010, LNCS 6075, pp.
291–302, 2010.
[4] Cosley, D., A. Konstan, J. and Riedl, J. 2001. PolyLens: a
recommender system for groups of users, Proceedings of the
seventh conference on European Conference on Computer
Supported Cooperative Work, p.199-218, September 16-20,
2001, Bonn, Germany.
[5] Masthoff, J. and A. Gatt (2006). "In pursuit of satisfaction
and the prevention of embarrassment: affective state in group
recommender systems." User Modeling and User-Adapted
Interaction 16(3): 281-319.
[6] Hancock, J. T., K. Gee, et al. (2008). I'm sad you're sad:
emotional contagion in CMC. Proceedings of the 2008 ACM
conference on Computer supported cooperative work. San
Diego, CA, USA, ACM: 295-298.
[7] Sy, T., S. Côté, et al. (2005). "The contagious leader: Impact
of the leader's mood on the mood of group members, group
affective tone, and group processes." Journal of applied
psychology. 90(2): 295-305
[8] Zentner, M., Grandjean, D., & Scherer, K. R. (2008).
Emotions evoked by the sound of music: Characterization,
classification, and measurement. Emotion, 8, 494-521.
[9] Banziger, T., Tran, V. and Scherer K R., (2005), "The
Emotion Wheel: A tool for the verbal report of emotional
reactions", Proceedings of the General Meeting of the
International Society for Research on Emotions (ISRE),
2005, July 11-15, Bari, Italy.
36
PopCore: A system for Network-Centric Recommendations
Amit Sharma
Meethu Malu
Dan Cosley
Dept. of Computer Science
Cornell University
Ithaca, NY 14853
Information Science
Cornell University
Ithaca, NY 14850
Information Science
Cornell University
Ithaca, NY 14850
asharma@cs.cornell.edu
mm956@cornell.edu
ABSTRACT
In this paper we explore the idea of network-centric recommendations. In contrast to individually-oriented recommendations enabled by social network data, a network-centric
approach to recommendations introduces new goals such as
effective information exchange, enabling shared experiences,
and supporting user-initiated suggestions in addition to conventional goals like recommendation accuracy. We are building a Facebook application, PopCore, to study how to support these goals in a real network, using recommendations
in the entertainment domain. We describe the design and
implementation of the system and initial experiments. We
end with a discussion on a set of possible research questions
and short-term goals for the system.
Keywords
recommender systems,social recommendation,network-centric
1.
INTRODUCTION
Users are increasingly disclosing information about themselves and their relationships on social websites such as Facebook, Twitter, and Google+. These data provide signals
that have been used to augment traditional collaborative
filtering techniques by making network-aware recommendations [8, 9]. Such recommenders use social data to support
prediction, provide social context for the recommendations,
and help alleviate the cold-start problem typically found in
recommender systems. Much of their power comes from social forces, such as homophily, trust and influence, and thus
these recommenders do not just provide better recommendations, they can also support the study of these forces. For
example, in [4], the authors divide a user’s social contacts
into familiarity and similarity networks (proxies for trust
and homophily, respectively), and study their relative impact on the quality of recommendation.
But we can take this a step farther. Just as a user’s network can influence the recommendations he/she receives,
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Workshop on Recommender Systems and the Social Web ’11 Chicago, IL
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
danco@cs.cornell.edu
the recommendations, in turn, also can influence the network and alter the underlying social processes. For instance,
new recommendations can alter the diversity of the set of
items within a network while group recommendations can
strengthen social ties. Thinking of recommendations as being embedded in a network, rather than informed by it,
provides a new context for analyzing and designing recommender systems—and an important one, given people’s increasing interaction and consumption in online social networks.
In this paper, we lay out our approach to exploring this
network-centric approach to recommendation system design.
We start by discussing new concerns such systems foreground,
focusing on design goals that come from thinking about the
social aspects of recommendations that are embedded in a
network, compared to more individually-focused systems.
Second, we introduce PopCore, the network-centric recommender system we are building in Facebook to support these
goals. We have already deployed an initial proof-of-concept
version to conduct initial experiments around network-aware
algorithms [13]; here, we discuss how we are evolving the
system to support the social design goals. We close by laying out issues that doing network-centric recommendations
raise, most notably around the tension between social sharing, privacy, and identity management, and outlining the
initial questions we hope to address as we design and build
both the system and the community.
2.
NETWORK-CENTRIC DESIGN GOALS
Thinking about recommendations as embedded in a social network raises a number of questions, ranging from using network data to improve individual recommendations to
using people’s behavior to study large-scale patterns of diffusion and other social science forces at work. Given our goal
to design a useful network-centric recommender system, here
we focus on design goals that capture social elements that
are more salient than they would be in a typical e-commerce
recommender application.
Directed Recommendations. An integral part of social experience is sharing information with others person-toperson. Such user-generated directed suggestions have been
studied for link-sharing [2] and are ripe for study in other
domains, integration with automated applications, and application to a social network context. Allowing directed suggestions might encourage people to be more active participants in the system and allow them ways to express their
identity. These suggestions may also be more accurate than
collaborative filtering for certain tasks [6], and in aggregate
37
Figure 1: A mockup of the PopCore interface. By default, recommendations are shown from all three
domains, Movies, Books and TV. The controls at the top help a user decide the composition of the list of
items, while the lower section provides contextual visualization for the recommendations.
support data mining and automated recommendation from
this user-generated ‘buzz’ [10].
Shared Experiences and Conversation. For many
items, especially in the entertainment domain, enjoyment
depends not just on personal preferences but also on social
experiences such as enjoying the content with other people [3]. Given an item such as a movie, is it possible to
predict the people who may join you for it? This is slightly
different from group recommendations, which are typically
aimed at a predefined group of people [1], and we expect that
leveraging network information will make them more effective than earlier approaches that combined individual lists
of recommendations [11]. Conversation is another social experience, and since people who disagree about movies have
livelier conversation [7], algorithms might focus on recommending items that evoke strong reactions, or even “antirecommendations”, along with the traditional goals of accurate recommendation. Systems aimed at individuals are
unlikely to want to recommend hated items, but people often like to talk about them, and this propagation of negative
opinion may also help others avoid bad experiences.
Network Awareness. Negative information is a specific kind of awareness, and people have a broad interest in
awareness of what is happening in their social network [5].
From the point of view of information, taste, and fashion, it’s
useful to know who the opinion leaders are, who are active
and effective recommenders, what items are becoming hot
or not, and who is knowledgeable about a given topic [12].
Thus, supporting social interaction not just between individuals but at the network level is likely to be valuable in a
network-centric recommender system.
3.
POPCORE: THE PLATFORM
We now discuss how we are starting to realize these goals
in PopCore, a Facebook application we are developing for
providing and studying network-centric recommendations.
We chose Facebook because it provides us both network and
preference data (though Likes), and also supports a diverse
set of domains for items. PopCore works by fetching a user
and her friends’ profile data on Facebook (subject to the
user’s permission) and providing recommendations based on
those signals. Currently, we restrict PopCore to the entertainment domain, including movies, books, TV shows and
music. These categories have a fair amount of activity and
broad popular appeal.
3.1
System Description/Design
We decided on a simple three-part interface, as shown in
Fig. 1. The center section contains the main content to be
shown (a list of items), while the top and bottom sections
show content-filtering controls and contextual visualizations
respectively. Each of the interface components, from the
logo on down, is designed to support both the goals outlined
above and the collection of interesting data to study.
PopMix. The top section is the control panel PopMix,
which allows a user full control of the type of items shown
in the content section. The controls are designed to be intuitive, inspired by a common interface metaphor, a music
equalizer. Just like the music mixer allows a user to set
sound output according to his tastes, the PopCore interface
gives the user control over the domain, genre, popularity and
other parameters he may choose. In order to account for
temporal preferences, we also include a special recency knob
38
that allows users to select the proportion of recent versus
older items shown. In addition, users may also view items
expected to be available soon. For such current and ‘future’
items, users may notify and invite their friends (chosen manually or from a system-recommended list).This supports the
goal of shared consumption.
Eventually, we plan to implement filters that allow people
to control social network parameters as well. For instance, a
user may also choose the relative importance/proportion of
network signals for recommendation, such as link-distance of
people from the user, interaction strength, age, location of
people. People may also select a subset of people manually,
or a named group of people, in which case recommendation
morphs more into a stream of items from those sources.
Stackpiles. The middle section shows a number of views
relevant to user tasks such as getting automated recommendations, directed suggestions, and remembering suggestions
to follow up on. The top right corner of this section contains
the tab-buttons to switch views as shown in Fig. 1.
The primary view presents automatically generated suggestions filtered on the user’s PopMix settings. The recommendation algorithm is a ranking algorithm that ranks
a user’s friends based on their relevance to the user on a
list of parameters, such as interaction strength and number
of commonly Liked items. The most popular items according to this weighted user popularity are then chosen. In a
given view, a user is shown a list of items arranged as cards
in distributed stackpiles. Items are grouped into stackpiles
based on their similarity, using k-means clustering over their
attributes. The number of piles and their distribution is generated dynamically.
On flipping an item card, a user gets options to Like,
Dislike, or rate the movie on a scale from 0.5-5. Giving a
number of ways to interact with the movie supports rich
data collection, and in the case of Dislike, the idea of antirecommendations. Users may also directly suggest an item
to one or more friends using the PopCore button. These
suggestions are sent to the target users as a Wall Post or
private message, based on the user’s preferences. PopCore
members can also see these suggestions as a view in the
content window by clicking the “N” button at its upper right.
People can type in any friend; PopCore also suggests people
that may be a good fit for enjoying the item with.
The other main view is a user’s personal library, which
contains a user’s ‘For later’ list and the list of items for which
the user has provided strong feedback. The ’For later’ list
may be thought of as a non-linear queue (and can be accessed through the “Q” button). The list benefits from the
same stackpiling metaphor, thus allowing a user more visual
and organized view of his/her library. Items are stackpiled
based their similarity and recency in the list by default, however, users have full control to customize the groups.
Visualizations. The bottom section contains visualizations of network activity around items that support the network awareness goals described earlier. The default view
is a word cloud showing items weighted by the number of
user’s friends who have Liked those items. Other visualizations include showing the friends who have contributed the
most to the content shown to a user (either through directed
recommendations, or algorithmically) along with the items
that have been recommended (Fig. 2), or a timeline showing the entry and growth of recent items in a user’s network.
The goal of these visualizations is to help the user navigate
Figure 2: A visualization showing aggregated behavior among the user’s friends weighted by the
amount of recommendations they make, with a detailed view of each item that has been recommended
in the user’s network.
the multi-part social activity information in a clear, intuitive
fashion.
4.
ISSUES AND RESEARCH QUESTIONS
We conclude by discussing the major issues we expect
around deploying a real network-centric recommender.
4.1
Trading off social and private elements.
A primary issue is that having access to more information
enhances the social discovery and consumption experience,
but there is a direct trade-off with privacy. For example,
the visualization component is designed to show individual
activity about either items or people, and aggregate information about the other. “Activity” in this case might represent making or receiving directed suggestions, rating items,
getting recommendations, adding items to one’s queue, and
so on. Consider showing items as the detail, people as the
aggregate, and queuing as the activity. Users might want
to know which items their friends are intending to consume,
but it may often be the case that an individual using the
system will queue a sequence of movies. Her picture will
grow as the stream of movies changes. This will immediately convey her queuing behavior to others, and hence her
privacy has been compromised.
Identity management also comes into play. Having Likes
and Dislikes visible to all friends makes it easier for a user’s
friends to follow his/her interests, but does it then affect
the Liking behavior based on concerns about privacy and
identity management? Similarly, a queue is a definite indication of interest and making it accessible to others will
directly benefit shared experiences and co-operation, but it
is unclear whether users would want to have a public queue.
For now, we have decided to have everything except Likes
and Dislikes private, but give the user an option to selectively enable items for sharing whenever an action is taken,
with the hope of balancing identity, privacy, and discovery
without imposing too much work.
39
4.2
Long-term goals and short-term questions.
The other major issue we see is that building out a networkcentric recommender while building up its userbase promises
to consume a fair amount of time. Thus, our short-term goal
is to answer questions that need no or limited social interaction while the system and userbase develops.
Tradeoffs in doing network-centric recommendations. A network-centric approach affords fast algorithms,
real-time capabilities, and modest user requirements compared to conventional collaborative filtering’s use of large
datasets, but it places a lot of emphasis on a person’s immediate social network. This reduces the pool of available items, and may also lead to a possible loss of diversity
among the items recommended. We plan to pit our algorithm against state-of-the-art collaborative filters and compare the performance of both in terms of the activity generated around recommendations and users’ satisfaction with
the automated recommendations they receive from each.
Eventually we hope to develop recommendation strategies
that use recommendations computed both on the full dataset
and in a network-centric way in the user’s local network.
Interpreting actions and developing metrics. PopCore provides a wide variety of actions that users can take
with an item, including putting it in their queue, publicly
Liking it or Disliking it, or suggesting it to friends. All
these actions may convey signals that can be used to both
improve the quality of recommendations and also evaluate
them, although we need to learn to interpret them. What’s
the difference between a “Like” (which is public) and a 5-star
rating (which is probably not)? Sharing an item provides
an indication of “interestingness”, but unlike ratings does
not provide a definite scale of enjoyment, and in fact people
may share disliked items.
Exploring cross-domain recommendations. The network-centric approach relies heavily on people and their connections, and less on the items. This suggests that we may
be able to cross-recommend items based on a user’s network
information and his/her preferences in a related domain, a
task for which collaborative filters have not been so successful. Designing algorithms for cross-domain recommendation
within a network is an interesting question in itself.
Social explanations. Right now PopCore uses data
harvested from Wikipedia to present additional information
about items to help people make decisions. However, that
data does not explain why the recommendation was made,
which is a commonly wanted feature in real world recommendation systems [14]. Using network information to help
justify automated recommendations may be a powerful feature, given the way people rely on this information to make
decisions already.
Once we have built the userbase we will be in a better
position to ask questions about the explicit social elements
we are designing for. Comparing directed to automatic recommendations, studying the value of awareness of network
activity around items, exploring how recommendations and
consumption propagate in the networks, and developing effective metrics for measuring social outcomes are all questions that we hope to address in the long term, and that we
think are key for recommender systems as they move into
social networks.
Acknowledgement.
We would like to acknowledge support from NSF grant
IIS 0910664.
5.
REFERENCES
[1] S. Amer-Yahia, S. B. Roy, A. Chawlat, G. Das, and
C. Yu. Group recommendation: Semantics and
efficiency. Proc. VLDB Endow., 2:754–765, August
2009.
[2] M. S. Bernstein, A. Marcus, D. R. Karger, and R. C.
Miller. Enhancing directed content sharing on the
web. In Proc. CHI, pages 971–980, 2010.
[3] P. Brandtzag, A. Folstad, and J. Heim. Enjoyment:
Lessons from karasek. In M. Blythe, K. Overbeeke,
A. Monk, and P. Wright, editors, Funology, volume 3
of Human-Computer Interaction Series, pages 55–65.
Springer Netherlands, 2005.
[4] I. Guy, N. Zwerdling, D. Carmel, I. Ronen, E. Uziel,
S. Yogev, and S. Ofek-Koifman. Personalized
recommendation of social software items based on
social relations. In Proc. RecSys, pages 53–60, 2009.
[5] A. N. Joinson. Looking at, looking up or keeping up
with people?: Motives and use of Facebook. In Proc.
SIGCHI, CHI ’08, pages 1027–1036, New York, NY,
USA, 2008. ACM.
[6] V. Krishnan, P. Narayanashetty, M. Nathan,
R. Davies, and J. Konstan. Who predicts better?
Results from an online study comparing humans and
an online recommender system. In Proc. RecSys, pages
211–218, Lausanne, Switzerland, 10/23/2008 2008.
ACM.
[7] P. J. Ludford, D. Cosley, D. Frankowski, and
L. Terveen. Think different: Increasing online
community participation using uniqueness and group
dissimilarity. In Proc. SIGCHI, CHI ’04, pages
631–638, New York, NY, USA, 2004. ACM.
[8] H. Ma, I. King, and M. R. Lyu. Learning to
recommend with social trust ensemble. In Proc.
SIGIR, pages 203–210, 2009.
[9] H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King.
Recommender systems with social regularization. In
Proc. WSDM, pages 287–296, 2011.
[10] H. Nguyen, N. Parikh, and N. Sundaresan. A software
system for buzz-based recommendations. In Proc.
SIGKDD, KDD ’08, pages 1093–1096, New York, NY,
USA, 2008. ACM.
[11] M. O’Connor, D. Cosley, J. A. Konstan, and J. Riedl.
PolyLens: A recommender system for groups of users.
In Proc. ECSCW, pages 199–218, Norwell, MA, USA,
2001.
[12] N. S. Shami, Y. C. Yuan, D. Cosley, L. Xia, and
G. Gay. That’s what friends are for: Facilitating ’who
knows what’ across group boundaries. In Proc.
GROUP, GROUP ’07, pages 379–382, New York, NY,
USA, 2007. ACM.
[13] A. Sharma and D. Cosley. Network-centric
recommendation: Personalization with and in social
networks. In IEEE International Conference on Social
Computing, Boston, MA, USA, 2011.
[14] K. Swearingen and R. Sinha. Beyond Algorithms: An
HCI Perspective on Recommender Systems. In ACM
SIGIR Workshop on Recommender Systems, New
Orleans, USA, 2001.
40
Community-Based Recommendations: a Solution to the
Cold Start Problem
Shaghayegh Sahebi
William W. Cohen
Intelligent Systems Program
University of Pittsburgh
Machine Learning Department
Carnegie Mellon University
sahebi@cs.pitt.edu
wcohen@cs.cmu.edu
ABSTRACT
The “Cold-Start” problem is a well-known issue in recommendation systems: there is relatively little information about
each user, which results in an inability to draw inferences to
recommend items to users. In this paper, we try to give
a solution to this problem based on homophily in social
networks: we can use social networks’ information in order to fill the gap existing in cold-start problem and find
similarities between users. In this study, we use communities, extracted from different dimensions of social networks,
to capture the similarities of these different dimensions and
accordingly, help recommendation systems to work based
on the found latent similarities. By different dimensions,
we mean friendship network, item similarity network, commenting network and etc.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval—information filtering
Keywords
Recommendation, Cold-Start, Community Detection, Social
Media
1.
be available. In this paper, we suggest user connections and
ratings in social networks as a replacement. By the advance
of openID protocol and the emerge of new social networks,
user activities, connections and ratings in various networks
are now more accessible.
Social networks offer connection of different dimensions:
people may be friends with each other, they might have similar interests, and may rate content similarly. These different
dimensions can be used to detect communities among people. Using community detection techniques, collective behavior of users is predictable. For example, in [5], a comparison has been made between familiarity network based and
similarity network based recommendations. In [4], a typical
traditional collaborative filtering (CF) approach is compared
to a social recommender/social filtering approach. These
studies do not utilize latent community detection techniques
to address the cold start problem.
This study aims to use different dimensions of social networks to extract latent communities and use these communities to provide a solution to the cold start problem. In
this paper, we first give a brief introduction to community
detection methods. Then, we describe the Principal Modularity Maximization method [8] in section 2.1. After that,
we propose our approaches to utilize the community detection algorithm in section 2.2., describe the used dataset in
section 3, and discuss the experiments in section 4.
INTRODUCTION
Recommendation systems have been developed as one of
the possible solutions to the information overload problem.
The cold start problem [7] is a typical problem in recommendation systems. In recent years, some studies tried to
address this problem. For example in [6] and [7], hybrid
recommendation approaches, that combine content and usage data, are proposed and in [1], a new similarity measure
considering impact, popularity, and proximity is introduced
as a solution to this problem. Most of these approaches
consider content information or demographic data, and not
the connection information, for performing the recommendations; However, in some cases these information might not
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
WOODSTOCK ’97 El Paso, Texas USA
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
2.
COMMUNITY DETECTION
With the growth of social network web sites, the number
of subjects within these networks has been growing rapidly.
Community detection in social media analysis [3] helps to
understand more of users’ collective behavior. The community detection techniques aim to find subgroups among
subjects such that the amount of interaction within group
is more than the interaction outside it. Multiple statistical
and graph-based methods have been used recently for the
community detection purposes. Bayesian generative models
[2], graph clustering approaches, hierarchical clustering, and
modularity-based methods [3] are a few examples.
While the existing social networks consist of multiple types
of subjects and interactions among them, most of these techniques focus only on one dimension of these interactions.
Consider the example of blog networks in which people can
connect to each other, comment on each other’s posts, post
link to other posts in their blog post, or blog about similar
subjects. By considering only one of these dimensions, e.g.
connections network, we loose important information about
other dimensions in the network and the resulting communi-
41
ties will just represent a part of existing ones. In this paper
we use modularity-based community detection method for
multi-dimensional networks presented by Tang et. al [8] as
brielfy described in the following subsection.
2.1
Principal Modularity Maximization
Modularity-based methods consider the strength of a community partition for real-world networks by taking into account the degree distribution of nodes. Modularity measure
is defined based on how far the within-group interaction of
found communities deviates from a uniform random graph
with the same degree distribution. The modularity measure
is defined as follows:
Q=
1
T r S T BS
2m
(1)
ddT
(2)
2m
where S is a matrix indicating community membership (Sij =
1 if node i belongs to community j and 0 otherwise) and B is
the modularity matrix defined in equation 2. In equation 2,
which measures the deviation of network interactions from
a random graph, A represents the sparse interaction matrix
between actors of the network, d shows the degree of each
node, and m is the total number of existing edges. The
goal in modularity-based methods is to maximize Q, the
strength of the community partition. By relaxing matrix S
as a matrix with continuous elements, the optimal S can be
computed as the top k eigenvectors of the modularity matrix
B [8].
As said before, communities can consist of multiple dimensions like friendship dimension, co-rating dimension, commenting dimension and etc. Principal Modularity Maximization[8], is a modularity based method to find hidden
communities in multi-dimensional networks. The idea is to
integrate the network information of multiple dimensions
in order to discover cross-dimension group structures. The
method is a two-phase strategy to identify the hidden structures shared across dimensions. In the first phase, the structural features from each dimension of the network is extracted via modularity analysis (structural feature extraction), and then the features are integrated to find out a
community structure among nodes (cross-dimension integration). The assumption behind this cross-dimensional integration is that the structure of all of the dimensions in the
network should be similar to each other. In the first step,
structural features are defined as the network-extracted dimensions that are indicative of community structure. They
can be computed by a low-dimensional embedding using the
top eigenvectors of the modularity matrix. Minimizing the
difference among features of various dimensions in crossdimension integration is equivalent to performing Principal
Component Analysis (PCA) on them. This results in a community membership matrix S which is continuous. This matrix shows how much each node belongs to each community.
To group all the nodes in a discrete community membership
based on these features, a simple clustering algorithm such
as K-means is used on S. As a result of this clustering, each
node will belong to just one community.
B =A−
2.2
Cold Start Problem and Community Detection in Recommendation Systems
The “cold start” problem [7] happens in recommendation
systems due to the lack of information, on users or items.
Usage-based recommendation systems work based on the
similarity of taste of user to other users and content based
recommendations take into account the similarity of items
user has been consumed to other existing items. When a
user is a newcomer in a system, or he/she has not yet rated
enough number of items. So, there is not enough evidence for
the recommendation system to build the user profile based
on his/her taste and the user profile will not be comparable
to other users or items. As a result, the recommendation
system cannot recommend any items to such a user. Regarding the cold start problem for items, when an item is
new in the usage based recommendation systems, no users
have rated that item. So, it does not exist in any user profile. Since in collaborative filtering the items consumed in
similar user profiles are recommended to the user, this new
item cannot be considered for recommendation to anyone.
In this paper, we concentrate on cold start problem for
new users. we propose that if a user is new in one system,
but has a history in another system, we can use his/her
external profile to recommend relevant items, in the new
system, to this user. As an example, consider a new user in
youtube, of whom we are aware of his/her profile in Facebook. A comprehensive profile of the user can be produced
by the movies he/she posted, liked or commented on in Facebook and this profile can be used to recommend relevant
movies in youtube to the same user. In this example, the
type of recommended items are the same: movies. Another
hypothesis, is that users’ interest in specific items, might
reveal his/her interest in other items. This is the same hypothesis that exists in multidimensional network community
detection: we expect multiple dimensions of a network to
have a similar structure. As an example, if a user is new
to the books section of a system, but has a profile in the
movies section, we can consider similar users to him/her, in
terms of movie ratings, to have a similar taste on books with
him/her; or if two users are friends, we expect them to have
more similar behavior in the system. We utilize user profiles
in other dimensions to predict their interests in another dimension can be used as a solution to the cold start problem.
Community detection can provide us with a group of users
similar to the target user considering multiple dimensions.
We can use this information in multiple ways as suggested
in the following. In traditional collaborative filtering, the
predicted rating of active user a on each item j is calculated
as a weighted sum of similar users’ rankings on the same
item: Equation 3. Where n is the number of similar users
we would like to take into account, α is a normalizer, vi,j is
the vote of user i on item j, v̄i is the average rating of user
i and w(a, i) is the weight of this n similar users.
pa,j = v̄a + α
n
X
w(a, i)(vi,j − v̄i )
(3)
i=1
The value of w(a, i) can be calculated in many ways. Common methods are Cosine similarity, Euclidean similarity, or
Pearson Correlation on user profiles. we proposed and tried
multiple approaches in community based collaborative filtering to predict user ratings. Once we have found latent
communities in the data, we need to use this information
to help with the recommendation of content to users. Our
42
assumption is that users within the same latent community
are a better representative of user interests in comparison
with all users. We propose approaches that consist of combinations of the following:
1. Using a community based similarity measure to calculate w(a, i): This is specifically useful in PMM community detection algorithm. Here, a matrix S, an N × K
matrix, which is an indicator of multi dimensional community membership is produced. It shows how much
each user belongs to each community. we define the
community-based similarity measure among users of
the system as an N × N matrix W in equation 4 and
use it as a weight function in equation 3. Here, N is
the total number of users and each element of the matrix shows the similarity between two of users based
on the communities they belong to.
W = SS
T
(4)
Figure 1: Log-log plot of number of book ratings per
user
2. Using co-community users (users within active user’s
community) instead of k-nearest neighbors: we define the predicted rating as in equation 5 in which
community(a) indicates the community assigned to
the active user by the community detection algorithm.
Based on that, only users within a user community are
considered in the CF algorithm.
pa,j = v̄a + α
X
w(a, i)(vi,j − v̄i )
(5)
i∈community(a)
In addition to using the proposed methods to address the
cold-start problem, we believe that, the second case is useful where there are a large number of users and as a result,
the traditional collaborative filtering approach takes a lot of
space and time to converge. Instead, we can detect the community user belongs to, and use that community members
to find relevant items to users.
3.
DATASET
The dataset used in this study is based on an online Russian social network called imhonet 1 . This web site contains
many aspects of a social network, including friendships, comments and ratings on items. We use a dataset that includes
the connections between users of this web site and the ratings they had on books and movies. The friendship network
contains approximately 240,000 connections among around
65,000 users. The average number of friends each user has
is about 3.5. Additionally, there are about 16 million rating instances of the movie ratings on about 50,000 movies in
the dataset and more than 11.5 million user ratings on about
195,000 available books in the dataset,. Figure 1 shows the
log-log scale of the number of book ratings per user and
Figure 2 shows the number of ratings for each book. As
can be seen, the number of users per book follows the power
law distribution. But for the number of book ratings per
user, it doesn’t show a power law distribution. It looks like
a combination of two power law distributions. That is because imhonet asked its users to rate at least 20 books for
building more complete user profiles.
1
www.imhonet.ru
Figure 2: Log-log plot of number of ratings per book
If we look at movie rating distribution (which we omitted
due to the space restrictions), we can see the same behavior: based on imhonet’s request, many users rated around
20 movies. Friendship connections between users follow a
power law distribution too. To reduce the volume of the
data, we used the ratings of users who had at least one connection in the dataset. The resulting dataset contains about
9 million movie ratings of 48,000 users on 50,000 movies and
1.2 million book ratings of 13,000 users on 140,000 books.
For the experiments, we picked 10,000 random users among
these users.
4.
EXPERIMENTS
We separated 10% of users as test users and the reminder
as train users. To simulate the cold start problem, we removed all the book ratings of test users from the dataset
and tried to predict these book ratings for them. We performed 10-fold cross-validation on this data. To apply PMM
to the problem at hand, we need to define the various network dimensions. The first is obvious: we can simply use
the friendship network itself. Then, we need a method to
construct a similarity graph of users using their book and
43
movie ratings. To do so, we define an edge weight s(ri , rj )
between each two users as follows: Let ri be the rating vector of user i, let σx be the standard deviation of the non-zero
elements of a vector x, and let covar(x, y) be covariance of
points where both x and y are non-zero. Then, the similarity
function is
s(ri , rj ) =
covar(ri , rj )
σri σrj
(6)
provided that ri and rj overlap at at least 3 positions and
0 otherwise. A similarity score of 0 indicates that no edges
should be added. This function is a modified version of
Pearson’s Correlation Coefficient that takes into account the
standard deviation of a user’s ratings instead of just the
standard deviation of the overlap with another user. As
such it is no longer constrained to the interval [−1, 1] and
does not have a direct interpretation, but it better represents
the similarity between users. We can then use this function
to create graphs from the book and movie ratings. Once we
had different dimensions of the network, we can run PMM
on the friendship, books, and movies graphs to obtain the
latent communities. We set the number of communities and
the number of neighbors in collaborative filtering approach
to 30 in this experiment. Graphical results of performing
PMM are shown in Figures 3 and 4 which are created by
Gephi software2 .
Figure 4: Pie chart of number of users in each community. Each color represents a community.
berships as a similarity measure (CF with Community
Simil),
3. As in case 2, we perform traditional collaborative filtering within the community (CF within Community),
4. We perform collaborative filtering using the community based similarity measure within the community
(combination of cases 1 and 2) (CF with Community
Simil within Community),
The performance of these different combination are reported
in Figure 5 in terms of nDCG at top k recommendations for
k changing from one to ten. Notice that collaborative filtering within members of a community works slightly better
than other methods. Also, performing CF within a community, either with community-based similarity measure or the
Pearson correlation, works better than performing CF with a
constant number of neighbors for kNN. On the other hand,
we can see that using community-based similarity reduces
nDCG for both within community and global Cf methods.
While this means that using only community members in
CF helps in recommending more interesting items to users,
it also means that Pearson correlation, works better as a
similarity measure for CF in comparison with communitybased similarity measure. Generally, the nDCG results we
obtain for the cold start problem is reasonable since the
problem is simulated in a way that having no information
about other dimensions, recommending items to users would
be impossible.
Figure 3: Communities detected by PMM shown in
a graph sketched by Gephi software. Each community forms a square. Nodes are imhonet users and
links are their friendship connections.
We considered different combinations of the approaches
proposed in Section 2.2 as follows:
1. We consider a vector space model for book and movie
ratings and build user profiles by concatenating these
two vectors in a combined space; then, performing traditional collaborative filtering using Pearson Correlation on the concatenated vector (CF),
2. As described in case 1, we perform collaborative filtering for all users considering their community mem2
www.gephi.org
5.
CONCLUSIONS AND FUTURE WORKS
We showed that performing collaborative filtering within
community members is more effective than running collaborative filtering on all users. Also, we showed that using
other dimensions of user interests or user connections, helps
in having a reasonable nDCG in cold-start problem. Based
on our experiments, the number of members in each community follows a power law. As a result, it is interesting to
see the performance of proposed community-based recommendation methods on different size communities and see
if these methods help in small-size, mid-size or big communities. Another interesting study is to consider the effect
of number of neighbors in simple collaborative filtering approach on the results. In other words, it is interesting to see
considering which number of neighbors is better in collaborative filtering and if this number is related to the average
detected community size. Another future works would be to
44
[8] L. Tang, X. Wang, and H. Liu. Uncovering groups via
heterogeneous interaction analysis. In ICDM, 2009.
Figure 5: nDCG at top k recommendations
consider Bayesian generative models of community detection
and study how grouping connections of a user and assigning
them to each of the dimensions of the network would help
in the recommendations’ quality.
6.
ACKNOWLEDGMENTS
We would like to thank the administration of imhonet who
kindly provided anonymized data for our study. Also, we
would like to thank Dr. Peter Brusilovsky and Daniel Mills
for their help during this study. This research is partially
supported by the National Science Foundation under Grants
No. 1059577 and 1138094.
7.
REFERENCES
[1] H. J. Ahn. A new similarity measure for collaborative
filtering to alleviate the new user cold-starting problem.
Information Sciences, 178(1):37 – 51, 2008.
[2] C. Delong and K. Erickson. Social topic models for
community extraction categories and subject
descriptors. October, 2008.
[3] S. Fortunato. Community detection in graphs. Physics
Reports, 486(3-5):75 – 174, 2010.
[4] G. Groh and C. Ehmig. Recommendations in taste
related domains: collaborative filtering vs. social
filtering. In proc. of the 2007 international ACM conf.
on Supporting group work, GROUP ’07, pages 127–136,
New York, NY, USA, 2007. ACM.
[5] I. Guy, N. Zwerdling, D. Carmel, I. Ronen, E. Uziel,
S. Yogev, and S. Ofek-Koifman. Personalized
recommendation of social software items based on
social relations. In proc. of the third ACM conf. on
Recommender systems, RecSys ’09, pages 53–60, New
York, NY, USA, 2009. ACM.
[6] S.-T. Park and W. Chu. Pairwise preference regression
for cold-start recommendation. RecSys ’09, pages
21–28, New York, NY, USA, 2009. ACM.
[7] A. I. Schein, A. Popescul, L. H., R. Popescul, L. H.
Ungar, and D. M. Pennock. Methods and metrics for
cold-start recommendations. In In proc. of ACM SIGIR
conf. on Research and Development in Information
Retrieval, pages 253–260. ACM Press, 2002.
45
Free Text In User Reviews:
Their Role In Recommender Systems
Maria Terzi
Maria-Angela Ferrario
Jon Whittle
School of Computing &
Communications, InfoLab21,
Lancaster University
LA1 4WA Lancaster UK
School of Computing &
Communications, InfoLab21,
Lancaster University,
LA1 4WA Lancaster UK
School of Computing &
Communications, InfoLab21,
Lancaster University,
LA1 4WA Lancaster UK
m.terzi@lancaster.ac.uk
m.ferrario@lancaster.ac.uk
j.n.whittle@lancaster.ac.uk
ABSTRACT
As short free text user-generated reviews become ubiquitous on
the social web, opportunities emerge for new approaches to
recommender systems that can harness users‟ reviews in open text
form. In this paper we present a first experiment towards the
development of a hybrid recommender system which calculates
users‟ similarity based on the content of users‟ reviews. We apply
this approach to the movie domain and evaluate the performance
of LSA, a state-of-the-art similarity measure, at estimating users‟
reviews similarity. Our initial investigation indicates that users‟
similarity is not well reflected in traditional score-based
recommender systems which solely rely on users‟ ratings. We
argue that short free text reviews can be used as a complementary
and effective information source. However, we also find that LSA
underperforms when measuring the similarity of short, informal
user-generated reviews. For this we argue that further research is
needed to develop similarity measures better suited to noisy short
text.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Information Search
and Retrieval – information filtering
General Terms
Algorithms, Human factors, Experimentation
Keywords
Recommender systems, social web, similarity measures, user
reviews.
1. INTRODUCTION
Recommender systems provide personalized suggestions to users
about products or services they might be interested in. The two
main approaches for developing recommender systems, namely
„content based‟ and „collaborative filtering ‟ are principally based
on user ratings. Collaborative filtering approaches compare the
ratings of two users to recommend new items to each. They work
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
Conference’10, Month 1–2, 2010, City, State, Country.
Copyright 2010 ACM 1-58113-000-0/00/0010…$10.00.
by pairing users that have rated the same items similarly, and
recommending items to each user based on the rating history of
the other paired user. Content-based filtering approaches (e.g.,
Amazon1) analyse item features/descriptions to identify items that
are likely of interest to the user. This is achieved by building user
profiles based on the features of items a user has previously rated,
then measuring the similarity of a profile with the extracted
features of other items. Potentially interesting items are then
recommended to the user based on the result of this similarity.
The main problem making recommendations based on items‟
ratings is that “we do not understand how users compose their
judgment of the varying attributes of an item into a single rating,
or how users‟ usage of rating schemes will vary between each
other” [1]. Indeed, people may rate items similarly, but their
ratings may be based on different combinations of item features.
Thus, a recommender system based blindly on ratings (without
any indication of the underlying reasons as to why a user gave a
particular item a particular rating) may not result in accurate
suggestions. For example, a collaborative filtering system
recommending movies will match two users that rated a number
of movies with the same (high) score. However, one of the users
may rate a movie highly because his favorite actor plays the lead
role, while the other may do so because he likes the special
effects. Thus, any recommendations made to these two users
based on their respective ratings of these films could be
inappropriate.
Current implementations of the content-based approach utilize
multi-rating systems [2, 3], whereby a user rates each aspect of the
item. The additional structure in the user rating enables a more
effective correlation of the user rating to product features, thereby
allowing a more accurate delivery of recommended content. These
systems, however, depend critically on the willingness of the user
to explicitly rate various features of an item.
Surprisingly despite the popularity of publicly viewable reviewing
in the social web, limited work has been undertaken to utilize it in
improving recommender systems as recent research has been
mainly focused on product features extraction.
Recent work in the field aims to extract item features and
sentiment from user reviews to enhance item recommendations.
Jakob et al [4] use opinion mining and sentiment analysis
techniques to infer the ratings of item aspects from user reviews
and improve the prediction accuracy of star ratings based
recommender systems. Long et al [5] propose a method to
1
http://www.amazon.com/
46
estimate features ratings by using users‟ textual reviews.
Particularly, they focus on „specialized reviews‟, that is reviews
that extensively discuss one feature, and suggest that
recommender systems can use specialized reviews to make
recommendations based on their main feature.
In this paper we undertake a feasibility study for a potential
recommender system which calculates the users‟ similarity based
on their reviews. The core part of such a system will utilize a
similarity measure to identify similar reviews, that is, reviews that
refer to the same features or use the same adjectives to describe an
item. The proposed system will use free text users‟ reviews to a)
match users that provided similar reviews for items and b) match
users with items based on their reviews.
Our approach differs from existing work [4, 5] focusing on
features extraction. The proposed system does not attempt to
extract predictive features of items from long-text reviews to make
recommendations but it aims to pair users with other users and
items based on the similarity of short text reviews. We argue that
such system may overcome the limitations of current rating-based
approaches as it is further described in Section 2.
To investigate the feasibility of such a system, we present the
results of an experiment in which a corpus of short free text movie
reviews are analyzed to determine whether the content of the
reviews matches the rating given by the user. Furthermore, we
investigate the accuracy of the current rating-based collaborative
filtering approaches by measuring the similarly of reviews that
have the same movie rating. In addition, we evaluate the
effectiveness of LSA (Latent Semantic Analysis), a state-of-the-art
similarity measure, for judging the similarity of user reviews. This
provides an indication of the feasibility of implementing the
proposed system using similarity of user reviews.
The remainder of the article is structured as follows. Section 2
briefly describes the functionality of the proposed system. Section
3 provides a review of the state-of-art similarity measures for
short text, including LSA. Section 4 describes a pilot study
undertaken to test the feasibility of the approach and the
performance of LSA as a short-review similarity measure
technique. In section 5, we provide a summary and outlook for
further investigations.
2. TOWARDS A FREE TEXT
RECOMMENDER SYSTEM
The recommender system we propose builds on the assumption
that a free text review contains the reasons why a rating was
given. The system is a hybrid approach consisting of collaborative
filtering coupled with a content based analysis of the user reviews.
The collaborative filtering aspect of the system will measure the
similarity between two users, by comparing the similarity of their
reviews (instead of ratings) for the items that both users have
reviewed, whilst recommending new items of potential interest to
the target user.
The advantage of this system is that it will overcome a common
limitation of rating-based collaborative filtering approaches: the
system will be able to match users based on the similarity of their
reviews, even if they rated an item differently.
To highlight this point, in Figure 1 we present three reviews from
Figure 1. User reviews from RottenTomatoes.
the RottenTomatoes2 movie review site. All three reviews are
about the movie Pirates of the Caribbean:
On Stranger Tides, the
.
fourth in the series. Two reviews have the same rating: a
traditional rating-based collaborative filtering approach would
judge them as similar. Evidently these two reviews are different one reviewer referred to the main actor, while the other referred to
the storyline. Additionally, each user has differing opinions about
this movie over the previous in the series. These two users rated
the movie with the same score, yet the reasons why they assigned
such a score differ. In such circumstances, to consider these two
reviews as equivalent would be naïve.
Finally, the third reviewer in Figure 1 positively comments about
the storyline, a review which is highly similar to the second,
despite the difference in the final score. Based on these reviews,
the system we propose would class the two reviewers as similar a sharp contrast to how their similarity would have been evaluated
from a rating-based approach.
The proposed system will also adopt a content based approach. It
will build items profiles based on reviews shared about them, and
users profiles based on reviews that they share. The system will
then compare the similarity between a target user profile and item
profiles to identify items of potential interest for each user. For
example, if a user often comments about the special effects, his
profile would be identified as similar to movies that users have
reviewed as having good special effects. The advantage of this
approach over current recommender systems is that it will enable
recommendations based not only on pre-defined movie features,
but also on potentially interesting features of that movie.
Since commenting and reviewing is already one of the main user
activities in the social web, and no additional interaction from
users is required, the proposed system can potentially be applied
to any pre- popular review website or social networking site.
For the development of this system we require a measure that will
be able to identify similar reviews. We propose similarity
measures of user reviews as a different approach for measuring
similarity and making recommendations, rather than extracting
features and sentiment from reviews. For that reason, we present
an overview of current similarity measures for short text to select
and evaluate the best, in terms of performance, on user reviews.
3. SIMILARITY MEASURES
The majority of user reviews on the social web can be described
as noisy short text: short free text often containing abbreviations,
has been undertaken for evaluating similarity measures on short
user-generated text. Beaza-Yates et al [6] propose a method for
recommending search engine queries based on similar query logs.
2
http://www.rottentomatoes.com/
47
Table 1. Feature similarity scores and explanations
feature similarity
not similar
somehow similar
score
1
2
explanation
the reviews don't contain reference to the same feature(s)
the reviews contain a reference to some same features
similar
3
the reviews contain reference to exactly the same feature(s)
To identify similar queries they applied cosine similarity on a
variation of the term frequency- inverse document frequency (tfidf) weighted Vector Space Model (VSM) that uses URL click
popularity, in place of the idf. Their results indicate that the
particular similarity measure cannot be used as a criterion for
making recommendations on its own. However, more
sophisticated similarity measures techniques exist and could
potentially provide better results.
VSM is one of the first approaches for computing semantic
similarity. However, it depends on exact matches between words
present in queries and documents, and is therefore subject to
problems such as polysemy and synonymy. Latent semantic
models can moderate these issues using co-occurrences of words
over the entire collection. The simplest and best known latent
model is Latent Semantic Analysis (LSA) [7], where the term
vectors are mapped into a lower-dimensional space based on
Singular Value decomposition. LSA was compared with word and
n-gram vectors by Lee et al [8] to evaluate their performance on
short text comments. Results indicated that LSA performs better
than the word and n-gram vectors.
Li et al [9] propose a hybrid measure called STASIS that uses
information from WordNet3; a large lexical database of nouns,
verbs, adjectives and adverbs grouped into sets of cognitive
synonyms to compute similarity between short texts. The measure
STS, proposed by Islam et al [10] calculates similarity between
short texts using string similarity along with corpus-based word
similarity. Tsatsaronis et al [11], proposed Omiotis, a measure of
semantic relatedness between texts, is based on a word to word
measure that uses semantic links between words from WordNet.
LSA, STATIS, STS, and Omiotis have been compared by
Tsatsaronis et al [11], to evaluate their performance on measuring
the similarity of sentences from a dictionary. Their results
indicated that Omiotis had the highest Spearman‟s correlation
with human judgments (r=0.8959) with second best being LSA
(r=0.8614).
.
4.1 Data
Our data sample consists of 80 short review pairs (with less than
200 characters). Each pair contains two reviews associated with
the same score. Four score rating categories (0-15, 20-30, 35-45,
50-60) were used to have a broad representation of reviews with
different scores. Our sample was extracted using the following
steps: first, we randomly selected 100 reviews from each of the
top 20 movies as ranked by the RottenTomatoes box office;
second, we selected the 600 reviews that contained reference to
movie features; third, we randomly selected two reviews
associated with the same score from each of the four categories.
The third step was repeated for each of the top 20 movies.
4.2 Procedure
In the first phase of the experiment, the 2000 reviews were
manually classified into two groups: those that reference movie
features and those that do not. This classification was made in
order to simplify the similarity judgment procedure. A clear
definition of similarity enabled participants to systematically
judge the similarity of pairs of reviews. In the second phase, the
main experiment was carried out by three participants
independently. Participants were instructed to rate the similarity
of 80 pairs using a three point scale: 1) not similar, 2) somehow
similar, and 3) similar. Similarity was defined as the presence of
the same movie features in a pair of reviews. Explicit directions
for measuring the similarity based on features were given, and are
provided in Table 1. In the third phase, we measured the
correlation of human similarity scores with scores produced by
LSA. In the results we report LSA values obtained from the LSA
website4.
4.3 Results and Discussion
4. EXPERIMENT
During the first phase, we identified that user movie reviews
contain content that can be used for making recommendations.
600 of the 2000 reviews collected contained references to movie
features, while 1300 reviews contained general adjectives about
the movie, and only 100 did not contain any useful content or
were in a different language. This suggests that there are mainly
two types of movie reviews: “feature-based”: those referring to
specific features and “discussion-based”: those describing the
users‟ general opinion of the movie. This result is promising since
it suggests that movie reviews contain content which can be used
for making accurate recommendations.
A three-phase experiment was carried out to evaluate the
feasibility of the proposed system. We used movie reviews as a
platform, specifically collecting data from RottenTomatoes, a
movie review website that allows users to express their opinions
about movies with a scalar rating and a text review. In the first
phase, we investigate if users‟ reviews contain useful content; that
is content that represents the underlying reasons on why a rating
was composed. In the second phase, we investigate if reviews with
references such as actors and plot. In phase 3, we evaluate the
performance of LSA in measuring the similarity of user reviews.
For the second phase an inter-rater reliability analysis using
Kappa statistic was performed to determine consistency among
the raters. The inter-rater reliability for the raters was found to be
Kappa=0.725 (p<0.001), 95% CI (0.631, 0.804), which indicates
a substantial agreement. As presented in Figure 2, 42 of the movie
review pairs with the same rating (52.5%) were judged as “not
similar” (mode 1), 22 review pairs (27.5%) were judged as
“somehow similar” (mode 2), while 16 pairs (20%) were judged
as “similar” (mode 3). Thus, the majority of the comments were
3
4
Based on the reported results, the availability of measures, and the
indicated performance of LSA on different short text datasets,
we decided to carry out an experiment to evaluate its effectiveness
in measuring the similarity of user reviews comparing with human
judgments. The experimental design and results follow.
http://wordnetweb.princeton.edu/perl/webwn
http://lsa.colorado.edu
48
„features-based‟ reviews. Further experiments will be conducted
to determine which reviews are more useful for recommendations
and how to automatically extract them.
Results also indicated that LSA performs weakly in measuring
the similarity of short user-generated reviews. Further
investigations into the reasons of this weak performance need to
be undertaken, along with the evaluation of other approaches.
Figure 2. Number of review pairs per similarity category
not similar, referring to completely different movie features
despite identical ratings. Based. on the assumption that users‟
reviews represent the reasons behind a rating, these results suggest
that people compose their ratings based on different aspects of the
film, thus ratings alone may be an insufficient source of
knowledge for making accurate recommendations.
In conclusion, additional work must be carried out to evaluate the
feasibility of the proposed system and establish the effectiveness
of the system over conventional rating-based recommender
systems. Our preliminary results, however, do lend weight to the
idea that item reviews, one of the most prominent features of
social web, offer a natural source of rating information. While still
a work in progress, our preliminary results constitute a useful
addition to the current line of research in this field, while
complementing research in the field of similarity measures.
6. REFERENCES
Furthermore, a Spearman's rank correlation coefficient analysis
was carried out to examine if there is a relation between the mode
of human judgments and the LSA similarity measure for
evaluating similarity. The results revealed a significant and
positive relationship (r=0.406, N=80, p<0.001), although the
correlation was weak in strength.
[1] Lathia, N., Hailes, S., and Capra, L. 2008. The effect of
correlation coefficients on communities of recommenders.
In Proceedings of SAC. 2008, 2000-2005.
According to related work [11], LSA has a high correlation
(r~0.8) with human judgments when measuring similarity, which
was the principle motivation for choosing this measure. The weak
correlation found in this experiment may be due to the nature of
the dataset. Free text user reviews are noisy, and often include
spelling mistakes, movies specific terms and abbreviations, which
LSA cannot recognize. In light of this, the performance of LSA
could be enhanced using a larger dictionary containing terms
frequently used in movie reviewing such as movie features.
[3] Wang, Y., Stash, N., Aroyo, L., Hollink, L., and Schreiber,
G. 2009. Semantic relations for content-based
recommendations. Proceedings of K-CAP. 2009, 209-210.
The initial results of this experiment suggest that user reviews are
a promising source of knowledge for recommender systems.
Moreover, the majority of pairs of randomly selected reviews with
the same rating score (52.5%) are not similar. Even if users rate a
movie with a same score, this does not necessarily mean that each
rating was based on similar reasons. Thus, current recommender
systems using only ratings may lack accuracy. In addition, this
experiment shows that while LSA performed well in judging
similarity, its performance for short user reviews is not sufficient.
5. CONCLUSION AND OUTLOOK
In this article we presented an initial investigation towards the
development of a recommender system that makes
recommendations based on user generated free text reviews. The
results of the small experiment shows that reviews can be used as
an alternative way for building users and items profiles for
recommender systems, since reviews typically represent the
opinion of a user about an item or some of its features.
Moreover the results show that in current rating-based
recommender systems, users‟ similarity is not well reflected. For
any two user ratings of a particular item there is a high possibility
that each respective rating was based on different features of that
item. However, replications of this experiment using a larger
dataset and more participants are needed to validate these results.
Additionally, a limitation of this experiment was the focus only on
[2]
Lakiotaki, K., Matsatsinis, N.F., and Tsoukiàs, A. 2011.
Multicriteria User Modeling in Recommender Systems. In
Proceedings of IEEE Intelligent Systems. 2011, 64-76.
[4] Jakob, N., Weber, S. H, Muller, M.-C, and Gurevych, I. 2009
Beyond the stars: exploiting free-text user reviews to
improve the accuracy of movie recommendations. In
Proceedings of the 1st international CIKM workshop on
topic-sentiment analysis for mass opinion.
[5] Long, C., Zhang, J., Huang, M., Zhu, X., Li, M., Ma, B.
2009. Specialized review selection for feature rating
estimation. Proceedings of the IEEE/WIC/ACM
International Joint Conference on Web. WI-IAT‟ 09
[6] Baeza-Yates, R., Hurtado. C., and Mendoza, M. 2004. Query
Recommendation Using Query Logs in Search Engines,
Current Trends in Database Technology –EDBT 2004
[7] Landauer, T.K,. and Dumais, S.T. 1997. A solution toplato‟s
problem: The latent semantic analysis theory of acquisition,
induction, and representation of knowledge.
[8] Lee , M.D., Pincombe, B., and Welsh, M. 2005. An
empirical evaluation of models of text document similarity.
Proceedings of the 27th Annual Conference of the Cognitive
Science Society, pages 1254–1259
[9] Li, Y., McLean, D., Bandar, Z., O'Shea, J., and Crockett,
K.A. 2006 Sentence Similarity Based on Semantic Nets and
Corpus Statistics. In Proceedings of IEEE Transactions on
Knowledge and Data Engineering.
[10] Islam, A., and Inkpen, D. 2008. Semantic text similarity
using corpus-based word similarity and string similarity.
ACM Transactions on Knowledge Discovery from Data
[11] Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. 2010.
Text Relatedness Based on a Word Thesaurus. In roceedings
of J. Artif. Intell. Res. (JAIR). 2010, 1-39
49
A Multi-Criteria Evaluation of a User-Generated Content
Based Recommender System
Sandra Garcia Esparza, Michael P. O’Mahony, Barry Smyth
CLARITY: Centre for Sensor Web Technologies
School of Computer Science and Informatics
University College Dublin, Ireland
{sandra.garcia-esparza, michael.omahony, barry.smyth}@ucd.ie
ABSTRACT
1.
The Social Web provides new and exciting sources of information that may be used by recommender systems as a
complementary source of recommendation knowledge. For
example, User-Generated Content, such as reviews, tags,
comments, tweets etc. can provide a useful source of item
information and user preference data, if a clear signal can
be extracted from the inevitable noise that exists within
these sources. In previous work we explored this idea, mining term-based recommendation knowledge from user reviews, to develop a recommender that compares favourably
to conventional collaborative-filtering style techniques across
a range of product types. However, this previous work focused solely on recommendation accuracy and it is now well
accepted in the literature that accuracy alone tells just part
of the recommendation story. For example, for many, the
promise of recommender systems lies in their ability to surprise with novel recommendations for less popular items that
users might otherwise miss. This makes for a riskier recommendation prospect, of course, but it could greatly enhance
the practical value of recommender systems to end-users. In
this paper we analyse our User-Generated Content (UGC)
approach to recommendation using metrics such as novelty,
diversity, and coverage and demonstrate superior performance, when compared to conventional user-based and itembased collaborative filtering techniques, while highlighting a
number of interesting performance trade-offs.
Recommender systems allow users to discover information, products and services by predicting their needs based
on the past behaviour of like-minded users. Typically these
systems can be classified in three main categories: collaborative filtering (CF), content-based (CB) and hybrid approaches. In CF approaches [4, 7], users are recommended
items that users with similar interests have liked in the past,
where users interests in items are represented by ratings.
In contrast, in CB approaches [14], users are recommended
items that are similar to those items that the user liked in the
past, where item descriptions (e.g. movies can be described
using metadata such as actors, genres etc.) are used to measure the similarity between items. Finally, researchers have
looked at the potential of combining CF and CB approaches
as the basis for hybrid recommendation strategies [5]. However, one of the problems with these systems is that they
need sufficient amounts of data in order to provide useful recommendations, and sometimes neither ratings nor metadata
are available in such quantities. For this reason researchers
have started looking into additional sources of recommendation data. In the last few years, the Social Web has experienced significant growth, with the emergence of new services, such as Twitter, Flixster and Foursquare, whose users
collectively generate very large volumes of content in the
form of micro-blogs, reviews, ratings and check-ins. These
rich sources of information, namely User-Generated Content (UGC), which sometimes relate to products and services (such as movie reviews or restaurant check-ins), are
becoming increasingly plentiful and researchers have already
started to utilise this content for the purposes of recommendation.
Here we focus on UGC in the form of product reviews (see,
for example, Figure 1). We believe this type of information
offers important advantages in comparison to other sources
of recommendation knowledge such as ratings and metadata.
For instance, when reviewing products, users often discuss
particular aspects that they like about them (e.g. “Tom
Hanks and Tim Allen at their best”), as well as commenting
on general interests (e.g. “I love animation”), which is not
always reflected in other sources of recommendation knowledge. In this sense, in a collaborative filtering approach, two
users that have rated movies similarly are treated as people
with similar interests. However, it often happens that users
may like the same movies for different reasons. For example,
one user may have rated a movie with a high score because
they loved the special effects while the other one rated the
same movie highly because they loved the plot and the ac-
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous
General Terms
Algorithms, Experimentation
Keywords
Recommender Systems, User-Generated Content, Performance
Metrics
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
INTRODUCTION
50
tors’ performances. In a similar way, a content-based approach usually relies on product descriptions to draw item
similarities, but it does not consider how the user feels towards each of these descriptions. A solution to this problem
is to consider multi-criteria ratings, where different aspects
of a product (or service) are rated separately [2]. For example, when rating restaurants in TripAdvisor, users can
rate along the dimensions of service, food, value and atmosphere. One of the disadvantages of this approach is that
these multi-criteria ratings tend to be predefined and thus
can restrict users from commenting on other aspects of the
product. Social tagging provides a solution to this by allowing users to associate tags to content. In [18] these tags,
which reflect users’ preferences, are introduced into recommender algorithms called tagommenders, and results show
that these algorithms performed better than state of the
art algorithms. [8] also comments on the benefits of using
tag-based user profiles and they investigate different techniques to build condensed and optimised profiles. However,
not many systems provide this social tagging functionality;
instead users’ ratings and reviews are more common and
abundant and for this reason we believe it is important to
consider them for recommendation purposes.
The most common way to evaluate recommenders is to
measure how accurate they are at predicting items that users
like. However, one of the problems with evaluating the accuracy of top-N recommendation lists is that current metrics (such as precision and recall) reward algorithms that
are able to accurately predict test set items (typically constructed by selecting a random subset of users’ known ratings), while failing to consider items that are not present in
test sets, but which users may in fact like. Hence current
recommendation accuracy evaluation techniques are limited,
which may result in promising new algorithms being labelled
as poor performers. Further, it has been shown that accuracy on its own is insufficient to to fully measure the utility
of recommendations [19]. Other performance criteria, such
as novelty and diversity of recommended lists are now acknowledged as also being important when it comes to user
satisfaction. In addition, the ability to recommend as many
products as possible, known as coverage, is also a desirable
system property.
While these performance criteria have been considered in
the past [20, 24, 25], however they have been less explored in
the context of UGC-based recommenders. In past work we
have implemented a recommendation approach where UGC
in the form of micro-reviews was used as the source of recommendation knowledge [9]. An evaluation performed on 4
different domains showed that UGC, while inherently noisy,
provided a useful recommendation signal and outperformed
a variation of a collaborative-filtering based approach. Here,
we expand on this work by considering additional performance criteria, such as novelty, diversity and coverage, and
we compare the performance of our approach with traditional user-based and item-based collaborative filtering approaches [4, 7].
2.
RELATED WORK
UGC has been leveraged by recommender systems for different purposes such as enriching user profiles or extracting
ratings from reviews by using sentiment analysis techniques.
In addition, in the last few years it has been shown that accuracy on its own is insufficient to fully measure the utility
Figure 1: A Flixster review of the movie ‘Toy Story’.
of recommendations and new metrics have been proposed.
Here, we provide an overview of some of the work that has
been carried out in these areas.
2.1
Review-based Recommendations
Recent work has focused on leveraging UGC in the form
of reviews for recommendations. For example, a methodology to build a recommender system which leverages usergenerated content is described in [23]. Although an evaluation is not performed, they propose a hybrid of a collaborative filtering and a content-based approach to recommend
hotels and attractions, where the collaborative filtering component utilises the review text to compute user similarities
in place of traditional preference-based similarity computations. Moreover, they also comment on the advantages
of using user-generated content for recommender systems;
such as, for example, providing a better rationale for recommended products and increasing user trust in the system.
An early attempt to build a recommender system based
on user-generated review data is described in [1]. Here,
an ontology is used to extract concepts from camera reviews and recommendations are provided based on users’ requests about a product; for example, “I would like to know
if Sony361 is a good camera, specifically its interface and
battery consumption”. In this case, the features interface
and battery are identified, and for each of them a score is
computed according to the opinions (i.e. polarities) of other
users and presented to the user.
Similar ideas are described in [3], which look at using usergenerated movie reviews from IMDb in combination with
movie meta-data (e.g. keywords, genres, plot outlines and
synopses) as input for a movie recommender system. Their
results show that user reviews provide the best source of
information for movie recommendations, followed by movie
genre data. In addition, in [12, 15], the number of ratings
in a collaborative filtering system is increased by inferring
new ratings from user reviews using sentiment analysis techniques. While [15] generate ratings for Flixster reviews by
extracting the overall sentiment expressed in the review, [12]
extract features and their associated opinions from IMDb
reviews and a rating is created by averaging the opinion
polarities (i.e. positive or negative) across the various features. Both approaches achieve better performance when
using the ratings inferred from reviews when compared to
using ratings predicted by traditional collaborative filtering
approaches.
2.2
Beyond Accuracy Metrics
Typically, the performance evaluation of recommendation
algorithms is done in terms of accuracy, which measures
how well a recommender system can make recommendations. However, it has been shown that accuracy on its
51
own is insufficient to fully measure the utility of recommendations and over the last few years new metrics have been
proposed [13,19]. Here we are interested in coverage, novelty
and diversity.
Coverage is a measure of the domain of items in the system over which the recommender system can form predictions or make recommendations [10]. Sometimes algorithms
can provide highly accurate recommendations but only for
a small portion of the item space. Systems with poor coverage may be capable of just recommending well-known or
popular products, while less mainstream products (belonging to the long tail) are rarely if ever recommended, resulting in a poor experience for recommendation consumers and
providers alike. For this reason it is useful to examine accuracy and coverage in combination; a “good” recommender
will achieve high performance along both dimensions.
Novelty measures how new or different recommendations
are from a user’s perspective. For instance, a movie recommender that keeps suggesting movies that the user is already
aware of is unlikely to be very useful to the user, although
the recommendations may well have high accuracy. Diversity refers to how dissimilar the products in recommendation
lists are. For example, a music recommendation list where
most albums are from the same band will have a low diversity, and such a situation is not desirable. Indeed, in [20], it
is argued that diversity can sometimes be as important as
similarity.
Current research is focused on improving the novelty and
diversity of the recommendations without sacrificing accuracy. For instance, in [24] the authors suggest an approach
to recommend novel items by partitioning the user profile
into clusters of similar items. Further, in [25], the authors
introduce diversity in their recommendation lists and results
show that although the new lists show a decrease in accuracy, users are more satisfied with the diversified lists. UGC
in the form of tags has also been used to introduce diversity in recommendation lists and results showed that their
method was also able to improve the accuracy [22].
The goal of this paper is to explore the advantages of
using UGC in the form of reviews in a recommender system. Similar work has been discussed above; however, these
approaches have been evaluated in terms of accuracy and
other metrics, which have proven to be equally important,
have not been taken into account. In this paper we provide
an evaluation of a UGC-based recommendation approach
which considers metrics that go beyond accuracy; in particular we are interested in the properties of novelty, diversity
and coverage.
3.
METHODOLOGY
The goal of this paper is to evaluate our UGC-based recommender by providing a multi-criteria evaluation. The
recommender is similar to that described in our previous
work [9], where UGC in the form of product micro-reviews
are used as the source of recommendation knowledge. In
particular, we propose an index-based approach where users
and items are represented by the terms used in their associated reviews. This section describes two variants of our
index-based approach and two variants of collaborative recommenders, which are used as benchmark techniques.
3.1
!"#$%&'() @;#A'() !.4'>00B)
!"#$%&'()
/0*"0(!"#23*0(
query
/0*"(%1(
%40(')-5%#4)
6.1%)/#--'()
0(46'%&(.)4".%%#4)
.78'1&5('%)
9%#40"0:#4.")&;#%&)
<511=).7.-)
%.17"'()>."'?)
G)
*'+"#$)
results ,-./01)232%)
!"#$%&'(()*&#++*,$*"(
45&1('#('6*(7%'%"*( 86*(9*$$:,;(<:,;*"(
%40(')0(46'%&(.)
C#46.'")DE)!0$)
40-'7=)4".%%#4)
.78'1&5('%)
<511=)?)
H)
86*(=5&6:,:0'(
<511=).7.-)
*) 9%=460"0:#4.")
%.17"'()40-'7=)
E)E)E)))) &6(#""'()F05(1'=)
>.((=-0(')
>."')-#17);'#(7)
-5%#4)
:007)F0>)
)%0517&(.4B)?)
9'(<0(-.14')?)
Figure 2: An index-based approach for recommendation.
of an index-based approach where products and users are
represented by the terms in their associated reviews. In
this initial work we only consider terms from positive user
reviews (i.e. reviews which are associated with a rating of
greater than 3 on a 5 point scale). The reason for this is that
in such reviews users tend to talk about the aspects they like
in products. This poses some obvious limitations discussed
in section 5 which will be addressed in future work.
Our recommendation approach involves the creation of a
product index, where each product Pi can be viewed as a
document made up of the set of terms, t1 , . . . , tn , used in its
associated user reviews, r1 , . . . , rk , as per Equation 1.
Pi = {r1 , . . . , rk } = {t1 , . . . , tn } .
(1)
Using techniques from the information retrieval community, we can apply weights to terms associated with a particular product according to how representative they are with
respect of that product. In this work we have used the well
known TFIDF approach [17] to term weighting (Equation
2). Briefly, the weight of a term tj in a product Pi , with
respect to some collection of products P, is proportional to
the frequency of occurrence of tj in Pi (denoted by ntj ,Pi ),
but inversely proportional to the frequency of occurrence of
tj in P overall, thus giving preference to terms that help to
discriminate Pi from the other products in the collection.
We use Lucene1 to provide this indexing, term-weighting
and retrieval functionality.
TFIDF(Pi , tj , P) = P
Index-based Approaches
The approach to recommend products to users consists
!"#$%&'(-,$*.(
1
ntj ,Pi
|P|
×log
|{Pk ∈ P : tj ∈ Pk }|
tk ∈Pi ntk ,Pi
(2)
http://lucene.apache.org/
52
Similarly, we can create the profile of a user by using the
terms in their associated (positive) reviews. To provide recommendations for this user we can use their profile as a
query into the product index, and return the top-N list of
products that best match the query as recommendations.
In addition to the vanilla index-based approach (IB) outlined above, we have also consider a variation where only
nouns and adjectives from reviews2 (IB+) are used to form
the product index and user queries. We also considered extracting nouns only from reviews, but better results were
obtained when adjectives were included. Further, for both
index-based approaches we applied stemming, stop-word removal and removed words that appeared in more than 60%
of user profiles and products to exclude common domainspecific words (such as “movie”, “plot” and “story”).
This index-based approach is illustrated in Figure 2. One
of the advantages of this approach is that user profiles can be
independent from the product index (i.e. users may not have
reviewed products from the product index), allowing us to
use a product index from one particular source (e.g. a movie
index created from Flixster reviews) with user profiles from
another source (e.g. user interests extracted by analysing
that user’s Twitter messages). This independence allows
for cross-domain possibilities which in turn can be used to
mitigate the cold start problem.
3.2
User-based CF (UBCF) [4]. In order to provide a topN list of recommended items for a target user, the k most
similar users (neighbours) to that user are selected using cosine similarity. Then, the union of the products rated by
each of the neighbours (less those already present in the
target user’s profile) is returned, ordered by descending frequency of occurrence. The top-N products are returned as
recommendations.
Item-based CF (IBCF) [7]. In this case, recommended
products are generated by first taking the union of the k
most similar products (using cosine similarity) to each of
the products in the target user’s profile, again ordered by descending frequency of occurrence. After removing the products already present in the target user’s profile, the top-N
products are selected as recommendations.
4.1
EVALUATION
Metrics
We use four different metrics in order to evaluate our
index-based recommendation approach.
Accuracy measures the extent to which the system can
predict the items that the users like. We measure this in
terms of the F1 metric (Equation 3), which is the harmonic
mean of precision and recall [21].
2
PN
i=1 (1 − popularity(i))
.
(4)
Novelty =
N
We define the diversity of a top-N list of recommended
products as the average of their pairwise dissimilarities [20].
Let i and j denote two products and let dissimilarity(i, j) =
1 − similarity(i, j) and, assuming a symmetric similarity
measure, diversity is given by:
Collaborative Filtering Approaches
We study two variations of collaborative filtering (CF):
user-based and item-based techniques. For these techniques,
entries in the user-product ratings matrix consist of 1 (if the
user likes a product, i.e. has assigned a rating of ≥ 4) or the
special symbol ⊥ (meaning that the user has not reviewed
the product or has assigned a rating of ≤ 3).
4.
2 × Precision × Recall
.
(3)
Precision + Recall
Novelty measures how new or different recommended products are to a user. Typically, the most popular items in
a system are the ones that users will be the most familiar with. Likewise, less mainstream items are more likely
to be unknown by users. In related work novelty is often
based on item popularity [6, 25]. Here we follow a similar
approach and compute the novelty of a product as one minus its popularity; where the popularity (popularity(i)) of
a product, i, is given by the number of reviews submitted
for the product divided by the maximum number of reviews
submitted over all products. Hence, the novelty of a top-N
list of recommended products is computed as the average of
each product’s novelty as per Equation 4.
F1 =
Nouns and adjectives are extracted from review text using the Stanford Parser (http://www-nlp.stanford.edu/
software/lex-parser.shtml).
Diversity =
2
PN−1 PN
i=1
j=i+1
1 − similarity(i, j)
N × (N − 1)
.
(5)
The similarity, similarity(i, j), between two products is
computed using cosine similarity on the corresponding columns
of the ratings matrix; for normalisation purposes, this is
divided by the maximum similarity obtained between two
products. Similar trends were obtained by computed similarity over the documents in the product index.
Finally, the ability to make recommendations for as many
products as possible is also a desirable system property. This
is reflected by the coverage metric, which for a given user is
defined as the percentage of the unrated set of products for
which the recommender is capable of making recommendations. Then the overall coverage provided by the system is
given by the mean coverage over all users.
4.2
Dataset and Methodology
In this paper we consider Flixster3 as our source of data.
Flixster is an online movie website where users can rate
movies and also write reviews about them. We selected reviews authored in the English language only and performed
some standard preprocessing on the reviews; such as removing stop-words, special symbols, digits and multiple character repetitions (e.g. we reduce cooool to cool ). Further, we
selected users and movies with at least 10 associated positive reviews. This produced a total of 43179 reviews (and
ratings) by 2157 users on 763 movies. The average number
of reviews per user is 20 and per movie is 57.
To evaluate each algorithm, first we randomly split each
users’ reviews and ratings into training (60%) and test (40%)
sets. Second, we create the product index or ratings matrix
using the training data. Then, for each user we produce
a top-N list of recommendations using the approaches described in Section 3, and compute accuracy using the test
3
http://www.flixster.com
53
a) UBCF IBCF IB IB+ b) Novelty (%) F1 Metric 0.15 0.1 0.05 0 UBCF IB IB+ 80 60 40 20 0 10 20 30 40 0 10 Result List Size c) UBCF IBCF IB IB+ 95 90 85 80 75 70 0 10 20 30 20 30 40 Result List Size d) Coverage (%) Diversity (%) IBCF 40 Result List Size 100 80 60 40 20 0 UBCF IBCF IB IB+ Figure 3: (a) Accuracy, (b) novelty, (c) diversity and (d) coverage for the index-based and collaborative
filtering approaches.
set. We also compute diversity, novelty and coverage as described in Section 4, and compute averages across all users.
We repeated this procedure five times and again averaged
the metrics.
We note that when generating user recommendations using the index-based approaches, we first remove the reviews
in each user profile from the product index. For the CF
approaches, we performed evaluations using different neighbourhood sizes and found that the best accuracy for UBCF
and IBCF was achieved for k = 200 and k = 100, respectively. These are the values used when comparing the CF
algorithms against the index-based approaches.
4.3
Results
The results shown in Figure 3 indicate that the collaborative filtering approaches outperformed both index-based
approaches in terms of accuracy, with user-based CF performing best overall. The index-based approach using nouns
and adjectives (IB+) actually performed slightly worse than
the standard bag-of-words approach (IB), which indicates a
loss of information using only the terms selected.
Although the accuracy results may seem discouraging at
first, our index-based approaches outperformed both CF approaches in terms of diversity, novelty and coverage. The
worst performing approach in terms of coverage was IBCF
(only 63.23%), while both IB and IB+ achieved in excess
of 90% coverage. In terms of novelty, IB+ was the best
approach, with 63% novelty for top-10 recommended lists,
followed by IB and IBCF, with UBCF providing the poorest novelty (approximately 34%). In terms of diversity, the
index-based approaches performed significantly better, with
IB+ providing 87% diversity for top-10 recommended lists,
compared to 77% for the best CF approach (IBCF).
It is interesting to note that in the above results, the
neighbourhood size (k) for both CF approaches was tuned
based on delivering the best accuracy performance. An interesting question that arises is how well the CF approaches
would perform if neighbourhood sizes were tuned according to other performance criteria (e.g. novelty or diversity) — would the recommendation accuracy provided by
CF still outperform the index-based approaches? To answer
this question, we repeated the above analysis using different neighbourhood sizes for CF. In particular we computed
accuracy versus novelty and diversity for the two CF approaches and compared them with the IB+ approach.
Results are presented in Figure 4 and show that by reducing the neighbourhood size, CF can achieve better diversity
and novelty than when using a bigger neighbourhood size.
However, this comes at the cost of reduced accuracy performance. In fact, for neighbourhood sizes of k = 10, the
novelty and diversity performance of the CF approaches is
closest to that achieved by the IB+ approach but at the cost
of poorer (in the case of UBCF) or almost equivalent (in the
case of IBCF) accuracy compared to IB+.
For comparison purposes, we also show novelty and diversity versus coverage in Figure 5. It can be seen that, for the
CF approaches, a higher coverage is achieved when using
larger neighbourhood sizes, although neither CF approach
can beat the coverage achieved by the IB+ approach. For
the CF approaches, the results in Figure 5 are also interesting in that they show a clear tradeoff exists between optimising for coverage performance on the one hand (larger neighbourhood sizes), or optimising for novelty and diversity performance on the other hand (smaller neighbourhood sizes).
54
0.7 b) UBCF-­‐10 IB+ Novelty@10 0.6 UBCF-­‐25 0.5 UBCF-­‐100 UBCF-­‐50 0.4 UBCF-­‐200 0.3 0.05 0.06 0.07 0.08 0.09 Novelty@10 a) 0.1 0.7 IB+ 0.6 IBCF-­‐10 0.5 Diversity@10 0.9 0.3 0.06 0.07 IB+ UBCF-­‐25 UBCF-­‐100 0.8 UBCF-­‐50 UBCF-­‐200 0.7 0.05 0.06 0.07 0.08 0.08 0.09 0.1 F1@10 d) UBCF-­‐10 IBCF-­‐50 0.4 0.05 0.09 0.1 F1@10 Diversity@10 1 IBCF-­‐100 IBCF-­‐200 F1@10 c) IBCF-­‐25 1 IB+ 0.9 IBCF-­‐25 IBCF-­‐10 0.8 IBCF-­‐200 IBCF-­‐50 IBCF-­‐100 0.7 0.05 0.06 0.07 0.08 0.09 0.1 F1@10 Figure 4: Novelty vs. Accuracy for (a) UBCF and (b) IBCF and Diversity vs. Accuracy for (c) UBCF and
(d) IBCF.
Further, examining the results in Figures 4 and 5 together,
although similar accuracy to IB+ was achieved by the CF
approaches when using a small neighbourhood (k = 10), we
can also see that the coverage achieved at k = 10 is only 9%
and 11% for UBCF and IBCF, respectively. This scenario
is obviously never desirable since 90% of the system’s items
cannot be recommended, showing that if we want to maintain coverage while having high levels of accuracy, novelty
and diversity, then the UGC-based approaches are preferable to both CF approaches in this evaluation setting.
Hence we can conclude that the index-based approaches
compare quite favourably to the collaborative filtering techniques, when the range of performance metrics evaluated in
this work are taken into consideration. This is a noteworthy result, and underlines the potential of the UGC-based
recommenders as described in this paper.
5.
DISCUSSION AND FUTURE WORK
While past work has focused on improving recommendation accuracy, recent work has shown that other metrics need
to be explored in order to improve the user experience. In
fact, algorithms which recommend novel items are likely to
be more useful to end-users than those recommending more
popular items. Such algorithms may, for example, introduce users to an entirely new space of items, regardless of
the rating that they might actually give to said novel items.
Further, a live evaluation performed by [25] showed that
users preferred more diverse lists instead of more accurate
(and less diverse) lists.
In this paper we consider a multi-criteria performance
evaluation and, although trade-offs exist for all evaluated
approaches, we believe the findings indicate that the UGCbased approach offers the best trade-off between all met-
rics and algorithms considered. For example, in order to
achieve similar levels of novelty and diversity using the CF
approaches as achieved by the UGC-based approach, a significant loss in coverage must apply.
We believe a reason for the higher novelty and diversity
performance achieved by the UGC-based approach is that
profiles created using this technique often reflect particular
aspects and topics that users are interested in, allowing for
more diverse (and often novel) recommendation lists compared to using ratings alone. For example, even if a user
rated only science fiction movies, there will be specific aspects that differentiate this user from another who is also
a fan of this genre (e.g. one may prefer aliens and strange
creatures while the other might prefer the more romantic
elements in the storyline).
There is an interesting range of future work to be carried
in order to improve our current approach and to explore
other benefits of UGC content:
• Enhancing the Real-Time Web In this paper we
used UGC in the form of long-form movie reviews.
However we are also interested in UGC in the form
of Real-Time Web data (e.g. Twitter messages) which
captures users’ preferences in real-time. For instance,
people often post messages about the movies they liked,
their favourite football team or their dream vacation
experience. This data facilitates the building of rich
user profiles which in turn allows recommenders to
better address users’ needs. In fact, in past work
[9], we demonstrated that micro-blogging messages can
provide a useful recommendation signal despite their
short-form and inconsistent use of language. Further,
as discussed above, in our index-based approach user
profiles can be independent from the product index,
0.7 0.6 b) IB+ UBCF-­‐10 UBCF-­‐25 UBCF-­‐50 0.5 UBCF-­‐100 0.4 UBCF-­‐200 0.3 0.7 IBCF-­‐10 IBCF-­‐25 0.6 Novelty@10 a) Novelty@10 55
IBCF-­‐50 0.5 0.4 0.2 0.4 0.6 0.8 1 0 0.2 Coverage d) UBCF-­‐10 UBCF-­‐25 UBCF-­‐50 UBCF-­‐100 UBCF-­‐200 IB+ Diversity@10 Diversity@10 0.8 0.7 0 0.2 0.4 0.6 0.4 0.6 0.8 1 Coverage 1 0.9 IB+ IBCF-­‐200 0.3 0 c) IBCF-­‐100 0.8 1 0.9 IBCF-­‐25 0.8 IBCF-­‐50 IBCF-­‐100 IBCF-­‐200 0.7 0 1 IB+ IBCF-­‐10 0.2 0.4 0.6 0.8 1 Coverage Coverage Figure 5: Novelty vs. Coverage for (a) UBCF and (b) IBCF and Diversity vs. Coverage for (c) UBCF and
(d) IBCF.
allowing us to use a product index from one particular
source (e.g. Flixster) with user profiles from another
source (e.g. Twitter). This allows for cross-domain
possibilities which will be explored in future work.
itations of our approach is that it is based on positive reviews. In future we will also consider negative
reviews which may be useful to avoid recommending
certain products to users. Another limitation is that
positive reviews may have negative aspects (e.g. ‘I
don’t generally like romantic comedies but I loved this
one’) or negative reviews may have positive aspects
(e.g. ‘Susan Sarandon was the only good thing in this
movie’). To address this problem, we will extend our
approach by using feature extraction techniques together with sentiment analysis [11, 16] in order to create richer user profiles and product indexes. Choosing
the optimum terms to represent users and items is also
a problem to be solved in order to reduce the sparsity of
the term-based profiles. Evaluating the effect of these
improvements on various performance metrics will also
be carried out in future work.
• The cold-start problem. In future work we will also
explore how UGC can help in solving (or at least in
mitigating) the well-known cold-start problem, which
is related to the new user and new item problems.
One of the advantages of using UGC as recommendation knowledge is that it facilitates a cross-domain
approach for users who have not reviewed any products from a particular product index. If data relating
to such users can be sourced from other domains (e.g.
Twitter or Facebook feeds), then they can still benefit
from recommendations. Further, a system which does
not have reviews for particular products could provide
recommendations for these products by building an index based on reviews from another system.
• Integrating UGC in traditional recommenders.
Collaborative filtering algorithms have proven to be
effective when there is a sufficient amount of ratings
available, but their performance decreases when the
number of ratings is limited. Our work shows preliminary evidence that UGC-based approaches have the
potential to complement recommendation knowledge
in the form of ratings and to improve the response
of recommender systems to data sparsity. In future
work we will study the performance of a hybrid recommender that benefits from the strengths of multiple
data sources.
• Improved index-based approach. One of the lim-
6.
CONCLUSIONS
In this paper we have considered an alternative source of
recommendation knowledge based on user-generated product reviews. Our findings indicate that recommenders utilising this source of knowledge can deliver comparable recommendation performance — across a range of criteria — compared to traditional CF-based techniques. In future work
we would like to study other properties of UGC for recommendation, such as the ability to address the well known
cold-start problem. Further, despite the simplicity of the
current approach our results are promising and further improvements can be incorporated in order to increase recommendation accuracy, while trying to maintain high levels of
coverage, novelty and diversity.
56
7.
ACKNOWLEDGMENTS
Based on work supported by Science Foundation Ireland,
Grant No. 07/CE/I1147.
8.
[14]
REFERENCES
[1] S. Aciar, D. Zhang, S. Simoff, and J. Debenham.
Recommender system based on consumer product
reviews. In Proceedings of the 2006 IEEE/WIC/ACM
International Conference on Web Intelligence
(WI-IATW ’06), pages 719–723, Washington, DC,
USA, 2006. IEEE Computer Society.
[2] G. Adomavicius and Y. Kwon. New recommendation
techniques for multicriteria rating systems. IEEE
Intelligent Systems, 22(3):48–55, 2007.
[3] S. Ahn and C.-K. Shi. Exploring movie
recommendation system using cultural metadata.
Transactions on Edutainment II, pages 119–134, 2009.
[4] J. S. Breese, D. Heckerman, and C. M. Kadie.
Empirical analysis of predictive algorithms for
collaborative filtering. In G. F. Cooper and S. Moral,
editors, Proceedings of the Fourteenth Conference on
Uncertainty in Artificial Intelligence (UAI ’98), pages
43–52. Morgan Kaufmann, 1998.
[5] R. Burke. Hybrid recommender systems: Survey and
experiments. User Modeling and User-Adapted
Interaction, 12(4):331–370, 2002.
[6] O. Celma and P. Herrera. A new approach to
evaluating novel recommendations. In Proceedings of
the 2008 ACM Conference on Recommender systems
(RecSys ’08), pages 179–186, New York, NY, USA,
2008. ACM.
[7] M. Deshpande and G. Karypis. Item-based top-n
recommendation algorithms. ACM Transactions on
Information Systems, 22(1):143–177, 2004.
[8] C. S. Firan, W. Nejdl, and R. Paiu. The benefit of
using tag-based profiles. In Proceedings of the 2007
Latin American Web Conference, pages 32–41,
Washington, DC, USA, 2007. IEEE Computer Society.
[9] S. Garcia Esparza, M. P. O’Mahony, and B. Smyth.
Effective product recommendation using the real-time
web. In Proceedings of the 30th International
Conference on Artificial Intelligence (SGAI ’10), 2010.
[10] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and
J. T. Riedl. Evaluating collaborative filtering
recommender systems. ACM Transactions on
Information Systems, 22(1):5–53, January 2004.
[11] M. Hu and B. Liu. Mining and summarizing customer
reviews. In Proceedings of the 10th ACM SIGKDD
International Conference on Knowledge discovery and
data mining (KDD ’04), pages 168–177, New York,
NY, USA, 2004. ACM.
[12] N. Jakob, S. H. Weber, M. C. Müller, and I. Gurevych.
Beyond the stars: exploiting free-text user reviews to
improve the accuracy of movie recommendations. In
Proceeding of the 1st international CIKM workshop on
Topic-sentiment analysis for mass opinion (TSA ’09),
pages 57–64, New York, NY, USA, 2009. ACM.
[13] S. M. McNee, J. Riedl, and J. A. Konstan. Making
recommendations better: an analytic model for
human-recommender interaction. In CHI ’06 extended
abstracts on Human factors in computing systems
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
(CHI EA ’06), pages 1103–1108, New York, NY, USA,
2006. ACM.
M. J. Pazzani and D. Billsus. Content-based
recommendation systems. In P. Brusilovsky, A. Kobsa,
and W. Nejdl, editors, The adaptive web: Methods and
strategies of Web personalization, pages 325–341.
Springer-Verlag, Berlin, Heidelberg, 2007.
D. Poirier, I. Tellier, F. Françoise, and S. Julien.
Toward text-based recommendations. In Proceedings
of the 9th International Conference on Adaptivity,
Personalization and Fusion of Heterogeneous
Information (RIAO ’10), Paris, France, 2010.
A.-M. Popescu and O. Etzioni. Extracting product
features and opinions from reviews. In Proceedings of
the Conference on Human Language Technology and
Empirical Methods in Natural Language Processing
(HLT ’05), pages 339–346, Morristown, NJ, USA,
2005. Association for Computational Linguistics.
G. Salton and M. J. McGill. Introduction to Modern
Information Retrieval. McGraw-Hill, Inc., 1986.
S. Sen, J. Vig, and J. Riedl. Tagommenders:
connecting users to items through tags. In Proceedings
of the 18th International Conference on World wide
web (WWW ’09), pages 671–680, New York, NY,
USA, 2009. ACM.
G. Shani and A. Gunawardana. Evaluating
recommendation systems. Recommender Systems
Handbook, pages 257–298, 2009.
B. Smyth and P. McClave. Similarity vs. diversity. In
Proceedings of the 4th International Conference on
Case-Based Reasoning: Case-Based Reasoning
Research and Development (ICCBR ’01), pages
347–361, London, UK, 2001. Springer-Verlag.
C. J. van Rijsbergen. Information Retrieval.
Butterworth-Heinemann, Newton, MA, USA, 1979.
C. Wartena and M. Wibbels. Improving tag-based
recommendation by topic diversification. In
Proceedings of the 33rd European Conference on
Advances in Information Retrieval (ECIR’11), pages
43–54, Berlin, Heidelberg, 2011. Springer-Verlag.
R. T. A. Wietsma and F. Ricci. Product reviews in
mobile decision aid systems. In Pervasive Mobile
Interaction Devices (PERMID 2005), pages 15–18,
Munich, Germany, 2005.
M. Zhang and N. Hurley. Statistical modeling of
diversity in top-n recommender systems. In
Proceedings of the 2009 IEEE/WIC/ACM
International Joint Conference on Web Intelligence
and Intelligent Agent Technology (WI-IAT ’09),
volume 01, pages 490–497, Washington, DC, USA,
2009. IEEE Computer Society.
C.-N. Ziegler, S. M. McNee, J. A. Konstan, and
G. Lausen. Improving recommendation lists through
topic diversification. In Proceedings of the 14th
international conference on World Wide Web (WWW
’05), pages 22–32, New York, NY, USA, 2005. ACM.
57
Personalized Recommendation by Example in Social
Annotation Systems
Jonathan Gemmell, Thomas Schimoler, Bamshad Mobasher, Robin Burke
Center for Web Intelligence
School of Computing, DePaul University
Chicago, Illinois, USA
jgemmell,tschimoler,mobasher,rburke@cs.depaul.edu
ABSTRACT
Resource recommendation by example allows users to explore new
resources similar to an example or query resource. While common in many contemporary Internet applications this function is
not commonly personalized. This work looks at a particular type
of Internet application: social annotations systems which enable
users to annotate resources with tags. In these systems recommendation by example natural occurs as users often navigate through
the resource space by clicking on the resources. We propose cascading hybrids to combine personalized and non-personalized approaches. Our extensive evaluation on three real world datasets
reveals that personalization is indeed beneficial, that cascading hybrids can effectively integrate personalized and non-personalized
recommenders, and the characteristics of the underlying data influence the effectiveness of the hybrids.
1. INTRODUCTION
The discovery of new and interesting resources remains a central
role of the World Wide Web. As the Web has evolved to encompass
new forms of social interaction, so too has new forms of resource
interaction been developed. In the so called Social Web users rate,
annotate, share, upload, promote and blog about online resources.
This abundance of data offers new ways to model resources as well
as users, and injects new life into old resource discovery tasks. In
this work we focus on a particular type of resource discovery, recommendation by example, in which a user asks for resources similar to an example.
Users in the movie domain may ask for more movies like “The
Godfather.” The recommendation engine is then responsible for
generating a list of similar movies such as the “The Deer Hunter.”
The user could then select this new film producing a new recommendation set. In this manner the user can discover new movies as
he navigates through the resource space.
This type of functionality is common in today’s Internet. When
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
RSWeb’11, October 23, 2011, Chicago, Illinois.
Copyright 2011 ACM 978-1-60558-093-7/08/10 ...$5.00.
a user views an item in Amazon1 , Netflix2 or LastFM3 the system
often presents the user with related items. However, these functions
are often not personalized.
While the example (or query) resource is clearly a necessary
input to a recommendation by example algorithm, we assert that
the user profile is equally important. Two users may both select
“The Godfather.” However, one might be more interested in crime
drama, while the other is particular interested in Marlon Brando.
Recommendation by example algorithms must take into account
the user’s preferences in order to maximize the benefit of the system.
In this paper we focus on a particular type of system, the social
annotation system, in which users interact with the application by
annotating resources with tags. Users are drawn to these applications due to their low entry barrier and freedom from preconceived
hierarchies; users can annotate a resource with any tag they wish.
The result is rich landscape of users, resources and tags. For our
experimental study we make the assumption that if a user annotates
two resources with identical tags then the two resources are — in
the eyes of that user — similar. This assumption permits our evaluation of personalized recommendation by example algorithms.
We evaluate several component recommenders some of which
ignore the user profile, some of which ignore the example profile. We then construct cascading hybrid recommenders to exploit
the benefits of both. A recommendation list of size k is produced
through one algorithm. That list is then reordered by a second algorithm and n elements are presented to the user.
Our extensive evaluation conducted on three real world datasets
found that personalization improves the effectiveness of recommendation by example algorithms. Cascading hybrid recommenders
effectively incorporate algorithms that focus on the user profile
with those that rely on the example resource. Finally, differences in
the underlying characteristics of the social annotation systems require different combinations of component recommenders to maximize the effectiveness of the cascading hybrid.
In the next Section we explore related work. In Section 3, we
formalize the notion of personalized resource recommendation by
example, define six algorithms based on cosine similarity and collaborative filtering and present our cascading hybrid models. Our
experimental results and evaluation follow in Section 4. We conclude the paper with a general discussion of our results.
2.
RELATED WORK
The notion of recommendation by example is a core compo-
1
www.amazon.com
www.netflix.com
3
www.lastfm.com
2
58
Figure 1: An example of resource recommendation by example, similar artists to Radiohead are recommended in LastFM. If the
user was to select Arcade Fire from the list a new list of similar artists would be presented. In this manner the user can explore the
resource space by moving from resource to resource.
nent of information retrieval systems particularly in the domain
of e-commerce recommenders [27, 32]. Early approaches include
association rule mining [1] and content-based classification [17].
Content-based filtering has been combined with collaborative filtering in several ways [2, 3, 22] in order to improve prediction effectiveness for personalized retrieval. More generally, hybrid recommender systems [4] have been shown to be an effective method
of drawing out the best performance among several independent
component algorithms. This work draws from these prior efforts
in applying a hybrid recommender to the domain of social annotation systems and specifically accommodating a recommendation
by example query.
There has been considerable work on the general recommendation problem in social annotation systems. Generalizable latentvariable retrieval model for annotation systems [31] can be used to
determine resource relevance for queries of several forms. Tagging
data was combined with classic collaborative filtering in order to
further filter a user’s domain of interest [28]. More recently, several techniques [12, 13, 18] have built upon and refined this earlier
work. None of these approaches, however, deal with the possibility
of resources themselves as queries.
Some work has focused on resource-to-resource comparison in
social annotation, although little in the way of direct recommendation. Some have considered the problem of measuring the similarity of resources (as well as tags) in a social annotation system by
various means of aggregation [19]. An author-topic latent variable
model has been used in order to determine web resources with identical functionality [23]. They do not, however, specifically seek to
recommend resources to a particular user, but rather simply enable
resource discovery utilizing the annotation data.
Our own previous work regarding annotation systems has fo-
cused on the use of tag clusters for personalized recommendation [11,
30] and linear weighted hybrid recommenders for both tag [6] and
resource [7, 8, 10] recommendation. Here we extend our examination to cascading hybrids comprised of simple components for the
specific problem of recommendation by example.
3.
RECOMMENDATION BY EXAMPLE
Recommenders are a critical component of today’s Web applications, reducing the burden of information overload and offering
the user a personalized view of the information space. One such
example is recommendation by example in which users select an
example resource and view additional resources which are similar.
Figure 1 illustrates a common paradigm. A user is viewing information about the band Radiohead. Near the bottom several other
bands are displayed under the heading “similar artists.” A user
might click on the link to Arcade Fire and discover yet more bands.
In this way the user can explore the resource space jumping from
artist to artist. Similarity in such a system can be measured based
on content features or co-occurrences in user transactions. In social
annotation systems, like that shown in Figure 1, similarity is often
measured based by the correlation of tags assigned to the resources.
Many Web sites offer the functionality of resource recommendation by example. However, it is not often personalized. Personalization focuses the results based on the interests of the user. For
example, the system may know that the user is fond of rich vocals
or heavy bass. By incorporating these preferences into the recommendation, the system can better serve the user.
In this work we seek to explore the impact of personalization on
recommendation by example recommenders. We focus our experimentation on social annotation systems. To that end we first present
59
the data model for these systems. We then discuss recommendation
by example for social annotation systems in a general sense before
presenting algorithms based on cosine similarity and collaborative
filtering. We present the framework for cascading hybrids which
integrate both approaches. Finally, we compare the cascading hybrid to other integrative techniques.
3.1 Data Model
The foundation of a social annotation system is the annotation:
the record of a user labeling a resource with one or more tags. A
collection of annotations results in a complex network of interrelated users, resources and tags [20]. A social annotation system
can be described as a four-tuple: D = ⟨U, R, T, A⟩, where, U is a
set of users; R is a set of resources; T is a set of tags; and A is a set
of annotations. Each annotation a is a tuple ⟨u, r, Tur ⟩, where u is
a user, r is a resource, and Tur is the set of tags that u has applied
to r. It is sometimes useful to view a social annotation system as a
three-dimensional matrix, URT, in which an entry URT(u,r,t) is 1 if
u has tagged r with t.
Many of our component recommenders rely on two-dimensional
projections of the three dimensional annotation data [21]. These
projections sacrifice some of its informational content but reduce
the dimensionality of the data, making it easier for algorithms to
leverage.
We define the relation between resources and tags as RT (r, t),
the number of users that have applied t to r, a notion strongly resembling the “bag-of-words” vector space model [25]. Similarly,
we can produce the projection U T . A user is modeled as a vector
over the set of tags, where each weight, U T (u, t), measures how
often a user applied a particular tag across all resources. In all,
there are six possible two-dimensional projections: U R, U T , RU ,
RT , T U , T R (or three if you wish consider the transpose). In the
case of U R, we do not weigh resources by the number of tags a
user applies, as this is not representative of the user interest. Rather
we define U R to be binary, indicating whether or not the user has
annotated the resource.
In our previous work we explored the notion of information channels [10]. An information channel describes the relationship between the underlying dimensions in a tagging system: users, resources and tags. A strong information channel between two dimensions means that information in the first dimension will be useful in building a predictor for the second dimension. While our
previous effort describe various ways for calculating the strength
of an information channel, in this work we simply rely on the notion to evaluate the experimental results.
3.2 Recommendation by Example in Social
Annotation Systems
Personalized resource recommendation engines reduce the effort of navigating large information spaces by giving each user a
tailored view of the resource space. To achieve this end the algorithms presented in this section accept a user u, an example rq and
a candidate resource r. With these considerations in mind, we can
view any recommendation by example algorithm as a function:
ϕ : U × R × R → R,
(1)
Given a particular instance, ϕ(u, rq , r), the function assigns a
real-valued score to the candidate resource r given the user profile
u and the example (or query) resource rq . A system computing
such a function can iterate over all potential recommendations and
recommend those with the highest scores. This general framework,
of course, requires a means to calculate the relevance. The follow-
ing sections provide several techniques.
3.3
Cosine Models
Cosine similarity is commonly used in information retrieval to
measure the agreement between two vectors. By modeling resources as a vector, the cosine similarity between the two can be
computed. Our first recommender, CSrt , models resources as a
vector of tags taken from RT and computes the cosine similarity
between the example resource and the potential recommendation:
∑
ϕ(u, rq , r) = √∑
RT (rq , t) × RT (r, t)
√∑
2
RT (rq , t)2 ×
t∈T RT (r, t)
t∈T
t∈T
(2)
Resources can also be modeled as a vector of users taken from
RU . We call this approach CSru . Neither of these approaches is
personalized; they completely ignore the input user u. Yet, they are
the type of recommender one might expect in a recommendation
by example scenario making it a useful baseline.
3.4
Collaborative Filtering
In order to take advantage of the user profile we rely on collaborative filtering algorithms. More specifically we employ two userbased algorithms [16, 29] and two item-based [5, 26] algorithms.
The user-based approaches, KN Nur and KN Nut , model users
as either resources or tags gathered from U R or U T . To make recommendations, we filter the potential neighbors to only those who
have used the example resource rq . We perform cosine similarity
to find the k nearest neighbors to u and then use these neighbors
to recommend resources using a weighted sum based on user-user
similarity:
∑
ϕ(u, rq , r) =
σ(u, v)θ(v, r)
(3)
v∈N rq
where N rq is the neighborhood of users that have annotated rq ,
σ(u, v) is the cosine similarity between the users u and v, and
θ(v, r) is 1 if v has annotated r and 0 otherwise.
Filtering users by the query resource focuses the algorithm on
the user’s query but still leaves a great deal of room for resources
dissimilar to the example to make its way into the recommendation
set. These two approaches however are strongly personalized.
The item-based algorithms rely on the similarity between resources rather than between users. When modeling resources as
users we call the algorithm KN Nru . When modeling them as tags
we call it KN Nrt . Again a weighted-sum is used as is common in
collaborative filtering:
ϕ(u, rq , r) =
∑
σ(r, s)θ(u, s)
(4)
s∈Nr
where Nr is the k resources nearest to r drawn from the user profile and σ(r, s) is the similarity between s and r. This procedure
ignores the query resource entirely, instead focusing on the similarity of the potential recommendations to those the user has already
annotated.
3.5
Cascading Hybrids
Hybrid recommenders are powerful tools used to combine the
results of multiple components into a single framework [4]. In this
work we focus on cascading recommenders which reorders the output of one recommender by the results of the second. The variable
k is used to determine how many of the resources are taken from the
first recommender and passed to the second. If k is set very small
60
then the second is severely limited in what it can recommend. On
the other hand if k is set very large then the first algorithm has very
little influence on the final result since all the resources might be
passed onto the second recommender.
Given the six recommenders described above, several cascading
hybrids are possible, thirty in all. We limit the scope of this paper
to an examination of hybrids that start with a component based on
cosine similarity and reorder the results based on collaborative filtering. In our preliminary experiments these combinations offer the
best results. Ideally, the first recommender will focus the results on
the example resource. The value of k will be tuned such that the
most relevant resources are passed to the second recommender. The
second recommender will then reorder the resources based on the
user’s profile thereby personalizing the final recommendation list.
More formally we describe the recommender as:
ϕ(u, rq , r) = χ1 (k, r)ϕ2 (u, rq , r)
(5)
where ϕ2 (u, rq , r) is the score taken from the second recommender
and χ1 (k, r) is 1 if r is ranked among the top k resources by the
first recommender.
The cascading recommender has many advantageous. First, is
its efficiency; the cosine similarities can be computed offline and
the collaborative filtering approaches — which are more computationally intense — would only need to rerank a small subset of
the resources. Second, the hybrid can leverage multiple information channels of the data. For example, the resource-tag channel exploited by CSrt and the user-resource channel exploited by
KN Nur can be combined in a single cascading recommender.
Third, by leveraging multiple information channels the hybrid can
produce results superior to what either can produce alone.
3.6 Other Integrative Models
Hybrid recommenders are not the only means to integrate multiple information channels of social annotation data. Graph based approaches such as Adapted PageRank [15] and tensor factorization
algorithms such as Pairwise Interaction Tensor Factorization [24]
(P IT F ) have meet with great success, particularly in tag recommendation.
Adapted PageRank models the data as a graph composed of users,
resources and tags connected through the annotations. A preference vector is used to model the input. The PageRank values are
then calculated and used to make recommendations. However the
computational requirements of Adapted Pagerank make it ill-suited
for large scale deployment; the Pagerank vector must be calculated
for each recommendation.
P IT F on the other hand offers a far better running time. It
achieves excellent results for tag recommendation, but it is not
clear how to adapt the algorithm for the recommendation by example scenario. In the context of tag recommendation P IT F prioritizes tags from both a user and resource model in order to make
recommendations thereby reusing tags. In resource recommendation however the algorithm cannot promote resources from the user
profile as these are already known to the user. This requirement
conflicts with the assumptions of the prioritization model; all possible candidate recommendations are in effect treated as negative
examples.
Second, many proposed tensor factorization methods, P IT F included, require an element from two of the data spaces in order to
produce elements from the third. For example a user and resource
can be used to produce tags. In recommendation by example the
input is a user and a resource while the expected output also comes
from the resource space. Finally, in our investigation into tag-based
resource recommendation [9], we found that hybrid recommenders
Bibsonomy
MovieLens
LastFM
KNNur
20
20
50
KNNut
20
30
50
KNNru
10
2
2
KNNrt
2
20
2
Table 1: Values for k in the collaborative filtering recommenders.
Bibsonomy
MovieLens
LastFM
KNNur
20
50
50
CSrt
KNNut KNNru
20
30
20
75
30
150
KNNrt
20
50
20
Table 2: Values for k in the cascading recommenders beginning
with CSrt.
often outperform P IT F .
While these previous efforts are not evaluated in this paper due
to either their scalability or applicability to the recommendation by
example paradigm, they do demonstrate the benefit of an integrative model leveraging multiple channels of the data and provide
inspiration for the cascading hybrids.
4.
EXPERIMENTAL EVALUATION
Here we describe the methods used to gather and pre-process our
three real-world datasets. We describe how test cases where generated and the cross fold validation used in our study. The evaluation
metrics are presented and the results for each dataset is given separately before we draw some final conclusions.
4.1
Datasets
Our experiments into recommendation by example were completed on three real-world social annotation systems. As part of our
data processing we generated p-cores [15] for each dataset. Users,
resources and tags were removed in order to produce a residual
dataset that guarantees each user, resource and tag occur in at least
p annotations. 5-cores or 20-cores were generated depending on
the size of the dataset thus allowing five-fold cross validation.
Bibsonomy users annotate both URL bookmarks and journal articles. The dataset was gathered on 1 January 2009 and is made
available online [14]. A 5-core was taken producing 13,909 annotations with 357 users, 1,738 resources and 1,573 tags.
MovieLens is administered by the GroupLens research lab at the
University of Minnesota. They provide a dataset containing users,
rating of movies, and tags. A 5-core generated 35,366 annotations
with 819 users, 2,445 resources and 2,309 tags.
LastFM users share their musical tastes online. Users have the
option to tag songs, artists or albums. The experiments here are
limited to album data though experiments with artist and song data
show similar trends. A p-core of 20 was drawn from the data and
contains 2,368 users, 2,350 resources, 1,141 tags and 172,177 annotations.
4.2
Methodology
Recommendation by example plays a critical role in modern Internet applications. However, it is difficult to directly capture the
user’s perception of how two resources relate to one another. Moreover, there does not exist to the best of our knowledge datasets is
which a user has explicitly stated he believes two items are similar.
61
Bibsonomy
MovieLens
LastFM
KNNur
150
75
75
CSru
KNNut KNNru
150
150
20
20
75
75
KNNrt
150
75
30
Table 3: Values for k in the cascading recommenders beginning
with CSru.
For these reasons we limit the scope of our experiments to social annotation systems. In these systems a user applies tags to
resources, in effect describing it in a way that is important to the
user. The basic assumption of this work is that if a user annotates
two resources with the same tags then the two resources are similar
from that user’s perspective.
Since a user may annotate resources with any number of tags we
segment the results into cases in which one, two, three, four or five
tags are in agreement. This segmentations allows us to analyze the
results when there is very high probability that two resources are
similar (when a user applies several similar tags to both resources)
or when the probability is lower (when only a single common tag
is applied to both resources).
We partition each dataset into five folds. The profile of each user
is randomly but equally assigned to each partition. In order to tune
the variables we use four partitions as training data and the fifth as
testing data. The variables were tuned for the collaborative filtering
algorithms as well as the cascading hybrid. Then the testing partition was discarded and we continued with four-fold cross validation
to produce our results.
To evaluate the algorithms, we iterated over all annotations in the
testing data. Each annotation contains a user, a resource and a set of
tags applied by the user to the resource. We compare these tags to
the tags in the user’s annotations from the training data. If there is
a match we generate a test case consisting of the user, the resource
from the training data as the example resources and the resource
from the holdout data as the target resource. The example resource
and target resource may have one tag in common or several. We
evaluate these cases separately looking at as many as five common
tags.
Since each test case has only one target resource, we judge the
effectiveness of the recommenders with the hit ratio, the percentage of times the target resource is found in the recommendation
set. We measure hit ratio for the top 10 recommended resources.
The results are averaged for each user, averaged over all users, and
finally averaged over all four folds.
4.3 Experimental Results
Table 1 reports the k values used in the collaborative filtering algorithms. Experiments were conducted using values of 2, 5, 10, 20,
30 and 50. Tables 2 and 3 present the values used in the cascading
recommenders. The values 20, 30, 50, 75, 100 and 150 were tested.
The experimental results are given in Figure 2. The first subfigure (Bibsonomy - CSrt) shows the results for CSrt and the four
hybrids that incorporate CSrt with the collaborative filtering approaches. For each case the results are displayed when one, two,
three, four or five tags are common among the example resource
and the target resource. In general when a single tag is common
between two annotations it is difficult to know the users intent and
the hit ratio is low. However, with more tags in agreement it is safer
to assume that the user views the two resources in similar terms and
the hit ratio increases to almost 30%. When comparing algorithms,
we find little difference in these two cases. For readability, the re-
maining subfigures only report the case when five tags are shared
between the example and target resource. The y-axis displays the
hit ratio for a recommendation set of 10 resources.
In general we find that personalization is important for resource
recommendation by example algorithms. It is particularly important for domains in which personal taste strongly influences the
user’s consumption of resources. MovieLens and LastFM, which
connect users with movies and music, both receive a benefit from
our personalized cascading hybrids. In contrast, Bibsonomy, which
allows its users to annotate journal articles, receives very little benefit. These users often annotate articles for professional reasons.
Since their individual preferences plays a small role in their annotations, the impact of personalizing the recommendation by example
algorithms is diminished.
A second finding is that the cascading hybrid recommenders effectively merge the benefits of its component parts. If one component exploits the information channel between users and resources
while a second leverages the channel between resources and tags,
then the cascading hybrid is benefited by both.
Thirdly, we see that the underlying characteristics of the annotation systems vary. Some have strong user-resource channels. Others have better developed user-tag channels. The cascading recommenders expose these differences and offer insights into how user
behavior might differ from system to system. In the remainder of
this section we evaluate each dataset in greater detail.
4.3.1
Bibsonomy
The top left graph of Figure 2 shows that CSrt alone achieves a
hit ratio of approximately 4% when the results are limited to cases
where a single tag is annotated to both the query resource and the
recommended resource by the user. When two tags are in common the hit ratio rises to approximately 8%. When five tags are in
common it jumps to 30%.
We assume that with five tags in common between the query
resource and recommended resource the likelihood that the user
views the two resources in a similar way is greater than when only
one tag is in common. The increase in hit ratio appears to bear
this out. Furthermore, we notice that except in certain rare circumstances the relative performance between algorithms is the same
regardless of how many shared tags are considered. For simplicity,
we restrict our discussion to cases where five tags overlap.
In Bibsonomy CSrt clearly outperforms CSru. It achieves a
hit ratio of nearly 30% while the other does little better than 10%.
In this application it appears that resources are better modeled by
tags than by users, at least for the task of recommending resources
by example.
The personalization afforded by the cascading recommenders
appears to offer little benefit. In the best case, CSrt/KN N ur,
the improvement is only a faction of a percent. Moreover in the
remaining hybrids, the results is actually worse than CSrt alone.
When using CSru as the initial recommender, the hybrid composed of CSru and KN N rt produces an improvement, but its
overall performance is still less than CSrt alone.
These results suggest that the resource-tag information channel
is particularly informative in Bibsonomy. Examination into how
this system is used seems to offer some explanation why this might
be the case. Bibsonomy allows users to annotate journal articles
and Web pages. Many users employ the system to organize papers
relevant to their research. To that end they select tags useful for retrieving their resources at a latter date and they often use tags drawn
from their area of expertise. The results is a highly informative tag
space that CSrt is able to exploit.
Since CSrt draws from the resource-tag space, it is not surpris-
62
ŝďƐŽŶŽŵLJͲ ^ƌƚ
ŝďƐŽŶŽŵLJͲ ^ƌƵ
ϯϬй
ϯϬй
ϮϬй
ϮϬй
ϭϬй
ϭϬй
Ϭй
Ϭй
^ƌƚ
^ƌƚͬ<EEƵƌ
^ƌƚͬ<EEƵƚ
^ƌƚͬ<EEƌƵ
^ƌƚͬ<EEƌƚ
^ƌƵ
DŽǀŝĞ>ĞŶƐͲ ^ƌƚ
^ƌƵͬ<EEƵƌ
^ƌƵͬ<EEƵƚ
^ƌƵͬ<EEƌƵ
^ƌƵͬ<EEƌƚ
DŽǀŝĞ>ĞŶƐͲ ^ƌƵ
ϱϬй
ϱϬй
ϰϬй
ϰϬй
ϯϬй
ϯϬй
ϮϬй
ϮϬй
ϭϬй
ϭϬй
Ϭй
Ϭй
^ƌƚ
^ƌƚͬ<EEƵƌ
^ƌƚͬ<EEƵƚ
^ƌƚͬ<EEƌƵ
^ƌƚͬ<EEƌƚ
^ƌƵ
>ĂƐƚ&DͲ ^ƌƚ
^ƌƵͬ<EEƵƌ
^ƌƵͬ<EEƵƚ
^ƌƵͬ<EEƌƵ
^ƌƵͬ<EEƌƚ
^ƌƵͬ<EEƌƵ
^ƌƵͬ<EEƌƚ
>ĂƐƚ&DͲ ^ƌƵ
ϯϬй
ϯϬй
ϮϬй
ϮϬй
ϭϬй
ϭϬй
Ϭй
Ϭй
^ƌƚ
^ƌƚͬ<EEƵƌ
^ƌƚͬ<EEƵƚ
^ƌƚͬ<EEƌƵ
^ƌƚͬ<EEƌƚ
^ƌƵ
^ƌƵͬ<EEƵƌ
^ƌƵͬ<EEƵƚ
Figure 2: The hit ratio for a recommendation set of size 10 for the baseline recommenders (CSrt and CSru) and the corresponding
cascading recommenders leveraging the collaborative filtering algorithms(KN Nur , KN Nut , KN Nru and KN Nrt ).
ing that KN N rt is not able to improve the results. Yet, the other
approaches appear not to add information to the recommender either. A look at Table 2 shows that 20 resources were taken from
CSrt for the other algorithms to reorder. This value was tuned
to produce the best performance and is the lowest among all the
cascading hybrids. It may be that the reliance of the collaborative
algorithms on the user profiles rather than the example resource
profiles is particularly detrimental in a system where the resources
are organized not by personal taste but by the content of the resource.
4.3.2
MovieLens
In MovieLens we see dramatically different results. First, CSru
still does not perform as well as CSrt, but it is far more competitive. This suggests that in this domain several information channels
of the data might be exploited to benefit the recommendation engine.
Second, the cascading hybrids improve upon the performance
of CSrt by as much as 19%. Alone CSrt results in a hit ratio
of 29%, whereas coupled with KN N ur it results in 48%. The
other cascading hybrids also improve the results, but to a lesser
63
degree. These results imply that combining component recommenders which rely on different channels of the data can increase
accuracy.
We also see that CSru is improved by combining it with the collaborative recommenders. Best results are achieved by combining
it with KN N rt producing a hit ratio of 38%.
In this system users tag movies. Often they use genres or an
actor’s name to annotate a film. In this sense they are similar to
users that annotate journal articles in Bibsonomy. In both cases,
tags are often drawn from a common domain specific vocabulary.
Yet, in MovieLens we observe that the user’s preferences play a
more important role. By means of an example, two users may both
like science fiction and annotate their finds as such. If one is drawn
toward British television (Dr. Who) and the other prefers modern
blockbusters (The Transformers) then even though they both use
the tag scifi, it means something different to each of them. A recommender like CSrt could whittle the possible films down to a
general area, but a collaborative algorithms such as KN N ur can
narrow the field down even further by focusing on the user’s preferences.
It is also informative to note that whereas KN N ur is the better
secondary model for CSrt, KN N rt is the better secondary model
for CSru. This may be because those secondary algorithms compliment the primary recommenders. For example CSrt focused
on the resource-tag channel. KN N rt does not offer substantially
new information since it also focuses on the resource-tag channel. Instead better results are achieved by combining CSrt with
KN N ur, an algorithm that leverages the user-resource channel.
4.3.3 LastFM
LastFM provides another example of how personalization can
improve upon standard cosine models. In this case, however, the
best cascading hybrid is built from CSru, which models the resource as a vector of users, and KN N ru, a collaborative algorithm
that also models resources over the user space. The result is more
modest improvement in performance from 25% to 29%. We observe that KN N ru also makes the best cascading recommender
of those combined with CSrt. This strength of this component
may be explained by an evaluation of the system itself.
LastFM users annotate music — tracks, artists and albums. Like
MovieLens they often use tags common to the domain such and
genre or a singer’s name. However, they also use tags like sawlive
or albumiown. The tag space is used not only to describe the resource, but the users relationship to it, making the tag space far
noisier.
Furthermore, LastFM users more commonly focus on a particular genre of music, whereas MovieLens users are likely to watch a
broader swath of genres, from horror to adventure to action. This
interaction makes the user space in LastFm slightly cleaner than in
MovieLens. These two differences suggest the user space would be
a better model for resources than the tag space. In fact that is what
we observe: CSru is the best initial component and KN N ru is
the best secondary component.
5. CONCLUSION
In this paper we have investigated the use of cascading hybrids
for the personalization of recommendation by example algorithms
in social annotation systems. This form of interaction offers great
utility to users as they navigate large information space, yet it is
rarely personalized. We contend that personalization is an important ingredient toward satisfying the user’s needs. Our experimental
analysis on three real world datasets reveal that in some cases personalization offers little benefit. However, in other contexts, par-
ticularly where resource consumption is driven by personal preference, the benefits the can be quite large.
Our proposed cascading hybrids leverage multiple information
channels of the data producing superior results yet it offers advantages beyond accuracy. Since much of the work can be completed
offline and only a small subset of resources needs to be evaluated
for reranking the algorithm is very efficient permitting fast online
personalized recommendations.
Cascading hybrids built from different combinations of component recommenders performed differently across our three datasets.
These results suggest that users interact with the social annotation
systems in varying ways, producing datasets with different underlying characteristics. The cascading hybrids expose these characteristics and allow investigations into why they might occur.
6.
ACKNOWLEDGMENTS
This work was supported in part by a grant from the Department
of Education, Graduate Assistance in the Area of National Need,
P200A070536.
7.
REFERENCES
[1] R. Agrawal, T. Imielinski, and A. N. Swami. Mining
Association Rules between Sets of Items in Large Databases.
In P. Buneman and S. Jajodia, editors, Proceedings of the
1993 ACM SIGMOD International Conference on
Management of Data, pages 207–216, Washington, D.C.,
1993.
[2] M. Balabanović and Y. Shoham. Fab: content-based,
collaborative recommendation. Commun. ACM, 40(3):66–72,
Mar. 1997.
[3] C. Basu, H. Hirsh, and W. W. Cohen. Recommendation as
Classification: Using Social and Content-Based Information
in Recommendation. In AAAI/IAAI, pages 714–720, 1998.
[4] R. Burke. Hybrid recommender systems: Survey and
experiments. User Modeling and User-Adapted Interaction,
12(4):331–370, 2002.
[5] M. Deshpande and G. Karypis. Item-Based Top-N
Recommendation Algorithms. ACM Transactions on
Information Systems, 22(1):143–177, 2004.
[6] J. Gemmell, M. Ramezani, T. Schimoler, L. Christiansen,
and B. Mobasher. A fast effective multi-channeled tag
recommender. In European Conference on Machine
Learning and Principles and Practice of Knowledge
Discovery in Databases Discovery Challenge, Bled,
Slovenia, 2009.
[7] J. Gemmell, T. Schimoler, B. Mobasher, and R. Burke.
Hybrid tag recommendation for social annotation systems. In
19th ACM International Conference on Information and
Knowledge Management, Toronto, Canada, 2010.
[8] J. Gemmell, T. Schimoler, B. Mobasher, and R. Burke.
Resource Recommendation in Collaborative Tagging
Applications. In E-Commerce and Web Technologies, Bilbao,
Spain, 2010.
[9] J. Gemmell, T. Schimoler, B. Mobasher, and R. Burke.
Tag-based resource recommendation in social annotation
applications. In User Modeling, Adaptation and
Personalization, Girona, Spain, 2011.
[10] J. Gemmell, T. Schimoler, M. Ramezani, L. Christiansen,
and B. Mobasher. Resource Recommendation for Social
Tagging: A Multi-Channel Hybrid Approach. In
Recommender Systems & the Social Web, Barcelona, Spain,
2010.
64
[11] J. Gemmell, A. Shepitsen, B. Mobasher, and R. Burke.
Personalizing Navigation in Folksonomies Using
Hierarchical Tag Clustering. In 10th International
Conference on Data Warehousing and Knowledge Discovery,
Turin, Italy, 2008.
[12] Z. Guan, C. Wang, J. Bu, C. Chen, K. Yang, D. Cai, and
X. He. Document recommendation in social tagging
services. In Proceedings of the 19th international conference
on World wide web, WWW ’10, pages 391–400, New York,
NY, USA, 2010. ACM.
[13] I. Guy, N. Zwerdling, I. Ronen, D. Carmel, and E. Uziel.
Social media recommendation based on people and tags. In
Proceeding of the 33rd international ACM SIGIR conference
on Research and development in information retrieval,
SIGIR ’10, pages 194–201, New York, NY, USA, 2010.
ACM.
[14] A. Hotho, R. Jaschke, C. Schmitz, and G. Stumme.
BibSonomy: A social bookmark and publication sharing
system. In Proceedings of the Conceptual Structures Tool
Interoperability Workshop at the 14th International
Conference on Conceptual Structures, Aalborg , Denmark,
2006.
[15] R. Jaschke, L. Marinho, A. Hotho, L. Schmidt-Thieme, and
G. Stumme. Tag Recommendations in Folksonomies.
Lecture Notes In Computer Science, 4702:506–513, 2007.
[16] J. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon, and
J. Riedl. GroupLens: Applying Collaborative Filtering to
Usenet News. Communications of the ACM, 40(3):87, 1997.
[17] D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka.
Training algorithms for linear text classifiers. In Proceedings
of the 19th annual international ACM SIGIR conference on
Research and development in information retrieval, SIGIR
’96, pages 298–306, New York, NY, USA, 1996. ACM.
[18] H. Liang, Y. Xu, Y. Li, R. Nayak, and X. Tao. Connecting
users and items with weighted tags for personalized item
recommendations. In Proceedings of the 21st ACM
conference on Hypertext and hypermedia, HT ’10, pages
51–60, New York, NY, USA, 2010. ACM.
[19] B. Markines, C. Cattuto, F. Menczer, D. Benz, A. Hotho, and
S. Gerd. Evaluating similarity measures for emergent
semantics of social tagging. In Proceedings of the 18th
international conference on World wide web, WWW ’09,
pages 641–650, New York, NY, USA, 2009. ACM.
[20] A. Mathes. Folksonomies-Cooperative Classification and
Communication Through Shared Metadata. Computer
Mediated Communication, (Doctoral Seminar), Graduate
School of Library and Information Science, University of
Illinois Urbana-Champaign, December, 2004.
[21] P. Mika. Ontologies are us: A unified model of social
networks and semantics. Web Semantics: Science, Services
and Agents on the World Wide Web, 5(1):5–15, 2007.
[22] D. M. Pennock, S. Lawrence, R. Popescul, and L. H. Ungar.
Probabilistic Models for Unified Collaborative and
Content-Based Recommendation in Sparse-Data
Environments. In Proceedings of the 17th Conference in
Uncertainty in Artificial Intelligence, pages 437–444, 2001.
[23] A. Plangprasopchok and K. Lerman. Exploiting Social
Annotation for Automatic Resource Discovery. In
Proceedings of AAAI workshop on Information Integration,
Apr. 2007.
[24] S. Rendle and L. Schmidt-Thieme. Pairwise Interaction
Tensor Factorization for Personalized Tag Recommendation.
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
In Proceedings of the third ACM international conference on
Web search and data mining, New York, New York, 2010.
G. Salton, A. Wong, and C. Yang. A Vector Space Model for
Automatic Indexing. Communications of the ACM,
18(11):613–620, 1975.
B. Sarwar, G. Karypis, J. Konstan, and J. Reidl. Item-Based
Collaborative Filtering Recommendation Algorithms. In
10th International Conference on World Wide Web, Hong
Kong, China, 2001.
J. B. Schafer, J. A. Konstan, and J. Riedl. E-Commerce
Recommendation Applications. Data mining and knowledge
discovery, 5(1):115–153, 2001.
S. Sen, J. Vig, and J. Riedl. Tagommenders: connecting users
to items through tags. In WWW ’09: Proceedings of the 18th
international conference on World wide web, pages 671–680,
New York, NY, USA, 2009. ACM.
U. Shardanand and P. Maes. Social Information Filtering:
Algorithms for Automating ŞWord of MouthŤ. In SIGCHI
Conference on Human Factors in Computing Systems,
Denver, Colorado, 1995.
A. Shepitsen, J. Gemmell, B. Mobasher, and R. Burke.
Personalized Recommendation in Social Tagging Systems
using Hierarchical Clustering. In ACM Conference on
Recommender Systems. Lausanne, Switzerland, 2008.
X. Wu, L. Zhang, and Y. Yu. Exploring social annotations for
the semantic web. In Proceedings of the 15th international
conference on World Wide Web, Edinburgh , Scotland, 2006.
Y. Yang and C. G. Chute. An example-based mapping
method for text categorization and retrieval. ACM Trans. Inf.
Syst., 12:252–277, July 1994.