Recommending Articles for an Online Newspaper - ILK
Transcription
Recommending Articles for an Online Newspaper - ILK
JUNE 09 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER HAIT Master Thesis In partial fulfilment of the requirements for the Degree of Masters of Arts. J.M.P. (Joost) Kneepkens BICT June 2009 HAIT Master Thesis Series no.09‐002 Supervisor: Drs. A.M. (Toine) Bogers ILK Research Group, Tilburg University Human Aspects of Information Technology Communication and Information Sciences Faculty of Humanities Tilburg University Tilburg, The Netherlands Other exam committee members: Dr. J.J. (Hans) Paijmans ILK Research Group, Tilburg University Drs. J.D. (Jaap) Meijers Trouw, PCM Uitgevers Recommending Articles for an Online Newspaper J.M.P. Kneepkens HAIT Master Thesis series nr. 09-002 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN COMMUNICATION AND INFORMATION SCIENCES, MASTER TRACK HUMAN ASPECTS OF INFORMATION TECHNOLOGY, AT THE FACULTY OF HUMANITIES OF TILBURG UNIVERSITY Thesis committee: Drs. A.M. Bogers Dr. J.J. Paijmans Drs. J.D. Meijers Tilburg University Faculty of Humanities Department of Communication and Information Sciences Tilburg, The Netherlands June 2009 Abstract This research presents an evaluation of a recommender system that automatically generates recommendations for articles from an online newspaper. A prototype of a recommender system was built in cooperation with a Dutch newspaper called “Trouw”. With the data retrieved from Trouw, the system was able to recommend articles for news articles that were daily published on the website of Trouw. Trouw Web editors have to judge each day, for every online article, the top 15 recommendations as correct or incorrect with an online application. During the research period, we looked at performance differences in combination with article growth, incorporating temporal information, and incorporating author and section metadata. The results show that article growth has an influence on the number of approved recommendations over time, and the MAP and P@n scores get better over time as well. Incorporating temporal information showed the best results for the MAP scores when only textual similarity was taken into account. The P@n scores had the best results when taking both textual similarity and recency in equal amounts into account. However, the differences in MAP and P@n score between these two variations were not significant. Finally, incorporating author and section metadata did not have influence on generating better recommendations. Looking at the MAP scores it even scored significantly worse than the baseline algorithm. Table of Contents 1 Introduction ............................................................................. 1 1.1 Motivation .......................................................................................1 1.2 Problem statement & research questions..............................................2 1.3 Research method..............................................................................2 1.4 Scope and relevance .........................................................................3 1.5 Outline ............................................................................................3 2 Related work ............................................................................ 4 2.1 Online newspapers............................................................................4 2.2 News article recommendation.............................................................4 2.2.1 Recommender systems .......................................................................... 5 2.2.2 Information Retrieval............................................................................. 7 2.2.3 Information Filtering.............................................................................. 8 3 The Trouw Recommender architecture ..................................... 9 3.1 Trouw usage scenario........................................................................9 3.2 Data collection ............................................................................... 10 3.3 Judgments ..................................................................................... 13 4 Evaluation .............................................................................. 16 4.1 Recall and Precision ........................................................................ 16 4.2 Precision at rank p and MAP scores ................................................... 17 4.3 Subjective evaluation ...................................................................... 19 5 Article growth ........................................................................ 21 5.1 Experimental setup ......................................................................... 21 5.2 Results .......................................................................................... 23 5.3 Discussion ..................................................................................... 31 6 Incorporating temporal information in recommendations ...... 33 6.1 Experimental setup ......................................................................... 33 6.2 Results .......................................................................................... 35 6.3 Discussion ..................................................................................... 39 7 Incorporating other metadata in recommendations ............... 41 7.1 Experimental setup ......................................................................... 41 7.2 Results .......................................................................................... 42 7.3 Discussion ..................................................................................... 47 8 Conclusion and future work.................................................... 48 8.1 Article growth ................................................................................ 48 8.2 Incorporating temporal information in recommendations ...................... 48 8.3 Incorporating other metadata in recommendations.............................. 49 8.4 Future work ................................................................................... 50 References .................................................................................. 52 Appendix A ................................................................................. 54 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 1 1 Introduction 1.1 Motivation Newspapers are as old as Ancient Rome, when announcements were carved on stone or metal and were posted in public places (Newspaper, 2001). However, the first ‘recognized’ newspaper was published in 1605 by Johann Carolus. When printing techniques became more advanced during the Industrial Revolution, newspapers became a more widely circulated means of communication. Due to the availability of news via 24-hour television and the Internet, newspapers had to launch their online variant in order to keep up with their readers. The numbers of visitors of online newspapers are still growing according to Nielsen Online, which conducted an investigation on behalf of the Newspaper Association of America (Sigmund, 2008). American newspaper websites attracted more than 66.4 million unique visitors (40.7% of all American Internet users) on average in the first quarter of 2008. This is a record number that represents a 12.3% increase over the same period in 2007. When news articles are published online, there are some advantages compared to the printed version. One of the advantages of publishing news articles online is that they could be used for hyperlinking. Hyperlinks in an article are used to refer to another section, article, person, or perhaps a location. Links to related articles are sometimes called recommendations and should attract the user to get more information about that specific topic. Generating recommendations can be done manually or automatically with a recommender system. In Chapter 2 we will discuss different types of recommender systems. Another advantage of publishing news articles online is personalization. With personalization it is possible to create a newspaper containing only articles that correspond to the user’s interest. The user will only read those articles that are interesting to him, just like he would do with a printed version of a newspaper. All the other uninteresting articles will not be visible to the user. Personalization will also be discussed in Chapter 2. This Master’s thesis is about evaluating a recommender system that automatically generates news article recommendations. The system makes recommendations based on a so-called focus article. This article is compared to other articles presented in an index, the top 15 recommendations returned by the system, are being judged as correct or incorrect. 1.2 Problem statement & research questions In this thesis we will evaluate a recommender system for an online newspaper. A prototype of this system was built for the Dutch newspaper “Trouw” and was called the Trouw Recommender. This research involves the whole process from setting up the system architecture, choosing the right algorithms, judging the recommendations, and evaluating the algorithms chosen. To evaluate this recommender system, the following research questions are formulated: 1. What kind of influence does article growth have on generating recommendations? 2. What kind of influence does recency have on generating recommendations? 3. What kind of influence does author or section metadata have on generating recommendations? For the first research question we will look at what article growth does with the relevancy of the recommendations returned by the system. We want to find out if article growth has an influence on getting more reliable recommendations from the system. For the second research question we will look if recent articles are more relevant than older articles. Finally, for the last research question we will look if metadata about the section and author can be incorporated in the recommendation algorithm and see if this changes the performance of the system. 1.3 Research method For this thesis, the following methods will be used. A prototype of a recommender system is set up. It will collect and make recommendations based on data that come from Trouw. To see these recommendations, the Web editors of Trouw have to go to an online application, where the recommendations can be judged as correct or incorrect. Every day, Trouw Web editors judge the recommendations that are made by the system. Meanwhile, different algorithms for recommendation are used to collect data for the research questions. The editors will not be aware of which algorithm is used during the time they are judging, because they will see the recommended articles presented the same way for each algorithm. 2 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 3 Finally, we will use the results of the judgments to evaluate the predictions of the system. 1.4 Scope and relevance Online advertising is one of the most profitable business models for Internet services to date. Internet advertising revenues in the United States totalled $11.5 billion dollars for the first six months in 2008 (PricewaterhouseCoopers, 2008). Out of these $11.5 billion, 21% comes from displaying banner ads, where companies show their advertisement on different websites. The owners of these websites, on which the advertisements are published, will receive revenue which is related to the number of times the ad is clicked on (cost-per-click) or how many times the ad was shown to visitors (cost-per-impression). Therefore, like many other commercial websites, it is important for online newspapers that viewers of their website stay on it as long as possible. This is called eyeball time, which refers to the time a user is on your website. Presenting recommendations with articles could gain longer eyeball time, because it creates the possibility for readers to go to these recommendations and read them. The chance that a reader will stay on the website longer will be higher and also the chance of clicking on one of the banner ads, displayed on the website, will be higher. 1.5 Outline In this thesis we will first discuss in Chapter 2 previous studies in information retrieval, recommender systems, and online news and article recommendation. The architecture of the prototype of the Trouw Recommender, which will be used during this research, is described in Chapter 3. In Chapter 4 we will describe how the evaluation of the system is being done. After that we will describe the three research questions in Chapter 5, 6, and 7, in which each chapter contains the experimental setup, the results and we will also discuss these results. Finally, in Chapter 8, we will draw conclusions and discuss what future work should be focussed on. 2 Related work 2.1 Online newspapers The rise of online newspapers started in the middle of the 1990s when McAdams created an online version of The Washington Post (McAdams, 1995). She and her team were the first to set up an online newspaper, which resulted in a lot of difficulties they had to struggle with. They used the Newspaper metaphor as their structural model for the online service. This means that it had to be so user-friendly that anyone who can read could figure out how this online version should be used. The newspaper metaphor also uses the front page as the entry point of the system, a term still seen on many websites nowadays. They came to the conclusion that an online newspaper cannot be a direct copy of the printed version. Furthermore, it is hard to figure out what to keep and what to discard. Finally, a whole new team of editors had to be formed to get the newspaper online. These editors had to think in two-way communication rather than in a one-way medium, because online publishing is bi-directional. As mentioned in the previous chapter, 40% of all American Internet users visited one or more online newspapers during the first quarter of 2008. Most of these visitors probably only read articles that they think are interesting to them, like they would do with printed newspapers. Although printed newspapers tend to be more portable and easier to manipulate, online newspapers have an argument in their favour — personalization (Kamba, Bharat, & Albers, 1994). With personalization it is possible to create a newspaper containing only articles that correspond to the user’s interest. The system of Kamba et al. made use of personalization without conscious user involvement, realized by realistic rendering, dynamic control, interactivity, and implicit feedback. Their system showed articles in a flexible layout, which could be manipulated by the user with a set of controls. With these controls, users were able to reorder articles according to their interest. 2.2 News article recommendation Recommendation is widely used in different commercial systems, where each system uses its own data and, in most cases, standard algorithms like K-Nearest Neighbour are used to optimize its recommendations. In this section we will first describe how recommender systems work and what kind of different types there are. A more general explanation of Information Retrieval (IR) will be given. Finally, another technique called Information Filtering (IF) will be explained. 4 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 5 2.2.1 Recommender systems Earlier work in recommender systems shows two main approaches in algorithmic techniques that can be distinguished, collaborative filtering and contentbased filtering. Work on recommender systems started in the early 1990s, when Goldberg et al. (1992) developed an experimental system that could filter e-mails. According to a specific user, their filter was able to distinguish interesting and noninteresting e-mails. They used collaborative filtering for their system, which means that people collaborate to help one another perform filtering by recording their reactions to documents they read. Others that used the same system with these filter methods could get access to these reactions and also see only those e-mails of their interest. In general, two broad classes of collaborative filtering algorithms can be distinguished, memory-based algorithms and model-based algorithms (Breese, Heekerman, & Kadic, 1998). Memory-based collaborative filtering is the more classic approach and uses statistical techniques for finding sets of neighbours and uses these as a source for making recommendations. This method can be used for userbased collaborative filtering. Resnick et al. (1994) developed a system that made use of this user-based approach. Their system, based on Usenet, let users rate articles in according with their interest. With those rankings, the system could make predictions for others users and return ranked articles to them. The system compared each user’s ranking and made use of the heuristic that people who agreed in the past are likely to agree again. Sarwar et al. (2001) computed the similarity between different items and used a set of items as nearest neighbours to do the recommendation, this is called item-based collaborative filtering. Collaborative filtering can also be used with model-based algorithms. Where memory-based algorithms operate over the entire user database to make predictions, model-based collaborative filtering, uses the user database to estimate or learn a model, which is then used for predictions (Breese, Heekerman, & Kadic, 1998). These predictions can be seen as an expected value of a vote, which has been calculated based on what is known about a user. Two common models for modelbased collaborative filtering are cluster models and Bayesian networks (Breese, Heekerman, & Kadic, 1998). Cluster models use like-minded users as clustered classes. Each user’s ratings are assumed to give the user his class-membership independently. Bayesian networks use titles, as variables within its network and their values of those titles are the ratings allowed. From these data, the system can learn the structure of the network, used for encoding the dependencies between titles, and the conditional probabilities (Pennock, Horvitz, Lawrence, & Giles, 2000). Das et al. (2007) made used collaborative filtering for personalization of Google News. Because of the large scalability of their system, Google News receives millions of page views and clicks from millions of users, and the frequency of rebuilding the models, they found existing recommender systems unsuitable for their needs. A mixture of memory-based and model-based algorithms was being used to generate recommendations. For the memory-based approach, they made use of PSLI and MinHash and for the model-based they used item covisitation. The scores of each of algorithm were combined, with an option to give more weight to a specific algorithm, to obtain a ranked list of stories. Finally, the top K stories were chosen from this list as recommendations for the user. There are some problems that can occur while recommending with collaborative filtering. For example, if a new item appears in the database, it will never be recommended until more information is obtained by another user either rating it or specifying which other items it is similar to (Balabanovic & Shoham, 1997). This is also called the cold start problem. Another problem is when a user’s interest is very unique and cannot be compared to the rest of the users, which will lead to poor recommendations (Claypool, Gokhale, Miranda, Murnikov, Netes, & Sartin, 1999). Next to collaborative filtering, a content-based filtering system selects items based on a comparison between the contents and the user's preference. In general, content-based filtering tries to recommend items similar to those a given user has liked in the past, whereas collaborative filtering identifies users whose tastes are similar to those of the given user and recommends items they have liked (Balabanovic & Shoham, 1997). So with content-based filtering, items are recommended based on information about the item itself rather than on the preferences of the other users (Mooney & Roy, 2000). For content-based filtering, standard machine learning methods like naive Bayes classification are commonly used. The naive Bayes assumption states that the probability of each word event is dependent on the document class but independent of the word’s context and position. Bogers et al. (2007) compared and evaluated relatively old and simple retrieval algorithms against newer state-of-the-art approaches such as language modelling. They developed a news recommender system and compared three algorithms. The first was the standard tf-idf algorithm used as a baseline, the second algorithm was the Okapi retrieval function and the last algorithm was the 6 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 7 language-modeling (LM) framework. The tf-idf algorithm performed worst in comparison to the other two algorithms. However, there were no significant differences between the Okapi and LM algorithms, the LM variant generated the recommendations 5.5 times faster than the Okapi algorithm. Another way to use content-based filtering was proposed by Maidel et al. (2008), by using an ontology for ranking items for online newspapers. Into their personalized newspaper system, the ePaper, they included a well-known ontology in the news domain called NewsCodes. For both the news items as each user, profiles were built consisting of ontology concepts. These profiles were used to measure similarity by considering the hierarchical distance between concepts in the two profiles hierarchy. The degree of similarity of an item’s profile to an user’s profile were based on the number of matches of concepts in the two profiles, where in three degrees matches were possible, and on the weights of the concepts in the user’s profile (Maidel, Shoval, Shapi, & Taieb-Maimon, 2008). 2.2.2 Information Retrieval Information Retrieval (IR) has been a field of research since the 1950s, when it became possible to store large amounts of information on computers and when finding the useful information became a necessity (Singhal, 2001). The early IR systems were based on Boolean logic, but had shortcomings such as the difficulty to form good queries and there were no notions of document ranking. Nowadays users of IR systems expect ranked results; therefore models as the vector space model, probabilistic models, and language models are nowadays used most in IR. The vector space model makes use of vectors that represent documents. One component corresponds to each term in the dictionary, where dictionary terms that do not occur in the document get a weight of zero. All documents in the collection are then viewed as a set of vectors in a vector space model, where each axis represents a term. To quantify the similarity between two documents, the cosine of the angle between the similar vectors is calculated or the dot-product is used as a similarity measurement. Probabilistic models estimate the proximity of relevance of documents for a query, because true probabilities are not available to an IR system. This estimated probability of relevance is used for ranking the documents, which is the basis of the Probability Ranking Principle (Manning, Raghavan, & Schütze, 2008). Language models make use of the idea that a document is a good match to a query if the document model is likely to generate the query. This will happen if the query words often occur in the document. For each document a probabilistic language model is build and used to estimate the probality that this model would generate the query. This probability is again used to rank the documents (Manning, Raghavan, & Schütze, 2008). To implement IR systems, using any these models, the data that will be used needs to be indexed first. Before this data will be indexed, it is likely that it has to be processed first to add, delete, or modify information to a document, such as removing stop words, or extracting information about the author. After processing, the indexing stage makes a searchable data structure, which is called the Index. This index contains references to the contents, used for requesting information by queries. IR systems let a user create a query of keywords describing the information needed. The keywords will be used to look up references in the index and will display these to the user. It is the intention of the system to return the most relevant set of results, based on the information need of the user. 2.2.3 Information Filtering Information filtering (IF) systems focus on filtering information that is based on a user’s profile. A user’s profile can be created by letting the user specify and combine interests explicitly, or by letting the system implicitly monitor the user's behaviour (Hanani, Shapira, & Shoval, 2001). When a user receives the information he needs according to his profile automatically, filtering within IF systems is performed. One of the advantages of IF is its ability to adapt to the user's long-term interest, and give the information to the user. This information can be given by a notice to the user, or by letting the system use the information to take action on behalf of the user. Information filtering differs from information retrieval in the way the interests of a user are presented. Instead of letting the user pull information using a query, an information filtering system tries to model the user's long-term interests and push relevant information to the user. 8 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 9 3 The Trouw Recommender architecture There are six paid national morning newspapers in the Netherlands at the moment and Trouw is one of them. Trouw, which was founded in 1943 as an illegal newspaper during the Second World War, distinguishes itself from other quality newspapers by an explicit focus on news and views from the world of religion and philosophy (Trouw, 2004). Currently, Trouw has approximately 120 employees working on the creation of the daily newspaper. Seven out of these 120 employees are responsible for the website of Trouw1, where anyone can read all the latest news online. 3.1 Trouw usage scenario Two types of news articles are published on the website of Trouw; articles from the ANP2, which are published automatically, and articles from their own printed version, which are published by the Trouw web editors. For our research we will only focus on the second type of articles. On a daily basis, around 16 and 24 news articles from the printed version are published on the website of Trouw. When someone visits the website and reads an article, sometimes there are links to other, related articles that could be of interest to the reader. These related articles are called recommendations and in this chapter we are going to explain how these are generated. In the past, Web editors from Trouw performed a search task by hand, within their own content management system (CMS) of their website, to find articles that were related to a focus article. Not only was this search task very time-consuming, there was also a chance of not getting back all relevant articles from their CMS because the editors had to formulate the correct and related keywords. Trouw wanted to have this search task done automatically by a computer and therefore a prototype of a recommender system was developed. The idea behind this prototype was that it should generate recommendations of articles automatically and that the Web editors would only have to judge the correctness of these recommendations. 1 http://www.trouw.nl The ANP stands for Algemeen Nederlands Persbureau and is the leading news agency of the Netherlands. 2 3.2 Data collection For making the recommendations of the articles, an open source toolkit called Lemur3 was used. Lemur is designed to facilitate research in language modelling and information retrieval. With this toolkit it is possible to construct basic text retrieval systems using language modeling methods, as well as traditional methods such as those based on the vector space model and Okapi. For the Trouw Recommender, a decision had to be made which algorithm Lemur should use. As we have mentioned in section 2.2.1, Bogers et al. (2007) compared three different algorithms for generating recommendations based on news articles. Due to their findings, the simple language model (Kullback-Leibler Divergence) with Jelinek-Mercer smoothing is used for the Trouw recommender. A simple LM algorithm creates a language model for each document and estimates the probability of generating the query according to each of these models (Ponte & Croft, 1998). A simple LM for a document d is the maximum likelihood estimator. Kullback-Leiber Divergence (KLD) is applied to measure the divergence between two probability distributions, and can be used as a distance between LMs (Fernández, 2007). Finally, smoothing tries to balance the probability of terms that appear in a document with the ones that are missing. Smoothing discounts the probability mass assigned to the words seen and distributes the extra probability to the terms unseen according to some fallback model. Jelinek-Mercer smoothing involves a linear interpolation of the maximum likelihood model with the collection model, using a coefficient λ (Fernández, 2007). Before the toolkit can make the recommendations, every night at 3 o’clock articles are collected from the TERA4 database located at Trouw (Figure 3.1). These are in an XML5 format and contain the full text and metadata of each news article. 3 4 5 http://www.lemurproject.org http://www.teradp.com eXtensible Mark-up Language 10 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 11 Figure 3.1: Daily article collection Because certain types of articles should never be recommended or never require recommendations, some pre-processing is done first. First, each article is formatted as plain text. Articles from the following categories are filtered out: weather reports and TV & radio guide listings. There is also a filter for articles that contain less than 80 words, because they are also irrelevant for making recommendations according to Trouw Web editors. When all this processing has been done, the formatted text is put in a database called “Trouw”. This Trouw database is a MySQL database, with 4 tables containing all articles, judgments, recommendations, and users. The next step is converting this information in a format that can be used by Lemur. For the Trouw recommender it is converted to the common Standard Generalized Mark-up Language (SGML) format used in the TREC6 community. Each article in an SGML file is organized according to Figure 3.2. When all articles are converted to this format, they are being indexed. Stop word removal is performed during this indexing, while stemming is not performed. Stop word removal removes the words that add little value to the document information, a list of Dutch stop words from the Snowball project7 is used for this. After that, all new articles are put in the Lemur index and can be used for recommendation. 6 7 http://trec.nist.gov/ http://snowball.tartarus.org <DOC> <DOCNO> Unique article number </DOCNO> <TEXT> <TITLE> Article title </TITLE> <DATE> Article date </DATE> <SECTION> Section </SECTION> <ONLINE> Yes or No </ONLINE> <ITEMLENGTH> Length of the article </ITEMLENGTH> <AUTHOR> Author of the article </AUTHOR> <ABSTRACT> Abstract of the article </ABSTRACT> <CAPTION> Caption of the article </CAPTION> <BODY> The whole text of the article </BODY> </TEXT> </DOC> Figure 3.2: SGML formatting of an article The next step in this recommendation process is to find articles that are related to the new online articles. Because Trouw thought it was not necessary to make recommendations for articles that were published in the past, only recommendations for the new online articles will be made. Therefore, only the new articles are converted into the TREC format (Figure 3.3), while stop word removal is performed. Figure 3.3: Daily article recommendation workflow These new articles are also indexed and Lemur will begin to make the recommendations. Using the simple language model algorithm mentioned above, Lemur returns 50 articles as being related for each new article. Because Lemur also recommends its focus article, only 49 recommendations can be used effectively. However, the Trouw Web editors wanted to judge only 15 recommendations, so only the top 15 articles are displayed. Each related article gets a relevance score 12 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 13 and the higher the score the more related an article should be. The simple LM algorithm (KLD with Jelinek-Mercer smoothing) mentioned above is used to calculate this relevance score. All new recommendations are then stored in the Trouw database with their relevance scores and are ready for judgment. 3.3 Judgments Now that the recommendations have been made for all the new articles, Trouw Web editors have to judge them. It is up to them to judge the top 15 recommendations of each article as correct or incorrect. Before the Web editors can judge the articles, they have to login on a web application with their own username and password. When they are logged in, the editors get to see all article titles of the current or latest date when there were recommended articles (Figure 3.4). Figure 3.4: Trouw Recommender main window The left side of the window shows all the days when there were articles being recommended. When an editor clicks on a date, all articles of that day are shown in the middle of the window. Furthermore, there are arrows on the left to navigate to all available dates on which there are recommended articles. In the middle of the window, all articles of that date are shown by their title. If the editor wants to read a whole article, the title of that article can be clicked on and the whole article is displayed in the centre as can be seen in Figure A.1 in the Appendix. Figure 3.5: Trouw Recommender interface for judging a focus article To judge the recommendations of a specific article, the editor clicks on the green icon next to that article. In a new window, the focus article is displayed on the left and the top 15 recommendations are displayed on the right. As can be seen in Figure 3.5, the recommended articles are also displayed with their titles. To get an idea what the recommended article is about, the abstract or, if there is no 14 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 15 abstract available, the first 50 words are displayed. The section, date of publication and the word count are also shown with the recommended article. The recommendations are ranked according to a normalized confidence score from high to low. This normalization was performed using the following formula: Equation 3.1: Normalization of confidence score scorenorm = scoreoriginal − scoremin scoremax − scoremin scoremin = lowest relevance score of all 49 recommended articles € scoremax = highest relevance score of all 49 recommended articles To get percentages out of these, the normalized scores are multiplied with 100. As a default, recommendations are set as incorrect, but when the normalized score gets a confidence of 70% or higher, the recommendation is automatically set as correct. This was done in consultation with Trouw, as they wanted to have the top of the recommendations already set as correct. They also wanted to have buttons to set all recommendations as correct or incorrect with one click. It is now up to the web editor to judge the recommended articles if they are related to the focus article (see Figure 3.5). The editor can set the radio button of each recommended article either as correct or incorrect. When all 15 recommendations have been judged, the editor clicks on a submit button to save his judgments. This is all the work a web editor has to do on a daily basis When a focus article has been judged, the search icon next to the green icon gets coloured and becomes clickable. When a web editor clicks on this icon, all articles that are being judged as correct are shown. Around 11 o’clock each morning, a computer at Trouw automatically collects the recommendations that were judged by the Web editors of Trouw. From these collected recommended articles, a maximum of five will be published next to their focus article. Figure A.2 and Figure A.3 in the appendix show an example of a recommended focus article and how the recommendations of this article are shown on the Trouw website. 4 Evaluation In this chapter, we describe our evaluation of the recommender’s performance. In search engine evaluation one primary distinction is usually made, the distinction between effectiveness and efficiency (Croft, Metzler, & Strohman, 2009). Effectiveness measures the ability to find the right information and efficiency measures just how quickly this is done. IR research focuses first on improving the effectiveness and when a technique has been established the focus shifts to find the most efficient method. Because we are looking for a way to improve the Trouw recommender system for finding the right information, our evaluation focus will be on effectiveness. 4.1 Recall and Precision To measure effectiveness, two measurements are most common, namely precision and recall (Croft, Metzler, & Strohman, 2009). Precision measures how well the system rejects non-relevant documents and recall measures how well the system finds all relevant documents. This presumes that, given a specific query, there is a set of retrieved and non-retrieved documents and that we know which ones are relevant and which ones are not. The results of this specific query can be summarized as shown in Table 4.1, making the assumption that the relevance is binary. Table 4.1 Sets of documents defined by a simple search with binary relevance (Croft, Metzler, & Strohman, 2009) Relevant Non-Relevant Retrieved "#$ "#$ Non-Retrieved "#$ "#$ ! ! The set of relevant documents in this table ! ! is Α , the non-relevant set is is the retrieved set, and finally Α, Β Β is the non-retrieved set. The operator ∩ gives the intersection between these two sets of documents. With this table we can define the € € two effectiveness measurements as follow: € € precision = 16 € Α∩Β Β € and recall = Α∩Β Α RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER € RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 17 Because the editors of Trouw will only judge the top 15 recommendations, only precision will be used as evaluation measurement. Besides that we do not know what all relevant documents are because we cannot expect from editors to judge each and every document for each new article. 4.2 Precision at rank p and MAP scores Because the relevance judgments are true or false (correct / incorrect), a binary evaluation method is chosen. For using precision as a measurement, we will introduce two measurements that are based on precision. The first measurement is precision at rank n (or P@n), where we will be using 5, 10, and 15 as n (because the editors only judge the top 15 recommendations). This measurement is typically used to compare the output at the top of the ranking, which is what we want to find out during this research. However a major disadvantage of this measurement is that it does not distinguish between the rankings of relevant documents within the top n results (Croft, Metzler, & Strohman, 2009). Therefore we will also be using a second measurement called Mean uninterpolated Average Precision (MAP). Using MAP, the average of the precision scores are calculated after each relevant article. MAP gives us a single-measure figure of overall system quality according to relevance levels. Figure 4.1: Recall and precision values for rankings from two different queries. (Croft, Metzler, & Strohman, 2009) Figure 4.1 gives an example of the recall and precision from two different queries. The P@5 score for both queries are 0.4, while the average precision and MAP for both queries are: Average precision query 1 = (1.0 + 0.67 + 0.5 + 0.44 + 0.5)/5 = 0.62 Average precision query 2 = (0.5 + 0.4 + 0.43)/3 = 0.44 Mean average precision = (0.62 + 0.44)/2 = 0.53 The average precision is calculated by taking the sum of all precision values, when an article was relevant, and divide this by the number of relevant articles. The MAP is calculated by taking the average of all the average precisions. To get these scores for measurement, some steps have to be completed. First, Lemur scores of each recommendation are obtained from the database. Secondly, the scores are calculated with the algorithm used during judging. These algorithms will be described in the following chapters. Third, the new calculated scores are ordered descending. Finally, precisions at rank n and MAP scores are calculated with a tool called trec_eval8. To use trec_eval, a file containing all query relevance judgments has to be created. This “qrel” file contains all focus and recommended article combinations that were judged as correct by the editors of Trouw. The qrel file was constructed into the following format, where a tab divides each field: Focus id Recommended id Relevant TR_ART0…0281623.2 0 TR_ART0…0281807.1 1 TR_ART0…0281623.2 0 TR_ART0…0281731 1 TR_ART0…0281623.2 0 TR_ART0…0281897.4 1 The first field is used for the id of the focus article, the second can be used as a dummy field, which will not be used during the calculations. The third field is for the id of the recommended article and finally in the last field there is a 1, indicating that this combination is relevant. Next to the relevant judgments, a file with an ordered list of all recommendations is needed for trec_eval. In this file the recommendations are ranked in the same order as they were shown to the web editor who judged it. This file was constructed into the following format: 8 http://trec.nist.gov/trec_eval/ 18 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 19 Focus id Recommended id Rank Score Run TR_ART0…0281623.2 Q0 TR_ART0…0281807.1 1 -3.0899 RunName TR_ART0…0281623.2 Q0 TR_ART0…0281731 2 -3.0932 RunName TR_ART0…0281623.2 Q0 TR_ART0…0281746 3 -3.1142 RunName TR_ART0…0281623.2 Q0 TR_ART0…0281897.4 4 -3.1353 RunName This file looks similar to the qrel file, but it has some different fields. The first field is again used for the id of the focus article. The second and fourth field are ignored by trec_eval. The third field is again for the id of the recommended article. The fifth field is used for the relevance scores that come from Lemur. And finally in the last field there is also a dummy field, which could be used for the name of the run. After creating these files, trec_eval can be run with these two files as parameters. All kind of measurements will then be returned by trec_eval, including P@n and MAP. For each research question we used different lists containing the recommendations and judgments for that period. We will describe the results and discuss each research question in Chapters 5, 6, and 7. 4.3 Subjective evaluation Because we could not check how the editors of Trouw worked with the system, some editors were questioned about working with the Trouw Recommender afterwards. Two editors, who judged most of the articles, were asked questions of how the system performed according to them and what they thought could be done better. In general they were very pleased with the results returned by the system. According to them, most of the time the recommendations were very good. Although, they pointed out that with specific topics like arts, music, or with articles about a specific person, the recommended articles were not very related. There are two things that can cause these bad recommendations. One is that all recommendations were normalized before displayed to an editor, always resulting in one recommended article with 100% confidence for each focus article. This is confusing for the editor judging the article, because the system sets this 100% confidence article automatically as correct. So this editor may think that, because the system sets the first recommendation on 100%, and it is not relevant, the system is not very reliable. But it could be that the system returned low relevance scores to these recommendations, but due to the normalization there is always one article with a 100% confidence. This is a problem that should be solved in future work. The other thing that can cause bad recommendations is that there are always 15 recommendations displayed, even if the recommendations are barely related to the focus article. So if even the first recommended article might not be related, the other 14 recommended articles are even less related according to the system. But it is agreed with Trouw to always display the top 15 recommendations and therefore this problem should not be solved but it had to be explained to the editors by forehand. Besides these problems, there were also some technical infrastructure problems that could not always be solved rapidly due to an external company. Because another company delivers the ICT infrastructure for the website of Trouw, there were some struggles with delivering the correct data to the Trouw Recommender. These problems are in some cases connected to the renewed website of Trouw, which went live in September 2008. Due to this, there were some periods that no articles were being recommended because they had not the “online” tag set in their XML file. Another problem that occurred after launching their new website is that information about the section was not incorporated in the xml anymore. This would make our third research question not work at all, because we are not able to compare the section of the focus article with the section of the recommended article. Fortunately, we chose the period from 5 June 2008 until 20 August 2008 for the third research question, so we did not have to deal with this problem. There were also some complaints by the editors about the loading time when opening the judge window (Figure 3.5). It took some time for the PHP script to collect all 49 recommendations, calculate these with the new algorithms and then show them reordered to the editors. Rewriting the PHP and MySQL code should solve this problem. 20 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 21 5 Article growth As described in Chapter 2, recommender systems have to deal with a problem called the “cold-start problem”. This problem occurs when there are not enough data to make good recommendations. The Trouw Recommender could have the same problem during the start-up period. On the first day the Trouw Recommender was launched, only articles from that day were collected and only those articles could be used as recommendations. For the second day, the number of articles that could be used for recommendation was approximately doubled. This means that as the number of articles increase, so does the chance of having related articles in the database. Therefore we want to find out if the performance of the system gets better when the number of articles in the index grows over time. In this chapter we will first describe the experimental setup, then we will show the results and finally discuss our first research question, “What kind of influence does article growth have on generating recommendations?” 5.1 Experimental setup For this experiment we used the scores from the standard algorithm, used by Lemur as described in Chapter 4, to rank the recommendations. The scores of the top 15 recommendations were normalized with Equation 3.1. The judgments of the Web editors were collected during a period of six weeks, from February 5th 2008 till March 21st 2008. We chose this six-week period, because we wanted to have the same time-span for each research question. The number of articles added during this period should be enough to see if the article growth has influence on the recommendations. Looking at the article growth for this first period of One or more approvals six weeks, we see that every day an average of 76 articles were added during the weekdays Monday till Friday. Out of these average 76 articles, 16 were also used for 467 (61%) 179 (23%) 120 (16%) No approvals Unjudged recommendations Figure 5.1: Recommended articles during first period publishing online. Only these online articles are used to generate recommendations for. Every Saturday, an average of 120 articles were added to the MySQL database of the Trouw Recommender. From these average 120 articles, 24 articles were meant for the online newspaper. So for both the weekdays as for Saturdays, approximately 20% of the newly added articles are meant for online publishing. Figure A.4 in the Appendix show all days of this period with their number of added articles. As can be seen in Figure 5.1, during this first period, 766 articles were being recommended by the system. Out of these 766 recommended articles, the editors of Trouw managed to judge only 299 articles (39%). We used trec_eval to calculate the MAP scores from these 299 articles that were judged and discovered that only 179 articles (60% of all recommendations. So 40% articles judged) had one or more approved of all articles judged did not have any good recommendations during this first period of six weeks. Figure 5.2 shows how these percentages change over time, for the articles judged that have no approved recommendations. Finally, over time, around 20% of all recommended articles have no approved recommendations. Figure 5.2: No approved recommendations over time Not only did we collect the judgments of the recommended articles, but also information about which editor judged them. In the results we will also go into the 22 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 23 differences between the editors. Because each recommended article was judged by only one editor, it was not possible to measure inter-annotator agreement. In addition to the six-week period used for this research question, we have also collected the baseline runs of all judgments during this research project. Because research question 2 and the period after research question 3 also used the baseline Lemur algorithm, we can use this to see how performance changes over the course of the entire project. Statistical significance of differences between results was determined using two-sample equal variance t-tests (p < 0.05). Because we wanted to establish whether there was a significant difference between editors, only the two-tailed test was used. 5.2 Results After creating the qrel file with all approved recommendations and 299 lists for each judged article with its recommendations for this first period, we can look at the results using trec_eval. Looking at the MAP scores in Figure 5.3, it can be seen that only the 179 out of the 299 articles that were judged are visible. This is because trec_eval did not return zero scores if there recommendations. Figure 5.3: MAP scores of judged articles over time during first period were no approved As illustrated by the blue line, there is a lot of variance in the MAP scores over time. However, the most common and also the highest MAP score is 1.0, occurring 89 out of the 179 times (58.5%). A MAP score of 1.0 can mean two things; either there is only one approved recommendation at position 1, or all recommendations that are judged as correct are next to each other with no incorrect recommendations between them. For example, if there are three correct recommendations, there will only be a MAP score of 1.0 if these three recommendations are on positions 1, 2 and 3. The lowest MAP score is 0.15 and occurred only two times during this first period. The red line in Figure 5.3 shows the average MAP scores over time. The average MAP scores of first 30 articles show a lot of variances and spikes. This is probably because some articles had bad recommendations in the beginning, resulting in low MAP scores. As can be seen, there is a constant upward trend visible after 30 judged articles. But this growth stagnates around 90 judged articles with a highest average MAP score of 0.9211. After that, the average MAP scores have a little downward trend, ending up with an average MAP score of 0.8956 after 179 judged articles. But the differences between the MAP scores from 1 till 90 articles and the MAP scores of 90 till 179 articles were not significant. Because trec_eval did not return MAP scores for the 120 articles with no approved recommendations, the average MAP score of 0.8843 is not right. The corrected average MAP score should be 0.5362, that is the sum of the MAP scores of the 179 judged articles with one or more approved recommendations, divided by 299, which is the total number of judged articles. Not only have we calculated the MAP scores of each judged article, but also the P@n scores. The average and corrected average MAP and P@n scores for this first period are listed in Table 5.1. When taking all 299 articles in account, all four score types decrease by 40%. That is the percentage of the 120 articles, judged with no approved recommendations, out of the total 299 judged articles. Table 5.1: Average and corrected average MAP and P@n scores during first period Type score MAP P@5 P@10 P@15 Average scores 0.8956 0.5084 0.3101 0.2235 Corrected average scores 0.5362 0.3034 0.1856 0.1338 24 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 25 Figure 5.4 shows the averages of all three P@n scores and the average number of approved recommendations. As can be seen, all P@n scores have the same increase after 80 judged articles. Also, after 155 judged articles, the P@n scores tend to increase a bit more. It is remarkable to see that all the P@n scores increase after 80 judged articles, while the MAP scores became more constant after 90 judged articles (Figure 5.3). But as can be seen by the purple line, the number of average approvals has the same increase at the end. Because P@n scores are calculated by taking the number of approved articles of the top n divided by n, it means that the upward trend of the average approvals have influence on the increase of the P@n scores. Figure 5.4: P@n scores and average approvals of judged articles over time during first period The P@n scores are still growing at the end, so it seems that after 299 judged articles there is still a growth in the P@n scores. In Figure 5.9 we will indeed see that this growth continues until 350 judged articles, after that it goes downward to a constant average P@5 of 0.3787, a constant average P@10 score of 0.2352, and a constant average P@15 score of 0.1703. During this first period, three editors judged the recommendations made by the Trouw Recommender. There was one editor who judged 195 articles, another one who judged 90 articles, and also one who judged only 14 articles. We have calculated the average MAP score for each editor and the results are in shown Table 5.2. Table 5.2: Differences in scores and approvals in between each editor during the first period Editor ID Articles judged Average MAP score Corrected average MAP score Approvals 4 11 18 90 14 195 0.8794 1 0.8976 0.8501 0.8571 0.3682 381 24 201 Average Approvals per article 4.2 1.7 1.0 The two editors, who judged most of the articles, have both very similar average MAP scores of 0.8794 and 0.8976. But if we look at the corrected average MAP scores, when also the articles with no approvals are used during the calculation, the MAP score of Editor 18 decreases with almost 59% to 0.3682. This is the percentage of judged articles with no approved recommendations as can be seen in Table A.1 in the Appendix. Taking all the MAP scores of both editors, the differences between them are significant. This also applies to Editors 11 and 18, the MAP scores are significantly different too. Only the MAP scores of Editor 4 and Editor 11 are not significantly different. Looking at the average approvals per article, which are the total number of approved recommendations divided by the number of judged articles for each editor, a similar difference between Editors 4 and 18 can be seen. Editor 18 has an average approval per article of 1.0, while Editor 4 has an average approval per article of 4.2. As can be seen at Table A.1 in the Appendix, Editor 4 has only 3.3% of his articles judged with no correct judgments and therefore his corrected average MAP is relatively high in comparison to Editor 18. Because in general one editor judged one day, it has to be taken into account what day, which editor made the judgments. This difference in days is shown in Figure 5.5, where Editor 18 judged the first nine days, the tenth day is judged by both Editors 11 and 18, again one day by Editor 18 followed by four days by Editor 4. Finally, Editor 18 judged another day and at the end Editor 4 judged another three days. 26 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 27 Figure 5.5: Dates of judgments and number of judged articles for each editor Figure 5.6 shows the average MAP scores for each editor after each judged article over time. The three coloured lines are distributed over the days as seen in Figure 5.5. Again, the average MAP scores are only based on the articles that were judged with one or more approved recommendations. This means that Editor 4 judged 87 out of 90 articles with one or more approved recommendations; Editor 11 judged 12 out of 14 articles, and Editor 18 judged 80 out of 195 articles. Figure 5.6: Average MAP scores of judged articles by each editor Looking at the green line, corresponding to Editor 18, the average MAP score is getting better towards 0.9. All 12 articles judged by Editor 11, had a MAP score of 1.0. Finally, the average MAP scores by Editor 4 show a downward trend from a MAP score of 1.0 and ending up with an average MAP score of 0.85. Because Editor 18 judged the first ten days, it looks like that the recommendations became better over time. For Editor 4 however, the system seems to get slightly worse over time. The P@5 scores of each editor in Figure 5.7 show something different from the MAP scores shown in Figure 5.6. For all three editors, the P@5 scores show a familiar downward trend in the beginning and become more constant over time. Editor 4 ends up with an average P@5 score of 0.6161, Editor 11 ends up with an average of 0.3667, and finally Editor 18 ends up with an average P@5 score of 0.4125. Figure 5.7: Average P@5 scores of judged articles by each editor In Table 5.3 the average and corrected average P@n scores are listed. Corrected average means that also the articles with no approved recommendations are taken into account while calculating the averages. Just like it was the case with the MAP scores, the corrected averages drop strongly for Editor 18. All his P@n scores drop with 59% due to the fact he judged the same percentage of articles without approved recommendations. 28 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 29 Table 5.3: Editors with their average and corrected average P@n scores Editor ID 4 11 18 Average P@5 0.6161 0.3667 0.4125 Corrected average P@5 0.5956 0.3143 0.1692 Average P@10 0.3897 0.2000 0.2400 Corrected average P@10 0.3767 0.1714 0.0985 Average P@15 0.2874 0.1333 0.1675 Corrected average P@15 0.2778 0.1143 0.0687 Only the P@10 and P@15 scores of Editors 11 and 18 were not significantly different from each other, for all the other editor combinations the scores were significantly different. This significance was calculated by taking the scores of all judged articles, so including the zero values of the judged articles with no approved recommendations. Because the baseline Lemur algorithm was used for some recommendations during the second period and after the third period, we were also able to collect this data after the first period. In total, there were 1765 judged articles whose recommendations were made using the baseline Lemur algorithm. Out of these 1765 judged articles, 1370 articles (77.6%) had one or more approved recommendations. This is 17,6% more than the 299 judged articles during the first period, where only 60% of the judged articles had one or more approvals. The MAP and average MAP scores for all these 1370 articles are plotted in Figure 5.8. Figure 5.8: MAP scores of all judged articles recommended with the baseline Lemur algorithm As we look at the MAP scores, a lot of difference in scores over time can be seen. Table 5.4: Number of articles for different MAP scores Although, MAP score most of the MAP scores are between 0.9 and 1.0 (68.8%) as can be seen 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 in Table 5.4. The period around 300 articles has some low scores. The red line, corresponding to the average MAP scores, shows the same. After 100 articles this line goes downward until 300 articles and becomes more constant to an average MAP score of 0.88. – – – – – – – – – – 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Articles 2 11 19 34 62 42 55 69 133 943 % of all articles 0.1 0.8 1.4 2.5 4.5 3.1 4.0 5.0 9.7 68.8 The P@n scores of all 1370 articles shown in Figure 5.9 have an opposite trend in comparison to the MAP scores in Figure 5.8. The P@n scores have an upward trend from 100 articles until about the 300 articles. After that there is a downward trend, which becomes more constant around 1300 articles. Figure 5.9: P@n scores of all judged articles with baseline Lemur algorithm As mentioned in Chapter 4 by the editors of Trouw, some articles had bad recommendations due to the subject of the article. Because the subject of an article is related to the section of it, we looked at the different sections and their number of articles. The results of the different sections are shown in Table 5.5. Unfortunately, only the period from 5 February until 5 September 2008 could be 30 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 31 used due to the implementation of the new website of Trouw, the daily retrieved XML files from Trouw did not have information about the section anymore. During this period, 512 out of the 1937 judged articles had no approved recommendations. Table 5.5: Different sections with total articles and articles with no approved recommendations during period 5 February 2008 until 5 September 2008 Section GI_ART NI_ECO NI_NED NI_SPO NI_WER VE_LNG VE_OVE VE_POD VE_REL empty Total number of articles 4320 2640 6645 2130 4365 255 3765 2400 30 2490 Number of articles with no approvals 145 34 96 29 54 10 77 28 1 38 % of all articles from same section 3.4 1.3 1.4 1.4 1.2 3.9 2.0 1.2 3.3 1.5 % of all articles with no approvals 28.3 6.6 18.8 5.7 10.5 2.0 15.0 5.5 0.2 7.4 Most articles with no approved recommendations (28.3%) were from the section GI_ART, which are specific articles related to all kinds of recreation like art, music, and other forms of leisure. 5.3 Discussion One of the most remarkable events during the first period is that the average MAP scores show an upward trend in the beginning (Figure 5.3), while the P@n scores have a more downward trend in the beginning (Figure 5.4). As we look further into this phenomenon, it can be seen that it is related to the editor who was judging at the moment. Editor 18, who judged the first 164 articles with 97 articles having no approved recommendations, approved only 1 recommendation per article on an average basis for its other 67 judged articles. For the MAP score results this is good, because one approval per article always results in a MAP score of 1.0. But for P@n scores this is bad, because one approval per article means that P@5 gets a score of 0.2. The P@n does not score so high at the beginning because of this, but later on they get better when editor 4 made his judgments. This editor has an average approval per article of 4.2, which can result in a P@5 score of 0.8 if all four articles are in the top 5 recommendations. Due to these differences in approval among different editors, we prefer MAP as a measurement instrument to P@n. If we look further than the first period of six weeks, we notice that the number of judged articles with no approved recommendation drops to 22% after 1400 judged articles (Figure 5.2). Finally, 663 out of the 2951 judged articles (22.5%) had no approved recommendation. This means that the system returned better recommendations over time, resulting in more approved recommendations by the editors. That there are still 22.5% of the judged articles without approved recommendations will probably always remain. There will always be some articles that do not have any good recommendations, due to the topic of that article which can be unique in comparison to all other articles in the index. This could also be seen in Table 5.5, with 28.3% of the judged articles in the section GI_ART had no approved recommendations. At last, we looked at the different sections and the number of judged articles without approved recommendations. The most of these recommendations with zero approvals came from the GI_ART section. Because these articles are not related to news in most cases, it is probably more difficult to make good recommendations for. This is related to the problem mentioned in Chapter 2; when an article is very unique it cannot be compared to the rest of the articles, which will lead to poor recommendations (Claypool, Gokhale, Miranda, Murnikov, Netes, & Sartin, 1999). 32 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 33 6 Incorporating temporal information in recommendations A very important, if not the most important, aspect of a newspaper website is recency. Only articles of the same day or a couple of days old are interesting for visitors of the website. If an article is published a couple of days after the event has happened, it is not news anymore. Because news depends very much on recency, it is likely that this is also the case for the recommendations. Therefore we want to find out in what amount the recency is important for the recommendations made by the system. In this chapter we will first describe the experimental setup, then the results, and finally we will discuss the second research question, “What kind of influence does recency have on generating recommendations?”, which is related to incorporating temporal information in the recommendations. 6.1 Experimental setup To incorporate dates in the recommendations we had to combine the relevance scores returned from Lemur and the difference in days of the focus and recommended articles. We decided to calculate a linear combination of the relevance score produced by Lemur and the temporal information in the form of recency. To be able to combine these two scores, we scaled them both to the [0, 1] interval. Because we wanted to have a weighted average, combining both the difference in days and the Lemur score, different formulas were being drafted in order to find the best proportion between these two values. The formula that was implemented is: Equation 6.1: Formula for combining relevance score and recency score − scoremin 1 new _ score = λ + (1− λ ) d + 1 scoremax − scoremin d = Difference in days between focus and recommended article λ = Linear combination factor between 0 and 1 € scoremin = lowest lemur score of the 49 recommended articles scoremax = highest lemur score of the 49 recommended articles To retrieve the new score we first take the difference in days between the focus article and the recommended article as d. To this number d we add 1 to prevent a division by zero if the focus and recommended article are from the same day. We then divide one by the square root of this difference in days plus 1, because this moderates the influence of the discount. After that we multiply this with a variable factor λ, which can be changed to assign more weight to recency or to relevance if desired. We decided to vary this λ-variable between five values: 0, 0.25, 0.5, 0.75 or 1. If λ-value is 0, the difference in days is not involved in the calculating, because it is multiplied by 0. If λ-value is 1, only the difference in days is involved in the calculation, because the normalized Lemur score is then multiplied by 0. In that case, λ-value 0.25 tends to go more towards the Lemur score, while λvalue 0.75 tends to go more towards the difference in days. This also means that λvalue 0.5 takes both the Lemur score as the difference in days in equal amount in account. To get results of all five λ-values, we randomly assigned one of the five predefined λ-values to an article as it was judged by one of the editors. Unfortunately, the editors of Trouw did not manage to judge the whole period from 21-03-2008 till 05-06-2008. Because of that, we could only use the judgments for 284 articles. These articles were divided in 190 articles with λ-value of 0, 15 with a λ-value of 0.25, 24 articles with λ = 0.5, 26 articles with a λ-value of 0.75 and finally 29 articles with λ = 1. The proportions of these values are being made visible in Figure 6.1 and it can be seen that λ-value 0 occurs too much 26 in comparison to the other λ-values. Due to a bug in 29 λ-value 0 24 15 190 λ-value 0.25 λ-value 0.5 the PHP code that shows λ-value 0.75 the interface for judging λ-value 1 the focus article (Figure 3.5), the randomization function did not lead to an Figure 6.1: Division by number of articles for each λ-value equal distribution of all five different λ-values. As can be seen in Figure 6.2, there are 9 full days where almost no other λvalue than 0 occurs, therefore only those days were all five λ-values occur will be used for evaluation. These days are 22 March 26 until 29 March, and 2 April 2008. 34 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 35 Only 117 articles can thus be used for drawing conclusions from the small number of results. Again, lists of ranked judgments had to be made. But this time each λ-value had to have its own list and each set of fifteen judgments had to be reordered according to Equation 6.1 mentioned above. Different variants of these lists were being made, for each λ-value and for each editor. All the statistical significance of the results was again determined using two-tailed, two-sample equal variance ttests (p < 0.05). Figure 6.2: Number of articles with their λ-value divided in days during this period. 6.2 Results When we look at the editors 39 who judged during this period, we see that there are some different editors 73 57 7 in comparison to the first period. This 11 time there were six different editors who judged the 92 recommendations. From the 284 judged articles, the number of judged articles per editor can be seen in Figure 6.3. How the λvalues were distributed among each editor can be seen in Table 6.1. Editor 4 Editor 10 Editor 7 Editor 17 Editor 9 Editor 18 Figure 6.3: Number of judged articles per editor during second period Table 6.1: Editors of the second period with their statistics Editor ID λ-value 4 4 4 4 4 7 9 10 17 17 17 17 17 18 18 18 18 18 0 0.25 0.5 0.75 1 0 0 0 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 Articles judged 60 1 7 4 1 7 11 92 14 9 11 10 13 7 5 6 12 9 Average MAP score 0.8278 1.0000 0.7715 0.6088 1.0000 0.4285 0.7197 0.5662 0.4964 0.3333 0.5290 0.4000 0.2248 0.7321 0.2200 0.6210 0.2449 0.2882 Approvals 449 6 36 13 1 47 78 346 14 5 22 20 11 20 5 18 14 10 Average approvals per article 7.5 6.0 5.1 3.3 1.0 6.7 7.1 3.8 1.0 0.6 2.0 2.0 0.8 2.9 1.0 3.0 1.2 1.1 It is remarkable to see that both Editors 4 and 10 have a high amount of articles with λ-value 0. Editor 4 judged, next to the λ-value 0, 13 articles with different λ-values, but Editor 10 judged 92 articles with only the λ-value 0. Table 6.1 shows that only Editors 17 and 18 have both good equal distribution of all five different λ-values. From all 284 judged articles during this second period, 81 articles were judged with no recommended articles (28.5%). Taking only the 117 judged articles by Editors 4, 17, and 18, there were 51 articles judged with no correct recommended articles. This time, 43.6% of the judged articles had no recommended articles that were related to the focus article. Looking at each λ-value of these 117 articles in Table 6.2, we see that λ-values 0 and 0.5 have the least percentage of no approved recommendations. Also the number of average approvals is the highest for those two λ-values. 36 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 37 Table 6.2: Number of judged articles and no approved recommendations λvalue Articles judged Approvals 0 0.25 0.5 0.75 1 26 15 23 26 28 62 25 70 43 47 Average approvals per article 2.5 1.7 3.0 1.8 1.5 No approved recommendations % of no approved recommendations 8 9 7 13 14 32.0 60.0 30.4 50.0 50.0 Looking at all judged articles by each editor, as can be seen at Table A.2 in the Appendix, we see that Editors 10, 17, and 18 judged most articles with no recommended articles. Editor 10 was responsible for judging 31.5% of all its articles with no approved recommendations. Editor 17 judged more than half of its articles, 56.1%, with no approved recommended articles. And finally, Editor 18 managed to judge 48.7%, also almost half of its articles, with no approved recommended articles. From the other three editors, Editor 4 has his highest percentage for six approved recommendations (15.2%). Editor 7, with only seven articles judged, has the highest percentage of four approved recommendations (28.6%). And finally, Editor 9 has the highest percentage of 18.2% for three, seven and eleven approved recommendations. For calculating the MAP scores for the second period, only the judgments of Editors 4, 17, and 18 during the days 22 March, 26 March until 29 March, and 2 April 2008 were used. They were able to judge 117 articles, with the average and corrected average MAP score of each λ-value listed in Table 6.3. Table 6.3: MAP score for each λ-value of Editors 4, 17, and 18 λ-value Average MAP 0 0.25 0.5 0.75 1 0.8554 0.7694 0.8177 0.7324 0.6508 Corrected average MAP 0.5817 0.3078 0.5688 0.3662 0.3254 The first column of average MAP scores is calculated without taking all articles with no approved recommendations into account. The second column with corrected average MAP scores has been calculated by taking all judged articles into account. As can be seen from Table 6.3, λ-values 0 and 0.5 experiences the least of those articles without approved recommendations. This is because they both have about 30% of their articles judged with no approved recommendations as we have seen in Table 6.2. While the other three λ-values have all 50% or more of their articles with no approved recommendations. The lowest corrected average MAP score is for λ-value 0.25, while λ-value 1 had the lowest normal average MAP score. Figure 6.4 shows all average MAP scores of the five different λ-values for Editors 4, 17, and 18 during the period of 22 March, 26 March until 29 March, and 4 April 2008. Figure 6.4: Average MAP scores of all five λ-values judged by Editors 4, 17, and 18. These are again the average MAP scores without the zero approved recommendations. Both λ-value 0 and 0.5 score very similar, ending up with an average MAP score of respectively 0.8554 and 0.8177. λ-value 0.25 does not perform that bad according to this figure, but has the disadvantage that 60% of its articles were judged with no approved recommendations. We looked at the differences in between the MAP scores, but only the differences in MAP scores between λ-values 0 and 1, and 0.5 and 1 were significant. All the recommendations, also the ones with zero approvals, were taken into account while calculating the differences. 38 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 39 Table 6.4: Corrected average P@n scores for each λ-value of Editors 4, 17, and 18 Type score P@5 P@10 P@15 λ-value 0 0.2960 0.1960 0.1653 λ-value 0.25 0.1200 0.0667 0.0533 λ-value 0.5 0.3913 0.2565 0.1855 λ-value 0.75 0.2231 0.1500 0.1179 λ-value 1 0.2000 0.1500 0.1024 The P@5 scores in Figure 6.5 show that λ-value 0.5 performed the best over time, although a downward trend is visible. It is remarkable to see that λ-value 0 scored low at the P@5 scores, while it scored high at the MAP scores. There are significant differences for the P@5 scores between λ-values 0 and 0.25, 0.25 and 0.5, and finally for 0.5 and 1. Again, all the recommendations were taken into account while calculating these differences. Figure 6.5: Average P@5 scores of all five λ-values judged by Editors 4, 17, and 18. 6.3 Discussion It is regrettable that only 117 judged articles could be used to evaluate the performance of the algorithm used during this second period. Not only because the editors of Trouw managed to judge only 284 articles, but also the fact that the random PHP script did not function well is very disappointing. Due to the small amount of articles that could be used for evaluating this second research question, it is hard to tell if these results will also influence the performance in the long term. It looks like that λ-value 0.5 would be a good alternative for replacing the baseline Lemur algorithm. Not only because the MAP scores were almost the same as for λ-value 0, but it performed even better than λ-value 0 with the P@n scores. It also had the highest number of average approvals and the least number of judged articles with no approved recommendations. However, the differences between λ-value 0.5 and λ-value 0 were not significant for both the MAP as P@5 scores. Figure 6.5 showed that the P@5 scores of λ-value 0.5 were the highest of all different λ-values, while Figure 6.4 show that λ-value 0 had the highest MAP scores at the end. The higher P@5 scores of λ-value 0.5 mean that there are more approved recommendations in the top 5 recommendations. The P@5 scores of λvalue 0 show that there were only 2 or less approved recommendations in the top 5, because the P@5 scores were 0.4 or lower. Probably, these 2 approved recommendations were on positions 1 and 2 (MAP score of 1.0), or 1 and 3 (MAP score of 0.8333), because of the high MAP scores for λ-value 0. When only taking recency in account, which was the case with λ-value 1, the scores of both MAP and P@n were not very good in comparison to the other λvalues. The average MAP score of λ-value 1 ends up at 0.6508, with a corrected average MAP score of 0.3254. The average P@5 score of λ-value 1 is also 0.4, but the corrected average P@5 score is only 0.2. However, λ-value 0.25 scored worsted for both corrected averages, with 0.3078 for MAP and 0.1200 for P@5. This is probably caused by the small number of judged articles (15) and the high percentage of judged articles with no approved recommendations (60%). 40 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 41 7 Incorporating other metadata in recommendations In this chapter we attempt to answer our last research question: “What kind of influence does author or section metadata have on generating recommendations?”. It is likely that articles from the same section, for example articles in the section “Sport”, are more related to each other than articles from different sections. This could be the same for articles with the same author, who are more likely to write different articles about similar topics. We will first explain how these two metadata are being incorporated in the recommender system. After that we will look at the results of the judgments with these incorporated metadata. 7.1 Experimental setup Because we had already some experience with incorporating metadata with the standard score from Lemur from the previous research question, we were able to use a same sort of computation with these metadata. From each article, information about the author and section is stored in the MySQL database. With these data it is possible to compare the metadata of the focus article with the metadata of the recommended articles for this focus article. So we will use the author and section data from the database to reorder the recommendations. We made use of the following calculations to incorporate the author and section metadata: Equation 7.1: Formula for combining author and section metadata with the relevance score from Lemur Author AND section: e score ⋅ 2 Author OR section: e score ⋅1,5 Author NOR section: e score ⋅1 The relevance scores from Lemur, as they are stored in the Trouw database, arrange from approximately -1 until -8. But in fact these scores are small € positive numbers, as they are the results of the standard Language Model (KLD & JL Smoothing) algorithm as described in Chapter 3. However, Lemur returns the natural logarithm of these small positive numbers, resulting in greater negative numbers. In order to combine the Lemur score with the metadata, we decided to use the original small numbers and give a bonus if there was any similarity in metadata. So if both author and section were similar for the focus and recommended article, the exponent Lemur score was multiplied by two. If only author or section was similar, the exponent Lemur score was multiplied by one and a half. And finally, if both author and section were not similar, the standard exponent Lemur score was being used. For this six-week period from 6 June 2008 till 20 224 (17%) judgments that were reordered according to Equation 7.1 One or more approvals 260 (19%) August 2008, we only have No approvals mentioned above. During this third period, 1350 articles 866 (64%) were being recommended by Unjudged recommendations the system (Figure 7.1). Out of these 1350 articles, the editors managed articles to (81%). judge 224 Figure 7.1: Recommended articles during third period 1090 of these 1090 articles were judged with no correct recommendations (20.6%). Due to some misunderstanding, all judged articles during this period used the same algorithm. Fortunately, the baseline Lemur score was used again while judging the articles for the period after 20 August 2008. So we were able to use the same number of judged articles of that period for comparison. Out of these 1350 articles, 209 were judged with no correct recommendations (19.2%). Seven editors judged the recommendations in both periods, and besides that there were two other editors judging only in one of the two periods. We used the two-tailed, two-sample equal variance t-tests (p < 0.05) again for determining the statistical significance of the results. 7.2 Results We created 2180 lists, of which 1090 lists were based on the recommendations with the author and/or section algorithm for this third period, and 1090 lists were based on the baseline Lemur algorithm after this third period. Now we can look if the performance of the system improves when author and section metadata are incorporated. We calculated the average MAP scores for both periods and plotted them in Figure 7.2. 42 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 43 Figure 7.2: Average MAP scores of judged articles over time during both periods The blue line represents the judged articles of the third period, while the red line shows the average map scores of the judged articles after that third period. It is remarkable to see that both lines seem to have similar movements till 300 judged articles, although they are both from different periods. After these 300 judged articles, the normal Lemur score increases to a top average MAP score of 0.9167, while the metadata scores decrease to a bottom average MAP score of 0.8262. Because both periods had judged articles with no approved recommendations, the corrected average MAP scores rate lower. The corrected average MAP score for the metadata algorithm should be 0.6536, while the corrected average MAP score for the baseline Lemur algorithm should be 0.7142. Taking all MAP scores for these two periods, also those with no approved recommendations, the differences between them were significant. The average P@5 scores of the judged articles over time are shown in Figure 7.3. The starting points of the two periods are very different from each other. While the average P@5 scores from the metadata algorithm starts high with a maximum score of 0.8, the average P@5 score from the baseline Lemur algorithm starts at a minimum of 0.2. The recommendations of the first articles seem to be better for the metadata algorithm than for the baseline lemur algorithm. After 50 judged articles, both periods are quite similar with an average P@5 score of 0.5, and these P@5 scores are not significant different from each other. The corrected average P@5 scores are 0.3943 for the metadata algorithm and 0.3701 for the baseline Lemur algorithm. Figure 7.3: Average P@5 scores of judged articles over time during both periods In both periods, a group of eight editors made the judgments of the recommendations. Editors 13 and 16 only judged in one of the two periods. The number of articles judged, average MAP scores, corrected average MAP scores, approvals, and average approvals of all editors can be seen in Table 7.1 and Table 7.2. Table 7.1: Editors who judged during third period Editor ID Articles Judged 4 10 11 12 15 16 17 18 91 47 40 40 30 463 284 97 % Articles with no approved recommendations 0.0 40.4 20.0 5.0 16.7 14.7 31.0 36.1 Average MAP score 0.8651 0.7574 0.8104 0.7006 0.6087 0.8260 0.8603 0.8199 Corrected average MAP score 0.8651 0.4512 0.6484 0.6656 0.5073 0.7047 0.5937 0.5241 Approvals Average approvals 496 137 114 202 73 1433 614 186 5.5 2.9 2.9 5.1 2.4 3.1 2.2 1.9 Editor 4 judged all of his articles with one or more correct recommendation, therefore his corrected average MAP score of 0.8651 is equal to his average MAP 44 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 45 score. All the other editors judged one or more articles with no approved recommendations. Editor 10 has the highest number of judged articles with no approved recommendations (40.4%) and because of this, Editor 10 has the lowest final average MAP score of 0.4512. Table 7.2: Editors who judged after the third period Editor ID Articles Judged 4 10 11 12 13 15 17 18 11 49 581 297 18 127 152 41 % Articles with no approved recommendations 0.0 26.5 17.6 5.7 5.6 19.7 44.7 26.8 Average MAP score 0.9197 0.8723 0.9375 0.7978 0.8696 0.8723 0.9045 0.9269 Corrected average MAP score 0.9191 0.6409 0.7729 0.7521 0.8213 0.7006 0.4999 0.6783 Approvals Average Approvals 52 72 1236 918 104 254 202 122 4.7 1.5 2.1 3.1 5.7 2.0 1.3 3.0 As can be seen in Table 7.2, for the period after the third period, Editor 4 judged again all articles with one or more correct recommendation. But this time he judged only one day with 11 articles. The percentages of articles with no approved recommendations are very equal for Editors 11, 12, and 15, in both periods they judged respectively around 20%, 5% and again 20%. During this second period, Editor 17 judged the most articles with no approved recommendations (44.7%). This means that almost half of the articles he judged had no correct recommendations. If we look at the MAP scores of the seven editors who judged in both periods, with four of them these are significantly different. So, Editors 10, 11, 15, and 17 judged significantly different in the first period than in the second period. Where Editors 10, 11, and 17 judged the articles with the baseline Lemur algorithm better, Editor 15 judged the articles with the metadata algorithm better. We have also looked at the number of authors of the judged articles that were recommended during both periods. In the period from 6 June 2008 till 20 August 2008, 340 different authors wrote the 1092 judged articles. As can be seen in Table 7.3, the author that had the most articles during this period was just a dash (-). The numbers two, three, and five were all groups of editors from different sections (foreign countries, politics, and economy). Only the fourth, Koos Dijksterhuis, is a real person who wrote 28 of the judged articles. Table 7.3: Top five authors from judged focus articles during the third period Name of the author Number of articles Van onze redactie buitenland Van onze redactie politiek Koos Dijksterhuis Van onze redactie economie 119 43 31 28 22 For the same period after this third period, with the standard Lemur algorithm, 324 different authors wrote the 1097 judged articles. The same authors that wrote the articles for the third period were also the top five authors for the articles after that period. Only the number of articles written differs for all five authors as can be seen in Table 7.4. Table 7.4: Top five authors from judged focus articles after the third period Name of the author Number of articles Van onze redactie economie Koos Dijksterhuis Van onze redactie buitenland Van onze redactie politiek 104 40 37 36 34 So in the top 5, there is almost no difference in authors between the two periods, only in the number of articles. However, we have found out that there is a problem that can occur with authors. When looking at the list of different authors, it was remarkable to see that sometimes the location of the author was also included in its name. An example of one author with different names was: “Frank Kools”, this author has also names as: “Frank Kools New Hampshire” – “Frank Kools New York” – “Frank Kools Pikeville, Kentucky” – “Frank Kools Richmond, Virginia” – “Frank Kools Steelton, Pennsylvania” – “Frank Kools Washington”. This author is probably an American columnist, due to all the different American cities added after his name. But this makes simple comparison very hard, due to all these different names for one author. 46 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 47 7.3 Discussion From the results of the average MAP scores during this third period, it is obvious to see that incorporating author and section metadata did have negative influence on the performance of the system. The MAP scores from the metadata algorithm are going downward, while the MAP scores with the baseline Lemur algorithm have an upward trend. These differences were also significant. Although the P@5 scores were slightly higher for the metadata algorithm during the third period, the differences between the two algorithms were not significant. This is also the case for the P@10 and P@15 scores, which show the same trend over time as the P@5 scores. Looking at the different editors during the period of this third research question, we saw that seven editors judged articles during both periods. Four of these seven editors judged the articles with the metadata algorithm significantly different from the articles with the baseline Lemur algorithm. In general, the corrected average MAP scores for judged articles with the baseline Lemur algorithm, were higher for all editors in according to the judged articles with the metadata algorithm. Only Editor 15 judged the articles with the metadata algorithm higher with a corrected average MAP score of 0.5937, while he judged the articles with the baseline Lemur algorithm with a corrected average MAP score of 0.4999. Finally, the difference in authors seemed harder to distinguish than we had in mind on forehand. The most common author in both periods was only a dash, meaning that the author field was empty during indexing. Besides that, three of the top five authors were groups of editors from a specific topic, which can be seen as rough versions of the section. Also the problem that one author can have different names, with including the location or section he writes the article for, should be solved in order to make good comparisons for the focus and recommended articles. 8 Conclusion and future work Now that all research questions have been described and discussed in the previous chapters, we will present our conclusions in this chapter. First we will draw our conclusions for each research question separately. After that, we will discuss what future work on the Trouw Recommender should be focussed on. 8.1 Article growth Our first research question was: “What kind of influence does article growth have on generating recommendations?”. If we looked at the article growth in comparison to the measurement tools MAP and P@n, we saw that for both measurements the average scores get better over time. However, the highest average MAP score was already measured after 100 judged articles and became reasonable constant afterwards. The corrected average P@n scores however, had their highest averages after 350 judged articles and show a more downward trend afterwards. In general, it could be said that article growth had less influence for MAP scores than it does for P@n scores. As we looked at the number of approvals per articles, we could say that article growth had positive influence on the performance of the system. For the whole period when the baseline Lemur algorithm was being used, the number of judged articles with no approved recommendations dropped from 75% after 50 judged articles to 20% after 2900 judged articles. It seems that the system returned better recommendations over time, resulting in more judged articles with approved recommendations. 8.2 Incorporating temporal information in recommendations The second research question of this thesis was: “What kind of influence does recency have on generating recommendations?”. By taking the MAP and P@n measurements, we saw that λ-value 0 scored the best for MAP, while λ-value 0.5 scored the best with P@n. Where λ-value 0 relies only on the Lemur relevance score, λ-value 0.5 takes both the recency as the Lemur relevance score into account equally. Besides that, λ-value 0.5 also scored best after λ-value 0 with the average MAP scores. While taking only recency into account, which was the case with λ-value 1, both MAP as P@n had the lowest averages. However, the corrected averages for both MAP and P@n were the lowest for λ-value 0.25, due the small 48 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 49 number of judged articles and high amount of judged articles without approved recommendations. So for the MAP measurement, recency did not have influence in generating better recommendations, while for P@n recency in combination with Lemur relevance did indeed result in better recommendations. But in both cases, the difference in scores between λ-value 0 and λ-value 0.5 were not significant. It can be said that incorporating temporal information had no influence in generating better recommendations. Maybe because the recommendations were already recent, also for the baseline lemur algorithm, which will not result in significant differences between these two algorithms. But also the small number of judged articles for each λ-value, will probably have lead to these results. So it is not clear if these results will also influence the performance of the Trouw Recommender on the long term. 8.3 Incorporating other metadata in recommendations Our third research question was cited as: “What kind of influence does author or section metadata have on generating recommendations?”. We used two periods to compare the baseline Lemur algorithm with the metadata algorithm. Looking at the average MAP scores for these two periods, the baseline Lemur scores were better than the metadata algorithm. The difference in MAP scores between these two periods was also significant. The P@n scores however were slightly better for the metadata algorithm, a corrected average P@5 score of 0.3943 for the metadata algorithm, while the baseline Lemur algorithm scored 0.3701. However, the difference between these two was not significant. With significant differences in the MAP scores for the two periods, it can be said that author and/or section metadata did not have influence in better recommendations. The baseline Lemur algorithm performed better over the same number of articles. Author metadata seemed to be too messy to make good comparisons between the author of the focus article and authors of the recommended articles. It is also not sure whether recommendations were already from the same section as the focus article. Incorporating this metadata then, would not change the ordering of the recommendations. 8.4 Future work Now that we have investigated article growth, incorporating temporal data, and incorporating metadata for the Trouw Recommender, other options could be researched. One option that is related to incorporating temporal data, is that only the top 15 recommendations would be reordered according to recency. You will make use of the best recommendations according to the baseline Lemur algorithm and show them ordered by date instead of the Lemur relevance score. It would be nice to see how this method performs in comparison to Equation 6.1 with λ-value 0.5, mentioned in Chapter 6. Another option that could result in better recommendations for the Trouw Recommender is relevance feedback. With relevance feedback it is possible to use the information about the results of the judgments, whether or not they were relevant, to perform a new query. This should lead to better recommendations when Lemur uses this query to generate recommendations again. However, it is questionable if the same query will be generated again in the future. Because the Trouw Recommender uses the whole focus article as its query, it is not likely that the exact same focus article will be recommended again. The problem that one author can occur with different names in the MySQL database should be solved in future work to see if incorporating “real” author metadata could have influence for better recommendations. An option is to cut off all the words that are behind the real author names, for example the locations or sections, and only keep the author his first and last name. During the subjective evaluation, we mentioned that always having one article with a 100% confidence score, was sometimes seen as a problem. The fact that there is always one article with 100% confidence occurred due to normalization. This should be solved in future work, otherwise the editors could think that the system is not working properly. Like we have noticed in related work, online newspapers have in favour that they could make use of personalization. If we look at the possibilities for this in combination with the Trouw Recommender, we first need to build up user profiles. These user profiles can than be used to collect the user’s interest and these interest can than be used to generate personal recommendations. This would however have 50 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 51 great influence in the architecture of both the Trouw Recommender as the website of Trouw. Because the Trouw Recommender needs to generate personal recommendations for each unique user, it will become more time-consuming and will probably need more resources for computing the recommendations. The website of Trouw on the other hand needs a login for the users. Besides that, it has to collect all click-through data of each user what can be used for building up a user his profile. A lot of research has to be conducted, before personalization can be used in combination with the Trouw Recommender. References Balabanovic, M., & Shoham, Y. (1997). Fab: content-based, collaborative recommendation. Communication of the ACM , 40 (3), 66-72. Bogers, T., & van den Bosch, A. (2007). Comparing and evaluating information retrieval algorithms for news recommendation. RecSys '07: Proceedings of the 2007 ACM Conference on Recommender Systems (pp. 141-144). ACM Press. Breese, J., Heekerman, D., & Kadic, C. (1998). Empirical analysis of predictive algorithms for collaborative filtering. Proceedings of the Fourteenth Annual Conference on Uncertainty in Artificial Intelligence , 43-52. Claypool, M., Gokhale, A., Miranda, T., Murnikov, P., Netes, D., & Sartin, M. (1999). Combining content-based and collaborative filters in an online newspaper. Proceedings of ACM SIGIR Workshop on Recommender Systems . Croft, W. B., Metzler, D., & Strohman, T. (2009). Search Engines: Information Retrieval in Practice. Addison Wesley. Das, A., Datar, M., & Garg, A. (2007). Google news personalization: scalable online collaborative filtering. WWW '07: Proceedings of the 16th international conference on World Wide Web (pp. 271-280). New York, NY, USA: ACM Press. Fernández, R. T. (2007). The Effect Of Smoothing In Language Models For Novelty Detection. Future Directions in Information Access, FDIA'2007. Glasgow. Goldberg, D., Nichols, D., Oki, B., & Terry, D. (1992). Using collaborative filtering to weave an information tapestry. Communications of the ACM , 35 (12), 61-70. Hanani, U., Shapira, B., & Shoval, P. (2001). Information filtering: Overview of issues, research and systems. User Modeling and User-Adapted Interaction , 11 (3), 203-259. Kamba, T., Bharat, K., & Albers, M. (1994). The krakatoa chronicle: An interactive, personalized newspaper on the web. Proceedings of the 4th International Conference on World Wide Web, (pp. 159–170). Maidel, V., Shoval, P., Shapi, B., & Taieb-Maimon, M. (2008). Evaluation of an ontology-content based filtering method for a personalized newspaper. RecSys '08: Proceedings of the 2008 ACM Conference on Recommender Systems (pp. 91-98). New York, NY, USA: ACM Press. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press. McAdams, M. (1995). Inventing an online newspaper. Interpersonal Computing and Technology , 64-90. 52 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 53 Mooney, R. J., & Roy, L. (2000). Content-based book recommending using learning for text categorization. DL '00: Proceedings of the Fifth ACM Conference on Digital Libraries (pp. 195-240). New York, NY, USA: ACM Press. Newspaper. (2001, 12 28). Retrieved 02 12, 2009 from Wikipedia, the free encyclopedia: http://en.wikipedia.org/wiki/Newspaper Pennock, D. M., Horvitz, E., Lawrence, S., & Giles, C. L. (2000). Collaborative filtering by personality diagnosis: A hybrid memory- and model-based approach. Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (pp. 473-480). San Francisco: Morgan Kaufmann. Ponte, J., & Croft, W. (1998). A language modeling approach to information retrieval. SIGIR '98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 275281). New York, NY, USA: ACM Press. PricewaterhouseCoopers. (2008, 09 7). IAB Internet Advertising Revenue. Retrieved 02 11, 2009 from IAB: http://www.iab.net/media/file/IAB_PWC_2008_6m.pdf Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., & Riedl, J. (1994). Grouplens: An open architecture for collaborative filtering of netnews. CSCW '94: ACM conference on Computer Supported Cooperative Work (pp. 175186). New York, NY, USA: ACM Press. Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. (2001). Item-based collaborative filtering recommendation algorithms. WWW '01: Proceedings of the 10th International Conference on World Wide Web (pp. 285-295). New York, NY, USA: ACM Press. Sigmund, J. (2008, 04 14). Newspaper web sites attract record audiences in first quarter. Retrieved 05 06, 2008 from Newspaper Association of America: www.naa.org Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Engineering Bulletin , 24 (4), 35-43. Trouw. (2004, 01 30). Retrieved 10 27, 2008 from Wikipedia, de vrije encyclopedie: http://nl.wikipedia.org/wiki/Trouw_(krant) Yang, C., Chen, H., & Hong, K. (2003). Visualization of large category map for internet browsing. Decision Support Systems , 35 (1), 89 – 102. Appendix A Figure A.1: Trouw Recommender full article view 54 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 55 Figure A.2: Trouw Recommender interface for judging article “Conferentie Antillen lokt demonstratie uit” Figure A.3: Article “Conferentie Antillen lokt demonstratie uit” on the Trouw website with the two recommended articles in the grey box on the bottom left Figure A.4: New articles added during the first period Table A.1: Approvals and percentages of all judged articles for each editor in the first period Approvals 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 56 Editor ID 4 Number of articles 3 9 16 16 17 12 3 2 3 1 2 1 1 1 2 1 % Of all articles 3.3 10 17.8 17.8 18.9 13.3 3.3 2.2 3.3 1.1 2.2 1.1 1.1 1.1 2.2 1.1 Editor ID 11 Number of articles 2 5 6 0 0 0 0 1 0 0 0 0 0 0 0 0 % Of all articles 14.3 35.7 0 0 0 0 0 7.1 0 0 0 0 0 0 0 0 Editor ID 11 Number of articles 115 30 23 15 4 2 2 1 0 1 0 0 1 0 1 0 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER % Of all articles 59 15.4 11.8 7.7 2.1 1 1 0.5 0 0.5 0 0 0.5 0 0.5 0 RECOMMENDING ARTICLES FOR AN ONLINE NEWSPAPER 57 Table A.2: Approvals of all judged articles for each editor in the second period Approvals 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Editor ID 4 Number of articles 0 2 10 8 7 9 12 4 2 3 4 6 3 1 1 7 % Of all articles 0 2.5 12.7 10.1 8.9 11.4 15.2 5.1 2.5 3.8 5.1 7.6 3.8 1.3 1.3 8.7 Editor ID 7 Number of articles 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1 % Of all articles 16.7 0 0 0 16.7 16.7 0 16.7 0 0 0 0 16.7 0 0 16.7 Editor ID 9 Number of articles 0 1 1 2 0 0 0 2 0 1 2 1 0 0 0 1 % Of all articles 0 9.1 9.1 18.2 0 0 0 18.2 0 9.1 18.2 9.1 0 0 0 9.1 Approvals 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Editor ID 10 Number of articles 29 15 12 7 2 1 3 4 3 1 2 3 2 0 3 5 % Of all articles 31.5 16.3 13.0 7.6 2.2 1.1 3.3 4.4 3.3 1.1 2.2 3.3 2.2 0 3.3 5.4 Editor ID 17 Number of articles 32 8 7 3 5 0 0 1 0 0 0 0 0 0 1 0 % Of all articles 56.1 14.0 12.3 5.3 8.8 0 0 1.8 0 0 0 0 0 0 1.8 0 Editor ID 18 Number of articles 19 4 7 2 1 3 0 1 1 1 0 0 0 0 0 0 % Of all articles 48.7 10.3 18.0 5.1 2.6 7.7 0 2.6 2.6 2.6 0 0 0 0 0 0