Context-aware recommender system for large e
Transcription
Context-aware recommender system for large e
Context-aware recommender system for large e-commerce platforms by Jacek Wasilewski A thesis submitted in partial satisfaction of the requirements for the degree of Master of Science in Computer Science in the Institute of Computing Science of the Poznań University of Technology, Poznań Thesis supervisor: Mikołaj Morzy, PhD, DSc. June 2013 i Contents 1 Introduction 1 2 Related work 2.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Recommender system . . . . . . . . . . . . . . . . . . . . . . 2.3 Context-aware recommender systems . . . . . . . . . . . . . 5 6 9 14 3 Dataset 3.1 Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Dataset structure . . . . . . . . . . . . . . . . . . . . . . . . 17 18 18 4 Text preprocessing 4.1 General transformations 4.2 Stop words removal . . . 4.3 Stemming . . . . . . . . 4.4 Tags removal . . . . . . . . . . 21 22 23 23 24 . . . . . . . . . . . 27 28 31 33 33 35 37 37 40 45 45 52 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Context-aware recommender system 5.1 General idea . . . . . . . . . . . . . . . . . . . . . 5.2 Text preprocessing . . . . . . . . . . . . . . . . . 5.3 Category-based context creation . . . . . . . . . . 5.3.1 Description . . . . . . . . . . . . . . . . . 5.3.2 Experiment . . . . . . . . . . . . . . . . . 5.4 Latent Semantic Indexing-based context creation 5.4.1 Description . . . . . . . . . . . . . . . . . 5.4.2 Experiment . . . . . . . . . . . . . . . . . 5.5 Network-based context creation . . . . . . . . . . 5.5.1 Description . . . . . . . . . . . . . . . . . 5.5.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusions 59 Bibliography 62 ii CONTENTS A Category-based context creation experiment data A.1 Items data . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Recommended items . . . . . . . . . . . . . . . . . . . . . . 65 66 66 B Latent Semantic Indexing experiment data B.1 Items data . . . . . . . . . . . . . . . . . . . B.2 Term-document frequency matrix . . . . . . B.3 Singular Value Decomposition . . . . . . . . B.4 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 72 73 74 76 C Network-based context creation experiment C.1 Categories networks . . . . . . . . . . . . . . C.2 Compressed categories networks . . . . . . . C.3 Modules of merged network . . . . . . . . . data . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 78 81 84 . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures 87 List of Tables 88 1 Chapter 1 Introduction 2 CHAPTER 1. INTRODUCTION Nowadays, the number of data and the number of different items we can buy outnumber few times, if not several times, what was available even few years ago. It is not possible to compare every type of item, which we are interested in purchasing, to chose the one which suits us perfectly. Also we might note be aware of different types of one product and all additional accessories that we might need while using the one product chosen by us. Because of the technological progress we have now, we expect that specialized systems analyze our needs, available products and then suggest us which one would be the best for us and what else we might need or might be interested in. With no doubt we can say that personalization and recommendation are clues to everything. There are few types of recommender systems but the majority of existing recommender system approaches focuses on recommending the most similar items, according to users’ interests and recent viewings, and usually does not take into account other, sometimes important information like time, place, mood or connections between items. For instance, if we are viewing eBay item of a bag of vanilla flavored ground coffee, then, as a result of typical recommender system, we might get other items of the same type of coffee. Usually this is what we expect from recommender systems. However, in many cases, more then results of items of the same type as recently viewed items, we might be expecting others which have something in common with viewed item. For example, if we are looking for a volleyball ball we might want to see, besides other volleyball balls, also t-shirts and shorts. What is more, while we are browsing some items of coffee machines, most likely we also would like to buy or just see bags of coffee or thermoses. This idea of highlighting connected items is not a new idea, it is called crossselling and is widely used even in traditional shops where we have laces places near to shoes because of the possibility we might need additional ones. Cross-selling is a very simple idea and is easily applicable in small ecommerce systems because all we need is to set connections between items. When it comes to larger systems like Amazon platform, eBay platform or Allegro platform (allegro.pl) with order of magnitude of millions accessible 3 items, this task becomes to be very difficult and doing it manually is unreasonable and almost impossible. We must mention that items which are being sold via large e-commerce platforms usually are being sold by different sellers who describe them in various ways, so the same item can be presented differently. The amount of data and the variety of descriptions make this problem complicated and it needs automatic and dynamic solutions. In this thesis we discuss the topic of specific type of recommender systems - context-aware recommender systems (CARS), which tries to build model enriched with information about complementary products. We also address the main problems which occur during processing data and then show how those complex tasks can be divided into subtasks and for each subtask we present approaches which can be used to perform those tasks. Because solutions must be ready to work with a huge amount of data, we take into consideration the complexity of each task. The rest of the thesis is organized as follows. Chapter 2 discusses the general notions of context, recommender systems as well as how we connect all this ideas. In Chapter 3 we present information about dataset which is used during experiments. Chapter 4 describes methods of text and web preprocessing. After that, in Chapter 5, we present idea how contexts can be constructed and also a few methods of building contexts - categorybased context creation, Latent Semantic Indexing-based context creation and network-based context creation. Chapter 6 contains conclusions and presents some opportunities for further work. 4 CHAPTER 1. INTRODUCTION 5 Chapter 2 Related work 6 CHAPTER 2. RELATED WORK Before we describe what context-aware recommender system is exactly, in Section 2.1 we start with discussing the general notion of context. Then, in Section 2.2, we focus on different approaches of recommender systems. After that, in Section 2.3, we show the ideas of context-aware recommender systems. 2.1 Context What is a context? Question is simple but there is not a simple answer. According to Webster dictionary[1] we describe it as interrelated conditions in which something exists or occurs. It is a kind of environment. This topic has been studied across various types of science disciplines and even there is conference called CONTEXT about interdisciplinary use of context in fields of cognitive sciences (linguistics, psychology, computer science, neuroscience), social sciences, but also medicine, law and business. Wellknow business researcher Coimbatore Krishnarao Prahalad has suggested that companies must deliver not only a competitive products but also provide unique, real-time customer experiences shaped by customer context and it would be next big thing for the CRM (Customer Relationship Management) practitioners[22]. This is only a one of many definitions of what the context is. Mary Bazire and Patrick Brézillon have analyzed over 150 of different definitions coming mainly from the web. They have observed that it is difficult to find a relevant definition satisfying in any discipline. Multifaceted nature of this intuitive problem makes it difficult to define it accordingly and exactly with the given area[17]. Since we are focused on the topic of recommender systems in area of e-commerce applications we discuss the definition with data mining, ecommerce personalization, databases, information retrieval, context-aware pervasive systems and marketing points of view. In this we follow [11] and [8]. 2.1. CONTEXT 7 Data mining According to [18] we define context as events that characterize the life stages of a customer and discriminate his/her preferences. Examples of context in data mining include e.g. a purchase of a new car, planning of a wedding, changing of a job, sets of items bought in the shop. The knowledge of those situations helps in discovering well-fitted and accurate data mining patterns to those situations. In some cases we might be looking for information what data are relevant in chosen problem, e.g. what data describe the possibility of occurrence of an infection after liver transplant, in other cases we might want to find what other insurances we should have been offered if we recently have bought a car insurance for our new sport car. E-commerce personalization In field of e-commerce, within the meaning of [8], we say that in e-commerce application the intent of purchase made by customer, interested in buying a particular item, is our context. Single customer may have many contexts, each one for a different intent. For instance, buying new DVD player, set of horror movies on DVD for a friend as a gift, new hairdryer and old set of cups give us four different contexts, no matter if all of those items were bought by the same customer, with the same application account, even the same seller. Each time the reason of those purchases was different. In [8] they built different and separate profiles for each reason, depending on customer behavior. What is more, in this case we can share the same context within many customers. Context-aware pervasive systems Initially the term context-aware system was used for systems which, depending on the localization of user’s mobile phone, were providing information about every object that was near the user’s phone. More specifically, context-aware system was classified as a one which can adopt accordingly to this information[7]. In further works this context-awareness have been pertaining not only to localization but also date, season, temperature[20], user’s emotional status and any other information that might be in a relationship between user and application[6]. 8 CHAPTER 2. RELATED WORK Information retrieval Information retrieval is strongly connected with Web searching, which is its one of the most common application. In this case we describe context as a set of other topics related to the search query. It has been proved that contextual data affect positively on the effects of information retrieval, but most of existing retrieval systems base their results only on queries, ignoring context of the query. In those which use contextual approach, retrieving techniques are focused on providing a short-term context, such as a context for currently searching query, in contrast to a long-term context which includes user’s tastes and preferences. Although, i.e. Google Web search is trying to adjust results of search to e.g. user’s search history. Marketing In the field of marketing and management researches have shown that items, their quality and other attributes are dependent on a context or a purpose of purchase. It is proved that in many situations customers make different buying decisions because of different strategies they have applied while buying e.g. car and hairdryer. According to [12], consumers vary in their decision-making rules because of the usage situation, the use of the good or service (for family, for gift, for self) and purchase situation (catalog sale, in-store shelf selection, and sales person aided purchase). Mentioned before Prahalad[22] describes context as the precise physical location of a customer at any given time, the exact minute he or she needs the service, and the 7 kind of technological mobile device over which that experience [or service – our addition] will be received. Prahalad also focuses on the delivering unique, real-time customer experiences, i.e. services like VOD (Video On Demand) services where customer gets exactly what he/she wants, exactly when he/she wants and where he/she wants it. It is the three dimensional space described by Prahalad. Although he mentions about realtime experiences, this can be used any time and place in the future. This section has presented that there is not just a one definition of context, but this is more complex term and in fact it depends on a situation in which this term is used. 2.2. RECOMMENDER SYSTEM 2.2 9 Recommender system Recommender system field derives directly from other fields of sciences such as cognitive science, approximation theory, information retrieval, forecasting theory and also customer choice modeling and has become an individual research field in the middle of 1990’s when researches started to focus on the rating problem. Commonly the problem of creating recommender system is reduced to the problem of providing appropriate scoring function which comply also new users and new unseen items[10]. Formally, let S be the set of all users and T be the set of all possible and available items. Now, let u be a utility function that represents the usefulness of item t to user s. Recommender systems are usually classified into three following categories, based on how recommendations are made: content-based, collaborative filtering, hybrid approaches. Now, we present each of these approaches. Content-based The content-based approach derivates from information retrieval and information search researches. Because of the advancements in those fields and importance of text-based applications, a lot of recently developed contentbased recommender systems focus on recommending items containing textual information. Those systems are supported also by user profiles containing data such as tastes, preferences, needs. User’s preferences can be collected directly from user via questionaries or indirectly using machine learning techniques based on previous activity. Formally, let Content(t) be an item profile characterized by a set of attributes previously extracted from item description. This set of attributes is used in finding most appropriate or similar item. Because commonly content-based recommender systems are used with text-based items, the content is presented usually as a set of keywords. Measuring importance of each word ki from keywords in document dj is possible if some weighting wij 10 CHAPTER 2. RELATED WORK is done, which can be defined in many ways. One of the most popular method of weighting words in documents is the method derived from information retrieval called term frequency/inverse document frequency ( TF-IDF) which is defined as follows[10]. Let N be the total number of all documents and word ki appears in ni of them. Also assume that fi,j is the frequency how many times word ki appears in document dj . Then term frequency of word ki in document dj is defined as: T Fi,j = fi,j maxz fz,j where maximum frequency is computed over all words kz that occur in document dj . Usually high occurrence of one word in most of documents does not help in distinguishing them, but in this case the inverse document frequency (IDF) is often use with TF. The inverse document frequency for word ki is defined as: N ni Then the TF-IDF weight of word ki in document dj is defined as: IDFi = log wij = T Fi,j · IDFi and the content of the document as: Content(dj ) = (w1j , w2j , ..., wkj ) In content-based systems, the utility function u(s, t) is usually defined as: u(s, t) = score(UserContentPreferences(s), Content(t)) where UserContentPreferences can be different and in general is computed based on the previous user behavior. Both of this attributes can be presented as TF-IDF vectors, w~s and w ~ t of keyword weights. One of the most popular utility function used in comparing sets of keywords is cosine similarity measure, defined as follows: K wi,s wi,t w~s · w ~t u(s, t) = cos(w~s , w ~t) = = qP i=1 qP K K 2 2 kw~s kkw~t k i=1 wi,s i=1 wi,t P 2.2. RECOMMENDER SYSTEM 11 where K is the total number of words. We can specify three main problems connected with the content-based approach: limited content analysis, over-specialization, new user problem. Content-based techniques are limited by the features that are explicitly associated with the objects recommended by those systems. Therefore, in order to have a sufficient set of features, the content must either be in a form that can be parsed automatically by a computer, or the features should be assigned to items manually. Another problem with limited content analysis is that if two different items are represented by the same set of features they are indistinguishable. When the system can only recommend items that scores highly against a user’s profile, the user is limited to receive recommended items similar to those already rated - we call it over-specialization. This problem, which has also been studied in other domains, is often solved by introducing some randomness to the recommendation process. Also in certain cases, items should not be recommended if they are too similar to something that user has already seen. The diversity of recommendations is often a desirable feature in recommender systems. The user has to rate a sufficient number of items before a content-based recommender system can really understand user’s preferences and present reliable recommendations to the user. Therefore, a new user, having very few ratings, would not be able to get accurate recommendations. Collaborative filtering Different approaches exist in collaborative filtering method, where system tries to predict relevant items based on user’s ratings and finding users with similar taste. More formally, the utility function u(s, t) where s is an user and t is an item, is predicted based on the utilities u(sj , t) assigned to item t by those users sj ∈ S who has similar taste to user s. Algorithms used in collaborative filtering can be clustered into two groups: memory-based and model-based. 12 CHAPTER 2. RELATED WORK Memory-based algorithms are in general heuristics which make rating predictions based on the ratings of entire set of previously rated items. Rating for unseen item is usually calculated as an aggregation of other users’ ratings for item: rs,t = aggrs0 ,t where s0 is the subset of S - usually the most similar users. As an aggregation function we can use simple average, but the weighted sum or adjusted weighted sum are used in most cases, though: rs,t = k X sim(s, s0 ) × rs0 ,t s0 ∈S and rs,t = r¯s + k X sim(s, s0 ) × (rs0 ,t − r¯s0 ) s0 ∈S where k is a normalizing factor. To determine similarity between different users we can use described in Section 2.2 cosine similarity and Pearson correlation coefficient, defined as: P sim(x, y) = qP t∈Txy (rx,t t∈Txy (rx,t − r¯x )(ry,t − r¯y ) − r¯x )2 P t∈Txy (ry,t − r¯y )2 The only difference between cosine similarity used in content-based approach and collaborative filtering is that in the first one we calculate similarity between TF-IDF vectors and in collaborative filtering we use vector of userspecified ratings. Collaborative filtering approach traditionally is used to comparing users, but also can be used in comparing items. In this case we use correlationbased and cosine-based techniques to compute similarity between items and obtain ratings for them. This idea was extended to top-N recommendation method. In contrast to memory-based, model-based approach use collection of ratings to learn a model. Machine learning gives us a lot of methods which can be used as a model training method, such as naive Bayesian model, neural networks, clustering, SVM, decision trees and other. Literature also mentions 2.2. RECOMMENDER SYSTEM 13 using other approaches, from different fields such as statistical models, Gibbs sampling, probabilistic relational model, linear regression, maximum entropy model and more complex probabilistic models, for instance Markov decision process, probabilistic latent semantic analysis or generative semantics of Latent Dirichlet Allocation. Main problem that occurs with model-based approach is in many cases the complexness of methods and how they handle with huge amounts of data. Collaborative method also has some problems which must face - new user and new item appearance and sparsity in dataset . New user problem is the same as in content-based method. In order to make accurate recommendations, the system must first learn the user’s preferences from the ratings the user makes. There is not a simple solution to this problem, however the impact of this is minimized by using hybrid approach we describe further. New items are added regularly to recommender systems. Collaborative systems rely solely on users’ preferences to make recommendations. Therefore, until the new item is rated by a substantial number of users, the recommender system would not be able to recommend it. This problem is similar to new user problem and also can be minimized by hybrid methods. Sparsity of data is another problem of collaborative filtering method. In any recommender system, the number of ratings already obtained is usually very small comparing to the number of ratings that needs to be predicted. The success of the collaborative recommender system depends on the availability of intercepted users’ preferences. In result for the user whose taste is unusual in comparison to the rest of the population there might not be any other users with similar taste so recommendation result might be poor. One way to reduce this influence is applying demographic filtering in which we are looking for other users in e.g. the same age, localization, social status. [10] Several recommender systems use hybrid approach by combining collaborative and content-based methods, which helps to avoid certain limitations 14 CHAPTER 2. RELATED WORK of content-based and collaborative systems. We can describe four different methods of putting together collaborative and content-based approaches which we shortly introduce. One way to build hybrid recommender systems is to implement separate collaborative and content-based systems. Then we can combine the outputs obtained from individual recommender systems into one final recommendation using either a linear combination of ratings or a voting. Second method is to add collaborative characteristics to content-based models. We can achieve that by i.e. using some dimensionality reduction technique on a group of content-based profiles. Third method is to add content-based characteristics to collaborative models. In this we use traditional collaborative techniques but also we maintain the content-based profiles for each user. These content-based profiles we use to calculate the similarity between two users. Finally, we can develop a unified recommendation model by mixing e.g. user’s age and article subject. Also expert knowledge can be used as an augment in hybrid recommender systems, especially to prevent new user or item problem. The main drawback of it is the unwritten requirement that subject domain must be well specified or should use ontologies. This section has presented main approaches in field of recommender system, their typical characterization and applications. Also we have introduced mostly used formulas which in general can be useful in solving such problems. 2.3 Context-aware recommender systems In this section we focus on the topic of context-aware recommender systems as an union of context and recommender systems which we have introduced before. Traditionally, all types of recommender systems deal with two types of entities, users and items, but as we have said in Section 2.1 all ours behaviors have their backgrounds and also reasons which describe our current needs and explain out recent decisions. In traditional two-dimensional user-item 2.3. CONTEXT-AWARE RECOMMENDER SYSTEMS 15 approach reasons, of e.g. ratings, are hidden, but we can not ignore the existence of them. Sometimes it even might be more important why we have made such a decision than the decision itself - for sure it gives us a lot of information to provide more accurate recommendations. Decent work in this field was made by authors of [11]. They have extended the user-item approach by the support of additional contextual data. This contextual data can be introduced in three ways. Manually as a set of questions e.g. about preferences, automatically provided by user’s device, e.g. localization of mobile phone, but also by analyzing user’s set of actions. Different methods of using contextual data in general can be grouped into two categories: recommendation via context-driven querying and search, and recommendation via contextual preference elicitation and estimation. The context-driven querying and search approach is widely used when we are looking for a restaurant where we could eat korean cuisine - the recommender system finds the best matching restaurant in the neighborhood which is open. The second method, contextual preference elicitation and estimation, tries to learn a model based on user’s activities and interactions between different users, but also based on the user’s feedbacks about previously recommended items. To achieve that we can use data analysis techniques from machine learning or data mining. This two concepts also can be combined into one that has features of both. Paper mentioned before introduces three concepts of applying context into the workflow of recommender system - Figure 2.1 on page 16 presents all them. In a contextual pre-filtering approach, contextual information is used to filter the data set before applying a traditional recommendation algorithm. In a contextual post-filtering approach, recommendations are generated on the entire data set. The result set of recommendations is adjusted using the contextual information. Contextual modeling approaches use contextual information directly in the recommendation function as an explicit predictor of a rating for an item. Whereas contextual pre-filtering and post-filtering approaches can use traditional recommendation algorithms, the contextual is initially ignored, and the ratings are predicted using any traditional 2D recommender system on the entire data. Then, the resulting set of recommendations is adjusted (contextuallized) for each user using the contextual information. Contextual modeling (or contextualization of recommendation function). In this recommendation paradigm (presented in Figure 4c), contextual information 16 is used directly in the modeling techniqueCHAPTER 2. RELATED as part of rating estimation. WORK U Data I C (c) Contextual Modeling (b) Contextual Post-Filtering (a) Contextual Pre-Filtering U R Data I C U R Data I C R c Contextualized Data U I R 2D Recommender U I R 2D Recommender U I R u u MD Recommender U I C R u Recommendations i1, i2, i3, c c Contextual Recommendations i1, i2, i3, Contextual Recommendations i1, i2, i3, Contextual Recommendations i1, i2, i3, Figure 2.1: Paradigms for incorporating context in recommender systems. Fig. 4. Paradigms for incorporating context in recommender systems. [11] In the remainder of this section we will discuss these three approaches in detail. modeling approach uses multidimensional recommendation algorithms[13]. Examples of heuristic-based and model-based approaches have been de- 3.1 Contextual scribed in [11]. Pre-Filtering this section we4a, have general and most uses common descripAsInshown in Figure the presented contextual the pre-filtering approach contextual information select the most relevant 2D systems (User Item) for generating recomtion of howtocontext and recommender workdata together. mendations. One major advantage of this approach is that it allows deployment of anyThis of the numerous traditional some recommendation techniquesabout previously proposed chapter has presented general information current works in the literature (Adomavicius and Tuzhilin 2005). In particular, when using this in fields of context definition, recommender system approaches and also approach, context c essentially serves as a query for selecting relevant ratings data. context-awareness of recommender Inrecommender section 2.1 we havewould seen few An example of a contextual data filtersystems. for a movie system be: if a person wants to see a movie on Saturday, only the Saturday rating data is used definitions of context from different points of view, especially e-commerce to recommend movies. Note that this example represents an exact pre-filter. In and marketing. Next we have had brief review about traditional recommender system approaches. Last section has connected information from two previous sections and has introduced the definition of general contextaware recommender system. 17 Chapter 3 Dataset 18 CHAPTER 3. DATASET In this chapter we present characteristics of typical dataset which is used in e-commerce platforms. We introduce information about source but also present structure and data types to provide good understanding of data. 3.1 Source While we are working on a real, industrial problem, the best way to receive reliable results is working with real data. Luckily we have been able to receive real dataset from running e-commerce platform which is Allegro.pl. Allegro.pl (allegro.pl) is the biggest e-commerce platform in Poland and Eastern Europe with several millions of users only in Poland. Every day using this e-commerce platform about 500 thousands items are sold from about 18 millions available items. Dataset contains information about items and purchases from period between September 2006 and April 2007 with total of 305899 items from 17519 categories and 1327872 purchases. Initial subset of users were picked up and for every user of this subset its items and connected data were retrieved. It is necessary to mention that if we had current dataset and compared that with the one we have access to, probably it would be different. The reason of this is the change in customers and sellers behaviors. For example in our dataset, 65% of all items have “buy now” option, where currently (in 2013) it is about 88%. Also now most of the sellers are professional sellers and companies that use platform as another method of distribution. Because of that, the way how descriptions are created and their quality might be different. 3.2 Dataset structure Dataset structure is presented as follows on figure 3.1. Every item in the dataset is described by few features, such as name, description, prices, starting and ending date and also information about quantity and number of sold items. Items are grouped into predefined categories, 19 3.2. DATASET STRUCTURE Bid * date * quantity * amount Item # * id * name * price * description o starting_price o buy_now_price * starting_time * ending_time * bid_count * quantity * photo_count * quantity_sold Feedback #* * * * Category # * id * name User # * id * creating_time o activation_time * rating * group o super_seller_status id creation_time type description Figure 3.1: Entity-Relation Diagram representing the dataset. 20 CHAPTER 3. DATASET where categories are designed as a forest - that is the reason why every item can be connected to many categories - one leaf category and descendants of this leaf category. Every category is described by a name. Because of privacy reasons, users placed in the dataset had to be anonymized in a way that prevents the possibility of recognition. Because of that dataset does not include user personal data, but only hashed ID and information about account creation, its rating on the platform and internal user groups. Every user can own an item which means that is the seller of this item. User also is able to buy an item or bid in an auction - in this case user is called buyer and information about that is stored in the bid relation. For every bid (transaction) users (seller and buyers) can give feedback about the transaction. This dataset contains different types of data: text, integer, float and date. All price fields use float type, time fields use date, rating, quantity, amount, count fields use integers. Name and description fields use text type which apparently is the most difficult type of field because of the high dimensionality hidden in data. Next chapter focuses on methods of preprocessing of text fields so they can be used in further processing. 21 Chapter 4 Text preprocessing 22 CHAPTER 4. TEXT PREPROCESSING In this chapter we present some preprocessing tasks that usually have to be performed in order to a further processing. In this we follow [16]. In Section 4.1 we focus on general and simple text transformations, then, in Section 4.2, we present stop words removal problem in general but also more specific that occurs in case of e-commerce item descriptions. Section 4.3 is about stemming, what it is, why it is so important and how it may improve the quality of dataset after preprocessing. At the end, in Section 4.4, we focus on tags removal and parsing web pages. 4.1 General transformations The main rule, which usually is performed at the beginning, is assurance that all letters are converted to either the upper or lower case. First and the most common problem in cleansing text is existence of digits. In traditional information retrieval systems numbers which occur in words or terms are removed because typically it means that this is not a word. An exception from this might be specific type, like date, time and other restricted by regular expression. It is worth to mention that in specific situations, like in search engines, words which contain digits are not removed and they are indexed. Different treatment of punctuation marks, hyphens and other special characters can lead into different results, sometimes inconsistent, so there is not a perfect approach to perform that. For example in English we have some words that can be written in different way by different people - state-of-theart or state of the art. If we replace hyphens with space in the first one we eliminate the problem of inconsistency. But not in every situation we can just replace special characters with space - the action which should be executed is not so obvious. Some words may have e.g. a hyphen as an integral part of a word. In general we use tho types of special character removal: (1) simple removing special character from original text without leaving anything in a place of removed sign, e.g. state-of-the-art will produce stateoftheart, and (2) replacing special character with a space. In some situations both forms of 4.2. STOP WORDS REMOVAL 23 the same word have to be stored and used in processing because determining which form is correct is hard, for example, if we convert pre-processing into pre processing and then we will be looking for word preprocessing we can receive no results. 4.2 Stop words removal Stop words are words which are filtered out prior to processing of natural language. Any group of words can be used as the stop words, but usually words which are frequently occurring and insignificant in a language and only help construct sentences without representing any specific content. When it comes to English language, common stop words may include: a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or, that, the, these, this, to, was, what, when, where, who, will, with and more. Besides language stop words we can define other stop words depending on our needs and its application. For example we can create a list of words that in specific situations create noise and do not provide any useful information, e.g. words often used while naming items which are used only to attract attention without adding any information. 4.3 Stemming Part of many languages is word formation and declension, where based on one word we create another with a very similar meaning, according to the context. For example, in English, we can create adjectives and adverbs from nouns, nouns have plural forms, verbs have gerund forms (by adding -ing), and verbs used in the past tense are different from the present tense. These are treated as syntactic variations of the same word. This variations can cause worse results in case, e.g. document searching, because relevant document may contain a variation of a query word but not the exact word. One of the solution to this problem is stemming. 24 CHAPTER 4. TEXT PREPROCESSING Stemming refers to the process of reducing words to their stems. A stem is the portion of a word that is left after removing its prefixes and suffixes. In English, most variants of a word are generated by the introduction of suffixes (rather than prefixes). So stemming in English usually means suffix removal, or stripping. In other languages, like in Polish, stemming process might be different, because language word formation rules are different. Stemming enables different variations of the word to be considered in retrieval, which improves the recall. Many researchers have been focused on the advantages and disadvantages of using stemming. For sure, stemming increases the recall and can reduces the size of the indexing structure. Although, it can make the precision worse because many irrelevant documents may be considered as a relevant. However many experiments have been conducted by researchers, there is still no simple answer to question whether one should use stemming or not - it depends on the dataset, so every time results of the stemming should be checked to measure its usefulness. 4.4 Tags removal If we have to process documents that are not pure text and for example they are web pages or XML documents then performing approaches presented before might not be enough. There are different types of preprocessing that might be useful, we focus only on the most common and suitable to our dataset. Tags (no matter if HTML or XML tags) removal can be dealt with similarly to punctuation or hyphen removal. There is one issue which needs careful consideration. Tags make the structure of the document and their removal destroy this structure and makes document inconsistent. For example if we have HTML document of a typical commercial page, information is presented in many rectangular blocks or layers. Simply removing HTML tags may cause problems by joining text that should not be joined. The problem is worse with XML documents where while we are removing XML 4.4. TAGS REMOVAL 25 tags we can lose semantical information and merge two different strings into one text. In this chapter we have presented some basic methods of text preprocessing which are further used in phase of data preparing and normalization. 26 CHAPTER 4. TEXT PREPROCESSING 27 Chapter 5 Context-aware recommender system 28 CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM As we have mentioned before in Chapter 2, we can specify different mean- ings of contexts. In this Chapter we describe an idea of item type context and how this idea can be used in recommendations - Section 5.1 introduces the concept of this method and few information about the implementation. After that, in Section 5.2, we present how text data are processed to extract some features. Section 5.3 shows the simple method of context creation based only on terms frequencies and its evaluation. Next section, Section 5.4, presents Latent Semantic Indexing idea as a method to find connections inside data and based on that to build items contexts. As the last, we propose network-based context creation in Section 5.5. 5.1 General idea In typical applications of context-aware recommender systems usually we define context as a set of features that describe user or item environment such as localization, mood, preferences. According to that if we are looking for a new car then we will receive cars which are being sold in the very near neighborhood. Assuming that we have bought this car the next thing which is going to be recommended by a typical content-based or collaborative filtering recommender system for us would be another car which is unreasonable. Maybe this example is very particular but perfectly describes the disadvantage of typical recommender system. More likely, after we have bought a car we would be looking for an additional tires, a windscreen wiper blades, a motor oil or an audio system. So more than buying the same type of item probably we would be looking for complementary items which create our set of items that we might need, connected with the main item. Imagine we have bought new coffee machine, instead of recommendations of other coffee machine we would be more interested in equipment that works with our new coffee machine such as filters, coffee. We also might be interested in buying coffee grinder, new cups or spoons. All this items create the set of connected products that all together create context of a product. Having this examples in our minds, now we define what is the context of 29 5.1. GENERAL IDEA items. If we have a set of types of items, in which types are similar to other types from the same set, then we call this set of types as a context of items. As we can see context is not based on items itself but more on categories of items. The recommendation process is quite similar to typical recommender system workflow. It is presented on the figure 5.1. As we can see everything Viewing item A Find type of item A Find contexts for found item type Retrieve items from contexts Filter items with constraints Recommenda;on results Rank items Figure 5.1: Context-aware recommendation process. starts from item that is currently viewed by user. For this item, instead of looking for other similar items, we are determining the type of this item for item exists only one type, but every item type may belong to unlimited number of contexts. For example, coffee machine may occur in context of coffee equipment with coffee, grinders and thermos, but also as a one of kitchen equipment with microwaves and dish washers. At this point it is hard to determine only based on viewing item for which particular reason 30 CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM this item has been viewed. After the contexts are determined, in general, we retrieve all items from these contexts. All items from context probably would be inappropriate so item filtering is performed and items that remain after this process are ranked in the next step. As a comparison, in typical content-based recommender system we would rather perform filtering and ranking on the items from whole dataset than subset created by context. The recommendation process is simple itself, but many difficult problems are hidden behind. We highlight five of them and describe them shortly. First problem and the main problem is how to create contexts and whether it should be performed on-line or off-line. Creating a context is a type of data-mining clustering where we are trying to find groups of similar items taking into account some constraints. Because of size of the dataset, the performance issues are more than important. It is impossible to perform typical agglomerative clustering on dataset of 21 millions of items in reasonable time, on the other hand, we are looking for available items in the time of viewing and also if new type of item occurs in the dataset we want to receive results for it. These reasons show us that some tasks should be performed on-line and other off-line to provide optimal recommendation results. Second problem is the method how to select most relevant items from, selected before, contexts. Number of items still can be huge and from this subset we would like to separate only those which e.g. are compatible to currently viewing item. Next problem, similar to filtering problem, is ranking items. A lot of effort has been done in this area, but still this is a complex problem how to pick the TOP 5 of perfect items which probability of user taking an interest in viewing them will be very high. Text processing is also one of problems which occur during working with content type data. Results of other steps might depend on how well text processing has been performed. The last problem, which is an effect of methods used as solutions of every step, is the performance. Recommendation not only must be precise but also must be fast, so all solutions must be prepared with taking into consideration the time of an execution. 31 5.2. TEXT PREPROCESSING 5.2 Text preprocessing Since all items in dataset are mainly texts, unified procedure of their preprocessing was prepared. It is quite typical natural language processing workflow, presented on the figure 5.2 and described below. This procedure is used with every text field that occurs in dataset model. First step of this Original text Lowercase text Tags stripping (op;onal) Text stemming Tokenize text Text cleansing Remove pla7orm stop words Remove other stop words Remove language stop words Word length filtering Processed text Figure 5.2: Text preprocessing workflow. procedure is lowercasing string, the reason of this step is quite obvious, if we have two strings TExt and text when we do string comparison the result without lowercasing is that these two words are different. Lowercasing makes these two strings the same. If we have to handle with texts that contain different types of tags, such 32 CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM as HTML tags or XML tags then we do tags stripping which just mean that we are removing any kind of tags from original text. It is useful if we have to process descriptions full of HTML tags that only format our description and place some images inside. After that we receive plain text. As it has been mentioned in Chapter 4, sometimes tag removal merges two parts of text that should be separated and together they do not provide any useful information. Basically because we have two types of text fields - titles and descriptions, we apply tag removal on descriptions and we do not do that for titles. Tokenizing text is splitting sentences into words. Normally it is done by splitting when space occurs, but we also split text to words by other special characters. It is helpful when we have titles like great ***new*** bike, after tokenizing only by space we would receive great, ***new***, bike but after tokenizing also by other special characters we would have words great, new, bike and that is what we have wanted to achieve. Unfortunately here there is a possibility of lost some information. Imagine that in out title of item we have words like C-3PO or R2-D2 which are e.g. model numbers. If we do tokenizing by hyphen then we are going to receive words C, 3PO, R2, D2 and we have lost information. The most important step which has the biggest influence on the results of preprocessing is text stemming. There is not an universal method because it depends on the language and available libraries. In general the goal is to do reverse e.g. word formation to the root word with keeping the meaning in the context. Since we are working with polish dataset we use library called Morfologik that does quite well its tasks. When we have our titles and descriptions split into words and stemmed sometimes we can find in our set of words, terms like R2, 3PO or 12345 which usually can not be treated as a normal word and there are just mistakes, number of parts, information about 24 hour shipping or service. If we are looking for words that describe item the most, probably words with number will not provide any useful information. We also remove at this stage verbs from descriptions. When verbs say about the action then describe the item 5.3. CATEGORY-BASED CONTEXT CREATION 33 we find it useless, because usually it sounds like See it now! and it does not add any information. After we have received terms that are likely existing words it is time to remove from this set of words, words which language role is only creating and connecting words. Usually it is done by applying stop word list. Some examples of English stop words have been presented in Chapter 4. In general, stop word removal is for removing terms that for some reason we think are useless and they only provide noise in text. In our application we have decided also to analyze words used by sellers only to emphasize and highlight titles. For example if we have title like Superb brand new cross bike - invoice and words like superb, brand, invoice are used by a lot of sellers then they just do not give any important information about item at all and only disrupt the whole title. Also other types of stop words list can be used, more specific to platform, maybe to avoid e.g. usernames or platform names. At the end the length of terms is checked. It is most likely that one-letter word is not a correct word, sometimes even two-letters words do not exist so they can be remove from the final set of words. It must be said that text preprocessing is a very important part of the whole processing because it can change the quality of output and for sure it should be performed very careful and there is neither a simple rule nor setting that will work with any dataset. 5.3 5.3.1 Category-based context creation Description In typical e-commerce platforms all goods are labeled by category. In most cases the category describes the type of the item. Categories also usually create category tree where while going deeper the more detailed type we get. Example of a category tree can be found on figure 5.3. Although items in the same category create contexts, those contexts are not as we defined previously. In our definition those sets of items we can find in the 34 CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM Clothes, Shoes & Accessories Women's Clothing Music Men's Clothing CDs Trousers Records Socks Lingerie & Nightwear 7” Singles T-‐Shirts 12” Singles Bodies Bras & Bra Sets Figure 5.3: Example of category tree structure. same category are groups that contexts are built from. So for example if we focus on the context of coffee, category where we can find coffee machines is only one of many elements that create coffee context. Those elements usually are placed in different categories, even they do not have the same root of the tree, however they can construct the context. Every category can be described by its items that are put inside it. Since every item is featured by title, description and set of attributes all this can be transformed into set of keywords that characterize one item. Having that we can create description of the category. Because originally description created from items would be very long and pointless, we count occurrences of each term and create ordered list of frequencies. We can say that the most relevant terms describes the best the category and they form new category description. This category description that we have got can be treated as a virtual item that represents all items of selected category and can be used 5.3. CATEGORY-BASED CONTEXT CREATION 35 as an element of context. Having those virtual items they have to be connected to create context. Because theirs descriptions are bags of keywords, simple cosine similarity can be used to calculate similarity value and then it can be ranked by those values to find the most similar virtual items which are candidates from which context is built. After that, when context candidates are selected, real items from corresponding categories are picked for further filtering. 5.3.2 Experiment Accuracy of presented method has been validated by selecting different items from different categories and then manually evaluating their correctness according to intuition and usefulness of recommended items or categories/contexts. To achieve that we have selected 7 items of different type and from different categories and for every item we have prepared 4 recommendations. More information about these items can be found in Appendix A.1. Because items have been recommended randomly according to calculated contexts, constraints and filters examples of them can be found in Appendix A.2. First item (I1) for which we have been looking for recommendations is an item of bag of coffee. As a result we have received other item of bag of coffee (R11), coffee grinder (R12), thermal cup (R13) and set of cups for coffee and tea. Categories of these items are various what means that their descriptions have been quite good and probably they have been connected by coffee keyword. Also because of the variety of categories it all together creates the context of items - items that have been recommended are not the same as viewing item. Second item (I2) is also connected with coffee and it is coffee machine. As a results we have got 4 items about the same thing - coffee machine. Only thing that distinguish them is category. It means that for category of viewing item the most important word is not word connected with coffee as 36 CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM it could be expected but probably other term connected with this machine. Recommended system has acted like typical one which is serving items of the same type. Because our assumptions have been different results are not acceptable. Third item (I3) is specific type of bicycle. In this case in our result set we have cover for a bike (R31), kid bike (R32), bicycle (R33), book about cycling (R34). Because we are not looking for items only of the same type but also for other. Here book about cycling and bike cover can be treated as a complementary items and bike R31 as a suggestion of other bike. It is hard to determine if the bike for children is adequate because probably 6-7 year old child is not looking for a bike itself, more likely a parent is doing this and in this case it would more like looking for a gift but we have not been looking for a bike for a kid originally. Nevertheless those recommendations partially build a context. Fourth item (I4) is an item of dress. In this case we have got 4 items of other dresses but when we analyze their categories we can see that original item is an adult dress and recommended are for children. Although we have received recommendations of relevant items because of the keyword, they do not create valid context and also are not useful. Probably the reason of that is the type of items we have in dataset but it shows that even if we have quite strong keywords in categories descriptions, connections between categories might be constructed badly. Volleyball ball is out fifth item (I5) that we have checked. In a result set we can find american football ball (R51), rugby ball (R52), soccer video game (R53) and garden ball toy (R54). Although all items are balls it is hard to say they are from the same context. For sure item R53 is not connected to the viewing item and it is in results because of phrases that exists in polish language. Other items are strongly connected with ball and if we are looking for a ball then context has been constructed quite well but we think that more likely is that we are looking for things connected with volleyball and context should be created around this concept. Sixth item (I6) is Apple’s MP3 player and we have two other MP3 players 5.4. LATENT SEMANTIC INDEXING-BASED CONTEXT CREATION37 and two chargers for this kind of device. Context is very narrow but we have been suggested for another devices and additional equipment to currently viewing. Last item (I7) is Apple’s laptop and as a result we have been offered two covers for Apple’s computers. Items are connected by the manufacturer but are not relevant in this case. As we can see recommending items by creating context between categories/item types may vary in results. It all depends on how well items are described and what category description they create. Because of that, this method is also unstable and can be easily manipulated - in this it works like naive Bayes classifier. This approach also do not work well with categories where keywords are very general, common and can be used in many contexts. The reason of that is using single keywords - using n-grams instead of single words to describe categories can affect in more accurate results in some cases but in other it can narrow the context. This cons may be pros in some particular situations like if we have items with manufacturer name and we are looking for compatible parts or equipment. Other advantage of this method is its complexity and execution speed. Collecting categories descriptions can be done off-line periodically and it is not a very complex task. Also searching for categories to create context can be done efficiently when we use indexers like e.g. Lucene/Solr. 5.4 Latent Semantic Indexing-based context creation 5.4.1 Description Typically when we are talking about calculating similarity between documents, the most common method is to use cosine similarity which is based on term frequency. This is a good method if documents about the same thing use the same words, however this situation does not occur too often 38 CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM and many concepts or objects can be described in multiple ways (using different words) due to the context and people’s language habits. If our query uses different words from the words used in a document, the document will not be retrieved although it may be relevant because the document may use some synonyms of query words. This results in low recall. For example, “picture”, “image” and “photo” are synonyms in the context of digital cameras. If the user query only has the word “picture”, relevant documents that contain “image” or “photo” but not “picture” will not be retrieved. Latent Semantic Indexing (LSI), proposed in [9], tries to deal with this problem by identifying statistical associations of terms. It is assumed that there is some latent semantic structure in the data that is hidden by the randomness of word choice. To mine this latent structure and remove noise statistical technique, called singular value decomposition (SVD) is used. This structure is also called the hidden concept space, which connects syntactically different but semantically similar documents. Let D be the text collection, the number of distinctive words in D be m and the number of documents in D be n. As an input LSI takes m × n termdocument matrix A. Documents are represented as columns of A and words are represented by rows. Matrix is usually computed using term frequency, but also TF-IDF can be used. Every cell of matrix A, denoted by Aij stores the number of occurrences of word i in document j. SVD is used in LSI to factorize matrix A into three matrices: A = U ΣV T where U is m × r matrix and its columns are eigenvectors associated with the r non-zero eigenvalues of AAT . What is more, the columns of U are unit orthogonal vectors, what means U T U = I. V is n × r matrix and its columns are eigenvectors associated with the r non-zero eigenvalues of AT A. The columns of V are also unit orthogonal vectors, what means V T V = I. Σ is r × r diagonal matrix, Σ = diag(σ1 , σ2 , ..., σr ), σi > 0. σ1 , σ2 , ..., and σr are the non-negative square roots of the r non-zero eigenvalues of AAT . 5.4. LATENT SEMANTIC INDEXING-BASED CONTEXT CREATION39 They are ordered decreasingly, i.e.,σ1 ≥ σ2 ≥ ... ≥ σr > 0. One of the features of SVD is that we can delete some insignificant dimensions in the transformed space to optimally approximate matrix A. The significance of the dimensions is indicated by the number of the singular values in Σ. Let we use only the k largest singular values in and set the remaining small ones to zero. The approximated matrix of A is denoted by Ak . We also have to reduce the size of the matrices Σ, U and V by deleting the last r − k rows and columns from Σ, the last r − k columns in U and the last r − k columns in V . As a result we obtain Ak = Uk Σk VkT which means that we use the k-largest singular triplets to approximate the original A matrix. The new space is called the k-concept space. Figure 5.4 shows the original matrices and reduced matrices schematically. Documents" Term"vectors" k" Terms" Document"vector" k" Σk" k" Σ" A"/"Ak" =" Uk" U" VkT" VT" ×" ×" k" m×n" m×r" r×r" r×n" Figure 5.4: Schematic representation of singular value decomposition of the matrix A. Latent Semantic Indexing does not re-construct the original A matrix perfectly. The truncated SVD captures most of the important basic structures in the association of terms and documents and at the same time removes the variability of word usage.[16] 40 CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM In general Latent Semantic Indexing is used to reveal hidden connections between texts about the same thing when different phrases are used. This characteristics can be used to create contexts of items. All items similar to each other are from the same concept space what means that they are representing the same context. 5.4.2 Experiment To check how LSI approach works for items available in the dataset we have selected 13 items from three categories: Bras, Socks and Bra Accessories. All this categories have common ancestor category which is Clothes. Such example has been selected because of two commonly used in Polish language synonym names for bra - stanik and biustonosz - which can be used to perform simple evaluation whether LSI method is working or not. Types of each item are presented below in table 5.1, more information about items is presented in Appendix A.1. No Category Item type D1 Socks Suit socks D2 Socks Sport socks D3 Socks Sport socks D4 Bras Bra D5 Bras Bra D6 Bras Bra D7 Bra accessories Bra accessories D8 Socks Socks D9 Bras Bra D10 Bras Bra D11 Bras Sport bra D12 Bras Sport bra D13 Bras Sport bra (only manufacturer name) Table 5.1: Selected items and theirs types. First step of the experiment is creating term-document frequency matrix where items titles are treated as documents. Every title is tokenized, all nonwords are removed and stemming is performed. After this preprocessing of 5.4. LATENT SEMANTIC INDEXING-BASED CONTEXT CREATION41 every title we create set of words from all titles and count their occurrences in documents, results are put into matrix A. For presented example termdocument frequency matrix A can be find in Appendix B.2. With such prepared matrix now we can apply SVD decomposition on this. As a result we receive three matrices: U (size 34 × 13), Σ (size 13 × 13), V (size 13 × 13) which are enclosed in Appendix B.3. Next step of latent semantic indexing is choosing value of k which is used to prune matrices. After the analysis of matrix Σ we have decided to evaluate with k = 3 and k = 2 because the drop of value on diagonal after k = 3 is relatively big. Although we have received better results with k = 2 and those results are presented - we only highlight differences between results with k = 2 and k = 3. With given value of k we create new matrices: Uk of size 34 × k, Σk of size k × k and Vk of size k × 13. To evaluate this method we prepared 6 queries built from available words. This queries are itemize below and theirs queries vectors are shown in Appendix B.4: • Q1: biustonosz, • Q2: biustonosz sport, • Q3: push up, • Q4: skarpeta, • Q5: sport, • Q6: stanik. Each query word vector is now transformed using matrices created in SVD process by the equation: vk = q T Uk Σ−1 k where: q T is transposed vector that represents single query, Uk is matrix U after resizing to k-columns, 42 CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM Σ−1 k is inverted matrix Σ after resizing to k × k dimensions, vk is result vector of the query. Now to evaluate if latent semantic indexing is working we calculate the similarity between vector vk and every document represented by rows in Vk matrix. This can be done by using cosine similarity: (i) sim(vk , Vk ) = ~(i) v~k · Vk , i = 1, 2, ..., d ~(i) kv~k kkVk k where: d is the total number of documents, (i) Vk is the i − th row of Vk matrix. In our case, results for every query and document are presented in table 5.2. Queries are represented by columns and documents by rows. Q1 Q2 Q3 Q4 Q5 Q6 D1 0,01 0,39 -0,13 0,99 0,72 -0,15 D2 -0,06 0,32 -0,21 0,99 0,66 -0,23 D3 0,23 0,58 0,08 0,96 0,85 0,06 D4 0,99 0,87 0,99 -0,11 0,62 0,99 D5 0,99 0,91 0,99 -0,04 0,67 0,99 D6 0,99 0,88 0,99 -0,10 0,63 0,99 D7 0,98 0,84 0,99 -0,19 0,56 0,99 D8 0,26 0,61 0,11 0,96 0,87 0,09 D9 0,98 0,83 0,99 -0,20 0,55 0,99 D10 0,99 0,86 0,99 -0,14 0,59 0,99 D11 0,96 0,98 0,92 0,22 0,85 0,91 D12 0,92 0,99 0,86 0,35 0,91 0,85 D13 0,73 0,94 0,62 0,66 0,99 0,61 Table 5.2: Similarities between documents and queries. According to the results, Query1 which is using one name of bra also finds useful documents that use second name of bra. Even if in document #13 this name has not been used still the similarity is quite high. Also bra accessories has been included into group of similar items. Similarity between this query and socks in 2 of 4 cases are near to zero. Other cases have the 5.4. LATENT SEMANTIC INDEXING-BASED CONTEXT CREATION43 similarity about 0.25 and probably this is because of occurrence of sport word in titles which is used not only by bras but also by socks. Query2 has checked this hypothesis and adding sport word to query results in increasing a little bit similarities between query and items representing socks group. Query3 has checked how similarities would look like if we are looking for a push-up bra which is one of subtype of bra. In results we can see that it has high similarity value in almost all bra items. Query4 shows the similarities for socks items and how it differs from bra items. However document #13 has high value of similarity because it uses words that also occurs in titles of socks items like sport and elegancki. Query5 has tested how using common words like sport affects on similarities and in this case it does not distinguish well items. Last query, Query6 is similar to Query1, but uses second word for bra and results are very similar. Using the same equations we can also measure the similarities between queries itself and it is presented in table 5.3. This results also confirm strong Q1 Q2 Q1 Q2 Q3 Q4 Q5 Q6 1,00 0,92 0,98 -0,01 0,69 0,98 1,00 0,85 0,36 0,91 0,84 1,00 -0,16 0,58 0,99 1,00 0,70 -0,18 1,00 0,56 Q3 Q4 Q5 Q6 1,00 Table 5.3: Similarities between queries. connections between some types of items or theirs keywords. Query1 gets similar results to Query2, Query3, Query6 which all are about bras. Query4 has the strongest connection with Query5, but Query5 is an example of query about very general and widely used word sport, so its similarity is above average with every query. Another way to compare results of LSI is document-document similarity calculating. Similarities between documents before using LSI are presented in table 5.4 and after applying LSI are in table 5.5. 44 CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM D1 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 1,00 0,38 0,43 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,33 1,00 0,50 0,00 0,00 0,00 0,00 0,38 0,00 0,00 0,00 0,00 0,00 1,00 0,00 0,00 0,15 0,00 0,21 0,00 0,00 0,16 0,21 0,21 1,00 0,22 0,18 0,20 0,00 0,00 0,25 0,40 0,25 0,00 1,00 0,00 0,00 0,00 0,00 0,00 0,22 0,28 0,00 1,00 0,54 0,00 0,20 0,23 0,00 0,00 0,00 1,00 0,00 0,22 0,25 0,00 0,00 0,00 1,00 0,00 0,00 0,25 0,33 0,33 1,00 0,00 0,00 0,00 0,00 1,00 0,00 0,00 0,33 1,00 0,51 0,25 1,00 0,33 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 1,00 Table 5.4: Similarities between documents before applying LSI. D1 D2 D3 D4 D5 D6 D7 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 1,00 0,99 0,97 0,09 0,01 0,07 0,16 0,96 0,17 -0,11 0,25 0,38 0,68 1,00 0,95 0,17 0,09 0,15 0,24 0,94 0,25 -0,19 0,17 0,30 0,62 1,00 0,13 0,20 0,14 0,05 0,99 0,04 0,10 0,46 0,57 0,82 1,00 0,99 0,99 0,99 0,16 0,99 0,99 0,93 0,88 0,66 1,00 0,99 0,98 0,23 0,98 0,99 0,96 0,91 0,71 1,00 0,99 0,17 0,99 0,99 0,94 0,89 0,67 1,00 0,08 0,99 0,99 0,91 0,84 0,60 1,00 0,07 0,13 0,49 0,60 0,84 1,00 0,99 0,90 0,84 0,59 1,00 0,92 0,87 0,64 1,00 0,99 0,88 1,00 0,93 D8 D9 D10 D11 D12 D13 1,00 Table 5.5: Similarities between documents after applying LSI. We can see how similarities have changed and connections that have been hidden appeared, e.g. between D1 and D8 or D4 and D9. Also weak connections have been strengthen. Some values have been bolded to indicate specific changes in the values of similarities. 5.5. NETWORK-BASED CONTEXT CREATION 45 Latent Semantic Indexing in a method that could be used to create clusters of similar items. It connects items together by discovering and showing latent context of items. However this method also has some disadvantages. Main disadvantage is the complexity of SVD which is O(nm2 ) or O(n2 m). Computing such a big matrices where dimensions would be in millions is practically impossible in the original form. Other situation that may occur is noise caused by number of words which connect a lot of items but term itself is too general or does not mean anything. 5.5 5.5.1 Network-based context creation Description Network-based context creation approach is an approach that mixes two approaches presented in Section 5.3 and Section 5.4 and it minimizes the disadvantages they have. Main disadvantage of category-based context creation is that it can not find contexts more specific than a category - like sub-context - but also more general - like context which items are distributed into many categories. This can be achieved by using LSI but it also has some disadvantage like complexity and we do not get typical contexts as a result. In network-based context creation we use the property of tags cloud from category-based context creation and how those tags are connected with each other. Those connections create structured network that keeps hidden information about contexts. Dividing this network into subnetworks gives us more and more concrete contexts. There are two ways of creating this network: (1) taking all items from dataset and analyzing them all together, or (2) analyzing items per category in order to create small network for category and then merge them into one network. Selecting the first solution does not differ from LSI-based approach, and because of that we have decided to divide computations per category. For every context corresponding to subnetwork we look for terms that represent items from this context. Those terms can be treated as labels of the context and items from this context. 46 CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM Deeply this approach uses network connectivity, network modularity, PageRank algorithm, Hyperlink-Induced Topic Search (HITS) algorithm and inverted index which we shortly present before the details of approach itself. In an undirected graph G, two vertices u and v are called connected if G contains a path from u to v. Otherwise, they are called disconnected. A graph is said to be connected if every pair of vertices in the graph is connected. A connected component is a maximal connected subgraph of G. Each vertex belongs to exactly one connected component, as does each edge [2]. Modularity term has a lot of different meaning depending on field and context in which it is used. We focus on meaning of modularity in term of network and graph analysis. [4] says that modularity is the measure of structure of network or graph. It measures the possibility how graph is likely to be divided into modules (groups, clusters). Graphs where the modularity value is high have nodes strongly connected within the same module and nodes weakly connected with other nodes from different modules. The modularity is, up to a multiplicative constant, the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random [19]. The modularity can be either positive or negative. Because of that it is possible to search for module structure precisely by looking for the divisions of a network that have positive values of the modularity. A precise mathematical formulation of modularity for networks where are two and more modules can be found in [19]. PageRank is a link analysis algorithm that assigns a numerical weighting to each element of a connected set of documents, with the purpose of measuring its relative importance according to given set. The algorithm may be applied not only to Web pages, which was its first application, but to any collection of entities where connections exist. Precisely, PageRank is a probability distribution used to represent the likelihood that we will be interested in seeing specific documents. PageRank can be calculated for collections of documents of any size. It is assumed in several research papers that the distribution is evenly divided among all documents in the collection 5.5. NETWORK-BASED CONTEXT CREATION 47 at the beginning of the computational process. The PageRank computations require several passes through collection to adjust approximate PageRank values to reflect more closely theoretical value. A probability is expressed as a numeric value between 0 and 1 [5]. More information about PageRank algorithm can be found in [21]. Hyperlink-Induced Topic Search (HITS), which is also known as hubs and authorities, is a link analysis algorithm that originally rates Web pages networks. The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming - that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that it held, but were used as compilations of a broad catalog of information that led users directly to other authoritative pages. In other words, a good hub represented a page that pointed to many other pages, and a good authority represented a page that was linked by many different hubs. The scheme therefore assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages [3]. More information about HITS algorithm and its values calculation can be found in [14, 15]. The inverted index of a document collection is basically a data structure that attaches each distinctive term with a list of all documents that contains the term. Given a set of documents, D = {d1 , d2 , . . . , dN }, each document has a unique identifier (ID). An inverted index consists of two parts: a vocabulary V , containing all the distinct terms in the document set, and for each distinct term ti an inverted list of postings. Each posting stores the ID (denoted by idj ) of the document dj that contains term ti and other pieces of information about term ti in document dj [16]. Context creation based on network analysis can be divided into 5 steps: (1) analyzing category network, (2) analyzing possible connections between categories, (3) simplification of category network, (4) item mapping and (5) network analysis of general network. Scheme of context creation based on network analysis in presented on figure 5.5. Now we describe more precisely 48 CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM all these steps. For every category Create network with terms as nodes Find the most relevant terms in category Find duplicated terms Simplify network Items mapping Merge networks Find contexts Items re-‐mapping Figure 5.5: Schematic workflow of network-based context creation approach. Analyzing all items from dataset is a very complex task which requires a lot of resources. While all items are divided into categories also their analysis can be divided and performed separately on each category. In this step we want to get as a result words that describe all items from selected category or most of them. Also we are interested in connections between those words which correspond to their similarity. To achieve that the first thing that we have to do is calculate Term Frequency (TF) matrix for items from current category. Instead of processing all items it is possible to use representative subset of items, e.g. 10000 of items. The reason of that is SVD which we are performing in the next step while applying LSI on the items of category. Latent Semantic Indexing is parametrized by dimension to which we want to reduce our matrices and this k-value can be set manually or automatically 5.5. NETWORK-BASED CONTEXT CREATION 49 e.g. to retain 99% of variance of Σ matrix, that means: Pk Σj ≥ 0.99 j=1 Σj j=1 Pn . Next, we create queries like it has been done in Section 5.4.2 but for every term that occurs in TF matrix. After that we calculate cosine similarities between queries, as a result of that we have m × m matrix M with values between 0 and 1. Before we use this matrix as an input matrix for graph creation we multiply every cell of this matrix with corresponding frequency for term from TF matrix. The reason for this is that we want to add word popularity factor and distinguish words where the similarities might be the same but one word should be more important than other. It might be worth saying that we do not need to have filled all cells because similarity is symmetric and lower or upper triangular matrix is enough to create undirected network, which we create right now from matrix M . While we have created network corresponding to matrix M , we calculate modularity for every connected component. If the modularity value is greater or equal than some value p, we split input graph (network) into subgraphs (subnetworks). We repeat this recurrently until modularity value for a subnetwork is less than p. For every subnetwork we have got, we perform PageRank algorithm to find the most relevant nodes which correspond to the most relevant words in a selected category. From those most relevant words in every subnetwork we select top N which we use in the further steps. Retrieving top N important words and discarding all not so relevant in category have some disadvantages. One of them is possibility of information and connection loses between words about the same thing or from the same context which have not been so important for PageRank algorithm. Also ideally we do not like to have not connected networks that represent different categories, to analyze them better we need information how they are close to each other or do they have overlapping nodes, overlapping context. If we do not do that then as a result we can get many disjointed networks. To prevent this and to find candidates that might connect networks together we look for common words in the whole dataset. In other words, we are looking 50 CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM for duplicates of words of all items. This can be done efficiently by using e.g. Prefix Trees. As a result of this step we get a set of words that occur at least a specific number of times. In the third step we take the set of terms that have been retrieved in steps #1 and #2, and delete from network created in step #1 all terms that are not in this set. While removing nodes we also remove all edges that connect those nodes. After that we have smaller network which can be used in further processing and analysis. While we have shrunk network to the most relevant and necessary words all items have to be bind to terms that describe selected item. This can be done by creating inverted indices for vocabulary used in step third and all items from category. Last step is the actual step where contexts are built. At the beginning we merge all small networks corresponding to categories into one bigger network that represents contexts and connections between them of all items from dataset. On this network we calculate modularity for every connected component and the same like it has been done in step #1, if the modularity value is greater or equal than some value p, we split input network into subnetworks and repeat this recurrently until modularity value for a subnetwork is less than p. Every subnetwork we have received create context and more general contexts contain more specific, smaller contexts. For debug purposes PageRank algorithm can be performed on every subnetwork to check what this context is about by looking on the most important term in the ranking. From the process of splitting network into modules we create hierarchical graph that represents contexts and its divisions into more specific contexts. After that we have to re-map items bindings from modules created in step #1 and simplified in step #3 to new modules, created after merging and modularity analysis. Because as a result we have received graph that shows contexts structure, recommendation process can be describe as a random walk through the graph. Sample graph is presented on figure 5.6. Every node represents context and subnetwork which we have received after network’s modularity 51 5.5. NETWORK-BASED CONTEXT CREATION C1 C2 C1.2 C1.1 C2.1 C2.2 C2.3 p2 p1 C1.2.1 C1.2.2 C1.2.3 C2.1.1 C2.1.2 C2.2.1 C2.2.2 Figure 5.6: Contexts’ structure graph and possible recommendation jumps. analysis. We distinguish two types of moves: changing context on the same level, or changing parent context - red and blue arrows show those two possibilities. While changing contexts or sub-contexts we set the probabilities that determine which route should be chosen - probability p1 for changing context to other from the same level and p2 for changing parent context of current context. Changing parent context can give more diversity in results so usually p1 > p2 . While visualizing network created after merging it is likely that some modules can be connected with others by words that should be filtered by one of the stop words list - its meaning is too general and does not introduce any information to any context, it is meaningless. If we check their hub value after applying HITS algorithm and dividing into modules we can see that words with high hub value which connect different modules in many cases should be removed. This step might be done manually periodically to 52 CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM improve the correctness of context creation. Also it is worth to mention that in this approach only recommendation process is done on-line - all other tasks are done off-line, periodically because items average characterization (typical, most common terms used to describe items) in one category does not change so quickly and so often. For all new items that occur between calculations there is a need to map them to appropriate context - to achieve that we have to store all terms from which context have been created and its corresponding network and calculate e.g. cosine similarity between item and those terms to find most appropriate context or contexts. 5.5.2 Experiment In order to evaluate network-based context creation we have picked 12 categories presented in table 5.6. All these categories contain exactly 1872 items. All items have been processed, titles divided into terms and TF matrix created. To show how our approach change number of terms describing category while processing, we count them. Counts for original data are presented in table 5.7. For every category LSI analysis has been performed and similarity with weighting has been calculated. After that we have created network between terms with weighted similarity as weight values on edges. Visualization of those networks can be found in Appendix C.1. Based on those networks we have select most important words that are used next to describe items connected to those categories - we have picked 5 most important words per category and total 103 terms for all categories. Also we have determined 97 duplicate terms within selected categories. Having the most important words for every category and list of duplicated we have been able to shrink original networks into its compressed versions. All those networks can be found in Appendix C.2 and counts of terms for those can be also found in table 5.7. As we can see, the number of edges decreases from few percents up to almost 80%. At this point we can merge all networks presented in Appendix C.2 into one network. figure 5.7a presents 5.5. NETWORK-BASED CONTEXT CREATION 53 the results of this process. (a) Before hubs removal. (b) After hubs removal. Figure 5.7: Network representing 12 categories merged together. The network created from compressed networks, which have been representing connections between terms, has only one connected component, although there is a probability that some of terms that connect different modules (every module has different color of nodes) should not be existing in this network. To check that HITS algorithm has been performed and its results reflect on the size of nodes. After brief analysis of hub nodes that connect different modules, we have decided to remove few more nodes in our opinion can not distinguish any specific type of items. After this removal we have received network on figure 5.7b. Instead one connected component we have now two connected components. This shows us that hub analysis (which can improve stop word lists) of nodes that connects modules has quite huge impact on the network structure. The modularity value of the top and bottom subnetworks are equal to 0.102 and 0.386, and if we set the value p = 0.1 then we have to continue its analysis on submodules. The structures of components can be found on figure 5.7. For every component we have analyzed its modules and their visualization are enclosed in Appendix C.3. Because all modules have to be analyze if 54 CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM (a) Connected component #1. (b) Connected component #2. Figure 5.8: Connected components. they can be divided into submodules, so in the result module #1.3 has been divided into 3 submodules and 2 subsubmodels - figure C.9 in Appendix C.3. For debug purposes on final submodules PageRank algorithm has been executed to give, more or less, the topic of its context. Divisions based on the modularity values has created context structure as follows on figure 5.9. After the analysis of terms topics we can find out approximately what are those topics in those contexts. Our interpretation is presented in table 5.8. We can see that topics are generally divided into two categories because of our two connected components which reflect on context division. The recommendation process - graph traversing will result in picking items from connected contexts - e.g. if we are viewing item which context is #2.4 (Socks) as a result we can receive items from context #2.3 (Leg warmers, which are also kind of socks) but also from #2.1 (Bras) which are a kind of lingerie also and might be useful. In this section we have presented few different ideas how to create contexts based on items keywords. We have started with simple idea of simple analyzing items from categories and creating categories descriptions. Although this method is fast because it can take advantages from existing technologies and methods, the results might not be satisfactory. To improve importance and to connect stronger context 55 5.5. NETWORK-BASED CONTEXT CREATION 1 1.1 1.2 1.3 1.4 1.3.1 1.3.2 1.3.3 1.3.3.1 2 2.1 2.2 2.3 2.3 1.3.3.2 Figure 5.9: Graph of contexts structure. we have tested the Latent Semantic Indexing approach in order to create contexts. This approach, on other hand, finds hidden connections between items but is very demanding in manner of resources, especially if it has to be performed on huge datasets. As a result of those two approaches we created network-based method which does category analysis separately and finds hidden connections between items inside one category and then analyze whole dataset using network and graph algorithms to find contexts in data. 56 No CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM Category name Category path Items type Odzież, Obuwie, Dodatki / Odzież Skarpetki i damska / Bielizna / Skarpetki i 1 Socks podkolanówki podkolanówki RTV i AGD / Sprzęt AGD / Kuchnia / 2 Przelewowe Coffee machines Ekspresy do kawy / Przelewowe RTV i AGD / Sprzęt AGD / Kuchnia / 3 Ciśnieniowe Coffee machines Ekspresy do kawy / Ciśnieniowe RTV i AGD / Sprzęt AGD / Kuchnia / 4 Młynki do kawy Coffee grinders Młynki do kawy Odzież, Obuwie, Dodatki / Odzież 5 Silikonowe damska / Bielizna / Biustonosze / Bras Silikonowe Odzież, Obuwie, Dodatki / Odzież 6 Sportowe damska / Bielizna / Biustonosze / Bras Sportowe Odzież, Obuwie, Dodatki / Odzież 7 Typu push-up damska / Bielizna / Biustonosze / Typu Bras push-up Naczynia do kawy i Antyki i Sztuka / Antyki / Ceramika / Antique coffee & tea herbaty Naczynia do kawy i herbaty cups 8 Dom i Ogród / Żywność / Kawy / 9 Mielone Coffee Mielone Dom i Ogród / Żywność / Kawy / 10 Rozpuszczalne Coffee Rozpuszczalne Dom i Ogród / Żywność / Kawy / 11 Ziarniste Coffee Ziarniste Dom i Ogród / Żywność / Kawy / Inne 12 Inne kawy Coffee kawy Table 5.6: Selected categories to evaluation. 5.5. NETWORK-BASED CONTEXT CREATION No 1 2 3 4 5 6 7 8 9 10 11 12 Number of distinct terms 104 19 102 11 17 20 130 16 151 16 55 43 57 Number of distinct terms after compression 32 13 25 10 11 13 29 11 54 12 41 26 Table 5.7: Distinct term counts for original and compressed category network. Context No 1 1.1 1.2 1.3 1.3.1 1.3.2 1.3.3 1.3.3.1 1.3.3.2 1.4 2 2.1 2.2 2.3 2.4 Topic interpretation Coffee equipment Coffee grinders Coffee Coffee machines Coffee machines Coffee machines (specific manufacturer) Coffee & breakfast Coffee Breakfast accessories Grounded coffee Lingerie Bras Push-up bras Leg warmers Socks Table 5.8: Topics interpretation of created contexts. 58 CHAPTER 5. CONTEXT-AWARE RECOMMENDER SYSTEM 59 Chapter 6 Conclusions 60 CHAPTER 6. CONCLUSIONS Topic of this thesis was to create some approaches that could be used in context creation process and then in context-aware recommender systems. The first problem of this was defining the meaning of context and we have ended with meaning of context as a set of complementary things that might be recommended to create full set of connected items. Then we have presented three methods where task of each next was to create contexts and to eliminate problems observed in the results of previously evaluated, so new approach has been the consequence of previous one. As a result we have received a solution that mixes different approaches and fields of computer science to perform its task in the best way. It analyzes groups of items categorized by people, removes unnecessary information, builds groups of items and tries to connect them all together in order to divide them into shared between categories, contexts. In order to achieve that we use Latent Semantic Indexing and network analysis methods such as PageRank algorithm, HITS algorithm and modularity. We think that this method may end with decent results of the recommendation task in real e-commerce platform. Further work It is impossible to debate on the efficiency of any method presented in this thesis based only on our subjective opinion. Objective evaluation of recommender systems is hard because it depends on users experience, they current needs, age, sex, social status. Some recommended items might be suitable for one and totally unacceptable for other. Wellknown validation methods from machine learning in many cases can not be used because of the lack of objectivity - we can not just check the equation. Although some methods of validation of recommender systems has been presented in the literature and e.g. to check users’ satisfactory we can do user trails. We say that this is the first thing that should be done as a further work. In all methods we have been relied on the results of text processing techniques. We have observed that this is the most important step in the whole context creation flow and by using better stemmers or analyzing texts more precisely. For instance, we have been dividing and then analyzing terms 61 built from single words. Instead of that we could use n-grams and then check what is the impact of this decision on the final results. Using n-gram creates itself a small context inside - nevertheless in network-based context creation approach n-grams might correspond to small cliques. Checking the impact of using n-gram might be the next thing for further verification. Some of used methods to perform subtask, such as SVD to decompose matrices, are quite complex and expensive when it comes to resources of processors and memory. We were trying to minimize datasets they had to handle with, but there are other possibilities how to adapt algorithms to handle with larger sets of data. For instance, instead of using standard SVD, CUR matrix approximation or Stochastic SVD might be used. However it might have an impact on the final results so it has to be evaluated before used in real systems. In this thesis we have presented the simple idea of graph traversing in order to serve recommendation results. Other, more complex probabilistic models should be applied to provide more accurate and realistic behavior of context switching, and this can be a topic for further researches. 62 Bibliography [1] Webster Dictionary - Context. available at http://www. merriam-webster.com/dictionary/context, last checked 13/06/22. [2] Wikipedia: Connectivity. available at http://en.wikipedia.org/ wiki/Connectivity_(graph_theory), last checked 13/06/11. [3] Wikipedia: HITS algorithm. available at http://en.wikipedia.org/ wiki/HITS_algorithm, last checked 13/06/11. [4] Wikipedia: Modularity (networks). available //en.wikipedia.org/wiki/Modularity_(networks), at http: last checked 13/06/10. [5] Wikipedia: PageRank. available at http://en.wikipedia.org/wiki/ PageRank, last checked 13/06/11. [6] D. Salber A. Dey, G. Abowd. A Conceptual Framework and a Toolkit for Supporting the Rapid Prototyping of Context-Aware Applications. Human Computer Interaction, 16(2):97–166, 2001. [7] M. Theimer B. Schilit. Disseminating Active Map Information to Mobile Hosts. IEEE Network, 8(5):22–32, 1994. [8] M. Gorgoglione C. Palmisano, A. Tuzhilin. Using Context to Improve Predictive Models of Customers in Personalization Applications Models of Customers in Personalization Applications. IEEE Transactions on Knowledge and Data Engineering, 20(11):1535–1549, 2008. BIBLIOGRAPHY 63 [9] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 41(6):391–407, 1990. [10] A. Tuzhilin G. Adomavicius. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6):734–749, 2005. [11] A. Tuzhilin G. Adomavicius. Context-Aware Recommender Systems. In Recommender Systems Handbook. Springer, 2011. [12] S. Moorthy G. Lilien, P. Kotler. Marketing Models. USA: Prentice Hall, pages 22–23, 1992. [13] X. Ochoa M. Wolpers H. Drachsler I. Bosnic E. Duval K. Verbert, N. Manouselis. Context-Aware Recommender Systems for Learning: A Survey and Future Challenges. IEEE Transactions on Learning Technologies, 5(4):318–335, 2012. [14] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604–632, September 1999. [15] Jon M. Kleinberg. Hubs, authorities, and communities. ACM Comput. Surv., 31(4es), December 1999. [16] Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. [17] P. Brézillon M. Bazire. Understanding context before to use it. In 5th International and Interdisciplinary Conference on Modeling and Using Context, volume 3554 of Lectures Notes in Artificial Intelligence, pages 29–40. Springer-Verlag, 2005. 64 BIBLIOGRAPHY [18] G. Linoff M. Berry. Data mining techniques: For marketing, sales, and customer support. Wiley, 1997. [19] M. E. J. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577–8582, June 2006. [20] X. Chen P. Brown, J. Bovey. Context-Aware Applications: From the Laboratory to the Marketplace. IEEE Personal Communications, 4(5):58–64, 1997. [21] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab, November 1999. [22] C.K. Prahalad. Beyond CRM: C.K. Prahalad Predicts Customer Context Is the Next Big Thing. In American Management Association MwWorld, 2004. 65 Appendix A Category-based context creation experiment data 66APPENDIX A. CATEGORY-BASED CONTEXT CREATION EXPERIMENT DATA A.1 Items data No Item title Category path I1 Kawa Senseo “Jacobs Kronung” Dom i Ogród / Żywność / Pozostałe I2 EKSPRES CIŚNIENIOWO-PRZELEWOWY RTV i AGD / Sprzęt AGD / Kuchnia / Ekspresy do kawy / Ciśnieniowe Sport i Turystyka / Rowery i akcesoria / I3 rower GIANT BOULDER 500 Rowery / Szosowe Odzież, Obuwie, Dodatki / Odzież I4 damska / Odzież / Suknie i sukienki / Czarna subtelna sukienko tunika ESSENCE, r.48/50 Sukienki Sport i Turystyka / Sporty drużynowe / I5 Piłka do siatkówki, piłka siatkowa (80806-7) Siatkówka / Piłki RTV i AGD / Sprzęt audio przenośny / I6 IPOD APPLE 2GB MP4 / 2GB Komputery / Komputery - inne / Apple I7 NOWY MacBook Pro 17” 2,33GHz/2GB RAM,F.Vat.Gw / Notebooki Table A.1: Items selected to experiment. A.2 No Recommended items Item title Category Path Kawa Segafredo Intermezzo 1kg - oferta Dom i Ogród / Żywność / Kawy / POLCAFFE Ziarniste MŁYNEK DO KAWY FIRST RTV i AGD / Sprzęt AGD / Kuchnia / GWARAN24M-C SERWIS, FA.VA Młynki do kawy KUBEK TERMICZNY 400ML 16CM Motoryzacja / Gadżety motoryzacyjne / KAWA w SAMOCHODZIE Kubki R11 R12 R13 Dom i Ogród / Wyposażenie / Kubki i R14 ZESTAW DO KAWY I HERBATY filiżanki / Zestawy Table A.2: Items recommended for item I1. 67 A.2. RECOMMENDED ITEMS No Item title Category path Ekspres kawowy Ciśnieniowy przelewowy RTV i AGD / Sprzęt AGD / Kuchnia / clatronik Ekspresy do kawy / Ciśnieniowe Ekspres przelewowy PREDOM ZELMER RTV i AGD / Sprzęt AGD / Kuchnia / typ 215.2 Ekspresy do kawy / Przelewowe NAJLEPSZY EKSPRES CIŚNIENIOWY RTV i AGD / Sprzęt AGD / Kuchnia / 50% CENY, NAJTANIEJ Ekspresy do kawy / Ciśnieniowe R21 R22 R23 Firma i Przemysł / Gastronomia / R24 ekspres ciśnieniowy automatyczny Saeco Wyposaż. i akcesoria barowe / Ekspresy ciśnieniowe Table A.3: Items recommended for item I2. No Item title Category path POKROWIEC NA Sport i Turystyka / Rowery i akcesoria / R31 ROWER-MOTOR-SKUTER ITD. Akcesoria / Torby i sakwy / Pozostałe SUPER CENA ROWER DLA DZIECKA 4-8 LAT Dla Dzieci / Zabawki / Pojazdy / Na KOŁA 14” UŻYWANY pedały DAMSKI ROWER SUNCITY 26” ALU Sport i Turystyka / Rowery i akcesoria / SUPER CENA!***** Rowery / Trekkingowe JAZDA ROWEREM GORSKIM LOPES Sport i Turystyka / Rowery i akcesoria / NOWOSC KRAKOW KSIEG Literatura, instrukcje R32 R33 R34 Table A.4: Items recommended for item I3. 68APPENDIX A. CATEGORY-BASED CONTEXT CREATION EXPERIMENT DATA No Item title Category path Sukienka sztruksowa + bluza 92 2-3 lata Dla Dzieci / Odzież / Niemowlęta i małe SUPER!!! dzieci / Rozmiar 92 / Sukienki SUKIENKA NEXT WIOSNA/LATO Dla Dzieci / Odzież / Niemowlęta i małe 2007 R.86 dzieci / Rozmiar 86 / Sukienki R41 R42 Dla Dzieci / Odzież / Niemowlęta i małe R43 SUKIENKA DLA MAŁEJ WRÓŻKI dzieci / Rozmiar 68 / Sukienki Dla Dzieci / Odzież / Niemowlęta i małe R44 SUKIENKA,TUNICZKA-92,98CM. dzieci / Rozmiar 92 / Sukienki Table A.5: Items recommended for item I4. No Item title Category path WILSON Piłka do futbolu Sport i Turystyka / Sporty drużynowe / amerykańskiego NFL Futbol amerykański / Piłki R51 Sport i Turystyka / Sporty drużynowe / R52 Piłka, Okazja Rugby / Piłki SUPER HIT GRA SOCCER PIŁKA Gry / Konsole i automaty / Game Boy NOŻNA ZOBACZ SAM!!!! Color / Gry / Sportowe bo-bas/do skakania KONIK - PIŁKA * Dla Dzieci / Zabawki / Ogrodowe / SUPER JAKOŚĆ Pozostałe R53 R54 Table A.6: Items recommended for item I5. No Item title Category path Ładowarka sieciowa - iPod Mini Nano RTV i AGD / Sprzęt audio przenośny / Video 3G 4G Pozostałe R61 OKAZJA! M10 MP4 512MB RTV i AGD / Sprzęt audio przenośny / R62 AUDIO/VIDEO/FM/PENDR/DYK MP4 / 512MB GW. Ładowarka sieciowa - iPod Mini Nano RTV i AGD / Sprzęt audio przenośny / Video 3G 4G Pozostałe PROMOCJA! M06 MP4 2GB RTV i AGD / Sprzęt audio przenośny / AUDIO/VIDEO/FM/PEN/DYK GW. MP4 / 2GB R63 R64 Table A.7: Items recommended for item I6. 69 A.2. RECOMMENDED ITEMS No Item title Category path NOWY Pokrowiec Tucano na iMac 20” Komputery / Komputery - inne / Apple F.Vat !!! / Komputery NOWY Pokrowiec Tucano na iMac G5 Komputery / Komputery - inne / Apple 17” F.Vat / Komputery R71 R72 Table A.8: Items recommended for item I7. 70APPENDIX A. CATEGORY-BASED CONTEXT CREATION EXPERIMENT DATA 71 Appendix B Latent Semantic Indexing experiment data 72APPENDIX B. LATENT SEMANTIC INDEXING EXPERIMENT DATA B.1 Items data No Category Item type D1 Socks Suit socks Item title CZARNE ELEGANCKIE SKARPETKI HENDERSON 39 - 42 FROTOWE SKARPETKI NIKE 42-46 FROTA D2 Socks Sport socks SKARPETY # SKARPETY SKARPETKI SPORTOWE D3 Socks Sport socks EXTREME E3 rozm.29-30 STANIK BIUSTONOSZ PUSH-UP 80D - BIAŁY D4 Bras Bra (31) Miss Selfridge ŚWIETNY KORONKOWY D5 Bras Bra BIUSTONOSZ 75D SILIKONOWY STANIK CharaBra -PIĘKNY D6 Bras Bra BIUST ! ROZM B WKŁADKI SILIKONOWE pod STANIK - D7 Bra accessories Bra accessories Powiększ BIUST ! D8 Socks Socks D9 Bras Bra D10 Bras Bra D11 Bras Sport bra 3 PARY SPORTOWYCH FROTA 44-46 UN BRA SILIKONOWY Stanik B,C,D FACECI OSZALEJĄ Stanik Triumph dla aktywnych, Nowy tanio r.75D DKNY -75 C- BIAŁY SPORTOWY TOP-BIUSTONOSZ 60-70A BIUSTONOSZ - SPORTOWY - NOWY 0d D12 Bras Sport bra 1zl! 224 D13 Bras Sport bra Sportowy,elegancki Triumph,80A Table B.1: Items selected to experiment. 73 B.2. TERM-DOCUMENT FREQUENCY MATRIX B.2 Term-document frequency matrix D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 czarny 1 0 0 0 0 0 0 0 0 0 0 0 0 elegancki 1 0 0 0 0 0 0 0 0 0 0 0 1 skarpeta 1 2 2 0 0 0 0 0 0 0 0 0 0 henderson 0 0 0 0 0 0 0 0 0 0 0 0 0 frota 0 2 0 0 0 0 0 1 0 0 0 0 0 nike 0 1 0 0 0 0 0 0 0 0 0 0 0 sport 0 0 1 0 0 0 0 1 0 0 1 1 1 extreme 0 0 1 0 0 0 0 0 0 0 0 0 0 rozmiar 0 0 1 0 0 1 0 0 0 0 0 0 0 stanik 0 0 0 1 0 1 1 0 0 1 0 0 0 biustonosz 0 0 0 1 1 0 0 0 0 0 1 1 0 push 0 0 0 1 0 0 0 0 0 0 0 0 0 up 0 0 0 1 0 0 0 0 0 0 0 0 0 biały 0 0 0 1 0 0 0 0 0 0 1 0 0 miss 0 0 0 0 1 0 0 0 0 0 0 0 0 selfridge 0 0 0 0 1 0 0 0 0 0 0 0 0 świetny 0 0 0 0 0 0 0 0 0 0 0 0 0 koronka 0 0 0 0 1 0 0 0 0 0 0 0 0 silikon 0 0 0 0 0 1 1 0 1 0 0 0 0 charabra 0 0 0 0 0 1 0 0 0 0 0 0 0 piękny 0 0 0 0 0 1 0 0 0 0 0 0 0 biust 0 0 0 0 0 1 1 0 0 0 0 0 0 wkładka 0 0 0 0 0 0 1 0 0 0 0 0 0 powiększyć 0 0 0 0 0 0 1 0 0 0 0 0 0 para 0 0 0 0 0 0 0 1 0 0 0 0 0 un 0 0 0 0 0 0 0 0 1 0 0 0 0 bra 0 0 0 0 0 0 0 0 1 0 0 0 0 facet 0 0 0 0 0 0 0 0 1 0 0 0 0 oszaleć 0 0 0 0 0 0 0 0 0 0 0 0 0 triumph 0 0 0 0 0 0 0 0 0 1 0 0 1 aktywny 0 0 0 0 0 0 0 0 0 1 0 0 0 nowy 0 0 0 0 0 0 0 0 0 0 0 1 0 dkny 0 0 0 0 0 0 0 0 0 0 1 0 0 top 0 0 0 0 0 0 0 0 0 0 1 0 0 Table B.2: Term-document frequency matrix of selected items. 74APPENDIX B. LATENT SEMANTIC INDEXING EXPERIMENT DATA B.3 Singular Value Decomposition 0,07 -0,02 0,02 0,07 0,10 0,06 0,23 0,08 0,25 -0,35 0,12 0,03 -0,54 0,10 0,00 -0,05 0,27 0,17 -0,13 0,42 0,02 0,25 -0,48 0,17 -0,06 0,18 0,75 -0,20 0,16 0,00 0,12 0,31 0,12 0,16 0,11 0,11 -0,08 0,07 -0,06 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,43 -0,16 0,09 -0,44 -0,18 -0,47 -0,12 -0,19 -0,11 -0,14 0,04 -0,04 0,03 0,18 -0,08 0,07 -0,23 -0,03 -0,11 0,00 -0,02 -0,01 -0,07 -0,10 0,20 0,24 0,33 0,15 -0,42 0,50 -0,15 -0,25 -0,16 -0,14 -0,01 0,13 0,16 -0,10 0,08 0,16 -0,01 0,00 0,19 0,05 0,23 -0,06 0,06 -0,06 0,30 0,01 -0,18 0,01 0,19 0,17 0,14 0,25 0,05 0,32 -0,21 -0,05 -0,28 -0,03 0,00 -0,10 0,03 0,07 0,53 0,17 -0,17 0,35 -0,14 0,07 0,04 -0,10 0,14 0,01 0,00 -0,20 0,11 0,31 -0,52 -0,25 -0,10 0,19 0,09 -0,08 0,02 -0,01 0,08 0,29 -0,04 0,02 0,13 -0,10 -0,21 0,16 0,03 0,02 0,24 -0,08 0,01 0,30 -0,16 0,15 0,02 0,13 -0,10 -0,21 0,16 0,03 0,02 0,24 -0,08 0,01 0,30 -0,16 0,15 0,06 0,22 -0,29 -0,18 0,08 0,00 -0,12 0,35 0,05 -0,13 -0,15 -0,18 0,07 0,01 0,04 -0,10 -0,13 -0,09 0,21 0,24 -0,33 -0,03 0,01 -0,07 -0,17 0,02 0,01 0,04 -0,10 -0,13 -0,09 0,21 0,24 -0,33 -0,03 0,01 -0,07 -0,17 0,02 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,01 0,04 -0,10 -0,13 -0,09 0,21 0,24 -0,33 -0,03 0,01 -0,07 -0,17 0,02 0,05 0,39 0,31 0,06 -0,37 -0,02 0,08 0,06 0,08 -0,04 0,00 0,02 0,05 0,03 0,18 0,14 0,06 0,00 0,08 -0,15 -0,11 -0,22 -0,33 0,00 0,08 0,03 0,03 0,18 0,14 0,06 0,00 0,08 -0,15 -0,11 -0,22 -0,33 0,00 0,08 0,03 0,04 0,33 0,25 0,03 -0,02 0,00 -0,15 -0,19 0,21 -0,09 0,00 0,03 0,08 0,01 0,16 0,12 -0,03 -0,02 -0,08 0,00 -0,07 0,44 0,24 0,00 -0,06 0,05 0,01 0,16 0,12 -0,03 -0,02 -0,08 0,00 -0,07 0,44 0,24 0,00 -0,06 0,05 0,06 0,00 -0,05 0,02 -0,11 -0,25 -0,12 -0,15 -0,09 -0,01 0,24 -0,44 -0,45 0,00 0,05 0,06 0,03 -0,36 -0,02 0,23 0,25 -0,14 0,04 0,00 -0,01 -0,02 0,00 0,05 0,06 0,03 -0,36 -0,02 0,23 0,25 -0,14 0,04 0,00 -0,01 -0,02 0,00 0,05 0,06 0,03 -0,36 -0,02 0,23 0,25 -0,14 0,04 0,00 -0,01 -0,02 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,04 0,09 -0,04 0,21 0,28 -0,36 0,38 -0,09 -0,23 0,09 -0,24 0,04 0,30 0,01 0,07 0,02 0,01 0,20 -0,17 0,19 -0,02 -0,24 0,21 -0,29 0,13 -0,42 0,03 0,05 -0,13 0,07 -0,08 -0,02 -0,03 -0,10 0,00 0,12 0,30 0,64 -0,13 0,04 0,08 -0,19 0,03 -0,08 -0,03 -0,15 0,11 0,13 -0,14 -0,45 -0,02 -0,08 0,04 0,08 -0,19 0,03 -0,08 -0,03 -0,15 0,11 0,13 -0,14 -0,45 -0,02 -0,08 Table B.3: Matrix U . 75 B.3. SINGULAR VALUE DECOMPOSITION 3,71 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 3,17 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 2,89 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 2,21 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 2,01 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 1,98 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 1,83 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 1,80 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 1,56 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 1,43 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 1,34 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 1,14 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,88 Table B.4: Matrix Σ. 0,25 -0,07 0,04 0,15 0,19 0,12 0,42 0,15 0,39 -0,51 0,16 0,03 -0,48 0,68 -0,25 0,19 -0,50 -0,07 -0,22 0,00 -0,04 -0,02 -0,09 -0,14 0,23 0,21 0,59 -0,03 0,01 0,42 0,10 0,46 -0,10 0,11 -0,09 0,43 0,01 -0,21 0,01 0,08 0,42 -0,29 -0,46 0,32 0,06 0,04 0,44 -0,12 0,02 0,40 -0,18 0,13 0,04 0,14 -0,28 -0,29 -0,18 0,42 0,44 -0,60 -0,05 0,01 -0,09 -0,19 0,02 0,11 0,56 0,40 0,12 0,01 0,17 -0,27 -0,20 -0,35 -0,47 -0,01 0,09 0,02 0,05 0,49 0,34 -0,07 -0,04 -0,16 0,00 -0,13 0,68 0,35 0,00 -0,07 0,05 0,22 0,00 -0,13 0,04 -0,22 -0,49 -0,22 -0,26 -0,14 -0,02 0,33 -0,51 -0,39 0,02 0,17 0,17 0,07 -0,72 -0,05 0,42 0,44 -0,21 0,06 0,00 -0,01 -0,02 0,03 0,22 0,05 0,03 0,41 -0,33 0,35 -0,04 -0,37 0,31 -0,39 0,15 -0,37 0,16 0,27 -0,56 0,06 -0,17 -0,06 -0,27 0,19 0,20 -0,21 -0,60 -0,02 -0,07 0,13 0,16 -0,37 0,15 -0,16 -0,04 -0,06 -0,18 0,01 0,17 0,41 0,73 -0,12 0,13 0,08 -0,18 0,45 0,14 -0,37 0,35 -0,12 0,00 -0,18 0,07 -0,10 0,63 Table B.5: Matrix V . 76APPENDIX B. LATENT SEMANTIC INDEXING EXPERIMENT DATA B.4 Queries Q1 Q2 Q3 Q4 Q5 Q6 czarny 0 0 0 0 0 0 elegancki 0 0 0 0 0 0 skarpeta 0 0 0 1 0 0 henderson 0 0 0 0 0 0 frota 0 0 0 0 0 0 nike 0 0 0 0 0 0 sport 0 1 0 0 1 0 extreme 0 0 0 0 0 0 rozmiar 0 0 0 0 0 0 stanik 0 0 0 0 0 1 biustonosz 1 1 0 0 0 0 push 0 0 1 0 0 0 up 0 0 1 0 0 0 biały 0 0 0 0 0 0 miss 0 0 0 0 0 0 selfridge 0 0 0 0 0 0 świetny 0 0 0 0 0 0 koronka 0 0 0 0 0 0 silikon 0 0 0 0 0 0 charabra 0 0 0 0 0 0 piękny 0 0 0 0 0 0 biust 0 0 0 0 0 0 wkładka 0 0 0 0 0 0 powiększyć 0 0 0 0 0 0 para 0 0 0 0 0 0 un 0 0 0 0 0 0 bra 0 0 0 0 0 0 facet 0 0 0 0 0 0 oszaleć 0 0 0 0 0 0 triumph 0 0 0 0 0 1 aktywny 0 0 0 0 0 0 nowy 0 0 0 0 0 0 dkny 0 0 0 0 0 0 top 0 0 0 0 0 0 Table B.6: Word vectors of queries used in an evaluation. 77 Appendix C Network-based context creation experiment data 78APPENDIX C. NETWORK-BASED CONTEXT CREATION EXPERIMENT DATA C.1 Categories networks (a) Network between terms in category #1. (b) Network between terms in category #2. (c) Network between terms in category #3. (d) Network between terms in category #4. Figure C.1: Networks between terms for categories #1 - #4. 79 C.1. CATEGORIES NETWORKS (a) Network between terms in category #5. (b) Network between terms in category #6. (c) Network between terms in category #7. (d) Network between terms in category #8. Figure C.2: Networks between terms for categories #5 - #8. 80APPENDIX C. NETWORK-BASED CONTEXT CREATION EXPERIMENT DATA (a) Network between terms in category #9. (b) Network between terms in category #10. (c) Network between terms in category (d) Network between terms in category #11. #12. Figure C.3: Networks between terms for categories #9 - #12. C.2. COMPRESSED CATEGORIES NETWORKS C.2 81 Compressed categories networks (a) Compressed network between terms in (b) Compressed network between terms in category #1. category #2. (c) Compressed network between terms in (d) Compressed network between terms in category #3. category #4. Figure C.4: Compressed networks between terms for categories #1 - #4. 82APPENDIX C. NETWORK-BASED CONTEXT CREATION EXPERIMENT DATA (a) Compressed network between terms in (b) Compressed network between terms in category #5. category #6. (c) Compressed network between terms in (d) Compressed network between terms in category #7. category #8. Figure C.5: Compressed networks between terms for categories #5 - #8. C.2. COMPRESSED CATEGORIES NETWORKS 83 (a) Compressed network between terms in (b) Compressed network between terms in category #9. category #10. (c) Compressed network between terms in (d) Compressed network between terms in category #11. category #12. Figure C.6: Compressed networks between terms for categories #9 - #12. 84APPENDIX C. NETWORK-BASED CONTEXT CREATION EXPERIMENT DATA C.3 Modules of merged network (a) Submodule #1.1 (b) Submodule #1.2 (c) Submodule #1.3 (d) Submodule #1.4 Figure C.7: Submodules of component #1. 85 C.3. MODULES OF MERGED NETWORK (a) Submodule #2.1 (b) Submodule #2.2 (c) Submodule #2.3 (d) Submodule #2.4 Figure C.8: Submodules of component #2. 86APPENDIX C. NETWORK-BASED CONTEXT CREATION EXPERIMENT DATA (a) Submodule #1.3.1. (b) Submodule #1.3.2. (c) Submodule #1.3.3 and its 2 submodules - #1.3.3.1 and #1.3.3.2. Figure C.9: Submodules of module #1.3. 87 List of Figures 2.1 Paradigms for incorporating context in recommender systems. [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 Entity-Relation Diagram representing the dataset. . . . . . . 19 5.1 5.2 5.3 5.4 29 31 34 5.5 5.6 5.7 5.8 5.9 Context-aware recommendation process. . . . . . . . . . . . Text preprocessing workflow. . . . . . . . . . . . . . . . . . . Example of category tree structure. . . . . . . . . . . . . . . Schematic representation of singular value decomposition of the matrix A. . . . . . . . . . . . . . . . . . . . . . . . . . . Schematic workflow of network-based context creation approach. Contexts’ structure graph and possible recommendation jumps. Network representing 12 categories merged together. . . . . . Connected components. . . . . . . . . . . . . . . . . . . . . Graph of contexts structure. . . . . . . . . . . . . . . . . . . 39 48 51 53 54 55 C.1 C.2 C.3 C.4 C.5 C.6 C.7 C.8 C.9 Networks between terms for categories #1 - #4. . . . . . . . Networks between terms for categories #5 - #8. . . . . . . . Networks between terms for categories #9 - #12. . . . . . . Compressed networks between terms for categories #1 - #4. Compressed networks between terms for categories #5 - #8. Compressed networks between terms for categories #9 - #12. Submodules of component #1. . . . . . . . . . . . . . . . . . Submodules of component #2. . . . . . . . . . . . . . . . . . Submodules of module #1.3. . . . . . . . . . . . . . . . . . . 78 79 80 81 82 83 84 85 86 88 List of Tables 5.1 5.2 5.3 5.4 5.5 5.6 5.7 40 42 43 44 44 56 5.8 Selected items and theirs types. . . . . . . . . . . . . . . . . Similarities between documents and queries. . . . . . . . . . Similarities between queries. . . . . . . . . . . . . . . . . . . Similarities between documents before applying LSI. . . . . . Similarities between documents after applying LSI. . . . . . Selected categories to evaluation. . . . . . . . . . . . . . . . Distinct term counts for original and compressed category network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Topics interpretation of created contexts. . . . . . . . . . . . A.1 A.2 A.3 A.4 A.5 A.6 A.7 A.8 Items Items Items Items Items Items Items Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 66 67 67 68 68 68 69 B.1 B.2 B.3 B.4 B.5 B.6 Items selected to experiment. . . . . . . . . . . . . Term-document frequency matrix of selected items. Matrix U . . . . . . . . . . . . . . . . . . . . . . . . Matrix Σ. . . . . . . . . . . . . . . . . . . . . . . . Matrix V . . . . . . . . . . . . . . . . . . . . . . . . Word vectors of queries used in an evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 73 74 75 75 76 selected to experiment. . recommended for item I1. recommended for item I2. recommended for item I3. recommended for item I4. recommended for item I5. recommended for item I6. recommended for item I7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 57