Personalized Resource Categorisation in Folksonomies
Transcription
Personalized Resource Categorisation in Folksonomies
Personalized Resource Categorisation in Folksonomies Muzaffer Ege Alper Şule Gündüz Öğüdücü Faculty of Computer and Informatics Istanbul Technical University Maslak, Istanbuk, Turkey Faculty of Computer and Informatics Istanbul Technical University Maslak, Istanbuk, Turkey malper@itu.edu.tr sgunduz@itu.edu.tr ABSTRACT Folksonomies constitute an important type of Web 2.0 services, where users collectively annotate (or “tag”) resources to create custom categories. Semantic relation of these categories hint at the possibility of another categorization at a higher level. Discovering these more general categories, called “topics”, is an important task. One problem is to discover these semantically coherent topics and the accompanying small sets of tags that cover these topics in order to facilitate more detailed item search. Another important problem is to find words/phrases that describe these topics, i.e. labels or “meta-tag”s. These labeled topics can immensely increase the item search efficiency of users in a folksonomy service. However, this possibility has not been sufficiently exploited to date. In this paper, a probabilistic model is used to identify topics in a folksonomy, which are then associated with relevant, descriptive meta-tags. In addition, a small set of diverse and relevant tags are found which cover the semantics of the topic well. The resulting topics form a personalized categorization of folksonomy data due to the personalized nature of the model employed. The results show that the proposed method is successful at discovering important topics and the corresponding identifying meta-tags. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Text Mining General Terms Algorithms Keywords Statistical Topic Models, Topic Model Labeling, Meta-Tag Generation, Personalized Information Retrieval 1. INTRODUCTION (c) 2012 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the national government of Turkey. As such, the government of Turkey retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. MDS August 12-16 2012, Beijing, China Copyright 2012 ACM 978-1-4503-1546-3/12/08 . . . $15.00. One of the core challenges of Data Mining is to turn vast amount of data into manageable information. Web 2.0 services immensely increased the acquisition of data, increasing demand on more advanced methods to mine information from them. Folksonomies constitute one type of such services where users collectively annotate (or “tag”) resources to create custom categories. These tags, however, are semantically related. These relations can be exploited to form higher level categorizations, also called “topics”. One approach to form topics would be to employ dictionaries to find clusters of semantically related tags. However, another approach, called topic modeling, has recently gained popularity, where the semantic relation among the words are assumed to show themselves in co-occurances of the words in documents. Latent Dirichlet Allocation is a particularly successful method of this kind [2]. The semantic relations (hypernym, synonymy, and others) among tags can then be discovered and exploited further using the resulting tag-topic distribution as in [13]. Another important goal is to find identifying words/phrases for the discovered topics, that is the topic labeling problem [8, 10]. It has gained more attention in recent years due to its broad application areas in text mining and information retrieval such as multi-document summarization [4] and opinion mining [9]. The corresponding labels are useful as a summary of each topic and also useful to make the topic model more transparent to the user [8]. The corresponding labels are useful as a summary of each topic and also useful to make the topic model more transparent to the user [8]. Most of these models extract these labels from text collections according to the contents of these collections without considering the needs of users. However, a label for a topic may be very difficult for a user to understand whereas it could be very meaningful for another user. Thus, in folksonomies it is highly desirable to automatically generate meaningful personalized labels for a topic. In the context of folksonomies, labels also enable categorization of resources, so that resources tagged “computer” and “informatics” may fall in the same category. So, in a sense, tags are tools for organizing resources and meta-tags are generalizations of them. This higher level categorization can immensely increase the item search efficiency in a folksonomy service. Since the computed categorizations are considered as an aid to the users to locate their resources of interest faster, construction of personalized categories is also crucial. For this reason, different usage patterns of users should also be taken into account in the categorization process. However, to the best of our knowledge, this possibility has not been sufficiently exploited to date. A more serious problem with topic labeling is that the methods found in the literature are not applicable to the problem of meta-tag generation in folksonomies. As noted in other works, multiple words/phrases are necessary for identifying names, since single words are usually too broad in meaning or because a single word is usually found inside a longer phrase [10]. However, their solution is also not suitable because they assume that there is an underlying document set behind the bag-of-words dataset employed in the topic modeling process. Contents of these documents are used to find identifying phrases for topics. However, the tagging data in folksonomies naturally come in a bagof-words format and this bag-of-words representation is not derived directly from the contents of documents/resources. Rather it reflects the interest of the user in a resource as well as the personal understanding of the resource. Besides, many folksonomy services have resources which are not text based (video,picture,animation,. . . ). Even for bookmarking services such as “Del.icio.us”1 whose resources are websites, collection of content is problematic due to Flash-based websites, video hosting sites with few textual content (or noisy content), etc. For these reasons, it is desirable to construct meta-tags using only the bag-of-words data obtained from the user annotations of the resources. In this paper, a probabilistic model, called Latent Interest Model (LIM) [1], is used to identify topics in a folksonomy. This model is similar to the method proposed in [15]; but also has important differences as mentioned in Section 2.3. The success of LIM as a personalized probabilistic model of folksonomy was previously shown in the context of item recommendation [1], where the items are tags or resources. Here, it is employed to discover personalized categorization (taxonomy) of resources in folksonomies. The contribution of this paper is three-folds: first a compact and more informative representation of topics is proposed (using the criteria of minimum redundancy and maximum relevance [12]) instead of the usual top-K most probable word per topic listing. Secondly, a method to assign clear, identifying words/phrases to topics, without assuming an underlying textual dataset is proposed. Finally these methods, together with the application of LIM, are employed to discover personalized resource categorizations in folksonomies. The resulting categorisation has two levels, the level of topics (meta-tags) or the lower level of tags in a given topic (see the second stage in Figure 2). The quality of the generated meta-tags and the resource categorization are also demonstrated with a user study. Note that the proposed methods can also be employed in any other application of topic models. The paper is organized as follows: in Section 2 formal definitions of a folksonomy and the topic labeling problem are given together with a brief discussion on the particular personalized topic model employed. Section 3 explains the proposed solution to the labeling problem in detail. Section 4, demonstrates the efficiency of the proposed method both qualitatively and quantitatively. Section 5 reviews the related work and finally, Section 6 concludes the paper. 2. BACKGROUND This section provides the necessary background on the 1 http://delicious.com/ Table 1: A sample topic from LIM Tag (w) Probability (p(w|z)) writing 0.248 reference 0.066 grammar 0.055 english 0.054 language 0.043 quotes 0.037 literature 0.026 copywriting 0.021 resources 0.019 tools 0.019 problem of meta-tag generation in folksonomies. First the notion of a folksonomy is discussed and the problem of discovering meta-tags in a folksonomy is presented. Finally, the particular model employed in this work, LIM, a personalized probabilistic model for folksonomy data previously proposed in [1], is explained. 2.1 Folksonomies A folksonomy can be formally defined as a quadruplet of sets, (T, U, R, P ) where T is the set of all possible tag words, U is the set of all users, R is the set of all resources and P ⊆ T × U × R, is the set of annotations. Elements of P are called “tagging triplets” or just “taggings” for short. We say a user u bookmarked a resource r if there exists a triplet (t, u, r) ∈ P for some t ∈ T . Note that there are also other rare cases where a user bookmarks a resource without tagging it. In this case, we assume that a tagging exists but the tag is unknown. 2.2 Topic Labeling Topic models are probabilistic models of textual data (in bag-of-words format), which are stated in terms of conditional distributions p(w|z) of words (w) given topics (z) and distributions of topics themselves as p(z). It is generally assumed that a successful topic model gives most of the probability weight to semantically related words. Recently, there is an attempt to use probabilistic models in topic extraction. Common methods are probabilistic Latent Semantic Analysis (pLSI) [5] and LDA [2]. The intuition behind these models is that the observed texts are derived from a generative model. In such a model, unseen latent variables (the topics) are represented as random variables over the set of words. The words and the documents in the data set are generated from these latent variables. Table 1 shows a sample topic with probability values computed using LIM, which is also a probabilistic topic model. Note that only the first ten words are listed. This topic is obviously related to English grammar with related references, resources and tools. The problem of topic labeling, or, in the particular case of folksonomies, meta-tag generation, is to find words/phrases that is representative of the topic, i.e. summarize the information in the conditional word probability distribution p(w|z). The solution to this problem is composed of two steps [8, 10]. The first step is to select or generate a list of candidate labels. Consequently, the best candidate label is selected using a score function. The proposed method considers only the words in the topic model, as a result the scoring function has a simple form, which is the product of the conditional probabilities of a given label (a word/tag pair). The contribution of this paper is mainly in finding non-redundant and descriptive candidate labels. As an example, the method proposed in this paper assigns “writing/grammar” to the topic shown in Table 1. 2.3 Introduction to LIM Much of the previous work on probabilistic modeling of folksonomies assumed a global meaning for tags [7]. LIM, on the contrary, is a personalized model that assumes meaning (or “interest context”) of a tagging event is determined by the user, tag and resource collectively. This is achieved by incorporating users and resources into the model as random variables. Figure 1 shows the graphical model of this generative process. z tz Ntz + ν Nz + T ν uz Nuz + γ Nz + U γ rz Nrz + β Nz + Rβ Φ̂t = ν u θ Φ̂u = φt t r Φ̂r = φr β z θ̂ = 2.4 Figure 1: Graphical Model of LIM Notice that the latent variables z have a broader meaning in our model than simply being a topic of words as in LDA [2]. It is better interpreted as an “interest context”, since the tagging triplets it generates can be regarded as statements of what interest a given resource is to a user, declared by the user with the associated tag. Such a reading of a folksonomy assumes that, each tag a user attaches to a resource is a declaration of why the user is interested in the resource and/or how the user is intending to use it. The (hyper)parameters in the proposed model are: α, β, γ, ν. These parameters are respectively the asymmetrical Dirichlet priors for distributions over “interests contexts” (θ) and the symmetrical Dirichlet priors for distributions of resources, users and tags (Φu , Φr and Φt ) [14]. To estimate these distributions, we employ Collapsed Gibbs Sampling [3]. We sample from the posterior p(z|u, r, t, D), where the variables Φu , Φr , Φt and θ have been integrated out, using aik biuk citk dirk i i i i k0 ak0 buk0 ctk0 drk0 p(z i = k|z−i , u, t, r, α, β, γ, ν, D) = P (1) aik = Nk−i −i Nrk +β −i Nk + Rβ In the expressions above, the index i in z i denotes the ith triplet in the corpus, while z i is the associated “interest” variable and z−i are the interest variables correspond−i ing to the other triplets. Nurtk is used as the count of a particular triplet associated with a particular “interest” in the corpus, other than the ith triplet. In this form it can be either 0 or 1, corresponding to whether the triplet (t, u, r) is not the ith triplet and the sampled interest context corresponding to this triplet is kth interest context. The dropped indices express a summation over that index. For example, P −i −i Nuk = r∈R,t∈T Nurtk . Notice the effect of the Dirichlet prior parameters as pseudocounts, which is the consequence of the conjugacy of the Dirichlet and multinomial distributions. The (Bayes) estimates of the four essential conditional z probabilities (Φ̂u , Φ̂r , Φ̂t , θ̂ ) in the model are: φu γ α dirk = + αk biuk = −i Nuk +γ −i Nk + U γ citk = −i Ntk +ν −i Nk + T ν Nz + αz P N + k αk Resource Categorization using LIM Resource categorization using LIM can be achieved in two manners. First, each “interest context” provides a ranked list of resources using p(r|z) and listing top-K resources. Since the interest contexts in a folksonomy reflect different interests of different users, only some of these topics are listed to a particular user, determined by applying a threshold δ to p(z|u) ∝ p(u|z)p(z). The parameters K, δ determine the amount of resources which are categorized for user u. In the Results section, the effect of the varying K on categorization performance is explored. The second kind of resource categorization is via p(r|t, u). Notice that, in this form, this listing does not necessarily correspond to a particular topic. However, in the next section, a method to compute a set of tags that are maximally related to the topic and have as small redundancy as possible is introduced. As a result, the tags in this set and the related resources can be loosely seen as sub-categories of the corresponding topics. Finally, once the topic is itself labeled with a descriptive phrase, the task of computing a personalized hierarchical resource categorization is complete. The whole process can be visualized as in Figure 2. 3. META-TAG GENERATION Intuitively, a meta-tag should be representative for its topic. This is achieved when the meta-tag is both specific to the topic and general enough to cover the semantic variety in the topic [8]. In this section, the proposed method to automatically determine meta-tags for topics in a folksonomy is presented. Unlike [10], the proposed model does not assume an underlying text dataset, which enables it to be applicable Figure 2: The overall process of personalized hierarchical resource categorization Table 2: A topic from LIM with redundancies Tag Probability webdesign 0.134 design 0.132 blog 0.117 tutorial 0.094 inspiration 0.079 resources 0.056 web 0.05 css 0.044 photoshop 0.028 web2.0 0.016 Table 3: A topic from LIM tation Tag webdesign jquery blog news magazine bookmarks icons php javascript css3 textures with compact represenScore 0.183 0.161 0.158 0.147 0.132 0.115 0.111 0.099 0.096 0.096 0.094 to folksonomy data. In this section, first a method to find a compact representation for topics is discussed. Then, this new representation is employed to find descriptive labels for the topics. able rather than the random variable itself, so there is no expectation. One would like to define reduncancy similarly; 3.1 Red∗ (wi , wj ) = log p(wt = wi |wt+1 = wj ) − log p(wt = wi ) Compact Topic Representation The common method to present a topic is to list the top-K most probable words given the topic. This list, however, can contain redundancies. Consider the topic in Table 2. This topic is obviously related to websites with blogs, references and resources for webdesign. It can be directly observed that several tags repeat these themes, like “webdesign”, “design”, “web” or “resources”, “web2.0”. In this section, a method to produce a more compact and informative representation is presented. The results are shown in Table 3. With a comparable number of words, tags shown in the table give us a much more diverse and detailed picture of the concepts significantly related to the topic. In order to eliminate redundancies, concepts of relevance and redundacy are employed. The proposed method is similar to the mRMR feature selection method in spirit [12]. However, the definitions of relevance and redundancy are slightly different in this work. Relevance of a word (tag) w to a topic z is defined as; Rel(w, z) = log p(z|w) − log p(z) = log p(w|z) − log p(w) In other words, relevance is the difference of self information of topic and topic given the tag/word. Notice the relevance is defined as a function of a realization of a random vari- where, wt is the random variable corresponding to the word sampled at time t, while wi and wj are particular realizations of this random variable (we drop the superscript t for realizations since the set of possible words which we sample is the same for all time steps). However, the model assumptions of LIM states that the tags/words are distributed i.i.d. That is, one sample at step t does not yield additional information for sample at step t + 1. To circumvent this issue, another measure for similarity and consequently redundancy is used; Red(wi , wj ) = log p(wt = wi |τ (wi ) = τ (wj ))−log p(wt = wi ) p(wt = wi |τ (wi ) = τ (wj )) = X p(z|wt+1 = wj )p(wt = wi |z) z where, τ (.) indicates the corresponding topic of the given word, so that the above equation can be read like this: redundancy is the information supplied by knowing the word wt and wt+1 have the same “interest contexts” with respect to the event of observing wi at a given time. Observe that if the words wi and wj have similar probabilities over several different topics, the − log p(wt = wi |τ (wi ) = τ (wj )) will be much smaller than − log p(wt = wi ) indicating a big redundancy. The algorithm then proceeds in an iterative fashion where at each iteration the tag (which is not already in the list) with the highest score value is added. X Score(w, z, S) = p(w|z) ∗ (Rel(w, z) − Red(w, wj )) the compact list is discussed in this section. Consequently, the final meta-tag is selected from these augmented tags using a simple criterion. The accompanying/augmenting tags are determined using redundancy to the respective tag, among the top-L most probable tags as in section 3.1. Aug(w, z) = arg max Red(wj , w) wj ∈S St+1 = St ∪ {arg max Score(w, z, St )} w6∈St In the above equation, St is the set of tags at the t’th step. Note that the step t denotes the iteration of the algorithm, whereas “time” t in the above paragraph refers to the sampling time of a word in the corpus according to the probabilistic model. S0 is defined as the empty set. Note that the candidate words/tags are picked from the top-L most prob, where able words given the topic. This L is taken to be W T W is the number of words and T is the number of topics. This constraint improves the computational efficiency of the method while leaving the results the same, since most of the probability mass is allocated to this set by the topic model LIM. Finally, a threshold value has to be determined so that tags with low scores, which are irrelevant to the topic or are redundant, are not included in the candidate label set. After trying different alternatives using our dataset, the “best” threshold is empirically determined to be one eighth the value of the highest score for a given topic. Note that the “compact” representation usually results in a list with a much smaller number of identifying words in it than, say, ten. However, this is not always the case, as it can be seen from Table 3. In this case, the words which are eliminated due to redundancy are replaced with other words that are non-redundant and more informative about the topic. It is, of course, possible to think of other viable options for determining cut-off points in the compact list. One option would be to consider resource coverage as defined in Section 4.3, possibly setting a cut-off point satisfying a given coverage threshold in probability (see Table 9 and the related discussion). However, for the purposes of meta-tag generation, we noticed that the simple approach used in this work suffices. 3.2 Labeling Topics The compact representation of topics presented in the previous section is more efficient than the top-10 or top-5 most probable words listing in the literature; but it is still too long to use as labels in a taxonomy of resources. Therefore, in this section, a method to derive labels using this compact lists is introduced. As discussed previously, the goal is to find a relevant phrase that also strikes a balance between generality and specificity. The generality criterion is satisfied by the top scoring members of the compact list (which may or may not be the most probable words given the topic). However, it is observed that sometimes these tags are too general to fully specify the topic, so accompanying words which increase the specificity are necessary. For example, “science” is a very broad meta-tag, however “science/astronomy” is much more specific and helpful to the users. Additionally, some tags are naturally used in pairs with other tags, such as “social” and “web” and despite one being redundant to the other, combined they are more informative than alone. For these reasons, a method to find accompanying tags to the tags in wj ∈topL The combinations of tags formed in this fashion are candidates for topic labels. In this work, we simply choose the most probable combination ( p(wi , wj |z) = p(wi |z)p(wj |z) ) among these candidates as the meta-tag for the corresponding topic ( z ). 4. RESULTS In this section, the results of the proposed meta-tag generation method is shown using data collected from a popular folksonomy website Del.icio.us. The dataset consists of 34,665 users, 6,429 resources, 9,641 tags, 5,546,813 taggings and 1,358,522 bookmarks. This set is produced by collecting data from Del.icio.us and performing standard cleaning operations such as stemming of tags using standard methods such as Pling, removing non-english websites and removing stop-words from tags. In addition to these, the resources which were not bookmarked by at least 100 users and tags which were not used in at least 100 bookmarks were removed (the resulting dataset is also called a 100-core), similar to other studies on folksonomies such as [6]. The proposed model contains one asymmetric Dirichlet prior parameter vector α and three other parameters, β, γ, ν, for the symmetrical priors. Additionally, the number of “interest contexts” is also a parameter for the model. Traditionally, the last parameter (number of topics) is manually tuned either using perplexity in unseen data or using another performance criterion regarding the specific application. In this work, this value is found using the precision/recall curve on a validation dataset. The other parameters however, can also be computed by maximum (marginal) likelihood estimation, which is the prevalent method in the literature. We have observed that validation based parameter estimation performed better in our setting [1]. The estimated parameters used in this paper are β = 0.1, γ = 1, ν = 0.01 and using 200 topics. The results will be reported in three parts, first the solution to the classical topic labeling problem is reported by showing ten randomly sampled topics together with the related tag probabilities. The goal here is to show that the meta-tags’ can sucessfully represent the underlying tag distribution. The second part shows the efficiency of resource categorization using the proposed method. Again, twenty randomly sampled topics are used to test the categorization performance. The resulting precision/recall values of this set of experiments are shown, together with two samples of “best” and “worst” categories, followed by a discussion of these results. Finally, the efficiency of the compact representation is discussed by showing the improved resource coverage results. 4.1 Topic Labeling Results The results are reported using 10 randomly selected topics, where for each the topic; the classical top-10 most probable words list, the compact list and the meta-tags are shown Table 4: 10 random topics from LIM Topic 0 tools internet dns network web test networking ip speed speedtest Topic 5 dictionary language reference english tools translation word thesaurus education online Topic 1 javascript tools programming development webdev code web js testing ajax Topic 6 photo photography images flickr tools web2.0 pictures search sharing gallery Topic 2 torrent download bittorrent search p2p music movies software video rapidshare Topic 7 programming processing software art visualization design opensource graphics interactive code Topic 3 wiki reference wikipedia encyclopedia web2.0 collaboration wikis tools research information Topic 8 generator tools webdesign design web favicon css online text html Topic 4 tools pdf converter online conversion free convert software file video Topic 8 typography fonts webdesign css design tools web @font-face type resources Table 5: compact representations of 10 random topics from LIM Topic 0 internet tools resources reference flash download search technology visualization web2.0 Topic 5 dictionary web web2.0 social Topic 1 javascript tools online collaboration free web2.0 software Topic 2 torrent blog Topic 3 wiki Topic 6 photo browser searchengine Topic 7 programming processing inspiration images art blog media graphic research linux Topic 8 generator Topic 4 pdf tools html music graphics photo resources images technology howto Topic 8 typography tools online in Table 4, Table 5 and Table 6 respectively. The results in Table 6 shows the descriptive value of the computed metatags. It also hints at effect of these labels or meta-tags in improving the resource search efficiency of users in a folksonomy. In order to better appreciate the compact topic representation, it is important to have a deeper understanding on the kind of “redundancy” that this representation eliminates. To this end, consider a simple scenerio where a user selects a topic and the relevant resources (ordered w.r.t. p(r|z)) are listed to him alongside the descriptive tags in the compact representation. This general list is useful, however a user might also want to narrow her search by selecting the tags listed, in which case the resources are listed w.r.t. p(r|t, u). Now if one were to use the top-K most probable tags as the representation of the topics, many of the tags would lead to a similar listing of resources, thus being redundant. This issue is discussed further in Section 4.3. 4.2 Resource Categorization Results Previous discussion demonstrates the effectiveness of the proposed method in finding a label that covers and summa- Table 7: Average relevance of resources to the suggested meta-tags at different coverage levels Average Precision 76.6% 84% Coverage 62% 31 % rizes the underlying tag distribution. However, the major claim of this paper, that these meta-tags are also representative for the related resources and that these resources form a useful categorization of the resources in a folksonomy. Success at this task is evaluated by using the judgements of five subjects (4 of whom are Master students and 1 is a PhD student), who scored each resource among the most probable 20 resources for 20 randomly selected topics (out of 200) according to whether they are related to the associated meta-tag. These binary scores, indicating relevance or irrelevance, are used to compute precision/recall values. Following this, the two best and worst topics are also shown with the meta-tags and the corresponding resources. A discussion of these results provides significant insights on the different properties of the proposed method. Table 7, shows the average precision, that is the ratio of average scores to the number of given resources at different coverage (or recall) values. Two different results for two coverage values are reported to indicate the performance difference of the top 10 and top 20, with coverage 31% and 62% respectively, most probable resources. The coverage of these results are computed by taking the ratio of the number of included websites to the number of all websites. Note that this value is calculated by assuming that the precision in the sampled topics is an indicator for the precision of the rest. Furthermore, we assumed that there are few cases of overlapping resources among the most 20 most probable resources of topics. Thus, this value is just a rough estimate, which is helpful to get a sense of the generality of our results. The two best and worst scored topics are shown in Table 8. Notice that in this table, only the root URLs of the websites are shown, due to space restrictions. Thus some websites appear to have been listed twice when, in fact, they refer to different pages of the websites. Several important conclusions can be drawn from these results. First of all the first word in the meta-tag tends to have a broader meaning, while the second term usually makes the meta-tag more specific. The two “best” topics clearly show the usefulness of the augmenting tag in forming a descriptive and specific meta-tag. However, in the worst topics, it is seen that this specificity can also hurt. For example, the “programming/python” topic includes URLs of websites that are broadly related to programming but unrelated to python. In quantitative terms, 20% of the 20 websites listed under this topic would be scored a 1 instead of a 0 if the augmenting tag “python” was not shown. However, we believe, the qualitative results show the overall usefulness of the augmenting tags. Indeed, when asked, the test subjects indicated that in 75% of the topics (15/20), the augmenting tags served to produce a more specific label without disturbing the relevance to the resources. Another interesting observation is that, sometimes, the users in our dataset and the test subjects disagree on relevance of resources to some Topic 0 internet/dns Topic 5 dictionary/language Table Topic 1 javascript/js Topic 6 photo/flickr 6: Meta-Tags for the 10 Topic 2 torrent/download Topic 7 processing/programming Table 9: Coverage of resources using most probable tags (M 1 ) and the compact tag set (M 2 ) Measure Cardinality Probability M1 47 ± 1.74 4.33 ± 0.19 M2 50 ± 1.5 4.59 ± 0.18 p-value 0.001 0.008 of the concepts. For example, “projecteuler.net” is judged as irrelevant by all test subjects due to its apparent irrelevance to “python”; but it is tagged with “python” by 5 users in our dataset! This, we believe, is again due to the personalized usage of websites. For many, the aforementioned website is irrelevant to python; but for some it is a resource for python excercises. This result is another indication of the importance of personalized meta-tag generation. The non-personalized nature of our experiments is a factor that lowers the scores. In practice, this topic would only be shown to users with similar “tastes” and this resource would have a higher probability of being useful/meaningful. 4.3 Efficiency of the Compact Representation The compact representation discussed in Section 3.1 is expected to result in a set of tags that covers the semantics of the particular interest context well. This can be shown by using a sample scenerio. Assume that a user selects a related meta-tag and browses this topic by selecting a tag from the associated set of tags. Obviously, the user will only be patient enough to try a few tags and browse a limited number of resources within the set of resources associated with the tag (i.e. the set of resources listed according to p(r|t)). Then the goal is to select this set of tags such that the resulting set of resources are as diverse and relevant as possible. In other words, we must avoid repeating resources. To show the advantage of our method in such a scenerio, we exercise the following procedure. For each user in the dataset, we find the most relevant topic. Then, in this topic, we select a set of tags, either using the most probable tags or the proposed compact tag set and find the associated set of resources. Finally, the sets are measured using either set cardinality or the total probability mass of the set. These two values are shown in Table 9, where we select three tags and show 20 resources to the user. The p-values from the t-test is also shown in the last column. The results show a statistically significant improvement in categorisation efficiency using the compact tag set. 5. RELATED WORK The problem of meta-tag generation to enhance resource categorization is, to the best of our knowledge, novel. The most similar works in the literature consider the problem of finding representative words [8] or phrases [10] for multinomial word distributions. Lau et.al. [8] considers training a topics Topic 3 wiki/wikipedia Topic 8 generator/tools Topic 4 pdf/converter Topic 8 typography/fonts Support Vector Regression method to determine the order of words for topics. Their features include Wordnet based word relations, coverage of words, in terms of mean conditional probability computed using Wikipedia word frequencies, and Pantel [11] distributional similarity score. The data set is prepared by manually labeling selected topics. They observed that many users disagreed in selecting the most representative words in some of the topics and consequently the methods performance in such topics is low. The supervised nature of this method prohibits its use it in a web based folksonomy service, since either online learning or frequent re-training of the topic models would require perpetual collection of manual tags, which is unrealistic. In addition to this, using a single word to describe topics is generally insufficient, as observed in [10]. Mei et. al. [10], proposed using phrases to label topics. The phrases were obtained from the n-grams of the actual documents. These n-grams are then scored with respect to relevance and coverage. The paper defines coverage using the concept of redundancy, however they choose to minimize maximum redundancy between words instead of sum of the redundancies as in this paper. Another criterion that the authors employed is the discriminating capability of tags, which is the relevance of a tag to the desired topic minus the sum of relevance to other topics. This criterion is meaningful given their goal. However, in the setting of this paper, taking discrimination into account is not desired, since it is expected, due to the personalized nature of LIM, for some topics to have similar tag distributions but different resource distributions. For example, LIM might provide two different topics on “movies” with similar tag distributions but leading to different movie resources. Finally, they assume that the underlying documents are easily available but in our case the underlying resources consists of many non-textual websites and acquiring the text data and constructing all possible n-grams would be inefficient even for the text based websites. The proposed method employs only the available bag-of-words data in labeling topics. 6. CONCLUSIONS With the rapid growth social tagging systems, so called folksonomies, it becomes a critical issue to design and organize the vast amounts of on-line resources on these systems according to their topic. Another important characteristic of folksonomy systems is that users can choose any keyword as a tag and can put one or more tags to a resource resulting in a wide variety of tags that can be redundant and ambiguous. Although statistical topic modeling has been well studied, there is no existing methods for automatically generating personalized topic labels in folksonomy systems. In this paper, a novel method for personalized resource categorization and labeling in folksonomies is introduced. The method is capable of creating a personalized taxonomy of resources with meaningful labels, called meta-tags, so that users can easily locate resources of interest. The resulting meta-tags offer a balance between generality and Table 8: Two best and worst scored topics Topics Meta-Tags Resources Best 1 photo/flickr www.flickr.com www.tineye.com www.compfight.com compfight.com labs.ideeinc.com bighugelabs.com/flickr tineye.com photobucket.com taggalaxy.de www.airtightinteractive.com hugin.sourceforge.net wylio.com www.cooliris.com www.flickriver.com min.us www.dropmocks.com www.smugmug.com labs.systemone.at labs.ideeinc.com photosynth.net Best 2 food/recipes www.tastespotting.com www.epicurious.com www.supercook.com smittenkitchen.com www.seriouseats.com www.cookingforengineers.com allrecipes.com www.stilltasty.com www.101cookbooks.com www.chow.com www.foodnetwork.com www.opensourcefood.com www.cookingbynumbers.com www.jamieoliver.com www.thekitchn.com thisiswhyyourefat.com foodgawker.com www.101cookbooks.com www.recipezaar.com mingmakescupcakes.yolasite.com specificity. The qualitative results from randomly drawn topics and quantitative results from human judgements indicate the usefulness of the proposed method. 7. ACKNOWLEDGMENTS [9] This project was partially funded by TUBITAK under project ID 110E027. 8. REFERENCES [1] M. E. Alper and S. Gunduz-Oguducu. Personalized recommendation in folksonomies using a joint probabilistic model of users, resources and tags. In submitted to a Conference 2012, 2012, 2012. [2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022, 2003. [3] T. L. Griffiths. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, Jan. 2004. [4] A. Haghighi and L. Vanderwende. Exploring content models for multi-document summarization. In Proceedings of Human Language Technologies, NAACL ’09, pages 362–370, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. [5] T. Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Artificial Intelligence, UAI’99, pages 289—296, 1999. [6] R. Jäschke, L. Marinho, A. Hotho, S.-T. Lars, and S. Gerd. Tag recommendations in folksonomies. PKDD 2007, pages 506–514, Berlin, Heidelberg, 2007. Springer-Verlag. [7] R. Krestel, P. Fankhauser, and W. Nejdl. Latent dirichlet allocation for tag recommendation. In Proceedings of the third ACM conference on Recommender systems, RecSys ’09, pages 61–68, New York, NY, USA, 2009. ACM. [8] J. H. Lau, D. Newman, S. Karimi, and T. Baldwin. Best topic word selection for topic labelling. In [10] [11] [12] [13] [14] [15] Worst 1 programming/python projecteuler.net gettingreal.37signals.com stackoverflow.com code.google.com/edu mitpress.mit.edu/sicp samizdat.mines.edu/howto diveintohtml5.org www.e-booksdirectory.com diveintopython.org www.freetechbooks.com mitpress.mit.edu/sicp www.python.org www.indiangeek.net jqfundamentals.com www.gigamonkeys.com/book detexify.kirelabs.org blog.objectmentor.com eloquentjavascript.net developer.mozilla.org learnyouahaskell.com Worst 2 health/fitness www.nutritiondata.com www.fitbit.com hundredpushups.com www.coolrunning.com www.gmap-pedometer.com preyproject.com hundredpushups.com www.webmd.com www.mapmyrun.com www.informationisbeautiful.net www.fitday.com www.thedailyplate.com www.mayoclinic.com www.patientslikeme.com adeona.cs.washington.edu www.lumosity.com www.successwithtracywhite.com www.walkscore.com www.bikely.com www.boston.com Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING ’10, pages 605–613, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai. Topic sentiment mixture: modeling facets and opinions in weblogs. In Proceedings of the 16th international conference on World Wide Web, WWW ’07, pages 171–180, New York, NY, USA, 2007. ACM. Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’07, pages 490–499, New York, NY, USA, 2007. ACM. P. Pantel and D. Lin. Discovering word senses from text. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’02, pages 613–619, New York, NY, USA, 2002. ACM. H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(8):1226 –1238, aug. 2005. J. Tang, H.-f. Leung, Q. Luo, D. Chen, and J. Gong. Towards ontology learning from folksonomies. In Proceedings of the 21st international jont conference on Artifical intelligence, IJCAI’09, pages 2089–2094, San Francisco, CA, USA, 2009. Morgan Kaufmann Publishers Inc. H. M. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why Priors Matter. In Proceedings of NIPS, 2009. X. Wu, L. Zhang, and Y. Yu. Exploring social annotations for the semantic web. In Proceedings of the 15th international conference on World Wide Web, WWW ’06, pages 417–426, New York, NY, USA, 2006. ACM.