Learning to Rank for Information Retrieval
Transcription
Learning to Rank for Information Retrieval
Learning to Rank for Information Retrieval Tie-Yan Liu Lead Researcher Microsoft Research Asia Speaker • Tie-Yan Liu – Lead Researcher, Microsoft Research Asia – Co-author of 70+ papers in SIGIR, WWW, NIPS, ICML, KDD, etc. – Co-author of SIGIR Best Student Paper (2008) and JVCIR Most Cited Paper Award (2004~2006). – Author of two books on learning to rank. – Associate editor, ACM Transactions on Information System; Editorial board member, IR Journal. – PC Chair of RIAO 2010, Area Chair of SIGIR (2008~2011), Track Chair of WWW (2011). 2011/5/9 Tie-Yan Liu, Copyright reserved 2 Outline • • • • • • Introduction Major Approaches to Learning to Rank Analysis of the Approaches Benchmark Evaluation of the Approaches Statistical Learning Theory for Ranking Summary and Outlook 2011/5/9 Tie-Yan Liu, Copyright reserved 3 Ranking in Information Retrieval Overwhelmed by Flood of Information 2011/5/9 Tie-Yan Liu, Copyright reserved 5 Facts about the Web • According to www.worldwidewebsize.com, major search engines index at least tens of billions of web pages. • CUIL.com claimed that they index 127 billion pages. • Google claimed that they know 1 trillion pages. 2011/5/9 Tie-Yan Liu, Copyright reserved 6 Ranking is Essential Indexed Document Repository D d j Ranking model Query 2011/5/9 Tie-Yan Liu, Copyright reserved 7 Conventional Ranking Models • Query-dependent – – – – Boolean model, extended Boolean model, etc. Vector space model, latent semantic indexing (LSI), etc. BM25 model, statistical language model, etc. Span based model, distance aggregation model, etc. • Query-independent – PageRank, TrustRank, BrowseRank, Toolbar Clicks, etc. 2011/5/9 Tie-Yan Liu, Copyright reserved 8 Evaluation of Ranking Models • Construct a test set containing a large number of (randomly sampled) queries and their associated documents, and relevance judgment for each querydocument pair. • Evaluate the ranking result given by a model for a particular query in the test set with an evaluation measure. • Use the average measure over the entire test set to represent the overall ranking performance. 2011/5/9 Tie-Yan Liu, Copyright reserved 9 Collecting Queries and Documents • Queries are usually randomly sampled from query logs. • TREC’s Pooling strategy for sampling documents – A pool of possibly relevant documents is created by taking a sample of documents selected by the various participating systems. – Top 100 documents retrieved in each submitted run for a given query are selected and merged into the pool for human assessment. – On average, an assessor judges the relevance of approximately 1500 documents per query. 2011/5/9 Tie-Yan Liu, Copyright reserved 10 Relevance Judgment • Degree of relevance lk – Binary: relevant vs. irrelevant – Multiple ordered categories: Perfect > Excellent > Good > Fair > Bad • Pairwise preference lu ,v – Document A is more relevant than document B • Total order l • Documents are ranked as {A,B,C,..} according to their relevance 2011/5/9 Tie-Yan Liu, Copyright reserved 11 Evaluation Measure - MAP • Precision at position k for query q: #{relevant documents in top k results} k • Average precision for query q: k P @ k lk AP # {relevant documents} P@k 1 1 2 3 AP 0.76 3 1 3 5 • MAP: averaged over all queries. 2011/5/9 Tie-Yan Liu, Copyright reserved 12 Evaluation Measure - NDCG • NDCG at position n for query q: NDCG @ k Z k G 1 ( j ) ( j ) k j 1 Normalization Cumulating Gain G ( j ) 2 l 1 Position discount 1 ( j ) 1, ( j ) 1 / log( j 1) • Averaged over all queries. 1 ( j ) The document ranked at the j-th position in π. 2011/5/9 ( j) Tie-Yan Liu, Copyright reserved The rank position of document j in π. 13 Evaluation Measure - Summary • Query-level: every query contributes equally to the measure. – Computed on documents associated with the same query. – Bounded for each query. – Averaged over all test queries. • Position-based: rank position is explicitly used. – Top-ranked objects are more important. – Relative order vs. relevance score of each document. – Non-continuous and non-differentiable w.r.t. scores 2011/5/9 Tie-Yan Liu, Copyright reserved 14 Learning to Rank Challenges of Conventional Models • Manual parameter tuning is usually difficult, especially when there are many parameters and the evaluation measures are non-smooth. • Manual parameter tuning sometimes leads to overfitting. • It is non-trivial to combine the large number of models proposed in the literature to obtain an even more effective model. 2011/5/9 Tie-Yan Liu, Copyright reserved 16 Machine Learning Can Help • Machine learning is an effective tool – To automatically tune parameters. – To combine multiple evidences. – To avoid over-fitting (by means of regularization, etc.) • “Learning to Rank” – In general, those methods that use machine learning technologies to solve the problem of ranking can be named as “learning to rank” methods. 2011/5/9 Tie-Yan Liu, Copyright reserved 17 Learning to Rank • In most recent works, learning to rank is defined as having the following two properties: – Feature based; – Discriminative training. 2011/5/9 Tie-Yan Liu, Copyright reserved 18 Feature Based • Documents represented by feature vectors. – Features are extracted for each query-document pair. – Even if a feature is the output of an existing retrieval model, one assumes that the parameter in the model is fixed, and only learns the optimal way of combining these features. • The capability of combining a large number of features is very promising. – It can easily incorporate any new progress on retrieval model, by including the output of the model as a feature. 2011/5/9 Tie-Yan Liu, Copyright reserved 19 Discriminative Training • An automatic learning process based on the training data, • Contains the objects under – With the four pillars of discriminative learning • • • • Input space, Output space, Hypothesis space, Loss function. investigation. • Usually objects are represented by feature vectors, extracted according to different applications – Not even necessary to have a probabilistic explanation. 2011/5/9 Tie-Yan Liu, Copyright reserved 20 Discriminative Training • An automatic learning process based on the training data, are two relatedlearning but different – With the four pillars of• There discriminative • • • • Input space, Output space, Hypothesis space, Loss function. – Not even necessary explanation. 2011/5/9 definitions of the output space. - Output space of the task. - Output space to facilitate learning. • Example: using regression to solve classification - Output space of the task: {+1, -1} - Output space to facilitate learning: toreal have a probabilistic numbers Tie-Yan Liu, Copyright reserved 21 Discriminative Training • An automatic learning process based on the training data, – With the four pillars of discriminative learning • • • • Input space, Output space, Hypothesis space, Loss function. – Not even necessary explanation. 2011/5/9 • Defines the class of functions mapping the input space to the output space. • The functions operate on the feature vectors of the input objects, and make predictions according to the format of tothe have a probabilistic output space. Tie-Yan Liu, Copyright reserved 22 Discriminative Training • An automatic learning process based on the • The loss function measures to what training data, degree the prediction generated by the is in accordance with the – With the four pillars ofhypothesis discriminative learning ground truth label. • • • • Input space, Output space, Hypothesis space, Loss function. – Not even necessary to explanation. 2011/5/9 - For example, widely used loss functions for classification include the exponential loss, the hinge loss, and the logistic loss. - With the loss function, an empirical risk can be defined on the training set, have probabilistic and theaoptimal hypothesis is usually (but not always) learned by means of empirical risk minimization. Tie-Yan Liu, Copyright reserved 23 Discriminative Training • An automatic learning process based on the training data, • This is highly demanding for real search engines, – Everyday these search engines will receive a lot of user feedback and usage logs – It is very important to automatically learn from the feedback and constantly improve the ranking mechanism. 2011/5/9 Tie-Yan Liu, Copyright reserved 24 Learning to Rank Framework 2011/5/9 Tie-Yan Liu, Copyright reserved 25 Wide Applications of Learning to Rank • • • • • • • Document retrieval Question answering Multimedia retrieval Text summarization Online advertising Collaborative filtering Machine translation 2011/5/9 Tie-Yan Liu, Copyright reserved 26 Recent Trends on Learning to Rank • Successfully applied to web search • Hot topic in IR and ML – Hundreds of publications at SIGIR, ICML, NIPS, etc. – Multiple sessions at SIGIR every year – Several workshops at SIGIR, ICML, NIPS, etc. – Several tutorials at SIGIR, WWW, ACL, etc. – Special issue at IRJ and JMLR – Yahoo! Learning to rank challenges – …… 2011/5/9 Tie-Yan Liu, Copyright reserved 27 Outline • • • • • • Introduction Major Approaches to Learning to Rank Analysis of the Approaches Benchmark Evaluation of the Approaches Statistical Learning Theory for Ranking Summary and Outlook 2011/5/9 Tie-Yan Liu, Copyright reserved 28 Learning to Rank Algorithms Least Square Retrieval Function Query refinement (WWW 2008) (TOIS 1989) Nested Ranker (SIGIR 2006) SVM-MAP (SIGIR 2007) ListNet (ICML 2007) Pranking (NIPS 2002) MPRank (ICML 2007) LambdaRank (NIPS 2006) Frank (SIGIR 2007) Learning to retrieval info (SCC 1995) MHR (SIGIR 2007) RankBoost (JMLR 2003) LDM (SIGIR 2005) Large margin ranker (NIPS 2002) IRSVM (SIGIR 2006) RankNet (ICML 2005) Ranking SVM (ICANN 1999) Discriminative model for IR (SIGIR 2004) SVM Structure (JMLR 2005) OAP-BPM (ICML 2003) Subset Ranking (COLT 2006) GPRank (LR4IR 2007) QBRank (NIPS 2007) GBRank (SIGIR 2007) Constraint Ordinal Regression (ICML 2005)McRank (NIPS 2007) AdaRank (SIGIR 2007) RankCosine (IP&M 2007) Relational ranking (WWW 2008) 2011/5/9 CCA (SIGIR 2007) SoftRank (LR4IR 2007) ListMLE (ICML 2008) Supervised Rank Aggregation (WWW 2007) Learning to order things (NIPS 1998) Round robin ranking (ECML 2003) Tie-Yan Liu, Copyright reserved 29 Key Questions to Answer • To what respect are these learning to rank algorithms similar and in which aspects do they differ? What are the strengths and weaknesses of each algorithm? • Empirically speaking, which of those many learning to rank algorithms perform the best? • What are the unique theoretical issues for ranking that should be investigated? • Are there many remaining issues regarding learning to rank to study in the future? 2011/5/9 Tie-Yan Liu, Copyright reserved 30 Categorization of the Algorithms Category Algorithms Pointwise Approach Regression: Least Square Retrieval Function (TOIS 1989), Regression Tree for Ordinal Class Prediction (Fundamenta Informaticae, 2000), Subset Ranking using Regression (COLT 2006), … Classification: Discriminative model for IR (SIGIR 2004), McRank (NIPS 2007), … Ordinal regression: Pranking (NIPS 2002), OAP-BPM (EMCL 2003), Ranking with Large Margin Principles (NIPS 2002), Constraint Ordinal Regression (ICML 2005), … Pairwise Approach Learning to Retrieve Information (SCC 1995), Learning to Order Things (NIPS 1998), Ranking SVM (ICANN 1999), RankBoost (JMLR 2003), LDM (SIGIR 2005), RankNet (ICML 2005), Frank (SIGIR 2007), MHR(SIGIR 2007), GBRank (SIGIR 2007), QBRank (NIPS 2007), MPRank (ICML 2007), IRSVM (SIGIR 2006), LambdaRank (NIPS 2006),… Listwise Approach Non-measure specific: ListNet (ICML 2007), ListMLE (ICML 2008), BoltzRank (ICML 2009) … Measure-specific: AdaRank (SIGIR 2007), SVM-MAP (SIGIR 2007), SoftRank (LR4IR 2007), RankGP (LR4IR 2007), … 2011/5/9 Tie-Yan Liu, Copyright reserved 31 The Poinitwise Approach The Pointwise Approach • Reduce ranking to – Regression • Subset Ranking – Classification • Discriminative model for IR • MCRank – Ordinal regression q • PRanking • Ranking with large margin principle x1 x2 x m ( x , y ), ( x , y ),..., ( x 1 2011/5/9 Tie-Yan Liu, Copyright reserved 1 2 2 m , ym ) 33 Subset Ranking using Regression (D. Cossock and T. Zhang, COLT 2006) • Regard relevance degree as real number, and use regression to learn the ranking function. L( f ; x j , y j ) f ( x j ) y j 2 Regression-based 2011/5/9 Tie-Yan Liu, Copyright reserved 34 McRank (P. Li, et al. NIPS 2007) • Multi-class classification is used to learn the ranking function. – For document x j , the output of the classifier is ŷ j . – Loss function: surrogate function of I{ yˆ j y j } • Ranking is produced by combining the outputs of the classifiers. K pˆ j ,k P( yˆ j k ), f ( x j ) pˆ j ,k k k 1 Classification-based 2011/5/9 Tie-Yan Liu, Copyright reserved 35 Pranking with Ranking (K. Krammer and Y. Singer, NIPS 2002) • Ranking is solved by ordinal regression x w x Ordinal regression based 2011/5/9 Tie-Yan Liu, Copyright reserved 36 Problem with Pointwise Approach • Properties of IR evaluation measures have not been well considered. – The fact is ignored that some documents are associated with the same query and some are not. • When the number of associated documents varies largely for different queries, the overall loss function will be dominated by those queries with a large number of documents. – The position of documents in the ranked list is invisible to the loss functions. • The pointwise loss function may unconsciously emphasize too much those unimportant documents (which are ranked low in the final results). 2011/5/9 Tie-Yan Liu, Copyright reserved 37 The Pairwise Approach The Pairwise Approach • Reduce ranking to pairwise classification – – – – – RankNet and Frank RankBoost Ranking SVM MHR IR-SVM q x1 x2 x m x1 , x2 ,1, x2 , x1 ,1,..., x 2 , xm ,1, xm , x2 ,1 2011/5/9 Tie-Yan Liu, Copyright reserved 39 RankNet (C. Burges, et al. ICML 2005) • Target probability: Pu ,v 1, if yu ,v 1; Pu ,v 0, otherwise • Modeled probability: Pu ,v ( f ) exp( f ( xu ) f ( xv )) 1 exp( f ( xu ) f ( xv )) • Cross entropy as the loss function l ( f ; xu , xv , yu ,v ) Pu ,v log Pu ,v ( f ) (1 Pu ,v ) log(1 Pu ,v ( f )) • Use neural network as model, and gradient descent as algorithm, to optimize the cross-entropy loss. 2011/5/9 Tie-Yan Liu, Copyright reserved 40 FRank (M. Tsai, et al. SIGIR 2007) • Similar setting to that of RankNet • New loss function: fidelity L( f ; xu , xv , yu ,v ) 1 Pu ,v Pu ,v ( f ) (1 Pu ,v ) (1 Pu ,v ( f )) 2011/5/9 Tie-Yan Liu, Copyright reserved 41 RankBoost (Y. Freund, et al. JMLR 2003) • Exponential loss L( f ; xu , xv , yu ,v ) exp( yu,v ( f ( xu ) f ( xv ))) 2011/5/9 Tie-Yan Liu, Copyright reserved 42 RankBoost • Due to the analogy to AdaBoost, RankBoost inherits Weak learnermany that can nice properties, such as the ability in feature selection, classify important pairs more correctly will have convergence in training, and certain generalization abilities. larger weight α Hypothesis is the combination of weak learners Emphasize those hard pairs by setting their distributions. 2011/5/9 Tie-Yan Liu, Copyright reserved 43 Ranking SVM (R. Herbrich, et al. , Advances in Large Margin Classifiers, 2000; T. Joachims, KDD 2002) • Hinge loss L( f ; xu , xv , yu,v ) 1 yu,v f ( xu ) f ( xv ) 2011/5/9 Tie-Yan Liu, Copyright reserved 44 Ranking SVM n 1 min || w || 2 C u(,iv) 2 i 1 u ,v: yu( i,v) 1 wT xu(i ) xv(i ) 1 u(,iv) , if yu(i,)v 1. uv(i ) 0, i 1,..., n. xu-xv as positive instance of learning Use SVM to perform binary classification on these instances, to learn model parameter w • Since Ranking SVM is well rooted in the framework of SVM, it inherits nice properties of SVM. For example, with the help of margin maximization, Ranking SVM can have a good generalization ability. Kernel tricks can also be applied to Ranking SVM, so as to handle complex non-linear problems. 2011/5/9 Tie-Yan Liu, Copyright reserved 45 Problems with Pairwise Approach • When we are given the relevance judgment in terms of multiple ordered categories, converting it to pairwise preference will lead to the missing of the information about the finer granularity in the relevance judgment • The distribution of document pair number is more skewed than the distribution of document number, with respect to different queries. • The pairwise approach is more sensitive to noisy label than the pointwise approach. • Most of the pairwise ranking algorithms have not considered the position in the final ranking results. 2011/5/9 Tie-Yan Liu, Copyright reserved 46 Improved Algorithms • Multi-hyperplane ranker (MHR) • IR-SVM • LambdaRank 2011/5/9 Tie-Yan Liu, Copyright reserved 47 Multi-Hyperplane Ranker (T. Qin, et al. SIGIR 2007) • If we have K categories, then will have K(K-1)/2 pairwise preferences between two labels. – Learn a model fk,l for the pair of categories k and l, using any previous method (for example Ranking SVM). • Testing – Rank Aggregation f ( x ) k ,l f k , l ( x ) k ,l 2011/5/9 Tie-Yan Liu, Copyright reserved 48 IR-SVM (J. Xu, et al. WWW 2007) • Introduce query-level normalization to the pairwise loss function. – Use the number of document pairs to normalize the pairwise loss. • IRSVM (Y. Cao, J. Xu, et al. SIGIR 2006; T. Qin, T.-Y. Liu, et al. IPM 2007) N 1 1 min || w ||2 C ~ ( i ) uv( i ) 2 i 1 m u ,v Query-level normalizer Significant improvement has been observed by using the query-level normalizer. 2011/5/9 Tie-Yan Liu, Copyright reserved 49 LambdaRank (C. Burges, NIPS 2006) • Basic idea – The evaluation measures (which are position based) are directly used to define the gradient with respect to each document pair in the training process. • Why is it feasible to directly define the gradient? 2011/5/9 Tie-Yan Liu, Copyright reserved 50 LambdaRank • Lambda function – The gradient is directly defined as a lambda function, without caring what the corresponding loss function really looks like. – For each pair of documents xu and xv, in each round of optimization, their scores are updated by +λ and − λ. 2011/5/9 Tie-Yan Liu, Copyright reserved 51 The Listwise Approach Listwise Approach • Measure-specific loss – Optimize the IR evaluation measures directly • Non measure-specific loss – Optimize a loss function defined on all documents associated with a query, according to the unique properties of ranking 2011/5/9 Tie-Yan Liu, Copyright reserved 53 Measure-specific Listwise Ranking • Also known as “Direct Optimization of IR Measures”. • It is natural to directly optimize what is used to evaluate the ranking results. • However, it is non-trivial. – Evaluation measures such as NDCG are non-continuous and nondifferentiable since they depend on the rank positions. (S. Robertson and H. Zaragoza, Information Retrieval 2007). – It is challenging to optimize such objective functions, since most optimization techniques in the literature were developed to handle continuous and differentiable cases. 2011/5/9 Tie-Yan Liu, Copyright reserved 54 Tackle the Challenges • Approximate the objective – Soften (approximate) the evaluation measure so as to make it smooth and differentiable (SoftRank) • Bound the objective – Optimize a smooth and differentiable upper bound of the evaluation measure (SVM-MAP) • Optimize the non-smooth objective directly – Use IR measure to update the distribution in Boosting (AdaRank) – Use genetic programming (RankGP) 2011/5/9 Tie-Yan Liu, Copyright reserved 55 SoftRank (M. Taylor, et al. LR4IR 2007/WSDM 2008) • Key Idea: – Avoid sorting by treating scores as random variables • Steps – Construct score distribution – Map score distribution to rank distribution – Compute expected NDCG (SoftNDCG) which is smooth and differentiable. – Optimize SoftNDCG by gradient descent. 2011/5/9 Tie-Yan Liu, Copyright reserved 56 Construct Score Distribution • Considering the score for each document sj as random variable, by using Gaussian. p(s j ) Ν (s j | f ( x j ), s2 ) 2011/5/9 Tie-Yan Liu, Copyright reserved 57 From Scores to Rank Distribution • Probability of du being ranked before dv uv Ν (s | f ( xu ) f ( xv ),2 s2 )ds 0 • Rank Distribution p (ju ) (r ) p (ju 1) (r 1) uj p (ju 1) (r )(1 uj ) Define the probability of a document being ranked at a particular position 2011/5/9 Tie-Yan Liu, Copyright reserved 58 Soft NDCG • Expected NDCG as the objective, or in the language of loss function, m m 1 L( f ; x, y ) 1 E[ NDCG] 1 Z m (2 1)r p j ( r ) j 1 yj r 0 – Only Pj(r) contain the model parameter w, and it is differential with respect to w. • A neural network is used as the model, and gradient descent is used as the algorithm to optimize the above loss function. 2011/5/9 Tie-Yan Liu, Copyright reserved 59 SVM-MAP (Y. Yue, et al. SIGIR 2007) • Use the framework of Structured SVM for optimization – I. Tsochantaridis, et al. ICML 2004; T. Joachims, ICML 2005. • Sum of slacks ∑ξ upper bounds ∑ (1-AP). 1 2 C min w 2 N N (i ) i 1 y ' y (i ) : wT (y (i ) , x (i ) ) wT (y ' , x (i ) ) (y (i ) , y ' ) (i ) ( y ' , x) ( y' y' u u ,v: yu 1, yv 0 ( y, x ) 2011/5/9 ( x v u u ,v: yu 1, yv 0 )( xu xv ) (y (i ) , y' ) 1 AP(y' ) xv ) Tie-Yan Liu, Copyright reserved 60 Challenges in SVM-MAP • For AP, the true ranking is a ranking where the relevant documents are all ranked in the front, e.g., • An incorrect ranking would be any other ranking, e.g., • Exponential number of rankings, thus an exponential number of constraints! • Solution: use a working set only containing the most violated constraints. 2011/5/9 Tie-Yan Liu, Copyright reserved 61 Finding the Most Violated Constraint • Most violated y’: arg max ( y, y' ) wT ( y' , x) y' • If the relevance at each position is fixed, Δ(y,y’) will be the same. • Strategy (with a complexity of O(mlogm)) – Sort the relevant and irrelevant documents and form a perfect ranking. – Start with the perfect ranking and swap two adjacent relevant and irrelevant documents, so as to find the optimal interleaving of the two sorted lists. 2011/5/9 Tie-Yan Liu, Copyright reserved 62 AdaRank (J. Xu and H. Li. SIGIR 2007) Weak learner that can • Embed IR evaluation measure in the distribution rank important queries update of Boosting more correctly will have larger weight α Emphasize those hard queries by setting their distributions. 2011/5/9 Tie-Yan Liu, Copyright reserved 63 RankGP (J. Yeh, et al. LR4IR 2007) • Define ranking function as a tree • Learning – Single-population GP, which has been widely used to optimize non-smoothing non-differentiable objectives. • Evolution mechanism – Crossover, mutation, reproduction, tournament selection. • Fitness function – IR evaluation measure (MAP). 2011/5/9 Tie-Yan Liu, Copyright reserved 64 Non Measure-specific Listwise Ranking • Defining listwise loss functions based on the understanding on the unique properties of ranking for IR. • Representative Algorithms – ListNet – ListMLE – BoltzRank 2011/5/9 Tie-Yan Liu, Copyright reserved 65 Ranking Loss is Non-trivial! • An example: – function f: f(A)=3, f(B)=0, f(C)=1 – function h: h(A)=4, h(B)=6, h(C)=3 – ground truth g: g(A)=6, g(B)=4, g(C)=3 ACB BAC ABC • Question: which function is closer to ground truth? – Based on pointwise similarity: sim(f,g) < sim(g,h). – Based on pairwise similarity: sim(f,g) = sim(g,h) – Based on cosine similarity between score vectors? f: <3,0,1> g:<6,4,3> h:<4,6,3> Closer! sim(f,g)=0.85 sim(g,h)=0.93 • However, according to NDCG, f should be closer to g! 2011/5/9 Tie-Yan Liu, Copyright reserved 66 ListNet (Z. Cao, et al. ICML 2007) • Ranked list ↔ Permutation probability distribution • More informative representation for ranked list: permutation and ranked list has 1-1 correspondence. 2011/5/9 Tie-Yan Liu, Copyright reserved 67 Defining Permutation Probability • Probability of a permutation is defined with Luce Model m P( | f ) j 1 ( f ( x ( j ) )) m k j ( f ( x ( k ) )) • Example: PABC | f f ( A) f (B) f (C) f (A) f (B) f (A) f (B) f (C) f (C) P(A ranked No.1) P(B ranked No.2 | A ranked No.1) = P(B ranked No.1)/(1- P(A ranked No.1)) P(C ranked No.3 | A ranked No.1, B ranked No.2) 2011/5/9 Tie-Yan Liu, Copyright reserved 68 Properties of Luce Model • Continuous, differentiable, and concave w.r.t. score s. • Suppose f(A) = 3, f(B)=0, f(C)=1, P(ACB) is largest and P(BCA) is smallest; for the ranking based on f: ACB, swap any two, probability will decrease, e.g., P(ACB) > P(ABC) • Translation invariant, when (s) exp( s) • Scale invariant, when (s) a s 2011/5/9 Tie-Yan Liu, Copyright reserved 69 Distance between Ranked Lists φ = exp Using KL-divergence to measure difference between distributions dis(f,g) = 0.46 dis(g,h) = 2.56 2011/5/9 Tie-Yan Liu, Copyright reserved 70 K-L Divergence Loss • Loss function = KL-divergence between two permutation probability distributions (φ = exp) L( f ; x, y ) DPy || P | f (x) Probability distribution defined by the ground truth Probability distribution defined by the model output • Model = Neural Network • Algorithm = Gradient Descent 2011/5/9 Tie-Yan Liu, Copyright reserved 71 Limitations of ListNet • Too complex to use in practice: O(m⋅m!) • By using the top-k Luce model, the complexity of ListNet can be reduced to polynomial order of m. • However, – When k is large, the complexity is still high! – When k is small, the information loss in the top-k Luce model will affect the effectiveness of ListNet. 2011/5/9 Tie-Yan Liu, Copyright reserved 72 Solution • Given the scores outputted by the ranking model, compute the likelihood of a ground truth permutation, and maximize it to learn the model parameter. • The complexity is reduced from O(m⋅m!) to O(m). 2011/5/9 Tie-Yan Liu, Copyright reserved 73 ListMLE (F. Xia, et al. ICML 2008) • Likelihood Loss L( f ; x, y ) log P( y | ( f (x))) • Model = Neural Network • Algorithm = Stochastic Gradient Descent 2011/5/9 Tie-Yan Liu, Copyright reserved 74 Extension of ListMLE • Dealing with other types of ground truth labels using the concept of “equivalent permutation set”. – For relevance degree y { y | u v, if l 1 (u ) l 1 (u ) } y y – For pairwise preference y { y | u v, if l 1 (u ), 1 (u ) 1} y y L( f , x, y ) min log P y | f (w, x) y y 2011/5/9 Tie-Yan Liu, Copyright reserved 75 BoltzRank (M. Volkovs and R. Zemel, ICML 2009) • BoltzRank is also based on permutation probability models. • BoltzRank utilizes a scoring function composed of individual and pairwise potentials to define a conditional probability distribution, in the form of a Boltzmann distribution, over all permutations of documents retrieved for a given query. 2011/5/9 Tie-Yan Liu, Copyright reserved 76 BoltzRank • E(π|s) thus represents the lack of compatibility between the relative document orderings given by p and those given by s, with more positive energy indicating less compatibility. • Using the energy function, we can now define the conditional Boltzmann distribution over document permutations by exponentiating and normalizing 2011/5/9 Tie-Yan Liu, Copyright reserved 77 BoltzRank • After defining the permutation probability as above, one can follow the idea in ListMLE or ListNet to define the loss function. • It is also possible to extend the above idea to optimize evaluation measures. – For example, suppose the evaluation measure is NDCG, then its expected value with respect to all the possible permutations can be computed as follows. 2011/5/9 Tie-Yan Liu, Copyright reserved 78 Discussions • Advantages – Take all the documents associated with the same query as the learning instance – Rank position is visible to the loss function • Problems – Complexity issue. – The use of the position information is insufficient. 2011/5/9 Tie-Yan Liu, Copyright reserved 79 Outline • • • • • • Introduction Major Approaches to Learning to Rank Analysis of the Approaches Benchmark Evaluation of the Approaches Statistical Learning Theory for Ranking Summary and Outlook 2011/5/9 Tie-Yan Liu, Copyright reserved 80 Question • Different loss functions are used in different approaches, while the models learned with all the approaches are evaluated by IR measures. • What are the relationship between the loss functions and IR evaluation measures? 2011/5/9 Tie-Yan Liu, Copyright reserved 81 The Pointwise Approach • Many pointwise loss functions can upper bound (1-NDCG). • Minimization of the pointwise loss functions can lead to the maximization of NDCG. 2011/5/9 Tie-Yan Liu, Copyright reserved 82 The Pointwise Approach • For regression based algorithms (D. Cossock and T. Zhang, COLT 2006) 1 1 NDCG( f ) Zm 1/ 2 j j 1 m 1/ f ( x j ) y j j 1 m • For multi-class classification based algorithms (P. Li, et al. NIPS 2007) 15 1 NDCG( f ) Zm 2011/5/9 2 m m m 2 2 j m jm I{ y j yˆ j } j 1 j 1 j 1 Tie-Yan Liu, Copyright reserved 83 The Pointwise Approach • The minimization of the losses is only the sufficient condition of minimizing (1-NDCG). xi , yi x1 ,4 x , 3 2 x ,2 3 x ,1 4 xi , f ( xi ) Z m 21.4 1 NDCG( f ) 0 m I j 1 { y j f ( x j )} 4 x1 ,3 x , 2 2 x ,1 3 x ,0 4 2 2 m m m m 15 1 1 2 m I{ y j f ( x j )} 1.15 1 Z m j 1 log( j 1) log( j 1 ) j 1 j 1 2011/5/9 Tie-Yan Liu, Copyright reserved 84 Pairwise and Non-measure Specific Listwise Approaches (W. Chen, et al. NIPS 2009) • Many pairwise and non-measure specific listwise loss functions are surrogate functions, and also upper bounds of a quantity named the “essential loss”. • The essential loss is an upper bound of (1-NDCG) and (1-MAP). • The minimization of these surrogate loss functions can lead to the maximization of NDCG and MAP. 2011/5/9 Tie-Yan Liu, Copyright reserved 85 Essential Loss for Ranking • Model ranking as a sequence of classifications Ground truth permutation: Prediction of the ranking function: y A B C D 2011/5/9 y B A B C D D C y {A B C D} {B A D C} y B C D D C D C Classifier √ Tie-Yan Liu, Copyright reserved 86 Essential Loss vs. Ranking Measures 1) Both (1-NDCG) and (1-MAP) are upper bounded by the essential loss. 2) The zero value of the essential loss is a necessary and sufficient condition for the zero values of (1-NDCG) and (1-MAP). 2011/5/9 Tie-Yan Liu, Copyright reserved 87 Essential Loss vs. Surrogate Losses 1) Many pairwise and listwise loss functions are upper bounds of the essential loss. 2) Therefore, the pairwise and listwise loss functions are also upper bounds of (1-NDCG) and (1-MAP). 2011/5/9 Tie-Yan Liu, Copyright reserved 88 Discussions • Revealed that different loss functions are upper bounds of the same thing. • Explained why many different pairwise and listwise algorithms perform similarly well. • Suggest better algorithms – The essential loss is a tighter upper bound of measure-based ranking errors than existing surrogate loss functions. – Our experiments showed that higher ranking performances can be achieved by optimizing such a tighter bound. 2011/5/9 Tie-Yan Liu, Copyright reserved 89 Measure-specific Listwise Approach • To what degree is a surrogate measure correlated with the ranking measure? • Can the maximization of a surrogate measure lead to the same ranking model that maximizes the ranking measure? 2011/5/9 Tie-Yan Liu, Copyright reserved 90 Tendency Correlation as a Tool (Y. He and T.-Y. Liu, IR Journal 2010) 0 0.2 Finding the best linear 0.4 transformation, by applying which to one of the functions, the maximum difference between the tendencies of the two functions in the entire input space can be minimized. 2011/5/9 The “tendency” of a function with respect to two inputs is defined as the difference in the values of the function at the two inputs. Tie-Yan Liu, Copyright reserved 91 Theoretical Justification of Tendency Correlation If the tendency correlation of a surrogate measure is arbitrarily strong, its optimization will lead to the ranking model that can optimize the corresponding ranking measure, on both training and test sets. 2011/5/9 Tie-Yan Liu, Copyright reserved 92 Major Results - SoftRank 1) When a parameter in SoftRank is set to be small, the tendency correlation of its surrogate measure will become strong, regardless of the data distribution. 2) However, in this case, the surrogate measure will contain many local optima, therefore robust global optimization techniques are needed. 2011/5/9 Tie-Yan Liu, Copyright reserved 93 Major Results – SVM-MAP, etc. 1) The surrogate measures in SVM-MAP, SVM-NDCG, DORM, and PermuRank are concave and easy to optimize. 2) No matter how the parameters in these algorithms are set, for certain data distributions, the tendency correlation of their surrogate measures can never be as strong as expected. 2011/5/9 Tie-Yan Liu, Copyright reserved 94 Summary • With the tool of tendency correlation, we reveal the relationship between surrogate measures and original ranking measures. • The theorems well explain previous empirical observations – Some methods can perform well on different datasets, while some others cannot on some datasets. • The theorems provide a guideline for trade-off between effectiveness and ease of optimization. 2011/5/9 Tie-Yan Liu, Copyright reserved 95 Outline • • • • • • Introduction Major Approaches to Learning to Rank Analysis of the Approaches Benchmark Evaluation of the Approaches Statistical Learning Theory for Ranking Summary and Outlook 2011/5/9 Tie-Yan Liu, Copyright reserved 96 Benchmark Datasets for Learning to Rank Role of a Benchmark Dataset • Enable researchers to focus on technologies instead of experimental preparations. • Enable fair comparison among algorithms. • Examples – Reuters and RCV-1 for text classification – UCI for general machine learning 2011/5/9 Tie-Yan Liu, Copyright reserved 98 The LETOR Collection (T.-Y. Liu, et al. LR4IR 2007) • The first benchmark dataset for learning to rank – Document corpora – Document sampling – Feature extraction – Meta information 2011/5/9 Tie-Yan Liu, Copyright reserved 99 Document Corpora • The “.gov” corpus • The OHSUMED corpus – 106 queries 2011/5/9 Tie-Yan Liu, Copyright reserved 100 Document Sampling • The “.gov” corpus – Top 1000 documents per query returned by BM25 • The OHSUMED corpus – All judged documents 2011/5/9 Tie-Yan Liu, Copyright reserved 101 Feature Extraction • The “.gov” corpus – 64 features – TF, IDF, BM25, LMIR, PageRank, HITS, Relevance Propagation, URL length, etc. • The OHSUMED corpus – 40 features – TF, IDF, BM25, LMIR, etc. 2011/5/9 Tie-Yan Liu, Copyright reserved 102 Meta Information • Statistical information about the corpus – Number of documents, number of streams, and number of (unique) terms in each stream. • Raw information of the documents associated with each query – Term frequency, and document length. • Relational information – Hyperlink graph, sitemap information, and similarity relationship matrix of the corpora. 2011/5/9 Tie-Yan Liu, Copyright reserved 103 LETOR Website 2011/5/9 Tie-Yan Liu, Copyright reserved 104 New Benchmark Datasets • Yahoo! Challenge Datasets • Microsoft Learning to Rank Datasets • Both are large-scale datasets retired from commercial search engines. 2011/5/9 Tie-Yan Liu, Copyright reserved 105 Yahoo! Challenge Datasets • Released on March 2010. • Accessible to participants of the challenge • About 30K queries • 700K documents • 700 features for each document. • http://learningtorankchallenge.yahoo.com/ 2011/5/9 Tie-Yan Liu, Copyright reserved 106 Yahoo! Challenge Datasets 2011/5/9 Tie-Yan Liu, Copyright reserved 107 Microsoft Learning to Rank Datasets • Released on June 16, 2010. • Publicly available • More than 30K queries • 3.7M documents • 136 features for each document. • http://research.microsoft.com/en-us/projects/mslr/ 2011/5/9 Tie-Yan Liu, Copyright reserved 108 Microsoft Learning to Rank Datasets 2011/5/9 Tie-Yan Liu, Copyright reserved 109 Evaluation Results on LETOR Datasets Algorithms under Investigation • The pointwise approach – Linear regression • The pairwise approach – Ranking SVM, RankBoost, Frank • The listwise approach – ListNet – SVM-MAP, AdaRank Linear ranking model (except RankBoost) 2011/5/9 Tie-Yan Liu, Copyright reserved 111 Experimental Results – TD2003 2011/5/9 Tie-Yan Liu, Copyright reserved 112 Experimental Results – TD2004 2011/5/9 Tie-Yan Liu, Copyright reserved 113 Experimental Results – NP2003 2011/5/9 Tie-Yan Liu, Copyright reserved 114 Experimental Results – NP2004 2011/5/9 Tie-Yan Liu, Copyright reserved 115 Experimental Results – HP2003 2011/5/9 Tie-Yan Liu, Copyright reserved 116 Experimental Results – HP2004 2011/5/9 Tie-Yan Liu, Copyright reserved 117 Experimental Results - OHSUMED 2011/5/9 Tie-Yan Liu, Copyright reserved 118 Experimental Results - Winning Numbers 2011/5/9 Tie-Yan Liu, Copyright reserved 119 Experimental Results - Winning Numbers 40 35 30 Regression RankSVM 25 RankBoost 20 FRank ListNet 15 AdaRank 10 SVMmap 5 0 N@1 2011/5/9 N@3 N@10 P@1 P@3 Tie-Yan Liu, Copyright reserved P@10 MAP 120 Discussions • NDCG@1 – {ListNet, AdaRank} > {SVM-MAP, RankSVM} > {RankBoost, FRank} > {Regression} • NDCG@3, @10 – {ListNet} > {AdaRank, SVM-MAP, RankSVM, RankBoost} > {FRank} > {Regression} • MAP – {ListNet} > {AdaRank, SVM-MAP, RankSVM} > {RankBoost, FRank} > {Regression} The listwise approach has certain advantages, especially in correctly predicting top-ranked documents. 2011/5/9 Tie-Yan Liu, Copyright reserved 121 Notes • The experimental results can be found at the LETOR website – http://research.microsoft.com/~letor/ • The results are still primal – The implementations of the these algorithms are not perfect – Almost every algorithm can be further improved. – The datasets are relatively small. 2011/5/9 Tie-Yan Liu, Copyright reserved 122 Outline • • • • • • Introduction Major Approaches to Learning to Rank Analysis of the Approaches Benchmark Evaluation of the Approaches Statistical Learning Theory for Ranking Summary and Outlook 2011/5/9 Tie-Yan Liu, Copyright reserved 123 Why Theory? • In practice, one can only observe experimental results on relatively small datasets. To some extent, such empirical results are not trustworthy, because – Small training set cannot fully realize the potential of a learning algorithm. – Small test set cannot reflect the true performance of an algorithm, since the query space is too huge to be represented by a small number of samples. • Statistical learning theory for ranking analyzes the performance of an algorithm when the training data is infinite and the test data is really randomly sampled. 2011/5/9 Tie-Yan Liu, Copyright reserved 124 Generalization Analysis • More specifically, in the training phase, one minimizes the empirical risk on the training data. • In the test phase, we evaluate the expected risk of the ranking model on any sample. • Generalization analysis is concerned with the bound of the difference between the expected and empirical risks, when the number of training data approaches infinity. 2011/5/9 Tie-Yan Liu, Copyright reserved 125 Generalization Analysis ? • An algorithm is regarded as better than the other algorithm if its empirical risk can converge to the expected risk but that of the other cannot. • An algorithm is regarded as better than the other if its corresponding convergence rate is faster than that of the other. 2011/5/9 Tie-Yan Liu, Copyright reserved 126 Generalization Analysis • With different assumptions on the data, we have different definitions of the empirical and expected risks. • As a consequence, we have different generalization bounds. 2011/5/9 Tie-Yan Liu, Copyright reserved 127 Statistical Ranking Framework • Document Ranking – Assume that all the documents are i.i.d. no matter they are associated with the same query or not • Subset Ranking – Regard documents as deterministically given, and only regard queries as i.i.d. random variables • Two-layer Ranking – Assume that only the documents associated with the same query are i.i.d. (the distribution depends on queries), but queries are also i.i.d. in the query space 2011/5/9 Tie-Yan Liu, Copyright reserved 128 Document Ranking Framework • Pointwise Approach • Pairwise Approach • Listwise Approach – Cannot be explained using document ranking framework 2011/5/9 Tie-Yan Liu, Copyright reserved 129 Subset Ranking Framework • Pointwise Approach • Pairwise Approach • Listwise Approach 2011/5/9 Tie-Yan Liu, Copyright reserved 130 Two-Layer Ranking Framework • Pointwise Approach • Pairwise Approach • Listwise Approach – Cannot be explained using document ranking framework 2011/5/9 Tie-Yan Liu, Copyright reserved 131 Two Kinds of Generalization Bounds • Uniform generalization analysis – To reveal a bound of the generalization ability for any function in the function class under investigation. • Algorithm-dependent generalization analysis – Only applicable to the model learned from the training data using a specific algorithm. 2011/5/9 Tie-Yan Liu, Copyright reserved 132 Uniform Generalization Bound for Document Ranking (Y. Freund, et al. JMLR 2003) • W.r.t. pairwise 0-1 loss, based on VC dimension. Convergence Rate: log 𝑚 /𝑚 2011/5/9 Tie-Yan Liu, Copyright reserved 133 Uniform Generalization Bound for Document Ranking (S. Agarwal, et al. JMLR 2005) • W.r.t. pairwise 0-1 loss, based on Rank Shatter Coefficient. Convergence Rate: For linear scoring function in 1-D space: 1 𝑚 For linear scoring function in d-D space: log 𝑚 /𝑚 2011/5/9 Tie-Yan Liu, Copyright reserved 134 Uniform Generalization Bound for Document Ranking (P. Bartlett and S. Mendelson, JMLR 2002) • W.r.t. pairwise 0-1 loss, for k-partite ranking, based on properties of U-statistics. 1 Convergence Rate: 𝑚 2011/5/9 Tie-Yan Liu, Copyright reserved 135 Uniform Generalization Bound for Document Ranking (S. Clemencon, et al. Annual Stats 2008) • W.r.t. pairwise surrogate loss, based on VC dimension. 1 Convergence Rate: 𝑚 2011/5/9 Tie-Yan Liu, Copyright reserved 136 Uniform Generalization Bound for Subset Ranking (Y. Lan, et al. ICML 2009) • W.r.t. listwise surrogate loss, based on Rademacher average. • The generalization bound for ListMLE is much tighter than that for ListNet, especially when the length of the list is large. • The generalization bound for ListMLE decreases monotonously, while that of ListNet increases monotonously, with respect to the length of the list. 1 Convergence • The linear transformation functionRate: is the best choice in terms 𝑛 of the generalization bound in most cases 2011/5/9 Tie-Yan Liu, Copyright reserved 137 Uniform Generalization Bound for Two-Layer Ranking (W. Chen, et al. NIPS 2010) • W.r.t. pairwise 0-1/surrogate loss, based on Rademacher average. • The increasing number of either queries or documents per query in the training data will enhance the two-layer generalization ability. • Only if m ∞ and n ∞ simultaneously, does the two-layer generalization bound uniformly converge. • If we only have the budget to label C documents in total, there is an optimal tradeoff between the number of training queries and that of training documents per query. 2011/5/9 Tie-Yan Liu, Copyright reserved 138 Algorithm-dependent Generalization Bound for Document Ranking (S. Agarwal, et al. ALT 2008 and COLT 2005) • W.r.t. pairwise surrogate loss, based on stability. For example, Ranking SVM is a stable algorithm, 16𝜅2 and its stability coefficient is β(𝑚) = λ𝑚 , where K 𝑥, 𝑥 ≤ 𝜅 2 < ∞. 2011/5/9 Tie-Yan Liu, Copyright reserved 140 Algorithm-dependent Generalization Bound for Subset Ranking (Y. Lan, et al. ICML 2008) • W.r.t. pairwise surrogate loss, based on query-level stability. For Ranking SVM, 𝜏 𝑛 = For IR-SVM, 𝜏 𝑛 = 4𝜅2 . λ𝑛 4𝜅2 𝑚𝑖 max 1 𝑖 λ𝑛 𝑚 𝑛 . The convergence rates for these two algorithms are both O(1/n). By comparing the case with a finite number of training queries, the bound for IR-SVM is much tighter than that for Ranking SVM. 2011/5/9 Tie-Yan Liu, Copyright reserved 141 Summary 2011/5/9 Tie-Yan Liu, Copyright reserved 142 Outline • • • • • • Introduction Major Approaches to Learning to Rank Analysis of the Approaches Benchmark Evaluation of the Approaches Statistical Learning Theory for Ranking Summary and Outlook 2011/5/9 Tie-Yan Liu, Copyright reserved 143 Listwise approach - non-measure specific - measure specific • NIPS workshop on • ICML workshop on • 1st SIGIR workshop learning to rank structure output on LR4IR Data Activity Pointwise approach Pairwise approach Algorithms Theory • LETOR 1.0 • LETOR 2.0 • • • • Pranking DMIR Ranking SVM RankBoost < 2005 2011/5/9 • Gaussian process for ordinal regression • RankNet 2005 2006 WWW tutorial • SIGIR tutorial • 2nd LR4IR workshop • 1st SIGIR workshop on • • diversity • LETOR 3.0 • Generalization of pairwise methods • Consistency of pointwise methods • • • • Subset Ranking • • IR-SVM • • LambdaRank • • P-norm push • • Nested Ranker • • RankCosine • • • • • • Consistency of Listwise methods MC-Rank GBRank MHR Frank ListNet Supervised RA RankGP AdaRank SVM-MAP 2007 Tie-Yan Liu, Copyright reserved • • • • • • • • • • • • WWW tutorial ACL tutorial IRJ special issue 3rd LR4IR workshop 2nd Diversity workshop • LETOR 4.0 • Generalization of listwise methods IsoRank QBRank • CDN Ranker • ListMLE • Relational Ranking • GlobalRank • Query-dependent Ranker AppRank • PermuRank SoftRank G-SoftRank SVM-NDCG 2008 OWA Rank BoltzmanRank Bayes-Luce SmoothRank Robust sparse ranker …… 2009 144 Answer to Question 1 • To what respect are these learning to rank algorithms similar and in which aspects do they differ? What are the strengths and weaknesses of each algorithm? Approach Modeling Pointwise Regression/ classification/ ordinal regression Pairwise Pairwise classification Listwise 2011/5/9 Ranking Pro Con Easy to leverage existing theories and algorithms • Accurate ranking ≠ Accurate score or category •Effectively • Position info is invisible to the loss minimize (1NDCG). query-level and position based • More complex • New theory needed Tie-Yan Liu, Copyright reserved Connection 145 Answer to Question 2 • Empirically speaking, which of those many learning to rank algorithms perform the best? – LETOR makes it possible to perform fair comparisons among different learning to rank methods. – Empirical studies on LETOR have shown that the listwise ranking algorithms seem to have certain advantages over other algorithms, especially in predicting the top-ranked documents, and the pairwise ranking algorithms seem to outperform the pointwise algorithms. 2011/5/9 Tie-Yan Liu, Copyright reserved 146 Answer to Question 3 • What are the unique theoretical issues for ranking that should be investigated? – Unique properties of ranking for IR: evaluation is performed at query level and is position based. – The risks should also be defined at the query level, and a new theoretical framework is needed for conducting analyses on the learning to rank methods. – The “true loss” for ranking should consider the position information in the ranking result, but not as simple as the 0−1 loss in classification. 2011/5/9 Tie-Yan Liu, Copyright reserved 147 Answer to Question 4 • Are there many remaining issues regarding learning to rank to study in the future? – – – – – – – 2011/5/9 Learning from logs Feature engineering Advanced ranking model Ranking theory Stability of ranking Online learning Large-scale learning to rank Tie-Yan Liu, Copyright reserved 148 Learning from Logs • Data scale matters a lot! – Ground truth mining from logs is one of the possible ways to construct large-scale training data. • Ground truth mining may lose information . – User sessions, frequency of clicking a certain document, frequency of a certain click pattern, and diversity in the intentions of different users. • Directly learn from logs. – E.g., Regard click-through logs as the “ground truth”, and define its likelihood as the loss function. 2011/5/9 Tie-Yan Liu, Copyright reserved 149 Feature Engineering • Ranking is deeply rooted in IR – Not just new algorithms and theories. • Feature engineering is very importance – How to encode the knowledge on IR accumulated in the past half a century in the features? – Currently, these kinds of works have not been given enough attention to. 2011/5/9 Tie-Yan Liu, Copyright reserved 150 Advanced Ranking Model • Listwise ranking function • Generative ranking model • Semi-supervised ranking • Relational ranking • Query-dependent ranking • …… 2011/5/9 Tie-Yan Liu, Copyright reserved 151 Ranking Theory • • • • • A more unified view on loss functions True loss for ranking Statistical consistency for ranking Complexity of ranking function class …… 2011/5/9 Tie-Yan Liu, Copyright reserved 152 References 2011/5/9 Tie-Yan Liu, Copyright reserved 153 Acknowledgement • I would like to thank – My colleagues Tao Qin, Jun Xu, Hang Li, and Wei-Ying Ma – Our interns Zhe Cao, Ming-Feng Tsai, Xiubo Geng, Yuting Liu, Yanyan Lan, Fen Xia, Wei Chen, Jiang Bian, Yin He, Di He, and Wenkui Ding. – Our internal collaborators Chris Burges, Michael Taylor, Stephen Robertson, Emine Yilmaz, and Ralf Herbrich. – Our external collaborators Hsin-Hsi Chen, Hongyuan Zha, Zhiming Ma, Liwei Wang, Cheng Xiang Zhai, Thorsten Joachims, Olivier Chappell, and Yi Chang. 2011/5/9 Tie-Yan Liu, Copyright reserved 154 tyliu@microsoft.com http://research.microsoft.com/people/tyliu/ http://weibo.com/tieyanliu/ 2011/5/9 Tie-Yan Liu, Copyright reserved 155
Similar documents
Learning to Rank
Learning to Rank Algorithms Least Square Retrieval Function Query refinement (WWW 2008) (TOIS 1989) Nested Ranker (SIGIR 2006) SVM-MAP (SIGIR 2007) ListNet (ICML 2007) Pranking (NIPS 2002) MPRank (...
More information