slides
Transcription
slides
Deep Learning for Mixed-Script IR Parth Gupta PRHLT Research Center Technical University of Valencia, Spain http://www.dsic.upv.es/~pgupta/ Valencia Outline • Introduction of Mixed-Script Information Retrieval (MSIR) • Motivation • Prevalence of MSIR on Web • Deep-Learning Basics • Deep-learning for MSIR • Experiments and Results • Summary and Future Work Scripts! Scripts! “Scripted” World Wide Web • Many languages have their native-script content • But, because of various socio-cultural and technical reasons.. • Many languages that use non-Roman based indigenous scripts • For these languages, there exist abundant data on the Web in Roman transliterated space “Scripted” World Wide Web Songs Lyrics “Scripted” World Wide Web Forums and Reviews “Scripted” World Wide Web Social Media Mixed Script IR - Example Mixed Script IR - Example MSIR Narrow by Script Mixed Script IR - Example MSIR Narrow by Script Mixed-Script Query Analysis • 13.78 B queries issued in India from Bing Query logs • 30 M unique queries • 6% are transliterated queries i.e. contain Hindi Words in Roman Script • How many queries will be benefitted by MSIR? Transliterated Query Categories Category Example Query % of queries Named Entity • Takshashila mahavidhyalaya • Baba Ramdev 28 Entertainment • Himmatwala remake • Ek villain reviews 27 Information Src • Bade ghar ki beti premchand • Swayamvaram info 25 Culture • Ganesh Chaturthi 2014 • Silk kurta 1.4 Recipe • Palak paneer by Tarala Dalal 1.2 Research • Vishwa arthik mandi me bharat 0.04 Others • Meaning of pyar 0.01 Our Contributions • Formal Introduction of Mixed-Script Information Retrieval • Quantitative analysis of Mixed-Script Queries • Trainable solution to handle Mixed-Script term Equivalents and Variants Mixed-Script Web Search Native Script Non-native Script Teri Galliyan Teri Galiyaan तेरी गलिय ाँ तेरी गलिय Mixed-Script Web Search Native Script Non-native Script Teri Galliyan Teri Galiyaan Galiyan Galliyan Galiyaan गलिय ाँ गलिय तेरी गलिय ाँ तेरी गलिय Mixed-Script Equivalents Query Formulation Original Query teri galiyan song lyrics Extract MSIR Part Variants of “teri” Variants of “galiyan” teri galiyan Formulated Query “teri”, “तेरी” “galiyan”, “galiyaan”, “गलिय ाँ” “teri$galiyan”,”teri$g aliyaan”, “तेरी$गलिय ाँ” Dataset • Training Data (30k pairs) khushboo khushbu khusbhu … mehke mahake mehake 1 खश़्ु बू खश़्ु बू खश़्ु बू … महके महके महके • Test Data • 62888 Lyrics documents in both scripts (Devnagari & Roman) • 25 Roman script Test Queries • Evaluation Measure: MAP and MRR • Publicly available as part of the shared task at FIRE1 http://research.microsoft.com/en-us/events/fire13_st_on_transliteratedsearch/FIRE14ST.aspx Mixed-Script Space: Model • Terms are transliterated into a non-native script in such a way that the sound or pronunciation is preserved. • Intra/Inter-script associations at phoneme level • Intra Script: u → oo, s → sh • Inter Script: k → क, kh → ख • Jointly model an abstract space for multi-script words to simultaneously handle spelling variations in multiple scripts Term Modelling • We treat the phonemes as character-level “topics” in the terms • Phonemes of the language can be captured by the character n-grams • Consider the feature set 𝐹 = {𝑓1, . . , 𝑓𝐾 } containing character grams of scripts 𝑠𝑖 for all 𝑖 ∈ {1, . . , 𝑟} and |𝐹| = 𝐾 • Each datum = < 𝑡𝑒𝑟𝑚𝑠1, 𝑡𝑒𝑟𝑚𝑠2, … , 𝑡𝑒𝑟𝑚𝑠𝑟 > • Each datum can be represented as 𝐾-dimensional feature vector 𝒙 where 𝑥𝑘 is the count of 𝑘𝑡ℎ feature 𝑓𝑘 ∈ 𝐹 • We observe that count data of character grams within terms follow Dirichlet-multinomial distribution Training Method • Deep Autoencoder : Non-linear Dimensionality Reduction Technique [Hinton and Salakhutdinov, 2006] • Visible layer models the character grams through multinomial sampling under replicated softmax [Salakhutdinov and Hinton, 2009] Building Blocks of Autoencoder • Restricted Boltzmann Machines (RBM) • RSM • Visible layer is Softmax • Sampled multinomially • Hidden layer is scaled by |D| Pretraining • Each RBM is pre-trained using Contrastive Divergence ∆𝑤𝑖𝑗 = 𝜖(< 𝑣𝑖 ℎ𝑗 >𝑑𝑎𝑡𝑎 − < 𝑣𝑖 ℎ𝑗 >𝑚𝑜𝑑𝑒𝑙 ) • Trained RBMs are stacked on top of each other and then autoencoder is Fine-tuned using backpropagation Finding Equivalents • Apriori the complete lexicon of the reference/source collection is projected into the abstract space using the autoencoder. • Given the query term 𝑞𝑡, its feature vector 𝒗𝑞𝑡 is also projected in the abstract space as (𝒉𝑞𝑡). • All the terms which have cosine similarity greater than θ are considered as equivalents. Experiments • Experiment: Ad hoc Retrieval • Unit of matching: Word 2-gram • Retrieval Model*: parameter free DFR [Amati, 2006] * We used publicly available implementation of it from Terrier 3.5 Baselines • Naïve: Original Query • Naïve + Trans: Original + Transliterated Query • LSI: Expansion using 100-d CL-LSI • Editex: Expansion using Editex [Zobel and Dart, 1996] • Editex+Trans • CCA: Expansion using Canonical Correlation Analysis [Kumar and Udupa, 2011] • Deep-Mono (Only Roman Script Expansion) • Deep-Mixed (Mixed Script Expansion) Results Method Naïve MRR 0.6857 MAP 0.2910 θ NA Naïve+Trans LSI 0.6590 0.7533 0.3560 0.3522 NA 0.92 Editex Editex+Trans CCA Deep-Mono Deep-Mixed 0.7767 0.7433 0.7640 0.8000 0.8740* 0.3788 0.4000 0.3891 0.4153 0.5039* NA NA 0.997 0.96 0.96 * Statistical Significance (𝑝-value<0.01) using T-Test when compared with the best performing baseline. Number of Equivalents at θ Retrieval Performance wrt θ Learnt Representations Scalability • Its a very important aspect of Web Search • There are two parts of the algorithm 1. offline indexing of reference lexicon 2. online calculation of similarity between input term and indexed items • The indexing being offline and one-time is not a concern • The similarity estimation can be easily parallelised using multi-core CPUs or GPUs as it essentially a matrix multiplication. Demo http://memex2.dsic.upv.es:8080/DeepTranslit/translit.html Finally: Why this way? • The topic models for documents are usually trained over a subset of vocabulary (top-n terms) and hence, they have to deal with the non-trivial problem of marginalising over unobserved terms. • In our formulation the feature set is finite (|𝐾|)! • For example, consider (without loss of generality) two scripts 𝑠1 (Roman) and 𝑠2 (Devanagari), where total number of characters in 𝑠1 = 26 and 𝑠2 = 50. • Considering uni and bi char-grams, 𝐾 is upper bounded by 𝐾 = 3252 (26 + 262 + 50 + 502) Mixed-Script IR (MSIR) • Web, especially non-English, exhibit high amount of mixed-script content and search behaviour • Expanding such queries with Mixed-Script Equivalents greatly improve the retrieval • Presented method provides a highly effective and efficient solution to the problem of MSIR. • In Future, one should extend the MSIR in close connection with Multilingual IR Thank you for your attention! Questions/Comments? • Code : http://users.dsic.upv.es/~pgupta/mixed-script-ir.html • FIRE2014 Shared Task on Transliterated Search • 4th Sep – 13th Oct 2014 (bit.ly/TLATC2) Related • Code : bit.ly/1vrsOOJ • FIRE2014 Shared Task on Transliterated Search • 4th Sep – 13th Oct 2014 (bit.ly/TLATC2) Thanks! • Backup slides Contributions Revisited • We formally introduced the problem of MixedScript IR • The quantitative analysis of the Web Search Query logs • Deep-learning based method to mine Mixed-Script Equivalents Training Data (del) • ∼30k pairs (automatically extracted) [Gupta et al, 2012] khushboo khushbu khusbhu … mehke mahake mehake खश़्ु बू खश़्ु बू खश़्ु बू … महके महके महके “Scripted” World Wide Web (del) And Lot More..