slides

Transcription

slides
Deep Learning for
Mixed-Script IR
Parth Gupta
PRHLT Research Center
Technical University of Valencia, Spain
http://www.dsic.upv.es/~pgupta/
Valencia
Outline
• Introduction of Mixed-Script Information Retrieval
(MSIR)
• Motivation
• Prevalence of MSIR on Web
• Deep-Learning Basics
• Deep-learning for MSIR
• Experiments and Results
• Summary and Future Work
Scripts!
Scripts!
“Scripted” World Wide Web
• Many languages have their native-script content
• But, because of various socio-cultural and technical
reasons..
• Many languages that use non-Roman based indigenous
scripts
• For these languages, there exist abundant data on the
Web in Roman transliterated space
“Scripted” World Wide Web
Songs Lyrics
“Scripted” World Wide Web
Forums and
Reviews
“Scripted” World Wide Web
Social Media
Mixed Script IR - Example
Mixed Script IR - Example
MSIR
Narrow by Script
Mixed Script IR - Example
MSIR
Narrow by Script
Mixed-Script Query Analysis
• 13.78 B queries issued in India from Bing Query
logs
• 30 M unique queries
• 6% are transliterated queries i.e. contain Hindi
Words in Roman Script
• How many queries will be benefitted by MSIR?
Transliterated Query Categories
Category
Example Query
% of queries
Named Entity
• Takshashila mahavidhyalaya
• Baba Ramdev
28
Entertainment
• Himmatwala remake
• Ek villain reviews
27
Information Src
• Bade ghar ki beti premchand
• Swayamvaram info
25
Culture
• Ganesh Chaturthi 2014
• Silk kurta
1.4
Recipe
• Palak paneer by Tarala Dalal
1.2
Research
• Vishwa arthik mandi me bharat
0.04
Others
• Meaning of pyar
0.01
Our Contributions
• Formal Introduction of Mixed-Script Information
Retrieval
• Quantitative analysis of Mixed-Script Queries
• Trainable solution to handle Mixed-Script term
Equivalents and Variants
Mixed-Script Web Search
Native Script
Non-native Script
Teri
Galliyan
Teri
Galiyaan
तेरी
गलिय ाँ
तेरी
गलिय
Mixed-Script Web Search
Native Script
Non-native Script
Teri
Galliyan
Teri
Galiyaan
Galiyan Galliyan
Galiyaan
गलिय ाँ
गलिय
तेरी
गलिय ाँ
तेरी
गलिय
Mixed-Script
Equivalents
Query Formulation
Original Query
teri galiyan song
lyrics
Extract MSIR Part
Variants of “teri”
Variants of “galiyan”
teri galiyan
Formulated Query
“teri”, “तेरी”
“galiyan”, “galiyaan”,
“गलिय ाँ”
“teri$galiyan”,”teri$g
aliyaan”, “तेरी$गलिय ाँ”
Dataset
• Training Data (30k pairs)
khushboo
khushbu
khusbhu
…
mehke
mahake
mehake
1
खश़्ु बू
खश़्ु बू
खश़्ु बू
…
महके
महके
महके
• Test Data
• 62888 Lyrics documents
in both scripts
(Devnagari & Roman)
• 25 Roman script Test
Queries
• Evaluation Measure:
MAP and MRR
• Publicly available as part
of the shared task at
FIRE1
http://research.microsoft.com/en-us/events/fire13_st_on_transliteratedsearch/FIRE14ST.aspx
Mixed-Script Space: Model
• Terms are transliterated into a non-native script in
such a way that the sound or pronunciation is
preserved.
• Intra/Inter-script associations at phoneme level
• Intra Script: u → oo, s → sh
• Inter Script: k → क, kh → ख
• Jointly model an abstract space for multi-script
words to simultaneously handle spelling variations
in multiple scripts
Term Modelling
• We treat the phonemes as character-level “topics” in
the terms
• Phonemes of the language can be captured by the
character n-grams
• Consider the feature set 𝐹 = {𝑓1, . . , 𝑓𝐾 } containing
character grams of scripts 𝑠𝑖 for all 𝑖 ∈ {1, . . , 𝑟} and
|𝐹| = 𝐾
• Each datum = < 𝑡𝑒𝑟𝑚𝑠1, 𝑡𝑒𝑟𝑚𝑠2, … , 𝑡𝑒𝑟𝑚𝑠𝑟 >
• Each datum can be represented as 𝐾-dimensional
feature vector 𝒙 where 𝑥𝑘 is the count of 𝑘𝑡ℎ feature
𝑓𝑘 ∈ 𝐹
• We observe that count data of character grams within
terms follow Dirichlet-multinomial distribution
Training Method
• Deep Autoencoder : Non-linear Dimensionality
Reduction Technique [Hinton and Salakhutdinov, 2006]
• Visible layer models the character grams through
multinomial sampling under replicated softmax
[Salakhutdinov and Hinton, 2009]
Building Blocks of Autoencoder
• Restricted Boltzmann Machines (RBM)
• RSM
• Visible layer is Softmax
• Sampled multinomially
• Hidden layer is scaled by |D|
Pretraining
• Each RBM is pre-trained using
Contrastive Divergence
∆𝑤𝑖𝑗 = 𝜖(< 𝑣𝑖 ℎ𝑗 >𝑑𝑎𝑡𝑎 − < 𝑣𝑖 ℎ𝑗 >𝑚𝑜𝑑𝑒𝑙 )
• Trained RBMs are stacked on
top of each other and then
autoencoder is Fine-tuned using
backpropagation
Finding Equivalents
• Apriori the complete lexicon
of the reference/source
collection is projected into
the abstract space using the
autoencoder.
• Given the query term 𝑞𝑡, its
feature vector 𝒗𝑞𝑡 is also
projected in the abstract
space as (𝒉𝑞𝑡).
• All the terms which have cosine similarity greater
than θ are considered as equivalents.
Experiments
• Experiment: Ad hoc Retrieval
• Unit of matching: Word 2-gram
• Retrieval Model*: parameter free DFR [Amati, 2006]
* We used publicly available implementation of it from Terrier 3.5
Baselines
• Naïve: Original Query
• Naïve + Trans: Original + Transliterated Query
• LSI: Expansion using 100-d CL-LSI
• Editex: Expansion using Editex [Zobel and Dart, 1996]
• Editex+Trans
• CCA: Expansion using Canonical Correlation
Analysis [Kumar and Udupa, 2011]
• Deep-Mono (Only Roman Script Expansion)
• Deep-Mixed (Mixed Script Expansion)
Results
Method
Naïve
MRR
0.6857
MAP
0.2910
θ
NA
Naïve+Trans
LSI
0.6590
0.7533
0.3560
0.3522
NA
0.92
Editex
Editex+Trans
CCA
Deep-Mono
Deep-Mixed
0.7767
0.7433
0.7640
0.8000
0.8740*
0.3788
0.4000
0.3891
0.4153
0.5039*
NA
NA
0.997
0.96
0.96
* Statistical Significance (𝑝-value<0.01) using T-Test when
compared with the best performing baseline.
Number of Equivalents at θ
Retrieval Performance wrt θ
Learnt Representations
Scalability
• Its a very important aspect of Web Search
• There are two parts of the algorithm
1. offline indexing of reference lexicon
2. online calculation of similarity between input term and
indexed items
• The indexing being offline and one-time is not a
concern
• The similarity estimation can be easily parallelised
using multi-core CPUs or GPUs as it essentially a
matrix multiplication.
Demo
http://memex2.dsic.upv.es:8080/DeepTranslit/translit.html
Finally: Why this way?
• The topic models for documents are usually trained
over a subset of vocabulary (top-n terms) and
hence, they have to deal with the non-trivial
problem of marginalising over unobserved terms.
• In our formulation the feature set is finite (|𝐾|)!
• For example, consider (without loss of generality)
two scripts 𝑠1 (Roman) and 𝑠2 (Devanagari), where
total number of characters in 𝑠1 = 26 and 𝑠2 = 50.
• Considering uni and bi char-grams, 𝐾 is upper
bounded by 𝐾 = 3252 (26 + 262 + 50 + 502)
Mixed-Script IR (MSIR)
• Web, especially non-English, exhibit high amount of
mixed-script content and search behaviour
• Expanding such queries with Mixed-Script
Equivalents greatly improve the retrieval
• Presented method provides a highly effective and
efficient solution to the problem of MSIR.
• In Future, one should extend the MSIR in close
connection with Multilingual IR
Thank you for your attention!
Questions/Comments?
• Code : http://users.dsic.upv.es/~pgupta/mixed-script-ir.html
• FIRE2014 Shared Task on Transliterated Search
• 4th Sep – 13th Oct 2014 (bit.ly/TLATC2)
Related
• Code : bit.ly/1vrsOOJ
• FIRE2014 Shared Task on Transliterated Search
• 4th Sep – 13th Oct 2014 (bit.ly/TLATC2)
Thanks!
• Backup slides
Contributions Revisited
• We formally introduced the problem of MixedScript IR
• The quantitative analysis of the Web Search Query
logs
• Deep-learning based method to mine Mixed-Script
Equivalents
Training Data (del)
• ∼30k pairs (automatically extracted) [Gupta et al, 2012]
khushboo
khushbu
khusbhu
…
mehke
mahake
mehake
खश़्ु बू
खश़्ु बू
खश़्ु बू
…
महके
महके
महके
“Scripted” World Wide Web (del)
And Lot More..