Download presentation
Presentation is loading. Please wait.
1
Chinese Academy of Sciences, Beijing, China
Semantic Matching by Non-Linear Word Transportation for Information Retrieval Jiafeng Guo* Yixing Fan* Qingyao Ai+ W. Bruce Croft+ *CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China +Center for Intelligent Information Retrieval, University of Massachusetts Amherst, MA, USA
2
Outline Introduction Non-Linear Word Transportation Model Discussion
Experiments Conclusions
3
Introduction Effective Retrieval Models Bag-of-Words (BoW)
Vocabulary mismatch Relevance score exact matching of words() semantically related words()
4
Techniques Query Expansion Latent Models Translation Models
Word Embedding Word Mover’s Distance
5
Query Expansion Global Method Local Method Problem
corpus being search or hand-crafted thesaurus Local Method top ranked documents(PRF) Problem Query drift
6
Latent Models Latent space in reduced dimensionality Problem
Query and Documents(e.g. LDA-based document model) Problem Loss of many detailed matching signals over words Do not improve the performance(need to combine)
7
Translation Models Documents -> Queries(word dependency) Problem
mixture model and binomial model(Berger et al.) title and Document pair(Jin et al.) mutual Information between words(Karimzadehgan et al) Problem How to formalize and estimate the translation probability
8
Word Embedding Semantical representations of words
semantics and syntactic The Potential in IR need to be further explored Bag of Word Embedding(BoWE) monolingual and bilingual(Vulic et al.) generalized language model(Ganguly et al.)
9
Word Mover’s Distance Transportation problem Earth Mover’s Distance
urban planning and civil engineering Earth Mover’s Distance image retrieval and multimedia search Word Mover’s Distance document classification
10
Non-Linear Word Transportation
Bag of Word Embedding(BoWE) Non-linear transportation(Inspired by WMD) Fixed document capacity and non-fixed query capacity Efficiently approximate Neighborhood pruning and indexing strategies
11
Bag of Word Embedding(BoWE)
Richer Representation Similarity between words(e.g., “car” and “auto”) Word Embedding Matrix 𝑊∈ ℝ 𝐾× 𝑉 𝐷={ 𝑤 1 𝑑 , 𝑡𝑓 1 , …, 𝑤 𝑚 𝑑 , 𝑡𝑓 𝑚 } 𝑄={ 𝑤 1 𝑞 ,𝑞 𝑡𝑓 1 , …, 𝑤 𝑛 𝑞 , 𝑡𝑓 𝑛 }
12
Non-Linear Word Transportation
Information Capacity Document word(fixed) Query word(unlimited) Vague nature of query intent Information Gain(Profit) Law of diminishing marginal returns
13
Non-Linear Word Transportation
Find optimal flows 𝐹= 𝑓 𝑖𝑗
14
Non-Linear Word Transportation
Document Word Capacity 𝑐 𝑖 = 𝑡𝑓 𝑖 +𝑢 𝑐𝑓 𝑖 |𝐶| 𝐷 +𝑢 Transportation Profit 𝑟 𝑖𝑗 = 𝑐𝑜𝑠 𝑤 𝑖 𝑑 , 𝑤 𝑖 𝑞 =max(𝑐𝑜𝑠 𝑤 𝑖 𝑑 , 𝑤 𝑖 𝑞 ,0)
15
Transportation Profit
Risk parameter 𝛼 exactly word > semantically related word multiple times “salmon” and “fish”(0.72) The higher 𝛼, the less profit the transportation can bring
16
Model Summary Non-linear word transportation model Damping Effect
Exact and Semantic matching signal Damping Effect Document word capacity Transportation Profit Neighborhood pruning 𝑉 × 𝑄 (e.g. kNN)
17
Model Discussion word alignment effect due to the relaxation of constraints on the query side and the marginal diminishing effect a document will be assigned a higher score interpret more distinct query words
18
Semantic Matching Query Expansion Latent Models
local analysis are orthogonal to our work Latent Models represents the document as a bag of word embeddings Statistical Translation models more flexibility, multiple feature in estimation
19
Word Mover’s Distance NWT WMD Relevance between queries and documents
Maximum profit and non-linear problem WMD Dissimilarity between documents Minimum cost and linear transportation problem
20
Experiments
21
Word Embedding and Evaluation
Word Embeddings Corpus Specific(CBOW and Skip-Gram) Corpus Independent(Glove) Evaluation Measures MAP, and
22
Retrieval Performance and analysis
23
Case Studies Named Entities Ambiguous Acronyms
“brazil america relation” “argentina” and ”spain” for “brazil” “europe” and ”africa” for “america” Ambiguous Acronyms “Find Information on taking the SAT college entrance exam” “fri”, “tue” and “wed”
24
Impact of Word Embeddings
25
Different Dimensionality
26
Indexed Neighbor Size
27
Linear vs. Non-Linear
28
Conclusions Transportation based on the BoWE
capture detailed semantic matching signals The non-linear formulation relaxation of constraints and the margin diminishing effect The flexibility in model definition word capacity and transportation profit
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.