Download presentation
Presentation is loading. Please wait.
Published byRose Taylor Modified over 9 years ago
1
1 Statistical Machine Translation Models for Personalized Search Rohini U AOL India R&D, Bangalore India Rohini.uppuluri@corp.aol.com Vamshi Ambati Language Technologies Institute Carnegie Mellon University Pittsburgh, USA vamshi@cs.cmu.edu Vasudeva Varma, SIEL, LTRC, IIIT Hyderabad, India vv@iiit.ac.in
2
2 Agenda Introduction Introduction Related Work Related Work Background Background User Profile as Translation Model User Profile as Translation Model Personalized Search Personalized Search Learning User Profile Learning User Profile Re-ranking Re-ranking Experiments Experiments Conclusions and Future Work Conclusions and Future Work
3
3 Introduction Current Web Search engines Current Web Search engines Provide users with documents “relevant” to their information need Provide users with documents “relevant” to their information need Issues Issues Information overload Information overload To cater Hundreds of millions of users To cater Hundreds of millions of users Terabytes of data Terabytes of data Poor description of Information need Poor description of Information need Short queries - Difficult to understand Short queries - Difficult to understand Word ambiguities Word ambiguities Users only see top few results Users only see top few results Relevance Relevance subjective – depends on the user subjective – depends on the user One size Fits all ???
4
4 Continued.. Search is not a solved problem! Search is not a solved problem! Poorly described information need Poorly described information need Java – (Java island / Java programming language ) Java – (Java island / Java programming language ) Jaguar – (cat /car) Jaguar – (cat /car) Lemur – (animal / lemur tool kit) Lemur – (animal / lemur tool kit) SBH – (State bank of Hyderabad/Syracuse Behavioral Health care) SBH – (State bank of Hyderabad/Syracuse Behavioral Health care) Given prior information Given prior information I am into biology – best guess for Jaguar? I am into biology – best guess for Jaguar? past queries - { information retrieval, language modeling } – best guess for lemur? past queries - { information retrieval, language modeling } – best guess for lemur?
5
5 Review of Personalized Search Personalized Search Personalized Search Query logs Machine learning Language modeling Community based Others
6
6 Statistical Language Modeling based Approaches: Introduction Statistical language modeling : task of estimating probability distribution that captures statistical regularities of natural language Statistical language modeling : task of estimating probability distribution that captures statistical regularities of natural language Applied to a number of problems – Speech, Machine Translation, IR, Summarization Applied to a number of problems – Speech, Machine Translation, IR, Summarization
7
7 Statistical Language Modeling based Approaches: Background Query Formulation Model User Information need Ideal Document Given a query, which is most likely to be the Ideal Document? Lemur Query In spite of the progress, not much work to capture, model and integrate user context ! P(Q/D) = P(q1….qn/D) = Π P(qi/D)
8
8 Noisy Channel based approach Motivation Query Generation Process (Noisy Channel) Ideal Document Retrieval Query Generation Process (Noisy Channel)
9
9 Similar to Statistical Machine Translation Given an english sentence translate into french Given an english sentence translate into french Given a query, retrieve documents closer to ideal document Given a query, retrieve documents closer to ideal document Noisy channel 1 French Sentence English Sentence Noisy Channel 2 Ideal Document Query P(e/f) P(q/w)
10
10 Learning user profile User profile: Translation Model User profile: Translation Model Triples : (qw,dw,p(qw/dw)) Triples : (qw,dw,p(qw/dw)) Use Statistical Machine Translation methods Use Statistical Machine Translation methods Learning user profile training a translation model Learning user profile training a translation model In SMT: Training a translation model In SMT: Training a translation model From Parallel texts From Parallel texts Using EM algorithm Using EM algorithm
11
11 Learning User profile Extracting Parallel Texts Extracting Parallel Texts From Queries and corresponding snippets from clicked documents From Queries and corresponding snippets from clicked documents Training a Translation Model Training a Translation Model GIZA++ - an open source tool kit widely used for training translation models in Statistical Machine Translation research. GIZA++ - an open source tool kit widely used for training translation models in Statistical Machine Translation research.
12
12 Sample user profile
13
13 Reranking Recall, in general LM for IR Recall, in general LM for IR Noisy Channel based approach Noisy Channel based approach Lemur - Encyclopedia gives a brief description of the physical traits of this animal. The Lemur toolkit for language modeling and information retrieval is documented and made available for download. lemur Lemur encyclopedia … brief … Lemur toolkit … information retireval … P(lemur/retrieval) D4 : D1 : P(Q/D) = Π P(qi/D)
14
14 Experiments Performed evaluation on explicit feedback data collected from 7 users Performed evaluation on explicit feedback data collected from 7 users Experiments Experiments Comparison with Contextless Ranking Comparison with Contextless Ranking Comparison between different training models and contexts Comparison between different training models and contexts
15
15 Data Explicit Feedback data collected from 7 users For each query, each user examined top 10 documents and identified top 10 documents Collected the top 10 results for all queries. Total documents 3469 documents Set up 3469 documents - created lucene index. For reranking, first retrieve the results using lucene and then rerank them using the noisy channel approach. We perform 10 fold cross validation Data and Set up
16
16 Data User No. Q % unique words in Q Total Rel Avg. Rel 137892366.378 25068.421783.56 36182.632984.885 42686.951013.884 53380.761344.06 62978.08983.379 72988.311153.965
17
17 Metrics Precision@n Precision@n Number of documents relevant / n Number of documents relevant / n
18
18 Set up Test Data Train Data User Profiles User Profile Learner Reranker Reranked Results
19
19 UserContextlessProposed 10.14330.2445 20.14260.2445 30.10160.1216 40.05570.1541 50.18870.3933 60.15660.3941 70.10.1833 Avg0.12680.2332
20
20 Results Training Model IBM Model1 GIZA++ Document Train Snippet Train Document Train Snippet Train Document Test 0.20620.23330.17990.2075 Snippet Test 0.2028 0.2488 0.18340.2034
21
21 Results I - Document Training and Document Testing II - Document Training and Snippet Testing III - Snippet Training and Document Testing IV - Snippet Training and Snippet Testing
22
22 Conclusions and Future Work Proposed a stat MT based approach for modeling user model Proposed a stat MT based approach for modeling user model Captures Richer context, relations between q and w. Captures Richer context, relations between q and w. In future, In future, N-gram based method : trigrams etc N-gram based method : trigrams etc Noisy Channel based method : bigram Noisy Channel based method : bigram
23
23 Questions?
24
24 Thank you
25
25 References Adam Berger and John D. Lafferty. 1999. Information retrieval as statistical translation. In Research and Development in Information Retrieval, pages 222–229. Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Comput. Linguist., 19(2):263–311. W. Bruce Croft, Stephen Cronen-Townsend, and Victor Larvrenko. 2001. Relevance feedback and personalization: A language modeling perspective. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries. Jamie Allan et. al. 2003. Challenges in information retrieval language modeling. In SIGIR Forum, volume 37 Number 1. K. Sugiyama K. Hatano and M. Yoshikawa. 2004. Adaptive web search based on user profile constructed without any effort from users. In Proceedings of WWW 2004, page 675 684. Victor Lavrenko and W. Bruce Croft. 2001. Relevance-based language models. In Research and Development in Information Retrieval, pages 120–127. F. Liu, C. Yu, and W. Meng. 2002. Personalized web search by mapping user queries to categories. In Proceedings of the eleventh international conference on Information and knowledge management, ACM Press, pages 558–565. Tom Mitchell. 1997. Machine Learning. McGrawHill.
26
26 Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. Jay M. Ponte and W. Bruce Croft. 1998. A language modeling approach to information retrieval. In Research and Development in Information Retrieval, pages 275–281. A. Pretschner and S. Gauch. 1999. Ontology based personalized search. In ICTAI., pages 391– 398. J. J. Rocchio. 1971. Relevance feedback in information retrieval, the smart retrieval system. Experiments in Automatic Document Processing, pages 313–323. G. Salton and C. Buckley. 1990. Improving retrieval performance by relevance feedback. Journal of the American Society of Information Science, 41:288–297. Xuehua Shen, Bin Tan, and Chengxiang Zhai. 2005. Implicit user modeling for personalized search. In Proceedings of CIKM 2005. F. Song and W. B. Croft. 1999. A general language model for information retrieval. In Proceedings on the 22nd annual international ACM SIGIR conference, page 279280. Micro Speretta and Susan Gauch. 2004. Personalizing search based on user search histories. In Thirteenth International Conference on Information and Knowledge Management (CIKM 2004). Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of ACM SIGIR’01, pages 334–342.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.