Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Metadata in Carrot II Current metadata –TF.IDF for both documents and collections –Full-text index –Metadata are transferred between different nodes Potential.
Chapter 5: Introduction to Information Retrieval
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Improved TF-IDF Ranker
Probabilistic Ranking Principle
Information Retrieval Models: Probabilistic Models
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
Video retrieval using inference network A.Graves, M. Lalmas In Sig IR 02.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Language Modeling Approaches for Information Retrieval Rong Jin.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
LIS618 lecture 11 i/r performance evaluation Thomas Krichel
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Lemur Application toolkit Kanishka P Pathak Bioinformatics CIS 595.
Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
A Language Modeling Approach to Information Retrieval 한 경 수  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.
Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Information Retrieval Models: Vector Space Models
User-Friendly Systems Instead of User-Friendly Front-Ends Present user interfaces are not accepted because the underlying systems are too difficult to.
Natural Language Processing Topics in Information Retrieval August, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
An Efficient Algorithm for Incremental Update of Concept space
Information Retrieval and Web Search
Information Retrieval Models: Probabilistic Models
Compact Query Term Selection Using Topically Related Text
Language Models for Information Retrieval
Murat Açar - Zeynep Çipiloğlu Yıldız
John Lafferty, Chengxiang Zhai School of Computer Science
6. Implementation of Vector-Space Retrieval
Language Model Approach to IR
Language Models for TR Rong Jin
Presentation transcript:

Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu

Outline Indexing problem and proposed solution Previous Work System Architecture Language Modeling Concept Evaluation of the System Conclusion

Indexing Problem “A Language Modeling Approach to Information Retrieval” Jay M. Ponte and. W. Bruce Croft, 1998 Indexing model is important at probabilistic retrieval model Current models do not lead to improved retrieval results

Indexing Problem Failure because of unwarranted assumptions: 2-Poisson model –“elite” documents N-Poisson model –Mixture of more than 2 Poission distributions

Proposed Solution Retrieval based on probabilistic language modeling Language model refers to probabilistic distribution that captures statistical regularities of the generation of language A language model is inferred for each document

Proposed Solution Estimate probability of generating the query Documents are ranked according to these probabilities Users have a reasonable idea of terms tf, idf are integral parts of language model

Previous Work Robertson–Sparck Jones model and Croft–Harper model –They focus on relevance Fuhr integrated indexing and retrieval models. –Used statistics as heuristics Wong and Yao used utility theory and information theory

Previous Work Kalt’s approach is the most similar –Maximum likelihood estimator is used –Collection statistics are integral parts of the model –Documents are members of language classes

System Overview Application Server Index DB (PostgreSQL) LM-Search JDBC UI Document Repository Query Evaluator Indexer USER Different Resultsets

System Architecture Document Repository Stemming & Term Selection No Stemming First 5 Lemmatiser Inverted Index Generation tf.idf Language Model Index DB

tf.idf vs. Language model Different Resultsets GUI for seeing differences between results tf.idf LM tf.idf

Vocabulary Extraction No stemmer – Turkish is aggluntinative – Expectation: low precision First 5 characters – As effective as more complex solutions Lemmatiser: – Expectation: high precision. – Zemberek2 (MPL license) Open Source Software Java Interface, easy to use Find stems of the words First valid stem will be used, Word sense disambiguation (using wordnet or POS) may be added in the future Stemming & Term Selection No Stemming First 5 Lemmatiser

Database Index contains fields for both methods Table can be evaluated in any form Hybrid Index Structure Stemmed First 5 Lemmatiser 3 Indices tf.idf LanguageModel Inverted Index PostgreSQL database server Implementation Index DB

Language Modeling : Inverted Index Implementation An example inverted index for m terms : t1  cf t d1P(t1|Md1)tf 1 d 1 dnP(t1|Mdn)tf 1 d n t2  cf t d3P(t2|Md3)tf 2 d 2 dpP(t2|Mdp)tf 2 d p tm  cf t d4P(t4|Md4)tf m d 4 dkP(tm|Mdk)tf m d k … … … … If a document does not contain term then probability can be calculated using cf t f t = mean term frequency = mean probability of t in documents containing it cf t =frequency of t in all documents

The Baseline Approach : tf.idf We will use the traditional tf.idf term weighting approach as the baseline model Robertson’s tf score Standard idf score

Language Modeling : Definition An alternative approach to indexing and retrieval Definition of Language Model: A probability distribution that captures the statistical regularities of the generation of language Intuition Behind : Users have a reasonable idea of terms that are likely to occur in documents of interest and will choose query terms that distinguish these documents from others in the collection

Language Modeling : The Approach The following assumptions are not made : Term distributions in the documents are parametric Documents are members of pre-defined classes “Query generation probability” rather than “Probability of relevance”

Language Modeling : The Approach P(t | M d ) : Probability that the term t is generated by the language model of document M d

Language Modeling : Theory Maximum likelihood estimate of the probability of term t under the term distribution for document d: tf (t,d) : raw term frequency in document d dl d : total number of terms in the document

Language Modeling : Theory An additional more robust estimate from a larger amount of data : p avg : Mean probability of term t in documents containing it df t : Number of documents that contain term t

Language Modeling : Theory The risk function : : Mean term frequency of term t in documents which contains it.

Language Modeling : The Ranking Formula Let the probability of term t being produced by document d given the document model Md : The probability of producing Q for a given document model M d is :

Language Modeling : Inverted Index Implementation An example inverted index for m terms : t1  cf t d1P(t1|Md1)dnP(t1|Mdn) t2  cf t d3P(t2|Md3)dpP(t2|Mdp) tm  cf t d4P(t4|Md4)dkP(tm|Mdk) … … … …

Evaluation Perform recall/precision experiments – Recall/precision results – Non-interpolated average precision – Precision figures for the top N documents For several values of N R-Precision

Other Metrics Compare the baseline (tf.idf) results to our language model. – Percent Change between two result sets – I / D I : count of queries performance improved D : count of queries performance changed

Document Repository Milliyet ( ) XML file ( 1.1 GB ) news Ready for indexing XML Schema......(FIXME) Document Source

Summary Indexing and stemming – Zemberek2 lemmatiser – Java environment Data – News archive from 2001 to 2005, from Milliyet Evaluation – Methods will be compared according to performance over recall/precision values

Conclusion First language modelling approach to Turkish IR The LM approach – Non-parametric – Less assumptions – Relaxed Expected a better performance than baseline tf.idf method

Thanks for listening … Any Questions?