Lemur Application toolkit Kanishka P Pathak Bioinformatics CIS 595.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Metadata in Carrot II Current metadata –TF.IDF for both documents and collections –Full-text index –Metadata are transferred between different nodes Potential.
Language Models Hongning Wang
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Introduction To Java Objectives For Today â Introduction To Java â The Java Platform & The (JVM) Java Virtual Machine â Core Java (API) Application Programming.
Information Retrieval Models: Probabilistic Models
Information Retrieval Visualization CPSC 533c Class Presentation Qixing Zheng March 22, 2004.
Information Retrieval in Practice
Search Engines and Information Retrieval
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
UNIX Chapter 01 Overview of Operating Systems Mr. Mohammad A. Smirat.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Workshop on Challenges in Information Retrieval and Language Modeling September 11-12, 2002 Amherst, Massachusetts.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
1 CS6320 – Why Servlets? L. Grewe 2 What is a Servlet? Servlets are Java programs that can be run dynamically from a Web Server Servlets are Java programs.
Language Modeling Frameworks for Information Retrieval John Lafferty School of Computer Science Carnegie Mellon University.
Overview of Search Engines
November 2011 At A Glance GREAT is a flexible & highly portable set of mission operations analysis tools that increases the operational value of ground.
1 Probabilistic Language-Model Based Document Retrieval.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Operating systems CHAPTER 7.
COMPUTER SOFTWARE Section 2 “System Software: Computer System Management ” CHAPTER 4 Lecture-6/ T. Nouf Almujally 1.
Search Engines and Information Retrieval Chapter 1.
Introduction to Visual Basic. Quick Links Windows Application Programming Event-Driven Application Becoming familiar with VB Control Objects Saving and.
CHAPTER FOUR COMPUTER SOFTWARE.
Introduction to Interactive Media Interactive Media Tools: Software.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future.
Predicting Question Quality Bruce Croft and Stephen Cronen-Townsend University of Massachusetts Amherst.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko.
Query Execution Section 15.1 Shweta Athalye CS257: Database Systems ID: 118 Section 1.
Introduction to Neural Networks and Example Applications in HCI Nick Gentile.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
A Language Modeling Approach to Information Retrieval 한 경 수  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.
IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.
Modern Information Retrieval Presented by Miss Prattana Chanpolto Faculty of Information Technology.
IR Homework #1 By J. H. Wang Mar. 5, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Discriminative Models for Information Retrieval Ramesh Nallapati UMass SIGIR 2004.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
JavaScript 101 Introduction to Programming. Topics What is programming? The common elements found in most programming languages Introduction to JavaScript.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Document Clustering for Natural Language Dialogue-based IR (Google for the Blind) Antoine Raux IR Seminar and Lab Fall 2003 Initial Presentation.
Software. Introduction n A computer can’t do anything without a program of instructions. n A program is a set of instructions a computer carries out.
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Information Retrieval in Practice
MATLAB Distributed, and Other Toolboxes
Implementation Issues & IR Systems
Chapter 15 QUERY EXECUTION.
Information Retrieval Models: Probabilistic Models
Chapter 27 WWW and HTTP.
Murat Açar - Zeynep Çipiloğlu Yıldız
John Lafferty, Chengxiang Zhai School of Computer Science
Language Models for TR Rong Jin
Presentation transcript:

Lemur Application toolkit Kanishka P Pathak Bioinformatics CIS 595

Introduction A language model (LM) is a probabilistic mechanism for generating text A language model (LM) is a probabilistic mechanism for generating text In the past several years, there has been significant interest in the use of language modeling for text and natural language processing tasks In the past several years, there has been significant interest in the use of language modeling for text and natural language processing tasks We now have text information retrieval (IR) based on statistical language modeling We now have text information retrieval (IR) based on statistical language modeling

Previous work The first statistical modeler was Claude Shannon. The first statistical modeler was Claude Shannon. He thought of the human language as a statistical source and … He thought of the human language as a statistical source and … He measured how well simple n-gram models did at predicting and compressing natural text. He measured how well simple n-gram models did at predicting and compressing natural text.

For many years, language models were used in speech recognition. For many years, language models were used in speech recognition. However, basic language modeling ideas have been used in information retrieval for quite some time. However, basic language modeling ideas have been used in information retrieval for quite some time. Some of the previous models are: Some of the previous models are: naïve Bayes model Robertson and Sparck Jones model

Their limitations…. Naïve Bayes Naïve Bayes Suffers from the “Independence Assumptions” it makes RSJ RSJ Distribution of query trems in “relevant” and “non-relevant” documents

Turning the problem around Ponte and Croft proposed the smoothed version of document unigram model to assign a score to a query Ponte and Croft proposed the smoothed version of document unigram model to assign a score to a query Berger and J.Lafferty built on this model. Berger and J.Lafferty built on this model. Their approach : “predict the input (i.e. the query)” This opened up new ways to think about information retrieval….

Lemur ‘Lemur’ is a nocturnal, monkey-like African animal largely confined to the island of Madagascar ‘Lemur’ is a nocturnal, monkey-like African animal largely confined to the island of Madagascar The name was chosen partly because of resemblance to LM/IR The name was chosen partly because of resemblance to LM/IR Secondly because LM community was an island to the IR community Secondly because LM community was an island to the IR community

What is the Lemur project? It is a research project being carried out by the computer Science dept. at Univ. of Massachusetts and Carnegie Mellon University It is a research project being carried out by the computer Science dept. at Univ. of Massachusetts and Carnegie Mellon University It is sponsored by the Advanced Research and Development Activity in Information Technology (ARDA) It is sponsored by the Advanced Research and Development Activity in Information Technology (ARDA) It is designed to facilitate research in language modeling and Information retrieval It is designed to facilitate research in language modeling and Information retrieval It is written in C/C++ and runs under Unix as well as Windows It is written in C/C++ and runs under Unix as well as Windows

Components and their interaction

The toolkit The lemur toolkit is available on the site www-2.cs.cmu.edu/~lemur The lemur toolkit is available on the site www-2.cs.cmu.edu/~lemur To use the toolkit : To use the toolkit : download  compile  execute

Example of applications Pre-processing : Pre-processing :ParseQueryParseToFile Building/Adding Index : Building/Adding Index :PushIndexerBuildBasicIndex Retrieval/Evaluation : Retrieval/Evaluation :RetEvalStructQueryEval Summarization : Summarization :BasicSummAppMMRSummApp

What do we need to run an application? Text documents in the format which is acceptable by LEMUR (TREC format) Text documents in the format which is acceptable by LEMUR (TREC format) Parameter file Parameter file

Document format in Lemur There are 5 documents formats supported by Lemur : TRECWEBCHINESECHINESECHARARABIC

Example of a Document format Say, we take the document “web” <DOC> any_number_here any_number_here Text here </DOC><DOC> any_number_here any_number_here Text here </DOC>

Example of Document format <DOC> Ballistic Cam Design This paper presents a digital computer program for the rapid calculation of manufacturing data essential to the design of preproduction cams which are utilized in ballistic computers of tank fire control systems. The cam profile generated introduces the superelevation angle required by tank main armament for a particular type ammunition. CACM November, 1961 Archambault, M. CA JB March 15, :37 PM </DOC>

Example of what a parameter file looks like Say we are creating a parameter file for the application ‘BuildBasicIndex’ The parameter file needs to have the following contents: 1.inputFile : the path to the source file 2.outputPrefix : a prefix name for your index 3.maxDocuments : maximum number of documents to index (default ) 4.maxMemory : maximum amount of memory to be used for indexing (default 128MB)

Eg:inputFile=/usr/mydata/source; outputPrefix= /usr/mydata/index; maxDocuments=200000; C:\lemur>BuildBasicIndex c:\lemur\buildpa The indexed file generated is : /usr/mydata/index.bsc /usr/mydata/index.bsc

Contd…. Run the application with the parameter as the only argument OR OR the first argument, if the application can take other parameters from the command line

example Example: C:\lemur\lemur-2.0.3>BuildBasicIndex c:\lemur\parambasic.txt OR C:\lemur\lemur-2.0.3>BuildBasicIndex c:\lemur\parambasic.txt c:\lemur\source.txt Where, BuildBasicIndex is the application parambasic.txt is a parameter file for BuildBasicIndex source.txt is the file containing the source document

Lemur API The Lemur API is intended to allow a programmer to use the toolkit for special-purpose applications that are not implemented in the toolkit itself The API interfaces are grouped at three different levels: 1. Utility level 2. Indexer level 3. Retrieval level

API levels Utility level : Includes common utilities such as memory management, default exception handler, program argument handler. Utility level : Includes common utilities such as memory management, default exception handler, program argument handler. Indexer level : Converts the raw text into efficient data structures so that the information (i.e. word counts) may be accessed conveniently and efficiently later. Indexer level : Converts the raw text into efficient data structures so that the information (i.e. word counts) may be accessed conveniently and efficiently later. Retrieval level: It is most useful for users who want to build a prototype system or evaluation system Retrieval level: It is most useful for users who want to build a prototype system or evaluation system

Future Developments Summarizing Summarizing Filtering Filtering Question Answering Question Answering Language generation Language generation

References www-2.cs.cmu.edu/~lemur www-2.cs.cmu.edu/~lemur A language modeling approach to Information retrieval A language modeling approach to Information retrieval by Jay M Ponte and W. Bruce Croft (CS – UMass Amherst)

THANK YOU Any questions?