Lemur Toolkit http://net.pku.edu.cn/~wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Lemur Toolkit http://net.pku.edu.cn/~wbia 彭波 pb@net.pku.edu.cn
北京大学信息科学技术学院 3/21/2011

Recap Information Retrieval Models Vector Space Model
Probabilistic models Language model

Some formulas for Sim(VSM)
Dot product Cosine Dice Jaccard t 1 D Q t 3 t 2 θ

BM25 (Okapi system) – Robertson et al.
Consider tf, qtf, document length Doc. length normalization TF factors k1, k2, k3, b: parameters qtf: query term frequency dl: document length avdl: average document length

Standard Probabilistic IR
Information need d1 matching d2 query … dn document collection

IR based on Language Model (LM)
Information need d1 generation d2 query … … A query generation process For an information need, imagine an ideal document Imagine what words could appear in that document Formulate a query using those words dn document collection

Language Modeling for IR
Estimate a multinomial probability distribution from the text Smooth the distribution with one estimated from the entire collection P(w|D) = (1-) P(w|D)+  P(w|C)

Query Likelihood ? Estimate probability that document generated the query terms P(Q|D) =  P(q|D)

Kullback-Leibler Divergence
Estimate models for document and query and compare ? = KL(Q|D) =  P(w|Q) log(P(w|Q) / P(w|D))

Question Among the three classic information retrieval model, which one is your best choice in designing your retrieval system? How can you tune the model parameters to achieve optimized performance? When you have a new idea on retrieval problem, how can you prove it?

University of California, Berkeley
A Brief History of IR Slides from Prof. Ray Larson University of California, Berkeley School of Information

Experimental IR systems
Probabilistic indexing – Maron and Kuhns, 1960 SMART – Gerard Salton at Cornell – Vector space model, 1970’s SIRE at Syracuse I3R – Croft Cheshire I (1990) TREC – 1992 Inquery Cheshire II (1994) MG (1995?) Lemur (2000?) IS 240 – Spring 2011

Historical Milestones in IR Research
1958 Statistical Language Properties (Luhn) 1960 Probabilistic Indexing (Maron & Kuhns) 1961 Term association and clustering (Doyle) 1965 Vector Space Model (Salton) 1968 Query expansion (Roccio, Salton) 1972 Statistical Weighting (Sparck-Jones) Poisson Model (Harter, Bookstein, Swanson) 1976 Relevance Weighting (Robertson, Sparck-Jones) 1980 Fuzzy sets (Bookstein) 1981 Probability without training (Croft) IS 240 – Spring 2011

Historical Milestones in IR Research (cont.)
Linear Regression (Fox) Probabilistic Dependence (Salton, Yu) Generalized Vector Space Model (Wong, Rhagavan) Fuzzy logic and RUBRIC/TOPIC (Tong, et al.) Latent Semantic Indexing (Dumais, Deerwester) Polynomial & Logistic Regression (Cooper, Gey, Fuhr) TREC (Harman) Inference networks (Turtle, Croft) Neural networks (Kwok) Language Models (Ponte, Croft) IS 240 – Spring 2011

Information Retrieval – Historical View
Research Industry Boolean model, statistics of language (1950’s) Vector space model, probablistic indexing, relevance feedback (1960’s) Probabilistic querying (1970’s) Fuzzy set/logic, evidential reasoning (1980’s) Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s) DIALOG, Lexus-Nexus, STAIRS (Boolean based) Information industry (O($B)) Verity TOPIC (fuzzy logic) Internet search engines (O($100B?)) (vector space, probabilistic) IS 240 – Spring 2011

Research Systems Software
INQUERY (Croft) OKAPI (Robertson) PRISE (Harman) SMART (Buckley) MG (Witten, Moffat) CHESHIRE (Larson) LEMUR toolkit Lucene Others IS 240 – Spring 2011

Some slides from Don Metzler, Paul Ogilvie & Trevor Strohman
Lemur Toolkit Project Some slides from Don Metzler, Paul Ogilvie & Trevor Strohman [1]INDRI – Overview [2]Lemur Toolkit

Zoology 101 Lemurs are primates found only in Madagascar
50 species (17 are endangered) Ring-tailed lemurs lemur catta

Zoology 101 The indri is the largest type of lemur
When first spotted the natives yelled “Indri! Indri!” Malagasy for "Look! Over there!"

About The Lemur Project
The Lemur Project was started in 2000 by the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts, Amherst, and the Language Technologies Institute (LTI) at Carnegie Mellon University. Over the years, a large number of UMass and CMU students and staff have contributed to the project. The project's first product was the Lemur Toolkit, a collection of software tools and search engines designed to support research on using statistical language models for information retrieval tasks. Later the project added the Indri search engine for large-scale search, the Lemur Query Log Toolbar for capture of user interaction data, and the ClueWeb09 dataset for research on web search.

Installation Linux, OS/X: Windows Extract software/lemur-4.12.tar.gz
./configure --prefix=/install/path ./make ./make install Windows Run software/lemur-4.12-install.exe Documentation in windoc/index.html

Installation Use Lemur-4.12 instead~
JAVA Runtime(JDK 6) need for evaluation tool. Environment Variable : PATH Linux: modify ~/.bash_profile Windows: MyComputer/Properties…

Indexing Document Preparation Indexing Parameters
Time and Space Requirements More information can be found at:

Two Index Formats KeyFile Term Positions Metadata Offline Incremental
InQuery Query Language Indri Term Positions Metadata Fields / Annotations Online Incremental InQuery and Indri Query Languages Blue text indicates settings specific to Indri applications. Refer to online documentation for other Lemur application parameters.

Indexing – Document Preparation
Document Formats: The Lemur Toolkit can inherently deal with several different document format types without any modification: TREC Text TREC Web Plain Text Microsoft Word(*) Microsoft PowerPoint(*) HTML XML PDF Mbox (*) Note: Microsoft Word and Microsoft PowerPoint can only be indexed on a Windows-based machine, and Office must be installed.

Indexing – Document Preparation
If your documents are not in a format that the Lemur Toolkit can inherently process: If necessary, extract the text from the document. Wrap the plaintext in TREC-style wrappers: <DOC> <DOCNO>document_id</DOCNO> <TEXT> Index this document text. </TEXT> </DOC> – or – For more advanced users, write your own parser to extend the Lemur Toolkit. Could also work with plain text documents

Indexing - Parameters Basic usage to build index:
IndriBuildIndex <parameter_file> Parameter file includes options for Where to find your data files Where to place the index How much memory to use Stopword, stemming, fields Many other parameters.

Indexing – Parameters Standard parameter file specification an XML document: <parameters> <option></option> … </parameters>

Indexing – Parameters BuildIndex IndriBuildIndex
where to find your source files and what type to expect BuildIndex <dataFiles> name of file containing list of datafiles to index. IndriBuildIndex <corpus> <path>: (required) the path to the source files (absolute or relative) <class>: (optional) the document type to expect. If omitted, IndriBuildIndex will attempt to guess at the filetype based on the file’s extension. <parameters> <path>/path/to/source/files</path> <class>trectext</class> </corpus> </parameters> Put link into web page for document classes

Indexing - Parameters The <index> parameter tells IndriBuildIndex where to create or incrementally add to the index If index does not exist, it will create a new one If index already exists, it will append new documents into the index. <parameters> <index>/path/to/the/index</index> </parameters>

Indexing - Parameters <memory> - used to define a “soft-limit” of the amount of memory the indexer should use before flushing its buffers to disk. Use K for kilobytes, M for megabytes, and G for gigabytes. <parameters> <memory>256M</memory> </parameters>

Indexing - Parameters Stopwords defined within <stopwords>filename</stopwords> IndriBuildIndex Stopwords can be defined within a <stopper> block with individual stopwords within enclosed in <word> tags. <parameters> <stopper> <word>first_word</word> <word>next_word</word> … <word>final_word</word> </stopper> </parameters>

Indexing – Parameters Term stemming can be used while indexing as well via the <stemmer> tag. Specify the stemmer type via the <name> tag within. Stemmers included with the Lemur Toolkit include the Krovetz Stemmer and the Porter Stemmer. <parameters> <stemmer> <name>krovetz</name> </stemmer> </parameters>

Retrieval Parameters Query Formatting Interpreting Results

Retrieval - Parameters
Basic usage for retrieval: IndriRunQuery/RetEval <parameter_file> Parameter file includes options for Where to find the index The query or queries How much memory to use Formatting options Many other parameters.

Just as with indexing: A well-formed XML document with options, wrapped by <parameters> tags: <parameters> <options></options> … </parameters>

The <index> parameter tells IndriRunQuery/RetEval where to find the repository. <parameters> <index>/path/to/the/index</index> </parameters>

The <query> parameter specifies a query plain text or using the Indri query language <parameters> <query> <number>1</number> <text>this is the first query</text> </query> <number>2</number> <text>another query to run</text> </parameters> Query file format <DOC> <DOCNO> 1 </DOCNO> What articles exist which deal with TSS (Time Sharing System), anoperating system for IBM computers? </DOC> <DOCNO> 2 </DOCNO> I am interested in articles written either by Prieve or Udo PoochPrieve, B.Pooch, U. Run a simple query now from the command line

Retrieval – Query Formatting
TREC-style topics are not directly able to be processed via IndriRunQuery/RetEval. Format the queries accordingly: Format by hand Write a script to extract the fields (可爱的Python~)

Retrieval – Parameters
To specify a maximum number of results to return, use the <count> tag: <parameters> <count>50</count> </parameters>

Result formatting options: IndriRunQuery/RetEval has built in formatting specifications for TREC and INEX retrieval tasks

Retrieval – Parameters
TREC – Formatting directives: <runID>: a string specifying the id for a query run, used in TREC scorable output. <trecFormat>: true to produce TREC scorable output, otherwise use false (default). <parameters> <runID>runName</runID> <trecFormat>true</trecFormat> </parameters> Command line options –runID=runName

Outputting INEX Result Format
Must be wrapped in <inex> tags <participant-id>: specifies the participant-id attribute used in submissions. <task>: specifies the task attribute (default CO.Thorough). <query>: specifies the query attribute (default automatic). <topic-part>: specifies the topic-part attribute (default T). <description>: specifies the contents of the description tag. <parameters> <inex> <participant-id>LEMUR001</participant-id> </inex> </parameters>

Retrieval - Evaluation
To use trec_eval: format IndriRunQuery results with appropriate trec_eval formatting directives in the parameter file: <runID>runName</runID> <trecFormat>true</trecFormat> Resulting output will be in standard TREC format ready for evaluation: <queryID> Q0 <DocID> <rank> <score> <runID> 150 Q0 AP runName 150 Q0 AP runName Run canned queries now! Evaluate using trec_eval

Use RetEval for TF.IDF First run ParseToFile to convert doc formatted queries into queries <parameters> <docFormat>web</docFormat> <outputFile>filename</outputFile> <stemmer>stemmername</stemmer> <stopwords>stopwordfile</stopwords> </parameters> ParseToFile paramfile queryfile <parameters> <docFormat>web</docFormat> <outputFile>queries.reteval</outputFile> <stemmer>krovetz</stemmer> </parameters>

Use RetEval for TF.IDF Then run RetEval RetEval paramfile
<parameters> <index>index</index> <retModel>0</retModel> // 0 for TF-IDF, 1 for Okapi, // 2 for KL-divergence, // 5 for cosine similarity <textQuery>querie filename</textQuery> <resultCount>1000</resultCount> <resultFile>tfidf.res</resultFile> </parameters> RetEval paramfile <parameters> <index>index</index> <retModel>0</retModel> <textQuery>queries.reteval</textQuery> <resultCount>1000</resultCount> <resultFile>tfidf.res</resultFile> </parameters>

Evluate Results TREC qrels Ground Truth: judge by human assessors.

Ireval tool java -jar “D:\Program Files\Lemur\Lemur 4.12\bin\ireval.jar”result qrels >pr.result

More Stories about Indri

lemur_sigir_2006 Paul Ogilvie Trevor Strohman

Thank You! Q&A

Lemur Toolkit http://net.pku.edu.cn/~wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Similar presentations

Presentation on theme: "Lemur Toolkit http://net.pku.edu.cn/~wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lemur Toolkit http://net.pku.edu.cn/~wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Similar presentations

Presentation on theme: "Lemur Toolkit http://net.pku.edu.cn/~wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011."— Presentation transcript:

Similar presentations

About project

Feedback