Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)

About Terrier Information Retrieval Toolkit Developed by Information Retrieval Group at the University of Glasgow - since 2001 The team:  3 Researchers  5 PhD students  5 Programmers

About Terrier Provides platform for development of large- scale IR applications  Uses Hadoop to distribute indexing  Splits indexing tasks across different nodes on a cluster JAVA Based Weighting model in Terrier is based on Divergence From Randomness (DFR) framework [ Read More ]Read More Also includes other IR models

State-of-the-art functionalities hyperlink structure analysis to rank pages automatic query expansion/re-formulation techniques pre-retrieval query performance predictors compression techniques

Other notable features selects optimal weighting model  based on the statistical features of the query

Toolkit Comparison LemurLuceneTerrier Indexing Claims can index up to terabytes of data Incremental indexing  Can index over 20MB/minute on a home machine  small RAM requirements -- only 1MB heap  index size about 20% -30% the size of text indexed (400GB  80GB)  Nutch supports distributed indexing  Incremental indexing Some numbers: size of files to index: 400 GB resulting size of index files: 17 GB  4% of actual text time to build : 3 days (2 processors) time to retrieve: 4 sec/query (8 processors) Supports distributed indexing Does not support incremental indexing Retrieval Models KL-divergence Vector space Okapi BM25 Language Model TF-IDF VSM Boolean retrieval model 126 Divergence From Randomness (DFR) models Okapi BM25 Language modeling TF-IDF Prog. Lang C++Java

Out of the box capabilities Index and evaluate on TREC test collections Index standard files formats  HTML, PDF, Word, Excel, PowerPoint files GUI based desktop search application

Other out of the box capabilities Indexing support using Hadoop Highly compressed index data structures Options for various stemming techniques Many document weighting model options  126 Divergence From Randomness (DFR) models  Okapi BM25  Language modeling  TF-IDF Modifiable Code  open source code base (Mozilla Public Licence).

Nice to have…but not there Ability to easily build a search engine Incremental indexing  Re-create index every time  Write your own code for incremental indexing Flexible Indexer  Implement your own indexer for non standard data format

Benefits of using Terrier Terrier – active ongoing project  Benefit from new models  Performance enhancements  New features Can index large amounts of data  Scalable in the long run Good support from the team  Wiki  Discussion forums

…Benefits of using Terrier Easy to set up and use Very modular Source files are fully modifiable and well documented [ Show ]Show

How To Get Started? 1. Download the Binary [ download ]download You get the full source code with this download 2. Unzip the file to a directory 3. Modify configuration files Models to use Stemmer Etc…. 4. You are now ready to index and evaluate Use pre-existing scripts to index and evaluate [ Full Setup Instructions ]Full Setup Instructions

Terrrier’s Directory Structure The directories of Terrier are – bin/ : contains useful scripts for running Terrier – etc/ : contains the configuration files – doc/ : contains the documentation of Terrier – lib/ : contains the compiled Terrier classes and the external libraries used by Terrier – licenses/ : contains the license information of the components included with Terrier – share/ : contains a stop word list, an example of documents to test with Terrier, and other infrequently changing files – src/ : contains the source code of Terrier – var/index : contains the data structures – var/results : contains the retrieval results -Which models? -Stopword list -Stemmer -etc -Which models? -Stopword list -Stemmer -etc Source files needed to start modifications Source files needed to start modifications

Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)

Similar presentations

Presentation on theme: "Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)

Similar presentations

Presentation on theme: "Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)"— Presentation transcript:

Similar presentations

About project

Feedback