Presentation is loading. Please wait.

Presentation is loading. Please wait.

Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)

Similar presentations


Presentation on theme: "Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)"— Presentation transcript:

1 Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)

2 About Terrier Information Retrieval Toolkit Developed by Information Retrieval Group at the University of Glasgow - since 2001 The team:  3 Researchers  5 PhD students  5 Programmers

3 About Terrier Provides platform for development of large- scale IR applications  Uses Hadoop to distribute indexing  Splits indexing tasks across different nodes on a cluster JAVA Based Weighting model in Terrier is based on Divergence From Randomness (DFR) framework [ Read More ]Read More Also includes other IR models

4 State-of-the-art functionalities hyperlink structure analysis to rank pages automatic query expansion/re-formulation techniques pre-retrieval query performance predictors compression techniques

5 Other notable features selects optimal weighting model  based on the statistical features of the query

6 Toolkit Comparison LemurLuceneTerrier Indexing Claims can index up to terabytes of data Incremental indexing  Can index over 20MB/minute on a home machine  small RAM requirements -- only 1MB heap  index size about 20% -30% the size of text indexed (400GB  80GB)  Nutch supports distributed indexing  Incremental indexing Some numbers: size of files to index: 400 GB resulting size of index files: 17 GB  4% of actual text time to build : 3 days (2 processors) time to retrieve: 4 sec/query (8 processors) Supports distributed indexing Does not support incremental indexing Retrieval Models KL-divergence Vector space Okapi BM25 Language Model TF-IDF VSM Boolean retrieval model 126 Divergence From Randomness (DFR) models Okapi BM25 Language modeling TF-IDF Prog. Lang C++Java

7 Out of the box capabilities Index and evaluate on TREC test collections Index standard files formats  HTML, PDF, Word, Excel, PowerPoint files GUI based desktop search application

8 Other out of the box capabilities Indexing support using Hadoop Highly compressed index data structures Options for various stemming techniques Many document weighting model options  126 Divergence From Randomness (DFR) models  Okapi BM25  Language modeling  TF-IDF Modifiable Code  open source code base (Mozilla Public Licence).

9 Nice to have…but not there Ability to easily build a search engine Incremental indexing  Re-create index every time  Write your own code for incremental indexing Flexible Indexer  Implement your own indexer for non standard data format

10 Benefits of using Terrier Terrier – active ongoing project  Benefit from new models  Performance enhancements  New features Can index large amounts of data  Scalable in the long run Good support from the team  Wiki  Discussion forums

11 …Benefits of using Terrier Easy to set up and use Very modular Source files are fully modifiable and well documented [ Show ]Show

12 How To Get Started? 1. Download the Binary [ download ]download You get the full source code with this download 2. Unzip the file to a directory 3. Modify configuration files Models to use Stemmer Etc…. 4. You are now ready to index and evaluate Use pre-existing scripts to index and evaluate [ Full Setup Instructions ]Full Setup Instructions

13 Terrrier’s Directory Structure The directories of Terrier are – bin/ : contains useful scripts for running Terrier – etc/ : contains the configuration files – doc/ : contains the documentation of Terrier – lib/ : contains the compiled Terrier classes and the external libraries used by Terrier – licenses/ : contains the license information of the components included with Terrier – share/ : contains a stop word list, an example of documents to test with Terrier, and other infrequently changing files – src/ : contains the source code of Terrier – var/index : contains the data structures – var/results : contains the retrieval results -Which models? -Stopword list -Stemmer -etc -Which models? -Stopword list -Stemmer -etc Source files needed to start modifications Source files needed to start modifications


Download ppt "Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)"

Similar presentations


Ads by Google