Download presentation
Presentation is loading. Please wait.
Published byMadeline Hawkins Modified over 9 years ago
1
Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)
2
About Terrier Information Retrieval Toolkit Developed by Information Retrieval Group at the University of Glasgow - since 2001 The team: 3 Researchers 5 PhD students 5 Programmers
3
About Terrier Provides platform for development of large- scale IR applications Uses Hadoop to distribute indexing Splits indexing tasks across different nodes on a cluster JAVA Based Weighting model in Terrier is based on Divergence From Randomness (DFR) framework [ Read More ]Read More Also includes other IR models
4
State-of-the-art functionalities hyperlink structure analysis to rank pages automatic query expansion/re-formulation techniques pre-retrieval query performance predictors compression techniques
5
Other notable features selects optimal weighting model based on the statistical features of the query
6
Toolkit Comparison LemurLuceneTerrier Indexing Claims can index up to terabytes of data Incremental indexing Can index over 20MB/minute on a home machine small RAM requirements -- only 1MB heap index size about 20% -30% the size of text indexed (400GB 80GB) Nutch supports distributed indexing Incremental indexing Some numbers: size of files to index: 400 GB resulting size of index files: 17 GB 4% of actual text time to build : 3 days (2 processors) time to retrieve: 4 sec/query (8 processors) Supports distributed indexing Does not support incremental indexing Retrieval Models KL-divergence Vector space Okapi BM25 Language Model TF-IDF VSM Boolean retrieval model 126 Divergence From Randomness (DFR) models Okapi BM25 Language modeling TF-IDF Prog. Lang C++Java
7
Out of the box capabilities Index and evaluate on TREC test collections Index standard files formats HTML, PDF, Word, Excel, PowerPoint files GUI based desktop search application
8
Other out of the box capabilities Indexing support using Hadoop Highly compressed index data structures Options for various stemming techniques Many document weighting model options 126 Divergence From Randomness (DFR) models Okapi BM25 Language modeling TF-IDF Modifiable Code open source code base (Mozilla Public Licence).
9
Nice to have…but not there Ability to easily build a search engine Incremental indexing Re-create index every time Write your own code for incremental indexing Flexible Indexer Implement your own indexer for non standard data format
10
Benefits of using Terrier Terrier – active ongoing project Benefit from new models Performance enhancements New features Can index large amounts of data Scalable in the long run Good support from the team Wiki Discussion forums
11
…Benefits of using Terrier Easy to set up and use Very modular Source files are fully modifiable and well documented [ Show ]Show
12
How To Get Started? 1. Download the Binary [ download ]download You get the full source code with this download 2. Unzip the file to a directory 3. Modify configuration files Models to use Stemmer Etc…. 4. You are now ready to index and evaluate Use pre-existing scripts to index and evaluate [ Full Setup Instructions ]Full Setup Instructions
13
Terrrier’s Directory Structure The directories of Terrier are – bin/ : contains useful scripts for running Terrier – etc/ : contains the configuration files – doc/ : contains the documentation of Terrier – lib/ : contains the compiled Terrier classes and the external libraries used by Terrier – licenses/ : contains the license information of the components included with Terrier – share/ : contains a stop word list, an example of documents to test with Terrier, and other infrequently changing files – src/ : contains the source code of Terrier – var/index : contains the data structures – var/results : contains the retrieval results -Which models? -Stopword list -Stemmer -etc -Which models? -Stopword list -Stemmer -etc Source files needed to start modifications Source files needed to start modifications
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.