University of Maryland Baltimore County

University of Maryland Baltimore County
Techniques for Gigabyte-Scale N-gram Based Information Retrieval on PCs Ethan L. Miller University of Maryland Baltimore County

Gigabyte-scale N-gram IR on PCs
What’s the problem? N-gram based IR is becoming more important Language-independent Garble-tolerant Better accuracy (phrases, etc.)? Scalability of n-gram IR now necessary Adapt traditional (word-based) IR techniques to n-grams More unique terms per corpus More unique terms per document Avoid use of language-dependent techniques CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

What did we do about it? Scaled n-gram based IR system to handle a gigabyte on a commodity (<$5K) PC Adapted compression techniques Used in-memory and on-disk methods Preserved beneficial properties of n-gram based retrieval Showed that disk isn’t much slower than memory for postings lists Fast file systems can move data quickly Decompression times dominate transfer times CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

Overview Information retrieval and n-gram basics Adapting word-based techniques to n-grams Scaling techniques to more terms Adapting to different numerical characteristics of n-gram based IR Performance TELLTALE design Future work Conclusions CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

What’s an n-gram? N-gram == n-character sequence Terms gathered by sliding window along the text Term generator & IR engine need no language-specific knowledge N-grams have desirable properties Language-independent Garble-resistant Incorporate inter-word relations N-grams have difficulties More unique terms per corpus & document Lower counts per term CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

Information retrieval in a nutshell
Create an inverted index for corpus Table of terms (words or n-grams) in corpus Postings list for each term Posting = <doc #, term weight in doc> Many potential weighting schemes for terms Find documents in corpus “similar” to query Break query into terms Similarity between query and a given document is a function of the term vectors for each Results ranked Function often looks like a dot product CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

N-grams vs. words as terms
Fewer unique words Differences of orders of magnitude 5-grams => 4x words 6-grams => 10x words Longer n-grams => even higher ratios More postings per document (5-gram postings) / (word postings) ~ 10 Most 5-gram postings have a count of 1 CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

N-gram IR: memory usage
Postings lists Naïve: 12 bytes per entry Better: compression! N-gram (term) table 1 entry per n-gram ~40 bytes per entry Document & file information Large structures Relatively few instances! Most memory used by postings list & n-gram hash table CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

Corpus compression Compress integers in postings to reduce corpus size Posting count Document identifier (use difference from previous one in sorted list) Try different compression techniques & adjust parameters to best fit n-grams Simple compression Easy to code Effective enough? Gamma compression CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

5-gram posting counts Q: What’s the count for a particular posting in a document? A: Almost certainly 1! 80% of all postings have a count of 1 98% have a count of 5 or less Distribution is more skewed for n-grams than for words CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

Document identifier gaps
Curve less steep than that of posting counts Curve less steep than corresponding curve for words Compression may be less effective Parameters may need to be changed CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

Simple compression Raw (uncompressed) index requires 6x storage of documents themselves Represent numbers in 8 bits: (27-1) 16 bits: (214 -1) 32 bits: everything else (up to 30 bits) Simple compression effectiveness 960 MB of text -> 1085 MB index Factor of 6 reduction from no compression gzip compressed index by another factor of 2 CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

Gamma compression Represent numbers as unary n followed by m-bit binary n to m translation table can be tuned Adjust translation to minimize number of bits used Posting counts Represent “1” in 1 bit Small numbers have very few bits Document gaps Small numbers have small representations, but... Shallower curve: don’t weight as much towards small numbers CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

Gamma compression results
Use single vector for simplicity Select for minimal sum of posting counts, document gap sizes Vector of <0,2,4,…,16,18,28> worked best Within 3% of minimum of each set compressed separately Posting counts compressed far more than document gaps 960 MB of text -> 647 MB of index Postings lists = 485 MB Overhead (doc info, n-gram headers) = 150 MB CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

Postings lists: memory vs. disk
Construct indices for 257 MB corpus Run queries with postings lists In memory On disk On disk lists slower, as expected, but… Less than 2x slowdown Decompression not much slower than disk I/O Seek time less critical than we thought CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

N-gram library rewrite
Build more efficient data structures Better dynamic storage Reduction in memory consumption Make on-disk storage work better More efficient Independent of underlying byte order Build to standard API Reusable component Fit with legacy apps CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

Data structure design Main data structures TermTable Maintain per-term information Store term text as hashed 64-bit value PostingsList Keep compressed postings lists Dynamically allocate chunks as needed Other structures DocTable Corpus (includes other structures) Structures use templates extensively CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

Data structures: connections
H(‘ellta’) TermTable H(‘lltal’) nOccs nDocs … PostList nOccs nDocs … PostList DocTable Count0 DocId0 Count1 DocId1 ... chunk1 chunk2 CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

Current status Basic data structures working PostingsList HashTable (for documents, terms) Structures need to be tied together Corpus data structure Term generation (parsing) CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

Future work Currently rewriting IR system from scratch Better memory & posting list management Support for trying different term weighting schemes & reduction mechanisms Support for excluding n-grams that won’t matter Explore tradeoff between disk and memory Try new weighting algorithms with n-grams Parallelize the IR engine (-> Linux clusters) Gauge IR performance for n-grams on large corpora CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

Conclusions Demonstrated an n-gram based IR system indexing a gigabyte on a commodity PC Used compression & disk storage for scaling Preserved properties of n-gram based retrieval Found source of performance improvement in scalable IR systems Compression more helpful than memory residence Disk access isn’t so bad if the file system is fast CADIP PI meeting - 9/9/99 Gigabyte-scale N-gram IR on PCs

University of Maryland Baltimore County

Similar presentations

Presentation on theme: "University of Maryland Baltimore County"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Maryland Baltimore County

Similar presentations

Presentation on theme: "University of Maryland Baltimore County"— Presentation transcript:

Similar presentations

About project

Feedback