Copyright © Weld 2002-5 10/14/2015 7:25 PM1 CSE 454 Crawlers & Indexing.

Slides:



Advertisements
Similar presentations
Indexing & Tolerant Dictionaries The pdf image slides are from Hinrich Schütze’s slides,Hinrich Schütze L'Homme qui marche Alberto Giacometti (sold for.
Advertisements

Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
9/10: Indexing & Tolerant Dictionaries Make-up Class: 10:30  11:45AM The pdf image slides are from Hinrich Schütze’s slides,Hinrich Schütze.
Search Engine Technology Slides are revised version of the ones taken from Homework 1 returned Stats: Total: 38 Min:
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
Mercator: A scalable, extensible Web crawler Allan Heydon and Marc Najork, World Wide Web, Young Geun Han.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Crawling Slides adapted from
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
CSE 454 Infrmation Retrieval & Indexing. Information Extraction INet Advertising Security Cloud Computing UI & Revisiting Class Overview Network Layer.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Copyright © D.S.Weld10/25/ :38 AM1 CSE 454 Index Compression Alta Vista PageRank.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Allan Heydon and Mark Najork --Sumeet Takalkar. Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture.
General Architecture of Retrieval Systems 1Adrienn Skrop.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Spidering (Crawling)
CSE 454 Indexing. Todo A bit repetitive – cut some slides Some inconsistencie – eg are positions in the index or not. Do we want nutch as case study instead.
CSE 454 Indexing.
Why indexing? For efficient searching of a document
CS 430: Information Discovery
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
CSE 454 Crawlers 9/22/ :09 PM.
Inverted Indices (with Compression & LSI)
Crawlers: Nutch CSE /12/2018 5:08 AM.
Inverted Indicies (with Compression & LSI)
CSE 454 Crawlers & Indexing.
Efficient Retrieval Document-term matrix t1 t tj tm nf
Presentation transcript:

Copyright © Weld /14/2015 7:25 PM1 CSE 454 Crawlers & Indexing

Copyright © Weld /14/2015 7:25 PM2 Course Overview Networking Cluster Computing Crawler Architecture Case Studies: Nutch, Google, Altavista Machine Learning Information Extraction Indexing Cool Topics

Copyright © Weld /14/2015 7:25 PM3 Administrivia Glenn Kelman’s Talk ?? Today’s Class Based in Part on –Mercator: A Scalable, Extensible Web CrawlerMercator: A Scalable, Extensible Web Crawler –Modern Information Retrieval, R. Baeza-Yates and B. Ribeiro-Neto, Addison Wesley, 1999.Modern Information Retrieval No Class on Thurs –Group meetings Colloq –3:30 Today –1:30 Monday EE 031 “Self-Managing DBMS”

Copyright © Weld /14/2015 7:25 PM4 Standard Web Search Engine Architecture crawl the web create an inverted index store documents, check for duplicates, extract links inverted index DocIds Slide adapted from Marty Hearst / UC Berkeley] Search engine servers user query show results To user

Copyright © Weld /14/2015 7:25 PM5 Search Engine Components Spider –Getting the pages Indexing –Storing (e.g. in an inverted file) Query Processing –Booleans, … Ranking –Vector space model, PageRank, anchor text analysis Summaries Refinement

Copyright © Weld /14/2015 7:25 PM6 Spiders 243 active spiders registered 1/01 – l Inktomi Slurp –Standard search engine Digimark –Downloads just images, looking for watermarks Adrelevance –Looking for Ads.

Copyright © Weld /14/2015 7:25 PM Spiders (Crawlers, Bots) Queue := initial page URL 0 Do forever –Dequeue URL –Fetch P –Parse P for more URLs; add them to queue –Pass P to (specialized?) indexing program Issues… –Which page to look at next? keywords, recency, focus, ??? –Avoid overloading a site –How deep within a site to go? –How frequently to visit pages? –Traps!

Copyright © Weld /14/2015 7:25 PM8 Crawling Issues Storage efficiency Search strategy –Where to start –Link ordering –Circularities –Duplicates –Checking for changes Politeness –Forbidden zones: robots.txt –CGI & scripts –Load on remote servers –Bandwidth (download what need) Parsing pages for links Scalability

Copyright © Weld /14/2015 7:25 PM Robot Exclusion Person may not want certain pages indexed. Crawlers should obey Robot Exclusion Protocol. –But some don’t Look for file robots.txt at highest directory level –If domain is robots.txt goes in Specific document can be shielded from a crawler by adding the line:

Copyright © Weld /14/2015 7:25 PM Robots Exclusion Protocol Format of robots.txt –Two fields. User-agent to specify a robot –Disallow to tell the agent what to ignore To exclude all robots from a server: User-agent: * Disallow: / To exclude one robot from two directories: User-agent: WebCrawler Disallow: /news/ Disallow: /tmp/ View the robots.txt specification at

Copyright © Weld /14/2015 7:25 PM11 Managing Load Very Important Question … (stay tuned)

Copyright © Weld /14/2015 7:25 PM12 Outgoing Links? Parse HTML… Looking for…what? anns html foos Bar baz hhh www A href = Frame font zzz,li> bar bbb anns html foos Bar baz hhh www A href = ffff zcfg bbbbb z Frame font zzz,li> bar bbb ?

Copyright © Weld /14/2015 7:25 PM13 Which tags / attributes hold URLs? Anchor tag: … Option tag: … Map: Frame: Link to an image: Relative path vs. absolute path:

Copyright © Weld /14/2015 7:25 PM14 Web Crawling Strategy Starting location(s) Traversal order –Depth first (LIFO) –Breadth first (FIFO) –Or ??? Politeness Cycles? Coverage? b c d e fg h i j

Copyright © Weld /14/2015 7:25 PM Structure of Mercator Spider 1. Remove URL from queue 2. Simulate network protocols & REP 3. Read w/ RewindInputStream (RIS) 4. Has document been seen before? (checksums and fingerprints) 5. Extract links 6. Download new URL? 7. Has URL been seen before? 8. Add URL to frontier Document fingerprints

Copyright © Weld /14/2015 7:25 PM16 URL Frontier (priority queue) Most crawlers do breadth-first search from seeds. Politeness constraint: don’t hammer servers! –Obvious implementation: “live host table” –Is this efficient? –Will it even fit in memory? Mercator’s politeness: –One FIFO subqueue per thread. –Choose subqueue by hashing host’s name. –Dequeue first URL whose host has NO outstanding requests.

Copyright © Weld /14/2015 7:25 PM17 Fetching Pages Need to support http, ftp, gopher,.... –Extensible! Need to fetch multiple pages at once. Need to cache as much as possible –DNS –robots.txt –Documents themselves (for later processing) Need to be defensive! –Need to time out http connections. –Watch for “crawler traps” (e.g., infinite URL names.) –See section 5 of Mercator paper. –Use URL filter module –Checkpointing!

Copyright © Weld /14/2015 7:25 PM18 (A?) Synchronous I/O Problem: network + host latency –Want to GET multiple URLs at once. Google –Single-threaded crawler + asynchronous I/O Mercator –Multi-threaded crawler + synchronous I/O –Easier to code?

Copyright © Weld /14/2015 7:25 PM19 Duplicate Detection URL-seen test: has this URL been seen before? –To save space, store a hash Content-seen test: different URL, same doc. –Supress link extraction from mirrored pages. What to save for each doc? –64 bit “document fingerprint” –Minimize number of disk reads upon retrieval.

Copyright © Weld /14/2015 7:25 PM Mercator Statistics HISTOGRAM OF DOCUMENT SIZES PAGE TYPE PERCENT text/html 69.2% image/gif 17.9% image/jpeg 8.1% text/plain 1.5 pdf 0.9% audio 0.4% zip 0.4% postscript 0.3% other 1.4% Exponentially increasing size

Copyright © Weld /14/2015 7:25 PM21 Advanced Crawling Issues Limited resources –Fetch most important pages first Topic specific search engines –Only care about pages which are relevant to topic “Focused crawling” Minimize stale pages –Efficient re-fetch to keep index timely –How track the rate of change for pages?

Copyright © Weld /14/2015 7:25 PM22 Focused Crawling Priority queue instead of FIFO. How to determine priority? –Similarity of page to driving query Use traditional IR measures –Backlink How many links point to this page? –PageRank (Google) Some links to this page count more than others –Forward link of a page –Location Heuristics E.g., Is site in.edu? E.g., Does URL contain ‘home’ in it? –Linear combination of above

Copyright © Weld /14/2015 7:25 PM23 Review: Precision & Recall Precision –Proportion of selected items that are correct Recall –Proportion of target items that were selected Precision-Recall curve –Shows tradeoff tn fptpfn System returned these Actual relevant docs Recall Precision

Copyright © Weld /14/2015 7:25 PM24 Review Vector Space Representation –Dot Product as Similarity Metric TF-IDF for Computing Weights –w ij = f(i,j) * log(N/n i ) But How Process Efficiently? documents terms q dj t1t1 t2t2

Copyright © Weld /14/2015 7:25 PM25 Thinking about Efficiency Clock cycle: 2 GHz –Typically completes 2 instructions / cycle ~10 cycles / instruction, but pipelining & parallel execution –Thus: 4 billion instructions / sec Disk access: 1-10ms –Depends on seek distance, published average is 5ms –Thus perform 200 seeks / sec –(And we are ignoring rotation and transfer times) Disk is 20 Million times slower !!! Store index in Oracle database? Store index using files and unix filesystem?

Copyright © Weld Number of indexed pages, self-reported Google: 50% of the web? Index Size over Time

Copyright © Weld /14/2015 7:25 PM27 Review Vector Space Representation –Dot Product as Similarity Metric TF-IDF for Computing Weights –w ij = f(i,j) * log(N/n i ) But How Process Efficiently? documents terms q dj t1t1 t2t2

Copyright © Weld /14/2015 7:25 PM28 Efficient Retrieval Document-term matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d 2 | d i w i1 w i2... w ij... w im 1/|d i | d n w n1 w n2... w nj... w nm 1/|d n | w ij is the weight of term t j in document d i Most w ij ’s will be zero.

Copyright © Weld /14/2015 7:25 PM29 Naïve Retrieval Consider query q = (q 1, q 2, …, q j, …, q n ), nf = 1/|q|. How evaluate q? (i.e., compute the similarity between q and every document)? Method 1: Compare q w/ every document directly. Document data structure : d i : ((t 1, w i1 ), (t 2, w i2 ),..., (t j, w ij ),..., (t m, w im ), 1/|d i |) –Only terms with positive weights are kept. –Terms are in alphabetic order. Query data structure: q : ((t 1, q 1 ), (t 2, q 2 ),..., (t j, q j ),..., (t m, q m ), 1/|q|)

Copyright © Weld /14/2015 7:25 PM30 Naïve Retrieval (continued) Method 1: Compare q with documents directly initialize all sim(q, d i ) = 0; for each document d i (i = 1, …, n) { for each term t j (j = 1, …, m) if t j appears in both q and d i sim(q, d i ) += q j  w ij ; sim(q, d i ) = sim(q, d i )  (1/|q|)  (1/|d i |); } sort documents in descending similarities; display the top k to the user;

Copyright © Weld /14/2015 7:25 PM31 Observation Method 1 is not efficient – Needs to access most non-zero entries in doc-term matrix. Solution: Use Index (Inverted File) –Data structure to permit fast searching. Like an Index in the back of a text book. –Key words --- page numbers. –E.g, “ Etzioni, 40, 55, 60-63, 89, 220” – Lexicon – Occurrences

Copyright © Weld /14/2015 7:25 PM32 Search Processing (Overview) 1.Lexicon search –E.g. looking in index to find entry 2.Retrieval of occurrences –Seeing where term occurs 3.Manipulation of occurrences –Going to the right page

Copyright © Weld /14/2015 7:25 PM33 Index A file is a list of words by position First entry is the word in position 1 (first word) Entry 4562 is the word in position 4562 (4562 nd word) Last entry is the last word An inverted file is a list of positions by word! POS FILE a (1, 4, 40) entry (11, 20, 31) file (2, 38) list (5, 41) position (9, 16, 26) positions (44) word (14, 19, 24, 29, 35, 45) words (7) 4562 (21, 27) INVERTED FILE aka “Index”

Copyright © Weld /14/2015 7:25 PM34 Inverted Files for Multiple Documents DOCID OCCUR POS 1 POS “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document LEXICON OCCURENCE INDEX One method. Alta Vista uses alternative …

Copyright © Weld /14/2015 7:25 PM35 Many Variations Possible Address space (flat, hierarchical) Record term-position information Precalculate TF-IDF info Stored header, font & tag info Compression strategies

Copyright © Weld /14/2015 7:25 PM36 Using Inverted Files Several data structures: 1.For each term t j, create a list (inverted file list) that contains all document ids that have t j. I(t j ) = { (d 1, w 1j ), (d 2, w 2j ), …, (d i, w ij ), …, (d n, w nj ) } –d i is the document id number of the ith document. –Weights come from freq of term in doc –Only entries with non-zero weights should be kept.

Copyright © Weld /14/2015 7:25 PM37 Inverted files continued More data structures: 2.Normalization factors of documents are pre- computed and stored in an array nf[i] stores 1/|d i |.

Copyright © Weld /14/2015 7:25 PM38 Inverted files continued More data structures: 3.Lexicon: a hash table for all terms in the collection t j pointer to I(t j ) –Inverted file lists are typically stored on disk. –The number of distinct terms is usually very large.

Copyright © Weld /14/2015 7:25 PM39 Retrieval Using Inverted Files initialize all sim(q, d i ) = 0 for each term t j in q find I(t) using the hash table for each (d i, w ij ) in I(t) sim(q, d i ) += q j  w ij for each document d i sim(q, d i ) = sim(q, d i )  nf[i] sort documents in descending similarities and display the top k to the user;

Copyright © Weld /14/2015 7:25 PM40 Observations about Method 2 If doc d doesn’t contain any term of query q, then d won’t be considered when evaluating q. Only non-zero entries in the columns of the document-term matrix which correspond to query terms … are used to evaluate the query. Computes the similarities of multiple documents simultaneously (w.r.t. each query word)

Copyright © Weld /14/2015 7:25 PM41 Efficient Retrieval Example (Method 2): Suppose q = { (t1, 1), (t3, 1) }, 1/|q| = d1 = { (t1, 2), (t2, 1), (t3, 1) }, nf[1] = d2 = { (t2, 2), (t3, 1), (t4, 1) }, nf[2] = d3 = { (t1, 1), (t3, 1), (t4, 1) }, nf[3] = d4 = { (t1, 2), (t2, 1), (t3, 2), (t4, 2) }, nf[4] = d5 = { (t2, 2), (t4, 1), (t5, 2) }, nf[5] = I(t1) = { (d1, 2), (d3, 1), (d4, 2) } I(t2) = { (d1, 1), (d2, 2), (d4, 1), (d5, 2) } I(t3) = { (d1, 1), (d2, 1), (d3, 1), (d4, 2) } I(t4) = { (d2, 1), (d3, 1), (d4, 1), (d5, 1) } I(t5) = { (d5, 2) }

Copyright © Weld /14/2015 7:25 PM42 Efficient Retrieval q = { (t1, 1), (t3, 1) }, 1/|q| = d1 = { (t1, 2), (t2, 1), (t3, 1) }, nf[1] = d2 = { (t2, 2), (t3, 1), (t4, 1) }, nf[2] = d3 = { (t1, 1), (t3, 1), (t4, 1) }, nf[3] = d4 = { (t1, 2), (t2, 1), (t3, 2), (t4, 2) }, nf[4] = d5 = { (t2, 2), (t4, 1), (t5, 2) }, nf[5] = I(t1) = { (d1, 2), (d3, 1), (d4, 2) } I(t2) = { (d1, 1), (d2, 2), (d4, 1), (d5, 2) } I(t3) = { (d1, 1), (d2, 1), (d3, 1), (d4, 2) } I(t4) = { (d2, 1), (d3, 1), (d4, 1), (d5, 1) } I(t5) = { (d5, 2) } After t1 is processed: sim(q, d1) = 2, sim(q, d2) = 0, sim(q, d3) = 1 sim(q, d4) = 2, sim(q, d5) = 0 After t3 is processed: sim(q, d1) = 3, sim(q, d2) = 1, sim(q, d3) = 2 sim(q, d4) = 4, sim(q, d5) = 0 After normalization: sim(q, d1) =.87, sim(q, d2) =.29, sim(q, d3) =.82 sim(q, d4) =.78, sim(q, d5) = 0

Copyright © Weld /14/2015 7:25 PM43 Efficiency versus Flexibility Storing computed document weights is good for efficiency, but bad for flexibility. –Recomputation needed if TF and IDF formulas change and/or TF and DF information changes. Flexibility improved by storing raw TF, DF information, but efficiency suffers. A compromise –Store pre-computed TF weights of documents. –Use IDF weights with query term TF weights instead of document term TF weights.

Copyright © Weld /14/2015 7:25 PM44 How Inverted Files are Created Crawler Repository Scan Forward Index Sort Sorted Index Scan NF (docs) Lexicon Inverted File List ptrs to docs

Copyright © Weld /14/2015 7:25 PM45 Creating Inverted Files Crawler Repository Scan Forward Index Sort Sorted Index Scan NF (docs) Lexicon Inverted File List Repository File containing all documents downloaded Each doc has unique ID Ptr file maps from IDs to start of doc in repository ptrs to docs

Copyright © Weld /14/2015 7:25 PM46 Creating Inverted Files Crawler Repository Scan Forward Index Sort Sorted Index Scan NF (docs) Lexicon Inverted File List ptrs to docs NF Length of each document Forward Index Pos

Copyright © Weld /14/2015 7:25 PM47 Creating Inverted Files Crawler Repository Scan Forward Index Sort Sorted Index Scan NF (docs) Lexicon Inverted File List ptrs to docs Sorted Index (positional info as well)

Copyright © Weld /14/2015 7:25 PM48 Creating Inverted Files Crawler Repository Scan Forward Index Sort Sorted Index Scan NF (docs) Lexicon Inverted File List DOCID OCCUR POS 1 POS ptrs to docs Lexicon Inverted File List

Copyright © Weld /14/2015 7:25 PM49 The Lexicon Grows Slowly (Heap’s law) –O(n  ) where n=text size;  is constant ~0.4 – 0.6 –E.g. for 1GB corpus, lexicon = 5Mb –Can reduce with stemming (Porter algorithm) Store lexicon in file in lexicographic order –Each entry points to loc in occurrence file (aka inverted file list)

Copyright © Weld /14/2015 7:25 PM50 Construction Build Trie (or hash table) This is a text. A text has many words. Words are made from letters. letters: 60 text: 11, 19 words: 33, 40 made: 50 many: 28 l ma d n t w

Copyright © Weld /14/2015 7:25 PM51 Memory Too Small? Merging –When word is shared in two lexicons –Concatenate occurrence lists –O(n1 + n2) Overall complexity –O(n log(n/M)

Copyright © Weld /14/2015 7:25 PM52 Stop lists Language-based stop list: –words that bear little meaning – words – Subject-dependent stop lists Removing stop words –From document –From query From Peter Brusilovsky Univ Pittsburg INFSCI 2140

Copyright © Weld /14/2015 7:25 PM53 Stemming Are there different index terms? –retrieve, retrieving, retrieval, retrieved, retrieves… Stemming algorithm: –(retrieve, retrieving, retrieval, retrieved, retrieves)  retriev –Strips prefixes of suffixes (-s, -ed, -ly, -ness) –Morphological stemming

Copyright © Weld /14/2015 7:25 PM54 Stemming Continued Can reduce vocabulary by ~ 1/3 C, Java, Perl versions, python, c# Criterion for removing a suffix –Does "a document is about w 1 " mean the same as –a "a document about w 2 " Problems: sand / sander & wand / wander

Copyright © Weld /14/2015 7:25 PM55 Compression What Should We Compress? –Repository –Lexicon –Inv Index What properties do we want? –Compression ratio –Compression speed –Decompression speed –Memory requirements –Pattern matching on compressed text –Random access

Copyright © Weld /14/2015 7:25 PM56 Inverted File Compression Each inverted list has the form A naïve representation results in a storage overhead of This can also be stored as Each difference is called a d-gap. Since each pointer requires fewer than Trick is encoding …. since worst case …. bits. Assume d-gap representation for the rest of the talk, unless stated otherwise Slides adapted from Tapas Kanungo and David Mount, Univ Maryland

Copyright © Weld /14/2015 7:25 PM57 Text Compression Two classes of text compression methods Symbolwise (or statistical) methods –Estimate probabilities of symbols - modeling step –Code one symbol at a time - coding step –Use shorter code for the most likely symbol –Usually based on either arithmetic or Huffman coding Dictionary methods –Replace fragments of text with a single code word –Typically an index to an entry in the dictionary. eg: Ziv-Lempel coding: replaces strings of characters with a pointer to a previous occurrence of the string. –No probability estimates needed Symbolwise methods are more suited for coding d-gaps

Copyright © Weld /14/2015 7:25 PM58 Classifying d-gap Compression Methods: Global: each list compressed using same model –non-parameterized: probability distribution for d-gap sizes is predetermined. –parameterized: probability distribution is adjusted according to certain parameters of the collection. Local: model is adjusted according to some parameter, like the frequency of the term By definition, local methods are parameterized.

Copyright © Weld /14/2015 7:25 PM59 Conclusion Local methods best Parameterized global models ~ non-parameterized –Pointers not scattered randomly in file In practice, best index compression algorithm is: –Local Bernoulli method (using Golomb coding) Compressed inverted indices usually faster+smaller than –Signature files –Bitmaps Local < Parameterized Global < Non-parameterized Global Not by much