Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang
Sigir’992 Basic Architectures: Search Web Log Index SE Spider Spam Freshness Quality results 20M queries/day Browser 800M pages? 24x7 SE
Sigir’993 Basic Algorithm (1) Pick Url from pending queue and fetch (2) Parse document and extract href’s (3) Place unvisited Url’s on pending queue (4) Index document (5) Goto (1)
Sigir’994 Issues Queue maintenance determines behavior Depth vs breadth Spidering can be distributed but queues must be shared Urls must be revisited Status tracked in a Database Revisit rate determines freshness SE’s typically revisit every url monthly
Sigir’995 Deduping Many urls point to the same pages DNS aliasing Many pages are identical Site mirroring How big is my index, really?
Sigir’996 Smart Spidering Revisit rate based on modification history Rapidly changing documents visited more often Revisit queues divided by priority Acceptance criteria based on quality Only index quality documents Determined algorithmically
Sigir’997 Spider Equilibrium Urls queues do not increase in size New documents are discovered and indexed Spider keeps up with desired revisit rate Index drifts upward in size At equilibrium index is Everyday Fresh As if every page were revisited every day Requires 10% daily revisit rates, on average
Sigir’998 Computational Constraints Equilibrium requires increasing resources Yet total disk space is a system constraint Strategies for dealing with space constraints Simple refresh: only revisit known urls Prune urls via stricter acceptance criteria Buy more disk
Sigir’999 Special Collections Newswire Newsgroups Specialized services (Deja) Information extraction Shopping catalog Events; recipes, etc.
Sigir’9910 The Hidden Web Non-indexible content Behind passwords, firewalls Dynamic content Often searchable through local interface Network of distributed search resources How to access? Ask Jeeves!
Sigir’9911 Spam Manipulation of content to affect ranking Bogus meta tags Hidden text Jump pages tuned for each search engine Add Url is a spammer’s tool 99% of submissions are spam It’s an arms race
Sigir’9912 Representation For precision, indices must support phrases Phrases make best use of short queries The web is precision biased Document location also important Title vs summary vs body Meta tags offer a special challenge To index or not?
Sigir’9913 Indexing Tricks Inverted indices are non-incremental Design for compactness and high-speed access Updated through merge with new indices Indices can be huge Minimize copying Use Raid for speed and reliability
Sigir’9914 Truncation Search Engines do not store all postings How could they? Tuned to return 10 good hits quickly Boolean queries evaluated conservatively Negation is a particular problem Some measurement methods depend on strong queries – how accurate can they be?
Sigir’9915 The Role of NLP Many Search Engines do not stem Precision bias suggests conservative term treatment What about non-English documents N-grams are popular for Chinese Language ID anyone?