Download presentation
Presentation is loading. Please wait.
Published byRudolf Roy Clarke Modified over 9 years ago
1
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto
2
Search Engines
3
What Are They? Tools for finding information on the Web -Problem: “hidden” databases, e.g. New York Times (i.e., databases hosted by the web site itself. These cannot be accessed by Yahoo, Google etc.) Based on a machine-constructed index of Web contents (usually contains keywords found in the documents) Directory of search engines: www.searchenginecolossus.com www.searchenginecolossus.com Search engine statistics: www.searchengineshowdown.com www.searchenginewatch.com
4
What They Do 1. Acquire the document collection, e.g., web documents (off-line) 2. Create and save an inverted index (off-line) 3. Match queries to documents (on-line; the actual retrieval) 4. Present the results to user (on-line; may include summarization, extraction, translation)
5
Typical Architecture Spider -Crawls the web to find pages by following hyperlinks -Ongoing process; never catches up Indexer -Produces the data structures for fast searching of all words in the pages (i.e, it updates the lexicon) Retrieval System -User interface and query language -Performs database lookup to find documents likely to be relevant -Document “relevance” based on a ranking heuristic
6
Did you know? The concept of a Web spider was developed by Dr. Fuzzy Mouldin Implemented in 1994 on the Web Went into the creation of Lycos Tangible evidence of commercial success: Newell-Simon Hall Dr. Michael L. (Fuzzy) Mauldin
7
Did you know? Developed here at CMU by Prof. Raul Valdes-Perez and a group of graduate students in 2000 Queries other web search engines and clusters documents into categories based on content
8
A look at 10,000+ Linux servers ! Supports searches in 104 different languages Receives over millions of searches per day Spiders and indexes over 8 billion documents (updated monthly), encompassing HTML and 12 other file formats (e.g.,*.pdf, *.ps, *.doc) PageRank algorithm estimates “importance” based on link counts
9
Google’s server farm
10
Why Spider the Web? User Perceptions Most annoying: Search engine finds nothing (too small an index; less of an issue since 1997 or so). Somewhat annoying: Obsolete links Must regularly identify and delete dead links (Google also caches many pages) Done every 1-2 weeks in best engines Mildly annoying: Failure to find new site Re-spider “entire” web Done every 2-4 weeks in best engines
11
Cost of Spidering Semi-parallel algorithmic decomposition Spider can (and does) run on hundreds of severs simultaneously Very high network connectivity Servers can migrate from spidering to query processing depending on time-of-day load Running a full web spider takes days even with hundreds of dedicated servers
12
Current Status of Web Spiders Enhanced Spidering Link counts for pages can be established during spidering Unsolved Problems Most spidering re-traverses a stable web graph; how to do on-demand re-spidering when changes occur? Achieving complete or near-complete coverage is still a major issue Cannot spider information stored in local databases
13
An Inverted Index DOCID OCCUR POS 1 POS 2...... “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document 56... LEXICON TERM INDEX Data structure to permit fast searching
14
Ranking (Scoring) Documents Must display “hits” in some order... how to choose? e.g., “relevance”, recency, popularity, reliability Some ranking heuristics Presence of search terms in title of document Proximity of search terms to start of document Search term occurrences within a document and the inverse frequency of a search term in a collection (common terms given less weight) Link popularity (how many pages point to this one) Challenges User queries often provide very limited information Tradeoff exists between precision and recall
15
Search Engine Sizes Source: www.searchenginewatch.comwww.searchenginewatch.com AVAltavistaFAST GGGoogle INKInktomi NLNorthern Light
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.