SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000
Last Time l Web Search –Directories vs. Search engines –How web search differs from other search »Type of data searched over »Type of searches done »Type of searchers doing search –Web queries are short »This probably means people are often using search engines to find starting points »Once at a useful site, they must follow links or use site search –Web search ranking combines many features
What about Ranking? l Lots of variation here –Pretty messy in many cases –Details usually proprietary and fluctuating l Combining subsets of: –Term frequencies –Term proximities –Term position (title, top of page, etc) –Term characteristics (boldface, capitalized, etc) –Link analysis information –Category information –Popularity information l Most use a variant of vector space ranking to combine these l Here’s how it might work: –Make a vector of weights for each feature –Multiply this by the counts for each feature
From description of the NorthernLight search engine, by Mark Krellenstein
High-Precision Ranking Proximity search can help get high- precision results if > 1 term –Hearst ’96 paper: »Combine Boolean and passage-level proximity »Proves significant improvements when retrieving top 5, 10, 20, 30 documents »Results reproduced by Mitra et al. 98 »Google uses something similar
Boolean Formulations, Hearst 96 Results
Spam l Spam: –Undesired content l Web Spam: –Content is disguised as something it is not, in order to »Be retrieved more often than it otherwise would »Be retrieved in contexts that it otherwise would not be retrieved in
Web Spam l What are the types of Web spam? –Add extra terms to get a higher ranking »Repeat “cars” thousands of times –Add irrelevant terms to get more hits »Put a dictionary in the comments field »Put extra terms in the same color as the background of the web page –Add irrelevant terms to get different types of hits »Put “sex” in the title field in sites that are selling cars –Add irrelevant links to boost your link analysis ranking l There is a constant “arms race” between web search companies and spammers
Commercial Issues General internet search is often commercially driven –Commercial sector sometimes hides things – harder to track than research –On the other hand, most CTOs for search engine companies used to be researchers, and so help us out –Commercial search engine information changes monthly –Sometimes motivations are commercial rather than technical »Goto.com uses payments to determine ranking order »iwon.com gives out prizes
Web Search Architecture
l Preprocessing –Collection gathering phase »Web crawling –Collection indexing phase l Online –Query servers –This part not talked about in the readings
From description of the FAST search engine, by Knut Risvik
Standard Web Search Engine Architecture crawl the web create an inverted index Check for duplicates, store the documents Inverted index Search engine servers user query Show results To user DocIds
More detailed architecture, from Brin & Page 98. Only covers the preprocessing in detail, not the query serving.
Inverted Indexes for Web Search Engines l Inverted indexes are still used, even though the web is so huge l Some systems partition the indexes across different machines; each machine handles different parts of the data l Other systems duplicate the data across many machines; queries are distributed among the machines l Most do a combination of these
From description of the FAST search engine, by Knut Risvik In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries. Each row can handle 120 queries per second Each column can handle 7M pages To handle more queries, add another row.
Cascading Allocation of CPUs l A variation on this that produces a cost-savings: –Put high-quality/common pages on many machines –Put lower quality/less common pages on fewer machines –Query goes to high quality machines first –If no hits found there, go to other machines
Web Crawlers l How do the web search engines get all of the items they index? l Main idea: –Start with known sites –Record information for these sites –Follow the links from each site –Record information found at new sites –Repeat
Web Crawlers l How do the web search engines get all of the items they index? l More precisely: –Put a set of known sites on a queue –Repeat the following until the queue is empty: »Take the first page off of the queue »If this page has not yet been processed: l Record the information found on this page –Positions of words, links going out, etc l Add each link on the current page to the queue l Record that this page has been processed l In what order should the links be followed?
Page Visit Order Animated examples of breadth-first vs depth-first search on trees: Structure to be traversed
Page Visit Order l Animated examples of breadth-first vs depth-first search on trees: Breadth-first search (must be in presentation mode to see this animation)
Page Visit Order l Animated examples of breadth-first vs depth-first search on trees: Depth-first search (must be in presentation mode to see this animation)
Page Visit Order l Animated examples of breadth-first vs depth-first search on trees:
Depth-First Crawling (more complex – graphs & sites) Page 1 Page 3 Page 2 Page 1 Page 2 Page 1 Page 5 Page 6 Page 4 Page 1 Page 2 Page 1 Page 3 Site 6 Site 5 Site 3 Site 1 Site 2
Breadth First Crawling (more complex – graphs & sites) Page 1 Page 3 Page 2 Page 1 Page 2 Page 1 Page 5 Page 6 Page 4 Page 1 Page 2 Page 1 Page 3 Site 6 Site 5 Site 3 Site 1 Site 2
Web Crawling Issues l Keep out signs –A file called norobots.txt tells the crawler which directories are off limits l Freshness –Figure out which pages change often –Recrawl these often l Duplicates, virtual hosts, etc –Convert page contents with a hash function –Compare new pages to the hash table l Lots of problems –Server unavailable –Incorrect html –Missing links –Infinite loops l Web crawling is difficult to do robustly!
Cha-Cha l Cha-cha searches an intranet –Sites associated with an organization l Instead of hand-edited categories –Computes shortest path from the root for each hit –Organizes search results according to which subdomain the pages are found in
Cha-Cha Web Crawling Algorithm l Start with a list of servers to crawl –for UCB, simply start with l Restrict crawl to certain domain(s) –*.berkeley.edu l Obey No Robots standard l Follow hyperlinks only –do not read local filesystems »links are placed on a queue »traversal is breadth-first l See first lecture or the technical papers for more information
Summary l Web search differs from traditional IR systems –Different kind of collection –Different kinds of users/queries –Different economic motivations l Ranking combines many features in a difficult-to-specify manner –Link analysis and proximity of terms seems especially important –This is in contrast to the term-frequency orientation of standard search »Why?
Summary (cont.) l Web search engine archicture –Similar in many ways to standard IR –Indexes usually duplicated across machines to handle many queries quickly l Web crawling –Used to create the collection –Can be guided by quality metrics –Is very difficult to do robustly
Web Search Statistics
Information from searchenginewatch.com Searches per Day Info missing For fast.com, Excite, Northernlight, etc.
Information from searchenginewatch.com Web Search Engine Visits
Information from searchenginewatch.com Percentage of web users who visit the site shown
Information from searchenginewatch.com Search Engine Size (July 2000)
Information from searchenginewatch.com Does size matter? You can’t access many hits anyhow.
Information from searchenginewatch.com Increasing numbers of indexed pages, self- reported
Information from searchenginewatch.com Increasing numbers of indexed pages (more recent) self- reported
Information from searchenginewatch.com Web Coverage
From description of the FAST search engine, by Knut Risvik
Information from searchenginewatch.com Directory sizes