The Inside Story Christine Reilly CSCI 6175 September 27, 2011
Back the late 1990’s…
Problems With 1990’s Search Engines Spam: top results were ads Users only look at top 10 results Rapid growth of the Web
Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions
Welcome to my page. I have links to other pages on my page. Welcome to my page. I have links to other pages on my page. Step 1: Crawl to Retrieve Pages URL List
Welcome to another page. I also have links to other pages on my page. Welcome to another page. I also have links to other pages on my page. Step 1: Crawl to Retrieve Pages URL List
Issues With Web Crawling How to crawl as much of web as possible Choose order of pages to crawl Storing all the pages When to re-crawl Don’t irritate the page owner
Step 2-a: Create Hit List All About Bicycles Bicycles are fun to ride. But watch out for cars on the road. All About Bicycles Bicycles are fun to ride. But watch out for cars on the road. … pageX; Bicycles; 50; h1; 1st pageX; Bicycles; 60; norm; 1st pageX; fun; 67; norm; none pageX; ride; 81; norm; none … pageX; Bicycles; 50; h1; 1st pageX; Bicycles; 60; norm; 1st pageX; fun; 67; norm; none pageX; ride; 81; norm; none … Hits
Step 2-b: Create Anchors File All About Bicycles Bicycles are fun to ride. But watch out for cars on the road. All About Bicycles Bicycles are fun to ride. But watch out for cars on the road. pageX; linkM; Bicycles pageX; linkN; cars pageX; linkM; Bicycles pageX; linkN; cars Anchors
More Steps Create inverted index sorted by word Creates lexicon Search uses lexicon, inverted index, and Page Rank
Search Process Parse the query Find documents that have all search terms Compute the rank of the document Return the top k documents (sorted by rank)
Search for “bicycle” bicycle; pageA; 30, 70 bicycle; pageB; 98, 1100 car; pageA; 103 car; pageC; 107 car; pageD; 119, 598, 2004 Inverted Index pageA pageB Results
Ranking Results of a Query Hit type: title, anchor, URL, large font, etc. PageRank (more about that next) Documents with words appearing closer together have higher weight
Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions
Data Storage Use specialized data structures Avoid expensive disk seeks
Repository of Crawled Web Pages docIdurlLenpageLenurlpage Pages compressed using zlib All other data structures can be rebuilt from repository and list of crawler errors
Hit Data Structure 2 bytes per hit 3 types of hits: – Plain – Fancy (URL, title, meta tag, etc) – Anchor text Plain:Cap (1)Font (3)Position (12) Fancy:Cap (1)Font = 7Type (4)Position (8) Anchor:Cap (1)Font = 7Type (4)Hash (4)Position (4) Parts of the hit data structure; (bits used by part)
Forward Index docIdwordId (24)num hits (8)list of hits wordId (24)num hits (8)list of hits null wordId docIdwordId (24)num hits (8)list of hits wordId (24)num hits (8)list of hits wordId (24)num hits (8)list of hits null wordId (n) = number of bits used
Inverted Index wordIdnum Docs wordIdnum Docs wordIdnum Docs docId (27)num Hits (5)Hit List docId (27)num Hits (5)Hit List docId (27)num Hits (5)Hit List docId (27)num Hits (5)Hit List … Lexicon Index
Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions
Importance of a Web Page Simple approximation: count backlinks – Can easily create many links to my own page – A page with one link from a “good” web page should get a higher importance Better method: PageRank – Use graph of the web – Measure relative importance of web pages
Simplified Page Rank
The Real Page Rank Handles cycles of pages Random Surfer: periodically jump to a random page
Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions
Quality of Results Simple example showed high quality results Current Google is used by tons of people
Other Performance Metrics Storage: All data used takes 55 GB – Better compression -> 7 GB System Performance – Crawl: 9 days first time, 2.6 days (48.5 pages / s) second time – Indexer: 54 pages / s; runs in parallel with crawl – Sorting w 4 parallel machines: 24 hours Search Performance: not a focus of the research
Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions
Modern Search Challenge Return relevant results – Find hotel in NYC with certain amenities – Assemble a geographically distributed committee Current search engines: sift through tons of results, find relevant information
Information Extraction Extract meaningful data from text, store as structured data Example: – Text: “Paris is the stylish capital of France” – Data tuple: (Paris, capital of, France) Automatically create collections of data that are currently human curated
Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions
Ways to improve search: – Format of text on page – Following page links Search must scale as the web grows Search has come a long way, but new techniques will improve it
Questions?