The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force.
CS 349: WebBase 1 What the WebBase can and can’t do?
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Presented By: - Chandrika B N
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Web Technologies Search Engines
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Search Xin Liu.
ITEC547 Text Mining Web Technologies Search Engines.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
The anatomy of a Large-Scale Hypertextual Web Search Engine.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Chapter 2: How Search Engines Work. Chapter Objectives Describe the PageRank formula for calculating a webpage’s popularity. Determine how a search engine.
The Anatomy Of A Large Scale Search Engine
Google and Scalable Query Services
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Hongjun Song Computer Science The University of Memphis
Anatomy of a search engine
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Web Search Engines.
The Search Engine Architecture
Instructor : Marina Gavrilova
Discussion Class 9 Google.
Presentation transcript:

The Inside Story Christine Reilly CSCI 6175 September 27, 2011

Back the late 1990’s…

Problems With 1990’s Search Engines Spam: top results were ads Users only look at top 10 results Rapid growth of the Web

Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions

Welcome to my page. I have links to other pages on my page. Welcome to my page. I have links to other pages on my page. Step 1: Crawl to Retrieve Pages URL List

Welcome to another page. I also have links to other pages on my page. Welcome to another page. I also have links to other pages on my page. Step 1: Crawl to Retrieve Pages URL List

Issues With Web Crawling How to crawl as much of web as possible Choose order of pages to crawl Storing all the pages When to re-crawl Don’t irritate the page owner

Step 2-a: Create Hit List All About Bicycles Bicycles are fun to ride. But watch out for cars on the road. All About Bicycles Bicycles are fun to ride. But watch out for cars on the road. … pageX; Bicycles; 50; h1; 1st pageX; Bicycles; 60; norm; 1st pageX; fun; 67; norm; none pageX; ride; 81; norm; none … pageX; Bicycles; 50; h1; 1st pageX; Bicycles; 60; norm; 1st pageX; fun; 67; norm; none pageX; ride; 81; norm; none … Hits

Step 2-b: Create Anchors File All About Bicycles Bicycles are fun to ride. But watch out for cars on the road. All About Bicycles Bicycles are fun to ride. But watch out for cars on the road. pageX; linkM; Bicycles pageX; linkN; cars pageX; linkM; Bicycles pageX; linkN; cars Anchors

More Steps Create inverted index sorted by word Creates lexicon Search uses lexicon, inverted index, and Page Rank

Search Process Parse the query Find documents that have all search terms Compute the rank of the document Return the top k documents (sorted by rank)

Search for “bicycle” bicycle; pageA; 30, 70 bicycle; pageB; 98, 1100 car; pageA; 103 car; pageC; 107 car; pageD; 119, 598, 2004 Inverted Index pageA pageB Results

Ranking Results of a Query Hit type: title, anchor, URL, large font, etc. PageRank (more about that next) Documents with words appearing closer together have higher weight

Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions

Data Storage Use specialized data structures Avoid expensive disk seeks

Repository of Crawled Web Pages docIdurlLenpageLenurlpage Pages compressed using zlib All other data structures can be rebuilt from repository and list of crawler errors

Hit Data Structure 2 bytes per hit 3 types of hits: – Plain – Fancy (URL, title, meta tag, etc) – Anchor text Plain:Cap (1)Font (3)Position (12) Fancy:Cap (1)Font = 7Type (4)Position (8) Anchor:Cap (1)Font = 7Type (4)Hash (4)Position (4) Parts of the hit data structure; (bits used by part)

Forward Index docIdwordId (24)num hits (8)list of hits wordId (24)num hits (8)list of hits null wordId docIdwordId (24)num hits (8)list of hits wordId (24)num hits (8)list of hits wordId (24)num hits (8)list of hits null wordId (n) = number of bits used

Inverted Index wordIdnum Docs wordIdnum Docs wordIdnum Docs docId (27)num Hits (5)Hit List docId (27)num Hits (5)Hit List docId (27)num Hits (5)Hit List docId (27)num Hits (5)Hit List … Lexicon Index

Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions

Importance of a Web Page Simple approximation: count backlinks – Can easily create many links to my own page – A page with one link from a “good” web page should get a higher importance Better method: PageRank – Use graph of the web – Measure relative importance of web pages

Simplified Page Rank

The Real Page Rank Handles cycles of pages Random Surfer: periodically jump to a random page

Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions

Quality of Results Simple example showed high quality results Current Google is used by tons of people

Other Performance Metrics Storage: All data used takes 55 GB – Better compression -> 7 GB System Performance – Crawl: 9 days first time, 2.6 days (48.5 pages / s) second time – Indexer: 54 pages / s; runs in parallel with crawl – Sorting w 4 parallel machines: 24 hours Search Performance: not a focus of the research

Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions

Modern Search Challenge Return relevant results – Find hotel in NYC with certain amenities – Assemble a geographically distributed committee Current search engines: sift through tons of results, find relevant information

Information Extraction Extract meaningful data from text, store as structured data Example: – Text: “Paris is the stylish capital of France” – Data tuple: (Paris, capital of, France) Automatically create collections of data that are currently human curated

Outline Motivation How to Build a Search Engine Storing All the Data Page Rank Google Performance The Future of Search Conclusions

Ways to improve search: – Format of text on page – Following page links Search must scale as the web grows Search has come a long way, but new techniques will improve it

Questions?