CS 349: WebBase 1 What the WebBase can and can’t do?

Slides:



Advertisements
Similar presentations
Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
Advertisements

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
February 17, There is no practical obstacle whatever now to the creation of an efficient index to all human knowledge, ideas and achievements,
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Computer Science 335 Data Compression.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
CM143 - Web Week 2 Basic HTML. Links and Image Tags.
CpSc 881: Information Retrieval. 2 Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin by reviewing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
CS4432: Database Systems II Data Storage (Better Block Organization) 1.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Internet Concept and Terminology. The Internet The Internet is the largest computer system in the world. The Internet is often called the Net, the Information.
1 In the good old days... Years ago… the WWW was made up of (mostly) static documents. –Each URL corresponded to a single file stored on some hard disk.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Follow the Data Data (and information) move from place to place in computer systems and networks. As it moves it changes form frequently. This story describes.
Crawling Slides adapted from
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Dreamweaver Edulaunch Project 1 EQ: What are the key concepts when building the first page of a web site?
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
Chapter 6 Protecting Your Files. 2Practical PC 5 th Edition Chapter 6 Getting Started In this Chapter, you will learn: − What you should know about losing.
Week 1 – Beginners Content McAfee & Big Fish Games CoderDojo.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
CS 4720 Dynamic Web Applications CS 4720 – Web & Mobile Systems.
Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein and Dror Feitelson School of Engineering and Computer.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Eyeblaster Understanding Discrepancies. Agenda Understanding Discrepancies What is a discrepancy? What causes discrepancies? Common Discrepancy Types.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
File Systems Topics Design criteria History of file systems Berkeley Fast File System Effect of file systems on programs fs.ppt CS 105 “Tour of the Black.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Data Communications and Computer Networks Chapter 2 CS 3830 Lecture 7 Omar Meqdadi Department of Computer Science and Software Engineering University of.
The Internet What is the Internet? The Internet is a lot of computers over the whole world connected together so that they can share information. It.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Computer Basics Introduction CIS 109 Columbia College.
Getting the Most out of Scientific Computing Resources
Getting the Most out of Scientific Computing Resources
UNIT 2 – CHAPTER 1 – LESSON 1 DIGITAL INFORMATION.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Reliable Transport CS 3700 Project 3.
Google and Scalable Query Services
Reliable Transport CS 3700 Project 3.
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
CS246 Search Engine Scale.
CS246: Search-Engine Scale
Follow the Data Data (and information) move from place to place in computer systems and networks. As it moves it changes form frequently. This story.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Web Search Engines.
The Search Engine Architecture
Presentation transcript:

CS 349: WebBase 1 What the WebBase can and can’t do?

Summary What is in the WebBase Performance Considerations Reliability Considerations Other sources of data

WebBase Repository 25 million web pages 150 GB (50 GB compressed) spread across roughly 30 disks

What kind of web pages? Everything you can imagine Infinite Web Pages (truncated at 100K) 404 Errors Very little correct HTML.

Duplicate Web Pages Duplicate Sites –Crawl Root Pages first –Find duplicates and assume same for remainder of crawl Duplicate Hierarchies off of main page –Mirror Sites Duplicate Pages Near Duplicate Pages

Shiva’s Test Results 36% Duplicates 48% Near Duplicates Largest Sets of Duplicates: –TUCOWS (100) –MS IE Server Manuals (90) –Unix Help Pages (75) –RedHat Linux Manual (55) –Java API Doc (50)

Order of Web Pages First half million are root pages After that, pages in PageRank order Roughly by importance

Structure of Data magic number (4 bytes), packet length (4 bytes), packet (~2K bytes) packet is compressed packet contains: docID, URL, HTTP Headers, HTML data

Performance Issues: An Example One Disk Seek Per Document: –10 ms seek latency +10 ms rotational latency –x ms read latency + x ms OS overhead –y ms processing Realistically 50 ms per document = 20 docs per second 25 million / 20 docs per second = 1,250,000 seconds = 2 weeks (too slow)

How fast does it have to be? Answer: 4ms per doc = 250 docs per second 25 million / 250 = 100,000 seconds = 1.2 days Reading + Uncompressing + Parsing == ~3 to 4 ms per document So there is not much room left for processing

How can you do something complicated? Really fast processing to generate some intermediate smaller results. Run complex processing over smaller results. Example: Duplicate Detection –Compute shingles from all documents –Find pairs of documents that share shingles

Bulk Processing of Large Result Sets Example: Resolving Anchors Resolve URLs and save from - to in ASCII Compute 64 bit checksum of “To” URLs Bulk Merge against checksum - docid table

Reliability - Potential Sources of Problems Source code bug. Hardware failure OS failure Out of resources

Software Engineering Guidelines Number of bugs seen ~ log(size of dataset) Not just your bugs –OS bugs –Disk OS bugs Generate incremental results

Other Available Data Link Graph of the Web List of PageRanks List of URLs

Link Graph of the Web From DocID : To DocID Try red bars on Google to find backlinks Interesting Information

What is PageRank Measure of “importance” You are important if important things point to you Random surfer model

Uncrawled URLs Image links MailTo links CGI links Plain uncrawled HTML links

Summary WebBase has lots of web pages –very heterogeneous and weird Performance Considerations –code should be very very fast –use bulk processing Reliability Considerations –write out intermediate results Auxiliary data