CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page.

Slides:

Advertisements

Similar presentations

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Advertisements

1 Searching Internet of Sensors Junghoo (John) Cho (UCLA CS) Mark Hansen (UCLA Stat) John Heidemann (USC/ISI)

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

© 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita,

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

CS CS 5150 Software Engineering Lecture 19 Performance.

1 Searching the Web Junghoo Cho UCLA Computer Science.

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

Algorithms (Contd.). How do we describe algorithms? Pseudocode –Combines English, simple code constructs –Works with various types of primitives Could.

1 CS/INFO 430 Information Retrieval Lecture 17 Web Search 3.

Energy Efficient Prefetching – from models to Implementation 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software Engineering.

CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.

Energy Efficient Prefetching with Buffer Disks for Cluster File Systems 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software.

1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

CS 349: WebBase 1 What the WebBase can and can’t do?

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)

Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

X-Stream: Edge-Centric Graph Processing using Streaming Partitions

Anatomy of a search engine Design criteria of a search engine Architecture Data structures.

Problems in Memory Management CS 1550 Recitation October 9 th, 2002 The questions in this slide are from Andrew S. Tanenbaum's textbook page 264.

RANKING THE INTERNETZ WITH MESHWORK Justin Cano Insight Data Engineering Fellow.

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

Web Search Algorithms By Matt Richard and Kyle Krueger.

Curtis Spencer Ezra Burgoyne An Internet Forum Index.

CS211 - Fernandez - 1 CS211 Graduate Computer Architecture Network 3: Clusters, Examples.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

EEL 5708 Cluster computers. Case study: Google Lotzi Bölöni.

Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.

David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.

Hadoop implementation of MapReduce computational model Ján Vaňo.

MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.

1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

The Google Cluster Google. The Google Floor Plan.

External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

General Architecture of Retrieval Systems 1Adrienn Skrop.

Cloud Computing Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington August 2012.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.

Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.

Job Scheduling in a Grid Computing Environment

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

Cloud Computing Ed Lazowska August 2011 Bill & Melinda Gates Chair in

Data Structures and Algorithms

Hongjun Song Computer Science The University of Memphis

CS246 Search Engine Scale.

Junghoo “John” Cho UCLA

CS246: Search-Engine Scale

Presentation transcript:

CS246 Search Engine Scale

Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page download & Refresh 2. Indexer  Index construction  PageRank computation 3. Query Processor  Page ranking  Query logging

Junghoo "John" Cho (UCLA Computer Science) 3 General Architecture

Junghoo "John" Cho (UCLA Computer Science) 4 Scale of Web Search  Number of pages indexed: ~ 10B pages  Index refresh interval: Once per month ~ 1200 pages/sec  Number of queries per day: 290M in Dec 2008 ~ 3000 queries/sec  Services often run on commodity Intel-Linux boxes

Junghoo "John" Cho (UCLA Computer Science) 5 Other Statistics  Average page size: 15KB  Average query size: 40B  Average result size: 5KB  Average number of links per page: 10

Junghoo "John" Cho (UCLA Computer Science) 6 Size of Dataset (1)  Total raw HTML data size 10G x 15KB = 150 TB!  Inverted index roughly the same size as raw corpus: 150 TB for index itself  With appropriate compression, 3:1 compression ratio 100 TB data residing in disk

Junghoo "John" Cho (UCLA Computer Science) 7 Size of Dataset (2)  Number of disks necessary for one copy  (100 TB) / (1TB per disk) = 100 disk

Junghoo "John" Cho (UCLA Computer Science) 8 Data Size and Crawling  Efficient crawl is very important  1 page/sec  1200 machines just for crawling  Parallelization through thread/event queue necessary  Complex crawling algorithm -- No, No!  Well-optimized crawler  ~ 100 pages/sec (10 ms/page)  ~ 12 machines for crawling  Bandwidth consumption  1200 x 15KB x 8bit ~ 150Mbps  One dedicated OC3 line (155Mbps) for crawling ~ $400,000 per year

Junghoo "John" Cho (UCLA Computer Science) 9 Data Size and Indexing  Efficient Indexing is very important  1200 pages / sec  Indexing steps  Load page, extract words – Network/disk intensive  Sort word, postings – CPU intensive  Write sorted postings – Disk intensive  Pipeline indexing steps LSWLSWLSW P1 P2 P3

Junghoo "John" Cho (UCLA Computer Science) 10 Simplified Indexing Model  Model  Copy 50TB data from disks in S to disks in D through network  S: crawling machines  D: indexing machines  50TB crawled data, 50TB index  Ignore actual processing  1TB disk per machine  ~50 machines in S and D each  8GB RAM per machine :::: SD

Junghoo "John" Cho (UCLA Computer Science) 11 Data Flow  Disk  RAM  Network  RAM  Disk  No hardware is error free  Disk (undetected) error rate ~ 1 per  Network error rate ~ 1 bit per  Memory soft error rate ~ 1 bit error per month (1GB)  Typically go unnoticed for small data Disk RAM Network RAMDisk

Junghoo "John" Cho (UCLA Computer Science) 12 Data Flow  Assuming 1Gbit/s link between machines  1TB per machine, 30MB/s transfer rate  Half day just for data transfer Disk RAM Network RAMDisk

Junghoo "John" Cho (UCLA Computer Science) 13 Errors from Disk  Undetected disk error rate ~ 1 per  5x10 13 bytes data read in total 5X10 13 bytes data write in total  10 byte errors from disk read/write Disk RAM Network RAMDisk

Junghoo "John" Cho (UCLA Computer Science) 14 Errors from Memory  1 bit error per month per 1GB  100 machines with 8GB each  8*100 bit errors/month  15 bit error per half day  15 byte error from memory corruption Disk RAM Network RAMDisk

Junghoo "John" Cho (UCLA Computer Science) 15 Errors from Network  1 error per  5 x 8 x bits transfer  400 bit errors scatters around the stream  400 byte errors Disk RAM Network RAMDisk

Junghoo "John" Cho (UCLA Computer Science) 16 Data Size and Errors (1)  During index construction/copy, something always goes wrong  400 byte errors from network, 30 byte errors from disk, 15 byte errors from memory  Very difficult to trace and debug  Particularly disk and memory error  No OS/application assumes such errors yet  Pure hardware errors, but very difficult to differentiate hardware error and software bug  Software bugs may also cause similar errors

Junghoo "John" Cho (UCLA Computer Science) 17 Data Size and Errors (2)  Very difficult to trace and debug  Data corruption in the middle of, say, sorting completely screws up the sorting  Need a data-verification step after every operation  Algorithm, data structure must be resilient to data corruption  Check points, etc.  ECC RAM is a must  Can detect most of 1 bit errors

Junghoo "John" Cho (UCLA Computer Science) 18 Data Size and Reliability  Disk mean time to failure ~ 3 years  (3 x 365 days) / 100 disks ~ 10 day One disk failure every 10 days  Remember, this is just for one copy  Data organization should be very resilient to disk failure

Junghoo "John" Cho (UCLA Computer Science) 19 Data Size and Query Processing  Index size: 50TB  50 disks  Potentially 50-machine cluster to answer a query  If one machine goes down, the cluster goes down  Multi-tier index structure can be helpful  Tier 1: Popular (high PageRank) page index  Tier 2: Less popular page index  Most queries can be answered by tier-1 cluster (with fewer machines)

Junghoo "John" Cho (UCLA Computer Science) 20 Implication of Query Load  3000 queries / sec  Rule of thumb: 1 query / sec per CPU  Depends on number of disks, memory size, etc.  ~ 3000 machines just to answer queries  5KB / answer page  3000 x 5KB x 8bit ~ 120 Mbps  Half dedicated OC3 line (155Mbps) ~ $300,000

Junghoo "John" Cho (UCLA Computer Science) 21 Hardware at Google  ~10K Intel-Linux cluster  Assuming 99.9% uptime (8 hour downtime per year)  10 machines are always down  Nightmare for system administrators  Assuming 3-year hardware replacement  Set up, replace and dump 10 machines every day  Heterogeneity is unavoidable Position Requirements: Able to lift/move lbs equipment on a daily basis.  Job posting at Google