CS246: Search-Engine Scale

Slides:

Advertisements

Similar presentations

1 Searching Internet of Sensors Junghoo (John) Cho (UCLA CS) Mark Hansen (UCLA Stat) John Heidemann (USC/ISI)

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

I/O-Algorithms Lars Arge January 31, Lars Arge I/O-algorithms 2 Random Access Machine Model Standard theoretical model of computation: –Infinite.

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

CS CS 5150 Software Engineering Lecture 19 Performance.

1 Searching the Web Junghoo Cho UCLA Computer Science.

February 17, There is no practical obstacle whatever now to the creation of an efficient index to all human knowledge, ideas and achievements,

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

1 CS/INFO 430 Information Retrieval Lecture 17 Web Search 3.

Energy Efficient Prefetching – from models to Implementation 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software Engineering.

CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.

Energy Efficient Prefetching with Buffer Disks for Cluster File Systems 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software.

1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

CS 349: WebBase 1 What the WebBase can and can’t do?

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

X-Stream: Edge-Centric Graph Processing using Streaming Partitions

Master Thesis Defense Jan Fiedler 04/17/98

Improving Efficiency of I/O Bound Systems More Memory, Better Caching Newer and Faster Disk Drives Set Object Access (SETOBJACC) Reorganize (RGZPFM) w/

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

Web Search Algorithms By Matt Richard and Kyle Krueger.

CS211 - Fernandez - 1 CS211 Graduate Computer Architecture Network 3: Clusters, Examples.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.

MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

1  2004 Morgan Kaufmann Publishers Fallacies and Pitfalls Fallacy: the rated mean time to failure of disks is 1,200,000 hours, so disks practically never.

UNIX U.Y: 1435/1436 H Operating System Concept. What is an Operating System?  The operating system (OS) is the program which starts up when you turn.

Canadian Bioinformatics Workshops

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.

Getting the Most out of Scientific Computing Resources

NFV Compute Acceleration APIs and Evaluation

22C:145 Artificial Intelligence

OPERATING SYSTEMS CS 3502 Fall 2017

Getting the Most out of Scientific Computing Resources

Selecting Evaluation Techniques

Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.

Large-scale file systems and Map-Reduce

Information Retrieval in Practice

Job Scheduling in a Grid Computing Environment

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

Cloud Computing Ed Lazowska August 2011 Bill & Melinda Gates Chair in

IST 516 Fall 2011 Dongwon Lee, Ph.D.

CS : Technology Trends August 31, 2015 Ion Stoica and Ali Ghodsi (

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Data Structures and Algorithms

Data Structures and Algorithms

Advanced Topics in Data Management

Hongjun Song Computer Science The University of Memphis

Data Mining Chapter 6 Search Engines

CS246 Search Engine Scale.

Junghoo “John” Cho UCLA

LO2 – Understand Computer Software

Database Systems (資料庫系統)

cs430 lecture 02/22/01 Kamen Yotov

Solving Awari using Large-Scale Parallel Retrograde Analysis

Virtual Memory 1 1.

Presentation transcript:

CS246: Search-Engine Scale Junghoo “John” Cho UCLA

Search Engine: High-Level Architecture Crawler Page download & Refresh Indexer Page analysis Index construction Query Processor Page ranking Query logging

Junghoo "John" Cho (UCLA Computer Science) General Architecture Junghoo "John" Cho (UCLA Computer Science)

Scale of Web Search Number of pages indexed: ~ 10 billion pages 23 million pages in 1998 Index refresh interval: once per month (~ 1200 pages/sec) Number of queries per sec: ~ 40,000 queries/sec Services often run on commodity Intel-Linux boxes

Other Relevant Statistics Average page size: 2MB Average query size: 40B Average result size: 5KB

Size of Dataset Total raw HTML data size: 10 billion x 2MB = 20 PB! (150GB in 1998) Inverted index roughly the same size as raw corpus 20 PB for index itself With simple compression, easy to get 3:1 compression ratio ~7 PB compressed (let’s use 8 PB for simplicity) Number of disks necessary for one copy 8 PB / 10TB per disk = 800 disk

Impact of Data Size on Web Crawling Efficient crawl is very important 1 page/sec → 1200 machines just for crawling! Parallelization through multi-threading/event-queue necessary Complex crawling algorithm -- No, No! With reasonable optimization ~ 100 pages/sec/machine (10 ms/page) ~ 12 machines for crawling Bandwidth consumption 1200 x 2MB x 8bit ~ 20 Gbps

Impact of Data Size on Indexing Efficient Indexing is critical 1200 pages/sec Indexing steps Load page – Network intensive Extract and sort word, create postings – CPU intensive Write postings – Disk intensive Pipeline indexing steps P1 L S W P2 L S W P3 L S W

Simplified Indexing Model Copy 8PB data from disks in S to disks in D through network S: crawling machines D: indexing machines 8PB crawled data, 8PB index Ignore actual processing 4x 10TB disks per machine ~200 machines in S and D each 16GB RAM per machine : Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Data Flow Disk → RAM → Network → RAM → Disk No hardware is error free Disk (undetected) error rate ~ 1 per 1013 Network error rate ~ 1 bit per 1012 Memory soft error rate ~ 1 bit error per month (1GB) Typically go unnoticed for small data Network Disk RAM Disk RAM Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Data Flow Assuming 1Gps link between machines 40TB per machine, 100MB/s transfer rate → 40,000 seconds (~half day) just for data transfer Network Disk RAM Disk RAM Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Errors from Disk Undetected disk error rate ~ 1 per 1013 8 PB * 8 bits (= 64 x 1015 bits) data read in total 8 PB * 8 bits (= 64 x 1015 bits) data write in total → 2 x 6,400 (= 12,800) bit errors (just because we are reading/writing from disk!!) Network Disk RAM Disk RAM Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Errors from Memory 1 bit error per month per 1 GB 400 machines with 16 GB each → 6,400 GB memory → 6,400 bit errors/month (= 100 bit errors/half day) (just because so we use so many systems!!) Network Disk RAM Disk RAM Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Errors from Network 1 error per 1012 8 PB x 8 bits (= 64 x 1015) transfer → 64,000 bit errors from network transfer!! Network Disk RAM Disk RAM Junghoo "John" Cho (UCLA Computer Science)

Impact of Data Size on Data Corruption (1) During indexing, something always goes wrong 64,000 bit errors from network 6,400 bit errors from disk IO 100 bit errors from memory Very difficult to trace and debug Pure hardware errors, but very difficult to differentiate hardware error and software bug Software bugs may also cause similar errors Our “consumer” OS/application is not designed to deal with such errors yet

Impact of Data Size on Data Corruption (2) Our data pipeline should verify data integrity in every step Data corruption in the middle of, say, sorting completely screws up the sorting Need a data-verification step after every operation Algorithm, data structure must be resilient to data corruption Check points, etc. ECC RAM is a must Can detect most 1 bit errors

Scale of Query Processing 40,000 queries / sec Rule of thumb: 10 requests / sec / CPU The exact number depends on number of disks, memory size, etc. ~ 4,000 machines just to answer queries 5KB / answer page 40,000 x 5KB x 8bit ~ 1.6 Gbps

Scale of Query Processing Index size: 8 PB → 800 disks, 200 machines Potentially 200-machine cluster to answer a query If one machine goes down, the cluster may go down Multi-tier index structure can be helpful Tier 1: Popular (high PageRank) page index Tier 2: Less popular page index Answer most queries with tier-1 cluster

Impact of Data Size on System Reliability Disk mean time to failure: ~ 3 years → (3 x 365 days) / 800 disks ~ 1 day One disk failure every day!! Remember, this is just for one copy Data organization should be very resilient to disk failure We should never lose data even if we lose a disk

Hardware at Google ~10K Intel-Linux cluster Assuming 99.9% uptime (8 hour downtime per year) 10 machines are always down Nightmare for system administrators Assuming 3-year hardware replacement Set up, replace and dump 10 machines every day Heterogeneity is unavoidable Position Requirements: Able to lift/move 20-30 lbs equipment on a daily basis. Job posting at Google