CS246 Search Engine Scale.

Slides:

Advertisements

Similar presentations

1 Searching Internet of Sensors Junghoo (John) Cho (UCLA CS) Mark Hansen (UCLA Stat) John Heidemann (USC/ISI)

Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

CS CS 5150 Software Engineering Lecture 19 Performance.

1 Searching the Web Junghoo Cho UCLA Computer Science.

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

1 CS/INFO 430 Information Retrieval Lecture 17 Web Search 3.

Energy Efficient Prefetching – from models to Implementation 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software Engineering.

Energy Efficient Prefetching with Buffer Disks for Cluster File Systems 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software.

1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

CS 349: WebBase 1 What the WebBase can and can’t do?

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

X-Stream: Edge-Centric Graph Processing using Streaming Partitions

Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.

RANKING THE INTERNETZ WITH MESHWORK Justin Cano Insight Data Engineering Fellow.

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

Web Search Algorithms By Matt Richard and Kyle Krueger.

CS211 - Fernandez - 1 CS211 Graduate Computer Architecture Network 3: Clusters, Examples.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

EEL 5708 Cluster computers. Case study: Google Lotzi Bölöni.

David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.

MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

The Google Cluster Google. The Google Floor Plan.

Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

Cloud Computing Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington August 2012.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.

Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.

Introduction to MapReduce and Hadoop

Selecting Evaluation Techniques

Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.

Large-scale file systems and Map-Reduce

Information Retrieval in Practice

Lecture 16: Data Storage Wednesday, November 6, 2006.

Job Scheduling in a Grid Computing Environment

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

Cloud Computing Ed Lazowska August 2011 Bill & Melinda Gates Chair in

IST 516 Fall 2011 Dongwon Lee, Ph.D.

CS : Technology Trends August 31, 2015 Ion Stoica and Ali Ghodsi (

Lecture 11: DMBS Internals

The Anatomy of a Large-Scale Hypertextual Web Search Engine

IST 497 Vladimir Belyavskiy 11/21/02

Data Structures and Algorithms

Advanced Topics in Data Management

Hongjun Song Computer Science The University of Memphis

External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.

Building a Database on S3

Junghoo “John” Cho UCLA

CS246: Search-Engine Scale

External Sorting.

LO2 – Understand Computer Software

Database Systems (資料庫系統)

cs430 lecture 02/22/01 Kamen Yotov

Virtual Memory 1 1.

Presentation transcript:

CS246 Search Engine Scale

High-Level Architecture Major modules for a search engine? Crawler Page download & Refresh Indexer Index construction PageRank computation Query Processor Page ranking Query logging Junghoo "John" Cho (UCLA Computer Science)

General Architecture Junghoo "John" Cho (UCLA Computer Science)

Scale of Web Search Number of pages indexed: ~ 10B pages Index refresh interval: Once per month ~ 1200 pages/sec Number of queries per: 40,000 queries/sec Services often run on commodity Intel-Linux boxes Junghoo "John" Cho (UCLA Computer Science)

Other Statistics Average page size: 15KB Average query size: 40B Average result size: 5KB Average number of links per page: 10 Junghoo "John" Cho (UCLA Computer Science)

Size of Dataset (1) Total raw HTML data size 10G x 15KB = 150 TB! Inverted index roughly the same size as raw corpus: 150 TB for index itself With appropriate compression, 3:1 compression ratio 100 TB data residing in disk Junghoo "John" Cho (UCLA Computer Science)

Size of Dataset (2) Number of disks necessary for one copy (100 TB) / (1TB per disk) = 100 disk Junghoo "John" Cho (UCLA Computer Science)

Data Size and Crawling Efficient crawl is very important 1 page/sec  1200 machines just for crawling Parallelization through thread/event queue necessary Complex crawling algorithm -- No, No! Well-optimized crawler ~ 100 pages/sec (10 ms/page) ~ 12 machines for crawling Bandwidth consumption 1200 x 15KB x 8bit ~ 150Mbps Junghoo "John" Cho (UCLA Computer Science)

Data Size and Indexing Efficient Indexing is very important 1200 pages / sec Indexing steps Load page, extract words – Network/disk intensive Sort word, postings – CPU intensive Write sorted postings – Disk intensive Pipeline indexing steps P1 L S W P2 L S W P3 L S W Junghoo "John" Cho (UCLA Computer Science)

Simplified Indexing Model Copy 50TB data from disks in S to disks in D through network S: crawling machines D: indexing machines 50TB crawled data, 50TB index Ignore actual processing 1TB disk per machine ~50 machines in S and D each 8GB RAM per machine : Junghoo "John" Cho (UCLA Computer Science)

Data Flow Disk  RAM  Network  RAM  Disk No hardware is error free Disk (undetected) error rate ~ 1 per 1013 Network error rate ~ 1 bit per 1012 Memory soft error rate ~ 1 bit error per month (1GB) Typically go unnoticed for small data Network Disk RAM Disk RAM Junghoo "John" Cho (UCLA Computer Science)

Data Flow Assuming 1Gbit/s link between machines 1TB per machine, 30MB/s transfer rate  Half day just for data transfer Network Disk RAM Disk RAM Junghoo "John" Cho (UCLA Computer Science)

Errors from Disk Undetected disk error rate ~ 1 per 1013 5x1013 bytes data read in total 5X1013 bytes data write in total  10 byte errors from disk read/write Network Disk RAM Disk RAM Junghoo "John" Cho (UCLA Computer Science)

Errors from Memory 1 bit error per month per 1GB 100 machines with 8GB each  8*100 bit errors/month  15 bit error per half day  15 byte error from memory corruption Network Disk RAM Disk RAM Junghoo "John" Cho (UCLA Computer Science)

Errors from Network 1 error per 1012 5 x 8 x 1013 bits transfer  400 bit errors scatters around the stream  400 byte errors Network Disk RAM Disk RAM Junghoo "John" Cho (UCLA Computer Science)

Very difficult to trace and debug Data Size and Errors (1) During index construction/copy, something always goes wrong 400 byte errors from network, 30 byte errors from disk, 15 byte errors from memory Very difficult to trace and debug Particularly disk and memory error No OS/application assumes such errors yet Pure hardware errors, but very difficult to differentiate hardware error and software bug Software bugs may also cause similar errors Junghoo "John" Cho (UCLA Computer Science)

Data Size and Errors (2) Very difficult to trace and debug Data corruption in the middle of, say, sorting completely screws up the sorting Need a data-verification step after every operation Algorithm, data structure must be resilient to data corruption Check points, etc. ECC RAM is a must Can detect most of 1 bit errors Junghoo "John" Cho (UCLA Computer Science)

Data Size and Reliability Disk mean time to failure ~ 3 years  (3 x 365 days) / 100 disks ~ 10 day One disk failure every 10 days Remember, this is just for one copy Data organization should be very resilient to disk failure Junghoo "John" Cho (UCLA Computer Science)

Data Size and Query Processing Index size: 50TB  50 disks Potentially 50-machine cluster to answer a query If one machine goes down, the cluster goes down Multi-tier index structure can be helpful Tier 1: Popular (high PageRank) page index Tier 2: Less popular page index Most queries can be answered by tier-1 cluster (with fewer machines) Junghoo "John" Cho (UCLA Computer Science)

Implication of Query Load 40,000 queries / sec Rule of thumb: 10 requests / sec per CPU Depends on number of disks, memory size, etc. ~ 4,000 machines just to answer queries 5KB / answer page 40,000 x 5KB x 8bit ~ 1.6 Gbps Junghoo "John" Cho (UCLA Computer Science)

Hardware at Google ~10K Intel-Linux cluster Assuming 99.9% uptime (8 hour downtime per year) 10 machines are always down Nightmare for system administrators Assuming 3-year hardware replacement Set up, replace and dump 10 machines every day Heterogeneity is unavoidable Position Requirements: Able to lift/move 20-30 lbs equipment on a daily basis. Job posting at Google Junghoo "John" Cho (UCLA Computer Science)