CS246 Search Engine Scale.

CS246 Search Engine Scale

High-Level Architecture
Major modules for a search engine? Crawler Page download & Refresh Indexer Index construction PageRank computation Query Processor Page ranking Query logging Junghoo "John" Cho (UCLA Computer Science)

General Architecture Junghoo "John" Cho (UCLA Computer Science)

Scale of Web Search Number of pages indexed: ~ 10B pages
Index refresh interval: Once per month ~ 1200 pages/sec Number of queries per: 40,000 queries/sec Services often run on commodity Intel-Linux boxes Junghoo "John" Cho (UCLA Computer Science)

Other Statistics Average page size: 15KB Average query size: 40B
Average result size: 5KB Average number of links per page: 10 Junghoo "John" Cho (UCLA Computer Science)

Size of Dataset (1) Total raw HTML data size 10G x 15KB = 150 TB!
Inverted index roughly the same size as raw corpus: 150 TB for index itself With appropriate compression, 3:1 compression ratio 100 TB data residing in disk Junghoo "John" Cho (UCLA Computer Science)

Size of Dataset (2) Number of disks necessary for one copy
(100 TB) / (1TB per disk) = 100 disk Junghoo "John" Cho (UCLA Computer Science)

Data Size and Crawling Efficient crawl is very important
1 page/sec  1200 machines just for crawling Parallelization through thread/event queue necessary Complex crawling algorithm -- No, No! Well-optimized crawler ~ 100 pages/sec (10 ms/page) ~ 12 machines for crawling Bandwidth consumption 1200 x 15KB x 8bit ~ 150Mbps Junghoo "John" Cho (UCLA Computer Science)

Data Size and Indexing Efficient Indexing is very important
1200 pages / sec Indexing steps Load page, extract words – Network/disk intensive Sort word, postings – CPU intensive Write sorted postings – Disk intensive Pipeline indexing steps P1 L S W P2 L S W P3 L S W Junghoo "John" Cho (UCLA Computer Science)

Simplified Indexing Model
Copy 50TB data from disks in S to disks in D through network S: crawling machines D: indexing machines 50TB crawled data, 50TB index Ignore actual processing 1TB disk per machine ~50 machines in S and D each 8GB RAM per machine : Junghoo "John" Cho (UCLA Computer Science)

Data Flow Disk  RAM  Network  RAM  Disk No hardware is error free
Disk (undetected) error rate ~ 1 per 1013 Network error rate ~ 1 bit per 1012 Memory soft error rate ~ 1 bit error per month (1GB) Typically go unnoticed for small data Network Disk RAM Disk RAM Junghoo "John" Cho (UCLA Computer Science)

Data Flow Assuming 1Gbit/s link between machines
1TB per machine, 30MB/s transfer rate  Half day just for data transfer Network Disk RAM Disk RAM Junghoo "John" Cho (UCLA Computer Science)

Errors from Disk Undetected disk error rate ~ 1 per 1013
5x1013 bytes data read in total 5X1013 bytes data write in total  10 byte errors from disk read/write Network Disk RAM Disk RAM Junghoo "John" Cho (UCLA Computer Science)

Errors from Memory 1 bit error per month per 1GB
100 machines with 8GB each  8*100 bit errors/month  15 bit error per half day  15 byte error from memory corruption Network Disk RAM Disk RAM Junghoo "John" Cho (UCLA Computer Science)

Errors from Network 1 error per 1012
5 x 8 x 1013 bits transfer  400 bit errors scatters around the stream  400 byte errors Network Disk RAM Disk RAM Junghoo "John" Cho (UCLA Computer Science)

Very difficult to trace and debug
Data Size and Errors (1) During index construction/copy, something always goes wrong 400 byte errors from network, 30 byte errors from disk, 15 byte errors from memory Very difficult to trace and debug Particularly disk and memory error No OS/application assumes such errors yet Pure hardware errors, but very difficult to differentiate hardware error and software bug Software bugs may also cause similar errors Junghoo "John" Cho (UCLA Computer Science)

Data Size and Errors (2) Very difficult to trace and debug
Data corruption in the middle of, say, sorting completely screws up the sorting Need a data-verification step after every operation Algorithm, data structure must be resilient to data corruption Check points, etc. ECC RAM is a must Can detect most of 1 bit errors Junghoo "John" Cho (UCLA Computer Science)

Data Size and Reliability
Disk mean time to failure ~ 3 years  (3 x 365 days) / 100 disks ~ 10 day One disk failure every 10 days Remember, this is just for one copy Data organization should be very resilient to disk failure Junghoo "John" Cho (UCLA Computer Science)

Data Size and Query Processing
Index size: 50TB  50 disks Potentially 50-machine cluster to answer a query If one machine goes down, the cluster goes down Multi-tier index structure can be helpful Tier 1: Popular (high PageRank) page index Tier 2: Less popular page index Most queries can be answered by tier-1 cluster (with fewer machines) Junghoo "John" Cho (UCLA Computer Science)

Implication of Query Load
40,000 queries / sec Rule of thumb: 10 requests / sec per CPU Depends on number of disks, memory size, etc. ~ 4,000 machines just to answer queries 5KB / answer page 40,000 x 5KB x 8bit ~ 1.6 Gbps Junghoo "John" Cho (UCLA Computer Science)

Hardware at Google ~10K Intel-Linux cluster
Assuming 99.9% uptime (8 hour downtime per year) 10 machines are always down Nightmare for system administrators Assuming 3-year hardware replacement Set up, replace and dump 10 machines every day Heterogeneity is unavoidable Position Requirements: Able to lift/move lbs equipment on a daily basis. Job posting at Google Junghoo "John" Cho (UCLA Computer Science)

CS246 Search Engine Scale.

Similar presentations

Presentation on theme: "CS246 Search Engine Scale."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS246 Search Engine Scale.

Similar presentations

Presentation on theme: "CS246 Search Engine Scale."— Presentation transcript:

Similar presentations

About project

Feedback