Download presentation
Presentation is loading. Please wait.
Published byKory Samson Wells Modified over 9 years ago
1
CS246 Search Engine Scale
2
Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture Major modules for a search engine? 1. Crawler Page download & Refresh 2. Indexer Index construction PageRank computation 3. Query Processor Page ranking Query logging
3
Junghoo "John" Cho (UCLA Computer Science) 3 General Architecture
4
Junghoo "John" Cho (UCLA Computer Science) 4 Scale of Web Search Number of pages indexed: ~ 10B pages Index refresh interval: Once per month ~ 1200 pages/sec Number of queries per day: 290M in Dec 2008 ~ 3000 queries/sec Services often run on commodity Intel-Linux boxes
5
Junghoo "John" Cho (UCLA Computer Science) 5 Other Statistics Average page size: 15KB Average query size: 40B Average result size: 5KB Average number of links per page: 10
6
Junghoo "John" Cho (UCLA Computer Science) 6 Size of Dataset (1) Total raw HTML data size 10G x 15KB = 150 TB! Inverted index roughly the same size as raw corpus: 150 TB for index itself With appropriate compression, 3:1 compression ratio 100 TB data residing in disk
7
Junghoo "John" Cho (UCLA Computer Science) 7 Size of Dataset (2) Number of disks necessary for one copy (100 TB) / (1TB per disk) = 100 disk
8
Junghoo "John" Cho (UCLA Computer Science) 8 Data Size and Crawling Efficient crawl is very important 1 page/sec 1200 machines just for crawling Parallelization through thread/event queue necessary Complex crawling algorithm -- No, No! Well-optimized crawler ~ 100 pages/sec (10 ms/page) ~ 12 machines for crawling Bandwidth consumption 1200 x 15KB x 8bit ~ 150Mbps One dedicated OC3 line (155Mbps) for crawling ~ $400,000 per year
9
Junghoo "John" Cho (UCLA Computer Science) 9 Data Size and Indexing Efficient Indexing is very important 1200 pages / sec Indexing steps Load page, extract words – Network/disk intensive Sort word, postings – CPU intensive Write sorted postings – Disk intensive Pipeline indexing steps LSWLSWLSW P1 P2 P3
10
Junghoo "John" Cho (UCLA Computer Science) 10 Simplified Indexing Model Model Copy 50TB data from disks in S to disks in D through network S: crawling machines D: indexing machines 50TB crawled data, 50TB index Ignore actual processing 1TB disk per machine ~50 machines in S and D each 8GB RAM per machine :::: SD
11
Junghoo "John" Cho (UCLA Computer Science) 11 Data Flow Disk RAM Network RAM Disk No hardware is error free Disk (undetected) error rate ~ 1 per 10 13 Network error rate ~ 1 bit per 10 12 Memory soft error rate ~ 1 bit error per month (1GB) Typically go unnoticed for small data Disk RAM Network RAMDisk
12
Junghoo "John" Cho (UCLA Computer Science) 12 Data Flow Assuming 1Gbit/s link between machines 1TB per machine, 30MB/s transfer rate Half day just for data transfer Disk RAM Network RAMDisk
13
Junghoo "John" Cho (UCLA Computer Science) 13 Errors from Disk Undetected disk error rate ~ 1 per 10 13 5x10 13 bytes data read in total 5X10 13 bytes data write in total 10 byte errors from disk read/write Disk RAM Network RAMDisk
14
Junghoo "John" Cho (UCLA Computer Science) 14 Errors from Memory 1 bit error per month per 1GB 100 machines with 8GB each 8*100 bit errors/month 15 bit error per half day 15 byte error from memory corruption Disk RAM Network RAMDisk
15
Junghoo "John" Cho (UCLA Computer Science) 15 Errors from Network 1 error per 10 12 5 x 8 x 10 13 bits transfer 400 bit errors scatters around the stream 400 byte errors Disk RAM Network RAMDisk
16
Junghoo "John" Cho (UCLA Computer Science) 16 Data Size and Errors (1) During index construction/copy, something always goes wrong 400 byte errors from network, 30 byte errors from disk, 15 byte errors from memory Very difficult to trace and debug Particularly disk and memory error No OS/application assumes such errors yet Pure hardware errors, but very difficult to differentiate hardware error and software bug Software bugs may also cause similar errors
17
Junghoo "John" Cho (UCLA Computer Science) 17 Data Size and Errors (2) Very difficult to trace and debug Data corruption in the middle of, say, sorting completely screws up the sorting Need a data-verification step after every operation Algorithm, data structure must be resilient to data corruption Check points, etc. ECC RAM is a must Can detect most of 1 bit errors
18
Junghoo "John" Cho (UCLA Computer Science) 18 Data Size and Reliability Disk mean time to failure ~ 3 years (3 x 365 days) / 100 disks ~ 10 day One disk failure every 10 days Remember, this is just for one copy Data organization should be very resilient to disk failure
19
Junghoo "John" Cho (UCLA Computer Science) 19 Data Size and Query Processing Index size: 50TB 50 disks Potentially 50-machine cluster to answer a query If one machine goes down, the cluster goes down Multi-tier index structure can be helpful Tier 1: Popular (high PageRank) page index Tier 2: Less popular page index Most queries can be answered by tier-1 cluster (with fewer machines)
20
Junghoo "John" Cho (UCLA Computer Science) 20 Implication of Query Load 3000 queries / sec Rule of thumb: 1 query / sec per CPU Depends on number of disks, memory size, etc. ~ 3000 machines just to answer queries 5KB / answer page 3000 x 5KB x 8bit ~ 120 Mbps Half dedicated OC3 line (155Mbps) ~ $300,000
21
Junghoo "John" Cho (UCLA Computer Science) 21 Hardware at Google ~10K Intel-Linux cluster Assuming 99.9% uptime (8 hour downtime per year) 10 machines are always down Nightmare for system administrators Assuming 3-year hardware replacement Set up, replace and dump 10 machines every day Heterogeneity is unavoidable Position Requirements: Able to lift/move 20-30 lbs equipment on a daily basis. Job posting at Google
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.