Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page.

Similar presentations


Presentation on theme: "CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page."— Presentation transcript:

1 CS246 Search Engine Scale

2 Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page download & Refresh 2. Indexer  Index construction  PageRank computation 3. Query Processor  Page ranking  Query logging

3 Junghoo "John" Cho (UCLA Computer Science) 3 General Architecture

4 Junghoo "John" Cho (UCLA Computer Science) 4 Scale of Web Search  Number of pages indexed: ~ 10B pages  Index refresh interval: Once per month ~ 1200 pages/sec  Number of queries per day: 290M in Dec 2008 ~ 3000 queries/sec  Services often run on commodity Intel-Linux boxes

5 Junghoo "John" Cho (UCLA Computer Science) 5 Other Statistics  Average page size: 15KB  Average query size: 40B  Average result size: 5KB  Average number of links per page: 10

6 Junghoo "John" Cho (UCLA Computer Science) 6 Size of Dataset (1)  Total raw HTML data size 10G x 15KB = 150 TB!  Inverted index roughly the same size as raw corpus: 150 TB for index itself  With appropriate compression, 3:1 compression ratio 100 TB data residing in disk

7 Junghoo "John" Cho (UCLA Computer Science) 7 Size of Dataset (2)  Number of disks necessary for one copy  (100 TB) / (1TB per disk) = 100 disk

8 Junghoo "John" Cho (UCLA Computer Science) 8 Data Size and Crawling  Efficient crawl is very important  1 page/sec  1200 machines just for crawling  Parallelization through thread/event queue necessary  Complex crawling algorithm -- No, No!  Well-optimized crawler  ~ 100 pages/sec (10 ms/page)  ~ 12 machines for crawling  Bandwidth consumption  1200 x 15KB x 8bit ~ 150Mbps  One dedicated OC3 line (155Mbps) for crawling ~ $400,000 per year

9 Junghoo "John" Cho (UCLA Computer Science) 9 Data Size and Indexing  Efficient Indexing is very important  1200 pages / sec  Indexing steps  Load page, extract words – Network/disk intensive  Sort word, postings – CPU intensive  Write sorted postings – Disk intensive  Pipeline indexing steps LSWLSWLSW P1 P2 P3

10 Junghoo "John" Cho (UCLA Computer Science) 10 Simplified Indexing Model  Model  Copy 50TB data from disks in S to disks in D through network  S: crawling machines  D: indexing machines  50TB crawled data, 50TB index  Ignore actual processing  1TB disk per machine  ~50 machines in S and D each  8GB RAM per machine :::: SD

11 Junghoo "John" Cho (UCLA Computer Science) 11 Data Flow  Disk  RAM  Network  RAM  Disk  No hardware is error free  Disk (undetected) error rate ~ 1 per 10 13  Network error rate ~ 1 bit per 10 12  Memory soft error rate ~ 1 bit error per month (1GB)  Typically go unnoticed for small data Disk RAM Network RAMDisk

12 Junghoo "John" Cho (UCLA Computer Science) 12 Data Flow  Assuming 1Gbit/s link between machines  1TB per machine, 30MB/s transfer rate  Half day just for data transfer Disk RAM Network RAMDisk

13 Junghoo "John" Cho (UCLA Computer Science) 13 Errors from Disk  Undetected disk error rate ~ 1 per 10 13  5x10 13 bytes data read in total 5X10 13 bytes data write in total  10 byte errors from disk read/write Disk RAM Network RAMDisk

14 Junghoo "John" Cho (UCLA Computer Science) 14 Errors from Memory  1 bit error per month per 1GB  100 machines with 8GB each  8*100 bit errors/month  15 bit error per half day  15 byte error from memory corruption Disk RAM Network RAMDisk

15 Junghoo "John" Cho (UCLA Computer Science) 15 Errors from Network  1 error per 10 12  5 x 8 x 10 13 bits transfer  400 bit errors scatters around the stream  400 byte errors Disk RAM Network RAMDisk

16 Junghoo "John" Cho (UCLA Computer Science) 16 Data Size and Errors (1)  During index construction/copy, something always goes wrong  400 byte errors from network, 30 byte errors from disk, 15 byte errors from memory  Very difficult to trace and debug  Particularly disk and memory error  No OS/application assumes such errors yet  Pure hardware errors, but very difficult to differentiate hardware error and software bug  Software bugs may also cause similar errors

17 Junghoo "John" Cho (UCLA Computer Science) 17 Data Size and Errors (2)  Very difficult to trace and debug  Data corruption in the middle of, say, sorting completely screws up the sorting  Need a data-verification step after every operation  Algorithm, data structure must be resilient to data corruption  Check points, etc.  ECC RAM is a must  Can detect most of 1 bit errors

18 Junghoo "John" Cho (UCLA Computer Science) 18 Data Size and Reliability  Disk mean time to failure ~ 3 years  (3 x 365 days) / 100 disks ~ 10 day One disk failure every 10 days  Remember, this is just for one copy  Data organization should be very resilient to disk failure

19 Junghoo "John" Cho (UCLA Computer Science) 19 Data Size and Query Processing  Index size: 50TB  50 disks  Potentially 50-machine cluster to answer a query  If one machine goes down, the cluster goes down  Multi-tier index structure can be helpful  Tier 1: Popular (high PageRank) page index  Tier 2: Less popular page index  Most queries can be answered by tier-1 cluster (with fewer machines)

20 Junghoo "John" Cho (UCLA Computer Science) 20 Implication of Query Load  3000 queries / sec  Rule of thumb: 1 query / sec per CPU  Depends on number of disks, memory size, etc.  ~ 3000 machines just to answer queries  5KB / answer page  3000 x 5KB x 8bit ~ 120 Mbps  Half dedicated OC3 line (155Mbps) ~ $300,000

21 Junghoo "John" Cho (UCLA Computer Science) 21 Hardware at Google  ~10K Intel-Linux cluster  Assuming 99.9% uptime (8 hour downtime per year)  10 machines are always down  Nightmare for system administrators  Assuming 3-year hardware replacement  Set up, replace and dump 10 machines every day  Heterogeneity is unavoidable Position Requirements: Able to lift/move 20-30 lbs equipment on a daily basis.  Job posting at Google


Download ppt "CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page."

Similar presentations


Ads by Google