2/25/2004 The Google Cluster Architecture February 25, 2004
2/25/2004 Assignments Work on Registrar Assignment Study for your quiz!
2/25/2004 Web Crawling Start with seed URL Follow all URLs in page, etc Store documents Create index –mapping between word in document and document
2/25/2004 Web Search
2/25/2004 Properties of Web Search Embarrassingly parallel –Stateless –Read-only Requires lots of storage Requires lots of computation Requires small response time
2/25/2004 Google Design Goals Energy efficiency Price performance ratio
2/25/2004 Software Architecture Reliability in software –Fault tolerance, not prevention –Cheap PCs High degree of replication
2/25/2004 Load Distribution/Balancing Geographically distributed clusters –Increased fault tolerance DNS-based load balancing –Select closest cluster to minimize RTT Hardware-based local load balancing
2/25/2004 Query Execution
2/25/2004 Query Execution 1.Index each query term 2.Compute relevance score across results
2/25/2004 Index Shards A pool of machines serves a particular shard Request goes to one machine in the pool If a machine goes down, capacity marginally reduced
2/25/2004 Query Execution 1.Index each query term 2.Compute relevance score across results 3.Retrieve document Highlight keywords 4.Generate/return HTML
2/25/2004 Replication No consistency issues Nearly linear speedup
2/25/2004 Discussion For which other applications would this architecture be useful/not useful?