Download presentation
Presentation is loading. Please wait.
Published byMarsha Sims Modified over 6 years ago
1
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster
Majeed Kassis
2
Case Study: Google Cluster
Google cluster architecture until 2010 Improvements made when Caffeine was released over the original one. How does the Google search engine work? How fast can it index and update web pages? How does it produce search result in less than second?
3
Architecture Requirements
Architecture needs to be economical Energy efficient Price-performance ratio matters Queries must be answered fast Architecture geared toward high throughput Support for massive amount of queries per second Google handles 3.5 billion searches a day, that’s 40,000 per second. Fault tolerance A node failure must not affect the performance of the system
4
Key Decisions in Google Cluster
Focus on software reliability, instead of hardware The use of low cost commodity PCs to build the cluster The use of redundancy and failure detection to improve fault tolerance Services are replicated over multiple machines No single point of failure The design for best total throughput Response time can be improved by using parallelism. The focus on price ratio The use of cost efficient consumer grade CPUs
5
Key Characteristics Instead of looking up for matching results in one large index Look up many times, in parallel, into smaller indices of the index. Once result are received, they are merged into one. Query Division Depending on location, send query to geographically near servers Manage load balancing within the cluster to avoid slowdowns Results The more shards there are the better the performance This supports massive increase of the number of machines.
6
How is a query handled? Query is sent to indexing servers:
Each word is mapped to a list of documents Inverted indexes generated using MapReduce Intersect the lists for each word found in the query Compute relevance score for each document Return list of documents sorted by relevance Each inverted index is of 10s of petabytes Searching is parallelized among many machines
7
Index Shards Due to the size of the index set it is divided into “index shards” Each index shard is built from randomly chosen subset of documents Pool of machines serve requests for each shard Pools are load balanced to avoid slowdowns Result is an ordered list of document ids For each id, page title, url, description are returned This is done by document servers
8
Final Steps Each query is also sent in parallel to: Finally:
Spell checker Advertisement system Finally: HTML result is generated Advertisements added Spell corrections are suggested
9
Google Caffeine - 2010 Old system: New system:
Based on MapReduce, GFS (Google File System) to build indexes. Batch processing: Web Crawling -> MapReduce -> Propagation (to servers) Indexing took a month and propagation cycle used to take 10 days This caused different users to receive different results! New system: Reduced the use of MapReduce Allowing dynamic updates of indexed data Improved the GFS even further Allowed to index more pages (10s of PetaBytes) Allowed identifying and updating pages that change frequently even faster. Old index had several layers, some of which were refreshed at a faster rate than others; the main layer would update every couple of weeks. To refresh a layer of the old index, it must analyze the entire web, which meant there was a significant delay between when the page is found and made available. With Caffeine, the web is analyzed in small portions and search index updates are done on a continuous basis, globally. Caffeine provides 50 percent fresher results for web searches than the older indexing scheme!
10
Caffeine vs Old System Dynamic updates in the indexing tables Storage:
Old: works only in cycles due to MapReduce New: allows dynamic updates of the tables Storage: Old: GFS uses “master node”, which held all meta data information, and “chunk servers” to hold the data itself. 64MB minimum chunk size This caused it to have high latency, and it became a bottleneck due to data size Chunk server failure causes further delays New: GFS2 uses distributed masters, and allows smaller file sizes (1MB) Reduces latency greatly Allows the storage of even more files (more than 10s of petabytes) MapReduce must finish work before being able to update the tables. This caused delay in updating the results! GFS had a supernode called ‘master node’ which has all the file information found in the server. And many ‘chunk servers’ physical storing the files themselves.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.