Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
2/25/2004 The Google Cluster Architecture February 25, 2004.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Cloud Computing Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington August 2010.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Ch 4. The Evolution of Analytic Scalability
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Software Engineer, #MongoDBDays.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Web Search Using Mobile Cores Presented by: Luwa Matthews 0.
Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Topic Distributed DBMS Database Management Systems Fall 2012 Presented by: Osama Ben Omran.
History & Motivations –RDBMS History & Motivations (cont’d) … … Concurrent Access Handling Failures Shared Data User.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Cloud Computing Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington August 2012.
Information Retrieval in Practice
CREATED BY: JEAN LOIZIN CLASS: CS 345 DATE: 12/05/2016
Hadoop Aakash Kag What Why How 1.
Software Systems Development
Large-scale file systems and Map-Reduce
Information Retrieval in Practice
Steve Ko Computer Sciences and Engineering University at Buffalo
CLUSTER COMPUTING Presented By, Navaneeth.C.Mouly 1AY05IS037
Map Reduce.
CHAPTER 3 Architectures for Distributed Systems
Cloud Computing Ed Lazowska August 2011 Bill & Melinda Gates Chair in
Google Filesystem Some slides taken from Alan Sussman.
Software Engineering Introduction to Apache Hadoop Map Reduce
Storage Virtualization
Database Performance Tuning and Query Optimization
Google and Cloud Computing
A Survey on Distributed File Systems
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Steve Ko Computer Sciences and Engineering University at Buffalo
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Basics.
Ch 4. The Evolution of Analytic Scalability
Distributed File Systems
CS 345A Data Mining MapReduce This presentation has been altered.
CS246: Search-Engine Scale
Chapter 11 Database Performance Tuning and Query Optimization
Information Retrieval and Web Design
Presentation transcript:

Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis

Case Study: Google Cluster Google cluster architecture until 2010 Improvements made when Caffeine was released over the original one. How does the Google search engine work? How fast can it index and update web pages? How does it produce search result in less than second?

Architecture Requirements Architecture needs to be economical Energy efficient Price-performance ratio matters Queries must be answered fast Architecture geared toward high throughput Support for massive amount of queries per second Google handles 3.5 billion searches a day, that’s 40,000 per second. Fault tolerance A node failure must not affect the performance of the system

Key Decisions in Google Cluster Focus on software reliability, instead of hardware The use of low cost commodity PCs to build the cluster The use of redundancy and failure detection to improve fault tolerance Services are replicated over multiple machines No single point of failure The design for best total throughput Response time can be improved by using parallelism. The focus on price ratio The use of cost efficient consumer grade CPUs

Key Characteristics Instead of looking up for matching results in one large index Look up many times, in parallel, into smaller indices of the index. Once result are received, they are merged into one. Query Division Depending on location, send query to geographically near servers Manage load balancing within the cluster to avoid slowdowns Results The more shards there are the better the performance This supports massive increase of the number of machines.

How is a query handled? Query is sent to indexing servers: Each word is mapped to a list of documents Inverted indexes generated using MapReduce Intersect the lists for each word found in the query Compute relevance score for each document Return list of documents sorted by relevance Each inverted index is of 10s of petabytes Searching is parallelized among many machines

Index Shards Due to the size of the index set it is divided into “index shards” Each index shard is built from randomly chosen subset of documents Pool of machines serve requests for each shard Pools are load balanced to avoid slowdowns Result is an ordered list of document ids For each id, page title, url, description are returned This is done by document servers

Final Steps Each query is also sent in parallel to: Finally: Spell checker Advertisement system Finally: HTML result is generated Advertisements added Spell corrections are suggested

Google Caffeine - 2010 Old system: New system: Based on MapReduce, GFS (Google File System) to build indexes. Batch processing: Web Crawling -> MapReduce -> Propagation (to servers) Indexing took a month and propagation cycle used to take 10 days This caused different users to receive different results! New system: Reduced the use of MapReduce Allowing dynamic updates of indexed data Improved the GFS even further Allowed to index more pages (10s of PetaBytes) Allowed identifying and updating pages that change frequently even faster. Old index had several layers, some of which were refreshed at a faster rate than others; the main layer would update every couple of weeks. To refresh a layer of the old index, it must analyze the entire web, which meant there was a significant delay between when the page is found and made available. With Caffeine, the web is analyzed in small portions and search index updates are done on a continuous basis, globally. Caffeine provides 50 percent fresher results for web searches than the older indexing scheme!

Caffeine vs Old System Dynamic updates in the indexing tables Storage: Old: works only in cycles due to MapReduce New: allows dynamic updates of the tables Storage: Old: GFS uses “master node”, which held all meta data information, and “chunk servers” to hold the data itself. 64MB minimum chunk size This caused it to have high latency, and it became a bottleneck due to data size Chunk server failure causes further delays New: GFS2 uses distributed masters, and allows smaller file sizes (1MB) Reduces latency greatly Allows the storage of even more files (more than 10s of petabytes) MapReduce must finish work before being able to update the tables. This caused delay in updating the results! GFS had a supernode called ‘master node’ which has all the file information found in the server. And many ‘chunk servers’ physical storing the files themselves.