Introduction to Data Center Computing Derek Murray October 2010.

Slides:



Advertisements
Similar presentations
Incremental Recomputations in MapReduce
Advertisements

Cluster Computing with Dryad Mihai Budiu, MSR-SVC LiveLabs, March 2008.
A walk in cloud (and look for databases) Jian Xu DMM DB-talk, Feb 2010.
Distributed Data-Parallel Computing Using a High-Level Programming Language Yuan Yu Michael Isard Joint work with: Andrew Birrell, Mihai Budiu, Jon Currey,
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
Chuang, Sang, Yoo, Gu, Killian and Kulkarni, “EventWave” SoCC ‘13 EventWave: Programming Model and Runtime Support for Tightly-Coupled Elastic Cloud Applications.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Homework 1: Common Mistakes Memory Leak Storing of memory pointers instead of data.
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
1/19 Presented by: Maedeh Tashakkorian Supervisor: Hadi Salimi Mazandaran University of Science and Technology February, 2011.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
David Gibbs and Govardhan Tanniru Georgia State University Department of Computer Science P.O. Box 3965 Atlanta, GA
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Author: Murray Stokely Presenter: Pim van Pelt Distributed Computing at Google.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
1 Large-scale Incremental Processing Using Distributed Transactions and Notifications Written By Daniel Peng and Frank Dabek Presented By Michael Over.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
UC Berkeley Scaleable Structured Datastorage for Web 2.0 Michael Armbrust, David Patterson October, 2007.
Software Architecture
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Introduction to Hadoop and HDFS
Apache Cassandra - Distributed Database Management System Presented by Jayesh Kawli.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MapReduce How to painlessly process terabytes of data.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Khoj: A Highly Scalable and Available Search Harneet Singh, Avinaash Gupta and Krishna Gayatri Kuchimanchi System Overview Load BalancingFailure Detection.
Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Toward Efficient and Simplified Distributed Data Intensive Computing IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011PPT.
History & Motivations –RDBMS History & Motivations (cont’d) … … Concurrent Access Handling Failures Shared Data User.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 8: hadoop and Tera/Peta byte graphs.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Percolator: Incrementally Indexing the Web OSDI’10.
CS239-Lecture 3 DryadLINQ Madan Musuvathi Visiting Professor, UCLA
Some slides adapted from those of Yuan Yu and Michael Isard
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
CSCI5570 Large Scale Data Processing Systems
NOSQL.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
Ch 4. The Evolution of Analytic Scalability
Hadoop Technopoints.
Interpret the execution mode of SQL query in F1 Query paper
Distributed Systems CS
CMSC Cluster Computing Basics
CS639: Data Management for Data Science
Presentation transcript:

Introduction to Data Center Computing Derek Murray October 2010

What we’ll cover Techniques for handling “big data” –Distributed storage –Distributed computation Focus on recent papers describing real systems

Example: web search WWW CrawlingIndexingQuerying

A system architecture? Computers Network Storage

Data Center architecture Server Rack switch Server Rack switch Server Rack switch Core switch

Distributed storage High volume of data High volume of read/write requests Fault tolerance

Brewer’s CAP theorem (2000) Consistency Partition Tolerance Availability

The Google file system (2003) GFS Master Chunk server Client

Dynamo (2007) Client

Distributed computation Parallel distributed processing Single Program, Multiple Data (SPMD) Fault tolerance Applications

Task farming Master Worker Storage

MapReduce (2004)

Dryad (2007) Arbitrary directed acyclic graph (DAG) Vertices and channels Topological ordering

DryadLINQ (2008) Language Integrated Query (LINQ) var table = PartitionedTable.Get (“…”); var result = from x in table select x * x; int sumSquares = result.Sum();

Scheduling issues Heterogeneous performance Sharing a cluster fairly Data locality

Percolator (2010) Built on Google BigTable Transactions via snapshot isolation Per-column notifications (triggers)

Skywriting and C IEL (2010) Universal distributed execution engine Script language for distributed programs Opportunities for student projects…

References Storage –Ghemawat et al., “The Google File System”, Proceedings of SOSP 2003 –DeCandia et al., “Dynamo: Amazon’s Highly-Available Key-value Store”, Proceedings of SOSP 2007 Computation –Dean and Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, Proceedings of OSDI 2004 –Isard et al., “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks”, Proceedings of EuroSys 2007 –Yu et al., “DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language”, Proceedings of OSDI 2008 –Olston et al., “Pig Latin: A Not-So-Foreign Language for Data Processing”, Proceedings of SIGMOD 2008 –Murray and Hand, “Scripting the Cloud with Skywriting”, Proceedings of HotCloud 2010 Scheduling –Zaharia et al., “Improving MapReduce Performance in Heterogeneous Environments”, Proceedings of OSDI 2008 –Isard et al., “Quincy: Fair Scheduling for Distributed Computing Clusters”, Proceedings of SOSP 2009 –Zaharia et al., “Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling”, Proceedings of EuroSys 2010 Transactions –Peng and Dabek, “Large-Scale Incremental Processing using Distributed Transactions and Notifications”, Proceedings of OSDI 2010

Conclusions Data centers achieve high performance with commodity parts Efficient storage requires application- specific trade-offs Data-parallelism simplifies distributed computation on the data

Questions Now or after the lecture – –Web