CSCI-2950u :: Data-Intensive Scalable Computing Rodrigo Fonseca (rfonseca)

Slides:



Advertisements
Similar presentations
Large Scale Computing Systems
Advertisements

SE-292 High Performance Computing
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Distributed Computations
Graph Processing Recap: data-intensive cloud computing – Just database management on the cloud – But scaling it to thousands of nodes – Handling partial.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Cloud Computing Lecture #1 Parallel and Distributed Computing Jimmy Lin The iSchool University of Maryland Monday, January 28, 2008 This work is licensed.
Distributed Computations MapReduce
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
DISTRIBUTED COMPUTING
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
SYSTEMS SUPPORT FOR GRAPHICAL LEARNING Ken Birman 1 CS6410 Fall /18/2014.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
Tyson Condie.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Introduction to MapReduce ECE7610. The Age of Big-Data  Big-data age  Facebook collects 500 terabytes a day(2011)  Google collects 20000PB a day (2011)
B. RAMAMURTHY MapReduce and Hadoop Distributed File System 10/6/ Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY)
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer Chapter 1.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
Early Adopter: Integration of Parallel Topics into the Undergraduate CS Curriculum at Calvin College Joel C. Adams Chair, Department of Computer Science.
Cloud Computing Lecture #1 What is Cloud Computing? (and an intro to parallel/distributed processing) Jimmy Lin The iSchool University of Maryland Modified.
Problem-solving on large-scale clusters: theory and applications Lecture 4: GFS & Course Wrap-up.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Outline Why this subject? What is High Performance Computing?
Introduction to CS739: Distribution Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Next Generation of Apache Hadoop MapReduce Owen
1  2004 Morgan Kaufmann Publishers Fallacies and Pitfalls Fallacy: the rated mean time to failure of disks is 1,200,000 hours, so disks practically never.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Intro to Parallel and Distributed Processing Some material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google.
Lecture #4 Introduction to Data Parallelism and MapReduce CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
BIG DATA/ Hadoop Interview Questions.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Web: Parallel Computing Rabie A. Ramadan , PhD Web:
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
15-826: Multimedia Databases and Data Mining
Introduction to MapReduce and Hadoop
Software Engineering Introduction to Apache Hadoop Map Reduce
Different Architectures
Hadoop Technopoints.
Introduction to MapReduce
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

CSCI-2950u :: Data-Intensive Scalable Computing Rodrigo Fonseca (rfonseca) Special Topics on Networking and Distributed Systems

Welcome! Course: TuTh, 10:30-11:50, 506 CIT Me: Rodrigo Fonseca –Have been here for 2 years –Interested in distributed systems, operating systems, networking –You can find me at or in 329 CIT (with an

Data-Intensive Scalable Computing What does this mean? –Sometimes know as DISC –Data-Intensive: data deluge! –Scalable: computing should keep up as the data grows –Our focus: Systems: how to build these scalable systems Applications: what applications/algorithms are suitable for these systems

The Deluge, John Martin, Image from wikimedia commons.

How much data?

How much data? LHC: 10-15PB/year 20 PB/day (2008) Facebook: 36PB of user data 90TB/day (2010) Economist: mankind produced 1,200 Exabytes (billion GB) in 2010 DNA Sequencing Throughput: Increasing 5x per year LSST: 3TB/day of image data Walmart, Amazon, NYSE…

So much data Easy to get –Complex, automated sensors –Lots of people (transactions, Internet) Cheap to keep –2TB disk costs $80 dollars today (4c/GB!) Potentially very useful –Both for science and business Large scale: doesn’t fit in memory

How do we process this data? Current minute-sort world record –External sort: 2 reads, 2 writes from/to disk (optimal) –TritonSort (UCSD), ~1TB/minute One disk: say 2TB, 100MB/s (sequential) –how long to read it once? –~ 2,000,000MB/100MB/s ~ 20,000s ~ 5.5h! –TritonSort: 52 nodes X 16 disks Latency: say processing 1 item takes 1ms –CPUs aren’t getting much faster lately –Only 1000 items per second! Have to use many machines

Challenges Traditional parallel supercomputers are not the right fit for many problems (given their cost) –Optimized for fine-grained parallelism with a lot of communication –Cost does not scale linearly with capacity => Clusters of commodity computers –Even more accessible with pay-as-you-go cloud computing

Parallel computing is hard! Message Passing P1P1 P2P2 P3P3 P4P4 P5P5 Shared Memory P1P1 P2P2 P3P3 P4P4 P5P5 Memory Different programming models Different programming constructs mutexes, conditional variables, barriers, … masters/slaves, producers/consumers, work queues, … Fundamental issues scheduling, data distribution, synchronization, inter- process communication, robustness, fault tolerance, … Common problems livelock, deadlock, data starvation, priority inversion… dining philosophers, sleeping barbers, cigarette smokers, … Architectural issues Flynn’s taxonomy (SIMD, MIMD, etc.), network typology, bisection bandwidth UMA vs. NUMA, cache coherence The reality: programmer shoulders the burden of managing concurrency… master slaves producerconsumer producerconsumer work queue Slide from Jimmy Lin, University of Maryland

DISC Main Ideas Scale “out”, not “up” –Se Barroso and Hölze, Chapter 3 Assume failures are common –Probability of “no machine down” decreases rapidly with scale… Move processing to the data –Bandwidth is scarce Process data sequentially –Seeks are *very* expensive Hide system-level details from the application developer

Examples Google’s MapReduce –Many implementations: most adopted is Hadoop –Used in many applications Many extensions –Dryad, CIEL, iterative versions Other frameworks –Graphlab, Piccolo, TritonSort

This course We will study current DISC solutions –Systems challenges –Programming models –Dealing with failures We will look at several applications –Information retrieval, data mining, graph mining, computational biology, finance, machine learning, … Possibly – Identify shortcomings, limitations –Address these!

Course Mechanics 3 Components –Reading papers (systems and applications) Reviews of assigned papers (everyone) Lead discussion in class (one or two people per paper) –A few programming assignments For example, PageRank on a collection of documents Department cluster, possibly on Amazon Cloud Get your hands dirty –A mini research project Systems and/or applications Can be related to your research Preferable in small groups (2 ideal)

Web and Group Site will be up later today Google group will be where you submit your reviews –Up to midnight, day before class

Next class Skim Barroso and Hölze, Chapters 1 and 3 Skim Lin and Dyer, Chapter 1 Read the MapReduce paper –Dean, J., and Ghemawat, S. Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107–113. No review necessary Homework 0 – u/f11/homework0.htmlwww.cs.brown.edu/courses/csci2950- u/f11/homework0.html