1 Christopher Moretti – University of Notre Dame 4/30/2008 High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon.

Slides:

Advertisements

Similar presentations

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

Advertisements

BXGrid: A Data Repository and Computing Grid for Biometrics Research Hoang Bui University of Notre Dame 1.

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

1 Condor Compatible Tools for Data Intensive Computing Douglas Thain University of Notre Dame Condor Week 2011.

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

1 Opportunities and Dangers in Large Scale Data Intensive Computing Douglas Thain University of Notre Dame Large Scale Data Mining Workshop at SIGKDD August.

COSC 120 Computer Programming

Distributed Computations

Engineering Problem Solving With C++ An Object Based Approach Fundamental Concepts Chapter 1 Engineering Problem Solving.

1 Scaling Up Data Intensive Science to Campus Grids Douglas Thain Clemson University 25 Septmber 2009.

Programming Distributed Systems with High Level Abstractions Douglas Thain University of Notre Dame Cloud Computing and Applications (CCA-08) University.

Survey of Programming Models for Data Oriented Grid Computing Douglas Thain University of Notre Dame 1 November 2007.

Deconstructing Clusters for High End Biometric Applications NSF CCF June Douglas Thain and Patrick Flynn University of Notre Dame 5 August.

Using Small Abstractions to Program Large Distributed Systems Douglas Thain University of Notre Dame 11 December 2008.

1 Lecture 1  Getting ready to program  Hardware Model  Software Model  Programming Languages  The C Language  Software Engineering  Programming.

1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.

Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing Douglas Thain University of Notre Dame HEC FSIO Workshop 6 August 2009.

Cooperative Computing for Data Intensive Science Douglas Thain University of Notre Dame NSF Bridges to Engineering 2020 Conference 12 March 2008.

An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame

Workload Management Massimo Sgaravatto INFN Padova.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

CS 101 Problem Solving and Structured Programming in C Sami Rollins Spring 2003.

Distributed Computations MapReduce

Using Abstractions to Scale Up Applications to Campus Grids Douglas Thain University of Notre Dame 28 April 2009.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.

Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.

Programming Distributed Systems with High Level Abstractions Douglas Thain University of Notre Dame 23 October 2008.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.

MapReduce M/R slides adapted from those of Jeff Dean’s.

SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.

1 Computational Abstractions: Strategies for Scaling Up Applications Douglas Thain University of Notre Dame Institute for Computational Economics University.

Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.

Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

Chapter 1 Computers, Compilers, & Unix. Overview u Computer hardware u Unix u Computer Languages u Compilers.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

C++ Programming: From Problem Analysis to Program Design, Fifth Edition Chapter 1: An Overview of Computers and Programming Languages.

By: Joel Dominic and Carroll Wongchote 4/18/2012.

Software Engineering Algorithms, Compilers, & Lifecycle.

Algorithms in Programming Computer Science Principles LO

Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Image taken from: slideshare

Fundamental Operations Scalability and Speedup

Workload Management Workpackage

Hadoop Aakash Kag What Why How 1.

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

Large-scale file systems and Map-Reduce

Scaling Up Scientific Workflows with Makeflow

Extraction, aggregation and classification at Web Scale

Introduction to Spark.

湖南大学-信息科学与工程学院-计算机与科学系

Cse 344 May 4th – Map/Reduce.

CS110: Discussion about Spark

Ch 4. The Evolution of Analytic Scalability

BXGrid: A Data Repository and Computing Grid for Biometrics Research

5/7/2019 Map Reduce Map reduce.

Presentation transcript:

1 Christopher Moretti – University of Notre Dame 4/30/2008 High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon Rich, and Douglas Thain University of Notre Dame

2 Christopher Moretti – University of Notre Dame 4/30/2008 Computing’s central challenge, “How not to make a mess of it,” has not yet been met. -Edsger Dijkstra

3 Christopher Moretti – University of Notre Dame 4/30/2008 Overview  Many systems today give end users access to hundreds or thousands of CPUs.  But, it is far too easy for the naive user to create a big mess in the process.  Our Solution:  Deploy high-level abstractions that describe both data and computation needs.  Some examples of current work:  All-Pairs: An abstraction for biometric workloads.  Distributed Ensemble Classification  DataLab: A system and language for data-parallel computation.

4 Christopher Moretti – University of Notre Dame 4/30/2008 Three Examples of Work at Notre Dame S2 1 S2 2 S2 3 S2 4 S2 5 S2 6 S2 7 S1 1 S1 2 S1 3 S1 4 S1 5 S1 6 S1 7 training data partitioning/sampling (optional) algorithm 1algorithm n classifier 1classifier n test instance voting classification chirp server chirp server chirp server chirp server set S chirp server ABC set T ABC FFF F

5 Christopher Moretti – University of Notre Dame 4/30/2008 Distributed Computing is Hard! How do I fit my workload into jobs? Which resources ? What happens when things fail? How Many? What is Condor? What do I do with the results? How can I measure job stats? What about job input data? How long will it take?

6 Christopher Moretti – University of Notre Dame 4/30/2008 Distributed Computing is Hard! How do I fit my workload into jobs? Which resources ? What happens when things fail? How Many? What is Condor? What do I do with the results? How can I measure job stats? What about job input data? How long will it take? ARGH!

7 Christopher Moretti – University of Notre Dame 4/30/2008 Domain Experts are not Distributed Experts Clouds Clusters OSG TeraGrid

8 Christopher Moretti – University of Notre Dame 4/30/2008 Abstractions – Compiler #include int main() { int i, j; for(i=0; i<100; i++) for(j=0; j<100; j++) cout<< i+j << endl; #include int main() { int i, j; for(i=0; i<100; i++) for(j=0; j<100; j++) cout<< i+j << endl; glibc MyProg.exe

9 Christopher Moretti – University of Notre Dame 4/30/2008 map nouns verbs map nouns verbs map nouns verbs reduce unique nouns unique verbs doc inputs: (file,word) intermediates (word,count) output: (word,count) Sample Application: Identify all unique nouns and verbs in 1M documents Abstractions – Map-Reduce

10 Christopher Moretti – University of Notre Dame 4/30/2008 Abstractions – Map-Reduce  Map-Reduce is a distributed abstraction that encapsulates the data and computation needs of a workload.  So can Map-Reduce solve an All-Pairs problem?  Not efficiently.  AllPairs(A,B,F)  Map(F,S)  S = ((A1,B1), (A1, B2) … )  So we have a large workload with one job per comparison, with no attempt to run computations where the data lies, or prestage data to the location at which it will be used.  This is our motivating (bad!) example!

11 Christopher Moretti – University of Notre Dame 4/30/2008 The All-Pairs Problem All-Pairs( Set S1, Set S2, Function F ) yields a matrix M: M ij = F(S1 i,S2 j ) 60K 20KB images >1GB 3.6B 50/s = 2.3 CPUYrs x 8B output = 29GB S2 1 S2 2 S2 3 S2 4 S2 5 S2 6 S2 7 S1 1 S1 2 S1 3 S1 4 S1 5 S1 6 S1 7

12 Christopher Moretti – University of Notre Dame 4/30/2008 Biometric All-Pairs Comparison F

13 Christopher Moretti – University of Notre Dame 4/30/2008 Naïve Mistakes Computing Problem: Even expert users don’t know how to tune jobs optimally, and can make 100 CPUs even slower than one by overloading the file server, network, or resource manager. CPU file server Batch System Each CPU reads 10TB! For all $X : For all $Y : cmp $X to $Y

14 Christopher Moretti – University of Notre Dame 4/30/2008 Consequences of Naïve Mistakes

15 Christopher Moretti – University of Notre Dame 4/30/2008 All Pairs Abstraction set S of files binary function F F M = AllPairs(F,S) invocation

16 Christopher Moretti – University of Notre Dame 4/30/2008 Avoiding Consquences of Naïveté Approach: Create data intensive abstractions that allow the system at runtime to distribute data, partition jobs, exploit locality, and hide errors. All-Pairs(F,S) = F(Si,Sj) for all elements in S. CPU file server All-Pairs Portal (File Distribution by Spanning Tree) Here is F(x,y) Here is set S. Addl. Fault Tolerance

17 Christopher Moretti – University of Notre Dame 4/30/2008 Web Portal 300 active storage units 500 CPUs, 40TB disk FGH S T All-Pairs Engine 2 - AllPairs(F,S) FFF FFF 3 - O(log n) distribution by spanning tree. 6 - Return result matrix to user. 1 - Upload F and S into web portal. 5 - Collect and assemble results. 4 – Choose optimal partitioning and submit batch jobs. All-Pairs Production System at Notre Dame

18 Christopher Moretti – University of Notre Dame 4/30/2008

19 Christopher Moretti – University of Notre Dame 4/30/2008

20 Christopher Moretti – University of Notre Dame 4/30/2008

21 Christopher Moretti – University of Notre Dame 4/30/2008 Returning the Result Matrix … … 0.98  Too many files.  Hard to do prefetching.  Too large files.  Must scan entire file.  Row/Column ordered.  How can we build it?

22 Christopher Moretti – University of Notre Dame 4/30/2008  Chirp_array allows users to create, manage, modify large arrays without having to realize underlying form.  Operations on chirp_array:  create a chirp_array  open a chirp_array  set value A[i,j]  get value A[i,j]  get row A[i]  get column A[j]  set row A[i]  set column A[j] Result Storage by Abstraction

23 Christopher Moretti – University of Notre Dame 4/30/2008 CPU Disk CPU Disk CPU Disk Result Storage with chirp_array  chirp_array_get(i,j)

24 Christopher Moretti – University of Notre Dame 4/30/2008 CPU Disk CPU Disk CPU Disk Result Storage with chirp_array  chirp_array_get(i,j)

25 Christopher Moretti – University of Notre Dame 4/30/2008 CPU Disk CPU Disk CPU Disk Result Storage with chirp_array  chirp_array_get(i,j)

26 Christopher Moretti – University of Notre Dame 4/30/2008 Data Mining on Large Data Sets Problem: Supercomputers are expensive, not all scientists have access to them for completing very large memory problems. Classification on large data sets without sufficient memory can degrade throughput, degrade accuracy, or fail outright.

27 Christopher Moretti – University of Notre Dame 4/30/2008 training data partitioning/sampling (optional) algorithm 1algorithm n classifier 1classifier n test instance voting classification Data Mining Using Ensembles (From Steinhaeuser and Chawla, 2007)

28 Christopher Moretti – University of Notre Dame 4/30/2008 training data partitioning/sampling (optional) algorithm 1algorithm n classifier 1classifier n test instance voting classification Data Mining Using Ensembles (From Steinhaeuser and Chawla, 2007)

29 Christopher Moretti – University of Notre Dame 4/30/2008 CPU Abstraction Engine Here are my algorithms. Here is my data set. Here is my test set. Abstraction for Ensembles Using Natural Parallelism Local Votes Choose optimal partitioning and submit batch jobs. Return local votes for tabulation and final prediction.

30 Christopher Moretti – University of Notre Dame 4/30/2008 unix filesys chirp server unix filesys chirp server unix filesys chirp server chirp server tcsh emacs perl parrot set S chirp server XY F ABC file F distributed data structures Y = F(X) job_start job_commit job_wait job_remove file system function evaluation DataLab Abstractions

31 Christopher Moretti – University of Notre Dame 4/30/2008 apply F on S into T chirp server chirp server chirp server chirp server set S chirp server ABC set T ABC FFF F DataLab Language Syntax

32 Christopher Moretti – University of Notre Dame 4/30/2008 For More Information  Christopher Moretti   Douglas Thain   Cooperative Computing Lab 