Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Christopher Moretti – University of Notre Dame 4/30/2008 High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon.

Similar presentations


Presentation on theme: "1 Christopher Moretti – University of Notre Dame 4/30/2008 High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon."— Presentation transcript:

1 1 Christopher Moretti – University of Notre Dame 4/30/2008 High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon Rich, and Douglas Thain University of Notre Dame

2 2 Christopher Moretti – University of Notre Dame 4/30/2008 Computing’s central challenge, “How not to make a mess of it,” has not yet been met. -Edsger Dijkstra

3 3 Christopher Moretti – University of Notre Dame 4/30/2008 Overview  Many systems today give end users access to hundreds or thousands of CPUs.  But, it is far too easy for the naive user to create a big mess in the process.  Our Solution:  Deploy high-level abstractions that describe both data and computation needs.  Some examples of current work:  All-Pairs: An abstraction for biometric workloads.  Distributed Ensemble Classification  DataLab: A system and language for data-parallel computation.

4 4 Christopher Moretti – University of Notre Dame 4/30/2008 Three Examples of Work at Notre Dame S2 1 S2 2 S2 3 S2 4 S2 5 S2 6 S2 7 S1 1 S1 2 S1 3 S1 4 S1 5 S1 6 S1 7 training data partitioning/sampling (optional) algorithm 1algorithm n classifier 1classifier n test instance voting classification chirp server chirp server chirp server chirp server set S chirp server ABC set T ABC FFF F

5 5 Christopher Moretti – University of Notre Dame 4/30/2008 Distributed Computing is Hard! How do I fit my workload into jobs? Which resources ? What happens when things fail? How Many? What is Condor? What do I do with the results? How can I measure job stats? What about job input data? How long will it take?

6 6 Christopher Moretti – University of Notre Dame 4/30/2008 Distributed Computing is Hard! How do I fit my workload into jobs? Which resources ? What happens when things fail? How Many? What is Condor? What do I do with the results? How can I measure job stats? What about job input data? How long will it take? ARGH!

7 7 Christopher Moretti – University of Notre Dame 4/30/2008 Domain Experts are not Distributed Experts Clouds Clusters OSG TeraGrid

8 8 Christopher Moretti – University of Notre Dame 4/30/2008 Abstractions – Compiler #include int main() { int i, j; for(i=0; i<100; i++) for(j=0; j<100; j++) cout<< i+j << endl; #include int main() { int i, j; for(i=0; i<100; i++) for(j=0; j<100; j++) cout<< i+j << endl; glibc MyProg.exe

9 9 Christopher Moretti – University of Notre Dame 4/30/2008 map nouns verbs map nouns verbs map nouns verbs reduce unique nouns unique verbs doc inputs: (file,word) intermediates (word,count) output: (word,count) Sample Application: Identify all unique nouns and verbs in 1M documents Abstractions – Map-Reduce

10 10 Christopher Moretti – University of Notre Dame 4/30/2008 Abstractions – Map-Reduce  Map-Reduce is a distributed abstraction that encapsulates the data and computation needs of a workload.  So can Map-Reduce solve an All-Pairs problem?  Not efficiently.  AllPairs(A,B,F)  Map(F,S)  S = ((A1,B1), (A1, B2) … )  So we have a large workload with one job per comparison, with no attempt to run computations where the data lies, or prestage data to the location at which it will be used.  This is our motivating (bad!) example!

11 11 Christopher Moretti – University of Notre Dame 4/30/2008 The All-Pairs Problem All-Pairs( Set S1, Set S2, Function F ) yields a matrix M: M ij = F(S1 i,S2 j ) 60K 20KB images >1GB 3.6B comparisons @ 50/s = 2.3 CPUYrs x 8B output = 29GB S2 1 S2 2 S2 3 S2 4 S2 5 S2 6 S2 7 S1 1 S1 2 S1 3 S1 4 S1 5 S1 6 S1 7

12 12 Christopher Moretti – University of Notre Dame 4/30/2008 Biometric All-Pairs Comparison 1.8.100 10 0 10.7 10 1.1 1 F

13 13 Christopher Moretti – University of Notre Dame 4/30/2008 Naïve Mistakes Computing Problem: Even expert users don’t know how to tune jobs optimally, and can make 100 CPUs even slower than one by overloading the file server, network, or resource manager. CPU file server Batch System Each CPU reads 10TB! For all $X : For all $Y : cmp $X to $Y

14 14 Christopher Moretti – University of Notre Dame 4/30/2008 Consequences of Naïve Mistakes

15 15 Christopher Moretti – University of Notre Dame 4/30/2008 All Pairs Abstraction set S of files binary function F F M = AllPairs(F,S) invocation

16 16 Christopher Moretti – University of Notre Dame 4/30/2008 Avoiding Consquences of Naïveté Approach: Create data intensive abstractions that allow the system at runtime to distribute data, partition jobs, exploit locality, and hide errors. All-Pairs(F,S) = F(Si,Sj) for all elements in S. CPU file server All-Pairs Portal (File Distribution by Spanning Tree) Here is F(x,y) Here is set S. Addl. Fault Tolerance

17 17 Christopher Moretti – University of Notre Dame 4/30/2008 Web Portal 300 active storage units 500 CPUs, 40TB disk FGH S T All-Pairs Engine 2 - AllPairs(F,S) FFF FFF 3 - O(log n) distribution by spanning tree. 6 - Return result matrix to user. 1 - Upload F and S into web portal. 5 - Collect and assemble results. 4 – Choose optimal partitioning and submit batch jobs. All-Pairs Production System at Notre Dame

18 18 Christopher Moretti – University of Notre Dame 4/30/2008

19 19 Christopher Moretti – University of Notre Dame 4/30/2008

20 20 Christopher Moretti – University of Notre Dame 4/30/2008

21 21 Christopher Moretti – University of Notre Dame 4/30/2008 Returning the Result Matrix 4.37 6.01 2.22 4.37 7.13 8.94 6.72 1.34 … 0.98 4.37 7.13 8.94 6.72 1.34 … 0.98  Too many files.  Hard to do prefetching.  Too large files.  Must scan entire file.  Row/Column ordered.  How can we build it?

22 22 Christopher Moretti – University of Notre Dame 4/30/2008  Chirp_array allows users to create, manage, modify large arrays without having to realize underlying form.  Operations on chirp_array:  create a chirp_array  open a chirp_array  set value A[i,j]  get value A[i,j]  get row A[i]  get column A[j]  set row A[i]  set column A[j] Result Storage by Abstraction

23 23 Christopher Moretti – University of Notre Dame 4/30/2008 CPU Disk CPU Disk CPU Disk Result Storage with chirp_array  chirp_array_get(i,j)

24 24 Christopher Moretti – University of Notre Dame 4/30/2008 CPU Disk CPU Disk CPU Disk Result Storage with chirp_array  chirp_array_get(i,j)

25 25 Christopher Moretti – University of Notre Dame 4/30/2008 CPU Disk CPU Disk CPU Disk Result Storage with chirp_array  chirp_array_get(i,j)

26 26 Christopher Moretti – University of Notre Dame 4/30/2008 Data Mining on Large Data Sets Problem: Supercomputers are expensive, not all scientists have access to them for completing very large memory problems. Classification on large data sets without sufficient memory can degrade throughput, degrade accuracy, or fail outright.

27 27 Christopher Moretti – University of Notre Dame 4/30/2008 training data partitioning/sampling (optional) algorithm 1algorithm n classifier 1classifier n test instance voting classification Data Mining Using Ensembles (From Steinhaeuser and Chawla, 2007)

28 28 Christopher Moretti – University of Notre Dame 4/30/2008 training data partitioning/sampling (optional) algorithm 1algorithm n classifier 1classifier n test instance voting classification Data Mining Using Ensembles (From Steinhaeuser and Chawla, 2007)

29 29 Christopher Moretti – University of Notre Dame 4/30/2008 CPU Abstraction Engine Here are my algorithms. Here is my data set. Here is my test set. Abstraction for Ensembles Using Natural Parallelism Local Votes Choose optimal partitioning and submit batch jobs. Return local votes for tabulation and final prediction.

30 30 Christopher Moretti – University of Notre Dame 4/30/2008 unix filesys chirp server unix filesys chirp server unix filesys chirp server chirp server tcsh emacs perl parrot set S chirp server XY F ABC file F distributed data structures Y = F(X) job_start job_commit job_wait job_remove file system function evaluation DataLab Abstractions

31 31 Christopher Moretti – University of Notre Dame 4/30/2008 apply F on S into T chirp server chirp server chirp server chirp server set S chirp server ABC set T ABC FFF F DataLab Language Syntax

32 32 Christopher Moretti – University of Notre Dame 4/30/2008 For More Information  Christopher Moretti  cmoretti@cse.nd.edu  Douglas Thain  dthain@cse.nd.edu  Cooperative Computing Lab  http://cse.nd.edu/~ccl


Download ppt "1 Christopher Moretti – University of Notre Dame 4/30/2008 High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon."

Similar presentations


Ads by Google