CX: A Scalable, Robust Network for Parallel Computing

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

High Speed Total Order for SAN infrastructure Tal Anker, Danny Dolev, Gregory Greenman, Ilya Shnaiderman School of Engineering and Computer Science The.
Cilk NOW Based on a paper by Robert D. Blumofe & Philip A. Lisiecki.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
CS 345 Computer System Overview
Broker Pattern Pattern-Oriented Software Architecture (POSA 1)
Ameoba Designed by: Prof Andrew S. Tanenbaum at Vrija University since 1981.
Task Scheduling and Distribution System Saeed Mahameed, Hani Ayoub Electrical Engineering Department, Technion – Israel Institute of Technology
Reference: Message Passing Fundamentals.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
City University London
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Advanced Distributed Software Architectures and Technology group ADSaT 1 Scalability & Availability Paul Greenfield CSIRO.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB.
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
CLUSTER COMPUTING Prepared by: Kalpesh Sindha (ITSNS)
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Distributed Software Engineering Lecture 1 Introduction Sam Malek SWE 622, Fall 2012 George Mason University.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Advanced Eager Scheduling for Java-Based Adaptively Parallel Computing Michael O. Neary & Peter Cappello Computer Science Department UC Santa Barbara.
J ICOS A Java-centric Internet Computing System Peter Cappello Computer Science Department UC Santa Barbara.
J ICOS’s Abstract Distributed Service Component Peter Cappello Computer Science Department UC Santa Barbara.
Grid Computing Framework A Java framework for managed modular distributed parallel computing.
Internet-Based TSP Computation with Javelin++ Michael Neary & Peter Cappello Computer Science, UCSB.
Scalable and Coordinated Scheduling for Cloud-Scale computing
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Building and managing production bioclusters Chris Dagdigian BIOSILICO Vol2, No. 5 September 2004 Ankur Dhanik.
Distributed Computing Systems CSCI 6900/4900. Review Definition & characteristics of distributed systems Distributed system organization Design goals.
Java-Based Parallel Computing on the Internet: Javelin 2.0 & Beyond Michael Neary & Peter Cappello Computer Science, UCSB.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
JICOS A Java-Centric Network Computing Service Peter Cappello & Christopher James Coakley Computer Science University of California, Santa Barbara.
J ICOS A Java-Centric Distributed Computing Service Peter Cappello & Chris Coakley Computer Science Department UC Santa Barbara.
Processes and threads.
Introduction to Distributed Platforms
JICOS A Java-Centric Distributed Computing Service
Introduction to Load Balancing:
Chapter 3: Process Concept
Operating Systems : Overview
Sujata Ray Dey Maheshtala College Computer Science Department
Definition of Distributed System
Chapter 3: Process Concept
Grid Computing.
Introduction to Operating System (OS)
Introduction to HDFS: Hadoop Distributed File System
Advanced Operating Systems
Operating Systems : Overview
Chapter 17: Database System Architectures
Web Server Administration
Chapter 2: The Linux System Part 1
Topic Based Data Distribution in GSpace
CLUSTER COMPUTING.
Operating Systems : Overview
CSE8380 Parallel and Distributed Processing Presentation
Architectures of distributed systems Fundamental Models
Distributed computing deals with hardware
Architectures of distributed systems Fundamental Models
Operating Systems : Overview
Operating Systems : Overview
Operating Systems : Overview
Sujata Ray Dey Maheshtala College Computer Science Department
Architectures of distributed systems
Atlas: An Infrastructure for Global Computing
In Distributed Systems
Architectures of distributed systems Fundamental Models
Operating System Overview
Distributed Systems and Concurrency: Distributed Systems
Presentation transcript:

CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

Outline Introduction Related work API Architecture Experimental results Current & future work

Introduction “Listen to the technology!” Carver Mead

Introduction “Listen to the technology!” Carver Mead What is the technology telling us?

Introduction “Listen to the technology!” Carver Mead What is the technology telling us? Internet’s idle cycles/sec growing rapidly

Introduction “Listen to the technology!” Carver Mead What is the technology telling us? Internet’s idle cycles/sec growing rapidly Bandwidth increasing & getting cheaper

Introduction “Listen to the technology!” Carver Mead What is the technology telling us? Internet’s idle cycles/sec growing rapidly Bandwidth is increasing & getting cheaper Communication latency is not decreasing

Introduction “Listen to the technology!” Carver Mead What is the technology telling us? Internet’s idle cycles/sec growing rapidly Bandwidth increasing & getting cheaper Communication latency is not decreasing Human technology is getting neither cheaper nor faster.

Introduction Project Goals Minimize job completion time despite large communication latency

Introduction Project Goals Minimize job completion time despite large communication latency Jobs complete with high probability despite faulty components

Introduction Project Goals Minimize job completion time despite large communication latency Jobs complete with high probability despite faulty components Application program is oblivious to: Number of processors Inter-process communication Hardware faults

Introduction Fundamental Issue: Heterogeneity … OS1 OS2 OS3 OS4 OS5 M1 M2 M3 M4 M5 Heterogeneous machine/OS

Introduction Fundamental Issue: Heterogeneity … OS1 OS2 OS3 OS4 OS5 M1 M2 M3 M4 M5 Heterogeneous machine/OS Functionally Homogeneous JVM 

Outline Introduction Related work API Architecture Experimental results Current & future work

Related work Cilk  Cilk-NOW  Atlas DAG computational model Work-stealing

Related work Linda  Piranha  JavaSpaces Space-based coordination Decoupled communication

Related work Charlotte (Milan project / Calypso prototype) High performance  no distributed transactions Fault tolerance via eager scheduling

Related work SuperWeb JavelinJavelin++ Architecture: client, broker, host

Outline Introduction Related work API Architecture Experimental results Current & future work

API DAG Computational model int f( int n ) { if ( n < 2 ) return n; else return f( n-1 ) + f( n-2 ); }

DAG Computational Model int f( int n ) { if ( n < 2 ) return n; else return f( n-1 ) + f( n-2 ); } f(4) Method invocation tree

DAG Computational Model int f( int n ) { if ( n < 2 ) return n; else return f( n-1 ) + f( n-2 ); } f(4) f(3) f(2) Method invocation tree

DAG Computational Model int f( int n ) { if ( n < 2 ) return n; else return f( n-1 ) + f( n-2 ); } f(4) f(3) f(2) f(2) f(1) f(1) f(0) Method invocation tree

DAG Computational Model int f( int n ) { if ( n < 2 ) return n; else return f( n-1 ) + f( n-2 ); } f(4) f(3) f(2) f(2) f(1) f(1) f(0) f(1) f(0) Method invocation tree

DAG Computational Model / API execute( ) { if ( n < 2 ) setArg( , n ); else { spawn ( ); spawn ( ); } _______________________________ f(n) f(4) + + f(n-1) f(n-2) execute( ) { setArg( , in[0] + in[1] ); } + +

DAG Computational Model / API execute( ) { if ( n < 2 ) setArg( , n ); else { spawn ( ); spawn ( ); } _______________________________ f(n) f(4) + f(3) f(2) + f(n-1) f(n-2) execute( ) { setArg( , in[0] + in[1] ); } + + +

DAG Computational Model / API execute( ) { if ( n < 2 ) setArg( , n ); else { spawn ( ); spawn ( ); } _______________________________ f(n) f(4) + f(3) f(2) + f(n-1) f(2) f(1) f(1) f(0) f(n-2) + execute( ) { setArg( , in[0] + in[1] ); } + + + +

DAG Computational Model / API execute( ) { if ( n < 2 ) setArg( , n ); else { spawn ( ); spawn ( ); } _______________________________ f(n) f(4) + f(3) f(2) + f(n-1) f(2) f(1) f(1) f(0) f(n-2) f(1) f(0) + + execute( ) { setArg( , in[0] + in[1] ); } + + + +

Outline Introduction Related work API Architecture Experimental results Current & future work

Architecture: Basic Entities register ( spawn | getResult )* unregister CONSUMER PRODUCTION NETWORK CLUSTER NETWORK

Architecture: Cluster PRODUCER TASK SERVER PRODUCER PRODUCER PRODUCER

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(4) + f(1) f(0) PRODUCER TASK SERVER READY PRODUCER WAITING

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(4)

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(4)

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(4)

Decompose execute( ) { if ( n < 2 ) setArg( ArgAddr, n ); else spawn ( + ); spawn ( f(n-1) ); spawn ( f(n-2) ); }

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(4) + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(3) + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(3) + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(3) + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(3) + PRODUCER WAITING f(2) + + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(2) + PRODUCER WAITING + + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(2) + PRODUCER WAITING f(1) + + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(2) + PRODUCER WAITING f(1) + + + + +

Compute Base Case execute( ) { if ( n < 2 ) setArg( ArgAddr, n ); else spawn ( + ); spawn ( f(n-1) ); spawn ( f(n-2) ); }

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(1) + PRODUCER WAITING f(1) + + + + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(1) + PRODUCER WAITING + + + + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(1) + PRODUCER f(0) WAITING + + + + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(1) + PRODUCER f(0) WAITING + + + + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(1) + PRODUCER WAITING f(0) + + + + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(1) + f(1) f(0) + PRODUCER WAITING + + + + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(1) + f(1) f(0) + PRODUCER WAITING + + + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING + f(1)

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING + f(0)

Compose execute( ) { setArg( ArgAddr, in[0] + in[1] ); }

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING + f(0)

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(0) + + + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(0) + + + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(0) + + + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING f(0) + + + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING + + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING + + + +

A Cluster at Work PRODUCER TASK SERVER READY + PRODUCER WAITING + + +

A Cluster at Work PRODUCER READY TASK SERVER PRODUCER WAITING + + + +

A Cluster at Work PRODUCER TASK SERVER READY + PRODUCER WAITING + + +

A Cluster at Work PRODUCER TASK SERVER READY + PRODUCER WAITING + + +

A Cluster at Work PRODUCER TASK SERVER READY + PRODUCER WAITING + +

A Cluster at Work PRODUCER TASK SERVER READY + PRODUCER WAITING +

A Cluster at Work PRODUCER TASK SERVER READY + + PRODUCER WAITING +

A Cluster at Work PRODUCER TASK SERVER READY + PRODUCER WAITING +

A Cluster at Work PRODUCER TASK SERVER READY + R PRODUCER WAITING +

A Cluster at Work Result object is sent to Production Network PRODUCER TASK SERVER Result object is sent to Production Network Production Network returns it to Consumer READY R PRODUCER WAITING

Task Server Proxy Overlap Communication with Computation PRODUCER TASK SERVER READY Task Server Proxy PRIORITY Q COMP COMM OUTBOX INBOX WAITING

Architecture Work stealing & eager scheduling A task is removed from the server only after a complete signal is received. A task may be assigned to multiple producers Balance task load among producers of varying processor speeds Tasks on failed/retreating producers are re-assigned.

Architecture: Scalability A cluster tolerates producer: Retreat Failure 1 task server however is a: Bottleneck Single point of failure. Use a network of task servers/clusters.

Scalability: Class loading CX class loader loads classes (Consumer JAR) in each server’s class cache 2. Producer loads classes from its server

Scalability: Fault-tolerance Replicate a server’s tasks on its sibling

Scalability: Fault-tolerance Replicate a server’s tasks on its sibling

Scalability: Fault-tolerance When server fails, its sibling restores state to replacement server Replicate a server’s tasks on its sibling

Architecture Production network of clusters Network tolerates single server failure. Restores ability to tolerate a single server failure. Tolerates a sequence of single server failures.

Outline Introduction Related work API Architecture Experimental results Current & future work

Preliminary experiments Experiments run on Linux cluster 100 port Lucent P550 Cajun Gigabit Switch Machine 2 Intel EtherExpress Pro 100 Mb/s Ethernet cards Red Hat Linux 6.0 JDK 1.2.2_RC3 Heterogeneous processor speeds processors/machine

Fibonacci Tasks with Synthetic Load execute( ) { if ( n < 2 ) synthetic workload(); setArg( , n ); else { spawn ( ); spawn ( ); } execute( ) { synthetic workload(); setArg( , in[0] + in[1] ); } f(n) + + + + f(n-1) f(n-2)

TSEQ vs. T1 (seconds) Computing F(8) Workload TSEQ T1 Efficiency 4.522 497.420 518.816 0.96 3.740 415.140 436.897 0.95 2.504 280.448 297.474 0.94 1.576 179.664 199.423 0.90 0.914 106.024 120.807 0.88 0.468 56.160 65.767 0.85 0.198 24.750 29.553 0.84 0.058 8.120 11.386 0.71

Average task time: Workload 1 = 1.8 sec. Workload 2 = 3.7 sec. Parallel efficiency for F(13) = 0.77 Parallel efficiency for F(18) = 0.99

Outline Introduction Related work API Architecture Experimental results Current & future work

{ } Current work Implement CX market maker (broker) Solves discovery problem between Consumers & Production networks Enhance Producer with Lea’s Fork/Join Framework See gee.cs.oswego.edu CONSUMER PRODUCTION NETWORK MARKET MAKER } { JINI Service

Current work Enhance computational model: branch & bound. Propagate new bounds thru production network: 3 steps SEARCH TREE PRODUCTION NETWORK BRANCH TERMINATE!

Current work Enhance computational model: branch & bound. Propagate new bounds thru production network: 3 steps SEARCH TREE PRODUCTION NETWORK TERMINATE!

Current work Investigate computations that appear ill-suited to adaptive parallelism SOR N-body.

Thanks!

End of CX Presentation www.cs.ucsb.edu/research/cx Next release: End of June, includes source. E-mail: cappello@cs.ucsb.edu

Introduction Fundamental Issues Communication latency Long latency  Overlap computation with communication. Robustness Massive parallelism  faults Scalability Massive parallelism  login privileges cannot be required. Ease of use Jini  easy upgrade of system components

Related work Market mechanisms Huberman, Waldspurger, Malone, Miller & Drexler, Newhouse & Darlington

Related work CX integrates DAG computational model Work-stealing scheduler Space-based, decoupled communication Fault-tolerance via eager scheduling Market mechanisms (incentive to participate)

Architecture Task identifier Dag has spawn tree TaskID = path id Root.TaskID = 0 TaskID used to detect duplicate: Tasks Results. F(4) 1 2 F(3) F(2) 2 1 1 2 F(2) F(1) F(1) F(0) 2 1 F(1) F(0) + + + +

Architecture: Basic Entities Consumer Seeks computing resources. Producer Offers computing resources. Task Server Coordinates task distribution among its producers. Production Network A network of task servers & their associated producers.

Defining Parallel Efficiency Scalar: Homogeneous set of P machines: Parallel efficiency = (T1 / P) / TP Vector: Heterogeneous set of P machines: P = [ P1, P2, …, Pd ], where there are P1 machines of type 1, P2 machines of type 2, … Pd machines of type d : Parallel efficiency = ( P1 / T1 + P2 / T2 + … Pd / Td ) –1 / TP

Future work Support special hardware / data: inter-server task movement. Diffusion model: Tasks are homogeneous gas atoms diffusing through network. N-body model: Each kind of atom (task) has its own: Mass (resistance to movement: code size, input size, …) attraction/repulsion to different servers Or other “massive” entities, such as: special processors large data base.

Future Work CX preprocessor to simplify API.