Presentation is loading. Please wait.

Presentation is loading. Please wait.

CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB.

Similar presentations


Presentation on theme: "CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB."— Presentation transcript:

1 CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

2 2 Outline 1.Introduction 2.Related work 3.API 4.Architecture 5.Experimental results 6.Current & future work

3 3 Introduction “Listen to the technology!” Carver Mead

4 4 Introduction “Listen to the technology!” Carver Mead What is the technology telling us?

5 5 Introduction “Listen to the technology!” Carver Mead What is the technology telling us? –Internet’s idle cycles/sec growing rapidly

6 6 Introduction “Listen to the technology!” Carver Mead What is the technology telling us? –Internet’s idle cycles/sec growing rapidly –Bandwidth increasing & getting cheaper

7 7 Introduction “Listen to the technology!” Carver Mead What is the technology telling us? –Internet’s idle cycles/sec growing rapidly –Bandwidth is increasing & getting cheaper –Communication latency is not decreasing

8 8 Introduction “Listen to the technology!” Carver Mead What is the technology telling us? –Internet’s idle cycles/sec growing rapidly –Bandwidth increasing & getting cheaper –Communication latency is not decreasing –Human technology is getting neither cheaper nor faster.

9 9 Introduction Project Goals 1.Minimize job completion time despite large communication latency

10 10 Introduction Project Goals 1.Minimize job completion time despite large communication latency 2.Jobs complete with high probability despite faulty components

11 11 Introduction Project Goals 1.Minimize job completion time despite large communication latency 2.Jobs complete with high probability despite faulty components 3.Application program is oblivious to: Number of processors Inter-process communication Fault tolerance

12 12 Heterogeneous machine/OS Introduction Fundamental Issue: Heterogeneity M1 OS1 M2 OS2 M3 OS3 M4 OS4 M5 OS5 …

13 13 Heterogeneous machine/OS Introduction Fundamental Issue: Heterogeneity M1 OS1 M2 OS2 M3 OS3 M4 OS4 M5 OS5 … Functionally Homogeneous JVM 

14 14 Outline 1.Introduction 2.Related work 3.API 4.Architecture 5.Experimental results 6.Current & future work

15 15 Related work Cilk  Cilk-NOW  Atlas –DAG computational model –Work-stealing

16 16 Related work Linda  Piranha  JavaSpaces –Space-based coordination –Decoupled communication

17 17 Related work Charlotte (Milan project / Calypso prototype) –High performance  Fault tolerance not achieved via transactions –Fault tolerance via eager scheduling

18 18 Related work SuperWeb  Javelin  Javelin++ –Architecture: client, broker, host

19 19 Outline 1.Introduction 2.Related work 3.API 4.Architecture 5.Experimental results 6.Current & future work

20 20 API DAG Computational model int f( int n ) { if ( n < 2 ) return n; else return f( n-1 ) + f( n-2 ); }

21 21 DAG Computational Model int f( int n ) { if ( n < 2 ) return n; else return f( n-1 ) + f( n-2 ); } f(4) Method invocation tree

22 22 DAG Computational Model int f( int n ) { if ( n < 2 ) return n; else return f( n-1 ) + f( n-2 ); } f(4) f(3)f(2) Method invocation tree

23 23 DAG Computational Model int f( int n ) { if ( n < 2 ) return n; else return f( n-1 ) + f( n-2 ); } f(4) f(3)f(2) f(1) f(0) Method invocation tree

24 24 DAG Computational Model int f( int n ) { if ( n < 2 ) return n; else return f( n-1 ) + f( n-2 ); } f(4) f(3)f(2) f(1) f(0) f(1)f(0) Method invocation tree f(2)

25 25 DAG Computational Model / API f(4) execute( ) { if ( n < 2 ) setArg(, n ); else { spawn ( ); } _______________________________ f(n-1) + + execute( ) { setArg(, in[0] + in[1] ); } f(n) + + f(n-2)

26 26 DAG Computational Model / API execute( ) { setArg(, in[0] + in[1] ); } + + f(4) f(3)f(2) + execute( ) { if ( n < 2 ) setArg(, n ); else { spawn ( ); } _______________________________ f(n-1) + + f(n) f(n-2)

27 27 DAG Computational Model / API execute( ) { setArg(, in[0] + in[1] ); } + + f(4) f(3)f(2) + f(1) f(0) + + execute( ) { if ( n < 2 ) setArg(, n ); else { spawn ( ); } _______________________________ f(n-1) + + f(n) f(n-2)

28 28 DAG Computational Model / API execute( ) { setArg(, in[0] + in[1] ); } + + f(4) f(3)f(2) + f(1) f(0) + + f(1)f(0) + execute( ) { if ( n < 2 ) setArg(, n ); else { spawn ( ); } _______________________________ f(n-1) + + f(n) f(n-2)

29 29 Outline 1.Introduction 2.Related work 3.API 4.Architecture 5.Experimental results 6.Current & future work

30 30 Architecture: Basic Entities CONSUMER PRODUCTION NETWORK CLUSTER NETWORK register ( spawn | getResult )* unregister

31 31 Architecture: Cluster TASK SERVER PRODUCER

32 32 A Cluster at Work f(4) f(3)f(2) + f(1) f(0) f(1)f(0) + + + TASK SERVER PRODUCER WAITING READY

33 33 A Cluster at Work f(4) TASK SERVER PRODUCER WAITING READY f(4)

34 34 A Cluster at Work f(4) TASK SERVER PRODUCER WAITING READY f(4)

35 35 A Cluster at Work f(4) TASK SERVER PRODUCER WAITING READY f(4)

36 36 Decompose execute( ) { if ( n < 2 ) setArg( ArgAddr, n ); else { spawn ( + ); spawn ( f(n-1) ); spawn ( f(n-2) ); }

37 37 A Cluster at Work f(4) f(3)f(2) + TASK SERVER PRODUCER WAITING READY f(4) + f(3) f(2)

38 38 A Cluster at Work TASK SERVER PRODUCER WAITING READY + f(3) f(2) f(3)f(2) +

39 39 A Cluster at Work TASK SERVER PRODUCER WAITING READY + f(3) f(2) f(3) f(2) f(3)f(2) +

40 40 A Cluster at Work TASK SERVER PRODUCER WAITING READY + f(3) f(2) f(3)f(2) +

41 41 A Cluster at Work f(3)f(2) + f(1) f(0) + + TASK SERVER PRODUCER WAITING READY + f(3) f(2)+ f(1) + f(0)

42 42 A Cluster at Work TASK SERVER PRODUCER WAITING READY ++ f(2) f(1) + f(0) + f(2)f(1) f(0) + +

43 43 A Cluster at Work TASK SERVER PRODUCER WAITING READY ++ f(2) f(1) + f(0) f(2) f(1) + f(2)f(1) f(0) + +

44 44 A Cluster at Work TASK SERVER PRODUCER WAITING READY ++ f(1) + f(0) f(2) f(1) + f(2)f(1) f(0) + +

45 45 Compute Base Case execute( ) { if ( n < 2 ) setArg( ArgAddr, n ); else { spawn ( + ); spawn ( f(n-1) ); spawn ( f(n-2) ); }

46 46 A Cluster at Work + f(2)f(1) f(0) f(1)f(0) + + + TASK SERVER PRODUCER WAITING READY ++ f(1) + f(0) f(2) f(1) + f(0)

47 47 A Cluster at Work + f(1)f(0) f(1)f(0) + + + TASK SERVER PRODUCER WAITING READY +++ f(0)f(1) + f(0)

48 48 A Cluster at Work + f(1)f(0) f(1)f(0) + + + TASK SERVER PRODUCER WAITING READY +++ f(0)f(1) + f(0) f(1) f(0)

49 49 A Cluster at Work + f(1)f(0) f(1)f(0) + + + TASK SERVER PRODUCER WAITING READY +++ + f(1) f(0) f(1) f(0)

50 50 A Cluster at Work + f(1)f(0) f(1)f(0) + + + TASK SERVER PRODUCER WAITING READY +++ + f(1) f(0) f(1) f(0)

51 51 A Cluster at Work + f(1)f(0) + + + TASK SERVER PRODUCER WAITING READY +++ + f(1) f(0) +

52 52 A Cluster at Work + f(1)f(0) + + + TASK SERVER PRODUCER WAITING READY ++ + f(1) f(0) +

53 53 A Cluster at Work + f(1)f(0) + + + TASK SERVER PRODUCER WAITING READY ++ + f(1) f(0) + + f(1)

54 54 A Cluster at Work + f(1)f(0) + + + TASK SERVER PRODUCER WAITING READY ++ + f(0) + f(1)

55 55 Compose execute( ) { setArg( ArgAddr, in[0] + in[1] ); }

56 56 A Cluster at Work + f(1)f(0) + + + TASK SERVER PRODUCER WAITING READY ++ + f(0) + f(1)

57 57 A Cluster at Work + f(0) + + TASK SERVER PRODUCER WAITING READY ++ + f(0)

58 58 A Cluster at Work + f(0) + + TASK SERVER PRODUCER WAITING READY ++ + f(0)

59 59 A Cluster at Work + f(0) + + TASK SERVER PRODUCER WAITING READY ++ + f(0)

60 60 A Cluster at Work + f(0) + + TASK SERVER PRODUCER WAITING READY ++ + f(0)

61 61 A Cluster at Work + + + TASK SERVER PRODUCER WAITING READY ++ + +

62 62 A Cluster at Work + + + TASK SERVER PRODUCER WAITING READY ++ +

63 63 A Cluster at Work + + + TASK SERVER PRODUCER WAITING READY ++ + +

64 64 A Cluster at Work + + + TASK SERVER PRODUCER WAITING READY +++

65 65 A Cluster at Work + + + TASK SERVER PRODUCER WAITING READY +++

66 66 A Cluster at Work + + TASK SERVER PRODUCER WAITING READY ++ +

67 67 A Cluster at Work + + TASK SERVER PRODUCER WAITING READY + +

68 68 A Cluster at Work + + TASK SERVER PRODUCER WAITING READY + + +

69 69 A Cluster at Work + + TASK SERVER PRODUCER WAITING READY + +

70 70 A Cluster at Work + + TASK SERVER PRODUCER WAITING READY + +

71 71 A Cluster at Work + TASK SERVER PRODUCER WAITING READY + +

72 72 A Cluster at Work + TASK SERVER PRODUCER WAITING READY +

73 73 A Cluster at Work + TASK SERVER PRODUCER WAITING READY + +

74 74 A Cluster at Work + TASK SERVER PRODUCER WAITING READY +

75 75 A Cluster at Work + TASK SERVER PRODUCER WAITING READY + R

76 76 A Cluster at Work TASK SERVER PRODUCER WAITING READY R 1.Result object is sent to Production Network 2.Production Network returns it to Consumer

77 77 Task Server Proxy Overlap Communication with Computation PRODUCER Task Server Proxy OUTBOX INBOX COMM COMP READY WAITING TASK SERVER PRIORITY Q

78 78 Architecture Work stealing & eager scheduling A task is removed from server only after a complete signal is received. A task may be assigned to multiple producers –Balance task load among producers of varying processor speeds –Tasks on failed/retreating producers are re-assigned.

79 79 Architecture: Scalability A cluster tolerates producer: –Retreat –Failure 1 task server however is a: –Bottleneck –Single point of failure. We introduce a network of task servers.

80 80 Scalability: Class loading 1.CX class loader loads classes (Consumer JAR) in each server’s class cache 2. Producer loads classes from its server

81 81 Scalability: Fault-tolerance Replicate a server’s tasks on its sibling

82 82 Scalability: Fault-tolerance Replicate a server’s tasks on its sibling

83 83 Scalability: Fault-tolerance Replicate a server’s tasks on its sibling When server fails, its sibling restores state to replacement server

84 84 Architecture Production network of clusters Network tolerates single server failure. Restores ability to tolerate a single failure.  ability to tolerate a sequence of failures

85 85 Outline 1.Introduction 2.Related work 3.API 4.Architecture 5.Experimental results 6.Current & future work

86 86 Preliminary experiments Experiments run on Linux cluster –100 port Lucent P550 Cajun Gigabit Switch Machine –2 Intel EtherExpress Pro 100 Mb/s Ethernet cards –Red Hat Linux 6.0 –JDK 1.2.2_RC3 –Heterogeneous processor speeds processors/machine

87 87 Fibonacci Tasks with Synthetic Load + + f(n-1) + + f(n) f(n-2) execute( ) { if ( n < 2 ) synthetic workload(); setArg(, n ); else { synthetic workload(); spawn ( ); } execute( ) { synthetic workload(); setArg(, in[0] + in[1] ); }

88 88 T SEQ vs. T 1 (seconds) Computing F(8) WorkloadT SEQ T1T1 Efficiency 4.522497.420518.8160.96 3.740415.140436.8970.95 2.504280.448297.4740.94 1.576179.664199.4230.90 0.914106.024120.8070.88 0.46856.16065.7670.85 0.19824.75029.5530.84 0.0588.12011.3860.71

89 89 Parallel efficiency for F(13) = 0.87 Parallel efficiency for F(18) = 0.99 Average task time: Workload 1 = 1.8 sec. Workload 2 = 3.7 sec.

90 90 Outline 1.Introduction 2.Related work 3.API 4.Architecture 5.Experimental results 6.Current & future work

91 91 Current work Implement CX market maker (broker) Solves discovery problem between Consumers & Production networks Enhance Producer with Lea’s Fork/Join Framework –See gee.cs.oswego.edu CONSUMER PRODUCTION NETWORK CONSUMER PRODUCTION NETWORK PRODUCTION NETWORK PRODUCTION NETWORK MARKET MAKER } { JINI Service

92 92 Current work Enhance computational model: branch & bound. –Propagate new bounds thru production network: 3 steps PRODUCTION NETWORK SEARCH TREE TERMINATE! BRANCH

93 93 Current work Enhance computational model: branch & bound. –Propagate new bounds thru production network: 3 steps PRODUCTION NETWORK SEARCH TREE TERMINATE!

94 94 Current work Investigate computations that appear ill-suited to adaptive parallelism –SOR –N-body.

95 95 End of CX Presentation www.cs.ucsb.edu/research/cx Next release: End of June, includes source. E-mail: cappello@cs.ucsb.edu

96 96 Introduction Fundamental Issues Communication latency Long latency  Overlap computation with communication. Robustness Massive parallelism  faults Scalability Massive parallelism  login privileges cannot be required. Ease of use Jini  easy upgrade of system components

97 97 Related work Market mechanisms –Huberman, Waldspurger, Malone, Miller & Drexler, Newhouse & Darlington

98 98 Related work CX integrates –DAG computational model –Work-stealing scheduler –Space-based, decoupled communication –Fault-tolerance via eager scheduling –Market mechanisms (incentive to participate)

99 99 Architecture Task identifier Dag has spawn tree TaskID = path id Root.TaskID = 0 TaskID used to detect duplicate: –Tasks –Results. F(4) F(3)F(2) + F(1) F(0) F(1)F(0) + + + 0 0 0 0 2 1 1 1 1 2 2 2

100 100 Architecture: Basic Entities Consumer Seeks computing resources. Producer Offers computing resources. Task Server Coordinates task distribution among its producers. Production Network A network of task servers & their associated producers.

101 101 Defining Parallel Efficiency Scalar: Homogeneous set of P machines: Parallel efficiency = (T 1 / P) / T P Vector: Heterogeneous set of P machines: P = [ P 1, P 2, …, P d ], where there are P 1 machines of type 1, P 2 machines of type 2, … P d machines of type d : Parallel efficiency = ( P 1 / T 1 + P 2 / T 2 + … P d / T d ) –1 / T P

102 102 Future work Support special hardware / data: inter-server task movement. –Diffusion model: Tasks are homogeneous gas atoms diffusing through network. –N-body model: Each kind of atom (task) has its own: Mass (resistance to movement: code size, input size, …) attraction/repulsion to different servers Or other “massive” entities, such as: »special processors »large data base.

103 103 Future Work CX preprocessor to simplify API.


Download ppt "CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB."

Similar presentations


Ads by Google