Presentation is loading. Please wait.

Presentation is loading. Please wait.

Architectural Interactions in High Performance Clusters

Similar presentations


Presentation on theme: "Architectural Interactions in High Performance Clusters"— Presentation transcript:

1 Architectural Interactions in High Performance Clusters
RTPP 98 David E. Culler Computer Science Division University of California, Berkeley

2 Run-Time Framework ° ° ° RunTime RunTime RunTime Network Parallel
Program Parallel Program Parallel Program ° ° ° RunTime RunTime RunTime Machine Architecture Machine Architecture Machine Architecture Network

3 Two Example RunTime Layers
Split-C thin global address space abstraction over Active Messages get, put, read, write MPI thicker message passing abstraction over Active Messages send, receive

4 Split-C over Active Messages
Read, Write, Get, Put built on small Active Message request / reply (RPC) Bulk-Transfer (store & get) Request handler Reply handler

5 Model Framework: LogP o o (overhead) g (gap)
P ( processors ) P M P M P M ° ° ° o o (overhead) g (gap) L (latency) Limited Volume Interconnection Network ( L/g to a proc) Latency in sending a (small) message between modules overhead felt by the processor on sending or receiving msg gap between successive sends or receives (1/rate) Processors Round Trip time: 2 x ( 2o + L)

6 LogP Summary of Current Machines
Max MB/s:

7 Methodology Apparatus:
35 Ultra 170s (64 MB, .5 MB L2, Solaris 2.5) M2F Lanai + Myricom Net in Fat Tree variant GAM + Split-C Modify the Active Message layer to inflate L, o, g, or G independently Execute a diverse suite of applications and observe effect Evaluate against natural performance models

8 Adjusting L, o, and g (and G) in situ
Host Workstation Host Workstation AM lib AM lib O: stall Ultra on msg write O: stall Ultra on msg read Lanai Lanai L: defer marking msg as valid until Rx + L Myrinet g: delay Lanai after msg injection (after fragment for bulk transfers)

9 Calibration

10 Applications Characteristics
Message Frequency Write-based vs. Read-based Short vs. Bulk Messages Synchronization Communication Balance

11 Applications used in the Study

12 Baseline Communication

13 Application Sensitivity to Communication Performance

14 Sensitivity to Overhead

15 Sensitivity to gap (1/msg rate)

16 Sensitivity to Latency

17 Sensitivity to bulk BW (1/G)

18 Modeling Effects of Overhead
Tpred = Torig + 2 x max #msgs x o request / response proc with most msgs limits overall time Why does this model under-predict?

19 Modeling Effects of gap
Uniform communication model Tpred = Torig , if g < I, average msg interval = Torig + m (g - I ), otherwise Bursty Communication Tpred = Torig + m g g

20 Extrapolating to Low Overhead

21 MPI over AM: ping-pong bandwidth
17

22 MPI over AM: start-up

23 NPB2 Speedup: NOW vs SP2

24 NOW vs. Origin

25 Single Processor Performance

26 Understanding Speedup
SpeedUp(p) = T MAXp (Tcompute + Tcomm. + T wait) Tcompute = (work/p + extra) x efficiency

27 Performance Tools for Clusters
Independent data collection on every node Timing Sampling Tracing Little perturbation of global effects

28 Where the Time Goes: LU-a

29 Where the Time Goes: BT-a

30 Constant Problem Size Scaling
128 256 64 16 4 8 32

31 Communication Scaling
Normalized Msgs per Proc Average Message Size

32 Communication Scaling: Volume

33 Extra Work

34 Cache Working Sets: LU 8-fold reduction in miss rate from 4 to 8 proc

35 Cache Working Sets: BT

36 Cycles per Instruction

37 MPI Internal Protocol Sender Receiver

38 Revised Protocol Sender Receiver

39 Sensitivity to Overhead

40 Conclusions Run Time systems for Parallel Programs must deal with a host of architectural interactions communication computation memory system Build a performance model of you RTPP only way to recognize anomalies Build tools along with the RT to reflect characteristics and sensitivity back to PP Much can lurk beneath a perfect speedup curve


Download ppt "Architectural Interactions in High Performance Clusters"

Similar presentations


Ads by Google