Download presentation
Presentation is loading. Please wait.
Published byAlexia Melton Modified over 6 years ago
1
Architectural Interactions in High Performance Clusters
RTPP 98 David E. Culler Computer Science Division University of California, Berkeley
2
Run-Time Framework ° ° ° RunTime RunTime RunTime Network Parallel
Program Parallel Program Parallel Program ° ° ° RunTime RunTime RunTime Machine Architecture Machine Architecture Machine Architecture Network
3
Two Example RunTime Layers
Split-C thin global address space abstraction over Active Messages get, put, read, write MPI thicker message passing abstraction over Active Messages send, receive
4
Split-C over Active Messages
Read, Write, Get, Put built on small Active Message request / reply (RPC) Bulk-Transfer (store & get) Request handler Reply handler
5
Model Framework: LogP o o (overhead) g (gap)
P ( processors ) P M P M P M ° ° ° o o (overhead) g (gap) L (latency) Limited Volume Interconnection Network ( L/g to a proc) Latency in sending a (small) message between modules overhead felt by the processor on sending or receiving msg gap between successive sends or receives (1/rate) Processors Round Trip time: 2 x ( 2o + L)
6
LogP Summary of Current Machines
Max MB/s:
7
Methodology Apparatus:
35 Ultra 170s (64 MB, .5 MB L2, Solaris 2.5) M2F Lanai + Myricom Net in Fat Tree variant GAM + Split-C Modify the Active Message layer to inflate L, o, g, or G independently Execute a diverse suite of applications and observe effect Evaluate against natural performance models
8
Adjusting L, o, and g (and G) in situ
Host Workstation Host Workstation AM lib AM lib O: stall Ultra on msg write O: stall Ultra on msg read Lanai Lanai L: defer marking msg as valid until Rx + L Myrinet g: delay Lanai after msg injection (after fragment for bulk transfers)
9
Calibration
10
Applications Characteristics
Message Frequency Write-based vs. Read-based Short vs. Bulk Messages Synchronization Communication Balance
11
Applications used in the Study
12
Baseline Communication
13
Application Sensitivity to Communication Performance
14
Sensitivity to Overhead
15
Sensitivity to gap (1/msg rate)
16
Sensitivity to Latency
17
Sensitivity to bulk BW (1/G)
18
Modeling Effects of Overhead
Tpred = Torig + 2 x max #msgs x o request / response proc with most msgs limits overall time Why does this model under-predict?
19
Modeling Effects of gap
Uniform communication model Tpred = Torig , if g < I, average msg interval = Torig + m (g - I ), otherwise Bursty Communication Tpred = Torig + m g g
20
Extrapolating to Low Overhead
21
MPI over AM: ping-pong bandwidth
17
22
MPI over AM: start-up
23
NPB2 Speedup: NOW vs SP2
24
NOW vs. Origin
25
Single Processor Performance
26
Understanding Speedup
SpeedUp(p) = T MAXp (Tcompute + Tcomm. + T wait) Tcompute = (work/p + extra) x efficiency
27
Performance Tools for Clusters
Independent data collection on every node Timing Sampling Tracing Little perturbation of global effects
28
Where the Time Goes: LU-a
29
Where the Time Goes: BT-a
30
Constant Problem Size Scaling
128 256 64 16 4 8 32
31
Communication Scaling
Normalized Msgs per Proc Average Message Size
32
Communication Scaling: Volume
33
Extra Work
34
Cache Working Sets: LU 8-fold reduction in miss rate from 4 to 8 proc
35
Cache Working Sets: BT
36
Cycles per Instruction
37
MPI Internal Protocol Sender Receiver
38
Revised Protocol Sender Receiver
39
Sensitivity to Overhead
40
Conclusions Run Time systems for Parallel Programs must deal with a host of architectural interactions communication computation memory system Build a performance model of you RTPP only way to recognize anomalies Build tools along with the RT to reflect characteristics and sensitivity back to PP Much can lurk beneath a perfect speedup curve
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.