1 Scalable Reconfigurable Interconnects Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf CSCAPES Workshop, Santa Fe, June 11, 2008

2 Ultra-scale systems rely on increased concurrency. Huge increases in concurrency since 2004. How to connect huge numbers of processors?

3 What is a good interconnect for ultra-scale systems?  Mesh/torus networks provide limited performance.  Fat-trees are widely used due to their flexibility.  94 of 100 of Top500 in 2004  72 of 100 of Top500 in 2007  Cost of a fat-tree scales as O(PlgP).  Cost of the interconnect dominates the cost of compute power for large numbers of processors. Fat tree Torus

4 Step-by-step approach  Characterize the communication requirements of applications.  Replaces theoretical metrics with practical ones.  Minimize the interconnection requirements  Choice of subdomains  Task-to-processor mapping  Scheduling of messages  Design alternative interconnects  Static networks: Fit-trees  Reconfigurable networks

5 Static Applications NameLinesDisciplineProblem & MethodStructure Cactus84kAstrophysicsEinstein’s Theory of GR via Finite Differencing Grid LBMHD1500Plasma PhysicsMagneto-Hydrodynamics via Lattice- Boltzmann Lattice/Grid GTC5000Magnetic FusionVlassov-Poisson Equation via Particle- in-Cell Particle/Grid MADbench5000CosmologyCMB Analysis via Newton-RaphsonDense Matrix ELBM3D3000Fluid DynamicsFluid Dynamics via Lattice-BoltzmannLattice/Grid Beam Beam3D 23kParticle PhysicsPoisson’s Equation via Particle-in-Cell and FFT Particle/Grid

6 Static Applications

7 Most messages are small Employ a separate network for low bandwidth messages

8 Most fat-tree ports are not utilized >50% of the ports of a fat-tree are not used

9 Clever task-to-procesor allocation yields better results. Hops reduced by an average of 25%; improved latency!

10 Do we need the fat-tree bandwidth?  We need the flexibility of a fat tree, but not the full bandwidth.  Bandwidth requirement can de decreased with careful placement of tasks.  Proposed alternative: Fit trees  Idea: Analyze the communication requirements of apps and design the interconnect for what is really needed.

11 Even all-to-all communication does not need a fat-tree.  All-to-all communication is the bottleneck for FFT.  Clever scheduling of messages reduces bandwidth requirement.  Conventional algorithms for all-to-all communication do not distribute communication evenly.  The savings are even more pronounced in FFT with 2D decomposition. Communication Step levellevel Standard RandomizedOptimal

12 Fittrees: network should fit the application  Key observation: scalability of an application is related locality of computation.  Implication: required bandwidth decreases as we go higher in the tree.  Fitness ratio (f) : ratio of the bandwidth between two successive layers  2D domains: f ~=1.4  3D domains: f ~=1.2 Fittree f Nf N N Fattree N N

13 Fit-trees provide scalability

14 HFAST  Hybrid Flexibly-Assignable Switch Topology  Use Layer-1 (circuit) switches to configure Layer-2 (packet) switches at run-time (O(10-100ms) cost of reconfiguration)  Hardware to do so exists (optical networks)  Layer-1 switches cheaper per port (no dynamic decisions, like telephone switchboard) Collective communication uses a separate low-latency, low bandwidth tree network (like IBM BlueGene)

15 How to use HFAST  Improved task to processor assignments  Even at runtime  Migrate processes with little overhead  Adapt to changing communication requirements  Avoid defragmentation at the system level  Build an interconnect for each application  Avoid overprovisioning the communication resources

16 Processor allocation for adaptive applications We obtain 41% of ideal and 53% of ideal hops savings.

17 Conclusions  Massive concurrencies of ultrascale machines will require new interconnects.  We cannot afford to overprovision the resources.  There is no magic solution that is good for all applications.  Flexibility or reconfigurability is necessary.  The technology for reconfigurable networks is available.  We need to  reduce the resource requirements  design networks for typical workloads  design methods to build networks for a given application.

