Scalable Reconfigurable Interconnects Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf CSCAPES.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

U of Houston – Clear Lake
Dynamic Topology Optimization for Supercomputer Interconnection Networks Layer-1 (L1) switch –Dumb switch, Electronic “patch panel” –Establishes hard links.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
What’s the Problem Web Server 1 Web Server N Web system played an essential role in Proving and Retrieve information. Cause Overloaded Status and Longer.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
A SYSTEM PERFORMANCE MODEL CSC 8320 Advanced Operating Systems Georgia State University Yuan Long.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Reconfigurable Network Topologies at Rack Scale
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
Traffic Engineering Jennifer Rexford Advanced Computer Networks Tuesdays/Thursdays 1:30pm-2:50pm.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
The Delta Routing Project Low-loss Routing for Hybrid Private Networks George Porter (UCB) Minwen Ji, Ph.D. (SRC - HP Labs)
A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.
Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
WAN Technologies.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Switching, routing, and flow control in interconnection networks.
Quasi Fat Trees for HPC Clouds and their Fault-Resilient Closed-Form Routing Technion - EE Department; *and Mellanox Technologies Eitan Zahavi* Isaac Keslassy.
Distributed Real-Time Systems for the Intelligent Power Grid Prof. Vincenzo Liberatore.
A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.
Network Aware Resource Allocation in Distributed Clouds.
Department of Computer Science at Florida State LFTI: A Performance Metric for Assessing Interconnect topology and routing design Background ‒ Innovations.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Cluster Workstations. Recently the distinction between parallel and distributed computers has become blurred with the advent of the network of workstations.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
Presenter: Jonathan Murphy On Adaptive Routing in Wavelength-Routed Networks Authors: Ching-Fang Hsu Te-Lung Liu Nen-Fu Huang.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
Department of Computer Science A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares Alexander Loukissas Amin Vahdat SIGCOMM’08 Reporter:
Software Defined Networks for Dynamic Datacenter and Cloud Environments.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.
Embedded System Lab 김해천 Thread and Memory Placement on NUMA Systems: Asymmetry Matters.
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Resource Allocation in Network Virtualization Jie Wu Computer and Information Sciences Temple University.
Dzmitry Kliazovich University of Luxembourg, Luxembourg
Interconnection network network interface and a case study.
Static Process Scheduling
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.
Introduction Computer networks: – definition – computer networks from the perspectives of users and designers – Evaluation criteria – Some concepts: –
Load Balancing : The Goal Given a collection of tasks comprising a computation and a set of computers on which these tasks may be executed, find the mapping.
Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:
HOW TO BUILD A BETTER TESTBED Fabien Hermenier Robert Ricci LESSONS FROM A DECADE OF NETWORK EXPERIMENTS ON EMULAB TridentCom ’
Architecture and Algorithms for an IEEE 802
Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
ElasticTree Michael Fruchtman.
Parallel Programming in C with MPI and OpenMP
CLUSTER COMPUTING.
Networked Real-Time Systems: Routing and Scheduling
Data Center Architectures
Parallel Programming in C with MPI and OpenMP
In-network computation
Presentation transcript:

Scalable Reconfigurable Interconnects Ali Pinar Lawrence Berkeley National Laboratory joint work with Shoaib Kamil, Lenny Oliker, and John Shalf CSCAPES Workshop, Santa Fe, June 11, 2008

Ultra-scale systems rely on increased concurrency. Huge increases in concurrency since How to connect huge numbers of processors?

What is a good interconnect for ultra-scale systems?  Mesh/torus networks provide limited performance.  Fat-trees are widely used due to their flexibility.  94 of 100 of Top500 in 2004  72 of 100 of Top500 in 2007  Cost of a fat-tree scales as O(PlgP).  Cost of the interconnect dominates the cost of compute power for large numbers of processors. Fat tree Torus

Step-by-step approach  Characterize the communication requirements of applications.  Replaces theoretical metrics with practical ones.  Minimize the interconnection requirements  Choice of subdomains  Task-to-processor mapping  Scheduling of messages  Design alternative interconnects  Static networks: Fit-trees  Reconfigurable networks

Static Applications NameLinesDisciplineProblem & MethodStructure Cactus84kAstrophysicsEinstein’s Theory of GR via Finite Differencing Grid LBMHD1500Plasma PhysicsMagneto-Hydrodynamics via Lattice- Boltzmann Lattice/Grid GTC5000Magnetic FusionVlassov-Poisson Equation via Particle- in-Cell Particle/Grid MADbench5000CosmologyCMB Analysis via Newton-RaphsonDense Matrix ELBM3D3000Fluid DynamicsFluid Dynamics via Lattice-BoltzmannLattice/Grid Beam Beam3D 23kParticle PhysicsPoisson’s Equation via Particle-in-Cell and FFT Particle/Grid

Static Applications

Most messages are small Employ a separate network for low bandwidth messages

Most fat-tree ports are not utilized >50% of the ports of a fat-tree are not used

Clever task-to-procesor allocation yields better results. Hops reduced by an average of 25%; improved latency!

Do we need the fat-tree bandwidth?  We need the flexibility of a fat tree, but not the full bandwidth.  Bandwidth requirement can de decreased with careful placement of tasks.  Proposed alternative: Fit trees  Idea: Analyze the communication requirements of apps and design the interconnect for what is really needed.

Even all-to-all communication does not need a fat-tree.  All-to-all communication is the bottleneck for FFT.  Clever scheduling of messages reduces bandwidth requirement.  Conventional algorithms for all-to-all communication do not distribute communication evenly.  The savings are even more pronounced in FFT with 2D decomposition. Communication Step levellevel Standard RandomizedOptimal

Fittrees: network should fit the application  Key observation: scalability of an application is related locality of computation.  Implication: required bandwidth decreases as we go higher in the tree.  Fitness ratio (f) : ratio of the bandwidth between two successive layers  2D domains: f ~=1.4  3D domains: f ~=1.2 Fittree f Nf N N Fattree N N

Fit-trees provide scalability

HFAST  Hybrid Flexibly-Assignable Switch Topology  Use Layer-1 (circuit) switches to configure Layer-2 (packet) switches at run-time (O(10-100ms) cost of reconfiguration)  Hardware to do so exists (optical networks)  Layer-1 switches cheaper per port (no dynamic decisions, like telephone switchboard) Collective communication uses a separate low-latency, low bandwidth tree network (like IBM BlueGene)

How to use HFAST  Improved task to processor assignments  Even at runtime  Migrate processes with little overhead  Adapt to changing communication requirements  Avoid defragmentation at the system level  Build an interconnect for each application  Avoid overprovisioning the communication resources

Processor allocation for adaptive applications We obtain 41% of ideal and 53% of ideal hops savings.

Conclusions  Massive concurrencies of ultrascale machines will require new interconnects.  We cannot afford to overprovision the resources.  There is no magic solution that is good for all applications.  Flexibility or reconfigurability is necessary.  The technology for reconfigurable networks is available.  We need to  reduce the resource requirements  design networks for typical workloads  design methods to build networks for a given application.