Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Similar presentations


Presentation on theme: "Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear."— Presentation transcript:

1 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Emerging Challenges & Opportunities in Parallel Computing: The Cretaceous Redux? Bruce Hendrickson Senior Manager for Math & Computer Science Sandia National Laboratories, Albuquerque, NM University of New Mexico, Computer Science Dept.

2 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. The Relationship Between Theory & Practice in Parallel Computing: Plus a Silly Metaphor Bruce Hendrickson Senior Manager for Math & Computer Science Sandia National Laboratories, Albuquerque, NM University of New Mexico, Computer Science Dept.

3 Outline Theory and practice in parallel computing are estranged Emerging applications will challenge the status quo Architectural changes will add further disruption These forces will create rich opportunities for the theory community

4 Parallel Computing Theory is Robust Theoretical foundations –P-Completeness [Cook’73] –Boolean Circuits [Borodin’77] –PRAMs [Fortune and Wyllie’78] –NC and P-Completeness [Pippenger/Cook’79] Technology-informed theoretical models –Fixed interconnection machines, e.g. hypercubes [many] –LOGP [Culler, et al.’93] –Bulk Synchronous Parallel [Gerbessiotis & Valiant’92] “Practical” ideas with strong theoretical underpinnings –PGAS Languages [several] –CILK [Leiserson’s group’95] 21 years of SPAA, 28 years of PODC, etc…

5 CM-2nCUBE-2 iPSC-860 Paragon ASCI Red Red Storm Cplant Gordon Bell Prize World Record 143 GFlops World Record 281 GFlops World Record Teraflops R&D 100 Parallel Software R&D 100 Dense Solvers R&D 100 Aztec R&D 100 Meshing R&D 100 Signal Processing Patent Parallel Software PatentMeshing PatentPartitioning Karp Challenge PatentPaving R&D 100 Storage SC96 Gold Medal Networking R&D 100 Salvo Patent Data Mining Gordon Bell Prize MannheimSuParCup FernbachAward R&D 100 Trillinos R&D 100 Allocator Designed by Rolf Riesen, July 2005 198819901992199419961998200020022004 2003200119991997199519931991198920051987 2006 2008 20072009 R&D 100 Catamount R&D 100 Xyce Sandia is a Leader in Parallel Computing

6 Theory at Sandia Sandia designs, procures, programs, runs & treasures big parallel computers Sandia has at least 200 PhDs working on parallel computing –Mostly physics & engineering degrees –But many computer scientists as well Very few of these practitioners could define a PRAM –Let alone explain NC! None use CILK or UPC What’s wrong with this picture!?

7 Elements of Parallel Computing Practice Clusters –“Killer micros” enable commodity-based parallel computing –Attractive price and price/performance –Stable model for algorithms & software MPI –Portable and stable programming model and language –Allowed for huge investment in software Bulk-Synchronous Parallel Programming (BSP) –Basic approach to almost all successful MPI programs –Compute locally; communicate; repeat –Excellent match for clusters+MPI –Good fit for many scientific applications Algorithms –Stability of the above allows for sustained algorithmic research

8 A Virtuous Circle… Architectures Programming Models Algorithms Software Commodity Clusters Explicit Message Passing Bulk Synchronous Parallel MPI …or a vicious noose?

9 MPI Applications LAMPPS PETSc Linpack Trilinos

10 CILK UPC PRAM LOGP

11 Existing Applications Are Evolving Leading edge scientific applications increasingly include: –Adaptive, unstructured data structures –Complex, multiphysics simulations –Multiscale computations in space and time –Complex synchronizations (e.g. discrete events) These raise significant parallelization challenges –Limited by memory, not processor performance –Unsolved micro-load balancing problems –Finite degree of coarse-grained parallelism –Bulk synchronous parallel not always appropriate These changes will stress existing approaches to parallelism

12 New Applications are Emerging: E.g. Network Science Graphs are ideal for representing entities and relationships Rapidly growing use in biological, social, environmental, and other sciences Zachary’s karate club (|V|=34) The way it was … Twitter social network (|V|≈200M) The way it is now …

13 Emerging New Scientific Questions New algorithms –Community detection, centrality, graph generation, etc. –Right set of questions and concepts still unknown. –Statistics, machine learning, anomaly detection, etc. New issues –Noisy, error-filled data. What can we conclude robustly? –Temporal evolution of networks. New science –Social dynamics and ties to technology & media –Large economic, social, political consequences Parallel computing needed for big data and/or fast response

14 Computational Challenges for Network Science Minimal computation to hide access time Runtime is dominated by latency –Random accesses to global address space –Parallelism is very fine grained and dynamic Access pattern is data dependent –Prefetching unlikely to help –Usually only want small part of cache line Potentially abysmal locality at all levels of memory hierarchy Many algorithms are not bulk synchronous Approaches based on virtuous circle don’t work!

15 Locality Challenges What we traditionally care about What industry cares about Emerging Codes From: Murphy and Kogge, On The Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications, IEEE T. on Computers, July 2007

16 A Renaissance in Architecture Research Good news –Moore’s Law marches on –Real estate on a chip is essentially free Major paradigm change – huge opportunity for innovation Bad news –Power considerations limit the improvement in clock speed –Parallelism is only viable route to improve performance Current response, multicore processors –Computation/Communication ratio will get worse Makes life harder for applications Long-term consequences unclear

17 Example: AMD Opteron

18 L2 Cache L1 I-Cache L1 D-Cache Memory (Latency Avoidance) Example: AMD Opteron

19 L2 Cache L1 I-Cache L1 D-Cache Memory (Lat. Avoidance) Memory Controller I-Fetch Scan Align Load/Store Unit Out-of-Order Exec Load/Store Mem/Coherency (Latency Tolerance) Example: AMD Opteron

20 L2 Cache L1 I-Cache L1 D-Cache Memory (Latency Avoidance) Memory Controller I-Fetch Scan Align Load/Store Unit Out-of-Order Exec Load/Store Mem/Coherency (Lat. Toleration) Bus DDR HT Memory and I/O Interfaces Example: AMD Opteron

21 L2 Cache L1 I-Cache L1 D-Cache Memory (Latency Avoidance) Memory Controller I-Fetch Scan Align Load/Store Unit Out-of-Order Exec Load/Store Mem/Coherency (Lat. Tolerance) FPU Execution Int Execution Bus DDR HT Memory and I/O Interfaces COMPUTER Example: AMD Opteron Thanks to Thomas Sterling

22 Architectural Wish List for Graphs Low latency / high bandwidth –For small messages! Latency tolerant Light-weight synchronization mechanisms for fine-grained parallelism Global address space –No graph partitioning required –Avoid memory-consuming profusion of ghost-nodes –No local/global numbering conversions One machine with these properties is the Tera MTA-2 –And its successor the Cray XMT

23 How Does the MTA/XMT Work? Latency tolerance via massive multi-threading –Context switch every tick –Global address space, hashed to reduce hot-spots –No cache or local memory. –Multiple outstanding loads Remote memory request doesn’t stall processor –Other streams work while your request gets fulfilled Light-weight, word-level synchronization –Minimizes conflicts, enables parallelism Flexible dynamic load balancing –Thread virtualization –Futures

24 Case Study: Single Source Shortest Path PBGL SSSP Time (s) MTA SSSP # Processors Parallel Boost Graph Library (PBGL) –Lumsdaine, et al., on Opteron cluster –Some graph algorithms can scale on some inputs PBGL - MTA Comparison on SSSP –Erdös-Renyi random graph (|V|=2 28 ) –PBGL SSSP can scale on non-power law graphs –Order of magnitude speed difference –2 orders of magnitude efficiency difference Big difference in power consumption –[Lumsdaine, Gregor, H., Berry, 2007]

25

26 Multicore Disruptive Architectures New Apps

27

28

29 What Happens Next? Virtuous circle will not survive the coming disruptions New programming models, languages, algorithms and abstractions will be needed But MPI cannot die –Billions of dollars in investment in software –“I don’t know what the parallel programming language of the future will look like, but I know it will be called MPI” Luckily, theory is forever …

30 Rebuilding the Foundations Applied parallel computing will need new ideas to continue moving forward Ideas and tools from theory community can: –Provide abstractions to manage hardware complexity –Underlie robust algorithm development and analysis –Suggest new programming models and abstractions –Point towards new architectural features –Support efficient utilization of resources –Provide underpinnings for the future of applied parallel computing

31

32 Conclusions Applied parallel computing is facing unprecedented challenges –Multi-core processors –Disruptive architectural innovations –Demands of emerging applications Theory can provide reliable light in the coming darkness –Theoretical insights are resilient to technology changes Theory community will have new opportunities –Provide robust foundation for future progress –Become central to applied parallel computing This is a great time to be doing parallel computing!

33 Thanks Cevdet Aykanat, Michael Bender, Jon Berry, Rob Bisseling, Erik Boman, Bill Carlson, Ümit Çatalürek, Edmond Chow, Karen Devine, Iain Duff, Danny Dunlavy, Alan Edelman, Jean-Loup Faulon, John Gilbert, Assefaw Gebremedhin, Mike Heath, Paul Hovland, Vitus Leung, Simon Kahan, Pat Knupp, Tammy Kolda, Gary Kumfert, Fredrik Manne, Michael Mahoney, Mike Merrill, Richard Murphy, Esmond Ng, Ali Pınar, Cindy Phillips, Steve Plimpton, Alex Pothen, Robert Preis, Padma Raghavan, Steve Reinhardt, Suzanne Rountree, Rob Schreiber, Viral Shah, Jonathan Shewchuk, Horst Simon, Dan Spielman, Shang-Hua Teng, Sivan Toledo, Keith Underwood, etc.


Download ppt "Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear."

Similar presentations


Ads by Google