SOS7, Durango CO, 4-Mar-2003 Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD Distilled [Trimmed & Distilled for SOS7 by M.

Slides:

Advertisements

Similar presentations

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

1 SEDA: An Architecture for Well- Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University

Introductory Courses in High Performance Computing at Illinois David Padua.

The Multikernel: A new OS architecture for scalable multicore systems Andrew Baumann et al CS530 Graduate Operating System Presented by.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Introduction to MIMD architectures

Small-World Graphs for High Performance Networking Reem Alshahrani Kent State University.

Hiperspace Lab University of Delaware Antony, Sara, Mike, Ben, Dave, Sreedevi, Emily, and Lori.

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.

Components for high performance grid programming in the GRID.it project 1 Workshop on Component Models and Systems for Grid Applications - St.Malo 26 june.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Managing Service Metadata as Context The 2005 Istanbul International Computational Science & Engineering Conference (ICCSE2005) Mehmet S. Aktas

Performance Evaluation of Parallel Processing. Why Performance?

Lecture 8: Design of Parallel Programs Part III Lecturer: Simon Winberg.

Network Aware Resource Allocation in Distributed Clouds.

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Advanced / Other Programming Models Sathish Vadhiyar.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Clusters, Fault Tolerance, and Other Thoughts Daniel S. Katz JPL/Caltech SOS7 Meeting 4 March 2003.

Programmability Hiroshi Nakashima Thomas Sterling.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Background Computer System Architectures Computer System Software.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Cyberinfrastructure Overview of Demos Townsville, AU 28 – 31 March 2006 CREON/GLEON.

Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

These slides are based on the book:

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Algorithm Design

EE 193: Parallel Computing

Parallel and Multiprocessor Architectures – Shared Memory

The Globus Toolkit™: Information Services

Tosiron Adegbija and Ann Gordon-Ross+

Cyberinfrastructure and PolarGrid

Presentation transcript:

SOS7, Durango CO, 4-Mar-2003 Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD Distilled [Trimmed & Distilled for SOS7 by M. Levine 4-March-2003, Durango]

SOS7, Durango CO, 4-Mar-2003 Contacts and References David O’Neal John Urbanic Sergiu Sanielevici Workshop materials:

SOS7, Durango CO, 4-Mar-2003 Introduction More than 80 researchers from universities, research centers, and corporations around the country attended the first "Scaling to New Heights" workshop, May 20 and 21, 2002, at the PSC, Pittsburgh. Sponsored by the NSF leading-edge centers (NCSA, PSC, SDSC) together with the Center for Computational Sciences (ORNL) and NERSC, the workshop included a poster session, invited and contributed talks, and a panel. Participants examined issues involved in adapting and developing research software to effectively exploit systems comprised of thousands of processors. [Fred/Neil’s Q1.] The following slides represent a collection of ideas from the workshop

SOS7, Durango CO, 4-Mar-2003 Basic Concepts All application components must scaleAll application components must scale Control granularity; VirtualizeControl granularity; Virtualize Incorporate latency toleranceIncorporate latency tolerance Reduce dependency on synchronizationReduce dependency on synchronization Maintain per-process load; Facilitate balanceMaintain per-process load; Facilitate balance Only new aspect, at larger scale, is the degree to which these things matter

SOS7, Durango CO, 4-Mar-2003 Poor Scalability? (Keep your eye on the ball) Processors Speedup

SOS7, Durango CO, 4-Mar-2003 Good Scalability? (Keep your eye on the ball) Processors Speedup

SOS7, Durango CO, 4-Mar-2003 Processors Speedup Performance is the Goal! (Keep your eye on the ball)

SOS7, Durango CO, 4-Mar-2003 Issues and Remedies Granularity [Q2a] Latencies [Q2b] Synchronization Load Balancing [Q2c] Heterogeneous Considerations

SOS7, Durango CO, 4-Mar-2003 Granularity Define problem in terms of a large number of small objects independent of the process count [Q2a] Object design considerations –Caching and other local effects –Communication-to-computation ratio Control granularity through virtualization –Maintain per-process load level –Manage comms within virtual blocks, e.g. Converse –Facilitate dynamic load balancing

SOS7, Durango CO, 4-Mar-2003 Latencies Network –Latency reduction lags improvement in flop rates; Much easier to grow bandwidth –Overlap communications and computations; Pipeline larger messages –Don’t wait – Speculate! [Q2b] Software Overheads –Can be more significant than network delays –NUMA architectures Scalable designs must accommodate latencies

SOS7, Durango CO, 4-Mar-2003 Synchronization Cost increases with the process count –Synchronization doesn’t scale well –Latencies come into play here too Distributed resource exacerbates problems –Heterogeneity another significant obstacle Regular communication patterns are often characterized by many synchronizations –Best suited to homogeneous co-located clusters Transition to asynchronous models?

SOS7, Durango CO, 4-Mar-2003 Load Balancing Static load balancing –Reduces to granularity problem –Differences between processors and network segments are determined a priori Dynamic process management requiring distributed monitoring capabilities [Q2c]Dynamic process management requiring distributed monitoring capabilities [Q2c] –Must be scalable –System maps objects to processes

SOS7, Durango CO, 4-Mar-2003 Heterogeneous Considerations Similar but different processors or network components configured within a single cluster –Different clock rates, NICs, etc. Distinct processors, networking segments, and operating systems operating at a distance –Grid resources Elevates significance of dynamic load balancing; Data-driven objects immediately adaptable

SOS7, Durango CO, 4-Mar-2003 Tools [Q2d?] Automated algorithm selection and performance tuning by empirical means, e.g. ATLASAutomated algorithm selection and performance tuning by empirical means, e.g. ATLAS –Generate space of algorithms and search for fastest implementations by running them Scalability prediction, e.g. PMaC LabScalability prediction, e.g. PMaC Lab –Develop performance models (machine profiles; application signatures) and trending patterns Identify/fix bottlenecks; choose new methods?

SOS7, Durango CO, 4-Mar-2003 Topics for Discussion How should large, scalable computational science problems be posed? Should existing algorithms and codes be modified or should new ones be developed? Should agencies explicitly fund collaborations to develop industrial-strength, efficient, scalable codes? What should cyber-infrastructure builders and operators do to help scientists develop and run good applications?

SOS7, Durango CO, 4-Mar-2003 Summary Comments (MJL) Substantial progress, with scientific payoff, is being made. It is hard work without magic bullets. >>> Dynamic load balancing <<< –Big payoff, homogeneous and heterogeneous –Requires considerable people work to implement –Runtime overhead very small.

SOS7, Durango CO, 4-Mar-2003 Case Study: NAMD Scalable Molecular Dynamics Three-dimensional object-oriented code Message-driven execution capability Fixed problem sizes determined by biomolecular structures Embedded PME electrostatics processor Asynchronous communications

SOS7, Durango CO, 4-Mar-2003

Case Study: Summary As more processes are used to solve the given fixed-size problems, benchmark times decrease to a few milliseconds –PME communication times and operating system loads are significant in this range Scaling to many thousands of processes is almost certainly achievable now given a large enough problem –700 atoms/process x 3,000 processes = 2.1M atoms