Towards Future Programming Languages and Models for High-Productivity Computing Hans P. Zima University of Vienna, Austria and JPL, California Institute.

Slides:



Advertisements
Similar presentations
Ch:8 Design Concepts S.W Design should have following quality attribute: Functionality Usability Reliability Performance Supportability (extensibility,
Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
RBG 6/9/ Reactive Processing Jon Hiller Science and Technology Associates, Inc. For Dr. William Harrod DARPA IPTO 24 May 2005.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Parallel Computing Overview CS 524 – High-Performance Computing.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Ensuring Non-Functional Properties. What Is an NFP?  A software system’s non-functional property (NFP) is a constraint on the manner in which the system.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Architectural Design Principles. Outline  Architectural level of design The design of the system in terms of components and connectors and their arrangements.
Architectural Design Establishing the overall structure of a software system Objectives To introduce architectural design and to discuss its importance.
Database Management Systems (DBMS)
Architectural Design.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Research & Development Roadmap 1. Outline A New Communication Framework Giving Bro Control over the Network Security Monitoring for Industrial Control.
Chapter 2 The process Process, Methods, and Tools
SensIT PI Meeting, January 15-17, Self-Organizing Sensor Networks: Efficient Distributed Mechanisms Alvin S. Lim Computer Science and Software Engineering.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Software Software is omnipresent in the lives of billions of human beings. Software is an important component of the emerging knowledge based service.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
Composing Adaptive Software Authors Philip K. McKinley, Seyed Masoud Sadjadi, Eric P. Kasten, Betty H.C. Cheng Presented by Ana Rodriguez June 21, 2006.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Presented by High Productivity Language and Systems: Next Generation Petascale Programming Wael R. Elwasif, David E. Bernholdt, and Robert J. Harrison.
Presented by High Productivity Language Systems: Next-Generation Petascale Programming Aniruddha G. Shet, Wael R. Elwasif, David E. Bernholdt, and Robert.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Copyright © 2007 Addison-Wesley. All rights reserved.1-1 Reasons for Studying Concepts of Programming Languages Increased ability to express ideas Improved.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
MAPLD Reconfigurable Computing Birds-of-a-Feather Programming Tools Jeffrey S. Vetter M. C. Smith, P. C. Roth O. O. Storaasli, S. R. Alam
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
MODEL-BASED SOFTWARE ARCHITECTURES.  Models of software are used in an increasing number of projects to handle the complexity of application domains.
Multiagent System Katia P. Sycara 일반대학원 GE 랩 성연식.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Programming Sensor Networks Andrew Chien CSE291 Spring 2003 May 6, 2003.
Programmability Hiroshi Nakashima Thomas Sterling.
SOFTWARE ENGINEERING. Objectives Have a basic understanding of the origins of Software development, in particular the problems faced in the Software Crisis.
Distributed Computing Systems CSCI 6900/4900. Review Distributed system –A collection of independent computers that appears to its users as a single coherent.
Outline Why this subject? What is High Performance Computing?
What’s Ahead for Embedded Software? (Wed) Gilsoo Kim
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
From Use Cases to Implementation 1. Structural and Behavioral Aspects of Collaborations  Two aspects of Collaborations Structural – specifies the static.
Parallel Computing Presented by Justin Reschke
Background Computer System Architectures Computer System Software.
From Use Cases to Implementation 1. Mapping Requirements Directly to Design and Code  For many, if not most, of our requirements it is relatively easy.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Introduction to Machine Learning, its potential usage in network area,
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
Self Healing and Dynamic Construction Framework:
Chapter 6 Database Design
Presentation transcript:

Towards Future Programming Languages and Models for High-Productivity Computing Hans P. Zima University of Vienna, Austria and JPL, California Institute of Technology, Pasadena, CA IFIP WG 10.3 Seminar September 7, 2004

1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 5 Conclusion 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 5 Conclusion Contents 2

Programming models and languages bridge the gap between “reality” and hardware – at different levels of abstraction - e.g.,  assembly languages  general-purpose procedural languages  functional languages  very high-level domain-specific languages Programming models and languages bridge the gap between “reality” and hardware – at different levels of abstraction - e.g.,  assembly languages  general-purpose procedural languages  functional languages  very high-level domain-specific languages Abstraction in Programming Abstraction implies loss of information – gain in simplicity, clarity, verifiability, portability versus potential performance degradation 3

The designers of the very first high level programming language were aware that their success depended on acceptable performance of the generated target programs: The designers of the very first high level programming language were aware that their success depended on acceptable performance of the generated target programs: John Backus (1957): “… It was our belief that if FORTRAN … were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger …” The designers of the very first high level programming language were aware that their success depended on acceptable performance of the generated target programs: The designers of the very first high level programming language were aware that their success depended on acceptable performance of the generated target programs: John Backus (1957): “… It was our belief that if FORTRAN … were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger …” The Emergence of High-Level Sequential Languages High-level algorithmic languages became generally accepted standards for sequential programming since their advantages outweighed any performance drawbacks For parallel programming no similar development took place 4

 Current HPC hardware: large clusters at reasonable cost  commodity clusters or custom MPPs  off-the-shelf processors and memory components, built for mass market  latency and bandwidth problems  Current HPC software:  application efficiency sometimes in single digits  low-productivity programming models dominate  focus on programming in the small  inadequate programming environments and tools  Resemblance to pre Fortran era in sequential programming  assembly language programming vs. MPI  wide gap between the domain of the scientist and programming language  Current HPC hardware: large clusters at reasonable cost  commodity clusters or custom MPPs  off-the-shelf processors and memory components, built for mass market  latency and bandwidth problems  Current HPC software:  application efficiency sometimes in single digits  low-productivity programming models dominate  focus on programming in the small  inadequate programming environments and tools  Resemblance to pre Fortran era in sequential programming  assembly language programming vs. MPI  wide gap between the domain of the scientist and programming language The Crisis of High Performance Computing 5

 Exploiting performance on current parallel architectures requires low level control and “heroic” programmers  This has led to “local view” programming models as exemplified by MPI based programming (as well as CoArray Fortran etc.)  explicit processors: local views of data  program state associated with memory regions  explicit communication intertwined with the algorithm  Exploiting performance on current parallel architectures requires low level control and “heroic” programmers  This has led to “local view” programming models as exemplified by MPI based programming (as well as CoArray Fortran etc.)  explicit processors: local views of data  program state associated with memory regions  explicit communication intertwined with the algorithm Domination of Low-Level Paradigms 6

 The 1990s saw a strong (mainly, but not exclusively, academic) effort to improve this situation  Focus on higher level paradigms  CMFortran  Fortran-D  Vienna Fortran  High Performance Fortran (HPF)  and many others…  A wealth of research on related compiler technology  These languages have not been broadly accepted, for a number of reasons …  The 1990s saw a strong (mainly, but not exclusively, academic) effort to improve this situation  Focus on higher level paradigms  CMFortran  Fortran-D  Vienna Fortran  High Performance Fortran (HPF)  and many others…  A wealth of research on related compiler technology  These languages have not been broadly accepted, for a number of reasons … The High Performance Fortran (HPF) Effort 7

do while (.not. converged) do J=1,M do I=1,N B(I,J) = 0.25 * (A(I-1,J)+A(I+1,J)+ A(I,J-1)+A(I,J+1)) end do A(1:N,1:N) = B(1:N,1:N) local computation initialize MPI if (MOD(myrank,2).eq. 1) then call MPI_SEND(B(1,1),N,…,myrank-1,..) call MPI_RCV(A(1,0),N,…,myrank-1,..) if (myrank.lt. s-1) then call MPI_SEND(B(1,M),N,…,myrank+1,..) call MPI_RCV(A(1,M+1),N,…,myrank+1,..) endif else … … Assuming block column distribution: … Example: Parallel Jacobi With MPI 8

do while (.not. converged) do J=1,N do I=1,N B(I,J) = 0.25 * (A(I-1,J)+A(I+1,J)+ A(I,J-1)+A(I,J+1)) end do A(1:N,1:N) = B(1:N,1:N) global computation processors P(NUMBER_OF_PROCESSORS) distribute(*,BLOCK) onto P :: A, B  Communication is automatically generated by compiler  Change of distribution requires modification of just one line in the source program; no change of the algorithm Example: Parallel Jacobi With HPF … block column distribution 9

 The HPF code is far simpler than the MPI code  Compiler generated object code is as good as MPI code  parallelization of the loop  static analysis of access patterns and communication  aggregation and fusion of communication  The data distribution directives could even be generated automatically from the sequential code  The HPF code is far simpler than the MPI code  Compiler generated object code is as good as MPI code  parallelization of the loop  static analysis of access patterns and communication  aggregation and fusion of communication  The data distribution directives could even be generated automatically from the sequential code So where is the problem with HPF? Example: Observations 10

 HPF 1 lacked important functionality  data distributions do not support irregular algorithms  lack of flexibility for processor-mapping and data/thread affinity  focus on support for array types and SPMD data parallelism  HPF includes features that are difficult to implement  Current architectures do not support well dynamic and irregular features in advanced applications  HPF+/JA HPF plasma code reached 12.5 TFLOPS on Earth Simulator  explicit high-level formulation of communication patterns  explicit high-level control of communication schedules  explicit high-level control of “halos”  HPF 1 lacked important functionality  data distributions do not support irregular algorithms  lack of flexibility for processor-mapping and data/thread affinity  focus on support for array types and SPMD data parallelism  HPF includes features that are difficult to implement  Current architectures do not support well dynamic and irregular features in advanced applications  HPF+/JA HPF plasma code reached 12.5 TFLOPS on Earth Simulator  explicit high-level formulation of communication patterns  explicit high-level control of communication schedules  explicit high-level control of “halos” HPF Problems However: 11

FEM crash simulation kernel Language Extensions (HPF/Earth Simulator) o Schedule Reuse o Halo Control o Purest procedures The Effect of Schedule Reuse and Halo Management 12

1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 5 Conclusion 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 5 Conclusion Contents 13

Current parallel programming language, compiler, and tool technologies are unable to support high productivity computing Current parallel programming language, compiler, and tool technologies are unable to support high productivity computing  New programming models, languages, compiler, and tool technologies are necessary to address the productivity demands of future systems New programming models, languages, compiler, and tool technologies are necessary to address the productivity demands of future systems Current parallel programming language, compiler, and tool technologies are unable to support high productivity computing Current parallel programming language, compiler, and tool technologies are unable to support high productivity computing  New programming models, languages, compiler, and tool technologies are necessary to address the productivity demands of future systems New programming models, languages, compiler, and tool technologies are necessary to address the productivity demands of future systems State-of-the-Art 14

 Make Scientists and Engineers more productive: provide a higher level of abstraction provide a higher level of abstraction  Support “Abstraction without Guilt” [Ken Kennedy]: Increase programming language usability without Increase programming language usability without sacrificing performance sacrificing performance  Make Scientists and Engineers more productive: provide a higher level of abstraction provide a higher level of abstraction  Support “Abstraction without Guilt” [Ken Kennedy]: Increase programming language usability without Increase programming language usability without sacrificing performance sacrificing performance Global Goal 15

 The major goal must be productivity  performance  user productivity – time to solution  portability  robustness  This requires significant progress in many fields …  parallel programming languages  compiler technology  runtime systems  intelligent programming environments  …and better support from hardware architectures  The major goal must be productivity  performance  user productivity – time to solution  portability  robustness  This requires significant progress in many fields …  parallel programming languages  compiler technology  runtime systems  intelligent programming environments  …and better support from hardware architectures Productivity 16

 Large-Scale Parallelism  hundreds of thousands of processors  component failures will occur in relatively short intervals  Non-Uniform Data Access  deep memory hierarchies  severe differences in latency: cycles for accessing data from memory; larger latency for remote memory  severe differences in bandwidths  Large-Scale Parallelism  hundreds of thousands of processors  component failures will occur in relatively short intervals  Non-Uniform Data Access  deep memory hierarchies  severe differences in latency: cycles for accessing data from memory; larger latency for remote memory  severe differences in bandwidths Challenges of Peta-Scale Architectures 17

 Long lived applications, surviving many generations of hardware  Applications are becoming larger and more complex  multi-disciplinary  multi-language  multi-paradigm  Legacy codes pose a big problem  from F77 to F90, C/C++, MPI, Coarray Fortran etc.  intellectual content must be preserved  automatic re-write under constraint of performance portability is a difficult research problem  Many advanced applications are dynamic, irregular, and adaptive  Long lived applications, surviving many generations of hardware  Applications are becoming larger and more complex  multi-disciplinary  multi-language  multi-paradigm  Legacy codes pose a big problem  from F77 to F90, C/C++, MPI, Coarray Fortran etc.  intellectual content must be preserved  automatic re-write under constraint of performance portability is a difficult research problem  Many advanced applications are dynamic, irregular, and adaptive Application Challenges 18

 High performance networks and multithreading can contribute to tolerating memory latency and improving memory bandwidth  Hardware support for locality aware programming can avoid serious performance problems in current architectures  Abandoning global cache coherence in favor of a more flexible scheme can eliminate a major source of bottlenecks  Lightweight processors (possibly in the millions) can be used both as a computational fabric and a service layer  exploitation of spatial locality  lightweight synchronization  an introspection infrastructure for fault tolerance, performance tuning, and intrusion detection  High performance networks and multithreading can contribute to tolerating memory latency and improving memory bandwidth  Hardware support for locality aware programming can avoid serious performance problems in current architectures  Abandoning global cache coherence in favor of a more flexible scheme can eliminate a major source of bottlenecks  Lightweight processors (possibly in the millions) can be used both as a computational fabric and a service layer  exploitation of spatial locality  lightweight synchronization  an introspection infrastructure for fault tolerance, performance tuning, and intrusion detection Opportunities for Peta-Scale Architectures 19

Semantics Productivity Model Programming Model conceptual view of data and control Programming Language Programming Language + Directives Library Command-line Interface S P S P S P S P Execution Model abstract machine realizations Programming Models for High Productivity Computing 20

 Programming models and their realizations can be characterized along (at least) three dimensions:  semantics  user productivity  performance  Semantics: a mapping from programs to functions specifying input/output behavior of the program:  S: P  F, where each f in F is a function f: I  O  User productivity: a mapping from programs to a characterization of structural complexity:  U: P  N  Performance: a mapping from programs to functions specifying the complexity of the program in terms of its execution on a real or abstract target machine:  C: P  G, where each g in G is a function g : I  N*  Programming models and their realizations can be characterized along (at least) three dimensions:  semantics  user productivity  performance  Semantics: a mapping from programs to functions specifying input/output behavior of the program:  S: P  F, where each f in F is a function f: I  O  User productivity: a mapping from programs to a characterization of structural complexity:  U: P  N  Performance: a mapping from programs to functions specifying the complexity of the program in terms of its execution on a real or abstract target machine:  C: P  G, where each g in G is a function g : I  N* Programming Model Issues 21

1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 1 General-Purpose Languages 2 Domain-Specific Languages 3 Programming Environments and Tools 4 Cascade and the Chapel Language Design 5 Conclusion 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 1 General-Purpose Languages 2 Domain-Specific Languages 3 Programming Environments and Tools 4 Cascade and the Chapel Language Design 5 Conclusion Contents 22

Key Issues in General Purpose Languages High-Level Features for Explicit Concurrency High-Level Features for Locality Control High-Level Support for Distributed Collections global address space Safety object orientation performance transparency generic programming Extensibility Critical Functionality Orthogonal Language Issues High-Level Support for Programming-In-the-Large 23

datawork data align work align with data distribute HPCS architecture PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM Locality Awareness: Distribution, Alignment, Affinity distribute 24

 Data distributions must be able to reflect partitioning of physical domains: exploiting locality in physical space  Data distributions must be able to reflect dynamic reshaping of data domains (e.g., Gaussian elimination): a load balancing issue  Affinity --- the co-location of threads and data they are operating upon --- is an important performance issue  Language features for expressing user-defined distribution and alignment  Data distributions must be able to reflect partitioning of physical domains: exploiting locality in physical space  Data distributions must be able to reflect dynamic reshaping of data domains (e.g., Gaussian elimination): a load balancing issue  Affinity --- the co-location of threads and data they are operating upon --- is an important performance issue  Language features for expressing user-defined distribution and alignment Requirements for Data Distribution Features 25

 DSLs provide high-level abstractions tailored to a specific domain  DSLs separate domain expertise from parallelization knowledge  DSLs allow a domain expert to work closer to the scientific process (removed from resource management) – a potentially easier route to user productivity than general purpose languages  Potential problems include fragility with respect to changes in the application domain, and target code performance  DSLs can be built on top of general purpose languages, providing a transition path for applications  DSLs provide high-level abstractions tailored to a specific domain  DSLs separate domain expertise from parallelization knowledge  DSLs allow a domain expert to work closer to the scientific process (removed from resource management) – a potentially easier route to user productivity than general purpose languages  Potential problems include fragility with respect to changes in the application domain, and target code performance  DSLs can be built on top of general purpose languages, providing a transition path for applications Domain-Specific Languages (DSLs) 26

 Building domain-specific skeletons for applications (with plug ins for user defined functions)  Using domain knowledge for extending and improving standard compiler analysis  Developing domain-specific libraries  Detecting and optimizing domain-specific patterns in serial legacy codes  Using domain knowledge for verification, validation, and fault tolerance  Telescoping -- automatic generation of optimizing compilers for domain specific languages defined by libraries  Building domain-specific skeletons for applications (with plug ins for user defined functions)  Using domain knowledge for extending and improving standard compiler analysis  Developing domain-specific libraries  Detecting and optimizing domain-specific patterns in serial legacy codes  Using domain knowledge for verification, validation, and fault tolerance  Telescoping -- automatic generation of optimizing compilers for domain specific languages defined by libraries Exploiting Domain-Specific Information 27

 Rewriting Legacy Codes  Preservation of intellectual content  Opportunity for exploiting new hardware, including new algorithms  Code size may preclude practicality of rewrite  Language, compiler, tool, and runtime support  (Semi) automatic tools for migrating code  incremental porting  transition of performance-critical sections requires highly- sophisticated software for automatic adaptation  high-level analysis  pattern matching and concept comprehension  optimization and specialization  Rewriting Legacy Codes  Preservation of intellectual content  Opportunity for exploiting new hardware, including new algorithms  Code size may preclude practicality of rewrite  Language, compiler, tool, and runtime support  (Semi) automatic tools for migrating code  incremental porting  transition of performance-critical sections requires highly- sophisticated software for automatic adaptation  high-level analysis  pattern matching and concept comprehension  optimization and specialization Legacy Code Migration 28

 Reliability Challenges  massive parallelism poses new problems  fault prognostics, detection, recovery  data distribution may cause vital data to be spread across all nodes  (Semi) Automatic Tuning  closed loop adaptive control: measurement, decision-making, actuation  information exposure: users, compilers, runtime systems  learning from experience: databases, data mining, reasoning systems  Introspection  a technology for the support of validation, fault detection, and performance tuning  Reliability Challenges  massive parallelism poses new problems  fault prognostics, detection, recovery  data distribution may cause vital data to be spread across all nodes  (Semi) Automatic Tuning  closed loop adaptive control: measurement, decision-making, actuation  information exposure: users, compilers, runtime systems  learning from experience: databases, data mining, reasoning systems  Introspection  a technology for the support of validation, fault detection, and performance tuning Issues in Programming Environments and Tools 29

Parallelizing compiler source program target program Expert advisor Transformation system execution on target machine end of tuning cycle Program / Execution Knowledge Base Example: Offline Performance Tuning 30

1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 1 The DARPA Sponsored HPCS Program 2 An Overview of the Cascade Architecture 3 Design Aspects of the Chapel Programming Language 5 Conclusion 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 1 The DARPA Sponsored HPCS Program 2 An Overview of the Cascade Architecture 3 Design Aspects of the Chapel Programming Language 5 Conclusion Contents 31

High Productivity Computing Systems Goals:  Provide a new generation of economically viable high productivity computing systems for the national security and industrial user community (2007 – 2010)Goals: Impact: Performance (efficiency): critical national security applications by a factor of 10X to 40X Productivity (time-to-solution) Portability (transparency): insulate research and operational application software from system Robustness (reliability): apply all known techniques to protect against outside attacks, hardware faults, & programming errors Fill the Critical Technology and Capability Gap Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing) Fill the Critical Technology and Capability Gap Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing) Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology HPCS Program Focus Areas 32

 one year Concept Study, July June 2003  three year Prototyping Phase, July June 2006  Led by Cray Inc. (Burton Smith) Partners:  Caltech/JPL  University of Notre Dame  Stanford University  one year Concept Study, July June 2003  three year Prototyping Phase, July June 2006  Led by Cray Inc. (Burton Smith) Partners:  Caltech/JPL  University of Notre Dame  Stanford University The Cascade Project 33

 Hierarchical architecture: two levels of processing elements  Lightweight processors in “smart memory”  Shared address space  Program controlled selection of UMA or NUMA data access  Hybrid UMA/NUMA programming paradigm  Hierarchical architecture: two levels of processing elements  Lightweight processors in “smart memory”  Shared address space  Program controlled selection of UMA or NUMA data access  Hybrid UMA/NUMA programming paradigm Cascade Architecture: Key Elements 34

 Tolerate latency with processor concurrency  Vectors provide concurrency within a thread  Multithreading provides concurrency between threads  Exploit locality to reduce bandwidth demand  “Heavyweight” processors (HWPs) to exploit temporal locality  “Lightweight” processors (LWPs) to exploit spatial locality  Use other techniques to reduce network traffic  Atomic memory operations  Efficient remote thread creation  Tolerate latency with processor concurrency  Vectors provide concurrency within a thread  Multithreading provides concurrency between threads  Exploit locality to reduce bandwidth demand  “Heavyweight” processors (HWPs) to exploit temporal locality  “Lightweight” processors (LWPs) to exploit spatial locality  Use other techniques to reduce network traffic  Atomic memory operations  Efficient remote thread creation Using Bandwidth Wisely 35

A Simplified Global View of the Cascade Architecture Smart Memory HWP SW-controlled Cache Smart Memory HWP SW-controlled Cache … Global Interconnection Network Locale 36

Heavyweight Processor Vector Multithreaded Streaming Compiler-assisted cache Locale Interconnect Networ k Router To Other Locales DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … A Cascade Locale Source: David Callahan, Cray Inc. 37

 Multiple processors per die  Heavily multithreaded  local memory available to hold thread state  Large on-chip DRAM cache  Multiple processors per die  Heavily multithreaded  local memory available to hold thread state  Large on-chip DRAM cache  Specialized instruction set  fork, quit  spawn  MTA-style F/E bits for synchronization DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … Optimized to start, stop and move Lightweight Processors Source: David Callahan, Cray Inc. 38

 Lightweight processors  co-located with memory  focus on availability  full exploitation is not a primary system goal  Lightweight threads  minimal state – high rate context switch  spawned by sending a parcel to memory  Exploiting spatial locality  fine-grain: reductions, prefix operations, search  coarse-grain: data distribution and alignment  Saving bandwidth by migrating threads to data  Lightweight processors  co-located with memory  focus on availability  full exploitation is not a primary system goal  Lightweight threads  minimal state – high rate context switch  spawned by sending a parcel to memory  Exploiting spatial locality  fine-grain: reductions, prefix operations, search  coarse-grain: data distribution and alignment  Saving bandwidth by migrating threads to data Lightweight Processors and Threads 39

 Fine grain application parallelism  Implementation of a service layer  Components of agent systems that asynchronously monitor the computation performing introspection, and dealing with:  dynamic program validation  validation of user directives  fault tolerance  intrusion prevention and detection  performance analysis and tuning  support of feedback-oriented compilation  Fine grain application parallelism  Implementation of a service layer  Components of agent systems that asynchronously monitor the computation performing introspection, and dealing with:  dynamic program validation  validation of user directives  fault tolerance  intrusion prevention and detection  performance analysis and tuning  support of feedback-oriented compilation Uses of Lightweight Threads 40

 Introspection can be defined as a system’s ability to:  explore its own properties  reason about its internal state  make decisions about appropriate state changes where necessary  an enabling technology for building “autonomic” systems (among other purposes)  Introspection can be defined as a system’s ability to:  explore its own properties  reason about its internal state  make decisions about appropriate state changes where necessary  an enabling technology for building “autonomic” systems (among other purposes) Introspection 41

 Performance tuning is critical for distributed high performance applications in a massively parallel architecture  Trace based off line analysis may not be practical due to volume of data  Complexity of machine architecture makes manual control difficult  Automatic support for on line tuning essential for efficiently dealing with the problem  Performance tuning is critical for distributed high performance applications in a massively parallel architecture  Trace based off line analysis may not be practical due to volume of data  Complexity of machine architecture makes manual control difficult  Automatic support for on line tuning essential for efficiently dealing with the problem A Case Study: Automatic Performance Tuning 42

 Program/Performance Data Base – storing information about the program, executions (input, performance data), and system. It provides the central interface in the system. It provides the central interface in the system.  Asynchronous parallel agents for various functions, including  Simplification: data reduction and filtering  Invariant checking  Assertion checking  Local problem analysis  Performance Exception Handler – performs a model-based analysis of performance problems communicated by agents, and decides on an appropriate action if necessary. Such an action may trigger  New method selection  Re-instrumentation  Re-compilation  Program/Performance Data Base – storing information about the program, executions (input, performance data), and system. It provides the central interface in the system. It provides the central interface in the system.  Asynchronous parallel agents for various functions, including  Simplification: data reduction and filtering  Invariant checking  Assertion checking  Local problem analysis  Performance Exception Handler – performs a model-based analysis of performance problems communicated by agents, and decides on an appropriate action if necessary. Such an action may trigger  New method selection  Re-instrumentation  Re-compilation Elements of an Approach to On the Fly Performance Tuning 43

Example: A Society of Agents for Performance Analysis and Feedback-Oriented Tuning Program/ Performance Data Base execution of instrumented application program compiler instrumenter data collection analysi s agent invarian tcheckin g agent simplificatio n agent Performance Exception Handler data reduction and filtering 44

 Base Model  Unbounded thread parallelism in a flat shared memory  Explicit synchronization for access to shared data  No consideration of temporal or spatial locality, no “processors”  Extended Model  Explicit memory locales, data distribution, data/thread affinity  Explicit cache management  Base Model  Unbounded thread parallelism in a flat shared memory  Explicit synchronization for access to shared data  No consideration of temporal or spatial locality, no “processors”  Extended Model  Explicit memory locales, data distribution, data/thread affinity  Explicit cache management Cascade Programming Model 45

 Global name space  even in the context of a NUMA model  avoid “local view” programming model of MPI, CoArray Fortran, UPC  Multiple models of parallelism  Provide support for :  explicit parallel programming  locality-aware programming  interoperability with legacy codes (MPI, Coarray Fortran, UPC, etc.)  generic programming  Global name space  even in the context of a NUMA model  avoid “local view” programming model of MPI, CoArray Fortran, UPC  Multiple models of parallelism  Provide support for :  explicit parallel programming  locality-aware programming  interoperability with legacy codes (MPI, Coarray Fortran, UPC, etc.)  generic programming Language Design Criteria 46

 The “Concrete Language” enhances HPF and ZPL  generalized HPF-type data distributions  domains are first-class objects: a central repository for index space, distribution, and associated set of arrays  support for automatic partitioning of dynamic graph-based data structures  high-level control for communication (halos,…)?  The “Abstract Language” supports generic programming  abstraction of types: type inference from context  abstraction of iteration: generalization of the CLU iterator  data structure inference: system-selected implementation for programmer-specified object categories Note: The “concrete” and “abstract” languages as used above do not in reality refer to different languages but are only used here to structure this discussion.  The “Concrete Language” enhances HPF and ZPL  generalized HPF-type data distributions  domains are first-class objects: a central repository for index space, distribution, and associated set of arrays  support for automatic partitioning of dynamic graph-based data structures  high-level control for communication (halos,…)?  The “Abstract Language” supports generic programming  abstraction of types: type inference from context  abstraction of iteration: generalization of the CLU iterator  data structure inference: system-selected implementation for programmer-specified object categories Note: The “concrete” and “abstract” languages as used above do not in reality refer to different languages but are only used here to structure this discussion. Language Design Highlights 47

 A modern base language  Strongly typed  Fortran-like array features  Objected-oriented  Module structure for name space management  Optional automatic storage management  High performance features  Abstractions for parallelism  data parallelism (domains, forall)  task parallelism (cobegin/coend)  Locality management via data distributions and affinity  A modern base language  Strongly typed  Fortran-like array features  Objected-oriented  Module structure for name space management  Optional automatic storage management  High performance features  Abstractions for parallelism  data parallelism (domains, forall)  task parallelism (cobegin/coend)  Locality management via data distributions and affinity Basics 48

Type Tree primitive domaincollectionfunction homogeneous collection heterogeneous collection sequenceset arrayclass record iterator other function 49

 Low level tools direct object and thread placement at time of creation  Natural extensions to existing partitioned address space languages  Evolve from HPF and ZPL a higher level concept:  Introduce domains as first class program entities  Sets of names (indices) for data  Centralization of distribution, layout, and iteration ordering  Low level tools direct object and thread placement at time of creation  Natural extensions to existing partitioned address space languages  Evolve from HPF and ZPL a higher level concept:  Introduce domains as first class program entities  Sets of names (indices) for data  Centralization of distribution, layout, and iteration ordering Locality Management 50

 locale view: a logical view for a set of locales  distribution: a mapping of an index set to a locale view  array: a map from an index set to a collection of variables Domains  index sets: Cartesian products, sparse, opaque 51

2) Separation of Concerns: Definition var Mat: domain(2) = [1..m, 1..n]; var MatCol: domain(1) = Mat(2); var MatRow: domain(1) = Mat(1); var A: array [Mat] of float; var v: array [MatCol] of float; var s: array [MatRow] of float; s = sum(dim=2) [i,j:Mat] A(i,j)*v(j); Example: Matrix Vector Multiplication V1 52

var L: array[1..p1,1..p2] of locale; var Mat: domain(2) dist(block,block) to L = [1..m,1..n]; var MatCol: domain(1) align(*,Mat(2)) = Mat(2); var MatRow: domain(1) align(Mat(1),*) = Mat(1); var A: array [Mat] of float; var v: array [MatCol] of float; var s: array [MatRow] of float; s = sum(dim=2) [i,j:Mat] A(i,j)*v(j); Example: Matrix Vector Multiplication V2 53

D 0 ____ C 0 ____ R 0 ____ D 0 ____ D 0 ____ C 0 ____ R 0 ____ C 1 ____ R 1 ____ D 1 ____ D 0 ____ C 2 ____ R 2 ____ D 2 ____ D 3 ____ C 3 ____ R 3 ____ Sparse Matrix Distribution 54

var L: array[1..p1,1..p2] of locale; var Mat: domain(sparse2) dist(mysdist) to L layout(myslay) = [1..m,1..n] where enumeratenonzeroes(); var MatCol: domain(1) align(*,Mat(2)) = Mat(2); var MatRow: domain(1) align(Mat(1),*) = Mat(1); var A: array [Mat] of float; var v: array [MatCol] of float; var s: array [MatRow] of float; s = sum(dim=2) [i,j:Mat] A(i,j)*v(j); Example: Matrix Vector Multiplication V3 55

 Global name space  High level control features supporting explicit parallelism  High level locality management  High level support for collections  Static typing  Support for generic programming  Global name space  High level control features supporting explicit parallelism  High level locality management  High level support for collections  Static typing  Support for generic programming Language Summary 56

 Today’s programming languages, models, and tools are not adequate to deal with the challenges of 2010 architectures and application requirements  Peta scale architectures will pose additional problems but may also provide better structural support for high level languages and compilers  A focused and persistent research and development effort -- along a number of different paths -- will be needed to create viable language systems and tool support for economically feasible and robust high productivity computing of the future  Today’s programming languages, models, and tools are not adequate to deal with the challenges of 2010 architectures and application requirements  Peta scale architectures will pose additional problems but may also provide better structural support for high level languages and compilers  A focused and persistent research and development effort -- along a number of different paths -- will be needed to create viable language systems and tool support for economically feasible and robust high productivity computing of the future Conclusion 57