Towards Future Programming Languages and Models for High-Productivity Computing Hans P. Zima University of Vienna, Austria and JPL, California Institute of Technology, Pasadena, CA IFIP WG 10.3 Seminar September 7, 2004
1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 5 Conclusion 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 5 Conclusion Contents 2
Programming models and languages bridge the gap between “reality” and hardware – at different levels of abstraction - e.g., assembly languages general-purpose procedural languages functional languages very high-level domain-specific languages Programming models and languages bridge the gap between “reality” and hardware – at different levels of abstraction - e.g., assembly languages general-purpose procedural languages functional languages very high-level domain-specific languages Abstraction in Programming Abstraction implies loss of information – gain in simplicity, clarity, verifiability, portability versus potential performance degradation 3
The designers of the very first high level programming language were aware that their success depended on acceptable performance of the generated target programs: The designers of the very first high level programming language were aware that their success depended on acceptable performance of the generated target programs: John Backus (1957): “… It was our belief that if FORTRAN … were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger …” The designers of the very first high level programming language were aware that their success depended on acceptable performance of the generated target programs: The designers of the very first high level programming language were aware that their success depended on acceptable performance of the generated target programs: John Backus (1957): “… It was our belief that if FORTRAN … were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger …” The Emergence of High-Level Sequential Languages High-level algorithmic languages became generally accepted standards for sequential programming since their advantages outweighed any performance drawbacks For parallel programming no similar development took place 4
Current HPC hardware: large clusters at reasonable cost commodity clusters or custom MPPs off-the-shelf processors and memory components, built for mass market latency and bandwidth problems Current HPC software: application efficiency sometimes in single digits low-productivity programming models dominate focus on programming in the small inadequate programming environments and tools Resemblance to pre Fortran era in sequential programming assembly language programming vs. MPI wide gap between the domain of the scientist and programming language Current HPC hardware: large clusters at reasonable cost commodity clusters or custom MPPs off-the-shelf processors and memory components, built for mass market latency and bandwidth problems Current HPC software: application efficiency sometimes in single digits low-productivity programming models dominate focus on programming in the small inadequate programming environments and tools Resemblance to pre Fortran era in sequential programming assembly language programming vs. MPI wide gap between the domain of the scientist and programming language The Crisis of High Performance Computing 5
Exploiting performance on current parallel architectures requires low level control and “heroic” programmers This has led to “local view” programming models as exemplified by MPI based programming (as well as CoArray Fortran etc.) explicit processors: local views of data program state associated with memory regions explicit communication intertwined with the algorithm Exploiting performance on current parallel architectures requires low level control and “heroic” programmers This has led to “local view” programming models as exemplified by MPI based programming (as well as CoArray Fortran etc.) explicit processors: local views of data program state associated with memory regions explicit communication intertwined with the algorithm Domination of Low-Level Paradigms 6
The 1990s saw a strong (mainly, but not exclusively, academic) effort to improve this situation Focus on higher level paradigms CMFortran Fortran-D Vienna Fortran High Performance Fortran (HPF) and many others… A wealth of research on related compiler technology These languages have not been broadly accepted, for a number of reasons … The 1990s saw a strong (mainly, but not exclusively, academic) effort to improve this situation Focus on higher level paradigms CMFortran Fortran-D Vienna Fortran High Performance Fortran (HPF) and many others… A wealth of research on related compiler technology These languages have not been broadly accepted, for a number of reasons … The High Performance Fortran (HPF) Effort 7
do while (.not. converged) do J=1,M do I=1,N B(I,J) = 0.25 * (A(I-1,J)+A(I+1,J)+ A(I,J-1)+A(I,J+1)) end do A(1:N,1:N) = B(1:N,1:N) local computation initialize MPI if (MOD(myrank,2).eq. 1) then call MPI_SEND(B(1,1),N,…,myrank-1,..) call MPI_RCV(A(1,0),N,…,myrank-1,..) if (myrank.lt. s-1) then call MPI_SEND(B(1,M),N,…,myrank+1,..) call MPI_RCV(A(1,M+1),N,…,myrank+1,..) endif else … … Assuming block column distribution: … Example: Parallel Jacobi With MPI 8
do while (.not. converged) do J=1,N do I=1,N B(I,J) = 0.25 * (A(I-1,J)+A(I+1,J)+ A(I,J-1)+A(I,J+1)) end do A(1:N,1:N) = B(1:N,1:N) global computation processors P(NUMBER_OF_PROCESSORS) distribute(*,BLOCK) onto P :: A, B Communication is automatically generated by compiler Change of distribution requires modification of just one line in the source program; no change of the algorithm Example: Parallel Jacobi With HPF … block column distribution 9
The HPF code is far simpler than the MPI code Compiler generated object code is as good as MPI code parallelization of the loop static analysis of access patterns and communication aggregation and fusion of communication The data distribution directives could even be generated automatically from the sequential code The HPF code is far simpler than the MPI code Compiler generated object code is as good as MPI code parallelization of the loop static analysis of access patterns and communication aggregation and fusion of communication The data distribution directives could even be generated automatically from the sequential code So where is the problem with HPF? Example: Observations 10
HPF 1 lacked important functionality data distributions do not support irregular algorithms lack of flexibility for processor-mapping and data/thread affinity focus on support for array types and SPMD data parallelism HPF includes features that are difficult to implement Current architectures do not support well dynamic and irregular features in advanced applications HPF+/JA HPF plasma code reached 12.5 TFLOPS on Earth Simulator explicit high-level formulation of communication patterns explicit high-level control of communication schedules explicit high-level control of “halos” HPF 1 lacked important functionality data distributions do not support irregular algorithms lack of flexibility for processor-mapping and data/thread affinity focus on support for array types and SPMD data parallelism HPF includes features that are difficult to implement Current architectures do not support well dynamic and irregular features in advanced applications HPF+/JA HPF plasma code reached 12.5 TFLOPS on Earth Simulator explicit high-level formulation of communication patterns explicit high-level control of communication schedules explicit high-level control of “halos” HPF Problems However: 11
FEM crash simulation kernel Language Extensions (HPF/Earth Simulator) o Schedule Reuse o Halo Control o Purest procedures The Effect of Schedule Reuse and Halo Management 12
1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 5 Conclusion 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 5 Conclusion Contents 13
Current parallel programming language, compiler, and tool technologies are unable to support high productivity computing Current parallel programming language, compiler, and tool technologies are unable to support high productivity computing New programming models, languages, compiler, and tool technologies are necessary to address the productivity demands of future systems New programming models, languages, compiler, and tool technologies are necessary to address the productivity demands of future systems Current parallel programming language, compiler, and tool technologies are unable to support high productivity computing Current parallel programming language, compiler, and tool technologies are unable to support high productivity computing New programming models, languages, compiler, and tool technologies are necessary to address the productivity demands of future systems New programming models, languages, compiler, and tool technologies are necessary to address the productivity demands of future systems State-of-the-Art 14
Make Scientists and Engineers more productive: provide a higher level of abstraction provide a higher level of abstraction Support “Abstraction without Guilt” [Ken Kennedy]: Increase programming language usability without Increase programming language usability without sacrificing performance sacrificing performance Make Scientists and Engineers more productive: provide a higher level of abstraction provide a higher level of abstraction Support “Abstraction without Guilt” [Ken Kennedy]: Increase programming language usability without Increase programming language usability without sacrificing performance sacrificing performance Global Goal 15
The major goal must be productivity performance user productivity – time to solution portability robustness This requires significant progress in many fields … parallel programming languages compiler technology runtime systems intelligent programming environments …and better support from hardware architectures The major goal must be productivity performance user productivity – time to solution portability robustness This requires significant progress in many fields … parallel programming languages compiler technology runtime systems intelligent programming environments …and better support from hardware architectures Productivity 16
Large-Scale Parallelism hundreds of thousands of processors component failures will occur in relatively short intervals Non-Uniform Data Access deep memory hierarchies severe differences in latency: cycles for accessing data from memory; larger latency for remote memory severe differences in bandwidths Large-Scale Parallelism hundreds of thousands of processors component failures will occur in relatively short intervals Non-Uniform Data Access deep memory hierarchies severe differences in latency: cycles for accessing data from memory; larger latency for remote memory severe differences in bandwidths Challenges of Peta-Scale Architectures 17
Long lived applications, surviving many generations of hardware Applications are becoming larger and more complex multi-disciplinary multi-language multi-paradigm Legacy codes pose a big problem from F77 to F90, C/C++, MPI, Coarray Fortran etc. intellectual content must be preserved automatic re-write under constraint of performance portability is a difficult research problem Many advanced applications are dynamic, irregular, and adaptive Long lived applications, surviving many generations of hardware Applications are becoming larger and more complex multi-disciplinary multi-language multi-paradigm Legacy codes pose a big problem from F77 to F90, C/C++, MPI, Coarray Fortran etc. intellectual content must be preserved automatic re-write under constraint of performance portability is a difficult research problem Many advanced applications are dynamic, irregular, and adaptive Application Challenges 18
High performance networks and multithreading can contribute to tolerating memory latency and improving memory bandwidth Hardware support for locality aware programming can avoid serious performance problems in current architectures Abandoning global cache coherence in favor of a more flexible scheme can eliminate a major source of bottlenecks Lightweight processors (possibly in the millions) can be used both as a computational fabric and a service layer exploitation of spatial locality lightweight synchronization an introspection infrastructure for fault tolerance, performance tuning, and intrusion detection High performance networks and multithreading can contribute to tolerating memory latency and improving memory bandwidth Hardware support for locality aware programming can avoid serious performance problems in current architectures Abandoning global cache coherence in favor of a more flexible scheme can eliminate a major source of bottlenecks Lightweight processors (possibly in the millions) can be used both as a computational fabric and a service layer exploitation of spatial locality lightweight synchronization an introspection infrastructure for fault tolerance, performance tuning, and intrusion detection Opportunities for Peta-Scale Architectures 19
Semantics Productivity Model Programming Model conceptual view of data and control Programming Language Programming Language + Directives Library Command-line Interface S P S P S P S P Execution Model abstract machine realizations Programming Models for High Productivity Computing 20
Programming models and their realizations can be characterized along (at least) three dimensions: semantics user productivity performance Semantics: a mapping from programs to functions specifying input/output behavior of the program: S: P F, where each f in F is a function f: I O User productivity: a mapping from programs to a characterization of structural complexity: U: P N Performance: a mapping from programs to functions specifying the complexity of the program in terms of its execution on a real or abstract target machine: C: P G, where each g in G is a function g : I N* Programming models and their realizations can be characterized along (at least) three dimensions: semantics user productivity performance Semantics: a mapping from programs to functions specifying input/output behavior of the program: S: P F, where each f in F is a function f: I O User productivity: a mapping from programs to a characterization of structural complexity: U: P N Performance: a mapping from programs to functions specifying the complexity of the program in terms of its execution on a real or abstract target machine: C: P G, where each g in G is a function g : I N* Programming Model Issues 21
1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 1 General-Purpose Languages 2 Domain-Specific Languages 3 Programming Environments and Tools 4 Cascade and the Chapel Language Design 5 Conclusion 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 1 General-Purpose Languages 2 Domain-Specific Languages 3 Programming Environments and Tools 4 Cascade and the Chapel Language Design 5 Conclusion Contents 22
Key Issues in General Purpose Languages High-Level Features for Explicit Concurrency High-Level Features for Locality Control High-Level Support for Distributed Collections global address space Safety object orientation performance transparency generic programming Extensibility Critical Functionality Orthogonal Language Issues High-Level Support for Programming-In-the-Large 23
datawork data align work align with data distribute HPCS architecture PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM Locality Awareness: Distribution, Alignment, Affinity distribute 24
Data distributions must be able to reflect partitioning of physical domains: exploiting locality in physical space Data distributions must be able to reflect dynamic reshaping of data domains (e.g., Gaussian elimination): a load balancing issue Affinity --- the co-location of threads and data they are operating upon --- is an important performance issue Language features for expressing user-defined distribution and alignment Data distributions must be able to reflect partitioning of physical domains: exploiting locality in physical space Data distributions must be able to reflect dynamic reshaping of data domains (e.g., Gaussian elimination): a load balancing issue Affinity --- the co-location of threads and data they are operating upon --- is an important performance issue Language features for expressing user-defined distribution and alignment Requirements for Data Distribution Features 25
DSLs provide high-level abstractions tailored to a specific domain DSLs separate domain expertise from parallelization knowledge DSLs allow a domain expert to work closer to the scientific process (removed from resource management) – a potentially easier route to user productivity than general purpose languages Potential problems include fragility with respect to changes in the application domain, and target code performance DSLs can be built on top of general purpose languages, providing a transition path for applications DSLs provide high-level abstractions tailored to a specific domain DSLs separate domain expertise from parallelization knowledge DSLs allow a domain expert to work closer to the scientific process (removed from resource management) – a potentially easier route to user productivity than general purpose languages Potential problems include fragility with respect to changes in the application domain, and target code performance DSLs can be built on top of general purpose languages, providing a transition path for applications Domain-Specific Languages (DSLs) 26
Building domain-specific skeletons for applications (with plug ins for user defined functions) Using domain knowledge for extending and improving standard compiler analysis Developing domain-specific libraries Detecting and optimizing domain-specific patterns in serial legacy codes Using domain knowledge for verification, validation, and fault tolerance Telescoping -- automatic generation of optimizing compilers for domain specific languages defined by libraries Building domain-specific skeletons for applications (with plug ins for user defined functions) Using domain knowledge for extending and improving standard compiler analysis Developing domain-specific libraries Detecting and optimizing domain-specific patterns in serial legacy codes Using domain knowledge for verification, validation, and fault tolerance Telescoping -- automatic generation of optimizing compilers for domain specific languages defined by libraries Exploiting Domain-Specific Information 27
Rewriting Legacy Codes Preservation of intellectual content Opportunity for exploiting new hardware, including new algorithms Code size may preclude practicality of rewrite Language, compiler, tool, and runtime support (Semi) automatic tools for migrating code incremental porting transition of performance-critical sections requires highly- sophisticated software for automatic adaptation high-level analysis pattern matching and concept comprehension optimization and specialization Rewriting Legacy Codes Preservation of intellectual content Opportunity for exploiting new hardware, including new algorithms Code size may preclude practicality of rewrite Language, compiler, tool, and runtime support (Semi) automatic tools for migrating code incremental porting transition of performance-critical sections requires highly- sophisticated software for automatic adaptation high-level analysis pattern matching and concept comprehension optimization and specialization Legacy Code Migration 28
Reliability Challenges massive parallelism poses new problems fault prognostics, detection, recovery data distribution may cause vital data to be spread across all nodes (Semi) Automatic Tuning closed loop adaptive control: measurement, decision-making, actuation information exposure: users, compilers, runtime systems learning from experience: databases, data mining, reasoning systems Introspection a technology for the support of validation, fault detection, and performance tuning Reliability Challenges massive parallelism poses new problems fault prognostics, detection, recovery data distribution may cause vital data to be spread across all nodes (Semi) Automatic Tuning closed loop adaptive control: measurement, decision-making, actuation information exposure: users, compilers, runtime systems learning from experience: databases, data mining, reasoning systems Introspection a technology for the support of validation, fault detection, and performance tuning Issues in Programming Environments and Tools 29
Parallelizing compiler source program target program Expert advisor Transformation system execution on target machine end of tuning cycle Program / Execution Knowledge Base Example: Offline Performance Tuning 30
1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 1 The DARPA Sponsored HPCS Program 2 An Overview of the Cascade Architecture 3 Design Aspects of the Chapel Programming Language 5 Conclusion 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 1 The DARPA Sponsored HPCS Program 2 An Overview of the Cascade Architecture 3 Design Aspects of the Chapel Programming Language 5 Conclusion Contents 31
High Productivity Computing Systems Goals: Provide a new generation of economically viable high productivity computing systems for the national security and industrial user community (2007 – 2010)Goals: Impact: Performance (efficiency): critical national security applications by a factor of 10X to 40X Productivity (time-to-solution) Portability (transparency): insulate research and operational application software from system Robustness (reliability): apply all known techniques to protect against outside attacks, hardware faults, & programming errors Fill the Critical Technology and Capability Gap Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing) Fill the Critical Technology and Capability Gap Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing) Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology HPCS Program Focus Areas 32
one year Concept Study, July June 2003 three year Prototyping Phase, July June 2006 Led by Cray Inc. (Burton Smith) Partners: Caltech/JPL University of Notre Dame Stanford University one year Concept Study, July June 2003 three year Prototyping Phase, July June 2006 Led by Cray Inc. (Burton Smith) Partners: Caltech/JPL University of Notre Dame Stanford University The Cascade Project 33
Hierarchical architecture: two levels of processing elements Lightweight processors in “smart memory” Shared address space Program controlled selection of UMA or NUMA data access Hybrid UMA/NUMA programming paradigm Hierarchical architecture: two levels of processing elements Lightweight processors in “smart memory” Shared address space Program controlled selection of UMA or NUMA data access Hybrid UMA/NUMA programming paradigm Cascade Architecture: Key Elements 34
Tolerate latency with processor concurrency Vectors provide concurrency within a thread Multithreading provides concurrency between threads Exploit locality to reduce bandwidth demand “Heavyweight” processors (HWPs) to exploit temporal locality “Lightweight” processors (LWPs) to exploit spatial locality Use other techniques to reduce network traffic Atomic memory operations Efficient remote thread creation Tolerate latency with processor concurrency Vectors provide concurrency within a thread Multithreading provides concurrency between threads Exploit locality to reduce bandwidth demand “Heavyweight” processors (HWPs) to exploit temporal locality “Lightweight” processors (LWPs) to exploit spatial locality Use other techniques to reduce network traffic Atomic memory operations Efficient remote thread creation Using Bandwidth Wisely 35
A Simplified Global View of the Cascade Architecture Smart Memory HWP SW-controlled Cache Smart Memory HWP SW-controlled Cache … Global Interconnection Network Locale 36
Heavyweight Processor Vector Multithreaded Streaming Compiler-assisted cache Locale Interconnect Networ k Router To Other Locales DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … A Cascade Locale Source: David Callahan, Cray Inc. 37
Multiple processors per die Heavily multithreaded local memory available to hold thread state Large on-chip DRAM cache Multiple processors per die Heavily multithreaded local memory available to hold thread state Large on-chip DRAM cache Specialized instruction set fork, quit spawn MTA-style F/E bits for synchronization DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … Optimized to start, stop and move Lightweight Processors Source: David Callahan, Cray Inc. 38
Lightweight processors co-located with memory focus on availability full exploitation is not a primary system goal Lightweight threads minimal state – high rate context switch spawned by sending a parcel to memory Exploiting spatial locality fine-grain: reductions, prefix operations, search coarse-grain: data distribution and alignment Saving bandwidth by migrating threads to data Lightweight processors co-located with memory focus on availability full exploitation is not a primary system goal Lightweight threads minimal state – high rate context switch spawned by sending a parcel to memory Exploiting spatial locality fine-grain: reductions, prefix operations, search coarse-grain: data distribution and alignment Saving bandwidth by migrating threads to data Lightweight Processors and Threads 39
Fine grain application parallelism Implementation of a service layer Components of agent systems that asynchronously monitor the computation performing introspection, and dealing with: dynamic program validation validation of user directives fault tolerance intrusion prevention and detection performance analysis and tuning support of feedback-oriented compilation Fine grain application parallelism Implementation of a service layer Components of agent systems that asynchronously monitor the computation performing introspection, and dealing with: dynamic program validation validation of user directives fault tolerance intrusion prevention and detection performance analysis and tuning support of feedback-oriented compilation Uses of Lightweight Threads 40
Introspection can be defined as a system’s ability to: explore its own properties reason about its internal state make decisions about appropriate state changes where necessary an enabling technology for building “autonomic” systems (among other purposes) Introspection can be defined as a system’s ability to: explore its own properties reason about its internal state make decisions about appropriate state changes where necessary an enabling technology for building “autonomic” systems (among other purposes) Introspection 41
Performance tuning is critical for distributed high performance applications in a massively parallel architecture Trace based off line analysis may not be practical due to volume of data Complexity of machine architecture makes manual control difficult Automatic support for on line tuning essential for efficiently dealing with the problem Performance tuning is critical for distributed high performance applications in a massively parallel architecture Trace based off line analysis may not be practical due to volume of data Complexity of machine architecture makes manual control difficult Automatic support for on line tuning essential for efficiently dealing with the problem A Case Study: Automatic Performance Tuning 42
Program/Performance Data Base – storing information about the program, executions (input, performance data), and system. It provides the central interface in the system. It provides the central interface in the system. Asynchronous parallel agents for various functions, including Simplification: data reduction and filtering Invariant checking Assertion checking Local problem analysis Performance Exception Handler – performs a model-based analysis of performance problems communicated by agents, and decides on an appropriate action if necessary. Such an action may trigger New method selection Re-instrumentation Re-compilation Program/Performance Data Base – storing information about the program, executions (input, performance data), and system. It provides the central interface in the system. It provides the central interface in the system. Asynchronous parallel agents for various functions, including Simplification: data reduction and filtering Invariant checking Assertion checking Local problem analysis Performance Exception Handler – performs a model-based analysis of performance problems communicated by agents, and decides on an appropriate action if necessary. Such an action may trigger New method selection Re-instrumentation Re-compilation Elements of an Approach to On the Fly Performance Tuning 43
Example: A Society of Agents for Performance Analysis and Feedback-Oriented Tuning Program/ Performance Data Base execution of instrumented application program compiler instrumenter data collection analysi s agent invarian tcheckin g agent simplificatio n agent Performance Exception Handler data reduction and filtering 44
Base Model Unbounded thread parallelism in a flat shared memory Explicit synchronization for access to shared data No consideration of temporal or spatial locality, no “processors” Extended Model Explicit memory locales, data distribution, data/thread affinity Explicit cache management Base Model Unbounded thread parallelism in a flat shared memory Explicit synchronization for access to shared data No consideration of temporal or spatial locality, no “processors” Extended Model Explicit memory locales, data distribution, data/thread affinity Explicit cache management Cascade Programming Model 45
Global name space even in the context of a NUMA model avoid “local view” programming model of MPI, CoArray Fortran, UPC Multiple models of parallelism Provide support for : explicit parallel programming locality-aware programming interoperability with legacy codes (MPI, Coarray Fortran, UPC, etc.) generic programming Global name space even in the context of a NUMA model avoid “local view” programming model of MPI, CoArray Fortran, UPC Multiple models of parallelism Provide support for : explicit parallel programming locality-aware programming interoperability with legacy codes (MPI, Coarray Fortran, UPC, etc.) generic programming Language Design Criteria 46
The “Concrete Language” enhances HPF and ZPL generalized HPF-type data distributions domains are first-class objects: a central repository for index space, distribution, and associated set of arrays support for automatic partitioning of dynamic graph-based data structures high-level control for communication (halos,…)? The “Abstract Language” supports generic programming abstraction of types: type inference from context abstraction of iteration: generalization of the CLU iterator data structure inference: system-selected implementation for programmer-specified object categories Note: The “concrete” and “abstract” languages as used above do not in reality refer to different languages but are only used here to structure this discussion. The “Concrete Language” enhances HPF and ZPL generalized HPF-type data distributions domains are first-class objects: a central repository for index space, distribution, and associated set of arrays support for automatic partitioning of dynamic graph-based data structures high-level control for communication (halos,…)? The “Abstract Language” supports generic programming abstraction of types: type inference from context abstraction of iteration: generalization of the CLU iterator data structure inference: system-selected implementation for programmer-specified object categories Note: The “concrete” and “abstract” languages as used above do not in reality refer to different languages but are only used here to structure this discussion. Language Design Highlights 47
A modern base language Strongly typed Fortran-like array features Objected-oriented Module structure for name space management Optional automatic storage management High performance features Abstractions for parallelism data parallelism (domains, forall) task parallelism (cobegin/coend) Locality management via data distributions and affinity A modern base language Strongly typed Fortran-like array features Objected-oriented Module structure for name space management Optional automatic storage management High performance features Abstractions for parallelism data parallelism (domains, forall) task parallelism (cobegin/coend) Locality management via data distributions and affinity Basics 48
Type Tree primitive domaincollectionfunction homogeneous collection heterogeneous collection sequenceset arrayclass record iterator other function 49
Low level tools direct object and thread placement at time of creation Natural extensions to existing partitioned address space languages Evolve from HPF and ZPL a higher level concept: Introduce domains as first class program entities Sets of names (indices) for data Centralization of distribution, layout, and iteration ordering Low level tools direct object and thread placement at time of creation Natural extensions to existing partitioned address space languages Evolve from HPF and ZPL a higher level concept: Introduce domains as first class program entities Sets of names (indices) for data Centralization of distribution, layout, and iteration ordering Locality Management 50
locale view: a logical view for a set of locales distribution: a mapping of an index set to a locale view array: a map from an index set to a collection of variables Domains index sets: Cartesian products, sparse, opaque 51
2) Separation of Concerns: Definition var Mat: domain(2) = [1..m, 1..n]; var MatCol: domain(1) = Mat(2); var MatRow: domain(1) = Mat(1); var A: array [Mat] of float; var v: array [MatCol] of float; var s: array [MatRow] of float; s = sum(dim=2) [i,j:Mat] A(i,j)*v(j); Example: Matrix Vector Multiplication V1 52
var L: array[1..p1,1..p2] of locale; var Mat: domain(2) dist(block,block) to L = [1..m,1..n]; var MatCol: domain(1) align(*,Mat(2)) = Mat(2); var MatRow: domain(1) align(Mat(1),*) = Mat(1); var A: array [Mat] of float; var v: array [MatCol] of float; var s: array [MatRow] of float; s = sum(dim=2) [i,j:Mat] A(i,j)*v(j); Example: Matrix Vector Multiplication V2 53
D 0 ____ C 0 ____ R 0 ____ D 0 ____ D 0 ____ C 0 ____ R 0 ____ C 1 ____ R 1 ____ D 1 ____ D 0 ____ C 2 ____ R 2 ____ D 2 ____ D 3 ____ C 3 ____ R 3 ____ Sparse Matrix Distribution 54
var L: array[1..p1,1..p2] of locale; var Mat: domain(sparse2) dist(mysdist) to L layout(myslay) = [1..m,1..n] where enumeratenonzeroes(); var MatCol: domain(1) align(*,Mat(2)) = Mat(2); var MatRow: domain(1) align(Mat(1),*) = Mat(1); var A: array [Mat] of float; var v: array [MatCol] of float; var s: array [MatRow] of float; s = sum(dim=2) [i,j:Mat] A(i,j)*v(j); Example: Matrix Vector Multiplication V3 55
Global name space High level control features supporting explicit parallelism High level locality management High level support for collections Static typing Support for generic programming Global name space High level control features supporting explicit parallelism High level locality management High level support for collections Static typing Support for generic programming Language Summary 56
Today’s programming languages, models, and tools are not adequate to deal with the challenges of 2010 architectures and application requirements Peta scale architectures will pose additional problems but may also provide better structural support for high level languages and compilers A focused and persistent research and development effort -- along a number of different paths -- will be needed to create viable language systems and tool support for economically feasible and robust high productivity computing of the future Today’s programming languages, models, and tools are not adequate to deal with the challenges of 2010 architectures and application requirements Peta scale architectures will pose additional problems but may also provide better structural support for high level languages and compilers A focused and persistent research and development effort -- along a number of different paths -- will be needed to create viable language systems and tool support for economically feasible and robust high productivity computing of the future Conclusion 57