Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Future Programming Languages and Models for High-Productivity Computing Hans P. Zima University of Vienna, Austria and JPL, California Institute.

Similar presentations


Presentation on theme: "Towards Future Programming Languages and Models for High-Productivity Computing Hans P. Zima University of Vienna, Austria and JPL, California Institute."— Presentation transcript:

1 Towards Future Programming Languages and Models for High-Productivity Computing Hans P. Zima University of Vienna, Austria and JPL, California Institute of Technology, Pasadena, CA IFIP WG 10.3 Seminar September 7, 2004

2 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 5 Conclusion 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 5 Conclusion Contents 2

3 Programming models and languages bridge the gap between “reality” and hardware – at different levels of abstraction - e.g.,  assembly languages  general-purpose procedural languages  functional languages  very high-level domain-specific languages Programming models and languages bridge the gap between “reality” and hardware – at different levels of abstraction - e.g.,  assembly languages  general-purpose procedural languages  functional languages  very high-level domain-specific languages Abstraction in Programming Abstraction implies loss of information – gain in simplicity, clarity, verifiability, portability versus potential performance degradation 3

4 The designers of the very first high level programming language were aware that their success depended on acceptable performance of the generated target programs: The designers of the very first high level programming language were aware that their success depended on acceptable performance of the generated target programs: John Backus (1957): “… It was our belief that if FORTRAN … were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger …” The designers of the very first high level programming language were aware that their success depended on acceptable performance of the generated target programs: The designers of the very first high level programming language were aware that their success depended on acceptable performance of the generated target programs: John Backus (1957): “… It was our belief that if FORTRAN … were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger …” The Emergence of High-Level Sequential Languages High-level algorithmic languages became generally accepted standards for sequential programming since their advantages outweighed any performance drawbacks For parallel programming no similar development took place 4

5  Current HPC hardware: large clusters at reasonable cost  commodity clusters or custom MPPs  off-the-shelf processors and memory components, built for mass market  latency and bandwidth problems  Current HPC software:  application efficiency sometimes in single digits  low-productivity programming models dominate  focus on programming in the small  inadequate programming environments and tools  Resemblance to pre Fortran era in sequential programming  assembly language programming vs. MPI  wide gap between the domain of the scientist and programming language  Current HPC hardware: large clusters at reasonable cost  commodity clusters or custom MPPs  off-the-shelf processors and memory components, built for mass market  latency and bandwidth problems  Current HPC software:  application efficiency sometimes in single digits  low-productivity programming models dominate  focus on programming in the small  inadequate programming environments and tools  Resemblance to pre Fortran era in sequential programming  assembly language programming vs. MPI  wide gap between the domain of the scientist and programming language The Crisis of High Performance Computing 5

6  Exploiting performance on current parallel architectures requires low level control and “heroic” programmers  This has led to “local view” programming models as exemplified by MPI based programming (as well as CoArray Fortran etc.)  explicit processors: local views of data  program state associated with memory regions  explicit communication intertwined with the algorithm  Exploiting performance on current parallel architectures requires low level control and “heroic” programmers  This has led to “local view” programming models as exemplified by MPI based programming (as well as CoArray Fortran etc.)  explicit processors: local views of data  program state associated with memory regions  explicit communication intertwined with the algorithm Domination of Low-Level Paradigms 6

7  The 1990s saw a strong (mainly, but not exclusively, academic) effort to improve this situation  Focus on higher level paradigms  CMFortran  Fortran-D  Vienna Fortran  High Performance Fortran (HPF)  and many others…  A wealth of research on related compiler technology  These languages have not been broadly accepted, for a number of reasons …  The 1990s saw a strong (mainly, but not exclusively, academic) effort to improve this situation  Focus on higher level paradigms  CMFortran  Fortran-D  Vienna Fortran  High Performance Fortran (HPF)  and many others…  A wealth of research on related compiler technology  These languages have not been broadly accepted, for a number of reasons … The High Performance Fortran (HPF) Effort 7

8 do while (.not. converged) do J=1,M do I=1,N B(I,J) = 0.25 * (A(I-1,J)+A(I+1,J)+ A(I,J-1)+A(I,J+1)) end do A(1:N,1:N) = B(1:N,1:N) local computation initialize MPI if (MOD(myrank,2).eq. 1) then call MPI_SEND(B(1,1),N,…,myrank-1,..) call MPI_RCV(A(1,0),N,…,myrank-1,..) if (myrank.lt. s-1) then call MPI_SEND(B(1,M),N,…,myrank+1,..) call MPI_RCV(A(1,M+1),N,…,myrank+1,..) endif else … … Assuming block column distribution: … Example: Parallel Jacobi With MPI 8

9 do while (.not. converged) do J=1,N do I=1,N B(I,J) = 0.25 * (A(I-1,J)+A(I+1,J)+ A(I,J-1)+A(I,J+1)) end do A(1:N,1:N) = B(1:N,1:N) global computation processors P(NUMBER_OF_PROCESSORS) distribute(*,BLOCK) onto P :: A, B  Communication is automatically generated by compiler  Change of distribution requires modification of just one line in the source program; no change of the algorithm Example: Parallel Jacobi With HPF … block column distribution 9

10  The HPF code is far simpler than the MPI code  Compiler generated object code is as good as MPI code  parallelization of the loop  static analysis of access patterns and communication  aggregation and fusion of communication  The data distribution directives could even be generated automatically from the sequential code  The HPF code is far simpler than the MPI code  Compiler generated object code is as good as MPI code  parallelization of the loop  static analysis of access patterns and communication  aggregation and fusion of communication  The data distribution directives could even be generated automatically from the sequential code So where is the problem with HPF? Example: Observations 10

11  HPF 1 lacked important functionality  data distributions do not support irregular algorithms  lack of flexibility for processor-mapping and data/thread affinity  focus on support for array types and SPMD data parallelism  HPF includes features that are difficult to implement  Current architectures do not support well dynamic and irregular features in advanced applications  HPF+/JA HPF plasma code reached 12.5 TFLOPS on Earth Simulator  explicit high-level formulation of communication patterns  explicit high-level control of communication schedules  explicit high-level control of “halos”  HPF 1 lacked important functionality  data distributions do not support irregular algorithms  lack of flexibility for processor-mapping and data/thread affinity  focus on support for array types and SPMD data parallelism  HPF includes features that are difficult to implement  Current architectures do not support well dynamic and irregular features in advanced applications  HPF+/JA HPF plasma code reached 12.5 TFLOPS on Earth Simulator  explicit high-level formulation of communication patterns  explicit high-level control of communication schedules  explicit high-level control of “halos” HPF Problems However: 11

12 FEM crash simulation kernel Language Extensions (HPF/Earth Simulator) o Schedule Reuse o Halo Control o Purest procedures 2 4 8 16 32 64 128 The Effect of Schedule Reuse and Halo Management 12

13 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 5 Conclusion 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 5 Conclusion Contents 13

14 Current parallel programming language, compiler, and tool technologies are unable to support high productivity computing Current parallel programming language, compiler, and tool technologies are unable to support high productivity computing  New programming models, languages, compiler, and tool technologies are necessary to address the productivity demands of future systems New programming models, languages, compiler, and tool technologies are necessary to address the productivity demands of future systems Current parallel programming language, compiler, and tool technologies are unable to support high productivity computing Current parallel programming language, compiler, and tool technologies are unable to support high productivity computing  New programming models, languages, compiler, and tool technologies are necessary to address the productivity demands of future systems New programming models, languages, compiler, and tool technologies are necessary to address the productivity demands of future systems State-of-the-Art 14

15  Make Scientists and Engineers more productive: provide a higher level of abstraction provide a higher level of abstraction  Support “Abstraction without Guilt” [Ken Kennedy]: Increase programming language usability without Increase programming language usability without sacrificing performance sacrificing performance  Make Scientists and Engineers more productive: provide a higher level of abstraction provide a higher level of abstraction  Support “Abstraction without Guilt” [Ken Kennedy]: Increase programming language usability without Increase programming language usability without sacrificing performance sacrificing performance Global Goal 15

16  The major goal must be productivity  performance  user productivity – time to solution  portability  robustness  This requires significant progress in many fields …  parallel programming languages  compiler technology  runtime systems  intelligent programming environments  …and better support from hardware architectures  The major goal must be productivity  performance  user productivity – time to solution  portability  robustness  This requires significant progress in many fields …  parallel programming languages  compiler technology  runtime systems  intelligent programming environments  …and better support from hardware architectures Productivity 16

17  Large-Scale Parallelism  hundreds of thousands of processors  component failures will occur in relatively short intervals  Non-Uniform Data Access  deep memory hierarchies  severe differences in latency: 1000+ cycles for accessing data from memory; larger latency for remote memory  severe differences in bandwidths  Large-Scale Parallelism  hundreds of thousands of processors  component failures will occur in relatively short intervals  Non-Uniform Data Access  deep memory hierarchies  severe differences in latency: 1000+ cycles for accessing data from memory; larger latency for remote memory  severe differences in bandwidths Challenges of Peta-Scale Architectures 17

18  Long lived applications, surviving many generations of hardware  Applications are becoming larger and more complex  multi-disciplinary  multi-language  multi-paradigm  Legacy codes pose a big problem  from F77 to F90, C/C++, MPI, Coarray Fortran etc.  intellectual content must be preserved  automatic re-write under constraint of performance portability is a difficult research problem  Many advanced applications are dynamic, irregular, and adaptive  Long lived applications, surviving many generations of hardware  Applications are becoming larger and more complex  multi-disciplinary  multi-language  multi-paradigm  Legacy codes pose a big problem  from F77 to F90, C/C++, MPI, Coarray Fortran etc.  intellectual content must be preserved  automatic re-write under constraint of performance portability is a difficult research problem  Many advanced applications are dynamic, irregular, and adaptive Application Challenges 18

19  High performance networks and multithreading can contribute to tolerating memory latency and improving memory bandwidth  Hardware support for locality aware programming can avoid serious performance problems in current architectures  Abandoning global cache coherence in favor of a more flexible scheme can eliminate a major source of bottlenecks  Lightweight processors (possibly in the millions) can be used both as a computational fabric and a service layer  exploitation of spatial locality  lightweight synchronization  an introspection infrastructure for fault tolerance, performance tuning, and intrusion detection  High performance networks and multithreading can contribute to tolerating memory latency and improving memory bandwidth  Hardware support for locality aware programming can avoid serious performance problems in current architectures  Abandoning global cache coherence in favor of a more flexible scheme can eliminate a major source of bottlenecks  Lightweight processors (possibly in the millions) can be used both as a computational fabric and a service layer  exploitation of spatial locality  lightweight synchronization  an introspection infrastructure for fault tolerance, performance tuning, and intrusion detection Opportunities for Peta-Scale Architectures 19

20 Semantics Productivity Model Programming Model conceptual view of data and control Programming Language Programming Language + Directives Library Command-line Interface S P S P S P S P Execution Model abstract machine realizations Programming Models for High Productivity Computing 20

21  Programming models and their realizations can be characterized along (at least) three dimensions:  semantics  user productivity  performance  Semantics: a mapping from programs to functions specifying input/output behavior of the program:  S: P  F, where each f in F is a function f: I  O  User productivity: a mapping from programs to a characterization of structural complexity:  U: P  N  Performance: a mapping from programs to functions specifying the complexity of the program in terms of its execution on a real or abstract target machine:  C: P  G, where each g in G is a function g : I  N*  Programming models and their realizations can be characterized along (at least) three dimensions:  semantics  user productivity  performance  Semantics: a mapping from programs to functions specifying input/output behavior of the program:  S: P  F, where each f in F is a function f: I  O  User productivity: a mapping from programs to a characterization of structural complexity:  U: P  N  Performance: a mapping from programs to functions specifying the complexity of the program in terms of its execution on a real or abstract target machine:  C: P  G, where each g in G is a function g : I  N* Programming Model Issues 21

22 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 1 General-Purpose Languages 2 Domain-Specific Languages 3 Programming Environments and Tools 4 Cascade and the Chapel Language Design 5 Conclusion 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 1 General-Purpose Languages 2 Domain-Specific Languages 3 Programming Environments and Tools 4 Cascade and the Chapel Language Design 5 Conclusion Contents 22

23 Key Issues in General Purpose Languages High-Level Features for Explicit Concurrency High-Level Features for Locality Control High-Level Support for Distributed Collections global address space Safety object orientation performance transparency generic programming Extensibility Critical Functionality Orthogonal Language Issues High-Level Support for Programming-In-the-Large 23

24 datawork data align work align with data distribute HPCS architecture PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM Locality Awareness: Distribution, Alignment, Affinity distribute 24

25  Data distributions must be able to reflect partitioning of physical domains: exploiting locality in physical space  Data distributions must be able to reflect dynamic reshaping of data domains (e.g., Gaussian elimination): a load balancing issue  Affinity --- the co-location of threads and data they are operating upon --- is an important performance issue  Language features for expressing user-defined distribution and alignment  Data distributions must be able to reflect partitioning of physical domains: exploiting locality in physical space  Data distributions must be able to reflect dynamic reshaping of data domains (e.g., Gaussian elimination): a load balancing issue  Affinity --- the co-location of threads and data they are operating upon --- is an important performance issue  Language features for expressing user-defined distribution and alignment Requirements for Data Distribution Features 25

26  DSLs provide high-level abstractions tailored to a specific domain  DSLs separate domain expertise from parallelization knowledge  DSLs allow a domain expert to work closer to the scientific process (removed from resource management) – a potentially easier route to user productivity than general purpose languages  Potential problems include fragility with respect to changes in the application domain, and target code performance  DSLs can be built on top of general purpose languages, providing a transition path for applications  DSLs provide high-level abstractions tailored to a specific domain  DSLs separate domain expertise from parallelization knowledge  DSLs allow a domain expert to work closer to the scientific process (removed from resource management) – a potentially easier route to user productivity than general purpose languages  Potential problems include fragility with respect to changes in the application domain, and target code performance  DSLs can be built on top of general purpose languages, providing a transition path for applications Domain-Specific Languages (DSLs) 26

27  Building domain-specific skeletons for applications (with plug ins for user defined functions)  Using domain knowledge for extending and improving standard compiler analysis  Developing domain-specific libraries  Detecting and optimizing domain-specific patterns in serial legacy codes  Using domain knowledge for verification, validation, and fault tolerance  Telescoping -- automatic generation of optimizing compilers for domain specific languages defined by libraries  Building domain-specific skeletons for applications (with plug ins for user defined functions)  Using domain knowledge for extending and improving standard compiler analysis  Developing domain-specific libraries  Detecting and optimizing domain-specific patterns in serial legacy codes  Using domain knowledge for verification, validation, and fault tolerance  Telescoping -- automatic generation of optimizing compilers for domain specific languages defined by libraries Exploiting Domain-Specific Information 27

28  Rewriting Legacy Codes  Preservation of intellectual content  Opportunity for exploiting new hardware, including new algorithms  Code size may preclude practicality of rewrite  Language, compiler, tool, and runtime support  (Semi) automatic tools for migrating code  incremental porting  transition of performance-critical sections requires highly- sophisticated software for automatic adaptation  high-level analysis  pattern matching and concept comprehension  optimization and specialization  Rewriting Legacy Codes  Preservation of intellectual content  Opportunity for exploiting new hardware, including new algorithms  Code size may preclude practicality of rewrite  Language, compiler, tool, and runtime support  (Semi) automatic tools for migrating code  incremental porting  transition of performance-critical sections requires highly- sophisticated software for automatic adaptation  high-level analysis  pattern matching and concept comprehension  optimization and specialization Legacy Code Migration 28

29  Reliability Challenges  massive parallelism poses new problems  fault prognostics, detection, recovery  data distribution may cause vital data to be spread across all nodes  (Semi) Automatic Tuning  closed loop adaptive control: measurement, decision-making, actuation  information exposure: users, compilers, runtime systems  learning from experience: databases, data mining, reasoning systems  Introspection  a technology for the support of validation, fault detection, and performance tuning  Reliability Challenges  massive parallelism poses new problems  fault prognostics, detection, recovery  data distribution may cause vital data to be spread across all nodes  (Semi) Automatic Tuning  closed loop adaptive control: measurement, decision-making, actuation  information exposure: users, compilers, runtime systems  learning from experience: databases, data mining, reasoning systems  Introspection  a technology for the support of validation, fault detection, and performance tuning Issues in Programming Environments and Tools 29

30 Parallelizing compiler source program target program Expert advisor Transformation system execution on target machine end of tuning cycle Program / Execution Knowledge Base Example: Offline Performance Tuning 30

31 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 1 The DARPA Sponsored HPCS Program 2 An Overview of the Cascade Architecture 3 Design Aspects of the Chapel Programming Language 5 Conclusion 1 Introduction 2 Programming Models for High Productivity Computing Systems (HPCS) 3 Issues in HPCS Languages and Paradigms 4 Cascade and the Chapel Language Design 1 The DARPA Sponsored HPCS Program 2 An Overview of the Cascade Architecture 3 Design Aspects of the Chapel Programming Language 5 Conclusion Contents 31

32 High Productivity Computing Systems Goals:  Provide a new generation of economically viable high productivity computing systems for the national security and industrial user community (2007 – 2010)Goals: Impact: Performance (efficiency): critical national security applications by a factor of 10X to 40X Productivity (time-to-solution) Portability (transparency): insulate research and operational application software from system Robustness (reliability): apply all known techniques to protect against outside attacks, hardware faults, & programming errors Fill the Critical Technology and Capability Gap Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing) Fill the Critical Technology and Capability Gap Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing) Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology HPCS Program Focus Areas 32

33  one year Concept Study, July 2002 -- June 2003  three year Prototyping Phase, July 2003 -- June 2006  Led by Cray Inc. (Burton Smith) Partners:  Caltech/JPL  University of Notre Dame  Stanford University  one year Concept Study, July 2002 -- June 2003  three year Prototyping Phase, July 2003 -- June 2006  Led by Cray Inc. (Burton Smith) Partners:  Caltech/JPL  University of Notre Dame  Stanford University The Cascade Project 33

34  Hierarchical architecture: two levels of processing elements  Lightweight processors in “smart memory”  Shared address space  Program controlled selection of UMA or NUMA data access  Hybrid UMA/NUMA programming paradigm  Hierarchical architecture: two levels of processing elements  Lightweight processors in “smart memory”  Shared address space  Program controlled selection of UMA or NUMA data access  Hybrid UMA/NUMA programming paradigm Cascade Architecture: Key Elements 34

35  Tolerate latency with processor concurrency  Vectors provide concurrency within a thread  Multithreading provides concurrency between threads  Exploit locality to reduce bandwidth demand  “Heavyweight” processors (HWPs) to exploit temporal locality  “Lightweight” processors (LWPs) to exploit spatial locality  Use other techniques to reduce network traffic  Atomic memory operations  Efficient remote thread creation  Tolerate latency with processor concurrency  Vectors provide concurrency within a thread  Multithreading provides concurrency between threads  Exploit locality to reduce bandwidth demand  “Heavyweight” processors (HWPs) to exploit temporal locality  “Lightweight” processors (LWPs) to exploit spatial locality  Use other techniques to reduce network traffic  Atomic memory operations  Efficient remote thread creation Using Bandwidth Wisely 35

36 A Simplified Global View of the Cascade Architecture Smart Memory HWP SW-controlled Cache Smart Memory HWP SW-controlled Cache … Global Interconnection Network Locale 36

37 Heavyweight Processor Vector Multithreaded Streaming Compiler-assisted cache Locale Interconnect Networ k Router To Other Locales DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … A Cascade Locale Source: David Callahan, Cray Inc. 37

38  Multiple processors per die  Heavily multithreaded  local memory available to hold thread state  Large on-chip DRAM cache  Multiple processors per die  Heavily multithreaded  local memory available to hold thread state  Large on-chip DRAM cache  Specialized instruction set  fork, quit  spawn  MTA-style F/E bits for synchronization DRAM Lightweight Processors Multithreaded cache Lightweight Processor Chip Multithreaded LWP … Optimized to start, stop and move Lightweight Processors Source: David Callahan, Cray Inc. 38

39  Lightweight processors  co-located with memory  focus on availability  full exploitation is not a primary system goal  Lightweight threads  minimal state – high rate context switch  spawned by sending a parcel to memory  Exploiting spatial locality  fine-grain: reductions, prefix operations, search  coarse-grain: data distribution and alignment  Saving bandwidth by migrating threads to data  Lightweight processors  co-located with memory  focus on availability  full exploitation is not a primary system goal  Lightweight threads  minimal state – high rate context switch  spawned by sending a parcel to memory  Exploiting spatial locality  fine-grain: reductions, prefix operations, search  coarse-grain: data distribution and alignment  Saving bandwidth by migrating threads to data Lightweight Processors and Threads 39

40  Fine grain application parallelism  Implementation of a service layer  Components of agent systems that asynchronously monitor the computation performing introspection, and dealing with:  dynamic program validation  validation of user directives  fault tolerance  intrusion prevention and detection  performance analysis and tuning  support of feedback-oriented compilation  Fine grain application parallelism  Implementation of a service layer  Components of agent systems that asynchronously monitor the computation performing introspection, and dealing with:  dynamic program validation  validation of user directives  fault tolerance  intrusion prevention and detection  performance analysis and tuning  support of feedback-oriented compilation Uses of Lightweight Threads 40

41  Introspection can be defined as a system’s ability to:  explore its own properties  reason about its internal state  make decisions about appropriate state changes where necessary  an enabling technology for building “autonomic” systems (among other purposes)  Introspection can be defined as a system’s ability to:  explore its own properties  reason about its internal state  make decisions about appropriate state changes where necessary  an enabling technology for building “autonomic” systems (among other purposes) Introspection 41

42  Performance tuning is critical for distributed high performance applications in a massively parallel architecture  Trace based off line analysis may not be practical due to volume of data  Complexity of machine architecture makes manual control difficult  Automatic support for on line tuning essential for efficiently dealing with the problem  Performance tuning is critical for distributed high performance applications in a massively parallel architecture  Trace based off line analysis may not be practical due to volume of data  Complexity of machine architecture makes manual control difficult  Automatic support for on line tuning essential for efficiently dealing with the problem A Case Study: Automatic Performance Tuning 42

43  Program/Performance Data Base – storing information about the program, executions (input, performance data), and system. It provides the central interface in the system. It provides the central interface in the system.  Asynchronous parallel agents for various functions, including  Simplification: data reduction and filtering  Invariant checking  Assertion checking  Local problem analysis  Performance Exception Handler – performs a model-based analysis of performance problems communicated by agents, and decides on an appropriate action if necessary. Such an action may trigger  New method selection  Re-instrumentation  Re-compilation  Program/Performance Data Base – storing information about the program, executions (input, performance data), and system. It provides the central interface in the system. It provides the central interface in the system.  Asynchronous parallel agents for various functions, including  Simplification: data reduction and filtering  Invariant checking  Assertion checking  Local problem analysis  Performance Exception Handler – performs a model-based analysis of performance problems communicated by agents, and decides on an appropriate action if necessary. Such an action may trigger  New method selection  Re-instrumentation  Re-compilation Elements of an Approach to On the Fly Performance Tuning 43

44 Example: A Society of Agents for Performance Analysis and Feedback-Oriented Tuning Program/ Performance Data Base execution of instrumented application program compiler instrumenter data collection analysi s agent invarian tcheckin g agent simplificatio n agent Performance Exception Handler data reduction and filtering 44

45  Base Model  Unbounded thread parallelism in a flat shared memory  Explicit synchronization for access to shared data  No consideration of temporal or spatial locality, no “processors”  Extended Model  Explicit memory locales, data distribution, data/thread affinity  Explicit cache management  Base Model  Unbounded thread parallelism in a flat shared memory  Explicit synchronization for access to shared data  No consideration of temporal or spatial locality, no “processors”  Extended Model  Explicit memory locales, data distribution, data/thread affinity  Explicit cache management Cascade Programming Model 45

46  Global name space  even in the context of a NUMA model  avoid “local view” programming model of MPI, CoArray Fortran, UPC  Multiple models of parallelism  Provide support for :  explicit parallel programming  locality-aware programming  interoperability with legacy codes (MPI, Coarray Fortran, UPC, etc.)  generic programming  Global name space  even in the context of a NUMA model  avoid “local view” programming model of MPI, CoArray Fortran, UPC  Multiple models of parallelism  Provide support for :  explicit parallel programming  locality-aware programming  interoperability with legacy codes (MPI, Coarray Fortran, UPC, etc.)  generic programming Language Design Criteria 46

47  The “Concrete Language” enhances HPF and ZPL  generalized HPF-type data distributions  domains are first-class objects: a central repository for index space, distribution, and associated set of arrays  support for automatic partitioning of dynamic graph-based data structures  high-level control for communication (halos,…)?  The “Abstract Language” supports generic programming  abstraction of types: type inference from context  abstraction of iteration: generalization of the CLU iterator  data structure inference: system-selected implementation for programmer-specified object categories Note: The “concrete” and “abstract” languages as used above do not in reality refer to different languages but are only used here to structure this discussion.  The “Concrete Language” enhances HPF and ZPL  generalized HPF-type data distributions  domains are first-class objects: a central repository for index space, distribution, and associated set of arrays  support for automatic partitioning of dynamic graph-based data structures  high-level control for communication (halos,…)?  The “Abstract Language” supports generic programming  abstraction of types: type inference from context  abstraction of iteration: generalization of the CLU iterator  data structure inference: system-selected implementation for programmer-specified object categories Note: The “concrete” and “abstract” languages as used above do not in reality refer to different languages but are only used here to structure this discussion. Language Design Highlights 47

48  A modern base language  Strongly typed  Fortran-like array features  Objected-oriented  Module structure for name space management  Optional automatic storage management  High performance features  Abstractions for parallelism  data parallelism (domains, forall)  task parallelism (cobegin/coend)  Locality management via data distributions and affinity  A modern base language  Strongly typed  Fortran-like array features  Objected-oriented  Module structure for name space management  Optional automatic storage management  High performance features  Abstractions for parallelism  data parallelism (domains, forall)  task parallelism (cobegin/coend)  Locality management via data distributions and affinity Basics 48

49 Type Tree primitive domaincollectionfunction homogeneous collection heterogeneous collection sequenceset arrayclass record iterator other function 49

50  Low level tools direct object and thread placement at time of creation  Natural extensions to existing partitioned address space languages  Evolve from HPF and ZPL a higher level concept:  Introduce domains as first class program entities  Sets of names (indices) for data  Centralization of distribution, layout, and iteration ordering  Low level tools direct object and thread placement at time of creation  Natural extensions to existing partitioned address space languages  Evolve from HPF and ZPL a higher level concept:  Introduce domains as first class program entities  Sets of names (indices) for data  Centralization of distribution, layout, and iteration ordering Locality Management 50

51  locale view: a logical view for a set of locales  distribution: a mapping of an index set to a locale view  array: a map from an index set to a collection of variables Domains  index sets: Cartesian products, sparse, opaque 51

52 2) Separation of Concerns: Definition var Mat: domain(2) = [1..m, 1..n]; var MatCol: domain(1) = Mat(2); var MatRow: domain(1) = Mat(1); var A: array [Mat] of float; var v: array [MatCol] of float; var s: array [MatRow] of float; s = sum(dim=2) [i,j:Mat] A(i,j)*v(j); Example: Matrix Vector Multiplication V1 52

53 var L: array[1..p1,1..p2] of locale; var Mat: domain(2) dist(block,block) to L = [1..m,1..n]; var MatCol: domain(1) align(*,Mat(2)) = Mat(2); var MatRow: domain(1) align(Mat(1),*) = Mat(1); var A: array [Mat] of float; var v: array [MatCol] of float; var s: array [MatRow] of float; s = sum(dim=2) [i,j:Mat] A(i,j)*v(j); Example: Matrix Vector Multiplication V2 53

54 0 53 0 0 0 0 0 0 0 0 19 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 93 0 0 0 0 0 0 0 0 0 21 0 0 0 16 72 0 0 0 0 0 0 13 0 0 0 0 23 69 0 27 0 0 11 44 0 0 19 37 0 0 0 0 0 64 0 D 0 ____ 53 19 17 93 C 0 ____ 2 1 4 5 R 0 ____ 1 2 3 4 5 D 0 ____ 53 19 17 93 D 0 ____ 53 19 17 93 C 0 ____ 2 1 4 5 R 0 ____ 1 2 3 4 5 C 1 ____ 2 3 1 2 R 1 ____ 1 2 3 4 5 D 1 ____ 21 16 72 13 D 0 ____ 53 19 17 93 C 2 ____ 2 3 1 4 R 2 ____ 1 3 5 D 2 ____ 23 69 27 11 D 3 ____ 44 19 37 64 C 3 ____ 1 4 1 3 R 3 ____ 1 3 4 5 Sparse Matrix Distribution 54

55 var L: array[1..p1,1..p2] of locale; var Mat: domain(sparse2) dist(mysdist) to L layout(myslay) = [1..m,1..n] where enumeratenonzeroes(); var MatCol: domain(1) align(*,Mat(2)) = Mat(2); var MatRow: domain(1) align(Mat(1),*) = Mat(1); var A: array [Mat] of float; var v: array [MatCol] of float; var s: array [MatRow] of float; s = sum(dim=2) [i,j:Mat] A(i,j)*v(j); Example: Matrix Vector Multiplication V3 55

56  Global name space  High level control features supporting explicit parallelism  High level locality management  High level support for collections  Static typing  Support for generic programming  Global name space  High level control features supporting explicit parallelism  High level locality management  High level support for collections  Static typing  Support for generic programming Language Summary 56

57  Today’s programming languages, models, and tools are not adequate to deal with the challenges of 2010 architectures and application requirements  Peta scale architectures will pose additional problems but may also provide better structural support for high level languages and compilers  A focused and persistent research and development effort -- along a number of different paths -- will be needed to create viable language systems and tool support for economically feasible and robust high productivity computing of the future  Today’s programming languages, models, and tools are not adequate to deal with the challenges of 2010 architectures and application requirements  Peta scale architectures will pose additional problems but may also provide better structural support for high level languages and compilers  A focused and persistent research and development effort -- along a number of different paths -- will be needed to create viable language systems and tool support for economically feasible and robust high productivity computing of the future Conclusion 57


Download ppt "Towards Future Programming Languages and Models for High-Productivity Computing Hans P. Zima University of Vienna, Austria and JPL, California Institute."

Similar presentations


Ads by Google