Konstantinos Krommydas, Ruchira Sasanka (Intel), Wu-chun Feng GLAF: A Visual Programming and Auto-Tuning Framework for Parallel Computing Konstantinos Krommydas, Ruchira Sasanka (Intel), Wu-chun Feng
Motivation High-performance computing is crucial in a broad range of scientific domains: Engineering, math, physics, biology, … Parallel programming revolution has made high-performance computing accessible to the broader masses High-performance computing is of high importance in a broad range of scientific domains, with many applications in engineering, math, physics, biology. Examples include molecular modeling, computational fluid dynamics, DNA sequencing, cosmology. Specifically, by exploiting parallel computing it allows complex computations, or computations on huge amounts of data to be achieved at very fast speeds, which often comes with the added advantage of being to obtain more accurate results. Until the not so distant past, high-performance computing was a privilege of government labs, universities, or large corporations that could afford to build and maintain expensive (and power-hungry) supercomputers. With the parallel programming revolution in the mid-2000s high-performance computing – at least at a lower, yet still adequate level has become more mainstream and available at reasonable cost to a broader audience. Molecular modeling: http://www.ks.uiuc.edu/Research/gpu/ DNA: Wikipedia Commons CFD: Image of surface pressure on wing and tail of Common Research Model (http://aaac.larc.nasa.gov/tsab/cfdlarc/aiaa-dpw/Workshop5/workshop5.html) PC: http://www.tutorialspoint.com/computer_fundamentals/images/personal_computer.jpg
Challenges Many different computing platforms Many different programming languages Many different architectures Many different optimization strategies At the heart of accessible high-performance computing lie different types of cores: on top of the “traditional” multi-core CPUs, that now can reach up to 48 logical cores and include vector capabilities for even more data parallelism, we have graphics processing units (GPUs). Moreover, co-processors, like Intel Xeon Phi or reconfigurable platforms, i.e., FPGAs provide even more opportunities for parallelism and customization, respectively. This broad heterogeneous landscape, presents the user with many challenges exposing him to a number of different underlying architectures, and consequently architecture-aware optimization strategies that make a given algorithm run efficiently on them. The problem is exacerbated by the fact that there is a number of programming languages that may be used to program these devices and that the user needs to switch to a parallel-state of thinking when programming for them. Programmers, themselves, are not always necessarily good in parallel programming, or may be “specialized” in one or few of these platforms and/or languages. What’s more, domain experts (i.e., engineers, scientists, researchers) are far away even from that. As such, there is a distance between such users – for which high-performance/parallel computing is very beneficial – and the ideal state. Domain experts do not know – and should NOT need to know all these details that steer them away from what they are good at: their respective science. Domain experts should not (need to) know all these details but rather focus on their science
Challenges Domain experts need to collaborate with computer scientists highly optimized parallel code pseudocode unoptimized/naïve serial code Innovation slow-down One of the ways the problem has been traditionally addressed is via collaboration of domain experts with computer scientists. Specifically, domain experts provide computer scientists with some pseudocode, or unoptimized/naïve version of the serial code in a language they are fairly comfortable with. Computer scientists then optimize the code and look for parallelization opportunities. This interdisciplinary collaborative approach entails exchange of domain specific knowledge between the domain experts and computer scientists. Such exchange is prone to errors and misunderstandings, entail a non-negligible communication overhead, which overall slow-down the innovation potential. While collaboration and interdisciplinary research is not bad per se, ideally each of domain experts and computer scientists should be able to focus on their respective areas and innovate accordingly. Communication overhead & errors Need to exchange domain-specific/programming knowledge Limited access to parallel computing
Contributions Realize a programming abstraction & development framework for domain experts to provide a balance between performance and programmability i.e., obtain fast performance from algorithms that have been programmed easily Desired features intuitive, familiar, minimalistic syntax It is this problem that we try to address with our work: provide a means for domain experts to express their algorithms in an easy and fast way and a means to generate efficient parallel code that would obviate the need for involving computer scientists in the process. In this context, we realize a programming abstraction and development framework for domain experts that provides a good balance between performance and programmability. In other words, a means to easily write fast parallel programs related to their domain science. Our idea of such an abstraction is geared towards including a set of features that we deem important for the intended user-base: -It has to be auto-parallelizable, and tunable/optimizable for a specific architecture, yet be platform agnostic. -It needs to be intuitive and familiar to the user, easy with minimalistic syntax, yet able to handle complex real-world problems. -It should be data-visual and interactive to allow easy understanding of the algorithm being developed and debugging. -Finally, it should be able to integrate with existing legacy-code, since there is a huge code-base for various scientific domains, written in various languages (Fortran, C, etc.) data-visual and interactive *GLAF auto-parallelizable, optimizable, tunable able to integrate with existing legacy code *Grid-based Language and Auto-tuning Framework
Graphical User Interface Comments (in step, statements, grids, …) Graphical User Interface Module name Function name Step number (within function) Click-based interface This is an example screen of GLAF’s GUI. The GUI is implemented in the form of a web page and written in a combination of HTML5 and JavaScript. In general, a GLAF program is organized in one or more MODULES (click), each module includes one or more FUNCTIONS (click), and in turn, each function contains one or more STEPS (click). Specifically, this screen represents a STEP of a GLAF program. A step includes a number of input and output grids on which computations are performed through a series of STATEMENTS. Specifically, GLAF supports all necessary programming constructs, such as LOOPS (click), CONDITIONAL STATEMENTS (click), and FORMULAS (click) that can include grid cells, arithmetic/logical operations on them as well as user-defined or library function calls. As you can see the “code” of GLAF coexists with the data variables (or grids – will talk about them in the next slide). Data visualization is important and we will discuss it further shortly. Programming using GLAF differs from programming using a typical free-text-based programming language (e.g., C or Java) in that users use a visual programming interface that attempts to keep typing at a minimum (e.g., for naming variables or functions or comments). Rather most development is based on an intuitive point-and-click visual interface. The code you can see in the statement boxes is not written manually but populated through a series of clicks on the appropriate grid positions and buttons (click). Finally, the user can insert comments in many places to make development easier: in a function, step, in the data variables (grids) and each statement of the program. These comments as we shall see are also placed in the appropriate position in the automatically generated code.
Programming Using GLAF: a Simple Example This is a simple example (animated) on how we program using GLAF. It is an example where we simply scale the values of a matrix based on a scaling matrix. (It shows the point-and-click way of creating loops, selecting variables, operators, how computation flows from input grids to output grids, and shows the parallelism meter at the top – green for parallelism opportunity and the number represents the number of iterations that can be parallelized across each dimension/variable). NOTE: The “code” in the boxes is AUTOMATICALLY placed based on clicking in the buttons, or grid cells as seen in the animation.
Grid-Based Data Structures Data type GLAF variables are based on the concept of grids: familiar abstraction (e.g., images, matrices, spreadsheets) regular format that facilitates code generation, optimizations, and parallelism detection Dimension title Different data types across a dimension Grid-based programming puts the focus on the relation, rather than the implementation details GLAF variables are based upon the concept of grids. Grids are simple, yet powerful data structures that can be used to represent a variety of real-world problems). A scalar variable is a 0D grid with one element and a 1D array is a 1D grid with multiple elements. Similarly, we can generalize for higher-dimensional data structures. Grids of this type can contain a single data type and be indexed by corresponding index variables (x, y, z, …). In this example (click) we allow dimensions to have titles. This way we can represent tables, in which case a specific cell is addressed by the combinations of the title and index (in a similar way to a spreadsheet). More complex structures (click) can be described using the grid abstraction by use of multiple data types across one of the dimensions with titles. Such grids can represent what would be a struct in C. This specific example represents a surface point in a molecular modeling algorithm, where q represents the charge, and x, y, z are the coordinates in the 3D space. Another example (click), with different titled dimensions, and different types per dimension are shown here. This grid represents a database for the student records in different departments of different colleges. In practice, any math relation that is discrete and finite can be represented with grids (e.g., trees can be represented by matrices indicating the parent-child relationships, graphs can be represented by adjacency matrix, sparse matrices can be represented in the CSR format). Example: Scalar variable: 0D grid 1D array: one-dimensional grid Grid name
Internal Representation All GLAF basic elements have a corresponding internal representation JavaScript objects modeled after constructor functions (object-oriented JavaScript) Used to drive GLAF front- and back-ends All elements that constitute GLAF, as we saw in the previous examples (i.e., modules, functions, steps, statements, comments, grids, …) have a corresponding internal representation in JavaScript. As the user develops the algorithm using GUI buttons and keyboard, event listeners activate appropriate JavaScript functions to populate JS objects modeled after constructor functions (in support of an object-oriented programing JavaScript). These are used in driving all the functionality of GLAF, through the corresponding front- and back-ends. Specifically, in creating and navigating the GUI screens, automatic code generation, parallelism analysis, etc. We are going to discuss these features throughout the presentation. Data visualization Code generation Parallelism analysis
Internal Representation Example: Grid Object (simplified) dimTitles[RowDim] = {q, x, y, z} comment = “Represents a surface point of a biomolecule” num_dims = 2 Here, you can see an example of the internal representation (of course this is a simplified version) of a GRID object. For instance, details such as the name of the grid (caption), number of dimensions, data types array for each dimension (if this grid has multiple data types per dimension – i.e., struct), titles array per dimension (if it has titles in said dimension), or a comment the user may have input for the grid (in this example we are using a data structure/grid from a molecular modeling algorithm and this specific grid stores details for surface points of a biomolecule). This information (and more) will be used during code generation as we will see shortly, where for example the data type as encoded internally will generate the string for the data type that corresponds to the language of choice (e.g., C or Fortran). dataTypes[RowDim] = {T_REAL,T_REAL,T_REAL,T_REAL} caption = “surface_pts”
GLAF Infrastructure Browser Web/Cloud GLAF Programming GUI Data visualization Fortran/C/JS/OpenCL code generation Auto-parallelization Compilation/auto-tuning script generation GLAF is currently implemented as a web-service, but can be easily used off-line, depending on the user needs and software/hardware availability. Users build their algorithms using a familiar browser interface and can visualize the results for easy debugging (for tractable program sizes). Automatic code generation, auto-parallelization and auto-tuning scripts take place on the host-side, while the resulting files are transferred on the cloud where they are compiled, auto-tuned, and executed. Results are also stored on the cloud (but users can download to their machines). The web-service choice was used to also allow “on-the-go” rapid code prototyping using touch-based devices, like tablets, without affecting the generality of the tool. We will now discuss each of the above procedures and GLAF in more detail. Compilation Auto-tuning Execution Data storage
Automatic Code Generation: C One of the main features of GLAF is the automatic code-generation back-end. Specifically, using a single GLAF implementation, GLAF generates code in other supported programming languages. Currently, we support C and Fortran and this picture shows a kernel in C (corresponds to the GUI screen from slide 6) as automatically generated. As one can notice the code is very readable, with appropriate indentation, comments as inserted in the GUI are in the appropriate places in the generated code, and OpenMP directives are placed in the places where the auto-parallelization back-end has identified parallelization opportunities. Such code can be either used directly by novice programmers or can be taken by “ninja” programmers and further developed manually if needed. Readable, indented code Comments auto-included in appropriate place OpenMP directives (where applicable)
Automatic Code Generation: Fortran Here you can see the FORTRAN code automatically generated by GLAF code-generation back-end, using the very same GLAF program as before. As before the user needs not know C or FORTRAN. Just by programming on a higher-level, using GLAF GUI can help produce code that will run on multi-core CPUs.
Automatic Code Generation To generate code GLAF parses the structured internal representation of the GLAF program Module(s) → Function(s) → Step(s) → Statement(s) Based on the correspondence of the internal representation to C/Fortran language constructs, we generate the corresponding code Let’s see the automatic code generation back-end in more detail… Recall the structure of a GLAF program, which is reflected in the structure/connection of the internal JavaScript objects that model this program abstraction: A module contains one or more functions. Each function contains one or more steps and each step contains one or more statements (“lines of code”). Accordingly, there are module objects, one of whose elements is an array of function objects. A function object, modelling a function, contains an array of step objects, and each step object contains an array of statement objects.
Gen. code for grids declared in a STEP No START Gen. code for a FUNCTION -Function header (type, name, arguments) Yes More functions Gen. code for grids declared in a STEP No Gen. code for LOOP in a STEP -Declaration of loop’s start/end variable -Assignment of start/end values -Create DO constructs -Generate corresponding loop end code Gen. code for PROGRAM Concatenate: -Derived types for complex grids (if used) -Library functions code (if used) -Code for all functions -Call to main() routine Gen. conditional statement code Parse box Yes Gen. formula code (operations/function calls) Conditional Yes Here, you can see the procedure that takes place in the code-generation back-end. At first, we parse the internal representation of each function object obtaining details like the return data type, the name, the arguments and their types, etc., and generate the function header. Subsequently we generate the code for declaring the variables that correspond to the function variables of the first step of the function - which we obtain from the grid objects associated with the first step object of the first function object (i.e., main()). If the current step includes a “foreach” statement (i.e., loop) we create the appropriate loop code and proceed to parse the rest boxes/statements of the step. If this is a conditional statement we generate the appropriate if/else if/else statement, otherwise we generate the code for a formula statement/single function call/mix. If we have more boxes in the current step we repeat, otherwise, we concatenate the code for all steps and finalize the current step’s code. If our function has more than one steps we repeat, otherwise we generate the code for the function by concatenating the appropriate sub-codes and generating code for finalizing the function code. If we have more functions we repeat the whole process otherwise we concatenate the code for any structs (C) (derived types in Fortran), any library functions (GLAF library functions for which we generate code), the concatenated code for all functions and any boilerplate code/the initialization call to the main function. Gen. code for FUNCTION: Concatenate: -Function header -All declarations and initializations -All steps’ code -Return value assignment -Function closing No More boxes No Gen. code for STEP: Index Range code + Formula(s) code + Loop end code END More steps in function Yes No
Auto-Parallelization GLAF parallelism analysis back-end Parses internal representation and fills additional appropriate JavaScript objects used in analyzing loops for parallelization opportunities (loop-level dependence analysis) E.g., Scalar grids parsed to build a control flow graph used to identify scalar dependencies Non-scalar grids parsed (name, read/write access, index per dimension) to be used in the cross-dependence loop analysis (see algorithms in paper) The auto-parallelization features of GLAF are based on its parallelism analysis back-end. Specifically, the auto-parallelization back-end parses the GLAF program and forms an additional collection of objects in the internal representation. These additional objects include information on scalar grids (scalar variables) and non-scalar grids. Information includes the name of the variable, instances where it is written/read (i.e., the “line” number), in case of non-scalar grids it records information about the indices accessed in each dimension of each instance, etc. Information about scalar grids is used to build a control flow graph, which is then parsed to analyze scalar dependencies. Information about non-scalar grids is fed to the loop parallelism detection algorithm (if interested refer to the paper, where we provide pseudocode of the whole process in more detail).
Auto-Parallelization GLAF parallel OpenMP directives code generation Based on parallelism analysis, generates appropriate OpenMP directives (including reductions) GLAF parallelism meter Visual information to the user about parallelizable dimensions & amount of parallelism The auto-parallelization back-end’s code generation part emits the appropriate OpenMP directives at the appropriate place in code, based on the aforementioned analysis. Finally, a visual aspect of auto-parallelization is the “parallelism meter” that provides visual information to the user about the loop dimensions that are parallelizable and the amount of available parallelism across each (if known).
Auto-Tuning Generates platform-specific binaries & optimizations Selects the languages in which to auto-generate code Selects one or more code “starting points” for each language Selects optimizations for each combination of language and code “starting point” Here you can see the main auto-tuning options screen that lets the user select the code implementations that will ultimately be generated and/or compiled and/or executed and timed. At the highest level the user selects the target platform, which is related to generating platform-specific binaries (and optimizations – if any). At the second level the user selects the target languages in which he wants code to be generated. At the third level he picks among serial or parallel code, where the latter can be a GLAF auto-parallelized version of the serial code, or a compiler auto-parallelized version of the serial code (as we see in the results the compiler is not always better, but we still want this option for cases where it is). At the last level we provide optimizations for each combination of the previous user selections. An important one is automatic generation of code for data structures (where applicable) as arrays of structures or structures of arrays. Data layout can have a very important effect on performance (e.g., better cache locality or more auto-vectorization opportunities). (Note: The code snippets correspond to the molecular modeling related data structure from slide 8).
Multi-level Auto-Tuning Approach GLAF program Fortran C OpenCL Serial GLAF-Parallel Compiler Parallel Data-layout transforms Loop interchange transforms … Loop collapse transforms One of the important paradigms that GLAF wants to stress is writing a program in a single higher-level and easier to program for domain experts language and produce optimized, parallel code in many “traditional” programming languages. From a single GLAF visual program we can get code in Fortran, C and –under development- OpenCL, in three basic forms: serial code, parallel code with OpenMP pragmas as identified and generated by GLAF, and the binary for the parallel code as generated by compiler’s auto-parallel option (the latter is based on the corresponding GLAF-auto-generated serial version). (NOTE: these 3 basic categories will be used in the results slides later). Each of these basic versions undergoes further optimizations either by GLAF or the respective compiler. Code generated by GLAF can run on CPUs, Xeon Phi (and GPUs using OpenCL under development).
Visualization Data visualization facilitates: “Image Map” “Show Data” understanding the algorithm being developed revealing bugs at an early stage “Image Map” “Show Data” “Colorize” Data visualization along with development – as GLAF supports - facilitates understanding the algorithms being developed, as well as revealing bugs, in an intuitive visual way. Visualization is very important, especially in light of the big data era. Currently GLAF supports two fundamental and fairly basic ways of visualizing data. The first one (“Show data”) calculates the values of each grid cell up to the step from which it is called from. The second (“Colorize”) paints each grid cell on the greyscale color spectrum according to the magnitude of the corresponding cell value, together with its value. Another one, is a straightforward extension of the former, and contains only color. Techniques based on color make it easy to spot outlier values or specific PATTERNS in relative problems, especially in the presence of huge amounts of data, thus facilitating result observation and interpretation. In the future we plan to support more complex visualization schemes by use of specialized visualization libraries for frequently used data structures: e.g., sparse matrices in CSR format could be visualized automatically as single (sparse) matrices (rather than the helping 1D arrays of CSR format). Similarly graphs could be visualized as such, while internally represented via adjacency matrices.
Results: 3D Finite Difference Algorithm Here we present the results of codes automatically generated for a 3D finite difference algorithm based on a GLAF GUI-based implementation. In the vertical axis you can see the speed-up over a reference (Fortran serial CPU) implementation and the horizontal axis presents 4 clusters of implementations: serial (as generated by GLAF), compiler-parallelized (the executable from compiling with compiler’s auto-parallelization/optimizations flag the serial auto-generated GLAF code), and GLAF-parallelized (code with OpenMP pragmas as identified and placed by the tool). The latter is seen as “col(3)” and “col(1)” which represents yet another auto-tuning option with respect to collapsing (3) or not collapsing (1) the loops. As you can see NOT collapsing the loops (which is a mistake that a novice programmer could easily do) yields better speed-ups compared to collapsing the loops, mainly because it takes away vectorization opportunities for the inner loops. Another important thing to notice is that GLAF-parallelized(col(1)) implementations are overall faster than the compiler-parallelized ones (and of course both are faster than the serial implementations). Within each cluster, one can see 3 implementations across languages and across platforms (CPU/Xeon Phi). Overall, C gives the best speed-up in all parallel implementations (which shows that language selection may be important – as a matter of fact Fortran was better in another example (see backup slides)). Of course this has also to do with the maturity and efficacy of the compiler used for each language. The second and third column of each cluster shows how our tool takes care of mistakes that novice programmers could make and how much toll these mistakes would have in performance. The specific example (_norestr vs. restr) is an example of how a compiler would fail to parallelize a C program with functions in which arrays are passed with pointers. C enforces strict aliasing rules and as such the compiler would NOT parallelize unless the user annotates the pointers with the RESTRICT keyword and puts the appropriate “-restrict” compiler flag. Of course one needs to identify that there is no aliasing to do this. The above are things novice programmers like domain experts would possibly not know. GLAF takes cares of these issues in an automatic way. This also shows how the language selection would be important: in Fortran arrays are passed by reference and are assumed to not alias by default. Strict aliasing rules when unnecessarily enforced may not only affect auto-parallelization but vectorization too. This is prominent in the Xeon Phi executions where failing to vectorize (using the WIDER vector units vs. the CPU) causes a 16x gap (vs. a 3x gap in CPU) for the _norestr and _restr implementations respectively. C performs overall better than Fortran in parallel implementations Simple “mistakes” can have a large toll in performance (e.g., unneeded pointer aliasing) Nested OpenMP loop pragmas may void vectorization opportunities
Results: N-body algorithm These are the results for an n-body type of algorithm that calculates the surface electrostatic potential due to interactions with a set of bodies within a biomolecule (recall the kernel shown in GLAF step in slide 7). The main thing that stands out is how important data layout transformations are: laying out the data as SoA provides a big performance benefit overall and especially in the parallel versions of the code. Wrong selection of data layout, which is something a naïve programmer can easily fall for, could yield more than twice as worse performance compared to the “right” data layout. SoA layout is cache friendly, amenable to more efficient vectorization, offering the possibility for aligned unmasked unit stride loads, as opposed to more expensive strided and unaligned loads and gather operations generally needed with AoS. The best overall performance is obtained using Fortran (in the GLAF-parallelized version) showing again that the best language choice may change for different problem types (recall C was found to be the best in our 3DFD example) – again showcasing the utility of being able to automatically obtain implementations in many languages (like we can do with GLAF). Another issue we encountered and tackled was with respect to reduction step in the code (sum[i] += …) In this case C does not allow reductions on dynamically allocated arrays, so even the compiler failed to auto-parallelize the respective loop. GLAF detects such cases and introduces a temporary variable (when allowed), thereby enabling reduction and parallelization (“*_restr_tmp” implementations, versus “*_restr”). (The “*_restr” and “*_norestr” versions have a similar meaning like the previous slide, no need to repeat). Finally, the “*_powf” versions, as opposed to “*_restr_tmp”) showcases yet another way of how a simple programmer’s oversight may have a huge toll on performance: a novice programmer may use the pow() function for power calculations, without taking into account the input data type (in our case float). In Fortran the exponentiation intrinsic function (**) is overloaded with the appropriate version according to the arguments, however in C pow() emits the double version. Using the right version (powf()) increased C serial CPU performance by 1.94x (SoA case). Argument type detection is inherent within GLAF and detection of such cases of potential unwanted performance degradation is important for steering users away from such pitfalls. SoA/AoS automatic transformations in GLAF Code transformations in GLAF to allow parallelization (e.g., reduction on non-scalar variables) Common programming pitfalls (e.g., pow() vs. powf() in C)
Problem Solving Environments Domain-Specific Languages Related Work Parallelization Problem Solving Environments Domain-Specific Languages Implicit Explicit Compilers: gcc, Intel, Cray, PGI, … Extensions/ libraries: Pthreads, OpenMP, OpenACC, Chapel … Domain-specific: computational biology, physics, dense linear algebra, Fast-fourier transforms, … Our work towards making high-performance, parallel computing accessible to domain experts is related to methods that allow code parallelization. Such methods attempt to parallelize and optimize serial code in a transparent (implicit) way, as is the case with optimizing/parallelizing compilers that attempt to auto-parallelize and vectorize appropriate code segments. The problem is that the above requires programming knowledge and the way compilers try to parallelize code, if one is to take full advantage. Oftentimes, even so, compilers may fail due to the fact that they need to be conservative to prevent errors. Another method entails use of extensions and libraries, which requires knowledge of these extensions and a specific parallel mindset in programming with them. Our work complements the former category by automatically generating code that is more amenable to parallelization by the compiler and that automatically takes advantage of extensions like OpenMP. On the other hand, a different path attempts to target domain experts with PSEs with examples from many different application areas. Such PSEs attempt to facilitate domain experts and indeed provide auto-tuned, fast implementations. The problem with PSEs is lack of generality of applicability, as they tend to focus to very function-specific codes. Moreover, they are restrictive in terms of target language and/or architecture (e.g., they may only target GPUs and/or CUDA). We attempt to provide a programming environment that attempts to facilitate programming in a more general sense, yet still be adequately fast, both in terms of programming and speed of resulting generated code for multiple languages and architectures. + Includes loop-level parallelism & data-level parallelism (vectorization), and potential optimizations - Requires certain programming knowledge - (Often) conservative nature + High-performance auto-tuning - Restrictive in nature - Restrictive in terms of target language and/or platform
Future Work Improve tool’s robustness Enabling more languages/extensions: OpenCL, OpenACC Support of distributed programming (MPI) Dynamic feedback/advice on parallelism issues Extend auto-tuning, auto-parallelization/auto-vectorization capabilities Implement more dwarfs and provide back-end support for common programming pitfalls in code generated for supported languages
Conclusion GLAF targets domain experts and provides a fine balance between performance and programmability auto-parallelization, optimization and auto-tuning helps avoid common programming pitfalls In summary: Analogous to: GLAF allows systematic generation of multiple starting points for different languages/platforms/optimizations Different seeds in a state-space search algorithm In this work we presented GLAF, an all-encompassing code development environment for non-programmers or novice programmers, like the majority of domain experts. GLAF seeks to provide a fine balance between performance and programmability, by providing a set of features that includes automatic parallelization, optimization and auto-tuning, and that helps steer programmers away from common programming pitfalls. We showed how automatically generating multiple versions of code derived from the SAME GLAF implementation in different languages and with different optimizations can facilitate obtaining the best performing code quickly. Our findings reveal that the traditional coding paradigm where a single implementation is written can be sub-optimal for novice or even average programmers. Rather, GLAF allows MULTIPLE starting points for different optimizations that lead to overall better performance. This is analogous to different seeds in a state-space search algorithm that eventually lead to the global – as opposed to local – minimum. Leads to overall better performance Global vs. local minimum “GLAF: A Visual Programming and Auto-Tuning Framework for Parallel Computing” Krommydas, Sasanka, Feng