Operated by Los Alamos National Security, LLC for DOE/NNSA DC Reviewed by Kei Davis SKA – Static Kernel Analysis using LLVM IR Kartik Ramkrishnan and Ben.

Slides:



Advertisements
Similar presentations
Mali Instruction Set Architecture
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Register Usage Keep as many values in registers as possible Register assignment Register allocation Popular techniques – Local vs. global – Graph coloring.
Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.
Register Allocation Zach Ma.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Complex Pipelining II Steve Ko Computer Sciences and Engineering University at Buffalo.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.
Code Generation Mooly Sagiv html:// Chapter 4.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
The Memory Behavior of Data Structures Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University.
Educational Computer Architecture Experimentation Tool Dr. Abdelhafid Bouhraoua.
AutoHacking with Phoenix Enabled Data Flow Analysis Richard Johnson |
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
LLVM Compiler (2 of 3) Jason Dangel. Lectures High-level overview of LLVM (Katie) Walkthrough of LLVM in context of our project (Jason) –Input requirements.
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Slide 1 Nick Salazar Operations Support.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Register Allocation John Cavazos University.
5-1 Chapter 5 - Languages and the Machine Department of Information Technology, Radford University ITEC 352 Computer Organization Principles of Computer.
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Sunpyo Hong, Hyesoon Kim
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science John Cavazos J Eliot B Moss Architecture and Language Implementation Lab University.
1 March 16, March 16, 2016March 16, 2016March 16, 2016 Azusa, CA Sheldon X. Liang Ph. D. Azusa Pacific University, Azusa, CA 91702, Tel: (800)
Welcome! Simone Campanoni
Computer Organization Exam Review CS345 David Monismith.
CS161 – Design and Architecture of Computer Systems
Computer Architecture Chapter (14): Processor Structure and Function
CS203 – Advanced Computer Architecture
Computer Architecture Principles Dr. Mike Frank
LLVM IR, code emission, assignment 4
Henk Corporaal TUEindhoven 2009
Flow Path Model of Superscalars
LLVM Pass and Code Instrumentation
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Spare Register Aware Prefetching for Graph Algorithms on GPUs
CMSC 611: Advanced Computer Architecture
Milad Hashemi, Onur Mutlu, Yale N. Patt
Address-Value Delta (AVD) Prediction
Register Pressure Guided Unroll-and-Jam
Computer Architecture
Figure 8.1 Architecture of a Simple Computer System.
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Topic 5: Processor Architecture Implementation Methodology
Dynamic Branch Prediction
Henk Corporaal TUEindhoven 2011
Topic 5: Processor Architecture
Instruction Level Parallelism (ILP)
Instruction Set Principles
Adapted from the slides of Prof
Lecture 17: Register Allocation via Graph Colouring
A Configurable Simulator for OOO Speculative Execution
CSc 453 Final Code Generation
What Are Performance Counters?
(via graph coloring and spilling)
Presentation transcript:

Operated by Los Alamos National Security, LLC for DOE/NNSA DC Reviewed by Kei Davis SKA – Static Kernel Analysis using LLVM IR Kartik Ramkrishnan and Ben Bergen Applied Computer Science (CCS-7) Los Alamos National Laboratory Kartik Ramkrishnan and Ben Bergen Applied Computer Science (CCS-7) Los Alamos National Laboratory

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA  SKA – Static Kernel Analyzer  SKA is a very useful tool to improve the development process.  Performs static architecture aware analysis of kernels.  Outputs code metrics during the development process.  Visualizes the code execution on the specified pipeline. What is SKA  Slide 2

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA SKA-Enhanced Development Cycle  Slide 3

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA define %argc, i8** nocapture %argv) nounwind uwtable readnone { entry: %a1 = alloca [32 x float], align 4 %b2 = alloca [32 x float], align 4 %c3 = alloca [32 x float], align 4 br label %"3" "3": ; preds = %"3", %entry %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %"3" ] %0 = getelementptr [32 x float]* %a1, i64 0, i64 %indvars.iv %1 = load float* %0, align 4 %2 = getelementptr [32 x float]* %b2, i64 0, i64 %indvars.iv %3 = load float* %2, align 4 %4 = getelementptr [32 x float]* %c3, i64 0, i64 %indvars.iv %5 = load float* %4, align 4 %6 = fmul float %3, %5 %7 = fadd float %1, %6 store float %7, float* %4, align 4 %indvars.iv.next = add i64 %indvars.iv, 1 %lftr.wideiv = trunc i64 %indvars.iv.next to i32 %exitcond = icmp eq i32 %lftr.wideiv, 32 br i1 %exitcond, label %"5", label %"3" "5": ; preds = %"3" ret i32 0 Example kernel – saxpy.ll  Slide 4

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA  LLVM IR is SSA (single static assignment) which has infinite register count.  ISAs(instruction set architectures) have a limited number of registers.  We improve SKA’s fidelity by allocating registers to the IR based on the target ISA. Register allocation support for SKA  Slide 5

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA  Simple register allocation algorithm. Register Allocation algorithm  Slide 6

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Build Liveness Tables  Slide 7

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA  SKA takes an LLVM IR module as input and builds a liveness table. Build Liveness Tables  Slide 8 Partial liveness table for saxpy.ll

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Build Liveness Tables  Slide 9 Top level loop Single BB liveness calculation Populate liveness table

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Build Interference Graph  Slide 10

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA  Traverse the liveness table to create the interference graph. Build Interference Graph  Slide 11 Partial igraph for saxpy.ll

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Build Interference Graph  Slide 12 Top level loop Populate igraph

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Simplify Interference Graph  Slide 13

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA  Populate a stack which records whether a register (node) is simple or not. Simplify Interference Graph  Slide 14 Partial node stack for saxpy.ll

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Simplify Interference Graph  Slide 15 Populate simple node stack

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Assign ISA Registers to IR  Slide 16

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Assign ISA Registers to IR  Slide 17  Assign ISA registers to IR, if no true spill.  We choose between int, float and vector. Partial register allocation for saxpy.ll

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Assign ISA registers to IR  Slide 18 Assign register if no true spill

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Rewrite IR  Slide 19

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA  The live range of %a1 is shown in red. It reduces after rewriting the IR. Rewrite IR  Slide 20

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Rewrite IR  Slide 21 Store instruction into stack Load, use and store

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Register allocation done !  Slide 22

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA  Specified in an xml file.  Specifies logical units, instructions they process, latencies, issue width … Virtual architecture specification  Slide 23 Partial architecture example

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Pipeline simulation  Slide 24 Pipeline simulation of saxpy.ll

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Skaview  Slide 25 Graphical visualization of saxpy.ll

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA  SKA outputs useful metrics about the code.  Primitive statistics include basic performance counters, such as instructions, cycles and stalls.  Derived statistics are obtained from primitive statistics. Code metrics  Slide 26

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA  CPI prediction is better after register allocation. Results for residual.ll  Slide 27

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA  No change in CPI prediction. Why ? Results for ef_operator.ll  Slide 28

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA  Predicts CPI > 1.0 for KNC for single threaded workloads. Results for KNC (Knights corner)  Slide 29

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA  SKA now supports register allocation.  Register allocation improves SKA’s fidelity by 5- 10% across three architectures for a compute intensive benchmark.  Dynamic scheduling and cache models can further improve SKA fidelity. Conclusion  Slide 30

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA  Questions ? Thank You !  Slide 31