Memory Systems Performance Workshop 2004© David Ryan Koes 20041 MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.

Slides:



Advertisements
Similar presentations
Λλ Divergence Analysis with Affine Constraints Diogo Sampaio, Sylvain Collange and Fernando Pereira The Federal University of Minas.
Advertisements

Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Computer Architecture Instruction-Level Parallel Processors
CSCI 4717/5717 Computer Architecture
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Compilation 2011 Static Analysis Johnni Winther Michael I. Schwartzbach Aarhus University.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Programmability Issues
CS2100 Computer Organisation Performance (AY2014/2015) Semester 2.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Constraint Systems used in Worst-Case Execution Time Analysis Andreas Ermedahl Dept. of Information Technology Uppsala University.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Program Slicing Mark Weiser and Precise Dynamic Slicing Algorithms Xiangyu Zhang, Rajiv Gupta & Youtao Zhang Presented by Harini Ramaprasad.
Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.
Chapter 1 Object Oriented Programming. OOP revolves around the concept of an objects. Objects are crated using the class definition. Programming techniques.
School of Computer Science A Global Progressive Register Allocator David Ryan Koes Seth Copen Goldstein Carnegie Mellon University
3/17/2008Prof. Hilfinger CS 164 Lecture 231 Run-time organization Lecture 23.
Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel.
Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.
JVM-1 Introduction to Java Virtual Machine. JVM-2 Outline Java Language, Java Virtual Machine and Java Platform Organization of Java Virtual Machine Garbage.
2005 International Symposium on Code Generation and Optimization Progressive Register Allocation for Irregular Architectures David Koes
Run-time Environment and Program Organization
1 Dan Quinlan, Markus Schordan, Qing Yi Center for Applied Scientific Computing Lawrence Livermore National Laboratory Semantic-Driven Parallelization.
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.
Precision Going back to constant prop, in what cases would we lose precision?
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
Language Systems Chapter FourModern Programming Languages 1.
1 Chapter 5: Names, Bindings and Scopes Lionel Williams Jr. and Victoria Yan CSci 210, Advanced Software Paradigms September 26, 2010.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
P ARALLEL P ROCESSING I NSTITUTE · F UDAN U NIVERSITY 1.
Bug Localization with Machine Learning Techniques Wujie Zheng
Basic Semantics Associating meaning with language entities.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
LANGUAGE SYSTEMS Chapter Four Modern Programming Languages 1.
1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.
Semantics In Text: Chapter 3.
CS536 Semantic Analysis Introduction with Emphasis on Name Analysis 1.
Using Types to Analyze and Optimize Object-Oriented Programs By: Amer Diwan Presented By: Jess Martin, Noah Wallace, and Will von Rosenberg.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
1 Becoming More Effective with C++ … Day Two Stanley B. Lippman
How to execute Program structure Variables name, keywords, binding, scope, lifetime Data types – type system – primitives, strings, arrays, hashes – pointers/references.
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.
D A C U C P Speculative Alias Analysis for Executable Code Manel Fernández and Roger Espasa Computer Architecture Department Universitat Politècnica de.
Introduction to Computer Programming Concepts M. Uyguroğlu R. Uyguroğlu.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Dr. Hussien Sharaf Dr Emad Nabil. Dr. Hussien M. Sharaf 2 position := initial + rate * Lexical analyzer 2. Syntax analyzer id 1 := id 2 + id 3 *
Phoenix Based Dynamic Slicing Debugging Tool Eric Cheng Lin Xu Matt Gruskin Ravi Ramaseshan Microsoft Phoenix Intern Team (Summer '06)
Compiler Construction (CS-636)
Benjamin Goldberg, Emily Crutcher NYU
Henk Corporaal TUEindhoven 2009
CSCI1600: Embedded and Real Time Software
Calpa: A Tool for Automating Dynamic Compilation
Predicting Unroll Factors Using Supervised Classification
Loop-Level Parallelism
Lecture 4: Instruction Set Design/Pipelining
CSCI1600: Embedded and Real Time Software
Chapter 10 Def: The subprogram call and return operations of
Presentation transcript:

Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani Seth Copen Goldstein

Memory Systems Performance Workshop 2004© David Ryan Koes Outline Motivation #pragma independent Automated Annotation Evaluation Conclusion

Memory Systems Performance Workshop 2004© David Ryan Koes Problem Potentially aliasing pointers inhibit compiler optimization. Fully determining pointer aliasing may be infeasible or expensive. How to get the benefit without paying the cost?

Memory Systems Performance Workshop 2004© David Ryan Koes Memory Dependencies Memory dependencies inhibit optimization Introduce edges into dependence graph Limits parallelization Inhibits code motion –instruction scheduling –loop invariant code motion –partial redundancy elimination –register promotion Breaking memory dependencies difficult compile-time analysis infeasible or expensive run-time analysis limited to local window

Memory Systems Performance Workshop 2004© David Ryan Koes Examples while(len--) { *p++ = *q++; } There is a real data dependence between the load and store within a single iteration. Unroll loop to exploit parallelism.L26: mov r24 = r33 mov r17 = r32 adds r22 = 8, r33 adds r19 = 8, r32 adds r20 = 12, r33 adds r21 = 12, r32 ;; ld4 r14 = [r24], 4 adds r33 = 16, r33 adds r32 = 16, r32 ;; st4 [r17] = r14, 4 ld4 r23 = [r24] ;; st4 [r17] = r23 ld4 r18 = [r22] ;; st4 [r19] = r18 ld4 r16 = [r20] ;; st4 [r21] = r16 br.cloop.L26 ;; Itanium assembly from gcc.L26: mov r18 = r33 mov r23 = r32 adds r25 = 8, r33 adds r24 = 12, r33 adds r22 = 8, r32 adds r21 = 12, r32 ;; ld4 r14 = [r18], 4 ld4 r19 = [r25] adds r33 = 16, r33 adds r32 = 16, r32 ;; st4 [r23] = r14, 4 ld4 r16 = [r18] ld4 r20 = [r24] ;;.mmb st4 [r23] = r16 st4 [r22] = r19 st4 [r21] = r20 br.cloop.L26 ;; without memory dependence

Memory Systems Performance Workshop 2004© David Ryan Koes Examples for(i = 0; i < len; i++) { = *q;... *p =... } t0 = *q; for(i = 0; i < len; i++) { = t0;... t1 =... } *p = t1; if loop was executed t0 = *q; if loop will be executed for(i = 0; i < len; i++) { = t0;... *p =... } loop invariant code motion register promotion Hardware can’t do this

Memory Systems Performance Workshop 2004© David Ryan Koes Pointer Analysis Memory Disambiguation is important hardware can’t do everything so have compiler figure it out... int p[10]; foo() { int q[10];... } foo() { int *p, *q; int a,b; if(...) { p = &a; q = &b; } else { p = &b; q = &a; }... } foo(int *p, int *q) {... } easy! harder.. need precise dataflow analysis requires inter-procedural information

Memory Systems Performance Workshop 2004© David Ryan Koes Inter-procedural Pointer Analysis Just apply same techniques as used for intraprocedural may not be possible –gcc -c foo.c may not be feasible –n 2 analysis on source code of Microsoft Office? Use less precise analysis still might not be possible (separate compilation, libraries) still takes time (every time you compile, or at least link) less precise » less optimization

Memory Systems Performance Workshop 2004© David Ryan Koes Alternative: Have Programmer Do It Programmer annotates source code informs compiler of pointer relationships Previous Work ANSI C99 restrict keyword –difficult for compiler and programmer to reason about –non-local semantics MIPSpro #pragma ivdep –break loop carried dependence in inner loop

Memory Systems Performance Workshop 2004© David Ryan Koes Outline Motivation #pragma independent Automated Annotation Evaluation Conclusion

Memory Systems Performance Workshop 2004© David Ryan Koes #pragma independent Syntax #pragma independent ptr1 ptr2 Example int x[100] int y; void foo(int *a, int *b) { #pragma independent a b int arr[50]; … } x y malloc_site_1 arr malloc_site_2 pointers guaranteed to always point to different objects

Memory Systems Performance Workshop 2004© David Ryan Koes Examples void f(int len, int * p, int * q) { #pragma independent p q while (len--) *p++ = *q++; } void example(int *a, int *b, int *c) { #pragma independent a b #pragma independent a c (*b)++; *a = *b; *a = *a + *c; } pragmas allow compiler to eliminate a store to *a

Memory Systems Performance Workshop 2004© David Ryan Koes #pragma independent Advantages more flexible and powerful than restrict relationships between pointers explicit easy to reason about –effects only listed pointers easy to implement in compiler –fewer than 100 lines of code Possible Disadvantage could take programmer a lot of time to annotate existing source

Memory Systems Performance Workshop 2004© David Ryan Koes Outline Motivation #pragma independent Automated Annotation Evaluation Conclusion

Memory Systems Performance Workshop 2004© David Ryan Koes Automated Annotation Toolflow *.c *.h compilerexecution script pragma aware compiler programmer executable with runtime checks invalid pointer pairs execution frequencies candidate pointer pairs static scores pragma annotations ranked by score source code with verified pragmas faster executable Compiler finds interesting pointer pairs pairs which inhibit optimization pairs whose aliasing is unknown Inserts profiling code and checks inputs

Memory Systems Performance Workshop 2004© David Ryan Koes Automated Annotation Toolflow *.c *.h compilerexecution script pragma aware compiler programmer executable with runtime checks invalid pointer pairs execution frequencies candidate pointer pairs static scores pragma annotations ranked by score source code with verified pragmas faster executable Instrumented executable run on input records pointers which conflict counts number of pointer uses inputs

Memory Systems Performance Workshop 2004© David Ryan Koes Automated Annotation Toolflow *.c *.h compilerexecution script pragma aware compiler programmer executable with runtime checks invalid pointer pairs execution frequencies candidate pointer pairs static scores pragma annotations ranked by score source code with verified pragmas faster executable Script combines static and dynamic info eliminates conflicting pairs assigns score to each pair inputs

Memory Systems Performance Workshop 2004© David Ryan Koes Automated Annotation Toolflow *.c *.h compilerexecution script pragma aware compiler programmer executable with runtime checks invalid pointer pairs execution frequencies candidate pointer pairs static scores pragma annotations ranked by score source code with verified pragmas faster executable Programmer verifies pointer pairs can verify high scoring pairs only inputs

Memory Systems Performance Workshop 2004© David Ryan Koes Example Output void summer(int *p, int *q, int n, int *result) { #pragma independent p q /* score: 1100 */ #pragma independent p result /* score: 15 */ #pragma independent q result /* score: 12 */ int i, sum = 0; for(i = 0; i < n; i++) { *p += *q; sum += *q; } *result = sum; }

Memory Systems Performance Workshop 2004© David Ryan Koes Sample Score Distribution

Memory Systems Performance Workshop 2004© David Ryan Koes Outline Motivation #pragma independent Automated Annotation Evaluation Conclusion

Memory Systems Performance Workshop 2004© David Ryan Koes Targets & Benchmarks Targets Itanium EPIC/VLIW architecture instruction scheduling important for good performance ASH (Application Specific Hardware) can take full advantage of parallelism Benchmarks Mediabench small, multimedia applications can’t time accurately on Itanium Spec95, Spec2000 general purpose integer longer running –sometimes days for ASH simulation

Memory Systems Performance Workshop 2004© David Ryan Koes Compilers gcc not very sophisticated optimizations -funroll-loops -O2 CASH more sophisticated optimizations memory dependencies are first class objects –token edge –pragma independent removes edge

Memory Systems Performance Workshop 2004© David Ryan Koes Questions Do we find a reasonable number of potential annotations? Yes! Do the annotations result in faster code? Yes! Does our scoring mechanism find the pointer pairs with the biggest impact on performance? Yes! How much time does the programmer have to spend verifying pragmas? Not a lot!

Memory Systems Performance Workshop 2004© David Ryan Koes Annotations Found

Memory Systems Performance Workshop 2004© David Ryan Koes Do the annotations result in faster code? Of 19 Spec benchmarks, these were the only ones to demonstrate measurable speedup Itanium Speedup

Memory Systems Performance Workshop 2004© David Ryan Koes Do the annotations result in faster code? CASH Speedup

Memory Systems Performance Workshop 2004© David Ryan Koes Does our scoring mechanism work? mpeg2_e

Memory Systems Performance Workshop 2004© David Ryan Koes How much time does the programmer have to spend?

Memory Systems Performance Workshop 2004© David Ryan Koes Verified Speedup

Memory Systems Performance Workshop 2004© David Ryan Koes Conclusions We’ve performed a limit study of pointer analysis gcc doesn’t fully exploit the results of pointer analysis CASH and ASH can fully exploit parallelism Programmer specified annotations are effective faster and more flexible than inter-procedural analysis Annotations can be automatically generated automatic score successfully focuses programmer’s attention manual verification does not take long

Memory Systems Performance Workshop 2004© David Ryan Koes

Memory Systems Performance Workshop 2004© David Ryan Koes ANSI C99 restrict keyword An object that is accessed through a restrict-qualified pointer has a special association with that pointer. This association, defined in below, requires that all accesses to that object use, directly or indirectly, the value of that particular pointer.) The intended use of the restrict qualifier (like the register storage class) is to promote optimization, and deleting all instances of the qualifier from all preprocessing translation units composing a conforming program does not change its meaning (i.e., observable behavior). ISO/IEC 9899 Second edition

Memory Systems Performance Workshop 2004© David Ryan Koes restrict Example void f(int len, int * restrict p, int * restrict q) { while (len--) *p++ = *q++; } restrict tells the compiler that p and q refer to different objects, enabling optimizations

Memory Systems Performance Workshop 2004© David Ryan Koes Problems with restrict

Memory Systems Performance Workshop 2004© David Ryan Koes gcc’s restrict Implementation No two restricted pointers can alias A restricted pointer and an unrestricted pointer may alias This definition is intuitive for both the programmer and compiler But not the C99 definition!