Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani Seth Copen Goldstein
Memory Systems Performance Workshop 2004© David Ryan Koes Outline Motivation #pragma independent Automated Annotation Evaluation Conclusion
Memory Systems Performance Workshop 2004© David Ryan Koes Problem Potentially aliasing pointers inhibit compiler optimization. Fully determining pointer aliasing may be infeasible or expensive. How to get the benefit without paying the cost?
Memory Systems Performance Workshop 2004© David Ryan Koes Memory Dependencies Memory dependencies inhibit optimization Introduce edges into dependence graph Limits parallelization Inhibits code motion –instruction scheduling –loop invariant code motion –partial redundancy elimination –register promotion Breaking memory dependencies difficult compile-time analysis infeasible or expensive run-time analysis limited to local window
Memory Systems Performance Workshop 2004© David Ryan Koes Examples while(len--) { *p++ = *q++; } There is a real data dependence between the load and store within a single iteration. Unroll loop to exploit parallelism.L26: mov r24 = r33 mov r17 = r32 adds r22 = 8, r33 adds r19 = 8, r32 adds r20 = 12, r33 adds r21 = 12, r32 ;; ld4 r14 = [r24], 4 adds r33 = 16, r33 adds r32 = 16, r32 ;; st4 [r17] = r14, 4 ld4 r23 = [r24] ;; st4 [r17] = r23 ld4 r18 = [r22] ;; st4 [r19] = r18 ld4 r16 = [r20] ;; st4 [r21] = r16 br.cloop.L26 ;; Itanium assembly from gcc.L26: mov r18 = r33 mov r23 = r32 adds r25 = 8, r33 adds r24 = 12, r33 adds r22 = 8, r32 adds r21 = 12, r32 ;; ld4 r14 = [r18], 4 ld4 r19 = [r25] adds r33 = 16, r33 adds r32 = 16, r32 ;; st4 [r23] = r14, 4 ld4 r16 = [r18] ld4 r20 = [r24] ;;.mmb st4 [r23] = r16 st4 [r22] = r19 st4 [r21] = r20 br.cloop.L26 ;; without memory dependence
Memory Systems Performance Workshop 2004© David Ryan Koes Examples for(i = 0; i < len; i++) { = *q;... *p =... } t0 = *q; for(i = 0; i < len; i++) { = t0;... t1 =... } *p = t1; if loop was executed t0 = *q; if loop will be executed for(i = 0; i < len; i++) { = t0;... *p =... } loop invariant code motion register promotion Hardware can’t do this
Memory Systems Performance Workshop 2004© David Ryan Koes Pointer Analysis Memory Disambiguation is important hardware can’t do everything so have compiler figure it out... int p[10]; foo() { int q[10];... } foo() { int *p, *q; int a,b; if(...) { p = &a; q = &b; } else { p = &b; q = &a; }... } foo(int *p, int *q) {... } easy! harder.. need precise dataflow analysis requires inter-procedural information
Memory Systems Performance Workshop 2004© David Ryan Koes Inter-procedural Pointer Analysis Just apply same techniques as used for intraprocedural may not be possible –gcc -c foo.c may not be feasible –n 2 analysis on source code of Microsoft Office? Use less precise analysis still might not be possible (separate compilation, libraries) still takes time (every time you compile, or at least link) less precise » less optimization
Memory Systems Performance Workshop 2004© David Ryan Koes Alternative: Have Programmer Do It Programmer annotates source code informs compiler of pointer relationships Previous Work ANSI C99 restrict keyword –difficult for compiler and programmer to reason about –non-local semantics MIPSpro #pragma ivdep –break loop carried dependence in inner loop
Memory Systems Performance Workshop 2004© David Ryan Koes Outline Motivation #pragma independent Automated Annotation Evaluation Conclusion
Memory Systems Performance Workshop 2004© David Ryan Koes #pragma independent Syntax #pragma independent ptr1 ptr2 Example int x[100] int y; void foo(int *a, int *b) { #pragma independent a b int arr[50]; … } x y malloc_site_1 arr malloc_site_2 pointers guaranteed to always point to different objects
Memory Systems Performance Workshop 2004© David Ryan Koes Examples void f(int len, int * p, int * q) { #pragma independent p q while (len--) *p++ = *q++; } void example(int *a, int *b, int *c) { #pragma independent a b #pragma independent a c (*b)++; *a = *b; *a = *a + *c; } pragmas allow compiler to eliminate a store to *a
Memory Systems Performance Workshop 2004© David Ryan Koes #pragma independent Advantages more flexible and powerful than restrict relationships between pointers explicit easy to reason about –effects only listed pointers easy to implement in compiler –fewer than 100 lines of code Possible Disadvantage could take programmer a lot of time to annotate existing source
Memory Systems Performance Workshop 2004© David Ryan Koes Outline Motivation #pragma independent Automated Annotation Evaluation Conclusion
Memory Systems Performance Workshop 2004© David Ryan Koes Automated Annotation Toolflow *.c *.h compilerexecution script pragma aware compiler programmer executable with runtime checks invalid pointer pairs execution frequencies candidate pointer pairs static scores pragma annotations ranked by score source code with verified pragmas faster executable Compiler finds interesting pointer pairs pairs which inhibit optimization pairs whose aliasing is unknown Inserts profiling code and checks inputs
Memory Systems Performance Workshop 2004© David Ryan Koes Automated Annotation Toolflow *.c *.h compilerexecution script pragma aware compiler programmer executable with runtime checks invalid pointer pairs execution frequencies candidate pointer pairs static scores pragma annotations ranked by score source code with verified pragmas faster executable Instrumented executable run on input records pointers which conflict counts number of pointer uses inputs
Memory Systems Performance Workshop 2004© David Ryan Koes Automated Annotation Toolflow *.c *.h compilerexecution script pragma aware compiler programmer executable with runtime checks invalid pointer pairs execution frequencies candidate pointer pairs static scores pragma annotations ranked by score source code with verified pragmas faster executable Script combines static and dynamic info eliminates conflicting pairs assigns score to each pair inputs
Memory Systems Performance Workshop 2004© David Ryan Koes Automated Annotation Toolflow *.c *.h compilerexecution script pragma aware compiler programmer executable with runtime checks invalid pointer pairs execution frequencies candidate pointer pairs static scores pragma annotations ranked by score source code with verified pragmas faster executable Programmer verifies pointer pairs can verify high scoring pairs only inputs
Memory Systems Performance Workshop 2004© David Ryan Koes Example Output void summer(int *p, int *q, int n, int *result) { #pragma independent p q /* score: 1100 */ #pragma independent p result /* score: 15 */ #pragma independent q result /* score: 12 */ int i, sum = 0; for(i = 0; i < n; i++) { *p += *q; sum += *q; } *result = sum; }
Memory Systems Performance Workshop 2004© David Ryan Koes Sample Score Distribution
Memory Systems Performance Workshop 2004© David Ryan Koes Outline Motivation #pragma independent Automated Annotation Evaluation Conclusion
Memory Systems Performance Workshop 2004© David Ryan Koes Targets & Benchmarks Targets Itanium EPIC/VLIW architecture instruction scheduling important for good performance ASH (Application Specific Hardware) can take full advantage of parallelism Benchmarks Mediabench small, multimedia applications can’t time accurately on Itanium Spec95, Spec2000 general purpose integer longer running –sometimes days for ASH simulation
Memory Systems Performance Workshop 2004© David Ryan Koes Compilers gcc not very sophisticated optimizations -funroll-loops -O2 CASH more sophisticated optimizations memory dependencies are first class objects –token edge –pragma independent removes edge
Memory Systems Performance Workshop 2004© David Ryan Koes Questions Do we find a reasonable number of potential annotations? Yes! Do the annotations result in faster code? Yes! Does our scoring mechanism find the pointer pairs with the biggest impact on performance? Yes! How much time does the programmer have to spend verifying pragmas? Not a lot!
Memory Systems Performance Workshop 2004© David Ryan Koes Annotations Found
Memory Systems Performance Workshop 2004© David Ryan Koes Do the annotations result in faster code? Of 19 Spec benchmarks, these were the only ones to demonstrate measurable speedup Itanium Speedup
Memory Systems Performance Workshop 2004© David Ryan Koes Do the annotations result in faster code? CASH Speedup
Memory Systems Performance Workshop 2004© David Ryan Koes Does our scoring mechanism work? mpeg2_e
Memory Systems Performance Workshop 2004© David Ryan Koes How much time does the programmer have to spend?
Memory Systems Performance Workshop 2004© David Ryan Koes Verified Speedup
Memory Systems Performance Workshop 2004© David Ryan Koes Conclusions We’ve performed a limit study of pointer analysis gcc doesn’t fully exploit the results of pointer analysis CASH and ASH can fully exploit parallelism Programmer specified annotations are effective faster and more flexible than inter-procedural analysis Annotations can be automatically generated automatic score successfully focuses programmer’s attention manual verification does not take long
Memory Systems Performance Workshop 2004© David Ryan Koes
Memory Systems Performance Workshop 2004© David Ryan Koes ANSI C99 restrict keyword An object that is accessed through a restrict-qualified pointer has a special association with that pointer. This association, defined in below, requires that all accesses to that object use, directly or indirectly, the value of that particular pointer.) The intended use of the restrict qualifier (like the register storage class) is to promote optimization, and deleting all instances of the qualifier from all preprocessing translation units composing a conforming program does not change its meaning (i.e., observable behavior). ISO/IEC 9899 Second edition
Memory Systems Performance Workshop 2004© David Ryan Koes restrict Example void f(int len, int * restrict p, int * restrict q) { while (len--) *p++ = *q++; } restrict tells the compiler that p and q refer to different objects, enabling optimizations
Memory Systems Performance Workshop 2004© David Ryan Koes Problems with restrict
Memory Systems Performance Workshop 2004© David Ryan Koes gcc’s restrict Implementation No two restricted pointers can alias A restricted pointer and an unrestricted pointer may alias This definition is intuitive for both the programmer and compiler But not the C99 definition!