Machine-Independent Optimization

Slides:

Advertisements

Similar presentations

Advertisements

Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.

Program Optimization (Chapter 5)

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Programming and Data Structure

1 Today’s lecture  Last lecture we started talking about control flow in MIPS (branches)  Finish up control-flow (branches) in MIPS —if/then —loops —case/switch.

1 Code Optimization(II). 2 Outline Understanding Modern Processor –Super-scalar –Out-of –order execution Suggested reading –5.14,5.7.

Winter 2015 COMP 2130 Intro Computer Systems Computing Science Thompson Rivers University Program Optimization.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Carnegie Mellon 1 Program Optimization : Introduction to Computer Systems 25 th Lecture, Nov. 23, 2010 Instructors: Randy Bryant and Dave O’Hallaron.

Code Optimization I September 24, 2007 Topics Machine-Independent Optimizations Basic optimizations Optimization blockers class08.ppt F’07.

1 Program Optimization Professor Jennifer Rexford

Code Optimization I: Machine Independent Optimizations Sept. 26, 2002 Topics Machine-Independent Optimizations Code motion Reduction in strength Common.

ARRAYS AND POINTERS Although pointer types are not integer types, some integer arithmetic operators can be applied to pointers. The affect of this arithmetic.

Code Optimization 1. Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.1 ~

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.

Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

RUN-Time Organization Compiler phase— Before writing a code generator, we must decide how to marshal the resources of the target machine (instructions,

Introduction to ECE 454 Computer Systems Programming Topics: Lecture topics and assignments Profiling rudiments Lab schedule and rationale Cristiana Amza.

Computer Organization and Design Pointers, Arrays and Strings in C Montek Singh Sep 18, 2015 Lab 5 supplement.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.

Optimization of C Code The C for Speed

Machine Independent Optimizations Topics Code motion Reduction in strength Common subexpression sharing.

Copyright 2014 – Noah Mendelsohn Code Tuning Noah Mendelsohn Tufts University Web:

Code Optimization and Performance CS 105 “Tour of the Black Holes of Computing”

Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.

Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”

1 Code Optimization. 2 Outline Machine-Independent Optimization –Code motion –Memory optimization Suggested reading –5.2 ~ 5.6.

Code Optimization and Performance I Chapter 5 perf01.ppt CS 105 “Tour of the Black Holes of Computing”

Computer Organization and Design Pointers, Arrays and Strings in C

Code Optimization.

Introduction To Computer Systems

Machine-Level Programming IV: Data

Machine-Level Programming IV: Data

Today’s Instructor: Phil Gibbons

Optimization Code Optimization ©SoftMoore Consulting.

CS 3114 Many of the following slides are taken with permission from

Code Optimization I: Machine Independent Optimizations

Instructors: Dave O’Hallaron, Greg Ganger, and Greg Kesden

Program Optimization CENG331 - Computer Organization

Instructors: Pelin Angin and Erol Sahin

Program Optimization CSCE 312

Machine-Dependent Optimization

Code Optimization /18-213/14-513/15-513: Introduction to Computer Systems 10th Lecture, September 27, 2018.

Introduction to Computer Systems

Machine-Level Programming IV: Data

Code Optimization and Performance

Code Optimization I: Machine Independent Optimizations Feb 11, 2003

Roadmap C: Java: Assembly language: OS: Machine code: Computer system:

Machine-Level Programming IV: Data

Code Optimization(II)

Code Optimization April 6, 2000

Code Optimization I Nov. 25, 2008

Machine-Level Programming 5 Structured Data

Introduction to Computer Systems

Machine-Level Programming 5 Structured Data

COMP 2130 Intro Computer Systems Thompson Rivers University

Optimizing program performance

Cache Memories Lecture, Oct. 30, 2018

Program Optimization CSE 238/2038/2138: Systems Programming

CS 3114 Many of the following slides are taken with permission from

Instructor: Fatma CORUT ERGİN

Lecture 11: Machine-Dependent Optimization

Machine-Level Programming VIII: Data Comp 21000: Introduction to Computer Systems & Assembly Lang Spring 2017 Systems book chapter 3* * Modified slides.

Code Optimization and Performance

Presentation transcript:

Machine-Independent Optimization

Machine-Independent Optimization Outline Machine-Independent Optimization Code motion Memory optimization Optimizing Blockers Memory alias Side effect in function call Suggested reading 5.3，5.2，5.4 ~ 5.6, 5.1

Motivation Optimization? easily see 10:1 performance range depending on how code is written can optimize at multiple levels algorithm, data representations, procedures, and loops Constant factors matter too

Understand system to optimize performance Motivation Understand system to optimize performance how programs are compiled and executed how to measure program performance and identify bottlenecks how to improve performance without destroying code modularity and generality ……

Vector ADT typedef struct { int len ; data_t *data ; length data    1 2 length–1 typedef struct { int len ; data_t *data ; } vec_rec, *vec_ptr ; typedef int data_t ;

vec_ptr new_vec(int len) int vec_length(vec_ptr) Procedures vec_ptr new_vec(int len) Create vector of specified length int vec_length(vec_ptr) Return length of vector data_t *get_vec_start(vec_ptr v) Return pointer to start of vector data int get_vec_element(vec_ptr v, int index, int *dest) Retrieve vector element, store at *dest Return 0 if out of bounds, 1 if successful Similar to array implementations in Pascal, Java E.g., always do bounds checking

Vector ADT vec_ptr new_vec(int len) { /* allocate header structure */ vec_ptr result = (vec_ptr) malloc(sizeof(vec_rec)) ; if ( !result ) return NULL ; result->len = len ;

Vector ADT /* allocate array */ if ( len > 0 ) { data_t *data = (data_t *)calloc(len, sizeof(data_t)) ; if ( !data ) { free( (void *)result ) ; return NULL ; /* couldn’t allocte stroage */ } result->data = data } else result->data = NULL return result ;

Vector ADT /* * Retrieve vector element and store at dest. * Return 0 (out of bounds) or 1 (successful) */ int get_vec_element(vec_ptr v, int index, data_t *dest) { if ( index < 0 || index >= v->len) return 0 ; *dest = v->data[index] ; return 1; }

Vector ADT /* Return length of vector */ int vec_length(vec_ptr) { return v->len ; } /* Return pointer to start of vector data */ data_t *get_vec_start(vec_ptr v) return v->data ;

Optimization Example #ifdef ADD #define IDENT 0 #define OP + typedef int data_t; #else #define IDENT 1 #define OP * typedef double data_t #endif

Optimization Example void combine1(vec_ptr v, data_t *dest) { long int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; } Compute sum (product) of all elements of vector Store result at destination location

Time Scales Absolute Time Clock Cycles Typically use nanoseconds: 10–9 seconds Time scale of computer instructions Clock Cycles Typical Range 100 MHz 108 cycles per second, clock period = 10ns 2 GHz 2 X 109 cycles per second, clock period = 0.5ns

Cycles Per Element Convenient way to express performance of program that operators on vectors or lists Length = n T = CPE*n + Overhead

CPE 1 void psum1(float a[], float p[], long int n) 2 { 3 long int i; 4 5 p[0] = a[0] ; 6 for (i = 1; i < n; i++) 7 p[i] = p[i-1] + a[i]; 8 } 9

CPE 10 void psum2(float a[], float p[]; long int n) 11 { 12 long int i; 13 p[0] = a[0] ; 14 for (i = 1; i < n-1; i+=2) { 15 float mid_val = p[i-1] + a[i] ; 16 p[i] = mid_val ; 17 p[i+1] = mid_val + a[i+1]; 18 } /* For odd n, finish remaining element */ if ( i < n ) p[i] = p[i-1] + a[i] ; }

Cycles Per Element

Time Scales

Understanding Loop void combine1(vec_ptr v, data_t *dest) { long int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; }

Procedure vec_length called every iteration Understanding Loop void combine1-goto(vec_ptr v, data_t *dest) { long int i = 0; data_t val; *dest = 0; if (i >= vec_length(v)) goto done; loop: get_vec_element(v, i, &val); *dest += val; i++; if (i < vec_length(v)) goto loop done: } Procedure vec_length called every iteration Even though result always the same 1 iteration

Code Motion void combine2(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; }

Code Motion Optimization Move call to vec_length out of inner loop Value does not change from one iteration to next Code motion

Code Motion 1 /* Convert string to lowercase: slow */ 2 void lower1(char *s) 3 { 4 int i; 5 6 for (i = 0; i < strlen(s); i++) 7 if (s[i] >= ’A’ && s[i] <= ’Z’) 8 s[i] -= (’A’ - ’a’); 9 } 10

Code Motion 11 /* Convert string to lowercase: faster */ 12 void lower2(char *s) 13 { 14 int i; 15 int len = strlen(s); 16 17 for (i = 0; i < len; i++) 18 if (s[i] >= ’A’ && s[i] <= ’Z’) 19 s[i] -= (’A’ - ’a’); 20 } 21

Code Motion

Code Motion 22 /* Sample implementation of library function strlen */ 23 /* Compute length of string */ 24 size_t strlen(const char *s) 25 { 26 int length = 0; 27 while (*s != ’\0’) { 28 s++; 29 length++; 30 } 31 return length; 32 }

Code Motion (before further opt) void combine2(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; }

Reduction in Strength void combine3(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); *dest = IDENT; for ( i = 0 ; i < length ; i++ ) { *dest = *dest OP data[i] ; }

Reduction in Strength Optimization Avoid procedure call to retrieve each vector element Get pointer to start of array before loop Within loop just do pointer reference Not as clean in terms of data abstraction

Eliminate Unneeded Memory References combine3: data_t = float, OP = * i in %rdx, data in %rax, dest in %rbp 1 .L498: loop: 2 movss (%rbp), %xmm0 Read product from dest 3 mulss (%rax,%rdx,4), %xmm0 Multiply product by data[i] 4 movss %xmm0, (%rbp) Store product at dest 5 addq $1, %rdx Increment i 6 cmpq %rdx, %r12 Compare i:limit 7 jg .L498 If >, goto loop

Eliminate Unneeded Memory References void combine4(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); data_t acc = IDENT; for (i = 0; i < length; i++) acc = acc OP data[i]; *dest = acc; }

Eliminate Unneeded Memory References combine4: data_t = float, OP = * i in %rdx, data in %rax, limit in %rbp, acc in %xmm0 1 .L488: loop: 2 mulss (%rax,%rdx,4), %xmm0 Multiply acc by data[i] 3 addq $1, %rdx Increment i 4 cmpq %rdx, %rbp Compare limit:i 5 jg .L488 If >, goto loop

Eliminate Unneeded Memory References Optimization Don’t need to store in destination until end Local variable sum held in register Avoids 1 memory read, 1 memory write per cycle

Machine Independent Opt. Results Optimizations Reduce function calls and memory references within loop

Provide efficient mapping of program to machine Optimizing Compilers Provide efficient mapping of program to machine register allocation code selection and ordering eliminating minor inefficiencies ……

Don’t (usually) improve asymptotic efficiency Optimizing Compilers Don’t (usually) improve asymptotic efficiency up to programmer to select best overall algorithm big-O savings are (often) more important than constant factors but constant factors also matter Have difficulty overcoming “optimization blockers” potential memory aliasing potential procedure side-effects

Optimization Blockers  Memory aliasing void twiddle1(int *xp, int *yp) { *xp += *yp ; } void twiddle2(int *xp, int *yp) *xp += 2* *yp ;

Optimization Blockers  Function call and side effect int f(int) ; int func1(x) { return f(x)+f(x)+f(x)+f(x) ; } int func2(x) return 4*f(x) ;

Optimization Blockers  Function call and side effect int counter = 0 ; int f(int x) { return counter++ ; }

Reduction in Strength void combine3(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); *dest = IDENT; for ( i = 0 ; i < length ; i++ ) { *dest = *dest OP data[i] ; }

Eliminate Unneeded Memory References void combine4(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); data_t acc = IDENT; for (i = 0; i < length; i++) acc = acc OP data[i]; *dest = acc; }

Optimization Blocker: Memory Aliasing Two different memory references specify single location Example v: [2, 3, 5] combine3(v, get_vec_start(v)+2) --> ? combine4(v, get_vec_start(v)+2) --> ? Function Intial Before loop i=0 i=1 i=2 Final Combine3 [2,3,5] [2,3,1] [2,3,2] [2,3,6] [2,3,36] combine4 [2,3,30]

Optimization Blocker: Memory Aliasing Observations Easy to have happen in C Since allowed to do address arithmetic Direct access to storage structures Get in habit of introducing local variables Accumulating within loops Your way of telling compiler not to check for aliasing

Limitations of Optimizing Compilers Operate Under Fundamental Constraint Must not cause any change in program behavior under any possible condition Often prevents it from making optimizations when would only affect behavior under pathological conditions. When in doubt, the compiler must be conservative

Example void combine1(vec_ptr v, data_t *dest) { long int i; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; }

Example void combine2(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); *dest = IDENT; for (i = 0; i < length; i++) { data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; }

Example void combine3(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); *dest = IDENT; for (i = 0; i < length; i++) { *dest = *dest OP data[i]; }

Example void combine4(vec_ptr v, data_t *dest) { long int i; long int length = vec_length(v); data_t *data = get_vec_start(v); data_t x = IDENT; for (i = 0; i < length; i++) x = x OP data[i]; *dest = x; }

Effectiveness of the Optimization Integer Floating point Function Method + * F* D* combine 1 Abstract -O1 12.00 combine 4 Accumulate in temporary 2.00 3.00 4.00 5.00