Instructor: Erol Sahin Machine Programming – Procedures and IA32 Stack CENG331: Introduction to Computer Systems 6th Lecture Instructor: Erol Sahin Acknowledgement: Most of the slides are adapted from the ones prepared by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ.
IA32 Stack Stack “Bottom” Region of memory managed with stack discipline Grows toward lower addresses Register %esp contains lowest stack address = address of “top” element Increasing Addresses Stack Grows Down Stack Pointer: %esp Stack “Top”
IA32 Stack: Push Stack “Bottom” pushl Src Stack Pointer: %esp Fetch operand at Src Decrement %esp by 4 Write operand at address given by %esp Increasing Addresses Stack Grows Down -4 Stack Pointer: %esp Stack “Top”
IA32 Stack: Pop Stack “Bottom” popl Dest Stack Pointer: %esp Read operand at address %esp Increment %esp by 4 Write operand to Dest Increasing Addresses Stack Grows Down +4 Stack Pointer: %esp Stack “Top”
Procedure Control Flow Use stack to support procedure call and return Procedure call: call label Push return address on stack Jump to label Return address: Address of instruction beyond call Example from disassembly 804854e: e8 3d 06 00 00 call 8048b90 <main> 8048553: 50 pushl %eax Return address = 0x8048553 Procedure return: ret Pop address from stack Jump to address
Procedure Call Example 804854e: e8 3d 06 00 00 call 8048b90 <main> 8048553: 50 pushl %eax call 8048b90 0x110 0x110 0x10c 0x10c 0x108 123 0x108 123 0x104 0x8048553 %esp 0x108 %esp 0x104 0x108 %eip 0x804854e %eip 0x8048b90 0x804854e %eip: program counter
Procedure Return Example 8048591: c3 ret ret 0x110 0x110 0x10c 0x10c 0x108 123 0x108 123 0x104 0x8048553 0x8048553 %esp 0x104 %esp 0x108 0x104 %eip 0x8048591 %eip 0x8048553 0x8048591 %eip: program counter
Stack-Based Languages Languages that support recursion e.g., C, Pascal, Java Code must be “Reentrant” Multiple simultaneous instantiations of single procedure Need some place to store state of each instantiation Arguments Local variables Return pointer Stack discipline State for given procedure needed for limited time From when called to when return Callee returns before caller does Stack allocated in Frames state for single procedure instantiation
Call Chain Example Example Call Chain yoo(…) { • who(); } yoo who(…) { • • • amI(); } who amI(…) { • amI(); } amI amI amI amI Procedure amI is recursive
Stack Frames Previous Frame Contents Frame for proc Management Local variables Return information Temporary space Management Space allocated when enter procedure “Set-up” code Deallocated when return “Finish” code Frame Pointer: %ebp Frame for proc Stack Pointer: %esp Stack “Top”
Example Stack yoo(…) { • who(); } yoo %ebp yoo who %esp amI amI amI
Example Stack who(…) { • • • amI(); } yoo yoo who %ebp who amI amI %esp amI amI
Example Stack amI(…) { • amI(); } yoo yoo who who amI amI %ebp amI amI %esp amI
Example Stack amI(…) { • amI(); } yoo yoo who who amI amI amI amI amI %ebp amI %esp
Example Stack amI(…) { • amI(); } yoo yoo who who amI amI amI amI amI %ebp amI %esp
Example Stack amI(…) { • amI(); } yoo yoo who who amI amI amI amI amI %ebp amI %esp
Example Stack amI(…) { • amI(); } yoo yoo who who amI amI %ebp amI amI %esp amI
Example Stack who(…) { • • • amI(); } yoo yoo who %ebp who amI amI %esp amI amI
Example Stack amI(…) { • } yoo yoo who who amI amI %ebp amI amI %esp
Example Stack who(…) { • • • amI(); } yoo yoo who %ebp who amI amI %esp amI amI
Example Stack yoo(…) { • who(); } yoo %ebp yoo who %esp amI amI amI
IA32/Linux Stack Frame Current Stack Frame (“Top” to Bottom) “Argument build:” Parameters for function about to call Local variables If can’t keep in registers Saved register context Old frame pointer Caller Stack Frame Return address Pushed by call instruction Arguments for this call Caller Frame Arguments Frame pointer %ebp Return Addr Old %ebp Saved Registers + Local Variables Argument Build Stack pointer %esp
Calling swap from call_swap Revisiting swap Calling swap from call_swap int zip1 = 15213; int zip2 = 91125; void call_swap() { swap(&zip1, &zip2); } call_swap: • • • pushl $zip2 # Global Var pushl $zip1 # Global Var call swap • Resulting Stack void swap(int *xp, int *yp) { int t0 = *xp; int t1 = *yp; *xp = t1; *yp = t0; } &zip2 &zip1 Rtn adr %esp
Revisiting swap swap: pushl %ebp movl %esp,%ebp pushl %ebx movl 12(%ebp),%ecx movl 8(%ebp),%edx movl (%ecx),%eax movl (%edx),%ebx movl %eax,(%edx) movl %ebx,(%ecx) movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Set Up void swap(int *xp, int *yp) { int t0 = *xp; int t1 = *yp; *xp = t1; *yp = t0; } Body Finish
swap Setup #1 Entering Stack Resulting Stack • %ebp • %ebp &zip2 yp xp Rtn adr %esp Rtn adr Old %ebp %esp swap: pushl %ebp movl %esp,%ebp pushl %ebx
swap Setup #1 Entering Stack • %ebp • %ebp &zip2 yp &zip1 xp Rtn adr %esp Rtn adr Old %ebp %esp swap: pushl %ebp movl %esp,%ebp pushl %ebx
swap Setup #1 Entering Stack Resulting Stack • %ebp • &zip2 yp &zip1 xp Rtn adr %esp Rtn adr Old %ebp %ebp %esp swap: pushl %ebp movl %esp,%ebp pushl %ebx
swap Setup #1 Entering Stack • %ebp • &zip2 yp &zip1 xp Rtn adr %esp Old %ebp %ebp %esp swap: pushl %ebp movl %esp,%ebp pushl %ebx
swap Setup #1 12 8 4 Entering Stack Resulting Stack • %ebp • Offset relative to %ebp &zip2 12 yp &zip1 8 xp Rtn adr 4 %esp Rtn adr Old %ebp %ebp Old %ebx %esp movl 12(%ebp),%ecx # get yp movl 8(%ebp),%edx # get xp . . .
swap Finish #1 swap’s Stack Resulting Stack • • yp yp xp xp Rtn adr Rtn adr Old %ebp %ebp Old %ebp %ebp Old %ebx %esp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Observation: Saved and restored register %ebx
swap Finish #2 swap’s Stack • • yp yp xp xp Rtn adr Rtn adr Old %ebp Old %ebx %esp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret
swap Finish #2 swap’s Stack Resulting Stack • • yp yp xp xp Rtn adr Old %ebp %ebp Old %ebp %ebp %esp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret
swap Finish #2 swap’s Stack • • yp yp xp xp Rtn adr Rtn adr Old %ebp %esp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret
swap Finish #3 swap’s Stack Resulting Stack • • %ebp yp yp xp xp Rtn adr Rtn adr %esp Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret
swap Finish #4 swap’s Stack • • %ebp yp yp xp xp Rtn adr Rtn adr %esp Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret
swap Finish #4 Observation swap’s Stack Resulting Stack • • %ebp yp yp xp xp %esp Rtn adr Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Observation Saved & restored register %ebx Didn’t do so for %eax, %ecx, or %edx
Disassembled swap Calling Code 080483a4 <swap>: 80483a4: 55 push %ebp 80483a5: 89 e5 mov %esp,%ebp 80483a7: 53 push %ebx 80483a8: 8b 55 08 mov 0x8(%ebp),%edx 80483ab: 8b 4d 0c mov 0xc(%ebp),%ecx 80483ae: 8b 1a mov (%edx),%ebx 80483b0: 8b 01 mov (%ecx),%eax 80483b2: 89 02 mov %eax,(%edx) 80483b4: 89 19 mov %ebx,(%ecx) 80483b6: 5b pop %ebx 80483b7: c9 leave 80483b8: c3 ret Calling Code 8048409: e8 96 ff ff ff call 80483a4 <swap> 804840e: 8b 45 f8 mov 0xfffffff8(%ebp),%eax
Register Saving Conventions When procedure yoo calls who: yoo is the caller who is the callee Can Register be used for temporary storage? Contents of register %edx overwritten by who yoo: • • • movl $15213, %edx call who addl %edx, %eax ret who: • • • movl 8(%ebp), %edx addl $91125, %edx ret
Register Saving Conventions When procedure yoo calls who: yoo is the caller who is the callee Can register be used for temporary storage? Conventions “Caller Save” Caller saves temporary in its frame before calling “Callee Save” Callee saves temporary in its frame before using
IA32/Linux Register Usage %eax, %edx, %ecx Caller saves prior to call if values are used later %eax also used to return integer value %ebx, %esi, %edi Callee saves if wants to use them %esp, %ebp special %eax Caller-Save Temporaries %edx %ecx %ebx Callee-Save Temporaries %esi %edi %esp Special %ebp
Recursive Factorial Registers %eax used without first saving .globl rfact .type rfact,@function rfact: pushl %ebp movl %esp,%ebp pushl %ebx movl 8(%ebp),%ebx cmpl $1,%ebx jle .L78 leal -1(%ebx),%eax pushl %eax call rfact imull %ebx,%eax jmp .L79 .align 4 .L78: movl $1,%eax .L79: movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Recursive Factorial int rfact(int x) { int rval; if (x <= 1) return 1; rval = rfact(x-1); return rval * x; } Registers %eax used without first saving %ebx used, but saved at beginning & restore at end
Pointer Code Recursive Procedure Top-Level Call void s_helper (int x, int *accum) { if (x <= 1) return; else { int z = *accum * x; *accum = z; s_helper (x-1,accum); } int sfact(int x) { int val = 1; s_helper(x, &val); return val; } Pass pointer to update location
Creating & Initializing Pointer int sfact(int x) { int val = 1; s_helper(x, &val); return val; } Variable val must be stored on stack Because: Need to create pointer to it Compute pointer as -4(%ebp) Push on stack as second argument 8 x 4 Rtn adr Initial part of sfact Old %ebp %ebp _sfact: pushl %ebp # Save %ebp movl %esp,%ebp # Set %ebp subl $16,%esp # Add 16 bytes movl 8(%ebp),%edx # edx = x movl $1,-4(%ebp) # val = 1 _sfact: pushl %ebp # Save %ebp movl %esp,%ebp # Set %ebp subl $16,%esp # Add 16 bytes movl 8(%ebp),%edx # edx = x movl $1,-4(%ebp) # val = 1 _sfact: pushl %ebp # Save %ebp movl %esp,%ebp # Set %ebp subl $16,%esp # Add 16 bytes movl 8(%ebp),%edx # edx = x movl $1,-4(%ebp) # val = 1 _sfact: pushl %ebp # Save %ebp movl %esp,%ebp # Set %ebp subl $16,%esp # Add 16 bytes movl 8(%ebp),%edx # edx = x movl $1,-4(%ebp) # val = 1 -4 val = 1 Unused Temp. Space %esp -8 -12 -16
Creating & Initializing Pointer int sfact(int x) { int val = 1; s_helper(x, &val); return val; } Variable val must be stored on stack Because: Need to create pointer to it Compute pointer as -4(%ebp) Push on stack as second argument 8 x 4 Rtn adr Initial part of sfact Old %ebp %ebp _sfact: pushl %ebp # Save %ebp movl %esp,%ebp # Set %ebp subl $16,%esp # Add 16 bytes movl 8(%ebp),%edx # edx = x movl $1,-4(%ebp) # val = 1 _sfact: pushl %ebp # Save %ebp movl %esp,%ebp # Set %ebp subl $16,%esp # Add 16 bytes movl 8(%ebp),%edx # edx = x movl $1,-4(%ebp) # val = 1 _sfact: pushl %ebp # Save %ebp movl %esp,%ebp # Set %ebp subl $16,%esp # Add 16 bytes movl 8(%ebp),%edx # edx = x movl $1,-4(%ebp) # val = 1 _sfact: pushl %ebp # Save %ebp movl %esp,%ebp # Set %ebp subl $16,%esp # Add 16 bytes movl 8(%ebp),%edx # edx = x movl $1,-4(%ebp) # val = 1 -4 val = 1 Unused Temp. Space %esp -8 -12 -16
Passing Pointer Calling s_helper from sfact Stack at time of call int sfact(int x) { int val = 1; s_helper(x, &val); return val; } 8 x 4 Rtn adr Old %ebp %ebp -4 val=x! val = 1 &val -8 Unused -12 -16 Calling s_helper from sfact leal -4(%ebp),%eax # Compute &val pushl %eax # Push on stack pushl %edx # Push x call s_helper # call movl -4(%ebp),%eax # Return val • • • # Finish leal -4(%ebp),%eax # Compute &val pushl %eax # Push on stack pushl %edx # Push x call s_helper # call movl -4(%ebp),%eax # Return val • • • # Finish leal -4(%ebp),%eax # Compute &val pushl %eax # Push on stack pushl %edx # Push x call s_helper # call movl -4(%ebp),%eax # Return val • • • # Finish %esp x
Passing Pointer Calling s_helper from sfact Stack at time of call int sfact(int x) { int val = 1; s_helper(x, &val); return val; } 8 x 4 Rtn adr Old %ebp %ebp -4 val=x! val = 1 &val -8 Unused -12 -16 Calling s_helper from sfact leal -4(%ebp),%eax # Compute &val pushl %eax # Push on stack pushl %edx # Push x call s_helper # call movl -4(%ebp),%eax # Return val • • • # Finish leal -4(%ebp),%eax # Compute &val pushl %eax # Push on stack pushl %edx # Push x call s_helper # call movl -4(%ebp),%eax # Return val • • • # Finish leal -4(%ebp),%eax # Compute &val pushl %eax # Push on stack pushl %edx # Push x call s_helper # call movl -4(%ebp),%eax # Return val • • • # Finish %esp x
IA 32 Procedure Summary The Stack Makes Recursion Work Private storage for each instance of procedure call Instantiations don’t clobber each other Addressing of locals + arguments can be relative to stack positions Managed by stack discipline Procedures return in inverse order of calls IA32 Procedures Combination of Instructions + Conventions Call / Ret instructions Register usage conventions Caller / Callee save %ebp and %esp Stack frame organization conventions Caller Frame Arguments Return Addr %ebp Old %ebp Saved Registers + Local Variables Argument Build %esp
Today Arrays Structures One-dimensional Multi-dimensional (nested) Multi-level Structures
Basic Data Types Integral Floating Point Stored & operated on in general (integer) registers Signed vs. unsigned depends on instructions used Intel GAS Bytes C byte b 1 [unsigned] char word w 2 [unsigned] short double word l 4 [unsigned] int quad word q 8 [unsigned] long int (x86-64) Floating Point Stored & operated on in floating point registers Single s 4 float Double l 8 double Extended t 10/12/16 long double
Array Allocation Basic Principle T A[L]; Array of data type T and length L Contiguously allocated region of L * sizeof(T) bytes char string[12]; x x + 12 int val[5]; x x + 4 x + 8 x + 12 x + 16 x + 20 double a[3]; x + 24 x x + 8 x + 16 char *p[3]; x x + 4 x + 8 x + 12 IA32 x x + 8 x + 16 x + 24 x86-64
Array Access Basic Principle Reference Type Value Will disappear T A[L]; Array of data type T and length L Identifier A can be used as a pointer to array element 0: Type T* Reference Type Value val[4] int 3 val int * x val+1 int * x + 4 &val[2] int * x + 8 val[5] int ?? *(val+1) int 5 val + i int * x + 4 i int val[5]; 1 5 2 3 x x + 4 x + 8 x + 12 x + 16 x + 20 Will disappear Blackboard?
Array Access Basic Principle Reference Type Value T A[L]; Array of data type T and length L Identifier A can be used as a pointer to array element 0: Type T* Reference Type Value val[4] int 3 val int * x val+1 int * x + 4 &val[2] int * x + 8 val[5] int ?? *(val+1) int 5 val + i int * x + 4 i int val[5]; 1 5 2 3 x x + 4 x + 8 x + 12 x + 16 x + 20
Array Example Declaration “zip_dig cmu” equivalent to “int cmu[5]” typedef int zip_dig[5]; zip_dig cmu = { 1, 5, 2, 1, 3 }; zip_dig mit = { 0, 2, 1, 3, 9 }; zip_dig ucb = { 9, 4, 7, 2, 0 }; zip_dig cmu; 1 5 2 3 16 20 24 28 32 36 zip_dig mit; 2 1 3 9 36 40 44 48 52 56 zip_dig ucb; 9 4 7 2 56 60 64 68 72 76 Declaration “zip_dig cmu” equivalent to “int cmu[5]” Example arrays were allocated in successive 20 byte blocks Not guaranteed to happen in general
Array Accessing Example zip_dig cmu; 1 5 2 3 16 20 24 28 32 36 int get_digit (zip_dig z, int dig) { return z[dig]; } Register %edx contains starting address of array Register %eax contains array index Desired digit at 4*%eax + %edx Use memory reference (%edx,%eax,4) IA32 # %edx = z # %eax = dig movl (%edx,%eax,4),%eax # z[dig]
Referencing Examples Reference Address Value Guaranteed? zip_dig cmu; 1 5 2 3 16 20 24 28 32 36 zip_dig mit; 2 1 3 9 36 40 44 48 52 56 zip_dig ucb; 9 4 7 2 56 60 64 68 72 76 Reference Address Value Guaranteed? mit[3] 36 + 4* 3 = 48 3 mit[5] 36 + 4* 5 = 56 9 mit[-1] 36 + 4*-1 = 32 3 cmu[15] 16 + 4*15 = 76 ?? Will disappear Blackboard?
Referencing Examples Reference Address Value Guaranteed? Yes No zip_dig cmu; 1 5 2 3 16 20 24 28 32 36 zip_dig mit; 2 1 3 9 36 40 44 48 52 56 zip_dig mit; 9 4 7 2 56 60 64 68 72 76 Reference Address Value Guaranteed? mit[3] 36 + 4* 3 = 48 3 mit[5] 36 + 4* 5 = 56 9 mit[-1] 36 + 4*-1 = 32 3 cmu[15] 16 + 4*15 = 76 ?? No bound checking Out of range behavior implementation-dependent No guaranteed relative allocation of different arrays Yes No No No
Array Loop Example Original Transformed As generated by GCC int zd2int(zip_dig z) { int i; int zi = 0; for (i = 0; i < 5; i++) { zi = 10 * zi + z[i]; } return zi; Original Transformed As generated by GCC Eliminate loop variable i Convert array code to pointer code Express in do-while form (no test at entrance) int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while (z <= zend); return zi; }
Array Loop Implementation (IA32) Registers %ecx z %eax zi %ebx zend Computations 10*zi + *z implemented as *z + 2*(zi+4*zi) z++ increments by 4 int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while(z <= zend); return zi; } int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while(z <= zend); return zi; } int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while(z <= zend); return zi; } int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while(z <= zend); return zi; } int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while(z <= zend); return zi; } # %ecx = z xorl %eax,%eax # zi = 0 leal 16(%ecx),%ebx # zend = z+4 .L59: leal (%eax,%eax,4),%edx # 5*zi movl (%ecx),%eax # *z addl $4,%ecx # z++ leal (%eax,%edx,2),%eax # zi = *z + 2*(5*zi) cmpl %ebx,%ecx # z : zend jle .L59 # if <= goto loop # %ecx = z xorl %eax,%eax # zi = 0 leal 16(%ecx),%ebx # zend = z+4 .L59: leal (%eax,%eax,4),%edx # 5*zi movl (%ecx),%eax # *z addl $4,%ecx # z++ leal (%eax,%edx,2),%eax # zi = *z + 2*(5*zi) cmpl %ebx,%ecx # z : zend jle .L59 # if <= goto loop # %ecx = z xorl %eax,%eax # zi = 0 leal 16(%ecx),%ebx # zend = z+4 .L59: leal (%eax,%eax,4),%edx # 5*zi movl (%ecx),%eax # *z addl $4,%ecx # z++ leal (%eax,%edx,2),%eax # zi = *z + 2*(5*zi) cmpl %ebx,%ecx # z : zend jle .L59 # if <= goto loop # %ecx = z xorl %eax,%eax # zi = 0 leal 16(%ecx),%ebx # zend = z+4 .L59: leal (%eax,%eax,4),%edx # 5*zi movl (%ecx),%eax # *z addl $4,%ecx # z++ leal (%eax,%edx,2),%eax # zi = *z + 2*(5*zi) cmpl %ebx,%ecx # z : zend jle .L59 # if <= goto loop # %ecx = z xorl %eax,%eax # zi = 0 leal 16(%ecx),%ebx # zend = z+4 .L59: leal (%eax,%eax,4),%edx # 5*zi movl (%ecx),%eax # *z addl $4,%ecx # z++ leal (%eax,%edx,2),%eax # zi = *z + 2*(5*zi) cmpl %ebx,%ecx # z : zend jle .L59 # if <= goto loop
Nested Array Example “zip_dig pgh[4]” equivalent to “int pgh[4][5]” #define PCOUNT 4 zip_dig pgh[PCOUNT] = {{1, 5, 2, 0, 6}, {1, 5, 2, 1, 3 }, {1, 5, 2, 1, 7 }, {1, 5, 2, 2, 1 }}; 1 5 2 6 1 5 2 3 1 5 2 7 1 5 2 zip_dig pgh[4]; 76 96 116 136 156 “zip_dig pgh[4]” equivalent to “int pgh[4][5]” Variable pgh: array of 4 elements, allocated contiguously Each element is an array of 5 int’s, allocated contiguously “Row-Major” ordering of all elements guaranteed
Multidimensional (Nested) Arrays Declaration T A[R][C]; 2D array of data type T R rows, C columns Type T element requires K bytes Array Size R * C * K bytes Arrangement Row-Major Ordering A[0][0] A[0][C-1] A[R-1][0] • • • A[R-1][C-1] • int A[R][C]; • • • A [0] [C-1] [1] [R-1] • • • 4*R*C Bytes
Nested Array Row Access Row Vectors A[i] is array of C elements Each element of type T requires K bytes Starting address A + i * (C * K) int A[R][C]; • • • A [0] [C-1] A[0] • • • A [i] [0] [C-1] A[i] • • • A [R-1] [0] [C-1] A[R-1] • • • • • • A A+i*C*4 A+(R-1)*C*4
Nested Array Row Access Code int *get_pgh_zip(int index) { return pgh[index]; } #define PCOUNT 4 zip_dig pgh[PCOUNT] = {{1, 5, 2, 0, 6}, {1, 5, 2, 1, 3 }, {1, 5, 2, 1, 7 }, {1, 5, 2, 2, 1 }}; What data type is pgh[index]? What is its starting address? # %eax = index leal (%eax,%eax,4),%eax # 5 * index leal pgh(,%eax,4),%eax # pgh + (20 * index) Will disappear Blackboard?
Nested Array Row Access Code int *get_pgh_zip(int index) { return pgh[index]; } #define PCOUNT 4 zip_dig pgh[PCOUNT] = {{1, 5, 2, 0, 6}, {1, 5, 2, 1, 3 }, {1, 5, 2, 1, 7 }, {1, 5, 2, 2, 1 }}; # %eax = index leal (%eax,%eax,4),%eax # 5 * index leal pgh(,%eax,4),%eax # pgh + (20 * index) Row Vector pgh[index] is array of 5 int’s Starting address pgh+20*index IA32 Code Computes and returns address Compute as pgh + 4*(index+4*index)
Nested Array Row Access Array Elements A[i][j] is element of type T, which requires K bytes Address A + i * (C * K) + j * K = A + (i * C + j)* K int A[R][C]; • • • A [0] [C-1] A[0] • • • • • • A [i] [j] A[i] • • • A [R-1] [0] [C-1] A[R-1] • • • • • • A A+i*C*4 A+(R-1)*C*4 A+i*C*4+j*4
Nested Array Element Access Code int get_pgh_digit (int index, int dig) { return pgh[index][dig]; } # %ecx = dig # %eax = index leal 0(,%ecx,4),%edx # 4*dig leal (%eax,%eax,4),%eax # 5*index movl pgh(%edx,%eax,4),%eax # *(pgh + 4*dig + 20*index) Array Elements pgh[index][dig] is int Address: pgh + 20*index + 4*dig IA32 Code Computes address pgh + 4*dig + 4*(index+4*index) movl performs memory reference
Strange Referencing Examples 1 5 2 6 1 5 2 3 1 5 2 7 1 5 2 zip_dig pgh[4]; 76 96 116 136 156 Reference Address Value Guaranteed? pgh[3][3] 76+20*3+4*3 = 148 2 pgh[2][5] 76+20*2+4*5 = 136 1 pgh[2][-1] 76+20*2+4*-1 = 112 3 pgh[4][-1] 76+20*4+4*-1 = 152 1 pgh[0][19] 76+20*0+4*19 = 152 1 pgh[0][-1] 76+20*0+4*-1 = 72 ?? Will disappear
Strange Referencing Examples 1 5 2 6 1 5 2 3 1 5 2 7 1 5 2 zip_dig pgh[4]; 76 96 116 136 156 Reference Address Value Guaranteed? pgh[3][3] 76+20*3+4*3 = 148 2 pgh[2][5] 76+20*2+4*5 = 136 1 pgh[2][-1] 76+20*2+4*-1 = 112 3 pgh[4][-1] 76+20*4+4*-1 = 152 1 pgh[0][19] 76+20*0+4*19 = 152 1 pgh[0][-1] 76+20*0+4*-1 = 72 ?? Code does not do any bounds checking Ordering of elements within array guaranteed Yes Yes Yes Yes No
Multi-Level Array Example Variable univ denotes array of 3 elements Each element is a pointer 4 bytes Each pointer points to array of int’s zip_dig cmu = { 1, 5, 2, 1, 3 }; zip_dig mit = { 0, 2, 1, 3, 9 }; zip_dig ucb = { 9, 4, 7, 2, 0 }; #define UCOUNT 3 int *univ[UCOUNT] = {mit, cmu, ucb}; cmu 1 5 2 3 16 20 24 28 32 36 36 160 16 56 164 168 univ mit 2 1 3 9 36 40 44 48 52 56 ucb 9 4 7 2 56 60 64 68 72 76
Element Access in Multi-Level Array int get_univ_digit (int index, int dig) { return univ[index][dig]; } # %ecx = index # %eax = dig leal 0(,%ecx,4),%edx # 4*index movl univ(%edx),%edx # Mem[univ+4*index] movl (%edx,%eax,4),%eax # Mem[...+4*dig] Will disappear Blackboard?
Element Access in Multi-Level Array int get_univ_digit (int index, int dig) { return univ[index][dig]; } # %ecx = index # %eax = dig leal 0(,%ecx,4),%edx # 4*index movl univ(%edx),%edx # Mem[univ+4*index] movl (%edx,%eax,4),%eax # Mem[...+4*dig] Computation (IA32) Element access Mem[Mem[univ+4*index]+4*dig] Must do two memory reads First get pointer to row array Then access element within array
Array Element Accesses Nested array Multi-level array int get_pgh_digit (int index, int dig) { return pgh[index][dig]; } int get_univ_digit (int index, int dig) { return univ[index][dig]; } Access looks similar, but element: Mem[pgh+20*index+4*dig] Mem[Mem[univ+4*index]+4*dig]
Strange Referencing Examples cmu 1 5 2 3 16 20 24 28 32 36 36 160 16 56 164 168 univ mit 2 1 3 9 36 40 44 48 52 56 ucb 9 4 7 2 56 60 64 68 72 76 Reference Address Value Guaranteed? univ[2][3] 56+4*3 = 68 2 univ[1][5] 16+4*5 = 36 0 univ[2][-1] 56+4*-1 = 52 9 univ[3][-1] ?? ?? univ[1][12] 16+4*12 = 64 7 Will disappear
Strange Referencing Examples cmu 1 5 2 3 16 20 24 28 32 36 36 160 16 56 164 168 univ mit 2 1 3 9 36 40 44 48 52 56 ucb 9 4 7 2 56 60 64 68 72 76 Reference Address Value Guaranteed? univ[2][3] 56+4*3 = 68 2 univ[1][5] 16+4*5 = 36 0 univ[2][-1] 56+4*-1 = 52 9 univ[3][-1] ?? ?? univ[1][12] 16+4*12 = 64 7 Code does not do any bounds checking Ordering of elements in different arrays not guaranteed Yes No No No No
Using Nested Arrays Strengths Limitation #define N 16 typedef int fix_matrix[N][N]; Strengths C compiler handles doubly subscripted arrays Generates very efficient code Avoids multiply in index computation Limitation Only works for fixed array size /* Compute element i,k of fixed matrix product */ int fix_prod_ele (fix_matrix a, fix_matrix b, int i, int k) { int j; int result = 0; for (j = 0; j < N; j++) result += a[i][j]*b[j][k]; return result; } a b j-th column x i-th row
Dynamic Nested Arrays Strength Programming Performance Can create matrix of any size Programming Must do index computation explicitly Performance Accessing single element costly Must do multiplication int * new_var_matrix(int n) { return (int *) calloc(sizeof(int), n*n); } int var_ele (int *a, int i, int j, int n) { return a[i*n+j]; } movl 12(%ebp),%eax # i movl 8(%ebp),%edx # a imull 20(%ebp),%eax # n*i addl 16(%ebp),%eax # n*i+j movl (%edx,%eax,4),%eax # Mem[a+4*(i*n+j)]
Dynamic Array Multiplication Without Optimizations Multiplies: 3 2 for subscripts 1 for data Adds: 4 2 for array indexing 1 for loop index /* Compute element i,k of variable matrix product */ int var_prod_ele (int *a, int *b, int i, int k, int n) { int j; int result = 0; for (j = 0; j < n; j++) result += a[i*n+j] * b[j*n+k]; return result; }
Optimizing Dynamic Array Multiplication { int j; int result = 0; for (j = 0; j < n; j++) result += a[i*n+j] * b[j*n+k]; return result; } Optimizations Performed when set optimization level to -O2 Code Motion Expression i*n can be computed outside loop Strength Reduction Incrementing j has effect of incrementing j*n+k by n Operations count 4 adds, 1 mult Compiler can optimize regular access patterns { int j; int result = 0; int iTn = i*n; int jTnPk = k; for (j = 0; j < n; j++) { result += a[iTn+j] * b[jTnPk]; jTnPk += n; } return result;
Today Structures Alignment Unions Floating point
Structures Memory Layout Concept Accessing Structure Member struct rec { int i; int a[3]; int *p; }; Memory Layout i a p 4 16 20 Concept Contiguously-allocated region of memory Refer to members within structure by names Members may be of different types Accessing Structure Member void set_i(struct rec *r, int val) { r->i = val; } IA32 Assembly # %eax = val # %edx = r movl %eax,(%edx) # Mem[r] = val
Generating Pointer to Structure Member struct rec { int i; int a[3]; int *p; }; r r+4+4*idx i a p 4 16 20 int *find_a (struct rec *r, int idx) { return &r->a[idx]; } What does it do? # %ecx = idx # %edx = r leal 0(,%ecx,4),%eax # 4*idx leal 4(%eax,%edx),%eax # r+4*idx+4 Will disappear blackboard?
Generating Pointer to Structure Member struct rec { int i; int a[3]; int *p; }; r r+4+4*idx i a p 4 16 20 Generating Pointer to Array Element Offset of each structure member determined at compile time int *find_a (struct rec *r, int idx) { return &r->a[idx]; } # %ecx = idx # %edx = r leal 0(,%ecx,4),%eax # 4*idx leal 4(%eax,%edx),%eax # r+4*idx+4
Structure Referencing (Cont.) C Code struct rec { int i; int a[3]; int *p; }; i a p 4 16 20 i a void set_p(struct rec *r) { r->p = &r->a[r->i]; } 4 16 20 Element i What does it do? # %edx = r movl (%edx),%ecx # r->i leal 0(,%ecx,4),%eax # 4*(r->i) leal 4(%edx,%eax),%eax # r+4+4*(r->i) movl %eax,16(%edx) # Update r->p
Today Structures Alignment Unions Floating point
Alignment Aligned Data Motivation for Aligning Data Compiler Primitive data type requires K bytes Address must be multiple of K Required on some machines; advised on IA32 treated differently by IA32 Linux, x86-64 Linux, and Windows! Motivation for Aligning Data Memory accessed by (aligned) chunks of 4 or 8 bytes (system dependent) Inefficient to load or store datum that spans quad word boundaries Virtual memory very tricky when datum spans 2 pages Compiler Inserts gaps in structure to ensure correct alignment of fields
Specific Cases of Alignment (IA32) 1 byte: char, … no restrictions on address 2 bytes: short, … lowest 1 bit of address must be 02 4 bytes: int, float, char *, … lowest 2 bits of address must be 002 8 bytes: double, … Windows (and most other OS’s & instruction sets): lowest 3 bits of address must be 0002 Linux: i.e., treated the same as a 4-byte primitive data type 12 bytes: long double Windows, Linux:
Satisfying Alignment with Structures Within structure: Must satisfy element’s alignment requirement Overall structure placement Each structure has alignment requirement K K = Largest alignment of any element Initial address & structure length must be multiples of K Example (under Windows or x86-64): K = 8, due to double element struct S1 { char c; int i[2]; double v; } *p; c 3 bytes i[0] i[1] 4 bytes v p+0 p+4 p+8 p+16 p+24 Multiple of 4 Multiple of 8 Multiple of 8 Multiple of 8
Different Alignment Conventions struct S1 { char c; int i[2]; double v; } *p; x86-64 or IA32 Windows: K = 8, due to double element IA32 Linux K = 4; double treated like a 4-byte data type c 3 bytes i[0] i[1] 4 bytes v p+0 p+4 p+8 p+16 p+24 c 3 bytes i[0] i[1] v p+0 p+4 p+8 p+12 p+20
Saving Space Put large data types first Effect (example x86-64, both have K=8) struct S1 { char c; int i[2]; double v; } *p; struct S2 { double v; int i[2]; char c; } *p; c i[0] i[1] v 3 bytes 4 bytes p+0 p+4 p+8 p+16 p+24 c i[0] i[1] v p+0 p+8 p+16
Arrays of Structures Satisfy alignment requirement for every element struct S2 { double v; int i[2]; char c; } a[10]; • • • a[0] a[1] a[2] a+0 a+24 a+48 a+36 v i[0] i[1] c 7 bytes a+24 a+32 a+40 a+48
Accessing Array Elements struct S3 { short i; float v; short j; } a[10]; Compute array offset 12i Compute offset 8 with structure Assembler gives offset a+8 Resolved during linking a[0] • • • a[i] • • • a+0 a+12i i 2 bytes v j 2 bytes a+12i a+12i+8 short get_j(int idx) { return a[idx].j; } # %eax = idx leal (%eax,%eax,2),%eax # 3*idx movswl a+8(,%eax,4),%eax
Today Structures Alignment Unions Floating point
Union Allocation Allocate according to largest element Can only use ones field at a time union U1 { char c; int i[2]; double v; } *up; c i[0] i[1] v up+0 up+4 up+8 struct S1 { char c; int i[2]; double v; } *sp; c 3 bits i[0] i[1] 4 bits v sp+0 sp+4 sp+8 sp+16 sp+24
Using Union to Access Bit Patterns typedef union { float f; unsigned u; } bit_float_t; u f 4 float bit2float(unsigned u) { bit_float_t arg; arg.u = u; return arg.f; } unsigned float2bit(float f) { bit_float_t arg; arg.f = f; return arg.u; } Same as (float) u ? Same as (unsigned) f ?
Byte Ordering Revisited Idea Short/long/quad words stored in memory as 2/4/8 consecutive bytes Which is most (least) significant? Can cause problems when exchanging binary data between machines Big Endian Most significant byte has lowest address PowerPC, Sparc Little Endian Least significant byte has lowest address Intel x86
Byte Ordering Example union { unsigned char c[8]; unsigned short s[4]; unsigned int i[2]; unsigned long l[1]; } dw; c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7] s[0] s[1] s[2] s[3] i[0] i[1] l[0]
Byte Ordering Example (Cont). int j; for (j = 0; j < 8; j++) dw.c[j] = 0xf0 + j; printf("Characters 0-7 == [0x%x,0x%x,0x%x,0x%x,0x%x,0x%x,0x%x,0x%x]\n", dw.c[0], dw.c[1], dw.c[2], dw.c[3], dw.c[4], dw.c[5], dw.c[6], dw.c[7]); printf("Shorts 0-3 == [0x%x,0x%x,0x%x,0x%x]\n", dw.s[0], dw.s[1], dw.s[2], dw.s[3]); printf("Ints 0-1 == [0x%x,0x%x]\n", dw.i[0], dw.i[1]); printf("Long 0 == [0x%lx]\n", dw.l[0]);
Byte Ordering on IA32 Little Endian Output on IA32: f0 f1 f2 f3 f4 f5 f6 f7 c[3] s[1] i[0] LSB MSB c[2] c[1] s[0] c[0] c[7] s[3] i[1] c[6] c[5] s[2] c[4] Print l[0] Output on IA32: Characters 0-7 == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7] Shorts 0-3 == [0xf1f0,0xf3f2,0xf5f4,0xf7f6] Ints 0-1 == [0xf3f2f1f0,0xf7f6f5f4] Long 0 == [0xf3f2f1f0]
Byte Ordering on Sun Big Endian Output on Sun: c[3] s[1] i[0] LSB MSB c[2] c[1] s[0] c[0] c[7] s[3] i[1] c[6] c[5] s[2] c[4] f0 f1 f2 f3 f4 f5 f6 f7 Print l[0] Output on Sun: Characters 0-7 == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7] Shorts 0-3 == [0xf0f1,0xf2f3,0xf4f5,0xf6f7] Ints 0-1 == [0xf0f1f2f3,0xf4f5f6f7] Long 0 == [0xf0f1f2f3]
Byte Ordering on x86-64 Little Endian Output on x86-64: c[3] s[1] i[0] LSB MSB c[2] c[1] s[0] c[0] c[7] s[3] i[1] c[6] c[5] s[2] c[4] f0 f1 f2 f3 f4 f5 f6 f7 Print l[0] Output on x86-64: Characters 0-7 == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7] Shorts 0-3 == [0xf1f0,0xf3f2,0xf5f4,0xf7f6] Ints 0-1 == [0xf3f2f1f0,0xf7f6f5f4] Long 0 == [0xf7f6f5f4f3f2f1f0]
Summary Arrays in C Structures Unions Contiguous allocation of memory Aligned to satisfy every element’s alignment requirement Pointer to first element No bounds checking Structures Allocate bytes in order declared Pad in middle and at end to satisfy alignment Unions Overlay declarations Way to circumvent type system
Today Structures Alignment Unions Floating point x87 (available with IA32, becoming obsolete) SSE3 (available with x86-64)
IA32 Floating Point (x87) History Summary Floating Point Formats 8086: first computer to implement IEEE FP separate 8087 FPU (floating point unit) 486: merged FPU and Integer Unit onto one chip Becoming obsolete with x86-64 Summary Hardware to add, multiply, and divide Floating point data registers Various control & status registers Floating Point Formats single precision (C float): 32 bits double precision (C double): 64 bits extended precision (C long double): 80 bits Instruction decoder and sequencer Integer Unit FPU Memory
FPU Data Register Stack (x87) FPU register format (80 bit extended precision) FPU registers 8 registers %st(0) - %st(7) Logically form stack Top: %st(0) Bottom disappears (drops out) after too many pushs 79 78 64 63 s exp frac “Top” %st(0) %st(1) %st(2) %st(3)
FPU instructions (x87) Large number of floating point instructions and formats ~50 basic instruction types load, store, add, multiply sin, cos, tan, arctan, and log Often slower than math lib Sample instructions: Instruction Effect Description fldz push 0.0 Load zero flds Addr push Mem[Addr] Load single precision real fmuls Addr %st(0) %st(0)*M[Addr] Multiply faddp %st(1) %st(0)+%st(1);pop Add and pop
FP Code Example (x87) Compute inner product of two vectors Single precision arithmetic Common computation pushl %ebp # setup movl %esp,%ebp pushl %ebx movl 8(%ebp),%ebx # %ebx=&x movl 12(%ebp),%ecx # %ecx=&y movl 16(%ebp),%edx # %edx=n fldz # push +0.0 xorl %eax,%eax # i=0 cmpl %edx,%eax # if i>=n done jge .L3 .L5: flds (%ebx,%eax,4) # push x[i] fmuls (%ecx,%eax,4) # st(0)*=y[i] faddp # st(1)+=st(0); pop incl %eax # i++ cmpl %edx,%eax # if i<n repeat jl .L5 .L3: movl -4(%ebp),%ebx # finish movl %ebp, %esp popl %ebp ret # st(0) = result float ipf (float x[], float y[], int n) { int i; float result = 0.0; for (i = 0; i < n; i++) result += x[i]*y[i]; return result; }
Inner Product Stack Trace eax = i ebx = *x ecx = *y Initialization 1. fldz 0.0 %st(0) Iteration 0 Iteration 1 2. flds (%ebx,%eax,4) 5. flds (%ebx,%eax,4) 0.0 %st(1) x[0]*y[0] %st(1) x[0] %st(0) x[1] %st(0) 3. fmuls (%ecx,%eax,4) 6. fmuls (%ecx,%eax,4) 0.0 %st(1) x[0]*y[0] %st(1) x[0]*y[0] %st(0) x[1]*y[1] %st(0) 4. faddp 7. faddp 0.0+x[0]*y[0] %st(0) x[0]*y[0]+x[1]*y[1] %st(0)
Instructor: Erol Sahin Machine Programming – x86-64 extensions CENG331: Introduction to Computer Systems Instructor: Erol Sahin Acknowledgement: Most of the slides are adapted from the ones prepared by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ.
x86-64 Integer Registers %rax %r8 %rbx %r9 %rcx %r10 %rdx %r11 %rsi %eax %r8d %rbx %r9 %ebx %r9d %rcx %r10 %ecx %r10d %rdx %r11 %edx %r11d %rsi %r12 %esi %r12d %rdi %r13 %edi %r13d %rsp %r14 %esp %r14d %rbp %r15 %ebp %r15d Twice the number of registers Accessible as 8, 16, 32, 64 bits
x86-64 Integer Registers %rax %r8 %rbx %r9 %rcx %r10 %rdx %r11 %rsi Return value Argument #5 %rbx %r9 Callee saved Argument #6 %rcx %r10 Argument #4 Callee saved %rdx %r11 Argument #3 Used for linking %rsi %r12 Argument #2 C: Callee saved %rdi %r13 Argument #1 Callee saved %rsp %r14 Stack pointer Callee saved %rbp %r15 Callee saved Callee saved
x86-64 Registers Arguments passed to functions via registers If more than 6 integral parameters, then pass rest on stack These registers can be used as caller-saved as well All references to stack frame via stack pointer Eliminates need to update %ebp/%rbp Other Registers 6+1 callee saved 2 or 3 have special uses
x86-64 Long Swap Operands passed in registers movq (%rdi), %rdx movq (%rsi), %rax movq %rax, (%rdi) movq %rdx, (%rsi) ret void swap(long *xp, long *yp) { long t0 = *xp; long t1 = *yp; *xp = t1; *yp = t0; } Operands passed in registers First (xp) in %rdi, second (yp) in %rsi 64-bit pointers No stack operations required (except ret) Avoiding stack Can hold all local information in registers
x86-64 Locals in the Red Zone swap_a: movq (%rdi), %rax movq %rax, -24(%rsp) movq (%rsi), %rax movq %rax, -16(%rsp) movq -16(%rsp), %rax movq %rax, (%rdi) movq -24(%rsp), %rax movq %rax, (%rsi) ret /* Swap, using local array */ void swap_a(long *xp, long *yp) { volatile long loc[2]; loc[0] = *xp; loc[1] = *yp; *xp = loc[1]; *yp = loc[0]; } Avoiding Stack Pointer Change Can hold all information within small window beyond stack pointer rtn Ptr %rsp −8 unused −16 loc[1] −24 loc[0]
x86-64 NonLeaf without Stack Frame long scount = 0; /* Swap a[i] & a[i+1] */ void swap_ele_se (long a[], int i) { swap(&a[i], &a[i+1]); scount++; } No values held while swap being invoked No callee save registers needed swap_ele_se: movslq %esi,%rsi # Sign extend i leaq (%rdi,%rsi,8), %rdi # &a[i] leaq 8(%rdi), %rsi # &a[i+1] call swap # swap() incq scount(%rip) # scount++; ret
x86-64 Call using Jump When swap executes ret, it will return from swap_ele Possible since swap is a “tail call” (no instructions afterwards) long scount = 0; /* Swap a[i] & a[i+1] */ void swap_ele(long a[], int i) { swap(&a[i], &a[i+1]); } swap_ele: movslq %esi,%rsi # Sign extend i leaq (%rdi,%rsi,8), %rdi # &a[i] leaq 8(%rdi), %rsi # &a[i+1] jmp swap # swap()
x86-64 Stack Frame Example swap_ele_su: movq %rbx, -16(%rsp) movslq %esi,%rbx movq %r12, -8(%rsp) movq %rdi, %r12 leaq (%rdi,%rbx,8), %rdi subq $16, %rsp leaq 8(%rdi), %rsi call swap movq (%r12,%rbx,8), %rax addq %rax, sum(%rip) movq (%rsp), %rbx movq 8(%rsp), %r12 addq $16, %rsp ret long sum = 0; /* Swap a[i] & a[i+1] */ void swap_ele_su (long a[], int i) { swap(&a[i], &a[i+1]); sum += a[i]; } Keeps values of a and i in callee save registers Must set up stack frame to save these registers
Understanding x86-64 Stack Frame swap_ele_su: movq %rbx, -16(%rsp) # Save %rbx movslq %esi,%rbx # Extend & save i movq %r12, -8(%rsp) # Save %r12 movq %rdi, %r12 # Save a leaq (%rdi,%rbx,8), %rdi # &a[i] subq $16, %rsp # Allocate stack frame leaq 8(%rdi), %rsi # &a[i+1] call swap # swap() movq (%r12,%rbx,8), %rax # a[i] addq %rax, sum(%rip) # sum += a[i] movq (%rsp), %rbx # Restore %rbx movq 8(%rsp), %r12 # Restore %r12 addq $16, %rsp # Deallocate stack frame ret
Understanding x86-64 Stack Frame swap_ele_su: movq %rbx, -16(%rsp) # Save %rbx movslq %esi,%rbx # Extend & save i movq %r12, -8(%rsp) # Save %r12 movq %rdi, %r12 # Save a leaq (%rdi,%rbx,8), %rdi # &a[i] subq $16, %rsp # Allocate stack frame leaq 8(%rdi), %rsi # &a[i+1] call swap # swap() movq (%r12,%rbx,8), %rax # a[i] addq %rax, sum(%rip) # sum += a[i] movq (%rsp), %rbx # Restore %rbx movq 8(%rsp), %r12 # Restore %r12 addq $16, %rsp # Deallocate stack frame ret rtn addr %r12 %rsp −8 %rbx −16 rtn addr %r12 %rsp +8 %rbx
Interesting Features of Stack Frame Allocate entire frame at once All stack accesses can be relative to %rsp Do by decrementing stack pointer Can delay allocation, since safe to temporarily use red zone Simple deallocation Increment stack pointer No base/frame pointer needed
Interesting Features of Stack Frame Many compiled functions do not require a stack frame other than saving their return address. A function does not require a stack frame if: All local variables can be held in registers The function does not call other functions (referred to as leaf procedures) A function would require a stack frame if the function: Has too many local variables to hold in registers Has some local variables are arrays or structures uses &-operator to compute the address of a local variable must pass some arguments on the stack to another function Needs to save the state of a calllee-save register
General Conditional Expression Translation C Code val = Test ? Then-Expr : Else-Expr; val = x>y ? x-y : y-x; Test is expression returning integer = 0 interpreted as false 0 interpreted as true Create separate code regions for then & else expressions Execute appropriate one Goto Version nt = !Test; if (nt) goto Else; val = Then-Expr; Done: . . . Else: val = Else-Expr; goto Done;
Conditionals: x86-64 Conditional move instruction cmovC src, dest int absdiff( int x, int y) { int result; if (x > y) { result = x-y; } else { result = y-x; } return result; absdiff: # x in %edi, y in %esi movl %edi, %eax # eax = x movl %esi, %edx # edx = y subl %esi, %eax # eax = x-y subl %edi, %edx # edx = y-x cmpl %esi, %edi # x:y cmovle %edx, %eax # eax=edx if <= ret Conditional move instruction cmovC src, dest Move value from src to dest if condition C holds More efficient than conditional branching (simple control flow) But overhead: both branches are evaluated
General Form with Conditional Move C Code val = Test ? Then-Expr : Else-Expr; Conditional Move Version val1 = Then-Expr; val2 = Else-Expr; val1 = val2 if !Test; Both values get computed Overwrite then-value with else-value if condition doesn’t hold Don’t use when: Then or else expression have side effects Then and else expression are to expensive
Specific Cases of Alignment (x86-64) 1 byte: char, … no restrictions on address 2 bytes: short, … lowest 1 bit of address must be 02 4 bytes: int, float, … lowest 2 bits of address must be 002 8 bytes: double, char *, … Windows & Linux: lowest 3 bits of address must be 0002 16 bytes: long double Linux: i.e., treated the same as a 8-byte primitive data type
Vector Instructions: SSE Family SIMD (single-instruction, multiple data) vector instructions New data types, registers, operations Parallel operation on small (length 2-8) vectors of integers or floats Example: Floating point vector instructions Available with Intel’s SSE (streaming SIMD extensions) family SSE starting with Pentium III: 4-way single precision SSE2 starting with Pentium 4: 2-way double precision All x86-64 have SSE3 (superset of SSE2, SSE) “4-way” + x
SSE3 Registers All caller saved %xmm0 for floating point return value 128 bit = 2 doubles = 4 singles %xmm0 Argument #1 %xmm8 %xmm1 Argument #2 %xmm9 %xmm2 Argument #3 %xmm10 %xmm3 Argument #4 %xmm11 %xmm4 Argument #5 %xmm12 %xmm5 Argument #6 %xmm13 %xmm6 Argument #7 %xmm14 %xmm7 Argument #8 %xmm15
SSE3 Registers Different data types and associated instructions Integer vectors: 16-way byte 8-way 2 bytes 4-way 4 bytes Floating point vectors: 4-way single 2-way double Floating point scalars: single double 128 bit LSB
SSE3 Instructions: Examples Single precision 4-way vector add: addps %xmm0 %xmm1 Single precision scalar add: addss %xmm0 %xmm1 %xmm0 + %xmm1 %xmm0 + %xmm1
Extending to x86-64 Pointers and long ints are 64 bits long. Integer arithmetic operations support 8, 16, 32 and 64-bit data types The set of general purpose registers expanded from 8 to 16 Much of the program state is held in registers rather than on stack. Integer and pointer arguments (upto 6) to procedures are passsed via registers. Some procedures do not need to access to stack at all. Conditional operations are implemented using conditional move instructions, when possible, yielding better performance than traditional branching Floating point operations are implemented using register-oriented SSE2, rather than stack-based x87
Procedures (x86-64): Optimizations No base/frame pointer Passing arguments to functions through registers (if possible) Sometimes: Writing into the “red zone” (below stack pointer) Sometimes: Function call using jmp (instead of call) Reason: Performance use stack as little as possible while obeying rules (e.g., caller/callee save registers) rtn Ptr %rsp −8 unused −16 loc[1] −24 loc[0]