Instructor: Erol Sahin

Instructor: Erol Sahin
Machine Programming – Procedures and IA32 Stack CENG331: Introduction to Computer Systems 6th Lecture Instructor: Erol Sahin Acknowledgement: Most of the slides are adapted from the ones prepared by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ.

IA32 Stack Stack “Bottom”
Region of memory managed with stack discipline Grows toward lower addresses Register %esp contains lowest stack address = address of “top” element Increasing Addresses Stack Grows Down Stack Pointer: %esp Stack “Top”

IA32 Stack: Push Stack “Bottom” pushl Src Stack Pointer: %esp
Fetch operand at Src Decrement %esp by 4 Write operand at address given by %esp Increasing Addresses Stack Grows Down -4 Stack Pointer: %esp Stack “Top”

IA32 Stack: Pop Stack “Bottom” popl Dest Stack Pointer: %esp
Read operand at address %esp Increment %esp by 4 Write operand to Dest Increasing Addresses Stack Grows Down +4 Stack Pointer: %esp Stack “Top”

Procedure Control Flow
Use stack to support procedure call and return Procedure call: call label Push return address on stack Jump to label Return address: Address of instruction beyond call Example from disassembly 804854e: e8 3d call b90 <main> : pushl %eax Return address = 0x Procedure return: ret Pop address from stack Jump to address

Procedure Call Example
804854e: e8 3d call b90 <main> : pushl %eax call b90 0x110 0x110 0x10c 0x10c 0x108 123 0x108 123 0x104 0x %esp 0x108 %esp 0x104 0x108 %eip 0x804854e %eip 0x8048b90 0x804854e %eip: program counter

Procedure Return Example
: c ret ret 0x110 0x110 0x10c 0x10c 0x108 123 0x108 123 0x104 0x 0x %esp 0x104 %esp 0x108 0x104 %eip 0x %eip 0x 0x %eip: program counter

Stack-Based Languages
Languages that support recursion e.g., C, Pascal, Java Code must be “Reentrant” Multiple simultaneous instantiations of single procedure Need some place to store state of each instantiation Arguments Local variables Return pointer Stack discipline State for given procedure needed for limited time From when called to when return Callee returns before caller does Stack allocated in Frames state for single procedure instantiation

Call Chain Example Example Call Chain yoo(…) { • who(); } yoo who(…) {
• • • amI(); } who amI(…) { • amI(); } amI amI amI amI Procedure amI is recursive

Stack Frames Previous Frame Contents Frame for proc Management
Local variables Return information Temporary space Management Space allocated when enter procedure “Set-up” code Deallocated when return “Finish” code Frame Pointer: %ebp Frame for proc Stack Pointer: %esp Stack “Top”

Example Stack yoo(…) { • who(); } yoo %ebp yoo who %esp amI amI amI

Example Stack who(…) { • • • amI(); } yoo yoo who %ebp who amI amI
%esp amI amI

Example Stack amI(…) { • amI(); } yoo yoo who who amI amI %ebp amI amI
%esp amI

Example Stack amI(…) { • amI(); } yoo yoo who who amI amI amI amI amI
%ebp amI %esp

Example Stack amI(…) { • amI(); } yoo yoo who who amI amI %ebp amI amI
%esp amI

%esp amI amI

Example Stack amI(…) { • } yoo yoo who who amI amI %ebp amI amI %esp

%esp amI amI

Example Stack yoo(…) { • who(); } yoo %ebp yoo who %esp amI amI amI

IA32/Linux Stack Frame Current Stack Frame (“Top” to Bottom)
“Argument build:” Parameters for function about to call Local variables If can’t keep in registers Saved register context Old frame pointer Caller Stack Frame Return address Pushed by call instruction Arguments for this call Caller Frame Arguments Frame pointer %ebp Return Addr Old %ebp Saved Registers + Local Variables Argument Build Stack pointer %esp

Calling swap from call_swap
Revisiting swap Calling swap from call_swap int zip1 = 15213; int zip2 = 91125; void call_swap() { swap(&zip1, &zip2); } call_swap: • • • pushl $zip2 # Global Var pushl $zip1 # Global Var call swap • Resulting Stack void swap(int *xp, int *yp) { int t0 = *xp; int t1 = *yp; *xp = t1; *yp = t0; } &zip2 &zip1 Rtn adr %esp

Revisiting swap swap: pushl %ebp movl %esp,%ebp pushl %ebx
movl 12(%ebp),%ecx movl 8(%ebp),%edx movl (%ecx),%eax movl (%edx),%ebx movl %eax,(%edx) movl %ebx,(%ecx) movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Set Up void swap(int *xp, int *yp) { int t0 = *xp; int t1 = *yp; *xp = t1; *yp = t0; } Body Finish

swap Setup #1 Entering Stack Resulting Stack • %ebp • %ebp &zip2 yp
xp Rtn adr %esp Rtn adr Old %ebp %esp swap: pushl %ebp movl %esp,%ebp pushl %ebx

swap Setup #1 Entering Stack • %ebp • %ebp &zip2 yp &zip1 xp Rtn adr
%esp Rtn adr Old %ebp %esp swap: pushl %ebp movl %esp,%ebp pushl %ebx

swap Setup #1 Entering Stack Resulting Stack • %ebp • &zip2 yp &zip1
xp Rtn adr %esp Rtn adr Old %ebp %ebp %esp swap: pushl %ebp movl %esp,%ebp pushl %ebx

swap Setup #1 Entering Stack • %ebp • &zip2 yp &zip1 xp Rtn adr %esp
Old %ebp %ebp %esp swap: pushl %ebp movl %esp,%ebp pushl %ebx

swap Setup #1 12 8 4 Entering Stack Resulting Stack • %ebp •
Offset relative to %ebp &zip2 12 yp &zip1 8 xp Rtn adr 4 %esp Rtn adr Old %ebp %ebp Old %ebx %esp movl 12(%ebp),%ecx # get yp movl 8(%ebp),%edx # get xp . . .

swap Finish #1 swap’s Stack Resulting Stack
• • yp yp xp xp Rtn adr Rtn adr Old %ebp %ebp Old %ebp %ebp Old %ebx %esp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Observation: Saved and restored register %ebx

swap Finish #2 swap’s Stack • • yp yp xp xp Rtn adr Rtn adr Old %ebp
Old %ebx %esp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret

swap Finish #2 swap’s Stack Resulting Stack • • yp yp xp xp Rtn adr
Old %ebp %ebp Old %ebp %ebp %esp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret

swap Finish #2 swap’s Stack • • yp yp xp xp Rtn adr Rtn adr Old %ebp
%esp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret

swap Finish #3 swap’s Stack Resulting Stack • • %ebp yp yp xp xp
Rtn adr Rtn adr %esp Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret

swap Finish #4 swap’s Stack • • %ebp yp yp xp xp Rtn adr Rtn adr %esp
Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret

swap Finish #4 Observation swap’s Stack Resulting Stack
• • %ebp yp yp xp xp %esp Rtn adr Old %ebp %ebp Old %ebx %esp movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Observation Saved & restored register %ebx Didn’t do so for %eax, %ecx, or %edx

Disassembled swap Calling Code 080483a4 <swap>:
80483a4: push %ebp 80483a5: 89 e mov %esp,%ebp 80483a7: push %ebx 80483a8: 8b mov 0x8(%ebp),%edx 80483ab: 8b 4d 0c mov 0xc(%ebp),%ecx 80483ae: 8b 1a mov (%edx),%ebx 80483b0: 8b mov (%ecx),%eax 80483b2: mov %eax,(%edx) 80483b4: mov %ebx,(%ecx) 80483b6: 5b pop %ebx 80483b7: c leave 80483b8: c ret Calling Code : e8 96 ff ff ff call 80483a4 <swap> 804840e: 8b 45 f mov 0xfffffff8(%ebp),%eax

Register Saving Conventions
When procedure yoo calls who: yoo is the caller who is the callee Can Register be used for temporary storage? Contents of register %edx overwritten by who yoo: • • • movl $15213, %edx call who addl %edx, %eax ret who: • • • movl 8(%ebp), %edx addl $91125, %edx ret

Register Saving Conventions
When procedure yoo calls who: yoo is the caller who is the callee Can register be used for temporary storage? Conventions “Caller Save” Caller saves temporary in its frame before calling “Callee Save” Callee saves temporary in its frame before using

IA32/Linux Register Usage
%eax, %edx, %ecx Caller saves prior to call if values are used later %eax also used to return integer value %ebx, %esi, %edi Callee saves if wants to use them %esp, %ebp special %eax Caller-Save Temporaries %edx %ecx %ebx Callee-Save Temporaries %esi %edi %esp Special %ebp

Recursive Factorial Registers %eax used without first saving
.globl rfact .type rfact: pushl %ebp movl %esp,%ebp pushl %ebx movl 8(%ebp),%ebx cmpl $1,%ebx jle .L78 leal -1(%ebx),%eax pushl %eax call rfact imull %ebx,%eax jmp .L79 .align 4 .L78: movl $1,%eax .L79: movl -4(%ebp),%ebx movl %ebp,%esp popl %ebp ret Recursive Factorial int rfact(int x) { int rval; if (x <= 1) return 1; rval = rfact(x-1); return rval * x; } Registers %eax used without first saving %ebx used, but saved at beginning & restore at end

Pointer Code Recursive Procedure Top-Level Call
void s_helper (int x, int *accum) { if (x <= 1) return; else { int z = *accum * x; *accum = z; s_helper (x-1,accum); } int sfact(int x) { int val = 1; s_helper(x, &val); return val; } Pass pointer to update location

Creating & Initializing Pointer
int sfact(int x) { int val = 1; s_helper(x, &val); return val; } Variable val must be stored on stack Because: Need to create pointer to it Compute pointer as -4(%ebp) Push on stack as second argument 8 x 4 Rtn adr Initial part of sfact Old %ebp %ebp _sfact: pushl %ebp # Save %ebp movl %esp,%ebp # Set %ebp subl $16,%esp # Add 16 bytes movl 8(%ebp),%edx # edx = x movl $1,-4(%ebp) # val = 1 _sfact: pushl %ebp # Save %ebp movl %esp,%ebp # Set %ebp subl $16,%esp # Add 16 bytes movl 8(%ebp),%edx # edx = x movl $1,-4(%ebp) # val = 1 _sfact: pushl %ebp # Save %ebp movl %esp,%ebp # Set %ebp subl $16,%esp # Add 16 bytes movl 8(%ebp),%edx # edx = x movl $1,-4(%ebp) # val = 1 _sfact: pushl %ebp # Save %ebp movl %esp,%ebp # Set %ebp subl $16,%esp # Add 16 bytes movl 8(%ebp),%edx # edx = x movl $1,-4(%ebp) # val = 1 -4 val = 1 Unused Temp. Space %esp -8 -12 -16

Passing Pointer Calling s_helper from sfact Stack at time of call
int sfact(int x) { int val = 1; s_helper(x, &val); return val; } 8 x 4 Rtn adr Old %ebp %ebp -4 val=x! val = 1 &val -8 Unused -12 -16 Calling s_helper from sfact leal -4(%ebp),%eax # Compute &val pushl %eax # Push on stack pushl %edx # Push x call s_helper # call movl -4(%ebp),%eax # Return val • • • # Finish leal -4(%ebp),%eax # Compute &val pushl %eax # Push on stack pushl %edx # Push x call s_helper # call movl -4(%ebp),%eax # Return val • • • # Finish leal -4(%ebp),%eax # Compute &val pushl %eax # Push on stack pushl %edx # Push x call s_helper # call movl -4(%ebp),%eax # Return val • • • # Finish %esp x

IA 32 Procedure Summary The Stack Makes Recursion Work
Private storage for each instance of procedure call Instantiations don’t clobber each other Addressing of locals + arguments can be relative to stack positions Managed by stack discipline Procedures return in inverse order of calls IA32 Procedures Combination of Instructions + Conventions Call / Ret instructions Register usage conventions Caller / Callee save %ebp and %esp Stack frame organization conventions Caller Frame Arguments Return Addr %ebp Old %ebp Saved Registers + Local Variables Argument Build %esp

Today Arrays Structures One-dimensional Multi-dimensional (nested)
Multi-level Structures

Basic Data Types Integral Floating Point
Stored & operated on in general (integer) registers Signed vs. unsigned depends on instructions used Intel GAS Bytes C byte b 1 [unsigned] char word w 2 [unsigned] short double word l 4 [unsigned] int quad word q 8 [unsigned] long int (x86-64) Floating Point Stored & operated on in floating point registers Single s 4 float Double l 8 double Extended t 10/12/16 long double

Array Allocation Basic Principle T A[L];
Array of data type T and length L Contiguously allocated region of L * sizeof(T) bytes char string[12]; x x + 12 int val[5]; x x + 4 x + 8 x + 12 x + 16 x + 20 double a[3]; x + 24 x x + 8 x + 16 char *p[3]; x x + 4 x + 8 x + 12 IA32 x x + 8 x + 16 x + 24 x86-64

Array Access Basic Principle Reference Type Value Will disappear
T A[L]; Array of data type T and length L Identifier A can be used as a pointer to array element 0: Type T* Reference Type Value val[4] int 3 val int * x val+1 int * x + 4 &val[2] int * x + 8 val[5] int ?? *(val+1) int 5 val + i int * x + 4 i int val[5]; 1 5 2 3 x x + 4 x + 8 x + 12 x + 16 x + 20 Will disappear Blackboard?

Array Access Basic Principle Reference Type Value T A[L];
Array of data type T and length L Identifier A can be used as a pointer to array element 0: Type T* Reference Type Value val[4] int 3 val int * x val+1 int * x + 4 &val[2] int * x + 8 val[5] int ?? *(val+1) int 5 val + i int * x + 4 i int val[5]; 1 5 2 3 x x + 4 x + 8 x + 12 x + 16 x + 20

Array Example Declaration “zip_dig cmu” equivalent to “int cmu[5]”
typedef int zip_dig[5]; zip_dig cmu = { 1, 5, 2, 1, 3 }; zip_dig mit = { 0, 2, 1, 3, 9 }; zip_dig ucb = { 9, 4, 7, 2, 0 }; zip_dig cmu; 1 5 2 3 16 20 24 28 32 36 zip_dig mit; 2 1 3 9 36 40 44 48 52 56 zip_dig ucb; 9 4 7 2 56 60 64 68 72 76 Declaration “zip_dig cmu” equivalent to “int cmu[5]” Example arrays were allocated in successive 20 byte blocks Not guaranteed to happen in general

Array Accessing Example
zip_dig cmu; 1 5 2 3 16 20 24 28 32 36 int get_digit (zip_dig z, int dig) { return z[dig]; } Register %edx contains starting address of array Register %eax contains array index Desired digit at 4*%eax + %edx Use memory reference (%edx,%eax,4) IA32 # %edx = z # %eax = dig movl (%edx,%eax,4),%eax # z[dig]

Referencing Examples Reference Address Value Guaranteed?
zip_dig cmu; 1 5 2 3 16 20 24 28 32 36 zip_dig mit; 2 1 3 9 36 40 44 48 52 56 zip_dig ucb; 9 4 7 2 56 60 64 68 72 76 Reference Address Value Guaranteed? mit[3] * 3 = 48 3 mit[5] * 5 = 56 9 mit[-1] *-1 = 32 3 cmu[15] *15 = 76 ?? Will disappear Blackboard?

Referencing Examples Reference Address Value Guaranteed? Yes No
zip_dig cmu; 1 5 2 3 16 20 24 28 32 36 zip_dig mit; 2 1 3 9 36 40 44 48 52 56 zip_dig mit; 9 4 7 2 56 60 64 68 72 76 Reference Address Value Guaranteed? mit[3] * 3 = 48 3 mit[5] * 5 = 56 9 mit[-1] *-1 = 32 3 cmu[15] *15 = 76 ?? No bound checking Out of range behavior implementation-dependent No guaranteed relative allocation of different arrays Yes No No No

Array Loop Example Original Transformed As generated by GCC
int zd2int(zip_dig z) { int i; int zi = 0; for (i = 0; i < 5; i++) { zi = 10 * zi + z[i]; } return zi; Original Transformed As generated by GCC Eliminate loop variable i Convert array code to pointer code Express in do-while form (no test at entrance) int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while (z <= zend); return zi; }

Array Loop Implementation (IA32)
Registers %ecx z %eax zi %ebx zend Computations 10*zi + *z implemented as *z + 2*(zi+4*zi) z++ increments by 4 int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while(z <= zend); return zi; } int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while(z <= zend); return zi; } int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while(z <= zend); return zi; } int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while(z <= zend); return zi; } int zd2int(zip_dig z) { int zi = 0; int *zend = z + 4; do { zi = 10 * zi + *z; z++; } while(z <= zend); return zi; } # %ecx = z xorl %eax,%eax # zi = 0 leal 16(%ecx),%ebx # zend = z+4 .L59: leal (%eax,%eax,4),%edx # 5*zi movl (%ecx),%eax # *z addl $4,%ecx # z++ leal (%eax,%edx,2),%eax # zi = *z + 2*(5*zi) cmpl %ebx,%ecx # z : zend jle .L59 # if <= goto loop # %ecx = z xorl %eax,%eax # zi = 0 leal 16(%ecx),%ebx # zend = z+4 .L59: leal (%eax,%eax,4),%edx # 5*zi movl (%ecx),%eax # *z addl $4,%ecx # z++ leal (%eax,%edx,2),%eax # zi = *z + 2*(5*zi) cmpl %ebx,%ecx # z : zend jle .L59 # if <= goto loop # %ecx = z xorl %eax,%eax # zi = 0 leal 16(%ecx),%ebx # zend = z+4 .L59: leal (%eax,%eax,4),%edx # 5*zi movl (%ecx),%eax # *z addl $4,%ecx # z++ leal (%eax,%edx,2),%eax # zi = *z + 2*(5*zi) cmpl %ebx,%ecx # z : zend jle .L59 # if <= goto loop # %ecx = z xorl %eax,%eax # zi = 0 leal 16(%ecx),%ebx # zend = z+4 .L59: leal (%eax,%eax,4),%edx # 5*zi movl (%ecx),%eax # *z addl $4,%ecx # z++ leal (%eax,%edx,2),%eax # zi = *z + 2*(5*zi) cmpl %ebx,%ecx # z : zend jle .L59 # if <= goto loop # %ecx = z xorl %eax,%eax # zi = 0 leal 16(%ecx),%ebx # zend = z+4 .L59: leal (%eax,%eax,4),%edx # 5*zi movl (%ecx),%eax # *z addl $4,%ecx # z++ leal (%eax,%edx,2),%eax # zi = *z + 2*(5*zi) cmpl %ebx,%ecx # z : zend jle .L59 # if <= goto loop

Nested Array Example “zip_dig pgh[4]” equivalent to “int pgh[4][5]”
#define PCOUNT 4 zip_dig pgh[PCOUNT] = {{1, 5, 2, 0, 6}, {1, 5, 2, 1, 3 }, {1, 5, 2, 1, 7 }, {1, 5, 2, 2, 1 }}; 1 5 2 6 1 5 2 3 1 5 2 7 1 5 2 zip_dig pgh[4]; 76 96 116 136 156 “zip_dig pgh[4]” equivalent to “int pgh[4][5]” Variable pgh: array of 4 elements, allocated contiguously Each element is an array of 5 int’s, allocated contiguously “Row-Major” ordering of all elements guaranteed

Multidimensional (Nested) Arrays
Declaration T A[R][C]; 2D array of data type T R rows, C columns Type T element requires K bytes Array Size R * C * K bytes Arrangement Row-Major Ordering A[0][0] A[0][C-1] A[R-1][0] • • • A[R-1][C-1] • int A[R][C]; • • • A [0] [C-1] [1] [R-1] • • • 4*R*C Bytes

Nested Array Row Access
Row Vectors A[i] is array of C elements Each element of type T requires K bytes Starting address A + i * (C * K) int A[R][C]; • • • A [0] [C-1] A[0] • • • A [i] [0] [C-1] A[i] • • • A [R-1] [0] [C-1] A[R-1] • • • • • • A A+i*C*4 A+(R-1)*C*4

Nested Array Row Access Code
int *get_pgh_zip(int index) { return pgh[index]; } #define PCOUNT 4 zip_dig pgh[PCOUNT] = {{1, 5, 2, 0, 6}, {1, 5, 2, 1, 3 }, {1, 5, 2, 1, 7 }, {1, 5, 2, 2, 1 }}; What data type is pgh[index]? What is its starting address? # %eax = index leal (%eax,%eax,4),%eax # 5 * index leal pgh(,%eax,4),%eax # pgh + (20 * index) Will disappear Blackboard?

Nested Array Row Access Code
int *get_pgh_zip(int index) { return pgh[index]; } #define PCOUNT 4 zip_dig pgh[PCOUNT] = {{1, 5, 2, 0, 6}, {1, 5, 2, 1, 3 }, {1, 5, 2, 1, 7 }, {1, 5, 2, 2, 1 }}; # %eax = index leal (%eax,%eax,4),%eax # 5 * index leal pgh(,%eax,4),%eax # pgh + (20 * index) Row Vector pgh[index] is array of 5 int’s Starting address pgh+20*index IA32 Code Computes and returns address Compute as pgh + 4*(index+4*index)

Nested Array Row Access
Array Elements A[i][j] is element of type T, which requires K bytes Address A + i * (C * K) + j * K = A + (i * C + j)* K int A[R][C]; • • • A [0] [C-1] A[0] • • • • • • A [i] [j] A[i] • • • A [R-1] [0] [C-1] A[R-1] • • • • • • A A+i*C*4 A+(R-1)*C*4 A+i*C*4+j*4

Nested Array Element Access Code
int get_pgh_digit (int index, int dig) { return pgh[index][dig]; } # %ecx = dig # %eax = index leal 0(,%ecx,4),%edx # 4*dig leal (%eax,%eax,4),%eax # 5*index movl pgh(%edx,%eax,4),%eax # *(pgh + 4*dig + 20*index) Array Elements pgh[index][dig] is int Address: pgh + 20*index + 4*dig IA32 Code Computes address pgh + 4*dig + 4*(index+4*index) movl performs memory reference

Strange Referencing Examples
1 5 2 6 1 5 2 3 1 5 2 7 1 5 2 zip_dig pgh[4]; 76 96 116 136 156 Reference Address Value Guaranteed? pgh[3][3] 76+20*3+4*3 = 148 2 pgh[2][5] 76+20*2+4*5 = 136 1 pgh[2][-1] 76+20*2+4*-1 = 112 3 pgh[4][-1] 76+20*4+4*-1 = 152 1 pgh[0][19] 76+20*0+4*19 = 152 1 pgh[0][-1] 76+20*0+4*-1 = 72 ?? Will disappear

1 5 2 6 1 5 2 3 1 5 2 7 1 5 2 zip_dig pgh[4]; 76 96 116 136 156 Reference Address Value Guaranteed? pgh[3][3] 76+20*3+4*3 = 148 2 pgh[2][5] 76+20*2+4*5 = 136 1 pgh[2][-1] 76+20*2+4*-1 = 112 3 pgh[4][-1] 76+20*4+4*-1 = 152 1 pgh[0][19] 76+20*0+4*19 = 152 1 pgh[0][-1] 76+20*0+4*-1 = 72 ?? Code does not do any bounds checking Ordering of elements within array guaranteed Yes Yes Yes Yes No

Multi-Level Array Example
Variable univ denotes array of 3 elements Each element is a pointer 4 bytes Each pointer points to array of int’s zip_dig cmu = { 1, 5, 2, 1, 3 }; zip_dig mit = { 0, 2, 1, 3, 9 }; zip_dig ucb = { 9, 4, 7, 2, 0 }; #define UCOUNT 3 int *univ[UCOUNT] = {mit, cmu, ucb}; cmu 1 5 2 3 16 20 24 28 32 36 36 160 16 56 164 168 univ mit 2 1 3 9 36 40 44 48 52 56 ucb 9 4 7 2 56 60 64 68 72 76

Element Access in Multi-Level Array
int get_univ_digit (int index, int dig) { return univ[index][dig]; } # %ecx = index # %eax = dig leal 0(,%ecx,4),%edx # 4*index movl univ(%edx),%edx # Mem[univ+4*index] movl (%edx,%eax,4),%eax # Mem[...+4*dig] Will disappear Blackboard?

Element Access in Multi-Level Array
int get_univ_digit (int index, int dig) { return univ[index][dig]; } # %ecx = index # %eax = dig leal 0(,%ecx,4),%edx # 4*index movl univ(%edx),%edx # Mem[univ+4*index] movl (%edx,%eax,4),%eax # Mem[...+4*dig] Computation (IA32) Element access Mem[Mem[univ+4*index]+4*dig] Must do two memory reads First get pointer to row array Then access element within array

Array Element Accesses
Nested array Multi-level array int get_pgh_digit (int index, int dig) { return pgh[index][dig]; } int get_univ_digit (int index, int dig) { return univ[index][dig]; } Access looks similar, but element: Mem[pgh+20*index+4*dig] Mem[Mem[univ+4*index]+4*dig]

cmu 1 5 2 3 16 20 24 28 32 36 36 160 16 56 164 168 univ mit 2 1 3 9 36 40 44 48 52 56 ucb 9 4 7 2 56 60 64 68 72 76 Reference Address Value Guaranteed? univ[2][3] 56+4*3 = 68 2 univ[1][5] 16+4*5 = 36 0 univ[2][-1] 56+4*-1 = 52 9 univ[3][-1] ?? ?? univ[1][12] 16+4*12 = 64 7 Will disappear

cmu 1 5 2 3 16 20 24 28 32 36 36 160 16 56 164 168 univ mit 2 1 3 9 36 40 44 48 52 56 ucb 9 4 7 2 56 60 64 68 72 76 Reference Address Value Guaranteed? univ[2][3] 56+4*3 = 68 2 univ[1][5] 16+4*5 = 36 0 univ[2][-1] 56+4*-1 = 52 9 univ[3][-1] ?? ?? univ[1][12] 16+4*12 = 64 7 Code does not do any bounds checking Ordering of elements in different arrays not guaranteed Yes No No No No

Using Nested Arrays Strengths Limitation
#define N 16 typedef int fix_matrix[N][N]; Strengths C compiler handles doubly subscripted arrays Generates very efficient code Avoids multiply in index computation Limitation Only works for fixed array size /* Compute element i,k of fixed matrix product */ int fix_prod_ele (fix_matrix a, fix_matrix b, int i, int k) { int j; int result = 0; for (j = 0; j < N; j++) result += a[i][j]*b[j][k]; return result; } a b j-th column x i-th row

Dynamic Nested Arrays Strength Programming Performance
Can create matrix of any size Programming Must do index computation explicitly Performance Accessing single element costly Must do multiplication int * new_var_matrix(int n) { return (int *) calloc(sizeof(int), n*n); } int var_ele (int *a, int i, int j, int n) { return a[i*n+j]; } movl 12(%ebp),%eax # i movl 8(%ebp),%edx # a imull 20(%ebp),%eax # n*i addl 16(%ebp),%eax # n*i+j movl (%edx,%eax,4),%eax # Mem[a+4*(i*n+j)]

Dynamic Array Multiplication
Without Optimizations Multiplies: 3 2 for subscripts 1 for data Adds: 4 2 for array indexing 1 for loop index /* Compute element i,k of variable matrix product */ int var_prod_ele (int *a, int *b, int i, int k, int n) { int j; int result = 0; for (j = 0; j < n; j++) result += a[i*n+j] * b[j*n+k]; return result; }

Optimizing Dynamic Array Multiplication
{ int j; int result = 0; for (j = 0; j < n; j++) result += a[i*n+j] * b[j*n+k]; return result; } Optimizations Performed when set optimization level to -O2 Code Motion Expression i*n can be computed outside loop Strength Reduction Incrementing j has effect of incrementing j*n+k by n Operations count 4 adds, 1 mult Compiler can optimize regular access patterns { int j; int result = 0; int iTn = i*n; int jTnPk = k; for (j = 0; j < n; j++) { result += a[iTn+j] * b[jTnPk]; jTnPk += n; } return result;

Today Structures Alignment Unions Floating point

Structures Memory Layout Concept Accessing Structure Member
struct rec { int i; int a[3]; int *p; }; Memory Layout i a p 4 16 20 Concept Contiguously-allocated region of memory Refer to members within structure by names Members may be of different types Accessing Structure Member void set_i(struct rec *r, int val) { r->i = val; } IA32 Assembly # %eax = val # %edx = r movl %eax,(%edx) # Mem[r] = val

Generating Pointer to Structure Member
struct rec { int i; int a[3]; int *p; }; r r+4+4*idx i a p 4 16 20 int *find_a (struct rec *r, int idx) { return &r->a[idx]; } What does it do? # %ecx = idx # %edx = r leal 0(,%ecx,4),%eax # 4*idx leal 4(%eax,%edx),%eax # r+4*idx+4 Will disappear blackboard?

Generating Pointer to Structure Member
struct rec { int i; int a[3]; int *p; }; r r+4+4*idx i a p 4 16 20 Generating Pointer to Array Element Offset of each structure member determined at compile time int *find_a (struct rec *r, int idx) { return &r->a[idx]; } # %ecx = idx # %edx = r leal 0(,%ecx,4),%eax # 4*idx leal 4(%eax,%edx),%eax # r+4*idx+4

Structure Referencing (Cont.)
C Code struct rec { int i; int a[3]; int *p; }; i a p 4 16 20 i a void set_p(struct rec *r) { r->p = &r->a[r->i]; } 4 16 20 Element i What does it do? # %edx = r movl (%edx),%ecx # r->i leal 0(,%ecx,4),%eax # 4*(r->i) leal 4(%edx,%eax),%eax # r+4+4*(r->i) movl %eax,16(%edx) # Update r->p

Alignment Aligned Data Motivation for Aligning Data Compiler
Primitive data type requires K bytes Address must be multiple of K Required on some machines; advised on IA32 treated differently by IA32 Linux, x86-64 Linux, and Windows! Motivation for Aligning Data Memory accessed by (aligned) chunks of 4 or 8 bytes (system dependent) Inefficient to load or store datum that spans quad word boundaries Virtual memory very tricky when datum spans 2 pages Compiler Inserts gaps in structure to ensure correct alignment of fields

Specific Cases of Alignment (IA32)
1 byte: char, … no restrictions on address 2 bytes: short, … lowest 1 bit of address must be 02 4 bytes: int, float, char *, … lowest 2 bits of address must be 002 8 bytes: double, … Windows (and most other OS’s & instruction sets): lowest 3 bits of address must be 0002 Linux: i.e., treated the same as a 4-byte primitive data type 12 bytes: long double Windows, Linux:

Satisfying Alignment with Structures
Within structure: Must satisfy element’s alignment requirement Overall structure placement Each structure has alignment requirement K K = Largest alignment of any element Initial address & structure length must be multiples of K Example (under Windows or x86-64): K = 8, due to double element struct S1 { char c; int i[2]; double v; } *p; c 3 bytes i[0] i[1] 4 bytes v p+0 p+4 p+8 p+16 p+24 Multiple of 4 Multiple of 8 Multiple of 8 Multiple of 8

Different Alignment Conventions
struct S1 { char c; int i[2]; double v; } *p; x86-64 or IA32 Windows: K = 8, due to double element IA32 Linux K = 4; double treated like a 4-byte data type c 3 bytes i[0] i[1] 4 bytes v p+0 p+4 p+8 p+16 p+24 c 3 bytes i[0] i[1] v p+0 p+4 p+8 p+12 p+20

Saving Space Put large data types first
Effect (example x86-64, both have K=8) struct S1 { char c; int i[2]; double v; } *p; struct S2 { double v; int i[2]; char c; } *p; c i[0] i[1] v 3 bytes 4 bytes p+0 p+4 p+8 p+16 p+24 c i[0] i[1] v p+0 p+8 p+16

Arrays of Structures Satisfy alignment requirement for every element
struct S2 { double v; int i[2]; char c; } a[10]; • • • a[0] a[1] a[2] a+0 a+24 a+48 a+36 v i[0] i[1] c 7 bytes a+24 a+32 a+40 a+48

Accessing Array Elements
struct S3 { short i; float v; short j; } a[10]; Compute array offset 12i Compute offset 8 with structure Assembler gives offset a+8 Resolved during linking a[0] • • • a[i] • • • a+0 a+12i i 2 bytes v j 2 bytes a+12i a+12i+8 short get_j(int idx) { return a[idx].j; } # %eax = idx leal (%eax,%eax,2),%eax # 3*idx movswl a+8(,%eax,4),%eax

Union Allocation Allocate according to largest element
Can only use ones field at a time union U1 { char c; int i[2]; double v; } *up; c i[0] i[1] v up+0 up+4 up+8 struct S1 { char c; int i[2]; double v; } *sp; c 3 bits i[0] i[1] 4 bits v sp+0 sp+4 sp+8 sp+16 sp+24

Using Union to Access Bit Patterns
typedef union { float f; unsigned u; } bit_float_t; u f 4 float bit2float(unsigned u) { bit_float_t arg; arg.u = u; return arg.f; } unsigned float2bit(float f) { bit_float_t arg; arg.f = f; return arg.u; } Same as (float) u ? Same as (unsigned) f ?

Byte Ordering Revisited
Idea Short/long/quad words stored in memory as 2/4/8 consecutive bytes Which is most (least) significant? Can cause problems when exchanging binary data between machines Big Endian Most significant byte has lowest address PowerPC, Sparc Little Endian Least significant byte has lowest address Intel x86

Byte Ordering Example union { unsigned char c[8]; unsigned short s[4];
unsigned int i[2]; unsigned long l[1]; } dw; c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7] s[0] s[1] s[2] s[3] i[0] i[1] l[0]

Byte Ordering Example (Cont).
int j; for (j = 0; j < 8; j++) dw.c[j] = 0xf0 + j; printf("Characters 0-7 == [0x%x,0x%x,0x%x,0x%x,0x%x,0x%x,0x%x,0x%x]\n", dw.c[0], dw.c[1], dw.c[2], dw.c[3], dw.c[4], dw.c[5], dw.c[6], dw.c[7]); printf("Shorts 0-3 == [0x%x,0x%x,0x%x,0x%x]\n", dw.s[0], dw.s[1], dw.s[2], dw.s[3]); printf("Ints 0-1 == [0x%x,0x%x]\n", dw.i[0], dw.i[1]); printf("Long 0 == [0x%lx]\n", dw.l[0]);

Byte Ordering on IA32 Little Endian Output on IA32:
f0 f1 f2 f3 f4 f5 f6 f7 c[3] s[1] i[0] LSB MSB c[2] c[1] s[0] c[0] c[7] s[3] i[1] c[6] c[5] s[2] c[4] Print l[0] Output on IA32: Characters 0-7 == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7] Shorts == [0xf1f0,0xf3f2,0xf5f4,0xf7f6] Ints == [0xf3f2f1f0,0xf7f6f5f4] Long == [0xf3f2f1f0]

Byte Ordering on Sun Big Endian Output on Sun:
c[3] s[1] i[0] LSB MSB c[2] c[1] s[0] c[0] c[7] s[3] i[1] c[6] c[5] s[2] c[4] f0 f1 f2 f3 f4 f5 f6 f7 Print l[0] Output on Sun: Characters 0-7 == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7] Shorts == [0xf0f1,0xf2f3,0xf4f5,0xf6f7] Ints == [0xf0f1f2f3,0xf4f5f6f7] Long == [0xf0f1f2f3]

Byte Ordering on x86-64 Little Endian Output on x86-64:
c[3] s[1] i[0] LSB MSB c[2] c[1] s[0] c[0] c[7] s[3] i[1] c[6] c[5] s[2] c[4] f0 f1 f2 f3 f4 f5 f6 f7 Print l[0] Output on x86-64: Characters 0-7 == [0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7] Shorts == [0xf1f0,0xf3f2,0xf5f4,0xf7f6] Ints == [0xf3f2f1f0,0xf7f6f5f4] Long == [0xf7f6f5f4f3f2f1f0]

Summary Arrays in C Structures Unions Contiguous allocation of memory
Aligned to satisfy every element’s alignment requirement Pointer to first element No bounds checking Structures Allocate bytes in order declared Pad in middle and at end to satisfy alignment Unions Overlay declarations Way to circumvent type system

Today Structures Alignment Unions Floating point
x87 (available with IA32, becoming obsolete) SSE3 (available with x86-64)

IA32 Floating Point (x87) History Summary Floating Point Formats
8086: first computer to implement IEEE FP separate 8087 FPU (floating point unit) 486: merged FPU and Integer Unit onto one chip Becoming obsolete with x86-64 Summary Hardware to add, multiply, and divide Floating point data registers Various control & status registers Floating Point Formats single precision (C float): 32 bits double precision (C double): 64 bits extended precision (C long double): 80 bits Instruction decoder and sequencer Integer Unit FPU Memory

FPU Data Register Stack (x87)
FPU register format (80 bit extended precision) FPU registers 8 registers %st(0) - %st(7) Logically form stack Top: %st(0) Bottom disappears (drops out) after too many pushs 79 78 64 63 s exp frac “Top” %st(0) %st(1) %st(2) %st(3)

FPU instructions (x87) Large number of floating point instructions and formats ~50 basic instruction types load, store, add, multiply sin, cos, tan, arctan, and log Often slower than math lib Sample instructions: Instruction Effect Description fldz push 0.0 Load zero flds Addr push Mem[Addr] Load single precision real fmuls Addr %st(0)  %st(0)*M[Addr] Multiply faddp %st(1)  %st(0)+%st(1);pop Add and pop

FP Code Example (x87) Compute inner product of two vectors
Single precision arithmetic Common computation pushl %ebp # setup movl %esp,%ebp pushl %ebx movl 8(%ebp),%ebx # %ebx=&x movl 12(%ebp),%ecx # %ecx=&y movl 16(%ebp),%edx # %edx=n fldz # push +0.0 xorl %eax,%eax # i=0 cmpl %edx,%eax # if i>=n done jge .L3 .L5: flds (%ebx,%eax,4) # push x[i] fmuls (%ecx,%eax,4) # st(0)*=y[i] faddp # st(1)+=st(0); pop incl %eax # i++ cmpl %edx,%eax # if i<n repeat jl .L5 .L3: movl -4(%ebp),%ebx # finish movl %ebp, %esp popl %ebp ret # st(0) = result float ipf (float x[], float y[], int n) { int i; float result = 0.0; for (i = 0; i < n; i++) result += x[i]*y[i]; return result; }

Inner Product Stack Trace
eax = i ebx = *x ecx = *y Initialization 1. fldz 0.0 %st(0) Iteration 0 Iteration 1 2. flds (%ebx,%eax,4) 5. flds (%ebx,%eax,4) 0.0 %st(1) x[0]*y[0] %st(1) x[0] %st(0) x[1] %st(0) 3. fmuls (%ecx,%eax,4) 6. fmuls (%ecx,%eax,4) 0.0 %st(1) x[0]*y[0] %st(1) x[0]*y[0] %st(0) x[1]*y[1] %st(0) 4. faddp 7. faddp 0.0+x[0]*y[0] %st(0) x[0]*y[0]+x[1]*y[1] %st(0)

Instructor: Erol Sahin
Machine Programming – x86-64 extensions CENG331: Introduction to Computer Systems Instructor: Erol Sahin Acknowledgement: Most of the slides are adapted from the ones prepared by R.E. Bryant, D.R. O’Hallaron of Carnegie-Mellon Univ.

x86-64 Integer Registers %rax %r8 %rbx %r9 %rcx %r10 %rdx %r11 %rsi
%eax %r8d %rbx %r9 %ebx %r9d %rcx %r10 %ecx %r10d %rdx %r11 %edx %r11d %rsi %r12 %esi %r12d %rdi %r13 %edi %r13d %rsp %r14 %esp %r14d %rbp %r15 %ebp %r15d Twice the number of registers Accessible as 8, 16, 32, 64 bits

x86-64 Integer Registers %rax %r8 %rbx %r9 %rcx %r10 %rdx %r11 %rsi
Return value Argument #5 %rbx %r9 Callee saved Argument #6 %rcx %r10 Argument #4 Callee saved %rdx %r11 Argument #3 Used for linking %rsi %r12 Argument #2 C: Callee saved %rdi %r13 Argument #1 Callee saved %rsp %r14 Stack pointer Callee saved %rbp %r15 Callee saved Callee saved

x86-64 Registers Arguments passed to functions via registers
If more than 6 integral parameters, then pass rest on stack These registers can be used as caller-saved as well All references to stack frame via stack pointer Eliminates need to update %ebp/%rbp Other Registers 6+1 callee saved 2 or 3 have special uses

x86-64 Long Swap Operands passed in registers
movq (%rdi), %rdx movq (%rsi), %rax movq %rax, (%rdi) movq %rdx, (%rsi) ret void swap(long *xp, long *yp) { long t0 = *xp; long t1 = *yp; *xp = t1; *yp = t0; } Operands passed in registers First (xp) in %rdi, second (yp) in %rsi 64-bit pointers No stack operations required (except ret) Avoiding stack Can hold all local information in registers

x86-64 Locals in the Red Zone
swap_a: movq (%rdi), %rax movq %rax, -24(%rsp) movq (%rsi), %rax movq %rax, -16(%rsp) movq -16(%rsp), %rax movq %rax, (%rdi) movq -24(%rsp), %rax movq %rax, (%rsi) ret /* Swap, using local array */ void swap_a(long *xp, long *yp) { volatile long loc[2]; loc[0] = *xp; loc[1] = *yp; *xp = loc[1]; *yp = loc[0]; } Avoiding Stack Pointer Change Can hold all information within small window beyond stack pointer rtn Ptr %rsp −8 unused −16 loc[1] −24 loc[0]

x86-64 NonLeaf without Stack Frame
long scount = 0; /* Swap a[i] & a[i+1] */ void swap_ele_se (long a[], int i) { swap(&a[i], &a[i+1]); scount++; } No values held while swap being invoked No callee save registers needed swap_ele_se: movslq %esi,%rsi # Sign extend i leaq (%rdi,%rsi,8), %rdi # &a[i] leaq 8(%rdi), %rsi # &a[i+1] call swap # swap() incq scount(%rip) # scount++; ret

x86-64 Call using Jump When swap executes ret, it will return from swap_ele Possible since swap is a “tail call” (no instructions afterwards) long scount = 0; /* Swap a[i] & a[i+1] */ void swap_ele(long a[], int i) { swap(&a[i], &a[i+1]); } swap_ele: movslq %esi,%rsi # Sign extend i leaq (%rdi,%rsi,8), %rdi # &a[i] leaq 8(%rdi), %rsi # &a[i+1] jmp swap # swap()

x86-64 Stack Frame Example swap_ele_su: movq %rbx, -16(%rsp) movslq %esi,%rbx movq %r12, -8(%rsp) movq %rdi, %r12 leaq (%rdi,%rbx,8), %rdi subq $16, %rsp leaq 8(%rdi), %rsi call swap movq (%r12,%rbx,8), %rax addq %rax, sum(%rip) movq (%rsp), %rbx movq 8(%rsp), %r12 addq $16, %rsp ret long sum = 0; /* Swap a[i] & a[i+1] */ void swap_ele_su (long a[], int i) { swap(&a[i], &a[i+1]); sum += a[i]; } Keeps values of a and i in callee save registers Must set up stack frame to save these registers

Understanding x86-64 Stack Frame
swap_ele_su: movq %rbx, -16(%rsp) # Save %rbx movslq %esi,%rbx # Extend & save i movq %r12, -8(%rsp) # Save %r12 movq %rdi, %r # Save a leaq (%rdi,%rbx,8), %rdi # &a[i] subq $16, %rsp # Allocate stack frame leaq 8(%rdi), %rsi # &a[i+1] call swap # swap() movq (%r12,%rbx,8), %rax # a[i] addq %rax, sum(%rip) # sum += a[i] movq (%rsp), %rbx # Restore %rbx movq 8(%rsp), %r # Restore %r12 addq $16, %rsp # Deallocate stack frame ret

Understanding x86-64 Stack Frame
swap_ele_su: movq %rbx, -16(%rsp) # Save %rbx movslq %esi,%rbx # Extend & save i movq %r12, -8(%rsp) # Save %r12 movq %rdi, %r # Save a leaq (%rdi,%rbx,8), %rdi # &a[i] subq $16, %rsp # Allocate stack frame leaq 8(%rdi), %rsi # &a[i+1] call swap # swap() movq (%r12,%rbx,8), %rax # a[i] addq %rax, sum(%rip) # sum += a[i] movq (%rsp), %rbx # Restore %rbx movq 8(%rsp), %r # Restore %r12 addq $16, %rsp # Deallocate stack frame ret rtn addr %r12 %rsp −8 %rbx −16 rtn addr %r12 %rsp +8 %rbx

Interesting Features of Stack Frame
Allocate entire frame at once All stack accesses can be relative to %rsp Do by decrementing stack pointer Can delay allocation, since safe to temporarily use red zone Simple deallocation Increment stack pointer No base/frame pointer needed

Interesting Features of Stack Frame
Many compiled functions do not require a stack frame other than saving their return address. A function does not require a stack frame if: All local variables can be held in registers The function does not call other functions (referred to as leaf procedures) A function would require a stack frame if the function: Has too many local variables to hold in registers Has some local variables are arrays or structures uses &-operator to compute the address of a local variable must pass some arguments on the stack to another function Needs to save the state of a calllee-save register

General Conditional Expression Translation
C Code val = Test ? Then-Expr : Else-Expr; val = x>y ? x-y : y-x; Test is expression returning integer = 0 interpreted as false 0 interpreted as true Create separate code regions for then & else expressions Execute appropriate one Goto Version nt = !Test; if (nt) goto Else; val = Then-Expr; Done: . . . Else: val = Else-Expr; goto Done;

Conditionals: x86-64 Conditional move instruction cmovC src, dest
int absdiff( int x, int y) { int result; if (x > y) { result = x-y; } else { result = y-x; } return result; absdiff: # x in %edi, y in %esi movl %edi, %eax # eax = x movl %esi, %edx # edx = y subl %esi, %eax # eax = x-y subl %edi, %edx # edx = y-x cmpl %esi, %edi # x:y cmovle %edx, %eax # eax=edx if <= ret Conditional move instruction cmovC src, dest Move value from src to dest if condition C holds More efficient than conditional branching (simple control flow) But overhead: both branches are evaluated

General Form with Conditional Move
C Code val = Test ? Then-Expr : Else-Expr; Conditional Move Version val1 = Then-Expr; val2 = Else-Expr; val1 = val2 if !Test; Both values get computed Overwrite then-value with else-value if condition doesn’t hold Don’t use when: Then or else expression have side effects Then and else expression are to expensive

Specific Cases of Alignment (x86-64)
1 byte: char, … no restrictions on address 2 bytes: short, … lowest 1 bit of address must be 02 4 bytes: int, float, … lowest 2 bits of address must be 002 8 bytes: double, char *, … Windows & Linux: lowest 3 bits of address must be 0002 16 bytes: long double Linux: i.e., treated the same as a 8-byte primitive data type

Vector Instructions: SSE Family
SIMD (single-instruction, multiple data) vector instructions New data types, registers, operations Parallel operation on small (length 2-8) vectors of integers or floats Example: Floating point vector instructions Available with Intel’s SSE (streaming SIMD extensions) family SSE starting with Pentium III: 4-way single precision SSE2 starting with Pentium 4: 2-way double precision All x86-64 have SSE3 (superset of SSE2, SSE) “4-way” + x

SSE3 Registers All caller saved %xmm0 for floating point return value
128 bit = 2 doubles = 4 singles %xmm0 Argument #1 %xmm8 %xmm1 Argument #2 %xmm9 %xmm2 Argument #3 %xmm10 %xmm3 Argument #4 %xmm11 %xmm4 Argument #5 %xmm12 %xmm5 Argument #6 %xmm13 %xmm6 Argument #7 %xmm14 %xmm7 Argument #8 %xmm15

SSE3 Registers Different data types and associated instructions
Integer vectors: 16-way byte 8-way 2 bytes 4-way 4 bytes Floating point vectors: 4-way single 2-way double Floating point scalars: single double 128 bit LSB

SSE3 Instructions: Examples
Single precision 4-way vector add: addps %xmm0 %xmm1 Single precision scalar add: addss %xmm0 %xmm1 %xmm0 + %xmm1 %xmm0 + %xmm1

Extending to x86-64 Pointers and long ints are 64 bits long. Integer arithmetic operations support 8, 16, 32 and 64-bit data types The set of general purpose registers expanded from 8 to 16 Much of the program state is held in registers rather than on stack. Integer and pointer arguments (upto 6) to procedures are passsed via registers. Some procedures do not need to access to stack at all. Conditional operations are implemented using conditional move instructions, when possible, yielding better performance than traditional branching Floating point operations are implemented using register-oriented SSE2, rather than stack-based x87

Procedures (x86-64): Optimizations
No base/frame pointer Passing arguments to functions through registers (if possible) Sometimes: Writing into the “red zone” (below stack pointer) Sometimes: Function call using jmp (instead of call) Reason: Performance use stack as little as possible while obeying rules (e.g., caller/callee save registers) rtn Ptr %rsp −8 unused −16 loc[1] −24 loc[0]

Instructor: Erol Sahin

Similar presentations

Presentation on theme: "Instructor: Erol Sahin"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Instructor: Erol Sahin

Similar presentations

Presentation on theme: "Instructor: Erol Sahin"— Presentation transcript:

Similar presentations

About project

Feedback