III:1 X86 Assembly – Buffer Overflow. III:2 Administrivia HW3 – skip the farmer game portion, moved to HW5  Due today HW4 – Endian swap in ASM and C.

Slides:

Advertisements

Similar presentations

Machine-Level Programming V: Miscellaneous Topics Topics Buffer Overflow Linux Memory Layout Understanding Pointers Floating-Point Code CS 105 Tour of.

Advertisements

Today C operators and their precedence Memory layout

Carnegie Mellon 1 Machine-Level Programming V: Advanced Topics : Introduction to Computer Systems 8 th Lecture, Sep. 16, 2010 Instructors: Randy.

University of Washington Last Time For loops  for loop → while loop → do-while loop → goto version  for loop → while loop → goto “jump to middle” version.

Machine-Level Programming V: Miscellaneous Topics Sept. 24, 2002 Topics Linux Memory Layout Understanding Pointers Buffer Overflow Floating Point Code.

Machine-Level Programming III: Procedures Apr. 17, 2006 Topics IA32 stack discipline Register saving conventions Creating pointers to local variables CS213.

Machine-Level Programming V: Miscellaneous Topics Topics Linux Memory Layout Buffer Overflow Floating Point Code CS213.

Assembly תרגול 8 פונקציות והתקפת buffer.. Procedures (Functions) A procedure call involves passing both data and control from one part of the code to.

Stack Activation Records Topics IA32 stack discipline Register saving conventions Creating pointers to local variables February 6, 2003 CSCE 212H Computer.

Fabián E. Bustamante, Spring 2007 Machine-Level Prog. V – Miscellaneous Topics Today Buffer overflow Floating point code Next time Memory.

Introduction to x86 Assembly or “How does someone hack my laptop?”

Security Exploiting Overflows. Introduction r See the following link for more info: operating-systems-and-applications-in-

Machine Programming – IA32 memory layout and buffer overflow CENG331: Introduction to Computer Systems 7 th Lecture Instructor: Erol Sahin Acknowledgement:

Carnegie Mellon 1 This week Buffer Overflow  Vulnerability  Protection.

1 UCR Common Software Attacks EE 260: Architecture/Hardware Support for Security Slide credits: some slides from Dave O’halloran, Lujo Bauer (CMU) and.

Functions & Activation Records Recursion

IA32 Linux Memory Layout Stack Heap Data Text not drawn to scale FF

Buffer Overflow Computer Organization II 1 © McQuain Buffer Overflows Many of the following slides are based on those from Complete Powerpoint.

University of Washington Today Memory layout Buffer overflow, worms, and viruses 1.

Computer Security  x86 assembly  How did it go?  IceCTF  Lock picking  Survey to be sent out about lock picking sets  Bomblab  Nice job!

Fabián E. Bustamante, Spring 2007 Machine-Level Programming III - Procedures Today IA32 stack discipline Register saving conventions Creating pointers.

Machine-Level Programming V: Miscellaneous Topics Topics Buffer Overflow Floating Point Code.

1 Machine-Level Programming V: Advanced Topics Andrew Case Slides adapted from Jinyang Li, Randy Bryant & Dave O’Hallaron.

Ithaca College 1 Machine-Level Programming IX Memory & buffer overflow Comp 21000: Introduction to Computer Systems & Assembly Lang Systems book chapter.

University of Washington Today Happy Monday! HW2 due, how is Lab 3 going? Today we’ll go over:  Address space layout  Input buffers on the stack  Overflowing.

Machine-Level Programming: X86-64 Topics Registers Stack Function Calls Local Storage X86-64.ppt CS 105 Tour of Black Holes of Computing.

Buffer Overflows Many of the following slides are based on those from

Machine-Level Programming V: Miscellaneous Topics

Carnegie Mellon 1 Odds and Ends Intro to x86-64 Memory Layout.

Microprocessors The ia32 User Instruction Set Jan 31st, 2002.

1 Understanding Pointers Buffer Overflow. 2 Outline Understanding Pointers Buffer Overflow Suggested reading –Chap 3.10, 3.12.

Instructors: Gregory Kesden and Markus Püschel

Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Machine-Level Programming V: Buffer overflow Slides.

Machine-Level Programming Advanced Topics Topics Linux Memory Layout Buffer Overflow.

Carnegie Mellon /18-243: Introduction to Computer Systems Instructors: Bill Nace and Gregory Kesden (c) All Rights Reserved. All work.

Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.

1 Assembly Language: Function Calls Jennifer Rexford.

Machine-Level Programming V: Miscellaneous Topics Feb 10, 2004 Topics Linux Memory Layout Understanding Pointers Buffer Overflow Floating Point Code class09.ppt.

Machine-Level Programming V: Miscellaneous Topics

CS 3214 Computer Systems Godmar Back Lecture 7. Announcements Stay tuned for Project 2 & Exercise 4 Project 1 due Sep 16 Auto-fail rule 1: –Need at least.

Machine-Level Programming V: Miscellaneous Topics Topics Linux Memory Layout Understanding Pointers Buffer Overflow Floating-Point Code CS 105 Tour of.

Machine-Level Programming Advanced Topics Topics Linux Memory Layout Understanding Pointers Buffer Overflow.

Machine-Level Programming V: Miscellaneous Topics Feb 6, 2003 Topics Linux Memory Layout Understanding Pointers Buffer Overflow Floating Point Code class09.ppt.

Machine-Level Programming V: Miscellaneous Topics Topics Buffer Overflow Floating Point Code class09.ppt.

Machine-Level Programming Advanced Topics Topics Buffer Overflow.

Carnegie Mellon 1 Machine-Level Programming V: Advanced Topics / : Introduction to Computer Systems 9 th Lecture, Sep. 25, 2012 Instructors:

1 Binghamton University Machine-Level Programming V: Advanced Topics CS220: Computer Systems II.

Carnegie Mellon 1 Machine-Level Programming V: Advanced Topics Slides courtesy of: Randy Bryant & Dave O’Hallaron.

Buffer Overflow Buffer overflows are possible because C doesn’t check array boundaries Buffer overflows are dangerous because buffers for user input are.

CS 177 Computer Security Lecture 9

Reading Condition Codes (Cont.)

The Hardware/Software Interface CSE351 Winter 2013

Machine-Level Programming V: Miscellaneous Topics

Machine Language V: Miscellaneous Topics Sept. 25, 2001

Instructors: Gregory Kesden

Machine-Level Programming V: Miscellaneous Topics

Buffer overflows Buffer overflows are possible because C does not check array boundaries Buffer overflows are dangerous because buffers for user input.

Machine-Level Programming 4 Procedures

Machine-Level Programming III: Procedures Sept 18, 2001

Machine Level Representation of Programs (IV)

Machine Level Representation of Programs (IV)

Machine-Level Programming V: Miscellaneous Topics

Machine-Level Programming V: Miscellaneous Topics October 2, 2008

Heterogeneous Data Structures & Alignment * heterogeneous: 不同种类的

Machine-Level Programming V: Advanced Topics

Instructors: Majd Sakr and Khaled Harras

Other Processors Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.

Machine-Level Programming V: Miscellaneous Topics

Presentation transcript:

III:1 X86 Assembly – Buffer Overflow

III:2 Administrivia HW3 – skip the farmer game portion, moved to HW5  Due today HW4 – Endian swap in ASM and C Quiz2 – please do your correction in RED  Do not erase original answers Quiz3 – due next Monday How am I doing? Please know that there is no way we will cover everything you need to know in the book. So read.

III:3 IA32 Linux Memory Layout Stack  Runtime stack (8MB limit) Heap  Dynamically allocated storage  When call malloc(), calloc(), new() Data  Statically allocated data  E.g., arrays & strings declared in code Text  Executable machine instructions  Read-only Upper 2 hex digits = 8 bits of address FF 00 Stack Text Data Heap 08 8MB not drawn to scale

III:4 Memory Allocation Example char big_array[1<<24]; /* 16 MB */ char huge_array[1<<28]; /* 256 MB */ int beyond; char *p1, *p2, *p3, *p4; int useless() { return 0; } int main() { p1 = malloc(1 <<28); /* 256 MB */ p2 = malloc(1 << 8); /* 256 B */ p3 = malloc(1 <<28); /* 256 MB */ p4 = malloc(1 << 8); /* 256 B */ /* Some print statements... */ } FF 00 Stack Text Data Heap 08 not drawn to scale Where does everything go?

III:5 IA32 Example Addresses $esp0xffffbcd0 p3 0x p1 0x p40x1904a110 p20x1904a008 &p20x beyond 0x big_array 0x huge_array 0x main()0x080483c6 useless() 0x final malloc()0x006be166 address range ~2 32 FF 00 Stack Text Data Heap not drawn to scale malloc() is dynamically linked address determined at runtime

III:6 How about some pointers to deal with the new boss? Sure. Try 0x0000A4F5, 0x00008EEF and 0x0000B32A. Pointers

III:7 C operators -> has very high precedence () has very high precedence monadic * just below OperatorsAssociativity () [] ->. left to right ! ~ * & (type) sizeof right to left * / % left to right + - left to right > left to right >= left to right == != left to right & left to right ^ left to right | left to right && left to right || left to right ?: right to left = += -= *= /= %= &= ^= != >= right to left, left to right

III:8 C Pointer Declarations: Test Yourself! int *p p is a pointer to int int *p[13] p is an array[13] of pointer to int int *(p[13]) p is an array[13] of pointer to int int **p p is a pointer to a pointer to an int int (*p)[13] p is a pointer to an array[13] of int int *f() f is a function returning a pointer to int int (*f)() f is a pointer to a function returning int int (*(*f())[13])() f is a function returning ptr to an array[13] of pointers to functions returning int int (*(*x[3])())[5] x is an array[3] of pointers to functions returning pointers to array[5] of ints

III:9 C Pointer Declarations: Test Yourself! int *p p is a pointer to int int *p[13] p is an array[13] of pointer to int int *(p[13]) p is an array[13] of pointer to int int **p p is a pointer to a pointer to an int int (*p)[13] p is a pointer to an array[13] of int int *f() f is a function returning a pointer to int int (*f)() f is a pointer to a function returning int int (*(*f())[13])() f is a function returning ptr to an array[13] of pointers to functions returning int int (*(*x[3])())[5] x is an array[3] of pointers to functions returning pointers to array[5] of ints p is an array[13] of pointer to int

III:10 C Pointer Declarations: Test Yourself! int *p p is a pointer to int int *p[13] p is an array[13] of pointer to int int *(p[13]) p is an array[13] of pointer to int int **p p is a pointer to a pointer to an int int (*p)[13] p is a pointer to an array[13] of int int *f() f is a function returning a pointer to int int (*f)() f is a pointer to a function returning int int (*(*f())[13])() f is a function returning ptr to an array[13] of pointers to functions returning int int (*(*x[3])())[5] x is an array[3] of pointers to functions returning pointers to array[5] of ints p is an array[13] of pointer to int

III:11 C Pointer Declarations: Test Yourself! int *p p is a pointer to int int *p[13] p is an array[13] of pointer to int int *(p[13]) p is an array[13] of pointer to int int **p p is a pointer to a pointer to an int int (*p)[13] p is a pointer to an array[13] of int int *f() f is a function returning a pointer to int int (*f)() f is a pointer to a function returning int int (*(*f())[13])() f is a function returning ptr to an array[13] of pointers to functions returning int int (*(*x[3])())[5] x is an array[3] of pointers to functions returning pointers to array[5] of ints p is an array[13] of pointer to int p is a pointer to an array[13] of int

III:12 C Pointer Declarations: Test Yourself! int *p p is a pointer to int int *p[13] p is an array[13] of pointer to int int *(p[13]) p is an array[13] of pointer to int int **p p is a pointer to a pointer to an int int (*p)[13] p is a pointer to an array[13] of int int *f() f is a function returning a pointer to int int (*f)() f is a pointer to a function returning int int (*(*f())[13])() f is a function returning ptr to an array[13] of pointers to functions returning int int (*(*x[3])())[5] x is an array[3] of pointers to functions returning pointers to array[5] of ints p is an array[13] of pointer to int p is a pointer to an array[13] of int f is a function returning ptr to an array[13] of pointers to functions returning int

III:13 C Pointer Declarations int *p p is a pointer to int int *p[13] p is an array[13] of pointer to int int *(p[13]) p is an array[13] of pointer to int int **p p is a pointer to a pointer to an int int (*p)[13] p is a pointer to an array[13] of int int *f() f is a function returning a pointer to int int (*f)() f is a pointer to a function returning int int (*(*f())[13])() f is a function returning ptr to an array[13] of pointers to functions returning int int (*(*x[3])())[5] x is an array[3] of pointers to functions returning pointers to array[5] of ints

III:14 Avoiding Complex Declarations Use typedef to build up the declaration Instead of int (*(*x[3])())[5] : typedef int fiveints[5]; typedef fiveints* p5i; typedef p5i (*f_of_p5is)(); f_of_p5is x[3]; x is an array of 3 elements, each of which is a pointer to a function returning an array of 5 ints

III:15 Internet Worm and IM War November, 1988  Internet Worm attacks thousands of Internet hosts.  How did it happen? July, 1999  Microsoft launches MSN Messenger (instant messaging system).  Messenger clients can access popular AOL Instant Messaging Service (AIM) servers AIM server AIM client AIM client MSN client MSN server

III:16 Internet Worm and IM War (cont.) August 1999  Mysteriously, Messenger clients can no longer access AIM servers.  Microsoft and AOL begin the IM war:  AOL changes server to disallow Messenger clients  Microsoft makes changes to clients to defeat AOL changes.  At least 13 such skirmishes.  How did it happen? The Internet Worm and AOL/Microsoft War were both based on stack buffer overflow exploits!  many Unix functions do not check argument sizes.  allows target buffers to overflow.

III:17 String Library Code Implementation of Unix function gets()  No way to specify limit on number of characters to read Similar problems with other Unix functions  strcpy : Copies string of arbitrary length  scanf, fscanf, sscanf, when given %s conversion specification /* Get string from stdin */ char *gets(char *dest) { int c = getchar(); char *p = dest; while (c != EOF && c != '\n') { *p++ = c; c = getchar(); } *p = '\0'; return dest; }

III:18 Vulnerable Buffer Code int main() { printf("Type a string:"); echo(); return 0; } /* Echo Line */ void echo() { char buf[4]; /* Way too small! */ gets(buf); puts(buf); } unix>./bufdemo Type a string: unix>./bufdemo Type a string: Segmentation Fault unix>./bufdemo Type a string: ABC Segmentation Fault

III:19 Buffer Overflow Disassembly f0 : 80484f0:55 push %ebp 80484f1:89 e5 mov %esp,%ebp 80484f3:53 push %ebx 80484f4:8d 5d f8 lea 0xfffffff8(%ebp),%ebx 80484f7:83 ec 14 sub $0x14,%esp 80484fa:89 1c 24 mov %ebx,(%esp) 80484fd:e8 ae ff ff ff call 80484b :89 1c 24 mov %ebx,(%esp) :e8 8a fe ff ff call a:83 c4 14 add $0x14,%esp d:5b pop %ebx e:c9 leave f:c3 ret 80485f2:e8 f9 fe ff ff call 80484f f7:8b 5d fc mov 0xfffffffc(%ebp),%ebx 80485fa:c9 leave 80485fb:31 c0 xor %eax,%eax 80485fd:c3 ret

III:20 Buffer Overflow Stack echo: pushl %ebp# Save %ebp on stack movl %esp, %ebp pushl %ebx# Save %ebx leal -8(%ebp),%ebx# Compute buf as %ebp-8 subl $20, %esp# Allocate stack space movl %ebx, (%esp)# Push buf on stack call gets# Call gets... /* Echo Line */ void echo() { char buf[4]; /* Way too small! */ gets(buf); puts(buf); } Return Address Saved %ebp %ebp Stack Frame for main Stack Frame for echo [3][2][1][0] buf Before call to gets

III:21 Buffer Overflow Stack Example unix> gdb bufdemo (gdb) break echo Breakpoint 1 at 0x (gdb) run Breakpoint 1, 0x in echo () (gdb) print /x $ebp $1 = 0xffffc638 (gdb) print /x *(unsigned *)$ebp $2 = 0xffffc658 (gdb) print /x *((unsigned *)$ebp + 1) $3 = 0x80485f f2:call 80484f f7:mov 0xfffffffc(%ebp),%ebx # Return Point 0xffffc638 buf 0xffffc658 Return Address Saved %ebp Stack Frame for main Stack Frame for echo [3][2][1][0] Stack Frame for main Stack Frame for echo xx buf ff c f7 Before call to gets

III:22 Buffer Overflow Example #1 Overflow bufs, but no problem 0xffffc638 0xffffc658 Stack Frame for main Stack Frame for echo xx buf ff c f7 0xffffc638 0xffffc658 Stack Frame for main Stack Frame for echo buf ff c f Before call to gets Input

III:23 Buffer Overflow Example #2 Base pointer corrupted 0xffffc638 0xffffc658 Stack Frame for main Stack Frame for echo xx buf ff c f7 0xffffc638 0xffffc658 Stack Frame for main Stack Frame for echo buf ff c f Before call to gets Input a:83 c4 14 add $0x14,%esp # deallocate space d:5b pop %ebx # restore %ebx e:c9 leave # movl %ebp, %esp; popl %ebp f:c3 ret # Return

III:24 Buffer Overflow Example #3 Return address corrupted 0xffffc638 0xffffc658 Stack Frame for main Stack Frame for echo xx buf ff c f7 0xffffc638 0xffffc658 Stack Frame for main Stack Frame for echo buf Before call to gets Input f2:call 80484f f7:mov 0xfffffffc(%ebp),%ebx # Return Point

III:25 Malicious Use of Buffer Overflow Input string contains byte representation of executable code Overwrite return address with address of buffer When bar() executes ret, will jump to exploit code int bar() { char buf[64]; gets(buf);... return...; } void foo(){ bar();... } Stack after call to gets() B return address A foo stack frame bar stack frame B exploit code pad data written by gets()

III:26 Exploits Based on Buffer Overflows Buffer overflow bugs allow remote machines to execute arbitrary code on victim machines Internet worm  Early versions of the finger server (fingerd) used gets() to read the argument sent by the client:  finger  Worm attacked fingerd server by sending phony argument:  finger “exploit-code padding new-return- address”  Exploit code: executed a root shell on the victim machine with a direct TCP connection to the attacker.

III:27 Exploits Based on Buffer Overflows Buffer overflow bugs allow remote machines to execute arbitrary code on victim machines IM War  AOL exploited existing buffer overflow bug in AIM clients  Exploit code: returned 4-byte signature (the bytes at some location in the AIM client) to server.  When Microsoft changed code to match signature, AOL changed signature location.

III:28 Example Virus: Code Red Worm History  June 18, Microsoft announces buffer overflow vulnerability in IIS Internet server  July 19, over 250,000 machines infected by new virus in 9 hours  White house must change its IP address. Pentagon shut down public WWW servers for day When CS:APP Web Site was set up  Received strings of form GET /default.ida?NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN N....NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN%u909 0%u6858%ucbd3%u7801%u9090%u6858%ucbd3%u7801%u9090%u 6858%ucbd3%u7801%u9090%u9090%u8190%u00c3%u0003%u8b0 0%u531b%u53ff%u0078%u0000%u00=a HTTP/1.0" "-" "-"

III:29 Code Red Exploit Code Starts 100 threads running Spread self  Generate random IP addresses & send attack string  Between 1st & 19th of month Attack  Send 98,304 packets; sleep for 4-1/2 hours; repeat  Denial of service attack  Between 21st & 27th of month Deface server’s home page  After waiting 2 hours

III:30 Avoiding Overflow Vulnerability Use library routines that limit string lengths  fgets instead of gets  strncpy instead of strcpy  Don’t use scanf with %s conversion specification  Use fgets to read the string  Or use %ns where n is a suitable integer /* Echo Line */ void echo() { char buf[4]; /* Way too small! */ fgets(buf, 4, stdin); puts(buf); }

III:31 System-Level Protections Randomized stack offsets  At start of program, allocate random amount of space on stack  Makes it difficult for hacker to predict beginning of inserted code Non-executable code segments  In traditional x86, segments are either “read- only” or “writeable”  Can execute anything readable  Better: add explicit “execute” permission Stack corruption detection  Compiler inserts “canary” just beyond array and code to check it before function return  Code aborts with error if canary changed unix> gdb bufdemo (gdb) break echo (gdb) run (gdb) print /x $ebp $1 = 0xffffc638 (gdb) run (gdb) print /x $ebp $2 = 0xffffbb08 (gdb) run (gdb) print /x $ebp $3 = 0xffffc6a8

III:32 Web-aside: Assembly & C Requirements to have asm + C integrated together?  In assembly, follow register usage, parameter passing, and return value conventions Problems with inline assembly  Not portable  Knowing what registers are safe to use Advantages of GCC’s extended asm  You inform compiler of source values required, results your code generates, and registers that will be overwritten  GCC generates code to correctly set up source values, execute desired instructions, and correctly use computed results

III:33 Administrivia Quiz 3 – due Feb 7 Quiz 4 – due Feb 14 Quiz 5 – due Feb 21 Midterm – available after class on Feb 28  No class Mar 1, work on the exam  Due 5PM Monday Mar 5  Take home, open book, open notes, open lecture slides  NO COLLABORATION

III:34 BONUS SLIDES

III:35 IA32 Floating Point (x87) History  8086: first computer to implement IEEE FP  separate 8087 FPU (floating point unit)  486: merged FPU and Integer Unit onto one chip  Becoming obsolete with x86-64 Summary  Hardware to add, multiply, and divide  Floating point data registers  Various control & status registers Floating Point Formats  single precision (C float ): 32 bits  double precision (C double ): 64 bits  extended precision (C long double ): 80 bits Instruction decoder and sequencer FPU Integer Unit Memory

III:36 FPU Data Register Stack (x87) FPU register format (80 bit extended precision) FPU registers  8 registers %st(0) - %st(7)  Logically form stack  Top: %st(0)  Bottom disappears (drops out) after too many pushs sexpfrac “Top” %st(0) %st(1) %st(2) %st(3)

III:37 InstructionEffectDescription fldzpush 0.0 Load zero flds Addr push Mem[Addr] Load single precision real fmuls Addr %st(0)  %st(0)* M[Addr]Multiply faddp%st(1)  %st(0)+%st(1);pop Add and pop FPU instructions (x87) Large number of floating point instructions and formats  ~50 basic instruction types  load, store, add, multiply  sin, cos, tan, arctan, and log  Often slower than math lib (!) Sample instructions:

III:38 FP Code Example (x87) Compute inner product of two vectors  Single precision arithmetic  Common computation float ipf (float x[], float y[], int n) { int i; float result = 0.0; for (i = 0; i < n; i++) result += x[i]*y[i]; return result; } pushl %ebp # setup movl %esp,%ebp pushl %ebx movl 8(%ebp),%ebx # %ebx=&x movl 12(%ebp),%ecx # %ecx=&y movl 16(%ebp),%edx # %edx=n fldz # push +0.0 xorl %eax,%eax # i=0 cmpl %edx,%eax # if i>=n done jge.L3.L5: flds (%ebx,%eax,4) # push x[i] fmuls (%ecx,%eax,4) # st(0)*=y[i] faddp # st(1)+=st(0); pop incl %eax # i++ cmpl %edx,%eax # if i<n repeat jl.L5.L3: movl -4(%ebp),%ebx # finish movl %ebp, %esp popl %ebp ret # st(0) = result

III:39 Inner Product Stack Trace 1. fldz 0.0 %st(0) 2. flds (%ebx,%eax,4) 0.0 %st(1) x[0] %st(0) 3. fmuls (%ecx,%eax,4) 0.0 %st(1) x[0]*y[0] %st(0) 4. faddp 0.0+x[0]*y[0] %st(0) 5. flds (%ebx,%eax,4) x[0]*y[0] %st(1) x[1] %st(0) 6. fmuls (%ecx,%eax,4) x[0]*y[0] %st(1) x[1]*y[1] %st(0) 7. faddp %st(0) x[0]*y[0]+x[1]*y[1] Initialization Iteration 0Iteration 1 eax = i ebx = *x ecx = *y

III:40 x87 Floating Point Problems One of “least elegant features” of x86  Remnants of original co-processor design Compiler technology is much better today Stack registers are problematic  All must be treated as caller-saved, too much memory traffic  Can’t treat as true stack over multiple function calls Separate FP status word is problematic  Set by FP comparison instructions, outcome used by branches  Must transfer to int word, test bits explicitly “Even experienced programmers find this code arcane and difficult to read.”

III:41 SSE Floating Point Architecture Better compiler target than x87 Includes new registers and instructions In Intel CPUs since Pentium III Default for x86-64, but not IA32

III:42 Vector Instructions: SSE Family SIMD (single-instruction, multiple data) vector instructions  New data types, registers, operations  Parallel operation on small (length 2-8) vectors of integers or floats  Example: Floating point vector instructions  Available with Intel’s SSE (streaming SIMD extensions) family  SSE starting with Pentium III: 4-way single precision  SSE2 starting with Pentium 4: 2-way double precision  All x86-64 have SSE3 (superset of SSE2, SSE) + x “4-way”

III:43 Intel Architectures (Focus Floating Point) x86-64 / em64t x86-32 x86-16 MMX SSE SSE2 SSE3 SSE Pentium Pentium MMX Pentium III Pentium 4 Pentium 4E Pentium 4F Core 2 Duo time ArchitecturesProcessorsFeatures 4-way single precision FP 2-way double precision FP Our focus: SSE3 used for scalar (non-vector) floating point

III:44 SSE3 Registers All caller saved %xmm0 for floating point return value %xmm0 %xmm1 %xmm2 %xmm3 %xmm4 %xmm5 %xmm6 %xmm7 %xmm8 %xmm9 %xmm10 %xmm11 %xmm12 %xmm13 %xmm14 %xmm15 Argument #1 Argument #2 Argument #3 Argument #4 Argument #5 Argument #6 Argument #7 Argument #8 128 bit = 2 doubles = 4 singles

III:45 SSE3 Registers Different data types and associated instructions Integer vectors:  16-way byte  8-way 2 bytes  4-way 4 bytes Floating point vectors:  4-way single  2-way double Floating point scalars:  single  double 128 bit LSB

III:46 SSE3 Instructions: Examples Single precision 4-way vector add: addps %xmm0 %xmm1 Single precision scalar add: addss %xmm0 %xmm1 + %xmm0 %xmm1 + %xmm0 %xmm1

III:47 SSE3 Instruction Names addpsaddss addpdaddsd packed (vector) single slot (scalar) single precision double precision our focus

III:48 SSE3 Basic Instructions Moves  Usual operand form: reg → reg, reg → mem, mem → reg Arithmetic SingleDoubleEffect addssaddsd D ← D + S subsssubsd D ← D – S mulssmulsd D ← D x S divssdivsd D ← D / S maxssmaxsd D ← max(D,S) minssminsd D ← min(D,S) sqrtsssqrtsd D ← sqrt(S) SingleDoubleEffect movssmovsd D ← S

III:49 x86-64 FP Code Example Compute inner product of two vectors  Single precision arithmetic  Uses SSE3 instructions float ipf (float x[], float y[], int n) { int i; float result = 0.0; for (i = 0; i < n; i++) result += x[i]*y[i]; return result; } ipf: xorps %xmm1, %xmm1# result = 0.0 xorl %ecx, %ecx# i = 0 jmp.L8# goto middle.L10:# loop: movslq %ecx,%rax# icpy = i incl %ecx# i++ movss (%rsi,%rax,4), %xmm0# t = y[icpy] mulss (%rdi,%rax,4), %xmm0# t *= x[icpy] addss %xmm0, %xmm1# result += t.L8:# middle: cmpl %edx, %ecx# i:n jl.L10# if < goto loop movaps %xmm1, %xmm0# return result ret

III:50 SSE3 Conversion Instructions Conversions  Same operand forms as moves InstructionDescription cvtss2sd single → double cvtsd2ss double → single cvtsi2ss int → single cvtsi2sd int → double cvtsi2ssq quad int → single cvtsi2sdq quad int → double cvttss2si single → int (truncation) cvttsd2si double → int (truncation) cvttss2siq single → quad int (truncation) cvttss2siq double → quad int (truncation)

III:51 x86-64 FP Code Example double funct(double a, float x, double b, int i) { return a*x - b/i; } a %xmm0 double x %xmm1 float b %xmm2 double i %edi int funct: cvtss2sd %xmm1, %xmm1 # %xmm1 = (double) x mulsd %xmm0, %xmm1 # %xmm1 = a*x cvtsi2sd %edi, %xmm0 # %xmm0 = (double) i divsd %xmm0, %xmm2 # %xmm2 = b/i movsd %xmm1, %xmm0 # %xmm0 = a*x subsd %xmm2, %xmm0 # return a*x - b/i ret

III:52 Constants double cel2fahr(double temp) { return 1.8 * temp ; } # Constant declarations.LC2:.long # Low order four bytes of 1.8.long # High order four bytes of 1.8.LC4:.long 0 # Low order four bytes of 32.0.long # High order four bytes of 32.0 # Code cel2fahr: mulsd.LC2(%rip), %xmm0 # Multiply by 1.8 addsd.LC4(%rip), %xmm0 # Add 32.0 ret Here: Constants in decimal format compiler decision hex more readable

III:53 Comments SSE3 floating point  Uses lower ½ (double) or ¼ (single) of vector  Finally departure from awkward x87  Assembly very similar to integer code x87 still supported  Even mixing with SSE3 possible  Not recommended For highest floating point performance  Vectorization a must (but not in this course )  See next slide

III:54 Vector Instructions Starting with version 4.1.1, GCC can auto-vectorize to some extent  -O3 or –ftree-vectorize  No speed-up guaranteed – limited capability  Intel’s C++ compiler (icc) currently much better  Spice machines have installed (May 2010) For highest performance vectorize yourself using intrinsics  Intrinsics = C interface to vector instructions Future  Intel AVX announced: 4-way double, 8-way single