Optimizing ARM Assembly Computer Organization and Assembly Languages Yung-Yu Chuang with slides by Peng-Sheng Chen.

Slides:

Advertisements

Similar presentations

ARM Assembly Programming Computer Organization and Assembly Languages Yung-Yu Chuang with slides by Peng-Sheng Chen.

Advertisements

ARM versions ARM architecture has been extended over several versions.

Passing by-value vs. by-reference in ARM by value C code equivalent assembly code int a;.section since a is not assigned an a:.skip initial.

Instruction Set Design

COMP3221 lec16-function-II.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lectures 16 : Functions in C/ Assembly - II

1 ARM Movement Instructions u MOV Rd, ; updates N, Z, C Rd = u MVN Rd, ; Rd = 0xF..F EOR.

Run-time Environment for a Program different logical parts of a program during execution stack – automatically allocated variables (local variables, subdivided.

Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Embedded System Design Center Sai Kumar Devulapalli ARM7TDMI Microprocessor Load and store instruction.

1 Procedure Calls, Linking & Launching Applications Lecture 15 Digital Design and Computer Architecture Harris & Harris Morgan Kaufmann / Elsevier, 2007.

1 Lecture 4: Procedure Calls Today’s topics:  Procedure calls  Large constants  The compilation process Reminder: Assignment 1 is due on Thursday.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Computer Architecture CSCE 350

Lecture 8: MIPS Instruction Set

COMP3221 lec-12-mem-II.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lecture 12: Memory Access - II

Inline Assembly Section 1: Recitation 7. In the early days of computing, most programs were written in assembly code. –Unmanageable because No type checking,

S. Barua – CPSC 440 CHAPTER 2 INSTRUCTIONS: LANGUAGE OF THE COMPUTER Goals – To get familiar with.

Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.

Chapters 5 - The LC-3 LC-3 Computer Architecture Memory Map

Topics covered: ARM Instruction Set Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

Cortex-M3 Programming. Chapter 10 in the reference book.

Topic 8: Data Transfer Instructions CSE 30: Computer Organization and Systems Programming Winter 2010 Prof. Ryan Kastner Dept. of Computer Science and.

Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.

MIPS coding. SPIM Some links can be found such as:

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

ITEC 352 Lecture 12 ISA(3). Review Buses Memory ALU Registers Process of compiling.

Computer Organization CS224 Fall 2012 Lesson 28. Pipelining Analogy  Pipelined laundry: overlapping execution l Parallelism improves performance §4.5.

More on Assembly 1 CSE 2312 Computer Organization and Assembly Language Programming Vassilis Athitsos University of Texas at Arlington.

M. Mateen Yaqoob The University of Lahore Spring 2014.

5/13/99 Ashish Sabharwal1 Pipelining and Hazards n Hazards occur because –Don’t have enough resources (ALU’s, memory,…) Structural Hazard –Need a value.

ARM Assembly Programming II Computer Organization and Assembly Languages Yung-Yu Chuang 2007/11/26 with slides by Peng-Sheng Chen.

Arrays and Strings in Assembly

ISA's, Compilers, and Assembly

More on Assembly 1 CSE 2312 Computer Organization and Assembly Language Programming Vassilis Athitsos University of Texas at Arlington.

ARM-7 Assembly: Example Programs 1 CSE 2312 Computer Organization and Assembly Language Programming Vassilis Athitsos University of Texas at Arlington.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Lecture 8: Loading and Storing to Memory CS 2011 Fall 2014, Dr. Rozier.

The Stack. ARMSim memory space: 0x Unused 0x x11400 Stack 0x x09400 Heap 0x?????-0x01400 Data 0x x????? Text 0x x01000.

Intel Xscale® Assembly Language and C. The Intel Xscale® Programmer’s Model (1) (We will not be using the Thumb instruction set.) Memory Formats –We will.

Embedded Systems Programming Writing Optimised C code for ARM.

Intel Xscale® Assembly Language and C. The Intel Xscale® Programmer’s Model (1) (We will not be using the Thumb instruction set.) Memory Formats –We will.

Data in Memory variables have multiple attributes symbolic name

Lecture 6: Assembly Programs

ECE 3430 – Intro to Microcomputer Systems

William Stallings Computer Organization and Architecture 8th Edition

CSCI206 - Computer Organization & Programming

ARM Registers Register – internal CPU hardware device that stores binary data; can be accessed much more rapidly than a location in RAM ARM has.

ECE 3430 – Intro to Microcomputer Systems

The compilation process

Assembly Language Assembly Language

Chapter 12 Variables and Operators

Single Clock Datapath With Control

High-Level Language Interface

Chapter 5 The LC-3.

ARM Assembly Programming

Passing by-value vs. by-reference in ARM

CSCI206 - Computer Organization & Programming

EECE.3170 Microprocessor Systems Design I

EECE.3170 Microprocessor Systems Design I

Computer Instructions

Optimizing ARM Assembly

Lecture 6: Assembly Programs

CSC3050 – Computer Architecture

ARM ORGANISATION.

Lecture 4: Instruction Set Design/Pipelining

Introduction to Assembly Chapter 2

An Introduction to the ARM CORTEX M0+ Instructions

Presentation transcript:

Optimizing ARM Assembly Computer Organization and Assembly Languages Yung-Yu Chuang with slides by Peng-Sheng Chen

Optimization Compilers do perform optimization, but they have blind sites. There are some optimization tools that you can’t explicitly use by writing C, for example. –Instruction scheduling –Register allocation –Conditional execution You have to use hand-written assembly to optimize critical routines. Use ARM9TDMI as the example, but the rules apply to all ARM cores. Note that the codes are sometimes in armasm format, not gas.

ARM optimization Utilize ARM ISA’s features –Conditional execution –Multiple register load/store –Scaled register operand –Addressing modes

Instruction scheduling ARM9 pipeline Hazard/Interlock: If the required data is the unavailable result from the previous instruction, then the process stalls. load/store load/store 8/16-bit data

Instruction scheduling No hazard, 2 cycles One-cycle interlock stall bubble

Instruction scheduling One-cycle interlock, 4 cycles ; no effect on performance

Instruction scheduling Brach takes 3 cycles due to stalls

Scheduling of load instructions Load occurs frequently in the compiled code, taking approximately 1/3 of all instructions. Careful scheduling of loads can avoid stalls.

Scheduling of load instructions 2-cycle stall. Total 11 cycles for a character. It can be avoided by preloading and unrolling. The key is to do some work when awaiting data.

Load scheduling by preloading Preloading: loads the data required for the loop at the end of the previous loop, rather than at the beginning of the current loop. Since loop i is loading data for loop i+1, there is always a problem with the first and last loops. For the first loop, insert an extra load outside the loop. For the last loop, be careful not to read any data. This can be effectively done by conditional execution.

Load scheduling by preloading 9 cycles. 11/9~1.22

Load scheduling by unrolling Unroll and interleave the body of the loop. For example, we can perform three loops together. When the result of an operation from loop i is not ready, we can perform an operation from loop i+1 that avoids waiting for the loop i result.

Load scheduling by unrolling

21 cycles. 7 cycle/character 11/7~1.57 More than doubling the code size Only efficient for a large data size.

Register allocation ATPCS requires called to save R4~R11 and to keep the stack 8-byte aligned. We stack R12 only for making the stack 8-byte aligned. Do not use sp(R13) and pc(R15) Total 14 general-purpose registers.

Register allocation Assume that K<=32 and N is large and a multiple of 256 k32-k

Register allocation Unroll the loop to handle 8 words at a time and to use multiple load/store

Register allocation

What variables do we have? We still need to assign carry and kr, but we have used 13 registers and only one remains. –Work on 4 words instead –Use stack to save least-used variable, here N –Alter the code argumentsread-inoverlap Register allocation

We notice that carry does not need to stay in the same register. Thus, we can use yi for it.

Register allocation This is often an iterative process until all variables are assigned to registers.

More than 14 local variables If you need more than 14 local variables, then you store some on the stack. Work outwards from the inner loops since they have more performance impact.

More than 14 local variables

Packing Pack multiple (sub-32bit) variables into a single register.

Packing When shifting by a register amount, ARM uses bits 0~7 and ignores others. Shift an array of 40 entries by shift bits.

Packing

Simulate SIMD (single instruction multiple data). Assume that we want to merge two images X and Y to produce Z by

30 Example X* α +Y*(1- α ) XY

31 α=0.75

32 α=0.5

33 α=0.25

Packing Load 4 bytes at a time Unpack it and promote to 16-bit data Work on 176x144 images

Packing

Conditional execution By combining conditional execution and conditional setting of the flags, you can implement simple if statements without any need of branches. This improves efficiency since branches can take many cycles and also reduces code size.

Conditional execution

Block copy example void bcopy(char *to, char *from, int n) { while (n--) *to++ = *from++; }

Block copy arguments: R0: to, R1: from, R2: n bcopy: TEQ R2, #0 BEQ end loop: SUB R2, R2, #1 LDRB R3, [R1], #1 STRB R3, [R0], #1 B bcopy end: MOV PC, LR

Block copy arguments: R0: to, R1: from, R2: rewrite “n–-” as “-–n>=0” bcopy: SUBS R2, R2, #1 LDRPLB R3, [R1], #1 STRPLB R3, [R0], #1 BPL bcopy MOV PC, LR

Block copy arguments: R0: to, R1: from, R2: assume n is a multiple of 4; loop unrolling bcopy: SUBS R2, R2, #4 LDRPLB R3, [R1], #1 STRPLB R3, [R0], #1 LDRPLB R3, [R1], #1 STRPLB R3, [R0], #1 LDRPLB R3, [R1], #1 STRPLB R3, [R0], #1 LDRPLB R3, [R1], #1 STRPLB R3, [R0], #1 BPL bcopy MOV PC, LR

Block copy arguments: R0: to, R1: from, R2: n is a multiple of 16; bcopy: SUBS R2, R2, #16 LDRPL R3, [R1], #4 STRPL R3, [R0], #4 LDRPL R3, [R1], #4 STRPL R3, [R0], #4 LDRPL R3, [R1], #4 STRPL R3, [R0], #4 LDRPL R3, [R1], #4 STRPL R3, [R0], #4 BPL bcopy MOV PC, LR

Block copy arguments: R0: to, R1: from, R2: n is a multiple of 16; bcopy: SUBS R2, R2, #16 LDMPL R1!, {R3-R6} STMPL R0!, {R3-R6} BPL bcopy MOV PC, could be extend to copy 40 byte at a if not multiple of 40, add a copy_rest loop

Search example int main(void) { int a[10]={7,6,4,5,5,1,3,2,9,8}; int i; int s=4; for (i=0; i<10; i++) if (s==a[i]) break; if (i>=10) return -1; else return i; }

Search.section.rodata.LC0:.word 7.word 6.word 4.word 5.word 1.word 3.word 2.word 9.word 8

Search.text.globalmain.typemain, %function main: sub sp, sp, #48 adr r4, =.LC0 add r5, sp, #8 ldmia r4!, {r0, r1, r2, r3} stmia r5!, {r0, r1, r2, r3} ldmia r4!, {r0, r1, r2, r3} stmia r5!, {r0, r1, r2, r3} ldmia r4!, {r0, r1} stmia r5!, {r0, r1} : a[9] s i a[0] low high stack

Search mov r3, #4 str r3, [sp, s=4 mov r3, #0 str r3, [sp, i=0 loop: ldr r0, [sp, r0=i cmp r0, i<10? bge end ldr r1, [sp, r1=s mov r2, #4 mul r3, r0, r2 add r3, r3, #8 ldr r4, [sp, r4=a[i] : a[9] s i a[0] low high stack

Search teq r1, test if s==a[i] beq end add r0, r0, i++ str r0, [sp, update i b loop end: str r0, [sp, #4] cmp r0, #10 movge r0, #-1 add sp, sp, #48 mov pc, lr : a[9] s i a[0] low high stack

Optimization Remove unnecessary load/store Remove loop invariant Use addressing mode Use conditional execution

Search (remove load/store) mov r3, #4 str r3, [sp, s=4 mov r3, #0 str r3, [sp, i=0 loop: ldr r0, [sp, r0=i cmp r0, i<10? bge end ldr r1, [sp, r1=s mov r2, #4 mul r3, r0, r2 add r3, r3, #8 ldr r4, [sp, r4=a[i] : a[9] s i a[0] low high stack r1, r0,

Search (remove load/store) teq r1, test if s==a[i] beq end add r0, r0, i++ str r0, [sp, update i b loop end: str r0, [sp, #4] cmp r0, #10 movge r0, #-1 add sp, sp, #48 mov pc, lr : a[9] s i a[0] low high stack

Search (loop invariant/addressing mode) mov r3, #4 str r3, [sp, s=4 mov r3, #0 str r3, [sp, i=0 loop: ldr r0, [sp, r0=i cmp r0, i<10? bge end ldr r1, [sp, r1=s mov r2, #4 mul r3, r0, r2 add r3, r3, #8 ldr r4, [sp, r4=a[i] : a[9] s i a[0] low high stack r1, r0, add r2, sp, #8 ldr r4, [r2, r0, LSL #2]

Search (conditional execution) teq r1, test if s==a[i] beq end add r0, r0, i++ str r0, [sp, update i b loop end: str r0, [sp, #4] cmp r0, #10 movge r0, #-1 add sp, sp, #48 mov pc, lr : a[9] s i a[0] low high stack addeq beq

Optimization Remove unnecessary load/store Remove loop invariant Use addressing mode Use conditional execution From 22 words to 13 words and execution time is greatly reduced.