ARM Organization and Implementation Aleksandar Milenkovic Web:http://www.ece.uah.edu/~milenkahttp://www.ece.uah.edu/~milenka.

ARM Organization and Implementation Aleksandar Milenkovic E-mail: milenka@ece.uah.edumilenka@ece.uah.edu Web:http://www.ece.uah.edu/~milenkahttp://www.ece.uah.edu/~milenka

2 Outline  ARM Architecture  ARM Organization and Implementation  ARM Instruction Set  Architectural Support for High-level Languages  Thumb Instruction Set  Architectural Support for System Development  ARM Processor Cores  Memory Hierarchy  Architectural Support for Operating Systems  ARM CPU Cores  Embedded ARM Applications

3 ARM organization  Register file –  2 read ports, 1 write port + 1 read, 1 write port reserved for r15 (pc)  Barrel shifter – shift or rotate one operand for any number of bits  ALU – performs the arithmetic and logic functions required  Memory address register + incrementer  Memory data registers  Instruction decoder and associated control logic multiply data out register instruction decode & control incrementer register bank address register barrel shifter A[31:0] D[31:0] data in register ALU control P C PC A L U b u s A b u s B b u s register

4 Three-stage pipeline  Fetch  the instruction is fetched from memory and placed in the instruction pipeline  Decode  the instruction is decoded and the datapath control signals prepared for the next cycle; in this stage the instruction owns the decode logic but not the datapath  Execute  the instruction owns the datapath; the register bank is read, an operand shifted, the ALU register generated and written back into a destination register

5 ARM single-cycle instruction pipeline

6 add r0,r1,#5 sub r2,r3,r6 cmp r2,#3 fetch time decode fetch execute add decode fetch execute sub decode execute cmp 123

7 ARM multi-cycle instruction pipeline Decode logic is always generating the control signals for the datapath to use in the next cycle

8 ARM multi-cycle LDMIA (load multiple) instruction fetchdecode ex ld r2 ldmia r0,{r2,r3} sub r2,r3,r6 cmp r2,#3 ex ld r3 fetch time decode ex sub fetchdecode ex cmp Decode stage occupied since ldmia must continue to remember decoded instruction sub fetched at normal time but not decoded until LDMIA is finishing Instruction delayed

9 Control stalls: due to branches  Branches often introduce stalls (branch penalty)  Stall time may depend on whether branch is taken  May have to squash instructions that already started executing  Don’t know what to fetch until condition is evaluated

10 ARM pipelined branch time fetchdecode ex bne bne foo sub r2,r3,r6 fetchdecode foo add r0,r1,r2 ex bne fetchdecode ex add ex bne Decision not made until the third clock cycle Two cycles of work thrown away if bne takes place

11 Pipeline: how it works  All instructions occupy the datapath for one or more adjacent cycles  For each cycle that an instruction occupies the datapath, it occupies the decode logic in the immediately preceding cycle  During the fist datapath cycle each instruction issues a fetch for the next instruction but one  Branch instruction flush and refill the instruction pipeline

12 ARM9TDMI 5-stage pipeline  Fetch  Decode  instruction is decoded  register operands read (3 read ports)  Execute  an operand is shifted and the ALU result generated, or  address is computed  Buffer/data  data memory is accessed (load, store)  Write-back  write to register file

13 ARM9TDMI Data Forwarding ADD r3, r2, r1, LSL #3 ADD r5, r5, r3, LSL r2 r3 := r2 + 8 x r1 r5 := r5 + 2 r2 x r3 ADD r3, r2, r1, LSL #3 ADD r8, r9, r10 ADD r5, r5, r3, LSL r2 r3 := r2 + 8 x r1 r8 := r9 + r10 r5 := r5 + 2 r2 x r3 LD r3, [r2] ADD r1, r2, r3 r3 := mem[r2] r1 := r2 + r3 Data Forwarding Stall?

14 ARM9TDMI PC generation  3-stage pipeline  PC behavior: operands are read in execution stage r15 = PC + 8  5-stage pipeline  operands are read in decode stage and r15 = PC + 4?  incompatibilities between 3- stage and 5-stage implementations => unacceptable  to avoid this 5-stage pipeline ARMs emulate the behavior of the older 3-stage designs

15 Data processing instruction datapath activity address register increment registers Rd Rn PC Rm as ins. as instruction mult data outdata ini. pipe (a) register – register operations address register increment registers Rd Rn PC as ins. as instruction mult data outdata ini. pipe [7:0] (b) register – immediate operations  Reg-Reg  Rd = Rn op Rm  r15 = AR + 4 AR = AR + 4  Reg-Imm  Rd = Rn op Imm  r15 = AR + 4 AR = AR + 4

16 STR (store register) datapath activity address register increment registers Rn PC lsl #0 =A /A + B /A - B mult data outdata ini. pipe [11:0] address register increment registers Rn Rd shifter =A + B /A - B mult PC byte?data ini. pipe (a) 1 st cycle – compute address (b) 2 nd cycle – store data & auto-index  Compute address  AR = Rn op Disp  r15 = AR + 4  Store data  AR = PC  mem[AR] = Rd  If autoindexing => Rn = Rn +/- 4

17 The first two (of three) cycles of a branch instruction address register increment registers PC lsl #2 =A + B mult data outdata ini. pipe [23:0] address register increment registers R14 PC shifter =A mult data outdata ini. pipe (a) 1 st cycle – compute branch target (b) 2 nd cycle – save return address Third cycle: do a small correction to the value stored in the link register in order that it points to directly at the instruction which follows the branch?  Compute target address  AR = PC + Disp,lsl #2  Save return address (if required)  r14 = PC  AR = AR + 4

18 ARM Implementation  Datapath  Control unit (FSM)

19 2-phase non-overlapping clock scheme  Most ARMs do not operate on edge-sensitive registers  Instead the design is based around 2-phase non-overlapping clocks which are generated internally from a single clock signal  Data movement is controlled by passing the data alternatively through latches which are open during phase 1 or latches during phase 2

20 ARM datapath timing  Register read  Register read buses – dynamic, precharged during phase 2  During phase 1 selected registers discharge the read buses which become valid early in phase 1  Shift operation  second operand passes through barrel shifter  ALU operation  ALU has input latches which are open in phase 1, allowing the operands to begin combining in ALU as soon as they are valid, but they close at the end of phase 1 so that the phase 2 precharge does not get through to the ALU  ALU processes the operands during the phase 2, producing the valid output towards the end of the phase  the result is latched in the destination register at the end of phase 2

21 ARM datapath timing (cont’d) Minimum Datapath Delay = Register read time + Shifter Delay + ALU Delay + Register write set-up time + Phase 2 to phase 1 non-overlap time

22 The original ARM1 ripple-carry adder  Carry logic: use CMOS AOI (And-Or-Invert) gate  Even bits use circuit show below  Odd bits use the dual circuit with inverted inputs and outputs and AND and OR gates swapped around  Worst case path: 32 gates long

23 ARM2 4-bit carry look-ahead scheme  Carry Generate (G) Carry Propagate (P)  Cout[3] =Cin[0].P + G  Use AOI and alternate AND/OR gates  Worst case: 8 gates long

24 The ARM2 ALU logic for one result bit  ALU functions  data operations (add, sub,...)  address computations for memory accesses  branch target computations  bit-wise logical operations ...

25 ARM2 ALU function codes

26 The ARM6 carry-select adder scheme  Compute sums of various fields of the word for carry-in of zero and carry-in of one  Final result is selected by using the correct carry- in value to control a multiplexor Worst case: O(log 2 [word width]) gates long Note: Be careful! Fan-out on some of these gates is high so direct comparison with previous schemes is not applicable.

27 The ARM6 ALU organization  Not easy to merge the arithmetic and logic functions => a separate logic unit runs in parallel with the adder, and multiplexor selects the output

28 ARM9 carry arbitration encoding  Carry arbitration adder

29 The cross-bar switch barrel shifter  Shifter delay is critical since it contributes directly to the datapath cycle time  Cross-bar switch matrix (32 x 32)  Principle for 4x4 matrix in[0] in[1] in[2] in[3] out[0]out[1]out[2]out[3] no shiftright 1right 2right 3 left 1 left 2 left 3

30 The cross-bar switch barrel shifter (cont’d)  Precharged logic is used => each switch is a single NMOS transistor  Precharging sets all outputs to logic 0, so those which are not connected to any input during switching remain at 0 giving the zero filling required by the shift semantics  For rotate right, the right shift diagonal is enabled + complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’)  Arithmetic shift right: use sign-extension => separate logic is used to decode the shift amount and discharge those outputs appropriately

31 The 2-bit multiplication algorithm, Nth cycle

32 Carry-propagate (a) and carry-save (b) adder structures

33 ARM high-speed multiplier organization

34 ARM2 register cell circuit

35 ARM register bank floorplan

36 ARM core datapath buses

37 ARM control logic structure

ARM Organization and Implementation Aleksandar Milenkovic Web:http://www.ece.uah.edu/~milenkahttp://www.ece.uah.edu/~milenka.

Similar presentations

Presentation on theme: "ARM Organization and Implementation Aleksandar Milenkovic Web:http://www.ece.uah.edu/~milenkahttp://www.ece.uah.edu/~milenka."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ARM Organization and Implementation Aleksandar Milenkovic Web:http://www.ece.uah.edu/~milenkahttp://www.ece.uah.edu/~milenka.

Similar presentations

Presentation on theme: "ARM Organization and Implementation Aleksandar Milenkovic Web:http://www.ece.uah.edu/~milenkahttp://www.ece.uah.edu/~milenka."— Presentation transcript:

Similar presentations

About project

Feedback