Download presentation
Presentation is loading. Please wait.
Published byClyde Richards Modified over 9 years ago
2
ARM Organization and Implementation Aleksandar Milenkovic E-mail: milenka@ece.uah.edumilenka@ece.uah.edu Web:http://www.ece.uah.edu/~milenkahttp://www.ece.uah.edu/~milenka
3
2 Outline ARM Architecture ARM Organization and Implementation ARM Instruction Set Architectural Support for High-level Languages Thumb Instruction Set Architectural Support for System Development ARM Processor Cores Memory Hierarchy Architectural Support for Operating Systems ARM CPU Cores Embedded ARM Applications
4
3 ARM organization Register file – 2 read ports, 1 write port + 1 read, 1 write port reserved for r15 (pc) Barrel shifter – shift or rotate one operand for any number of bits ALU – performs the arithmetic and logic functions required Memory address register + incrementer Memory data registers Instruction decoder and associated control logic multiply data out register instruction decode & control incrementer register bank address register barrel shifter A[31:0] D[31:0] data in register ALU control P C PC A L U b u s A b u s B b u s register
5
4 Three-stage pipeline Fetch the instruction is fetched from memory and placed in the instruction pipeline Decode the instruction is decoded and the datapath control signals prepared for the next cycle; in this stage the instruction owns the decode logic but not the datapath Execute the instruction owns the datapath; the register bank is read, an operand shifted, the ALU register generated and written back into a destination register
6
5 ARM single-cycle instruction pipeline
7
6 add r0,r1,#5 sub r2,r3,r6 cmp r2,#3 fetch time decode fetch execute add decode fetch execute sub decode execute cmp 123
8
7 ARM multi-cycle instruction pipeline Decode logic is always generating the control signals for the datapath to use in the next cycle
9
8 ARM multi-cycle LDMIA (load multiple) instruction fetchdecode ex ld r2 ldmia r0,{r2,r3} sub r2,r3,r6 cmp r2,#3 ex ld r3 fetch time decode ex sub fetchdecode ex cmp Decode stage occupied since ldmia must continue to remember decoded instruction sub fetched at normal time but not decoded until LDMIA is finishing Instruction delayed
10
9 Control stalls: due to branches Branches often introduce stalls (branch penalty) Stall time may depend on whether branch is taken May have to squash instructions that already started executing Don’t know what to fetch until condition is evaluated
11
10 ARM pipelined branch time fetchdecode ex bne bne foo sub r2,r3,r6 fetchdecode foo add r0,r1,r2 ex bne fetchdecode ex add ex bne Decision not made until the third clock cycle Two cycles of work thrown away if bne takes place
12
11 Pipeline: how it works All instructions occupy the datapath for one or more adjacent cycles For each cycle that an instruction occupies the datapath, it occupies the decode logic in the immediately preceding cycle During the fist datapath cycle each instruction issues a fetch for the next instruction but one Branch instruction flush and refill the instruction pipeline
13
12 ARM9TDMI 5-stage pipeline Fetch Decode instruction is decoded register operands read (3 read ports) Execute an operand is shifted and the ALU result generated, or address is computed Buffer/data data memory is accessed (load, store) Write-back write to register file
14
13 ARM9TDMI Data Forwarding ADD r3, r2, r1, LSL #3 ADD r5, r5, r3, LSL r2 r3 := r2 + 8 x r1 r5 := r5 + 2 r2 x r3 ADD r3, r2, r1, LSL #3 ADD r8, r9, r10 ADD r5, r5, r3, LSL r2 r3 := r2 + 8 x r1 r8 := r9 + r10 r5 := r5 + 2 r2 x r3 LD r3, [r2] ADD r1, r2, r3 r3 := mem[r2] r1 := r2 + r3 Data Forwarding Stall?
15
14 ARM9TDMI PC generation 3-stage pipeline PC behavior: operands are read in execution stage r15 = PC + 8 5-stage pipeline operands are read in decode stage and r15 = PC + 4? incompatibilities between 3- stage and 5-stage implementations => unacceptable to avoid this 5-stage pipeline ARMs emulate the behavior of the older 3-stage designs
16
15 Data processing instruction datapath activity address register increment registers Rd Rn PC Rm as ins. as instruction mult data outdata ini. pipe (a) register – register operations address register increment registers Rd Rn PC as ins. as instruction mult data outdata ini. pipe [7:0] (b) register – immediate operations Reg-Reg Rd = Rn op Rm r15 = AR + 4 AR = AR + 4 Reg-Imm Rd = Rn op Imm r15 = AR + 4 AR = AR + 4
17
16 STR (store register) datapath activity address register increment registers Rn PC lsl #0 =A /A + B /A - B mult data outdata ini. pipe [11:0] address register increment registers Rn Rd shifter =A + B /A - B mult PC byte?data ini. pipe (a) 1 st cycle – compute address (b) 2 nd cycle – store data & auto-index Compute address AR = Rn op Disp r15 = AR + 4 Store data AR = PC mem[AR] = Rd If autoindexing => Rn = Rn +/- 4
18
17 The first two (of three) cycles of a branch instruction address register increment registers PC lsl #2 =A + B mult data outdata ini. pipe [23:0] address register increment registers R14 PC shifter =A mult data outdata ini. pipe (a) 1 st cycle – compute branch target (b) 2 nd cycle – save return address Third cycle: do a small correction to the value stored in the link register in order that it points to directly at the instruction which follows the branch? Compute target address AR = PC + Disp,lsl #2 Save return address (if required) r14 = PC AR = AR + 4
19
18 ARM Implementation Datapath Control unit (FSM)
20
19 2-phase non-overlapping clock scheme Most ARMs do not operate on edge-sensitive registers Instead the design is based around 2-phase non-overlapping clocks which are generated internally from a single clock signal Data movement is controlled by passing the data alternatively through latches which are open during phase 1 or latches during phase 2
21
20 ARM datapath timing Register read Register read buses – dynamic, precharged during phase 2 During phase 1 selected registers discharge the read buses which become valid early in phase 1 Shift operation second operand passes through barrel shifter ALU operation ALU has input latches which are open in phase 1, allowing the operands to begin combining in ALU as soon as they are valid, but they close at the end of phase 1 so that the phase 2 precharge does not get through to the ALU ALU processes the operands during the phase 2, producing the valid output towards the end of the phase the result is latched in the destination register at the end of phase 2
22
21 ARM datapath timing (cont’d) Minimum Datapath Delay = Register read time + Shifter Delay + ALU Delay + Register write set-up time + Phase 2 to phase 1 non-overlap time
23
22 The original ARM1 ripple-carry adder Carry logic: use CMOS AOI (And-Or-Invert) gate Even bits use circuit show below Odd bits use the dual circuit with inverted inputs and outputs and AND and OR gates swapped around Worst case path: 32 gates long
24
23 ARM2 4-bit carry look-ahead scheme Carry Generate (G) Carry Propagate (P) Cout[3] =Cin[0].P + G Use AOI and alternate AND/OR gates Worst case: 8 gates long
25
24 The ARM2 ALU logic for one result bit ALU functions data operations (add, sub,...) address computations for memory accesses branch target computations bit-wise logical operations ...
26
25 ARM2 ALU function codes
27
26 The ARM6 carry-select adder scheme Compute sums of various fields of the word for carry-in of zero and carry-in of one Final result is selected by using the correct carry- in value to control a multiplexor Worst case: O(log 2 [word width]) gates long Note: Be careful! Fan-out on some of these gates is high so direct comparison with previous schemes is not applicable.
28
27 The ARM6 ALU organization Not easy to merge the arithmetic and logic functions => a separate logic unit runs in parallel with the adder, and multiplexor selects the output
29
28 ARM9 carry arbitration encoding Carry arbitration adder
30
29 The cross-bar switch barrel shifter Shifter delay is critical since it contributes directly to the datapath cycle time Cross-bar switch matrix (32 x 32) Principle for 4x4 matrix in[0] in[1] in[2] in[3] out[0]out[1]out[2]out[3] no shiftright 1right 2right 3 left 1 left 2 left 3
31
30 The cross-bar switch barrel shifter (cont’d) Precharged logic is used => each switch is a single NMOS transistor Precharging sets all outputs to logic 0, so those which are not connected to any input during switching remain at 0 giving the zero filling required by the shift semantics For rotate right, the right shift diagonal is enabled + complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’) Arithmetic shift right: use sign-extension => separate logic is used to decode the shift amount and discharge those outputs appropriately
32
31 The 2-bit multiplication algorithm, Nth cycle
33
32 Carry-propagate (a) and carry-save (b) adder structures
34
33 ARM high-speed multiplier organization
35
34 ARM2 register cell circuit
36
35 ARM register bank floorplan
37
36 ARM core datapath buses
38
37 ARM control logic structure
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.