ARM ORGANISATION.

ARM ORGANISATION

Computer Architecture is abstract model and are those attributes that are visible to programmer like instructions sets, no of bits used for data, addressing techniques. A computer's organization expresses the realization of the architecture. OR how features are implemented like these registers ,those data paths or this connection to memory. contents of CO are ALU, CPU and memory and memory organizations.

Computer architecture refers to those attributes of system visible to a programmer and they have a direct impact on logical execution of a program. Computer organisation refers to operational units and their interconnection that realize the architectural specifications.

EXAMPLE 1: Suppose you are in a company that manufactures cars, design and overall details of the car come under computer architecture (abstract,programmers view), while making it’s parts piece by piece and connecting together the different components of that car by keeping the basic design in mind comes under computer organization (physical and visible). EXAMPLE 2: For example, both Intel and AMD processors have the same X86 architecture, but how the two companies implement that architecture (their computer organizations) is usually very different. The same programs run correctly on both, because the architecture is the same, but they may run at different speeds, because the organizations are different.

Pipeline stages (for different family of ARM processor)

3-stage pipeline ARM organization
The register bank, which stores the processor state. Barrel Shifter, which can shift or rotate one operand by any number of bits. ALU, performs the arithmetic and logic functions required by the instruction set.

3-stage pipeline ARM organization
Address register and incrementer, select and hold all memory addresses and generate sequential addresses when required. Data Register, which hold data passing to and from memory.

In a single-cycle data processing instruction, two registers operands are accessed, the value on the B bus is shifted and combined with the value on the A bus in the ALU, then the result is written back into the register bank. The program counter value is in the address register, from where it is fed into the incrementer, the incremented value is copied back into r15 in the register bank and also into the address register to be used as the address for the next instruction fetch if needed.

The 3-stage pipeline ARM processors up to the ARM7 employ a simple 3-stage pipeline with the following pipeline stages Fetch Decode Execute Fetch: The instruction is fetched from memory and placed in the instruction pipeline. Decode: The instruction is decoded and the datapath control signals prepared for the next cycle. In this stage, the instruction ”owns” the decode logic but not the datapath Execute: The instruction “owns” the datapath The register bank is read. An operand is shifted. The ALU result is generated, and written back into a destination register. At any one time, three different instructions may occupy each of these stages, so the hardware in each stage has to be capable of independent operation.

ARM single-cycle instruction 3-stage pipeline operation
When the processor is executing simple data processing instructions the pipeline enables one instruction to be completed every clock cycle. An individual instruction takes three clock cycles to complete, so it has three-cycle latency, but the throughput is one instruction per cycle.

ARM Multi Cycle instruction
3-stage pipeline operation ARM Multi Cycle instruction When a multi-cycle instruction is executed the flow is less regular, as illustrated in Figure. This shows a sequence of single-cycle ADD instructions with a data store instruction, STR, occurring after the first ADD. The cycles that access main memory are shown with light shading so it can be seen that memory is used in every cycle. The datapath is likewise used in every cycle, being involved in all the execute cycles, the address calculation and the data transfer. The decode logic is always generating the control signals for the datapath to use in the next cycle, so in addition to the explicit decode cycles it is also generating the control for the data transfer during the address calculation cycle of the STR.

Multiple register data transfer instructions
Example of ldmia – load, increment after ldmia r9, {r0-r3} @ register 9 holds the @ base address This has the same effect as four separate ldr instructions, or ldr r0, [r9] ldr r1, [r9, #4] ldr r2, [r9, #8] ldr r3, [r9, #12] Note: at the end of the ldmia instruction, register r9 has not been changed. If you wanted to change r9, you could simply use ldmia r9!, {r0,r2,r5}

Multiple register data transfer instuctions
ldmia – Example ldmia r9, {r0-r3, r12} Load words addressed by r9 into r0, r1, r2, r3, and r12 Increment r9 after each load. Example 3 ldmia r9, {r5, r3, r0-r2, r14} load words addressed by r9 into registers r5, r3, r0, r1, r2, and r14. ldmib, ldmda, ldmdb work similar to ldmia Stores work in an analogous manner to load instructions

Store Multiples

Load and Store Multiples
IA r1 Increasing Address r4 r0 r10 IB DA DB LDMxx r10, {r0,r1,r4} STMxx r10, {r0,r1,r4} Base Register (Rb) Several aliases for stack usage are allowed for instance: LDMFD -> LDMIA STDFD -> STMDB

The mapping between the stack and block copy views of the load and store multiple instructions
LDMFD == restore from stack STMFD == save registers onto stack

As a result of the issues, higher performance ARM cores employ a 5-stage pipeline and have separate instruction and data memories. Breaking instruction execution down into five components rather than three reduces the maximum work which must be completed in a clock cycle, and hence allows a higher clock frequency to be used. The separate instruction and data memories allow a significant reduction in the core's CPI.

Recall - ARM family 7 and 9

5 stage pipe line ARM organization
The time T, required to execute a given program is given by : Since Ninst is constant for a given program (compiled with a given compiler using a given set of optimizations, and so on) there are only two ways to increase performance.

Increase the clock rate, fclk.
This requires the logic in each pipeline stage to be simplified and, therefore, the number of pipeline stages to be increased. Reduce the average number of clock cycles per instruction, CPI. This requires either that instructions which occupy more than one pipeline slot in a 3-stage pipeline ARM are re-implemented to occupy fewer slots, or that pipeline stalls caused by dependencies between instructions are reduced, or a combination of both.

Instruction Execution

Store Instruction

Branch Instruction

Write the instructions required and pipeline stages for the instructions to do the following operation a = b + c

Running this code segment will need some forwarding.
a = b + c Running this code segment will need some forwarding. But instructions LW and ALU(Add or Sub), when put in sequence, are generating hazards for the pipeline that can not be resolved by forwarding. So the pipeline will stall. Observe that in time steps 4, 5, and 6, there are two forwards from the Data memory unit to the ALU in the EX stage of the Add instruction.

Write a program to add 32 bit numbers
Find the one’s complement of the given number. [use MVN instruction – which acts as Not instruction] Swapping : if value is 4E ( only 8 bits – remaining bits 0) result should be E4 Sum of n numbers Find the smallest/ largest of 2 numbers Find the smallest of n numbers 1. 2.One’s complement Mvn

Eg1 Consider that there are 3-stages in an instruction and each stage takes 1 minute, what is the time taken to finish 3 instructions in a non pipeline processor? What is the average time taken for an instruction in a non pipeline processor? Similarly for pipeline processor

ANS Non Pipeline = 9 mins Average time in non pipeline = 3 mins
Pipeline processor = 5 mins

Eg. A 5-stage pipelined processor has Instruction Fetch(IF),Instruction Decode(ID),Execute (EX) , MEM and Write Operand(WO)stages. The IF,ID, MEM and WO stages take 1 clock cycle each for any instruction. The EX stage takes 1 clock cycle for ADD and SUB instructions,3 clock cycles for MUL instruction and 6 clock cycles for DIV instruction respectively.

For the next page instructions --
What is the number of clock cycles required if is a non-pipelined processor ? What is the number of clock cycles required if it is a pipelined processor without forwarding What is the number of clock cycles required if it is pipelined processor with forwarding?

Instruction sequence I1 :MUL R2 ,R0 ,R1 I2 :DIV R5 ,R3 ,R4 I3 :ADD R2 ,R5 ,R2 I4 :SUB R5 ,R2 ,R6

ARM ORGANISATION.

Similar presentations

Presentation on theme: "ARM ORGANISATION."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ARM ORGANISATION.

Similar presentations

Presentation on theme: "ARM ORGANISATION."— Presentation transcript:

Similar presentations

About project

Feedback