Download presentation
Presentation is loading. Please wait.
Published byNorman Small Modified over 9 years ago
1
ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang
2
Advanced RISC Machines >75% of market for 32-bit RISC microprocessors ARM11 Design led by Ian Devereux
3
Demands of Wireless Applications High performance Low power Small size Cost
4
RISC for Wireless Strengths: –Clock rate –Pipelining Weaknesses: –High code density –Power consumption
5
ARM11 for Wireless Strengths Enhanced: –Clock rate Optimized interrupt and exception handling Minimized context switch cost Instruction set for media –Pipelining Decoupled for high bandwidth Retire before execution Weaknesses Reduced: –High code density ISA extensions Optional application specific and/or VFP coprocessors –Power consumption Architecture and instructions reduce clock rate Clock gate control
6
ARM11 Microarchitecture First implementation of ARMv6 architecture 8-stage pipeline 64-bit datapaths Frequency: up to 750 MHz, 350 – 500+ MHz worst case. 400 – 1,200 Dhrystone MIPS Power: 0.4 mW/MHz worst case: 0.13µm 1.2V Will be released to licensees in Q4 2002
7
ARMv6 Media support: SIMD extensions Improved interrupt latency ISA extensions THUMB, DSP, Jazelle 100% backwards compatibility to ARMv5
8
THUMB Instruction Set 32-bit performance for 16-bit systems 32-bit instructions re-coded to 16-bit op- codes 32-bit ROM stores 2 THUMB instructions per word Decompressed in pipeline to ARM instruction equivalents Improves code density by 35%
9
DSP Instruction Set Application accelerator for Digital Signal Processor performance Can load/store registers by pairs 16x16 or 32x16 MAC in one cycle Utilized in MAC pipeline
10
Jazelle Instruction Set Support for entering/exiting Java applications Fetches/decodes Java bytecodes, maintains a Java operand stack Creates a state that imitates a Java processor OS controls low-cost switch between Java and ARM/THUMB states
11
SIMD Instruction Set Parallel processing of 2x16-bit or 4x8-bit operands Four new Greater than or Equal to status bits (GE[3:0]) for MAC calculations Eliminates need for very high clock frequencies and hardware accelerators 2 – 4 x performance improvement for multimedia applications
12
Synchronization and Sharing Data Load-/store- Exclusive instructions (LDREX/STREX) support semaphores –Consolidates old Swap instruction and necessary semaphore implementation Virtual Memory System Architecture v6 ID’s separate caches –Cache hierarchy and ordering rules
13
Bit/Byte Order Support E-bit for current endian setting of core –Set/cleared with SETEND instruction REV* instructions reverse bytes for unaligned data support –REV – reverses a word –REV16 – reverses both halfwords –REVSH – reverses high order halfword + sign extend halfword
14
Exception and Interrupt Improvement Imperative for real-time tasks wherein low latency is critical F1 bit in CP15 register 1 designates: 0:Max performance mode, or 1:Low interrupt latency mode to allow interrupts VE bit enables vectored interrupts to core –Direct vs. external-> system -> vector address A-bit aborts all unaligned accesses U-bit (with clear A-bit) allows unaligned hardware access
15
Mode Changing and Stack Improvements CPSID/CPSIE instructions allow changing between modes with interrupt disable/enable Save Return State (SRS) saves registers and state of current mode onto stack of target mode Return From Exception (RFE) loads registers and state of saved mode Reduces exception handling overhead
16
8-Stage Pipeline Single-issue Dynamic branch prediction is 64-entry directly mapped BTB 64-bit data paths: read 2 registers in 1 clock Loads/stores done in background Out-of-order completion: can retire instructions before execution ALU processed in parallel with data cache access MAC processed in lock-step with ALU
17
Prefetch L1 memory access requires 2 cycles
18
Decode Decode instruction bits and allocate stack
19
Issue Instruction Load operands from registers
20
ALU and MAC ALU pipeline –Shift bits –Arithmetic and logical operations –Save state and registers 3-stage MAC –Can issue a 16x16 operation per cycle –Processed with ALU pipeline
21
Data Cache Access Map memory address Data cache load/store requires 2 cycles
22
Writeback Write results of instructions to designated memory, cache, or register
23
8-Stage Pipeline Diagram by Devereau:7
24
Power-saving features >95% of registers clock gated WFI instruction: wait for interrupt: can disable entire clock network Reduced clock cycles and use of transistors
25
Conclusions ARM11 will be implemented as a family of cores –Designed for maximum performance in wireless multimedia –A new standard in efficiency and power for embedded applications
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.