An Approach for Implementing Efficient Superscalar CISC Processors

An Approach for Implementing Efficient Superscalar CISC Processors
Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James Smith Ilhyun Kim The 12th Int'l Symp. on High Performance Computer Architecture

Processor Design Challenges
CISC challenges -- Suboptimal internal micro-ops. Complex decoders & obsolete features/instructions Instruction count expansion: 40% to 50%  mgmt, comm … Redundancy & Inefficiency in the cracked micro-ops Solution: Dynamic optimization Other current challenges (CISC & RISC) Efficiency (Nowadays, less performance gain per transistor) Power consumption has become acute Solution: Novel efficient microarchitectures Introduction: problems – The motivation for this paper is that more software has been developed for the CISC x86 instruction set than any other ISAs. As many people know, CISC processors are harder to design than RISC processors. For example, multi-operation instructions, irregular instruction encoding, implicit register operands etc. However, these issues can be easily addressed by hardware decoders at the pipeline front end. The more challenging issue is in fact that these decoders can hardly generate optimal micro-op code for the pipeline backend….. HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Solution: Architecture Innovations
Software in Architected ISA: OS, Drivers, Lib code, Apps Architected ISA e.g. x86 Pipeline Decoders Conventional HW design Pipeline Code $ Software Binary Translator VM paradigm Dynamic Translation Implementation ISA e.g. fusible ISA Solution paradigms HW Implementation: Processors, Mem-sys, I/O devices ISA mapping: Hardware: Simple translation, good for startup performance. Software: Dynamic optimization, good for hotspots. Can we combine the advantages of both? Startup: Fast, simple translation Steady State: Intelligent translation/optimization, for hotspots. The 12th Int'l Symp. on High Performance Computer Architecture

Microarchitecture: Macro-op Execution
Enhanced OoO superscalar microarchitecture Process & execute fused macro-ops as single Instructions throughout the entire pipeline Analogy: All lanes  car-pool on highway  reduce congestion w/ high throughput, AND raise the speed limit from 65mph to 80mph. 3-1 ALUs cache Overview of macro-op execution ports Fuse bit Decode Align Wake- Fetch Rename Select RF EXE WB MEM Retire Fuse up Dispatch HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Related Work: x86 processors
AMD K7/K8 microarchitecture Macro-Operations High performance, efficient pipeline Intel Pentium M Micro-op fusion. Stack manager. High performance, low power. Transmeta x86 processors Co-Designed x86 VM VLIW engine + code morphing software. Related work HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Related Work Co-designed VM: IBM DAISY, BOA Macro-op execution
Full system translator on tree regions + VLIW engine Other research projects: e.g. DBT for ILDP Macro-op execution ILDP, Dynamic Strands, Dataflow Mini-graph, CCG. Fill Unit, SCISM, rePLay, PARROT. Dynamic Binary Translation / Optimization SW based: (Often user mode only) UQBT, Dynamo (RIO), IA-32 EL. Java and .NET HLL VM runtime systems HW based: Trace cache fill units, rePLay, PARROT, etc Related work HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Co-designed x86 processor architecture
$ Code (Macro-op) Memory Hierarchy vertical x86 decoder horizontal micro / Macro op Rename/ Dispatch Pipeline EXE backend Issue buffer VM translation / optimization software x86 code 1 2 Microarchitecture details: dual mode pipeline Co-designed virtual machine paradigm Startup: Simple hardware decode/crack for fast translation Steady State: Dynamic software translation/optimization for hotspots. HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Fusible Instruction Set
- 21-bit Immediate / Displacement / 10 b opcode 11b Immediate / Disp 5b Rds 5b Rsrc 16-bit immediate / Displacement F Core 32-bit instruction formats Add-on 16-bit instruction formats for code density Fusible ISA Instruction Formats RISC-ops with unique features: A fusible bit per instr. for fusing Dense encoding, 16/32-bit ISA Special Features to Support x86 Condition codes Addressing modes Aware of long immediate values F F 10 b opcode 5b Rds F 10 b opcode 5b Rsrc F 16 bit opcode 5b Rsrc 5b Rds Processor Architecture 5b opcode 10b Immediate / Disp F 5b opcode 5b Immd 5b Rds F 5b opcode 5b Rsrc 5b Rds The 12th Int'l Symp. on High Performance Computer Architecture

Macro-op Fusing Algorithm
Objectives: Maximize fused dependent pairs Simple & Fast Heuristics: Pipelined Scheduler: Only single-cycle ALU ops can be a head. Minimize non-fused single-cycle ALU ops Criticality: Fuse instructions that are “close” in the original sequence. ALU-ops criticality is easier to estimate. Simplicity: 2 or less distinct register operands per fused pair Solution: Two-pass Fusing Algorithm: The 1st pass, forward scan, prioritizes ALU ops, i.e. for each ALU-op tail candidate, look backward in the scan for its head The 2nd pass considers all kinds of RISC-ops as tail candidates Macro-op Fusing: DBT software The 12th Int'l Symp. on High Performance Computer Architecture

Fusing Algorithm: Example
x86 asm: 1. lea eax, DS:[edi + 01] 2. mov [DS:080b8658], eax 3. movzx ebx, SS:[ebp + ecx << 1] 4. and eax, f 5. mov edx, DS:[eax + esi << 0 + 0x7c] RISC-ops: 1. ADD Reax, Redi, 1 2. ST Reax, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1] 4. AND Reax, f 5. ADD R17, Reax, Resi 6. LD Redx, mem[R17 + 0x7c] After fusing: Macro-ops 1. ADD R18, Redi, :: AND Reax, R18, 007f 2. ST R18, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1] 4. ADD R17, Reax, Resi :: LD Rebx, mem[R17+0x7c] Macro-op Fusing: DBT software HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Instruction Fusing Profile
Macro-op Fusing: DBT software 55+% fused RISC-ops  increases effective ILP by 1.4 Only 6% single-cycle ALU ops left un-fused. HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Reduced Instr. traffic throughout
Processor Pipeline Rename Dispatch wakeup Fetch Align Payload RF EXE WB Retire x86 Decode3 Select Decode2 X86 Decode1 Pipelined 2-cycle Issue Logic Align/ Fuse Decode Macro-op Pipeline - x86 Pipeline Reduced Instr. traffic throughout Reduced forwarding Pipelined scheduler Microarchitecture details: dual mode pipeline Macro-op pipeline for efficient hotspot execution Execute macro-ops Higher IPC, and Higher clock speed potential Shorter pipeline front-end The 12th Int'l Symp. on High Performance Computer Architecture

Co-designed x86 pipeline frond-end
1 2 3 4 5 slot 0 slot 1 slot 2 6 Align / Fuse Decode Dispatch Rename Fetch 16 Bytes Microarchitecture details: Macro-op Formation HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Co-designed x86 pipeline backend
Wakeup Select Payload RF EXE WB/ Mem 2-cycle Macro-op Scheduler lane 0 dual entry 2 read ports issue port 0 lane 1 issue port 1 Port 0 ALU0 3 - 1ALU0 lane 2 issue port 2 Port 1 ALU1 ALU2 1ALU2 1ALU1 Microarchitecture details: Macro-op Execution HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Experimental Evaluation
x86vm: Experimental framework for exploring the co-designed x86 virtual machine paradigm. Proposed co-designed x86 processor – A specific instantiation of the framework. Software components: VMM – DBT, Code caches, VM runtime control and resource management system (Extracted some source code from BOCHS 2.2) Hardware components: Microarchitecture timing simulators, Baseline OoO Superscalar, Macro-op Execution, etc. Benchmarks: SPEC2000 integer Experimental Evaluation HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Performance Evaluation: SPEC2000
Experimental Evaluation: IPC trend  The 12th Int'l Symp. on High Performance Computer Architecture

Performance Contributors
Many factors contribute to the IPC performance improvement: Code straightening, Macro-op fusing and execution. Reduce pipeline front-end (reduce branch penalty) Collapsed 3-1 ALUs (resolve branches & addresses sooner). Besides baseline and macro-op models, we model three middle configurations: M0: baseline + code cache M1: M0 + macro-op fusing. M2: M1 + shorter pipeline front-end. (Macro-op mode) Macro-op: M2 + collapsed 3-1 ALUs. Experimental Evaluation: IPC contributors. HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Performance Contributors: SPEC2000
Experimental Evaluation: IPC contributors. HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Conclusions Architecture Enhancement Complexity Effectiveness
Hardware/Software co-designed paradigm  enable novel designs & more desirable system features Fuse dependent instruction pairs  collapse dataflow graph to increase ILP Complexity Effectiveness Pipelined 2-cycle instruction scheduler Reduce ALU value forwarding network significantly DBT software reduces hardware complexity Power Consumption Implication Reduced pipeline width Reduced Inter-instruction communication and instruction management Conclusions HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Finale – Questions & Answers
Suggestions and comments are welcome, Thank you! HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Outline Motivation & Introduction Processor Microarchtecture Details
Evaluation & Conclusions Roadmap HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Performance Simulation Configuration
 HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Fuse Macro-ops: An Illustrative Example
Processor Architecture HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Translation Framework
Dynamic binary translation framework: 1. Form hotspot superblock. Crack x86 instructions into RISC-style micro-ops 2. Perform Cluster Analysis of embedded long immediate values and assign to registers if necessary. 3. Generate RISC-ops (IR form) in the implementation ISA 4. Construct DDG (Data Dependency Graph) for the superblock 5. Fusing Algorithm: Scan looking for dependent pairs to be fused. Forward scan, backward pairing. Two-pass to prioritize ALU ops. 6. Assign registers; re-order fused dependent pairs together, extend live ranges for precise traps, use consistent state mapping at superblock exits 7. Code generation to code cache Macro-op Fusing: DBT software HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Other DBT Software Profile
Of all fused macro-ops: 50%  ALU-ALU pairs. 30%  fused condition test & conditional branch pairs. Others  mostly ALU-MEM ops pairs. 70+% are inter-x86instruction fusion. 46% access two distinct source registers, only 15% (6% of all instruction entities) write two distinct destination registers. Translation Overhead Profile About instructions per translated hotspot instruction. Macro-op Fusing: DBT software HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Dependence Cycle Detection
All cases are generalized to (c) due to Anti-Scan Fusing Heuristic HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

HST back-end profile Light-weight opts: ProcLongImm, DDG setup, encode – tens of instrs. each Overhead per x86 instruction -- initial load from disk. Heavy-weight opts: uops translation, fusing, codegen – none dominates HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Hotspot Coverage vs. runs
HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Hotspot Detected vs. runs

Performance Evaluation: SPEC2000
Experimental Evaluation: IPC trend  HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Performance evaluation (WSB2004)
 HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

Performance Contributors (WSB2004)

Future Directions Co-Designed Virtual Machine Technology:
Confidence: More realistic benchmark study – important for whole workload behavior such as hotspot behavior and impact of context switches. Enhancement: More synergetic, complexity-effective HW/SW Co-design techniques. Application: Specific enabling techniques for specific novel computer architectures of the future. Example co-designed x86 processor design: Confidence Study as above. Enhancement: HW μ-Arch  Reduce register write ports VMM  More dynamic optimizations in HST, e.g. CSE, software stack manager, SIMDification. HPCA 2006, Austin, TX The 12th Int'l Symp. on High Performance Computer Architecture

An Approach for Implementing Efficient Superscalar CISC Processors

Similar presentations

Presentation on theme: "An Approach for Implementing Efficient Superscalar CISC Processors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Approach for Implementing Efficient Superscalar CISC Processors

Similar presentations

Presentation on theme: "An Approach for Implementing Efficient Superscalar CISC Processors"— Presentation transcript:

Similar presentations

About project

Feedback