Out-of-Order OpenRISC 2 semesters project Semester B: OR1200 ISA Extension Final B Presentation By: Vova Menis-Lurie Sonia Gershkovich Advisor: Mony Orbach.

Slides:

Advertisements

Similar presentations

Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.

EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Final Presentation Part-A

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Instruction-Level Parallelism (ILP)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Processor Technology and Architecture

Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

Chapter 12 Pipelining Strategies Performance Hazards.

EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

The Xilinx EDK Toolset: Xilinx Platform Studio (XPS) Building a base system platform.

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Chapter 12 CPU Structure and Function. Example Register Organizations.

1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 25 CPU design (of a single-cycle CPU) Intel is prototyping circuits that.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Chapter 14 Instruction Level Parallelism and Superscalar Processors

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Basic Processing Unit (Week 6)

Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.

Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Final A Presentation By: Vova Menis-Lurie Sonia Gershkovich.

Digital signature using MD5 algorithm Hardware Acceleration

Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Midterm Presentation By: Vova Menis-Lurie Sonia Gershkovich.

Parallelism Processing more than one instruction at a time. Pipelining

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.

Lecture 9. MIPS Processor Design – Instruction Fetch Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System Education &

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.

Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.

Lecture 14 Today’s topics MARIE Architecture Registers Buses

The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.

LZRW3 Decompressor dual semester project Part A Mid Presentation Students: Peleg Rosen Tal Czeizler Advisors: Moshe Porian Netanel Yamin

Out-of-Order OpenRISC Stage 1: Implementation of OpenRISC on XUP5 board Project Characterization By: Vova Menis-Lurie Sonia Gershkovich Advisor: Mony Orbach.

Computer Organization CS224 Fall 2012 Lesson 22. The Big Picture  The Five Classic Components of a Computer  Chapter 4 Topic: Processor Design Control.

Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.

Final Presentation Implementation of DSP Algorithm on SoC Student : Einat Tevel Supervisor : Isaschar Walter Accompanying engineer : Emilia Burlak The.

Tools - LogiBLOX - Chapter 5 slide 1 FPGA Tools Course The LogiBLOX GUI and the Core Generator LogiBLOX L BX.

TEAM FRONT END ECEN 4243 Digital Computer Design.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.

Pentium Architecture Arithmetic/Logic Units (ALUs) : – There are two parallel integer instruction pipelines: u-pipeline and v-pipeline – The u-pipeline.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

Content Project Goals. Workflow Background. System configuration. Working environment. System simulation. System synthesis. Benchmark. Multicore.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Simulator Outline of MIPS Simulator project  Write a simulator for the MIPS five-stage pipeline that does the following: Implements a subset of.

UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

PipeliningPipelining Computer Architecture (Fall 2006)

STUDY OF PIC MICROCONTROLLERS.. Design Flow C CODE Hex File Assembly Code Compiler Assembler Chip Programming.

Timing Model of a Superscalar O-o-O processor in HAsim Framework

Introduction to Pentium Processor

How to improve (decrease) CPI

COMS 361 Computer Organization

Instruction Level Parallelism

Presentation transcript:

Out-of-Order OpenRISC 2 semesters project Semester B: OR1200 ISA Extension Final B Presentation By: Vova Menis-Lurie Sonia Gershkovich Advisor: Mony Orbach Spring 2013

Content: 1. Project Overview a. Background b. Goals 2. The System: OR Project Flow a. Simulation Environment b. Out-of-Order Implementation c. Super Scalar implementation d. ISA Extension 4. Conclusions

Project Overview Background OpenRISC 1200 is an open source Verilog implementation of OR1000 ISA As a part A, we created basic working environment on XUPV5 board and SoC with OR1200 CPU

Project Overview Project Goal Initial Goal: Out-of-Order execution processor implementation based on OR1200 implementation Changed goal: Super Scalar processor implementation based on OR1200 implementation Final Goal ISA Extension Implementation for OR1200

CPU

MMU CPU QMEM OR1200 top IMMU DMMU 32 Cache ICache DCache 32 Store Buffer WBI Instruction WBIU Data WBIU 32 WB bus

1.Cache initialization function in assembly to enable cache. (WB Interface protocol require 3 cycles for each transaction – not effective for rtl analyze and implementation improvements ) 2.Simulation Environment Creation (Testbench) 3.Out-of-Order implementation – try 4.Super-Scalar implementation – try 5.ISA extension of current implementation Project Flow

Environment features: UART interface emulation Waveform generation One Makefile to: RTL Compilation Testbench instantiation C program compilation Run simulation Assembly code file creation XILINX ram initialization file Simulation Environment

Environment features: Advanced monitor: Monitoring all data and control transactions of SoC Monitoring states and SPRS values Creates log files with desired information: States of register file after each command Execution time analysis Simulation Environment

Fundamental statements (based on Tomasulu algorithm): Execution parallelism should be implemented !! Non-arch shadow registers implementation. In order commitment. (SW executes in order) Out of Order implementation – try ALU OR1200 IF GenPC OR1200 CTRL Except Freeze MAC LSU FPU SPRS CFGR OR1200 RF PC Next PC Operand MUX OR1200 top WB MUX CPU For LSU instruction parallelism –multiple ports memory and wider bus -multiple port Cache, QMEM and MMU Branch prediction is not necessary – delay slot at compiler level Multiple ALU – not effective solution ALU instructions executed in one cycle

Fundamental statements :. Still in-order commitment. Multiple execution should not affect SW in-order execution Non-parallel Fetch and Decode to avoid instructions dependencies. Super Scalar implementation – try Fetch and Decode units should be completely rewritten based on current implementation Exception engine should support 2 pipes – requires exception unit complete redesign Not all dependencies can be seen at fetch/decode stage LSU results may be required Multiple port SPRS should be implemented. Parallel LSU instruction execution in 2 pipes requires multiple port memories and wider bus

gcc OR1000 compiler and assembler support empty slots for custom ISA extension 8 non-parameter commands: l.cust1 l.cust2 l.cust3 l.cust4 l.cust6 l.cust7 l.cust8 1 highly parameterized command l.cust5 Rd, Ra, Rb, L immediate[5:0], K immediate [4:0] Allows 2048 !! commands which operates on 3 registers. ISA extension will not be used by compiler to generate assembly code from given C code, but gcc allows assembly commands use aside C code. ISA Extension – final goal

4 Non parameterized commands l.cust1 Set flag (unconditioned) l.cust2 Unset flag(unconditioned) l.cust3 Set carry(unconditioned) l.cust4 Unset carry (unconditioned) l.cust Commands Implementation

l.cust5 parameterized command : K immediate defines command, L immediate defines options K=0x1 Replaces A[L_byte] with B[0_byte] and put result in D K=0x2 SET bit A[L] (Result in D) K=0x3 UNSET bit A[L] (Result in D) l.cust Commands Implementation

l.cust5 parameterized command : K immediate defines command, L immediate defines options K=0x4 Slice A(MSB’s) and B(LSB’s) and put result in D >> D = {A[32-L:L], B[L-1:0]} K=0x5 Slice B(MSB’s) and A(LSB’s) and put result in D >> D = {B[32-L:L], A[L-1:0]} K=0x6 Rotate A >> D = A[0:31] l.cust Commands Implementation

l.cust5 parameterized command : K immediate defines command, L immediate defines options K=0x7 Rotate A by bit- Hword-wise >> D = {A[16:31], A[0:15]} K=0x8 Rotate A by bit- byte-wise >> D = {A[24:31], A[16:23], A[8:15], A[0:7]} K=0xa Check if A is even. If true D=1 and set flag else D=0 K=0xb Check if A is odd. If true D=1 and set flag else D=0 l.cust Commands Implementation

l.cust5 parameterized command : K immediate defines command, L immediate defines options K=0xe L=2: Rotate A 2bytes MSB’s with 2bytes LSB’s >> D = {A[15:0], A[31:16]} L=4: Rotate A byte-wise >> D = {A[7:0], A[15:8], A[23:16], A[31:24]} L=8: Rotate A Hbyte-wise >> D = {A[3:0], A[7:4], A[11:8], A[15:12], A[19:16], A[23:20], A[27:24],A[31:28]}; K=0xf L=0: Mirror LSB’s >> D = {A[0:15], A[15:0]} L=1: Mirror MSB’s >> D = {A[31:16], A[16:31]} l.cust Commands Implementation

ISA Extension – FPGA proven Test C program

ISA Extension – FPGA proven UART output

FPGA Utilization Old RTLNew RTL

Given implementation is not suitable for any significant u-Arch improvements Out-of-Order / Super-Scalar OR1200 implementations are possible but should be done from scratch. Written in assembly software can be easily optimized for specific application due to l.cust instructions (2048 instructions with 5 operands) Conclusions

Thank you!