Download presentation
Presentation is loading. Please wait.
1
ARMOR Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Submitted by: Tziki Oz-Sinay, Ori Lempel Supervised by: Rony Mitleman Mid-Semester Presentation
2
Milestones Reached Development platform selected –Balsa over Petrify+VHDL Micro-Architecture Specification (MAS) completed –functional block partition, datapath interface defined –asynchronous handshaking protocol defined Detailed asynchronous pseudo-code implementation written Balsa code writing, dynamic simulation and synthesis started
3
Development Platform Selection Two development enviornments were examined: Balsa: Language for synthesising large asynchronous circuits and systems Compiles to a small, parametric, set of handshake components –Balsa flowBalsa flow –Balsa initial flowBalsa initial flow Petrify: A synthesis tool for Petri Nets and asynchronous controllers Reads a Petri Net and generates another Petri Net, which is simpler than the original description but behaviorally similar
4
Development Platform Selection (cont.) Balsa’s Advantages –One development environment easier debugging and integration –Synthesis implements a delay-insensitive circuit implementation is transparent to the developer (no need for timing analysis) –Control channels are automatically created at compilation –High level language easier to learn
5
Petrify’s Advantages –A more mature environment than Balsa –When using Petrify the core of the system is written in VHDL all of the tools/flows are well known and supported in the lab –Petrify’s output is translated to Verilog, while Balsa only supports EDIF synthesis higher level output, compatible with Altera Development Platform Selection (cont.)
6
The Balsa Environment Was Chosen This constitutes new hardware requirements: –A simplified design, comprising an in-order pipeline and no external memory will be synthesized on a Xilinx Spartan FPGA –The complete design will later on be implemented on a Xilinx Vertex Pro II
7
REQ ACK DATAn REQ ACK DATA 4 Phase Protocol Handshake Protocol Push Channel REQ ACK DATAn Pull Channel
8
ARMOR Pipestages Instruction Cache Fetch Decode Rename Date Cache Write Back Execute Retire PC[15:0] Inst[15:0] VInst[15:0] Op[3:0] LDst[3:0] LSrc[3:0] Imm[11:0] Op[3:0] PDst[3:0] SrcVal1[15:0 ] SrcVal2[15:0] Imm[11:0] DataIn[15:0] PDst[3:0] Addr[15:0] ReadWrite# ALU0PDst[3:0] ALU0Res[15:0] ALU1PDst[3:0] ALU1Res[15:0] MemPDst[3:0] DataOut[15:0] LDst[3:0] Val15:0] Op[3:0] PDst[3:0] SrcVal1[15:0 ] SrcVal2[15:0] Imm[11:0] BranchDecision Out Of Order Engine
9
Instruction Fetch Unit (IFU) Function: –Fetch instruction pointed to by the PC register from the instruction cache. –Execute the jump instruction. –Calculate branch addresses, speculatively fetch branch target instructions and stall pipeline pending branch decision. + PC+2 branch offset branch instruction next instruction to instruction cache to ID branch decision
10
Instruction Decoder (ID) Function: –Tag instructions by type (REGREG, REGIMM, MEM, BRANCH). –Queue up to 4 issue-pending instructions, thus allowing continuous instruction fetching in case instruction issue stalls. V Inst head tail
11
ROB RRF RAT RS0RS1 ALU0ALU1 DATA CACHE BranchDecision to IFU Inst from ID branches non-mem inst mem inst non-branch inst Register Alias Table (RAT)
12
Function: –Register Renaming – map logical sources/ destinations to physical registers (ROB/RRF entries): Allocate physical destination (PDst) pointers during instruction issue Reset pointers during retirement (CAM-match logic) –Monitor data-readiness of physical sources/destinations: Reset ready-bit during instruction issue Set ready-bit during writeback (CAM-match logic) R0 R1 R2 R3 R4 R5 R6 R7 PDstReady
13
ROB RRF RAT RS0RS1 ALU0ALU1 DATA CACHE BranchDecision to IFU Inst from ID branches non-mem inst mem inst non-branch inst ReOrder Buffer (ROB)
14
Structure: Circular buffer of 24 entries (PDsts), each one holding all relevant data for a single instruction: –Op Code and Op Type –LDst –PSrc1 – pointer, value and status –PSrc2 – pointer, value and status (if needed) –Immediate (if needed) –Writeback Result –Dispatched, Valid bits Large register file: 24 entries * 71 bits/entry = 1704 bits
15
Function: –Hold all instructions currently in the execution window (issue retirement). –Determine data-readiness of each instruction by CAM-matching WB buses vs. entry’s PSrc pointers. –Dispatch data-ready instructions out-of-order to approriate RS (to be explained…☺). –Retire PDsts of executed instruction in-order to Real Register File (RRF).
16
Dispatch Algorithm: –3 independent iterators, scanning the ROB from tail to head: BranchRS Iterator – searches for the oldest branch instruction yet to be dispatched. MemRS Iterator – searches for the oldest memory instruction yet to be dispatched. RegOpRS Iterator – searches for the oldest data-ready non-branch/memory instruction yet to be dispatched. –Iterators’ independence does not cause conflicts no need for arbitration ! –Problem: unbalanced dispatching can clog one ALU and starve the other, leading to diminished performance.
17
Dispatch Algorithm (cont.): –Solution: the ROB maintains a load-balance counter, ranging from -4 to 3: incremented upon branch issue and memory dispatch decremented upon memory issue and branch dispatch –The RegOpRS Iterator dispatches data-ready instructions according to the following rules: LoadBalancer < 0LoadBalancer > -1 RS0, RS1 availabledispatch to RS0; continuedispatch to RS1; continue RS0 available RS1 busy dispatch to RS0; return to Tailif no branch ops are ready, dispatch to RS0; return to Tail RS0 busy RS1 available if no memory ops are ready, dispatch to RS1; return to Tail RS0, RS1 busyreturn to Tail
18
ROB RRF RAT RS0RS1 ALU0ALU1 DATA CACHE BranchDecision to IFU Inst from ID branches non-mem inst mem inst non-branch inst Reservation Stations (RS)
19
PDst Src1 Src2V Imm Op Src1 Src2 Src1 Src2 Src1 Src2 Imm Op PDst V V V RS0 RS1 Branch Op Non Branch/Mem Op Mem Op Non Branch/Mem Op Reservation Stations (RS) Function: –Buffer data-ready instructions for both ALUs, so as to minimize (or even eliminate!) execution idle time –Sort instructions according to type/priority for each ALU: ALU0 – branch ops vs. non-branch/memory ops ALU1 – memory ops vs. non-branch/memory ops
20
ROB RRF RAT RS0RS1 ALU0ALU1 DATA CACHE BranchDecision to IFU Inst from ID branches non-mem inst mem inst non-branch inst ALUs
21
Function: –Continuously execute instructions from respective RS and drive their associated PDsts and results on the WB busses. –Prioritize instructions: branch ops have precedence over other ops on ALU0 result (branch decision) is driven to IFU memory ops have precedence over other ops on ALU1 result (address) is driven to DCache
22
ROB RRF RAT RS0RS1 ALU0ALU1 DATA CACHE BranchDecision to IFU Inst from ID branches non-mem inst mem inst non-branch inst Data Cache
23
Function: –Read/write memory operands in-order (according to address, RdWr# signal from ALU1) and drive their PDsts (and results, for LW ops) on the WB busses. –Queue up to 4 pending memory access instructions, thus allowing ALU1 to execute successive LW/SW ops without stalling. Data Cache
24
Timeline ASAP (beaurocracy…) –Install Balsa 3.3, including netlist technology, on Lion server –Increase Linux user quotas –Install Exceed terminal server in lab so that we can remotely connect to Lion server 4/3/04 (final report, first semester): –Asynchronous simulation of a complete data-path flow through the pipeline: mov R0, 1 add R0, 1
25
Balsa Initial Flow
26
Balsa Flow
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.