Download presentation
1
Team Antelope Final Presentation
What doesn’t kill you, makes you stronger "The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair.“ James Zirkle John Lange Peter Johnson Chris
2
Processor Overview 5 stage pipeline 10 nanosecond clock 128 bit memory
Despite all my rage I'm still just a rat in a cage --Bullet With Butterfly Wings 5 stage pipeline 10 nanosecond clock 128 bit memory Split Caches Write back policy CLZ and Multiply simplified to 1 clock cycle MicroSequencer used to handle complex operations
3
Who did what James Jack Peter Chris Register File, Integration
Cache, Memory, ALU Peter Shifter, hazard detection unit Chris Multiplier, CLZ, interrupts
4
“Quidquid latine dictum sit, altum viditur”
ALU “Quidquid latine dictum sit, altum viditur” Handles all 16 data processing instructions Determines PSR flag values 4 bit carry look ahead units, combined into 16 blocks
5
Shifter 32 bit Barrel Shifter
Logical Shift Left/Right, Arithmetic Shift Right, Rotate Right, Rotate Right Extended Special Cases (LSR #0 encodes LSR #32, etc) Generates result by combining individual bit shifters
6
Result propagated through bit shifters
Barrel Shifter Added 32-bit Shifters Result propagated through bit shifters
7
16 Bit-Right Shifter
8
32-Bit Barrel Shifter Carry In / Carry Out
-Carry in only used in RRX (rotate right extended) operations -Carry out always computed, even though not needed in rotate operations
9
Carry Out Logic: Two Options
Separate logic computes Cout early using input and shift amount Pros: -Cout signal ready much earlier, no need for propagation -Simpler bit shifter designs Cons: -Many more gates needed
10
Carry Out Logic: Two Options
Individual bit shifters compute and propagate Cout signal Pros: -Simpler overall design -Fewer logic gates Cons: -Takes longer for Cout to be ready (propagation delay) -More complicated bit shifters
11
Carry out: Conclusion Went ahead and implemented Cout logic in the bit shifters -Don’t really need the signal to be ready any earlier than the rest of the shifter output, especially not at the addition gate cost -Each shifter computes Cout for its own shift amount and passes it on, or leaves Cout alone if it is disabled
12
Complete Shifter
13
Multiplier (MUL/MLA) 32 additions in parallel Logarithmic time result
25 = 32, so time equals 5 adds Multiply w/accumulate inserted at the end with a multiplexor
15
Count leading zeros (CLZ)
Output equals number of leading zeros on the input (Ex: ) First step: Then, add one: Lastly, convert to binary. With a 32-bit input, output will have a 6-digit maximum. Timing: Only four gate delays.
17
Register File 37 Total Registers
Different modes select between different registers. Registers r0-r7 and the PC (r15) are common to all modes PSR Mode bits select between different register banks
18
Register File, Continued
3 normal (r0-r15) register outputs. 1 input that can access r0-r15 An input and an output dedicated to the PC An input and an output dedicated to the SPSR
19
Pipeline Design and Component Integration
“The manual for a ferrari 250 states that replacing the timing chain is a five-step process. Step one is the simple (?) instruction: ‘Invert motor on bench.’”
20
Pipeline Selection Selected a 5 stage pipeline design
Fetch: Instruction is retrieved from memory Decode: Instruction is processed, control signals sent Execute: ALU, Shift, Multiply and CLZ operations Memory: Data cache/memory access Writeback: Results are written back to the register file Fetch->Decode->Execute->Memory->Writeback
21
Advantages Breaks datapath into logical operational blocks.
Slower stages can be broken up to increase the clock speed. Results in higher throughput
22
Disadvantages More time consuming to implement.
Data hazards appear, so must implement forwarding and stalls in certain circumstances. This further complicates the design.
23
Fetch Decode Execute Memory Writeback
24
Pipelined Datapath Construction
“Purpose—to drive you to insanity” Implemented simple single stage datapath first. Used D flip-flops to break up the datapath into the 5 different stages. Added memory and cache. Stall the pipeline by holding the clock.
25
Fetch Stage Consists of Data Cache Runs almost every cycle.
Stalled independently of the rest of the stages while the Sequencer is running.
26
Execute Stage Contains: Shifter ALU Multiplier CLZ unit
Conditional Execution unit PSR and PSR control
27
Memory Writeback Contains the interface to Data Cache
Writes back to registers
28
Decode Stage, Continued
Stage contains: Register File Sequencer Branching logic 32 bit shift extender 32 bit full adder PC is output from the register file straight into the Instruction Cache address
29
Decode Stage Modular design, each instruction type has one module that is connected to a mux PLA takes instruction and outputs a 4 bit select signal that selects between all modules. Control is contained in a 32 bit bus that is piped through the entire processor.
30
Current Processor Implementation
31
Hazards Read after Write: 1. 2. FETCH DEC EXEC DATA WB FETCH DEC STALL
32
Hazards Branch: 1. 2. 3. 4. FETCH DEC EXEC DATA WB FETCH DEC EXEC DATA
(Branch Target) FETCH DEC EXEC DATA WB
33
Hazard Checking Logic Checks to see if Rd (destination register) is read from in next 2 commands
34
Data Forwarding FETCH DECODE EXEC Result DATA BUFFER Data WRITEBACK
35
Overview
36
CONDITION EVALUATE CPSR Flags
37
Interrupt Handler Component must handle the following seven cases:
Reset (Highest Priority) Data Abort FIQ IRQ Prefetch Abort Undefined Instruction Software Interrupt (SWI) (Lowest Priority)
38
Implementation One ROM file handles memory addresses.
3-bit input leads to 32-bit address for PC. Second ROM file handles CPSR alterations. 4-bit input leads to lower 8 bits of CPSR. Priorities of the interrupts are handled with CLZ functionality. Lastly, no interrupts leads to “Active = 0”.
40
Memory 128 bit wide Main Memory 32 bit Split cache system
"Memory is like an orgasm. It's a lot better if you don't have to fake it.“ -- Seymour Cray 128 bit wide Main Memory 32 bit Split cache system Data and Instruction Data Cache operates with Write Back Policy 2 State Machines in charge of Memory Control
41
Main Memory Control Simulates memory latency with a delay component
It wasn't very sporting, but what the hell. - Chuck Yeager on shooting down a landing Me-262 Simulates memory latency with a delay component Implemented with a state machine Enters a wait state while holding for memory to finish Operation order: Data first, Instruction second Signals when data is valid, and when operation is finished
42
Memory State Machine
43
“I'm just here for moral support. Ignore the gun.”
Caches “I'm just here for moral support. Ignore the gun.” 128 bit lines separated into 32 bit blocks Hits determined by using high address bits, as well as a valid bit Write strategy uses Dirty bit to signal when to write to memory On reset valid and dirty bits are cleared Can operate in 128, 32, and 8 bit modes Necessary for memory and processor interface
44
"A day without killing... is like a day without sunshine“
Cache Reset "A day without killing... is like a day without sunshine“ -John Wayne Cache reset controlled by two signals RESET and MEM_CLEAR When MEM_CLEAR is pulsed a sequencer is engaged Adder attached to a flip-flop Cycles through addresses, setting values to 0 Asserts pipeline hold signal while running RESET clears all the state machines back to initial state
45
"He spoke, I had no clue, it was a mutual relationship.“
Memory System Control "He spoke, I had no clue, it was a mutual relationship.“ Implemented with a state machine Interfaces I-Cache, D-Cache, Main Memory, and Pipeline During operation, pipeline hold signal is asserted Autonomous operation, requires no special datapath control Took so much time, that it made my girlfriend jealous
46
Memory Control Overview
47
Memory Control FSMs
48
Memory Control FSMs
49
Memory Control FSMs
50
Interrupts “The nice thing about standards is that there are so many of them to choose from.”
51
Sequencer Built to handle complex operations
Interrupts, block load/store Is basically a clocked ROM file. Has a start address and a start signal Runs through a sequence of instructions in the ROM file until sequence signals it is done. One instruction per cycle is injected into instruction stream, Fetch stage is stalled.
52
Instructions: Data Processing, Multiply and CLZ
These instructions move linearly through the pipeline, and don’t require stalls as they are all single cycle in our implementation. Present some data hazard problems, but hazard detection and forwarding logic maintains linear execution.
53
Branch On decode, branch immediately adds the PC to the shifted offset and updates the PC. No stall necessary, since PC is updated before the next instruction is fetched. Branch w/link has r14 updated when branch finishes moving through the entire pipeline.
54
LDR, STR Used asynchronous logic to make LDR and STR single cycle. During the first part of the clock cycle, the updated base register is written, the writeback register is changed, and the value is loaded from memory into that register. Simplifies load and store logic greatly.
55
Multicycle Instructions
Multiple Register Transfer Swap Implemented with our sequencer: Each of these instructions translates into a sequence of single cycle instructions. These instructions are modified to correspond with the specific multicycle instruction.
56
"Time commitment--eternity.“
Where are we now? "Time commitment--eternity.“ --CTEC All 5 stages and Memory/Cache integrated. Data Processing, Multiply, CLZ, Shifting, Load, Store, Branch, MRS, MSR Not yet fully functional: Load/Store Multiple Swap Conditional execution (in regards to branch) Interrupts
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.