Binu Mathew, Al Davis School of Computing, University of Utah Deepika Ranade Sharanna Chowdhury 1.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

DSPs Vs General Purpose Microprocessors
Computer Organization and Architecture
Adding the Jump Instruction
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Parallell Processing Systems1 Chapter 4 Vector Processors.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Unit -II CPU Organization By- Mr. S. S. Hire. CPU organization.
Part II: Addressing Modes
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Chapter 5 Basic Processing Unit
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
1 4.2 MARIE This is the MARIE architecture shown graphically.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
Chapter One Introduction to Pipelined Processors
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
CS 352H: Computer Systems Architecture
Assembly language.
Variable Word Width Computation for Low Power
A Closer Look at Instruction Set Architectures
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Superscalar Processors & VLIW Processors
CS170 Computer Organization and Architecture I
Computer Organization “Central” Processing Unit (CPU)
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
URECA: A Compiler Solution to Manage Unified Register File for CGRAs
Henk Corporaal TUEindhoven 2011
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
CSC3050 – Computer Architecture
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Binu Mathew, Al Davis School of Computing, University of Utah Deepika Ranade Sharanna Chowdhury 1

“Super” DSP Clustered execution units VLSI rocks ILP limitation Improve throughput Timely data delivery to execution units Limited memory, power VLIW Specialized on-chip memory system scratch pad memory cache banks bit-reversal auto increment addressing modes SW control 2

Author’s solution: Sneak peek Multiple SRAM Distributed Address Generators Loop Acceleration Unit Array Variable Renaming 3

VLIW Processor Architecture Address Generator x 2 Address Generator x 2 Address Generator x 2 SRAM 0 SRAM 1 SRAM n … Interconnect Function Units x 8 Function Units x 8 Loop Unit Loop Unit U Code Memory/ I-Cache Decode Stage U Code Memory/ I-Cache Decode Stage micro code PC 4

Issues with load/store ports Limited # load/store ports limits performance+data availability Need large # SRAM ports efficiently feed data to function units BUT degrades access time power consumption Solution Banking multiple small software managed scratch SRAMs Power down unused SRAMs 5

Working 1. VLIW instruction bundle 2. Load/store decoded 3. issued to address generators VLIW execution unit - loop unit 4. Compiler configures loop unit before entering loop intensive code 5. Loop unit works autonomously. 6. PC passes 1st instruction of loop body => loop count ++ values used by AGs 6

Instructions Needed write context instruction ~transfers loop parameters/data access patterns => context registers ( in loop unit/address generators) load. context and store. context instructions ~enhanced versions of load/store instructions ~Two fields named context index and modulo period => immediate constant field ~Context. index :controls address calculation push loop instruction ~used by compiler to inform hardware => nested inner loop is entered. 7

Modulo Scheduling The loop unit offers hw support for modulo scheduling software pipelining high loop performance execution of new instance of loop body every II cycles Non- modulo scheduled loop converted to modulo scheduled loop with II=N N cycles LOOP BODY II < N. 8

Modulo Scheduling cont Original loop body => modulo scheduled Replicating instructions ~instruction scheduled in cycle n is replicated to appears in all cycles ~ pasting new copy of the loop body at intervals of II cycles over original loop body ~wrapping around all instructions that appear after cycle N. ~ Instruction scheduled for cycle n ~ then n=II => modulo period 9

Modulo Scheduling Context registers Compiler Static Parameters ~ II ~ Loop Count Limits Loop Counter Register File Dynamic Values of loop variables PC -> Loop Body ?? 10

Loop Unit Loop Contexts +1 Loop Stack ||Count Reg = MUX 2 x1 Loop Count Register x 4 MUX 2 x1 += Current loop loop_type, ||, start_count, end_count, increment || clear Write enable Start_count Pop loop Push loop increment Next count End_count Current loop 11

Loop Unit- Prior to starting a loop intensive section of code Loop Contexts +1 Loop Stack ||Count Reg = MUX 2 x1 Loop Count Register x 4 MUX 2 x1 += Current loop loop_type, ||, start_count, end_count, increment || clear Write enable Start_count Pop loop Push loop increment Next count End_count Current loop Loop Parameter write context 12

Loop Contexts +1 Loop Stack ||Count Reg = MUX 2 x1 Loop Count Register x 4 MUX 2 x1 += Current loop loop_type, ||, start_count, end_count, increment || clear Write enable Start_count Pop loop Push loop increment Next count End_count Current loop index of context register Push_loop T Top of stack = current loop body +1 + || Reset! 13

Loop Unit- When the end count of the loop is reached Loop Contexts +1 Loop Stack ||Count Reg = MUX 2 x1 Loop Count Register x 4 MUX 2 x1 += Current loop loop_type, ||, start_count, end_count, increment || clear Write enable Start_count Pop loop Push loop increment Next count End_count Current loop Innermost loop completed Pop_loop Top entry = popped off stack 14

Stream address generator Loop Unit counters Address context register file Base address Row size How Compiler generates addresses word oriented addressing Elem size => size of Complex struct row size => elem size* N offset of imag within struct Baseimag = BaseA + offset Load into t1 Base imag +i*row size+j* elem size Vector= 1 D array row size = 0 15 C row major Select correct Loop variables

Stream address generator(2) Address Calc (2) Data Packing Complex index P*i+Q Q=> Base address P=> row size Column walk A[j][i] If row size, elem sizes = powers of two address = Base + [(i << x)| (j << y)] Hierarchy sum of powers of 2 multiple access expressions Higher-D array N-2 bases recalculated Last 2 by hardware 16

Stream address generator(3) Partner ALU Write_context Indirect access streams A[B[i]] Addr_gen cant compute Address by Partner ALU Many Addr _gen: 1 ALU in cluster Computes address context Reconfigures Addr_gen 1 cycle latency to load/store Issued by Compiler 1 cycle Low reconfiguration overhead Array access pattern  address context register row, element sizes base address loop indices 17 4 context entries/ Addr_gen support 24 //access patterns

Address Generation Steps Compiler generates Array access code address generator index address context reg index immediate field of Ld/ St get array parameters from context field Loop variables Mux => loop variables Shift loop variables Pipeline Better cycle time Address = result + base_addr 18

Address Context Mux 4 x 1 Mux 2 x 1 - << + Pipeline Reg Context reg from opcode Loop Counters Opcode const J_sel Modulo Period from opcode Alu_sel Alu address y x Base address Clock Enable Address i_sel, j_sel, const_sel, alu_sel, x, y, Base address 19

Special cases Loop unrolling Address calculation 1 loop variable + 1 unroll factor In immediate field of the instruction Input to upper 2x1 mux ALU address computation Selected by lower 2x1 mux 0/ 1 loop variables loop counter = 0 20

Modulo Scheduling: Catch Modulo period in Load.context/ Store.context != 0 overlap execution of multiple instances of inner loop Example: Loop of latency=30 cycles Can start instance after 5 cycles No resource conflicts, data dependences Loop variable ++ every 5 cycles Old instances get wrong value dependences for load/store end 21 6 instances in // Problem !!

Modulo Scheduling: Solution Traditional Array Variable Renaming Rotating Register File Multiple copies of loop variable Multiple address calculations  long latency loop body Single copy Compensate for increment Subtraction Modulo period Immediate field of Ld/ St Throughput Use entire scratchpad Separate rotating reg for each array 22

Evaluation Software running on a 400 MHz Intel XScale embedded processor No floating point instructions Replaced by integer instructions For lower bound performance, energy 2.4 GHz Intel Pentium 4 processor. No energy optimization Optimized: No perception task support Micro-code running on our VLIW architecture custom ASIC 23

Evaluation (2) BenchmarksMetrics Speech recognizer Data-dependent branches If-converted Visual feature recognition Image segmentation Neural networks Vectorizable DSP, encryption For generality Floating-point Integer Throughput Energy to process 1 input block Performance Vs Energy Work per energy Energy*delay Normalized feature size Constant field scaling Delay: λ Energy: λ^3 24 Architecture +Process

Evaluation (3) Methodology Instruments Spice Supply Current Instantaneous power Energy = ∫Power / time SRAM Log each read/ write/ idle cycle Clock trees + loads Netlists Average processor current Current probe Digital Oscilloscope PCB modification Pentium XScale  DSP 25

Results Better IPC 3.3x better W.r.t. SGI MIPSPro On-chip memory Functional Unit utilization High operand flow Loop accelerator Scratchpad 26 Out-of-order processing, optimized

Results(2) Throughput Real time performance Stream computations Throughput = #input packets processed per second Normalized w.r.t. Pentium 4 VLIW architecture 1.75 x better than Pentium % of ASIC’s performance 27 Note: Log scale on Y axis

Results(3) Throughput is achieved at energy levels 13.5x better than XScale 28 Note: ASIC implementation was used for 4 benchmarks

Results (4) Energy*delay 159x better than XScale 12 times worse than ASIC implementation Advantages over General Processor/ ASIC Better energy + performance Programmable 29

Related study Low power Scratchpad IBM Elite DSP Vector extensions Operate on disjoint data Reconfigurable Streaming Vector Processor Base-stride vectors accelerated Vector register files: less autonomy Vector register files Auto address generation Hardcoded hardware scheduling DSP Processors bit-reversed addressing modes auto ++ Author’s approach Generic address generation software scheduling + enhanced scalar loads +VLIW multiple access patterns 30

Conclusion Multiple SRAM Distributed address generation loop acceleration Register Renaming Advantages ASIC: Power, Energy efficiency Less costly CPU: Generality 159 x better Energy*delay over Intel Xscale 1.75x performance over Pentium 4 Applications speech recognition, vision, cryptography Real-time Within embedded energy budget 31