Lecture 15: Reconfigurable Coprocessors October 31, 2013 ECE 636 Reconfigurable Computing Lecture 15 Reconfigurable Coprocessors.

Slides:

Advertisements

Similar presentations

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Lecture 6: Multicore Systems

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Instruction-Level Parallelism (ILP)

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

EECC722 - Shaaban #1 lec # 9 Fall Computing System Element Choices Specialization, Development cost/time Performance/Chip Area Programmability.

EECC722 - Shaaban #1 lec # 8 Fall What is Configurable Computing? Spatially-programmed connection of processing elements Spatially-programmed.

What is Configurable Computing?

EECC722 - Shaaban #1 lec # 7 Fall What is Configurable Computing? Spatially-programmed connection of processing elements Spatially-programmed.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

CS294-6 Reconfigurable Computing Day 27 Tuesday, November 24 Integrating Processors and RC Arrays (Part 2)

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

Chapter 17 Parallel Processing.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

CS294-6 Reconfigurable Computing Day 26 Thursday, November 19 Integrating Processors and RC Arrays.

CS 300 – Lecture 24 Intro to Computer Architecture / Assembly Language The LAST Lecture!

Pipelining By Toan Nguyen.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 12: May 15, 2001 Interfacing Heterogeneous Computational.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Paper Review: XiSystem - A Reconfigurable Processor and System

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Lecture 18: Dynamic Reconfiguration II November 12, 2004 ECE 697F Reconfigurable Computing Lecture 18 Dynamic Reconfiguration II.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #24 – Reconfigurable.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 18: May 19, 2003 Heterogeneous Computation Interfacing.

Lecture on Central Process Unit (CPU)

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Introduction to Computer Organization Pipelining.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 24: May 25, 2005 Heterogeneous Computation Interfacing.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 13: May 17 22, 2001 Interfacing Heterogeneous Computational.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

CS 352H: Computer Systems Architecture

15-740/ Computer Architecture Lecture 3: Performance

Instruction Level Parallelism

Computer Architecture

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

CS203 – Advanced Computer Architecture

Pipeline Implementation (4.6)

/ Computer Architecture and Design

CDA 3101 Spring 2016 Introduction to Computer Organization

Hardware Multithreading

Advanced Computer Architecture

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

CSC3050 – Computer Architecture

The University of Adelaide, School of Computer Science

Presentation transcript:

Lecture 15: Reconfigurable Coprocessors October 31, 2013 ECE 636 Reconfigurable Computing Lecture 15 Reconfigurable Coprocessors

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Overview Differences between reconfigurable coprocessors, reconfigurable functional units, and soft processors Motivation Compute Models: how to fit into computation Interaction with memory is a key Acknowledgment: DeHon

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Overview Processors efficient at sequential codes, regular arithmetic operations. FPGA efficient at fine-grained parallelism, unusual bit-level operations. Tight-coupling important: allows sharing of data/control Efficiency is an issue: -Context-switches -Memory coherency -Synchronization

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Compute Models I/O pre/post processing Application specific operation Reconfigurable Co-processors -Coarse-grained -Mostly independent Reconfigurable Functional Unit -Tightly integrated with processor pipeline -Register file sharing becomes an issue

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Instruction Augmentation Processor can only describe a small number of basic computations in a cycle -I bits -> 2 I operations Many operations could be performed on 2 W-bit words. ALU implementations restrict execution of some simple operations. -e. g. bit reversal a 31 a 30 ………. a 0 b 31 b 0 Swap bit positions

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Instruction Augmentation Provide a way to augment the processor instruction set for an application. Avoid mismatch between hardware/software Fit augmented instructions into data and control stream. Create a functional unit for augmented instructions. Compiler techniques to identify/use new functional unit. What’s Required?

Lecture 15: Reconfigurable Coprocessors October 31, 2013 PRISC °Architecture: couple into register file as “superscalar” functional unit flow-through array (no state)

Lecture 15: Reconfigurable Coprocessors October 31, 2013 PRISC Results °All compiled °working from MIPS binary °<200 4LUTs ? 64x3 °200MHz MIPS base Razdan/Micro27

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Chimaera Start from Prisc idea. -Integrate as a functional unit -No state -RFU Ops (like expfu) -Stall processor on instruction miss Add -Multiple instructions at a time -More than 2 inputs possible Hauck: University of Washington

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Chimaera Architecture Live copy of register file values feed into array Each row of array may compute from register of intermediates Tag on array to indicate RFUOP

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Chimaera Architecture Array can operate on values as soon as placed in register file. Logic is combinational When RFUOP matches -Stall until result ready -Drive result from matching row

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Chimaera Timing If R1 presented late then stall Might be helped by instruction reordering Physical implementation an issue. Relies on considerable processor interaction for support R5R3R2R1

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Chimaera Results Three Spec92 benchmarks -Compress 1.11 speedup -Eqntott 1.8 -Life 2.06 Small arrays with limited state Small speedup Perhaps focus on global router rather than local optimization.

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Garp Integrate as coprocessor -Similar bandwidth to processor as functional unit -Own access to memory Support multi-cycle operation -Allow state -Cycle counter to track operation Configuration cache, path to memory

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Garp – UC Berkeley ISA – coprocessor operations -Issue gaconfig to make particular configuration present. -Explicitly move data to/from array -Processor suspension during coproc operation -Use cycle counter to track progress Array may directly access memory -Processor and array share memory -Exploits streaming data operations -Cache/MMU maintains data consistency

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Garp Instructions Interlock indicates if processor waits for array to count to zero. Last three instructions useful for context swap Processor decode hardware augmented to recognize new instructions.

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Garp Array Row-oriented logic Dedicated path for processor/memory Processor does not have to be involved in array-memory path

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Garp Results General results X improvement on stream, feed- forward operation -2-3x when data dependencies limit pipelining - [Hauser-FCCM97]

Lecture 15: Reconfigurable Coprocessors October 31, 2013 PRISC/Chimaera vs. Garp Prisc/Chimaera -Basic op is single cycle: expfu -No state -Could have multiple PFUs -Fine grained parallelism -Not effective for deep pipelines Garp -Basic op is multi-cycle – gaconfig -Effective for deep pipelining -Single array -Requires state swapping consideration

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Common Theme To overcome instruction expression limits: -Define new array instructions. Make decode hardware slower / more complicated. -Many bits of configuration… swap time. An issue -> recall tips for dynamic reconfiguration. Give array configuration short “name” which processor can call out. Store multiple configurations in array. Access as needed (DPGA)

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Observation All coprocessors have been single-threaded -Performance improvement limited by application parallelism Potential for task/thread parallelism -DPGA -Fast context switch Concurrent threads seen in discussion of IO/stream processor Added complexity needs to be addressed in software.

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Parallel Computation: Processor and FPGA What would it take to let the processor and FPGA run in parallel? Modern Processors Deal with: Variable data delays Dependencies with data Multiple heterogeneous functional units Via: Register scoreboarding Runtime data flow (Tomasulo)

Lecture 15: Reconfigurable Coprocessors October 31, 2013 OneChip °Want array to have direct memory  memory operations °Want to fit into programming model/ISA w/out forcing exclusive processor/FPGA operation allowing decoupled processor/array execution °Key Idea: FPGA operates on memory  memory regions make regions explicit to processor issue scoreboard memory blocks [Jacob+Chow: Toronto]

Lecture 15: Reconfigurable Coprocessors October 31, 2013 OneChip Pipeline

Lecture 15: Reconfigurable Coprocessors October 31, 2013 OneChip Instructions °Basic Operation is: FPGA MEM[Rsource]  MEM[Rdst] -block sizes powers of 2 °Supports 14 “loaded” functions DPGA/contexts so 4 can be cached °Fits well into soft-core processor model

Lecture 15: Reconfigurable Coprocessors October 31, 2013 OneChip °Basic op is: FPGA MEM  MEM °no state between these ops °coherence is that ops appear sequential °could have multiple/parallel FPGA Compute units scoreboard with processor and each other °single source operations? °can’t chain FPGA operations?

Lecture 15: Reconfigurable Coprocessors October 31, 2013 OneChip Extensions FPGA operates on certain memory regions only Makes regions explicit to processor issue. Scoreboard memory blocks 0x0 0x1000 0x10000 FPGA Proc Indicates usage of data pages like virtual memory system!

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Model Roundup °Interfacing °IO Processor (Asynchronous) °Instruction Augmentation PFU (like FU, no state) Synchronous Coproc VLIW Configurable Vector °Asynchronous Coroutine/coprocesor °Memory  memory coprocessor

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Shadow Registers for Functional Units Reconfigurable functional units require tight integration with register file Many reconfigurable operations require more than two operands at a time Cong: 2005

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Multi-operand Operations What’s the best speedup that could be achieved? -Provides upper bound Assumes all operands available when needed

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Providing Additional Register File Access Dedicated link – move data as needed -Requires latency Extra register port – consumes resources -May not be used often Replicate whole (or most) of register file -Can be wasteful

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Shadow Register Approach Small number of registers needed (3 or 4) Use extra bits in each instruction Can be scaled for necessary port size

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Shadow Register Approach Approach comes within 89% of ideal for 3-input functions Paper also shows supporting algorithms

Lecture 15: Reconfigurable Coprocessors October 31, 2013 Summary Many different models for co-processor implementation -Functional unit -Stand-alone co-processor Programming models for these systems is a key Recent compiler advancements open the door for future development Need tie in with applications