Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Slides:

Advertisements

Similar presentations

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Advertisements

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.

Lecture 15: Reconfigurable Coprocessors October 31, 2013 ECE 636 Reconfigurable Computing Lecture 15 Reconfigurable Coprocessors.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.

S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Introduction to CMOS VLSI Design Lecture 13: SRAM

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.

Introduction to CMOS VLSI Design SRAM/DRAM

CS294-6 Reconfigurable Computing Day 27 Tuesday, November 24 Integrating Processors and RC Arrays (Part 2)

Programmable logic and FPGA

Lecture 16: Power Reduction Techniques November 5, 2013 ECE 636 Reconfigurable Computing Lecture 16 Power Reductions Techniques for FPGAs.

CS294-6 Reconfigurable Computing Day 26 Thursday, November 19 Integrating Processors and RC Arrays.

Lecture 19: SRAM.

Chapter 6 Memory and Programmable Logic Devices

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.

Microcomputer & Interfacing Lecture 2

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

Case Study - SRAM & Caches

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 12: May 15, 2001 Interfacing Heterogeneous Computational.

Power Reduction for FPGA using Multiple Vdd/Vth

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Dept. of Computer Science, UC Irvine

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Lecture 18: Dynamic Reconfiguration II November 12, 2004 ECE 697F Reconfigurable Computing Lecture 18 Dynamic Reconfiguration II.

Digital Logic Design Instructor: Kasım Sinan YILDIRIM

Advanced VLSI Design Unit 06: SRAM

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 24: November 5, 2010 Memory Overview.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

Power-Aware RAM Processing for FPGAs December 9, 2005 Power-aware RAM Processing for FPGA Embedded Memory Blocks Russell Tessier University of Massachusetts.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.

Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #24 – Reconfigurable.

CPS3340 COMPUTER ARCHITECTURE Fall Semester, /3/2013 Lecture 9: Memory Unit Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE CENTRAL.

© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.

1 KU College of Engineering Elec 204: Digital Systems Design Lecture 22 Memory Definitions Memory ─ A collection of storage cells together with the necessary.

07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 13: May 17 22, 2001 Interfacing Heterogeneous Computational.

RAM RAM - random access memory RAM (pronounced ramm) random access memory, a type of computer memory that can be accessed randomly;

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #22 – Multi-Context.

Presenter: Darshika G. Perera Assistant Professor

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

Give qualifications of instructors: DAP

Morgan Kaufmann Publishers Memory & Cache

Computer Architecture & Operations I

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

The Xilinx Virtex Series FPGA

A High Performance SoC: PkunityTM

Guest Lecturer TA: Shreyas Chand

FPGA Glitch Power Analysis and Reduction

The Xilinx Virtex Series FPGA

Modified from notes by Saeid Nooshabadi

Application-Specific Customization of Soft Processor Microarchitecture

Presentation transcript:

Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review

Lecture 20: Exam 2 Review November 21, 2013 PRISC °Architecture: couple into register file as “superscalar” functional unit flow-through array (no state)

Lecture 20: Exam 2 Review November 21, 2013 PRISC Results °All compiled °working from MIPS binary °<200 4LUTs ? 64x3 °200MHz MIPS base Razdan/Micro27

Lecture 20: Exam 2 Review November 21, 2013 Chimaera Start from Prisc idea. -Integrate as a functional unit -No state -RFU Ops (like expfu) -Stall processor on instruction miss Add -Multiple instructions at a time -More than 2 inputs possible Hauck: University of Washington

Lecture 20: Exam 2 Review November 21, 2013 Chimaera Architecture Live copy of register file values feed into array Each row of array may compute from register of intermediates Tag on array to indicate RFUOP

Lecture 20: Exam 2 Review November 21, 2013 Chimaera Architecture Array can operate on values as soon as placed in register file. Logic is combinational When RFUOP matches -Stall until result ready -Drive result from matching row

Lecture 20: Exam 2 Review November 21, 2013 Chimaera Results Three Spec92 benchmarks -Compress 1.11 speedup -Eqntott 1.8 -Life 2.06 Small arrays with limited state Small speedup Perhaps focus on global router rather than local optimization.

Lecture 20: Exam 2 Review November 21, 2013 Garp Integrate as coprocessor -Similar bandwidth to processor as functional unit -Own access to memory Support multi-cycle operation -Allow state -Cycle counter to track operation Configuration cache, path to memory

Lecture 20: Exam 2 Review November 21, 2013 Garp – UC Berkeley ISA – coprocessor operations -Issue gaconfig to make particular configuration present. -Explicitly move data to/from array -Processor suspension during coproc operation -Use cycle counter to track progress Array may directly access memory -Processor and array share memory -Exploits streaming data operations -Cache/MMU maintains data consistency

Lecture 20: Exam 2 Review November 21, 2013 Garp Instructions Interlock indicates if processor waits for array to count to zero. Last three instructions useful for context swap Processor decode hardware augmented to recognize new instructions.

Lecture 20: Exam 2 Review November 21, 2013 Garp Array Row-oriented logic Dedicated path for processor/memory Processor does not have to be involved in array-memory path

Lecture 20: Exam 2 Review November 21, 2013 Garp Results General results X improvement on stream, feed- forward operation -2-3x when data dependencies limit pipelining - [Hauser-FCCM97]

Lecture 20: Exam 2 Review November 21, 2013 PRISC/Chimaera vs. Garp Prisc/Chimaera -Basic op is single cycle: expfu -No state -Could have multiple PFUs -Fine grained parallelism -Not effective for deep pipelines Garp -Basic op is multi-cycle – gaconfig -Effective for deep pipelining -Single array -Requires state swapping consideration

Lecture 20: Exam 2 Review November 21, 2013 Common Theme To overcome instruction expression limits: -Define new array instructions. Make decode hardware slower / more complicated. -Many bits of configuration… swap time. An issue -> recall tips for dynamic reconfiguration. Give array configuration short “name” which processor can call out. Store multiple configurations in array. Access as needed (DPGA)

Lecture 20: Exam 2 Review November 21, 2013 Observation All coprocessors have been single-threaded -Performance improvement limited by application parallelism Potential for task/thread parallelism -DPGA -Fast context switch Concurrent threads seen in discussion of IO/stream processor Added complexity needs to be addressed in software.

Lecture 20: Exam 2 Review November 21, 2013 FPGA Power Reduction Goals Dynamic power goals -Reduce Vdd along non-critical paths -Low swing signalling -Use CAD approaches to limit long high-toggle paths -P dynamic = 0.5 * C * Vdd 2 * f Static power goals -Cut-off Vdd for unused transistors -Use high Vt transistors for SRAM cells -Various other voltage biasing techniques

Lecture 20: Exam 2 Review November 21, 2013 Traditional Routing Switch level-restoring buffer Courtesy: Anderson

Lecture 20: Exam 2 Review November 21, 2013 Proposed Switch Designs: Anderson °Based on 3 observations: Routing switch inputs tolerant to weak-1 signals (level-restoring buffers). Considerable slack in FPGA designs  many switches can be slowed down. Most routing switches feed other routing switches. -Can produce weak-1 logic signals.

Lecture 20: Exam 2 Review November 21, 2013 “Basic” Switch Design high-speed: MNX & MPX ON low-power: MNX ON, MPX OFF sleep: MNX OFF, MPX OFF MODE OPERATION: V VD

Lecture 20: Exam 2 Review November 21, 2013 High-Speed Mode high-speed: MNX & MPX ON low-power: MNX ON, MPX OFF sleep: MNX OFF, MPX OFF MODE OPERATION: output swing: rail-to-rail. V VD = V DD

Lecture 20: Exam 2 Review November 21, 2013 Low-Power Mode high-speed: MNX & MPX ON low-power: MNX ON, MPX OFF sleep: MNX OFF, MPX OFF MODE OPERATION: output swing: GND-to- (V DD -V TH ). V VD = V DD - V TH V VD output swing: GND-to- (V DD -V TH ).

Lecture 20: Exam 2 Review November 21, 2013 Sleep Mode high-speed: MNX & MPX ON low-power: MNX ON, MPX OFF sleep: MNX OFF, MPX OFF MODE OPERATION: V VD

Lecture 20: Exam 2 Review November 21, 2013 Leakage Power Results: Anderson LP modeSleep modeLP mode (+unused fanout) LP mode (+used fanout) Traditional switch % leakage power reduction vs. high-speed mode Basic

Lecture 20: Exam 2 Review November 21, 2013 FPGA Embedded Memory Blocks °Embedded memory blocks (EMBs) are important parts of FPGAs °Consume roughly 14% of Altera Stratix II dynamic power * Increasing in recent designs * Stratix II Low Power Applications Note, 2005

Lecture 20: Exam 2 Review November 21, 2013 Embedded Memory Block Port Internal View Write Data MClk Write Enable Column Mux Write Buffers Sense Amps Row Decode Read Data Read Enable Latch Address MClk Clk Enable Clk RAM cell BIT Bit Line Pre-charge MClk Reducing clocking saves dynamic power

Lecture 20: Exam 2 Review November 21, 2013 Power Optimization #1 °Convert EMB read enable/write enable signals to associated read/write clock enable signals °Limitations Each port has read or write enable control signal Embedded memory block has read enable input Clock Wren Data Write Address Read Address Q Write enable Read enable Q Rden Vcc Wr clk enable Rd clk enable Write Address Read Address Clock Wren Data Write Address Read Address Q Write enable Read enable Q Rden Vcc Wr clk enable Rd clk enable Write Address Read Address BeforeAfter

Lecture 20: Exam 2 Review November 21, 2013 Implementation °Conversion mode Ties off R/W enable to RAM clock enables Doesn’t make transform if CE already present on port °Combining mode AND user RAM clock enables with derived R/W clock Could impact performance Combined Write Clk Enable Write Enable User-defined Write Clk Enable

Lecture 20: Exam 2 Review November 21, 2013 FPGA RAM Processing °FIFOs and Shift registers converted into logical RAMs °Logical RAMs mapped to RAM blocks FIFO, Shift Register, RAM specification Create Logical Memory Logical RAMs/ logic Logical-to- physical RAM processing RAM blocks/ logic Memory/ logic placement Placed Memory

Lecture 20: Exam 2 Review November 21, 2013 Mapping RAM to EMBs °Implementation choice can impact design area, performance, and power. °Some mappings may require multiple EMBs 4k deep x 4 wide 16K bits 4K bits M4K User-defined (logical) memory Physical (EMB) memory 512K MRAM

Lecture 20: Exam 2 Review November 21, 2013 Memory Organization °Each EMB can be configured to have different depth and width (e.g. Stratix II M4K) °All hold 4K bits °Slightly lower power consumption for wider EMB configurations (not including routing) 4K words deep 1 bit wide 32 bits wide 128 words deep 8 bits wide 512 words deep

Lecture 20: Exam 2 Review November 21, 2013 Area and Delay Optimal Mapping °Configure each EMB to be as deep as possible °Number of address bits on each EMB same as on logical memory °Area and performance efficient: no external logic needed °Power inefficient: All EMBs must be active during each logical RAM access 4k words deep and 1 bit wide (4 times) Addr[0:11] Data[0:3] 4k words deep and 4 bits wide Logical memory 4 EMBs active during access EMB Vertical Slicing

Lecture 20: Exam 2 Review November 21, 2013 Alternative Mapping °Configure EMB to have width of logical RAM (e.g. 1Kx4) Allows shutdown of some RAMs each cycle But adds some logic °Saves RAM power, adds combinational logic and register power More Power Efficient: 1K deep x 4 wide (4 times) 1 EMB active during access Addr Decoder 4 Addr[0:9] Addr[10:11] Data[0:3] 4k words deep and 4 bits wide Logical memory Addr[10:11] Horizontal Slicing

Lecture 20: Exam 2 Review November 21, 2013 RAM Slicing - Example °Power reduction available with different slicing 4kx32 Dynamic Power Maximum Depth Dynamic Power (mW) Best range Multiplexer Power Increasing k2k4k EMB Power Increasing

Lecture 20: Exam 2 Review November 21, 2013 Power Optimization #2: Power-aware RAM Partitioning °Algorithm considers possible logical to physical RAM mappings Completed placement Insert Decode and Mux Logic FIFO, Shift Register Create Logical Memory Power-aware Physical RAM processing Memory/ Logic Placement Power Library

Lecture 20: Exam 2 Review November 21, 2013 Experimental Approach °40 designs evaluated °Quartus 5.1 °Mapped to smallest possible device and target max frequency °Simulation with test vectors °Power analysis with PowerPlay

Lecture 20: Exam 2 Review November 21, 2013 Experimental Approach °40 designs evaluated °Quartus 5.1 °Mapped to smallest possible device and target max frequency °Simulation with test vectors °Power analysis with PowerPlay

Lecture 20: Exam 2 Review November 21, 2013 Memory Power °21.0% average reduction for all techniques (9.7% with convert/combine)

Lecture 20: Exam 2 Review November 21, 2013 Overall Core Dynamic Power °6.8% average power reduction for all techniques (2.6% with convert/combine) Designs % Dyn. Power Reduction Enable convert/ combine Enable convert/ combine + mem partition

Lecture 20: Exam 2 Review November 21, 2013 Design Performance °1.0% average performance loss for all techniques (0.1% for enable convert/combine) Average Design Clock Frequency Designs % Frequency Improvement Enable Convert/ Combine Enable Convert/ Combine + Mem Partition

Lecture 20: Exam 2 Review November 21, 2013 Results Summary °Almost 7% core dynamic power reduction across all designs Some designs benefit more than others °Minimal clock frequency hit for most designs Enable convert Enable convert/ combine Enable convert/ combine + Mem partition Core dynamic power -1.8%-2.6%-6.8% Memory dynamic power -6.3%-9.7%-21.0% Max clk freq -0.1%-0.2%-1.0% LUT count 0.0%0.1%0.7%

Lecture 20: Exam 2 Review November 21, 2013 Other material °Lecture 17: Reconfigurable Memory Security °Lecture 18: Hardware Monitors to Protect Network Processors °Lecture 19 is not covered on the exam