Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Frank Vahid, UC Riverside 1 System-on-a-Chip Platform Tuning for Embedded Systems Frank Vahid Associate Professor Dept. of Computer Science and Engineering.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Walid Najjar* *University of California, Riverside **The.

Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

Frank Vahid, UC Riverside 1 New Opportunities with Platform Based Design Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.

Frank Vahid Associate Professor

A Highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang, Frank Vahid and Walid Najjar University of California, Riverside ISCA 2003.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.

Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,

Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science.

Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.

Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part.

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.

A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.

1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Propagating Constants Past Software to Hardware Peripherals Frank Vahid*, Rilesh Patel and Greg Stitt Dept. of Computer Science and Engineering University.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Hardware-Software Partitioning. EEL6935 / 52 Hardware Software Definition Definition: Given an application, hw/sw partitioning maps each region of the.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.

A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Automated Design of Custom Architecture Tulika Mitra

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

A S ELF -T UNING C ACHE ARCHITECTURE FOR E MBEDDED S YSTEMS Chuanjun Zhang, Frank Vahid and Roman Lysecky Presented by: Wei Zang Mar. 29, 2010.

Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.

An Improved “Soft” eFPGA Design and Implementation Strategy

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Dynamo: A Runtime Codesign Environment

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

Ann Gordon-Ross and Frank Vahid*

A High Performance SoC: PkunityTM

Dynamic FPGA Routing for Just-in-Time Compilation

A Self-Tuning Configurable Cache

Dynamic Hardware/Software Partitioning: A First Approach

Warp Processor: A Dynamically Reconfigurable Coprocessor

Automatic Tuning of Two-Level Caches to Embedded Applications

Presentation transcript:

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine Co-PI: Walid Najjar, Professor, CS&E, UCR

Frank Vahid, UC Riverside 2 Goal: Platform Self-Tunes to Executing Application Download standard binary Platform adjusts to executing application Result is better speed and energy Why and How? Application Platform

Frank Vahid, UC Riverside 3 Platforms Pre-designed programmable platforms Reduce NRE cost, time-to-market, and risk Platform designer amortizes design cost over large volumes Many (if not most) will include FPGA Today: Triscend, Altera, Xilinx, Atmel More sure to come As FPGA vendors license to SoC makers FPGA MemProcessor L1 Cache Periph 1 JPEG Sample Platform Processor, cache, memory, FPGA, etc. Modern IC costs are feasible mostly in very high volumes

Frank Vahid, UC Riverside 4 Hardware/Software Partitioning Improves Speed and Energy FPGA Mem Processor L1 Cache Periph1JPEG But requires partitioning CAD tool O.K. in some flows In mainstream software flows, hard to integrate Standard Sw Tools Hw/Sw Parti- tioner idleuP active idleuP FPGA

Frank Vahid, UC Riverside 5 Idea: Perform Partitioning Dynamically (and hence Transparently) Add components on-chip: Profile Decompile frequent loops Optimize Synthesize Place and route onto FPGA Update Sw to call FPGA Transparent No impact on tool flow Dynamic software optimization, software binary updating, and dynamic binary translation are proven technologies But how can you profile, decompile, optimize, synthesize, and p&r, on-chip? DAG & LC MemProcessor L1 Cache Profiler Explorer Dynamic Partitioning Module Decompiler, Optimizer Synthesis, Place and Route FPGA

Frank Vahid, UC Riverside 6 Dynamic Partitioning Requires Lean Tools How can you run Synopsys/Cadence/Xilinx tools on-chip, when they currently run on powerful workstations? Key – our tools only need be good enough to speedup critical loops Most time spent in small loops (e.g., Mediabench, Netbench, EEMBC) Created ultra-lean versions of the tools Quality not necessarily as good, but good enough Runs on a 60 MHz ARM 7 Loop

Frank Vahid, UC Riverside 7 Dynamic Hw/Sw Partitioning Tool Chain DAG & LC FPGA MemProcessor L1 Cache Profiler Explorer Partitioner Binary Loop Profiler Small, Frequent Loops Loop Decompilation Place & Route Hw Synthesis Binary Modification Updated Binary DMA Configuration Bitfile Creation Tech. Mapping Architecture targeted for loop speedup, simple P&R We’ve developed efficient profiler Hw We’re continuing to extend these tools to handle more benchmarks Decompiler, Optimizer Synthesis, Place and Route

Frank Vahid, UC Riverside 8 Dynamic Hw/Sw Partitioning Results DAG & LC FPGA MemProcessor L1 Cache Profiler Explorer Partitioner Decompiler, Optimizer Synthesis, Place and Route

Frank Vahid, UC Riverside 9 Dynamic Hw/Sw Partitioning Results Powerstone, NetBench, and EEMBC examples, most frequent 1 loop only Average speedup very close to ideal speedup of 2.4 Not much left on the table in these examples Dynamically speeding up inners loops on FPGAs is feasible using on-chip tools ICCAD’02 (Stitt/Vahid) – Binary-level partitioning in general is very effective

Frank Vahid, UC Riverside 10 Configurable Cache: Why? ARM920T: Caches consume half of total processor system power (Segars 01) M*CORE: Unified cache consumes half of total processor sys. power (Lee/Moyer/Arends 99) DAG & LC FPGA MemProcessor L1 Cache Profiler Explorer Dynamic Partitioning Module Decompiler Synthesis Place and Route

Frank Vahid, UC Riverside 11 Best Cache for Embedded Systems? Diversity of associativity, line size, total size

Frank Vahid, UC Riverside 12 Cache Design Dilemmas Associativity Low: low power, good performance for many programs High: better performance on more programs Total size Small: lower power if working set small, (less area) Big: better performance/power if working set large Line size Small: better when poor spatial locality Big: better when good spatial locality Most caches are a compromise for many programs Work best on average But embedded systems run one/few programs Want best cache for that one program vs.

Frank Vahid, UC Riverside 13 Solution to the Cache Design Dilemna Configurable cache Design physical cache that can be reconfigured 1-way, 2-ways, or 4-ways Way concatenation – new technique, ISCA’03 (Zhang/Vahid/Najjar) Four 2K ways, plus concatenation logic 8K, 4K or 2K byte total size Way shutdown, ISCA’03 Gates Vdd, saves both dynamic and static power, some performance overhead (5%) 16, 32 or 64 byte line size Variable line fetch size, ISVLSI’03 Physical 16 byte line, one, two or four physical line fetches Note: this is a single physical cache, not a synthesizable core

Frank Vahid, UC Riverside 14 Configurable Cache Design: Way Concatenation (4, 2 or 1 way) index c1c1 c3c3 c0c0 c2c2 a 11 a 12 reg 1 reg 0 sense amps column mux tag part tag address mux driver c1c1 line offset data output critical path c0c0 c2c2 c0c0 c1c1 6x64 c3c3 c2c2 c3c3 a 31 tag address a 13 a 12 a 11 a 10 index a 5 a 4 line offset a 0 Configuration circuit data array bitline Trivial area overhead, no performance overhead

Frank Vahid, UC Riverside 15 Configurable Cache Design Metrics We computed power, performance, energy and size using CACTI models Our own layout (0.13 TSMC CMOS), Cadence tools Energy: considered cache, memory, bus, and CPU stall Powerstone, MediaBench, and SPEC benchmarks Used SimpleScalar for simulations

Frank Vahid, UC Riverside 16 Configurable Cache Energy Benefits 40%-50% energy savings on average Compared to conventional 4-way and 1-way assoc., 32-byte line size AND, best for every example (remember, conventional is compromise)

Frank Vahid, UC Riverside 17 Future Work Dynamic cache tuning More advanced dynamic partitioning Automatic frequent loop detection On-chip exploration tool Better decompilation, synthesis Better FPGA fabric, place and route Approach: continue to extend to support more benchmarks Extend to platforms with multiple processors Scales well – processors can share on-chip partitioning tools

Frank Vahid, UC Riverside 18 Conclusions Self-improving configurable ICs Provide excellent speed and energy improvements Require no modification to existing software flows Can thus be widely adopted We’ve shown the idea is practical Lean on-chip tools are possible Now need to make them even better Extensive research into algorithms, designs and architecture is needed