Frank Vahid Associate Professor

Slides:



Advertisements
Similar presentations
VHDL Design of Multifunctional RISC Processor on FPGA
Advertisements

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.
Altera FLEX 10K technology in Real Time Application.
Device Tradeoffs Greg Stitt ECE Department University of Florida.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
RISC ARCHITECTURE By Guan Hang Su. Over View -> RISC design philosophy -> Features of RISC -> Case Study -> The Success of RISC processors -> CRISC.
EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.
Graduate Computer Architecture I Lecture 16: FPGA Design.
The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.
RISC vs CISC CS 3339 Lecture 3.2 Apan Qasem Texas State University Spring 2015 Some slides adopted from Milo Martin at UPenn.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Some Thoughts on Technology and Strategies for Petaflops.
Frank Vahid, UC Riverside 1 System-on-a-Chip Platform Tuning for Embedded Systems Frank Vahid Associate Professor Dept. of Computer Science and Engineering.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
RISC By Don Nichols. Contents Introduction History Problems with CISC RISC Philosophy Early RISC Modern RISC.
Frank Vahid, UC Riverside 1 New Opportunities with Platform Based Design Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Configurable System-on-Chip: Xilinx EDK
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.
Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science.
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
1 Introduction A digital circuit design is just an idea, perhaps drawn on paper We eventually need to implement the circuit on a physical device –How do.
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Propagating Constants Past Software to Hardware Peripherals Frank Vahid*, Rilesh Patel and Greg Stitt Dept. of Computer Science and Engineering University.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
Using FPGAs with Embedded Processors for Complete Hardware and Software Systems Jonah Weber May 2, 2006.
HW/SW CODESIGN OF THE MPEG-2 VIDEO DECODER Matjaz Verderber, Andrej Zemva, Andrej Trost University of Ljubljana Faculty of Electrical Engineering Trzaska.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Computer Organization and Assembly language
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
INTRODUCTION TO MICROPROCESSORS
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
Automated Design of Custom Architecture Tulika Mitra
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Welcome to CSE 143! Microelectronic System Design
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
J. Christiansen, CERN - EP/MIC
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Embedded Systems Design: A Unified Hardware/Software Introduction 1 Chapter 3 General-Purpose Processors: Software.
What is a Microprocessor ? A microprocessor consists of an ALU to perform arithmetic and logic manipulations, registers, and a control unit Its has some.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
Programmable Logic Devices Zainalabedin Samadi. Embedded Systems Technology  Programmable Processors  Application Specific Processor (ASIP)  Single.
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
INTRODUCTION TO MICROPROCESSORS
Chapter 1: Introduction
Introduction to Reconfigurable Computing
Dynamically Reconfigurable Architectures: An Overview
Ann Gordon-Ross and Frank Vahid*
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
A High Performance SoC: PkunityTM
HIGH LEVEL SYNTHESIS.
Dynamic Hardware/Software Partitioning: A First Approach
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http://www.cs.ucr.edu/~vahid This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend Frank Vahid, UC Riverside

General Purpose vs. Special Purpose Amazing to think this came from wolves Standard tradeoff Oct. 14, 2002, Cincinnati, Ohio -- physician at Cincinnati Children’s Hospital Medical Center report duct tape effective at treating warts. Frank Vahid, UC Riverside

General Purpose vs. Single Purpose Processors total = 0 for i = 1 to N loop total += M[i] end loop Designers have long known that: General-purpose processors are flexible Single-purpose processors are fast General purpose OR Single purpose IR PC Register file General ALU Datapath Controller Program memory Assembly code for: total = 0 for i =1 to … Control logic and State register Data memory Datapath Controller Control logic State register Data memory i total + ENIAC, 1940’s Its flexibility was the big deal Flexibility Design cost Time-to-market Performance Power efficiency Size Frank Vahid, UC Riverside

Mixing General and Single Purpose Processors A.k.a. Hardware/software partitioning Hardware: single-purpose processors coprocessor, accelerator, peripheral, etc. Software: general-purpose processors Though hardware underneath! Especially important for embedded systems Computers embedded in devices (cameras, cars, toys, even people) Speed, cost, time-to-market, power, size, … demands are tough Microcontroller CCD preprocessor Pixel coprocessor A2D D2A JPEG codec DMA controller Memory controller ISA bus interface UART LCD control Display control Multiplier/Accumulator Digital camera chip lens CCD Frank Vahid, UC Riverside

How is Partitioning Done for Embedded Systems? Partitioning into hw and sw blocks done early During conceptual stage Sw design done separately from hw design Attempts since late 1980s to automate not yet successful Partitioning manually is reasonably straightforward Spec is informal and not machine readable Sw algorithms may differ from hw algorithms No compelling need for tools Informal spec System Partitioning Sw spec Hw spec Sw design Hw design Processor ASIC Frank Vahid, UC Riverside

New Platforms Invite New Efforts in Hw/Sw Partitioning New single-chip platforms contain both general-purpose processor and an FPGA FPGA: Field-programmable gate array Programmable just like software  Flexible Intended largely to implement single-purpose processors Can we perform a later partitioning to improve the software too? Processor + FPGA Informal spec System Partitioning Sw spec Hw spec Sw design Hw design Partitioning Processor + FPGA ASIC Frank Vahid, UC Riverside

Commercial Single-Chip Microprocessor/FPGA Platforms Triscend E5 chip Configurable logic 8051 processor plus other peripherals Memory Triscend E5: based on 8-bit 8051 CISC core (2000) 10 Dhrystone MIPS at 40MHz up to 40K logic gates Cost only about $4 Frank Vahid, UC Riverside

Frank Vahid, UC Riverside Single-Chip Microprocessor/FPGA Platforms Atmel FPSLIC Field-Programmable System-Level IC Based on AVR 8-bit RISC core 20 Dhrystone MIPS 5k-40k logic gates $5-$10 Courtesy of Atmel Frank Vahid, UC Riverside

Single-Chip Microprocessor/FPGA Platforms Triscend A7 chip (2001) Based on ARM7 32-bit RISC processor 54 Dhrystone MIPS at 60 MHz Up to 40k logic gates $10-$20 in volume Courtesy of Triscend Frank Vahid, UC Riverside

Single-Chip Microprocessor/FPGA Platforms Altera’s Excalibur EPXA 10 (2002) ARM (922T) hard core 200 Dhrystone MIPS at 200 MHz ~200k to ~2 million logic gates Source: www.altera.com Frank Vahid, UC Riverside

Single-Chip Microprocessor/FPGA Platforms Xilinx Virtex II Pro (2002) PowerPC based 420 Dhrystone MIPS at 300 MHz 1 to 4 PowerPCs 4 to 16 gigabit transceivers 12 to 216 multipliers Millions of logic gates 200k to 4M bits RAM 204 to 852 I/O $100-$500 (>25,000 units) Up to 16 serial transceivers 622 Mbps to 3.125 Gbps PowerPCs Config. logic Courtesy of Xilinx Frank Vahid, UC Riverside

Single-Chip Microprocessor/FPGA Platforms Why wouldn’t future microprocessor chips include some amount of on-chip FPGA? One argument against – area Lots of silicon area taken up by FPGA FPGA about 20-30 times less area efficient than custom logic FPGA used to be for prototyping, too big for final products But chip trends imply that FPGAs will be O.K. in final products… Frank Vahid, UC Riverside

Frank Vahid, UC Riverside How Much is Enough? Perhaps a bit small Frank Vahid, UC Riverside

Frank Vahid, UC Riverside How Much is Enough? Reasonably sized Frank Vahid, UC Riverside

How Much is Enough? Probably plenty big for most of us Frank Vahid, UC Riverside

How Much is Enough? More than typically necessary Frank Vahid, UC Riverside

How Much Custom Logic is Enough? IC package IC 1993: ~ 1 million logic transistors Perhaps a bit small 8-bit processor: 50,000 tr. Pentium: 3 million tr. MPEG decoder: several million tr. Frank Vahid, UC Riverside

How Much Custom Logic is Enough? 1996: ~ 5-8 million logic transistors Reasonably sized Frank Vahid, UC Riverside

How Much Custom Logic is Enough? 1999: ~ 10-50 million logic transistors Probably plenty big for most of us Frank Vahid, UC Riverside

How Much Custom Logic is Enough? 2002: ~ 100-200 million logic transistors More than typically necessary Frank Vahid, UC Riverside

How Much Custom Logic is Enough? 2008: >1 BILLION logic transistors Perhaps very few people could design this Frank Vahid, UC Riverside

Very Few Companies Can Design High-End ICs Design productivity gap 10,000 1,000 100 10 1 0.1 0.01 0.001 Logic transistors per chip (in millions) 100,000 1000 Productivity (K) Trans./Staff-Mo. 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 IC capacity productivity Gap Moore’s Law Source: ITRS’99 Designer productivity growing at slower rate 1981: 100 designer months  ~$1M 2002: 30,000 designer months  ~$300M Frank Vahid, UC Riverside

Single-Chip Platforms with On-Chip FPGAs So, big FPGAs on-chip are O.K., because mainstream designers couldn’t have used all that silicon area anyways Becoming out of reach of mainstream designers But, couldn’t designers use custom logic instead of FPGAs to make smaller chips and save costs? Frank Vahid, UC Riverside

Shrinking Chips Yes, but there’s a limit Chips becoming pin limited A football huddle can only get so small This area will exist whether we use it all or not Shrink Pads connecting to external pins Frank Vahid, UC Riverside

Trend Towards Pre-Fabricated Platforms: ASSPs ASSP: application specific standard product Domain-specific pre-fabricated IC e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC Unique IC design Ignores quantity of same IC ASIC design starts decreasing Due to strong benefits of using pre-fabricated devices Source: Gartner/Dataquest September’01 Frank Vahid, UC Riverside

Microprocessor/FPGA Platforms Trends point towards such platforms increasing in popularity Can we automatically partition the software to utilize the FPGA? For improved speed and energy Frank Vahid, UC Riverside

Automatic Hardware/Software Partitioning Since late 1980s – goal has been spec in, hw/sw out But no successful commercial tool yet. Why? // From MediaBench’s JPEG codec GLOBAL(void) jpeg_fdct_ifast (DCTELEM * data) { DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7; DCTELEM tmp10, tmp11, tmp12, tmp13; DCTELEM z1, z2, z3, z4, z5, z11, z13; DCTELEM *dataptr; int ctr; SHIFT_TEMPS /* Pass 1: process rows. */ dataptr = data; for (ctr = DCTSIZE-1; ctr >= 0; ctr--) { tmp0 = dataptr[0] + dataptr[7]; tmp7 = dataptr[0] - dataptr[7]; tmp1 = dataptr[1] + dataptr[6]; … // Thousands of lines like this in dozens of files Hardware “Spec” Partitioner Processor ASIC/FPGA Compilation Synthesis Software Ideal Software Frank Vahid, UC Riverside

Why No Successful Tool Yet? Most research has focused on extensive exploration Roots in VLSI CAD Decompose problem into fine-grained operations Apply sophisticated partitioning algorithms Examples Min-cut, dynamic programming, simulated annealing, tabu-search, genetic evolution, etc. Is this overkill? “Spec” 1000s of nodes (like circuit partitioning) Partitioner Frank Vahid, UC Riverside

We Really Only Need Consider a Few Loops – Due to the 90-10 Rule Recent appearance of embedded benchmark suites Enables analysis  understanding of the real problem We’ve examined UCLA’s MediaBench, Netbench, Motorola’s Powerstone Currently examining EEMBC (embedded equivalent of SPEC) UCR loop analysis tools based on SimpleScalar and Simics // From MediaBench’s JPEG codec GLOBAL(void) jpeg_fdct_ifast (DCTELEM * data) { DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7; DCTELEM tmp10, tmp11, tmp12, tmp13; DCTELEM z1, z2, z3, z4, z5, z11, z13; DCTELEM *dataptr; int ctr; SHIFT_TEMPS /* Pass 1: process rows. */ dataptr = data; for (ctr = DCTSIZE-1; ctr >= 0; ctr--) { tmp0 = dataptr[0] + dataptr[7]; tmp7 = dataptr[0] - dataptr[7]; tmp1 = dataptr[1] + dataptr[6]; … 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 10 Assigned each loop a number, sorted by fraction of contribution to total execution time Frank Vahid, UC Riverside

The 90-10 Rule Holds for Embedded Systems In fact, the most frequent loop alone took 50% of time, using 1% of code Frank Vahid, UC Riverside

So Need We Only Consider the First Few Loops? Not Necessarily What if programs were self-similar w.r.t. 90-10 rule? Remove most frequent loop – 90-10 rule still hold? Intuition might say yes – remove loop, and we have another program. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 10 Loop % Remaining Execution Time 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 10 Loop % Remaining Execution Time So we need only speedup the first few loops After that, speedups are limited Good from tool perspective! 500 1000 1 2 3 4 5 6 7 8 9 10 Loop Speedup 500 1000 1 2 3 4 5 6 7 8 9 10 Loop Speedup Frank Vahid, UC Riverside

Used multimeter and timer to measure performance and power Manually Partitioned Several PowerStone Benchmarks onto Triscend A7 and E5 Chips E5 IC Used multimeter and timer to measure performance and power Obtained good speedups and energy savings by partitioning software among microprocessor and on-chip FPGA Triscend A7 development board Frank Vahid, UC Riverside

Simulation-Based Results for More Benchmarks (Quicker than physical implementation, results matched reasonably well) Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg) Frank Vahid, UC Riverside

Looking at Multiple Loops per Benchmark Manually created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates Surprisingly few gates! Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002 Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded Systems, 2002 (to appear). Frank Vahid, UC Riverside

Ideal Speedups for Different Architectures Varied loop speedup ratio (sw time / hw time of loop itself) to see impact of faster microprocessor or slower FPGA: 30, 20, 10 (base case), 5 and 2 Loop speedups of 5 or more work fine for first few loops, not hard to achieve Frank Vahid, UC Riverside

Ideal Energy Savings for Different Architectures Varied loop power ratio (FPGA power / microprocessor power) to account for different architectures – 2.5, 2.0, 1.5 (base case), 1.0 Energy savings quite resilient to variations Frank Vahid, UC Riverside

How is Automated Partitioning Done? Informal spec Previous data obtained manually System Partitioning Sw spec Hw spec Sw design Hw design Partitioning Processor + FPGA ASIC Frank Vahid, UC Riverside

Source-Level Partitioning SW Source _______ Front-end converts code into intermediate format, such as SUIF (Stanford University Intermediate Format) Compiler Front-End Intermediate format explored for hardware candidates Hw/Sw Partitioning Compiler Back-End Hw source Assembly & object files Binary is generated from assembling and linking. Hw source is generated and synthesized into netlist Assembler & Linker Synthesis Binary Netlists Processor FPGA Frank Vahid, UC Riverside

Problems with Source-Level Partitioning Though technically superior, source-level partitioning Disrupts standard commercial tool flow significantly Requires special compiler (ouch!) Multiple source languages, changing source languages How deal with library code, assembly code, object code Compiler Front-end C Source C++ Source Java Source ? C SUIF Compiler C++ SUIF Compiler Frank Vahid, UC Riverside

Frank Vahid, UC Riverside Binary Partitioning SW Source _______ Assembly & object files Compilation Source code is first compiled and linked in order to create a binary. Assembler & Linker Binary Candidate hardware regions (a few small, frequent loops) are decompiled for partitioning Hw/Sw Partitioning Hw source Updated Binary HDL is generated and synthesized, and binary is updated to use hardware Synthesis Netlists Processor FPGA Frank Vahid, UC Riverside

Binary-Level Partitioning Results (ICCAD’02) Source-Level Average speedup, 1.5 Average energy savings, 27% Average 4,361 gates Binary-Level Average speedup, 1.4 Average energy savings, 13% Large area overhead averaging 10,325 gates Frank Vahid, UC Riverside

Frank Vahid, UC Riverside Binary Partitioning Could Eventually Lead to Dynamic Hw/Sw Partitioning Dynamic software optimization gaining interest e.g., HP’s Dynamo What better optimization than moving to FPGA? Add component on-chip: Detects most frequent sw loops Decompiles a loop Performs compiler optimizations Synthesizes to a netlist Places and routes the netlist onto (simple) FPGA Updates sw to call FPGA Config. Logic Mem Processor DMA D$ I$ Profiler Proc. Self-improving IC Can be invisible to designer Appears as efficient processor HARD! Much future work. Frank Vahid, UC Riverside

Frank Vahid, UC Riverside Conclusions Hardware/software partitioning can significantly improve software speed and energy Single-chip microprocessor/FPGA platforms, increasing in popularity, make such partitioning even more attractive Successful commercial tool still on the horizon Binary-level partitioning may help in some cases Source-level can yield massive parallelism (Profs. Najjar/Payne) Future dynamic hw/sw partitioning possible? Distinction between sw/hw continually being blurred! Many people involved: Greg Stitt, Roman Lysecky, Shawn Nematbakhsh, Dinesh Suresh, Walid Najjar, Jason Villarreal, Tom Payne, several others… Support from NSF, Triscend, and soon SRC… Exciting new directions! Frank Vahid, UC Riverside