Parameterized Embedded Systems Platforms Frank Vahid Students: Tony Givargis, Roman Lysecky, Susan Cotterell Dept. of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Experiments with the Peripheral Virtual Component Interface Roman L. Lysecky, Frank Vahid*, Tony D. Givargis Dept. of Computer Science & Engineering University.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.
Power Reduction Techniques For Microprocessor Systems
Application-Specific Customization of FPGA Soft- core Processors Journal Paper Presentation Presented by: Ahmad Sghaier Course Instructor: Dr. Shawki Areibi.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Instruction-based System-level Power Evaluation of System-on-a-chip Peripheral Cores Tony Givargis, Frank Vahid* Dept. of Computer Science & Engineering.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
Parameterized Systems-on-a-Chip Frank Vahid Tony Givargis, Roman Lysecky, Leslie Tauro, Susan Cotterell Department of Computer Science and Engineering.
A First-step Towards an Architecture Tuning Methodology for Low Power Greg Stitt, Frank Vahid*, Tony Givargis Dept. of Computer Science & Engineering University.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
System-level Exploration for Pareto- optimal Configurations in Parameterized Systems-on-a-chip Architectures Tony Givargis (Frank Vahid, Jörg Henkel) Center.
Mehdi Amirijoo1 Power estimation n General power dissipation in CMOS n High-level power estimation metrics n Power estimation of the HW part.
Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science.
Tony GivargisUniversity of California, Riverside & NEC USA1 Fast Cache and Bus Power Estimation for Parameterized System-on-a-Chip Design Tony D. Givargis.
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.
Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.
Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Propagating Constants Past Software to Hardware Peripherals Frank Vahid*, Rilesh Patel and Greg Stitt Dept. of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

Automated Design of Custom Architecture Tulika Mitra
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
Power Estimation and Optimization for SoC Design
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
1 6-Performance Analysis of Embedded System Designs: Digital Camera Case Study (cont.)
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Minimum Effort Design Space Subsetting for Configurable Caches + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work.
Roman LyseckyUniversity of California, Riverside1 Pre-fetching for Improved Core Interfacing Roman Lysecky, Frank Vahid, Tony Givargis, & Rilesh Patel.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
Greg Alkire/Brian Smith 197 MAPLD An Ultra Low Power Reconfigurable Task Processor for Space Brian Smith, Greg Alkire – PicoDyne Inc. Wes Powell.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang,
Liquid Architecture D. Schuehler, B. Brodie, R. Chamberlain, R. Cytron, S. Friedman, J. Fritts, P. Jones, P. Krishnamurthy, J. Lockwood, S. Padmanabhan,
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
Re-configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-micron Instructions Bus Siu-Kei Wong.
Microcontrollers & GPIO
ECE354 Embedded Systems Introduction C Andras Moritz.
Evaluating Register File Size
Chapter 1: Introduction
Dynamically Reconfigurable Architectures: An Overview
A High Performance SoC: PkunityTM
A Self-Tuning Configurable Cache
Wireless Embedded Systems
Portable SystemC-on-a-Chip
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

Parameterized Embedded Systems Platforms Frank Vahid Students: Tony Givargis, Roman Lysecky, Susan Cotterell Dept. of Computer Science and Engineering University of California, Riverside Member, Center for Embedded Computer Systems, UC Irvine The Dalton Project Supported by: NSF, NEC

2 Outline Introduction Parameterized SOC platforms Exploring parameter configurations Future direction: self-optimizing platforms Conclusions

3 IC Introduction Advent of system-on-a-chip Micro- proc. IC Memory IC Peripher. IC FPGA IC Board Microprocessor core (aka “IP”) Peripheral core Introduction

4 System-on-a-chip (SOC) Introduction

5 The Productivity Gap [ITRS99]

6 Programmable Platforms (ITRS99) Pre-fabricated IC, synthesizable HDL, or both –“reference designs” (VLSI), “silicon platforms” (Philips), “fig chips” (Vahid/Givargis99) Micro- processor CacheMemoryDMABridge FPGA Peripheral System bus Peripheral bus Programmable Platform Introduction

7 Targeted to Embedded Systems May drive future architecture design [Patterson98] Varied power/performance/size constraints –Programmable platforms must adapt Introduction

8 Micro- processor CacheMemoryDMABridge FPGA Peripheral System bus Peripheral bus Programmable Platform Adapting platforms to constraints One solution: Architectural Parameters Application1 main() while (…) { … } Cache Application2 main() while(…) { … } Cache Introduction

9 Related work Pleiades project [Rabaey97] VLSI’s Velocity ArchitectureApplications Numbers Mapping Analysis Our focus Introduction Microprocessor + FPGA Philips’ Y-Chart approach Microcontrollers

10 Outline Introduction Parameterized SOC platforms Exploring parameter configurations Future direction: self-optimizing platforms Conclusions

11 Basic parameters -- cache Micro- processor CacheMemoryDMABridge FPGA Peripheral System bus Peripheral bus Programmable Platform Cache Parameterized Systems-on-a-chip

12 Basic parameters -- cache TagIndexOffset VTDVTD == Mux Data Associativity Cache Size Line Size Parameterized Systems-on-a-chip

13 Micro- processor CacheMemoryDMABridge FPGA Peripheral Programmable Platform System bus Peripheral bus Basic parameters -- bus Parameterized Systems-on-a-chip

14 Basic parameters -- Bus Bus Change Bus Width [Givargis98] C1C1 C2C2 C 1 > C 2 Parameterized Systems-on-a-chip Mux Demux Mux Demux Bus

15 Basic parameters -- Bus Bus Parameterized Systems-on-a-chip Encode data to reduce switching (Bus Invert) [Stan95] Encoder Decoder Encoder Decoder invert_ctrl Hamming Dist = invert_ctrl Binary Encoding Bus-Invert Encoding Hamming Dist = 3

16 Parameter definitions Parameter –An architectural feature that can be varied, with a small set of possible values, without changing the application’s essential functionality. Configuration –A selection of a particular value for every architecture parameter Static vs. dynamic parameter –Static: Value is set before fabricating the IC. –Dynamic: Value is set after fabricating the IC. Parameterized Systems-on-a-chip

17 Potential tradeoffs experiment [ICCAD99] Parameterized Systems-on-a-chip Micro- processor MemoryDMABridge FPGA Peripheral System bus Peripheral bus I-cache D-cache ParametersPossible values I-cache Size32k,16k,8k,4k,2k,1k,512,256,128 Line8, 16, 32 Associativity2, 4, 8 D-cache Size32k,16k,8k,4k,2k,1k,512,256,128 Line8, 16, 32 Associativity2, 4, 8 Mp-c bus Data bus width4, 8, 16, 32 Data bus inverton or off Sys. bus Data bus width4, 8, 16, 32 Data bus inverton or off

18 Potential tradeoffs experiment [ICCAD99] Cache: Dinero [Edler, Hill] ISS: [Tiwari96] Micro- processor CacheMemory C Program Bus simulator Instr. Set Simulator Cache Simulator Memory Simulator Power Total power Parameterized Systems-on-a-chip

19 Potential tradeoffs experiment X-axis: execution time (sec) Y-axis: power (watt) Tradeoff between performance and power Computed power for all 45,568 configurations –For each of four C applications –Used microprocessor, cache, and bus simulators (1 wk CPU) Parameterized Systems-on-a-chip

20 Potential tradeoffs experiment Narrower bus required a larger cache size Bus: 8-1/32-1 I: 32k, 8, 8 D: 16k, 8, sec, 3.4 W, 30K Bus: 16-1/32-1 I: 16k, 8, 16 D: 32k, 8, sec, 11.4 W, 21kG Bus: 32-1/32-0 I: 16k, 4, 4 D: 16k, 4, sec, 43.6 W, 20kG Parameterized Systems-on-a-chip

21 Potential tradeoffs experiment Performance varied by 11x Power varied by 13x Area varied by 1x Energy consumption varied by 2x Parameterized Systems-on-a-chip

22 Potential tradeoffs experiment Bus: 8-1/4-0, I: 1k, 2, 4 D: 512, 2, 4 5 ms,.02 W, 18kG Bus: 16-1/32-1 I: 1k, 4, 4 D: 512, 4, 8 3 ms,.07 W, 17kG Bus: 32-1/32-1 I: 1k, 4, 4 D: 512, 4, 8 2 ms,.19 W, 15kG Parameterized Systems-on-a-chip

23 Potential tradeoffs experiment Performance varied by 2.5x Power varied by 9.5x Area varied by 1x Energy consumption varied by 4x Parameterized Systems-on-a-chip

24 Potential tradeoffs experiment How much variation in total system power and performance can we obtain just by varying the cache and bus parameters? –9 to 14x improvement in power/performance How interdependent are these two types of parameters? –fixing cache param. values, then selecting bus param. values results in non-optimal solutions Parameterized Systems-on-a-chip

25 Many more parameters possible Some examples include: –Code compression (Henkel/Wolf) –Address bus encoding –Multiple levels of memory hierarchy –CPU parameters (e.g., voltage scale, DP width) –Peripheral core parameters (our current focus) –Fertile research area Can yield even larger tradeoffs if we: –Create parameter-aware compiler –Adapt OS? Parameterized Systems-on-a-chip

26 Outline Introduction Parameterized SOC platforms Exploring parameter configurations Future direction: self-optimizing platforms Conclusions

27 Exploring parameter configurations Low-level simulation –Gate-level simulation Far too slow, days per configuration –RT-level simulation Still slow, hours per configuration Our approach –System-level simulation Minutes per configuration –System-level trace simulation Seconds per configuration –System-level trace analysis Milliseconds per configuration

28 Evaluation by gate-level simulation Exploring Parameter Configurations Micro- processor CacheMemoryDMABridge FPGA Peripheral Programmable Platform System bus Peripheral bus Capture each core in HDL, synthesize, simulate HDL synthesis HDL simulation Total power Reconfigure Hours (often tens) per configuration

29 Evaluation by system-level simulation Exploring Parameter Configurations Micro- processor CacheMemoryDMABridge Peripheral Peripheral bus C Program Trace Generator Bus simulator Instr. Set Simulator Cache Simulator Memory Simulator Power Total power OO models DMA Simulator Bridge Simulator Peripheral Simulator Peripheral Simulator Power Minutes-per-configuration Contrast with hours-per-config. Reconfigure

30 Evaluation by trace-simulation Exploring Parameter Configurations OO non-fct. models Note that the cache simulator is non-functional Same approach for others –Get traces from small # of system simulation Bus trace Bus trace simulator Instr. trace Simulator Memory trace Simulator Instr. trace C Program Trace Generator Cache trace Simulator Address trace DMA trace Simulator Bridge trace Simulator Peripheral trace Simulator Peripheral trace Simulator Instr. traces Power Total power Power Reconfigure Seconds-per-configuration

31 System simulation vs. trace simulation Parameter evaluation System level model Execute Power System level model Execute Traces Power uPDMA UART Trace simulators uPDMA UART Parameter evaluation

32 Evaluation by trace-analysis Exploring Parameter Configurations Equations Further speedup -- –statistically-characterize traces –Still only small # of system simulations Bus stats. Bus trace simulator Instr. trace analyzer Memory trace analyzer Instr. stats. C Program Trace Generator Cache trace analyzer Address stats. DMA trace analyzer Bridge trace analyzer Peripheral trace analyzer Peripheral trace analyzer Instr. stats. Power Total power Power Reconfigure Milliseconds-per-configuration

33 Trace-analysis approach for cache Given a trace of memory refs Cache parameters Size (S) Line/block-size (L) Associativity (A) Compute # of misses (N) Size (S) # of misses (N) Exploring Parameter Configurations

34 Trace-analysis approach for cache Exploring Parameter Configurations

35 Trace-analysis approach for cache Capture improvements obtainable by: –changing line-size at small/large values of cache-size –changing associativity at small/large values of cache-size Exploring Parameter Configurations

36 Trace-analysis approach for bus Exploring Parameter Configurations Items/second Bus width Num transfers per item Random data capacitance

37 Trace-analysis approach for bus Bus equation: m items/second (denotes the traffic N on the bus) n bits/item k bit wide bus bus-invert encoding random data assumption Exploring Parameter Configurations

38 Trace-analysis experiments Bus ABus B Peripheral 1 Peripheral Bus Bridge CPU I-Cache D-Cache Peripheral 2Peripheral n Memory Cache parameters – size: 128, 256, 512, 1k, 2k, 4k, 8k, 16k, 32k – assoc: 2, 4, 8 – line: 8, 16, 32 Bus Parameters – width: 4, 8, 16, 32 – code: binary/bus-invert Analyzed 45K sets exhaustively for each of 4 examples. Exploring Parameter Configurations

39 Experiment Results Diesel application’s performance Blue (light-gray) is system-simulation-based Red (dark-gray) is trace-analysis-based 4% error 320x faster Exploring Parameter Configurations

40 Experiment Results Diesel application’s energy consumption Blue (light-gray) is obtained using full simulation Red (dark-gray) is obtained using our equations 2% error 420x faster Exploring Parameter Configurations

41 Experiment Results CKey application’s performance Blue (light-gray) is obtained using full simulation Red (dark-gray) is obtained using our equations 8% error 125x faster Exploring Parameter Configurations

42 Experiment Results CKey application’s energy consumption Blue (light-gray) is obtained using full simulation Red (dark-gray) is obtained using our equations 3 % error 125x faster Exploring Parameter Configurations

43 Experiment Results x speedup 1-18% absolute error (power & performance) 2% average power error Time (hours) Power Error (%) Exploring Parameter Configurations

44 Techniques for general cores Earlier experiments were for uP/cache/bus System simulation for other cores (ISSS’00) –Isolate “instructions” in system-level model –Gate-level simulation per instruction –Back-annotate system-level model’s instructions –Similar to technique for microprocessors, but: Must consider “power modes”

45 Trace approach for general cores System level model Execute Traces Power Trace simulators uPDMA UART Parameter evaluation Full trace Reset -- Quantize P 1,P 2,…,P 64 IDCT P 1,P 2,…,P 64 Quantize P 1,P 2,…,P 64 IDCT P 1,P 2,…,P 64 Reduced trace with characterized data Reset -- Quantize.80 IDCT.72 Quantize.93 IDCT.63 Reduced trace with instructions only Reset -- Quantize -- IDCT -- Quantize -- IDCT -- Reduced trace with instruction frequencies Reset *1 Quantize *2 IDCT *2

46 Experiments with general cores: JPEG trace file size (Kb)CPU time for power evaluation (sec) pixel size (bits) ftrcrtrc_ cd rtrc _i gatesysftrcrtrc_ cd rtrc_ i average speedup:6K12K62K67K gateftrcrtrc_cdrtrc_ipixel size (bits) mJ errormJerrormJerror %4517%49117% %5768%63219% average error:6%7.5%18%

47 Experiments with general cores: UART

48 Outline Introduction Parameterized SOC platforms Exploring parameter configurations Future direction: self-optimizing platforms Conclusions

49 Future directions Earlier work –used software on workstation to explore parameter configurations “Self-optimizing” platform –Can we build the exploration ability into the platform itself? –Transparent to the user Ease of use, more accurate metrics, wider acceptance, –“Embedded CAD” Workstation Platform Exploration sw Configuration Workstation Platform Exploration ability Regular binary

50 Conclusions Parameters can improve usefulness of programmable platforms –by adapting platform to particular application and to power/performance constraints Good tradeoff range even for basic parameters Fast and accurate evaluation seems possible Much work remains –More parameters –Better exploration –Self-optimizing platforms