Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science.

Slides:



Advertisements
Similar presentations
1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Performance of Cache Memory
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.
The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Frank Vahid, UC Riverside 1 System-on-a-Chip Platform Tuning for Embedded Systems Frank Vahid Associate Professor Dept. of Computer Science and Engineering.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Walid Najjar* *University of California, Riverside **The.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
Frank Vahid, UC Riverside 1 New Opportunities with Platform Based Design Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Frank Vahid Associate Professor
A Highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang, Frank Vahid and Walid Najjar University of California, Riverside ISCA 2003.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
1 Hot Caches, Cool Techniques: Online Tuning of Highly Configurable Caches for Reduced Energy Consumption Ann Gordon-Ross Department of Computer Science.
1  Staunstrup and Wolf Ed. “Hardware Software codesign: principles and practice”, Kluwer Publication, 1997  Gajski, Vahid, Narayan and Gong, “Specification,
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Power Reduction for FPGA using Multiple Vdd/Vth
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
A S ELF -T UNING C ACHE ARCHITECTURE FOR E MBEDDED S YSTEMS Chuanjun Zhang, Frank Vahid and Roman Lysecky Presented by: Wei Zang Mar. 29, 2010.
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
COSC3330 Computer Architecture
Morgan Kaufmann Publishers Memory & Cache
ECE 445 – Computer Organization
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
Ann Gordon-Ross and Frank Vahid*
A High Performance SoC: PkunityTM
A Self-Tuning Configurable Cache
Dynamic Hardware/Software Partitioning: A First Approach
Cache - Optimization.
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine

Frank Vahid, UC Riverside 2 Trend Towards Pre-Fabricated Platforms: ASSPs ASSP: application specific standard product Domain-specific pre- fabricated IC e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC Unique IC design Ignores quantity of same IC ASIC design starts decreasing Due to strong benefits of using pre-fabricated devices Source: Gartner/Dataquest September’01

Frank Vahid, UC Riverside 3 Will High End ICs Still be Made? YES The point is that mainstream designers likely won’t be making them Very high volume or very high cost products Platforms are one such product – high volume Need to be highly configurable to adapt to different applications and constraints Becoming out of reach of mainstream designers

Frank Vahid, UC Riverside 4 UCR Focus Configurable Cache Hardware/Software Partitioning

Frank Vahid, UC Riverside 5 UCR Focus Configurable Cache Hardware/Software Partitioning

Frank Vahid, UC Riverside 6 Configurable Cache: Why uP L1 cache DSP JPEG dcd Periph- erals FPGA Pre-fabricated Platform (A pre-designed system-level architecture) IC ARM920T: Caches consume half of total power (Segars 01) M*CORE: Unified cache consumes half of total power (Lee/Moyer/Arends 99) L1 cache

Frank Vahid, UC Riverside 7 Best Cache for Embedded Systems? Not clear Huge variety among popular embedded processors What’s the best… Associativity, Line size, Total size?

Frank Vahid, UC Riverside 8 Cache Associativity Direct mapped cache Certain bits “index” into cache Remaining “tag” bits compared A B C D Conflict 0000 D Tag 11 Direct mapped cache (1-way set associative) Index Set associative cache Multiple “ways” Fewer index bits, more tag bits, simultaneous comparisons More expensive, but better hit rate D110C100 2-way set associative cache 000

Frank Vahid, UC Riverside 9 Cache Associativity Reduces miss rate – thus improving performance Impact on power and energy? (Energy = Power * Time)

Frank Vahid, UC Riverside 10 Associativity is Costly Associativity improves hit rate, but at the cost of more power per access Are the power savings from reduced misses outweighed by the increased power per hit? Energy access breakdown for 8 Kbyte, 4-way set associative cache (considering dynamic power only) Energy per access for 8 Kbyte cache

Frank Vahid, UC Riverside 11 Associativity and Energy Best performing cache is not always lowest energy Significantly poorer energy

Frank Vahid, UC Riverside 12 Associativity Dilemma Direct mapped cache Good hit rate on most examples Low power per access But poor hit rate on some examples High power due to many misses Four-way set-associative cache Good hit rate on nearly all examples But high power per access Overkill for most examples, thus wasting energy Dilemma: Design for the average or worst case?

Frank Vahid, UC Riverside 13 Associativity Dilemma Obviously not a clear choice Previous work Albonesi – proposed configurable cache having way shutdown ability to save dynamic power Motorola M*CORE also D

Frank Vahid, UC Riverside 14 Our Solution: Way Concatenatable Cache Can be configured as 4, 2, or 1 way Ways can be concatenated D x 11xC10x This bit selects the way 0000

Frank Vahid, UC Riverside 15 Configurable Cache Design: Way Concatenation (4, 2 or 1 way) index c1c1 c3c3 c0c0 c2c2 a 11 a 12 reg 1 reg 0 sense amps column mux tag part tag address mux driver c1c1 line offset data output critical path c0c0 c2c2 c0c0 c1c1 6x64 c3c3 c2c2 c3c3 a 31 tag address a 13 a 12 a 11 a 10 index a 5 a 4 line offset a 0 Configuration circuit data array bitline Small area and performance overhead

Frank Vahid, UC Riverside 16 Way Concatenate Experiments Experiment Motorola PowerStone benchmark g3fax Considering dynamic power only L1 access energy, CPU stall energy, memory access energy Way concatenate outperforms 4 way and direct map. Just as good as way shutdown

Frank Vahid, UC Riverside 17 Way Concatenate Experiments Considered 23 programs (Powerstone, MediaBench, and Spec2000) Dynamic power only (L1 access energy, CPU stall energy, memory access energy) Way concatenate Better than way shutdown (due to less performance penalty) Saves over conventional 4-way Also avoids big penalties of 1-way on some programs 100% = 4-way conventional cache

Frank Vahid, UC Riverside 18 Way Concatenate Experiments Best configuration varies Need to tune configuration to a given program

Frank Vahid, UC Riverside 19 Normalized Execution Times Way shutdown suffers performance penalty As does direct mapped Way concatenate has almost no performance penalty Though 3% longer critical path than conventional 4-way

Frank Vahid, UC Riverside 20 Way Shutdown for Static Power Savings Albonesi and Motorola used logic to gate clock Reduced dynamic power, but not static (leakage) Way concatenate clearly superior for reducing dynamic pwr Shutting down ways still useful to save static power But we’ll use another method (Agarwal DRG-cache) Gnd Vdd bitline Gated-Vdd Control SRAM cell

Frank Vahid, UC Riverside 21 Way Concatenate Plus Way Shutdown We set static power = 30% of dynamic power Way shutdown now preferred in many examples But way concatenate still very helpful

Frank Vahid, UC Riverside 22 Configurable Line Size Too Best line size also differs per example Our cache can be configured for line of 16, 32 or 64 bytes 64 is usually best; but 16 is much better in a couple cases 100% = 4-way conventional cachecsb: concatenate plus shutdown cache

Frank Vahid, UC Riverside 23 Configurable Cache A configurable cache with way concatenation, way shutdown, and variable line size, can save a lot of energy Well-suited for configurable devices like Triscend’s

Frank Vahid, UC Riverside 24 UCR Focus Configurable Cache Hardware/Software Partitioning

Frank Vahid, UC Riverside 25 Using On-Chip FPGA to Reduce Sw Energy Hennessey/Patterson: “The best way to save power is to have less hardware” (pg 392) Actually, best way is to have less ACTIVE hw Paradoxically, MORE hw can actually REDUCE power, as long as overall activity is reduced How?

Frank Vahid, UC Riverside 26 Using On-Chip FPGA to Reduce Sw Energy uP L1 cache DSP JPEG dcd Periph- erals FPGA Pre-fabricated Platform Move critical sw loops to FPGA Loop executes in 1/10 th the time Use this time to power down the system longer during task period Alternatively, slow down the microprocessor using voltage scaling IC FPGA uP idleuP active idleuP FPGA Task period

Frank Vahid, UC Riverside 27 The rule (or rule) Most software time is spent in a few small loops e.g., MediaBench and NetBench benchmarks Known as the rule 10% of the code accounts for 90% of the execution time Move those loops to FPGA

Frank Vahid, UC Riverside 28 Hardware/Software Partitioning Results Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg) Simulation based

Frank Vahid, UC Riverside 29 Analysis of Ideal Speedup Each loop is 10x faster in hw (average based on observations) Notice the leveling off after the first couple loops (due to rule) Thus, most speedup comes from the first few loops Good for us -- Moderate amount of FPGA gives most of the speedup How much FPGA?

Frank Vahid, UC Riverside 30 Speedup Gained with Relatively Few Gates Manually created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates Surprisingly few gates Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002 Stitt and Vahid, IEEE Design and Test, Dec J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded Systems, 2002 (to appear).

Frank Vahid, UC Riverside 31 Impact of Microprocessor/FPGA Clock Ratio Previous data assumed equal clock freq. A faster microprocessor has significant impact Analyzed 1:1, 2:1, 3:1, 4:1, 5:1 ratios Planning additional such analyses Memory bandwidth Power ratios More

Frank Vahid, UC Riverside 32 Software Improvements using On-Chip Configurable Logic – Verified through Physical Measurement Performed physical measurements on Triscend A7 and E5 devices Similar results (even a bit better) A7 IC Triscend A7 development board

Frank Vahid, UC Riverside 33 Other Research Directions: Tiny Caches Impact of tiny caches on instruction fetch power Filter caches, dynamic loop cache, preloaded loop cache Gordon-Ross, Cotterell, Vahid, Comp. Arch. Letters 2002 Gordon-Ross, Vahid, ICCD Cotterell, Vahid, ISSS 2002 and ICCAD 2002 Gordon-Ross, Cotterell, Vahid, IEEE TECS, 2002 Processor Loop cache L1 cache or I-mem Mux

Frank Vahid, UC Riverside 34 Other Research Directions: Platform-Based CAD Use physical platform to aid search of configuration space Configure cache, hw/sw partition Configure, execute, and measure Goal: Define best cooperation between desktop CAD and platform NSF grant (with N. Dutt at UC Irvine)

Frank Vahid, UC Riverside 35 Other Research Directions: Dynamic Hw/Sw Partitioning My favorite Add component on-chip: Detects most frequent sw loops Decompiles a loop Performs compiler optimizations Synthesizes to a netlist Places and routes the netlist onto FPGA Updates sw to call FPGA Self-improving IC Can be invisible to designer Appears as efficient processor Can also dynamically tune the cache configuration Config. Logic Mem Process or DMA D$ I$ Profiler Proc. Mem

Frank Vahid, UC Riverside 36 Current Researchers Working in Embedded Systems at UCR Prof. Frank Vahid 5 Ph.D. students, 2 M.S. Prof. Walid Najjar 3 Ph.D. students, 1 M.S., working on hw/sw partitioning, and on compiling C to FPGAs Prof. Tom Payne 1 Ph.D. student, working on compiling C to FPGAs Prof. Jun Yang (new hire) Working on low power architectures (frequent value detection) Prof. Harry Hsieh 2 Ph.D. students, working on formal verification of system models Prof. Sheldon Tan (new hire) 1 Ph.D, working on physical design, and analog synthesis

Frank Vahid, UC Riverside 37 Conclusions Highly configurable platforms have a bright future Cost equations just don’t justify ASIC production as much as before Triscend parts are well situated; close collaboration desired Configurable cache improves memory energy Tuning to a particular program is CRUCIAL to low energy Way concatenation is effective at reducing dynamic power Way shutdown saves static power Variable line size reduces traffic All must be tuned to a particular program Configurable logic improves software energy Without requiring excessive amounts of hardware Many exciting avenues to investigate!