Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Slides:

Advertisements

Similar presentations

1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

HTR: On-Chip Hardware Task Relocation for Partially Reconfigurable FPGAs + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Walter Binder University of Lugano, Switzerland Niranjan Suri IHMC, Florida, USA Green Computing: Energy Consumption Optimized Service Hosting.

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.

Dynamic Scheduling for Reduced Energy in Configuration-Subsetted Heterogeneous Multicore Systems + Also Affiliated with NSF Center for High- Performance.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Walid Najjar* *University of California, Riverside **The.

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

Performance and Energy Bounds for Multimedia Applications on Dual-processor Power-aware SoC Platforms Weng-Fai WONG 黄荣辉 Dept. of Computer Science National.

A Highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang, Frank Vahid and Walid Najjar University of California, Riverside ISCA 2003.

Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

Storage Allocation for Embedded Processors By Jan Sjodin & Carl von Platen Present by Xie Lei ( PLS Lab)

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

A S ELF -T UNING C ACHE ARCHITECTURE FOR E MBEDDED S YSTEMS Chuanjun Zhang, Frank Vahid and Roman Lysecky Presented by: Wei Zang Mar. 29, 2010.

Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.

Dynamic Phase-based Tuning for Embedded Systems Using Phase Distance Mapping + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

1 of 21 Online Algorithms for Wireless Sensor Networks Dynamic Optimization Arslan Munir 1, Ann Gordon-Ross 2+, Susan Lysecky 3, and Roman Lysecky 3 1.

Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,

Kyushu University Las Vegas, June 2007 The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems.

Thermal-aware Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work was supported.

Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.

DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –

PANEL Ubiquitous Systems Ubiquity for Everyone: What is Missing? Ann Gordon-Ross Department of Electrical and Computer Engineering University of Florida,

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

Minimum Effort Design Space Subsetting for Configurable Caches + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

1 of 20 Low Power and Dynamic Optimization Techniques for Power-Constrained Domains Ann Gordon-Ross Department of Electrical and Computer Engineering University.

By Islam Atta Supervised by Dr. Ihab Talkhan

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Exploiting Dynamic Phase Distance Mapping for Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

Marilyn Wolf1 With contributions from:

Application-Specific Customization of Soft Processor Microarchitecture

Department of Electrical & Computer Engineering

Tosiron Adegbija and Ann Gordon-Ross+

Verilog to Routing CAD Tool Optimization

Ann Gordon-Ross and Frank Vahid*

Tosiron Adegbija and Ann Gordon-Ross+

A Self-Tuning Configurable Cache

Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,

Automatic Tuning of Two-Level Caches to Embedded Applications

Application-Specific Customization of Soft Processor Microarchitecture

Introduction to Computer Systems Engineering

Presentation transcript:

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work was supported by National Science Foundation (NSF) grant CNS Tosiron Adegbija 1, Ann Gordon-Ross 1+,and Marisha Rawlins 2 1 Department of Electrical and Computer Engineering University of Florida, Gainesville, Florida, USA 2 Center for Information and Communication Technology University of Trinidad and Tobago, Trinidad and Tobago

Introduction and Motivation 2 of 18 Embedded systems are ubiquitous and have stringent design constraints –Increasing demand for high performance embedded systems Shift to multicore embedded systems –Increases system and optimization complexity Need optimizations that reduce power without increasing overheads –Overheads: performance, area, etc.

Optimization – Cache Tuning 3 of 18 Caches are a good candidate for optimization –Significant impact on power and performance Different applications have different cache parameter value requirements –Parameter values: cache size, line size, associativity Cache tuning determines appropriate/optimal parameter values (cache configurations) to meet optimization goals (e.g., lowest energy) Tunable cache Cache tuner Download application Microprocessor Cache tuning requires - tunable/configurable cache - tuning hardware (cache tuner)

Configurable Cache Architecture 4 of 18 Configurable caches enable cache tuning –Design space: combination of all possible configurations A Highly Configurable Cache (Zhang ‘03) 2KB 8 KB, 4-way base cache 2KB 8 KB, 2-way 2KB 8 KB, direct- mapped Way concatenation Tunable Associativity 2KB 4 KB, 2-way 2KB 2 KB, direct- mapped Way shutdown Tunable Size Configurable Line size 16 byte physical line size Tunable Line Size

Cache Tuners Tuner evaluates different configurations to determine the best configuration –Implements tuning algorithm/heuristic to search design space –Best configuration satisfies design objective (e.g., minimum energy/execution time) –Can impose overheads (tuning delay, power, area) and complexity Software cache tuners Intrusive to application and cache –Non-optimal, inferior configurations Hardware cache tuners Non-intrusive to application Must be low-overhead Previous work –Single- and dual-core cache tuners More cores = more tuner overhead/complexity 5 of 18 Number of cores Tuner overhead and complexity Number of cores Tuner overhead/complexity Need low-overhead cache tuners for multicore systems!

Cache Tuner Architectural Layouts Global tuner –Single tuner for all cores –Shared resource bottleneck Tuning delay –Little area/power 6 of 18 Core 1L1 CacheCore 2L1 CacheCore 3L1 CacheCore 4L1 Cache Tuner Global tuner Tuning delay Area overhead Power overhead Overheads incurred

Cache Tuner Architectural Layouts 7 of 18 Dedicated tuners Core 1L1 CacheCore 2L1 CacheCore 3L1 CacheCore 4L1 Cache Tuner 2Tuner 1Tuner 3Tuner 4 Tuning delay Area overhead Power overhead Overheads incurred Dedicated tuners –Separate tuner for each core –Resources not shared Less tuning delay –More area/power

Cache Tuner Architectural Layouts 8 of 18 Clustered tuners Core 1L1 CacheCore 2L1 CacheCore 3L1 CacheCore 4L1 Cache Tuner 1Tuner 2 Tuning delay Area overhead Power overhead Overheads incurred Clustered tuners –Separate tuner for core subset –Shared resources Tuning delay –Tradeoff area/power and tuning delay Cluster size must be carefully selected!

Contributions 9 of 18 Challenges –Low-overhead tuners required for > dual core systems –Cluster sizes must be carefully selected Tradeoff power, area, and shared resource contention Our work –Design custom cache tuners for multicore embedded systems Scalable Low-overhead –Quantify tradeoffs of cache tuner architectural layouts –Formulate essential design guidelines for cache tuners

Cache tuner Hardware Implementation 10 of 18 Implemented cache tuners in multiple architectural layouts –2-, 4-, 8-, and 16-core systems State machine Datapath Core1 Core2 Core3 Core4 Global tuner Core1 Core2 Core3 Core4 Dedicated tuners Core1 Core2 Core3 Core4 Clustered tuners

calculates energy consumption changes the value of the parameter being tuned changes the parameter being tuned C5 C4 C1 C2 C3C0 tune_again = 1, next parameter value 0, next parameter { busy bit C1 – Dynamic energy C2 – Static energy C3 – Write back energy C4 – Cache fill energy C5 – CPU stall energy V0 V1 V2 V3 V4 V way way way configuration bits adjust_parameter size (KB) line_size (byte) associativity S0 S1S3S4 adjust_parameter = none adjust_parameter = size adjust_parameter = line_size adjust_parameter = associativity start = 1 start = 0 State Machine 11 of 18 Value state Parameter state Calculation state tune_again = 0 start = 1 calc_done = 1 calc_start = 1 tune_again = 1 Datapath control signal control signals

Datapath 12 of 18 State machine for cores C(0)… C(n-1) control signals accesses_p(n) total_cycles_p(n) miss_cycles_p(n) write_backs_p(n) MUX dynamic_energy static_energy fill_energy write_back_energy cpu_stall_energy configuration bits register X + comparator current_energy previous_energy Multiply accumulate (MAC) unit Datapath

Experimental Setup Cache tuners evaluated –Global and dedicated tuners: 2-, 4-, 8-, and 16-cores –Clustered tuners: 4-, 8-, and 16-cores: 2-core clusters 8- and 16-cores: 4-core clusters 16-cores: 8-core clusters Modeled with synthesizable VHDL in Synopsys Design Compiler 11 benchmarks from Splash-2 benchmark suite SESC simulator provided cache statistics 13 of 18

Power and Area Trends 14 of 18 Power-of-two increase: Global tuner = 49% Dedicated tuners = 89% Clustered tuners = 111% Power-of-two increase: Global tuner = 51% Dedicated tuners = 93% Clustered tuners = 100% PowerArea Linear increases: scalability of all layouts to future systems

Tuning Delay 15 of 18 Small clusters reduce bottleneck in large systems Average reduction normalized to global tuner: Dedicated tuners = 3%, 20%, 21%, and 82% in 2-, 4-, 8-, 16-core systems, respectively 2/cluster tuners = 17%, 18%, 82% in 4-, 8-, 16-core systems, respectively 4/cluster tuners = 1% and 78% in 8- and 16-core systems, respectively 8/cluster tuners = 77% in 16-core system

Power/Area Compared to Global Tuner 16 of 18 Power increase: Dedicated tuners = 149% 2/cluster tuners = 86% 4/cluster tuners = 56% 8/cluster tuners = 53% Clustered tuners can provide good tradeoff in large systems Area increase: Dedicated tuners = 156% 2/cluster tuners = 87% 4/cluster tuners = 44% 8/cluster tuners = 23% PowerArea

Overheads Imposed on Microprocessors Compared to MIPS32 M14K 90nm processor: 12mW power at 200 MHz; 0.21mm 2 area 17 of 18 Global = 0.5% Dedicated tuners = 1.16% Clustered tuners = 0.6% Our cache tuners constitute minimal overhead! PowerArea Global = 4.73% Dedicated tuners = 11.03% Clustered tuners = 5.17%

Conclusions Cache tuning specializes system’s cache configuration to varying application requirements Cache tuner must constitute minimal power, performance, and area overhead We presented low overhead cache tuners –Options: global, dedicated, and clustered tuners –Scales to multiple cores –Designers can select appropriate cache tuners based on design objectives Clustered tuners provide tradeoffs –Analysis applicable to other tuning scenarios Future work –Evaluate cache tuners in up to 128 cores –Incorporate lightweight communication network on cache tuner Independent of on-chip communication network 18 of 18

Questions?