Department of Electrical & Computer Engineering

Slides:

Advertisements

Similar presentations

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Advertisements

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Chapter 8 Runtime Support. How program structures are implemented in a computer memory? The evolution of programming language design has led to the creation.

S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.

Computer Science Department University of Pittsburgh 1 Evaluating a DVS Scheme for Real-Time Embedded Systems Ruibin Xu, Daniel Mossé and Rami Melhem.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua.

Automated Design of Custom Architecture Tulika Mitra

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

JouleTrack - A Web Based Tool for Software Energy Profiling Amit Sinha and Anantha Chandrakasan Massachusetts Institute of Technology June 19, 2001.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Smalltalk Implementation Harry Porter, October 2009 Smalltalk Implementation: Optimization Techniques Prof. Harry Porter Portland State University 1.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Lecture 2: Performance Evaluation

Contents Introduction Bus Power Model Related Works Motivation

Reza Yazdani Albert Segura José-María Arnau Antonio González

Memory Segmentation to Exploit Sleep Mode Operation

Ioannis E. Venetis Department of Computer Engineering and Informatics

Evaluating Register File Size

Selective Code Compression Scheme for Embedded System

Application-Specific Customization of Soft Processor Microarchitecture

Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University

Cache Memory Presentation I

Methodology of a Compiler that Compresses Code using Echo Instructions

Flavius Gruian < >

Anne Pratoomtong ECE734, Spring2002

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Tosiron Adegbija and Ann Gordon-Ross+

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

Objective of This Course

Ann Gordon-Ross and Frank Vahid*

Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,

Automatic Tuning of Two-Level Caches to Embedded Applications

Garbage Collection Advantage: Improving Program Locality

RUN-TIME STORAGE Chuen-Liang Chen Department of Computer Science

Application-Specific Customization of Soft Processor Microarchitecture

COMP755 Advanced Operating Systems

Introduction to Computer Systems Engineering

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Phase based adaptive Branch predictor: Seeing the forest for the trees

Run-time environments

Presentation transcript:

Department of Electrical & Computer Engineering Design Space Optimization of Embedded Memory Systems via Data Remapping Present by Xiang Mao Department of Electrical & Computer Engineering University of Florida

Outline Introduction Data Remapping and Design Space Exploration Data Remapping Algorithm Experimental Methodology Result Analysis Conclusions 20:19:36

Introduction Memory in embedded system: Valuable Resource ~ Power Sink Compile-time data remapping/reorganization Improve the spatial locality Lower memory cost and power needs (Viewed as a tool for design-space exploration) Improve the execution time and energy consumption (Viewed as a conventional complier optimization) 45% In the context of 2 levels of cache Explain the figure;-> explain the two ends 20:19:36

Introduction Features: Fully automated; Applicable for pointer-based programming language; Running time linear in the size of the program. All the work is based on hardware models and an instuction set architecture(ISA) for the ARM family of processors using floating point and integer benchmarks. Simulation environment models an ARM-like processor but also includes floating point support. Previous work in this area: 1. semi-automated 2. restricted to memory that is statically allocated but C programming language, used extensively in embedded system domain, are usually pointer based. 20:19:36

Outline Introduction Data Remapping and Design Space Exploration Data Remapping Algorithm Experimental Methodology Result Analysis Conclusions 20:19:36

Design Space Exploration The goal of design exploration, as illustrated in this fig, is to fix the program under consideration and to vary its performance via optimizations, in search of the best hardware configuration. In this paper, the authors focus on the cache subsystem and seek to optimize its energy and cost requirement. 20:19:36

Design Space Exploration Negative value in execution reduction. Since the cache size reduced SA110 – Intel StrongARM 110 Processor 179.ART – floating point benchmark | Perimeter,TreeAdd – integer benchmarks 20:19:36

Outline Introduction Data Remapping and Design Space Exploration Data Remapping Algorithm Experimental Methodology Result Analysis Conclusions 20:19:36

Data Remapping Algorithm Goal: New layout exhibits a better correlation with the application reference sequence. Target: Record data types ubiquitous to real-word, pointer-heavy applications. Record: A set of diverse data types grouped within a unique declaration; Field: Elements of the set; Object: Instances of a record. 20:19:36

Record Model Key Field Datum Field Next Field K D N K D N K D N 20:19:36

Data Remapping Algorithm The remapping optimization consists of 3 phases: Gathering Phase Remapping of Global Data Objects Remapping of Dynamic Data Objects Based on the above and some other thoughts… 20:19:36

Gathering Phase NAP – Neighbor Affinity Probability Only data types with NAP lower than some threshold are marked for remapping 20:19:36

Remapping of Global Data Objects 20:19:36

Remapping of Dynamic Data Objects Light-weight wrappers are automatically generated around traditonal memory allocation requests in the program. Large memory pool is allocated and smaller portions within the pool are reassigned with successive allocation requests. The need for cache-conscious data placement is even more important for dynamically allocated objects. Traditional allocation strategies ignore the underlying memory hierarchy in favor of low run-time overhead but results in poor interactions between data layout and program access pattern. The goal is to produce a filed allocation layout as the figure. 20:19:36

Remapping of Dynamic Data Objects Rely on a run-time comparison of the pointer value against the stack pointer register to determine the proper offset. 20:19:36

Outline Introduction Data Remapping and Design Space Exploration Data Remapping Algorithm Experimental Methodology Result Analysis Conclusions 20:19:36

The Target Processor Verilog model of ARM-like processor Synthesize the core using Synopsys Design Complier targeted toward a TSMC 0.25μ library from LEDA System, Inc System Clock 100MHz, 5-stage RISC, the processor core is about 250,000 NAND gates. 20:19:36

The Target Processor The power consumption is constant. This is likely due to the fact that in a simple RISC processor with one ALU, the datapath is always busy, and thus the power variation is minimal. 20:19:36

Model of Cache Power Consumption Assume L1, L2 cache to be SRAM and use the approach of Kamble and Ghose Drawbacks: Need to collect runtime statistics such as hit/miss counts and ratio of read/write request; The model only accounts for dynamic power dissipation. (For 0.25μ technology, dynamic power ≈ 102*static power) 20:19:36

Outline Introduction Data Remapping and Design Space Exploration Data Remapping Algorithm Experimental Methodology Result Analysis Conclusions 20:19:36

Result Analysis The benchmarks used here includes floating-point and integer applications like neural network simulation, large database management, image matching and scientific computation from the Data Intensive System(DIS), OLDEN and SPEC2000 20:19:36

Result Analysis Two no energy reduction, others average 20-30percentage. Floating-point 71%!!! Almost all have execution time reduction 20:19:36

Result Analysis A Half 20:19:36

Result Analysis A Half 20:19:36

Result Analysis Energy for ARM-like core vs. L1+L2 cache 20:19:36

Outline Introduction Data Remapping and Design Space Exploration Data Remapping Algorithm Experimental Methodology Result Analysis Conclusions 20:19:36

Conclusion The paper propses a novel compile-time data remapping algorithm that applicable to pointer-intensive dynamic applications and leads to a 50% reduction of both L1 and L2 cache, yeielding a energy savings of 57%. It also improves the energy savings of an ARM-like core. Further works such as adding in static(leakage) power are still needed. 20:19:36

? Thanks and Questions. 20:19:36