Simulation of Decode Filter Cache using SimpleScalar simulator Presented by Fei Hong.

Slides:

Advertisements

Similar presentations

COMP375 Computer Architecture and Organization Senior Review.

Advertisements

DIGITAL COMMUNICATION Packet error detection (CRC) November 2011 A.J. Han Vinck.

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.

Department of Computer Science University of the West Indies.

Evaluating Performance and Power of Object-oriented vs. Procedural Programming in Embedded Processors A. Chatzigeorgiou, G. Stephanides Department of Applied.

S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

Chapter 13 Reduced Instruction Set Computers (RISC) CISC – Complex Instruction Set Computer RISC – Reduced Instruction Set Computer.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

Apr 14,2003CPE 631 Project Performance Analysis and Power Estimation of ARM Processor Team: Ajayshanker Krishnamurthy Swathi Tanjore Gurumani Zexin Pan.

Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.

Computer Organization and Architecture Reduced Instruction Set Computers (RISC) Chapter 13.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

What have mr aldred’s dirty clothes got to do with the cpu

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

AES Encryption Code Generator Undergraduate Research Project by Paul Magrath. Supervised by Dr David Gregg.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and Jihong Kim Computer Architecture Lab School of Computer Science.

Computer Organization and Architecture Tutorial 1 Kenneth Lee.

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

MIPS Project -- Simics Yang Diyi Outline Introduction to Simics Simics Installation – Linux – Windows Guide to Labs – General idea Score Policy.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Fetch Directed Prefetching - a Study

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.

Sunpyo Hong, Hyesoon Kim

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.

Crusoe Processor Seminar Guide: By: - Prof. H. S. Kulkarni Ashish.

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Multiscalar Processors

The University of Adelaide, School of Computer Science

5.2 Eleven Advanced Optimizations of Cache Performance

Improving Program Efficiency by Packing Instructions Into Registers

Flow Path Model of Superscalars

Figure 8.1 Architecture of a Simple Computer System.

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

Detailed Analysis of MiBench benchmark suite

Ann Gordon-Ross and Frank Vahid*

Ka-Ming Keung Swamy D Ponpandi

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

In Search of Near-Optimal Optimization Phase Orderings

Big Data Analytics: Exploring Graphs with Optimized SQL Queries

Chapter 12 Pipelining and RISC

Chapter 6: Understanding and Assessing Hardware

Fault Tolerant Systems in a Space Environment

Instruction Level Parallelism

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

Simulation of Decode Filter Cache using SimpleScalar simulator Presented by Fei Hong

Motivation & Goals Instruction fetches and decodes are the major on-chip power consumers Optimize the power consumption by reducing instruction fetches and decodes Simulate the DFC architecture using simplescalar To test the performance of DFC

Prediction Mechanism Each sector in DFC has the following fields. (tag, sector_valid, next_address) If A is not equal to C, a different control path will be taken tag(A) != tag(C) (1) A and B are consecutively accessed. If they belonged to a small loop tag(A) == tag(B) (2) Based on (1) and (2), the prediction for next fetch : tag(C) == tag(B) (3)

Working Process

The Platform Host computer: ACPI x86-based PC Host computer operating system: Microsoft Windows Vista Ultimate Virtual Machine: VMware Workstation version 6.03 Linux operating system: Fedora Core 6 Simulator: SimpleScalar version 3.0

Work have done so far… Setup the platform Reading the source code of SimpleScalar Apply my DFC structure and working process to SimpleScalar Find benchmarks and compile in the platform Do simulation using given memory hierarchy parameters

MiBench dijkstra: it constructs a large graph in an adjacency matrix representation and then calculates the shortest path between every pair of nodes using repeated applications of Dijkstra’s algorithm. stringsearch: it searches for given words in phrases using a case insensitive comparison algorithm. rijndael encrypt/decrypt: it was selected as the National Institute of Standards and Technologies Advanced Encryption Standard (AES). CRC32: This benchmark performs a 32-bit Cyclic Redundancy Check (CRC) on a file. CRC checks are often used to detect errors in data transmission.

Memory hierarchy parameters ParameterValue Instr. size4B DFCdirect-mapped, 32 secotors, 4 decoded instr. per sector, 8B per decoded instr. L1 I-cache16KB, 2-way, 32B line, 1 cycle hit latency L1 D-cache8KB, 2-way, 32B line, 1-cycle hit latency Memory30-cycle latency

Simulation results % reduction in instruction fetches and decodes

Simulation results Prediction hit rate

Simulation results dijkstrastringsearchrijndaelCRC32 sim_num_insn il1.accesses il1.hits il1.misses il1.miss_rate dfc.accesses dfc.hits dfc.misses dfc.miss_rate

Conclusion The DFC stores decoded instructions and can be very small and energy-efficient. Use of the DFC eliminates both the access to a much larger instruction cache and the entire decoding step. From the simulation results, we can see that most instruction fetch and decode can be eliminated by using DFC. Therefore, it is a very efficient way to optimize the power consumption of embedded processors.

Thank you!