Simulation of Decode Filter Cache using SimpleScalar simulator Presented by Fei Hong.

Slides:



Advertisements
Similar presentations
COMP375 Computer Architecture and Organization Senior Review.
Advertisements

Machine cycle.
DIGITAL COMMUNICATION Packet error detection (CRC) November 2011 A.J. Han Vinck.
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Department of Computer Science University of the West Indies.
Evaluating Performance and Power of Object-oriented vs. Procedural Programming in Embedded Processors A. Chatzigeorgiou, G. Stephanides Department of Applied.
S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
Chapter 13 Reduced Instruction Set Computers (RISC) CISC – Complex Instruction Set Computer RISC – Reduced Instruction Set Computer.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Apr 14,2003CPE 631 Project Performance Analysis and Power Estimation of ARM Processor Team: Ajayshanker Krishnamurthy Swathi Tanjore Gurumani Zexin Pan.
Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.
Computer Organization and Architecture Reduced Instruction Set Computers (RISC) Chapter 13.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
What have mr aldred’s dirty clothes got to do with the cpu
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.
AES Encryption Code Generator Undergraduate Research Project by Paul Magrath. Supervised by Dr David Gregg.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and Jihong Kim Computer Architecture Lab School of Computer Science.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
MIPS Project -- Simics Yang Diyi Outline Introduction to Simics Simics Installation – Linux – Windows Guide to Labs – General idea Score Policy.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Fetch Directed Prefetching - a Study
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.
Sunpyo Hong, Hyesoon Kim
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.
Crusoe Processor Seminar Guide: By: - Prof. H. S. Kulkarni Ashish.
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Multiscalar Processors
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Improving Program Efficiency by Packing Instructions Into Registers
Flow Path Model of Superscalars
Figure 8.1 Architecture of a Simple Computer System.
Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.
Detailed Analysis of MiBench benchmark suite
Ann Gordon-Ross and Frank Vahid*
Ka-Ming Keung Swamy D Ponpandi
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
In Search of Near-Optimal Optimization Phase Orderings
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Chapter 12 Pipelining and RISC
Chapter 6: Understanding and Assessing Hardware
Fault Tolerant Systems in a Space Environment
Instruction Level Parallelism
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Simulation of Decode Filter Cache using SimpleScalar simulator Presented by Fei Hong

Motivation & Goals Instruction fetches and decodes are the major on-chip power consumers Optimize the power consumption by reducing instruction fetches and decodes Simulate the DFC architecture using simplescalar To test the performance of DFC

Prediction Mechanism Each sector in DFC has the following fields. (tag, sector_valid, next_address) If A is not equal to C, a different control path will be taken tag(A) != tag(C) (1) A and B are consecutively accessed. If they belonged to a small loop tag(A) == tag(B) (2) Based on (1) and (2), the prediction for next fetch : tag(C) == tag(B) (3)

Working Process

The Platform Host computer: ACPI x86-based PC Host computer operating system: Microsoft Windows Vista Ultimate Virtual Machine: VMware Workstation version 6.03 Linux operating system: Fedora Core 6 Simulator: SimpleScalar version 3.0

Work have done so far… Setup the platform Reading the source code of SimpleScalar Apply my DFC structure and working process to SimpleScalar Find benchmarks and compile in the platform Do simulation using given memory hierarchy parameters

MiBench dijkstra: it constructs a large graph in an adjacency matrix representation and then calculates the shortest path between every pair of nodes using repeated applications of Dijkstra’s algorithm. stringsearch: it searches for given words in phrases using a case insensitive comparison algorithm. rijndael encrypt/decrypt: it was selected as the National Institute of Standards and Technologies Advanced Encryption Standard (AES). CRC32: This benchmark performs a 32-bit Cyclic Redundancy Check (CRC) on a file. CRC checks are often used to detect errors in data transmission.

Memory hierarchy parameters ParameterValue Instr. size4B DFCdirect-mapped, 32 secotors, 4 decoded instr. per sector, 8B per decoded instr. L1 I-cache16KB, 2-way, 32B line, 1 cycle hit latency L1 D-cache8KB, 2-way, 32B line, 1-cycle hit latency Memory30-cycle latency

Simulation results % reduction in instruction fetches and decodes

Simulation results Prediction hit rate

Simulation results dijkstrastringsearchrijndaelCRC32 sim_num_insn il1.accesses il1.hits il1.misses il1.miss_rate dfc.accesses dfc.hits dfc.misses dfc.miss_rate

Conclusion The DFC stores decoded instructions and can be very small and energy-efficient. Use of the DFC eliminates both the access to a much larger instruction cache and the entire decoding step. From the simulation results, we can see that most instruction fetch and decode can be eliminated by using DFC. Therefore, it is a very efficient way to optimize the power consumption of embedded processors.

Thank you!