Presenter: Shao-Jay Hou. In the multicore era, capturing execution traces of processors is indispensable to debugging complex software. The inability.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Final Project : Pipelined Microprocessor Joseph Kim.
Lecture 6: Multicore Systems
Performance of Cache Memory
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
1 Architectural Complexity: Opening the Black Box Methods for Exposing Internal Functionality of Complex Single and Multiple Processor Systems EECC-756.
Reporter:PCLee With a significant increase in the design complexity of cores and associated communication among them, post-silicon validation.
The ARM7TDMI Hardware Architecture
Feng-Xiang Huang A Low-Cost SOC Debug Platform Based on On-Chip Test Architectures.
Presenter : Shau-Jay Hou Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/12 EICE team TraceDo: An On-Chip Trace System for Real-Time Debug and Optimization in Multiprocessor.
Processor Technology and Architecture
Presenter : Shao-Jay Hou. Today’s complex integrated circuit designs increasingly rely on post-silicon validation to eliminate bugs that escape from pre-silicon.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Presenter: Shao-Jay Hou. Embedded logic analysis has emerged as a powerful technique for identifying functional bugs during post- silicon validation,
Presenter: Shao-Jay Hou. This paper introduces a new unobtrusive and cost-effective method for the capture and compression of program execution traces.
Architectural Considerations for CPU and Network Interface Integration C. D. Cranor; R. Gopalakrishnan; P. Z. Onufryk IEEE Micro Volume: 201, Jan.-Feb.
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
1-1 Embedded Software Development Tools and Processes Hardware & Software Hardware – Host development system Software – Compilers, simulators etc. Target.
PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Embedded Systems Programming
ARM Processor Architecture
VPC3: A Fast and Effective Trace-Compression Algorithm Martin Burtscher.
Presenter : Shao-Cheih Hou Sight count : 11 ASPDAC ‘08.
Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation
Dragged, Kicking and Screaming: Multicore Architecture and Video Games.
Sequential Arithmetic ELEC 311 Digital Logic and Circuits Dr. Ron Hayne Images Courtesy of Cengage Learning.
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Feng-Xiang Huang Test Symposium(ETS), th IEEE European Ko, Ho Fai; Nicolici, Nicola; Department of Electrical and Computer Engineering,
25 April 2000 SEESCOASEESCOA STWW - Programma Evaluation of on-chip debugging techniques Deliverable D5.1 Michiel Ronsse.
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
1 by: Ilya Melamed Supervised by: Eyal Sarfati High Speed Digital Systems Lab.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
FPL Sept. 2, 2003 Software Decelerators Eric Keller, Gordon Brebner and Phil James-Roxby Xilinx Research Labs.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.
Lecture 11: FPGA-Based System Design October 18, 2004 ECE 697F Reconfigurable Computing Lecture 11 FPGA-Based System Design.
Presenter: Shao-Chieh Hou International Database Engineering & Application Symposium (IDEAS’05)
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
AN ASYNCHRONOUS BUS BRIDGE FOR PARTITIONED MULTI-SOC ARCHITECTURES ON FPGAS REPORTER: HSUAN-JU LI 2014/04/09 Field Programmable Logic and Applications.
EECS 322 March 18, 2000 RISC - Reduced Instruction Set Computer Reduced Instruction Set Computer  By reducing the number of instructions that a processor.
High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.
ARM 7 & ARM 9 MICROCONTROLLERS AT91 1 ARM920T Processor.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Data Compression Michael J. Watts
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Computer Architecture Chapter (14): Processor Structure and Function
Selective Code Compression Scheme for Embedded System
Application-Specific Customization of Soft Processor Microarchitecture
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Jason Klaus Supervisor: Duncan Elliott August 2, 2007 (Confidential)
CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation
Sampoorani, Sivakumar and Joshua
Application-Specific Customization of Soft Processor Microarchitecture
Spring 2019 Prof. Eric Rotenberg
ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.
Efficient Placement of Compressed Code for Parallel Decompression
Presentation transcript:

Presenter: Shao-Jay Hou

In the multicore era, capturing execution traces of processors is indispensable to debugging complex software. The inability to transfer vast amounts of trace data off-chip without significant slow-down has impeded the debugging of such software, in both pre-silicon emulation and in real designs. We consider on-chip trace compression performed in hardware to reduce data volume, using techniques that exploit inherent higher-order redundancy in address trace data. While hardware trace compression is often restricted to poor or moderate performance due to area and memory constraints, we present a parameterizable scheme that leverages the re- sources already found on existing platforms. Harnessing resources such as existing trace buffers on CPUs, and unused embedded memory on FPGA emulation platforms, our trace compression scheme requires only a small additional hardware area to achieve superior compression ratios.

MPSoCs multi-threaded program  Traditional debug method can’t be use  Non-invasive method is a good way(on-chip emulation) immense amount of data that must be either stored on-chip or transferred off-chip in real-time  trace of a 32-bit processor, 1 clock per instruction, 100 MHz 400 MB/s data  Data need to be compressed

This Paper Compression algorithms[5] Combin e MTF and LZ [1] Combin e MTF and LZ [1] DMTF [17] DMTF [17] Multi-stage compression [11] Multi-stage compression [11] Lempel- Ziv(LZ) [18] Lempel- Ziv(LZ) [18] MCDS [12] ARM ETM[2] Trace compression schemes Compression methods Some example tools

Why?  instructions consecutively until a branch is reached  Branch target address How?  Divided into two part 。 address 。 length  Example:

Why?  Branch will be taken or not taken  Sequential locality How?  similar to a cache 。 miss the first time a set of instructions is encountered 。 hit for every subsequent encounter that matches the prediction

Why?  MTF 。 Increase the relevance  Prefix 。 Assist for differential compression How?  Input address and predicted address  Differential compression

Why?  Prefix byte compression  Probability of prefix How?  Huffman encoding

Why?  The input for data form MTF/AE stage is 5bytes  But the output to LZ stage is 1byte How?  Use a little buffer to save

Why?  The input data has high Repeatability How?  Use LZ compression 。 Create a dictionary to save the repeat part 。 But don’t output the dictionary 。 While decompression, create a same dictionary  Don’t output every cycle

Benchmark : Mibench CPU: Apple PowerMac G4 with a 1.25 GHz PowerPC 7455, 32-bit fixed instruction-length processor, Linux SMP kernel Simulation software: ModelSim SE-64 v6.5c

Logic utilization Usage Scenario  JTAG  software fault 10 -3

This paper presented a parameterizable microarchitecture for address trace compression, suited to implementation on ASICs and modern FPGAs. Better compression ratio to others

The paper use a dictionary base, multi-stage compression method, can be use to improve our tracer. The paper give a inspiration for future work for our tracer CPUGPU Bus B.T. P.T. T.M.