Preeti Ranjan Panda, Anant Vishnoi, and M. Balakrishnan Proceedings of the IEEE 18th VLSI System on Chip Conference (VLSI-SoC 2010) Sept. 2010 Presenter:

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
Lecture 12 Reduce Miss Penalty and Hit Time
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
Presenter : Ching-Hua Huang 2014/4/14 A Configurable Bus-Tracer for Error Reproduction in Post-Silicon Validation Shing-Yu Chen ; Ming-Yi Hsiao ; Wen-Ben.
BackSpace: Formal Analysis for Post-Silicon Debug Flavio M. de Paula * Marcel Gort *, Alan J. Hu *, Steve Wilton *, Jin Yang + * University of British.
Reporter:PCLee With a significant increase in the design complexity of cores and associated communication among them, post-silicon validation.
From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.
Feng-Xiang Huang A Low-Cost SOC Debug Platform Based on On-Chip Test Architectures.
X-Compaction Itamar Feldman. Before we begin… Let’s talk about some DFT history: Design For Testability (DFT) has been around since the 1960s. The technology.
Virtual Memory Chapter 8. Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
1 Virtual Memory Chapter 8. 2 Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.
1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.
Chapter 9 Virtual Memory Produced by Lemlem Kebede Monday, July 16, 2001.
Computer Organization Cs 147 Prof. Lee Azita Keshmiri.
Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Virtual Memory I Chapter 8.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
Reporter: PCLee. Assertions in silicon help post-silicon debug by providing observability of internal properties within a system which are.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
CMPE 421 Parallel Computer Architecture
Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.
Presenter : Ching-Hua Huang 2013/9/16 Visibility Enhancement for Silicon Debug Cited count : 62 Yu-Chin Hsu; Furshing Tsai; Wells Jong; Ying-Tsai Chang.
Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.
Reporter :PCLee The decisions on when to acquire debug data during post-silicon validation are determined by trigger events that are programmed.
Presenter: PCLee Post-silicon validation is used to identify design errors in silicon. Its main limitation is real-time observability of the.
2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
Fall 2000M.B. Ibáñez Lecture 17 Paging Hardware Support.
8.1 Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Paging Physical address space of a process can be noncontiguous Avoids.
Chapter 4 Memory Management Virtual Memory.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
Feng-Xiang Huang Test Symposium(ETS), th IEEE European Ko, Ho Fai; Nicolici, Nicola; Department of Electrical and Computer Engineering,
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Virtual Memory 1 1.
Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
IT3002 Computer Architecture
Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Virtual Memory - Part II
5.2 Eleven Advanced Optimizations of Cache Performance
Improving Program Efficiency by Packing Instructions Into Registers
Virtual Memory Chapter 8.
FA-TAGE Frequency Aware TAgged GEometric History Length Branch Predictor Boyu Zhang, Christopher Bodden, Dillon Skeehan ECE/CS 752 Advanced Computer Architecture.
Improving cache performance of MPEG video codec
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Page that info back into your memory!
Jianbo Dong, Lei Zhang, Yinhe Han, Ying Wang, and Xiaowei Li
Miss Rate versus Block Size
If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?
Operating Systems: Internals and Design Principles, 6/E
Virtual Memory 1 1.
Presentation transcript:

Preeti Ranjan Panda, Anant Vishnoi, and M. Balakrishnan Proceedings of the IEEE 18th VLSI System on Chip Conference (VLSI-SoC 2010) Sept Presenter: Chun-Hung Lai 2016/5/31

During post-silicon validation/debug of processors, it is common to alternate between two phases: processor execution and state dump. The state dump, where the entire processor state is dumped off-chip to a logic analyzer for further processing, is a major bottleneck. We present a technique for improving debug efficiency by reducing the volume of cache data dumped off- chip, while still capturing the complete state. The reduction is achieved by introducing hardware mechanisms to transmit only the portion of the cache that was updated since the last dump. We propose two design alternatives based on whether or not the processor is permitted to continue execution during the dump: Blocking Incremental Cache Dumping (BICD) and Non-blocking Increment Cache Dumping (NICD). We observe a 64% reduction in overall cache lines dumped and the dump time reduces to an average of 16.8% and % for BICD and NICD respectively. Abstract - 2 -

The state dump is a major bottleneck during post- silicon debug of processors  Dump processor state to off-chip 。 Last level cache forms the majority of the processor state To improve debug efficiency  Reduce the volume of cache data dumped 。 While still capture the complete state What’s the Problem Large amount cache Large cache dump size Huge dump duration

Related Works Design for debug Collection of selected signal traces Trace compression [6][10][18] Trace compression [6][10][18] Expand few trace signal to restore untraced signal Scan-based debug for physical / logic probing [17][20] Scan-based debug for physical / logic probing [17][20] Halt real time execution Trace signal selection [9][11][13][15] Trace signal selection [9][11][13][15] Reduce area overhead and dump time Iterative silicon debug with signature [2] Capture only error- data; zoom in interval of error signature Only for repeatable Compression specific memory/cache data For performance / energy For Debug Conservative compression [1][4][12][21] Conservative compression [1][4][12][21] Aggressive compression [14][18] Aggressive compression [14][18] Decompression without impacting μp execution Decompression in off-line Compression is limited Online cache dump for μp debug [19] Online cache dump for μp debug [19] Dump simultaneously with μp execution Incremental cache state dumping This paper: Reduce dump size

Goal: reduce total amount of cache data to be transferred off-chip  Dump only the cache lines that are updated since last dump Use an Update History Table (UHT) to track all cache updates between two consecutive dumps Incremental Cache Dumping Time Dump all Dump updated only μp execution -> $ update

Blocking Incremental Cache Dumping (BICD)  Processor is halted during the cache dump  Dump lines whose UHT entry is set Cost-dump time trade-offs  Each UHT bit represents more than one $ line 。 May lead to extra dump Two Methodologies for Incremental Cache Dumping – 1 st BICD Blocking Incremental Cache Dumping (BICD) Reduce 56%dump size Don’t update but dump

Non-Blocking Incremental Cache Dumping (NICD)  Cache dump is performed simultaneously with μp execution Two challenges with NICD  (1) Cache state is corrupted by the executing processor 。 Reset the corresponding UHT entry after dumping Two Methodologies for Incremental Cache Dumping – 2 nd NICD Solution: dump a cache line before the cache attempts to update it UHT Updated Being dumped Dump Non-Dumped

Two challenges with NICD  (2) Maintenance of the Update History Table (UHT) 。 UHT get incorrectly updated with the “cache dump” and the “executing μp“  UHT-P(previous): cache updates since the last dump (indicate dump)  UHT-C(current): cache updates during dump interval (Swap their roles at the start of the next dump) Two Methodologies for Incremental Cache Dumping – 2 nd NICD (Cont.) UHT Updated Non- dumped UHT Updated UHT Updated Dump before update Update but don’t affect current dump Time: TTime: T+1Time: T+2 Solution: use two UHTs

Illustration of Non-Blocking Incremental Cache Dumping (NICD) Indicate lines to be dumped Dump then reset UHT entry 1.Dump F then reset UHT-P entry 2.Update F then set UHT-C entry 1.Dump F then reset UHT-P entry 2.Update F then set UHT-C entry Update line F during dump line B

Illustration of Non-Blocking Incremental Cache Dumping (NICD) Dump only ‘0’ for F since has been dumped due to update 1 Update line C but don’t affect current dump 2 UHT-P= ‘0’ -> not dump Update line H but don’t affect current dump 3 UHT-P= ‘0’ -> not dump Ready for next dump For next dump: - UHT-P: capture further updates - UHT-C: indicate lines to be dumped For next dump: - UHT-P: capture further updates - UHT-C: indicate lines to be dumped

Hardware Implementation – NICD Arch Counter: dump line’s index W_sel Mask: addr. of updated window Mask: addr. of updated window Export from $: - W_sel: Updated way - Write: $ update - Dump: ready for dump UHTs: track $ updates Use for update Use for dump

Hardware Implementation- Operation Flow W_sel 1 Dump_S: start dump 2 Sense Valid & Dump -> data lines to buffer Dump line Valid from UHT 3 Cache updates (Write): - If the UHT entry = ‘1’ then dump in advance Cache updates (Write): - If the UHT entry = ‘1’ then dump in advance

For CHESS  Lines dumped increases with “window size” and “dump interval” For HMMER  Difference is minimal with respect to window size Experimental Results- Lines Dumped at Various Dump Intervals / Window Sizes For window size 1: only 36% of total lines is dumped in average Increase with the dump interval

Cache updates during the dumping of a window -> stall For CHESS  Average % stall overhead for window size 2 For HMMER  Average % stall overhead for window size 2 。 Memory requests are spread over time with infrequent updates Experimental Results- Processor Stalls with NICD Stalls increase with window size

Total dump time overhead  Processor stall overhead + dumping overhead (bus busy during dump) For CHESS  % dump time for all dump intervals (window size 1) 。 As a percentage of the original dump time For HMMER  % ~ 0.003% dump time for all dump intervals (window size 1) Experimental Results – Dump Time Overhead for NICD Overall dump time follows the trends of processor stalls (increase with window size)

Additional area / timing overhead  For BICD 。 Require a UHT (vary window size between 1 and 16)  Area: 0.24 ~ 0.03  Timing: no overhead (UHT access time is smaller than cache access time)  For NICD 。 Dump logic  Area: twice of BICD (no extra timing overhead) 。 Cache modification for online dumping  Area difference is mm 2 (no extra timing overhead) Experimental Results – Area / Access Time Original Cache Controller BICDNICD Area (mm 2 )38.9+ ( 0.24~0.03)+ (0.48~0.06) + (0.0002) Timing (ns)2.63+ (0) Dump logic $ modification nm synthesis technology Require 2 UHTs

This paper proposed an incremental cache dumping  Goal: reduce transfer time and logic analyzer space requirement 。 Two hardware mechanisms  Blocking Incremental Cache Dumping (BICD)  Non-blocking Incremental Cache Dumping (NICD) The results show that  Incremental dumping reduces the lines dumped by 64%  BICD: reduce dump time to 16.2% of the original dump time  NICD: reduce dump time to % of the original dump time Conclusions

Good points  Let me understand how to use cache dumping for debug 。 Signature based debugging approach  Map a sequence of events into a cache state dump  Factors for dump time overhead Things can be improved  Why “dump line’s index” doesn’t import to the UHT  From the architecture, it seems to use single-port SRAM 。 How to achieve “cache line dump” and “normal cache access” simultaneously  Environment for transferring from dump logic to logic analyzer is not clear Comments for This Paper