Presenter : Shih-Tung Huang Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/26 EICE team dIP: A Non-Intrusive Debugging IP for Dynamic Data Race Detection in Many-core.

Slides:

Advertisements

Similar presentations

On-the-fly Healing of Race Conditions in ARINC-653 Flight Software

Advertisements

IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Chapter3 Limitations on Instruction-Level Parallelism Bernard Chen Ph.D. University of Central Arkansas.

Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Koushik Sen CS 265.

William Stallings Data and Computer Communications 7 th Edition Chapter 13 Congestion in Data Networks.

Presenter : Shih-Tung Huang 2015/4/30 EICE team Automated Data Analysis Solutions to Silicon Debug Yu-Shen Yang Dept. of ECE University of Toronto Toronto,

Feng-Xiang Huang 2015/5/4 International Symposium Quality Electronic Design (ISQED), th M. H Neishaburi, Zeljko Zilic, McGill University, Quebec.

Corey – An Operating System for Many Cores 謝政宏.

Reporter:PCLee With a significant increase in the design complexity of cores and associated communication among them, post-silicon validation.

Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.

Atomicity in Multi-Threaded Programs Prachi Tiwari University of California, Santa Cruz CMPS 203 Programming Languages, Fall 2004.

Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.

Today From threads to file systems

Presenter : Shau-Jay Hou Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/12 EICE team TraceDo: An On-Chip Trace System for Real-Time Debug and Optimization in Multiprocessor.

Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.

Presenter : Shih-Tung Huang Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/15 EICE team Model-Level Debugging of Embedded Real-Time Systems Wolfgang Haberl, Markus.

1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.

What Great Research ?s Can RAMP Help Answer? What Are RAMP’s Grand Challenges ?

High-performance bulk data transfers with TCP Matei Ripeanu University of Chicago.

Presenter: Jyun-Yan Li Multiprocessor System-on-Chip Profiling Architecture: Design and Implementation Po-Hui Chen, Chung-Ta King, Yuan-Ying Chang, Shau-Yin.

PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection S. Lu, P. Zhou, W. Liu, Y. Zhou, J. Torrellas University.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

CprE 458/558: Real-Time Systems

Parallelizing Data Race Detection Benjamin Wester Facebook David Devecsery, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

1 The Google File System Reporter: You-Wei Zhang.

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

CMPE 421 Parallel Computer Architecture

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Operating Systems Lecture 2 Processes and Threads Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.

Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.

Preeti Ranjan Panda, Anant Vishnoi, and M. Balakrishnan Proceedings of the IEEE 18th VLSI System on Chip Conference (VLSI-SoC 2010) Sept Presenter:

Presenter : Cheng-Ta Wu David Lin1, Ted Hong1, Farzan Fallah1, Nagib Hakim3, Subhasish Mitra1, 2 1 Department of EE and 2 Department of CS Stanford University,

A Case for Unlimited Watchpoints Joseph L. Greathouse †, Hongyi Xin*, Yixin Luo †‡, Todd Austin † † University of Michigan ‡ Shanghai Jiao Tong University.

Transformer: A Functional-Driven Cycle-Accurate Multicore Simulator 1 黃翔 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan,

High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.

Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.

Debugging Threaded Applications By Andrew Binstock CMPS Parallel.

Computer Architecture Lab at Evangelos Vlachos, Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry.

Hardware Support for On-Demand Software Analysis Joseph L. Greathouse Advanced Computer Architecture Laboratory University of Michigan December 8, 2011.

IT3002 Computer Architecture

HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.

Presenter: Yi-Ting Chung Fast and Scalable Hybrid Functional Verification and Debug with Dynamically Reconfigurable Co- simulation.

Memory Protection through Dynamic Access Control Kun Zhang, Tao Zhang and Santosh Pande College of Computing Georgia Institute of Technology.

NFV Compute Acceleration APIs and Evaluation

Why Events Are A Bad Idea (for high-concurrency servers)

Speculative Lock Elision

CSC 591/791 Reliable Software Systems

How will execution time grow with SIZE?

Effective Data-Race Detection for the Kernel

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Lecture 2: Snooping-Based Coherence

Firewalls Routers, Switches, Hubs VPNs

Overview of Computer Architecture and Organization

Background and Motivation

LHC BLM Software audit June 2008.

Fast Testing Network Data Plane with RuleChecker

2019/10/19 Efficient Software Packet Processing on Heterogeneous and Asymmetric Hardware Architectures Author: Eva Papadogiannaki, Lazaros Koromilas, Giorgos.

Presentation transcript:

Presenter : Shih-Tung Huang Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/26 EICE team dIP: A Non-Intrusive Debugging IP for Dynamic Data Race Detection in Many-core Chi-Neng Wen, Shu-Hsuan Chou and Tien-Fu Chen National Chung-Cheng University, Chia-Yi, Taiwan {wcn93, csh93, th International Symposium on Pervasive Systems, Algorithms, and Networks

Traditional debug facilities are limited in providing debugging requirements for multicore parallel programming. Synchronization problems or bugs due to race conditions are particularly difficult to detect with software debugging tools. This work presents a fast and feasible hardware-assistant solution for many-core non- intrusive debugging. The key idea is to keep tracks of data accesses of shared memory areas and their lock synchronization activities by proposed data structures in proposed debugging IP (dIP). A page-based shared variable cache is provided to keep shared variables as long as possible, and an inexpensive pluggable off-chip RAM can eliminate the false-positive rate efficiently. 2 Abstract (1)

To decrease the debugging traffic block, this work provides a thread library to specify shared memory/lock events and transmit those events to the dIP by a small proper hardware co-processor (eXtend dIP) of each core. Our experimental result shows the debugging traffic block (worse-case) by increasing cores, and adding tolerance buffers in XdIP can efficiently ease off. Moreover, the real workloads (SPLASH-2, MPEG-4, and H.264) are executed by the dIP non-instructive race-detection with only 4.7%~12.2% slow down in average. Finally, the hardware cost of dIP is also low when the growing of many-core. 3 Abstract (2)

Data race detection in multi-cores Software method Cause probe effect Hardware method Cause lot of memory (or hardware area) needed for log cores behavior Cause false positive This paper propose method Not software method Use related work [3] to avoid probe effect Use centralized race detection : don’t increase huge hardware area when increase cores 4 What’s the problem

Probe effect was introduced in related work [1] Use related work [4] for data race detection Related work [3] separate debugging data path from usual data path to avoid probe effect 5 Related work Race detection (multi-core) Software [5][6] hardware [7][8][9] This paper method Lock-set algo.[4] Related work[3]

6 Propose MPSOC framework Every core has a XdIP XdIP as a co-processor for each core XdIP is used to send debug event to dIP through Debug I/F The interconnection flow the standard of related work [3] Data I/F is used for usual data path Debug I/F is used for debug event path

7 XdIP architecture The architecture is quite simply Filter to filter debug event (Lock and Mem access info) to buffer which in packet & send and wait for sending to dIP Filter is settled by SW setting Event monitor and transfer in each core When buffer is full, it will announce dIP to stall all core for event transfer

8 Data race detection flow First Table manager accept debug event from XdIP and then maintain shard variable cache, lock-set and core- status table Second Rule logic check if data race happen or not happen: Alert will be enable to notify exception handler to fix race detection

9 dIP architecture Data race detection flow corresponds 1~5 6 is for ordering debug event (SqID) 7 is external RAM for cache miss

10 Three tables Page-base Variable table is used for recording variable latest access state Lock-key table is used for recording how many lock-set and how many lock key are available Core-status table is used for recording core state (thread, lock set, SqID) Fully association

11 Overall propose framework

12 Allocation/de-allocation lock-key Allocation Thread A execute W_lock S1, then the event sent to dIP by XdIP dIP allocate a lock-key to thread A, then thread A save lock-key number with S1 de-allocation Thread A execute W_unlock S1, in the mean time the lock-key will send to dIP together to de-allocate

13 Data race detect rule 6211 core1core2

When XdIP buffer full,dIP will stall all cores for non- intrusive. stall will reduce system performance, use a experience to show stall ratio by using SPLASH-2 benchmarks 14 Experiences Sol: add buffer in XdIP

Four different benchmarks worse case performance down is 12.25% Compare with related work [9] 15 Experiences