Accelerated Path-Based Timing Analysis with MapReduce

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Simplified Data Processing on Large Clusters
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
Continuing Challenges in Static Timing Analysis
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Sequential Synthesis.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Timing Margin Recovery With Flexible Flip-Flop Timing Model
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Reference: Message Passing Fundamentals.
Distributed Computations
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
TH EDA NTHU-CS VLSI/CAD LAB 1 Re-synthesis for Reliability Design Shih-Chieh Chang Department of Computer Science National Tsing Hua University.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Laboratory of Reliable Computing Department of Electrical Engineering National Tsing Hua University Hsinchu, Taiwan Delay Defect Characteristics and Testing.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
UI-Timer: An Ultra-Fast Clock Network Pessimism Removal Algorithm
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
SSV Summit November 2013 Cadence Tempus™ Timing Signoff Solution.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Static Timing Analysis
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 8: hadoop and Tera/Peta byte graphs.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.
Parallel Programming By J. H. Wang May 2, 2017.
Map Reduce.
15-826: Multimedia Databases and Data Mining
Introduction to MapReduce and Hadoop
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
Lecture 16 (Intro to MapReduce and Hadoop)
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Accelerated Path-Based Timing Analysis with MapReduce Tsung-Wei Huang and Martin D. F. Wong Department of Electrical and Computer Engineering (ECE) University of Illinois at Urbana-Champaign (UIUC), IL, USA 2015 ACM International Symposium on Physical Design (ISPD) ECE Main Slide

Outline Path-based timing analysis (PBA) Speed up the PBA Static timing analysis Performance bottleneck Problem formulation Speed up the PBA Distributed computing MapReduce programming paradigm Experimental result Conclusion

Static Timing Analysis (STA) Verify the expected timing characteristics of integrated circuits Keep track of path slacks and identify the critical path with negative slack Increasing significance of variance On-chip variation such as temperature change and voltage drop Perform dual-mode (min-max) conservative analysis We all know STA is an important step in the design flow.

Passing (positive slack) Timing Test and Verification of Setup/Hold Check Sequential timing test Setup time check “Latest” arrival time (at) v.s. “Earliest” required arrival time (rat) Hold time check “Earliest” arrival time (at) v.s. “Latest” required arrival time (rat) Earliest rat (hold test) Latest rat (setup test) One important task of the STA is the sequential timing tests which verifies the timing using setup/hold guard. Passing (positive slack) Failing Failing time Hold violation  No violation  Setup violation 

Two Fundamental Solutions to STA Block-based timing analysis Linear topological propagation Worst quantities for each point Very fast, but pessimistic Path-based timing analysis Analyze timing path by path instead of single points Common path pessimism removal (CPPR), advanced on chip variation (AOCV), etc Reduce the pessimism margin Very slow (exponential number of paths), but more accurate *Source: Cadence Tempus white paper There are two fundamental solutions to the STA. The first one is the so called, block-based timing analysis, which performs linear topological scan on the circuit and propagate the timing based the topological order. During the propagation, we keep track of the “worst” timing quantities on each point…

CPPR Example – Data Path Slack with CPPR Off Pre common-path-pessimism-removal (CPPR) slack Data path 1: ((120+(20+10+10))-30) – (25+30+40+50) = -15 (critical) Data path 2: ((120+(20+10+10))-30) –(25+45+40+50) = -30 (critical)

CPPR Example – Data Path Slack with CPPR On Post common-path-pessimism-removal (CPPR) slack Data path 1: ((120+(20+10+10))-30) – (25+30+40+50)+5 = -10 (critical) Data path 2: ((120+(20+10+10))-30) –(25+45+40+50)+40 = 10 +5 CPPR 1 CPPR 2 +40

Example: Impact of Common-Path-Pessimism Removal (CPPR)

Problem Formulation of PBA Consider the key coding block of PBA After block-based timing propagation Early/Late delay on edges Input A given circuit G=(V, E) A given test set T A parameter k Output Top-k critical paths in the design Goal & Application CPPR from TAU 2014 Contest Speed up the PBA time Clock tree Benchmark from TAU 2014 CAD contest

Key Observation of PBA Time-consuming process but… Multi-threading Multiple timing tests (e.g., setup, hold, PO, etc) are independent Graph-based abstraction isolates the process of each timing test High parallelism Multi-threading Shared-memory-based architecture Single computing node with multiple cores Distributed computing Distributed-memory-based architecture Multiple computing nodes + multiple cores Goal of this paper!

Conventional Distributed Programming Interface Advantage High parallelism, multiple computing nodes with multiple cores  Performance typically scales up as the core count grows  MPI programming library Explicitly specify the details of message passing  Annoying and error-prone  Very long development time and low productivity  Highly customized for performance tuning  The distributed programming is advantageous in … The conventional programming interface to distributed computing is the MPI library. MPI_Init MPI_Send MPI_Recv MPI_Isend MPI_Irecv MPI_Reduce MPI_Scatter MPI_Gather MPI_Allgather MPI_Allreduce MPI_Barrier MPI_Finalize MPI_Grid MPI_Comm MPI …

MapReduce – A Programming Paradigm for Distributed System First introduced by Google in 2004 Simplified distributed computing for big-data processing Open source library Hadoop (Java), Scalar (Java), MRMPI (C++), etc. Because of this problem on MPI, Google first introduces the concept of MapReduce. MapReduce is a programming paradigm that simplifies distributed computing for big-data processing.

MapReduce program (<10 lines) Standard Form of a MapReduce Program Map operation Partition the data set into pieces and assign work to processors Processors generate output data and assign each string a “key” Collate operation Output data with the same key are collected to an unique processor Reduce operation Derive the solution from each unique data set MPI_Isend… MPI_Irecv… MPI_SEND… M… MPI_Send… MPI_Recv… MPI_Barrier… … Tradition MPI program (> 1000 lines) MapReduce program (<10 lines)

Example - Word Counting Count the frequency of each word across a document set 3288 TB data set 10 min to finish on Google cluster

MapReduce Solution to PBA (I) Partition the test set across available processors Each processor generates the top k critical paths Each path is associated with a global key (identical across all paths) Collate Aggregate paths with the same key and combine them to a path string Reduce Sort the paths from the path string and output the top k critical paths Mapper (t) Generate the search for test t Find top k critical paths for t Emit K-V pair for each path Reducer (s) Parse path from path string s Sort paths Output the top k critical paths

Extraction of graph and paths MapReduce Solution to PBA (II) Mapper Extract the search graph for each timing test Find k critical paths on each search graph [Huang and Wong, ICCAD’14] Reducer Sort paths according to slacks and output the globally top-k critical paths Map Reduce Top-1 critical path Input circuit graph Extraction of graph and paths

Reducing the Communication Overhead Messaging latency to remote node is expensive Data locality Each computing node has a replicate of the circuit graph No graph copy between the master node and slave nodes Hidden reduce Reducer call on each processor before the collate method Reduce the amount of path strings passing through computing nodes *Source: Intel clustered OpenMP white paper

Experimental Results Programming environment Benchmark C++ language with C++ based MapReduce library (MR-MPI) 2.26GHZ 64-bit Linux machine UIUC Campus cluster (with up to 500 computing nodes and 5000 cores) Benchmark TAU 2014 CAD contest on Path-based CPPR Million-scale circuit graphs

Experimental Results – Runtime (I) Parameter Path count K Core count C Performance Only ~30 lines on MapReduce x2 – x9 speedup by 10 cores Promising scalability

Experimental Results – Runtime (II) Runtime portion on Map, Collate, and Reduce Map occupies the majority of the runtime ~ 10 % on process communication Communication overhead Grows as the path count increases ~15 % improvement with hidden reduce

Experimental Results – Comparison with Multi-threading on a Single Node

Conclusion MapReduce-based solution to PBA Future work Coding ease, promising speedup, and high scalability Analyzes million-scale graph within a few minute Future work Investigate more EDA applications on cluster computing GraphX, Spark, etc.