Store Recycling Function Experimental Results

Slides:



Advertisements
Similar presentations
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Advertisements

Scalable and Precise Dynamic Datarace Detection for Structured Parallelism Raghavan RamanJisheng ZhaoVivek Sarkar Rice University June 13, 2012 Martin.
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto.
Shimin Chen Big Data Reading Group.  Energy efficiency of: ◦ Single-machine instance of DBMS ◦ Standard server-grade hardware components ◦ A wide spectrum.
Jack Ou, Ph.D. CES522 Engineering Science Sonoma State University
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
Intel Concurrent Collections (for Haskell) Ryan Newton, Chih-Ping Chen, Simon Marlow Software and Services Group Jul 27, 2010.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Orbe: Scalable Causal Consistency Using Dependency Matrices & Physical Clocks Jiaqing Du, EPFL Sameh Elnikety, Microsoft Research Amitabha Roy, EPFL Willy.
Introduction Overview Static analysis Memory analysis Kernel integrity checking Implementation and evaluation Limitations and future work Conclusions.
Kenichi Kourai (Kyushu Institute of Technology) Takuya Nagata (Kyushu Institute of Technology) A Secure Framework for Monitoring Operating Systems Using.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Yongzhi Wang, Jinpeng Wei VIAF: Verification-based Integrity Assurance Framework for MapReduce.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Dynamic Analysis of Multithreaded Java Programs Dr. Abhik Roychoudhury National University of Singapore.
Compactly Representing Parallel Program Executions Ankit Goel Abhik Roychoudhury Tulika Mitra National University of Singapore.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis.
Energy-Aware Resource Adaptation in Tessellation OS 3. Space-time Partitioning and Two-level Scheduling David Chou, Gage Eads Par Lab, CS Division, UC.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Static Process Scheduling
Martin Kruliš by Martin Kruliš (v1.1)1.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Parallelizing Functional Tests for Computer Systems Using Distributed Graph Exploration Alexey Demakov, Alexander Kamkin, and Alexander Sortov
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
Tuning Threaded Code with Intel® Parallel Amplifier.
Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.
IThreads A Threading Library for Parallel Incremental Computation Pramod Bhatotia Pedro Fonseca, Björn Brandenburg (MPI-SWS) Umut Acar (CMU) Rodrigo Rodrigues.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Chapter 8: Main Memory.
CSE 373 Topological Sort Graph Traversals
Conception of parallel algorithms
Computational Models Database Lab Minji Jo.
Prabhanjan Kambadur, Open Systems Lab, Indiana University
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Unconventional applications of Intel® Xeon Phi™ Processor (KNL)
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Computing Resource Allocation and Scheduling in A Data Center
Introduction to HDFS: Hadoop Distributed File System
Report on Vector Prototype
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Operating System Concepts
Fault-Tolerant Programming Models and Computing Frameworks
Linchuan Chen, Xin Huo and Gagan Agrawal
MapReduce Simplied Data Processing on Large Clusters
Douglas Lacy & Daniel LeCheminant CS 252 December 10, 2003
Chapter 4: Threads.
Wei Jiang Advisor: Dr. Gagan Agrawal
Data-Intensive Computing: From Clouds to GPU Clusters
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Multithreaded Programming
Tahsin Reza Matei Ripeanu Nicolas Tripoul
Programming with Shared Memory Specifying parallelism
Operating System Overview
Parallel Exact Stochastic Simulation in Biochemical Systems
Accelerating Regular Path Queries using FPGA
Presentation transcript:

Store Recycling Function Experimental Results Motivation Overview Verification Run Verification run checks correctness of recycling functions on smaller problem instance Production run executes task graph with verified recycling function on actual problem instance Emergence of multi/many-core architectures Need for effective parallel programming models Dynamic task graph scheduling task scheduling via work stealing memory management Our approach: recycle memory assigned to data-blocks among tasks via store recycling functions Recycling constraints T1 can recycle T2 if both T2 and all of T2’s uses causally precede T1  ensures no premature recycling Two tasks T1 and T2 can recycle the same task T3 only if T1 can recycle T2 (or vice versa)  ensures no concurrent recycling Track causality relationships between tasks via vector clocks Auto exploration of recycling functions recycling candidates: immediate and transitive predecessors ask user the dependence structure, then enumerate all traversal paths Background Production Run Representation as a DAG vertices (tasks) edges (dependences) Guarantees no concurrent recycling is allowed Memory management single assignment: likely to run out of memory garbage-collection: requires use count or last use specification for each data-block a data-block recycled too early is recomputed through re-execution Store Recycling Function Experimental Results Setup Intel Xeon Phi: 61 cores (244 threads), 8GB memory Bench.: Cholesky, FW, Hotspot, LU, Rician, Srad, SW 3) Re-execution Overheads Maps a task T1 to another task T2 so that output of T1 occupies the same memory as the output of T2 Ex: Recycle(B) = A Challenges: characterization of the sufficient conditions for a recycling function to be correct - Recycle(B) = D - Recycle(B) = A and Recycle(C) = A determination of the most memory efficient recycling function given a set of candidates efficient representation of recycling functions during runtime guaranteeing correct execution for every possible schedule and problem instance 1) Comparison between single-assignment and recycling -- Auto recycling overheads 2) Recycling Function Verification Costs Incorrect execution Fig3. Re-execution overheads with a representative incorrect recycling function Fig1. Cholesky with small (left) and large problem instance (right) for varying number of threads Fig2. Associated costs at 61 threads Tab2. Number of recycling functions checked and verified as correct Fig4. Distribution of incorrect recycling functions to overhead bins Tab1. Memory consumption in MBs