Better Speedups for Parallel Max-Flow George C. Caragea Uzi Vishkin Dept. of Computer Science University of Maryland, College Park, USA June 4 th, 2011.

Slides:

Advertisements

Similar presentations

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

Advertisements

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

IPDPS Looking Back Panel Uzi Vishkin, University of Maryland.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

An Introduction to Amorphous Computing Daniel Coore, PhD Dept. Mathematics and Computer Science University of the West Indies, Mona.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

James Edwards and Uzi Vishkin University of Maryland 1.

Paul D. Bryan, Jason A. Poovey, Jesse G. Beu, Thomas M. Conte Georgia Institute of Technology.

Lectures on Network Flows

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.

Weekly Report Ph.D. Student: Leo Lee date: Oct. 9, 2009.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

Performance Potential of an Easy-to- Program PRAM-On-Chip Prototype Versus State-of-the-Art Processor George C. Caragea – University of Maryland A. Beliz.

George Caragea,and Uzi Vishkin University of Maryland 1 Speaker James Edwards.

Joint UIUC/UMD Parallel Algorithms/Programming Course David Padua, University of Illinois at Urbana-Champaign Uzi Vishkin, University of Maryland, speaker.

Jean-Charles REGIN Michel RUEHER ILOG Sophia Antipolis Université de Nice – Sophia Antipolis A global constraint combining.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008.

Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.

Teaching Parallelism Panel, SPAA11 Uzi Vishkin, University of Maryland.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin.

Programmability and Portability Problems? Time for Hardware Upgrades Uzi Vishkin ~2003 Wall Street traded companies gave up the safety of the only paradigm.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Principles/theory matter and can matter more: Big lead of PRAM algorithms on prototype-HW Uzi Vishkin There is nothing more practical than a good theory--

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † *

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

CS6963 L15: Design Review and CUBLAS Paper Discussion.

Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Multi-dimensional Packet Classification on FPGA 100 Gbps and Beyond Author: Yaxuan Qi, Jeffrey Fong, Weirong Jiang, Bo Xu, Jun Li, Viktor Prasanna Publisher:

Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Integrated Maximum Flow Algorithm for Optimal Response Time Retrieval of Replicated Data Nihat Altiparmak, Ali Saman Tosun The University of Texas at San.

Does humans-in-the-service-of-technology have a future Preview of Viewpoint article: Is Multi-Core Hardware for General-Purpose Parallel Processing Broken?

Joint UIUC/UMD Parallel Algorithms/Programming Course David Padua, University of Illinois at Urbana-Champaign Uzi Vishkin, University of Maryland, speaker.

Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:

QCAdesigner – CUDA HPPS project

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

DISTRIBUTED COMPUTING

Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Data Structures and Algorithms in Parallel Computing Lecture 7.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

1 Packet Network Simulator-on-Chip Henry Wong Danyao Wang University of Toronto Connections 2009 ECE Graduate Symposium.

Sunpyo Hong, Hyesoon Kim

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

The Swept Rule for Breaking the Latency Barrier in Time-Advancing PDEs FINAL PROJECT MIT FALL 2015 PROJECT SUPERVISOR: PROFESSOR QIQI WANG MAITHAM.

R. Rastogi, A. Srivastava , K. Sirasala , H. Chavhan , K. Khonde

Mohammad Gh. Alfailakawi, Imtiaz Ahmad, Suha Hamdan

Improving java performance using Dynamic Method Migration on FPGAs

Department of Computer Science University of California, Santa Barbara

NVIDIA Fermi Architecture

SAT-Based Area Recovery in Technology Mapping

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

Better Speedups for Parallel Max-Flow George C. Caragea Uzi Vishkin Dept. of Computer Science University of Maryland, College Park, USA June 4 th, 2011

Experience with an Easy-to-Program Parallel Architecture XMT (eXplicit Multi-Threading) Platform ◦ Design goal: easy to program many-core architecture ◦ PRAM-based design, PRAM-On-Chip programming ◦ Ease of programming demonstrated by order-of-magnitude ease-of- teaching/learning ◦ 64-processor hardware, compiler, 20+ papers, 9 grad degrees, 6 US Patents ◦ Only one previous single-application paper (Dascal et. al, 1999) Parallel Max-Flow results ◦ [IPDPS 2010] 2.5x speedup vs. serial using CUDA ◦ [Caragea and Vishkin, SPAA 2011] up to 108.3x speedup vs. serial using XMT  3-page paper 2

How to publish application papers on an easy-to-program platform? Reward game is skewed Easier to publish on “hard-to-program” platforms ◦ Remember STI Cell? Application papers for easy-to-program architectures are considered “boring” ◦ Even when they show good results Recipe for academic publication: ◦ Take simple application (e.g. Breadth-First Search in graph) ◦ Implement it on latest (difficult to program) parallel architecture ◦ Discuss challenges and work-arounds 3

Parallel Programming Today 4 Current Parallel Programming High-friction navigation - by implementation [walk/crawl] Initial program (1week) begins trial & error tuning (½ year; architecture dependent) PRAM-On-Chip Programming Low-friction navigation – mental design and analysis [fly] No need to crawl Identify most efficient algorithm Advance to efficient implementation

PRAM-On-Chip Programming High-school student comparing parallel programming approaches ◦ “I was motivated to solve all the XMT programming assignments we got, since I had to cope with solving the algorithmic problems themselves, which I enjoy doing. In contrast, I did not see the point of programming other parallel systems available to us at school, since too much of the programming was effort getting around the way the systems were engineered, and this was not fun” 5

Maximum Flow in Networks Extensively studied problem ◦ Numerous algorithms and implementations (general graphs) ◦ Application domains  Network analysis  Airline scheduling  Image processing  DNA sequence alignment Parallel Max-Flow algorithms and implementations ◦ Paper has overview ◦ SMPs and GPUs ◦ Difficult to obtain good speedups vs. serial  e.g. 2.5x for hybrid CPU-GPU solution 6

XMT Max-Flow Parallel Solution First stage: identify/design parallel algorithm ◦ [Shiloach,Vishkin 1982] designed O(n 2 log n) time, O(nm) space PRAM algorithm ◦ [Goldberg, Tarjan 1988] introduced distance labels in S-V: Push- Relabel algorithm with O(m) space complexity ◦ [Anderson, Setubal 1992] observed poor practical performance for G-T, augmented with S-V-inspired Global Relabeling heuristic ◦ Solution: Hybrid SV-GT PRAM algorithm Second stage: write PRAM-On-Chip implementation ◦ Relax PRAM lock-step synchrony by grouping several PRAM steps in an XMT spawn block  Insert synchronization points (barriers) where needed for correctness ◦ Maintain active node set instead of polling all graph nodes for work ◦ Use hardware supported atomic operations to simplify reductions 7

Input Graph Families Performance is highly dependent on the structure of the graph Graph structures proposed in DIMACS challenge [DIMACS90] ◦ Used by virtually every Max-Flow publication 8

Speed-Up Results Compared to “best serial implementation”, running on recent x86 processor [Goldberg2006] Clock cycle count speedups: Two XMT configurations: ◦ XMT.64: 64 core FPGA prototype ◦ XMT.1024: 1024-core, cycle-accurate simulator XMTSim Speedups: 1.56x to 108.3x for XMT

Conclusion XMT aims at being easy-to-program, general-purpose architecture ◦ Performance improvements on hard-to-parallelize applications like Max-Flow ◦ Ease of programming: by showing order-of-magnitude improvement in ease-of- teaching/learning  Achieved difficult speedups at much earlier developmental stage (10th graders in HS versus graduate students). UCSB/UMD experiment, Middle-School, Magnet HS, Inner City HS, freshmen course, UIUC/UMD-experiment: J. Sys. & SW08 SIGCSE10, EduPar11. Current stage of XMT project: develop more complex applications beyond benchmarks ◦ Max-Flow is a step in that direction ◦ More needed Without an easy-to-program many-core architecture, rejection of parallelism by mainstream programmers is all but certain ◦ Affirmative action: drive more researchers to work and seek publications on easy-to- program architectures ◦ This work should not be dismissed as ‘too easy’ Thank you! 10