Yuanrui Zhang, Mahmut Kandemir

Slides:



Advertisements
Similar presentations
© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
School of Engineering & Technology Computer Architecture Pipeline.
2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),
Optimizing Regular Expression Matching with SR-NFA on Multi-Core Systems Authors : Yang, Y.E., Prasanna, V.K. Yang, Y.E. Prasanna, V.K. Publisher : Parallel.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:
Sparse Computations: Better, Faster, Cheaper! Padma Raghavan Department of Computer Science and Engineering, The Pennsylvania State University
DUKE CSNUFFT on Multicores HPEC-2008 Sept. 25, 2008NUFFTs on Multicores 1 Parallelization of NUFFT with Radial Data on Multicore Processors Nikos Pitsianis.
Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Sparse Matrix-Dense Vector Multiply on G80: Probing the CUDA Parameter Space Comp 790 GPGP Project Stephen Olivier.
Computer Science Department, Duke UniversityPhD Defense TalkMay 4, 2005 Fast Extraction of Feature Salience Maps for Rapid Video Data Analysis Nikos P.
Enhancing Learning Through Technology b Choosing the technology b Meeting the needs of students with exceptionalities b Specific Technologies.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
Multi-core.  What is parallel programming ?  Classification of parallel architectures  Dimension of instruction  Dimension of data  Memory models.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
MotivationFundamental ProblemsProblems on Graphs Parallel processors are becoming common place. Each core of a multi-core processor consists of a CPU and.
DUKEMIT-LLHPEC-2007 FFTs of Arbitrary Dimensions on GPUs Xiaobai Sun and Nikos Pitsianis Duke University September 19, 2007 At High Performance Embedded.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Scalable Multi-core Sonar Beamforming with Computational Process Networks Motivation Sonar beamforming requires significant computation and input/output.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.
Computer Science Department, Duke UniversityPhD Defense TalkMay 4, 2005 FAST PATTERN MATCHING IN 3D IMAGES ON GPUS Patrick Eibl, Dennis Healy, Nikos P.
CPS 258, Fall 2004 Introduction to Computational Science.
T.J Brown, I. Spence, P. Kilpatrick, C. Gillan, N. S. Scott School of Electronics, Electrical Engineering and Computer Science, Queen’s University of Belfast.
… begin …. Parallel Computing: What is it good for? William M. Jones, Ph.D. Assistant Professor Computer Science Department Coastal Carolina University.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Understanding Parallel Computers Parallel Processing EE 613.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
NSF/TCPP Curriculum Planning Workshop Joseph JaJa Institute for Advanced Computer Studies Department of Electrical and Computer Engineering University.
C HAPTER 9 Cache Memory and Virtual Memory © 2014 Cengage Learning Engineering. All Rights Reserved. 1 Computer Organization and Architecture: Themes and.
KERRY BARNES WILLIAM LUNDGREN JAMES STEED
Joe Hummel, PhD U. Of Illinois, Chicago Loyola University Chicago
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
The University of Adelaide, School of Computer Science
† Dept. Computer Science and Engineering The Pennsylvania State University ‡ IMEC, Belgium Estimating Influence of Data Layout Optimizations on SDRAM Energy.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.
Ioannis E. Venetis Department of Computer Engineering and Informatics
Pathology Spatial Analysis February 2017
For Massively Parallel Computation The Chaotic State of the Art
Parallel Computing Lecture
The University of Adelaide, School of Computer Science
Multi-Processing in High Performance Computer Architecture:
Fei Cai Shaogang Wu Longbing Zhang and Zhimin Tang
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Comp 541 Wrap Up! Montek Singh Apr 27, 2018.
Mattan Erez The University of Texas at Austin
High Performance Computing (CS 540)
Symmetric Multiprocessing (SMP)
"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging.
Coe818 Advanced Computer Architecture
Introduction to Multiprocessors
Embedded Computer Architecture 5SIA0 Overview
Alpha 21264: Microarchitecture and Performance
University of Wisconsin-Madison
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
EE 4xx: Computer Architecture and Performance Programming
1CECA, Peking University, China
The University of Adelaide, School of Computer Science
© 2012 Elsevier, Inc. All rights reserved.
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

Automated Parallelization of Non-Uniform Convolutions on Chip Multiprocessors Yuanrui Zhang, Mahmut Kandemir Computer Science and Engineering, The Pennsylvania State University. Nikos Pitsianis, Xiaobai Sun Computer Science, Duke University.

Non-Uniform FFT Local Convolution Local Support targets sources While FFTs are well implemented by FFTW, we concentrate on accelerating the parallel convolution step on multicores. Geometric tiling can enhance data locality and reuse for the non-uniform local convolution, a matrix-vector product with an irregular and sparse matrix. Multicores have different on-chip memory hierarchies and characteristics.

On-chip memory abstraction Architecture Aware Hierarchical Geometric Tiling and Parallel Scheduling Hierarchical tiling according to memory hierarchy and sizes Neighborhood tile traversing order Block distribution Try to take care of data sharing and locality at all levels of cache 1 2 19 18 6 3 16 17 7 5 4 On-chip memory abstraction 8 14 15 13 12 9 10 11

Preliminary Evaluation