Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

+ Accelerating Fully Homomorphic Encryption on GPUs Wei Wang, Yin Hu, Lianmu Chen, Xinming Huang, Berk Sunar ECE Dept., Worcester Polytechnic Institute.

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Cyberinfrastructure for Scalable and High Performance Geospatial Computation Xuan Shi Graduate assistants supported by the CyberGIS grant Fei Ye (2011)

Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Panda: MapReduce Framework on GPU’s and CPU’s

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.

GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

“SEMI-AUTOMATED PARALLELISM USING STAR-P " “SEMI-AUTOMATED PARALLELISM USING STAR-P " Dana Schaa 1, David Kaeli 1 and Alan Edelman 2 2 Interactive Supercomputing.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.

Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.

GPU Architecture and Programming

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY HPCDB Satisfying Data-Intensive Queries Using GPU Clusters November.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.

Backprojection and Synthetic Aperture Radar Processing on a HHPC Albert Conti, Ben Cordes, Prof. Miriam Leeser, Prof. Eric Miller

Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.

Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Gnort: High Performance Network Intrusion Detection Using Graphics Processors Date:101/2/15 Publisher:ICS Author:Giorgos Vasiliadis, Spiros Antonatos,

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.

NFV Compute Acceleration APIs and Evaluation

GPU-based iterative CT reconstruction

Low-Cost High-Performance Computing Via Consumer GPUs

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: GPU Architecture and Programming

Presentation transcript:

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use of new technology for solving intensive computational problems Objective Help to improve the efficiency of early breast cancer detection Minimize the processing cost of the Digital Breast Tomosynthesis Mammography technique Tomosynthesis reconstruction process Reconstructs a 3D image from multiple x-ray radiograph images  Detects and diagnoses breast cancer and abnormalities NVIDIA GPU - GeForce 8800 Data-parallel programming On-chip SIMD Compute Unified Device Architecture (CUDA) –a programming interface  Execute C code on NVIDIA GPU  CUDA libraries: FFT and BLAS Porting Tomosynthesis reconstruction to the GPU Evaluation environments Tomosynthesis reconstruction Execution time (sec) vs. number of iterations Simplicity All software development stages – design, implementation testing and deployment are done on one single environment Allow novice users to run, execute and work with Tomosynthesis algorithm on windows. Summary GPU’s performance comparable to HPC  Exploit inherent parallelism in algorithm  Reduce communication and synchronization  Launch high number of threads per multiprocessor  Hide memory latency (Implementation is memory bound) First implementation of algorithm  Further development can improve performance on both CPU and GPU  I mprove memory allocation  Reduce CPU/GPU communication overhead  Optimize kernel threads (running on GPU) Future work Optimize threads running on GPU, Improve CPU/GPU interaction Current performance enables further development of Tomosynthesis algorithm – reducing image noise Explore opportunities for speeding up additional applications using GPU " Acceleration of Digital Tomosynthesis Mammography using Graphics Processors " " Acceleration of Digital Tomosynthesis Mammography using Graphics Processors " Diego Rivera, Micha Moffie, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering Northeastern University, Boston, MA {drivera, mmoffie, dschaa, Acknowledgement This project is supported by the Gordon Center for Subsurface Sensing and Imaging Systems. Many thanks to Juemin Zhang (ECE NEU) and Leo Hill (ATS NEU) for their help during the early stages of this work Gordon-CenSSIS is a National Science Foundation Engineering Research Center supported in part by the Engineering Research Centers Program of the National Science Foundation (Award # EEC ). Taken From: National Cancer Institute From presentation “GeForce 8800 & NVIDIA CUDA: A New architecture for Computing on the GPU” by Ian Buck, NVIDIA Corporation at Supercomputing '06 Workshop "General-Purpose GPU Computing: Practice And Experience“, November Thread Processors Parallel Data Cache Thread Processors Parallel Data Cache Thread Processors Parallel Data Cache Thread Processors Parallel Data Cache Thread Processors Parallel Data Cache Thread Processors Parallel Data Cache Thread Processors Parallel Data Cache Thread Processors Parallel Data Cache Thread Execution Manager Input Assembler Host Load/store Device Memory 128 Stream Processors 768 MB from $530 Taken From presentation “Acceleration of Maximum Likelihood for Tomosynthesis Mammography” by Juemin Zhang, Waleed Meleis, David Kaeli, Tao Wu. ICPADS’06 detector X-ray source Y Set 3D volume Compute projections Correct 3D volume 3D volume Satisfied ? No Yes Exit Initialization Forward Backward X-ray projections X Z Y Nvidia GTX8800 (GPU) 128 Stream Processors, 1.35 GHz 768 MB Device memory (86.4 GB/Sec) PCI-E x16 TeraCluster (Cluster) 33 Servers 4 nodes per server (dual processor, dual core) Intel Xeon, 2.0 GHz (Pentium M) 8/16GB RAM per server Gigabit Ethernet interconnect (among servers) Opportunity (Cluster) 65 servers 2 nodes per server (dual processor) Xeon EMT 64, 3.2 GHz (Pentium IV) 4 GB RAM per server Gigabit Ethernet interconnect (among servers) Workstation Intel Core2 CPU (Using only 1), 1.86 GHz 3GB RAM