Many-SC 아키텍처 기반 OpenCL 프레임워크상에서의 응용프로그램 구현 및 최적화

Slides:

Advertisements

Similar presentations

Implementation of Voxel Volume Projection Operators Using CUDA

Advertisements

CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Optimization on Kepler Zehuan Wang

Direct Volume Rendering. What is volume rendering? Accumulate information along 1 dimension line through volume.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

IMAGE, RADON, AND FOURIER SPACE

A many-core GPU architecture.. Price, performance, and evolution.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

GPU Computing with CUDA as a focus Christie Donovan.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

Memory Efficient Acceleration Structures and Techniques for CPU-based Volume Raycasting of Large Data S. Grimm, S. Bruckner, A. Kanitsar and E. Gröller.

Hardware-Based Nonlinear Filtering and Segmentation using High-Level Shading Languages I. Viola, A. Kanitsar, M. E. Gröller Institute of Computer Graphics.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Volume Rendering & Shear-Warp Factorization Joe Zadeh January 22, 2002 CS395 - Advanced Graphics.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

SAGE: Self-Tuning Approximation for Graphics Engines

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Enhancing GPU for Scientific Computing Some thoughts.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Efficient Volume Visualization of Large Medical Datasets Stefan Bruckner Institute of Computer Graphics and Algorithms Vienna University of Technology.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Computer Architecture Project #2 Cache Simulator

KinectFusion : Real-Time Dense Surface Mapping and Tracking IEEE International Symposium on Mixed and Augmented Reality 2011 Science and Technology Proceedings.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Lecture 3 : Direct Volume Rendering Bong-Soo Sohn School of Computer Science and Engineering Chung-Ang University Acknowledgement : Han-Wei Shen Lecture.

Cg Programming Mapping Computational Concepts to GPUs.

Optimizing Katsevich Image Reconstruction Algorithm on Multicore Processors Eric FontaineGeorgiaTech Hsien-Hsin LeeGeorgiaTech.

Yingcai Xiao Voxel Game Engine Development. What do we need? What tools do we have? How can we design and implement? We will answer those questions in.

Parallelization of System Matrix generation code Mahmoud Abdallah Antall Fernandes.

Large-scale Deep Unsupervised Learning using Graphics Processors

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

The Fourth World Congress of Structural and Multidisciplinary Optimization Distributed GA and SA Algorithms for Structural Optimization Jan 11, 2002 Hyo.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

3D Volume Visualization. Volume Graphics  Maintains a 3D image representation that is close to the underlying fully-3D object (but discrete)  경계표면 (Boundary.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

QCAdesigner – CUDA HPPS project

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Intraoral Laser Scan Data와 Dental CT 영상 접합 및 개별 치아 분할

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.

Single-Slice Rebinning Method for Helical Cone-Beam CT

Volume Visualization with Ray Casting

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008.

Volume Ray-Casting on Graphics Hardware

Queueing Theory Modeling of a CPU-GPU System September 15, 2010 Lindsay B. H. May, Ph.D. Systems Engineer.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

CPU-GPU Collaboration for Output Quality Monitoring Mehrzad Samadi and Scott Mahlke University of Michigan March 2014 Compilers creating custom processors.

My Coordinates Office EM G.27 contact time:

Wei Hong, Feng Qiu, Arie Kaufman Center for Visual Computing and Department of Computer Science, Stony Brook University

Accelerating particle identification for high-speed data-filtering using OpenCL on FPGAs and other architectures for FPL 2016 Srikanth Sridharan CERN 8/31/2016.

The Present and Future of Parallelism on GPUs

Stencil-based Discrete Gradient Transform Using

GPU-based iterative CT reconstruction

CMSC 611: Advanced Computer Architecture

CS427 Multicore Architecture and Parallel Computing

Image Transformation 4/30/2009

Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang

NVIDIA Fermi Architecture

6- General Purpose GPU Programming

Survey of Parallel Volume Rendering Algorithms

Presentation transcript:

Many-SC 아키텍처 기반 OpenCL 프레임워크상에서의 응용프로그램 구현 및 최적화 2014. 9. 30 신영길 Computer Graphics & Image Processing Lab Seoul National University

목표 Many-SC 아키텍처의 성능 평가를 위한 응용프로그램 구현 및 최적화 성능 평가를 위한 항목 및 benchmarking set 설계 Benchmarking set을 이용한 성능 실험을 통해 적합한 응용 유형 정리

1차년도 목표 PC 상에서의 OpenCL 기반 응용프로그램 구현 및 최적화 비교 실험 및 평가 항목 설정 Registration Volume Rendering CT Reconstruction 비교 실험 및 평가 항목 설정

진행 사항 PC 상에서의 OpenCL 기반 응용프로그램 구현 및 최적화 Registration : CPU/GPU 프로그램 구현 완료 Volume Rendering : 세가지 속도 개선 기법이 포함된 CPU/GPU 프로그램 구현 완료 CT Reconstruction : Voxel-driven backprojection 기반 CPU/GPU 프로그램 구현 완료

Compute similarity measure Registration 한 영상(reference image)과 다른 영상(floating image)의 위치 대응 관계를 나타내는 변환을 찾는 작업 Reference image Floating Apply transform parameters Compute similarity measure Update Registered image Converge? Yes No Per-pixel operation Parallelization!!

Registration Rigid registration의 병렬화 각 core group에 transformation parameter(translation, rotation) 할당 할당된 parameter로 floating image를 이 동시킨 후, reference image와 similarity measure 수행 가장 similarity measure가 높은 parameter를 결과로 선택 Parameter #1 Parameter #2 Parameter #n

Registration Registration 수행시간 및 speedup Data CPU (sec) 3891.22 3436.51 GPU (sec) 29.83 26.68 Speedup 130.46 128.82 CPU: Intel i7-2600 (4 cores@3.4GHz) / GPU: NVIDIA GTX680 2.0GB (1536 stream processors@2.0GHz)

Volume Rendering 3D volume data  2D rendered image Ray casting 기반 volume rendering 결과 영상의 각 픽셀마다 하나의 ray가 진행하면서 연산을 수행 각 ray의 계산 결과는 서로 독립적  높은 parallelism Scalability 평가에 적합 (core수 증가에 따른 성능 향상 평가)

Global / Constant Memory Data Cache (128KB) Current GPU architecture Global / Constant Memory Data Cache (128KB) Compute Device Global Memory (2GB) Compute Device Memory … Private Memory Work-Item 1 Work-Item M Local Memory (48KB) Compute Unit N Compute Unit 1 Volume Rendering Parallelism에 영향을 주는 요소 Thread diversion : threads follow different control flows Global/shared memory access Loop (instruction mix) : loop-unrolling enhances performance Volume rendering 속도 개선 기법은 thread diversion을 일으켜 parallelism에 영향을 줌 기법 1 - transparent voxel skipping 기법 2 – early ray termination 기법 3 – empty space leaping NVIDIA GTX680 2.0GB (1536 stream processors@2.0GHz) 기법 2 기법 1 기법 3

Global / Constant Memory Data Cache (128KB) Volume Rendering Performance elements Number of cores Global memory size Core 1 Core 2 … Core X Input data Global / Constant Memory Data Cache (128KB) Compute Device Global Memory (2GB) Compute Device Memory Private Memory Work-Item 1 Work-Item M Local Memory (48KB) Compute Unit N Compute Unit 1 Volume rendering 속도 개선 기법 사용에 따른 speedup 비교 Data Engine block (512x512x512, 16bit) Rendered image: 600x600 Metal plate (512x512x206, 16bit) Abdomen (512x512x86, 16bit) CPU (sec) GPU Speedup Basic VR 617.31 3.07 201.08 618.24 3.05 202.70 241.22 1.19 202.71 With 기법1 151.16 0.82 184.34 150.31 0.87 172.77 67.99 0.39 174.33 With 기법2 296.54 1.78 166.60 331.88 2.07 160.33 134.53 0.8 168.16 With 기법3 136.66 0.73 187.21 141.85 172.99 45.58 0.25 182.32 With 기법1 & 2 & 3 17.88 0.22 81.15 24.42 0.18 135.67 8.78 0.08 109.75 기법 1 - transparent voxel skipping / 기법 2 – early ray termination / 기법 3 – empty space leaping CPU: Intel i7-2600 (4 cores@3.4GHz) / GPU: NVIDIA GTX680 2.0GB (1536 stream processors@2.0GHz)

CT Reconstruction Projection image set  volumetric data CT scan configuration(거리, 각도, detector 해상도 등) 이용 Projection image set acquired by CT scanning Volumetric data CT Reconstruction using CT scan configuration (SDD, SOD, angle, etc.) [Cone beam] SDD: Source-to-Detector Distance SOD: Source-to-Object Distance

Filtered backprojection CT Reconstruction Filtered backprojection을 이용한 CT reconstruction : Apply filter to 1D projection before backprojection Backprojection Filter Backprojection Filtered backprojection

CT Reconstruction Cone beam CT의 filtered backprojection : FDK (Feldkamp-Davis-Kress) algorithm Object Reconstruction Projections Row-wise filtered projections Weighted Data acquisition visualization Cone-beam projection backprojection Parallelization!! Approximation

Voxel-driven backprojection Ray-driven backprojection CT Reconstruction 병렬화 방법 Voxel-driven backprojection vs. ray-driven backprojection  parallelism과 CT reconstruction 결과 다름 Volume Projection image Sampling Write conflict Sampling Voxel Volume Projection image Voxel-driven backprojection Ray-driven backprojection

Output volume data size CT Reconstruction Voxel-driven backprojection 수행시간 및 speedup Projection image : 512x384x680 Input volume data Output Output volume data size 320x320x176 480x480x264 CPU (single thread) 2944.76 8805.96 OpenCL (GPU) 18.55 52.40 Speedup 158.75 168.05 CPU: Intel i7-2600 (4 cores@3.4GHz) / GPU: NVIDIA GTX680 2.0GB (1536 stream processors@2.0GHz)

Global / Constant Memory Data Cache CT Reconstruction Sampling Voxel Voxel-driven backprojection 각 voxel의 계산 결과는 서로 독립적이므로 높은 parallelism  Scalability 평가에 적합 (core수 증가에 따른 성능 향성 평가) Projection image가 들어갈 정도의 cache memory 확보되는 경우 속도 향상  I/O throughput 평가에 적합 (cache 구조 및 크기에 따른 성능 향상 평가) Volume Global / Constant Memory Data Cache Compute Device Global Memory Compute Device Memory … Private Memory Work-Item 1 Work-Item M Local Memory Compute Unit N Compute Unit 1 Projection images Volume data

CT Reconstruction Ray-driven backprojection Volume Sampling Write conflict Ray-driven backprojection voxel-driven 방법에 비해 기하학적 정보를 이용하여 정확한 계산 가능 인접한 ray들이 하나의 voxel에 대해 동시에 업데이트 시도 Write conflict  병렬화하기 어려움 구현 이슈 Write conflict를 피하기 위해 OpenCL의 atomic_add(float) 사용 필요 OpenCL 2.0 에서 지원, 그러나 현재 지원하는 H/W 없음 atomic_add(int)를 이용한 fixed-point 연산으로 대체 가능하지만 정확도 감소

H/W 지원이 필요한 Common Operation 현재의 응용 프로그램 기반 GPU에 있는 operation 중 지원이 필요한 common operation GPU에 없는 operation 중 지원이 되면 좋은 common operation Priority Operation Applications 1 Trilinear interpolation (Bilinear interpolation으로 대체 가능) 2 Vector operations (e.g. vector normalization, length of vector, dot product) 3 Matrix operation (e.g. 좌표 변환) 4 Mathematical functions (e.g. sin, cos, log, pow, saturate) Priority Operation Applications 1 Gradient Central difference: 연산은 간단하나, 3D 데이터의 경우 6번의 sampling 필요 향후 개발하는 응용프로그램에 따라 항목 추가 Volume Rendering Registration CT Reconstruction

향후 일정 PC 상에서 OpenCL 기반 응용프로그램 구현 및 코드 최적화 지속적 비교 실험 및 평가 항목 설정 CT Reconstruction : Ray-driven backprojection 기반 CPU/GPU 프로그램 구현 지속적 비교 실험 및 평가 항목 설정