GPU [1] Speaker 高崇閔. Exceed limitation Error massage Line 47 Ecercise2.

Slides:



Advertisements
Similar presentations
Fabián E. Bustamante, Spring 2007
Advertisements

CUBLAS and CUSPARSE MVM Timing Gavin Harrison. SMVM Algorithm.
Dan Iannuzzi Kevin Pine CS 680. Outline The Problem Recap of CS676 project Goal of this GPU Research Approach Parallelization attempts Results Difficulties.
By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
“Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Computations” By Ravi, Ma, Chiu, & Agrawal Presented.
Avishai Wool lecture Introduction to Systems Programming Lecture 8.3 Non-volatile Memory Flash.
Chapter 20 rounding error Speaker: Lung-Sheng Chien.
CUDA and the Memory Model (Part II). Code executed on GPU.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
CS32310 MATRICES 1. Vector vs Matrix transformation formulae Geometric reasoning allowed us to derive vector expressions for the various transformation.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Implementing Natural Joins, R. Ramakrishnan and J. Gehrke with corrections by Christoph F. Eick 1 Implementing Natural Joins.
LIBCXXGPU GPU Acceleration for the C++ STL. libcxxgpu  C++ STL provides algorithms over data structures  Thrust provides GPU versions of these algorithms.
GPU Architecture and Programming
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
Networking for Home and Small Businesses –.  Explain the binary representation of data.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Virtual Memory Additional Slides Slide Source: Topics Address translation Accelerating translation with TLBs class12.ppt.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
1. So far, one thread is responsible for one data element, can you change this, say one thread takes care of several data entries ? test N = 512*10 We.
Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.
Moore’s Law Electronics 19 April Moore’s Original Data Gordon Moore Electronics 19 April 1965.
Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,
LDL’ = PAP’ Speaker : 高崇閔. Program structure[1] Use high precision Main function 1_1 pivot do 1_1 pivot in case_1 to case_3 main.cpp test_matrix.cpp test.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
Blocked 2D Convolution Ravi Sankar P Nair
General Purpose computing on Graphics Processing Units
Employing compression solutions under openacc
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Image Transformation 4/30/2009
Multiplication table. x
Quiz for Week #5.
Graphics Processing Unit
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
GPU Implementations for Finite Element Methods
NVIDIA Fermi Architecture
Dots 5 × TABLES MULTIPLICATION.
Бази от данни и СУБД Основни понятия инж. Ангел Ст. Ангелов.
5 × 7 = × 7 = 70 9 × 7 = CONNECTIONS IN 7 × TABLE
5 × 8 = 40 4 × 8 = 32 9 × 8 = CONNECTIONS IN 8 × TABLE
Introduction to CUDA.
4 × 6 = 24 8 × 6 = 48 7 × 6 = CONNECTIONS IN 6 × TABLE
5 × 6 = 30 2 × 6 = 12 7 × 6 = CONNECTIONS IN 6 × TABLE
Dense Linear Algebra (Data Distributions)
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
10 × 8 = 80 5 × 8 = 40 6 × 8 = CONNECTIONS IN 8 × TABLE MULTIPLICATION.
Memory System Performance Chapter 3
3 × 12 = 36 6 × 12 = 72 7 × 12 = CONNECTIONS IN 12 × TABLE
5 × 12 = × 12 = × 12 = CONNECTIONS IN 12 × TABLE MULTIPLICATION.
6- General Purpose GPU Programming
Speaker : Guo-Zhen Wu Date : 2008/12/28
5 × 9 = 45 6 × 9 = 54 7 × 9 = CONNECTIONS IN 9 × TABLE
3 × 7 = 21 6 × 7 = 42 7 × 7 = CONNECTIONS IN 7 × TABLE
The Design and Implementation of a Log-Structured File System
Presentation transcript:

GPU [1] Speaker 高崇閔

Exceed limitation Error massage Line 47 Ecercise2

CPU V.S. GPU Question : If we can speed up the computation of CPU, it’s no use about Question : If we can speed up the computation of CPU, it’s no use about GPU, doesn’t if ? GPU, doesn’t if ? Reply : In Table II, we spend twice time to transfer data than computation. Reply : In Table II, we spend twice time to transfer data than computation. But, it’s not means we can take place of GPU by CPU. We just But, it’s not means we can take place of GPU by CPU. We just transfer data to GPU once, but we can do several times transfer data to GPU once, but we can do several times computation in GPU. As a result, the time we cost in transfer is computation in GPU. As a result, the time we cost in transfer is sin that is necessaries. sin that is necessaries.

Num of block sizeGPU Drive -> Host CPU KB KB KB KB KB MB MB MB MB MB MB MB MB Ecercise4 Thread=512 N=number of block * threads size= size of (float) byte experimental platform : Geforce 8800 GT Table II

Computation A : m*n, B : n*p. A * B : m*p A : m*n, B : n*p. A * B : m*p The data we need to transfer is n*(p+m)*sizeof(float)byte The times we do computation is (m*p)*(addition) + (n^2)*(plus) A : 1*n, B : 1*n. A + B : 1*n A : 1*n, B : 1*n. A + B : 1*n The data we need to transfer is (2*n)*sizeof(float)byte The times we do computation is (n)*(addition) Vector addition Matrix multiple Ratio of addition and multiple The ratio of data transfer is (p+m) The ratio of computation is [ (m*p)*(addition) + (n^2)*(plus)] /(n)*(addition) (multiple / addition)

remain How to do several data in finite thread How to do several data in finite thread How to computation (multiple) between matrix and vector (inner product and outer product) How to computation (multiple) between matrix and vector (inner product and outer product)