Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thrust & Curand Dr. Bo Yuan

Similar presentations


Presentation on theme: "Thrust & Curand Dr. Bo Yuan"— Presentation transcript:

1 Thrust & Curand Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

2 Outline Thrust Basic Work with Thrust General Types of Thrust Fancy Iterators of Thrust Curand Library 2

3 Thrust Basic What is Thrust? Containers Iterators Interaction with raw pointers Namespace 3

4 What is Thrust? C++ template library for CUDA –Mimics Standard Template Library (STL) Containers –thrust::host_vector –thrust::device_vector Algorithms –thrust::sort() –thrust::reduce() –thrust::inclusive_scan() –… 4

5 Containers Make common operations concise and readable –Hides cudaMalloc, cudaMemcpy and cudaFree // allocate host vector with two elements thrust::host_vector h_vec(2); // copy host vector to device thrust::device_vector d_vec = h_vec; // manipulate device values from the host d_vec[0] = 13; d_vec[1] = 27; std::cout << "sum: " << d_vec[0] + d_vec[1] << std::endl; // vector memory automatically released w/ free() or cudaFree() 5

6 Iterators Sequences defined by pair of iterators // allocate device vector thrust::device_vector d_vec(4); // returns iterator at first element of d_vec d_vec.begin(); // returns iterator one past the last element of d_vec d_vec.end() // [begin, end) pair defines a sequence of 4 elements 6 d_vec.begin() d_vec.end()

7 Iterators // initialize random values on host thrust::host_vector h_vec(1000); thrust::generate(h_vec.begin(), h_vec.end(), rand); // copy values to device thrust::device_vector d_vec = h_vec; // compute sum on host int h_sum = thrust::reduce(h_vec.begin(),h_vec.end()); // compute sum on device int d_sum = thrust::reduce(d_vec.begin(),d_vec.end()); 7

8 Convertible to Raw Pointers // allocate device vector thrust::device_vector d_vec(4); // obtain raw pointer to device vectors memory int * ptr = thrust::raw_pointer_cast(&d_vec[0]); // use ptr in a CUDA C kernel my_kernel >>(N, ptr); // Note: ptr cannot be dereferenced on the host! 8

9 Wrap Raw Pointers with device_ptr // raw pointer to device memory int * raw_ptr; cudaMalloc((void **) &raw_ptr, N * sizeof(int)); // wrap raw pointer with a device_ptr thrust::device_ptr dev_ptr(raw_ptr); // use device_ptr in thrust algorithms thrust::fill(dev_ptr, dev_ptr + N, (int) 0); // access device memory through device_ptr dev_ptr[0] = 1; 9

10 Namespaces C++ supports namespaces –Thrust uses thrust namespace thrust::device_vector thrust::copy –STL uses std namespace std::vector std::list Avoids collisions –thrust::sort() –std::sort() For brevity –using namespace thrust; 10

11 Usage of Thrust: Thrust provides many standard algorithms –Transformations –Reductions –Prefix Sums –Sorting Generic definitions –General Types Built-in types ( int, float, …) User-defined structures –General Operators reduce with plus operator scan with maximum operator 11

12 Transformations // allocate three device_vectors with 10 elements thrust::device_vector X(10); thrust::device_vector Y(10); thrust::device_vector Z(10); // initialize X to 0,1,2,3,.... thrust::sequence(X.begin(), X.end()); // compute Y = -X thrust::transform(X.begin(), X.end(), Y.begin(), thrust::negate ()); // fill Z with twos thrust::fill(Z.begin(), Z.end(), 2); // compute Y = X mod 2 thrust::transform(X.begin(), X.end(), Z.begin(), Y.begin(), thrust::modulus ()); // replace all the ones in Y with tens thrust::replace(Y.begin(), Y.end(), 1, 10); 12

13 Reductions int sum = thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus ()); int sum = thrust::reduce(D.begin(), D.end(), (int) 0); int sum = thrust::reduce(D.begin(), D.end()) thrust::count_if thrust::min_element thrust::max_element thrust::is_sorted thrust::inner_product 13

14 Sorting thrust::sort thrust::stable_sort thrust::sort_by_key thrust::stable_sort_by_key 14

15 General Types and Operators #include // declare storage device_vector i_vec =... device_vector f_vec =... // sum of integers (equivalent calls) reduce(i_vec.begin(), i_vec.end()); reduce(i_vec.begin(), i_vec.end(), 0, plus ()); // sum of floats (equivalent calls) reduce(f_vec.begin(), f_vec.end()); reduce(f_vec.begin(), f_vec.end(), 0.0f, plus ()); // maximum of integers reduce(i_vec.begin(), i_vec.end(), 0, maximum ()); 15

16 General Types and Operators struct negate_float2 { __host__ __device__ float2 operator()(float2 a) { return make_float2(-a.x, -a.y); } }; // declare storage device_vector input =... device_vector output =... // create functor negate_float2 func; // negate vectors transform(input.begin(), input.end(), output.begin(), func); 16

17 General Types and Operators // compare x component of two float2 structures struct compare_float2 { __host__ __device__ bool operator()(float2 a, float2 b) { return a.x < b.x; } }; // declare storage device_vector vec =... // create comparison functor compare_float2 comp; // sort elements by x component sort(vec.begin(), vec.end(), comp); 17

18 Advanced Thrust Behave like normal iterators –Algorithms don't know the difference –constant_iterator –counting_iterator –transform_iterator –zip_iterator 18

19 Constant // create iterators constant_iterator begin(10); constant_iterator end = begin + 3; begin[0] // returns 10 begin[1] // returns 10 begin[100] // returns 10 // sum of [begin, end) reduce(begin, end); // returns 30 (i.e. 3 * 10) 19

20 Counting // create iterators counting_iterator begin(10); counting_iterator end = begin + 3; begin[0] // returns 10 begin[1] // returns 11 begin[100] // returns 110 // sum of [begin, end) reduce(begin, end); // returns 33 (i.e. 10 + 11 + 12) 20

21 Transform –Yields a transformed sequence –Facilitates kernel fusion –Conserves memory capacity and bandwidth 21 XYZ F( x ) F( y )F( z )

22 Transform // initialize vector device_vector vec(3); vec[0] = 10; vec[1] = 20; vec[2] = 30; // create iterator (type omitted) begin = make_transform_iterator(vec.begin(), negate ()); end = make_transform_iterator(vec.end(), negate ()); begin[0] // returns -10 begin[1] // returns -20 begin[2] // returns -30 // sum of [begin, end) reduce(begin, end); // returns -60 (i.e. -10 + -20 + -30) 22

23 Zip –Looks like an array of structs (AoS) –Stored in structure of arrays (SoA) 23

24 Zip // initialize vectors device_vector A(3); device_vector B(3); A[0] = 10; A[1] = 20; A[2] = 30; B[0] = x; B[1] = y; B[2] = z; // create iterator (type omitted) begin = make_zip_iterator(make_tuple(A.begin(), B.begin())); end = make_zip_iterator(make_tuple(A.end(), B.end())); begin[0] // returns tuple(10, x) begin[1] // returns tuple(20, y) begin[2] // returns tuple(30, z) // maximum of [begin, end) maximum > binary_op; reduce(begin, end, begin[0], binary_op); // returns tuple(30, z) 24

25 Curand Library # include curandStatus_t curandGenerate(curandGenerator_t generator, unsigned int *outputPtr, size_t num) A given state may not be used by more than one block. A given block may generate random numbers using multiple states. 25

26 Distribution __device__ float curand_uniform (curandState_t *state) __device__ float curand_normal (curandState_t *state) __device__ unsigned int curand_poisson (curandState_t *state, double lambda) __device__ double curand_uniform_double (curandState_t *state) __device__ float2 curand_normal2 (curandState_t *state) 26

27 struct estimate_pi{ public thrust::unary_function { __device__ float operator()(unsigned int thread_id) { float sum = 0; unsigned int N = 10000; // samples per thread unsigned int seed = thread_id; curandState s; // seed a random number generator curand_init(seed, 0, 0, &s); // take N samples in a quarter circle for(unsigned int i = 0; i < N; ++i) { 27 Estimating π

28 // draw a sample from the unit square float x = curand_uniform(&s); float y = curand_uniform(&s); float dist = sqrtf(x*x + y*y); // measure distance from the origin // add 1.0f if (u0,u1) is inside the quarter circle if(dist <= 1.0f) sum += 1.0f; } // multiply by 4 to get the area of the whole circle sum *= 4.0f; // divide by N return sum / N; }}; 28 Estimating π

29 int main(void) { // use 30K independent seeds int M = 30000; float estimate = thrust::transform_reduce ( thrust::counting_iterator (0),thrust::c ounting_iterator (M), estimate_pi(), 0.0f, thrust::plus ()); estimate /= M; std::cout << std::setprecision(3); std::cout << "pi is approximately "; std::cout << estimate << std::endl; return 0; } 29 Estimating π

30 Review What are containers and iterators? How to transfer information between Host and Device via Thurst? How to use thrusts namespace? In which way can we make containers associated with raw pointers? How to use Thrust to perform transformation, reduce and sort? How to use general types in Thrust? 30

31 Review Why constant iterators can save storage? When do we need to use counting iterators, and how much memory does it require? How to use transfor iterators? Why do we need zip iterators? How to generate different distribution by Curand? How to use Curand with Thrust? How to use Curand in a kernel function? 31

32 References An Introduction to Thrust GTC 2010 –(Part 1) - High Productivity Development –(Part 2) - Thrust By Example Thrust_Quick_Start_Guide Thrust - A Productivity-Oriented Library for CUDA http://code.google.com/p/thrust/ CURAND_Library 32


Download ppt "Thrust & Curand Dr. Bo Yuan"

Similar presentations


Ads by Google