Programming with CUDA WS 08/09 Lecture 11 Thu, 27 Nov, 2008
Previously Optimizing your code Optimizing your code –Instruction throughput –Memory bandwidth –#Threads per block –Type of memory –General guidelines
Today Graded/ungraded course? Graded/ungraded course? 2 examples 2 examples –Matrix multiplication Straightforward Straightforward –Parallel reduction Final projects Final projects
Graded/ungraded
Matrix Multiplication Inherently parallel problem Inherently parallel problem C = A * B C = A * B A: hA x wA, B: hB x wB, C: hA x wB A: hA x wA, B: hB x wB, C: hA x wB Each entry in C depends on one row in A and one column in B Each entry in C depends on one row in A and one column in B –Assign each entry to a thread
Matrix Multiplication
C: hA x wB C: hA x wB First strategy: start a thead block with hA*wB threads First strategy: start a thead block with hA*wB threads –Too many threads per block!
Matrix Multiplication C: hA x wB C: hA x wB Better: block the problem Better: block the problem –Break C into BlockSize x BlockSize –Assign each block to a thread block Recall: recommended #threads/block = 192, 256 Recall: recommended #threads/block = 192, 256 A reasonable choice for BlockSize is 16 A reasonable choice for BlockSize is 16
Matrix Multiplication
Parallel Reduction Reduction Reduction –Reducing an array to a single value, e.g. sum, min, max Slides Slides
Final Projects Time-line Time-line –Thu, 20 Nov: Float write-ups on ideas of Jens & Waqar Float write-ups on ideas of Jens & Waqar –Tue, 25 Nov: Suggest groups and topics Suggest groups and topics –Thu, 27 Nov (today): Groups and topics assigned Groups and topics assigned –Tue, 2 Dec: Last chance to change groups/topics Last chance to change groups/topics Groups and topics finalized Groups and topics finalized
See you next week!