Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Similar presentations


Presentation on theme: "CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts."— Presentation transcript:

1 CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts

2 Summary (1) Architecture Modern architecture designs are driven by energy constraints Shortening latencies is too costly, so we use parallelism in hardware to increase potential throughput Some parallelism is implicit (out-of-order superscalar processing,) but have limits Others are explicit (vectorization and multithreading,) and rely on software to unlock 2

3 Summary (2) Memory Memory technologies trade off energy and cost for capacity, with SRAM registers on one end and spinning platter hard disks on the other Locality (relationships between memory accesses) can help us get the best of all cases Caching is the hardware-only solution to capturing locality, but software-driven solutions exist too (memcache for files, etc.) 3

4 Summary (3) Software Want to fully occupy your hardware? – Express locality (tiling) – Vectorize (compiler or manual) – Multithread (e.g. OpenMP) – Accelerate (e.g. CUDA, OpenCL) Take the cost into consideration. Unless you’re optimizing in your free time, your time isn’t free. 4

5 Research Perspective (2010) Can we generalize and categorize the most important, generally applicable GPU Computing software optimizations? – Across multiple architectures – Across many applications What kinds of performance trends are we seeing from successive GPU generations? Conclusion – GPUs aren’t special, and parallel programming is getting easier 5

6 Application Survey Surveyed the GPU Computing Gems chapters Studied the Parboil benchmarks in detail Results: Eight (for now) major categories of optimization transformations – Performance impact of individual optimizations on certain Parboil benchmarks included in the paper 6

7 1: (Input) Data Access Tiling 7 DRAM Cache DRAM Scratchpad Explicit Copy Implicit Copy Local Access

8 2. (Output) Privatization Avoid contention by aggregating updates locally Requires storage resources to keep copies of data structures 8 Private Results Local Results Global Results

9 Running Example: SpMV 9 Ax = v Row Data Col vx A

10 Running Example: SpMV 10 Ax = v Row Data Col A vx

11 3. “Scatter to Gather” Transformation 11 Ax = v v Row Data Col A x

12 3. “Scatter to Gather” Transformation 12 Ax = v v Row Data Col A x

13 4. Binning 13 A

14 5. Regularization (Load Balancing) 14

15 6. Compaction 15

16 7. Data Layout Transformation 16

17 7. Data Layout Transformation 17

18 8. Granularity Coarsening Parallel execution often requires redundant and coordination work – Merging multiple threads into one allows reuse of result, reducing redundancy Essential Redundant 4-way parallel 2-way parallel Time 18

19 How much faster do applications really get each hardware generation?

20 Unoptimized Code Has Improved Drastically 20 Orders of magnitude speedup in many cases Hardware does not solve all problems – Coalescing (lbm) – Highly contentious atomics (bfs)

21 Optimized Code Is Improving Faster than “Peak Performance” Caches capture locality scratchpad can’t efficiently (spmv, stencil) Increased local storage capacity enables extra optimization (sad) Some benchmarks need atomic throughput more than flops (bfs, histo) 21

22 Optimization Still Matters Hardware never changes algorithmic complexity (cutcp) Caches do not solve layout problems for big data (lbm) Coarsening still makes a big difference (cutcp, sgemm) Many artificial performance cliffs are gone (sgemm, tpacf, mri-q) 22

23 Stuff we haven’t covered Good tools out there for profiling code beyond good timing (cache misses, etc.) If you can’t find why a particular piece of code is taking so long, look into hardware performance counters. Patterns and practice – Some of the major patterns of optimization we covered, but only the basic ones. Many optimization patterns are algorithmic. 23

24 Fill Out Evaluations! 24


Download ppt "CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts."

Similar presentations


Ads by Google