Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf Benchmarking guidelines Regular vs. irregular parallel applications
Last time: Amdahl’s law Under what assumptions? Speedup = F 1 F N F Code is infinitely paralelizable No parallelization overheads No synchronization
Assuming multiple BCEs. Q: How to design a multicore for maximum speedup Assumed Perf(R) = square root of R Two problems –symmetric / asymmetric multicore chips –Area allocations (symmetric) Sixteen 1-BCE cores (symmetric) Four 4-BCE cores (symmetric) One 16-BCE core
For Asymmetric Multicore Chips Serial Fraction 1-F same, so time = (1 – F) / Perf(R) Parallel Fraction F –One core at rate Perf(R) –N-R cores at rate 1 –Parallel time = F / (Perf(R) + N - R) Therefore, w.r.t. one base core: Asymmetric Speedup = F Perf(R) F Perf(R) + N - R
[for 256 BCEs] (256 cores)(253 cores)(193 cores)(1 core) (241 cores)
Amdahl assumptions Code is infinitely paralelizable No parallelization overheads No synchronization –Add synchronization. Randomly entered (?!) f seq + f par = 1f seq + f par,ncs + f par,cs = 1
Average time in critical sections Paper also derives an estimate for max time in critical sections
f seq f par,cs P cs P ctn f par,cs (1-P cs P ctn )/N f par,ncs / N
Speedup for an asymmetric processor as a function of the big core size (b) and small core size (s) for different contention rates, assuming 256 BCEs. Fraction spent in sequential code 1%.
Design space exploration across symmetric, asymmetric and ACS multicore processors Varying the fraction of the time spent in critical sections and their contention rates. Fraction spent in sequential code equals 1% ACS = Accelerated critical section
agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf 12 ways to fool the masses Regular vs. irregular parallel applications
If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
David H. Bailey, “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers”, Supercomputing Review, August 1991, 1. Quote only 32-bit performance results, not 64-bit results. 2. Present performance figures for an inner kernel, and then represent these figures as the performance of the entire application. 3. Quietly employ assembly code and other low-level language constructs. 4. Scale up the problem size with the number of processors, but omit any mention of this fact. 5. Quote performance results projected to a full system. 6. Compare your results against scalar, unoptimized code on Crays. 7. When direct run time comparisons are required, compare with an old code on an obsolete system. 8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation. 9. Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar. 10. Mutilate the algorithm used in the parallel implementation to match the architecture. 11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment. 12 If all else fails, show pretty pictures and animated videos, and don't talk about performance.
Rodamap Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf 12 ways to fool the masses Regular vs. irregular parallel applications
16 Definitions Regular applications –key data structures are vectors dense matrices –simple access patterns (eg) array indices are affine functions of for-loop indices –examples: MMM, Cholesky & LU factorizations, stencil codes, FFT,… Irregular applications –key data structures are lists, priority queues trees, DAGs, graphs usually implemented using pointers or references –complex access patterns –examples: see next slide
17 Regular application example: Stencil computation (e.g.,) Finite-difference method for solving pde’s –discrete representation of domain: grid Values at interior points are updated using values at neighbors –values at boundary points are fixed Data structure: –dense arrays Parallelism: –values at next time step can be computed simultaneously –parallelism is not dependent on runtime values Compiler can find the parallelism –spatial loops are DO-ALL loops //Jacobi iteration with 5-point stencil //initialize array A for time = 1, nsteps for in [2,n-1]x[2,n-1] temp(i,j)=0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1)) for in [2,n-1]x[2,n-1]: A(i,j) = temp(i,j) Jacobi iteration, 5-point stencil Atemp tntn t n+1
18 Delaunay Mesh Refinement Iterative refinement to remove badly shaped triangles: while there are bad triangles do { Pick a bad triangle; Find its cavity; Retriangulate cavity; // may create new bad triangles } Don’t-care non-determinism: –final mesh depends on order in which bad triangles are processed –applications do not care which mesh is produced Data structure: –graph in which nodes represent triangles and edges represent triangle adjacencies Parallelism: –bad triangles with cavities that do not overlap can be processed in parallel –parallelism is dependent on runtime values compilers cannot find this parallelism