Processor Level Parallelism 2
How We Got Here Developments in PC CPUs
Development Single Core
Development Single Core with Multithreading – 2002 Pentium 4 / Xeon
Development Multi Processor – Multiple processors coexisting in system – PC space in ~1995
Development Multi Core – Multiple CPU's on one chip – PC space in ~2005
Power Density Prediction circa 2000 Core 2 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005
Moore's Law Related Curves Adapted from UC Berkeley "The Beauty and Joy of Computing"
Moore's Law Related Curves Adapted from UC Berkeley "The Beauty and Joy of Computing"
Development Modern Complexity – Many cores – Private / Shared cache levels
Homogenous Multicore i7 : Homogenous multicore – 4 chips in one – separate L2 cache, shared L3
Heterogeneous Multicore Different cores for different jobs – Standard CPU – Low Power CPU – Graphics – Video
Coprocessors Coprocessor : Assists main CPU with some part of work
Co Processors Graphics Card : floating point specialized – 100s-1000s of SIMD cores –i7 ~ 100 gigaflops –Kepler GPU ~ 1300 gigaflops
CUDA Compute Unified Device Architecture – Programming model for general purpose work on GPU hardware – Streaming Multiprocessors each with CUDA cores
CUDA Designed for 1000's of threads – Broken into "warps" of 32 threads – Entire warp runs on SM in lock step – Branch divergence cuts speed
Other Coprocessors CPU's used to have floating point coprocessors – Intel & Audio cards Crytpo – SLL encryption for servers
Parallelism & Memory
Multiprocessing & Memory Memory demo…
Memory Access Multiple Processes accessing same memory = interactions – May add 10, 1 or 11 to x
UMA Uniform Memory Access – Every processor sees every memory using same addresses – Same access time for any CPU to any memory word
NUMA Non Uniform Memory Access – Single memory address space visible to all CPUs – Some memory local Fast – Some memory remote Accessed in same way, but slower
NUMA & Cache Memory problems compounded by cache X = 10
NUMA & Cache Memory problems compounded by cache X = 10 X = 15
Cache Coherence Cores need to "snoop" other reads Cores need to broadcast writes
MESI MESI : Cache Coherence Protocol – Modified I have this cached and I have changed it – Exclusive I have this uncached and unmodified and am only one with it – Shared I and another both have this cached – Invalid I do not have this cached
State Change Changes based on OWN actions Read Fulfilled By Other Cache Read Fulfilled By Other Cache
State Change Changes based on OTHERS actions I have only modified copy of this… Write it out to memory and have other core wait
State Change Sample CPU 2 broadcasts write message… CPU 1 invalidates
State Change Sample CPU 2 snoops read… has to write modified value to memory CPU 2 snoops write… has to write modified value to memory
Parallelism Bad News
Parallel Speedup In Theory: N cores = N times speedup
Issues Not every part of a problem scales well – Parallel : can run at same time – Serial : must run one at a time in order
Amdahl’s Law In Practice: Amadahl's law applied to N processors on a task where P is parallel portion:
Amdahl’s Law 60% of a job can be made parallel. We use 2 processors: 1.43x faster with 2 than 1
Applications can almost never be completely parallelized; some serial code remains Speedup Issues : Amdahl’s Law Time Number of Cores Parallel portion Serial portion 1 5
Speedup Issues : Amdahl’s Law Time Number of Cores Parallel portion Serial portion Serial portion becomes limiting factor
Ouch More processors only help with high % of parallelized code
Amdahl's Law is Optimistic Each new processor means more – Load balancing – Scheduling – Communication – Etc…
Parallel Algorithms Some problems highly parallel, others not: