Multiprocessing
Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Processor Parallelism Process Parallelism : Ability run multiple instruction streams simultaneously
Flynn's Taxonomy Categorization of architectures based on – Number of simultaneous instructions – Number of simultaneous data items
Flynn's Taxonomy Categorization of architectures
SISD SISD : Single Instruction – Single Data – One instruction sent to one processing unit to work on one piece of data – May be pipelined or superscalar
Flynn's Taxonomy Categorization of architectures
SIMD Roots ILLIAC IV – One instruction issued to 64 processing units
SIMD Roots Cray I – Vector processor – One instruction applied to all elements of vector register
Modern SIMD x86 Processors – SSE Units : Streaming SIMD Execution – Operate on special 128 bit registers 4 32bit chunks 2 64bit chunks 16 8 bit chiunks …
Modern SIMD Graphics Cards fermi-architecture.html fermi-architecture.html Becoming less and less "S"
Co Processors Graphics Processing : floating point specialized –i7 ~ 100 gigaflops –Kepler GPU ~ 1300 gigaflops
CUDA Compute Unified Device Architecture – Programming model for general purpose work on GPU hardware – Streaming Multiprocessors each with CUDA cores
CUDA Designed for 1000's of threads – Broken into "warps" of 32 threads – Entire warp runs on SM in lock step – Branch divergence cuts speed
Flynn's Taxonomy Categorization of architectures
MISD MISD : Multiple Instruction – Single Data – Different instruction, same data calculated – Rare – Space shuttle : Five processors handle fly by wire input, vote
Flynn's Taxonomy Categorization of architectures
MIMD MIMD : Multiple Instruction – Multiple Data – Different instructions, working on different data in different processing units – Most common parallel
Coprocessors Coprocessor : Assists main CPU with some part of work
Co Processors Graphics Processing : floating point specialized –i7 ~ 100 gigaflops –Kepler GPU ~ 1300 gigaflops
Other Coprocessors CPU's used to have floating point coprocessors – Intel & Audio cards PhysX Crytpo – SLL encryption for servers
Multiprocessing Multiprocessing : Many processors, shared memory – May have local cache/special memory
Homogenous Multicore i7 : Homogenous multicore – 4 chips in one – separate L2 cache, shared L3
Heterogeneous Multicore Different cores for different jobs – Specialized media processing in mobile devices Examples – Tegra – PS3 Cell
Multiprocessing & Memory Memory conflict demo…
UMA Uniform Memory Access – Every processor sees every memory using same addresses – Same access time for any CPU to any memory word
NUMA Non Uniform Memory Access – Single memory address space visible to all CPUs – Some memory local Fast – Some memory remote Accessed in same way, but slower
Connections Bus : One communication channel – Scales poorly
Connections Crossbar switched – Segmented memory – Any processor can directly link to any memory – N 2 switches
Connections Other topologies – Balance complexity, flexibility and latency
BlueGene Major super computer player
BG/P Compute Cards 4 processors per card Fully coherent caches Connected in double torus to neighbors
BG/P Full system : 72 x 32 x 32 torus of nodes
Titan The king : Descendant of Redstorm –
Flynn's Taxonomy Categorization of architectures
Distributed Systems No common memory space Pass message between processors
COW Cluster of Workstations
Grid Computing – Multi Computing at internet scale – Resources owned by multiple parties
Parallel Algorithms Some problems highly parallel, others not:
Applications can almost never be completely parallelized; some serial code remains Speedup always limited by serial part of program Speedup Issues : Amdahl’s Law Time Number of Cores Parallel portion Serial portion 1 5
Speedup Issues : Amdahl’s Law Time Number of Cores Parallel portion Serial portion Amdahl’s law: – s is serial fraction of program, P is # of processors
Ouch More processors only help with high % of parallelized code
Amdahl's Law is Optimistic Each new processor means more – Load balancing – Scheduling – Communication – Etc…