Download presentation
Presentation is loading. Please wait.
Published byAubrey Norton Modified over 9 years ago
1
Multiprocessing
2
Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
3
Processor Parallelism Process Parallelism : Ability run multiple instruction streams simultaneously
4
Flynn's Taxonomy Categorization of architectures based on – Number of simultaneous instructions – Number of simultaneous data items
5
Flynn's Taxonomy Categorization of architectures
6
SISD SISD : Single Instruction – Single Data – One instruction sent to one processing unit to work on one piece of data – May be pipelined or superscalar
7
Flynn's Taxonomy Categorization of architectures
8
SIMD Roots ILLIAC IV – One instruction issued to 64 processing units
9
SIMD Roots Cray I – Vector processor – One instruction applied to all elements of vector register
10
Modern SIMD x86 Processors – SSE Units : Streaming SIMD Execution – Operate on special 128 bit registers 4 32bit chunks 2 64bit chunks 16 8 bit chiunks …
11
Modern SIMD Graphics Cards http://www.nvidia.com/object/ fermi-architecture.html http://www.nvidia.com/object/ fermi-architecture.html Becoming less and less "S"
12
Co Processors Graphics Processing : floating point specialized –i7 ~ 100 gigaflops –Kepler GPU ~ 1300 gigaflops
13
CUDA Compute Unified Device Architecture – Programming model for general purpose work on GPU hardware – Streaming Multiprocessors each with 16-48 CUDA cores
14
CUDA Designed for 1000's of threads – Broken into "warps" of 32 threads – Entire warp runs on SM in lock step – Branch divergence cuts speed
15
Flynn's Taxonomy Categorization of architectures
16
MISD MISD : Multiple Instruction – Single Data – Different instruction, same data calculated – Rare – Space shuttle : Five processors handle fly by wire input, vote
17
Flynn's Taxonomy Categorization of architectures
18
MIMD MIMD : Multiple Instruction – Multiple Data – Different instructions, working on different data in different processing units – Most common parallel
19
Coprocessors Coprocessor : Assists main CPU with some part of work
20
Co Processors Graphics Processing : floating point specialized –i7 ~ 100 gigaflops –Kepler GPU ~ 1300 gigaflops
21
Other Coprocessors CPU's used to have floating point coprocessors – Intel 30386 & 80387 Audio cards PhysX Crytpo – SLL encryption for servers
22
Multiprocessing Multiprocessing : Many processors, shared memory – May have local cache/special memory
23
Homogenous Multicore i7 : Homogenous multicore – 4 chips in one – separate L2 cache, shared L3
24
Heterogeneous Multicore Different cores for different jobs – Specialized media processing in mobile devices Examples – Tegra – PS3 Cell
25
Multiprocessing & Memory Memory conflict demo…
26
UMA Uniform Memory Access – Every processor sees every memory using same addresses – Same access time for any CPU to any memory word
27
NUMA Non Uniform Memory Access – Single memory address space visible to all CPUs – Some memory local Fast – Some memory remote Accessed in same way, but slower
28
Connections Bus : One communication channel – Scales poorly
29
Connections Crossbar switched – Segmented memory – Any processor can directly link to any memory – N 2 switches
30
Connections Other topologies – Balance complexity, flexibility and latency
31
BlueGene Major super computer player http://s.top500.org/static/lists/2012/11/TOP500_201211_Poster.png
32
BG/P Compute Cards 4 processors per card Fully coherent caches Connected in double torus to neighbors
33
BG/P Full system : 72 x 32 x 32 torus of nodes
34
Titan The king : Descendant of Redstorm – http://www.olcf.ornl.gov/titan/ http://www.olcf.ornl.gov/titan/
35
Flynn's Taxonomy Categorization of architectures
36
Distributed Systems No common memory space Pass message between processors
37
COW Cluster of Workstations
38
Grid Computing – Multi Computing at internet scale – Resources owned by multiple parties http://folding.stanford.edu/ Seti@Home
39
Parallel Algorithms Some problems highly parallel, others not:
40
Applications can almost never be completely parallelized; some serial code remains Speedup always limited by serial part of program Speedup Issues : Amdahl’s Law Time Number of Cores Parallel portion Serial portion 1 5
41
Speedup Issues : Amdahl’s Law Time Number of Cores Parallel portion Serial portion 123 45 Amdahl’s law: – s is serial fraction of program, P is # of processors
42
Ouch More processors only help with high % of parallelized code
43
Amdahl's Law is Optimistic Each new processor means more – Load balancing – Scheduling – Communication – Etc…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.