Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Multiprocessing

Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Processor Parallelism Process Parallelism : Ability run multiple instruction streams simultaneously

Flynn's Taxonomy Categorization of architectures based on – Number of simultaneous instructions – Number of simultaneous data items

Flynn's Taxonomy Categorization of architectures

SISD SISD : Single Instruction – Single Data – One instruction sent to one processing unit to work on one piece of data – May be pipelined or superscalar

SIMD Roots ILLIAC IV – One instruction issued to 64 processing units

SIMD Roots Cray I – Vector processor – One instruction applied to all elements of vector register

Modern SIMD x86 Processors – SSE Units : Streaming SIMD Execution – Operate on special 128 bit registers 4 32bit chunks 2 64bit chunks 16 8 bit chiunks …

Modern SIMD Graphics Cards http://www.nvidia.com/object/ fermi-architecture.html http://www.nvidia.com/object/ fermi-architecture.html Becoming less and less "S"

Co Processors Graphics Processing : floating point specialized –i7 ~ 100 gigaflops –Kepler GPU ~ 1300 gigaflops

CUDA Compute Unified Device Architecture – Programming model for general purpose work on GPU hardware – Streaming Multiprocessors each with 16-48 CUDA cores

CUDA Designed for 1000's of threads – Broken into "warps" of 32 threads – Entire warp runs on SM in lock step – Branch divergence cuts speed

MISD MISD : Multiple Instruction – Single Data – Different instruction, same data calculated – Rare – Space shuttle : Five processors handle fly by wire input, vote

MIMD MIMD : Multiple Instruction – Multiple Data – Different instructions, working on different data in different processing units – Most common parallel

Coprocessors Coprocessor : Assists main CPU with some part of work

Co Processors Graphics Processing : floating point specialized –i7 ~ 100 gigaflops –Kepler GPU ~ 1300 gigaflops

Other Coprocessors CPU's used to have floating point coprocessors – Intel 30386 & 80387 Audio cards PhysX Crytpo – SLL encryption for servers

Multiprocessing Multiprocessing : Many processors, shared memory – May have local cache/special memory

Homogenous Multicore i7 : Homogenous multicore – 4 chips in one – separate L2 cache, shared L3

Heterogeneous Multicore Different cores for different jobs – Specialized media processing in mobile devices Examples – Tegra  – PS3 Cell

Multiprocessing & Memory Memory conflict demo…

UMA Uniform Memory Access – Every processor sees every memory using same addresses – Same access time for any CPU to any memory word

NUMA Non Uniform Memory Access – Single memory address space visible to all CPUs – Some memory local Fast – Some memory remote Accessed in same way, but slower

Connections Bus : One communication channel – Scales poorly

Connections Crossbar switched – Segmented memory – Any processor can directly link to any memory – N 2 switches

Connections Other topologies – Balance complexity, flexibility and latency

BlueGene Major super computer player http://s.top500.org/static/lists/2012/11/TOP500_201211_Poster.png

BG/P Compute Cards 4 processors per card Fully coherent caches Connected in double torus to neighbors

BG/P Full system : 72 x 32 x 32 torus of nodes

Titan The king : Descendant of Redstorm – http://www.olcf.ornl.gov/titan/ http://www.olcf.ornl.gov/titan/

Distributed Systems No common memory space Pass message between processors

COW Cluster of Workstations

Grid Computing – Multi Computing at internet scale – Resources owned by multiple parties http://folding.stanford.edu/ Seti@Home

Parallel Algorithms Some problems highly parallel, others not:

Applications can almost never be completely parallelized; some serial code remains Speedup always limited by serial part of program Speedup Issues : Amdahl’s Law Time Number of Cores Parallel portion Serial portion 1 5

Speedup Issues : Amdahl’s Law Time Number of Cores Parallel portion Serial portion 123 45 Amdahl’s law: – s is serial fraction of program, P is # of processors

Ouch More processors only help with high % of parallelized code

Amdahl's Law is Optimistic Each new processor means more – Load balancing – Scheduling – Communication – Etc…

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Similar presentations

Presentation on theme: "Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing""— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Similar presentations

Presentation on theme: "Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing""— Presentation transcript:

Similar presentations

About project

Feedback