Processor Level Parallelism 1
Parallelism Levels Levels we can attack parallelism:
Bit Level Parallelism Circuits process bits in parallel
Instruction Level Parallelism Organization level may process instructions in parallel
Higher levels Thread Level Task Level Application Level Ability to run multiple simultaneous streams of instrucions Task Level Ability to run parts of a program on different chips Application Level Run separate jobs on different machines
Process vs Thread Process : Program Own memory space Has at least one thread
Multi Tasking Multitasking Done on single cores running multiple programs OS handles switch "Large" chunks of time Flush cache on switch
Process vs Thread Thread : Instruction sequence Own registers/stack Share memory with other threads in process
Threaded Code Demo…
Resource Usage Four threads running in 4-wide pipeline Can't always fill all 4 issue slots Have bubbles from memory access, page faults, etc… Issue Slots
Multithreading Multithreading Alternate or combine threads to maximize use of processor Finer timescale Maintain cache Hardware required Multiple register sets Track "owner" of pipeline instructions
Multithreading Corse Grained Multitasking Threads run for number of cycles Must drain pipeline before switch
Multithreading Single Pipeline Course Grained Assumption 1 cycle to retire after stall Threads to run Single Pipeline Time
Multithreading Dual Pipeline Course Grained Assumption 1 cycle to retire after stall Threads to run Dual Pipeline Time
Latency vs Throughput Multithreading favors throughput over latency
Multithreading Fine Grained Multitasking Hardware can switch to a new thread each cycle without draining pipeline
Multithreading Single Pipeline Fine Grained Assumption: Switches every cycle Threads to run Single Pipeline Time
Multithreading Dual Pipeline Fine Grained Assumption: Switches every cycle Threads to run Dual Pipeline Time
SMT SMT : Simultaneous Multithreading AKA Hyperthreading Issue ops from multiple threads in one cycle Time
Multithreading SMT Try to start next thread early if spare pipeline Threads to run C gets to jump in early as B2 not ready Time
Multithreading SMT Otherwise switch like fine grained Threads to run C gets full turn, A up next Time
Multithreading SMT Still constrained by load delays Threads to run C5, B3 not ready until 8; A7 not ready until 9 Time
SMT Challenges Resources must be duplicated or split Split too thin hurts performance… Duplicate everything and you aren't maximizing use of hardware…
Intel vs AMD Variations on SMT
Processor Level Parallelism Styles
Processor Parallelism Process Parallelism : Run multiple instruction streams simultaneously
Flynn's Taxonomy Categorization of architectures based on Number of simultaneous instructions Number of simultaneous data items
Flynn's Taxonomy Categorization of architectures based on
SISD SISD : Single Instruction – Single Data One instruction One piece data May be pipelined or superscalar
SISD SIMD : Single Instruction – Multiple Data One instruction Multiple pieces of data
SIMD Roots ILLIAC IV One instruction issued to 64 processing units
SIMD Roots Cray I Vector processor One instruction applied to all elements of vector register
Modern SIMD x86 Processors SSE Units : Streaming SIMD Execution Operate on special 128 bit registers 4 32bit chunks 2 64bit chunks 16 8 bit chiunks …
MISD MISD : Multiple Instruction – Single Data One piece of data Processed by multiple instructions Rare Space shuttle : Five processors handle fly by wire input, vote
MIMD MIMD : Multiple Instruction – Multiple Data Multiple pieces of data, multiple instruction streams
MIMD MIMD : Multiple Instruction – Multiple Data Multi core processors Super computers Computational Grids
Coupling and Topologies MIMD differences How connected are nodes? How shared is memory?
BlueGene http://s.top500.org/static/lists/2012/11/TOP500_201211_Poster.png
BG/P Full system : 72 x 32 x 32 torus of nodes
COW Cluster of Workstations