Download presentation
Presentation is loading. Please wait.
Published byVictoria Kelley Modified over 9 years ago
1
12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures
2
12/14 Multi-Hyper thread.2 Locality and Parallelism Review Large memories are slow, fast memories are small Storage hierarchies are large and fast on average Parallel processors, collectively, have large, fast cache –the slow accesses to “remote” data we call “communication” Algorithm should do most work on local data Proc Cache L2 Cache L3 Cache Memory Conventional Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects
3
12/14 Multi-Hyper thread.3 Static ILP hitting limit In-order scheduling microarchitecture with perfect memory GCC Benchmark: Issue width VS IPC Memory not keeping pace with processors Chip density ~2x every 2 years Clock speed: no increase Number of processor cores doubling Power kept under control, no longer growing
4
12/14 Multi-Hyper thread.4 Memory Not Keeping Pace Memory density doubling every three years; processor logic every two Storage costs dropping slower compared to logic Source: David Turek, IBM Cost of Computation vs. Memory Source: IBM
5
12/14 Multi-Hyper thread.5 Power Density Limiting Serial Performance HEAT Scaling clock speed (business as usual) will not work High performance serial processors waste power -Speculation, dynamic dependence checking, etc. burn power -Implicit parallelism discovery More transistors, but not faster serial processors Concurrent systems more power efficient –Dynamic power is proportional to V 2 fC –Increasing cores increases capacitance –lowering clock speed Save power
6
12/14 Multi-Hyper thread.6 Parallelism Today:: Multicore All processor vendors multicore chips –Every machine is a parallel machine –To double performance, double parallelism – Can commercial applications use parallelism? – rewritten from scratch? Will programmers parallel programmers –New software models needed – hide complexity from most programmers –In the meantime, need to understand it Computer industry betting on parallelism, but does not have all the answers –Berkeley ParLab & Stanford parallelism working on it
7
12/14 Multi-Hyper thread.7 Finding Enough Parallelism Only part of application is parallel, rest sequential Amdahl’s law –If S fraction of sequential work, (1-s) is fraction parallelizable –P = number of processors Speedup(P) = Time(1)/Time(P) <= 1/(s + (1-s)/P); serial part limits speedup <= 1/s (limit) performance limited by sequential work, even with If perfect parallel part speeds up Top500 list: Nov 2014 fastest machine is Tianhe-2 - China, others came from US, Japan – Europe distant
8
12/14 Multi-Hyper thread.8 TOP500 – China Tianhe-2 1 st nov 2014
9
12/14 Multi-Hyper thread.9 TOP500 – China Tianhe-2 is 1st
10
12/14 Multi-Hyper thread.10 Parallelism has Overhead barrier Parallelism overheads: –Starting thread / process –communicating shared data –Synchronizing Each can be in milliseconds (M flops) Tradeoff: Algorithm needs large units of work to run fast in parallel (i.e. large granularity), but not too large; not enough parallel work
11
12/14 Multi-Hyper thread.11 Performance beyond single thread TLP natural parallelism in applications (e.g., Database / Scientific ) Explicit Thread Level Parallelism or Data Level Parallelism Thread: instruction stream with own PC and data –Eg. Online transaction processing, scientific nature modeling,.. –Each thread has (instructions, data, PC, register state, and so on) necessary to execute Data Level Parallelism: eg multimedia ; identical operations on data,, vector was predecessor
12
12/14 Multi-Hyper thread.12 Multithreaded Categories Overview Time (processor cycle) SuperscalarFine-GrainedCoarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot Time (processor cycle)
13
12/14 Multi-Hyper thread.13 Multithreaded Execution multiple threads share processor functional units –processor duplicates independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table –memory shared through virtual memory mechanisms – HW for fast thread switch; much faster than full process switch 100s to 1000s of clocks When switch? –fine grain Alternate instruction per thread –coarse grain When thread stalls, eg cache miss;
14
12/14 Multi-Hyper thread.14 Course-Grained Multithreading Switch on costly stall, eg L2 cache misses Advantages –Simple, –Doesn’t slow down thread Disadvantage throughput loss from short stalls, pipeline start-up costs –CPU issues instructions from 1 thread, pipeline emptied on stall –New thread fills pipeline coarse-grained multithreading is better for reducing penalty of high cost stalls, ( pipeline refill << stall time) Used in IBM eServer pSeries 680
15
12/14 Multi-Hyper thread.15 Fine-Grained Multithreading Switch thread on each instruction, every clock done in a round-robin, skipping stalled threads Advantage: can hide both short and long stalls, instructions from other threads execute when thread stalls Disadvantage: slows down individual threads; thread delayed by other threads Used on Sun’s Niagara
16
12/14 Multi-Hyper thread.16 Most execution units in superscalar are idle Tullsen, Eggers, and Levy, “Simultaneous Multithreading: For an 8-way superscalar. observation
17
12/14 Multi-Hyper thread.17 Chip Multiprocessing (CMP) i7, Power4 Without SMT Sending threads – processes to multiple processors –reduces horizontal waste –But leaves vertical waste –POWER 5 uses SMT Issue width Time Processor cycle
18
12/14 Multi-Hyper thread.18 IBM Power 4 1 st CMP 2000 2 64-bit cores Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine each may issue an instruction each cycle. (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, and CP = group commit).
19
12/14 Multi-Hyper thread.19 Power4 Core
20
12/14 Multi-Hyper thread.20 Power4 Pipeline Instruction fetch, group, crack group up to 5 instructions –Up to 8 instructions fetched from cache –Instructions cracked in groups of 1 to 5 instructions. –complex instructions simpler ones –cracked instruction: broken to 2 internal instructions e.g. load multiple word –millicoded instruction: broken to more than 2 internal instructions
21
12/14 Multi-Hyper thread.21 Power4 Pipeline ( group dispatch GD) Dispatch: send instruction group issue queues in order –instruction dependencies determined –internal resources assigned: issue queue slot, rename registers, load / store reorder queues (GD and MP stages) –Group control information GCT Global completion table (20 groups) [ ROB ]
22
12/14 Multi-Hyper thread.22 Power4 Pipeline ( group dispatch – one group / cycle) Group separate issue queues: floating-point, branch execution, fixed-point and load/store units. Fixed point (integer) & load/store units share common issue queues. issue stage (ISS): ready to execute instructions pulled out of issue queues.
23
12/14 Multi-Hyper thread.23 Power4 Pipeline Instruction execution EX, speculation, rename resources (GPRs from 32 -- 80) Branch Prediction BP –conditional branches are predicted, instructions fetched and speculatively executed –3 history tables used –processing continues If prediction is correct, ELSE –instructions flushed and instruction fetching redirected.
24
12/14 Multi-Hyper thread.24 Power 5 = SMT + Power 4
25
12/14 Multi-Hyper thread.25 3/1/2010 Power 4 Power 5 2 fetch (PC), 2 initial decodes 2 commits (architected register sets)
26
12/14 Multi-Hyper thread.26 Power 5 data flow... Why only 2 threads? With 4, shared resources (physical registers, cache, memory bandwidth) would be bottleneck
27
12/14 Multi-Hyper thread.27 Simultaneous Multi-threading... 1 2 3 4 5 6 7 8 9 MMFX FP BRCC Cycle One thread, 8 units M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes 1 2 3 4 5 6 7 8 9 MMFX FP BRCCCycle Two threads, 8 units
28
12/14 Multi-Hyper thread.28 Simultaneous Multithreading (SMT) (SMT): Using dynamically scheduled processor –Large register set can hold independent threads –Register renaming provides unique register identifiers. Instructions from multiple threads mixed in datapath without confusing sources and destinations across threads –Out-of-order completion allows threads to execute out of order, and get better utilization HW Adding per thread renaming table and separate PCs –Independent commit; logically keep separate reorder buffer for each thread
29
12/14 Multi-Hyper thread.29 Changes from Single thread to SMT Second Program Counter (PC) added to fetch 2 nd thread GPR/FPR rename mapper expanded to map second set of registers ( bit indicates thread) Completion logic replicated to track two threads Thread bit added to most address/tag buses
30
12/14 Multi-Hyper thread.30 Changes in Power 5 to support SMT Increased associativity of L1 I cache and instruction address translation buffers –(ITLB) Added load - store queues / per thread Increased L2, L3 size (1.92 vs. 1.44 MB) separate instruction prefetch and buffering per thread Increased number of virtual registers from 152 to 240 – rename registers Increased the size of issue queues Power5 core 24% larger than the Power4 core to support SMT
31
12/14 Multi-Hyper thread.31 SMT Design Issues SMT, impact on single thread performance? Larger register file needed to hold multiple contexts Clock cycle time, especially in: –Instruction issue - more candidate instructions need to be considered –Instruction completion - choosing which instructions to commit challenging Cache and TLB conflicts generated by SMT degrade performance
32
12/14 Multi-Hyper thread.32 Resource Sharing -- effects Threads share many resources –GCT, BHT, TLB,.. Resources balanced across threads for Higher performance drifting to extremes reduced performance Solution: Dynamically adjust resource utilization
33
12/14 Multi-Hyper thread.33 Power 5 thread performance / priority.. Relative priority of each thread is hardware controlled For balanced operation, both run slower than if threads “owned” the machine.
34
12/14 Multi-Hyper thread.34 Thread priority Control-cont’d Unbalanced execution desirable if – No work for opposite thread – Thread spin-waiting on lock – Software determined non uniform balance – Power management Solution: Control instruction decode rate – Software/hardware controls 8 priority levels for each thread
35
12/14 Multi-Hyper thread.35 Dynamic Thread Switching Used if no task ready for second thread to run All machine resources allocated to one thread Software initiated Dormant thread awakens on –External interrupt –Decrementer Interrupt –Special Instruction from active thread
36
12/14 Multi-Hyper thread.36 Single Thread Operation For execution unit limited applications – Floating or fixed point intensive Workloads Execution unit limited applications provide minimal performance leverage for SMT – Higher performance benefit when resources dedicated to single thread Determined dynamically on a Per processor basis
37
12/14 Multi-Hyper thread.37 Initial Performance of SMT Pentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark and 1.07 for SPECfp_rate –Pentium 4 is dual threaded SMT –SPECRate requires that each SPEC benchmark be run against a vendor-selected number of copies of the same benchmark Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (26 2 runs) speed-ups from 0.90 to 1.58; average was 1.20 Power 5, 8 processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate Power 5 running 2 copies of each app speedup between 0.89 and 1.41 –Most gained some –Fl.Pt. apps had most cache conflicts and least gains
38
12/14 Multi-Hyper thread.38 Limits to ILP Doubling issue rates above today’s 3-6 instructions per clock, say to 6 to 12 instructions, probably requires a processor to –issue 3 or 4 data memory accesses per cycle, –resolve 2 or 3 branches per cycle, –rename and access more than 20 registers per cycle, and –fetch 12 to 24 instructions per cycle. The complexities of implementing these capabilities is likely to mean sacrifices in the maximum clock rate –E.g, widest issue processor is the Itanium 2, but it also has the slowest clock rate, despite the fact that it consumes the most power!
39
12/14 Multi-Hyper thread.39 Limits to ILP Most techniques for increasing performance increase power consumption The key question is whether a technique is energy efficient: does it increase power consumption faster than it increases performance? Multiple issue processors techniques all are energy inefficient: 1.Issuing multiple instructions incurs some overhead in logic that grows faster than the issue rate grows 2.Growing gap between peak issue rates and sustained performance Number of transistors switching = f(peak issue rate), and performance = f( sustained rate), growing gap between peak and sustained performance increasing energy per unit of performance
40
12/14 Multi-Hyper thread.40 Commentary Itanium architecture does not represent a significant breakthrough in scaling ILP or in avoiding power / complexity consumption problems Instead of more ILP, architects focusing on TLP implemented with CMP IBM announced Power4, 1st commercial CMP, = 2 Power3 processors + L2 cache –Sun Microsystems and Intel have switched CMP rather than aggressive uniprocessors. Right balance of ILP and TLP not clear –Good for server, exploit more TLP, –desktop, single-thread performance a primary requirement
41
12/14 Multi-Hyper thread.41 And in conclusion … Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for practical options Explicitly parallel (Data level parallelism or Thread level parallelism) is next step to performance Coarse grain vs. Fine grained multithreading –Only on big stall vs. every clock cycle Simultaneous Multithreading fine grained multithreading based on superscalar microarchitecture –Instead of replicating registers, reuse rename registers
42
12/14 Multi-Hyper thread.42 Power Storage Hierarchy
43
12/14 Multi-Hyper thread.43 Power Storage Hierarchy Hardware data prefetch –hardware prefetches Data from L2, L3 & memory : hides memory latency transparently loads the L1 data cache –Triggered by data cache line misses L1 prefetches 1 cache line ahead L2 prefetches 5 cache lines ahead L3 prefetches 17 to 20 lines
44
12/14 Multi-Hyper thread.44 Moore’s Law reinterpreted Number of cores per chip will double every two years Clock speed will not increase (possibly decrease) Need to deal with systems with millions of concurrent threads Need to deal with inter-chip parallelism as well as intra-chip parallelism
45
12/14 Multi-Hyper thread.45 Intel’s Hyper-threading technology is SMT Pentium 4 (Xeon) Executes two tasks simultaneously – Two different applications – Two threads of same application CPU maintains architecture state for two processors –Two logical processors per physical processor Implemented on Intel® Xeon™ and most Pentium 4 – Two logical processors for < 5% additional die area – Power efficient performance gain
46
12/14 Multi-Hyper thread.46 Resources are shared not replicated
47
12/14 Multi-Hyper thread.47 Multithreaded Microarchitecture Dedicated local context per running thread Efficient resource sharing –Time sharing –Space sharing Fast thread synchronization / communication –Explicit instructions –Implicit via shared registers / cache / buffer
48
12/14 Multi-Hyper thread.48 Changes needed for Hyper-threading Pentium 4 Replicate – All per CPU architectural state – Instruction Pointers, renaming logic – Other: ITLB, return stack predictor,.. So Partition resources (share by splitting in half per thread) –Several buffers: Re-order buffer, load/store buffers, queues Share –Out -of -Order execution engine –Caches
49
12/14 Multi-Hyper thread.49 P4 Out-of-order Execution pipeline
50
12/14 Multi-Hyper thread.50 P4 Hyper-threaded pipeline
51
12/14 Multi-Hyper thread.51 Pentium-4 Hyperthreading Front End Resource divided between logical CPUs Resource shared between logical CPUs
52
12/14 Multi-Hyper thread.52 Thread selection points
53
12/14 Multi-Hyper thread.53 Icount Choosing Policy Fetch from thread with the least instructions in flight.
54
12/14 Multi-Hyper thread.54 All caches are shared Execution trace cache L1 Data L2 Unified L3 Unified
55
12/14 Multi-Hyper thread.55 Data in Caches can be shared L1 Data L2 unified L3 unified
56
12/14 Multi-Hyper thread.56 Operating systems manages tasks Schedule tasks on logical processors Executes HALT if a logical processor is idle
57
12/14 Multi-Hyper thread.57 Initial Performance of SMT Pentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate benchmark and 1.07 for SPECfp_rate –Pentium 4 is dual threaded SMT Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (26 2 runs) speed-ups from 0.90 to 1.58; average was 1.20 Power 5, 8-processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate Power 5 running 2 copies of each app speedup between 0.89 and 1.41 –Most gained some –Fl.Pt. apps had most cache conflicts and least gains
58
12/14 Multi-Hyper thread.58 Hyper-threading technology Significant new technology direction for Intel’s future CPUs Exploits parallelism in today’s applications and usage – Two logical processors on one physical processor Accelerates performance for low silicon and power costs Implemented in Xeon MP, Pentium 4, Itanium 2
59
12/14 Multi-Hyper thread.59 Multicore & Manycore Revolution needed Software or architecture alone can’t fix parallel programming problem, need innovations in both “Multicore” 2X cores per generation: 2, 4, 8, … “Manycore” 100s is highest performance per unit area, and per Watt, then 2X per generation: 64, 128, 256, 512, 1024 … Multicore architectures & Programming Models good for 2 to 32 cores won’t evolve to Manycore systems of 1000’s of processors Desperately need HW/SW models that work for Manycore or will run out of steam (as ILP ran out of steam at 4 instructions)
60
12/14 Multi-Hyper thread.60 Summary: Multithreaded Categories Time (processor cycle) SuperscalarFine-GrainedCoarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot
61
12/14 Multi-Hyper thread.61 Cell Processor
62
12/14 Multi-Hyper thread.62 Cell Processor Features 64b Power core & its L2 cache 8 SPE – processing elements with local memory High bandwidth interconnect bus Memory interface controller 10 simultaneous threads, 8 on SPEs + 2 on Power core 234M transistors, 90 nm, SOI, 8-level Copper On-chip temperature monitored – cooling adjusted
63
12/14 Multi-Hyper thread.63 12/10
64
12/14 Multi-Hyper thread.64 SPE SPE optimized for compute intensive applications Both types of processor cores share access to common address space, main memory, and address ranges corresponding to each SPE’s local store, control registers,and I/O devices. Simple high speed pipeline Pervasive parallel computing ….SIMD data level parallelism 128 x 128 register file (scalar – vector) Optimized scalar – uses same h/w path as vector instructions 256k local store ( similar to but not a cache, no tags,..etc)
65
12/14 Multi-Hyper thread.65 Cell Processor Die Photo
66
12/14 Multi-Hyper thread.66 Synergistic Processor SPE
67
12/14 Multi-Hyper thread.67 SPE Pipeline
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.