Download presentation
Presentation is loading. Please wait.
1
Hyperthreading Technology
Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville
2
Outline What is hyperthreading? Trends in microarchitecture
Exploiting thread-level parallelism Hyperthreading architecture Microarchitecture choices and trade-offs 9/20/2018 A. Milenkovic
3
What is hyperthreading?
SMT - Simultaneous multithreading Make one physical processor appear as multiple logical processors to the OS and software Intel Xeon for the server market, early 2002 Pentium4 for the consumer market, November 2002 Motivation: boost performance for up to 25% at the cost of 5% increase in additional die area Hyperthreading brings the SMT concept to the Intel architecture. First introduced in the Intel Xeon processor (serves the server market), then in the Pentium 4 processor. Goal: improve performance at minimal cost. One physical processor appears as multiple logical processors to the operating system and software. 9/20/2018 A. Milenkovic
4
Trends in microarchitecture
Higher clock speeds To achieve high clock frequency make pipeline deeper (superpipelining) Events that disrupt pipeline (branch mispredictions, cache misses, etc) become very expensive in terms of lost clock cycles ILP: Instruction Level Parallelism Extract parallelism in a single program Superscalar processors have multiple execution units working in parallel Challenge to find enough instructions that can be executed concurrently Out-of-order execution => instructions are sent to execution units based on instruction dependencies rather than program order 9/20/2018 A. Milenkovic
5
Trends in microarchitecture
Cache hierarchies Processor-memory speed gap Use caches to reduce memory latency Multiple levels of caches: smaller and faster closer to the processor core Thread-level Parallelism Multiple programs execute concurrently Web-servers have an abundance of software threads Users: surfing the web, listening to music, encoding/decoding video streams, etc. 9/20/2018 A. Milenkovic
6
Exploiting thread-level parallelism
CMP – Chip Multiprocessing Multiple processors, each with a full set of architectural resources, reside on the same die Processors may share an on-chip cache or each can have its own cache Examples: HP Mako, IBM Power4 Challenges: Power, Die area (cost) Time-slice multithreading Processor switches between software threads after a predefined time slice Can minimize the effects of long lasting events Still, some execution slots are wasted 9/20/2018 A. Milenkovic
7
Exploiting thread-level parallelism
Switch-on-event multithreading Processor switches between software threads after an event (e.g., cache miss) Works well, but still coarse-grained parallelism (e.g., data dependences and branch mispredictions are still wasting cycles) SMT – Simultaneous multithreading Multiple software threads execute on a single processor without switching Have potential to maximize performance relative to the transistor count and power 9/20/2018 A. Milenkovic
8
Hyperthreading architecture
One physical processor appears as multiple logical processors HT implementation on NetBurst microarchitecture has 2 logical processors Architectural State Architectural State Architectural state: general purpose registers control registers APIC: advanced programmable interrupt controller Processor execution resources 9/20/2018 A. Milenkovic
9
Hyperthreading architecture
Main processor resources are shared caches, branch predictors, execution units, buses, control logic Duplicated resources register alias tables (map the architectural registers to physical rename registers) next instruction pointer and associated control logic return stack pointer instruction streaming buffer and trace cache fill buffers 9/20/2018 A. Milenkovic
10
Die Size and Complexity
9/20/2018 A. Milenkovic
11
Resources sharing schemes
Partition – dedicate equal resources to each logical processors Good when expect high utilization and somewhat unpredicatable Threshold – flexible resource sharing with a limit on maximum resource usage Good for small resources with bursty utilization and when the micro-ops stay in the structure for short predictable periods Full sharing – flexible with no limits Good for large structures, with variable working-set sizes 9/20/2018 A. Milenkovic
12
Shared vs. partitioned queues
Cycle 0: 2 light shaded (fast) and 2 dark shaded (slow) microoperations Cycle 1: send light shaded micro-op 0 down the pipeline In the shared queue the previous pipeline stage sends dark shaded micro-op 2, but in the partitioned queue the slow thread is already occupying two entries, so the previous stage will send light shaded micro-op 2. (c) Cycle 3: Light micro-op 1 is sent down the pipeline. Shared queue will accept light shaded micro-op 2, while the partitioned will accept the next light shaded micro-op 3. (d) Cycle 3: light shaded micro-op 2 is sent down the pipeline. In shared queue, dark shaded micro-op 3 is entering the queue, but in the partitioned we will have to accept another light shaded micro-op 4. (e) Finally, the shared queue keeps all dark shaded micro-ops (slow) and will block the other thread. With the partitioned queue we do not have such problems. shared partitioned 9/20/2018 A. Milenkovic
13
NetBurst Pipeline threshold partitioned 9/20/2018 A. Milenkovic
14
Shared vs. partitioned resources
E.g., major pipeline queues Threshold Puts a threshold on the number of resource entries a logical processor can have E.g., scheduler Fully shared resources E.g., caches Modest interference Benefit if we have shared code and/or data 9/20/2018 A. Milenkovic
15
Scheduler occupancy 9/20/2018 A. Milenkovic
16
Shared vs. partitioned cache
9/20/2018 A. Milenkovic
17
Performance Improvements
9/20/2018 A. Milenkovic
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.