Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Slides:

Advertisements

Similar presentations

DSPs Vs General Purpose Microprocessors

Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

ARCHITECTURE OF APPLE’S G4 PROCESSOR BY RON WEINWURZEL MICROPROCESSORS PROFESSOR DEWAR SPRING 2002.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Contents Even and odd memory banks of 8086 Minimum mode operation

Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.

Pentium microprocessors CAS 133 – Basic Computer Skills/MS Office CIS 120 – Computer Concepts I Russ Erdman.

THE AMD-K7 TM PROCESSOR Microprocessor Forum 1998 Dirk Meyer.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Computers Organization & Assembly Language Chapter 1 THE 80x86 MICROPROCESSOR.

Mobile Pentium 4 Architecture Supporting Hyper-ThreadingTechnology Hakan Burak Duygulu CmpE

Intel® Core™ Duo Processor Behrooz Jafarnejad Winter 2006.

1 Microprocessor-based Systems Course 4 - Microprocessors.

IA- 32 Architecture Richard Eckert Anthony Marino Matt Morrison Steve Sonntag.

Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.

Advanced Micro Devices - Athlon Buddy Guest Mike Lewitt Bill McCorkle November 28, 2001.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.

The Pentium 4 CPSC 321 Andreas Klappenecker. Today’s Menu Advanced Pipelining Brief overview of the Pentium 4.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

Copyright © 2006, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Intel® Core™ Duo Processor.

Features of the Intel 32 Bit Machines

7-Aug-15 (1) CSC Computer Organization Lecture 6: A Historical Perspective of Pentium IA-32.

Intel Pentium 4 Microprocessor

The AMD and Intel Architectures COMP Jamie Curtis.

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

Multicore Designs Presented By: Mahendra B Salunke Asst. Professor, Dept of Comp Engg., SITS, Narhe, Pune. URL:

Computer performance.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

History of Microprocessor MPIntroductionData BusAddress Bus

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

Hyper-Threading Technology Architecture and Micro-Architecture.

Tahir CELEBI, Istanbul, 2005 Hyper-Threading Technology Architecture and Micro-Architecture Prepared by Tahir Celebi Istanbul, 2005.

AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)

Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University

Computer performance issues* Pipelines, Parallelism. Process and Threads.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

EKT303/4 Superscalar vs Super-pipelined.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

Hewlett-Packard PA-RISC Bit Processors: History, Features, and Architecture Presented By: Adam Gray Christie Kummers Joshua Madagan.

Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

The Pentium Series CS 585: Computer Architecture Summer 2002 Tim Barto.

Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Protection in Virtual Mode

Visit for more Learning Resources

Computer Structure Multi-Threading

Introduction to Pentium Processor

Hyperthreading Technology

Comparison of Two Processors

Presentation transcript:

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett

Introduction Simultaneous Multithreading – a technique permitting several independent threads to issue instructions to a superscalar’s multiple functional units in a single cycle. The objective of SM is to substantially increase processor utilization in the face of both long memory latencies and limited available parallelism per thread.

Overview Introduce several SM models Evaluate the performance of those models relative to superscalar and fine-grain multithreading Show how to tune the cache hierarchy for SM processors Demonstrate the potential for performance and real-estate advantages of SM architectures over small-scale, on chip multiprocessors

Simulation Environment Developed a simulation environment that defines an implementation of SM architecture Uses emulation-based instructions-level simulation, similar to Tango and g88 Models the execution pipelines, the memory hierarchy (both in terms of hit rates and bandwidths), the TLBs, and the branch prediction logic of a wide superscalar processor Based on the Alpha AXP 21164, augmented first for wider superscalar execution and then for multithreaded execution

Simulation Environment (cont.) Typical Simulated configuration contains 10 functional units of four types (four integer, two floating point, three load/store and 1 branch) and a maximum issue rate of 8 instructions per cycle.

Simulation Environment (cont.)

Superscalar Bottlenecks No dominant source of wasted issue bandwidth, therefore, no dominant solution No single latency-tolerating technique will produce a dramatic increase in the performance of these programs if it only attacks specific types of latencies

SM Machine Models

SM Machine Models (cont.)

In summary, the results show that simultaneous multithreading surpasses limits on the performance attainable through either single-thread execution or fine-grain multithreading, when run on a wide superscalar. Simplified implementations of SM with limited per- thread capabilities can still attain high instruction throughput. These improvements come without any significant tuning of the architecture for multithreaded execution.

Cache Design Cache sharing caused performance degradation in SM processors. Different cache configurations were simulated to determine optimum configurations

Cache Design Two configurations appear to be good choices: 64s.64s 64p.64s Important Note: cache sizes today are larger than those at the time of the paper (1995).

Simultaneous Multithreading vs Single-Chip Multiprocessing On organizational level, the two are similar: Multiple register sets Multiple functional units High issue bandwidth on a single chip

Simultaneous Multithreading vs Single-Chip Multiprocessing (cont.) Key difference is the way resources are partitioned and scheduled: MP statically partitions resources SM allows partitions to change every cycle MP and SM tested in similar configurations to compare performance:

Simultaneous Multithreading vs Single-Chip Multiprocessing (cont.)

Conclusion

Pentium 4 Product Features Available at 1.50, 1.60, 1.70, 1.80, 1.90 and 2 GHz Binary compatible with applications running on previous members of the Intel microprocessor line Intel ® NetBurst™ micro-architecture System bus frequency at 400 MHz Rapid Execution Engine: Arithmetic Logic Units (ALUs) run at twice the processor core frequency Hyper Pipelined Technology Advance Dynamic Execution —Very deep out-of-order execution —Enhanced branch prediction Level 1 Execution Trace Cache stores 12K micro-ops and removes decoder latency from main execution loops 8 KB Level 1 data cache 256 KB Advanced Transfer Cache (on- die, full speed Level 2 (L2) cache) with 8-way associativity and Error Correcting Code (ECC) 144 new Streaming SIMD Extensions 2 (SSE2) instructions Enhanced floating point and multimedia unit for enhanced video, audio, encryption, and 3D performance Power Management capabilities —System Management mode —Multiple low-power states Optimized for 32-bit applications running on advanced 32-bit operating systems 8-way cache associativity provides improved cache hit rate on load/store operations.

AMD Athlon The AMD Athlon XP processor features a seventh-generation microarchitecture with an integrated, exclusive L2 cache, which supports the growing processor and system bandwidth requirements of emerging software, graphics, I/O, and memory technologies. The high-speed execution core of the AMD Athlon XP processor includes multiple x86 instruction decoders, a dual-ported 128-Kbyte split level-one (L1) cache, an exclusive 256-Kbyte L2 cache, three independent integer pipelines, three address calculation pipelines, and a superscalar, fully pipelined, out-of-order, three-way floating-point engine. The floating-point engine is capable of delivering outstanding performance on numerically complex applications.

AMD Athlon (cont.) The following features summarize the AMD Athlon XP processor QuantiSpeed architecture: An advanced nine-issue, superpipelined, superscalar x86 processor microarchitecture designed for increased Instructions Per Cycle (IPC) and high clock frequencies Fully pipelined floating-point unit that executes all x87 (floating-point), MMX, SSE and 3DNow! instructions Hardware data pre-fetch that increases and optimizes performance on high-end software applications utilizing high-bandwidth system capability Advanced two-level Translation Look-aside Buffer (TLB) structures for both enhanced data and instruction address translation. The AMD Athlon XP processor with QuantiSpeed architecture incorporates three TLB optimizations: the L1 DTLB increases from 32 to 40 entries, the L2 ITLB and L2 DTLB both use exclusive architecture, and the TLB entries can be speculatively loaded.