EECC722 - Shaaban #1 Lec # 2 Fall 2001 9-10-2001 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996.

Slides:



Advertisements
Similar presentations
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Lecture 6: Multicore Systems
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
EECC551 - Shaaban #1 lec # 6 Fall Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CMPE750 - Shaaban #1 Lec # 2 Spring Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
Simultaneous Multithreading (SMT)
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
How Multi-threading can increase on-chip parallelism
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Pipelining and Parallelism Mark Staveley
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous.
EKT303/4 Superscalar vs Super-pipelined.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
COMP 740: Computer Architecture and Implementation
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Simultaneous Multithreading
Simultaneous Multithreading
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Levels of Parallelism within a Single Processor
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Levels of Parallelism within a Single Processor
Presentation transcript:

EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors. SMT has the potential of greatly enhancing processor computational capabilities by: –Exploiting thread-level parallelism (TLP), simultaneously executing instructions from different threads during the same cycle. –Providing multiple hardware contexts, hardware thread scheduling and context switching capability.

EECC722 - Shaaban #2 Lec # 2 Fall SMT Issues SMT CPU performance gain potential. Modifications to Superscalar CPU architecture necessary to support SMT. SMT performance evaluation vs. Fine-grain multithreading Superscalar, Chip Multiprocessors. Hardware techniques to improve SMT performance: –Optimal level one cache configuration for SMT. –SMT thread instruction fetch, issue policies. –Instruction recycling (reuse) of decoded instructions. Software techniques: –Compiler optimizations for SMT. –Software-directed register deallocation. –Operating system behavior and optimization. SMT support for fine-grain synchronization. SMT as a viable architecture for network processors.

EECC722 - Shaaban #3 Lec # 2 Fall Microprocessor Architecture Trends

EECC722 - Shaaban #4 Lec # 2 Fall Performance Increase of Workstation-Class Microprocessors Integer SPEC92 Performance

EECC722 - Shaaban #5 Lec # 2 Fall Microprocessor Logic Density Moore’s Law: 2X transistors/Chip Every 1.5 years Alpha 21264: 15 million Pentium Pro: 5.5 million PowerPC 620: 6.9 million Alpha 21164: 9.3 million Sparc Ultra: 5.2 million Moore’s Law

EECC722 - Shaaban #6 Lec # 2 Fall Increase of Capacity of VLSI Dynamic RAM Chips year size(Megabit) X/yr, or doubling every 1.6 years

EECC722 - Shaaban #7 Lec # 2 Fall CPU Architecture Evolution: Single Threaded Pipeline Traditional 5-stage pipeline. Increases Throughput: Ideal CPI = 1

EECC722 - Shaaban #8 Lec # 2 Fall CPU Architecture Evolution: Superscalar Architectures Fetch, decode, execute, etc. more than one instruction per cycle (CPI < 1). Limited by instruction-level parallelism (ILP).

EECC722 - Shaaban #9 Lec # 2 Fall Empty or wasted issue slots can be defined as either vertical waste or horizontal waste: –Vertical waste is introduced when the processor issues no instructions in a cycle. –Horizontal waste occurs when not all issue slots can be filled in a cycle. Superscalar Architectures: Issue Slot Waste Classification

EECC722 - Shaaban #10 Lec # 2 Fall Sources of Unused Issue Cycles in an 8-issue Superscalar Processor. Processor busy represents the utilized issue slots; all others represent wasted issue slots. 61% of the wasted cycles are vertical waste, the remainder are horizontal waste. Workload: SPEC92 benchmark suite. Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages

EECC722 - Shaaban #11 Lec # 2 Fall Superscalar Architectures: All possible causes of wasted issue slots, and latency-hiding or latency reducing techniques that can reduce the number of cycles wasted by each cause. Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages

EECC722 - Shaaban #12 Lec # 2 Fall Fine-grain or Traditional Multithreaded Processors Multiple HW contexts (PC, SP, and registers). One context gets CPU for x cycles at a time. Limited by thread-level parallelism (TLP): –Can reduce some of the vertical issue slot waste. –No reduction in horizontal issue slot waste. Example Architectures: HEP, Tera. Advanced CPU Architectures:

EECC722 - Shaaban #13 Lec # 2 Fall VLIW: Intel/HP Explicitly Parallel Instruction Computing (EPIC) Strengths: –Allows for a high level of instruction parallelism (ILP). –Takes a lot of the dependency analysis out of HW and places focus on smart compilers. Weakness: –Limited by instruction-level parallelism (ILP) in a single thread. –Keeping Functional Units (FUs) busy (control hazards). –Static FUs Scheduling limits performance gains. Advanced CPU Architectures:

EECC722 - Shaaban #14 Lec # 2 Fall Single Chip Multiprocessor Strengths: –Create a single processor block and duplicate. –Takes a lot of the dependency analysis out of HW and places focus on smart compilers. Weakness: –Performance limited by individual thread performance (ILP). Advanced CPU Architectures:

EECC722 - Shaaban #15 Lec # 2 Fall Advanced CPU Architectures: Single Chip Multiprocessor

EECC722 - Shaaban #16 Lec # 2 Fall SMT: Simultaneous Multithreading Multiple Hardware Contexts running at the same time (HW context: registers, PC, and SP). Avoids both horizontal and vertical waste by having multiple threads keeping functional units busy during every cycle. Builds on top of current time-proven advancements in CPU design: superscalar, dynamic scheduling, hardware speculation, dynamic HW branch prediction. Enabling Technology: VLSI logic density in the order of hundreds of millions of transistors/Chip.

EECC722 - Shaaban #17 Lec # 2 Fall SMT With multiple threads running penalties from long-latency operations, cache misses, and branch mispredictions will be hidden: –Reduction of both horizontal and vertical waste and thus improved Instructions Issued Per Cycle (IPC) rate. Pipelines are separated until issue stage. Functional units are shared among all contexts during every cycle: –More complicated writeback stage. More threads issuing to functional units results in higher resource utilization.

EECC722 - Shaaban #18 Lec # 2 Fall SMT: Simultaneous Multithreading

EECC722 - Shaaban #19 Lec # 2 Fall The Power Of SMT Time (processor cycles) SuperscalarTraditional Multithreaded Simultaneous Multithreading Rows of squares represent instruction issue slots Box with number x: instruction issued from thread x Empty box: slot is wasted

EECC722 - Shaaban #20 Lec # 2 Fall SMT Performance Example InstCodeDescriptionFunctional unit ALUIR5,100R5 = 100Int ALU BFMULF1,F2,F3F1 = F2 x F3FP ALU CADDR4,R4,8R4 = R4 + 8Int ALU DMULR3,R4,R5R3 = R4 x R5Intmul/div ELWR6,R4R6 = (R4)Memory port FADDR1,R2,R3R1 = R2 + R3Int ALU GNOTR7,R7R7 = !R7Int ALU HFADDF4,F1,F2F4=F1 + F2FP ALU IXORR8,R1,R7R8 = R1 XOR R7Int ALU JSUBIR2,R1,4R2 = R1 – 4Int ALU KSWADDR,R2(ADDR) = R2Memory port 4 integer ALUs (1 cycle latency) 1 integer multiplier/divider (3 cycle latency) 3 memory ports (2 cycle latency, assume cache hit) 2 FP ALUs (5 cycle latency) Assume all functional units are fully-pipelined

EECC722 - Shaaban #21 Lec # 2 Fall SMT Performance Example (continued) 2 additional cycles to complete program 2 Throughput: –Superscalar: 11 inst/7 cycles = 1.57 IPC –SMT: 22 inst/9 cycles = 2.44 IPC

EECC722 - Shaaban #22 Lec # 2 Fall Changes to Superscalar CPUs Necessary to support SMT Multiple program counters and some mechanism by which one fetch unit selects one each cycle (thread instruction fetch policy). A separate return stack for each thread for predicting subroutine return destinations. Per-thread instruction retirement, instruction queue flush, and trap mechanisms. A thread id with each branch target buffer entry to avoid predicting phantom branches. A larger register file, to support logical registers for all threads plus additional registers for register renaming. (may require additional pipeline stages). A higher available main memory fetch bandwidth may be required. Improved cache to offset the cache performance degradation due to cache sharing among the threads and the resulting reduced locality. –e.g Private per-thread vs. shared L1 cache.

EECC722 - Shaaban #23 Lec # 2 Fall A Base SMT hardware Architecture. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages

EECC722 - Shaaban #24 Lec # 2 Fall Example SMT Vs. Superscalar Pipeline The pipeline of (a) a conventional superscalar processor and (b) that pipeline modified for an SMT processor, along with some implications of those pipelines. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages

EECC722 - Shaaban #25 Lec # 2 Fall SMT Performance Comparison Instruction throughput from simulations by Eggers et al. at The University of Washington, using both multiprogramming and parallel workloads: Multiprogramming workload Superscalar Traditional SMT Threads Multithreading Parallel Workload SuperscalarMP2MP4 Traditional SMT Threads Multithreading

EECC722 - Shaaban #26 Lec # 2 Fall Simultaneous Vs. Fine-Grain Multithreading Performance Instruction throughput as a function of the number of threads. (a)-(c) show the throughput by thread priority for particular models, and (d) shows the total throughput for all threads for each of the six machine models. The lowest segment of each bar is the contribution of the highest priority thread to the total throughput. Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages

EECC722 - Shaaban #27 Lec # 2 Fall Results for the multiprocessor MP vs. simultaneous multithreading SM comparisons.The multiprocessor always has one functional unit of each type per processor. In most cases the SM processor has the same total number of each FU type as the MP. Simultaneous Multithreading Vs. Single-Chip Multiprocessing Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages

EECC722 - Shaaban #28 Lec # 2 Fall Impact of Level 1 Cache Sharing on SMT Performance Results for the simulated cache configurations, shown relative to the throughput (instructions per cycle) of the 64s.64p The caches are specified as: [total I cache size in KB][private or shared].[D cache size][private or shared] For instance, 64p.64s has eight private 8 KB I caches and a shared 64 KB data Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages

EECC722 - Shaaban #29 Lec # 2 Fall SMT Thread Instruction Fetch Scheduling Policies Round Robin: –Instruction from Thread 1, then Thread 2, then Thread 3, etc. (eg RR 1.8 : each cycle one thread fetches up to eight instructions RR 2.4 each cycle two threads fetch up to four instructions each) BR-Count: –Give highest priority to those threads that are least likely to be on a wrong path by by counting branch instructions that are in the decode stage, the rename stage, and the instruction queues, favoring those with the fewest unresolved branches. MISS-Count: –Give priority to those threads that have the fewest outstanding Data cache misses. I-Count: –Highest priority assigned to thread with the lowest number of instructions in static portion of pipeline (decode, rename, and the instruction queues). IQPOSN: –Give lowest priority to those threads with instructions closest to the head of either the integer or floating point instruction queues (the oldest instruction is at the head of the queue).

EECC722 - Shaaban #30 Lec # 2 Fall Instruction throughput & Thread Fetch Policy

EECC722 - Shaaban #31 Lec # 2 Fall Possible SMT Instruction Issue Policies OLDEST FIRST: Issue the oldest instructions (those deepest into the instruction queue). OPT LAST and SPEC LAST: Issue optimistic and speculative instructions after all others have been issued. BRANCH FIRST: Issue branches as early as possible in order to identify mispredicted branches quickly. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages

EECC722 - Shaaban #32 Lec # 2 Fall Simulator RIT CE Execution-driven, performance simulator. Derived from Simple Scalar tool set. Simulates cache, branch prediction, five pipeline stages Flexible: –Configuration File controls cache size, buffer sizes, number of functional units. Cross compiler used to generate Simple Scalar assembly language. Binary utilities, compiler, and assembler available. Standard C library (libc) has been ported.

EECC722 - Shaaban #33 Lec # 2 Fall Simulator Memory Address Space

EECC722 - Shaaban #34 Lec # 2 Fall Alternate Functional Unit Configurations New functional unit configurations attempted (by adding one of each type of FU): –+1 integer multiplier/divider +2.8% IPC, issue rate -74% times with no FU available Simulator very flexible (only one line in configuration file required change)

EECC722 - Shaaban #35 Lec # 2 Fall Sim-SMT Simulator Limitations Does not keep precise exceptions. System Call’s instructions not tracked. Limited memory space: –Four test programs’ memory spaces running on one simulator memory space –Easy to run out of stack space

EECC722 - Shaaban #36 Lec # 2 Fall Simulation Runs & Results Test Programs used: –Newton interpolation. –Matrix Solver using LU decomposition. –Integer Test Program. –FP Test Program. Simulations of a single program –1,2, and 4 threads. System simulations involve a combination of all programs simultaneously –Several different combinations were run From simulation results: –Performance increase: Biggest increase occurs when changing from one to two threads. –Higher issue rate, functional unit utilization.

EECC722 - Shaaban #37 Lec # 2 Fall Performance (IPC) Simulation Results:

EECC722 - Shaaban #38 Lec # 2 Fall Simulation Results: Simulation Time

EECC722 - Shaaban #39 Lec # 2 Fall Instruction Issue Rate Simulation Results:

EECC722 - Shaaban #40 Lec # 2 Fall Performance Vs. Issue BW Simulation Results:

EECC722 - Shaaban #41 Lec # 2 Fall Functional Unit Utilization Simulation Results:

EECC722 - Shaaban #42 Lec # 2 Fall No Functional Unit Available Simulation Results:

EECC722 - Shaaban #43 Lec # 2 Fall Horizontal Waste Rate Simulation Results:

EECC722 - Shaaban #44 Lec # 2 Fall Vertical Waste Rate Simulation Results:

EECC722 - Shaaban #45 Lec # 2 Fall SMT: Simultaneous Multithreading Strengths: –Overcomes the limitations imposed by low single thread instruction-level parallelism. –Multiple threads running will hide individual control hazards (branch mispredictions). Weaknesses: –Additional stress placed on memory hierarchy Control unit complexity. –Sizing of resources (cache, branch prediction, etc.) –Accessing registers (32 integer + 32 FP for each HW context): Some designs devote two clock cycles for both register reads and register writes.

EECC722 - Shaaban #46 Lec # 2 Fall SMT: Simultaneous Multithreading Kernel Code Many, if not all, benchmarks are based upon a limited interaction with kernel code. How can the kernel overhead be minimized (context- switching, process management, etc.)? –CHAOS (Context Hardware Accelerated Operating System). Introduce a lightweight dedicated kernel context to handle process management: –When there are 4 contexts, there is a good chance that one of them will continue to run, why take an (expensive) chance in swapping it out when it will be brought right back in by the swapper (process management).

EECC722 - Shaaban #47 Lec # 2 Fall SMT & Technology SMT architecture has not been implemented in any existing commercial microprocessor yet (First 4-thread SMT CPU: Alpha EV8 ~2001). Current technology has the potential for 4-8 simultaneous threads: –Based on transistor count and design complexity.

EECC722 - Shaaban #48 Lec # 2 Fall RIT-CE SMT Project Goals Investigate performance gains from exploiting Thread- Level Parallelism (TLP) in addition to current Instruction- Level Parallelism (ILP) in processor design. Design and simulate an architecture incorporating Simultaneous Multithreading (SMT). Study operating system and compiler modifications needed to support SMT processor architectures. Define a standard interface for efficient SMT-processor/OS kernel interaction. Modify an existing OS kernel (Linux?) to take advantage of hardware multithreading capabilities. Long term: VLSI implementation of an SMT prototype.

EECC722 - Shaaban #49 Lec # 2 Fall Current Project Status Architecture/OS interface definition. Study of design alternatives and impact on performance. SMT Simulator Development: –System call development, kernel support, and compiler/assembler changes. Development of code (programs and OS kernel) is key to getting results.

EECC722 - Shaaban #50 Lec # 2 Fall Short-Term Project Chart

EECC722 - Shaaban #51 Lec # 2 Fall Current/Future Project Goals SMT simulator completion refinement, and further testing. Development of an SMT-capable OS kernel. Extensive performance studies with various workloads using the simulator/OS/compiler: –Suitability for fine-grained parallel applications? –Effect on multimedia applications? Architectural changes based on benchmarks. Cache impact on SMT performance investigation. Investigation of an in-order SMT processor (C or VHDL model) MOSIS Tiny Chip (partial/full) implementation. Investigate the suitability of SMT processors as building blocks for MPPs.