SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism and Superscalar Processors
Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution
RISC and Pipelining Prof. Sin-Min Lee Department of Computer Science.
Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
Quiz 4 Solution. n Frequency = 2.5GHz, CLK = 0.4ns n CPI = 0.4, 30% loads and stores, n L1 hit =0, n L1-ICACHE : 2% miss rate, 32-byte blocks n L1-DCACHE.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
1 Recap: Memory Hierarchy. 2 Unified vs.Separate Level 1 Cache Unified Level 1 Cache (Princeton Memory Architecture). A single level 1 cache is used for.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.
CS25212 Coarse Grain Multithreading Learning Objectives: – To be able to describe a coarse grain multithreading implementation – To be able to estimate.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Niagara: a 32-Way Multithreaded SPARC Processor
Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
1 CENG 450 Computer Systems and Architecture Cache Review Amirali Baniasadi
Classic Model of Parallel Processing
.1 Multiprocessor on a Chip & Simultaneous Multi-threads [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
CSC 7080 Graduate Computer Architecture Lec 8 – Multiprocessors & Thread- Level Parallelism (3) – Sun T1 Dr. Khalaf Notes adapted from: David Patterson.
The Alpha – Data Stream Matt Ziegler.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Organization CS224 Fall 2012 Lessons 39 & 40.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
My Coordinates Office EM G.27 contact time:
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Microbenchmarking the GT200 GPU
The University of Adelaide, School of Computer Science
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Lecture: SMT, Cache Hierarchies
CMPT 886: Computer Architecture Primer
Lecture: SMT, Cache Hierarchies
Lecture 20: OOO, Memory Hierarchy
Lecture: SMT, Cache Hierarchies
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CS 286 Computer Architecture & Organization
Lecture: SMT, Cache Hierarchies
CS 286 Computer Organization and Architecture
Chip&Core Architecture
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group Overview 8 cores 4 threads per core 3MB L2 cache (4-banks) 12-way, write-back One FPU per chip © David Yen BUS

SYNAR Systems Networking and Architecture Group Memory Latency Limits Performance © David Yen

SYNAR Systems Networking and Architecture Group Hardware Multithreading © David Yen While one thread is blocked on memory, others continue computing – results in higher number of instructions per cycle

SYNAR Systems Networking and Architecture Group Eight Multithreaded Cores © David Yen

SYNAR Systems Networking and Architecture Group Niagara Chip © Poonacha Kongetira

SYNAR Systems Networking and Architecture Group Niagara Core 4 threads per core Multithreading increases core area by 20% 6 stage single-issue in-order pipeline IFU – instruction fetch unit LSU – load/store unit EXU – execution unit L1 D-cache: 4-way, 8KB, 16 byte line L1 I-cache: 4-way, 16KB, 32 byte line Why simple in-order core? Why small caches?

SYNAR Systems Networking and Architecture Group Switching Threads Switch between available threads every cycle giving priority to least recently executed thread Fine-grained multithreading Threads become unavailable due to: – Long latency ops like loads, branch, mul, div. – Pipeline stalls such as cache misses, traps, and resource conflicts