National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Evaluating the Tera MTA Allan Snavely, Wayne Pfeiffer et.

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Tareq Malas Advisors: Prof. David Keyes, Dr. Aron Ahmadia.
RISC ARCHITECTURE By Guan Hang Su. Over View -> RISC design philosophy -> Features of RISC -> Case Study -> The Success of RISC processors -> CRISC.
Parallel Research at Illinois Parallel Everywhere
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
The Multikernel: A new OS architecture for scalable multicore systems Andrew Baumann et al CS530 Graduate Operating System Presented by.
Introduction CS 524 – High-Performance Computing.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.
SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Very Long Instruction Word (VLIW) Architecture. VLIW Machine It consists of many functional units connected to a large central register file Each functional.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 1 Introduction Read:
Improving Network I/O Virtualization for Cloud Computing.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
1.1 Operating System Concepts Introduction What is an Operating System? Mainframe Systems Desktop Systems Multiprocessor Systems Distributed Systems Clustered.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE Capability Computing – High-End Resources Wayne Pfeiffer Deputy Director NPACI & SDSC NPACI.
Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power.
UNICOS. When it comes to solving real-world problems, leading-edge hardware is only part of the solution. A complete solution also requires a powerful.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Full and Para Virtualization
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Data Structures and Algorithms in Parallel Computing Lecture 7.
EGRE 426 Computer Organization and Design Chapter 4.
1 TCS Confidential. 2 Objective : In this session we will be able to learn:  What is Cloud Computing?  Characteristics  Cloud Flavors  Cloud Deployment.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
Background Computer System Architectures Computer System Software.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Processor Level Parallelism 1
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Multi-core processors
Modern Processor Design: Superscalar and Superpipelining
Computer Architecture
Department of Computer Science University of California, Santa Barbara
Simultaneous Multithreading in Superscalar Processors
Course Description: Parallel Computer Architecture
Gary M. Zoppetti Gagan Agrawal
Chapter 2 Operating System Overview
Department of Computer Science University of California, Santa Barbara
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Operating System Overview
Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.
Computer Organization and Design Chapter 4
Presentation transcript:

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Evaluating the Tera MTA Allan Snavely, Wayne Pfeiffer et al Architectural features: Massive hardware multithreading Flat, randomized memory (no data cache) Support for automatic parallelization Single programming model for 1 or many processors Designed to scale Goals of Architecture: Cover memory and other operational latencies Ease burden on programmer Exploit multiple levels of parallelism Scale Goals of SDSC Evaluation: Funded by NSF to evaluate the MTA for the purposes of scientific computing Wayne Pfeiffer, Larry Carter PIs

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Evaluating the Tera MTA Executive Summary A few kernels and applications have been found for which the MTA achieves higher performance than other SDSC machines. Such codes have these characteristics: They do not vectorize well. They are difficult to parallelize on conventional machines. They contain substantial parallelism. Examples are codes that involve: Integer sorting. Dynamic, irregular meshes or dynamic, non-uniform workloads within a regular mesh. Parallel operations (such as a general gather/scatter) with poor data locality. Single-processor performance of the multithreaded Tera MTA (with a 260-MHz clock) is typically lower than that of the vector Cray T90 (with a 440-MHz clock). The T90 is faster than the MTA processor for 4 out of 7 kernels and 2 out of 3 applications compared. The MTA processor is appreciably faster for one kernel which does an integer sort. Single-processor performance of the MTA is typically higher than that of cache-based, workstation processors. An MTA processor is substantially faster than a workstation processor for 8 out of 9 applications compared. This indicates the effectiveness of multithreading as compared to cache utilization. Scalability on the MTA is good up to 8 processors in many instances and better for kernels than for larger applications. Very good scalability (parallel efficiency between 0.80 and 1.00 on 8 processors) has been achieved for 6 out of 7 kernels and 5 out of 11 applications studied. Compared to kernels, the applications have more sections of code that must be tuned to achieve good performance. The MTA is faster processor for processor than the IBM Blue Horizon (with a 220 MHz clock.) Scaling sometimes favors one machine and sometimes the other. Codes which put pressure on the I-cache suffer degraded scaling on the MTA. Recall that the IBM has 1152 processors, an advantage for large problems that scale well.

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center MTA v.s. IBM Blue Horizon

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center MTA v.s. T90

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Scalability

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Symbiosis and Congestion Pricing on MTA Allan Snavely’s Ph.D. thesis (Fall 2000) Advisor Larry Carter. Symbiosis: A term from Biology meaning ‘the living together of distinct organisms in close proximity’. We adapt that term to refer to an increase in throughput and job turnaround that can occur when jobs are coscheduled on a multithreaded machine. Congestion Pricing: An area of Economics dealing with the right way of pricing a ‘congestion externality’ in such a way that users are caused to take cognizance of the impact their usage has on others. Key Observation: Resource sharing among coscheduled jobs on a multithreaded machine such as the MTA or SMT is very intimate. Thesis: Jobschedulers which take Symbiosis into account, when combined with principles of Congestion Pricing, deliver significant throughput and turnaround gains and maximize global user utility when deployed on multithreaded machines. See