Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha 21264 Only concern is performance: no power.

Slides:



Advertisements
Similar presentations
1 Lecture 2: Metrics to Evaluate Performance Topics: Benchmark suites, Performance equation, Summarizing performance with AM, GM, HM Video 1: Using AM.
Advertisements

Branch prediction Titov Alexander MDSP November, 2009.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.
CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
Performance & Benchmarking. What Matters? Which airplane has best performance:
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
BİL 221 Bilgisayar Yapısı Lab. – 1: Benchmarking.
Memory/Storage Architecture Lab Computer Architecture Performance.
Recap Technology trends Cost/performance Measuring and Reporting Performance What does it mean to say “computer X is faster than computer Y”? E.g. Machine.
Alpha 21364: A Scalable Single-chip SMP
Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.
Cosc 2150: Computer Organization Chapter 11: Performance Measurement.
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
Hardware Support for Compiler Speculation
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
Computer Architecture
1 Seoul National University Performance. 2 Performance Example Seoul National University Sonata Boeing 727 Speed 100 km/h 1000km/h Seoul to Pusan 10 hours.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
MEMORY SYSTEM CHARACTERIZATION OF COMMERCIAL WORKLOADS Authors: Luiz André Barroso (Google, DEC; worked on Piranha) Kourosh Gharachorloo (Compaq, DEC;
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Performance Performance
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Fetch Directed Prefetching - a Study
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
4. Performance 4.1 Introduction 4.2 CPU Performance and Its Factors
1 Lecture: Metrics to Evaluate Performance Topics: Benchmark suites, Performance equation, Summarizing performance with AM, GM, HM  Video 1: Using AM.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
CS203 – Advanced Computer Architecture Performance Evaluation.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CS203 – Advanced Computer Architecture
Computer Sciences Department University of Wisconsin-Madison
Lecture 2: Performance Evaluation
Architecture & Organization 1
Architecture & Organization 1
Presented by: Eric Carty-Fickes
Presentation transcript:

Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power or cost issues Compares the ES40 and systems from other vendors in terms of speed, throughput, and sustained memory bandwidth Primary focus of paper is on comparison of ES40 against previous-generation AlphaServer 4100 to explore the improvements in the new system –Larger L1 cache –Tournament branch predictor –Out-of-order execution –Larger number of execution pipelines –More advanced memory system

Methodology On-line transaction processing and technical/scientific workloads –TPC-C: transaction processing benchmark Measures throughput (transactions per minute) Exercises processor power, memory interface, and I/O –SPEC95 benchmark suite SPECint95/SPECfp95 measure speed; SPECint_rate95/SPECfp_rate95 measure throughput Exercises processor, memory hierarchy, and compiler performance (to be included in the suite, can not stress I/O or include networking or graphics) Measures geometric mean of normalized execution times, which can present a skewed perspective –Linpack: solves linear equations and least-squares problems Paper uses benchmark once to get a MFLOPS rating –STREAM: synthetic benchmark program that includes four vector kernels Measures sustainable memory bandwidth (in MB/s) ProfileMe and DCPI tools used to profile data Paper does not specify compiler used

I-Cache, Branch Prediction, and IPC Performance Instruction cache performance –21164 has 8 KB direct-mapped L1 cache, while has 64 KB two-way associative L1 cache –Most cache misses in appear to be compulsory misses for the given benchmarks; difficult to tell how the cache would handle larger programs Branch prediction performance –21164 uses a simple two-bit branch predictor; uses a tournament branch predictor that tracks local and global history –21264 typically has improved branch prediction over the 21164, but it does not do as well with transaction processing branch prediction Comparison of Instructions Per Cycle –Out-of-order execution, up to 80 in-flight instructions, additional functional units, and other advances lead to IPC improvements in the over the –IPC comparison could be skewed: IPC in measures number of retired instructions, which favors the 21164, but the ES40 results were obtained with code that targeted 21264, which favors that system

Instruction Retire and Memory System Performance Non-retired instructions –A large percentage of issued instructions in the are killed due to speculation –Little performance down-side due to speculative execution Memory bandwidth and latency –STREAM Copy benchmark shows big improvement in memory bandwidth in the over the 21164, as well as better scaling, due to several factors Larger, more associative L1 cache Crossbar interconnect New WH64 prefetch –21264 has impressive advantage (4x) over for “independent-load” memory latency, but only 40% improvement for “dependent-load” memory latency. Memory replay traps –21264 has significantly fewer memory traps than 21164, due to more highly- parallel memory system –Further reduction in number of memory traps can be obtained through compiler modification

Conclusions The based Compaq ES40 has many architectural features that improve performance over the based AlphaServer 4100 Authors claim that the ES40 has performance advantage over other vendor machines based on results of SPEC95 and STREAM benchmarks It is acknowledged in the paper that further study is required using a greater variety of applications--SPEC95 results are generally impressive, but TPC results show that there is room for improvement Paper does a good job of examining the architectural features of the ES40 and their effect on performance Other areas of interest--power consumption, area, cost