ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Slides:



Advertisements
Similar presentations
Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten.
Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
Computer Abstractions and Technology
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Performance Analysis of Multiprocessor Architectures
Analysis of Database Workloads on Modern Processors Advisor: Prof. Shan Wang P.h.D student: Dawei Liu Key Laboratory of Data Engineering and Knowledge.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
CS430 – Computer Architecture Lecture - Introduction to Performance
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.
Benchmarks Prepared By : Arafat El-madhoun Supervised By:eng. Mohammad temraz.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
XP Practical PC, 3e Chapter 16 1 Looking “Under the Hood”
Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Chapter 1. Introduction What is an Operating System? Mainframe Systems
Computer Performance Computer Engineering Department.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.
Performance of mathematical software Agner Fog Technical University of Denmark
Srihari Makineni & Ravi Iyer Communications Technology Lab
 Copyright, HiCLAS1 George Delic, Ph.D. HiPERiSM Consulting, LLC And Arney Srackangast, AS1MET Services
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Pipelining and Parallelism Mark Staveley
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Full and Para Virtualization
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Baum, Boyett, & Garrison Comparing Intel C++ and Microsoft Visual C++ Compilers Michael Baum David Boyett Holly Garrison.
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
Computer performance issues* Pipelines, Parallelism. Process and Threads.
6.1 Advanced Operating Systems Lies, Damn Lies and Benchmarks Are your benchmark tests reliable?
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
Sunpyo Hong, Hyesoon Kim
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Computer Architecture CSE 3322 Web Site crystal.uta.edu/~jpatters/cse3322 Send to Pramod Kumar, with the names and s.
*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Lecture 3. Performance Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212, CYDF210 Computer Architecture.
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Computer Architecture & Operations I
Lecture 2: Performance Evaluation
Assessing and Understanding Performance
Microarchitecture.
Hyperthreading Technology
CMSC 611: Advanced Computer Architecture
Performance of computer systems
Performance of computer systems
Computer Evolution and Performance
Performance ICS 233 Computer Architecture and Assembly Language
CMSC 611: Advanced Computer Architecture
Performance of computer systems
COMS 361 Computer Organization
Presentation transcript:

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++ Swathi Tanjore Gurumani, Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Outline Objective Background Problem Overview Performance Evaluation - Overview Experimental Setup Results Conclusion and Future Research

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Problem Objective Prove and stress the importance of designing architecture-aware compilers

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Background - Application Performance  Advancement in processor technology Deep pipelining Multi-level cache hierarchy Improved branch predictors Out of order execution engine Advanced floating point Multimedia units  Compilers Optimization levels and switches  Compilers should keep up with processor technology

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH  Compiler/hardware interaction can maximize application performance by Exploiting advances in processor technology Generating target-specific optimal codes  Path length reduction  Efficient instruction selection  Pipelining scheduling  Instruction level parallelism  Memory penalty minimization Architecture-aware Compilers

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Performance Evaluation  Systematic process of data collection and analysis to determine and evaluate any system Benchmarks Exe Compile Performance Metrics  Benchmarks: A program that performs a strictly defined set of operations (a workload) and returns some form of result (a metric) describing how the tested computer performed.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Performance Evaluation – Previous Works  Study underlying architecture and characterize workloads Evaluation of Pentium Pro using SPEC 2000 Evaluation of Pentium II using Multimedia applications  Processor centric optimization Xeon vs. Pentium III Pentium III vs. Pentium IV  Compilers and optimization Branch optimizations by different compilers

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Problem Overview  Objective Prove and stress the importance of architecture aware compilers  How? Compile benchmarks using different compilers Use same optimization switches Execute the binaries using performance analyzer Analyze and compare the performance metrics collected  Same OS, hardware features - difference in metrics only due to compiler used

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Experimental Setup SPEC CPU2000 Exe IC++ Performance Metrics Exe VC++ Performance Metrics VTune Processor : Pentium IV Operating System : Windows 2000 Optimization Level : /O2 Input : Reference set from SPEC

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH SPEC CPU2000  Portray real user application and computation intensive compiler  Can measure performance of processor, memory and compiler  Does not stress on I/O devices, networking and OS  Used CINT2000 and CFP2000 NameDescription 164.gzip (INT)Data Compression written in C 176.gcc (INT)C Programming Language Compiler 177.mesa (FP)3-D Graphics Library written in C 181.mcf (INT)Combinatorial Optimization written in C 186.crafty (INT)Chess – Game Playing written in C 197.parser (INT)Word Processing written in C 252.eon (INT)Computer Visualization written in C perlbmk (INT)PERL Programming Language written in C 254.gap (INT)Group Theory, Interpreter written in C 255.vortex (INT)Object Oriented database written in C

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH VTune Performance Analyzer  Simultaneous sampling of multiple events and real time display using counter monitors event-based sampling  Supports time-based and event-based sampling To take advantage of Pentium IV’s EBS feature  Has a low intrusion Samples collected provide a closer representation of application’s actual performance  Events Collected Clockticks, instructions retired, loads retired, stores retired, branches retired, I level cache misses and mispredicted branches

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Compiler Optimizations  Both compilers were used with /O2 option  Invoke the same switches and have same functions  Microsoft VC++ has special switches to target Pentium (/G5) & Pentium Pro (/G6)  Intel C++ compiler optimizes performance for applications running on Intel architecture-based computers OptionEffect /OdDisable optimization /O1Minimize size /O2Maximize speed  Performance gains by using IC++ are result of - profile-guided optimization - pre-fetch instruction - support for Streaming SIMD Extensions (SSE) - data prefetching - inter-procedural optimization

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Clock ticks  On average, 10% performance gain with IC++  Performance gain more pronounced for 3D graphics library and computer visualization application

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Binaries Benchmark Code Size (in Bytes) MSVC++IC gzip69,63277, gcc1,089,5361,314, mesa442,368610, mcf49,15253, crafty241,664258, parser118,784131, eon405,504413, perlbmk516,096651, gap356,352413, vortex417,792454,656  VC++ produced smaller sized binaries

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Instruction Count  3D and Computer Visualization applications have a much reduced instruction count than others

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Loads

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Stores

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Branches

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Other Instructions

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Cache Misses

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Conclusion & Future Research  Execution characteristics of CPU2000 benchmarks was presented for VC++ and IC++  IC++ performed better than VC++ for all considered applications and more pronounced for graphics applications  Distribution of loads, stores and branches were same – difference in absolute numbers  No difference in branch prediction and memory references  Use - Strength and weakness of compilers  Future Directions Different Optimization switches Usage of microbenchmarks for better control

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Thank You! Questions and Feedback…