ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++ Swathi Tanjore Gurumani, Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Outline Objective Background Problem Overview Performance Evaluation - Overview Experimental Setup Results Conclusion and Future Research
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Problem Objective Prove and stress the importance of designing architecture-aware compilers
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Background - Application Performance Advancement in processor technology Deep pipelining Multi-level cache hierarchy Improved branch predictors Out of order execution engine Advanced floating point Multimedia units Compilers Optimization levels and switches Compilers should keep up with processor technology
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Compiler/hardware interaction can maximize application performance by Exploiting advances in processor technology Generating target-specific optimal codes Path length reduction Efficient instruction selection Pipelining scheduling Instruction level parallelism Memory penalty minimization Architecture-aware Compilers
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Performance Evaluation Systematic process of data collection and analysis to determine and evaluate any system Benchmarks Exe Compile Performance Metrics Benchmarks: A program that performs a strictly defined set of operations (a workload) and returns some form of result (a metric) describing how the tested computer performed.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Performance Evaluation – Previous Works Study underlying architecture and characterize workloads Evaluation of Pentium Pro using SPEC 2000 Evaluation of Pentium II using Multimedia applications Processor centric optimization Xeon vs. Pentium III Pentium III vs. Pentium IV Compilers and optimization Branch optimizations by different compilers
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Problem Overview Objective Prove and stress the importance of architecture aware compilers How? Compile benchmarks using different compilers Use same optimization switches Execute the binaries using performance analyzer Analyze and compare the performance metrics collected Same OS, hardware features - difference in metrics only due to compiler used
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Experimental Setup SPEC CPU2000 Exe IC++ Performance Metrics Exe VC++ Performance Metrics VTune Processor : Pentium IV Operating System : Windows 2000 Optimization Level : /O2 Input : Reference set from SPEC
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH SPEC CPU2000 Portray real user application and computation intensive compiler Can measure performance of processor, memory and compiler Does not stress on I/O devices, networking and OS Used CINT2000 and CFP2000 NameDescription 164.gzip (INT)Data Compression written in C 176.gcc (INT)C Programming Language Compiler 177.mesa (FP)3-D Graphics Library written in C 181.mcf (INT)Combinatorial Optimization written in C 186.crafty (INT)Chess – Game Playing written in C 197.parser (INT)Word Processing written in C 252.eon (INT)Computer Visualization written in C perlbmk (INT)PERL Programming Language written in C 254.gap (INT)Group Theory, Interpreter written in C 255.vortex (INT)Object Oriented database written in C
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH VTune Performance Analyzer Simultaneous sampling of multiple events and real time display using counter monitors event-based sampling Supports time-based and event-based sampling To take advantage of Pentium IV’s EBS feature Has a low intrusion Samples collected provide a closer representation of application’s actual performance Events Collected Clockticks, instructions retired, loads retired, stores retired, branches retired, I level cache misses and mispredicted branches
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Compiler Optimizations Both compilers were used with /O2 option Invoke the same switches and have same functions Microsoft VC++ has special switches to target Pentium (/G5) & Pentium Pro (/G6) Intel C++ compiler optimizes performance for applications running on Intel architecture-based computers OptionEffect /OdDisable optimization /O1Minimize size /O2Maximize speed Performance gains by using IC++ are result of - profile-guided optimization - pre-fetch instruction - support for Streaming SIMD Extensions (SSE) - data prefetching - inter-procedural optimization
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Clock ticks On average, 10% performance gain with IC++ Performance gain more pronounced for 3D graphics library and computer visualization application
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Binaries Benchmark Code Size (in Bytes) MSVC++IC gzip69,63277, gcc1,089,5361,314, mesa442,368610, mcf49,15253, crafty241,664258, parser118,784131, eon405,504413, perlbmk516,096651, gap356,352413, vortex417,792454,656 VC++ produced smaller sized binaries
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Instruction Count 3D and Computer Visualization applications have a much reduced instruction count than others
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Loads
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Stores
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Branches
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Other Instructions
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Comparison of Cache Misses
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Conclusion & Future Research Execution characteristics of CPU2000 benchmarks was presented for VC++ and IC++ IC++ performed better than VC++ for all considered applications and more pronounced for graphics applications Distribution of loads, stores and branches were same – difference in absolute numbers No difference in branch prediction and memory references Use - Strength and weakness of compilers Future Directions Different Optimization switches Usage of microbenchmarks for better control
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Thank You! Questions and Feedback…