Baum, Boyett, & Garrison Comparing Intel C++ and Microsoft Visual C++ Compilers Michael Baum David Boyett Holly Garrison.

Slides:



Advertisements
Similar presentations
Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.
Advertisements

Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
11/21/2002© 2002 Hal Perkins & UW CSEO-1 CSE 582 – Compilers Instruction Scheduling Hal Perkins Autumn 2002.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.
Making Choices in C if/else statement logical operators break and continue statements switch statement the conditional operator.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
1.Calculate number of events by searching for event in assembly file or analytical model. 2.Validate the numbers from step one with a simulator. 3.Compare.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Query Reordering for Photon Mapping Rohit Saboo. Photon Mapping A two step solution for global illumination: Step 2: Shoot eye rays and perform a “gather”
CSCE 212 Quiz 4 – 2/16/11 *Assume computes take 1 clock cycle, loads and stores take 10 cycles and branches take 4 cycles and that they are running on.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.

Chapter 1 Section 1.4 Dr. Iyad F. Jafar Evaluating Performance.
Revisiting Load Value Speculation:
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.
Different CPUs CLICK THE SPINNING COMPUTER TO MOVE ON.
Developing Workflows with SharePoint Designer David Coe Application Development Consultant Microsoft Corporation.
Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
CWRU EECS 3221 Benchmarks EECS 322 Computer Architecture Instructor: Francis G. Wolff Case Western Reserve University This presentation.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Performance of mathematical software Agner Fog Technical University of Denmark
OCR GCSE Computing © Hodder Education 2013 Slide 1 OCR GCSE Computing Chapter 2: CPU.
* Third party brands and names are the property of their respective owners. Performance Tuning Linux* Applications LinuxWorld Conference & Expo Gary Carleton.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
CDA 3101 Discussion Section 09 CPU Performance. Question 1 Suppose you wish to run a program P with 7.5 * 10 9 instructions on a 5GHz machine with a CPI.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
FFT Accelerator Project Rohit Prakash Anand Silodia Date: June 7 th, 2007.
Alpha Supplement CS 740 Oct. 14, 1998
M. Mateen Yaqoob The University of Lahore Spring 2014.
Introduction to MMX, XMM, SSE and SSE2 Technology
Pipelining and Parallelism Mark Staveley
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
Dynamic Branch Prediction During Context Switches Jonathan Creekmore Nicolas Spiegelberg T NT.
Processors with Hyper-Threading and AliRoot performance Jiří Chudoba FZÚ, Prague.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
FFT Accelerator Project Rohit Prakash(2003CS10186) Anand Silodia(2003CS50210) Date : February 23,2007.
Microsoft Visual Basic 2015 CHAPTER ONE Introduction to Visual Basic 2015 Programming.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Measuring Performance II and Logic Design
GCSE OCR Computing A451 The CPU Computing hardware 1.
??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.
Execution time Execution Time (processor-related) = IC x CPI x T
High Performance Computing on an IBM Cell Processor --- Bioinformatics
An example of multiplying two numbers A = A * B;
Understanding Performance Counter Data - 1
CMSC 611: Advanced Computer Architecture
Cache Replacement in Modern Processors
CMSC 611: Advanced Computer Architecture
Determining the Accuracy of Event Counts - Methodology
What Are Performance Counters?
Project Guidelines Prof. Eric Rotenberg.
Presentation transcript:

Baum, Boyett, & Garrison Comparing Intel C++ and Microsoft Visual C++ Compilers Michael Baum David Boyett Holly Garrison

Baum, Boyett, & Garrison Agenda Problem Statement System Environment Programs Used for Comparison Matrix Processing Programs Results and Analysis SPEC Benchmark Results and Analysis Conclusion

Baum, Boyett, & Garrison Problem Statement The general purpose of our project is to verify Intel’s claim that their compiler is 10% better then the Microsoft Visual compiler. Data will be gathered using Intel VTune tool from both SPEC CPU 2000 benchmarks and from simple matrix processing programs.

Baum, Boyett, & Garrison System Environment Programs were run on a single processor system with Intel P4 2.4GHz processor and 512 MB RAM. – Windows 2000 operating system Microsoft Visual.NET compiler Intel C++ Compiler 7.1 for Windows Intel VTune Performance Analyzer 7.0

Baum, Boyett, & Garrison Programs Used for Comparison SPEC CPU 2000 Benchmark –164.gzip –300.twolf Simple Matrix Processing Programs –Array Summation of elements –Matrix Multiplication of 250x250 matrices

Baum, Boyett, & Garrison VTune Setup Using Intel’s VTune application the following events were measured: –Instruction Count –Clockticks and Clockticks per Instruction –Loads & Stores –Level 1 cache misses –Mispredicted Calls and Branches

Baum, Boyett, & Garrison Matrix Processing Programs Results Executable (*.exe) Mispredict ed Calls Mispredict ed Branches 1st Level Cache Misses LoadsStoresClockticks Instruction Count Clockticks per Instruction Array Sum (Intel)1,51822,28549,8901,268,145844,96218,995,295981, Array Sum (VC++)4,53639,123186,760863,7721,162,23913,069,2421,462, Matrix Mult 250 (Intel)2205, ,3249,502,5321,979, Matrix Mult 250 (VC++)28968,35418,640,24931,728,270657,32888,513,59454,242,

Baum, Boyett, & Garrison Matrix Processing Programs Results (cont.)

Baum, Boyett, & Garrison Matrix Processing Programs Results (cont.)

Baum, Boyett, & Garrison Matrix Processing Analysis For Simple Matrix and Array Processing the Intel compiler verified it’s claim of a 10% better compiler –With the exception of the number of Stores executed, the Intel compiler showed approximately a 50% savings in the measured operations. The Matrix Multiplication program showed one noteworthy result: the Intel compiler had zero events for both 1 st Level Cache Misses and for Loads. –Verified by multiple builds and runs

Baum, Boyett, & Garrison SPEC Benchmark Results Executable (*.exe) Mispredicte d Calls Mispredicte d Branches 1st Level Cache Misses LoadsStoresClockticks Instruction Count Clockticks per Instruction 164.gzip (Intel)11,725871,754,1722,267,577,93622,054,374,34211,101,416,840106,412,563,51576,670,596, gzip (VC++)7,695869,317,0152,273,066,85222,074,844,24811,108,909,049107,286,054,47076,671,138, twolf (Intel)3464,874,9827,639,21177,060,02532,577,657484,933,215210,922, twolf (VC++)5374,797,5527,526,58876,831,63833,214,416473,946,742211,425,

Baum, Boyett, & Garrison SPEC Benchmark Results

Baum, Boyett, & Garrison SPEC Benchmark Results

Baum, Boyett, & Garrison SPEC CPU 2000 Analysis SPEC CPU 2000 Benchmarks did not show any significant difference between the two compilers. SPEC Benchmarks were re-compiled and data sets were collected multiple times to verify the validity of the original data.

Baum, Boyett, & Garrison Conclusions Even though our group saw significant improvements in performance for our small test programs, these same gains could not be duplicated for the Benchmark applications. These variations might be the result of differences in program complexity.

Baum, Boyett, & Garrison Conclusions (cont.) The Intel C++ Compiler showed results that were equal to or in some cases better than those of Microsoft Visual C++. While Intel’s claim of 10% better results may not be true in all cases it is still a superior compiler.