CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.

Slides:

Advertisements

Similar presentations

Final Project : Pipelined Microprocessor Joseph Kim.

Advertisements

EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

CS455/CpE 442 Intro. To Computer Architecure

CS1104: Computer Organisation School of Computing National University of Singapore.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Euro-Par Uppsala Architecture Research Team [UART] | Uppsala University Dept. of Information Technology Div. of.

CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.

Performance D. A. Patterson and J. L. Hennessey, Computer Organization & Design: The Hardware Software Interface, Morgan Kauffman, second edition 1998.

Home: Phones OFF Please Unix Kernel Parminder Singh Kang Home:

Operating System Kernels1 Operating System Support for Performance Monitoring Witawas Srisa-an Chapter: not in the book.

Assessing and Understanding Performance B. Ramamurthy Chapter 4.

Computer Architecture Lecture 2 Instruction Set Principles.

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

Computer Organization and Architecture

Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.

GCSE Computing - The CPU

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.

Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.

Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.

Gary MarsdenSlide 1University of Cape Town Computer Architecture – Introduction Andrew Hutchinson & Gary Marsden (me) ( ) 2005.

Computer Performance Computer Engineering Department.

Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-1 Measuring.

André Seznec Caps Team IRISA/INRIA HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation at user level André.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

March 12, 2001 Kperfmon-MP Multiprocessor Kernel Performance Profiling Alex Mirgorodskii Computer Sciences Department University of Wisconsin.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Computer Organization and Architecture Tutorial 1 Kenneth Lee.

Alpha Supplement CS 740 Oct. 14, 1998

Lab 2 Parallel processing using NIOS II processors

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Performance Performance

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.

1  1998 Morgan Kaufmann Publishers Where we are headed Performance issues (Chapter 2) vocabulary and motivation A specific instruction set architecture.

System Hardware FPU – Floating Point Unit –Handles floating point and extended integer calculations 8284/82C284 Clock Generator (clock) –Synchronizes the.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Computer Architecture CSE 3322 Web Site crystal.uta.edu/~jpatters/cse3322 Send to Pramod Kumar, with the names and s.

ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.

Measuring Performance II and Logic Design

GCSE Computing - The CPU

??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.

Topics to be covered Instruction Execution Characteristics

Protection in Virtual Mode

Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.

Execution time Execution Time (processor-related) = IC x CPI x T

Visit for more Learning Resources

Introduction to SimpleScalar

CSCE 212 Chapter 4: Assessing and Understanding Performance

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Understanding Performance Counter Data - 1

CS170 Computer Organization and Architecture I

“C” and Assembly Language- What are they good for?

CS455/CpE 442 Intro. To Computer Architecure

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

Adapted from the slides of Prof

Chapter 12 Pipelining and RISC

Execution time Execution Time (processor-related) = IC x CPI x T

October 29 Review for 2nd Exam Ask Questions! 4/26/2019

GCSE Computing - The CPU

Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.

What Are Performance Counters?

Presentation transcript:

CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques to reduce cache miss rate

Topics to Be Covered  Part I, Using PAPI: Finding the best blocking factor to reduce cache miss rate Getting a complete picture of system hardware  Part II: Using SimpleScalar to find the best size of branch predictor  Part III: Getting the best TLB using the SimpleScalar, also

What is PAPI?  Performance Application Programming Interface  Developed at the University of Tennessee ’ s Innovative Computing Laboratory  Access the hardware performance counters found on most modern microprocessors  Easy to use, well documented, and freely available

Events  Occurrences of specific signals related to a processor ’ s function  Hardware performance counters exist as a small set of registers that count events while the program executes on the processor such as : Cache misses Floating point operations

C calling interface  Function calls are defined in the header file “ papi.h ”  Consists of the following form : return type PAPI_function_name (arg1,arg2, … )  Return value can be a pointer to structures or a value

PAPI timers  can be used to obtain both real and virtual time  The real time clock runs all the time (e.g. a wall clock) and the virtual time clock runs only when the processor is running in user mode  Real time can be acquired in clock cycles and microseconds by calling the following low-level functions, respectively: PAPI_get_real_cyc() PAPI_get_real_usec()

System information  Executable information PAPI_get_executable_info() Information about the executable ’ s address space: The beginning of the user program The end of the user program  Hardware information PAPI_get_hardware_info() Information about the system hardware: Cycle time of processor Number of processors in the system

Finding the best blocking factor on Bragg and get system information  Use PAPI to find the best block size (using the matrix multiplication)  Measure the number of clock cycles for each block size  Choose the best block size according to the minimum number of clock cycles  Provides system hardware information such as: processor clock rate, number of processors in the system

Results on Bragg system Available hardware information Vendor string and code : SUN unknown (-1) Model string and code : UltraSPARC I&II (1000) CPU revision : CPU Megahertz : CPU's in an SMP node : 8 Nodes in the system : 1 Total CPU's in the system: Best block size: 8 bfactor: 8 clock cycles bfactor: 16 clock cycles bfactor: 32 clock cycles bfactor: 64 clock cycles

Part II: branch predictor  modify the Simple Scalar parameters of: L1-I cache, L1-D cache, branch predictor, and branch target buffer  Get 16 different configurations  Using four integer and four floating point SPEC2000 benchmarks with these configuration  Calculate the CPI for each benchmark and every configuration and plot the results

CPI for integer benchmarks

CPI for floating point benchmarks

Average CPI for the integer and floating point benchmarks Config. # 14 Config. # 14: Branch predictor: 16 KB, branch target buffer: 4KB, L1 instruction cache: 32KB, and L1 data cache: 8KB

Part III: TLB  Used instruction TLB varying from 512 to 1024 entries and data TLB varying from 512 to 1024 entries. L1I and L1D cache sizes were also varied  Get 16 different configurations  Run one integer and one floating point SPEC2000 benchmarks for each of these configurations  Find the number of clock cycles for each benchmark and every configuration and plot the results

Number of clock cycles for the integer benchmark

Number of clock cycles for the floating point benchmark

Average number of clock cycles of the integer and floating point benchmarks 16 KB L1 instruction cache, 16 KB L1 data cache, 1024 instruction TLB, and 512 data TLB

Questions? Thank you …