Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Computer Organization and Architecture

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch.

Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Pipelining What is it? How does it work? What are the benefits? What could go wrong? By Derek Closson.

From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.

1 Sec (2.3) Program Execution. 2 In the CPU we have CU and ALU, in CU there are two special purpose registers: 1. Instruction Register 2. Program Counter.

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

Prince Sultan College For Woman

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Computer Processing of Data

1 Intel Microprocessors Daniel Rodocker CSCI 1005.

TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

RISC Architecture RISC vs CISC Sherwin Chan.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Software Performance Monitoring Daniele Francesco Kruse July 2010.

The Intel 86 Family of Processors

CPU/BIOS/BUS CES Industries, Inc. Lesson 8.  Brain of the computer  It is a “Logical Child, that is brain dead”  It can only run programs, and follow.

Pentium Architecture Arithmetic/Logic Units (ALUs) : – There are two parallel integer instruction pipelines: u-pipeline and v-pipeline – The u-pipeline.

COMPUTER ORGANIZATIONS CSNB123 NSMS2013 Ver.1Systems and Networking1.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

CISC. What is it?  CISC - Complex Instruction Set Computer  CISC is a design philosophy that:  1) uses microcode instruction sets  2) uses larger.

1 x86 Programming Model Microprocessor Computer Architectures Lab Components of any Computer System Control – logic that controls fetching/execution of.

*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Protection in Virtual Mode

Advanced Architectures

Visit for more Learning Resources

Reducing Hit Time Small and simple caches Way prediction Trace caches

Advanced Topic: Alternative Architectures Chapter 9 Objectives

5.2 Eleven Advanced Optimizations of Cache Performance

Architecture Background

Chapter 14 Instruction Level Parallelism and Superscalar Processors

William Stallings Computer Organization and Architecture 7th Edition

Introduction to Pentium Processor

Pipelining: Advanced ILP

Central Processing Unit

Understanding Performance Counter Data - 1

Ka-Ming Keung Swamy D Ponpandi

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

* From AMD 1996 Publication #18522 Revision E

Virtual Memory Overcoming main memory size limitation

What Are Performance Counters?

Ka-Ming Keung Swamy D Ponpandi

ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.

Sec (2.3) Program Execution.

Presentation transcript:

Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten

Problem Statement Explore characteristics of the P4 Trace Cache using microbenchmarks and performance counters related to branching and Trace Cache

Approach Determine characteristics of the Pentium 4 processor that will help us evaluate the P4’s trace cache Using a performance monitoring tool (Intel’s Vtune Performance Analyzer) measure the data we need and analyze it to find limitations on the trace cache

Some P4 Characteristics Like most high performance processors, the P4 has special on-chip hardware for performance monitoring. This hardware typically includes Event detectors and counters Qualification of event detections and counting by privilege mode and event characteristics Support for event-based sampling

P4 characteristics cont. Common problems faces by modern processors Small number of counters Inability to distinguish between speculative and nonspeculative events Imprecise event-based sampling With 42 million transistors (compared to 28 million of the P3), the P4 has overcome these problems 48 event detectors and 18 event counters Provides instruction-tagging to enable counting of nonspeculative performane events Provides support for imprecise event-based sampling (IEBS) and precise event-based sampling (PEBS)

Trace Cache Special instruction cache for capturing long dynamic instruction sequences. Each line stores a snapshot, or trace, of the dynamic instruction stream P4 executes trace caches when there is an L1 cache hit (which is over 90% of the time)

Characteristics of Trace Cache Stores instructions after they’ve already been decoded into μops (“micro-ops”). μops – RISC-style instructions Cache Line Size: 6 μops Trace Cache Size: 12K μops Branch Prediction hardware is used knows about any branch and fetch instructions that follow the branch. Conditional Branches can cause problems Won’t know if wrong until branch condition check in ALU0

Entering The Execution Pipeline - Pentium 4's Trace Cache Tom’s Hardware Guide

Advantages of Trace Cache More efficient use of limited cache space. Trace cache lines contain both branch instructions and the code after the branch instruction. No extra latency for branches Does not use TLB check

“Execute Mode" (when needed code is in L1 cache) The P4’s Critical Execution Path

Execute Mode Vs. Trace Segment Build Mode Execute Mode Trace cache feeds stored traces to the execution logic to be executed. Trace cache normally runs in this mode. Trace Segment Build Mode Used when there is an L1 cache miss Front end fetches x86 code from the L2 cache, Translates into μops, Builds a “trace segment” with it, Loads that segment into the trace cache to be executed.

Branch Prediction X86 code with a branch in it: The trace cache builds a trace from instructions up to and including the branch instruction Then picks which branch it thinks the program will take Continues to build the trace along that speculative branch.

Microcode ROM Used by P4 to process longer instructions Allows regular hardware decoder to concentrate on decoding the smaller, faster instructions. Stores a sequence of μops for each long instruction encountered. Inserts a tag into the trace segment that points to the section of the microcode ROM where the μop sequence is held. Trace Cache gives control to the Microcode ROM when a tag is encountered until the proper sequence of μops is produced. Execution Engine does not care if instructions come from the Trace Cache or the Microcode ROM

VTune Experiment for(i=0; i<1M; i++) _asm { mov eax, 10 mov eax, 20 }

VTune Experiment for(i=0; i<1M; i++) _asm { mov eax, 10 … mov eax, 4990 }

VTune Results Trace Cache Misses Trace Cache Delivery Mode mov eax, ,605,634 mov eax, 49902,356173,879,264 mov eax, 50003,945174,448,595

VTune Results cont. Dis- tanc e Ru n # Spec microcode Uops Spec TC- built uops Spec TC- delived uops TC Build Mode TC Deliver Mode TC Missesuops Decodeduops Retired , ,636, ,973,480 4, ,140, ,671, , ,233, , ,451,130 5, ,390, ,080, , , ,599, , ,215,964 10, ,918, ,939, , ,929, ,872,716 1, ,609, ,960, , , ,086, ,210,494 5, ,178, ,336, , ,424, , ,107,503 6, ,964, ,790, , ,461, ,471,452 1, ,074, ,907, ,108 82, ,650, ,759,410 5, ,827, ,866, , ,591, , ,811,048 12, ,118, ,147,504

VTune Results for P4m Dis- tanceRun #Spec Uops retired Spec TC- built uops Spec TC-delived uops TC Build Mode TC Deliver Mode TC Missesuops Retired ,706, ,600,752053,391,1824,248158,219, ,352, ,005,262383,18355,624,0162,957157,856, ,698, ,680,678055,166,3197,248158,195, ,311, ,421,964389,10155,592,7685,192157,215, ,841, ,760,210048,314, ,856, ,101, ,808,330342,95548,707,7959,054138,242, ,317, ,527,055360,10050,786, ,032,684

Sources: M. Milenkovic, A. Milenkovic, J. Kulick, “Demystifying Intel Branch Predictors,” Proceedings of the Workshop on Duplicating, Deconstructing, and Debunking (held in conjunction with 29 th ISCA), Anchorage, Alaska, May 2002 E. Rotenberg, S. Bennett, J. E. Smith, “A Trace Cache Microarchitecture and Evaluation,” IEEE Transactions on Computers, (Vol. 48, No. 2) February htm 5.htm