Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee.

Slides:

Advertisements

Similar presentations

Branch prediction Titov Alexander MDSP November, 2009.

Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Dynamic Branch Prediction (Sec 4.3) Control dependences become a limiting factor in exploiting ILP So far, we’ve discussed only static branch prediction.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Alpha Microarchitecture Onur/Aditya 11/6/2001.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.

Chapter Hardwired vs Microprogrammed Control Multithreading

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Branch Target Buffers BPB: Tag + Prediction

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Processor Organization and Architecture

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.

Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-8 Memory Management (2) Department of Computer Science and Software.

Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

Energy Efficient D-TLB and Data Cache Using Semantic-Aware Multilateral Partitioning School of Electrical and Computer Engineering Georgia Institute of.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

EKT303/4 Superscalar vs Super-pipelined.

Page Table Implementation. Readings r Silbershatz et al:

CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Prof. Hsien-Hsin Sean Lee

CS203 – Advanced Computer Architecture

Variable Word Width Computation for Low Power

Dynamic Branch Prediction

‘99 ACM/IEEE International Symposium on Computer Architecture

Exploring Value Prediction with the EVES predictor

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

So far we have dealt with control hazards in instruction pipelines by:

Comparison of Two Processors

15-740/ Computer Architecture Lecture 24: Control Flow

CS399 New Beginnings Jonathan Walpole.

Alpha Microarchitecture

Understanding Program Address Space

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

Prof. Onur Mutlu Carnegie Mellon University

So far we have dealt with control hazards in instruction pipelines by:

Dynamic Hardware Prediction

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

The University of Adelaide, School of Computer Science

So far we have dealt with control hazards in instruction pipelines by:

Presentation transcript:

Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee Iowa State U 32nd Annual International Symposium on Microarchitecture

MICRO-32 November 17, 1999 Cho, Yew, and Lee2 Big Picture

On-Chip D-Cache Bandwidth Problem

MICRO-32 November 17, 1999 Cho, Yew, and Lee4 Wide-Issue Superscalar Processors n Current Generation –Alpha –Intel’s Merced n Future Generation (IEEE Computer, Sept. ‘97) –Superspeculative Processors –Trace Processors

MICRO-32 November 17, 1999 Cho, Yew, and Lee5 Multi-Ported Data Cache n Replicated Cache –Alpha n Time-Division Multiplexed Cache –Alpha n Interleaved Cache –MIPS R10K

MICRO-32 November 17, 1999 Cho, Yew, and Lee6 Window Logic Complexity n Pointed out as the major hardware complexity (Parlacharla et al., ISCA ‘97) n More severe for Memory window –Difficult to partition –Thick network needed to connect RSs and LSUs

Data Decoupling

MICRO-32 November 17, 1999 Cho, Yew, and Lee8 Data Decoupling: What is it? n A Divide-and-Conquer approach –Instruction stream partitioned before entering RS –Narrower networks –Less ports to each cache –Needs mechanism for proper partitioning

MICRO-32 November 17, 1999 Cho, Yew, and Lee9 Data Decoupling: Operating Issues n Memory Stream Partitioning –Hardware classification –Compiler classification n Load Balancing –Enough instructions in different groups? –Are they well interleaved?

Access Region Locality & Access Region Prediction

MICRO-32 November 17, 1999 Cho, Yew, and Lee11 Access Region: Overview n Access Region R –R = (L, U) n L: Lower Bound on Addr. n U: Upper Bound on Addr. n If (D<A) or (B<C), –Region R and Q are said to be exclusive or non-overlapping. n Locations in exclusive regions are independent.

MICRO-32 November 17, 1999 Cho, Yew, and Lee12 Access Region and Mem. Instructions

MICRO-32 November 17, 1999 Cho, Yew, and Lee13 Partitioning Memory Space n One way of partitioning memory space into regions: –Data Region / Heap Region / Stack Region n This work assumes this partitioning.

MICRO-32 November 17, 1999 Cho, Yew, and Lee14 Partitioning Memory Space, Cont’d n Many accesses are toward Data and Stack regions. n Some programs don’t access the Heap region at all. (%)

MICRO-32 November 17, 1999 Cho, Yew, and Lee15 Partitioning Memory Space, Cont’d n Accesses to Data region are less bursty than others. n Programs such as ijpeg have clustered region accesses. n Window Size = 32

MICRO-32 November 17, 1999 Cho, Yew, and Lee16 Partitioning Memory Space, Cont’d n W/ a large window, Stack accesses become less bursty. n Data and Stack regions have quite stable, constant demand. n Window Size = 64

MICRO-32 November 17, 1999 Cho, Yew, and Lee17 Partitioning Memory Space, Cont’d gom88ksimgcccompressliijpegperlvortexInt.AvgFP.Avgtomcatvswimsu2cormgrid 1.9%1.8% 51.1%50.4% 1.6% 16.2% 45.4% 31.6% n Many instructions access a single region (~98%). n Multi-region-accessing instructions account for 0 ~ 9.6% of dynamic memory references.

MICRO-32 November 17, 1999 Cho, Yew, and Lee18 Access Region Locality n “A memory reference instruction typically accesses a single region at run time” –Only about 2% of all static memory instructions access more than a single region. n “(Thus) the region it accesses is highly predictable” –Simple predictors with a small look-up table achieve high prediction accuracy.

MICRO-32 November 17, 1999 Cho, Yew, and Lee19 Predicting Regions: Unlimited Case n One predictor per memory instruction n Predictor types: –1-bit history saver (0: Data, 1: Stack) –2-bit saturating counter

MICRO-32 November 17, 1999 Cho, Yew, and Lee20 Predicting Regions: Adding Context n Run-time context –Caller’s ID (CID): in Link Register –Global Branch History (GBH) –Hybrid of above

MICRO-32 November 17, 1999 Cho, Yew, and Lee21 Predicting Regions: Utilizing Static Info. n Some instructions’ access regions are revealed through architecture and compiler conventions : –Use of Stack Pointer ( $SP ) or Frame Pointer ($FP) suggests that the region is Stack. –Use of Global Pointer ( $GP ) suggests that the region is non- Stack. –For others, assume non-Stack. n Directly exporting some high-level region information from compiler to processor may improve prediction accuracy.

MICRO-32 November 17, 1999 Cho, Yew, and Lee22 Region Pred. Result: Unlimited Case gom88ksimgcccompressliijpegperlvortexInt.AvgFP.Avgtomcatvswimsu2cormgrid Simple 1-bit w/ GBH w/ CID Static w/ Hybrid n 1-bit predictors do better than 2-bit predictors (not shown). n Hybrid context bits achieve the best prediction rate on average.

MICRO-32 November 17, 1999 Cho, Yew, and Lee23 Predicting Regions: Limited-Size ARPT n Low n bits of PC, XOR’ed with hybrid context bits are used to index into Access Region Prediction Table (ARPT): –Table Entries Initialized to 0’s –1 to denote stack access –Decoding information exploited to save ARPT space

MICRO-32 November 17, 1999 Cho, Yew, and Lee24 Region Prediction Result: ARPT gom88ksimgcccompressliijpegperlvortexInt.AvgFP.Avgtomcatvswimsu2cormgrid Unlimited 8 KB4 KB 2 KB 1 KB n Over 99.9% Accuracy w/ 4 KB or larger ARPT w/o compiler hints. n Compiler hints relieve pressure due to smaller sizes.

Dynamic Data Decoupling

MICRO-32 November 17, 1999 Cho, Yew, and Lee26 Dynamic Data Decoupling

MICRO-32 November 17, 1999 Cho, Yew, and Lee27 Dynamic Data Decoupling, Cont’d n Dynamically predicting access regions to classify memory instructions: –Utilize Access Region Prediction Table (ARPT). –Utilize any region information revealed through instruction decoding. n Dispatching partitioned memory instructions into separate memory pipelines, connetected to separate caches. n Dynamically Verifying Region Prediction –Let TLB (i.e., page table) contain verification information such that memory access is reissued on mis-predictions.

MICRO-32 November 17, 1999 Cho, Yew, and Lee28 Base Machine Model

MICRO-32 November 17, 1999 Cho, Yew, and Lee29 Overall Performance gom88ksimgcccompressliijpegperlvortexInt.AvgFP.Avgtomcatvswimsu2cormgrid n Over (2+0) conf.

MICRO-32 November 17, 1999 Cho, Yew, and Lee30 Conclusions n Access Region Locality says –Memory instructions access few regions at run time. –Accessed regions are accurately predictable. n Access Region Locality leads to Access Region Prediction techniques. n Access Region Prediction allows Dynamic Data Decoupling, shown to achieve comparable performance to very wide data caches.

Now Any Questions?

MICRO-32 November 17, 1999 Cho, Yew, and Lee32 Impact of LVC Size n 2KB and 4KB LVCs achieve high hit rates. (~99.9%). n Set associativity less important if LVC is 2KB or more. n Small, simple LVC works well. 0.5K1K2K4K