2nd Data Prefetching Championship Results and Awards

Slides:



Advertisements
Similar presentations
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.
Nikos Hardavellas, Northwestern University
High Performing Cache Hierarchies for Server Workloads
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Achieving Non-Inclusive Cache Performance with Inclusive Caches
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.
Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
Prefetch-Aware Cache Management for High Performance Caching
Data Prefetching Mechanism by Exploiting Global and Local Access Patterns Ahmad SharifQualcomm Hsien-Hsin S. LeeGeorgia Tech The 1 st JILP Data Prefetching.
Improving Cache Performance by Exploiting Read-Write Disparity
A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.
Computer Architecture Evaluation, Simulation and Research OSU ECE OS Interaction with Cache Memories Dr. Sohum Sohoni School of Electrical and Computer.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
Prefetching On-time and When it Works Sequential Prefetcher With Adaptive Distance (SPAD) Ibrahim Burak Karsli Mustafa Cavus
TOWARDS BANDWIDTH- EFFICIENT PREFETCHING WITH SLIM AMPM June 13 th 2015 DPC-2 Workshop, ISCA-42 Portland, OR Vinson Young Ajit Krisshna.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.
Conference title1 A New Methodology for Studying Realistic Processors in Computer Science Degrees Crispín Gómez, María E. Gómez y Julio Sahuquillo DISCA.
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.
4th JILP Workshop on Computer Architecture Competitions Championship Branch Prediction (CBP-4) -Moinuddin Qureshi (GT)
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
1 DSP handling of Video sources and Etherenet data flow Supervisor: Moni Orbach Students: Reuven Yogev Raviv Zehurai Technion – Israel Institute of Technology.
Sampling Dead Block Prediction for Last-Level Caches
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
T-BAG: Bootstrap Aggregating the TAGE Predictor Ibrahim Burak Karsli, Resit Sendag University of Rhode Island.
Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan.
4th JILP Workshop on Computer Architecture Competitions Championship Branch Prediction (CBP-4) -Moinuddin Qureshi (GT)
Welcome to the 2016 Championship Branch Prediction Contest Program Trevor Mudge Seoul, South Korea June 18 th, 2016.
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
The 2nd Cache Replacement Championship (CRC-2)
Adaptive Cache Partitioning on a Composite Core
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Less is More: Leveraging Belady’s Algorithm with Demand-based Learning
QuickPath interconnect GB/s GB/s total To I/O
RIC: Relaxed Inclusion Caches for Mitigating LLC Side-Channel Attacks
Out-of-Order Commit Processors
Prefetch-Aware Cache Management for High Performance Caching
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Using Dead Blocks as a Virtual Victim Cache
Welcome to the 1st Championship Value Prediction (CVP) Workshop
Sponsored by JILP and Intel’s Academic Research Office
CARP: Compression-Aware Replacement Policies
CDA 5155 Caches.
Adapted from slides by Sally McKee Cornell University
Out-of-Order Commit Processors
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah
Caches & Memory.
Chris Wilkerson, MRL/MTL Intel
DSPatch: Dual Spatial pattern prefetcher
Presentation transcript:

2nd Data Prefetching Championship Results and Awards Seth Pugsley

Thanks! Big thanks to Hyesoon Kim and the Program Committee Babak Falsafi Mike Ferdman Aamer Jaleel Daniel Jiménez Calvin Lin Moin Qureshi Eric Rotenberg Thomas Wenisch Thanks to Alaa Alameldeen and Chris Wilkerson at Intel Labs Thanks also to submission chair Hyojong Kim

DPC2Sim Parameters Single Core - 3.2 GHz, 6-wide, 256 ROB 3 level cache hierarchy 16 KB L1D, 128 KB L2, 1 MB L3 1 channel 64-bit 1600 MT/s DDR3 channel Prefetching all done at the L2 level L2 read event is the entry point into the prefetcher Prefetches inserted into the L2 read queue void l2_prefetcher_operate(cpu_num, addr, PC, cache_hit); MSHR, read queue occupancy, cycle time

Championship Scoring 4 configurations Score for each configuration Default (no knobs) Small LLC Low bandwidth Scrambled loads Score for each configuration Geomean((Prefetcher IPC) / (No prefetcher IPC)) Final score is sum of 4 scores (20 SPEC CPU 2006 workloads) x (3 traces/workload) x (1 B instructions/trace) x (4 configurations) = 240 B simulated instructions

Total Scores

Total Scores

Default Configuration Scores

Small LLC Scores

Low Bandwidth Scores

Scrambled Loads Scores

Effect of Scrambling Loads: IPC(scrambled)/IPC(default)

Idealized Total Score: Max(All Prefetchers)

Accepted Workshop Prefetchers Scores

Awards

Awards 3rd place - Prefetching On-time and When It Works Ibrahim Burak Karsli, Mustafa Cavus, and Resit Sendag

Awards 3rd place - Prefetching On-time and When It Works Ibrahim Burak Karsli, Mustafa Cavus, and Resit Sendag 2nd place - Towards Bandwidth-Efficient Prefetching with Slim AMPM Vinson Young and Ajit Krisshna

Awards 3rd place - Prefetching On-time and When It Works Ibrahim Burak Karsli, Mustafa Cavus, and Resit Sendag 2nd place - Towards Bandwidth-Efficient Prefetching with Slim AMPM Vinson Young and Ajit Krisshna 1st place - A Best-Offset Prefetcher Pierre Michaud

Thanks for participating!