CISC 879 - Machine Learning for Solving Systems Problems Arch Explorer Lecture 5 John Cavazos Dept of Computer & Information Sciences University of Delaware.

Slides:



Advertisements
Similar presentations
A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.
Application of Instruction Analysis/Synthesis Tools to x86’s Functional Unit Allocation Ing-Jer Huang and Ping-Huei Xie Institute of Computer & Information.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Conference title1 A New Methodology for Studying Realistic Processors in Computer Science Degrees Crispín Gómez, María E. Gómez y Julio Sahuquillo DISCA.
MacSim Tutorial (In ICPADS 2013) 1. |The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
Energy saving in multicore architectures Assoc. Prof. Adrian FLOREA, PhD Prof. Lucian VINTAN, PhD – Research.
Analysis of Branch Predictors
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
Predictive Design Space Exploration Using Genetically Programmed Response Surfaces Henry Cook Department of Electrical Engineering and Computer Science.
Computer Architecture Lecture 3 Cache Memory. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics.
CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.
ALCHEMY Architectures, Languages and Compilers to Harness the End of Moore Years  INRIA project (INRIA Futurs, Saclay)  Main research focus of Alchemy:
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Dept. of Computer Science - CS6461 Computer Architecture CS6461 – Computer Architecture Fall 2015 Lecture 1 – Introduction Adopted from Professor Stephen.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Opening Up Automatic Structural Design Space Exploration by Fixing Modular Simulation VEERLE DESMET SYLVAIN GIRBAL OLIVIER TEMAM Ghent University Thales.
CISC Machine Learning for Solving Systems Problems Microarchitecture Design Space Exploration Lecture 4 John Cavazos Dept of Computer & Information.
CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.
Migration Cost Aware Task Scheduling Milestone Shraddha Joshi, Brian Osbun 10/24/2013.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
Liquid Architecture D. Schuehler, B. Brodie, R. Chamberlain, R. Cytron, S. Friedman, J. Fritts, P. Jones, P. Krishnamurthy, J. Lockwood, S. Padmanabhan,
1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
MICRO-48, 2015 Computer System Lab, Kim Jeong Won.
Cache Memory.
Memory COMPUTER ARCHITECTURE
William Stallings Computer Organization and Architecture 7th Edition
Lecture 13: Large Cache Design I
Presented by: Sameer Kulkarni
CSCI1600: Embedded and Real Time Software
Software Cache Coherent Control by Parallelizing Compiler
MILEPOST GCC Lecture 4 John Cavazos
BIC 10503: COMPUTER ARCHITECTURE
Interconnect with Cache Coherency Manager
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
CARP: Compression-Aware Replacement Policies
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Automatic Tuning of Two-Level Caches to Embedded Applications
The University of Adelaide, School of Computer Science
CSCI1600: Embedded and Real Time Software
Presentation transcript:

CISC Machine Learning for Solving Systems Problems Arch Explorer Lecture 5 John Cavazos Dept of Computer & Information Sciences University of Delaware

CISC Machine Learning for Solving Systems Problems Motivation 2 [MICRO 2004, Gracia-Pérez et al.] Need for systematic quantitative comparison Need for systematic quantitative comparison

CISC Machine Learning for Solving Systems Problems Computer Arch Research 3

CISC Machine Learning for Solving Systems Problems Design space exploration 4 Time-to-market area power execution time Multi-objectives Need more than intuition and experience!

CISC Machine Learning for Solving Systems Problems ArchExplorer 5 archexplorer.org database simulation cluster upload daily update pick design points add results test Server-side InfrastructureWebsite FULLY AUTOMATIC

CISC Machine Learning for Solving Systems Problems How to compare? 1. Custom simulator 2. Hardware compatibility 3. Software compatibility 4. Upload 6 Wrapped Simulator & Parameter ranges Custom Simulator DL1 CPU D SM EXWBCM M F $TLB $ $ $ MEM DSEXWBCM M F SM IL1BP L2 S MEM

CISC Machine Learning for Solving Systems Problems Hardware compatibility 7 Instruction caches Data caches Branch predictors Interconnects Main memory Accelerators...

CISC Machine Learning for Solving Systems Problems Software compatibility 8 Isolate the hardware block, possibly by from centralized control to distributed control

CISC Machine Learning for Solving Systems Problems Software compatibility 9 Self-Configuration and parameters legality Models of computation Wrapping in SystemC-based on UNISIM communication layer

CISC Machine Learning for Solving Systems Problems Case study Memory sub-system for embedded processor PowerPC405 8 different cache modules available Complex hierarchies automatically explored Ranking designs for performance, power, energy, area,... Victim Cache Timekeeping Victim cache Stride Prefetcher Content-Directed Prefetcher Stride + Content Directed Prefetcher Tag Prefetcher Global History Prefetcher Skewed associtiative cache 10

CISC Machine Learning for Solving Systems Problems Accurate comparison needs compiler tuning as well P1 P2 < P1 P2 > baseline Tuned to P1, tuned to P2

CISC Machine Learning for Solving Systems Problems Best data cache mechanisms per area 12 CONCLUSIONS: 1.Contrast to Gracia-Pérez et al. [MICRO 2004] 2.No clear winner 3.Close to tuned parametric cache

CISC Machine Learning for Solving Systems Problems Best data cache mechanisms per area 13 CONCLUSIONS: 1.Contrast to Gracia-Pérez et al. [MICRO 2004] 2.No clear winner 3.Close to tuned parametric cache

CISC Machine Learning for Solving Systems Problems Composing cache hierarchies 14

CISC Machine Learning for Solving Systems Problems Speedup and Energy Improvement 15

CISC Machine Learning for Solving Systems Problems ARCHEXPLORER.ORG Check out this website: 16

CISC Machine Learning for Solving Systems Problems 17

CISC Machine Learning for Solving Systems Problems Conclusion Permanent open competition(s) Future: superscalar processor branch predictor repository multi-cores Open for your ideas! NoC, compiler extensions,... 18

CISC Machine Learning for Solving Systems Problems ARCHEXPLORER.ORG Check out this website: 19

CISC Machine Learning for Solving Systems Problems Genetic Search Algorithm Convergence Permanently ranks all designs per area bucket speedup or power assigning higher probability to better points Picking a point according to distribution Mutations & crossover Natural selection 20 Veerle Desmet – Sylvain Girbal – Olivier Temam 6th HiPEAC Industrial Workshop – Thales Nov 26th, 2008 Statistical Exploration $ BP CPU $ $ MEM

CISC Machine Learning for Solving Systems Problems Standardized Interfaces Module Repository Features for Systematic DSE Module parameter tuningModule exploration Compiler Exploration Design Space Exploration Compatibility Database Parameter Check Parameter Introspection Compatibility database Compiler Flag Database benchmarks datasets PPCARM WB$VC$SP$ NB WB$ TVC$ CDP$ CD PSP$ TagP$GHB$ BUS DRAM Module category Module interfaces Known models Probing neighbors parameters Configuration validity Ranges Params. relationship DRAM nBanks  {2;4;8} tRAS+tCD<tRCD focused search algorithm configs Selection probability Fast convergence configs Predictive modeling compiler flags Machine description