Faustino J. Gomez, Doug Burger, and Risto Miikkulainen

Slides:



Advertisements
Similar presentations
Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.
Advertisements

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
High Performing Cache Hierarchies for Server Workloads
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
CISC Machine Learning for Solving Systems Problems Presented by: John Tully Dept of Computer & Information Sciences University of Delaware Using.
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.
Perceptron-based Global Confidence Estimation for Value Prediction Master’s Thesis Michael Black June 26, 2003.
Evolving Neural Network Agents in the NERO Video Game Author : Kenneth O. Stanley, Bobby D. Bryant, and Risto Miikkulainen Presented by Yi Cheng Lin.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
CY2003 Computer Systems Lecture 09 Memory Management.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
CAP6938 Neuroevolution and Developmental Encoding Real-time NEAT Dr. Kenneth Stanley October 18, 2006.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
02/21/2003 CART 1 On-chip MRAM as a High-Bandwidth, Low-Latency Replacement for DRAM Physical Memories Rajagopalan Desikan, Charles R. Lefurgy, Stephen.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
CAP6938 Neuroevolution and Developmental Encoding Evolving Adaptive Neural Networks Dr. Kenneth Stanley October 23, 2006.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Comparative Reproduction Schemes for Evolving Gathering Collectives A.E. Eiben, G.S. Nitschke, M.C. Schut Computational Intelligence Group Department of.
Nonlinear balanced model residualization via neural networks Juergen Hahn.
Chapter 4 CPU Scheduling. 2 Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.
Variation Aware Application Scheduling in Multi-core Systems Lavanya Subramanian, Aman Kumar Carnegie Mellon University {lsubrama,
Evolutionary Computation Evolving Neural Network Topologies.
CAP6938 Neuroevolution and Artificial Embryogeny Real-time NEAT Dr. Kenneth Stanley February 22, 2006.
Neural networks.
CPU SCHEDULING.
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Chapter 6: CPU Scheduling
Using Packet Information for Efficient Communication in NoCs
CPU Scheduling G.Anuradha
Module 5: CPU Scheduling
Address-Value Delta (AVD) Prediction
Phase Capture and Prediction with Applications
Dr. Kenneth Stanley September 20, 2006
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
3: CPU Scheduling Basic Concepts Scheduling Criteria
Chapter5: CPU Scheduling
A Simulator to Study Virtual Memory Manager Behavior
Chapter 6: CPU Scheduling
Packet Classification with Evolvable Hardware Hash Functions
Chapter 5: CPU Scheduling
Artificial Neural Networks
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Chapter 6: CPU Scheduling
Phase based adaptive Branch predictor: Seeing the forest for the trees
Lei Zhao, Youtao Zhang, Jun Yang
Module 5: CPU Scheduling
Presentation transcript:

Faustino J. Gomez, Doug Burger, and Risto Miikkulainen A Neuroevolution Method for Dynamic Resource Allocation on a Chip Multiprocessor Faustino J. Gomez, Doug Burger, and Risto Miikkulainen Presented by: Harman Patial Dept of Computer & Information Sciences University of Delaware

Background Multiple core are becoming the norm. Latency between main memory and fastest cache are becoming larger. For best performance. Dynamic management of the on-chip resource like the cache memory. Dynamic management of the off-chip resource like the bandwidth.

Solution A Controller that would use the CMP state info. to periodically re-assign cache banks to the cores. CMP Resource Management, using a neuroevolution algorithm called Enforced Subpopulations(ESP). Enforced Subpopulations extends the Symbiotic, Adaptive Neuroevolution Algorithm(SANE).

Evolving CMP Controller Restrict the problem to evolve a recurrent neural network to manage the L2 Cache of a CMP with C cores. Evaluation – A set of u neurons is selected randomely, one from each subpopulation, and combined to form a neural network. Recombination – The average of each neuron is calculated by dividing its cumulative fitness by the number of trials in which it participated. The Evaluation Recombination cycle is repeated until a network that perform sufficiently well is found

CMP Control Network

Evolving CMP Controller The Input layer receives Instruction per cycle(IPC), L1 misses and L2 misses as input. Because the networks are recurrent, input also consists of previous hidden layer activation for a total of 3C + u. The output layer is one unit per core whose activation value is the amount of cache desired by the core.

Simulation Environment Controllers were evolved in an approximation to the CMP environment that relies on traces collected from SimpleScalar processor simulator. A set of traces were developed for the following SPEC2000 benchmarks: art, equake, gcc, gzip, parser, perlbmk, vpr. Each Benchmark Trace set consists of one trace for each possible L2 Cache size

Simulation Environment All Traces recorded the IPC, L1m and L2m of the simulated processor every 100,000 instructions using the DEC Alpha 21264 configuration. By combining n traces we can approximate a CMP with n processing cores.

Simulation Environment

Network Evaluation Networks are Evaluated by having them interact with the trace-based environment for some fixed number of control decisions. In the starting the environment is initialized by selecting a set of C benchmarks, and allocating an equal amount of L2 cache to each core(Total L2/C)

Network Evaluation When a trace runs out, the environment switches to the trace of a different benchmark at the same cache size. 7 possible cache size and 7 benchmarks, a total of 49 traces were used to implement to environment. The fitness of the network was the average IPC of the chip averaged over the duration of the trial.

Network Evaluation In a real CMP, reassignment of a cache bank from core A to core B will cause entire cache of A and B to be unavailable for a significant time. In our simulation we ignore this overhead and simply reconfigure the chip by switching to the trace corresponding to the new cache size. The new trace starts at the same point where the previous one left.

Results Five Simulations were run on a 14-processor Sun Ultra Enterprise for approximately 1000 generations each. At the end of each evolution the fitness of the best network is compared with the baseline performance. The network showed an average improvement of 16% over the baseline.

RESULTS

Results The best network from each of the 5 simulations were submitted to a generalization test. The test consists of 1000 trials where the network controls the chip for 1 billion instructions under random initial conditions. The network still retained a 13% average performance advantage over the baseline and more importantly the network performed better that the baseline on every trial.

QUESTIONS QUESTION ?