1 Sampling-based Program Locality Approximation Yutao Zhong, Wentao Chang Department of Computer Science George Mason University June 8th,2008.

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Song Jiang1 and Xiaodong Zhang1,2 1College of William and Mary
Scribe for 7 th April 2014 Page Replacement Algorithms Payal Priyadarshini 11CS30023.
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
New Directions in Traffic Measurement and Accounting Cristian Estan – UCSD George Varghese - UCSD Reviewed by Michela Becchi Discussion Leaders Andrew.
Fast Algorithms For Hierarchical Range Histogram Constructions
Presented By: Krishna Balasubramanian
Improving the Speed and Quality of Architectural Performance Evaluation Vijay S. Pai with contributions from: Derek Schuff, Milind Kulkarni Electrical.
1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.
Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC pag. 1 Discovery of Locality-Improving Refactorings.
Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.
A Hierarchical Model of Reference Affinity Yutao Zhong Xipeng Shen Chen Ding Computer Science Department University of Rochester.
Algorithmic Complexity Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.
Power Efficient IP Lookup with Supernode Caching Lu Peng, Wencheng Lu*, and Lide Duan Dept. of Electrical & Computer Engineering Louisiana State University.
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
1 Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang Steve Carr Soner Önder Zhenlin Wang.
Object (Data and Algorithm) Analysis Cmput Lecture 5 Department of Computing Science University of Alberta ©Duane Szafron 1999 Some code in this.
Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
ECE7995 Caching and Prefetching Techniques in Computer Systems Lecture 8: Buffer Cache in Main Memory (IV)
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Variational Path Profiling Erez Perelman*, Trishul Chilimbi †, Brad Calder* * University of Califonia, San Diego †Microsoft Research, Redmond.
University of California San Diego Locality Phase Prediction Xipeng Shen, Yutao Zhong, Chen Ding Computer Science Department, University of Rochester Class.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
Making B+-Trees Cache Conscious in Main Memory
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
Calculating Stack Distances Efficiently George Almasi,Calin Cascaval,David Padua
Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,
A Polynomial Time Approximation Scheme For Timing Constrained Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, Charles J. Alpert** *Dept of Electrical.
Institute of Computing Technology On Improving Heap Memory Layout by Dynamic Pool Allocation Zhenjiang Wang Chenggang Wu Institute of Computing Technology,
P ath & E dge P rofiling Michael Bond, UT Austin Kathryn McKinley, UT Austin Continuous Presented by: Yingyi Bu.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Benjamin AraiUniversity of California, Riverside Reliable Hierarchical Data Storage in Sensor Networks Song Lin – Benjamin.
Dynamic Object Sampling for Pretenuring Maria Jump Department of Computer Sciences The University of Texas at Austin Stephen M. Blackburn.
Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.
1 Virtual Machine Memory Access Tracing With Hypervisor Exclusive Cache USENIX ‘07 Pin Lu & Kai Shen Department of Computer Science University of Rochester.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
QUINN GAUMER ECE 259/CPS 221 Improving Performance Isolation on Chip Multiprocessors via on Operating System Scheduler.
Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Author: Weirong Jiang and Viktor K. Prasanna Publisher: The 18th International Conference on Computer Communications and Networks (ICCCN 2009) Presenter:
IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
SketchVisor: Robust Network Measurement for Software Packet Processing
Experience Report: System Log Analysis for Anomaly Detection
Data Driven Resource Allocation for Distributed Learning
Data Structures I (CPCS-204)
Updating SF-Tree Speaker: Ho Wai Shing.
A paper on Join Synopses for Approximate Query Answering
Distance Computation “Efficient Distance Computation Between Non-Convex Objects” Sean Quinlan Stanford, 1994 Presentation by Julie Letchner.
Online Subpath Profiling
Cache Memory Presentation I
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Spatial Online Sampling and Aggregation
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Using Dead Blocks as a Virtual Victim Cache
Adaptive Code Unloading for Resource-Constrained JVMs
Performance Evaluation of the Parallel Fast Multipole Algorithm Using the Optimal Effectiveness Metric Ioana Banicescu and Mark Bilderback Department of.
Determining the Accuracy of Event Counts - Methodology
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

1 Sampling-based Program Locality Approximation Yutao Zhong, Wentao Chang Department of Computer Science George Mason University June 8th,2008

2 Outline Background information Motivation Our sampling approach Experimental results

3 Reuse distance and reuse signature a b c a a c b Reuse distance: the number of distinct data elements accessed between two consecutive uses of the same element Reuse signature: a histogram of reuse distances demonstrating the distribution of reuse distances over different lengths 2 2 Starting Point Ending Point

4 Reuse signature application Relationship to cache behavior : Capacity miss <= reuse distance ≥ cache size Reduce reuse distance => improve cache effectiveness Current applications : Predict cache miss rate [Zhong+03][Marin & Mellor-Crummey 04] [Fang+05][Zhong+07] Reorganize data [Zhong+04] Provide caching hint [Beyls & D’Hollander 02] Evaluate program optimizations [Beyls & D’Hollander 01] [Ding 00]

5 Reuse distance measurement Access Time Table Access Trace Distance Histogram Get Accessed Memory Address Search Update Address Search, Count Update Last Record distance Distance ① Large space and a long counting time required to store traces and count memory access ② Enormous efforts for memory-intensive program Data Structure: a c a b b a Starting Point Ending Point 1

6 Motivation Sampling is generally effective to reduce the overhead of program behavior profiling We are devoted to balance efficiency and accuracy Sample only 1% memory accesses Improve measurement speed by 7.5 times in average Achieve over 99% accuracy

7 Sampling algorithms Utilize common structure of bursty tracing [Hirzel & Chilimbi 01] Sampling rate r =|I s |/(|I s | +|I H |) Naïve sampling Turn off profiling during hibernating intervals Non guarantee of accuracy

8 Naive sampling.. c a b c a c a b c a c a b c d a.... Memory access trace: IHIH ISIS Naïve sampling: IHIH ISIS ①②③④ 1 Inaccurate measurement ⑤ 3

9 Biased sampling Ignore datum that has been referenced within the current hibernating period Measured distance always larger than or equal to actual distance Probability of being sampled not uniform

10 Biased sampling.. c a b c a f a b c a c a b f d a.... Memory access trace: IHIH ISIS Biased sampling: IHIH ISIS ①②③④ ⑤

11 History-preserved representative sampling Add an additional tag for each address in access trace Mark references within a sampling period as sampled in the tag Reuse will only be sampled when starting point marked sampled

12 History-preserved representative sampling.. c a b c a f a b c a c a b f d a.... Memory access trace: IHIH ISIS History-preserved representative sampling: IHIH ISIS ①②③④ ⑤

13 Further improvements Simplifying maintenance in hibernating intervals Reference trace implementation: splay tree [Ding & Zhong 03] In sampling period, full tree maintenance In hibernating period, instead of a new leaf node for each access, we construct a single node for each hibernating period with a counter of the number of distinct accesses Fast sample tag marking and checking To save space cost, we fix the length of sampling and hibernating period, avoid additional tag

14 Experiments Benchmarks from SPEC 2006, Olden, Chaos: Floating point programs: CactusADM, Milc, Soplex, Apsi, MolDyn Integer programs: Bzip2,Gcc, Libquatum, Perimeter, TSP Instrumentation tool: Valgrind Sampling rate : 1% We run each individual benchmark with 3 to 6 different inputs Repeat three time for each input

15 Experiments cont’d Comparison of accuracy and efficiency Ding and Zhong ’s approximation method [Ding & Zhong 03] Time distance measurement [Shen+07] Implementation of four algorithms: Naive sampling, biased sampling, basic and optimized representative sampling

16 Accuracy

17 Efficiency Sampling even outperforms the lower bound :time distance measurement Generally, speedup is less when the input size is small

18 Efficiency Speedup of basic representative sampling : around 4-5 times for most cases Speedup of optimized representative sampling: around 7-10 for most cases, up to 33 times geometric mean is 7.5 Sampling rate effect (TSP):

19 Related work Reuse signature collection [Mattson+70] [Bennett & Kruskal 75] [Olken81] [Kim+91] [Sugumar & Abraham 93] [Almasi+02] [Ding & Zhong 03] [Shen+07] Selective monitoring Time sampling [Zagha+96] [Anderson+97] [Burrows+00][Whaley 00] [Arnold & Sweeney 00] [Arnold & Ryder 01] [Hirzel & Chilimbi 01] [Chilimbi & Hirzel 02] [Itzkowitz+03] [Arnold & Grove 05] Data sampling [Larus 90] [Ding & Zhong 02] [Zhao+07] Uses of efficient locality analysis [Huang & Shen 96] [Li+96] [Ding 2000] [Beyls & D’ Hollander 01] [Almasi+02] [Beyls & D’ Hollander 02] [Zhong+04] [Marin & Mellor-Crummey 04] [Fang+05] [Zhong+07]

20 Future work Dynamically adjust sampling/hibernating lengths Store references in temporary buffer and then process them in batch Combine time sampling with data sampling

21 Thank you! Questions?