Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa.

Slides:



Advertisements
Similar presentations
Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Predictive Parallelization: Taming Tail Latencies in
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
Performance Analysis of Multiprocessor Architectures
Afrigraph 2004 Interactive Ray-Tracing of Free-Form Surfaces Carsten Benthin Ingo Wald Philipp Slusallek Computer Graphics Lab Saarland University, Germany.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
Saehoon Kim§, Yuxiong He. , Seung-won Hwang§, Sameh Elnikety
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Load Balancing for Partition-based Similarity Search Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Department of Computer Science University of California.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.
Load Balancing for Partition-based Similarity Search Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Department of Computer Science University of California.
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.
A new perspective on processing-in-memory architecture design These data are submitted with limited rights under Government Contract No. DE-AC52-8MA27344.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
ApproxHadoop Bringing Approximations to MapReduce Frameworks
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
University of Toronto Department of Electrical And Computer Engineering Jason Zebchuk RegionTracker: Optimizing On-Chip Cache.
Sunpyo Hong, Hyesoon Kim
1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
CMSC 611: Advanced Computer Architecture
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Optimizing Parallel Algorithms for All Pairs Similarity Search
Ioannis E. Venetis Department of Computer Engineering and Informatics
Boosted Augmented Naive Bayes. Efficient discriminative learning of
An Empirical Study of Learning to Rank for Entity Search
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
5.2 Eleven Advanced Optimizations of Cache Performance
Efficient Similarity Search with Cache-Conscious Data Traversal
Real-Time Ray Tracing Stefan Popov.
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Hyperthreading Technology
Bojian Zheng CSCD70 Spring 2018
Department of Computer Science University of California, Santa Barbara
STUDY AND IMPLEMENTATION
Coe818 Advanced Computer Architecture
Hybrid Programming with OpenMP and MPI
Hardware Counter Driven On-the-Fly Request Signatures
15-740/ Computer Architecture Lecture 14: Prefetching
Multithreading Why & How.
Department of Computer Science University of California, Santa Barbara
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa Barbara SIGIR’14

Focus and Contributions Challenge in query processing  Fast ranking score computation without accuracy loss in multi- tree ensemble models Contributions  Investigate data traversal methods for fast score calculation with large multi-tree ensemble models  Propose a 2D blocking scheme for better cache utilization with simple code structure

Motivation Ranking assembles are effective in web search and other data applications  E.g. GBRT A large number of trees are used to improve accuracy  Winning teams at Yahoo! Learning-to-rank challenge used ensembles with 2k to 20k trees, or even 300k trees with bagging methods Time consuming for computing large ensembles  Access of irregular document attributes impairs CPU cache reuse –Unorchestrated slow memory access incurs significant cost –Memory access latency is 200x slower than L1 cache  Dynamic tree branching impairs instruction branch prediction

Previous Work Approximate processing: Tradeoff between ranking accuracy and performance  Early exit [Cambazoglu et al. WSDM’10]  Tree trimming [Asadi and Lin, ECIR’13] Architecture-conscious solution  VPred [Asadi et al. TKDE’13] –Convert control dependence to data dependence –Loop unrolling with vectorization to reduce instruction branch misprediction and mask slow memory access latency

Document-ordered Traversal (DOT) Scorer-ordered Traversal (SOT) Our Key Idea: Optimize Data Traversal Two existing solutions:

Our Proposal: 2D Block Traversal

Algorithm Pseudo Code

Why Better? Total slow memory accesses in score calculation  2D block can be up to s time faster. But s is capped by cache size DOTSOT2D Block 2D Block fully exploits cache capacity for better temporal locality Block-VPred: a combined solution that applies 2D Blocking on top of VPred

Evaluations 2D Block and Block-VPred implemented in C  Compiled with GCC using optimization flag -O3  Tree ensembles derived by jforests [Ganjisaffar et al. SIGIR’11] using LambdaMART [ Burges et al. JMLR’11] Experiment platforms  3.1GHz 8-core AMD Bulldozer FX8120 processors  Intel X GHz 6-core dual processors Benchmarks  Yahoo! Learning-to-rank, MSLR-30K, and MQ2007 Metrics  Scoring time  Cache miss ratios and branch misprediction ratios reported by Linux perf tool

Scoring Time per Document per Tree in Nanoseconds Query latency = Scoring time * n * m  n docs ranked with an m-tree model

Query Latency in Seconds Block-VPred  Up to 100% faster than VPred  Faster than 2D blocking in some cases 2D blocking  Up to 620% faster than DOT  Up to 213% faster than SOT  Up to 50% faster than VPred Fastest algorithm is marked in gray.

Time & Cache Perf. as Ensemble Size Varies 2D blocking is up to 287% faster than DOT Time & cache perf. are highly correlated Change of ensemble size affects DOT the most

Time & Cache Perf. as No. of Doc Varies 2D blocking is up to 209% faster than SOT Block-VPred is up to 297% faster than SOT SOT deteriorates the most when number of doc grows 2D combines the advantage of both DOT and SOT

2D Blocking: Time & Cache Perf. as Block Size Vary The fastest scoring time and lowest L3 cache miss ratio are achieved with block size s=1,000 and d=100 when these trees and documents fit in cache Scoring time could be 3.3x slower if block size is not chosen properly

Impact of Branch Misprediction Ratios For larger trees or larger no. of documents  Branch misprediction impacts more  Block-VPred outperforms 2D Block with less misprediction and faster scoring MQ2007 Dataset DOTSOTVPred2D Block Block- VPred 50-leaf tree 1.9%3.0%1.1%2.9%0.9% 200-leaf tree 6.5%4.2%1.2%9.0%1.1% Yahoo! Dataset n=1,000n=5,000n=10,000n=100,000 2D Block1.9%2.7%4.3%6.1% Block-VPred1.1%0.9%0.84%0.44%

Discussions When multi-tree score calculation per query is parallelized to reduce latency, 2D blocking still maintains its advantage For small n, multiple queries could be combined to fully exploit cache capacity  Combining leads to 48.7% time reduction with Yahoo! 150-leaf 8,051-tree ensemble when n=10 Future work  Extend to non-tree ensembles by iteratively selecting a fixed number of base rank models that fit in fast cache Acknowledgments U.S. NSF IIS Travel grant from ACM SIGIR