Forest Packing: Fast Parallel, Decision Forests

Slides:



Advertisements
Similar presentations
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Algorithm Design Techniques: Greedy Algorithms. Introduction Algorithm Design Techniques –Design of algorithms –Algorithms commonly used to solve problems.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
Arithmetic Expression Consider the expression arithmetic expression: (a – b) + ((c + d) + (e * f)) that can be represented as the following tree.
Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Optimizing RAM-latency Dominated Applications
Making B+-Trees Cache Conscious in Main Memory
© Toni Cortes Improving Application Performance through Swap Compression R. Cervera, T. Cortes, Y. Becerra and S. Lucas.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
B + TREE. INTRODUCTION A B+ tree is a balanced tree in which every path from the root of the tree to a leaf is of the same length, and each non leaf node.
HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.
Index tuning-- B+tree. overview © Dennis Shasha, Philippe Bonnet 2001 B+-Tree Locking Tree Traversal –Update, Read –Insert, Delete phantom problem: need.
External Sorting Sort n records/elements that reside on a disk. Space needed by the n records is very large.  n is very large, and each record may be.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
Balajee Vamanan and T. N. Vijaykumar School of Electrical & Computer Engineering CoNEXT 2011.
Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Author: Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, Gary Valentin Members:
Srihari Makineni & Ravi Iyer Communications Technology Lab
Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti.
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Simultaneous Multithreading
Multilevel Memories (Improving performance using alittle “cash”)
Morgan Kaufmann Publishers
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Real-Time Ray Tracing Stefan Popov.
Cache By: Thanh Nguyen.
/ Computer Architecture and Design
هوش مصنوعی فصل سوم: حل مسئله با جستجو
Bojian Zheng CSCD70 Spring 2018
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Virtual Memory فصل هشتم.
M. Usha Professor/CSE Sona College of Technology
Lecture 20: OOO, Memory Hierarchy
CACHE-CONSCIOUS INDEXES
Presented by David Wolinsky
/ Computer Architecture and Design
Mapping the FFT Algorithm to the IBM Cell Processor
Cache - Optimization.
Wei Wang University of New South Wales, Australia
Fast Accesses to Big Data in Memory and Storage Systems
Presentation transcript:

Forest Packing: Fast Parallel, Decision Forests Author: James Browne In Collaboration With: Disa Mhembere, Tyler M. Tomita, Joshua T. Vogelstein, Randal Burns 17/11/2019

Agenda Why is forest inference slow? Inference Acceleration What is Forest Packing? Why is forest inference slow? Inference Acceleration Memory Layout Traversal Methods Results 17/11/2019

Why do we need fast decisions? 17/11/2019

Forest Inference New Observation  Class A Class B Class A Tree 1 17/11/2019

Standard Inference Reality Internal Node Leaf Node Processed Node Cache Miss Prefetch Instruction Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  17/11/2019

Inference Acceleration Methods Model Structure Make smaller trees Make full trees Use less trees Reduce Mispredictions Assume direction Predication Batching Reduced Accuracy Minimally Affective High Latency 17/11/2019

Memory Optimizations BF DF DF- Breadth First (BF) Depth First (DF) Combined Leaves (DF-) Statistical Layout (Stat) Contiguous Likely Path Bin Contiguous Tree Space Trees Share Leaves 1 1 1 2 3 2 3 α 2 4 5 4 5 3 4 6 7 8 9 6 7 8 9 β α α β 1 2 3 4 5 6 7 8 9 1 3 5 9 8 4 7 6 2 1 2 4 3 α β Stat Bin 1 1A 1B α 2 α 2A 2B 3B 3 4 3A 4A β 4B β α β α α β β α α β β α 1 2 3 4 α β 1A α β 1B 2A 3A 4A 3B 2B 4B 17/11/2019

Memory Optimization: Why Bins? High frequency nodes in single page file Increases cache hits Reduces cache pollution Access Frequency 100% 50% 25% 12.5% 17/11/2019

Traversal Optimization: Round-Robin Internal Node Leaf Node Processed Node Cache Miss Prefetch Instruction w/ 2 Line Fill Buffers Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  17/11/2019

Traversal Optimization: Prefetch Internal Node Leaf Node Processed Node Cache Miss Prefetch Instruction w/ 2 Line Fill Buffers Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  Tree 1 Tree 2 Tree 3 Time  17/11/2019

Inference Execution Tree 1 Tree 2 Tree 3 Tree 1 Tree 2 Standard Tree 1 Round-Robin Tree 1 Tree 2 Tree 3 Prefetching 17/11/2019

Prediction Method Comparison 17/11/2019

Prediction Method Comparison 17/11/2019

Memory Optimization Comparisons FP Forest Packing is 2x-5x faster compared to other optimized methods FP 17/11/2019

Forest Packing: Inference Latency Comparison Forest Packing (FP) 10x faster 17/11/2019

Forest Packing: Performance on Varying Forest Size Trees in Forest Forest Packing has higher throughput than batching Forest Packing R-RerF 17/11/2019

Conclusion What is Forest Packing? Why is forest inference slow? Inference Acceleration Memory Layout Traversal Methods Results Latency reduced by an order of magnitude Efficiently uses additional resources Comparable throughput to batched systems 17/11/2019

Questions? Thank You Source Code: https://github.com/jbrowne6/forestpacking 17/11/2019