1 Random Forest in Distributed R Arash Fard Vishrut Gupta.

Slides:



Advertisements
Similar presentations
Is Random Model Better? -On its accuracy and efficiency-
Advertisements

Random Forest Predrag Radenković 3237/10
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Spark: Cluster Computing with Working Sets
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin.
Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer Science University of Maryland May
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.
FLANN Fast Library for Approximate Nearest Neighbors
The Chinese University of Hong Kong. Research on Private cloud : Eucalyptus Research on Hadoop MapReduce & HDFS.
The hybird approach to programming clusters of multi-core architetures.
A Generic Approach for Image Classification Based on Decision Tree Ensembles and Local Sub-windows Raphaël Marée, Pierre Geurts, Justus Piater, Louis Wehenkel.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Large-scale Hybrid Parallel SAT Solving Nishant Totla, Aditya Devarakonda, Sanjit Seshia.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
DLS on Star (Single-level tree) Networks Background: A simple network model for DLS is the star network with a master-worker platform. It consists of a.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
The τ - Synopses System Yossi Matias Leon Portman Tel Aviv University.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
 The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Lecture Notes for Chapter 4 Introduction to Data Mining
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Experimental Perspectives on Lasso-related Algorithms on Parallel Computing Frameworks
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
1 Munther Abualkibash University of Bridgeport, CT.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Data Driven Resource Allocation for Distributed Learning
Distributed Network Traffic Feature Extraction for a Real-time IDS
Spark Presentation.
PREGEL Data Management in the Cloud
Parallel Density-based Hybrid Clustering
Classification with Perceptrons Reading:
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Exam #3 Review Zuyin (Alvin) Zheng.
Communication and Memory Efficient Parallel Decision Tree Construction
Replication-based Fault-tolerance for Large-scale Graph Processing
MapReduce.
Support for ”interactive batch”
MNIST Dataset Training with Tensorflow
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Decision Trees By Cole Daily CSCI 446.
Fast and Exact K-Means Clustering
Creative Activity and Research Day (CARD)
Classification with CART
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

1 Random Forest in Distributed R Arash Fard Vishrut Gupta

2 Distributed R Distributed R is a scalable high-performance platform for the R language that can leverage the resources of multiple machines Easy to use: library(distributedR) distributedR_start() Github page: Coming Soon: CRAN installation

3 Distributed R Standard Master – Worker framework Distributed data structures: darray – distributed array dframe – distributed data frame dlist – distributed list Parallel execution: foreach – function executed remotely Master is a normal R console can run standard R packages

4 Random Forest in Distributed R hpdRF_parallelForest Great for small/medium sized data Embarrassingly parallel Each worker builds fraction of trees Each worker needs the entire data Calls random forest package Very memory intensive Doesn’t scale well hpdRF_parallelTree Great for large sized data: 1 GB + Not Embarrassingly parallel Doesn’t require all data to be on worker Scales better than hpdRF_parallelForest Smaller output model Larger Distributed R overhead Approximate Algorithm

5 hpdRF_parallelTree details Distribute data across machines Recursively on leaf nodes: 1.Compute local histograms 2.Combine global histograms and compute optimal split 3.Workers work together to find best split 4.Update tree with decision rule and create new leaf nodes X7 > X7 Scan feature 7 to create histogram Compute best split from histogram Build tree recursively

6 How to use Random Forest in Distributed R Interface is extremely similar to randomForest function Some additional arguments required nBins – default value of 256 nExecutors – no default value (controls how much parallelism in hpdRF_parallelForest) completeModel – default value set to false (decide whether to calculate OOB error) Some output features not yet there Variable Importance Proximity matrix

7 MNIST dataset with 8.1M observations library(distributedR) library(HPdclassifier) distributedR_start() mnist_train <- read.csv("/mnt/mnist_train.csv",sep="\t") mnist_test <- read.csv("/mnt/mnist_test.csv",sep="\t") model <- hpdrandomForest(response~., mnist_train, ntree = 10) predictions <- predict(model, mnist_test) distributedR_shutdown() Prediction accuracy of 99.7% with just 10 trees! Not recommended to use read.csv (do this in parallel using Distributed R)

8 Scalability of hpdRF_parallelTree R’s random forest takes about seconds (~29 hours) on larger machine Testing Conditions: 1M observations 100 features 12 cores per machine

9 Accuracy of hpdRF_parallelTree

10 Conclusions Distributed R multi-core and distributed Random Forest in Distributed R Two parallel implementations optimized for different scenarios

11 Appendix: Comparison with Other Implementations Self reported results on MNIST 8.1M observations wiseRF – 8 min H20 – 19 min Spark Sequoia Forest - 6 min Spark MLLIB - crashed Distributed R – 10 min Distributed R is competitive Disclaimer: These results were run on different machines