© 2010 IBM Corporation IBM Almaden Research Center Ricardo: Integrating R and Hadoop Sudipto Das 1, Yannis Sismanis 2, Kevin S Beyer 2, Rainer Gemulla.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Overview of this week Debugging tips for ML algorithms
Movie Recommendation System
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Big Data A big step towards innovation, competition and productivity.
Big Data Use Cases in the cloud Peter Sirota, GM Elastic
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs.
Ch 4. The Evolution of Analytic Scalability
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential.
1 Hybrid methods for solving large-scale parameter estimation problems Carlos A. Quintero 1 Miguel Argáez 1 Hector Klie 2 Leticia Velázquez 1 Mary Wheeler.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Google News Personalization: Scalable Online Collaborative Filtering
RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.
Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Apache Mahout Qiaodi Zhuang Xijing Zhang.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Scaling Distributed Machine Learning with the Parameter Server By M. Li, D. Anderson, J. Park, A. Smola, A. Ahmed, V. Josifovski, J. Long E. Shekita, B.
MarkLogic The Only Enterprise NoSQL Database Presented by: Aashi Rastogi ( ) Sanket Patel ( )
BIG DATA/ Hadoop Interview Questions.
Does one size really fit all? Evaluating classifiers in a Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
Image taken from: slideshare
Big Data Analytics and HPC Platforms
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
Chilimbi, et al. (2014) Microsoft Research
Open Source distributed document DB for an enterprise
Spark Presentation.
COMP61011 : Machine Learning Ensemble Models
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Lecture 23: Feature Selection
Advanced Artificial Intelligence
CS110: Discussion about Spark
Scalable Parallel Interoperable Data Analytics Library
Ch 4. The Evolution of Analytic Scalability
HPML Conference, Lyon, Sept 2018
Overview of big data tools
Big Data Young Lee BUS 550.
Big Data, Bigger Data & Big R Data
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Big DATA.
Carlos Ordonez, Javier Garcia-Garcia,
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

© 2010 IBM Corporation IBM Almaden Research Center Ricardo: Integrating R and Hadoop Sudipto Das 1, Yannis Sismanis 2, Kevin S Beyer 2, Rainer Gemulla 2, Peter J. Haas 2, John McPherson 2 1 UC Santa Barbara 2 IBM Almaden Research Center Presented by: Luyuang Zhang Yuguan Li

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Outline  Motivation & Background  Architecture & Components  Trading with Ricardo – Simple Trading – Complex Trading  Evaluation  Conclusion 2

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Deep Analytics on Big Data  Enterprises collect huge amounts of data – Amazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, … – User interaction data and history – Click and Transaction logs  Deep analysis critical for competitive edge – Understanding/Modeling data – Recommendations to users – Ad placement  Challenge: Enable Deep Analysis and Understanding over massive data volumes – Exploiting data to its full potential 3

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 4 Motivating Examples  Data Exploration/Model Evaluation/Outlier Detection  Personalized Recommendations – For each individual customer/product – Many applications to Netflix, Amazon, eBay, iTunes, …  Difficulty: Discern particular customer preferences – Sampling loses Competitive advantage  Application Scenario: Movie Recommendations – Millions of Customers – Hundreds of thousands of Movies – Billions of Movie Ratings

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Analyst’s Workflow  Data Exploration – Deal with raw data  Data Modeling – Deal with processed data – Use assigned method to build model fits the data  Model Evaluation – Deal with built model – Use data to test the accuracy of model 5

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Big Data and Deep Analytics – The Gap  R, SPSS, SAS – A Statistician’s toolbox – Rich statistical, modeling, visualization functionality – Thousands of sophisticated add-on packages developed by hundreds of statistical experts and available through CRAN – Operate on small data amounts entirely in memory on a single server – Extensions for data handling cumbersome  Hadoop – Scalable Data Management Systems – Scalable, Fault-Tolerant, Elastic, … – “Magnetic”: easy to store data – Limited deep analytics: mostly descriptive analytics 6

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Filling the Gap: Existing Approaches  Reducing Data size by Sampling – Approximations might result in losing competitive advantage – Loses important features of the long tail of data distributions [Cohen et al., VLDB 2009]  Scaling out R – Efforts from statistics community to parallel and distributed variants [SNOW, Rmpi] – Main memory based in most cases – Re-implementing DBMS and distributed processing functionality  Deep Analysis within a DBMS – Port statistical functionality into a DBMS [Cohen et al., VLDB 2009], [Apache Mahout] – Not Sustainable – missing out from R’s community development and rich libraries 7

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 8 Ricardo: Bridging the Gap  David Ricardo, famous economist from 19 th century – “Comparative Advantage”  Deep Analytics decomposable in “large part” and “small part” [Chu et al., NIPS ‘06] – Linear/logistic regression, k-means clustering, Naïve Bayes, SVMs, PCA – Recommender Systems/Latent Factorization [our paper] – A key requirement for Ricardo is that the amount of data that must be communicated between both systems be sufficiently small  Large-part includes joins, group bys, distributive aggregations – Hadoop + Jaql: excellent scalability to large-scale data management  Small-part includes matrix/vector operations – R: excellent support for numerically stable matrix inversions, factorizations, optimizations, eigenvector decompositions,etc.  Ricardo: Establishes “trade” between R and Hadoop/Jaql

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Ricardo: Bridging the Gap – Trade – R send aggregation-processing queries (written in Jaql) to Hadoop – Hadoop send aggregated data to R for advanced satistical processing 9

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das R in a Nutshell 10

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das R in a Nutshell 11  R supports Rich statistical functionality

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 12 Jaql in a Nutshell  JSON View of the data:  Jaql Example:  Scalable Descriptive Analysis using Hadoop  Jaql a representative declarative interface

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 13 Ricardo: The Trading Architecture Complexity of Trade between R and Hadoop ―Simple Trading: Data Exploration ―Complex Trading: Data Modeling

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Simple Trading: Exploratory Analytics  Gain insights about data  Example - top-k outliers for a model – Identify data items on which the model performed most poorly  Helpful for improving accuracy of model  The trade: – Use complex statistical models using rich R functionality – Parallelize processing over entire data using Hadoop/Jaql 14

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 15 Complex Trading: Latent Factors SVD-like matrix factorization Minimize Square Error: Σ i,j (p i q j - r ij ) 2 p q The trade: ―Use complex statistical models in R ―Parallelize aggregate computations using Hadoop/Jaql

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Complex Trading: Latent Factors 16 However, in real world……… A vector of factors for each customer and item!

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 17 Latent Factor Models with Ricardo  Goal – Minimize Square Error: e = Σ i,j (p i q j - r ij ) 2 – Numerical methods needed (large, sparse matrix)  Pseudocode 1.Start with initial guess of parameters p i and q j. 2.Compute error & gradient – E.g., de/dp i = Σ j 2q j (p i q j – r ij ) 3.Update parameters. – R implements many different optimization algorithms 4.Repeat steps 2 and 3 until convergence. p q Data intensive, but parallelizable!

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 18 The R Component Parameters e: squared error de:gradients pq:concatenation of the latent factors for users and items R code optim( c(p,q), f e, f de, method="L-BFGS-B" ) Goal Keeps updating pq until it reaches convergence

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 19 The Hadoop and Jaql Component Dataset Goal

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 20 The Hadoop and Jaql Component Calculate the squared errors Calculate the gradients

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 21 Computing the Model ijr ij ipipi jqjqj Movie Ratings Movie Parameters Customer Parameters 3 way join to match r ij, p i, and q j, then aggregate e = Σ i,j (p i q j - r ij ) 2 Similarly compute the gradients

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 22 Aggregation In Jaql/Hadoop res = jaqlTable(channel, " ratings  hashJoin( fn(r) r.j, moviePars, fn(m) m.j, fn(r, m) { r.*, m.q } )  hashJoin( fn(r) r.i, custPars, fn(c) c.i, fn(r, c) { r.*, c.p } )  transform { $.*, diff: $.rating - $.p*$.q }  expand [ { value: pow($.diff, 2.0) }, { $.i, value: -2.0 * $.diff * $.p }, { $.j, value: -2.0 * $.diff * $.q } ]  group by g={ $.i, $.j } into { g.*, gradient: sum($[*].value) } ") i j gradient null null null 21 2 null 357 … null 1 9 null 2 64 … Result in R

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 23 Integrating the Components Remember….. We would be running optim( c(p,q), f e, f de, method="L-BFGS-B" ) in R process.

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 24 Experimental Evaluation  50 nodes at EC2  Each node: 8 cores, 7GB Memory, 320GB Disk  Total: 400 cores, 320GB Memory, 70TB Disk Space Number of Rating TuplesData Size in GB 500 Million Billion Billion Billion

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 25 Leveraging Hadoop’s Scalability

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 26 Leveraging R’s Rich Functionality – optim( c(p,q), f e, f de, method=“CG" ) – optim( c(p,q), f e, f de, method="L-BFGS-B" )

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 27 Conclusion  Scaled Latent Factor Models to Terabytes of data  Provided a bridge for other algorithms with Summation Form can be mapped and scaled – Many Algorithms have Summation Form – Decompose into “large part” and “small part” – [Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-means, logistic regression, neural network, PCA, ICA, EM, SVM  Future & Current Work – Tighter language integration – More algorithms – Performance tuning

IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Questions?Comments?