© 2010 IBM Corporation IBM Almaden Research Center Ricardo: Integrating R and Hadoop Sudipto Das 1, Yannis Sismanis 2, Kevin S Beyer 2, Rainer Gemulla 2, Peter J. Haas 2, John McPherson 2 1 UC Santa Barbara 2 IBM Almaden Research Center Presented by: Luyuang Zhang Yuguan Li
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Outline Motivation & Background Architecture & Components Trading with Ricardo – Simple Trading – Complex Trading Evaluation Conclusion 2
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Deep Analytics on Big Data Enterprises collect huge amounts of data – Amazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, … – User interaction data and history – Click and Transaction logs Deep analysis critical for competitive edge – Understanding/Modeling data – Recommendations to users – Ad placement Challenge: Enable Deep Analysis and Understanding over massive data volumes – Exploiting data to its full potential 3
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 4 Motivating Examples Data Exploration/Model Evaluation/Outlier Detection Personalized Recommendations – For each individual customer/product – Many applications to Netflix, Amazon, eBay, iTunes, … Difficulty: Discern particular customer preferences – Sampling loses Competitive advantage Application Scenario: Movie Recommendations – Millions of Customers – Hundreds of thousands of Movies – Billions of Movie Ratings
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Analyst’s Workflow Data Exploration – Deal with raw data Data Modeling – Deal with processed data – Use assigned method to build model fits the data Model Evaluation – Deal with built model – Use data to test the accuracy of model 5
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Big Data and Deep Analytics – The Gap R, SPSS, SAS – A Statistician’s toolbox – Rich statistical, modeling, visualization functionality – Thousands of sophisticated add-on packages developed by hundreds of statistical experts and available through CRAN – Operate on small data amounts entirely in memory on a single server – Extensions for data handling cumbersome Hadoop – Scalable Data Management Systems – Scalable, Fault-Tolerant, Elastic, … – “Magnetic”: easy to store data – Limited deep analytics: mostly descriptive analytics 6
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Filling the Gap: Existing Approaches Reducing Data size by Sampling – Approximations might result in losing competitive advantage – Loses important features of the long tail of data distributions [Cohen et al., VLDB 2009] Scaling out R – Efforts from statistics community to parallel and distributed variants [SNOW, Rmpi] – Main memory based in most cases – Re-implementing DBMS and distributed processing functionality Deep Analysis within a DBMS – Port statistical functionality into a DBMS [Cohen et al., VLDB 2009], [Apache Mahout] – Not Sustainable – missing out from R’s community development and rich libraries 7
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 8 Ricardo: Bridging the Gap David Ricardo, famous economist from 19 th century – “Comparative Advantage” Deep Analytics decomposable in “large part” and “small part” [Chu et al., NIPS ‘06] – Linear/logistic regression, k-means clustering, Naïve Bayes, SVMs, PCA – Recommender Systems/Latent Factorization [our paper] – A key requirement for Ricardo is that the amount of data that must be communicated between both systems be sufficiently small Large-part includes joins, group bys, distributive aggregations – Hadoop + Jaql: excellent scalability to large-scale data management Small-part includes matrix/vector operations – R: excellent support for numerically stable matrix inversions, factorizations, optimizations, eigenvector decompositions,etc. Ricardo: Establishes “trade” between R and Hadoop/Jaql
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Ricardo: Bridging the Gap – Trade – R send aggregation-processing queries (written in Jaql) to Hadoop – Hadoop send aggregated data to R for advanced satistical processing 9
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das R in a Nutshell 10
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das R in a Nutshell 11 R supports Rich statistical functionality
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 12 Jaql in a Nutshell JSON View of the data: Jaql Example: Scalable Descriptive Analysis using Hadoop Jaql a representative declarative interface
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 13 Ricardo: The Trading Architecture Complexity of Trade between R and Hadoop ―Simple Trading: Data Exploration ―Complex Trading: Data Modeling
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Simple Trading: Exploratory Analytics Gain insights about data Example - top-k outliers for a model – Identify data items on which the model performed most poorly Helpful for improving accuracy of model The trade: – Use complex statistical models using rich R functionality – Parallelize processing over entire data using Hadoop/Jaql 14
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 15 Complex Trading: Latent Factors SVD-like matrix factorization Minimize Square Error: Σ i,j (p i q j - r ij ) 2 p q The trade: ―Use complex statistical models in R ―Parallelize aggregate computations using Hadoop/Jaql
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Complex Trading: Latent Factors 16 However, in real world……… A vector of factors for each customer and item!
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 17 Latent Factor Models with Ricardo Goal – Minimize Square Error: e = Σ i,j (p i q j - r ij ) 2 – Numerical methods needed (large, sparse matrix) Pseudocode 1.Start with initial guess of parameters p i and q j. 2.Compute error & gradient – E.g., de/dp i = Σ j 2q j (p i q j – r ij ) 3.Update parameters. – R implements many different optimization algorithms 4.Repeat steps 2 and 3 until convergence. p q Data intensive, but parallelizable!
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 18 The R Component Parameters e: squared error de:gradients pq:concatenation of the latent factors for users and items R code optim( c(p,q), f e, f de, method="L-BFGS-B" ) Goal Keeps updating pq until it reaches convergence
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 19 The Hadoop and Jaql Component Dataset Goal
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 20 The Hadoop and Jaql Component Calculate the squared errors Calculate the gradients
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 21 Computing the Model ijr ij ipipi jqjqj Movie Ratings Movie Parameters Customer Parameters 3 way join to match r ij, p i, and q j, then aggregate e = Σ i,j (p i q j - r ij ) 2 Similarly compute the gradients
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 22 Aggregation In Jaql/Hadoop res = jaqlTable(channel, " ratings hashJoin( fn(r) r.j, moviePars, fn(m) m.j, fn(r, m) { r.*, m.q } ) hashJoin( fn(r) r.i, custPars, fn(c) c.i, fn(r, c) { r.*, c.p } ) transform { $.*, diff: $.rating - $.p*$.q } expand [ { value: pow($.diff, 2.0) }, { $.i, value: -2.0 * $.diff * $.p }, { $.j, value: -2.0 * $.diff * $.q } ] group by g={ $.i, $.j } into { g.*, gradient: sum($[*].value) } ") i j gradient null null null 21 2 null 357 … null 1 9 null 2 64 … Result in R
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 23 Integrating the Components Remember….. We would be running optim( c(p,q), f e, f de, method="L-BFGS-B" ) in R process.
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 24 Experimental Evaluation 50 nodes at EC2 Each node: 8 cores, 7GB Memory, 320GB Disk Total: 400 cores, 320GB Memory, 70TB Disk Space Number of Rating TuplesData Size in GB 500 Million Billion Billion Billion
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 25 Leveraging Hadoop’s Scalability
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 26 Leveraging R’s Rich Functionality – optim( c(p,q), f e, f de, method=“CG" ) – optim( c(p,q), f e, f de, method="L-BFGS-B" )
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das 27 Conclusion Scaled Latent Factor Models to Terabytes of data Provided a bridge for other algorithms with Summation Form can be mapped and scaled – Many Algorithms have Summation Form – Decompose into “large part” and “small part” – [Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-means, logistic regression, neural network, PCA, ICA, EM, SVM Future & Current Work – Tighter language integration – More algorithms – Performance tuning
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das Questions?Comments?