Download presentation
Presentation is loading. Please wait.
Published byGodwin Alexander Modified over 9 years ago
1
© 2010 IBM Corporation IBM Almaden Research Center Ricardo: Integrating R and Hadoop Sudipto Das 1, Yannis Sismanis 2, Kevin S Beyer 2, Rainer Gemulla 2, Peter J. Haas 2, John McPherson 2 1 UC Santa Barbara 2 IBM Almaden Research Center Presented by: Luyuang Zhang Yuguan Li
2
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Outline Motivation & Background Architecture & Components Trading with Ricardo – Simple Trading – Complex Trading Evaluation Conclusion 2
3
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Deep Analytics on Big Data Enterprises collect huge amounts of data – Amazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, … – User interaction data and history – Click and Transaction logs Deep analysis critical for competitive edge – Understanding/Modeling data – Recommendations to users – Ad placement Challenge: Enable Deep Analysis and Understanding over massive data volumes – Exploiting data to its full potential 3
4
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 4 Motivating Examples Data Exploration/Model Evaluation/Outlier Detection Personalized Recommendations – For each individual customer/product – Many applications to Netflix, Amazon, eBay, iTunes, … Difficulty: Discern particular customer preferences – Sampling loses Competitive advantage Application Scenario: Movie Recommendations – Millions of Customers – Hundreds of thousands of Movies – Billions of Movie Ratings
5
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Analyst’s Workflow Data Exploration – Deal with raw data Data Modeling – Deal with processed data – Use assigned method to build model fits the data Model Evaluation – Deal with built model – Use data to test the accuracy of model 5
6
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Big Data and Deep Analytics – The Gap R, SPSS, SAS – A Statistician’s toolbox – Rich statistical, modeling, visualization functionality – Thousands of sophisticated add-on packages developed by hundreds of statistical experts and available through CRAN – Operate on small data amounts entirely in memory on a single server – Extensions for data handling cumbersome Hadoop – Scalable Data Management Systems – Scalable, Fault-Tolerant, Elastic, … – “Magnetic”: easy to store data – Limited deep analytics: mostly descriptive analytics 6
7
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Filling the Gap: Existing Approaches Reducing Data size by Sampling – Approximations might result in losing competitive advantage – Loses important features of the long tail of data distributions [Cohen et al., VLDB 2009] Scaling out R – Efforts from statistics community to parallel and distributed variants [SNOW, Rmpi] – Main memory based in most cases – Re-implementing DBMS and distributed processing functionality Deep Analysis within a DBMS – Port statistical functionality into a DBMS [Cohen et al., VLDB 2009], [Apache Mahout] – Not Sustainable – missing out from R’s community development and rich libraries 7
8
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 8 Ricardo: Bridging the Gap David Ricardo, famous economist from 19 th century – “Comparative Advantage” Deep Analytics decomposable in “large part” and “small part” [Chu et al., NIPS ‘06] – Linear/logistic regression, k-means clustering, Naïve Bayes, SVMs, PCA – Recommender Systems/Latent Factorization [our paper] – A key requirement for Ricardo is that the amount of data that must be communicated between both systems be sufficiently small Large-part includes joins, group bys, distributive aggregations – Hadoop + Jaql: excellent scalability to large-scale data management Small-part includes matrix/vector operations – R: excellent support for numerically stable matrix inversions, factorizations, optimizations, eigenvector decompositions,etc. Ricardo: Establishes “trade” between R and Hadoop/Jaql
9
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Ricardo: Bridging the Gap – Trade – R send aggregation-processing queries (written in Jaql) to Hadoop – Hadoop send aggregated data to R for advanced satistical processing 9
10
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} R in a Nutshell 10
11
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} R in a Nutshell 11 R supports Rich statistical functionality
12
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 12 Jaql in a Nutshell JSON View of the data: Jaql Example: Scalable Descriptive Analysis using Hadoop Jaql a representative declarative interface
13
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 13 Ricardo: The Trading Architecture Complexity of Trade between R and Hadoop ―Simple Trading: Data Exploration ―Complex Trading: Data Modeling
14
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Simple Trading: Exploratory Analytics Gain insights about data Example - top-k outliers for a model – Identify data items on which the model performed most poorly Helpful for improving accuracy of model The trade: – Use complex statistical models using rich R functionality – Parallelize processing over entire data using Hadoop/Jaql 14
15
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 15 Complex Trading: Latent Factors SVD-like matrix factorization Minimize Square Error: Σ i,j (p i q j - r ij ) 2 p q The trade: ―Use complex statistical models in R ―Parallelize aggregate computations using Hadoop/Jaql
16
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Complex Trading: Latent Factors 16 However, in real world……… A vector of factors for each customer and item!
17
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 17 Latent Factor Models with Ricardo Goal – Minimize Square Error: e = Σ i,j (p i q j - r ij ) 2 – Numerical methods needed (large, sparse matrix) Pseudocode 1.Start with initial guess of parameters p i and q j. 2.Compute error & gradient – E.g., de/dp i = Σ j 2q j (p i q j – r ij ) 3.Update parameters. – R implements many different optimization algorithms 4.Repeat steps 2 and 3 until convergence. p q Data intensive, but parallelizable!
18
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 18 The R Component Parameters e: squared error de:gradients pq:concatenation of the latent factors for users and items R code optim( c(p,q), f e, f de, method="L-BFGS-B" ) Goal Keeps updating pq until it reaches convergence
19
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 19 The Hadoop and Jaql Component Dataset Goal
20
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 20 The Hadoop and Jaql Component Calculate the squared errors Calculate the gradients
21
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 21 Computing the Model ijr ij ipipi jqjqj Movie Ratings Movie Parameters Customer Parameters 3 way join to match r ij, p i, and q j, then aggregate e = Σ i,j (p i q j - r ij ) 2 Similarly compute the gradients
22
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 22 Aggregation In Jaql/Hadoop res = jaqlTable(channel, " ratings hashJoin( fn(r) r.j, moviePars, fn(m) m.j, fn(r, m) { r.*, m.q } ) hashJoin( fn(r) r.i, custPars, fn(c) c.i, fn(r, c) { r.*, c.p } ) transform { $.*, diff: $.rating - $.p*$.q } expand [ { value: pow($.diff, 2.0) }, { $.i, value: -2.0 * $.diff * $.p }, { $.j, value: -2.0 * $.diff * $.q } ] group by g={ $.i, $.j } into { g.*, gradient: sum($[*].value) } ") i j gradient ---- ---- -------- null null 325235 1 null 21 2 null 357 … null 1 9 null 2 64 … Result in R
23
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 23 Integrating the Components Remember….. We would be running optim( c(p,q), f e, f de, method="L-BFGS-B" ) in R process.
24
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 24 Experimental Evaluation 50 nodes at EC2 Each node: 8 cores, 7GB Memory, 320GB Disk Total: 400 cores, 320GB Memory, 70TB Disk Space Number of Rating TuplesData Size in GB 500 Million104.33 1 Billion208.68 3 Billion625.99 5 Billion1043.23
25
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 25 Leveraging Hadoop’s Scalability
26
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 26 Leveraging R’s Rich Functionality – optim( c(p,q), f e, f de, method=“CG" ) – optim( c(p,q), f e, f de, method="L-BFGS-B" )
27
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 27 Conclusion Scaled Latent Factor Models to Terabytes of data Provided a bridge for other algorithms with Summation Form can be mapped and scaled – Many Algorithms have Summation Form – Decompose into “large part” and “small part” – [Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-means, logistic regression, neural network, PCA, ICA, EM, SVM Future & Current Work – Tighter language integration – More algorithms – Performance tuning
28
IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Questions?Comments?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.