Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2010 IBM Corporation IBM Almaden Research Center Ricardo: Integrating R and Hadoop Sudipto Das 1, Yannis Sismanis 2, Kevin S Beyer 2, Rainer Gemulla.

Similar presentations


Presentation on theme: "© 2010 IBM Corporation IBM Almaden Research Center Ricardo: Integrating R and Hadoop Sudipto Das 1, Yannis Sismanis 2, Kevin S Beyer 2, Rainer Gemulla."— Presentation transcript:

1 © 2010 IBM Corporation IBM Almaden Research Center Ricardo: Integrating R and Hadoop Sudipto Das 1, Yannis Sismanis 2, Kevin S Beyer 2, Rainer Gemulla 2, Peter J. Haas 2, John McPherson 2 1 UC Santa Barbara 2 IBM Almaden Research Center Presented by: Luyuang Zhang Yuguan Li

2 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Outline  Motivation & Background  Architecture & Components  Trading with Ricardo – Simple Trading – Complex Trading  Evaluation  Conclusion 2

3 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Deep Analytics on Big Data  Enterprises collect huge amounts of data – Amazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, … – User interaction data and history – Click and Transaction logs  Deep analysis critical for competitive edge – Understanding/Modeling data – Recommendations to users – Ad placement  Challenge: Enable Deep Analysis and Understanding over massive data volumes – Exploiting data to its full potential 3

4 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 4 Motivating Examples  Data Exploration/Model Evaluation/Outlier Detection  Personalized Recommendations – For each individual customer/product – Many applications to Netflix, Amazon, eBay, iTunes, …  Difficulty: Discern particular customer preferences – Sampling loses Competitive advantage  Application Scenario: Movie Recommendations – Millions of Customers – Hundreds of thousands of Movies – Billions of Movie Ratings

5 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Analyst’s Workflow  Data Exploration – Deal with raw data  Data Modeling – Deal with processed data – Use assigned method to build model fits the data  Model Evaluation – Deal with built model – Use data to test the accuracy of model 5

6 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Big Data and Deep Analytics – The Gap  R, SPSS, SAS – A Statistician’s toolbox – Rich statistical, modeling, visualization functionality – Thousands of sophisticated add-on packages developed by hundreds of statistical experts and available through CRAN – Operate on small data amounts entirely in memory on a single server – Extensions for data handling cumbersome  Hadoop – Scalable Data Management Systems – Scalable, Fault-Tolerant, Elastic, … – “Magnetic”: easy to store data – Limited deep analytics: mostly descriptive analytics 6

7 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Filling the Gap: Existing Approaches  Reducing Data size by Sampling – Approximations might result in losing competitive advantage – Loses important features of the long tail of data distributions [Cohen et al., VLDB 2009]  Scaling out R – Efforts from statistics community to parallel and distributed variants [SNOW, Rmpi] – Main memory based in most cases – Re-implementing DBMS and distributed processing functionality  Deep Analysis within a DBMS – Port statistical functionality into a DBMS [Cohen et al., VLDB 2009], [Apache Mahout] – Not Sustainable – missing out from R’s community development and rich libraries 7

8 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 8 Ricardo: Bridging the Gap  David Ricardo, famous economist from 19 th century – “Comparative Advantage”  Deep Analytics decomposable in “large part” and “small part” [Chu et al., NIPS ‘06] – Linear/logistic regression, k-means clustering, Naïve Bayes, SVMs, PCA – Recommender Systems/Latent Factorization [our paper] – A key requirement for Ricardo is that the amount of data that must be communicated between both systems be sufficiently small  Large-part includes joins, group bys, distributive aggregations – Hadoop + Jaql: excellent scalability to large-scale data management  Small-part includes matrix/vector operations – R: excellent support for numerically stable matrix inversions, factorizations, optimizations, eigenvector decompositions,etc.  Ricardo: Establishes “trade” between R and Hadoop/Jaql

9 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Ricardo: Bridging the Gap – Trade – R send aggregation-processing queries (written in Jaql) to Hadoop – Hadoop send aggregated data to R for advanced satistical processing 9

10 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} R in a Nutshell 10

11 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} R in a Nutshell 11  R supports Rich statistical functionality

12 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 12 Jaql in a Nutshell  JSON View of the data:  Jaql Example:  Scalable Descriptive Analysis using Hadoop  Jaql a representative declarative interface

13 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 13 Ricardo: The Trading Architecture Complexity of Trade between R and Hadoop ―Simple Trading: Data Exploration ―Complex Trading: Data Modeling

14 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Simple Trading: Exploratory Analytics  Gain insights about data  Example - top-k outliers for a model – Identify data items on which the model performed most poorly  Helpful for improving accuracy of model  The trade: – Use complex statistical models using rich R functionality – Parallelize processing over entire data using Hadoop/Jaql 14

15 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 15 Complex Trading: Latent Factors SVD-like matrix factorization Minimize Square Error: Σ i,j (p i q j - r ij ) 2 p q The trade: ―Use complex statistical models in R ―Parallelize aggregate computations using Hadoop/Jaql

16 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Complex Trading: Latent Factors 16 However, in real world……… A vector of factors for each customer and item!

17 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 17 Latent Factor Models with Ricardo  Goal – Minimize Square Error: e = Σ i,j (p i q j - r ij ) 2 – Numerical methods needed (large, sparse matrix)  Pseudocode 1.Start with initial guess of parameters p i and q j. 2.Compute error & gradient – E.g., de/dp i = Σ j 2q j (p i q j – r ij ) 3.Update parameters. – R implements many different optimization algorithms 4.Repeat steps 2 and 3 until convergence. p q Data intensive, but parallelizable!

18 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 18 The R Component Parameters e: squared error de:gradients pq:concatenation of the latent factors for users and items R code optim( c(p,q), f e, f de, method="L-BFGS-B" ) Goal Keeps updating pq until it reaches convergence

19 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 19 The Hadoop and Jaql Component Dataset Goal

20 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 20 The Hadoop and Jaql Component Calculate the squared errors Calculate the gradients

21 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 21 Computing the Model ijr ij ipipi jqjqj Movie Ratings Movie Parameters Customer Parameters 3 way join to match r ij, p i, and q j, then aggregate e = Σ i,j (p i q j - r ij ) 2 Similarly compute the gradients

22 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 22 Aggregation In Jaql/Hadoop res = jaqlTable(channel, " ratings  hashJoin( fn(r) r.j, moviePars, fn(m) m.j, fn(r, m) { r.*, m.q } )  hashJoin( fn(r) r.i, custPars, fn(c) c.i, fn(r, c) { r.*, c.p } )  transform { $.*, diff: $.rating - $.p*$.q }  expand [ { value: pow($.diff, 2.0) }, { $.i, value: -2.0 * $.diff * $.p }, { $.j, value: -2.0 * $.diff * $.q } ]  group by g={ $.i, $.j } into { g.*, gradient: sum($[*].value) } ") i j gradient ---- ---- -------- null null 325235 1 null 21 2 null 357 … null 1 9 null 2 64 … Result in R

23 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 23 Integrating the Components Remember….. We would be running optim( c(p,q), f e, f de, method="L-BFGS-B" ) in R process.

24 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 24 Experimental Evaluation  50 nodes at EC2  Each node: 8 cores, 7GB Memory, 320GB Disk  Total: 400 cores, 320GB Memory, 70TB Disk Space Number of Rating TuplesData Size in GB 500 Million104.33 1 Billion208.68 3 Billion625.99 5 Billion1043.23

25 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 25 Leveraging Hadoop’s Scalability

26 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 26 Leveraging R’s Rich Functionality – optim( c(p,q), f e, f de, method=“CG" ) – optim( c(p,q), f e, f de, method="L-BFGS-B" )

27 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} 27 Conclusion  Scaled Latent Factor Models to Terabytes of data  Provided a bridge for other algorithms with Summation Form can be mapped and scaled – Many Algorithms have Summation Form – Decompose into “large part” and “small part” – [Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-means, logistic regression, neural network, PCA, ICA, EM, SVM  Future & Current Work – Tighter language integration – More algorithms – Performance tuning

28 IBM Almaden Research Ricardo: Integrating R and Hadoop © 2010 IBM Corporation Sudipto Das {sudipto@cs.ucsb.edu} Questions?Comments?


Download ppt "© 2010 IBM Corporation IBM Almaden Research Center Ricardo: Integrating R and Hadoop Sudipto Das 1, Yannis Sismanis 2, Kevin S Beyer 2, Rainer Gemulla."

Similar presentations


Ads by Google