Presentation is loading. Please wait.

Presentation is loading. Please wait.

Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Similar presentations


Presentation on theme: "Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP."— Presentation transcript:

1 Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP

2 1.Introduction 2.Motivating Examples 3.Preliminaries 4.Ricardo Design 5.Experimental Study 6.Conclusion CONTENTS

3 INTRODUCTION

4  Enterprise datasets  Why are these datasets important?  Statistical analysis on datasets  Data analyst workflow  Explore/summarize data  Built a model  Used to improve business practices  Need a statistical package DATA COLLECTION

5  R design  Single server  Main memory  Large data  FAIL!  Problem for analysts – they work with large datasets  Vertical scalability  Subsets  Neither is ideal!  Large-scale data management systems (DMS)  Example: Hadoop  Aggregation processing R AND DMS

6  Overview  Scalable platform for deep analytics  Part of eXtreme Analytics Platform (XAP) project  Named after economist David Ricardo  Facilitates trading between R and Hadoop  Previous work on Map-Reduce  Small data – combined approach success  Several advantages RICARDO

7  Familiar working environment – work within a statistical environment  Data attraction – Hadoop’s flexible data store together with the Jaql query language  Integration of data processing into the analytical workflow – handle large data by preprocessing and reducing it  Reliability and community support – built from open-source projects  Improved user code – facilitates better code  Deep analytics –can handle many kinds of advanced statistical analyses  No re-inventing of wheels – combine existing statistical and DMS technology RICARDO ADVANTAGES

8 MOTIVATING EXAMPLES

9  Analyst workflow: exploration  Graph shows movie perception over time  How does an analyst get this data visualization?  R is good for the job, BUT…  Ricardo can help! EXAMPLE 1: SIMPLE TRADING

10  Analyst workflow: evaluation – already have a model  Analysis must be on all the data  Ricardo can help once again  What did we see?  Simple trading  First case  pass to R  Second case  pass to Hadoop  More complicated analyses? No problem! EXAMPLE 2: SIMPLE TRADING

11  Analyst workflow: modeling  How?  Simple-trading scheme  no good  Losing information  Ricardo permits complex trading  Data needs decomposition  Small parts  handled by R  Large parts  handled by Hadoop  Consider an example  Latent-factor model  Each piece of data must be taken into account  Simple-trading won’t work EXAMPLE 3: COMPLEX TRADING

12 LATENT-FACTOR MODEL

13 PRELIMINARIES

14  Developed at the University of Auckland, New Zealand  Open-source language and statistical environment  Small maintenance team, but big popularity  Example of functionality: fit <- lm(df$mean ~ df$year) plot(df$year, df$mean) abline(fit)  Data frame equivalent THE R PROJECT

15  Enterprise data warehouses – dominant type of DMS  Designed for clean/structured data – not good  Analysts want their data dirty  What to do? Use Hadoop!  Hadoop method  Hadoop Distributed File System  Operates on raw data files  Process according to MapReduce  Map phase results fed to reducer  Used successfully on large-scale datasets  Appealing alternative LARGE-SCALE DMS

16  Hadoop drawback – programming interface  Attempts to help this  Ricardo uses Jaql  Open-source dataflow language  Jaql scripts automatically compiled  Operates directly on data files  JSON view: [{ customer: "Michael", movie: { name: "About Schmidt", year: 2002}, rating: 5},... ],  Jaql query: read("ratings") -> group by year = $.movie.year into { year, mean: avg($[*].rating) } -> sort by [ $.year ]. JAQL: A JSON QUERY LANGUAGE

17 RICARDO DESIGN

18 PROBLEM STATEMENT Advantages: -Statistical software -Data analysis Disadvantages: -Operate in main memory -Limited data Advantage: -Large scale processing Disadvantage: Insufficient analytical functionality

19 RICARDO DESIGN

20  R driver:  Not memory-resident  Does R need memory to store some data?  Hadoop :  Performance operations  Store data in HDFS  R-Jaql Bridge:  Connect between R driver and Hadoop cluster  Execute query (what kind of query?)  Send the result back to R as data frames  Allow Jaql queries to spawn R processes on Hadoop worker nodes. RICARDO DESIGN

21  Components:  R package(Jaql  R and a Jaql module: R  Jaql)  R-JAQL BRIDGE  R  Hadoop  Hadoop  R  R  Hadoop

22  Analyst’s typical workflow  Data exploration  Preliminary observation  Simple trading  Model building  Depth Analytics  Complex trading  Model evaluation  Quality of models  Simple trading RICARDO WORKFLOW Why model building is complex trading?

23  Movies recommendation REVIEW EXAMPLE Simple Trading: Linear Regression Complex Trading: Latent-Factor Model Data ExplorationModel Building

24 SIMPLE TRADING – LINEAR REGRESSION Get data from Hadoop Fit data

25 SIMPLE TRADING – EVALUATE MODEL Fit data Select top 10 outliers

26 COMPLEX TRADING Model BuildingObjectives

27 MODEL BUILDING Random pick up p and q Set up optimization method Update p and q Repeat it until convergence Compute Squared error (e) The derivative of e with respect to p The derivative of e with respect to q

28  Table r: stores ratings  Table p and q: stores latent factors MODEL BUILDING UserItem AliceAbout Schmidt BobLost in Translation MichaelSideways Schmidt2.24 Lost in translation 1.92 Sideways1.18 Alice1.98 Bob1.21 Michael2.30 Table r Table q Table p

29 DETAILS Compute the gradient Compute the sum of squared errors

30  Principal component analysis (PCA)  Compute eigenvectors and eigenvalues  Perpendicular among eigenvectors  GLM  Compute response variable  Expressed as a nonlinear function  …… OTHER MODELS

31  Java Native Interface (JNI) as the bridge between C and Java  How to transfer the data between JNI?  Naïve way  Better solution  Japl wrapper handles data-representation incompatibilities  This is in the bridge  What’s the component right now in the R-Jaql bridge now? IMPLEMENTATIONS

32 EXPERIMENTAL STUDY

33

34

35

36  Scaling Out R  Low level message passing type  Task- and data-parallel computing systems  Automatic parallelization of high-level  Deeping a DMS RELATED WORK

37  Ricardo combines the data management capabilities of Hadoop and Jaql with the statistical functionality provided by R.  Future work  Identifying and integrating additional statistical analyses that are amenable to the Ricardo approach. CONCLUSION

38  S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating R and Hadoop. In SIGMOD, pages 987-998, 2010. REFERENCES


Download ppt "Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP."

Similar presentations


Ads by Google