Download presentation
Presentation is loading. Please wait.
Published byKimberly Cummings Modified over 9 years ago
1
Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP
2
1.Introduction 2.Motivating Examples 3.Preliminaries 4.Ricardo Design 5.Experimental Study 6.Conclusion CONTENTS
3
INTRODUCTION
4
Enterprise datasets Why are these datasets important? Statistical analysis on datasets Data analyst workflow Explore/summarize data Built a model Used to improve business practices Need a statistical package DATA COLLECTION
5
R design Single server Main memory Large data FAIL! Problem for analysts – they work with large datasets Vertical scalability Subsets Neither is ideal! Large-scale data management systems (DMS) Example: Hadoop Aggregation processing R AND DMS
6
Overview Scalable platform for deep analytics Part of eXtreme Analytics Platform (XAP) project Named after economist David Ricardo Facilitates trading between R and Hadoop Previous work on Map-Reduce Small data – combined approach success Several advantages RICARDO
7
Familiar working environment – work within a statistical environment Data attraction – Hadoop’s flexible data store together with the Jaql query language Integration of data processing into the analytical workflow – handle large data by preprocessing and reducing it Reliability and community support – built from open-source projects Improved user code – facilitates better code Deep analytics –can handle many kinds of advanced statistical analyses No re-inventing of wheels – combine existing statistical and DMS technology RICARDO ADVANTAGES
8
MOTIVATING EXAMPLES
9
Analyst workflow: exploration Graph shows movie perception over time How does an analyst get this data visualization? R is good for the job, BUT… Ricardo can help! EXAMPLE 1: SIMPLE TRADING
10
Analyst workflow: evaluation – already have a model Analysis must be on all the data Ricardo can help once again What did we see? Simple trading First case pass to R Second case pass to Hadoop More complicated analyses? No problem! EXAMPLE 2: SIMPLE TRADING
11
Analyst workflow: modeling How? Simple-trading scheme no good Losing information Ricardo permits complex trading Data needs decomposition Small parts handled by R Large parts handled by Hadoop Consider an example Latent-factor model Each piece of data must be taken into account Simple-trading won’t work EXAMPLE 3: COMPLEX TRADING
12
LATENT-FACTOR MODEL
13
PRELIMINARIES
14
Developed at the University of Auckland, New Zealand Open-source language and statistical environment Small maintenance team, but big popularity Example of functionality: fit <- lm(df$mean ~ df$year) plot(df$year, df$mean) abline(fit) Data frame equivalent THE R PROJECT
15
Enterprise data warehouses – dominant type of DMS Designed for clean/structured data – not good Analysts want their data dirty What to do? Use Hadoop! Hadoop method Hadoop Distributed File System Operates on raw data files Process according to MapReduce Map phase results fed to reducer Used successfully on large-scale datasets Appealing alternative LARGE-SCALE DMS
16
Hadoop drawback – programming interface Attempts to help this Ricardo uses Jaql Open-source dataflow language Jaql scripts automatically compiled Operates directly on data files JSON view: [{ customer: "Michael", movie: { name: "About Schmidt", year: 2002}, rating: 5},... ], Jaql query: read("ratings") -> group by year = $.movie.year into { year, mean: avg($[*].rating) } -> sort by [ $.year ]. JAQL: A JSON QUERY LANGUAGE
17
RICARDO DESIGN
18
PROBLEM STATEMENT Advantages: -Statistical software -Data analysis Disadvantages: -Operate in main memory -Limited data Advantage: -Large scale processing Disadvantage: Insufficient analytical functionality
19
RICARDO DESIGN
20
R driver: Not memory-resident Does R need memory to store some data? Hadoop : Performance operations Store data in HDFS R-Jaql Bridge: Connect between R driver and Hadoop cluster Execute query (what kind of query?) Send the result back to R as data frames Allow Jaql queries to spawn R processes on Hadoop worker nodes. RICARDO DESIGN
21
Components: R package(Jaql R and a Jaql module: R Jaql) R-JAQL BRIDGE R Hadoop Hadoop R R Hadoop
22
Analyst’s typical workflow Data exploration Preliminary observation Simple trading Model building Depth Analytics Complex trading Model evaluation Quality of models Simple trading RICARDO WORKFLOW Why model building is complex trading?
23
Movies recommendation REVIEW EXAMPLE Simple Trading: Linear Regression Complex Trading: Latent-Factor Model Data ExplorationModel Building
24
SIMPLE TRADING – LINEAR REGRESSION Get data from Hadoop Fit data
25
SIMPLE TRADING – EVALUATE MODEL Fit data Select top 10 outliers
26
COMPLEX TRADING Model BuildingObjectives
27
MODEL BUILDING Random pick up p and q Set up optimization method Update p and q Repeat it until convergence Compute Squared error (e) The derivative of e with respect to p The derivative of e with respect to q
28
Table r: stores ratings Table p and q: stores latent factors MODEL BUILDING UserItem AliceAbout Schmidt BobLost in Translation MichaelSideways Schmidt2.24 Lost in translation 1.92 Sideways1.18 Alice1.98 Bob1.21 Michael2.30 Table r Table q Table p
29
DETAILS Compute the gradient Compute the sum of squared errors
30
Principal component analysis (PCA) Compute eigenvectors and eigenvalues Perpendicular among eigenvectors GLM Compute response variable Expressed as a nonlinear function …… OTHER MODELS
31
Java Native Interface (JNI) as the bridge between C and Java How to transfer the data between JNI? Naïve way Better solution Japl wrapper handles data-representation incompatibilities This is in the bridge What’s the component right now in the R-Jaql bridge now? IMPLEMENTATIONS
32
EXPERIMENTAL STUDY
36
Scaling Out R Low level message passing type Task- and data-parallel computing systems Automatic parallelization of high-level Deeping a DMS RELATED WORK
37
Ricardo combines the data management capabilities of Hadoop and Jaql with the statistical functionality provided by R. Future work Identifying and integrating additional statistical analyses that are amenable to the Ricardo approach. CONCLUSION
38
S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating R and Hadoop. In SIGMOD, pages 987-998, 2010. REFERENCES
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.