Download presentation
Presentation is loading. Please wait.
Published byKatrina Miles Modified over 8 years ago
1
1
2
2 Quick Introduction to Oracle R Enterprise ©2011 Oracle – All Rights Reserved
3
3 What is Oracle R Enterprise? R packages, database library, and SQL extensions that bring Oracle Database closer to Advanced Analytics users in an Enterprise –Transparency framework: R package for transparent database table access and scalable in-database execution –Statistics Engine: database library to support a wide range of statistical computations –SQL extensions: enables in-database execution of R code, eliminating client data loading and result write-back to database Targets users whose data reside in Oracle Databases, or who could benefit from in-database processing of R scripts ©2011 Oracle – All Rights Reserved
4
4 Oracle R Enterprise Compute Engines R-SQL Transparency Framework intercepts R functions for scalable in-database execution Function intercept for data transforms, statistical functions and advanced analytics Interactive display of graphical results and flow control as in standard R Submit entire R scripts for execution by Oracle Database Scale to large datasets Access tables, views, and external tables, as well as data through DB LINKS Leverage database SQL parallelism Leverage new and existing in-database statistical and data mining capabilities R Engine Other R packages Oracle R Enterprise packages User R Engine on desktop Database can spawn multiple R engines for database-managed parallelism Efficient data transfer to spawned R engines Emulate map-reduce style algorithms and applications Enables “lights-out” execution of R scripts 1 User tables Oracle Database SQL Results Database Compute Engine 2 R Engine Other R packages Oracle R Enterprise packages R Engine(s) spawned by Oracle DB R Results 3 ©2011 Oracle – All Rights Reserved
5
5 Oracle R Enterprise – Data Sources User tables Oracle Database Bulk import File systems Other databases Database Links SQLResults R Engine Other R packages Oracle R package R user on desktop External Tables File systems Direct R access Other databases Direct R access RODBC, DBI, etc ©2011 Oracle – All Rights Reserved
6
6 What R programmers should know… Oracle R Enterprise… provides a familiar environment to operate on data in Oracle Database overloads base R functions for data in Oracle Database to address big data as a single big data set problem enables embedded execution of existing R scripts in Oracle Database to address big data provides database-controlled data-parallel execution framework removes need to manage data outside the database no knowledge of SQL required ©2011 Oracle – All Rights Reserved
7
7 What statisticians should know… Oracle R Enterprise… expands the kinds of problems that can be solved using readily available technologies through in-database statistical calculations removes requirement of SQL skills on programmers working for you reduces time programmers spend on building infrastructure to manage data outside of Oracle Database reduces the complexity of analytic solutions to be placed into production reduces costs by replacing SAS, and its high licensing fees, in the lab provides exhaustive support for statistical techniques and functions found in Base SAS and Base R implements select techniques from SAS libraries: STAT, ETS and OR ©2011 Oracle – All Rights Reserved
8
8 What a SQL user or DBA should know … Oracle R Enterprise… is a secure advanced analytical engine that can be used in addition to SQL in Oracle Database can be used to put an R script into production through SQL invocation reduces or eliminates the need to move data outside Oracle Database facilitates auditing of data flows reduces the number of LOB help requests for SQL queries to obtain data of interest ©2011 Oracle – All Rights Reserved
9
9 Example Dataset: “ONTIME” Airline Data Full Data –123M records –22 years –29 airlines Sample Data –~220K records –~10K / year –ONTIME_S On-time arrival data for non-stop domestic flights by major air carriers. Also provides departure and arrival delays, origin and destination airports, flight numbers, scheduled and actual departure and arrival times, cancelled or diverted flights, taxi-out and taxi-in times, air time, and non-stop distance. ©2011 Oracle – All Rights Reserved
10
10 TRANSPARENCY FRAMEWORK Oracle R Enterprise ©2011 Oracle – All Rights Reserved
11
11 Oracle R Enterprise Initial connection and basic operations R> ore.connect("rquser", "orcl", "machine-1") R> ore.sync("rquser") R> ore.attach ("rquser") R> ore.ls() R> names(ONTIME_S) R> dim(ONTIME_S) Implicitly invoked through R_PROFILE_USER Sync database objects with R client, e.g., add/remove tables/views added via SQL*Plus) List tables visible in schema rquser as ore.frame objects List columns associated with table ONTIME_S View the number of rows and columns associated with this ore.frame Add schema objects in R env search path ©2011 Oracle – All Rights Reserved
12
12 RStudio in a Browser using Oracle R Enterprise ©2011 Oracle – All Rights Reserved
13
13 RStudio in a Browser using Oracle R Enterprise ©2011 Oracle – All Rights Reserved
14
14 Pull data into R Client R Engine Other R packages Oracle R package R user on desktop Oracle Database User tables Transparency Framework class(ONTIME_S) dim(ONTIME_S) ontime <- ore.pull(ONTIME_S) class(ontime) dim(ontime) Goal: Compare ore.frame to data.frame by loading table into R memory from Oracle Database ore.pull() returns a standard R data.frame ©2011 Oracle – All Rights Reserved select * from ONTIME_S
15
15 Invoke in-database aggregation function Client R Engine Other R packages Oracle R package R user on desktop Oracle Database User tables Transparency Framework aggdata <- aggregate(ONTIME_S$DEST, by = list(ONTIME_S$DEST), FUN = length) class(aggdata) head(aggdata) Source data is an ore.frame ONTIME_S, which resides in Oracle Database The aggregate() function has been overloaded to accept ORE frames aggregate() transparently switches between code that works with standard R data.frames and ore.frames Returns an ore.frame In-db stats ©2011 Oracle – All Rights Reserved select DEST, count(*) from ONTIME_S group by DEST
16
16 Manipulating Data Column and row selection df <- ONTIME_S[,c("YEAR","DEST","ARRDELAY")] head(df) df <- df[,c(1,3)] head(df) plot(df) df <- ONTIME_S[,c("YEAR","DEST","ARRDELAY")] head(df) df <- df[df$DEST=="SFO",] df <- df[,c(1,3)] head(df) plot(df) Joining two tables (data frames) df1 <- data.frame(x1=1:5, y1=letters[1:5]) df2 <- data.frame(x2=5:1, y2=letters[11:15]) merge (df1, df2, by.x="x1", by.y="x2") ore.create(df1, table="TEST_DF1") ore.create(df2, table="TEST_DF2") merge (TEST_DF1, TEST_DF2, by.x="X1", by.y="X2") x1 y1 y2 1 1 a o 2 2 b n 3 3 c m 4 4 d l 5 5 e k ©2011 Oracle – All Rights Reserved
17
17 Formatting data – “SAS Data Step and Formats” diverted_fmt <- function (x) { ifelse(x==0, 'Not Diverted', ifelse(x==1, 'Diverted','')) } cancellationCode_fmt <- function(x) { ifelse(x=='A', 'A CODE', ifelse(x=='B', 'B CODE', ifelse(x=='C', 'C CODE', ifelse(x=='D', 'D CODE', 'NOT CANCELLED')))) } delayCategory_fmt <- function(x) { ifelse(x>200,'LARGE', ifelse(x>=30,'MEDIUM','SMALL')) } zscore <- function(x) { (x-mean(x,na.rm=TRUE))/sd(x,na.rm=TRUE) } attach(ONTIME_S) ONTIME_S$DIVERTED <- diverted_fmt(DIVERTED) ONTIME_S$CANCELLATIONCODE <- cancellationCode_fmt(CANCELLATIONCODE) ONTIME_S$ARRDELAY <- delayCategory_fmt(ARRDELAY) ONTIME_S$DEPDELAY <- delayCategory_fmt(DEPDELAY) ONTIME_S$DISTANCE_ZSCORE <- zscore(DISTANCE) detach(ONTIME_S) head(ONTIME_S) ©2011 Oracle – All Rights Reserved
18
18 Which days were the worst to fly for delays over the past 22 years? ©2011 Oracle – All Rights Reserved
19
19 Are select airlines getting better or worse? Mean annual delay by Year ©2011 Oracle – All Rights Reserved
20
20 Embedded R Script Execution ©2011 Oracle – All Rights Reserved
21
21 ore.doEval ( ) Client R Engine Other R packages Oracle R package R user on desktop Oracle Database User tables DB R Engine Other R packages Oracle R package Transparency Framework mod <- ore.doEval( function(param) { library(biglm) dat <- ore.pull(ONTIME_S) mod <- biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat) mod }); mod_local <- ore.pull(mod) summary(mod_local[[1]]) Goal: Build a single regression model using transparency framework in DB R Engine Package “biglm” loaded to DB R Engine Data explicitly loaded into R memory at DB R Engine using ore.pull() Result “mod” returned as a model object rq*Apply () interface extproc 1 2 3 4 5 ©2011 Oracle – All Rights Reserved
22
22 ore.groupApply ( ) – parallel execution Client R Engine Other R packages Oracle R package R user on desktop Oracle Database User tables Transparency Framework modList <- ore.groupApply( X=ONTIME_S, INDEX=ONTIME_S$DEST, function(x, param) { library(biglm) biglm(ARRDELAY ~ DISTANCE + DEPDELAY, x) }); modList_local <- ore.pull(modList) summary(modList_local$BOS) ## return model for BOS Goal: Build models in parallel on partitions of dataset Function loaded to DB R Engine Parallelism enabled through INDEX column – one extproc DB R Engine per value Data group subset passed to extproc via input cursor to R memory at DB R Engine Result “modList” returned as a list of model objects, one per group rq*Apply () interface DB R Engine Other R packages Oracle R package Transparency Framework extproc … … DB R Engine Other R packages Oracle R package Transparency Framework 1 2 3 5 4 5 ©2011 Oracle – All Rights Reserved
23
23 rqTableEval and rqRowEval – the SQL interface rqTableEval select * from table(rqTableEval( cursor(select * from fish), NULL, 'select t.*, 1 rowsum from fish t', 'function(x, param) { dat <- data.frame(x, stringsAsFactors=F) cbind(dat, ROWSUM = apply(dat,1,sum)) }')); rqRowEval select * from table(rqRowEval( cursor(select * from fish), NULL, 'select t.*, 1 rowsum from fish t', 1, 'function(x, param) { dat <- data.frame(x, stringsAsFactors=F) cbind(dat, ROWSUM = apply(dat,1,sum)+10) }')); Using the full table fish as input to the function, no parameters, produce output that contains all input data plus the rowsum of values Providing one row at a time from table fish to the function, no parameters, produce similar output as above adding 10 to the result for each row using chunksize = 1 ©2011 Oracle – All Rights Reserved
24
24 rqEval – generate XML string for graphic output Score SQL> set long 20000 SQL> set pages 1000 SQL> select value from table(rqEval( NULL,'XML', ' function(){ res <- 1:10 plot( 1:100, rnorm(100), pch = 21, bg = "red", cex = 2 ) res } ')); VALUE --------------------------------------------- <variable name="result" ty pe="numeric"> 1 2 3 4</v alue> 5 6 7 8 < value>9 10 <img src="data:image/ png;base64, iVBORw0KGgoAAAANSUhEUgAAAeAAAAHgCAIAAADytinCAAAgAElEQVR4nOzdeVxN+f8H8Ndt URRiCClLdtnX9n3VJm0SQ8gwZJ/BEMLY9yXrMDO2ilBZE7JlzRqRES1CKaXSfu/vj778onuO lrv3fj78MXM+p8953Zb3Pfecz/l8ODweD4QQQiSPnLgDEEII4Y8KNCGESCgq0IQQIqGoQBNC iISiAk0IIRKKCjQhhEgoKtCEECKhqEATQoiEogJNCCESigo0IYRIKCrQhBAioahAE0KIhKIC TQghEooKNCGESCgq0IQQIqGoQBNCiISiAk0IIRKKCjQhhEgoKtCEECKhqEATQoiEogJNCCES… Execute the function that plots 100 random numbers and returns a vector with values 1 to 10. No parameters are specified. Return the results a XML View the XML VALUE returned, which can be consumed by OBIEE ©2011 Oracle – All Rights Reserved
25
25 Users of statistical packages ©2011 Oracle – All Rights Reserved
26
26 Oracle R Enterprise Statistics Engine Example Features Special Functions –Gamma function –Natural logarithm of the Gamma function –Digamma function –Trigamma function –Error function –Complementary error function Tests –Chi-square, McNemar, Bowker –Simple and weighted kappas –Cochran-Mantel-Haenzel correlation –Cramer’s V –Binomial, KS, t, F, Wilcox Base SAS equivalents –Freq, Summary, Sort –Rank, Corr, Univariate Density, Probability, and Quantile Functions –Standard normal distribution –Chi-square distribution –Exponential distribution –F-distribution –Density Function –Probability Function –Quantile –Gamma distribution –Beta distribution –Cauchy distribution –Student’s t distribution –Weibull distribution ©2011 Oracle – All Rights Reserved IQR aggregate binom.test chisq.test cor cov fivenum get_all_vars ks.test mad median model.frame model.matrix na.omit quantile reorder rnorm sd t.test terms var var.test wilcox.test add1.ore.lm drop1.ore.lm hatvalues.ore.lm logLik.ore.lm ore.as.matrix ore.corr ore.crosstab ore.crossval ore.cumsum ore.dbi.recover ore.dim ore.extend ore.extend.cumsum ore.extend.index ore.extend.mean ore.extend.sum ore.extend.total ore.extend.xval ore.freq ore.freq.all ore.group ore.groupmap ore.groups ore.index ore.is.crosstab ore.is.cumsum ore.is.extended ore.is.index ore.is.mean ore.is.sum ore.is.total ore.lm ore.mean ore.print ore.rank ore.sort ore.strata ore.stratas ore.sum ore.summary ore.total ore.univariate re.way plot.ore.lm predict.ore.lm print.summary.ore.lm summary.ore.lm vcov.ore.lm IRIS_TABLE$PETALBINS=ifelse(IRIS_TABLE$PETAL_LENGTH < 2, 1, 2) binom.test(IRIS_TABLE$PETALBINS) # Chi Square Test chisq.test(IRIS_TABLE$PETALBINS) # One sample K-S Test for given probabilities ks.test(IRIS_TABLE$PETAL_LENGTH, "pexp", rate=4) # Two sample K S Test ks.test(IRIS_TABLE$PETAL_LENGTH, IRIS_TABLE$SEPAL_LENGTH) # T-test with different alternate hypothesis possibilities */ t.test(IRIS_TABLE$PETAL_LENGTH, alternative="two.sided", mu=0, conf.level=0.9) # F test to compare variances var.test(IRIS_TABLE$PETAL_LENGTH, IRIS_TABLE$SEPAL_LENGTH, ratio=0.75, alternative="two.sided", conf.level=0.9) # Wilcoxon signed rank test wilcox.test(IRIS_TABLE$PETAL_LENGTH-3.8, alternative="greater", mu=0)
27
27 ore.freq ( ) Compute cross tabulation for number of diverted flights for each airline. Compute the Pearson CHISQ for the results. For each airline, compute cross tabulation for number of diverted flights and day of week. Compute the Pearson CHISQ for each result. ct <- ore.crosstab(UNIQUECARRIER~DIVERTED, data=ONTIME_S) ct freq <- ore.freq(ct) freq ct <- ore.crosstab(UNIQUECARRIER~DIVERTED+DAYOFWEEK,data=ONTIME_S) ct freq <- ore.freq(ct) freq ©2011 Oracle – All Rights Reserved
28
28 Scatterplot Matrix Airline, Arrival Delay, Departure Delay, Distance ©2011 Oracle – All Rights Reserved
29
29 Summary Join us for the upcoming training sessions on ORE R is… –A statistical programming language and environment –An open source software project with a 2M+ users –Exploding in functionality and popularity Oracle R Enterprise enables… –Transparent in-database data analytics using R –R users to leverage Oracle Database, Exadata, and Big Data Appliance for enterprise-ready R Analytics –Writing map-reduce style R scripts and interfaces with Hadoop and HDFS –Enables migration away from Base SAS and helps reduce SA$ Annual Usage Fees ©2011 Oracle – All Rights Reserved
30
30 ©2011 Oracle – All Rights Reserved
31
31
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.