Big Data, Bigger Data & Big R Data

Big Data, Bigger Data & Big R Data
Birmingham R Users Meeting 23rd April 2013 Andy Pryke

I work in commercial data mining, data analysis and data visualisation
My Bias… I work in commercial data mining, data analysis and data visualisation Background in computing and artificial intelligence Use R to write programs which analyse data

What is Big Data? Depends who you ask. Answers are often “too big to ….” …load into memory …store on a hard drive …fit in a standard database Plus “Fast changing” Not just relational

My “Big Data” Definition
“Data collections big enough to require you to change the way you store and process them.” Andy Pryke

Servers with 1Tb+ memory are available
Data Size Limits in R Standard R packages use a single thread, with data held in memory (RAM) help("Memory-limits") Vectors limited to 2 Billion items Memory limit of ~128Tb Servers with 1Tb+ memory are available Also, Amazon EC2 servers up to 244Gb

Overview Problems using R with Big Data Processing data on disk
Hadoop for parallel computation and Big Data storage / access “In Database” analysis What next for Birmingham R User Group?

Background: R matrix class
- Built in (package base). - Stored in RAM - “Dense” - takes up memory to store zero values) Can be replaced by…..

Sparse / Disk Based Matrices
Matrix – Package Matrix. Sparse. In RAM big.matrix – Package bigmemory / bigmemoryExtras & VAM. On disk. VAM allows access from parallel R sessions Analysis – Packages irlba, bigalgebra, biganalytics (R-Forge list)etc. More details? “Large-Scale Linear Algebra with R”, Bryan W. Lewis, Boston R Users Meetup

Commercial Versions of R
Revolution Analytics have specialised versions of R for parallel execution & big data I believe many if not most components are also available under Free Open Source licences, including the RHadoop set of packages Plenty more info here

Background: Hadoop Parallel data processing environment based on Google’s “MapReduce” model “Map” – divide up data and sending it for processing to multiple nodes. “Reduce” – Combine the results Plus: Hadoop Distributed File System (HDFS) HBase – Distributed database like Google’s BigTable

RHadoop – Revolution Analytics
Package: rmr2, rhbase, rhdfs Example code using RMR (R Map-Reduce) R and Hadoop – Step by Step Tutorials Install and Demo RHadoop (Google for more of these online) Data Hacking with RHadoop

RHadoop ## In, 1 ## the, 1 ## beginning, 1 ##... ## the, 2345
wc.map <- function(., lines) { ## split "lines" of text into a vector of individual "words" words <- unlist(strsplit(x = lines,split = " ")) keyval(words,1) ## each word occurs once } wc.reduce <- function(word, counts ) { ## Add up the counts, grouping them by word keyval(word, sum(counts)) wordcount <- function(input, output = NULL){ mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T) E.g. Function Output ## In, 1 ## the, 1 ## beginning, 1 ##... ## the, 2345 ## word, 987 ## beginning, 123 RHadoop

Other Hadoop libraries for R
Other packages: hive, segue, RHIPE… segue – easy way to distribute CPU intensive work - Uses Amazon’s Elastic Map Reduce service, which costs money. - not designed for big data, but easy and fun. Example follows…

RHadoop # first, let's generate a 10-element list of
# 999 random numbers + 1 NA: > myList <- getMyTestList() # Add up each set of 999 numbers > outputLocal <- lapply(myList, mean, na.rm=T) > outputEmr <- emrlapply(myCluster, myList, mean, na.rm=T) RUNNING :16:57 RUNNING :17:27 RUNNING :17:58 WAITING :18:29 ## Check local and cluster results match > all.equal(outputEmr, outputLocal) [1] TRUE # The key is the emrlapply() function. It works just like lapply(), # but automagically spreads its work across the specified cluster RHadoop

Oracle R Connector for Hadoop
Integrates with Oracle Db, “Oracle Big Data Appliance” (sounds expensive!) & HDFS Map-Reduce is very similar to the rmr example Documentation lists examples for Linear Regression, k-means, working with graphs amongst others Introduction to Oracle R Connector for Hadoop. Oracle also offer some in-database algorithms for R via Oracle R Enterprise (overview)

Teradata Integration Package: teradataR
Teradata offer in-database analytics, accessible through R These include k-means clustering, descriptive statistics and the ability to create and call in-database user defined functions

“R” you interested? What Next?
I propose an informal “big data” Special Interest Group, where we collaborate to explore big data options within R, producing example code etc. “R” you interested?

Big Data, Bigger Data & Big R Data

Similar presentations

Presentation on theme: "Big Data, Bigger Data & Big R Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data, Bigger Data & Big R Data

Similar presentations

Presentation on theme: "Big Data, Bigger Data & Big R Data"— Presentation transcript:

Similar presentations

About project

Feedback