James j.d. long december 6, 2011 london R user group.

Slides:



Advertisements
Similar presentations
AppFabric Caching Services:
Advertisements

I’ve Got Peace Like a River I've got peace like a river, I've got peace like a river in my soul. I've got peace like a river, I've got peace like a river.
Cloud Computing (Yes, It’s More Than Hype) Chris Murphy, editor InformationWeek.
Gold Sponsors Bronze Sponsors Silver Sponsors Taking SharePoint to the Cloud Aaron Saikovski Readify – Software Solution Specialist.
Cloud Computing at GES DISC Presented by: Long Pham Contributors: Aijun Chen, Bruce Vollmer, Ed Esfandiari and Mike Theobald GES DISC UWG May 11, 2011.
1 Cloud Computing with Amazon and Oracle Lewis Cunningham TUSC, Sr Datawarehouse Consultant
Amazon Web Services Justin DeBrabant CIS Advanced Systems - Fall 2013.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Cisco Intercloud Fabric John McDonough Technical Marketing Engineer January, 2015 Amazon AWS & Microsoft Azure – Cloud Access Keys.
Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1.
CONNECTING REMOTE PC WITHOUT ANY SOFTWARE USING CHROME WEB BROWSER WITH ITS ADD-ON/EXTENSION FOR REMOTE ACCESS HASSLE FREE ACCESS USING A COMMON GMAIL.
TIME FRAME. Which time frame should you choose? Move down method -Start from the daily charts, once profitable move down to -H4 -H1 -30min -15min -5min.
Startup Technology Pitfalls and How to Avoid them.
Foxhole Technology Big Data Analytics Challenge Foxhole Confidential.
Web Crawler with Word Count – Single and Multi Threaded with GAE By, Vallisha Keshavamurthy Rajarshi Chakraborty CSE 587 Project 1, Dr. Bina Ramamurthy.
1/19 Presented by: Maedeh Tashakkorian Supervisor: Hadi Salimi Mazandaran University of Science and Technology February, 2011.
Cloud Don McGregor Research Associate MOVES Institute
Branding and Customizing My Sites with Microsoft SharePoint Server 2010 John Ross & Randy Drisgill MVPs Rackspace Hosting OSP337.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
CLOUD COMPUTING 2.0 Finally, the promise of the cloud has arrived v 1.8.
Scientific Computing at Amazon Disruptive Innovations in Distributed Computing Dave Ward, Principal Product Manager Adam Gray, Senior Product Manager.
CLOUD COMPUTING 黃政明 陳其偉 李佳融 廖柏威 吳宜憲 簡瑋男 朱思翰.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Extreme scale parallel and distributed systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward.
Scaling to the Modern Internet CSCI 572: Information Retrieval and Search Engines Summer 2010.
Cansys West International Conference February , 2013Panama City, Panama An easier way to deliver APPX applications.
J. J. Rehr & R.C. Albers Rev. Mod. Phys. 72, 621 (2000) A “cluster to cloud” story: Naturally parallel Each CPU calculates a few points in the energy grid.
1 The Fast(est) Path to Building a Private/Hybrid Cloud October 25th, 2011 Paul Mourani RightScale.
Grid and Cloud Computing Globus Provision Dr. Guy Tel-Zur.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Cloud SharePoint-hosted SharePoint Autohosted Provider-hosted Host web App web (optional) Host web App web Host web App web (optional)
E X C E E D I N G E X P E C T A T I O N S VLIW-RISC CSIS Parallel Architectures and Algorithms Dr. Hoganson Kennesaw State University Instruction.
MAPS. Dr. John Snow’s Cholera Map of London
Apache Hadoop on the Open Cloud David Dobbins Nirmal Ranganathan.
Branding and Customizing My Sites with Microsoft SharePoint Server 2010 John Ross & Randy Drisgill MVPs Rackspace Hosting OSP337.
Searching Over Encrypted Data Charalampos Papamanthou ECE and UMIACS University of Maryland, College Park Research Supported By.
Hey, You, Get Off of My Cloud Thomas Ristenpart, Eran Tromer, Hovav Shacham, Stefan Savage Presented by Daniel De Graaf.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
EuroSys Doctoral Workshop 2011 Resource Provisioning of Web Applications in Heterogeneous Cloud Jiang Dejun Supervisor: Guillaume Pierre
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
Terraform at Adobe Kelvin Jasperson. Introduction 2 Systems Adobe Audience Manager (AAM) Been with Adobe for 18 months AAM was acquired by.
GETTING STARTED WITH AWS AND PYTHON. OUTLINE  Intro to Boto  Installation and configuration  Working with AWS S3 using Bot  Working with AWS SQS using.
Databricks What is Databricks ? Cloud services used Functionality Languages Spark Usage 3 rd Party Apps Architecture Books
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI-Engage Accounting Portal contribution Iván Díaz 2014/06/05.
Parallel and Distributed Programming: A Brief Introduction Kenjiro Taura.
Cloud Computing % of us use some form of cloud coumputing.
Cloud Services Today and the Future
Dag Toppe Larsen UiB/CERN CERN,
Microsoft /2/2018 3:42 PM BRK3129 Query Big Data using the Expanded T-SQL footprint with PolyBase in SQL Server 2016 Casey Karst Program Manager.
TECHNOLOGY.
Hire Toyota Innova in Delhi for Outstation Tour
Get Help-Gmail help number
PLOTr -KUSHAL MEHTA.
AWS DevOps Engineer - Professional dumps.html Exam Code Exam Name.
2018 Amazon AWS DevOps Engineer Professional Dumps - DumpsProfessor
رايانش ابري Cloud Computing
湖南大学-信息科学与工程学院-计算机与科学系
Development of the Nanoconfinement Science Gateway
Martin Swany Gregor von Laszewski Thomas Sterling Clint Whaley
Fintan The Amazing Fish of Knowledge…
In this session… Introduce what we’re talking about
  30 A 30 B 30 C 30 D 30 E 77 TOTALS ORIGINAL COUNT CURRENT COUNT
Module P3 Practical: Building a webapp in nodejs and
The Big 6 Research Model Step 3: Location and Access
ВОМР Подмярка 19.2 Възможности за финансиране
Споразумение за партньорство
Lecture 16B: Instructions on how to use Hadoop on Amazon Web Services
Big Data, Bigger Data & Big R Data
Building a Threat-Analytics Multi-Region Data Lake on AWS
2019 Summer Cloud Hosting Deal To Avail This Offer Go To CloudwaysCloudways.
Presentation transcript:

james j.d. long december 6, 2011 london R user group

SEGUE parallel R in the cloud two lines of code no kidding!

SEGUE why... so i've got this problem... reinsurance simulations updated frequently for one month on my laptop... each sim takes ~ 1 min 10k sims * 1 min = ~ 7 days no need for full map/reduce embarrassingly parallel

SEGUE you've seen ”word count” demos... segue has nothing to do with that big cpu, not big data

SEGUE my options... make the code faster build a cluster type snow mpi hadoop location self hosted amazon web services ec2 emr rackspace lowest startup costs

SEGUE syntax... require(segue) myCluster <- createCluster() contratulations. we've built a cluster!

SEGUE more syntax... parallel apply() on lists: base R: lapply( X, FUN, … ) segue: emrlapply( clusterObject, X, FUN, … )

SEGUE example... stochastic pi simulation (again)

SEGUE example... estimatePi <- function( seed ){ set.seed(seed) numDraws < r <-.5 x <- runif(numDraws, min=-r, max=r) y <- runif(numDraws, min=-r, max=r) inCircle <- ifelse( (x^2 + y^2)^.5 < r, 1, 0) return(sum(inCircle) / length(inCircle) * 4) } seedList <- as.list(1:1000) require(segue) myCluster <- createCluster(20) myEstimates <- emrlapply( myCluster, seedList, estimatePi ) stopCluster(myCluster) myPi <- Reduce(sum, myEstimates) / length(myEstimates) format(myPi, digits=10)

SEGUE howzit work? createCluster() cluster object: list of parameters bootstrap: update R update packages temp dirs: local S3 for EMR ~ minutes

SEGUE howzit work? emrlapply() list is serialized to CSV and uploaded to S3 – streaming input file EMR copies files to nodes – mapper.R picks them up function, arguments, r objects, etc are saved & uploaded CSV is input to mapper.R applies function to each list element output is serialized into emr part-xxxxx files on s3 part files are downloaded to R and deserialized deserialized results are reordered and put into a list object

SEGUE options? numInstances number of ec2 machines to fire up cranPackagescran packages to load on each cluster node filesOnNodesfiles to be loaded on each node rObjectsOnNodesR objects to put on the worker nodes enableDebugging start emr debugging instancesPerNode number of R instances per node masterInstanceType ec2 instance type for the master node slaveInstanceType ec2 instance type for the slave nodes location ec2 location name for the cluster ec2KeyName ec2 key used for logging into the main node copy.image copy the entire local environment to the nodes? otherBootstrapActionsother bootstrap actions to run sourcePackagesToInstallR source packages to be installed on each node

SEGUE when to use segue... embarrassingly parallel cpu bound apply on lists with many items object size: to / from s3 roundtrip each job has a fixed & marginal cost

SEGUE downside of segue... embarrassingly parallel failure

SEGUE reasons daddy drinks... (a.k.a things vendors never say) keep one eye on aws dashboard united nations considers debugging of segue jobs ”torture” under geneva convention

SEGUE more reasons daddy drinks... if you use segue you will see: unreproducable errors clusters that never start temp buckets in your s3 acct clusters left running i/o that takes longer than calcs but... i've never had a ”wrong” answer

SEGUE Immediate segue future... maintenance issues: R releases change emr changes vendor lock-in to amazon whirr as solution? foreach %dopar% backend?

SEGUE imagine the future... R objects backed by clusters as.hdfs.data.frame(data) operations converted to map reduce jobs transparently other abstractions...

SEGUE segue project page google groups see also... rhipe – program m/r in R revolution analytics rhadoop: