Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scaling Compute with R in CyVerse

Similar presentations


Presentation on theme: "Scaling Compute with R in CyVerse"— Presentation transcript:

1 Scaling Compute with R in CyVerse
Blake Joyce – Science Analyst University of Arizona

2

3 Evolution of CyVerse From plant science, to life science, and beyond…
Transforming Science Through Data-Driven Discovery iPlant 2008 Empowering a New Plant Biology iPlant 2013 Cyberinfrastructure for Life Science

4 Who We Serve

5 Who We Serve Space Object Behavioral Sciences

6 Who We Serve 1000s of researchers

7 But what is Cyberinfrastructure?
software hardware people Platforms, tools, datasets Storage and compute Expertise, support, training, CyVerse is People + Cyberinfrastructure, empowering researchers

8 Past Science Informaticians
Who We Are: the SI Team Ramona Walls 4 years 14 proposals 5 grants 14 publications Ecology Ontologies Data Standards Data Identifiers Data Commons Upendra Devisetty 9 months 3 publications Genomics Metagenomics Transcriptomics Docker Genius Martha Narro 8 years 3 proposals 3 publications Image Management and Analysis Project Management Tyson Swetnam Aug 2016 1 proposal 4 grants (continuing) 2 publications GIS/drones Remote sensing Blake Joyce Aug 2016 1 proposal 3 publications Agriculture Ecology Software Carpentry Hacky Hour Past Science Informaticians

9 I’ve Only Recently Learned Computation
Background: Ecology and plant secondary metabolism engineering 2 years ago (Nov 2014) I could not code Took a Software Carpentry R course (Feb 2015) Co-taught SWC course (Oct 2015) Published first (Python) bioinformatic tool called FractBias (Aug 2016)

10 The Data Cycle

11 CyVerse Cyberinfrastructure
An Interoperable Ecosystem CyVerse Cyberinfrastructure iPlant Computational Resources iPlant Data Store

12 Particular focus for this talk

13 It All Starts with the Data
Data Store Specific request: Initial allocation of 100 GB Allocation can be increased ( We need to report to NSF, so the allocation has to be fully filled out! Data Commons Specific interest: “sequence data. data archiving/backup.” Issue DOIs to data sets Move data out to NCBI SRA and WGS through a form Projects are being created currently

14 Discovery Environment Overview
Hands-on demo: Create a multiple alignment Find a file in the Community Data folder Download a small file of unaligned DNA sequences Upload a small file Use the MUSCLE App to align the sequences Monitor the job status and export its parameters View results

15 Atmosphere Overview advantages... Work in an on-demand Linux environment (most bioinformatics) Collaborate with students and colleagues on the same instance Get it Done Make data, workflows, and analyses available in a public image Access previous software version and images Reproducibility Multicore high memory images to run multithreading applications Move your analyses from your laptop to the cloud Productivity

16 Integrate Apps and Make Workflows
We have switched to start using virtual container technology Packages all the dependencies into a container You build your own GUI for people to use Docker for the DE Docker files available on CyVerse GitHub repo

17 Access Rstudio Server (for free!)
Atmosphere Rstudio image Use for development of R code Go to a browser and paste your IP address Add “:8787” -> Rstudio server listens on that port 8787 Get the compute right and then when you’re happy move to Jetstream Jetstream Scale up the Rstudio compute you perfected on Atmosphere Share your code, image, or the Docker container you developed with anyone Allow reviewers to access all your compute/code/data to rebuild it themselves ^ use R markdown or Jupyter notebooks for style points)

18 Specific Needs Mentioned
“I would like to run specific DNA sequence assembly softwares on a Linux supercomputer through which I could have remote access.” HPC and Cloud are 90% Linux computers This can be done on cloud computing, the DE, or the HPC through the DE

19 Specific Needs Mentioned: Jupyter
“Parallel processing in r” “Coding R in parallel, making R use less resources” To be frank: it’s not easy in R, but it’s possible Purist are going to get angry and mention lapply(), etc Python offers just in time compiling (JIT packages like {jit} don’t work in R -> Ra) Apache Spark works in Jupyter really smoothly (SDSC) Here’s my not answer for the question: Jupyter == Julia + Python + R Execute all different kinds of code in the same place/interface I know this is going to make me some enemies: But my job is to bring you the tools, so….researchers have to move away from using a single language

20 Specific Needs Mentioned: Training
“Using R and Github in the university computing resources” Software Carpentry and Data Carpentry A great way to get introduce to basic coding Can learn how to start with data -> analysis -> advanced graphing Advanced graphing example: FractBias Research Bazaar Arizona Weekly events designed to complement SWC/DC For people that want hands on peer-to-peer training Get help with specific problems, help others with specific problems Build a community at UA. Learn what other departments are doing. Drink tea or beer! (or don’t, it’s whatever)

21 A Shameless Plug: ResBazAZ

22 Specific Needs Mentioned: need more info
“I hope to learn how to run r scripts on the super computer.” Answer: cloud computing A.K.A. “own your own ghost computer” “be interested in MATLAB links to the HPC. I am also wondering if there is a Github interface with Cyverse.” “I want to know the amount of storage we are allowed to use, the memory we can use, how many cores we can use, how to submit jobs for parallel computing.”

23 Coming Features of Interest
Bring your own compute to the DE Get an HPC allocation, Jetstream allocation, etc Use pre-existing apps in the DE and point it at your own compute Contact me, Susan Miller, or Nirav Merchant if interested Singularity on HPC Docker for HPC (though they hate that description) Virtualized environment that lets you install what you please and run it Doesn’t give root permissions so it’s secure R Shiny server Integration with Jupyter (and Rstudio Enterprise?) GIS

24


Download ppt "Scaling Compute with R in CyVerse"

Similar presentations


Ads by Google