Scaling Compute with R in CyVerse

Slides:



Advertisements
Similar presentations
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of Atmosphere.
Advertisements

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Customized cloud platform for computing on your terms !
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
BISQUE: Enabling Cloud and Grid Powered Image Analysis Ramona Walls iPlant Collaborative
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of Atmosphere.
Using Biological Cyberinfrastructure Scaling Science and People: Applications in Data Storage, HPC, Cloud Analysis, and Bioinformatics Training Scaling.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Customized cloud platform for computing on your terms ! Nirav Merchant
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Objectives.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
IPlant Collaborative Hands-on Cyberinfrastructure Workshop – Part 2 R. Walls University of Arizona Biodiversity Information Standards (TDWG) Sep. 29, 2015,
Using Biological Cyberinfrastructure Scaling Science and People: Applications in Data Storage, HPC, Cloud Analysis, and Bioinformatics Training Scaling.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop iPlant Data Store.
The iPlant Collaborative Using iPlant for sharing, managing, and analyzing ecological data Ramona Walls Presented at ESA 2014 – Ignite session August 12,
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Atmosphere.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
IPlant Collaborative Hands-on Cyberinfrastructure Workshop - Part 1 R. Walls University of Arizona Biodiversity Information Standards (TDWG) Sep. 28, 2015,
Overview of Atmosphere
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop iPlant Data Store – Managing Your ‘Big’ Data.
1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall Nirav Merchant Bio Computing & iPlant Collaborative Eric Lyons.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment.
The iPlant Collaborative
Bringing your favorite analysis applications to iPlant using Docker containers Nirav Merchant
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment.
IPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment Sriram Srinivasan.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of Atmosphere.
Transforming Science Through Data-driven Discovery Tools and Services Workshop Atmosphere Joslynn Lee – Data Science Educator Cold Spring Harbor Laboratory,
Transforming Science Through Data-driven Discovery Tools and Services Workshop Data Store Overview.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of Atmosphere.
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
Transforming Science Through Data-driven Discovery Workshop Overview Ohio State University MCIC Jason Williams – Lead, CyVerse – Education, Outreach, Training.
CyVerse Data Store Managing Your ‘Big’ Data. Welcome to the Data Store Manage and share your data across all CyVerse platforms.
Teaching How to Scale Science (and People) Using Cloud Resources Nirav Merchant The University of Arizona
Transforming Science Through Data-driven Discovery Using CyVerse Cyberinfrastructure to Enable Data Intensive Research, Collaboration, and Education Joslynn.
Transforming Science Through Data-driven Discovery Using CyVerse Cyberinfrastructure to Enable Data Intensive Research, Collaboration, and Education Atmosphere.
Transforming Science Through Data-driven Discovery Bringing your Bioinformatics tools to CyVerse’s Discovery Environment using Docker Upendra Kumar Devisetty.
Canadian Bioinformatics Workshops
How to Get Started With Python
Foundations of Data Science
CyVerse Tools and Services
Tools and Services Workshop
Customized cloud platform for computing on your terms !
Joslynn Lee – Data Science Educator
CyVerse Discovery Environment
Working With Azure Batch AI
MANAGING, SHARING, AND PUBLISHING DATA WITH THE CYVERSE DATA STORE
A Few Questions Before We Begin
R For The SQL Developer Kevin Feasel Manager, Predictive Analytics
Time Management and Teamwork (a.k.a. “Software Project”)
Tools and Services Workshop Overview of Atmosphere
Steering Group Member, Link Digital
JMC CGEMS SUMMER GENOMICS TRAINING WORKSHOPS
Tools and Services Workshop
University of Wisconsin – Stout
An easier path? Customizing a “Global Solution”
Data uploading and sharing with CyVerse
Bioinformatic analysis using Jetstream, a cloud computing environment
Integration of Singularity With Makeflow
SRA Submission Pipeline
Coding in the Cloud This slide deck includes recorded video demonstrations of content from the live presentation. Joon-Yee.
Shared Research Computing Policy Advisory Committee (SRCPAC)
Alan Chalker and Eric Franz Ohio Supercomputer Center
Data science and machine learning at scale, powered by Jupyter
Cyberinfrastructure for the Life Sciences
Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy
Databricks and End-to-End Processes Demo Links & Help
Presentation transcript:

Scaling Compute with R in CyVerse Blake Joyce – Science Analyst University of Arizona bjoyce3@cyverse.org

Evolution of CyVerse From plant science, to life science, and beyond… Transforming Science Through Data-Driven Discovery iPlant 2008 Empowering a New Plant Biology iPlant 2013 Cyberinfrastructure for Life Science

Who We Serve

Who We Serve Space Object Behavioral Sciences

Who We Serve 1000s of researchers

But what is Cyberinfrastructure? software hardware people Platforms, tools, datasets Storage and compute Expertise, support, training, CyVerse is People + Cyberinfrastructure, empowering researchers http://www.cyverse.org

Past Science Informaticians Who We Are: the SI Team Ramona Walls 4 years 14 proposals 5 grants 14 publications Ecology Ontologies Data Standards Data Identifiers Data Commons Upendra Devisetty 9 months 3 publications Genomics Metagenomics Transcriptomics Docker Genius Martha Narro 8 years 3 proposals 3 publications Image Management and Analysis Project Management Tyson Swetnam Aug 2016 1 proposal 4 grants (continuing) 2 publications GIS/drones Remote sensing Blake Joyce Aug 2016 1 proposal 3 publications Agriculture Ecology Software Carpentry Hacky Hour Past Science Informaticians

I’ve Only Recently Learned Computation Background: Ecology and plant secondary metabolism engineering 2 years ago (Nov 2014) I could not code Took a Software Carpentry R course (Feb 2015) Co-taught SWC course (Oct 2015) Published first (Python) bioinformatic tool called FractBias (Aug 2016)

The Data Cycle

CyVerse Cyberinfrastructure An Interoperable Ecosystem CyVerse Cyberinfrastructure iPlant Computational Resources iPlant Data Store

Particular focus for this talk

It All Starts with the Data Data Store Specific request: Initial allocation of 100 GB Allocation can be increased (http://www.cyverse.org/content/increase-your-data-store-allocation) We need to report to NSF, so the allocation has to be fully filled out! Data Commons Specific interest: “sequence data. data archiving/backup.” Issue DOIs to data sets Move data out to NCBI SRA and WGS through a form Projects are being created currently

Discovery Environment Overview Hands-on demo: Create a multiple alignment Find a file in the Community Data folder Download a small file of unaligned DNA sequences Upload a small file Use the MUSCLE App to align the sequences Monitor the job status and export its parameters View results

Atmosphere Overview advantages... Work in an on-demand Linux environment (most bioinformatics) Collaborate with students and colleagues on the same instance Get it Done Make data, workflows, and analyses available in a public image Access previous software version and images Reproducibility Multicore high memory images to run multithreading applications Move your analyses from your laptop to the cloud Productivity

Integrate Apps and Make Workflows We have switched to start using virtual container technology Packages all the dependencies into a container You build your own GUI for people to use Docker for the DE Docker files available on CyVerse GitHub repo

Access Rstudio Server (for free!) Atmosphere Rstudio image Use for development of R code Go to a browser and paste your IP address Add “:8787” -> Rstudio server listens on that port 8787 Get the compute right and then when you’re happy move to Jetstream Jetstream Scale up the Rstudio compute you perfected on Atmosphere Share your code, image, or the Docker container you developed with anyone Allow reviewers to access all your compute/code/data to rebuild it themselves ^ use R markdown or Jupyter notebooks for style points)

Specific Needs Mentioned “I would like to run specific DNA sequence assembly softwares on a Linux supercomputer through which I could have remote access.” HPC and Cloud are 90% Linux computers This can be done on cloud computing, the DE, or the HPC through the DE

Specific Needs Mentioned: Jupyter “Parallel processing in r” “Coding R in parallel, making R use less resources” To be frank: it’s not easy in R, but it’s possible Purist are going to get angry and mention lapply(), etc Python offers just in time compiling (JIT packages like {jit} don’t work in R -> Ra) Apache Spark works in Jupyter really smoothly (SDSC) Here’s my not answer for the question: Jupyter == Julia + Python + R Execute all different kinds of code in the same place/interface I know this is going to make me some enemies: But my job is to bring you the tools, so….researchers have to move away from using a single language

Specific Needs Mentioned: Training “Using R and Github in the university computing resources” Software Carpentry and Data Carpentry A great way to get introduce to basic coding Can learn how to start with data -> analysis -> advanced graphing Advanced graphing example: FractBias Research Bazaar Arizona Weekly events designed to complement SWC/DC For people that want hands on peer-to-peer training Get help with specific problems, help others with specific problems Build a community at UA. Learn what other departments are doing. Drink tea or beer! (or don’t, it’s whatever)

A Shameless Plug: ResBazAZ

Specific Needs Mentioned: need more info “I hope to learn how to run r scripts on the super computer.” Answer: cloud computing A.K.A. “own your own ghost computer” “be interested in MATLAB links to the HPC. I am also wondering if there is a Github interface with Cyverse.” “I want to know the amount of storage we are allowed to use, the memory we can use, how many cores we can use, how to submit jobs for parallel computing.”

Coming Features of Interest Bring your own compute to the DE Get an HPC allocation, Jetstream allocation, etc Use pre-existing apps in the DE and point it at your own compute Contact me, Susan Miller, or Nirav Merchant if interested Singularity on HPC Docker for HPC (though they hate that description) Virtualized environment that lets you install what you please and run it Doesn’t give root permissions so it’s secure R Shiny server Integration with Jupyter (and Rstudio Enterprise?) GIS