The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Data Demo and MAKER-P.

Slides:



Advertisements
Similar presentations
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of the iPlant Data Store.
Advertisements

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Managing Data with iPlant Introduction to Uploading, Downloading, Sharing, and Metadata in the Data Store.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Genome Annotation BCB 660 October 20, From Carson Holt.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
1 iPlant Data Store (iDS) Supporting the Lifecycle of Data Nirav Merchant 1.
IPlant Collaborative Powering a New Plant Biology iPlant Collaborative Powering a New Plant Biology.
Assembly & Annotation at iPlant
Customized cloud platform for computing on your terms !
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Using Biological Cyberinfrastructure Scaling Science and People: Applications in Data Storage, HPC, Cloud Analysis, and Bioinformatics Training Scaling.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop iCommands and Other Data Store Resources.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Objectives.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
IPlant Genomics in Education Workshop Genome Exploration in Your Classroom.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
The iPlant Collaborative
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
IPlant Collaborative Hands-on Cyberinfrastructure Workshop – Part 2 R. Walls University of Arizona Biodiversity Information Standards (TDWG) Sep. 29, 2015,
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Using Biological Cyberinfrastructure Scaling Science and People: Applications in Data Storage, HPC, Cloud Analysis, and Bioinformatics Training Scaling.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop iPlant Data Store.
The iPlant Collaborative Using iPlant for sharing, managing, and analyzing ecological data Ramona Walls Presented at ESA 2014 – Ignite session August 12,
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Atmosphere.
Data Integration and Management A PDB Perspective.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of the iPlant Data Store.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
IPlant Collaborative Hands-on Cyberinfrastructure Workshop - Part 1 R. Walls University of Arizona Biodiversity Information Standards (TDWG) Sep. 28, 2015,
Annotating genomes using MAKER-P and iPlant. What Are Annotations? Annotations are descriptions of features of the genome –Structural: exons, introns,
Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry.
IPlant Genomics in Education Workshop Genome Exploration in Your Classroom.
IPlant Genomics in Education
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop iPlant Data Store – Managing Your ‘Big’ Data.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop BISQUE.
The iPlant Collaborative
The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment.
IPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment Sriram Srinivasan.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Brian Corrie Technical Lead, iReceptor Technical Director, IRMACS Centre Simon Fraser University Services for Distributed Data, Security and Computation.
Transforming Science Through Data-driven Discovery Genomics in Education University of Delaware – February 2016 Jason Williams, Education, Outreach, Training.
Using DNA Subway in the Classroom Genome Annotation: Red Line.
Transforming Science Through Data-driven Discovery Tools and Services Workshop Data Store Overview.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
Transforming Science Through Data-driven Discovery Workshop Overview Ohio State University MCIC Jason Williams – Lead, CyVerse – Education, Outreach, Training.
Transforming Science Through Data-driven Discovery Tools and Services Workshop Data Store – Managing your ‘Big’ Data Joslynn Lee, Ph.D. – Data Science.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
Transforming Science Through Data-driven Discovery Tools and Services Workshop Data Store – Managing your ‘Big’ Data Joslynn Lee – Data Science Educator.
CyVerse Data Store Managing Your ‘Big’ Data. Welcome to the Data Store Manage and share your data across all CyVerse platforms.
IPlant Genomics in Education Workshop Genome Exploration in Your Classroom.
Joslynn S. Lee, PhD, Data Science Educator Cold Spring Harbor Laboratory, DNA Learning Center Transforming Science Through Data-driven Discovery.
CyVerse Tools and Services
Tools and Services Workshop
Joslynn Lee – Data Science Educator
CyVerse Discovery Environment
MANAGING, SHARING, AND PUBLISHING DATA WITH THE CYVERSE DATA STORE
Tools and Services Workshop
Tools and Services Workshop Overview of the iPlant Data Store
Data uploading and sharing with CyVerse
Genome Annotation w/ MAKER
Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy
Presentation transcript:

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Data Demo and MAKER-P

The iPlant Collaborative Getting Started Getting Data into iPlant

Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set. - Wikipedia - ( Challenges: the scope and scale of life sciences data continue to grow Working with Big Data

Challenges (sequencing example): data generation is cheaper and faster Working with Big Data

Biologists work with and require access to diverse data types Working with Big Data Challenges: biology encompasses more than sequence data Advanced ImagingGeospatialNetwork

Challenges: changes in data require changes in tools Working with Big Data Changes in scale introduce quantitative and qualitative complications Difficult/slow transfers Expense for storage/backup Difficult to share and publish Analysis Metadata (what is metadata?)

The Data Store services all iPlant platforms iPlant Data Store Overview Access your data from multiple iPlant services Automatic backup (redundant between University of Arizona and University of Texas) Default 100GB allocation, >1TB allocations available with justification

iRODS (integrated Rule-Oriented Data System) is an established, scalable, open-source data management system iRODS supports many data intensive projects iRODS abstracts data services from data storage to facilitate executing services across heterogeneous, distributed storage systems Avoid reinventing the wheel iPlant Data Store Overview Critical for effective data management Works under the hood Folder = Collection

Benefits Get Science Done Reproducibility Productivity Store any type of files related to your research An evolving “Data Commons” lets you access important datasets Metadata captures information needed for reproducibility Automatic backup and accessibility support your project’s data management plan IRODS makes high-speed transfers possible (100GB in ~30min)* Share data instantly with collaborators within iPlant iPlant Data Store Overview

Texas Replication Arizona Key component of your data management Worry Free! Some important things we will not “see” in the demo SourceDestination Copy Method Time (seconds) CDMy Computercp320 Berkeley ServerMy Computerscp150 External DriveMy Computercp36 USB2.0 FlashMy Computercp30 iPlant Data StoreMyComputeriget18 My Computer cp15 Close to optimum conditions; transfer between Univ. of Arizona and UC Berkeley 100GB: 29m15s, 1 GB / 17.5 seconds Data TransfersData Backups

iPlant Data Store Overview Some important things we will not “see” in the demo Local connections and institutional policies limit data transfers

iPlant Data Store Command linePoint-and-click iCommands Multiple ways to access Cyberduck Discovery Environment iDrop Desktop

Discovery Environment  Simple upload/download for small files  Bulk upload files and folders (<10GB)  Import from URL (no size limit) Advantage + Disadvantage - Covers most upload/download sharing needs Some size/speed limitations

Cyberduck  Drag and drop files and folders  No size limit, file editing/previews  Easy Desktop functionality Advantage + Disadvantage - More like desktop file systems No permissions/metadata control

iDrop Desktop  Drag and drop files and folders  No size limit  Synchronize folders with Data Store Advantage + Disadvantage - Upload/download large file sizes and numbers of files Sharing and permission features more complex

iCommands  Full flexibility  Ability to script and automate  Access from terminal/server Advantage + Disadvantage - Customizability Requires some command line expertise

The iPlant Collaborative Getting Started Cyberduck and iCommands Demo

The iPlant Collaborative Getting Started Gene Annotation with Maker

Available on Atmosphere and DE (Lonestar)

Quick Review What Are Annotations? Annotations are descriptions of features of the genome Structural: exons, introns, UTRs, splice forms etc. Coding & non-coding genes Expression, repeats, transposons Annotations should include evidence trail Assists in quality control of genome annotations Examples of evidence supporting a structural annotation: Ab initio gene predictions ESTs Protein homology

Secondary Annotations Protein Domains InterPro Scan: combines many HMM databases GO and other ontologies Pathway mapping E.g. BioCyc Pathway tools

Particularly for plants… Challenges in Annotation Genomes are BIG Highly repetitive Many pseudogenes Assembly contamination Incomplete evidence No method is 100% accurate

Options for Protein-coding Gene Annotation Yandell & Ence. Nature Reviews Genetics 13, (May 2012) | doi: /nrg3174

Typical Annotation Pipeline Contamination screening Repeat/TE masking Ab initio prediction Evidence alignment (cDNA, EST, RNA-seq, protein) Evidence-driven prediction Chooser/combiner Evaluation/filtering Manual curation

MAKER-P Automated Pipeline Ab initio prediction Evidence MPI-enabled to allow parallel operation on large compute clusters Collaboration with Yandell Lab Repeat Library

Generic Feature Format What is a GFF File?

MAKER-P at iPlant W559 - Annotation of the Lobolly Pine Megagenome—Jill Wegrzyn Gb assembly—split into 40 jobs—216 CPU/job (8640 CPU total)—17 hours P157 - Disease Resistance Gene Analysis on Chromosome 11 Across Ten Oryza Species 10 rice species (each w/12 chromosome pseudomolecules) 96 CPU per chromosome (1152 CPU total) ~ 2hr per genome 22,656 CPU cores on1,888 nodes GenomeAssembly Size (Mb) CPU Run Time Arabidopsis thalianaTAIR :44 Arabidopsis thalianaTAIR :27 Zea maysRefGen_v :53 TACC Lonestar Supercomputer Campbell et al. Plant Physiology. December 4, 2013, DOI: /pp PAG 2014: Agave API

Keep asking: ask.iplantcollabortive.org

The iPlant Collaborative is funded by a grant from the National Science Foundation Plant Cyberinfrastructure Program (#DBI ).