Presentation is loading. Please wait.

Presentation is loading. Please wait.

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Data Demo and MAKER-P.

Similar presentations


Presentation on theme: "The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Data Demo and MAKER-P."— Presentation transcript:

1 The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Data Demo and MAKER-P

2 The iPlant Collaborative Getting Started Getting Data into iPlant

3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set. - Wikipedia - (http://en.wikipedia.org/wiki/Big_data) Challenges: the scope and scale of life sciences data continue to grow Working with Big Data

4 Challenges (sequencing example): data generation is cheaper and faster Working with Big Data

5 Biologists work with and require access to diverse data types Working with Big Data Challenges: biology encompasses more than sequence data Advanced ImagingGeospatialNetwork

6 Challenges: changes in data require changes in tools Working with Big Data Changes in scale introduce quantitative and qualitative complications Difficult/slow transfers Expense for storage/backup Difficult to share and publish Analysis Metadata (what is metadata?)

7 The Data Store services all iPlant platforms iPlant Data Store Overview Access your data from multiple iPlant services Automatic backup (redundant between University of Arizona and University of Texas) Default 100GB allocation, >1TB allocations available with justification

8 iRODS (integrated Rule-Oriented Data System) is an established, scalable, open-source data management system iRODS supports many data intensive projects iRODS abstracts data services from data storage to facilitate executing services across heterogeneous, distributed storage systems Avoid reinventing the wheel iPlant Data Store Overview Critical for effective data management Works under the hood Folder = Collection

9 Benefits Get Science Done Reproducibility Productivity Store any type of files related to your research An evolving “Data Commons” lets you access important datasets Metadata captures information needed for reproducibility Automatic backup and accessibility support your project’s data management plan IRODS makes high-speed transfers possible (100GB in ~30min)* Share data instantly with collaborators within iPlant iPlant Data Store Overview

10 Texas Replication Arizona Key component of your data management Worry Free! Some important things we will not “see” in the demo SourceDestination Copy Method Time (seconds) CDMy Computercp320 Berkeley ServerMy Computerscp150 External DriveMy Computercp36 USB2.0 FlashMy Computercp30 iPlant Data StoreMyComputeriget18 My Computer cp15 Close to optimum conditions; transfer between Univ. of Arizona and UC Berkeley 100GB: 29m15s, 1 GB / 17.5 seconds Data TransfersData Backups

11 iPlant Data Store Overview Some important things we will not “see” in the demo http://www.speedtest.net/ Local connections and institutional policies limit data transfers

12 iPlant Data Store Command linePoint-and-click iCommands Multiple ways to access Cyberduck Discovery Environment iDrop Desktop

13 Discovery Environment  Simple upload/download for small files  Bulk upload files and folders (<10GB)  Import from URL (no size limit) Advantage + Disadvantage - Covers most upload/download sharing needs Some size/speed limitations

14 Cyberduck  Drag and drop files and folders  No size limit, file editing/previews  Easy Desktop functionality Advantage + Disadvantage - More like desktop file systems No permissions/metadata control

15 iDrop Desktop  Drag and drop files and folders  No size limit  Synchronize folders with Data Store Advantage + Disadvantage - Upload/download large file sizes and numbers of files Sharing and permission features more complex

16 iCommands  Full flexibility  Ability to script and automate  Access from terminal/server Advantage + Disadvantage - Customizability Requires some command line expertise

17 The iPlant Collaborative Getting Started Cyberduck and iCommands Demo

18 The iPlant Collaborative Getting Started Gene Annotation with Maker

19 Available on Atmosphere and DE (Lonestar)

20 Quick Review What Are Annotations? Annotations are descriptions of features of the genome Structural: exons, introns, UTRs, splice forms etc. Coding & non-coding genes Expression, repeats, transposons Annotations should include evidence trail Assists in quality control of genome annotations Examples of evidence supporting a structural annotation: Ab initio gene predictions ESTs Protein homology

21 Secondary Annotations Protein Domains InterPro Scan: combines many HMM databases GO and other ontologies Pathway mapping E.g. BioCyc Pathway tools

22 Particularly for plants… Challenges in Annotation Genomes are BIG Highly repetitive Many pseudogenes Assembly contamination Incomplete evidence No method is 100% accurate

23 Options for Protein-coding Gene Annotation Yandell & Ence. Nature Reviews Genetics 13, 329-342 (May 2012) | doi:10.1038/nrg3174

24 Typical Annotation Pipeline Contamination screening Repeat/TE masking Ab initio prediction Evidence alignment (cDNA, EST, RNA-seq, protein) Evidence-driven prediction Chooser/combiner Evaluation/filtering Manual curation

25 MAKER-P Automated Pipeline Ab initio prediction Evidence MPI-enabled to allow parallel operation on large compute clusters Collaboration with Yandell Lab Repeat Library

26 Generic Feature Format What is a GFF File?

27 MAKER-P at iPlant W559 - Annotation of the Lobolly Pine Megagenome—Jill Wegrzyn 20.15 Gb assembly—split into 40 jobs—216 CPU/job (8640 CPU total)—17 hours P157 - Disease Resistance Gene Analysis on Chromosome 11 Across Ten Oryza Species 10 rice species (each w/12 chromosome pseudomolecules) 96 CPU per chromosome (1152 CPU total) ~ 2hr per genome 22,656 CPU cores on1,888 nodes GenomeAssembly Size (Mb) CPU Run Time Arabidopsis thalianaTAIR101206002:44 Arabidopsis thalianaTAIR1012015001:27 Zea maysRefGen_v2206721722:53 TACC Lonestar Supercomputer Campbell et al. Plant Physiology. December 4, 2013, DOI:10.1104/pp.113.230144 PAG 2014: Agave API

28 Keep asking: ask.iplantcollabortive.org

29 The iPlant Collaborative is funded by a grant from the National Science Foundation Plant Cyberinfrastructure Program (#DBI-0735191).


Download ppt "The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Data Demo and MAKER-P."

Similar presentations


Ads by Google