Download presentation
Presentation is loading. Please wait.
Published byChad Morgan Modified over 9 years ago
1
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Data Demo and MAKER-P
2
The iPlant Collaborative Getting Started Getting Data into iPlant
3
Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set. - Wikipedia - (http://en.wikipedia.org/wiki/Big_data) Challenges: the scope and scale of life sciences data continue to grow Working with Big Data
4
Challenges (sequencing example): data generation is cheaper and faster Working with Big Data
5
Biologists work with and require access to diverse data types Working with Big Data Challenges: biology encompasses more than sequence data Advanced ImagingGeospatialNetwork
6
Challenges: changes in data require changes in tools Working with Big Data Changes in scale introduce quantitative and qualitative complications Difficult/slow transfers Expense for storage/backup Difficult to share and publish Analysis Metadata (what is metadata?)
7
The Data Store services all iPlant platforms iPlant Data Store Overview Access your data from multiple iPlant services Automatic backup (redundant between University of Arizona and University of Texas) Default 100GB allocation, >1TB allocations available with justification
8
iRODS (integrated Rule-Oriented Data System) is an established, scalable, open-source data management system iRODS supports many data intensive projects iRODS abstracts data services from data storage to facilitate executing services across heterogeneous, distributed storage systems Avoid reinventing the wheel iPlant Data Store Overview Critical for effective data management Works under the hood Folder = Collection
9
Benefits Get Science Done Reproducibility Productivity Store any type of files related to your research An evolving “Data Commons” lets you access important datasets Metadata captures information needed for reproducibility Automatic backup and accessibility support your project’s data management plan IRODS makes high-speed transfers possible (100GB in ~30min)* Share data instantly with collaborators within iPlant iPlant Data Store Overview
10
Texas Replication Arizona Key component of your data management Worry Free! Some important things we will not “see” in the demo SourceDestination Copy Method Time (seconds) CDMy Computercp320 Berkeley ServerMy Computerscp150 External DriveMy Computercp36 USB2.0 FlashMy Computercp30 iPlant Data StoreMyComputeriget18 My Computer cp15 Close to optimum conditions; transfer between Univ. of Arizona and UC Berkeley 100GB: 29m15s, 1 GB / 17.5 seconds Data TransfersData Backups
11
iPlant Data Store Overview Some important things we will not “see” in the demo http://www.speedtest.net/ Local connections and institutional policies limit data transfers
12
iPlant Data Store Command linePoint-and-click iCommands Multiple ways to access Cyberduck Discovery Environment iDrop Desktop
13
Discovery Environment Simple upload/download for small files Bulk upload files and folders (<10GB) Import from URL (no size limit) Advantage + Disadvantage - Covers most upload/download sharing needs Some size/speed limitations
14
Cyberduck Drag and drop files and folders No size limit, file editing/previews Easy Desktop functionality Advantage + Disadvantage - More like desktop file systems No permissions/metadata control
15
iDrop Desktop Drag and drop files and folders No size limit Synchronize folders with Data Store Advantage + Disadvantage - Upload/download large file sizes and numbers of files Sharing and permission features more complex
16
iCommands Full flexibility Ability to script and automate Access from terminal/server Advantage + Disadvantage - Customizability Requires some command line expertise
17
The iPlant Collaborative Getting Started Cyberduck and iCommands Demo
18
The iPlant Collaborative Getting Started Gene Annotation with Maker
19
Available on Atmosphere and DE (Lonestar)
20
Quick Review What Are Annotations? Annotations are descriptions of features of the genome Structural: exons, introns, UTRs, splice forms etc. Coding & non-coding genes Expression, repeats, transposons Annotations should include evidence trail Assists in quality control of genome annotations Examples of evidence supporting a structural annotation: Ab initio gene predictions ESTs Protein homology
21
Secondary Annotations Protein Domains InterPro Scan: combines many HMM databases GO and other ontologies Pathway mapping E.g. BioCyc Pathway tools
22
Particularly for plants… Challenges in Annotation Genomes are BIG Highly repetitive Many pseudogenes Assembly contamination Incomplete evidence No method is 100% accurate
23
Options for Protein-coding Gene Annotation Yandell & Ence. Nature Reviews Genetics 13, 329-342 (May 2012) | doi:10.1038/nrg3174
24
Typical Annotation Pipeline Contamination screening Repeat/TE masking Ab initio prediction Evidence alignment (cDNA, EST, RNA-seq, protein) Evidence-driven prediction Chooser/combiner Evaluation/filtering Manual curation
25
MAKER-P Automated Pipeline Ab initio prediction Evidence MPI-enabled to allow parallel operation on large compute clusters Collaboration with Yandell Lab Repeat Library
26
Generic Feature Format What is a GFF File?
27
MAKER-P at iPlant W559 - Annotation of the Lobolly Pine Megagenome—Jill Wegrzyn 20.15 Gb assembly—split into 40 jobs—216 CPU/job (8640 CPU total)—17 hours P157 - Disease Resistance Gene Analysis on Chromosome 11 Across Ten Oryza Species 10 rice species (each w/12 chromosome pseudomolecules) 96 CPU per chromosome (1152 CPU total) ~ 2hr per genome 22,656 CPU cores on1,888 nodes GenomeAssembly Size (Mb) CPU Run Time Arabidopsis thalianaTAIR101206002:44 Arabidopsis thalianaTAIR1012015001:27 Zea maysRefGen_v2206721722:53 TACC Lonestar Supercomputer Campbell et al. Plant Physiology. December 4, 2013, DOI:10.1104/pp.113.230144 PAG 2014: Agave API
28
Keep asking: ask.iplantcollabortive.org
29
The iPlant Collaborative is funded by a grant from the National Science Foundation Plant Cyberinfrastructure Program (#DBI-0735191).
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.