MANAGING, SHARING, AND PUBLISHING DATA WITH THE CYVERSE DATA STORE CyVerse Focus Forum http://www.cyverse.org/blog/events/webinar-managing-sharing-and-publishing-data-cyverse-data-store Ramona Walls, Tony Edgin, Nirav Merchant Sep. 15, 2017
Topics Overview of the CyVerse Data Store Uploading and downloading data Managing data Publishing data A future Focus Forum will cover accessing the Data Store via API, iRODS federation, and content delivery Uploading and downloading data to CyVerse using the Discovery Environment (DE, our scientific analysis and data management web interface), iCommands (a command line tool), CyberDuck (an open source desktop client), and FUSE (an open source tool for viewing and editing cloud-based file systems that works on Mac or Linux). We’ll cover working with data in Atmosphere virtual machines and connecting data to genome browsers Strategies and best practices for managing data using CyVerse tools for sharing and organizing data, including metadata in the DE. How to publish data using CyVerse, including publishing sequence data to NCBI, how to request a DOI in the CyVerse Data Commons, and how to create a Community Released Data folder in the Data Commons, so others can find/re-use your data.
BisQue Discovery Environment Data Store Data commons Atmosphere
CyVerse Data Store ~ 2.5 PB of data ~ 90 million files Growing at about 600 GB / day Built on the open source iRODS platform
Moving data in and out of CyVerse Uploading and downloading data to CyVerse using: Discovery Environment (DE, web interface) CyberDuck (desktop client) iCommands (command line) Atmosphere (virtual machines) FUSE (cloud-based file system) – not recommended for most uses Download from the Data Commons
Discovery Environment Home page: https://de.cyverse.org/de/ Manual: https://wiki.cyverse.org/wiki/display/DEmanual/Table+ of+Contents https://wiki.cyverse.org/wiki/display/DEmanual/Manag ing+Data+Files+and+Folders
DE - uploads Simple upload – up to 5 files, each <1.9 GB Bulk upload – use another method Import from URL example: ftp://ftp.gramene.org/pub/gramene/archives/PAST_RELEASES/rel ease39/data/fasta/brassica_rapa/cdna/README for password protected sites, can include username and password, but not recommended ftp://username:password@hostname/$URL The URL being opened may be determinable by other users on the same machine on which you are browsing (as from a command line). The URL retrieved from the remote machine may be logged in some non-secure place on the remote machine. Your browser history would then also contain a copy of your password.
DE - downloads Simple upload – up to 5 files, each <1.9 GB Bulk upload – use another method
CyberDuck For Mac or Windows users Not developed by CyVerse, works with iRODS Recommend using the latest version. Installation - see instructions at: https://wiki.cyverse.org/wiki/display/DS/Using+Cyberduck+for+ Uploading+and+Downloading+to+the+Data+Store Configuration Download the configuration file Enter connection details – keep defaults, add user name and password Choose “Open multiple connections” Can store multiple connections
Using CyberDuck Upload from your computer Download to your computer Anonymous data access – for public data on CyVerse Do not attempt to browse to iplant/home or iplant/! The large number of folders in these directories will cause CyberDuck to hang. Accessing shared data A paid, mounted version of CyverDuck is available – MountainDuck.
iCommands Command line access to iRODS https://wiki.cyverse.org/wiki/display/DS/Using+iCommands iCommands documentation for each command: https://docs.irods.org/4.2.0/icommands/user/
Using iCommands Logging in (iinit) Browsing (icd) Uploading (iput) Downloading (iget) Sharing/permissions (ichmod)
Atmosphere Use iCommands Mount a volume – a virtual hard drive that you attach to one or more instances. New tool: kanki: https://github.com/ilarik/kanki-irodsclient
Atmosphere – using volumes https://wiki.cyverse.org/wiki/display/atmman/Using+Volumes Steps: Create the volume (as part of a project) Click on the volume and attach it to an instance Grant users of the image access to the volume Save data generated on Atmo to your volume When finished, backup and detach the volume Before detaching a volume, be sure to back up your data to the Data Store! Data can be restored to the same or another instance https://wiki.cyverse.org/wiki/display/atmman/Backing+Up+and+Restori ng+Your+Data+to+the+Data+Store
Fuse (Filesystem in Userspace) https://wiki.cyverse.org/wiki/display/DS/Using+FUSE+to+Mo unt+the+CyVerse+Data+Store Mounts a Data Store directory to a local directory. For most use cases, other methods are more efficient.
Download public data from the Data Commons http://datacommons.cyverse.org/ Files <2GB can be downloaded directly For larger files, use one of the methods described above Change any data browsing URL to a direct link to the data by substituting “download” for “browse”: http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/cur ated/VertNet_Traits/ReadMe.txt >>> http://datacommons.cyverse.org/download/iplant/home/shared/commons_repo/ curated/VertNet_Traits/ReadMe.txt OR use the DE data service:https://de.cyverse.org/anon- files/iplant/home/shared/commons_repo/curated/VertNet_Traits/ReadMe.tx t
Special topic: connecting data to genome browsers https://wiki.cyverse.org/wiki/display/DEmanual/Viewing+Genome+Files+i n+a+Genome+Browser File types: bam, vcf, gff, gtf, bed, bigBed, and bigWig Browsers: Ensembl, UCSC, IGV, GBrowse, jbrowse, and WashU EPIGenome Browser. Bam and vcf files require a matching index file (bam.bai or vcf.vci) Gff, gtf, bed, bigBed, and bigWig files require that the name of the reference genome's fasta header match the gene name in the genome file. Files must be tagged with the correct info type Fasta infotype files can also be viewed in CoGe (https://genomevolution.org/) If you have an issue, you may need to change https:// to http:// in the URL Demo: use icommands to copy genome file to my dir (bai file is already copied, because it is large) view the info type send to browser – notice that this creates a public link
Commons problems with data transfer University firewalls block access for CyberDuck or iCommands Contact CyVerse support and your university’s IT department Uploading 1000s of files at one time bundle them up before upload using tar command ibun command extracts the tar file in place on the Data Store ibun can also be used to bundle files within the Data Store Time out for very large files Usually a network error Other random problems Make sure the name is unix friendly! (no spaces or special chars)
Publishing Data Make your data FAIR Publish sequence data to NCBI Request a DOI in the CyVerse Data Commons Create a Community Released Data folder in the Data Commons
Default data allocation Lab group on PI’s allocation Community Folder Public Folder Published to Repository Private/ Single user Public/ Many users
Good metadata is key! Follow relevant data and metadata standards Use open source formats Available via web services, or directly from a URL
Publish Data to NCBI SRA (Sequence Read Archive) for raw sequences and alignments (NGS data) https://goo.gl/163Z9L WGS (Whole Genome Shotgun) for incomplete assemblies https://goo.gl/9mJb3N Tutorials walk you through creating a submission package, including BioProject, BioSamples, and data
Request a DOI Is the CyVerse Data Commons right for you? CyVerse Curated data are: Stable “Permanent” linked to a permanent identifier (DOI or ARK) managed by CyVerse staff described using DataCite metadata, plus scientific metadata DOI – Digital Object Identifier Can be used to cite your data Points to the dataset landing page, even if the data moves Go to DC home page and show link See http://datacommons.cyverse.org/
Community Released Data Folders Community Released data are: managed by community members publicly available possibly evolving not permanent described using Dublin Core metadata scientific metadata recommended See http://datacommons.cyverse.org/
Data Management Tips Strategies and best practices for managing data using CyVerse tools: Sharing data Organizing data Using metadata in the DE.
Sharing Data Can share any file or folder with any CyVerse user Don’t need to know their user name Grant read, write, or own permission Create a public link to a file, to share with non-CyVerse users https://wiki.cyverse.org/wiki/display/DEmanual/Sharing+Data+Fi les+and+Folders To share using iCommands, use ichmod Demo sharing in DE
Organizing Data https://pods.iplantcollaborative.org/wiki/display/DC/Using+CyVe rse+for+a+Shared+Project Coming soon: teams Use search to create “smart folders” from metadata https://www.dataone.org/best-practices Using CyVerse in your data management plans: 4. Data dissemination 5. Policies for data sharing, public access, and re-use. 6. Plans for archiving data, samples, software, and other research products.
Using metadata in the DE Add and edit metadata Apply a metadata template – need to publish data Copy metadata from one object to another Apply metadata in bulk – video tutorial: https://goo.gl/7EmhP9
CyVerse is supported by the National Science Foundation under Grants No. DBI-0735191 and DBI-1265383.