Download presentation
Presentation is loading. Please wait.
Published byWendy Riley Modified over 6 years ago
1
MANAGING, SHARING, AND PUBLISHING DATA WITH THE CYVERSE DATA STORE
CyVerse Focus Forum Ramona Walls, Tony Edgin, Nirav Merchant Sep. 15, 2017
2
Topics Overview of the CyVerse Data Store
Uploading and downloading data Managing data Publishing data A future Focus Forum will cover accessing the Data Store via API, iRODS federation, and content delivery Uploading and downloading data to CyVerse using the Discovery Environment (DE, our scientific analysis and data management web interface), iCommands (a command line tool), CyberDuck (an open source desktop client), and FUSE (an open source tool for viewing and editing cloud-based file systems that works on Mac or Linux). We’ll cover working with data in Atmosphere virtual machines and connecting data to genome browsers Strategies and best practices for managing data using CyVerse tools for sharing and organizing data, including metadata in the DE. How to publish data using CyVerse, including publishing sequence data to NCBI, how to request a DOI in the CyVerse Data Commons, and how to create a Community Released Data folder in the Data Commons, so others can find/re-use your data.
3
BisQue Discovery Environment Data Store Data commons Atmosphere
4
CyVerse Data Store ~ 2.5 PB of data ~ 90 million files
Growing at about 600 GB / day Built on the open source iRODS platform
5
Moving data in and out of CyVerse
Uploading and downloading data to CyVerse using: Discovery Environment (DE, web interface) CyberDuck (desktop client) iCommands (command line) Atmosphere (virtual machines) FUSE (cloud-based file system) – not recommended for most uses Download from the Data Commons
6
Discovery Environment
Home page: Manual: of+Contents ing+Data+Files+and+Folders
7
DE - uploads Simple upload – up to 5 files, each <1.9 GB
Bulk upload – use another method Import from URL example: ftp://ftp.gramene.org/pub/gramene/archives/PAST_RELEASES/rel ease39/data/fasta/brassica_rapa/cdna/README for password protected sites, can include username and password, but not recommended The URL being opened may be determinable by other users on the same machine on which you are browsing (as from a command line). The URL retrieved from the remote machine may be logged in some non-secure place on the remote machine. Your browser history would then also contain a copy of your password.
8
DE - downloads Simple upload – up to 5 files, each <1.9 GB
Bulk upload – use another method
9
CyberDuck For Mac or Windows users
Not developed by CyVerse, works with iRODS Recommend using the latest version. Installation - see instructions at: Uploading+and+Downloading+to+the+Data+Store Configuration Download the configuration file Enter connection details – keep defaults, add user name and password Choose “Open multiple connections” Can store multiple connections
10
Using CyberDuck Upload from your computer Download to your computer
Anonymous data access – for public data on CyVerse Do not attempt to browse to iplant/home or iplant/! The large number of folders in these directories will cause CyberDuck to hang. Accessing shared data A paid, mounted version of CyverDuck is available – MountainDuck.
11
iCommands Command line access to iRODS
iCommands documentation for each command:
12
Using iCommands Logging in (iinit) Browsing (icd) Uploading (iput)
Downloading (iget) Sharing/permissions (ichmod)
13
Atmosphere Use iCommands
Mount a volume – a virtual hard drive that you attach to one or more instances. New tool: kanki:
14
Atmosphere – using volumes
Steps: Create the volume (as part of a project) Click on the volume and attach it to an instance Grant users of the image access to the volume Save data generated on Atmo to your volume When finished, backup and detach the volume Before detaching a volume, be sure to back up your data to the Data Store! Data can be restored to the same or another instance ng+Your+Data+to+the+Data+Store
15
Fuse (Filesystem in Userspace)
unt+the+CyVerse+Data+Store Mounts a Data Store directory to a local directory. For most use cases, other methods are more efficient.
16
Download public data from the Data Commons
Files <2GB can be downloaded directly For larger files, use one of the methods described above Change any data browsing URL to a direct link to the data by substituting “download” for “browse”: ated/VertNet_Traits/ReadMe.txt >>> curated/VertNet_Traits/ReadMe.txt OR use the DE data service: files/iplant/home/shared/commons_repo/curated/VertNet_Traits/ReadMe.tx t
17
Special topic: connecting data to genome browsers
n+a+Genome+Browser File types: bam, vcf, gff, gtf, bed, bigBed, and bigWig Browsers: Ensembl, UCSC, IGV, GBrowse, jbrowse, and WashU EPIGenome Browser. Bam and vcf files require a matching index file (bam.bai or vcf.vci) Gff, gtf, bed, bigBed, and bigWig files require that the name of the reference genome's fasta header match the gene name in the genome file. Files must be tagged with the correct info type Fasta infotype files can also be viewed in CoGe ( If you have an issue, you may need to change to in the URL Demo: use icommands to copy genome file to my dir (bai file is already copied, because it is large) view the info type send to browser – notice that this creates a public link
18
Commons problems with data transfer
University firewalls block access for CyberDuck or iCommands Contact CyVerse support and your university’s IT department Uploading 1000s of files at one time bundle them up before upload using tar command ibun command extracts the tar file in place on the Data Store ibun can also be used to bundle files within the Data Store Time out for very large files Usually a network error Other random problems Make sure the name is unix friendly! (no spaces or special chars)
19
Publishing Data Make your data FAIR Publish sequence data to NCBI
Request a DOI in the CyVerse Data Commons Create a Community Released Data folder in the Data Commons
20
Default data allocation
Lab group on PI’s allocation Community Folder Public Folder Published to Repository Private/ Single user Public/ Many users
21
Good metadata is key! Follow relevant data and metadata standards Use open source formats Available via web services, or directly from a URL
22
Publish Data to NCBI SRA (Sequence Read Archive) for raw sequences and alignments (NGS data) WGS (Whole Genome Shotgun) for incomplete assemblies Tutorials walk you through creating a submission package, including BioProject, BioSamples, and data
23
Request a DOI Is the CyVerse Data Commons right for you?
CyVerse Curated data are: Stable “Permanent” linked to a permanent identifier (DOI or ARK) managed by CyVerse staff described using DataCite metadata, plus scientific metadata DOI – Digital Object Identifier Can be used to cite your data Points to the dataset landing page, even if the data moves Go to DC home page and show link See
24
Community Released Data Folders
Community Released data are: managed by community members publicly available possibly evolving not permanent described using Dublin Core metadata scientific metadata recommended See
25
Data Management Tips Strategies and best practices for managing data using CyVerse tools: Sharing data Organizing data Using metadata in the DE.
26
Sharing Data Can share any file or folder with any CyVerse user
Don’t need to know their user name Grant read, write, or own permission Create a public link to a file, to share with non-CyVerse users les+and+Folders To share using iCommands, use ichmod Demo sharing in DE
27
Organizing Data rse+for+a+Shared+Project Coming soon: teams Use search to create “smart folders” from metadata Using CyVerse in your data management plans: 4. Data dissemination 5. Policies for data sharing, public access, and re-use. 6. Plans for archiving data, samples, software, and other research products.
28
Using metadata in the DE
Add and edit metadata Apply a metadata template – need to publish data Copy metadata from one object to another Apply metadata in bulk – video tutorial:
29
CyVerse is supported by the National Science Foundation under Grants No. DBI and DBI
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.