Issues with creating Genome Browsers for Whole Genome Assemblies

Issues with creating Genome Browsers for Whole Genome Assemblies
G-OnRamp Beta Users Workshop Wilson Leung 06/2017

Outline Obtain genome assemblies from NCBI
Transfer large genomics datasets to Galaxy Obtain RNA-Seq data from NCBI SRA Identifying and masking repeats Obtain protein sequences for tblastn searches Obtain RNA GenBank files for translated BLAT searches

Types of evidence tracks on a Genome Browser
Protein alignments (SPALN) Geneid N-SCAN PASA-EST Augustus (with RNA-Seq) RNA PolII ChIP-Seq (MACS2) RNA-Seq Coverage TopHat junctions StringTie + TransDecoder RepeatMasker

Obtaining the genome assembly from NCBI BioProject
Use case: create a genome browser for closely related species as reference Entry point to all genomic datasets (e.g., genome assembly, transcriptome) that pertain to a study

Data from the 1000 Genome Project available through NCBI BioProject
SRA = Sequence Read Archive Database of high-throughput sequencing data Data available through NCBI, EBI, and DDBJ

Access genome assemblies from the NCBI Assembly database
Use case: create a genome browser for closely related species as reference Download data files for GenBank and RefSeq whole genome assemblies

Types of genome assemblies
RefSeq categories: Reference genome High quality assembly Standard for comparison Example: D. melanogaster Representative genome Best genome assembly available within a clade Example: D. miranda Reference genome – other data are compared against this standard

Obtain genome assemblies from the NCBI FTP site
Download genome sequence, predicted transcript and protein products Consistent primary sequence IDs (accession.version) for both GFF and FASTA files

Naming conventions for GenBank assemblies
<accession.version>_<assembly name>_<content type>.<format> Content type Description genomic Genome assembly (Repeats identified by WindowMasker are in lower-case) rm Transposons identified by RepeatMasker (Eukaryotes only) Other content types: _protein and _rna The .run file describes the parameters used in the RepeatMasker analysis See the README.txt file within the directory for details

Common data formats used by GenBank assemblies
Large data files are compressed by gzip File suffix = .gz Supported by Galaxy Built-in support in macOS Use 7-Zip on MS Windows Format Description fna Nucleotide sequence in FASTA format faa Protein sequence in FASTA format gbff GenBank flat file format gff General Feature Format Version 3 GFF3 file is Genome assembly in FASTA format: _genomic.fna.gz Example: GCA_ _DroMir_2.2_genomic.fna.gz

DEMO: Access the D. miranda genome assembly from the NCBI FTP site

Benefits of using FTP to transfer large files to Galaxy
Problems with standard file upload Most servers have a 2 GB file upload size limit Cannot monitor progress of file upload Cannot resume interrupted file upload Galaxy Main and G-OnRamp support FTP file upload Support transfer of large gzip, bzip2, and zip files G-OnRamp instances support FTP file uploads Galaxy Main = Limitations of file upload can be addressed by HTML5 file upload; feature not yet available

Overview of the File Transfer Protocol (FTP)
Data transfer protocol between a client and a server May allow anonymous access Insecure connection Partial built-in support in most operating systems macOS: Go ➜ Connect to Server MS Windows: File Explorer Other graphical clients Cyberduck, FileZilla, Fugu, … RFC 114 released in 1971 Might need to switch to passive mode (PASV) if client is behind a firewall

Use FTP to upload files to Galaxy
Use a FTP client to initiate a FTP connection to Galaxy Galaxy Main FTP server: ftp://usegalaxy.org Use your Galaxy account credentials to authenticate Transfer files to the Galaxy FTP server Use the Upload File tool to import contents of the FTP directory into Galaxy Files available through the “Choose FTP file” button

Directly transfer files from the NCBI FTP site to Galaxy
“Open Connection” to Galaxy Main in Cyberduck Server: usegalaxy.org Enter the username and password for your Galaxy account File ➜ New Browser Copy the FTP link to the GenBank assembly at NCBI Paste link into the “Quick Connect” textbox and press “Enter” Select and drag files from the NCBI connection window to the Galaxy connection window Versions and has a regression that prevents direct file transfer Compatible with version of Cyberduck

DEMO: Use FTP to upload the D. miranda genome assembly to Galaxy
Use finder to upload file to Galaxy through the FXP protocol DEMO: Use FTP to upload the D. miranda genome assembly to Galaxy

Transfer high-throughput sequencing data from the SRA to Galaxy
Second and third generation sequencing data available through the Sequence Read Archive (SRA) NCBI SRA stores sequencing data in sra format Use the SRA Toolkit to convert files to fastq (fastq-dump) Paired-end reads might split at the wrong position: European Nucleotide Archive (ENA) at EBI SRA sequencing data in fastq format Import data into Galaxy using “Get Data” ➜ “EBI SRA” sff-dump to recover the original sequencing files produced by 454

Goals of repeat analysis
Improve G-OnRamp workflow: Improve performance of tblastn and BLAT searches Reduce number of false positives in gene predictions Survey of the repetitive contents of a genome: Estimate total repeat density Types and distributions of transposons

Develop repeat pipeline to handle genome assemblies with different sizes and quality
Assembly sizes: 111Mb - 2.8Gb Number of scaffolds: ,501

Strategies used to identify repeats in five genome assemblies
k-mer based: WindowMasker, Tallymer tRNA derived SINEs: tRNAscan-SE Structure based: LTRharvest + LTRdigest, TRF, TanTan Conserved domains within transposons: transposonPSI Species-specific repeat library: RepBase repeats from closely-related species (if available) RepeatScout MUMmer + PILER RepeatModeler Repeat classification: RepeatClassifier LTR retrotransposon contains a primer binding site (PBS) for reverse transcription (tRNA) near the 5’ LTR

Repeat tracks available on the G-OnRamp Assembly Hubs
WindowMasker Tallymer TRF RepeatMasker Nested repeats LTRHarvest TransposonPSI

Accurate repeat identification requires the use of multiple techniques
Arabidopsis thaliana repeatome Repeat libraries Maumus F, Quesneville H. PLoS One Apr 7;9(4):e94101.

RepeatScout run time vs. genome size
Run time (seconds) Takes ~1 hour to process 200 Mb genome Takes ~5 days to process the A. vittata genome Schaeffer CE et al. Bioinformatics Jun 15;32(12):i209-i215. Genome Size (Mb)

High memory requirement of k-mer based repeat finders
Memory required (Gb) RepeatScout 7Gb of memory to process 200 Mb genome Schaeffer CE et al. Bioinformatics Jun 15;32(12):i209-i215. Genome Size (Mb)

Partition genome assembly into smaller batches
Shuffle scaffolds in genome assembly Scaffolds in the original assembly are often ordered by size Batch size optimization criteria: Avoids memory errors (i.e., segmentation faults) Can be processed in a “reasonable” amount of time Batch size for RepeatScout and PILER: 100 Mb per batch Compare only within each batch Random sample of 600 Mb for the X. laevis genome Determine batch size that can be completed in a ”reasonable” Comparison across batches did not improve results

Use tandem repeat masked genome assembly to improve performance
Some genomes (e.g., C. reinhardtii) contain high density of tandem repeats Degrades performance of many repeat finding algorithms Results in large number of spurious matches RepeatModeler analysis of C. reinhardtii (111 Mb) Requires ~130 hours to process unmasked genome Requires ~90 hours to process tandem repeat masked genome Requires ~30 hours to process A. vittata genome (1.2 Gb) RepeatModeler is not deterministic Use tandem repeat masked assembly in the RepeatModeler and PILER analyses

Recent changes to RepeatMasker and RepeatModeler
New Dfam_consensus database: Creative Commons CC0 1.0 public domain license Support searches using profile Hidden Markov Models HMMER + Dfam

Obtain protein sequences for tblastn searches
Species-specific databases FlyBase: dmel-all-translation-r6.15.fasta.gz Swiss-Prot High quality, manually annotated section of UniProtKB NCBI RefSeq Use only curated RefSeq records (accession prefix = NP_) Protein sequences from RefSeq reference genomes

Misannotations in public databases
# sequences in family > 50 11-50 ≤ 10 X None Average % misannotation Schnoes AM et al. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. Swiss-Prot has fewest number of sequences but much more accurate Color circles corresponds to individual families within the superfamily Schnoes AM, et al. PLoS Comput Biol Dec;5(12):e

Obtain Swiss-Prot protein sequences
UniProt download page ( Entire Swiss-Prot database Swiss-Prot sequences separated by taxonomic divisions Human, invertebrates, mammals, plants, rodents, vertebrates, … Download files with the uniprot_sprot prefix Use the seqret EMBOSS tool in Galaxy to create FASTA file Search for “reviewed:yes” entries in UniProtKB Filter protein sequences by taxonomy, keywords, gene ontology, enzyme class or pathways

DEMO: Download Swiss-Prot protein sequences from UniProt

NCBI Reference Sequence database
More comprehensive than Swiss-Prot Two major types of RefSeq records: Known RefSeq: NP_ Model RefSeq: XP_ Model RefSeq records are based on results from computational pipelines More likely to propagate annotation errors

Obtain protein sequences from the NCBI RefSeq database
Download from the NCBI Genome database Search the NCBI Protein database with the “RefSeq” and “reviewed” filters

Obtain RNA GenBank files for translated BLAT searches
Available through the NCBI FTP server File with the “_rna.gbff.gz” suffix Obtain the RNA GenBank file for D. melanogaster

Summary Obtain genome assemblies from NCBI
Use FTP to transfer large genome assemblies to Galaxy Use EBI SRA to transfer fastq files to Galaxy Use different approaches to identify repetitive sequences in a genome Obtain transcript and protein sequences from NCBI and UniProtKB for sequence similarity searches

https://flic.kr/p/bhyT8B
Questions?

Issues with creating Genome Browsers for Whole Genome Assemblies

Similar presentations

Presentation on theme: "Issues with creating Genome Browsers for Whole Genome Assemblies"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Issues with creating Genome Browsers for Whole Genome Assemblies

Similar presentations

Presentation on theme: "Issues with creating Genome Browsers for Whole Genome Assemblies"— Presentation transcript:

Similar presentations

About project

Feedback