Web Databases for Drosophila

Slides:



Advertisements
Similar presentations
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Advertisements

Web Apollo Resources at the National Agricultural Library Christopher Childers NAL ARS USDA i5k.nal.usda.gov.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
How to access genomic information using Ensembl August 2005.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
Annotation of Drosophila GEP Workshop – August 2015 Wilson Leung and Chris Shaffer.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Copyright OpenHelix. No use or reproduction without express written consent1.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Sackler Medical School
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.
Searching for Transcription Start Sites in Drosophila
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
Annotation of Drosophila primer
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
What is BLAST? Basic BLAST search What is BLAST?
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
What is BLAST? Basic BLAST search What is BLAST?
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Annotation of Drosophila
Annotation for D. virilis
bacteria and eukaryotes
Introduction to Genes and Genomes with Ensembl
Basics of BLAST Basic BLAST Search - What is BLAST?
Genomics and Personalized Care in Health Systems Lecture 7 Gene Finding (Part 2) Ab initio and Evidence-Based Gene Finding Leming Zhou, PhD School of.
TSS Annotation Workflow
GEP Annotation Workflow
Visualization of genomic data
Visualization of genomic data
Genome Center of Wisconsin, UW-Madison
Bioinformatics and BLAST
BLAST.
Ensembl Genome Repository.
Identify D. melanogaster ortholog
Follow-up from last night: XSEDE credits
Problems from last section
Introduction to Alternative Splicing and my research report
Basic Local Alignment Search Tool
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung 08/2017

Agenda GEP annotation project overview Web databases for Drosophila annotation UCSC Genome Browser NCBI / BLAST FlyBase Gene Record Finder

Start codon Coding region Stop codon Intron donor Intron acceptor UTR AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCTTAAAAAAGAGCAAGAACAGTTTAACCATTGAAAACAAGATTATTCCAATAGCCGTAAGAGTTCATTTAATGACAATGACGATGGCGGCAAAGTCGATGAAGGACTAGTCGGAACTGGAAATAGGAATGCGCCAAAAGCTAGTGCAGCTAAACATCAATTGAAACAAGTTTGTACATCGATGCGCGGAGGCGCTTTTCTCTCAGGATGGCTGGGGATGCCAGCACGTTAATCAGGATACCAATTGAGGAGGTGCCCCAGCTCACCTAGAGCCGGCCAATAAGGACCCATCGGGGGGGCCGCTTATGTGGAAGCCAAACATTAAACCATAGGCAACCGATTTGTGGGAATCGAATTTAAGAAACGGCGGTCAGCCACCCGCTCAACAAGTGCCAAAGCCATCTTGGGGGCATACGCCTTCATCAAATTTGGGCGGAACTTGGGGCGAGGACGATGATGGCGCCGATAGCACCAGCGTTTGGACGGGTCAGTCATTCCACATATGCACAACGTCTGGTGTTGCAGTCGGTGCCATAGCGCCTGGCCGTTGGCGCCGCTGCTGGTCCCTAATGGGGACAGGCTGTTGCTGTTGGTGTTGGAGTCGGAGTTGCCTTAAACTCGACTGGAAATAACAATGCGCCGGCAACAGGAGCCCTGCCTGCCGTGGCTCGTCCGAAATGTGGGGACATCATCCTCAGATTGCTCACAATCATCGGCCGGAATGNTAANGAATTAATCAAATTTTGGCGGACATAATGNGCAGATTCAGA ACGTATTAACAAAATGGTCGGCCCCGTTGTTAGTGCAACAGGGTCAAATATCGCAAGCTCAAATATTGGCCCAAGCGGTGTTGGTTCCGTATCCGGTAATGTCGGGGCACAATGGGGAGCCACACAGGCCGCGTTGGGGCCCCAAGGTATTTCCAAGCAAATCACTGGATGGGAGGAACCACAATCAGATTCAGAATATTAACAAAATGGTCGGCCCCGTTGTTATGGATAAAAAATTTGTGTCTTCGTACGGAGATTATGTTGTTAATCAATTTTATTAAGATATTTAAATAAATATGTGTACCTTTCACGAGAAATTTGCTTACCTTTTCGACACACACACTTATACAGACAGGTAATAATTACCTTTTGAGCAATTCGATTTTCATAAAATATACCTAAATCGCATCGTC Start codon Coding region Stop codon Intron donor Intron acceptor UTR

Annotation – adding labels to a sequence 6/25/12 Annotation – adding labels to a sequence Genes: Novel or known genes, pseudogenes Regulatory Elements: Promoters, enhancers, silencers Non-coding RNA: tRNAs, miRNAs, siRNAs, snoRNAs Repeats: Transposable elements, simple repeats Structural: Origins of replication Experimental Results: DNase I Hypersensitive sites ChIP-chip and ChIP-Seq datasets (e.g. modENCODE) Use D. melanogaster as our reference

GEP Drosophila annotation projects 6/25/12 GEP Drosophila annotation projects D. melanogaster D. simulans D. sechellia Reference D. yakuba D. erecta Published D. ficusphila D. eugracilis D. biarmipes TSS projects for Fall 2017 / Spring 2018 D. takahashii D. elegans D. rhopaloa Annotation projects for Fall 2017 / Spring 2018 D. kikkawai D. bipectinata D. ananassae D. pseudoobscura New species sequenced by modENCODE D. persimilis D. willistoni D. mojavensis D. virilis D. grimshawi Phylogenetic tree produced by Thom Kaufman as part of the modENCODE project

Gene annotation workflow 6/25/12 Visualize a genomic region with evidence tracks GEP UCSC Genome Browser Identify interesting features and putative orthologs NCBI BLAST Learn about the putative D. melanogaster ortholog NCBI / FlyBase feature = some interesting region in the genome Understand the gene and isoform structure Gene Record Finder

UCSC Genome Browser Provide graphical view of genomic regions Sequence conservation Gene and splice site predictions RNA-Seq and splice junction predictions BLAT – BLAST-Like Alignment Tool Map protein or nucleotide sequences against an assembly Faster but less sensitive than BLAST Table Browser Access data used to create the graphical browser

UCSC Genome Browser overview 6/25/12 Genomic sequence BLASTX alignments Gene predictions Evidence tracks RNA-seq Repeats Comparative genomics

Control how evidence tracks are displayed on the Genome Browser Most evidence tracks have five display modes: Hide: track is hidden Dense: all features (including overlapping features) are displayed on a single line Squish: overlapping features are drawn on separate lines, features are half the height compared to full mode Pack: overlapping features are drawn on separate lines, features are the same height as full mode Full: Each feature is displayed on its own line Some annotation tracks (e.g., RepeatMasker) only have a subset of these display modes

Two different versions of the UCSC Genome Browser Official UCSC Version http://genome.ucsc.edu Published data, lots of species, whole genomes; used for “Chimp Chunks” GEP Version http://gander.wustl.edu GEP projects, parts of genomes; used for annotation of Drosophila species

Additional training resources Training section on the UCSC web site http://genome.ucsc.edu/training/index.html Videos User guides Mailing lists OpenHelix tutorials and training materials http://www.openhelix.com/ucsc Pre-recorded tutorials Quick reference cards

UCSC Genome Browser demo

Use BLAST to detect sequence similarity BLAST = Basic Local Alignment Search Tool Why is BLAST popular? Provide statistical significance for each match Good balance of sensitivity and speed Find local regions of similarity irrespective of where they are in the sequence

Common types of BLAST programs Except for BLASTN, all alignments are based on comparisons of protein sequences Decide which BLAST program to use based on the type of query and subject sequences: Program Query Database (Subject) BLASTN Nucleotide BLASTP Protein BLASTX Nucleotide → Protein TBLASTN TBLASTX BLASTN compares nucleotide sequences directly. Other BLAST programs compare protein sequences

Common BLAST programs use cases BLASTN: Search for similar nucleotide sequences Map contigs to genome, mRNAs/ESTs to genome BLASTP: Search for proteins similar to predicted genes BLASTX: Map protein / exons against genomic sequence TBLASTN: Map protein against genomic assemblies TBLASTX: Identify genes in unannotated sequences See the “BLAST Homepage and Selected Search Pages” document for details: ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_BLASTGuide.pdf

NCBI BLAST nucleotide databases GenBank Non-Redundant Nucleotide Database (nr/nt) Most comprehensive but some entries are low quality Exclude sequences from whole genome assemblies, ESTs RefSeq RNA Database mRNA and non-coding RNA entries from the NCBI Reference Sequence Project Include real and computationally predicted sequences Expressed Sequence Tag (EST) Database ESTs are short single reads of cDNA clones High error rate but useful for identifying transcribed loci

NCBI BLAST protein databases GenBank Non-Redundant Protein Database (nr) Most comprehensive but some entries are low quality Include sequences from both RefSeq and UniProtKB RefSeq Protein Database Sequences from the NCBI Reference Sequence Project Higher quality than the nr database Include real and computationally predicted sequences UniProtKB / Swiss-Prot Protein Database Manually curated proteins from literature Real proteins with known functions Much smaller database than either RefSeq or nr

Where can I run BLAST? NCBI BLAST web service EBI BLAST web service https://blast.ncbi.nlm.nih.gov/Blast.cgi EBI BLAST web service http://www.ebi.ac.uk/Tools/sss/ FlyBase BLAST (Drosophila and other insects) http://flybase.org/blast/

Include welcome to the Mac NCBI BLAST demo

National Center for Biotechnology Information (NCBI) https://www.ncbi.nlm.nih.gov

Key features of NCBI Strengths Weaknesses Use cases Most comprehensive among publicly available databases PubMed for literature searches Comprehensive BLAST web service Weaknesses Web site is large and complex Quality of GenBank records may vary Use cases Perform BLAST searches against Refseq, nr/nt databases Use BLAST (bl2seq) to align two or more sequences

FlyBase - Database for the Drosophila research community http://flybase.org/

Key Features of FlyBase Lots of ancillary data for each gene in Drosophila Curation of literature for each gene Reference Drosophila annotations for all the other databases (including NCBI) Fast release cycle (6-8 releases per year) Use cases Species-specific BLAST searches Genome browsers (GBrowse/JBrowse), RNA-Seq search, and access datasets for 20 Drosophila species 20 species – where WGS data is available

Web databases and tools Many genome databases available Be aware of different annotation releases Use FlyBase as the canonical reference Web databases are being updated constantly Update GEP materials before semester starts Discrepancies in exercise screenshots Minor changes in search results Let us know about errors or revisions Feel free to use the database you feel most comfortable with FlyBase release updates 6 times each year

Include welcome to the Mac FlyBase demo

Gene Record Finder http://gander.wustl.edu/~wilson/dmelgenerecord/index.html

Key features of the Gene Record Finder List of unique coding and non-coding exons for each gene in D. melanogaster CDS and exon usage maps for each isoform Optimized for exon-by-exon annotation strategy Slower update release cycle than FlyBase Database is updated every semester Use cases: Get amino acid sequences and nucleotide sequences of each exon for BLAST 2 Sequences (bl2seq) searches Seen in the Introduction to NCBI BLAST exercise

Gene Record Finder demo

Summary GEP annotation project seeks to generate high quality manually curated gene models for multiple Drosophila species Use BLAST to characterize a genomic sequence Use web databases to gather information on a gene UCSC Genome Browser NCBI FlyBase Gene Record Finder

Questions? http://www.flickr.com/photos/jac_opo/240254763/sizes/l/

Ensembl Metazoa: Databases for 12 Drosophila species http://metazoa.ensembl.org/index.html

Key features of Ensembl Metazoa Lots of ancillary data for each gene Data for 12 Drosophila species available Detailed information on each gene available at the transcript, peptide, and exon level Not always up-to-date Annotations are from FlyBase Release 6.02 Use cases Get amino acid sequences and nucleotide sequences of each exon for bl2seq searches Perform species-specific BLAST searches FlyBase is up to release 6.13 Has data on the Drosophila population genome project (SNPs in 50 Drosophila genomes)