Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.

Similar presentations


Presentation on theme: "Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015."— Presentation transcript:

1 Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015

2 Agenda GEP annotation project overview Web databases for Drosophila annotation – UCSC Genome Browser – NCBI / BLAST – FlyBase – Gene Record Finder

3 AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATC GTTCTTAAAAAAGAGCAAGAACAGTTTAACCATTGAAAACAAGATTATTCCAAT AGCCGTAAGAGTTCATTTAATGACAATGACGATGGCGGCAAAGTCGATGAAG GACTAGTCGGAACTGGAAATAGGAATGCGCCAAAAGCTAGTGCAGCTAAACA TCAATTGAAACAAGTTTGTACATCGATGCGCGGAGGCGCTTTTCTCTCAGGA TGGCTGGGGATGCCAGCACGTTAATCAGGATACCAATTGAGGAGGTGCCCC AGCTCACCTAGAGCCGGCCAATAAGGACCCATCGGGGGGGCCGCTTATGTG GAAGCCAAACATTAAACCATAGGCAACCGATTTGTGGGAATCGAATTTAAGAA ACGGCGGTCAGCCACCCGCTCAACAAGTGCCAAAGCCATCTTGGGGGCATA CGCCTTCATCAAATTTGGGCGGAACTTGGGGCGAGGACGATGATGGCGCC GATAGCACCAGCGTTTGGACGGGTCAGTCATTCCACATATGCACAACGTCTG GTGTTGCAGTCGGTGCCATAGCGCCTGGCCGTTGGCGCCGCTGCTGGTCC CTAATGGGGACAGGCTGTTGCTGTTGGTGTTGGAGTCGGAGTTGCCTTAAA CTCGACTGGAAATAACAATGCGCCGGCAACAGGAGCCCTGCCTGCCGTGG CTCGTCCGAAATGTGGGGACATCATCCTCAGATTGCTCACAATCATCGGCCG GAATGNTAANGAATTAATCAAATTTTGGCGGACATAATGNGCAGATTCAGA ACGTATTAACAAAATGGTCGGCCCCGTTGTTAGTGCAACAGGGTCAAATATC GCAAGCTCAAATATTGGCCCAAGCGGTGTTGGTTCCGTATCCGGTAATGTCG GGGCACAATGGGGAGCCACACAGGCCGCGTTGGGGCCCCAAGGTATTTCC AAGCAAATCACTGGATGGGAGGAACCACAATCAGATTCAGAATATTAACAAAA TGGTCGGCCCCGTTGTTATGGATAAAAAATTTGTGTCTTCGTACGGAGATTAT GTTGTTAATCAATTTTATTAAGATATTTAAATAAATATGTGTACCTTTCACGAGAA ATTTGCTTACCTTTTCGACACACACACTTATACAGACAGGTAATAATTACCTTT TGAGCAATTCGATTTTCATAAAATATACCTAAATCGCATCGTC Start codon Coding region Stop codon Intron donorIntron acceptor UTR

4 Annotation – adding labels to a sequence Genes: Novel or known genes, pseudogenes Regulatory Elements: Promoters, enhancers, silencers Non-coding RNA: tRNAs, miRNAs, siRNAs, snoRNAs Repeats: Transposable elements, simple repeats Structural: Origins of replication Experimental Results: – DNase I Hypersensitive sites – ChIP-chip and ChIP-Seq datasets (e.g. modENCODE)

5 GEP Drosophila annotation projects D. melanogaster D. simulans D. sechellia D. yakuba D. erecta D. ficusphila D. eugracilis D. biarmipes D. takahashii D. elegans D. rhopaloa D. kikkawai D. ananassae D. bipectinata D. pseudoobscura D. persimilis D. willistoni D. mojavensis D. virilis D. grimshawi Reference Published Annotation projects for Fall 2015 / Spring 2016 Species in the Four Genomes Paper New species sequenced by modENCODE Phylogenetic tree produced by Thom Kaufman as part of the modENCODE project Manuscript in progress

6 Gene annotation workflow Visualize a genomic region with evidence tracks GEP UCSC Genome Browser Visualize a genomic region with evidence tracks GEP UCSC Genome Browser Learn about the putative D. melanogaster ortholog NCBI / FlyBase Learn about the putative D. melanogaster ortholog NCBI / FlyBase Identify interesting features and putative orthologs NCBI BLAST Identify interesting features and putative orthologs NCBI BLAST Understand the gene and isoform structure Gene Record Finder Understand the gene and isoform structure Gene Record Finder

7 UCSC Genome Browser Provide graphical view of genomic regions – Sequence conservation – Gene and splice site predictions – RNA-Seq and splice junction predictions BLAT – BLAST-Like Alignment Tool – Map protein or nucleotide sequences against an assembly – Faster but less sensitive than BLAST Table Browser – Access data used to create the graphical browser

8 UCSC Genome Browser overview Genomic sequence Evidence tracks BLASTX alignments Gene predictions RNA-seq Repeats Comparative genomics

9 Control how evidence tracks are displayed on the Genome Browser Most evidence tracks have five display modes: – Hide: track is hidden – Dense: all features (including overlapping features) are displayed on a single line – Squish: overlapping features are drawn on separate lines, features are half the height compared to full mode – Pack: overlapping features are drawn on separate lines, features are the same height as full mode – Full: Each feature is displayed on its own line Some annotation tracks (e.g. RepeatMasker) only have a subset of these display modes

10 Two different versions of the UCSC Genome Browser http://genome.ucsc.edu http://gander.wustl.edu Official UCSC Version GEP Version Published data, lots of species, whole genomes, used for “Chimp Chunks” GEP data, parts of genomes, used for annotation of Drosophila species

11 Additional training resources Training section on the UCSC web site – http://genome.ucsc.edu/training.html http://genome.ucsc.edu/training.html – User guides – Mailing lists OpenHelix tutorials and training materials – http://www.openhelix.com/ucsc http://www.openhelix.com/ucsc – Pre-recorded tutorial – Reference cards

12 UCSC GENOME BROWSER DEMO

13 Use BLAST to detect sequence similarity BLAST = Basic Local Alignment Search Tool Why is BLAST popular? – Provide statistical significance for each match – Good balance of sensitivity and speed Find local regions of similarity irrespective of where they are in the sequence

14 Common types of BLAST programs ProgramQueryDatabase (Subject) BLASTNNucleotide BLASTPProtein BLASTXNucleotide -> ProteinProtein TBLASTNProteinNucleotide -> Protein TBLASTXNucleotide -> Protein Except for BLASTN, all alignments are based on comparisons of protein sequences Decide which BLAST program to use based on the type of query and subject sequences:

15 Common BLAST programs use cases BLASTN: Search for similar nucleotide sequences – Map contigs to genome, mRNAs/ESTs to genome BLASTP: Search for proteins similar to predicted genes BLASTX: Map protein / exons against genomic sequence TBLASTN: Map protein against genomic assemblies TBLASTX: Identify genes in unannotated sequences See the BLAST Homepage and Selected Search Pages document for details: ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_BLASTGuide.pdf

16 NCBI BLAST nucleotide databases GenBank Non-Redundant Nucleotide Database (nr/nt) – Most comprehensive but some entries are low quality – Exclude sequences from whole genome assembly, ESTs RefSeq RNA Database – mRNA and non-coding RNA entries from the NCBI Reference Sequence Project – Include real and computationally predicted sequences Expressed Sequence Tag (EST) Database – ESTs are short single reads of cDNA clones – High error rate but useful for identifying transcribed loci

17 NCBI BLAST protein databases GenBank Non-Redundant Protein Database (nr) – Most comprehensive but some entries are low quality – Include sequences from both RefSeq and UniProtKB RefSeq Protein Database – Sequences from the NCBI Reference Sequence Project – Higher quality than the nr database – Include real and computationally predicted sequences UniProtKB / Swiss-Prot Protein Database – Manually curated proteins from literature – Real proteins with known functions – Much smaller database than either RefSeq or nr

18 Where can I run BLAST? NCBI BLAST web service – http://blast.ncbi.nlm.nih.gov/Blast.cgi http://blast.ncbi.nlm.nih.gov/Blast.cgi EBI BLAST web service – http://www.ebi.ac.uk/Tools/sss/ http://www.ebi.ac.uk/Tools/sss/ FlyBase BLAST (Drosophila and other insects) – http://flybase.org/blast/ http://flybase.org/blast/

19 NCBI BLAST DEMO

20 National Center for Biotechnology Information (NCBI) http://www.ncbi.nlm.nih.gov

21 Key features of NCBI Strengths – Most comprehensive among publicly available databases – PubMed for literature searches – Comprehensive BLAST web service Weaknesses – Web site is large and complex – Quality of GenBank records may vary Use cases – Perform BLAST searches against Refseq, nr/nt databases – Compare one sequence against another (bl2seq)

22 FlyBase - Database for the Drosophila research community http://flybase.org/

23 Key Features of FlyBase Lots of ancillary data for each gene in Drosophila Curation of literature for each gene Reference Drosophila annotations for all the other databases (including NCBI) Fast release cycle (6-8 releases per year) Use cases – Species-specific BLAST searches – Genome browser (GBrowse) and access datasets for the 20 sequenced Drosophila species

24 Web databases and tools Many genome databases available – Be aware of different annotation releases – Use FlyBase as the canonical reference Web databases are being updated constantly – Update GEP materials before semester starts – Discrepancies in exercise screenshots – Minor changes in search results – Let us know about errors or revisions

25 FLYBASE DEMO

26 Gene Record Finder http://gander.wustl.edu/~wilson/dmelgenerecord/index.html

27 Key features of the Gene Record Finder List of unique coding and non-coding exons for each gene in D. melanogaster CDS and exon usage maps for each isoform Optimized for exon-by-exon annotation strategy Slower update release cycle than FlyBase – Database is updated every semester Use cases: – Get amino acid sequences and nucleotide sequences of each exon for BLAST 2 Sequences (bl2seq) searches

28 GENE RECORD FINDER DEMO

29 Summary GEP annotation project seeks to generate high quality manually curated gene models for multiple Drosophila species Use BLAST to characterize a genomic sequence Use web databases to gather information on a gene – UCSC Genome Browser – NCBI – FlyBase – Gene Record Finder

30 Questions? http://www.flickr.com/photos/jac_opo/240254763/sizes/l/

31

32 Brief overview of the BLAST algorithm BLAST is based on the Smith-Waterman algorithm but use the following heuristics: – Build word list with the query and subject sequences Word size = 3 for protein, 11 for nucleotide – Generate a list of high-scoring words Determine by scoring system or scoring matrix – Scan database for exact matches to high-scoring words – Extend matches to generate High-scoring Segment Pairs – Merge multiple HSP’s into a longer alignment – Calculate E-value for the alignments – Report alignments below E-value threshold

33 Ensembl Metazoa: Databases for 12 Drosophila species http://metazoa.ensembl.org

34 Key features of Ensembl Metazoa Lots of ancillary data for each gene Data for 12 Drosophila species available Detailed information on each gene available at the transcript, peptide, and exon level Not always up-to-date – Annotations are from FlyBase Release 6.02 Use cases – Get amino acid sequences and nucleotide sequences of each exon for bl2seq searches – Perform species-specific BLAST searches


Download ppt "Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015."

Similar presentations


Ads by Google