Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Web Apollo Resources at the National Agricultural Library Christopher Childers NAL ARS USDA i5k.nal.usda.gov.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
 GEP Digital Laboratory Notebook Nick Reeves, Mt. San Jacinto Community College.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Annotation of Drosophila GEP Workshop – August 2015 Wilson Leung and Chris Shaffer.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011.
Organizing information in the post-genomic era The rise of bioinformatics.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
 GEP Implementation at Mt. San Jacinto Community College Nick Reeves, Ph.D.
Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.
Searching for Transcription Start Sites in Drosophila
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
Annotation of Drosophila primer
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Genomics Education Partnership: a flexible approach to implement Genomic teachings and research in the classroom Matthew W. Wadsworth and Consuelo J. Alvarez,
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
Web Apollo Resources at the National Agricultural Library Christopher Childers NAL ARS USDA i5k.nal.usda.gov.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Finding genes in the genome
Annotation of eukaryotic genomes
What is BLAST? Basic BLAST search What is BLAST?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Primer on Reading Frames and Phase Wilson Leung08/2012.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Web Databases for Drosophila
What is BLAST? Basic BLAST search What is BLAST?
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Annotation of Drosophila
Annotation for D. virilis
bacteria and eukaryotes
Primer on Reading Frames and Phase
Basics of BLAST Basic BLAST Search - What is BLAST?
Genomics and Personalized Care in Health Systems Lecture 7 Gene Finding (Part 2) Ab initio and Evidence-Based Gene Finding Leming Zhou, PhD School of.
TSS Annotation Workflow
GEP Annotation Workflow
Visualization of genomic data
Visualization of genomic data
Ab initio gene prediction
Ensembl Genome Repository.
Identify D. melanogaster ortholog
Determine CDS Coordinates
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer

Outline Overview of the GEP annotation projects GEP annotation workflow Practice applying the GEP annotation strategy

AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATC GTTCTTAAAAAAGAGCAAGAACAGTTTAACCATTGAAAACAAGATTATTCCAAT AGCCGTAAGAGTTCATTTAATGACAATGACGATGGCGGCAAAGTCGATGAAG GACTAGTCGGAACTGGAAATAGGAATGCGCCAAAAGCTAGTGCAGCTAAACA TCAATTGAAACAAGTTTGTACATCGATGCGCGGAGGCGCTTTTCTCTCAGGA TGGCTGGGGATGCCAGCACGTTAATCAGGATACCAATTGAGGAGGTGCCCC AGCTCACCTAGAGCCGGCCAATAAGGACCCATCGGGGGGGCCGCTTATGTG GAAGCCAAACATTAAACCATAGGCAACCGATTTGTGGGAATCGAATTTAAGAA ACGGCGGTCAGCCACCCGCTCAACAAGTGCCAAAGCCATCTTGGGGGCATA CGCCTTCATCAAATTTGGGCGGAACTTGGGGCGAGGACGATGATGGCGCC GATAGCACCAGCGTTTGGACGGGTCAGTCATTCCACATATGCACAACGTCTG GTGTTGCAGTCGGTGCCATAGCGCCTGGCCGTTGGCGCCGCTGCTGGTCC CTAATGGGGACAGGCTGTTGCTGTTGGTGTTGGAGTCGGAGTTGCCTTAAA CTCGACTGGAAATAACAATGCGCCGGCAACAGGAGCCCTGCCTGCCGTGG CTCGTCCGAAATGTGGGGACATCATCCTCAGATTGCTCACAATCATCGGCCG GAATGNTAANGAATTAATCAAATTTTGGCGGACATAATGNGCAGATTCAGA ACGTATTAACAAAATGGTCGGCCCCGTTGTTAGTGCAACAGGGTCAAATATC GCAAGCTCAAATATTGGCCCAAGCGGTGTTGGTTCCGTATCCGGTAATGTCG GGGCACAATGGGGAGCCACACAGGCCGCGTTGGGGCCCCAAGGTATTTCC AAGCAAATCACTGGATGGGAGGAACCACAATCAGATTCAGAATATTAACAAAA TGGTCGGCCCCGTTGTTATGGATAAAAAATTTGTGTCTTCGTACGGAGATTAT GTTGTTAATCAATTTTATTAAGATATTTAAATAAATATGTGTACCTTTCACGAGAA ATTTGCTTACCTTTTCGACACACACACTTATACAGACAGGTAATAATTACCTTT TGAGCAATTCGATTTTCATAAAATATACCTAAATCGCATCGTC Start codon Coding region Stop codon Splice donorSplice acceptor UTR

GEP Drosophila annotation projects D. melanogaster D. simulans D. sechellia D. yakuba D. erecta D. ficusphila D. eugracilis D. biarmipes D. takahashii D. elegans D. rhopaloa D. kikkawai D. ananassae D. bipectinata D. pseudoobscura D. persimilis D. willistoni D. mojavensis D. virilis D. grimshawi Reference Published Annotation projects for Fall 2015 / Spring 2016 Species in the Four Genomes Paper New species sequenced by modENCODE Phylogenetic tree produced by Thom Kaufman as part of the modENCODE project Manuscript in progress

Muller element nomenclature Schaeffer SW et al, Polytene Chromosomal Maps of 11 Drosophila Species: The Order of Genomic Scaffolds Inferred From Genetic and Physical Maps. Genetics Jul;179(3): X2L2R3L3R4 X45326

Gene structure nomenclature Gene span Primary mRNA Protein Exons Exon UTR’s UTR CDS’s CDS

GEP annotation goals Identify and annotate all genes in your project For each gene, identify and precisely map (accurate to the base pair) all Coding DNA Sequence ( CDS ) Do this for ALL isoforms Annotate the initial transcribed exon and transcription start site (TSS) Optional curriculum not submitted to GEP Clustal analysis (protein, promoter regions) Repeats analysis Synteny analysis Non-coding genes analysis

Evidence for gene models (in general order of importance) 1.Conservation Sequence similarity to genes in D. melanogaster Sequence similarity to other Drosophila species (Multiz) 2.Expression data RNA-Seq, EST, cDNA 3.Computational predictions Gene and splice site predictions 4.Tie-breakers of last resort See the “Annotation Instruction Sheet”

Basic annotation workflow 1. Identify the likely D. melanogaster ortholog 2. Observe the gene structure of the ortholog 3. Map each CDS to the project sequence 4. Determine the exact coordinates of each CDS 5. Verify the model using the Gene Model Checker 6. Repeat steps 2-5 for each additional isoform

Four main web sites used by the GEP annotation strategy 1.GEP UCSC Genome Browser ( ) 2.FlyBase ( ) Tools  Genomic/Map Tools  BLAST Jump to Gene  Genomic Location  GBrowse 3.Gene Record Finder ( ) Projects  Annotation Resources 4.NCBI BLAST ( ) BLASTX  select the checkbox:

1. Identify the likely D. melanogaster ortholog 2. Observe the gene structure of the ortholog 3. Map each CDS to the project sequence 4. Determine the exact coordinates of each CDS 5. Verify the model using the Gene Model Checker 6. Repeat steps 2-5 for each additional isoform Annotation workflow: Step 1

Two different versions of the UCSC Genome Browser Official UCSC Version Published data, lots of species, whole genomes; used for “Chimp Chunks” GEP Version GEP data, parts of genomes, used for annotation of Drosophila species

GEP UCSC Genome Browser overview Genomic sequence Evidence tracks

Control how evidence tracks are displayed on the Genome Browser Five different display modes: Hide : track is hidden Dense : all features appear on a single line Squish : overlapping features appear on separate lines Features are half the height compared to full mode Pack: overlapping features appear on separate lines Features are the same height as full mode Full: each feature is displayed on its own line Set “Base Position” track to “Full” to see the amino acid translations Some evidence tracks ( e.g., RepeatMasker ) only have a subset of these display modes

DEMO : GEP UCSC Genome Browser Examine contig10 in the D. biarmipes Aug (GEP/Dot) assembly

GEP annotation strategy Use D. melanogaster as reference D. melanogaster is very well annotated Use sequence similarity to infer homology Minimize changes compared to the D. melanogaster gene model (parsimony) Coding sequences evolve slowly Exon structure changes very slowly

FlyBase – Database for the Drosophila research community Lots of ancillary data for each gene in D. melanogaster Curation of literature for each gene Reference for D. melanogaster annotations for all other databases Including NCBI, EBI, and DDBJ Fast release cycle (6-8 releases per year)

Overview of NCBI BLAST Detect local regions of significant sequence similarity between two sequences Decide which BLAST program to use based on the type of query and subject sequences: ProgramQueryDatabase (Subject) BLASTN Nucleotide BLASTP Protein BLASTX Nucleotide -> ProteinProtein TBLASTN ProteinNucleotide -> Protein TBLASTX Nucleotide -> Protein

Where can I run BLAST ? NCBI BLAST web service EBI BLAST web service FlyBase BLAST ( Drosophila and other insects)

DEMO : Ortholog assignment for the N-SCAN prediction contig Feature in contig10 of the D. biarmipes Aug (GEP/Dot) assembly

1. Identify the likely D. melanogaster ortholog 2. Observe the gene structure of the ortholog 3. Map each CDS to the project sequence 4. Determine the exact coordinates of each CDS 5. Verify the model using the Gene Model Checker 6. Repeat steps 2-5 for each additional isoform Annotation workflow: Step 2

Gene Record Finder – Observe the structure of D. melanogaster genes Retrieves CDS and exon sequences for each gene in D. melanogaster CDS and exon usage maps for each isoform List of unique CDS Designed for the exon-by-exon annotation strategy

Nomenclature for Drosophila genes Drosophila gene names are case-sensitive Lowercase initial letter = recessive mutant phenotype Uppercase initial letter = dominant mutant phenotype Every D. melanogaster gene has an annotation symbol Begins with the prefix CG ( C omputed G ene) Some genes have a different gene symbol ( e.g., ey ) Suffix after the gene symbol denotes different isoforms mRNA = -R ; protein = -P ey-RA = Transcript for the A isoform of ey ey-PA = Protein product for the A isoform of ey

Be aware of different annotation releases D. melanogaster Release 6 genome assembly First change of the assembly since late 2006 Most modENCODE analysis used the Release 5 assembly Gene annotations change much more frequently Use FlyBase as the canonical reference GEP data freeze: GEP materials are updated before the start of semester; potential discrepancies in exercise screenshots corrected; minor differences in search results corrected. Let us know about major errors or discrepancies.

DEMO : Determine the gene structure of the D. melanogaster gene CG31997

Annotation workflow: Step 3 1. Identify the likely D. melanogaster ortholog 2. Observe the gene structure of the ortholog 3. Map each CDS to the project sequence 4. Determine the exact coordinates of each CDS 5. Verify the model using the Gene Model Checker 6. Repeat steps 2-5 for each additional isoform

BLAST parameters for CDS mapping Select the “Align two or more sequences” checkbox Settings in the “Algorithm parameters” section Verify the Word size is set to 3 Turn off compositional adjustments Turn off the low complexity filter

Strategies for finding small CDS Examine RNA-Seq coverage and TopHat junctions Small CDS is typically part of a larger transcribed exon Use Query subrange to restrict the search region Increase the Expect threshold and try again Keep increasing the Expect threshold until you get matches Also try decreasing the word size Use the Small Exon Finder Minimize changes in CDS size Available under Projects  Annotation Resources See the “Annotation Strategy Guide” for details

DEMO : Map CDS 3_10861_1 of CG31997 against contig10 with BLASTX

EXERCISE : Map CDS 1_10861_0 and 2_10861_2 of CG31997 against contig10

1. Identify the likely D. melanogaster ortholog 2. Observe the gene structure of the ortholog 3. Map each CDS to the project sequence 4. Determine the exact coordinates of each CDS 5. Verify the model using the Gene Model Checker 6. Repeat steps 2-5 for each additional isoform Annotation workflow: Step 4

Basic biological constraints (inviolate rules*) Coding regions start with a methionine Coding regions end with a stop codon Gene should be on only one strand of DNA Exons appear in order along the DNA (collinear) Intron sequences should be at least 40 bp Intron starts with a GT (or rarely GC) Intron ends with an AG * There are known exceptions to each rule

modENCODE RNA-Seq data RNA-Seq evidence tracks: RNA-Seq coverage (read depth) TopHat splice junction predictions Assembled transcripts ( Cufflinks, Oases ) Positive results very helpful Negative results less informative Lack of transcription ≠ no gene GEP curriculum: RNA-Seq Primer Browser-Based Annotation and RNA-Seq Data

Overview of RNA-Seq (Illumina) Processed mRNA 5’ capPoly-A tail AAAAAA RNA fragments (~250bp) Library with adapters 5’3’5’3’5’3’5’3’ Paired end sequencing 5’3’ ~125bp RNA-Seq reads Reverse Forward Wang Z et al. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 10(1):57-63.

DEMO : Use RNA-Seq coverage to support the placement of the start codon

EXERCISE : Confirm the placement of the stop codon for CDS 3_10861_1

Can use the TopHat splice junction predictions to identify splice sites Processed mRNA AAAAAA M RNA-Seq reads 5’ capPoly-A tail * TopHat junctions Contig Intron

A genomic sequence has 6 different reading frames Frame : Base to begin translation relative to the start of the sequence Frames

Splice donor and acceptor phases Phase : Number of bases between the complete codon and the splice site Donor phase: Number of bases between the end of the last complete codon and the splice donor site (GT/GC) Acceptor phase: Number of bases between the splice acceptor site (AG) and the start of the first complete codon Phase is dependent on the reading frame of the CDS

Phase depends on the reading frame Phase of donor site: Phase 2 relative to frame +1 Phase 1 relative to frame +2 Phase 0 relative to frame +3 Splice donor

Phase of the donor and acceptor sites must be compatible Extra nucleotides from donor and acceptor phases form an additional codon Donor phase + acceptor phase = 0 or 3 GTAG… … … CT G AG A G TTT CC G AT LRDFP Translation: CT G AG A G TTT CC G GAT CT G AG A TTT CC G AT

Incompatible donor and acceptor phases result in a frame shift Phase 0 donor is incompatible with phase 2 acceptor CT G GTGT AG… AG A G TTT CC G AT LRGIF Translation: GT CT G AG A TT C CGGGTGGT ATT

DEMO : Use RNA-Seq to annotate the intron between CDS 1_10861_0 and 2_10861_2 of the CG31997 ortholog

EXERCISE : Determine the coordinates for CDS 2_10861_2 and 3_10861_1 of the CG31997 ortholog

1. Identify the likely D. melanogaster ortholog 2. Observe the gene structure of the ortholog 3. Map each CDS to the project sequence 4. Determine the exact coordinates of each CDS 5. Verify the model using the Gene Model Checker 6. Repeat steps 2-5 for each additional isoform Annotation workflow: Step 5

Verify the final gene model using the Gene Model Checker Gene model should satisfy biological constraints Explain errors or warnings in the GEP Annotation Report Compare model against the D. melanogaster ortholog Dot plot and protein alignment See “ How to do a quick check of student annotations ” View your gene model as a custom track in the genome browser Generate files require for project submission

DEMO : Verify the proposed gene model for the ortholog of CG31997

1. Identify the likely D. melanogaster ortholog 2. Observe the gene structure of the ortholog 3. Map each CDS to the project sequence 4. Determine the exact coordinates of each CDS 5. Verify the model using the Gene Model Checker 6. Repeat steps 2-5 for each additional isoform Annotation workflow: Step 6

Next step: practice annotation Annotation of a Drosophila Gene onecut on contig35 ey on contig40 CG1909 on contig35 Arl4 and CG33978 on contig10 Difficulty

Questions?