Annotation of Drosophila Primer

Slides:



Advertisements
Similar presentations
Request Dispatching for Cheap Energy Prices in Cloud Data Centers
Advertisements

SpringerLink Training Kit
Luminosity measurements at Hadron Colliders
From Word Embeddings To Document Distances
Choosing a Dental Plan Student Name
Virtual Environments and Computer Graphics
Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI
THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –
D. Phát triển thương hiệu
NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN
Điều trị chống huyết khối trong tai biến mạch máu não
BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.
Nasal Cannula X particulate mask
Evolving Architecture for Beyond the Standard Model
HF NOISE FILTERS PERFORMANCE
Electronics for Pedestrians – Passive Components –
Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel
L-Systems and Affine Transformations
CMSC423: Bioinformatic Algorithms, Databases and Tools
Some aspect concerning the LMDZ dynamical core and its use
Bayesian Confidence Limits and Intervals
实习总结 (Internship Summary)
Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,
Front End Electronics for SOI Monolithic Pixel Sensor
Face Recognition Monday, February 1, 2016.
Solving Rubik's Cube By: Etai Nativ.
CS284 Paper Presentation Arpad Kovacs
انتقال حرارت 2 خانم خسرویار.
Summer Student Program First results
Theoretical Results on Neutrinos
HERMESでのHard Exclusive生成過程による 核子内クォーク全角運動量についての研究
Wavelet Coherence & Cross-Wavelet Transform
yaSpMV: Yet Another SpMV Framework on GPUs
Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.
MOCLA02 Design of a Compact L-­band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Fuel cell development program for electric vehicle
Overview of TST-2 Experiment
Optomechanics with atoms
داده کاوی سئوالات نمونه
Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium  
ლექცია 4 - ფული და ინფლაცია
10. predavanje Novac i financijski sustav
Wissenschaftliche Aussprache zur Dissertation
FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,
Particle acceleration during the gamma-ray flares of the Crab Nebular
Interpretations of the Derivative Gottfried Wilhelm Leibniz
Advisor: Chiuyuan Chen Student: Shao-Chun Lin
Widow Rockfish Assessment
SiW-ECAL Beam Test 2015 Kick-Off meeting
On Robust Neighbor Discovery in Mobile Wireless Networks
Chapter 6 并发:死锁和饥饿 Operating Systems: Internals and Design Principles
You NEED your book!!! Frequency Distribution
Y V =0 a V =V0 x b b V =0 z
Fairness-oriented Scheduling Support for Multicore Systems
Climate-Energy-Policy Interaction
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Ch48 Statistics by Chtan FYHSKulai
The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.
Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs
Online Learning: An Introduction
Factor Based Index of Systemic Stress (FISS)
What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.
THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*
Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.
The Toroidal Sporadic Source: Understanding Temporal Variations
FW 3.4: More Circle Practice
ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف
Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM
Limits on Anomalous WWγ and WWZ Couplings from DØ
Presentation transcript:

Annotation of Drosophila Primer August 2017 Wilson Leung and Chris Shaffer

Outline Overview of the GEP annotation projects GEP annotation workflow Practice applying the GEP annotation strategy

Start codon Coding region Stop codon Splice donor Splice acceptor UTR AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCTTAAAAAAGAGCAAGAACAGTTTAACCATTGAAAACAAGATTATTCCAATAGCCGTAAGAGTTCATTTAATGACAATGACGATGGCGGCAAAGTCGATGAAGGACTAGTCGGAACTGGAAATAGGAATGCGCCAAAAGCTAGTGCAGCTAAACATCAATTGAAACAAGTTTGTACATCGATGCGCGGAGGCGCTTTTCTCTCAGGATGGCTGGGGATGCCAGCACGTTAATCAGGATACCAATTGAGGAGGTGCCCCAGCTCACCTAGAGCCGGCCAATAAGGACCCATCGGGGGGGCCGCTTATGTGGAAGCCAAACATTAAACCATAGGCAACCGATTTGTGGGAATCGAATTTAAGAAACGGCGGTCAGCCACCCGCTCAACAAGTGCCAAAGCCATCTTGGGGGCATACGCCTTCATCAAATTTGGGCGGAACTTGGGGCGAGGACGATGATGGCGCCGATAGCACCAGCGTTTGGACGGGTCAGTCATTCCACATATGCACAACGTCTGGTGTTGCAGTCGGTGCCATAGCGCCTGGCCGTTGGCGCCGCTGCTGGTCCCTAATGGGGACAGGCTGTTGCTGTTGGTGTTGGAGTCGGAGTTGCCTTAAACTCGACTGGAAATAACAATGCGCCGGCAACAGGAGCCCTGCCTGCCGTGGCTCGTCCGAAATGTGGGGACATCATCCTCAGATTGCTCACAATCATCGGCCGGAATGNTAANGAATTAATCAAATTTTGGCGGACATAATGNGCAGATTCAGA ACGTATTAACAAAATGGTCGGCCCCGTTGTTAGTGCAACAGGGTCAAATATCGCAAGCTCAAATATTGGCCCAAGCGGTGTTGGTTCCGTATCCGGTAATGTCGGGGCACAATGGGGAGCCACACAGGCCGCGTTGGGGCCCCAAGGTATTTCCAAGCAAATCACTGGATGGGAGGAACCACAATCAGATTCAGAATATTAACAAAATGGTCGGCCCCGTTGTTATGGATAAAAAATTTGTGTCTTCGTACGGAGATTATGTTGTTAATCAATTTTATTAAGATATTTAAATAAATATGTGTACCTTTCACGAGAAATTTGCTTACCTTTTCGACACACACACTTATACAGACAGGTAATAATTACCTTTTGAGCAATTCGATTTTCATAAAATATACCTAAATCGCATCGTC Start codon Coding region Stop codon Splice donor Splice acceptor UTR

GEP Drosophila annotation projects D. melanogaster D. simulans D. sechellia D. yakuba Reference D. erecta D. ficusphila D. eugracilis Published D. biarmipes D. takahashii TSS projects for Fall 2017 / Spring 2018 D. elegans D. rhopaloa D. kikkawai Annotation projects for Fall 2017 / Spring 2018 D. bipectinata D. eugracilis F element has a higher frequency of consensus errors D. ananassae D. pseudoobscura D. persimilis New species sequenced by modENCODE D. willistoni D. mojavensis D. virilis D. grimshawi Phylogenetic tree produced by Thom Kaufman as part of the modENCODE project

Muller element nomenclature X 2L 2R 3L 3R 4 X 4 5 3 2 6 Schaeffer SW et al, 2008. Polytene Chromosomal Maps of 11 Drosophila Species: The Order of Genomic Scaffolds Inferred From Genetic and Physical Maps. Genetics. 2008 Jul;179(3):1601-55

Gene structure nomenclature Primary mRNA Protein Gene span CDS = coding exons Exons Exon UTR’s UTR CDS’s CDS

GEP annotation goals Identify and annotate all genes in your project For each gene, identify and precisely map (accurate to the base pair) all Coding DNA Sequences (CDS) Do this for ALL isoforms Annotate the initial transcribed exon and transcription start site (TSS) Optional analyses not submitted to GEP Clustal analysis (proteins, promoter regions) Transposons and other repeats Synteny Non-coding genes

Evidence for gene models (in general order of importance) Conservation Sequence similarity to genes in D. melanogaster Sequence similarity to other Drosophila species (Multiz) Expression data RNA-Seq, EST, cDNA Computational predictions Open reading frames; gene and splice site predictions Tie-breakers of last resort See the “Annotation Instruction Sheet” RNA-Seq is more definitive than conservation if it is available

Basic annotation workflow Identify the likely D. melanogaster ortholog Observe the gene structure of the ortholog Map each CDS to the project sequence Determine the exact coordinates of each CDS Verify the model using the Gene Model Checker Repeat steps 2-5 for each additional isoform Annotation workflows available under the “Introducing Genes” section of the GEP web site

Four main web sites used by the GEP annotation strategy GEP UCSC Genome Browser (http://gander.wustl.edu) FlyBase (http://flybase.org) Tools  Genomic/Map Tools  BLAST Jump to Gene  Genomic Location  GBrowse Gene Record Finder (http://gep.wustl.edu) Projects  Annotation Resources NCBI BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) BLASTX  select the checkbox:

Annotation workflow: Step 1 Identify the likely D. melanogaster ortholog Observe the gene structure of the ortholog Map each CDS to the project sequence Determine the exact coordinates of each CDS Verify the model using the Gene Model Checker Repeat steps 2-5 for each additional isoform

Two different versions of the UCSC Genome Browser Official UCSC Version http://genome.ucsc.edu Published data, lots of species, whole genomes; used for “Chimp Chunks” GEP Version http://gander.wustl.edu GEP projects, parts of genomes; used for annotation of Drosophila species

GEP UCSC Genome Browser overview Genomic sequence Evidence tracks

Control how evidence tracks are displayed on the Genome Browser Five different display modes: Hide: track is hidden Dense: all features appear on a single line Squish: overlapping features appear on separate lines Features are half the height compared to full mode Pack: overlapping features appear on separate lines Features are the same height as full mode Full: each feature is displayed on its own line Set “Base Position” track to “Full” to see the amino acid translations Some evidence tracks (e.g., RepeatMasker) only have a subset of these display modes

DEMO: Examine contig10 in the D. biarmipes Aug DEMO: Examine contig10 in the D. biarmipes Aug. 2013 (GEP/Dot) assembly with the GEP UCSC Genome Browser

GEP annotation strategy Use D. melanogaster as reference D. melanogaster is very well annotated Use sequence similarity to infer homology Minimize changes compared to the D. melanogaster gene model (parsimony) Coding sequences evolve slowly Exon structure changes very slowly D. melanogaster gene models supported by large amount of experimental data (from modENCODE and other projects) Higher frequency of gene structure changes in D. ficusphila compared to D. biarmipes and D. elegans

FlyBase – Database for the Drosophila research community Lots of ancillary data for each gene in D. melanogaster Curation of literature for each gene Reference for D. melanogaster annotations for all other databases Including NCBI, EBI, and DDBJ Fast release cycle (6-8 releases per year)

Overview of NCBI BLAST Detect local regions of significant sequence similarity between two sequences Decide which BLAST program to use based on the type of query and subject sequences: Program Query Database (Subject) BLASTN Nucleotide BLASTP Protein BLASTX Nucleotide → Protein TBLASTN TBLASTX See “Basics of BLAST” for a more detailed overview

Where can I run BLAST? NCBI BLAST web service EBI BLAST web service https://blast.ncbi.nlm.nih.gov/Blast.cgi EBI BLAST web service http://www.ebi.ac.uk/Tools/sss/ FlyBase BLAST (Drosophila and other insects) http://flybase.org/blast/

Accessing TBLASTX at NCBI tblastx is no longer listed on the new NCBI BLAST front page

DEMO: Ortholog assignment for the N-SCAN prediction contig10.001.1 TODO DEMO: Ortholog assignment for the N-SCAN prediction contig10.001.1

Annotation workflow: Step 2 Identify the likely D. melanogaster ortholog Observe the gene structure of the ortholog Map each CDS to the project sequence Determine the exact coordinates of each CDS Verify the model using the Gene Model Checker Repeat steps 2-5 for each additional isoform

Gene Record Finder – Observe the structure of D. melanogaster genes Retrieves CDS and exon sequences for each gene in D. melanogaster CDS and exon usage maps for each isoform List of unique CDS Designed for the exon-by-exon annotation strategy Mention that this is based on an older release

Nomenclature for Drosophila genes Drosophila gene names are case-sensitive Lowercase initial letter = recessive mutant phenotype Uppercase initial letter = dominant mutant phenotype Every D. melanogaster gene has an annotation symbol Begins with the prefix CG (Computed Gene) Some genes have a different gene symbol (e.g., ey) Suffix after the gene symbol denotes different isoforms mRNA = -R; protein = -P ey-RA = Transcript for the A isoform of ey ey-PA = Protein product for the A isoform of ey Dl = Delta; dl = dorsal ey is the gene symbol, CG1901 is the annotation symbol

Be aware of different annotation releases D. melanogaster Release 6 genome assembly First change of the assembly since late 2006 Most modENCODE analysis used the Release 5 assembly Gene annotations change much more frequently Use FlyBase as the canonical reference GEP data freeze GEP materials are updated before the start of semester Potential discrepancies in results and screenshots See the archived BLAST results in the exercise package Let us know about major errors or discrepancies Lifted release 5 datasets required by the GEP to release 6

DEMO: Determine the gene structure of the D. melanogaster gene CG31997

Annotation workflow: Step 3 Identify the likely D. melanogaster ortholog Observe the gene structure of the ortholog Map each CDS to the project sequence Determine the exact coordinates of each CDS Verify the model using the Gene Model Checker Repeat steps 2-5 for each additional isoform

BLAST parameters for CDS mapping Select the “Align two or more sequences” checkbox Settings in the “Algorithm parameters” section Verify the Word size is set to 3 Turn off compositional adjustments Turn off the low complexity filter Default word size for database searches = 6, bl2seq searches = 3

Strategies for CDS mapping Start by mapping the largest CDS The first and last CDS tend to be smaller than internal CDS in Drosophila Continue mapping CDS by size in descending order Defer mapping small or weakly conserved CDS Use placements of adjacent CDS to define the search region Use the splice donor and acceptor phases of adjacent CDS as additional constraints In most cases, avoid mapping CDS from 5’ to 3’ of the gene because they tend to be smaller

Strategies for finding small CDS Examine RNA-Seq coverage and TopHat junctions Small CDS is typically part of a larger transcribed exon Use Query subrange to restrict the search region Increase the Expect threshold and try again Keep increasing the Expect threshold until you get matches Also try decreasing the word size Use the Small Exon Finder Minimize changes in CDS size Available under Projects  Annotation Resources See the “Annotation Instruction Sheet” for details

DEMO: Map CDS 3_10803_1 of CG31997 against contig10 with BLASTX

EXERCISE: Map each CDS to the project sequence Use BLASTX to determine the approximate locations for the three CDS of CG31997 on contig10 Consult with each other The “Annotation of a Drosophila Gene” document provides a step-by-step walkthrough

Discussions and coffee break Carolina Ponce: https://flic.kr/p/otHbqV duncan c: https://flic.kr/p/nSfe14

Annotation workflow: Step 4 Identify the likely D. melanogaster ortholog Observe the gene structure of the ortholog Map each CDS to the project sequence Determine the exact coordinates of each CDS Verify the model using the Gene Model Checker Repeat steps 2-5 for each additional isoform

Basic biological constraints (inviolate rules*) Coding regions start with a methionine Coding regions end with a stop codon Gene should be on only one strand of DNA Exons appear in order along the DNA (collinear) Intron sequences should be at least 40 bp Intron starts with a GT (or rarely GC) Intron ends with an AG Only break these rules if they are found in D. melanogaster or supported by experimental evidence ~1% of the genes in D. melanogaster have a GC donor site * There are known exceptions to each rule

modENCODE RNA-Seq data RNA-Seq evidence tracks: RNA-Seq coverage (read depth) TopHat splice junction predictions Assembled transcripts (Cufflinks, Oases) Positive results very helpful Negative results less informative Lack of transcription ≠ no gene GEP curriculum: RNA-Seq Primer Browser-Based Annotation and RNA-Seq Data modENCODE RNA-Seq data: mixed embryos, adult males, adult females

Overview of RNA-Seq (Illumina) 5’ cap Poly-A tail Processed mRNA AAAAAA RNA fragments (~250bp) Library with adapters 5’ 3’ Paired end sequencing 5’ 3’ ~125bp RNA-Seq reads Reverse Forward Wang Z et al. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 10(1):57-63.

DEMO: Use RNA-Seq coverage to support the placement of the start codon

EXERCISE: Confirm the placement of the stop codon for CDS 3_10803_1

Can use the TopHat splice junction predictions to identify splice sites 5’ cap M * Poly-A tail Processed mRNA AAAAAA RNA-Seq reads Contig Intron TopHat junctions

A genomic sequence has 6 different reading frames 3 1 2 2 Frames Place start codon at 25673 Frame: Base to begin translation relative to the start of the sequence

A codon could be derived from nucleotides in adjacent exons TTT CCG GT AG … … … CTG AGA GA T G AT TTT CCG GT AG … … … CTG AGA GT AG … … … CTG AGA GAT TTT CCG CTG AGA GAT GAT TTT CCG Spliced mRNA Phase 0 Phase 1 Phase 2 move label intron Phase 2 Phase 1 Donor Acceptor Intron

Splice donor and acceptor phases Phase: Number of bases between the complete codon and the splice site Donor phase: Number of bases between the end of the last complete codon and the splice donor site (GT/GC) Acceptor phase: Number of bases between the splice acceptor site (AG) and the start of the first complete codon Phase depends on the reading frame of the CDS

Phase depends on the reading frame Splice donor CDS 1_10803_0 ends at 25835 with a phase 1 donor site Phase of donor site: Phase 2 relative to frame +1 Phase 1 relative to frame +2 Phase 0 relative to frame +3

Phase of the donor and acceptor sites must be compatible Extra nucleotides from donor and acceptor phases form an additional codon Donor phase + acceptor phase = 0 or 3 CTG AGA G CTG AGA G GT … … … AG TTT CCG AT TTT CCG AT By definition, blastx cannot detect the additional codon that results from donor and acceptor phases TTT CCG GAT CTG AGA L R D F P Translation:

Incompatible donor and acceptor phases result in a frame shift CTG AGA G GT GT … … AG AT TTT CCG CTG AGA TTC CG GGT ATT L R G I F Translation: Phase 0 donor is incompatible with phase 2 acceptor

DEMO: Use RNA-Seq to annotate the intron between CDS 1_10803_0 and 2_10803_2 of the CG31997 ortholog

EXERCISE: Determine the coordinates for CDS 2_10803_2 and 3_10803_1 of the CG31997 ortholog

Annotation workflow: Step 5 Identify the likely D. melanogaster ortholog Observe the gene structure of the ortholog Map each CDS to the project sequence Determine the exact coordinates of each CDS Verify the model using the Gene Model Checker Repeat steps 2-5 for each additional isoform

Verify the final gene model using the Gene Model Checker Gene model should satisfy biological constraints Explain errors or warnings in the GEP Annotation Report Compare model against the D. melanogaster ortholog Dot plot and protein alignment See “Quick check of student annotations” View your gene model as a custom track in the genome browser Generate files require for project submission

DEMO: Verify the proposed gene model for the ortholog of CG31997

Annotation workflow: Step 6 Identify the likely D. melanogaster ortholog Observe the gene structure of the ortholog Map each CDS to the project sequence Determine the exact coordinates of each CDS Verify the model using the Gene Model Checker Repeat steps 2-5 for each additional isoform In this case, isoform A and B of CG31997 have the same set of coding exons

Next step: practice annotation D. biarmipes Aug Next step: practice annotation D. biarmipes Aug. 2013 (GEP/Dot) assembly Annotation of a Drosophila Gene onecut on contig35 ey on contig40 CG1909 on contig35 Arl4 and CG33978 on contig10 D. biarmipes (Aug. 2013 GEP/Dot assembly) onecut – Similar to CG31997 but the feature is on the minus strand; partial alignment for CDS 3, acceptor placement based on RNA-Seq and TopHat ey – member of a gene family but with large difference in E-value, alternative splicing in multiple isoforms CG1909 – Change in gene structure: split CDS 4_9560_0; consists of 9 CDS, gene with conserved domains Arl4 – B isoform does not exist in D. biarmipes CG33978 – Partial gene missing 5’ end; missing isoforms C and E; consensus error within CDS 6_11558_0 Difficulty

Questions?