Download presentation
Presentation is loading. Please wait.
Published byPaulina Shepherd Modified over 9 years ago
1
Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources
2
Background
3
Genome quality
4
Genes in Drosophila melanogaster ● high gene density ● at least 20% with alternative transripts ● can be nested on the same strand on different strands ● di-cistronic ● involve trans-splicing exons from a different strand
5
Gene prediction pipeline ● Gene prediction by homology no ab-initio predictions not using genomic alignments ● TBLASTN/Genewise process quick genome scan to find putative gene containing regions aligning peptide sequence to genomic fragment using a gene model ● cds ● introns ● splice-sites
7
Sensitivity – Selectivity - Speed ● Genome scan strict trade-off between ● sensitivity versus memory/time ● Transcript prediction t = O(MN) ● N: length of peptide sequence = quite short ● M: length of DNA sequence = large you want to minimize ● the length of the genomic sequence to search ● the number of fragments you align
8
Solutions ● ENSEMBL: Minigenes cut out putative introns ● My pipeline: priority lists gene structure conservation
9
Difficulties ● Terminal exons short and thus alignment signal is weak ● Spindly genes there is no length penalty on introns
10
Concepts ● Predict in three passes 1)Predict clear cut cases 2)Predict dubious cases only if they don't overlap with a previous prediction 3)Predict alternative transcripts ● Iteratively search for duplications ● Accept a prediction with conserved exon boundaries
11
Conservation of gene structure Query Prediction Conserved Query Prediction Partially conserved Query Prediction Single exon Query Prediction Retrotransposed Query Prediction Unconserved (exon boundaries of query/prediction mapped on query protein)
12
Quality control ● Classify predictions into categories Full length or fragment Gene or pseudogene Conserved or not conserved gene structure ● Heuristically remove predictions that are redundant that are in conflict ● nested genes ● good predictions take precedence over bad predictions
13
Results ● http://wwwfgu.anat.ox.ac.uk:8080/cgi-bin/gbrowse
14
Number of predicted genes
15
Orthology assignments Genes in D. melanogaster with ortholgs
16
Technical details ● Hardware: 28 dual CPU nodes with 2Gb memory sun grid engine (SGE) ● Pipeline logic gmake ● Tasks Python scripts (and Perl scripts) Bash/awk scripts ● Database Postgres
17
Downstream analysis ● Pairwise orthology assignment PhyOP Pipeline (Leo Goodstadt (2006)) ● Multiple orthology assignment My own concoction based on graph clustering with some consistency criteria ● Multiple alignment of cds Dialign (<50 sequences) Muscle (<500 sequences)
18
Phylogenetic analysis ● 14,000 GBlocks cleaned multiple alignments ● Calculation of ka and ks with PAML ● Phylogenetic trees Genome trees Gene trees built with Fitch/Kitsch
19
Odds and bits ● Mapping of Pdb -> Uniprot -> dmel proteins ● Mapping of Interpro domains onto predictions not up-to-date ● Codon bias analysis ENC, CAI, information theoretic measures GC3, GC3_4D
20
Comparison of measures Experimental CAI Computational CAI ENC GC3 Encoding | bias Encoding | unbiased Encoding | uniform Ribosomal CAI
21
Other groups ● see http://rana.lbl.gov/drosophila/wiki/index.php/Main_Page ● Gene predictions by others Don Gilbert: SNAP Lior Pachter: GeneMapper (genomic alignments) Eisen Lab : TBLastN + Genewise/Exonerate, GeneMapper Batzoglou Lab: CONTRAST Brent Lab: N-Scan Guigo: geneid and SGP2
22
http://insects.eugenes.org/species/news/genome- summaries/genepredictions.html
23
Consensus predictions ● Gbrowser comparison of all gene predictions http://rana.lbl.gov/drosophila/gbrowse/cgi-bin/gbrowse http://rana.lbl.gov/drosophila/gbrowse/cgi-bin/gbrowse ● Mike Eisen's group: GLEAN consensus set ● Don Gilbert: http://insects.eugenes.org/species/ ● Other resources tRNA predictions genome alignments
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.