Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources
Background
Genome quality
Genes in Drosophila melanogaster ● high gene density ● at least 20% with alternative transripts ● can be nested on the same strand on different strands ● di-cistronic ● involve trans-splicing exons from a different strand
Gene prediction pipeline ● Gene prediction by homology no ab-initio predictions not using genomic alignments ● TBLASTN/Genewise process quick genome scan to find putative gene containing regions aligning peptide sequence to genomic fragment using a gene model ● cds ● introns ● splice-sites
Sensitivity – Selectivity - Speed ● Genome scan strict trade-off between ● sensitivity versus memory/time ● Transcript prediction t = O(MN) ● N: length of peptide sequence = quite short ● M: length of DNA sequence = large you want to minimize ● the length of the genomic sequence to search ● the number of fragments you align
Solutions ● ENSEMBL: Minigenes cut out putative introns ● My pipeline: priority lists gene structure conservation
Difficulties ● Terminal exons short and thus alignment signal is weak ● Spindly genes there is no length penalty on introns
Concepts ● Predict in three passes 1)Predict clear cut cases 2)Predict dubious cases only if they don't overlap with a previous prediction 3)Predict alternative transcripts ● Iteratively search for duplications ● Accept a prediction with conserved exon boundaries
Conservation of gene structure Query Prediction Conserved Query Prediction Partially conserved Query Prediction Single exon Query Prediction Retrotransposed Query Prediction Unconserved (exon boundaries of query/prediction mapped on query protein)
Quality control ● Classify predictions into categories Full length or fragment Gene or pseudogene Conserved or not conserved gene structure ● Heuristically remove predictions that are redundant that are in conflict ● nested genes ● good predictions take precedence over bad predictions
Results ●
Number of predicted genes
Orthology assignments Genes in D. melanogaster with ortholgs
Technical details ● Hardware: 28 dual CPU nodes with 2Gb memory sun grid engine (SGE) ● Pipeline logic gmake ● Tasks Python scripts (and Perl scripts) Bash/awk scripts ● Database Postgres
Downstream analysis ● Pairwise orthology assignment PhyOP Pipeline (Leo Goodstadt (2006)) ● Multiple orthology assignment My own concoction based on graph clustering with some consistency criteria ● Multiple alignment of cds Dialign (<50 sequences) Muscle (<500 sequences)
Phylogenetic analysis ● 14,000 GBlocks cleaned multiple alignments ● Calculation of ka and ks with PAML ● Phylogenetic trees Genome trees Gene trees built with Fitch/Kitsch
Odds and bits ● Mapping of Pdb -> Uniprot -> dmel proteins ● Mapping of Interpro domains onto predictions not up-to-date ● Codon bias analysis ENC, CAI, information theoretic measures GC3, GC3_4D
Comparison of measures Experimental CAI Computational CAI ENC GC3 Encoding | bias Encoding | unbiased Encoding | uniform Ribosomal CAI
Other groups ● see ● Gene predictions by others Don Gilbert: SNAP Lior Pachter: GeneMapper (genomic alignments) Eisen Lab : TBLastN + Genewise/Exonerate, GeneMapper Batzoglou Lab: CONTRAST Brent Lab: N-Scan Guigo: geneid and SGP2
summaries/genepredictions.html
Consensus predictions ● Gbrowser comparison of all gene predictions ● Mike Eisen's group: GLEAN consensus set ● Don Gilbert: ● Other resources tRNA predictions genome alignments