Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Homology Based Analysis of the Human/Mouse lncRNome
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Bioinformatics Needs for the post-genomic era Dr. Erik Bongcam-Rudloff The Linnaeus Centre for Bioinformatics.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Bioinformatics and Phylogenetic Analysis
UCSC Known Genes Version 3 Take 9. Known Gene History Initially based on Genie predictions constrained by BLAT mRNA alignments. –David Kulp got busy at.
Comparative ab initio prediction of gene structures using pair HMMs
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Sequence comparison: Local alignment
Sequencing a genome and Basic Sequence Alignment
Gene Structure and Identification
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
Chapter 5 Multiple Sequence Alignment.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
HOGENOM a phylogenomic database
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
1 The Interrupted Gene. Ex Biochem c3-interrupted gene Introduction Figure 3.1.
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Sequencing a genome and Basic Sequence Alignment
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Genome Annotation Rosana O. Babu.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Protein and RNA Families
SPIDA Substitution Periodicity Index and Domain Analysis Combining comparative sequence analysis with EST alignment to identify coding regions Damian Keefe.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.
Genome reannotation: Dealing with the atypical, the ambiguous, and the contrary.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Chapter 17 Transcription and Translation From Gene to Protein.
Do not reproduce without permission 1 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Gerstein Lab Aims in ModENCODE.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
While replication, one strand will form a continuous copy while the other form a series of short “Okazaki” fragments Genetic traits can be transferred.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Construction of Substitution matrices
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
1 Many to 1 Gene Associations The following slides show a few examples of gene predictions by one annotation group that overlap one or more genes from.
Finding genes in the genome
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Annotation of eukaryotic genomes
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
Daphnia Genome Annotation & Analysis Notes July 2007 Don Gilbert Genome Informatics Lab, Biology Dept., Indiana University
bacteria and eukaryotes
Bioinformatics Overview
Sequence based searches:
Sequence comparison: Local alignment
Recitation 7 2/4/09 PSSMs+Gene finding
Basic Local Alignment Search Tool
Origins and Impacts of New Mammalian Exons
Presentation transcript:

Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Background

Genome quality

Genes in Drosophila melanogaster ● high gene density ● at least 20% with alternative transripts ● can be nested  on the same strand  on different strands ● di-cistronic ● involve trans-splicing  exons from a different strand

Gene prediction pipeline ● Gene prediction by homology  no ab-initio predictions  not using genomic alignments ● TBLASTN/Genewise process  quick genome scan to find putative gene containing regions  aligning peptide sequence to genomic fragment using a gene model ● cds ● introns ● splice-sites

Sensitivity – Selectivity - Speed ● Genome scan  strict trade-off between ● sensitivity versus memory/time ● Transcript prediction  t = O(MN) ● N: length of peptide sequence = quite short ● M: length of DNA sequence = large  you want to minimize ● the length of the genomic sequence to search ● the number of fragments you align

Solutions ● ENSEMBL: Minigenes  cut out putative introns ● My pipeline:  priority lists  gene structure conservation

Difficulties ● Terminal exons  short and thus alignment signal is weak ● Spindly genes  there is no length penalty on introns

Concepts ● Predict in three passes 1)Predict clear cut cases 2)Predict dubious cases  only if they don't overlap with a previous prediction 3)Predict alternative transcripts ● Iteratively search for duplications ● Accept a prediction with conserved exon boundaries

Conservation of gene structure Query Prediction Conserved Query Prediction Partially conserved Query Prediction Single exon Query Prediction Retrotransposed Query Prediction Unconserved (exon boundaries of query/prediction mapped on query protein)

Quality control ● Classify predictions into categories  Full length or fragment  Gene or pseudogene  Conserved or not conserved gene structure ● Heuristically remove predictions  that are redundant  that are in conflict ● nested genes ● good predictions take precedence over bad predictions

Results ●

Number of predicted genes

Orthology assignments Genes in D. melanogaster with ortholgs

Technical details ● Hardware:  28 dual CPU nodes with 2Gb memory  sun grid engine (SGE) ● Pipeline logic  gmake ● Tasks  Python scripts (and Perl scripts)  Bash/awk scripts ● Database  Postgres

Downstream analysis ● Pairwise orthology assignment  PhyOP Pipeline (Leo Goodstadt (2006)) ● Multiple orthology assignment  My own concoction based on graph clustering with some consistency criteria ● Multiple alignment of cds  Dialign (<50 sequences)  Muscle (<500 sequences)

Phylogenetic analysis ● 14,000 GBlocks cleaned multiple alignments ● Calculation of ka and ks with PAML ● Phylogenetic trees  Genome trees  Gene trees  built with Fitch/Kitsch

Odds and bits ● Mapping of Pdb -> Uniprot -> dmel proteins ● Mapping of Interpro domains onto predictions  not up-to-date ● Codon bias analysis  ENC, CAI, information theoretic measures  GC3, GC3_4D

Comparison of measures Experimental CAI Computational CAI ENC GC3 Encoding | bias Encoding | unbiased Encoding | uniform Ribosomal CAI

Other groups ● see ● Gene predictions by others  Don Gilbert: SNAP  Lior Pachter: GeneMapper (genomic alignments)  Eisen Lab : TBLastN + Genewise/Exonerate, GeneMapper  Batzoglou Lab: CONTRAST  Brent Lab: N-Scan  Guigo: geneid and SGP2

summaries/genepredictions.html

Consensus predictions ● Gbrowser comparison of all gene predictions  ● Mike Eisen's group: GLEAN consensus set ● Don Gilbert: ● Other resources  tRNA predictions  genome alignments