Gene Prediction (cont’d)

Slides:



Advertisements
Similar presentations
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
Advertisements

Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Ab initio gene prediction Genome 559, Winter 2011.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Profiles for Sequences
McPromoter – an ancient tool to predict transcription start sites
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Lecture 12 Splicing and gene prediction in eukaryotes
Biological Motivation Gene Finding in Eukaryotic Genomes
Gene Structure and Identification
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Grupo 5. 5’site 3’site branchpoint site exon 1 intron 1 exon 2 intron 2 AG/GT CAG/NT.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
CS5238 Combinatorial methods in bioinformatics
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Local Multiple Sequence Alignment Sequence Motifs
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Sequence Alignment.
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
Applied Bioinformatics
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Regulation of Gene Expression
bacteria and eukaryotes
Transcription.
What is a Hidden Markov Model?
Learning Sequence Motif Models Using Expectation Maximization (EM)
TSS Annotation Workflow
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Gene Structure and Identification
Generalizations of Markov model to characterize biological sequences
Finding regulatory modules
SEG5010 Presentation Zhou Lanjun.
Nora Pierstorff Dept. of Genetics University of Cologne
Summarized by Sun Kim SNU Biointelligence Lab.
Basic Local Alignment Search Tool
Presentation transcript:

Gene Prediction (cont’d) For CISC 889 (Bioinformatics) Ozcan KOC March 26, 2002

Evaluation of Gene Prediction Algorithms Outline Evaluation of Gene Prediction Algorithms Promoter prediction in prokaryotes Scoring matrices Neural Networks Prediction of less conserved regions Promoter prediction in eukaryotes Otto (Celera) Summary

Evaluation of Gene Prediction Methods (1) What to consider when comparing… type of analysis (neural nw, linear discriminant etc.) # and types of sequences user for training and test Also, parameters affect the predictions.. An ideal method should use A known set of gene structures (training) A different set for test Evaluation is more stringent when Test set includes a gene and neighboring sequence, rather than sequence between the first and the last exons

Evaluation of Gene Prediction Methods (2) # of actual positives AP=TP+FN # of actual negatives AN= FP+TN Predicted # of positives PP=TP+FP Predicted # of negatives PN=TP+FP Sensitivity SN = TP/AP=TP/(TP+FN) Specificity SP = TP/PP=TP/(TP+FP) Correlation coefficient [-1,1] GeneParser GenID Grail Otto(RefSeq) Otto(homology) Sensitivity .68-.75 .65-.67 .48-.65 .94 .60 Specificity .68-.78 .74-.78 .86-.87 .97 .88 Corr. Coef. .66-.69 .66-.67 .61-.72

Evaluation of Gene Prediction Methods (3) In a later study (Zhang ‘97) Grail II FGENEH MZEF Sensitivity .79 .93 .95 Specificity .92 Corr. Coef. .83 .85 .89 Programs including protein sequence DB searches (GeneID+, GeneParser3) achieved substantially greater accuracy (Burset ’96) Gene prediction programs reliably locate genomic regions, but provide only an approximation of gene structure

Exons Predicted in an Arabodopsis Genomic Sequence Note: Arabodopsis UVH1 gene (with approx. 250 bp upstream from the first exon and 200 bp downstream from the last exon) used. NOT to be taken as a measure of reliability of these programs. cDNA NetGene GeneMark FgeneP GeneScan Mzeff 345-1210 x – 1210 276-1210 1290-1513 x – 1513 1242-1513 1611-1696 x – x 1880-2029 1880-2034 2143-2880 x –2880 x – 2880 3143-3253 x – 3253 3339-3599 3698-3921 4010-4217 4010 – x 4010-4220* 4010- 4220* x: not predicted *: includes the termination codon

Promoter Prediction in E.coli Align a set of promoter sequences by the position that marks the known TSS (transcription start site) Search for conserved regions E.coli promoters have 3 conserved sequence features 6bp region w/ consensus TATAAT (at pos. –10) 6bp region w/ consensus TTGACA (at pos. –35) 17bp distance between them A weaker region exists around +1 and an AT-rich region exists around -35

Promoter Prediction in E.coli (2) FINDPATTERNS and PatScan can be used to search for matches to consensus Sequence positions in aligned regions vary to some extent, but some regions are less variable Alternative: use search features of FINDPATTERNS/PatScan which allows repeats, gaps, inverted repeats etc. E.g. GAT (TG, T,G) {1,4} (for FINDPATTERNS) Adv Useful for locating complex regulatory patterns DisAdv No consideration for each residue at each pattern position

Promoter Prediction in E.coli (3) Use a scoring matrix Ex: How to prepare the matrix (for -10 region promoters of E.coli) N sequences aligned by their –10 regions Count of each base pair is made and are converted to frequencies Frequencies are converted into log odds scores Alternative formula (Hertz and Stormo ‘99): wi,j=log[(ni,j+ Pi)/ {(N+1)Pi}] =ln(fi,j +Pi) ni,j:count of base i in column j Pi: background freq. N: total # of pairs

A Scoring Matrix for E.Coli promoters (-10 position) Fraction of each base at each column of the aligned promoters in the –10 region Position A C G T 1 0.02 0.09 0.10 0.79 2 0.94 0.01 0.03 3..6 …. … ... Freq. Observed . Freq. Expected (bg freq) Log odds score Log(0.79/0.25) Position A C G T 1 -3.80 -1.49 -1.34 1.67 2 1.92 -3.81 -4.81 -3.22 3 -0.06 -0.81 -0.66 0.81 4 1.24 -1.00 -0.72 -0.89 5 1.02 -0.35 -0.56 6 1.95

Locating –10 Promoter Sites in E. coli(1) -3.80 -1.49 -1.34 1.67 2 1.92 -3.81 -4.81 -3.22 3 -0.06 -0.81 -0.66 0.81 4 1.24 -1.00 -0.72 -0.89 5 1.02 -0.35 -0.56 6 1.95 T… Log odds score = -1.49-4.81+0.81 +1.24-0.56-4.81 =-9.62 bits odds=2-9.62=1/786

Locating –10 Promoter Sites in E. coli(2) -3.80 -1.49 -1.34 1.67 2 1.92 -3.81 -4.81 -3.22 3 -0.06 -0.81 -0.66 0.81 4 1.24 -1.00 -0.72 -0.89 5 1.02 -0.35 -0.56 6 1.95 … Log odds score = -1.34-3.22-0.06 -0.89+1.02-4.81 =-9.30 bits odds=2-9.3=1/630

Locating –10 Promoter Sites in E. coli(3) -3.80 -1.49 -1.34 1.67 2 1.92 -3.81 -4.81 -3.22 3 -0.06 -0.81 -0.66 0.81 4 1.24 -1.00 -0.72 -0.89 5 1.02 -0.35 -0.56 6 1.95 . Log odds score = 1.67+1.92+0.81 +1.24+1.02+1.95 =8.61 bits odds=28.61=391/1

Locating –10 Promoter Sites in E. coli(4) Scoring matrices are applied for regions –35 (35bp), -10(10bp) and +1(12bp) for both strands Each matrix will provide a distribution of odds scores Matches are examined for spacing characteristics of promoters Result: log odds score represents an overall likelihood for regions matching characteristics E.coli promoters w/ correct spacing.

Problems with Matrix Method Adds scores for each sequence position in reality: one pos. in –10 region may play a role in one stage of transcription (I.e. promoter recognition), and another pos. inelongation of mRNA, etc. Promoters are treated as being in the same class. In reality: different RNA polymerase may have preference for different regions in promoter region Promoter sequence is treated as a Markov chain(I.e. each position is independent from others). In reality, there may be a correlation between sequence positions Assumptions: Matching positions with functional separations are expected to be additive. In reality NO! None Usually this assumption is true, but there may be cases where a correlation—which is not just by chance—exists.

Neural Networks for E.Coli Promoters Use a neural nw trained to distinguish E.Coli sequences from non-promoter sequences (Pedersen et. Al ‘96) Horton and Kanehisa ’92 used a neural network lacking a hidden layer(perceptron) Scan the sequence to be analyzed using a sliding window Sequence characters are given a simple identification scheme to avoid any bias (e.g. A is 1000, G is 0100 etc) Perceptron: No more efficient than matrix method

Perceptron Model for Locating E.coli promoters The Perceptron T [0100] A [1000] w1 weights w2 w3 w4 w5 w6 Output of approximately 1 indicates function, 0 indicates no function, Scoring Matrix Equivalent A C G T SUM= 0.19+ 0.22+ 0.09+ 0.14+ 0.12+ 0.24= 1.00 indicates function 1 0.19 2 0.22 3 0.09 4 0.14 5 0.12 6 0.24

Finding Less-conserved Binding Sites (1) In E.Coli the sequences could be aligned by TSS, -10 and –35 regions. In many cases, it is not possible to find conserved binding site by aligning the sequences. Similar to finding patterns common to a set of protein sequences that cannot be aligned. However, more difficult. Methods: Expectation maximization: Guess an initial scoring matrix of estimated length. Scan each sequence, calculate probability of matches, update (sequence pos. x probability) scoring matrix, then repeat until no change. More difficult, because in proteins we have 20 amino acids. In DNA, there are only 4 nucleotides. More difficult to detect a pattern from noise.

Finding Less-conserved Binding Sites (2) Methods Cont’d: Hidden Markov Models Statistical Method of Finding Patterns A dinucleotide analysis performed to reduce background noise. A Gibbs sampling method considering inverted repeats (e.g. for lexA) is applied Hertz, Stormo and Hartzell Method Example: how the algorithm compares a fixed window of sequence (4) in a set of sequences Object: find the 4-mer in each sequence that constitutes as nearly as can be found in ALL seq.s

Hertz, Stormo and Hartzell Method (for DNA-binding Sites) Sequence 1 Sequence 2 Sequence 3 A C T G A T A G C G C T T G C A C T G Seq1 1 l=8 bits A C T G Seq1 Seq2 1 l = 4 bits A C T G Seq1 Seq2 2 1 l = 6 bits A C T G Seq1 Seq3 1 2 l = 6 bits A C T G Seq1 Seq3 1 l=4 bits A C T G Seq1 Seq2 Seq3 2 1 3 l = 4.6 bits A C T G Seq1 Seq2 Seq3 2 1 l = 3.0 bits

Promoter Prediction in Eukaryotes (1) Transcriptional Regulation in Eukaryotes Transcription involves the interaction of TFs (Transcription Factors–protein complexes) with Each other DNA-binding sites in the promoter region Degree of expression of gene is influenced by the region upstream from transcription start point the region downstream A TATA box is present in most eukaryotes (75% in vertebrates) A TATA box HMM trained for vertebrates has the consensus sequence TATAWDR starting at –17 bp from TSS W: A/T D: not C R: G/A Transcription: Transcription of protein-encoding genes by RNA polymerase II

Promoter Prediction in Eukaryotes (2) INR also influences the start position of transcription. a loosely defined sequence around TSS may be recognized by other protein subunits of TFIID(a TF that recognizes and binds to the promoter DNA) CCAAT and GC boxes also discovered around TSS(at variable distances) Many different TFs may be involved in the regulation of a particular eukaryotic gene. DNA-binding sites for many of these TFs are unknown, which limits promoter pred. Transcription: Transcription of protein-encoding genes by RNA polymerase II

Promoter Prediction in Eukaryotes (3) Gene expression is also influenced by the region upstream of the core promoter and other enhancer sites. Eukaryotic sequences show variation not only b/w species but also among genes within a species. Hence, a set of promoters in an organism that share a common regulatory response is analyzed The programs can predict 13-54% of the TSSs correctly, but also each program predicted a number of false-positive TSSs. Transcription: Transcription of protein-encoding genes by RNA polymerase II

Prediction Methods for RNA PolII Promoters (1) Neural nw trained on TATA and Inr Sites allowing a variable spacing between sites. NN-GA approach to identify conserved patterns in RNA PolII promoters and conserved spacing among them (PROMOTER2.0). TATA box recognition using weight matrix and density analysis of TF sites. NN-GA : Neural netwok-Genetic Algorithms A promoter recognition profile is produced using the density of TF sites at least 50 bp apart in known sequences of the EPD (eukaryotic Promoter DB) and Non-promoter primate sequences from Genbank (PromoterScan)

Prediction Methods for RNA PolII Promoters (2) Methods Cont’d. Usage of linear (TSSD and TSSW) /quadratic (CorePromoter) discriminant function. The function is based on: TATA box score Base-pair frequencies around TSS (triplet) Frequencies in consecutive 100-bp upstream regions TF binding site prediction Searches of weight matrices for different organism against a test sequence (TFSearch/ TESS). MatInspector and ConInspector allows user-provided limits on type of weight matrix, generation of new matrices etc. Testing for presence of clustered groups (or modules) of TF binding sites which are characteristics of a given pattern of gene regulation. Frequencies in consecutive 100-bp upstream regions - Hexamer frequencies

An Expert System: Otto (1) A rule-based expert system to identify and characterize genes in human genome Simulates a human annotator A human annotator Looks for different patterns, e.g. homology to a number of ESTs and evaluates whether they can be connected into a longer virtual mRNA Puts different levels of confidence in different types of evidence Strength and contiguity of the match This type of annotation is used for Drosophila genome

An Expert System: Otto (2) Otto can promote an observed evidence to a gene annotation Either by checking a high-quality match to a known gene Or evaluating a broad spectrum of evidence and determining whether the evidence is enough for gene annotation It first partitions the genome to identify likely gene boundaries, using BLAST matches. Partitions are checked for matches against DB sequences and grouped.

An Expert System: Otto (3) Next, known genes(exact matches of cDNA to the genome) are identified. For remaining genes, some complex rules, like If a RefSeq transcript matched the genome assembly for at least 50% of its length and >%92 identity , then SIM4 alignment of transcript is promoted to a gene. used. Otto identified a total of 6538 genes.

Summary (1) Sensitivity, specifity and correlation analyses are performed when evaluating Gene Prediction Algorithms Promoter prediction is relatively easier in prokaryots than in eukaryots Scoring matrices and perceptrons (NN) can be used to predict promoters in prokaryots Finding less-conserved binding sites is more difficult. Expectation max., HMM, statistical methods and Hertz et. al. can be employed in this case

A number of methods are available for promoter prediction in eukaryots Summary (2) A number of methods are available for promoter prediction in eukaryots NN-GAs Weight matrices Linear/quadratic discriminat functions Testing for TF binding sites TATA box recognition Otto is an expert system used to identify genes. Thanks!