Gene Prediction (cont’d)

Gene Prediction (cont’d)
For CISC 889 (Bioinformatics) Ozcan KOC March 26, 2002

Evaluation of Gene Prediction Algorithms
Outline Evaluation of Gene Prediction Algorithms Promoter prediction in prokaryotes Scoring matrices Neural Networks Prediction of less conserved regions Promoter prediction in eukaryotes Otto (Celera) Summary

Evaluation of Gene Prediction Methods (1)
What to consider when comparing… type of analysis (neural nw, linear discriminant etc.) # and types of sequences user for training and test Also, parameters affect the predictions.. An ideal method should use A known set of gene structures (training) A different set for test Evaluation is more stringent when Test set includes a gene and neighboring sequence, rather than sequence between the first and the last exons

# of actual positives AP=TP+FN # of actual negatives AN= FP+TN Predicted # of positives PP=TP+FP Predicted # of negatives PN=TP+FP Sensitivity SN = TP/AP=TP/(TP+FN) Specificity SP = TP/PP=TP/(TP+FP) Correlation coefficient [-1,1] GeneParser GenID Grail Otto(RefSeq) Otto(homology) Sensitivity .94 .60 Specificity .97 .88 Corr. Coef.

In a later study (Zhang ‘97) Grail II FGENEH MZEF Sensitivity .79 .93 .95 Specificity .92 Corr. Coef. .83 .85 .89 Programs including protein sequence DB searches (GeneID+, GeneParser3) achieved substantially greater accuracy (Burset ’96) Gene prediction programs reliably locate genomic regions, but provide only an approximation of gene structure

Exons Predicted in an Arabodopsis Genomic Sequence
Note: Arabodopsis UVH1 gene (with approx. 250 bp upstream from the first exon and 200 bp downstream from the last exon) used. NOT to be taken as a measure of reliability of these programs. cDNA NetGene GeneMark FgeneP GeneScan Mzeff x – 1210 x – 1513 x – x x –2880 x – 2880 x – 3253 4010 – x * * x: not predicted *: includes the termination codon

Promoter Prediction in E.coli
Align a set of promoter sequences by the position that marks the known TSS (transcription start site) Search for conserved regions E.coli promoters have 3 conserved sequence features 6bp region w/ consensus TATAAT (at pos. –10) 6bp region w/ consensus TTGACA (at pos. –35) 17bp distance between them A weaker region exists around +1 and an AT-rich region exists around -35

Promoter Prediction in E.coli (2)
FINDPATTERNS and PatScan can be used to search for matches to consensus Sequence positions in aligned regions vary to some extent, but some regions are less variable Alternative: use search features of FINDPATTERNS/PatScan which allows repeats, gaps, inverted repeats etc. E.g. GAT (TG, T,G) {1,4} (for FINDPATTERNS) Adv Useful for locating complex regulatory patterns DisAdv No consideration for each residue at each pattern position

Promoter Prediction in E.coli (3)
Use a scoring matrix Ex: How to prepare the matrix (for -10 region promoters of E.coli) N sequences aligned by their –10 regions Count of each base pair is made and are converted to frequencies Frequencies are converted into log odds scores Alternative formula (Hertz and Stormo ‘99): wi,j=log[(ni,j+ Pi)/ {(N+1)Pi}] =ln(fi,j +Pi) ni,j:count of base i in column j Pi: background freq N: total # of pairs

A Scoring Matrix for E.Coli promoters (-10 position)
Fraction of each base at each column of the aligned promoters in the –10 region Position A C G T 1 0.02 0.09 0.10 0.79 2 0.94 0.01 0.03 3..6 …. … ... Freq. Observed Freq. Expected (bg freq) Log odds score Log(0.79/0.25) Position A C G T 1 -3.80 -1.49 -1.34 1.67 2 1.92 -3.81 -4.81 -3.22 3 -0.06 -0.81 -0.66 0.81 4 1.24 -1.00 -0.72 -0.89 5 1.02 -0.35 -0.56 6 1.95

Locating –10 Promoter Sites in E. coli(1)
-3.80 -1.49 -1.34 1.67 2 1.92 -3.81 -4.81 -3.22 3 -0.06 -0.81 -0.66 0.81 4 1.24 -1.00 -0.72 -0.89 5 1.02 -0.35 -0.56 6 1.95 T… Log odds score = =-9.62 bits odds=2-9.62=1/786

-3.80 -1.49 -1.34 1.67 2 1.92 -3.81 -4.81 -3.22 3 -0.06 -0.81 -0.66 0.81 4 1.24 -1.00 -0.72 -0.89 5 1.02 -0.35 -0.56 6 1.95 … Log odds score = =-9.30 bits odds=2-9.3=1/630

-3.80 -1.49 -1.34 1.67 2 1.92 -3.81 -4.81 -3.22 3 -0.06 -0.81 -0.66 0.81 4 1.24 -1.00 -0.72 -0.89 5 1.02 -0.35 -0.56 6 1.95 . Log odds score = =8.61 bits odds=28.61=391/1

Scoring matrices are applied for regions –35 (35bp), -10(10bp) and +1(12bp) for both strands Each matrix will provide a distribution of odds scores Matches are examined for spacing characteristics of promoters Result: log odds score represents an overall likelihood for regions matching characteristics E.coli promoters w/ correct spacing.

Problems with Matrix Method
Adds scores for each sequence position in reality: one pos. in –10 region may play a role in one stage of transcription (I.e. promoter recognition), and another pos. inelongation of mRNA, etc. Promoters are treated as being in the same class. In reality: different RNA polymerase may have preference for different regions in promoter region Promoter sequence is treated as a Markov chain(I.e. each position is independent from others). In reality, there may be a correlation between sequence positions Assumptions: Matching positions with functional separations are expected to be additive. In reality NO! None Usually this assumption is true, but there may be cases where a correlation—which is not just by chance—exists.

Neural Networks for E.Coli Promoters
Use a neural nw trained to distinguish E.Coli sequences from non-promoter sequences (Pedersen et. Al ‘96) Horton and Kanehisa ’92 used a neural network lacking a hidden layer(perceptron) Scan the sequence to be analyzed using a sliding window Sequence characters are given a simple identification scheme to avoid any bias (e.g. A is 1000, G is 0100 etc) Perceptron: No more efficient than matrix method

Perceptron Model for Locating E.coli promoters
The Perceptron T [0100] A [1000] w1 weights w2 w3 w4 w5 w6 Output of approximately 1 indicates function, 0 indicates no function, Scoring Matrix Equivalent A C G T SUM= = indicates function 1 0.19 2 0.22 3 0.09 4 0.14 5 0.12 6 0.24

Finding Less-conserved Binding Sites (1)
In E.Coli the sequences could be aligned by TSS, -10 and –35 regions. In many cases, it is not possible to find conserved binding site by aligning the sequences. Similar to finding patterns common to a set of protein sequences that cannot be aligned. However, more difficult. Methods: Expectation maximization: Guess an initial scoring matrix of estimated length. Scan each sequence, calculate probability of matches, update (sequence pos. x probability) scoring matrix, then repeat until no change. More difficult, because in proteins we have 20 amino acids. In DNA, there are only 4 nucleotides. More difficult to detect a pattern from noise.

Finding Less-conserved Binding Sites (2)
Methods Cont’d: Hidden Markov Models Statistical Method of Finding Patterns A dinucleotide analysis performed to reduce background noise. A Gibbs sampling method considering inverted repeats (e.g. for lexA) is applied Hertz, Stormo and Hartzell Method Example: how the algorithm compares a fixed window of sequence (4) in a set of sequences Object: find the 4-mer in each sequence that constitutes as nearly as can be found in ALL seq.s

Hertz, Stormo and Hartzell Method (for DNA-binding Sites)
Sequence 1 Sequence 2 Sequence 3 A C T G A T A G C G C T T G C A C T G Seq1 1 l=8 bits A C T G Seq1 Seq2 1 l = 4 bits A C T G Seq1 Seq2 2 1 l = 6 bits A C T G Seq1 Seq3 1 2 l = 6 bits A C T G Seq1 Seq3 1 l=4 bits A C T G Seq1 Seq2 Seq3 2 1 3 l = 4.6 bits A C T G Seq1 Seq2 Seq3 2 1 l = 3.0 bits

Promoter Prediction in Eukaryotes (1)
Transcriptional Regulation in Eukaryotes Transcription involves the interaction of TFs (Transcription Factors–protein complexes) with Each other DNA-binding sites in the promoter region Degree of expression of gene is influenced by the region upstream from transcription start point the region downstream A TATA box is present in most eukaryotes (75% in vertebrates) A TATA box HMM trained for vertebrates has the consensus sequence TATAWDR starting at –17 bp from TSS W: A/T D: not C R: G/A Transcription: Transcription of protein-encoding genes by RNA polymerase II

INR also influences the start position of transcription. a loosely defined sequence around TSS may be recognized by other protein subunits of TFIID(a TF that recognizes and binds to the promoter DNA) CCAAT and GC boxes also discovered around TSS(at variable distances) Many different TFs may be involved in the regulation of a particular eukaryotic gene. DNA-binding sites for many of these TFs are unknown, which limits promoter pred. Transcription: Transcription of protein-encoding genes by RNA polymerase II

Gene expression is also influenced by the region upstream of the core promoter and other enhancer sites. Eukaryotic sequences show variation not only b/w species but also among genes within a species. Hence, a set of promoters in an organism that share a common regulatory response is analyzed The programs can predict 13-54% of the TSSs correctly, but also each program predicted a number of false-positive TSSs. Transcription: Transcription of protein-encoding genes by RNA polymerase II

Prediction Methods for RNA PolII Promoters (1)
Neural nw trained on TATA and Inr Sites allowing a variable spacing between sites. NN-GA approach to identify conserved patterns in RNA PolII promoters and conserved spacing among them (PROMOTER2.0). TATA box recognition using weight matrix and density analysis of TF sites. NN-GA : Neural netwok-Genetic Algorithms A promoter recognition profile is produced using the density of TF sites at least 50 bp apart in known sequences of the EPD (eukaryotic Promoter DB) and Non-promoter primate sequences from Genbank (PromoterScan)

Prediction Methods for RNA PolII Promoters (2)
Methods Cont’d. Usage of linear (TSSD and TSSW) /quadratic (CorePromoter) discriminant function. The function is based on: TATA box score Base-pair frequencies around TSS (triplet) Frequencies in consecutive 100-bp upstream regions TF binding site prediction Searches of weight matrices for different organism against a test sequence (TFSearch/ TESS). MatInspector and ConInspector allows user-provided limits on type of weight matrix, generation of new matrices etc. Testing for presence of clustered groups (or modules) of TF binding sites which are characteristics of a given pattern of gene regulation. Frequencies in consecutive 100-bp upstream regions - Hexamer frequencies

An Expert System: Otto (1)
A rule-based expert system to identify and characterize genes in human genome Simulates a human annotator A human annotator Looks for different patterns, e.g. homology to a number of ESTs and evaluates whether they can be connected into a longer virtual mRNA Puts different levels of confidence in different types of evidence Strength and contiguity of the match This type of annotation is used for Drosophila genome

Otto can promote an observed evidence to a gene annotation Either by checking a high-quality match to a known gene Or evaluating a broad spectrum of evidence and determining whether the evidence is enough for gene annotation It first partitions the genome to identify likely gene boundaries, using BLAST matches. Partitions are checked for matches against DB sequences and grouped.

Next, known genes(exact matches of cDNA to the genome) are identified. For remaining genes, some complex rules, like If a RefSeq transcript matched the genome assembly for at least 50% of its length and >%92 identity , then SIM4 alignment of transcript is promoted to a gene. used. Otto identified a total of 6538 genes.

Summary (1) Sensitivity, specifity and correlation analyses are performed when evaluating Gene Prediction Algorithms Promoter prediction is relatively easier in prokaryots than in eukaryots Scoring matrices and perceptrons (NN) can be used to predict promoters in prokaryots Finding less-conserved binding sites is more difficult. Expectation max., HMM, statistical methods and Hertz et. al. can be employed in this case

A number of methods are available for promoter prediction in eukaryots
Summary (2) A number of methods are available for promoter prediction in eukaryots NN-GAs Weight matrices Linear/quadratic discriminat functions Testing for TF binding sites TATA box recognition Otto is an expert system used to identify genes. Thanks!

Gene Prediction (cont’d)

Similar presentations

Presentation on theme: "Gene Prediction (cont’d)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene Prediction (cont’d)

Similar presentations

Presentation on theme: "Gene Prediction (cont’d)"— Presentation transcript:

Similar presentations

About project

Feedback