Copyright  2003 limsoon wong Recognition of Gene Features Limsoon Wong Institute for Infocomm Research BI6103 guest lecture on ?? February 2004 For written.

Slides:



Advertisements
Similar presentations
Learning Algorithm Evaluation
Advertisements

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Indian Statistical Institute Kolkata
Model Assessment, Selection and Averaging
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
Experimental Evaluation
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.
Lecture 12 Splicing and gene prediction in eukaryotes
CS Instance Based Learning1 Instance Based Learning.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
This week: overview on pattern recognition (related to machine learning)
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician.
Experimental Evaluation of Learning Algorithms Part 1.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 2, May 2004 For written notes.
Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore.
Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 1, May 2004 For written notes.
Copyright  2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
From Genomes to Genes Rui Alves.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
CS5238 Combinatorial methods in bioinformatics
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Finding genes in the genome
Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 3 of The Practical Bioinformatician, CS2220:
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 13 January.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
7. Performance Measurement
bacteria and eukaryotes
Evaluating Classifiers
Fanfan Zeng & Roland Yap National University of Singapore Limsoon Wong
K Nearest Neighbor Classification
Recitation 7 2/4/09 PSSMs+Gene finding
Parametric Methods Berlin Chen, 2005 References:
The Toy Exon Finder.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Presentation transcript:

Copyright  2003 limsoon wong Recognition of Gene Features Limsoon Wong Institute for Infocomm Research BI6103 guest lecture on ?? February 2004 For written notes, please read chapters 3, 4, and 7 of The Practical Bioinformatician,

Copyright  2003 limsoon wong Lecture Plan experiment design, result interpretation central dogma recognition of translation initiation sites recognition of transcription start sites survey of some ANN-based systems for recognizing gene features

Copyright  2003 limsoon wong What is Accuracy?

Copyright  2003 limsoon wong What is Accuracy? Accuracy = No. of correct predictions No. of predictions = TP + TN TP + TN + FP + FN

Copyright  2003 limsoon wong Examples (Balanced Population) Clearly, B, C, D are all better than A Is B better than C, D? Is C better than B, D? Is D better than B, C? Accuracy may not tell the whole story

Copyright  2003 limsoon wong Examples (Unbalanced Population) Clearly, D is better than A Is B better than A, C, D? high accuracy is meaningless if population is unbalanced

Copyright  2003 limsoon wong What is Sensitivity (aka Recall)? Sensitivity = No. of correct positive predictions No. of positives = TP TP + FN wrt positives Sometimes sensitivity wrt negatives is termed specificity

Copyright  2003 limsoon wong What is Precision? Precision = No. of correct positive predictions No. of positives predictions = TP TP + FP wrt positives

Copyright  2003 limsoon wong Precision-Recall Trade-off A predicts better than B if A has better recall and precision than B There is a trade-off between recall and precision In some applications, once you reach a satisfactory precision, you optimize for recall In some applications, once you reach a satisfactory recall, you optimize for precision recall precision

Copyright  2003 limsoon wong Comparing Prediction Performance Accuracy is the obvious measure –But it conveys the right intuition only when the positive and negative populations are roughly equal in size Recall and precision together form a better measure –But what do you do when A has better recall than B and B has better precision than A?

Copyright  2003 limsoon wong Adjusted Accuracy Weigh by the importance of the classes Adjusted accuracy =  * Sensitivity  * Specificity+ where  +  = 1 typically,  =  = 0.5 But people can’t always agree on values for , 

Copyright  2003 limsoon wong ROC Curves By changing thresholds, get a range of sensitivities and specificities of a classifier A predicts better than B if A has better sensitivities than B at most specificities Leads to ROC curve that plots sensitivity vs. (1 – specificity) Then the larger the area under the ROC curve, the better sensitivity 1 – specificity

Copyright  2003 limsoon wong What is Cross Validation?

Copyright  2003 limsoon wong Construction of a Classifier Build Classifier Training samples Classifier Apply Classifier Test instance Prediction

Copyright  2003 limsoon wong Estimate Accuracy: Wrong Way Apply Classifier Predictions Build Classifier Training samples Classifier Estimate Accuracy Why is this way of estimating accuracy wrong?

Copyright  2003 limsoon wong Recall... Given a test sample S Compute scores p(S), n(S) Predict S as negative if p(S) < t * n(s) Predict S as positive if p(S)  t * n(s) t is the decision threshold of the classifier …the abstract model of a classifier

Copyright  2003 limsoon wong K-Nearest Neighbour Classifier (k-NN) Given a sample S, find the k observations S i in the known data that are “closest” to it, and average their responses. Assume S is well approximated by its neighbours p(S) =  1 S i  N k (S)  D P n(S) =  1 S i  N k (S)  D N where N k (S) is the neighbourhood of S defined by the k nearest samples to it. Assume distance between samples is Euclidean distance for now

Copyright  2003 limsoon wong Estimate Accuracy: Wrong Way Apply 1-NN Predictions Build 1-NN Training samples 1-NN Estimate Accuracy 100% Accuracy For sure k-NN (k = 1) has 100% accuracy in the “accuracy estimation” procedure above. But does this accuracy generalize to new test instances?

Copyright  2003 limsoon wong Estimate Accuracy: Right Way Testing samples are NOT to be used during “Build Classifier” Apply Classifier Predictions Build Classifier Training samples Classifier Estimate Accuracy Testing samples

Copyright  2003 limsoon wong How Many Training and Testing Samples? No fixed ratio between training and testing samples; but typically 2:1 ratio Proportion of instances of different classes in testing samples should be similar to proportion in training samples What if there are insufficient samples to reserve 1/3 for testing? Ans: Cross validation

Copyright  2003 limsoon wong Cross Validation 2.Train3.Train4.Train5.Train1.Test Divide samples into k roughly equal parts Each part has similar proportion of samples from different classes Use each part to testing other parts Total up accuracy 2.Test3.Train4.Train5.Train1.Train2.Train3.Test4.Train5.Train1.Train2.Train3.Train4.Test5.Train1.Train2.Train3.Train4.Train5.Test1.Train

Copyright  2003 limsoon wong How Many Fold? If samples are divided into k parts, we call this k-fold cross validation Choose k so that –the k-fold cross validation accuracy does not change much from k-1 fold –each part within the k- fold cross validation has similar accuracy k = 5 or 10 are popular choices for k. Size of training set Accuracy

Copyright  2003 limsoon wong Bias-Variance Decomposition Suppose classifiers C j and C k were trained on different sets S j and S k of 1000 samples each Then C j and C k might have different accuracy What is the expected accuracy of a classifier C trained this way? Let Y = f(X) be what C is trying to predict The expected squared error at a test instance x, averaging over all such training samples, is E[f(x) – C(x)] 2 = E[C(x) – E[C(x)]] 2 + [E[C(x)] - f(x)] 2 variance bias

Copyright  2003 limsoon wong Bias-Variance Trade-Off In k-fold cross validation, –small k tends to under estimate accuracy (i.e., large bias downwards) –large k has smaller bias, but can have high variance Size of training set Accuracy

Copyright  2003 limsoon wong Curse of Dimensionality

Copyright  2003 limsoon wong Curse of Dimensionality How much of each dimension is needed to cover a proportion r of total sample space? Calculate by e p (r) = r 1/p So, to cover 1% of a 15-D space, need 85% of each dimension!

Copyright  2003 limsoon wong Consequence of the Curse Suppose the number of samples given to us in the total sample space is fixed Let the dimension increase Then the distance of the k nearest neighbours of any point increases Then the k nearest neighbours are less and less useful for prediction, and can confuse the k-NN classifier

Copyright  2003 limsoon wong What is Feature Selection?

Copyright  2003 limsoon wong Tackling the Curse Given a sample space of p dimensions It is possible that some dimensions are irrelevant Need to find ways to separate those dimensions (aka features) that are relevant (aka signals) from those that are irrelevant (aka noise)

Copyright  2003 limsoon wong Signal Selection (Basic Idea) Choose a feature w/ low intra-class distance Choose a feature w/ high inter-class distance

Copyright  2003 limsoon wong Signal Selection (eg., t-statistics)

Copyright  2003 limsoon wong Signal Selection (eg., MIT-correlation)

Copyright  2003 limsoon wong Signal Selection (eg., entropy)

Copyright  2003 limsoon wong Signal Selection (eg.,  2)

Copyright  2003 limsoon wong Signal Selection (eg., CFS) Instead of scoring individual signals, how about scoring a group of signals as a whole? CFS –Correlation-based Feature Selection –A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other

Copyright  2003 limsoon wong Self-fulfilling Oracle Construct artificial dataset with 100 samples, each with 100,000 randomly generated features and randomly assigned class labels select 20 features with the best t-statistics (or other methods) Evaluate accuracy by cross validation using only the 20 selected features The resultant estimated accuracy can be ~90% But the true accuracy should be 50%, as the data were derived randomly

Copyright  2003 limsoon wong What Went Wrong? The 20 features were selected from the whole dataset Information in the held-out testing samples has thus been “leaked” to the training process The correct way is to re-select the 20 features at each fold; better still, use a totally new set of samples for testing

Copyright  2003 limsoon wong Short Break

Copyright  2003 limsoon wong Central Dogma of Molecular Biology

Copyright  2003 limsoon wong What is a gene?

Copyright  2003 limsoon wong Central Dogma

Copyright  2003 limsoon wong Transcription: DNA  nRNA

Copyright  2003 limsoon wong Splicing: nRNA  mRNA

Copyright  2003 limsoon wong Translation: mRNA  protein F L I M V S P T A Y H Q N K D E C W R G A T E L R S stop

Copyright  2003 limsoon wong What does DNA data look like? A sample GenBank record from NCBI md=Retrieve&db=nucleotide&list_uids= &dopt=GenBank

Copyright  2003 limsoon wong What does protein data look like? A sample GenPept record from NCBI md=Retrieve&db=protein&list_uids= &dopt=GenPept

Copyright  2003 limsoon wong Recognition of Translation Initiation Sites An introduction to the World’s simplest TIS recognition system A simple approach to accuracy and understandability

Copyright  2003 limsoon wong Translation Initiation Site

Copyright  2003 limsoon wong A Sample cDNA 299 HSU CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the TIS?

Copyright  2003 limsoon wong Approach Training data gathering Signal generation  k-grams, distance, domain know-how,... Signal selection  Entropy,  2, CFS, t-test, domain know-how... Signal integration  SVM, ANN, PCL, CART, C4.5, kNN,...

Copyright  2003 limsoon wong Training & Testing Data Vertebrate dataset of Pedersen & Nielsen [ISMB’97] 3312 sequences ATG sites 3312 (24.5%) are TIS (75.5%) are non-TIS Use for 3-fold x-validation expts

Copyright  2003 limsoon wong Signal Generation K-grams (ie., k consecutive letters) –K = 1, 2, 3, 4, 5, … –Window size vs. fixed position –Up-stream, downstream vs. any where in window –In-frame vs. any frame

Copyright  2003 limsoon wong Signal Generation: An Example 299 HSU CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT Window =  100 bases In-frame, downstream –GCT = 1, TTT = 1, ATG = 1… Any-frame, downstream –GCT = 3, TTT = 2, ATG = 2… In-frame, upstream –GCT = 2, TTT = 0, ATG = 0,...

Copyright  2003 limsoon wong Too Many Signals For each value of k, there are 4 k * 3 * 2 k-grams If we use k = 1, 2, 3, 4, 5, we have = 8188 features! This is too many for most machine learning algorithms

Copyright  2003 limsoon wong Sample k-grams Selected by CFS Position – 3 in-frame upstream ATG in-frame downstream –TAA, TAG, TGA, –CTG, GAC, GAG, and GCC Kozak consensus Leaky scanning Stop codon Codon bias?

Copyright  2003 limsoon wong Signal Integration kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win. SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error. Naïve Bayes, ANN, C4.5,...

Copyright  2003 limsoon wong Results (3-fold x-validation)

Copyright  2003 limsoon wong Performance Comparisons * result not directly comparable due to different dataset and ribosome-scanning model

Copyright  2003 limsoon wong Improvement by Scanning Apply Naïve Bayes or SVM left-to-right until first ATG predicted as positive. That’s the TIS. Naïve Bayes & SVM models were trained using TIS vs. Up-stream ATG

Copyright  2003 limsoon wong Technique Comparisons Pedersen&Nielsen [ISMB’97] –Neural network –No explicit features Zien [Bioinformatics’00] –SVM+kernel engineering –No explicit features Hatzigeorgiou [Bioinformatics’02] –Multiple neural networks –Scanning rule –No explicit features Our approach –Explicit feature generation –Explicit feature selection –Use any machine learning method w/o any form of complicated tuning –Scanning rule is optional

Copyright  2003 limsoon wong Can we do even better? F L I M V S P T A Y H Q N K D E C W R G A T E L R S stop How about using k-grams from the translation?

Copyright  2003 limsoon wong Amino-Acid Features

Copyright  2003 limsoon wong Amino-Acid Features

Copyright  2003 limsoon wong Amino Acid K-grams Discovered by Entropy

Copyright  2003 limsoon wong Results (on Pedersen & Nielsen’s mRNA) Performance based on top 100 amino-acid features: is better than performance based on DNA seq. features:

Copyright  2003 limsoon wong Independent Validation Sets A. Hatzigeorgiou: –480 fully sequenced human cDNAs –188 left after eliminating sequences similar to training set (Pedersen & Nielsen’s) –3.42% of ATGs are TIS Our own: –well characterized human gene sequences from chromosome X (565 TIS) and chromosome 21 (180 TIS)

Copyright  2003 limsoon wong Validation Results (on Hatzigeorgiou’s) –Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s dataset

Copyright  2003 limsoon wong Validation Results (on Chr X and Chr 21) –Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s ATGpr Our method

Copyright  2003 limsoon wong Recognition of Transcription Start Sites An introduction to the World’s best TSS recognition system A heavy tuning approach

Copyright  2003 limsoon wong Transcription Start Site

Copyright  2003 limsoon wong Approach taken in Dragon Multi-sensor integration via ANNs Multi-model system structure –for different sensitivity levels –for GC-rich and GC-poor promoter regions

Copyright  2003 limsoon wong Structure of Dragon Promoter Finder -200 to +50 window size Model selected based on desired sensitivity

Copyright  2003 limsoon wong Each model has two submodels based on GC content GC-rich submodel GC-poor submodel (C+G) = #C + #G Window Size

Copyright  2003 limsoon wong Data Analysis Within Submodel K-gram (k = 5) positional weight matrix pp ee ii

Copyright  2003 limsoon wong Promoter, Exon, Intron Sensors These sensors are positional weight matrices of k-grams, k = 5 (aka pentamers) They are calculated as  below using promoter, exon, intron data respectively Pentamer at i th position in input j th pentamer at i th position in training window Frequency of jth pentamer at ith position in training window Window size 

Copyright  2003 limsoon wong Data Preprocessing & ANN Tuning parameters tanh(x) = e x  e -x e x  e -x s IE sIsI sEsE tanh(net) Simple feedforward ANN trained by the Bayesian regularisation method wiwi net =  s i * w i Tuned threshold

Copyright  2003 limsoon wong Accuracy Comparisons without C+G submodels with C+G submodels

Copyright  2003 limsoon wong Training Data Criteria & Preparation Contain both positive and negative sequences Sufficient diversity, resembling different transcription start mechanisms Sufficient diversity, resembling different non-promoters Sanitized as much as possible TSS taken from –793 vertebrate promoters from EPD –-200 to +50 bp of TSS non-TSS taken from –GenBank, –800 exons –4000 introns, –250 bp, –non-overlapping, –<50% identities

Copyright  2003 limsoon wong Tuning Data Preparation To tune adjustable system parameters in Dragon, we need a separate tuning data set TSS taken from –20 full-length gene seqs with known TSS –-200 to +50 bp of TSS –no overlap with EPD Non-TSS taken from –1600 human 3’UTR seqs –500 human exons –500 human introns –250 bp –no overlap

Copyright  2003 limsoon wong Testing Data Criteria & Preparation Seqs should be from the training or evaluation of other systems (no bias!) Seqs should be disjoint from training and tuning data sets Seqs should have TSS Seqs should be cleaned to remove redundancy, <50% identities 159 TSS from 147 human and human virus seqs cummulative length of more than 1.15Mbp Taken from GENESCAN, GeneId, Genie, etc.

Copyright  2003 limsoon wong Survey of Neural Network Based Systems for Recognizing Gene Features

Copyright  2003 limsoon wong NNPP (TSS Recognition) NNPP2.1 –use 3 time-delayed ANNs –recognize TATA-box, Initiator, and their mutual distance –Dragon is 8.82 times more accurate Makes about 1 prediction per 550 nt at 0.75 sensitivity

Copyright  2003 limsoon wong Promoter 2.0 (TSS Recognition) Promoter 2.0 –use ANN –recognize 4 signals commonly present in eukaryotic promoters: TATA-box, Initiator, GC-box, CCAAT-box, and their mutual distances –Dragon is 56.9 times more accurate

Copyright  2003 limsoon wong Promoter Inspector (TSS Recognition) statistics-based the most accurate reported system for finding promoter region uses sensors for promoters, exons, introns, 3’UTRs Strong bias for CpG-related promoters Dragon is 6.88 times better –to compare with Dragon, we consider Promoter Inspector to have made correct prediction if TSS falls within a predicted promoter region by Promoter Inspector

Copyright  2003 limsoon wong Grail’s Promoter Prediction Module Makes about 1 prediction per nt at 0.66 sensitivity

Copyright  2003 limsoon wong LVQ Networks for TATA Recognition Achieves 0.33 sensitivity at 47 FP on Fickett & Hatzigeorgiou 1997

Copyright  2003 limsoon wong Hatzigeorgiou’s DIANA-TIS Get local TIS score of ATG and -7 to +5 bases flanking Get coding potential of 60 in-frame bases up-stream and down-stream Get coding score by subtracting down-stream from up-stream ATG may be TIS if product of two scores is > 0.2 Choose the 1st one

Copyright  2003 limsoon wong Pedersen & Nielson’s NetStart Predict TIS by ANN -100 to +100 bases as input window feedforward 3 layer ANN 30 hidden neurons sensitivity = 78% specificity = 87%

Copyright  2003 limsoon wong Notes

Copyright  2003 limsoon wong References (expt design, result interpretation) John A. Swets, “Measuring the accuracy of diagnostic systems”, Science 240: , June 1988 Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, Chapters 1, 7 Lance D. Miller et al., “Optimal gene expression analysis by microarrays”, Cancer Cell 2: , November 2002

Copyright  2003 limsoon wong References (TIS recognition) A. G. Pedersen, H. Nielsen, “Neural network prediction of translation initiation sites in eukaryotes”, ISMB 5: , 1997 L. Wong et al., “Using feature generation and feature selection for accurate prediction of translation initiation sites”, GIW 13: , 2002 A. Zien et al., “Engineering support vector machine kernels that recognize translation initiation sites”, Bioinformatics 16: , 2000 A. G. Hatzigeorgiou, “Translation initiation start prediction in human cDNAs with high accuracy”, Bioinformatics 18: , 2002

Copyright  2003 limsoon wong References (TSS Recognition) V.B.Bajic et al., “Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates”, J. Mol. Graph. & Mod. 21: , 2003 V.B.Bajic et al., “Dragon Promoter Finder: Recognition of vertebrate RNA polymerase II promoters”, Bioinformatics 18: , V.B.Bajic et al., “Intelligent system for vertebrate promoter recognition”, IEEE Intelligent Systems 17:64--70, 2002.

Copyright  2003 limsoon wong References (TSS Recognition) J.W.Fickett, A.G.Hatzigeorgiou, “Eukaryotic promoter recognition”, Gen. Res. 7: , 1997 Y.Xu, et al., “GRAIL: A multi-agent neural network system for gene identification”, Proc. IEEE 84: , 1996 M.G.Reese, “Application of a time-delay neural network to promoter annotation in the D. melanogaster genome”, Comp. & Chem. 26:51--56, 2001 A.G.Pedersen et al., “The biology of eukaryotic promoter prediction---a review”, Comp. & Chem. 23: , 1999

Copyright  2003 limsoon wong References (TSS Recognition) S.Knudsen, “Promoter 2.0 for the recognition of Pol II promoter sequences”, Bioinformatics 15: , H.Wang, “Statistical pattern recognition based on LVQ ANN: Application to TATA-box motif”, M.Tech Thesis, Technikon Natal, South Africa M.Scherf, et al., “Highly specific localisation of promoter region in large genome sequences by Promoter Inspector: A novel context analysis approach”, JMB 297: , 2000

Copyright  2003 limsoon wong References (feature selection) M. A. Hall, “Correlation-based feature selection machine learning”, PhD thesis, Dept of Comp. Sci., Univ. of Waikato, New Zealand, 1998 U. M. Fayyad, K. B. Irani, “Multi-interval discretization of continuous-valued attributes”, IJCAI 13: , 1993 H. Liu, R. Sentiono, “Chi2: Feature selection and discretization of numeric attributes”, IEEE Intl. Conf. Tools with Artificial Intelligence 7: , 1995

Copyright  2003 limsoon wong Acknowledgements (for TIS) A.G. Pedersen H. Nielsen Roland Yap Fanfan Zeng Jinyan Li Huiqing Liu

Copyright  2003 limsoon wong Acknowledgements (for TSS)

Copyright  2003 limsoon wong The lecture will be on Thursday, 20th, from pm at LT07 (which is very close to the entrance of the school of computer engineering, 2nd floor). If you come to my office Blk N4 - 2a05, which is close to the General Office of SCE, I am happy to usher you to the lecture theater.