Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician.

Slides:



Advertisements
Similar presentations
An Introduction to Bioinformatics Finding genes in prokaryotes.
Advertisements

Bioinformatics as Hard Disk Investigation Assuming you can read all the bits on a 1000 year old hard drive Can you figure out what does what? - Distinguish.
CSCE555 Bioinformatics Lecture 3 Gene Finding Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Identification Lab
Gene Finding Charles Yan.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Lecture 12 Splicing and gene prediction in eukaryotes
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Biological Motivation Gene Finding in Eukaryotic Genomes
Finding prokaryotic genes and non intronic eukaryotic genes
Exciting Bioinformatics Adventures Limsoon Wong Institute for Infocomm Research.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Knowledge Discovery in Biomedicine Limsoon Wong Institute for Infocomm Research.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Copyright © 2004 by Limsoon Wong Gene Finding & Gene Feature Recognition by Computational Analysis Limsoon Wong Institute for Infocomm Research November.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 2, May 2004 For written notes.
Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Copyright  2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Predicting protein degradation rates Karen Page. The central dogma DNA RNA protein Transcription Translation The expression of genetic information stored.
Copyright  2003 limsoon wong Recognition of Gene Features Limsoon Wong Institute for Infocomm Research BI6103 guest lecture on ?? February 2004 For written.
A Biology Primer Part III: Transcription, Translation, and Regulation Vasileios Hatzivassiloglou University of Texas at Dallas.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Genome Annotation Haixu Tang School of Informatics.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
CS5238 Combinatorial methods in bioinformatics
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Applied Bioinformatics
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 3 of The Practical Bioinformatician, CS2220:
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 13 January.
Introduction to molecular biology Data Mining Techniques.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics.
Bacterial infection by lytic virus
bacteria and eukaryotes
Bacterial infection by lytic virus
An Artificial Intelligence Approach to Precision Oncology
Fanfan Zeng & Roland Yap National University of Singapore Limsoon Wong
Gene architecture and sequence annotation
Recitation 7 2/4/09 PSSMs+Gene finding
DNA Crash course…..
The Toy Exon Finder.
Presentation transcript:

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 4 February 2005

Course Plan

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Central Dogma of Molecular Biology

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong What is a gene?

Central Dogma

Transcription: DNA  nRNA

Splicing: nRNA  mRNA

Translation: mRNA  protein F L I M V S P T A Y H Q N K D E C W R G A T E L R S stop

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong What does DNA data look like? A sample GenBank record from NCBI rieve&db=nucleotide&list_uids= &dopt=GenB ank

What does protein data look like? A sample GenPept record from NCBI rieve&db=protein&list_uids= &dopt=GenPept

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Recognition of Translation Initiation Sites An introduction to the World’s simplest TIS recognition system A simple approach to accuracy and understandability

Translation Initiation Site

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong A Sample cDNA 299 HSU CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the TIS?

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Approach Training data gathering Signal generation  k-grams, distance, domain know-how,... Signal selection  Entropy,  2, CFS, t-test, domain know-how... Signal integration  SVM, ANN, PCL, CART, C4.5, kNN,...

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Training & Testing Data Vertebrate dataset of Pedersen & Nielsen [ISMB’97] 3312 sequences ATG sites 3312 (24.5%) are TIS (75.5%) are non-TIS Use for 3-fold x-validation expts

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Signal Generation K-grams (ie., k consecutive letters) –K = 1, 2, 3, 4, 5, … –Window size vs. fixed position –Up-stream, downstream vs. any where in window –In-frame vs. any frame

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Signal Generation: An Example 299 HSU CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT Window =  100 bases In-frame, downstream –GCT = 1, TTT = 1, ATG = 1… Any-frame, downstream –GCT = 3, TTT = 2, ATG = 2… In-frame, upstream –GCT = 2, TTT = 0, ATG = 0,...

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Too Many Signals For each value of k, there are 4 k * 3 * 2 k-grams If we use k = 1, 2, 3, 4, 5, we have = 8184 features! This is too many for most machine learning algorithms

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Signal Selection (Basic Idea) Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Signal Selection (eg., t-statistics)

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Signal Selection (eg., MIT-correlation)

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Signal Selection (eg.,  2)

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Signal Selection (eg., CFS) Instead of scoring individual signals, how about scoring a group of signals as a whole? CFS –Correlation-based Feature Selection –A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Sample k-grams Selected by CFS Position – 3 in-frame upstream ATG in-frame downstream –TAA, TAG, TGA, –CTG, GAC, GAG, and GCC Kozak consensus Leaky scanning Stop codon Codon bias?

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Signal Integration kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win. SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error. Naïve Bayes, ANN, C4.5,...

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Results (3-fold x-validation)

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Improvement by Voting Apply any 3 of Naïve Bayes, SVM, Neural Network, & Decision Tree. Decide by majority.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Improvement by Scanning Apply Naïve Bayes or SVM left-to-right until first ATG predicted as positive. That’s the TIS. Naïve Bayes & SVM models were trained using TIS vs. Up-stream ATG

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Performance Comparisons * result not directly comparable

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Technique Comparisons Pedersen&Nielsen [ISMB’97] –Neural network –No explicit features Zien [Bioinformatics’00] –SVM+kernel engineering –No explicit features Hatzigeorgiou [Bioinformatics’02] –Multiple neural networks –Scanning rule –No explicit features Our approach –Explicit feature generation –Explicit feature selection –Use any machine learning method w/o any form of complicated tuning –Scanning rule is optional

mRNA  protein F L I M V S P T A Y H Q N K D E C W R G A T E L R S stop How about using k-grams from the translation?

Amino-Acid Features

Amino Acid K-grams Discovered (by entropy)

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Independent Validation Sets A. Hatzigeorgiou: –480 fully sequenced human cDNAs –188 left after eliminating sequences similar to training set (Pedersen & Nielsen’s) –3.42% of ATGs are TIS Our own: –well characterized human gene sequences from chromosome X (565 TIS) and chromosome 21 (180 TIS)

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Validation Results (on Hatzigeorgiou’s) –Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s dataset

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Validation Results (on Chr X and Chr 21) –Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s ATGpr Our method

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Recognition of Transcription Start Sites An introduction to the World’s best TSS recognition system A heavy tuning approach

Transcription Start Site

Structure of Dragon Promoter Finder -200 to +50 window size Model selected based on desired sensitivity

Each model has two submodels based on GC content GC-rich submodel GC-poor submodel (C+G) = #C + #G Window Size

Data Analysis Within Submodel K-gram (k = 5) positional weight matrix pp ee ii

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Promoter, Exon, Intron Sensors These sensors are positional weight matrices of k- grams, k = 5 (aka pentamers) They are calculated as  below using promoter, exon, intron data respectively Pentamer at i th position in input j th pentamer at i th position in training window Frequency of jth pentamer at ith position in training window Window size 

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Data Preprocessing & ANN Tuning parameters tanh(x) = e x  e -x e x  e -x s IE sIsI sEsE tanh(net) Simple feedforward ANN trained by the Bayesian regularisation method wiwi net =  s i * w i Tuned threshold

Accuracy Comparisons without C+G submodels with C+G submodels

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Training Data Criteria & Preparation Contain both positive and negative sequences Sufficient diversity, resembling different transcription start mechanisms Sufficient diversity, resembling different non- promoters Sanitized as much as possible TSS taken from –793 vertebrate promoters from EPD –-200 to +50 bp of TSS non-TSS taken from –GenBank, –800 exons –4000 introns, –250 bp, –non-overlapping, –<50% identities

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Tuning Data Preparation To tune adjustable system parameters in Dragon, we need a separate tuning data set TSS taken from –20 full-length gene seqs with known TSS –-200 to +50 bp of TSS –no overlap with EPD Non-TSS taken from –1600 human 3’UTR seqs –500 human exons –500 human introns –250 bp –no overlap

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Testing Data Criteria & Preparation Seqs should be from the training or evaluation of other systems (no bias!) Seqs should be disjoint from training and tuning data sets Seqs should have TSS Seqs should be cleaned to remove redundancy, <50% identities 159 TSS from 147 human and human virus seqs cummulative length of more than 1.15Mbp Taken from GENESCAN, GeneId, Genie, etc.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Accuracy on Human Chromosome 22

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Other Gene Features

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Notes

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Exercises Wrt slide 13, what are the possible k-grams (k = 3) for that cDNA? wrt slide 13, what are the first 10 amino acids of that cDNA What is the purpose of submodels based on GC- content in Dragon Promoter Finder? Find out how does CFS work Wrt slide 45, how should we choose the pentamers to train the 3 sensors? Summarise key points of this lecture

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong References (TIS Recognition) A. G. Pedersen, H. Nielsen, “Neural network prediction of translation initiation sites in eukaryotes”, ISMB 5: , 1997 L. Wong et al., “Using feature generation and feature selection for accurate prediction of translation initiation sites”, GIW 13: , 2002 A. Zien et al., “Engineering support vector machine kernels that recognize translation initiation sites”, Bioinformatics 16: , 2000 A. G. Hatzigeorgiou, “Translation initiation start prediction in human cDNAs with high accuracy”, Bioinformatics 18: , 2002

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong References (TSS Recognition) V.B.Bajic et al., “Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates”, J. Mol. Graph. & Mod. 21: , 2003 J.W.Fickett, A.G.Hatzigeorgiou, “Eukaryotic promoter recognition”, Gen. Res. 7: , 1997 A.G.Pedersen et al., “The biology of eukaryotic promoter prediction---a review”, Computer & Chemistry 23: , 1999 M.Scherf et al., “Highly specific localisation of promoter regions in large genome sequences by PromoterInspector”, JMB 297: , 2000

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong References (Feature Selection) M. A. Hall, “Correlation-based feature selection machine learning”, PhD thesis, Dept of Comp. Sci., Univ. of Waikato, New Zealand, 1998 U. M. Fayyad, K. B. Irani, “Multi-interval discretization of continuous-valued attributes”, IJCAI 13: , 1993 H. Liu, R. Sentiono, “Chi2: Feature selection and discretization of numeric attributes”, IEEE Intl. Conf. Tools with Artificial Intelligence 7: , 1995

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong References (Misc.) C. P. Joshi et al., “Context sequences of translation initiation codon in plants”, PMB 35: , 1997 D. J. States, W. Gish, “Combined use of sequence similarity and codon bias for coding region identification”, JCB 1: , 1994 G. D. Stormo et al., “Use of Perceptron algorithm to distinguish translational initiation sites in E. coli”, NAR 10: , 1982 J. E. Tabaska, M. Q. Zhang, “Detection of polyadenylation signals in human DNA sequences”, Gene 231:77--86, 1999

Acknowledgements (for TIS) A.G. Pedersen H. Nielsen Roland Yap Fanfan Zeng Jinyan Li Huiqing Liu

Acknowledgements (for TSS)

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong 12-2pm every Tuesday at LT33 of SOC