GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information Sciences, 163(1-3), pp , Advisor: Min-Shiang Hwang Speaker: Chun-Ta Li
2 Outline Introduction Related work The proposed approach Experiments and results Conclusion Comments
3 Introduction – 1/4 Data mining – knowledge discovery from data Data mining in life sciences: –Finding clustering rules for gene expressions –Discovering classification rules for proteins –Detecting associations between metabolic pathways –Predicting genes in genomic DNA sequences
4 Introduction – 2/4 A genomic DNA sequence –Four types of nucleotides (A, C, G, T) The basic structure for a vertebrate gene A sequence fragment containing an exon of 296 nucleotides codon: 密碼子 introns: 內含子 exons: 編碼順序 donor: 捐贈者 coding sequences
5 Introduction – 3/4 coding region
6 Introduction – 4/4 A number of programs have been developed for locating gene coding regions (exons). Insufficient: –The vertebrate DNA sequence signals involved in gene determination are usually ill defined. –The automated interpretation without experimental validation of genomic data is still myth. Motivation: –GeneScout: Developing accurate methods for automatically detecting vertebrate genomic DNA structures. –Exon: start sites, junction donor, acceptor sites
7 Related work – 1/2 NN-based techniques (Neural Network) –Gene structure prediction –Training
8 Related work – 2/2 HMM-based techniques (Hidden Markov Models) –To describe sequential data or processes –Using a number of states –Probabilistic state transitions –Example: cast a dice NormalFake
9 The proposed approach – 1/4 HMM models for predicting functional sites –Star Site Model Start codon 11
10 The proposed approach – 2/4 An HMM model for computing coding potentials –The Codon Model First state is base T Second state is base A or G Third State can only be C or T (A, G is not defined) Stop codons: TAA, TAG, TGA, TGG
11 The proposed approach – 3/4 Graph representation of the gene detection problem –DNA sequence Directed acyclic graph dynamic programming algorithm optimal path –candidate exon, candidate intron, candidate gene : intron :exon
12 The proposed approach – 4/4 A dynamic programming algorithm –Weight of the vertex v – W(v) –Weight of the edge (v 1,v 2 ) – W(v 1,v 2 ) stop acceptorstartacceptor donor acceptor
13 Experiments and results – 1/3 Data: –GeneBank 570 vertebrate sequences 28,992,149 nucleotides 2649 exons 444,498 nucleotides –start condon – ATG –donor site – GT –acceptor site – AG Evaluating method: –10-way cross-validation –570 sequences 10 sets 9 sets training data 1 set test data
14 Experiments and results – 2/3 : 正確認出 nucleotide 的比率 : 正確認出 nucleotide 的比率相較於誤認是 nucleotide 的比率 : 在 nucleotide level 的總預測精確度 (1~-1) : 正確認出 exon 的比率 : 正確認出 exon 的比率相較於誤認是 exon 的比率
15 Experiments and results – 3/3 8 sequences GeneScout correctly detected nucleotides about 85% but GeneScan did not correctly predict any coding nucleotide GeneScout funs much faster than GeneScan
16 Conclusion GeneScout uses hidden Markov models to detect functional sites. A vertebrate genomic DNA sequence A directed acyclic graph A dynamic programming algorithm optimal path Experiment results shows GeneScout can detect 51% of exons in the data set.
17 Comments Enhanced the accuracy of detect the DNA sequences: –More models or rules –Association rules known exons rules –Rules DNA sequences Candidate exons