Presentation is loading. Please wait.

Presentation is loading. Please wait.

GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.

Similar presentations


Presentation on theme: "GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information."— Presentation transcript:

1 GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information Sciences, 163(1-3), pp. 201-218, 2004. Advisor: Min-Shiang Hwang Speaker: Chun-Ta Li

2 2 Outline Introduction Related work The proposed approach Experiments and results Conclusion Comments

3 3 Introduction – 1/4 Data mining – knowledge discovery from data Data mining in life sciences: –Finding clustering rules for gene expressions –Discovering classification rules for proteins –Detecting associations between metabolic pathways –Predicting genes in genomic DNA sequences

4 4 Introduction – 2/4 A genomic DNA sequence –Four types of nucleotides (A, C, G, T) The basic structure for a vertebrate gene A sequence fragment containing an exon of 296 nucleotides codon: 密碼子 introns: 內含子 exons: 編碼順序 donor: 捐贈者 coding sequences

5 5 Introduction – 3/4 coding region

6 6 Introduction – 4/4 A number of programs have been developed for locating gene coding regions (exons). Insufficient: –The vertebrate DNA sequence signals involved in gene determination are usually ill defined. –The automated interpretation without experimental validation of genomic data is still myth. Motivation: –GeneScout: Developing accurate methods for automatically detecting vertebrate genomic DNA structures. –Exon: start sites, junction donor, acceptor sites

7 7 Related work – 1/2 NN-based techniques (Neural Network) –Gene structure prediction –Training

8 8 Related work – 2/2 HMM-based techniques (Hidden Markov Models) –To describe sequential data or processes –Using a number of states –Probabilistic state transitions –Example: cast a dice NormalFake

9 9 The proposed approach – 1/4 HMM models for predicting functional sites –Star Site Model Start codon 11

10 10 The proposed approach – 2/4 An HMM model for computing coding potentials –The Codon Model First state is base T Second state is base A or G Third State can only be C or T (A, G is not defined) Stop codons: TAA, TAG, TGA, TGG

11 11 The proposed approach – 3/4 Graph representation of the gene detection problem –DNA sequence  Directed acyclic graph  dynamic programming algorithm  optimal path –candidate exon, candidate intron, candidate gene : intron :exon

12 12 The proposed approach – 4/4 A dynamic programming algorithm –Weight of the vertex v – W(v) –Weight of the edge (v 1,v 2 ) – W(v 1,v 2 ) stop acceptorstartacceptor donor acceptor

13 13 Experiments and results – 1/3 Data: –GeneBank  570 vertebrate sequences  28,992,149 nucleotides  2649 exons  444,498 nucleotides –start condon – ATG –donor site – GT –acceptor site – AG Evaluating method: –10-way cross-validation –570 sequences  10 sets 9 sets  training data 1 set  test data

14 14 Experiments and results – 2/3 : 正確認出 nucleotide 的比率 : 正確認出 nucleotide 的比率相較於誤認是 nucleotide 的比率 : 在 nucleotide level 的總預測精確度 (1~-1) : 正確認出 exon 的比率 : 正確認出 exon 的比率相較於誤認是 exon 的比率

15 15 Experiments and results – 3/3 8 sequences  GeneScout correctly detected nucleotides about 85% but GeneScan did not correctly predict any coding nucleotide GeneScout funs much faster than GeneScan

16 16 Conclusion GeneScout uses hidden Markov models to detect functional sites. A vertebrate genomic DNA sequence  A directed acyclic graph  A dynamic programming algorithm  optimal path Experiment results shows GeneScout can detect 51% of exons in the data set.

17 17 Comments Enhanced the accuracy of detect the DNA sequences: –More models or rules –Association rules  known exons  rules –Rules  DNA sequences  Candidate exons


Download ppt "GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information."

Similar presentations


Ads by Google