GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

Application to find Eukaryotic Open reading frames. Lab.
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
Hidden Markov Models in Bioinformatics
Ab initio gene prediction Genome 559, Winter 2011.
Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov Models in Bioinformatics
Effective hidden Markov models for detecting splicing junction sites in DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.
1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.
Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : Christos Ferles ∗, Andreas Stafylopatis NN Self-Organizing Hidden Markov Model.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Gene Prediction: Similarity-Based Approaches (selected from Jones/Pevzner lecture notes)
Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.
Introduction to BioInformatics GCB/CIS535
CSE182-L10 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Data Mining Presentation Learning Patterns in the Dynamics of Biological Networks Chang hun You, Lawrence B. Holder, Diane J. Cook.
Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department.
Eukaryotic Gene Finding
A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc.,
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Dynamic Programming II
Hidden Markov Models In BioInformatics
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Externally Enhanced Classifiers and Application in Web Page Classification Join work with Chi-Feng Chang and Hsuan-Yu Chen Jyh-Jong Tsay National Chung.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Bioinformatics lectures at Rice University Li Zhang Lecture 11: Networks and integrative genomic analysis-3 Genomic data
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 5 Hidden Markov Model Aleppo University Faculty of technical engineering.
Research about Alternative Splicing recently 楊佳熒.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Finding genes in the genome
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Identification of Coding Sequences Bert Gold, Ph.D., F.A.C.M.G.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
BNFO 615 Fall 2016 Usman Roshan NJIT. Outline Machine learning for bioinformatics – Basic machine learning algorithms – Applications to bioinformatics.
bacteria and eukaryotes
Genome Annotation (protein coding genes)
What is a Hidden Markov Model?
A Methodology for Finding Bad Data
Eukaryotic Gene Finding
Ab initio gene prediction
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (IV)
Profile HMMs GeneScan TMMOD
4. HMMs for gene finding HMM Ability to model grammar
Bioinformatics 김유환, 문현구, 정태진, 정승우.
Modeling of Spliceosome
Introduction to Alternative Splicing and my research report
Evaluating Classifiers for Disease Gene Discovery
Presentation transcript:

GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information Sciences, 163(1-3), pp , Advisor: Min-Shiang Hwang Speaker: Chun-Ta Li

2 Outline Introduction Related work The proposed approach Experiments and results Conclusion Comments

3 Introduction – 1/4 Data mining – knowledge discovery from data Data mining in life sciences: –Finding clustering rules for gene expressions –Discovering classification rules for proteins –Detecting associations between metabolic pathways –Predicting genes in genomic DNA sequences

4 Introduction – 2/4 A genomic DNA sequence –Four types of nucleotides (A, C, G, T) The basic structure for a vertebrate gene A sequence fragment containing an exon of 296 nucleotides codon: 密碼子 introns: 內含子 exons: 編碼順序 donor: 捐贈者 coding sequences

5 Introduction – 3/4 coding region

6 Introduction – 4/4 A number of programs have been developed for locating gene coding regions (exons). Insufficient: –The vertebrate DNA sequence signals involved in gene determination are usually ill defined. –The automated interpretation without experimental validation of genomic data is still myth. Motivation: –GeneScout: Developing accurate methods for automatically detecting vertebrate genomic DNA structures. –Exon: start sites, junction donor, acceptor sites

7 Related work – 1/2 NN-based techniques (Neural Network) –Gene structure prediction –Training

8 Related work – 2/2 HMM-based techniques (Hidden Markov Models) –To describe sequential data or processes –Using a number of states –Probabilistic state transitions –Example: cast a dice NormalFake

9 The proposed approach – 1/4 HMM models for predicting functional sites –Star Site Model Start codon 11

10 The proposed approach – 2/4 An HMM model for computing coding potentials –The Codon Model First state is base T Second state is base A or G Third State can only be C or T (A, G is not defined) Stop codons: TAA, TAG, TGA, TGG

11 The proposed approach – 3/4 Graph representation of the gene detection problem –DNA sequence  Directed acyclic graph  dynamic programming algorithm  optimal path –candidate exon, candidate intron, candidate gene : intron :exon

12 The proposed approach – 4/4 A dynamic programming algorithm –Weight of the vertex v – W(v) –Weight of the edge (v 1,v 2 ) – W(v 1,v 2 ) stop acceptorstartacceptor donor acceptor

13 Experiments and results – 1/3 Data: –GeneBank  570 vertebrate sequences  28,992,149 nucleotides  2649 exons  444,498 nucleotides –start condon – ATG –donor site – GT –acceptor site – AG Evaluating method: –10-way cross-validation –570 sequences  10 sets 9 sets  training data 1 set  test data

14 Experiments and results – 2/3 : 正確認出 nucleotide 的比率 : 正確認出 nucleotide 的比率相較於誤認是 nucleotide 的比率 : 在 nucleotide level 的總預測精確度 (1~-1) : 正確認出 exon 的比率 : 正確認出 exon 的比率相較於誤認是 exon 的比率

15 Experiments and results – 3/3 8 sequences  GeneScout correctly detected nucleotides about 85% but GeneScan did not correctly predict any coding nucleotide GeneScout funs much faster than GeneScan

16 Conclusion GeneScout uses hidden Markov models to detect functional sites. A vertebrate genomic DNA sequence  A directed acyclic graph  A dynamic programming algorithm  optimal path Experiment results shows GeneScout can detect 51% of exons in the data set.

17 Comments Enhanced the accuracy of detect the DNA sequences: –More models or rules –Association rules  known exons  rules –Rules  DNA sequences  Candidate exons