An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA 07/07/2006.

Slides:



Advertisements
Similar presentations
Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Advertisements

Genetic Research Using Bioinformatics: LESSON 2:
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
CSCE555 Bioinformatics Lecture 3 Gene Finding Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Materials & Methods Collections Specimens of Themiste pyroides and Phascolosoma agassizii were collected from Peter the Great Bay (Figure 3) in the Sea.
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Natural Selection on the Olfactory Receptor Gene Family in Humans and Chimpanzee Chloe Lee.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Parameter Estimation using likelihood functions Tutorial #1
JYC: CSM17 BioinformaticsCSM17 Week 10: Summary, Conclusions, The Future.....? Bioinformatics is –the study of living systems –with respect to representation,
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Molecular Evolution Revised 29/12/06
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Scientific Data Mining: Emerging Developments and Challenges F. Seillier-Moiseiwitsch Bioinformatics Research Center Department of Mathematics and Statistics.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Lecture 4 BNFO 235 Usman Roshan. IUPAC Nucleic Acid symbols.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
Class 3: Estimating Scoring Rules for Sequence Alignment.
DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods CSE 377 – Bioinformatics - Spring 2006 Sotirios Kentros Univ. of.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Mutations Section 12–4 This section describes and compares gene mutations and chromosomal mutations.
BINF6201/8201 Molecular phylogenetic methods
A MOLECULAR APPROACH TO INVESTIGATE TUBERCULOSIS CASES IN A GOTHIC POPULATION FROM GHERĂSENI NECROPOLIS, BUZĂU COUNTY 1 Molecular Biology Center, Interdisciplinary.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
plants animals monera fungi protists protozoa invertebrates vertebrates mammals Five kingdom system (Haeckel, 1879)
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
DNA Structure & Function. Perspective They knew where genes were (Morgan) They knew what chromosomes were made of Proteins & nucleic acids They didn’t.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Maximum Likelihood Estimation
A B C D E F G H I J K FigS1. Supplemental Figure S1. Evolutionary relationships of Arabidopsis and tomato Aux/IAA proteins. The evolutionary history was.
MODELLING EVOLUTION TERESA NEEMAN STATISTICAL CONSULTING UNIT ANU.
NEW TOPIC: MOLECULAR EVOLUTION.
Construction of Substitution matrices
Modelling evolution Gil McVean Department of Statistics TC A G.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Bioinformatics Overview
Introduction to Bioinformatics Resources for DNA Barcoding
Fig. 1. Genomic structure of the csd gene in A
MtActinopterygii: Analysing evolution of mitogenomes belonging to the most dominant class of vertebrates Sevgi Kaynar1, Esra Mine Ünal1, Tuğçe Aygen1,
Figure 1. substitution rate of purine and pyrimidine
Linkage and Linkage Disequilibrium
Maximum likelihood (ML) method
Pipelines for Computational Analysis (Bioinformatics)
Molecular Evolutionary Analysis
Distances.
Models of Sequence Evolution
Tutorial #3 by Ma’ayan Fishelson
Goals of Phylogenetic Analysis
Biological Classification: The science of taxonomy
Molecular basis of evolution.
Explore Evolution: Instrument for Analysis
A. Papa, K. Dumaidi, F. Franzidou, A. Antoniadis 
Chapter 19 Molecular Phylogenetics
Maximum-likelihood phylogenetic tree of norovirus genomes belonging to genogroup GI, with the norovirus GII reference genome as an outlier. Maximum-likelihood.
Phylogenetic analysis of AquK2P.
Maximum Likelihood Estimation (MLE)
Functional Insights of Plant GSK3-like Kinases: Multi-Taskers in Diverse Cellular Signal Transduction Pathways  Ji-Hyun Youn, Tae-Wuk Kim  Molecular Plant 
Presentation transcript:

An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA 07/07/2006

Rutgers University Outline Problem & Motivation Statistical modeling Preliminary result Future work

07/07/2006Rutgers University Problem & Motivation Species clustering problem is a very important issue on DNA barcode project. MEGA3 by S. Kumar, K. Tamura, and M. Nei (2004) has limited success when applying the true data set. A good method to cluster new biological specimens to the proper species is very important. This is also one of the goals of this DNA initiative.

07/07/2006Rutgers University Statistical modeling Statistical model. Assumption: Nucleotides are iid. random variables which are multinomial distributed with parameter, where i stands for the DNA (COI) sequence and j indicates the position.

07/07/2006Rutgers University MLE-based clustering method Maximum likelihood estimator (MLE) Suppose are the MLE of in species k. The clustering rule : A new unidentified specimen, where, is assigned to species k if

07/07/2006Rutgers University Preliminary result There are 150 species with 1623 COI sequences in the training datasets provide by DIMACS. In order to performance the misclassification rate of the MLE-based clustering method, I randomly select 200 COI sequences as testing set. The MLE of parameters is estimated from the rest 1423 COI sequences. The misclassification rate is about 16%.

07/07/2006Rutgers University Future work The misclassification rate could be improved if more complicated models are considered. For example: 1. gaps of the COI sequences. 2. Kimura-two-parameter (K2P) model (1980) considers the biological diversity with different evolution rate. 3. synonymous and non-synonymous substitutions. Remark: Synonymous substitutions will change the encoded amino acid.

07/07/2006Rutgers University Proposed deliverables If the MLE-based clustering method works well on barcode sequences, it can be written as a paper with R-code for public users of barcode data.

07/07/2006Rutgers University Acknowledgement NSF grant Data Analysis Working Group DIMACS at Rutgers My advisor: Dr. Javier Cabrera

07/07/2006Rutgers University References M. Kimura (1980) A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16: Sudhir Kumar, Koichior Tamura and Masatoshi Nei (2004) MEGA 3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment. Briefings in Bioinformatics. Vol 5. No