Characteristic Restriction Endonuclease cut order for Classification and analysis of DNA Sequences Rajib SenGupta College of Information Science and Technology,

Slides:



Advertisements
Similar presentations
LG 4 Outline Evolutionary Relationships and Classification
Advertisements

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Lab 6: Molecular Biology Description – Gel electrophoresis cut DNA with restriction enzyme fragments separate on gel based on size.
Molecular Evolution Revised 29/12/06
Structural bioinformatics
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
Heuristic alignment algorithms and cost matrices
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Bioinformatics and Phylogenetic Analysis
Generation and Analysis of AFLP Data
CHAPTER 25 TRACING PHYLOGENY. I. PHYLOGENY AND SYSTEMATICS A.TAXONOMY EMPLOYS A HIERARCHICAL SYSTEM OF CLASSIFICATION  SYSTEMATICS, THE STUDY OF BIOLOGICAL.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Sequencing a genome and Basic Sequence Alignment
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Multiple testing correction
Microbial taxonomy and phylogeny
Laboratory #1: Forensic DNA Fingerprinting
Analysis of Microbial Community Structure
Classification and Systematics Tracing phylogeny is one of the main goals of systematics, the study of biological diversity in an evolutionary context.
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
Some Independent Study on Sequence Alignment — Lan Lin prepared for theory group meeting on July 16, 2003.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Manipulation of DNA. Restriction enzymes are used to cut DNA into smaller fragments. Different restriction enzymes recognize and cut different DNA sequences.
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Sequencing a genome and Basic Sequence Alignment
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Classification. Cell Types Cells come in all types of shapes and sizes. Cell Membrane – cells are surrounded by a thin flexible layer Also known as a.
DNA Fingerprinting. Introduction to DNA Fingerprinting Technicians in forensic labs are often asked to do DNA profiling or “fingerprinting” Restriction.
Classification.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Classification Biology I. Lesson Objectives Compare Aristotle’s and Linnaeus’s methods of classifying organisms. Explain how to write a scientific name.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
GENETIC MARKERS (RFLP, AFLP, RAPD, MICROSATELLITES, MINISATELLITES)
Introduction to Bioinformatics Resources for DNA Barcoding
One method of rapidly analyzing and comparing DNA is gel electrophoresis. Gel electrophoresis separates macromolecules - nucleic acids or proteins - on.
DIGESTION OF DNA WITH RESTRICTION ENZYMES
Multiple Alignment and Phylogenetic Trees
Biological Classification: The science of taxonomy
Biological Classification: The science of taxonomy
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Chapter 25 Phylogeny and the Tree of Life
Introduction to Bioinformatics II
DNA Fingerprinting.
Basic Local Alignment Search Tool (BLAST)
Unit Genomic sequencing
Restriction Fragment Length Polymorphism (RFLP)
Forensic DNA Fingerprinting Lab
Presentation transcript:

Characteristic Restriction Endonuclease cut order for Classification and analysis of DNA Sequences Rajib SenGupta College of Information Science and Technology, University of Nebraska at Omaha Omaha,NE, , USA

Problem Statement The motivation for this project is the old holy grail of Bioinformatics Sequence Identification & Classification Current Approaches 1. Computational approach – Pairwise local and Multiple Sequence Alignment 2. Laboratory Method – RFLP, Southern Blotting

Existing Methods - Limitations Pairwise or Multiple Alignment 1.Alignment is ‘fine-grained’ approach 2.More computation intensive and so NP hard for large dataset 3.Introduces gaps - gaps are interpreted as evolutionary events in molecular phylogeny, misaligned sequences have no useful biological information 4.Heuristics like BLAST is employed Laboratory Methods (RFLP) 1.Only feasible for few sequences 2.Human and procedural error 3.In-silico RFLP methods (TRFLP program) requires Alignment as the second step for sequence identification Ideation Utilize ‘coarse-grain-features’ of RFLP/Restriction Enzyme in-silico as opposed to the ‘fine-grain- features’ of Alignment computationally.

Restriction Endonuclease  Proteins that recognize particular sequence of nucleotide (called the restriction site and generally 4 to 8 bases long) and cut the double stranded DNA molecule at restriction site

RFLP  Restriction Fragment Length Polymorphism (RFLP)  Widely used laboratory method in molecular identification and Phylogenetic studies.  This approach requires the sequences to be cut into several fragments with the help of restriction endonucleases.  The variation in the position of these sites along the DNA, among the sequences being analyzed will lead to digested product that are of varying lengths.  Following a high-resolution gel electrophoresis of the digested product, the fragment-patterns are visually compared to determine the similarity between the sequences.

RFLP

Proposed Concept  New Idea  Uses Enzyme Cut Order (ECO) – information from DNA for evaluation  Definition: –ECO for a DNA sequence (S) for a particular set of restriction enzymes {Ez} is a string (array) of enzyme names (represented as numeric id) in the order each enzyme (ez Є Ez) cuts the sequence.  ECO may also include position of nucleotide from the start of sequence where the cut occur. –Thus, ECO is a string (array) of tuples consisting of enzyme id and cut position. –Example:

GenBank Classification

Concept Contd..  Closely related organisms have similar Enzyme Cut Order Table1 :The ECO for ‘ITS’ sequences from close and distantly related fungi. The closely related Nectria species (Nectria haematococca and Nectria mauritiicola) show high level of ECO similarity.

Quantifying ECO  Enzyme Cut Order (ECO)- Similarity Score –The similarity score between two ECO consists  Number of similar enzymes and  Order in which these enzyme cut the sequence 1.The similarity score will be higher if we find larger number of similar enzymes appearing in the same order among two Enzyme Cut Orders. 2.This similarity score is the Longest Common Subsequence (LCS) among two strings – the strings are the ECO 3.The length of Longest Common Subsequence (LCS) between two ECO (E1 and E2) of two corresponding sequences (S1 and S2) are considered as the Enzyme Cut Order Similarity Score between E1 and E2.

Hypothesis Organisms closer to each other in the Phylogenetic tree have highly similar Enzyme Cut Order. The similarity is defined as the Enzyme Cut Order Similarity Score which is the length of LCS among the corresponding Enzyme Cut Orders of the DNA sequences of the organisms.

Preliminary Result Summary  Enzyme Cut Order is a distinguishing characteristic of DNA sequences  The similarity between two sequences can be defined by Enzyme Cut Order Similarity Score  ECO-similarity score can be measured as the length of LCS among the corresponding Enzyme Cut Orders of the DNA sequences of the organisms

SEQUENCE DB Enzyme Cut Order Similarity Score Algorithm Clustering Algorithm RES ENZ DB Analysis of Clusters TAXON DB CLUSTER DB Array of Enzyme cut orders Similarity Matrix Report Graph Genetic Algorithms Optimal Enzyme Set Overall Method Diagram

Step 1 Sequence Data Collection and Curation  Created a local database of GenBank sequences obtained in FASTA or XML format  Reference these sequences against taxon database  Create a curated taxonomy database for these sequences using user-defined taxonomical rules  Fungi ITS Sequences from Genbank –Organism description” of the genbank entries (or OrgName_Lineage in XML format) –Classification categories included Kingdom, Division, Class, Order, Family, Genus, Species –Use simple suffix rule and the position to decide

Step 2 – Enzyme Data Collection  Create a database of restriction enzymes obtained from REBASE  Add more relevant information about these restriction enzymes (Isoschizomers, Commercial availability, Reverse Cutsite) for later use  Appropriate recognition sequence containing bases other than A, T, G and C were interpreted as per IUB ambiguity code (Eur. J. Biochem. 150: 1-5, 1985).

Step 3 – Enzyme Cut Order DB Build  Obtained Enzyme cut order using user defined set of restriction enzymes {Ez}.  The Enzyme cutorder is obtained for every test sequences and every enzymes in {Ez}  Evaluate the effect of the size and type of restriction endonuclease Different sets of (Ez) were chosen with the following properties. 1.Enzymes that cut at least one of the sequences from the given sequence data 2.Enzymes that cut 50% of the sequences of the given sequence data 3.Enzymes that cut all the sequences at least once 4.Random enzyme set (consisting a mixture from the sets listed previously) 5.Commonly used restriction enzymes in a biology laboratory working with the RFLP of fungi. 6.Restriction enzymes set obtained by using genetic algorithm

Step 4 – Similarity Matrix based on LCS score  Create a similarity matrix or a complete weighted graph for each Enzyme Set {Ez} –each node represents an enzyme cutorder of a sequence and the weight between two nodes is similarity score (SS = LCS length) between two corresponding enzyme cut-order –(G)Ez = (Kn) Ez = (v Є V, e Є E) where v is enzyme cut order of the sequence and |e| = SS

Step 5 – Clustering The Similarity matrix is clustered and the cluster is analyzed for its phylogenetic accuracy  Clustering algorithm employed: –Maximum gap based exclusive clustering –Hierarchical clustering –Similarity Clustering

Step 5 – Clustering  Sensitivity and the positive predictive value were two important evaluation parameters for cluster analysis and are defined as follows:  For a particular taxon in a group X S = Sensitivity = TP/(TP + FN) where TP= True Positive = Count of taxon’s in X FN= False Negative = Count of taxon’s in DB1, excluding that in X TP+FN = Total counts of taxon’s in the entire DB1 (S) tax,x = Count of the taxon in X / Total count of the taxon in database  Similarly, for a particular taxon in a group X PP = Positive Predictive Value = TP / (TP + FP) where TP = True Positive = Count of taxon’s sequences in X FP = False Positive = Count of other taxons which are not in X TP + FP = Total counts of sequences in the group X (PP) tax,x = Count of the taxon in X / total count of sequences in X

Step 6 – Genetic Algorithm  Find optimal enzyme set for a particular dataset using genetic algorithm.  Optimal enzyme set is defined as the minimal size enzyme set that shows highest phylogenetic resolution  The Fitness Function is based on the expected and actual count of an organism in the cluster. The score is quantitatively determined in terms of Sensitivity and Positive Predictive Value  The Selection is either Roulette-wheel selection, tournament selection or random selection.  Uniform, Single-Point or Two-Point crossover is used along with a user specified crossover rate.

Experiment -1  Sequence (Set-1) –Type = Internal Transcribed Spacer –Size = 7 –Taxonomy  Ascomycota = 5 –Nectria sp. = 3 –Lirula sp. = 2  Bacidiomycota = 2 –Oligoporus sp.= 2  Enzyme (Set1 - TaqI, HaeIII, HinfI, AluI, RsaI, MspI) –Size = 6 –Property = Frequent cutter

Result-1

Result-1  All sequences are perfectly clustered –Similarity Gap is close and reflected on highlighted samples

Sample Test Set 1 – Enz Set 2 1.Using 57 enzymes on the same test set 1 2.Obtained better Similarity Matrix (Higher Similarity Gap) 3.Larger Enzyme set may have better clustering result 4.All Sequences are perfectly clustered

Result species are perfectly clustered out of 26 with 65 Enzymes

Experiment -3 (Find Optimal Enzyme Set using GA)  Sequence (AspCan) –Type = Internal Transcribed Spacer –Size = 78 –Taxonomy – Aspergillus and Candida  Sequence (All9Genus) –Type = Internal Transcribed Spacer –Size = 97 –Taxonomy – 9 Genus

Result -3

Conclusion  Restriction Enzymes data can be modeled and used for computational analysis.  Introduced an new property of DNA sequences based on order of the cut by multiple restriction enzymes on the sequences, namely Enzyme Cut Order.  This property can be quantified to a similarity score as the length of the Longest Common Subsequence between two enzyme cut orders.  The resulting similarity matrix shows high phylogenetic resolution while clustered.  Can be considered as an alternative”coarse-grain” method for sequence identification and classification compared to computational intensive alignment methods