Rényi entropic profiles of DNA sequences and statistical significance of motifs Acknowledgments S.Vinga and J.S.Almeida thankfully acknowledge.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

DEVELOPMENT OF A COMPUTER PLATFORM FOR OBJECT 3D RECONSTRUCTION USING COMPUTER VISION TECHNIQUES Teresa C. S. Azevedo João Manuel R. S. Tavares Mário A.
Pattern Recognition and Machine Learning
Microstructural analysis of fresh-cut red bell pepper (Capsicum annuum L.) aiming at postharvest quality optimization Susana C. FONSECA * Cristina L. SILVA.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Automatic Annotation of Actigraphy Data for Sleep Disorders Diagnosis Purposes 32nd Annual International Conference of the IEEE Engineering in Medicine.
06/05/2008 Jae Hyun Kim Chapter 2 Probability Theory (ii) : Many Random Variables Bioinformatics Tea Seminar: Statistical Methods in Bioinformatics.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Mutual Information Mathematical Biology Seminar
1 Representing a Computer Science Research Organization on the ACM Computing Classification System Boris Mirkin School of Computer Science and Information.
Finding approximate palindromes in genomic sequences.
Overview Of Clustering Techniques D. Gunopulos, UCR.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning CMPT 726 Simon Fraser University
A Probabilistic Framework for Video Representation Arnaldo Mayer, Hayit Greenspan Dept. of Biomedical Engineering Faculty of Engineering Tel-Aviv University,
Part 3 Vector Quantization and Mixture Density Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Supplementary material Figure S1. Cumulative histogram of the fitness of the pairwise alignments of random generated ESSs. In order to assess the statistical.
Learning Programs Danielle and Joseph Bennett (and Lorelei) 4 December 2007.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
1 An Introduction to Nonparametric Regression Ning Li March 15 th, 2004 Biostatistics 277.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
The Statistical Properties of Large Scale Structure Alexander Szalay Department of Physics and Astronomy The Johns Hopkins University.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
Isolated-Word Speech Recognition Using Hidden Markov Models
Mean Shift Theory and Applications Reporter: Zhongping Ji.
Internet Engineering Czesław Smutnicki Discrete Mathematics – Discrete Convolution.
P ATHO G ENO M ICS PORTUGAL Partner 9 P ATHO G ENO M ICS PORTUGAL Partner 9.
Developed at Utah State University Dept of Engr & Tech Educ — Materials and Processes 5.6 calculate the mean and standard deviation of.
Migration Motif: A Spatial-Temporal Pattern Mining Approach for Financial Markets Xiaoxi Du, Ruoming Jin, Liang Ding, Victor E. Lee, John H.Thornton Jr.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
1 E. Fatemizadeh Statistical Pattern Recognition.
Jeffrey Zheng School of Software, Yunnan University August 4, nd International Summit on Integrative Biology August 4-5, 2014 Chicago, USA.
A New Method of Probability Density Estimation for Mutual Information Based Image Registration Ajit Rajwade, Arunava Banerjee, Anand Rangarajan. Dept.
APL: Autonomous Passive Localization for Wireless Sensors Deployed in Road Networks IEEE INFOCOM 2008, Phoenix, AZ, USA Jaehoon Jeong, Shuo Guo, Tian He.
Introduction to Digital Signals
Mestrado em Ciência de Computadores Mestrado Integrado em Engenharia de Redes e Sistemas Informáticos VC 15/16 – TP14 Pattern Recognition Miguel Tavares.
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Cluster validation Integration ICES Bioinformatics.
Abstract  Arterial Spin Labeling (ASL) is a noninvasive method for quantifying Cerebral Blood Flow (CBF).  The most common approach is to alternate between.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Rate Distortion Theory. Introduction The description of an arbitrary real number requires an infinite number of bits, so a finite representation of a.
Abstract  Cirrhosis is an endemic diseases across the world that leads to observed liver contour irregularities in the Ultrasound images, which can be.
Image Features (I) Dr. Chang Shu COMP 4900C Winter 2008.
13th Portugaliae Genetica IPATIMUP, Porto 19 March New applications of alignment-free methods for biological sequence analysis and comparison Instituto.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Presenter Name Date Presentation Title. Title HEADER Bullet Point.
A general dynamic function for the basal area of individual trees derived from a production theoretically motivated autonomous differential equation Version.
Opracowanie językowe dr inż. J. Jarnicki
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Special Topics In Scientific Computing
A segmentation and tracking algorithm
Overview Of Clustering Techniques
Nuno Neves1,2, Pedro Tomás1,2, Nuno Roma1,2
خشنه اتره اهورهه مزدا شيوۀ ارائه مقاله 17/10/1388.
Bayesian Models in Machine Learning
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
DATA PRESENTATION FUNDAMENTALS RESULTS RESULTS FUNDAMENTALS
Generalizations of Markov model to characterize biological sequences
EE513 Audio Signals and Systems
Pattern Recognition and Machine Learning
Handwritten Characters Recognition Based on an HMM Model
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
Derivative-free Methods for Structural Optimization
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Presentation transcript:

Rényi entropic profiles of DNA sequences and statistical significance of motifs Acknowledgments S.Vinga and J.S.Almeida thankfully acknowledge the financial support by grants SFRH/BPD/24254/2005 and POCTI/BIO/48333/2002 from Fundação para a Ciência e a Tecnologia (FCT) of the Portuguese Ministério da Ciência, Tecnologia e Ensino Superior. References [1] Vinga, S. and Almeida, J. S. (2004) Rényi continuous entropy of DNA sequences J Theor Biol, 231(3): Susana Vinga (a,b), Jonas S Almeida (a,c) In a recent report [1] the authors presented a new measure of Rényi continuous entropy for DNA sequences, which allows the estimation of their randomness level. The definition therein explored was based on the Rényi entropy of the probability density estimation (pdf) using the Parzens window method and applied to Chaos Game Representation/Universal Sequence Maps (CGR/USM). This work extends those concepts of continuous entropy by defining DNA sequence entropic profiles using the pdf estimations obtained. These profiles are applied to the study of a sequence dataset constituted by artificial and real DNA and a new fractal-kernel function, more adjusted to the estimation, is explored, instead of the Gaussians functions previously used. This work shows that the entropic profiles are directly related to the statistical significance of motifs, allowing the study of under and over- representation of sub-strings. Furthermore, by spanning the parameters of the fractal-kernel function, it is possible to extract important information about the scale of each DNA region, which can have future applications in the recognition of biologically significant segments of the genome. Keywords: Rényi entropy, DNA, Information Theory, kernel functions, CGR/USM CGR/USM representation of DNA Chaos Game Representation/Universal Sequence Map (CGR/USM) Maps discrete sequences onto continuous maps. The CGR/USM mapping of a N-length DNA sequence is: A TC G 2D-CGR/USM representation of DNA Each point x i corresponds to one symbol in its context Each iteration goes half the distance towards the corner representing the next symbol Suffix property – strings ending in a specific suffix are in the sub-square labeled with that suffix Definition of DNA entropy based on CGR/USM and Parzens Method with parameter - variance of Gaussian function used. where All pairwise squared Euclidean distances between CGR/USM coordinates x i Simplification! -ATC- Motif detected Simplification: Integral Sum Convolution of two Gaussians is Gaussian CGR/USM estimation 2. Rényi continuous entropy of DNA sequences DNA testset Rényi entropic profiles Rényi entropic profiles provide local information about motifs and their statistical significance Continuous quadratic entropy H 2 is a good measure of DNA sequence randomness a) a) Biomathematics Group ITQB/UNL Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa - Oeiras, Portugal b) b) INESC-ID Instituto de Engenharia de Sistemas e Computadores: Investigação Desenvolvimento - Lisboa, Portugal c) c) Dept. Biostatistics, Bioinformatics and Epidemiology - Medical Univ. South Carolina - Charleston SC 29425, USA Method provides new tools for the study of motifs and repeatability in biological sequences Explore theoretical properties of the entropic profiles Optimize algorithm to accommodate longer sequences Rényi continuous quadratic entropy for the sequence DNA dataset Representation of entropies for the dataset described in the Table above as a function of the logarithm of the Gaussian kernel variance used in the Parzens Method. The lower the value of entropy H 2, the less random or more structured the sequence is. The graph has theoretically demonstrated asymptotes for given by line and for, line -ATC- Motif detected 1. Abstract 2. Methods and Algorithms 4. Conclusions and Future work 3. Results ATC Gaussian kernel 0 1x Fractal kernel vs. Example