1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Evolution of genomes.
Genes: organization, function and evolution
Longest Common Subsequence
DNA Organization Lec 2. Aims The aims of this lecture is to investigate how cells organize their DNA within the cell nucleus, how is the huge amount of.
Introduction to molecular biology. Subjects overview Investigate how cells organize their DNA within the cell nucleus, and replicate it during cell division.
Chapter 13.3 (Pgs ): Mutations
(Please study textbook, notes and hand-outs)
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Finding approximate palindromes in genomic sequences.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
BNFO 602 Lecture 1 Usman Roshan.
Sequence similarity.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Hosted by The Greatest Biology teachers at Rider.
C-kit and the D816V Mutation The nucleus of the human cell contains 46 strings of DNA, called CHROMOSOMES, arranged in 23 pairs. Each chromosome actually.
Introduction to Biological Sequences. Background: What is DNA? Deoxyribonucleic acid Blueprint that carries genetic information from one generation to.
Sequencing a genome and Basic Sequence Alignment
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Aim: How do your genetics play a role in the person you are today?  Do Now: What similarities do you have with your parents or siblings?  Homework:Textbook.
Biotechnology SB2.f – Examine the use of DNA technology in forensics, medicine and agriculture.
CSE 6406: Bioinformatics Algorithms. Course Outline
Lecture 7 Integrity & Veracity UFCE8K-15-M: Data Management.
What must DNA do? 1.Replicate to be passed on to the next generation 2.Store information 3.Undergo mutations to provide genetic diversity.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Chapter 13 Table of Contents Section 1 DNA Technology
Sequencing a genome and Basic Sequence Alignment
1 Gene Therapy Gene therapy: the attempt to cure an underlying genetic problem by insertion of a correct copy of a gene. –Tantalizingly simple and profound.
What are the parts of DNA? Vocabulary word for chapter 6.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Section Section Summary The Product Rule The Sum Rule The Subtraction Rule The Division Rule Examples, Examples, and Examples Tree Diagrams.
Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France.
WMU CS 6260 Parallel Computations II Spring 2013 Presentation #1 about Semester Project Feb/18/2013 Professor: Dr. de Doncker Name: Sandino Vargas Xuanyu.
10 Nature, structure and organisation of the genetic material.
Programming Logic and Design Fourth Edition, Comprehensive Chapter 8 Arrays.
Genetic Algorithms Genetic algorithms provide an approach to learning that is based loosely on simulated evolution. Hypotheses are often described by bit.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Do you agree or disagree with these statements: 1.I have eaten food that contains genetically modified (GM) crops. 2.GM foods should be available, as long.
Class 22 DNA Polymorphisms Based on Chapter 10 Recombinant DNA Technology Copyright © 2010 Pearson Education Inc.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
1 String Processing CHP # 3. 2 Introduction Computer are frequently used for data processing, here we discuss primary application of computer today is.
Protein Synthesis Review By PresenterMedia.com PresenterMedia.com.
Bell Work: 1/31/14 Based on the figure to the left, which of the following statements best describes cross-pollination? A The stigma comes in contact with.
Dipankar Ranjan Baisya, Mir Md. Faysal & M. Sohel Rahman CSE, BUET Dhaka 1000 Degenerate String Reconstruction from Cover Arrays (Extended Abstract) 1.
CHAPTER 51 LINKED LISTS. Introduction link list is a linear array collection of data elements called nodes, where the linear order is given by means of.
Genetics 3.1 Genes. Essential Idea: Every living organism inherits a blueprint for life from its parents.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Genetic Mutations Occur in any organism, from people and other animals to plants, bacteria, fungi, and protists. A mutation is any change in the nucleotide.
Meiosis. Mitosis The process in which cells make exact copies of themselves How single celled organisms reproduce How cells in multicellular organisms.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
DNA Marker Lecture 10 BY Ms. Shumaila Azam
Genomes and Their Evolution
Genetic Engineering Study Guide Review.
CSE 589 Applied Algorithms Spring 1999
Chapter 6 Clusters and Repeats.
tUAN THANH NGUYEN Nanyang Technological University (NTU), Singapore
Unit Genomic sequencing
9-3 DNA Typing with Tandem Repeats
9-1 DNA: the Indispensable Forensic Tool
Presentation transcript:

1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA

2 Introduction What are tandem repeats in DNA? How are we going to detect tandem repeats in DNA? Why would anybody want to detect tandem repeats in DNA?

3 Genetic sequences DNA consists of four different nucleotides, namely: Adenine (A)Guanine (G) Cytosine (C)Thiamine (T) Genetic databanks e.g. Genbank, Emboss and Entrez stores DNA sequences as concatenated single letter codes in FASTA format.

4 Tandem Repeats (TR’s) in genome sequences DNA molecules are subject to numerous mutational events. One of the consequences of these events that can be detected by computationally analyzing genome sequences is tandem duplication. A TR or TR-zone is a string of DNA molecules that is characterized by a certain motif that introduces the string, contiguously followed by a number of ‘copies’ of the motif, e.g., ACGACGACGACGACG

5 Tandem Repeats Perfect tandem repeat (PTR) if the copies are exact e.g. ACGACGACGACGACG, hence five copies of the motif ACG. Approximate tandem repeat (ATR) if the copies of the motif include non-exact copies, thus mutational events have, most likely occurred e.g. ACGACACGAGGACGAG. In the absence of further qualification, reference to a tandem repeat should be construed as a reference to either a PTR or an ATR.

6 Tandem Repeat Elements A PTR element (PTRE) is a TR element that matches the motif. If the motif is for example ACG then the PTRE will also be ACG. An ATR element (ATRE) is a TR element similar to the motif but not an exact copy thereof. If the motif is ACG then an ATRE may for example be AC.

7 Microsatellites The length of PTRE’s may vary: satellites, minisatellites and microsatellites Microsatellites is a subset of TR’s (conforming to Benson, Delgrange, Rivals & Abajian)

8 Formal problem statement A PTR whose motif is ρ is repeated p times where p 1, is denoted by ρ p. An ATR u that is derived from this PTR ρ p must always have the motif (ρ) as its prefix. It therefore has the form ρu 2 …u p where each ATRE, u k (k = 2…p), is the result of at most ε mutations on ρ. Here ε is the so called motif error. Besides the restrictions applicable to the motif error threshold values are also introduced that manipulate the attributes of the detected TR.

9 Tolerated error types Tolerated error types Errors regarding the motif or PTRE (motif errors): deletions mismatches insertions Errors related to the detected TR (TR errors): in terms of the ratio between PTRE’s and ATRE’s the minimum number TRE’s to be reported the maximum number of ATRE’s consecutively

10 Motif errors Maximum of 50% error toleration If |ρ| = 2 or |ρ| = 3 then є = 0 or є = 1 (default = 1) If |ρ| = 4 or |ρ| = 5 then є = 0; є = 1 or є = 2 (default = 2) Consider ACGTT then ACT will be an ATRE where two deletions have occurred.

11 Motif errors: Types of Mutations Deletion Refers to the absence of a base pair in the motif. Insertion An ATRE with up to ε base pairs inserted into any position of the PTRE. Mismatch Refers to the replacement of a base pair in the motif by another.

12 Detected TR errors: the substring error Detected TR errors: the substring error The substring error : where is the maximum substring error allowed and = (n_d x p_d) + (n_i x p_i) + (n_m x p_m) – n_ptre where n_d: number of deletions n_i: number of insertions n_m: number of mismatches p_d: penalty allocated to deletions p_i: penalty allocated to insertions p_m: penalty allocated to mismatches

13 Detected TR errors: the minimum number of TRE’s Detected TR errors: the minimum number of TRE’s tn_tre = tn_ptre + tn_atre tn_tre the default value for = 2 to prevent the output of unwanted data

14 Detected TR errors: the maximum number of consecutive ATRE’s Detected TR errors: the maximum number of consecutive ATRE’s tn_atreC tn_atreC is incremented for every ATRE read tn_atreC is set to zero whenever a PTRE is read the default of tn_atreC is 0

15 Deletion Refers to the absence of a base pair in the motif FA D (ACG,1)

16 Mismatch Refers to the replacement of a base pair in the motif by another. FA m (ACG,1)

17 generateWords(ρ,ε) generates a set of all words of length ρLength from the alphabet Σ = {A,C,G,T}. createFA TR (ρ,ε) returns FA TR (ρ,ε) as discussed. findIndices(gSeq, FA TR, τ, α, β, p_m, p_d, p_i) returns a set of index pairs in gSeq of an identified TR. the TR is such that it complies with the constraints specified by τ, α, β. Various counters have to be updated to ensure correct output. High-level Description of Fire μ Sat

18 Why does anybody want to detect TR’s in DNA? The cause of several human diseases can be traced to having too many copies of a certain nucleotide triplet. TR’s play a role in the development of immune system cells. TR’s serves as genetic markers in plant and animal species. Tandem repeats play a role in gene regulation and contribute to the breeding of disease resistant cultivars.

19 Conclusion A new theoretical approach to detect TR’s in DNA has been introduced. The time complexity of FireµSat is linear in |gSeq|. The practical implementation of FireµSat is in progress. The following matters constitute a future research agenda: the performance of FireµSat the possibility of reducing FA TR and, if successful, the latter results could suggest ways of adapting FireµSat to detect minisatellites and satellites as well.