Improved Models and Algorithms for Universal DNA Tag Systems continued … a.k.a. what did we do?

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Applied Algorithmics - week7
An Algorithm for Computing the Local Weight Distribution of Binary Linear Codes Closed under a Group of Permutations Kenji YASUNAGA and Toru FUJIWARA Osaka.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
6 - 1 Chapter 6 The Secondary Structure Prediction of RNA.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Exact and Approximation Algorithms for DNA Tag Set Design
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Design and Optimization of Universal DNA Arrays Ion Mandoiu CSE Department & BME Program University of Connecticut.
Evaluation of Placement Techniques for DNA Probe Array Layout Andrew B. Kahng 1 Ion I. Mandoiu 2 Sherief Reda 1 Xu Xu 1 Alex Zelikovsky 3 (1) CSE Department,
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Efficient Statistical Pruning for Maximum Likelihood Decoding Radhika Gowaikar Babak Hassibi California Institute of Technology July 3, 2003.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut.
An efficient algorithm for optimizing whole genome alignment with noise P. Wong, T. Lam, N. Lu, H. Ting, and S. Yiu Department of Computer Science, University.
Definitions Let i) standard q-ary alphabet. iii) is a set of n elements ii) is the set of all q! permutations of q symbols. n-sequence q-partition.
Improved Tag Set Design and Multiplexing Algorithms for Universal Arrays Ion Mandoiu Claudia Prajescu Dragos Trinca Computer Science & Engineering Department.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Selection of Optimal DNA Oligos for Gene Expression Arrays Reporter : Wei-Ting Liu Date : Nov
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University.
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.
Probabilistic methods for phylogenetic trees (Part 2)
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Rational DNA Sequence Design for Molecular Nanotechnology Ion Mandoiu, CSE Department DNA is well-known as the carrier of information in living organisms.
Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Hamming Codes 11/17/04. History In the late 1940’s Richard Hamming recognized that the further evolution of computers required greater reliability, in.
Minimum Spanning Trees What is a MST (Minimum Spanning Tree) and how to find it with Prim’s algorithm and Kruskal’s algorithm.
Optimization of thermal processes2007/2008 Optimization of thermal processes Maciej Marek Czestochowa University of Technology Institute of Thermal Machinery.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.
Outline More exhaustive search algorithms Today: Motif finding
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Prof. Swarat Chaudhuri COMP 482: Design and Analysis of Algorithms Spring 2012 Lecture 16.
DIGITAL COMMUNICATIONS Linear Block Codes
Approximation Algorithms For Protein Folding Prediction Giancarlo MAURI,Antonio PICCOLBONI and Giulio PAVESI Symposium on Discrete Algorithms, pp ,
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
1 Channel Coding (III) Channel Decoding. ECED of 15 Topics today u Viterbi decoding –trellis diagram –surviving path –ending the decoding u Soft.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
1 Reliability-Based SD Decoding Not applicable to only graph-based codes May even help with some algebraic structure SD alternative to trellis decoding.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
An Improved Search Algorithm for Optimal Multiple-Sequence Alignment Paper by: Stefan Schroedl Presentation by: Bryan Franklin.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
1 Code design: Computer search Low rate: Represent code by its generator matrix Find one representative for each equivalence class of codes Permutation.
On Template Method for DNA Sequence Design
Fast Fourier Transform
COT 5611 Operating Systems Design Principles Spring 2014
CSE 589 Applied Algorithms Spring 1999
Fitness measures for DNA Computing
Multiple Sequence Alignment
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001)
Fragment Assembly 7/30/2019.
Theory of Information Lecture 13
Lecture 18 The Main Coding Theory Problem (Section 4.7)
Presentation transcript:

Improved Models and Algorithms for Universal DNA Tag Systems continued … a.k.a. what did we do?

Nucleation Model When do two tags form a match? 1.sum of score of matches ≥ c ? (not stable complex!) 2.score of heaviest match ≥ c ? (as in [BKSY]) 3.score of heaviest match with e errors ≥ c ! (we propose) AAGCTGCA ACCCTGTA AAGCTGCA ACCCTGTA AAGCTGCA ACCCTGTA AAGCTGCA ACCCTGTA

Score of a single match (recap) May be computed via either of 2-4 Rule – easy approximation: A-T = 2, G-C = 4 – sum gives melting temperature Nearest Neighbor Rule – sum energies due to contiguous A-T & C-G pairs – A-T different from T-A different from A-G etc.. 

It’s an improvement.[BKSY] would predict We predict mfold predicts Is this a realistic model ? CGTAGCACGAA AACTCGTATCA CGTAGCACGAA AACTCGTATCA ACAGCAATGGA GATCGGTACTA ACAGCAATGGA GATCGGTACTA > < T m = 3.2°C T m = 13.8°C (6,0) match(9,1) match

Definitions Two strings s 1 and s 2 have a (c,e)-match if they have substrings t 1 and t 2 such that: 1.w(t 1 ) = w(t 2 ) ≥ c 2.t 1 and t 2 differ in ≤ e places A tag system is an (h,c,e)-code if 1.every tag has weight atleast h 2.no two tags have a (c,e)-match

Design of (h,c,e)-code with large size Outline of Upper Bound on size How? Via upper bound on number of c-tokens (the substrings t that have weight ≈ c) Choosing one c-token in a tag knocks out a sphere of nearby c-tokens from further use in any other tag. Similar to sphere packing bound in coding theory. Algorithms for generating optimal codes Modify alphabetic tree-search algorithm of [MPT]

c-tokens (recap) strings with weight ≥ c no proper suffix of weight ≥ c have weight either c or c+1 length ranges from c (all C/G) to 2c (all A/T) can’t use tailweight method of [BKSY]  nucleation complexes nucleation complexes = Two c-tokens differing in at most e symbols Two c-tokens differing in at most e symbols

A sphere around CGCA C G CA is a 6-token of weight 7, length 4 how many 4-length codewords at distance 1? TGCA·GGCA AGCA CACA CCCA·CTCA CGGA CGTA CGAA CGCC CGCT CGCG

How many such spheres pack the whole space ? Now look at spheres around codewords of optimum code vol(s) total number of c-tokens s a red sphere ≤ must be disjoint ! size of code × vol(sphere) total number of c-tokens ≤

Size of a sphere Suppose string s has a A/T and b C/G symbols weight = a + 2b, length = a + b Introduce e errors into s to get t weight of t same as weight of s, so e1 = e2 for errors of type 1, pick inways and options to change to REPLACEWEIGHTNUMBER A → G, A → C, T → G, T → C +1e1 G → A, C → A, G → T, C → T e2 A → T, T → A, C → G, G → C 0e3

One tag of weight h uses (h-c+1) tokens So size of code ≤ Size of sphere = Substitute a = 2 l – c and b = c - l l varies from c/2 to c, c-tokens of weight c or c+1 = number of strings of length l =

Can tighten the bound further our sphere knocked out only c-tokens of the same length we should also remove similar c-tokens of other lengths.. reduce bound by factor e ? In comparison to [BKSY] bound h = 30, c = 12, e = 0: ≥ #tags ≥ h = 30, c = 12, e = 2: #tags ≤ 1268 if nucleation does occur with errors then we can’t assume so many tags

Plot of upper bound vs. c,e (h = 50) upper bound on number of codewords e – number of errors c – weight of nucleation complex

Open Problems & Remarks design, analyze efficient algorithms for model can we use random deBruijn sequences to generate codewords ? analyze using mixing techniques on Markov chain of [KMUW] ? exciting new question for coding theory: alphabets with weighted Hamming distances!