tUAN THANH NGUYEN Nanyang Technological University (NTU), Singapore

Slides:



Advertisements
Similar presentations
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
Advertisements

Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Information theory Multi-user information theory A.J. Han Vinck Essen, 2004.
Applied Algorithmics - week7
Locally Decodable Codes from Nice Subsets of Finite Fields and Prime Factors of Mersenne Numbers Kiran Kedlaya Sergey Yekhanin MIT Microsoft Research.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Information Theory Introduction to Channel Coding Jalal Al Roumy.
Data Compression.
1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Correcting Errors Beyond the Guruswami-Sudan Radius Farzad Parvaresh & Alexander Vardy Presented by Efrat Bank.
6/20/2015List Decoding Of RS Codes 1 Barak Pinhas ECC Seminar Tel-Aviv University.
Information Theory Eighteenth Meeting. A Communication Model Messages are produced by a source transmitted over a channel to the destination. encoded.
A Graph-based Framework for Transmission of Correlated Sources over Multiuser Channels Suhan Choi May 2006.
Copyright © Cengage Learning. All rights reserved.
Linear-Time Encodable and Decodable Error-Correcting Codes Jed Liu 3 March 2003.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Mario Vodisek 1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Erasure Codes for Reading and Writing Mario Vodisek ( joint work.
Noise, Information Theory, and Entropy
Basics of Compression Goals: to understand how image/audio/video signals are compressed to save storage and increase transmission efficiency to understand.
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
Genetic Mutations.
Source Coding-Compression
Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Uncorrectable Errors of Weight Half the Minimum Distance for Binary Linear Codes Kenji Yasunaga * Toru Fujiwara + * Kwansei Gakuin University, Japan +
ERROR CONTROL CODING Basic concepts Classes of codes: Block Codes
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
A Mathematical Theory of Communication Jin Woo Shin Sang Joon Kim Paper Review By C.E. Shannon.
Tung-Wei Kuo, Kate Ching-Ju Lin, and Ming-Jer Tsai Academia Sinica, Taiwan National Tsing Hua University, Taiwan Maximizing Submodular Set Function with.
Some Computation Problems in Coding Theory
Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi University of Udine.
Reliable Deniable Communication: Hiding Messages in Noise The Chinese University of Hong Kong The Institute of Network Coding Pak Hou Che Mayank Bakshi.
Fidelities of Quantum ARQ Protocol Alexei Ashikhmin Bell Labs  Classical Automatic Repeat Request (ARQ) Protocol  Qubits, von Neumann Measurement, Quantum.
Slide 1 of 24 Copyright Pearson Prentice Hall Biology.
Encoding/Decoding May 9, 2016 Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck 1.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
II. Linear Block Codes. © Tallal Elshabrawy 2 Digital Communication Systems Source of Information User of Information Source Encoder Channel Encoder Modulator.
CSCI2950-C Genomes, Networks, and Cancer
IERG6120 Lecture 22 Kenneth Shum Dec 2016.
EE465: Introduction to Digital Image Processing
The Viterbi Decoding Algorithm
Succinct Data Structures
Functions Defined on General Sets
13 Text Processing Hongfei Yan June 1, 2016.
Context-based Data Compression
Trellis Codes With Low Ones Density For The OR Multiple Access Channel
Mutations Chapter 12-4.
Mutations Add to Table of Contents – p. 14
Intro to Theory of Computation
On Template Method for DNA Sequence Design
Algorithm design and Analysis
Chapter 11 Data Compression
Non-Malleable Extractors New tools and improved constructions
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Copyright © Cengage Learning. All rights reserved.
Data Structure and Algorithms
Miguel Griot, Andres I. Vila Casado, and Richard D. Wesel
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Sequences 5/17/ :43 AM Pattern Matching.
New Jersey, October 9-11, 2016 Field of theoretical computer science
Mutations.
How many deleted bits can one recover?
CSE 373 Sorting 2: Selection, Insertion, Shell Sort
Compute-and-Forward Can Buy Secrecy Cheap
Error Correction Coding
Analysis of Algorithms CS 477/677
Presentation transcript:

tUAN THANH NGUYEN Nanyang Technological University (NTU), Singapore Linear-Time Encoding/Decoding of Irreducible Words for Codes Correcting Tandem Duplications tUAN THANH NGUYEN Nanyang Technological University (NTU), Singapore Joint work with: Yeow Meng Chee Han Mao Kiah Johan Chrisnata

Our motivation Applications that store data in living organisms Shipman et al. (2017) : CRISPR-Cas, encoding of a digital movie into the genomes of a population of living bacteria.

Our motivation Duplication Errors due to the biological mutations Deletion Insertion A C G T A C G T is one of the two common repeats found in the human genome (More than 50%) Substitution G A T G A T Duplication Duplication plays an important role in determining an individual’s inherited traits is believed to be the cause of several disorders Inversion G A T Translocation

Problem Classification The duplication length The number of errors 1.1 Bounded 1. Fixed-length duplications A C G A C G A G C A G C A T 1.2 Unbounded Tandem Duplication A C G A G C A T Given 𝑘=3 A A C G C G A G C A G C A T T 2.1 Bounded 2. Variable-length duplications 2.2 Unbounded We focus on the worst-case scenario !

Notation Given alphabet Σ 𝑞 = 0,1,…, 𝑞−1 , an integer 𝑘≥1 012 ≤𝑘-irreducible 0012 012012 𝑥=0112 ≤𝑘-descendant cone of 𝑥 𝐷 ≤𝑘 (𝑥) 01122 01212012 ≤𝑘-descendants of 𝑥 00112012212012

Problem Formulation Previous Works 0122 0112 Goal: Given 𝑘,𝑛,𝑞∈𝑁, construct a code 𝒞 ≤𝑘 𝑛,𝑞 such that “For all 𝑥,𝑦 ∈ 𝒞 ≤𝑘 𝑛,𝑞 , we have 𝐷 ≤𝑘 𝑥 ∩ 𝐷 ≤𝑘 𝑦 =∅.” Previous Works 0122 0112 Optimal codes are found when 𝑘≤2. (Jain et al. 2017) A method to construct codes when 𝑘=3 is provided (Jain et al. 2017). Main idea: using “irreducible words” There is no known result when 𝑘≥4. 01122

For 𝑘≤3, different irreducible words generate different descendants! Previous Work 𝑘≤2: The code is optimal 0120 0121 1201 1210 b d c a For 𝑘≤3, different irreducible words generate different descendants!

Previous Work 0120 0121 1201 1210 d D b A c C a For 𝑘=3, we can choose more than one codewords in each cone! Irreducible words form an “almost optimal” code!

Our Main Results Detailed analysis on constructed codes based on irreducible words when 𝑘=3, such codes are denoted by Ir r ≤𝑘 (𝑛,𝑞) Provide an explicit formula to compute the size and asymptotic rate Provide an upper bound for optimal code and hence conclude that Ir r ≤𝑘 (𝑛,𝑞) is almost optimal Linear-time encoder for Ir r ≤𝑘 (𝑛,𝑞) The extension of this encoder provides the first known encoder for previous constructed codes. Publication: IEEE International Symposium on Information Theory 2018.

Encoder of 𝐈𝐫 𝐫 ≤𝒌 (𝒏;𝒒) for 𝒌=𝟐,𝟑 (Sketched idea) Duplication channel encoder Decoder Error-decoder Irr-decoder Input output x y y' x 𝓵 irreducible word For 𝜖>0, to achieve encoding rates at least optimal rate − 𝜖, we only require ℓ=Θ 1 𝜖 , 𝑚=Θ 1 𝜖 For 𝑧∈Ir r ≤𝑘 𝑚;𝑞 , we define the neighbours of 𝑧 𝑁 𝑧 ≜ 𝑧 ′ ∈Ir r ≤𝑘 𝑚;𝑞 𝑧 𝑧 ′ ∈Ir r ≤𝑘 (2𝑚;𝑞)} x 𝒙 𝟏 𝒙 𝟐 … 𝒙 𝒔 ∆ ≤𝑘 𝑚;𝑞 ≜Min{ 𝑁 𝑧 𝑓𝑜𝑟 𝑧 ∈Ir r ≤𝑘 𝑚;𝑞 } Irr Irr Irr y 𝒚 𝟏 𝒚 𝟐 … 𝒚 𝒔 ∆ ≤𝑘 𝑚;𝑞 ≥ 𝑞 ℓ 𝓶

Example 𝑘 = 2, 𝑞 = 3, 𝑚 = 3, ℓ=1 20101 2 1 1 010 212 010 210 120 212 010 210 120 120 212010210120120

Special attention on 𝒒=𝟒 Recent Work: Special attention on 𝒒=𝟒 The GC-content of a DNA string refers to the number of nucleotides that corresponds to G or C, and DNA strings with GC-content that are too high or too low are more prone to both synthesis and sequencing errors. Many recent works use DNA strings whose GC-content are close to 50% or exactly 50%. This is referred as “GC-balanced constraint”. Our updated encoder: Irreducible Irreducible GC-balanced AAAA ATACTA ATGCTACG

Knuth Balancing Method Modified Knuth Method Irreducible + GC-balanced 0 1 0 0 0 1 0 0 Input Input A T C A T G A T Flip Flip 1 0 1 1 0 1 0 0 𝑡=4: 𝑡=4: G C T G T G A T 1 0 1 1 0 1 0 0 log 𝑛 2 log 𝑛 G C T G T G A T Output codeword Output codeword Redundancy: log 𝑛 (to encode the index t+ a look-up table) Redundancy: 2 log 𝑛 +𝑂(1) (linear-time encoding + no need a look-up table)

Recent Work: Design codes when 𝒌≥𝟒. The size of our code is at least 𝑞−1 𝑞−2 … 𝑞−𝑘 𝑛 𝑘 In term of rate: log 𝑞 (𝑞−1) + log 𝑞 𝑞−2 +…+ log 𝑞 (𝑞−𝑘) 𝑘 ≤𝑅≤ log 𝑞 (𝑞−1) 𝑞 Upper bound Lower bound 5 0.8280 0.4936 10 0.9542 0.8701 15 0.9745 0.9311 20 0.9828 0.9547 𝐔𝐩𝐩𝐞𝐫 𝐛𝐨𝐮𝐧𝐝 𝐚𝐧𝐝 𝐋𝐨𝐰𝐞𝐫 𝐛𝐨𝐮𝐧𝐝 𝐨𝐟 𝐭𝐡𝐞 𝐜𝐨𝐝𝐞 𝐫𝐚𝐭𝐞 𝐰𝐡𝐞𝐧 𝒌=𝟒.

Summary Previous Works Our work Further work Goal: “Given 𝑘,𝑛,𝑞∈𝑁, construct the largest code where each codeword is of length 𝑛 over 𝑞-ay alphabet that can correct unbounded tandem duplications of length at most 𝑘.” Previous Works Our work Further work Optimal codes when 𝑘≤2 (Jain et al. 2017) A method to construct codes when 𝑘=3 Provide upper bound and lower bound for codes when 𝑘=3 Linear-time encoder for known TD codes (𝑘≤3) (IEEE ISIT 2018) Linear-time encoder for TD GC-balanced code A method to construct codes when 𝑘≥4 Find optimal codes when 𝑘=3 Reduce the redundancy of the encoder for TD GC-balanced code Design better codes when 𝑘≥4