Presentation is loading. Please wait.

Presentation is loading. Please wait.

tUAN THANH NGUYEN Nanyang Technological University (NTU), Singapore

Similar presentations


Presentation on theme: "tUAN THANH NGUYEN Nanyang Technological University (NTU), Singapore"— Presentation transcript:

1 tUAN THANH NGUYEN Nanyang Technological University (NTU), Singapore
Linear-Time Encoding/Decoding of Irreducible Words for Codes Correcting Tandem Duplications tUAN THANH NGUYEN Nanyang Technological University (NTU), Singapore Joint work with: Yeow Meng Chee Han Mao Kiah Johan Chrisnata

2 Our motivation Applications that store data in living organisms
Shipman et al. (2017) : CRISPR-Cas, encoding of a digital movie into the genomes of a population of living bacteria.

3 Our motivation Duplication Errors due to the biological mutations
Deletion Insertion A C G T A C G T is one of the two common repeats found in the human genome (More than 50%) Substitution G A T G A T Duplication Duplication plays an important role in determining an individual’s inherited traits is believed to be the cause of several disorders Inversion G A T Translocation

4 Problem Classification
The duplication length The number of errors 1.1 Bounded 1. Fixed-length duplications A C G A C G A G C A G C A T 1.2 Unbounded Tandem Duplication A C G A G C A T Given 𝑘=3 A A C G C G A G C A G C A T T 2.1 Bounded 2. Variable-length duplications 2.2 Unbounded We focus on the worst-case scenario !

5 Notation Given alphabet Σ 𝑞 = 0,1,…, 𝑞−1 , an integer 𝑘≥1 012 ≤𝑘-irreducible 0012 012012 𝑥=0112 ≤𝑘-descendant cone of 𝑥 𝐷 ≤𝑘 (𝑥) 01122 ≤𝑘-descendants of 𝑥

6 Problem Formulation Previous Works 0122 0112
Goal: Given 𝑘,𝑛,𝑞∈𝑁, construct a code 𝒞 ≤𝑘 𝑛,𝑞 such that “For all 𝑥,𝑦 ∈ 𝒞 ≤𝑘 𝑛,𝑞 , we have 𝐷 ≤𝑘 𝑥 ∩ 𝐷 ≤𝑘 𝑦 =∅.” Previous Works 0122 0112 Optimal codes are found when 𝑘≤2. (Jain et al. 2017) A method to construct codes when 𝑘=3 is provided (Jain et al. 2017). Main idea: using “irreducible words” There is no known result when 𝑘≥4. 01122

7 For 𝑘≤3, different irreducible words generate different descendants!
Previous Work 𝑘≤2: The code is optimal 0120 0121 1201 1210 b d c a For 𝑘≤3, different irreducible words generate different descendants!

8 Previous Work 0120 0121 1201 1210 d D b A c C a For 𝑘=3, we can choose more than one codewords in each cone! Irreducible words form an “almost optimal” code!

9 Our Main Results Detailed analysis on constructed codes based on irreducible words when 𝑘=3, such codes are denoted by Ir r ≤𝑘 (𝑛,𝑞) Provide an explicit formula to compute the size and asymptotic rate Provide an upper bound for optimal code and hence conclude that Ir r ≤𝑘 (𝑛,𝑞) is almost optimal Linear-time encoder for Ir r ≤𝑘 (𝑛,𝑞) The extension of this encoder provides the first known encoder for previous constructed codes. Publication: IEEE International Symposium on Information Theory 2018.

10 Encoder of 𝐈𝐫 𝐫 ≤𝒌 (𝒏;𝒒) for 𝒌=𝟐,𝟑 (Sketched idea)
Duplication channel encoder Decoder Error-decoder Irr-decoder Input output x y y' x 𝓵 irreducible word For 𝜖>0, to achieve encoding rates at least optimal rate − 𝜖, we only require ℓ=Θ 1 𝜖 , 𝑚=Θ 1 𝜖 For 𝑧∈Ir r ≤𝑘 𝑚;𝑞 , we define the neighbours of 𝑧 𝑁 𝑧 ≜ 𝑧 ′ ∈Ir r ≤𝑘 𝑚;𝑞 𝑧 𝑧 ′ ∈Ir r ≤𝑘 (2𝑚;𝑞)} x 𝒙 𝟏 𝒙 𝟐 𝒙 𝒔 ∆ ≤𝑘 𝑚;𝑞 ≜Min{ 𝑁 𝑧 𝑓𝑜𝑟 𝑧 ∈Ir r ≤𝑘 𝑚;𝑞 } Irr Irr Irr y 𝒚 𝟏 𝒚 𝟐 𝒚 𝒔 ∆ ≤𝑘 𝑚;𝑞 ≥ 𝑞 ℓ 𝓶

11 Example 𝑘 = 2, 𝑞 = 3, 𝑚 = 3, ℓ=1 20101 2 1 1 010 212 010 210 120 212 010 210 120 120

12 Special attention on 𝒒=𝟒
Recent Work: Special attention on 𝒒=𝟒 The GC-content of a DNA string refers to the number of nucleotides that corresponds to G or C, and DNA strings with GC-content that are too high or too low are more prone to both synthesis and sequencing errors. Many recent works use DNA strings whose GC-content are close to 50% or exactly 50%. This is referred as “GC-balanced constraint”. Our updated encoder: Irreducible Irreducible GC-balanced AAAA ATACTA ATGCTACG

13 Knuth Balancing Method Modified Knuth Method Irreducible + GC-balanced
Input Input A T C A T G A T Flip Flip 𝑡=4: 𝑡=4: G C T G T G A T log 𝑛 2 log 𝑛 G C T G T G A T Output codeword Output codeword Redundancy: log 𝑛 (to encode the index t+ a look-up table) Redundancy: 2 log 𝑛 +𝑂(1) (linear-time encoding + no need a look-up table)

14 Recent Work: Design codes when 𝒌≥𝟒.
The size of our code is at least 𝑞−1 𝑞−2 … 𝑞−𝑘 𝑛 𝑘 In term of rate: log 𝑞 (𝑞−1) + log 𝑞 𝑞−2 +…+ log 𝑞 (𝑞−𝑘) 𝑘 ≤𝑅≤ log 𝑞 (𝑞−1) 𝑞 Upper bound Lower bound 5 0.8280 0.4936 10 0.9542 0.8701 15 0.9745 0.9311 20 0.9828 0.9547 𝐔𝐩𝐩𝐞𝐫 𝐛𝐨𝐮𝐧𝐝 𝐚𝐧𝐝 𝐋𝐨𝐰𝐞𝐫 𝐛𝐨𝐮𝐧𝐝 𝐨𝐟 𝐭𝐡𝐞 𝐜𝐨𝐝𝐞 𝐫𝐚𝐭𝐞 𝐰𝐡𝐞𝐧 𝒌=𝟒.

15 Summary Previous Works Our work Further work
Goal: “Given 𝑘,𝑛,𝑞∈𝑁, construct the largest code where each codeword is of length 𝑛 over 𝑞-ay alphabet that can correct unbounded tandem duplications of length at most 𝑘.” Previous Works Our work Further work Optimal codes when 𝑘≤2 (Jain et al. 2017) A method to construct codes when 𝑘=3 Provide upper bound and lower bound for codes when 𝑘=3 Linear-time encoder for known TD codes (𝑘≤3) (IEEE ISIT 2018) Linear-time encoder for TD GC-balanced code A method to construct codes when 𝑘≥4 Find optimal codes when 𝑘=3 Reduce the redundancy of the encoder for TD GC-balanced code Design better codes when 𝑘≥4

16


Download ppt "tUAN THANH NGUYEN Nanyang Technological University (NTU), Singapore"

Similar presentations


Ads by Google