Arun ganesh (UC BERKELEY)

Slides:



Advertisements
Similar presentations
Chapter 13. Red-Black Trees
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
AVL Trees COL 106 Amit Kumar Shweta Agrawal Slide Courtesy : Douglas Wilhelm Harder, MMath, UWaterloo
COMP 451/651 Indexes Chapter 1.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem Matthew Streeter & Stephen Smith Carnegie Mellon University NESCAI, April
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Data Structures Using C++ 2E Chapter 11 Binary Trees and B-Trees.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Physical Mapping of DNA Shanna Terry March 2, 2004.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
CSIT 402 Data Structures II
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
1 Binary Search Trees  Average case and worst case Big O for –insertion –deletion –access  Balance is important. Unbalanced trees give worse than log.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
6.896: Probability and Computation Spring 2011 Constantinos (Costis) Daskalakis lecture 23.
Lower Bounds & Sorting in Linear Time
Distance-based phylogeny estimation
File Organization and Processing Week 3
New Characterizations in Turnstile Streams with Applications
Distance based phylogenetics
Lecture 22: Linearity Testing Sparse Fourier Transform
Chapter 6 Transform-and-Conquer
B+-Trees.
B+-Trees.
B+-Trees.
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
CMSC 341 Introduction to Trees.
COSC160: Data Structures Linked Lists
Hashing Exercises.
Interval Heaps Complete binary tree.
Multiple Alignment and Phylogenetic Trees
CMSC 341 Lecture 10 B-Trees Based on slides from Dr. Katherine Gibson.
Randomized Algorithms: Data Structures
(2,4) Trees (2,4) Trees 1 (2,4) Trees (2,4) Trees
(2,4) Trees /26/2018 3:48 PM (2,4) Trees (2,4) Trees
Linear sketching with parities
B- Trees D. Frey with apologies to Tom Anastasio
B- Trees D. Frey with apologies to Tom Anastasio
Linear sketching over
B+-Trees (Part 1).
CS 581 Tandy Warnow.
(2,4) Trees (2,4) Trees (2,4) Trees.
Introduction Wireless Ad-Hoc Network
Trees CMSC 202, Version 5/02.
CMSC 202 Trees.
Linear sketching with parities
Lower Bounds & Sorting in Linear Time
Reconstruction on trees and Phylogeny 3
B- Trees D. Frey with apologies to Tom Anastasio
(2,4) Trees /24/2019 7:30 PM (2,4) Trees (2,4) Trees
(2,4) Trees (2,4) Trees (2,4) Trees.
Phylogeny.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
The Selection Problem.
Balanced search trees: trees.
Presentation transcript:

Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Arun ganesh (UC BERKELEY) With QIUYI (RICHARD) ZHANG (UC BERKELEY Q GOOGLE) Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

Setup Start with a “model” binary tree 𝑛 leaves = extant species Image source: Bulbapedia Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Setup 𝜎 𝑟 ~ {0, 1} 𝑘 Start with a “model” binary tree 𝜎 𝑟 ~ {0, 1} 𝑘 Start with a “model” binary tree Sample a uniformly random bitstring (DNA) for root Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Setup 011001010 Start with a “model” binary tree Sample a uniformly random bitstring (DNA) for root Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Setup 011001010 Start with a “model” binary tree Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: 011001010 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Setup 011001010 Start with a “model” binary tree Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: -Substitutes w.p. 𝑝 𝑠 (𝑒) 011000010 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Setup 011001010 Start with a “model” binary tree Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: -Substitutes w.p. 𝑝 𝑠 (𝑒) -Inserts random bit w.p. 𝑝 𝑖 (𝑒) 0110000101 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Setup 011001010 Start with a “model” binary tree Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: -Substitutes w.p. 𝑝 𝑠 (𝑒) -Inserts random bit w.p. 𝑝 𝑖 (𝑒) -Deletes w.p. 𝑝 𝑑 (𝑒) _110000101 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Setup 011001010 Start with a “model” binary tree Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: -Substitutes w.p. 𝑝 𝑠 (𝑒) -Inserts random bit w.p. 𝑝 𝑖 (𝑒) -Deletes w.p. 𝑝 𝑑 (𝑒) 110000101 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Setup Start with a “model” binary tree 011001010 Start with a “model” binary tree Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: -Substitutes w.p. 𝑝 𝑠 (𝑒) -Inserts random bit w.p. 𝑝 𝑖 (𝑒) -Deletes w.p. 𝑝 𝑑 (𝑒) 110000101 011110110 010010001 111001101 000100101 001000011 011011010 001100110 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Setup Start with a “model” binary tree Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: -Substitutes w.p. 𝑝 𝑠 (𝑒) -Inserts random bit w.p. 𝑝 𝑖 (𝑒) -Deletes w.p. 𝑝 𝑑 (𝑒) Algorithm is given leaf bitstrings 001000011 011011010 111001101 000100101 001100110 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Setup Start with a “model” binary tree Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: -Substitutes w.p. 𝑝 𝑠 (𝑒) -Inserts random bit w.p. 𝑝 𝑖 (𝑒) -Deletes w.p. 𝑝 𝑑 (𝑒) Algorithm is given leaf bitstrings and must reconstruct the tree with high probability 001000011 011011010 111001101 000100101 001100110 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Goal Reconstruct the tree with high probability, using as short a sequence length 𝑘 as possible (as a function of 𝑛), while tolerating as large mutation probabilities 𝑝 𝑠 , 𝑝 𝑖 , 𝑝 𝑑 as possible. Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Motivation Many practical applications: Reconstructing paths of migration Linking mutations to disease Determining origins of pathogens and likely paths of contamination Informing policy on conservation of species Image source: Tim Lohrentz Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Prior work One method: Try to align sequences, reducing to substitution-only case. Image source: Wikipedia Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Prior work Theorem [DMR05]: Can reconstruct tree in substitution-only case using 𝑂( log 𝑛 ) bits. With 𝑜( log 𝑛 ) length bitstrings, problem is impossible. Requires 2 1−2 𝑝 𝑠 2 >1 (Kesten-Stigum threshold). [BRZ95, Iof96, EKP00, BKM01, MSW04]: If 2 1−2 𝑝 𝑠 2 <1, need 𝑛 Ω 1 bits for reconstruction. If we have good alignment methods, we’ve solved the problem to optimality! Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Prior work Unfortunately, multiple sequence alignment is NP-hard, and heuristics used in practice may induce problematic biases. Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Prior work GWEJ07, LG08, WSH08 give empirical evidence for biases in MSA. DR10, ABH12 provide some guarantees for reconstruction with indels, but require polynomial sequence lengths or very small 𝑝 𝑖 , 𝑝 𝑑 . What 𝑝 𝑖 , 𝑝 𝑑 can we handle with 𝑘= log 𝑂 1 𝑛 ? 𝑝 𝑖 , 𝑝 𝑑 DR10 GZ18 𝑂 1 ABH12 𝑂 1 log 2 𝑛 ESSW99 DMR05 𝑘 𝑛 𝑂 1 log 𝑂 1 𝑛 𝑂 log 2 𝑛 We show the answer is 𝑂(1). 𝑂 log 𝑛 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Our contribution In particular, if: 2 1−2 𝑝 𝑠 𝑒 2 1− 𝑝 𝑑 𝑒 2 1+ 𝑝 𝑖 𝑒 − 𝑝 𝑑 𝑒 −1 >1 for all edges (Kesten-Stigum threshold) 𝐷 𝑚𝑎𝑥 ≤ log 𝛼 𝑛 𝑝 𝑖 𝑒 − 𝑝 𝑑 𝑒 ≤ 𝛽 log log 𝑛 𝐷 𝑚𝑎𝑥 Then we can reconstruct the tree with 𝑘= log 𝜅 𝛼, 𝛽 𝑛 . 𝜅 𝛼,𝛽 is optimal up to a small multiplicative constant, and otherwise this result is optimal in every possible sense! needed to avoid empty leaf strings Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Roadmap Substitution only With indels 1. Estimating distances using bitwise correlation (ESSW99) 2. Using block signatures instead of bits (DR10) 3. Reconstructing bitstrings of internal nodes (Roc08) 4. Reconstructing signatures of internal nodes (GZ18) 𝑘= 𝑛 𝑂 1 5. Future Directions 𝑘= log 𝑂 1 𝑛 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

Roadmap Substitution only With indels 1. Estimating distances using bitwise correlation (ESSW99) 2. Using block signatures instead of bits (DR10) 3. Reconstructing bitstrings of internal nodes (Roc08) 4. Reconstructing signatures of internal nodes (GZ18) 𝑘= 𝑛 𝑂 1 5. Future Directions 𝑘= log 𝑂 1 𝑛 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

ESTIMATING DISTANCES USING BITWISE CORRELATION Distance estimation Well known: distance estimates that concentrate well suffice to reconstruct the tree. High-level algorithm: Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels ESTIMATING DISTANCES USING BITWISE CORRELATION

ESTIMATING DISTANCES USING BITWISE CORRELATION Distance estimation Well known: distance estimates that concentrate well suffice to reconstruct the tree. High-level algorithm: 1. Use distances to identify siblings Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels ESTIMATING DISTANCES USING BITWISE CORRELATION

ESTIMATING DISTANCES USING BITWISE CORRELATION Distance estimation Well known: distance estimates that concentrate well suffice to reconstruct the tree. High-level algorithm: 1. Use distances to identify siblings 2. Use distances to compute distances from parents Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels ESTIMATING DISTANCES USING BITWISE CORRELATION

ESTIMATING DISTANCES USING BITWISE CORRELATION Distance estimation Well known: distance estimates that concentrate well suffice to reconstruct the tree. High-level algorithm: 1. Use distances to identify siblings 2. Use distances to compute distances from parents 3. Recurse on parents Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels ESTIMATING DISTANCES USING BITWISE CORRELATION

ESTIMATING DISTANCES USING BITWISE CORRELATION To estimate distance, in substitution-only case can use bitwise correlation (linear rescaling of Hamming distance). Think of bits as ±1 instead of 0-1. Let 𝜎 𝑎,𝑗 be 𝑗th bit of 𝑎’s bitstring. Bitwise correlation is 1 𝑘 𝑗=1 𝑘 𝜎 𝑎,𝑗 𝜎 𝑏,𝑗 . 10101011 00111010 00101110 Faraway nodes have dissimilar bitstrings Siblings have very similar bitstrings Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels ESTIMATING DISTANCES USING BITWISE CORRELATION

ESTIMATING DISTANCES USING BITWISE CORRELATION Claim: If we define edge lengths as 𝑑 𝑒 =− ln 1−2 𝑝 𝑠 𝑒 then E 1 𝑘 𝑗=1 𝑘 𝜎 𝑎,𝑗 𝜎 𝑏,𝑗 = 𝑒∈ 𝑃 𝑎,𝑏 1−2 𝑝 𝑠 𝑒 = 𝑒 −𝑑 𝑎,𝑏 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels ESTIMATING DISTANCES USING BITWISE CORRELATION

Concentration of bitwise correlation How well does it concentrate? Rough analysis: Bitwise correlation has standard deviation ≈ 1 𝑘 . For the correlation to concentrate at distance O log 𝑛 , need 𝑒 −𝑂 log 𝑛 > 1 𝑘 →𝑘= 𝑛 Ω 1 . Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels ESTIMATING DISTANCES USING BITWISE CORRELATION

Using block signatures Roadmap Substitution only With indels 1. Estimating distances using bitwise correlation (ESSW99) 2. Using block signatures instead of bits (DR10) 3. Reconstructing bitstrings of internal nodes (Roc08) 4. Reconstructing signatures of internal nodes (GZ18) 𝑘= 𝑛 𝑂 1 5. Future Directions 𝑘= log 𝑂 1 𝑛 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Using block signatures

Using block signatures Handling indels Problem with bitwise correlation when indels are introduced: indels make bits move around a lot between bitstrings. 𝑗 𝑎 What bit in 𝑏’s bitstring does bit 𝑗 of 𝑎’s bitstring correspond to? In the substitution only case: With indels: What if it doesn’t appear at all? 𝑏 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Using block signatures

Blockwise correlation To handle shifts due to insertions and deletions, split bitstrings into blocks. 𝑘 3/4 𝑗 𝑎 Assume 𝑝 𝑖 = 𝑝 𝑑 everywhere (not too hard to generalize). If we split bitstrings into blocks of length, say, 𝑙= 𝑘 3/4 , most bits will stay within a block throughout the tree. Any bit shifts by < 𝑘 positions in expectation, at most 𝑘 log 𝑛 with high probability on one edge. < 𝑘 log 𝑛 𝑏 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Using block signatures

Blockwise correlation Define signature of block 𝑖 in bitstring 𝑎, 𝑠 𝑎,𝑖 as sum of bits in block 𝑖, divided by 𝑙 . Signatures 𝑠 𝑎,𝑖 are robust to shifts, so they behave like bits in substitution only case, i.e. 𝑠 𝑎,𝑖 𝑠 𝑏,𝑖 behaves like bitwise correlation. 𝑠 𝑎,𝑖 𝑠 𝑏,𝑖 = 1 𝑙 𝑗 𝑗 ′ 𝜎 𝑎,𝑗 𝜎 𝑏, 𝑗 ′ Fixing any series of indels, 𝜎 𝑎,𝑗 𝜎 𝑏, 𝑗 ′ is non-zero in expectation only if bits 𝑗 and 𝑗′ correspond to each other. Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Using block signatures

Blockwise correlation Define 𝑑 𝑒 =− ln 1−2 𝑝 𝑠 𝑒 1− 𝑝 𝑑 𝑒 Lemma: E 𝑠 𝑎,𝑖 𝑠 𝑏,𝑖 = 1±𝑜 1 𝑒∈ 𝑃 𝑎,𝑏 1−2 𝑝 𝑠 𝑒 1− 𝑝 𝑑 𝑒 ≈ 𝑒 −𝑑(𝑎,𝑏) Error term to account for the tiny fraction of bits that move in/out of blocks Decay in number of corresponding bits between the blocks due to deletions Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Using block signatures

Challenges with concentration 𝑠 𝑎,𝑖 𝑠 𝑏,𝑖 includes lots of non-corresponding bitwise products that are expectation 0, but total variance may be high. But each 𝑠 𝑎,𝑖 is 𝑂( log 𝑛 ) with high probability, and we have 𝑘 1/4 products – can boost 𝑘 to get concentration via Azuma’s if 𝑠 𝑎,𝑖 𝑠 𝑏,𝑖 are independent. Unfortunately 𝑠 𝑎,𝑖 𝑠 𝑏,𝑖 and 𝑠 𝑎, 𝑖 ′ 𝑠 𝑏, 𝑖 ′ are not independent. Knowing 𝑠 𝑎,𝑖 𝑠 𝑏,𝑖 gives you information about the indel process between 𝑎 and 𝑏, and thus about 𝑠 𝑎, 𝑖 ′ 𝑠 𝑏, 𝑖 ′ (since indels affect many blocks simultaneously). Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Using block signatures

Challenges with concentration Our fix: Suppose we could define blocks based on actual alignment. Let 𝑠 𝑎,𝑖 ∗ be the signatures of these blocks. Indels can only affect one “aligned” signature, so 𝑠 𝑎,𝑖 ∗ 𝑠 𝑏,𝑖 ∗ are independent and concentrate nicely. We don’t know actual alignment, so can’t compute 𝑠 𝑎,𝑖 ∗ , but we can show 𝑠 𝑎,𝑖 𝑠 𝑏,𝑖 is very close to 𝑠 𝑎,𝑖 ∗ 𝑠 𝑏,𝑖 ∗ . 𝑎 Blocks our estimator uses Aligned blocks 𝑏 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Using block signatures

Reconstructing bitstrings Roadmap Substitution only With indels 1. Estimating distances using bitwise correlation (ESSW99) 2. Using block signatures instead of bits (DR10) 3. Reconstructing bitstrings of internal nodes (Roc08) 4. Reconstructing signatures of internal nodes (GZ18) 𝑘= 𝑛 𝑂 1 5. Future Directions 𝑘= log 𝑂 1 𝑛 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing bitstrings

Challenges with 𝑘= log 𝑂(1) 𝑛 With 𝑘= log 𝑂 1 𝑛 , correlation only concentrates at distance 𝑂( log log 𝑛 ). Can show far apart nodes have low correlation with high probability (no false positives), so can still reconstruct first Ω( log log 𝑛 ) levels just based off leaf distance estimates. Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing bitstrings

Challenges with 𝑘= log 𝑂(1) 𝑛 𝑑(𝑎,𝑏) =O(log log 𝑛 ) 𝑎 𝑏 Use estimator 𝜎 𝑎,𝑚 = 1 2 ℎ 𝑗 𝑒 𝑑 𝑎, 𝑎 𝑗 𝜎 𝑎 𝑗 ,𝑚 Correct in expectation! ℎ =Ω(log log 𝑛 ) But variance of correlation of 𝜎 𝑎 and 𝜎 𝑏 is large. 𝑎 1 𝑎 2 … 𝑎 2 ℎ 𝑏 1 𝑏 2 … 𝑏 2 ℎ Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing bitstrings

Challenges with 𝑘= log 𝑂(1) 𝑛 To get better concentration, use median-of-means approach. 𝑑(𝑎,𝑏) =O(log log 𝑛 ) 𝑎 𝑏 For each descendant 𝑎 𝑗 , 𝑏 𝑗 at height ℎ below, compute 𝜎 𝑎 𝑗 , 𝜎 𝑏 𝑗 , and use correlation to get estimate of 𝑑 𝑎 𝑗 , 𝑏 𝑗 (and thus 𝑑(𝑎, 𝑏)). ℎ =log log 𝑛 𝑎 1 𝑎 2 … 𝑎 2 ℎ 𝑏 1 𝑏 2 … 𝑏 2 ℎ 𝐴 1 𝐵 1 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing bitstrings

Challenges with 𝑘= log 𝑂(1) 𝑛 We get log 𝑛 estimates of 𝑑(𝑎, 𝑏), one for each 𝑎 𝑗 , 𝑏 𝑗 pair. 𝑑(𝑎,𝑏) =O(log log 𝑛 ) 𝑎 𝑏 We can condition on having “good” bitstrings at all 𝑎 𝑗 , 𝑏 𝑗 . ℎ =log log 𝑛 The log 𝑛 estimators of 𝑑(𝑎, 𝑏) are conditionally independent, so median concentrates well. 𝑎 1 𝑎 2 … 𝑎 2 ℎ 𝑏 1 𝑏 2 … 𝑏 2 ℎ 𝐴 1 𝐴 1 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing bitstrings

Reconstructing signatures Roadmap Substitution only With indels 1. Estimating distances using bitwise correlation (ESSW99) 2. Using block signatures instead of bits (DR10) 3. Reconstructing bitstrings of internal nodes (Roc08) 4. Reconstructing signatures of internal nodes (GZ18) 𝑘= 𝑛 𝑂 1 5. Future Directions 𝑘= log 𝑂 1 𝑛 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing signatures

Reconstructing signatures Signatures are robust to indels, so they behave similarly to bits in substitution-only case. Suggests our algorithm: apply the reconstruction scheme to signatures. Some technical challenges in the analysis we need to overcome. Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing signatures

Challenges with reconstruction with indels In reconstructing signatures, bits appearing in blocks of ancestors but not children or vice-versa may induce noise that is non-zero in expectation in the recursive estimator. We show that since the noise also “decays”, it is tiny in expectation, so misalignment does not ruin the reconstructed signatures. 𝑎, ancestor which we condition on signal noise decayed signal noise 𝑎 𝑗 , descendant Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing signatures

Challenges with reconstruction with indels We again have to deal with 𝑠 𝑎 𝑗 ,𝑖 𝑠 𝑏 𝑗 ,𝑖 and 𝑠 𝑎 𝑗 , 𝑖 ′ 𝑠 𝑏 𝑗 , 𝑖 ′ not being independent when analyzing the variance of our recursive estimator for 𝑑 𝑎, 𝑏 . We show that the covariance of the reconstructed blockwise correlations is small, i.e. 𝑠 𝑎,𝑖 𝑠 𝑏,𝑖 and 𝑠 𝑎, 𝑖 ′ 𝑠 𝑏, 𝑖 ′ are almost completely independent. Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing signatures

Roadmap Substitution only With indels 1. Estimating distances using bitwise correlation (ESSW99) 2. Using block signatures instead of bits (DR10) 3. Reconstructing bitstrings of internal nodes (Roc08) 4. Reconstructing signatures of internal nodes (GZ18) 𝑘= 𝑛 𝑂 1 5. Future Directions 𝑘= log 𝑂 1 𝑛 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Future directions

Open questions/future directions Our reconstruction guarantees are optimal up to some constants – what are the right constants? We only use 𝑂 𝑘 bits of information per 𝑘-bit sequence, so there should be room for improvement. Can we remove some of the strong assumptions in the model? What if there isn’t sitewise independence of mutations? What if the root bitstring isn’t chosen uniformly at random? What if different parts of each sequence are generated using different trees? Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Future directions

Thank You! Questions? Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels