Presentation is loading. Please wait.

Presentation is loading. Please wait.

Arun ganesh (UC BERKELEY)

Similar presentations


Presentation on theme: "Arun ganesh (UC BERKELEY)"β€” Presentation transcript:

1 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels
Arun ganesh (UC BERKELEY) With QIUYI (RICHARD) ZHANG (UC BERKELEY Q GOOGLE) Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels

2 Setup Start with a β€œmodel” binary tree 𝑛 leaves = extant species
Image source: Bulbapedia Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

3 Setup 𝜎 π‘Ÿ ~ {0, 1} π‘˜ Start with a β€œmodel” binary tree
𝜎 π‘Ÿ ~ {0, 1} π‘˜ Start with a β€œmodel” binary tree Sample a uniformly random bitstring (DNA) for root Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

4 Setup 011001010 Start with a β€œmodel” binary tree
Sample a uniformly random bitstring (DNA) for root Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

5 Setup 011001010 Start with a β€œmodel” binary tree
Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

6 Setup 011001010 Start with a β€œmodel” binary tree
Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: -Substitutes w.p. 𝑝 𝑠 (𝑒) Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

7 Setup 011001010 Start with a β€œmodel” binary tree
Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: -Substitutes w.p. 𝑝 𝑠 (𝑒) -Inserts random bit w.p. 𝑝 𝑖 (𝑒) Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

8 Setup 011001010 Start with a β€œmodel” binary tree
Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: -Substitutes w.p. 𝑝 𝑠 (𝑒) -Inserts random bit w.p. 𝑝 𝑖 (𝑒) -Deletes w.p. 𝑝 𝑑 (𝑒) _ Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

9 Setup 011001010 Start with a β€œmodel” binary tree
Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: -Substitutes w.p. 𝑝 𝑠 (𝑒) -Inserts random bit w.p. 𝑝 𝑖 (𝑒) -Deletes w.p. 𝑝 𝑑 (𝑒) Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

10 Setup Start with a β€œmodel” binary tree
Start with a β€œmodel” binary tree Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: -Substitutes w.p. 𝑝 𝑠 (𝑒) -Inserts random bit w.p. 𝑝 𝑖 (𝑒) -Deletes w.p. 𝑝 𝑑 (𝑒) Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

11 Setup Start with a β€œmodel” binary tree
Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: -Substitutes w.p. 𝑝 𝑠 (𝑒) -Inserts random bit w.p. 𝑝 𝑖 (𝑒) -Deletes w.p. 𝑝 𝑑 (𝑒) Algorithm is given leaf bitstrings Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

12 Setup Start with a β€œmodel” binary tree
Sample a uniformly random bitstring (DNA) for root DNA is inherited down the tree with mutations. On edge 𝑒, each bit: -Substitutes w.p. 𝑝 𝑠 (𝑒) -Inserts random bit w.p. 𝑝 𝑖 (𝑒) -Deletes w.p. 𝑝 𝑑 (𝑒) Algorithm is given leaf bitstrings and must reconstruct the tree with high probability Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

13 Goal Reconstruct the tree with high probability,
using as short a sequence length π‘˜ as possible (as a function of 𝑛), while tolerating as large mutation probabilities 𝑝 𝑠 , 𝑝 𝑖 , 𝑝 𝑑 as possible. Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

14 Motivation Many practical applications:
Reconstructing paths of migration Linking mutations to disease Determining origins of pathogens and likely paths of contamination Informing policy on conservation of species Image source: Tim Lohrentz Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

15 Prior work One method: Try to align sequences, reducing to substitution-only case. Image source: Wikipedia Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

16 Prior work Theorem [DMR05]: Can reconstruct tree in substitution-only case using 𝑂( log 𝑛 ) bits. With π‘œ( log 𝑛 ) length bitstrings, problem is impossible. Requires 2 1βˆ’2 𝑝 𝑠 2 >1 (Kesten-Stigum threshold). [BRZ95, Iof96, EKP00, BKM01, MSW04]: If 2 1βˆ’2 𝑝 𝑠 2 <1, need 𝑛 Ξ© 1 bits for reconstruction. If we have good alignment methods, we’ve solved the problem to optimality! Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

17 Prior work Unfortunately, multiple sequence alignment is NP-hard, and heuristics used in practice may induce problematic biases. Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

18 Prior work GWEJ07, LG08, WSH08 give empirical evidence for biases in MSA. DR10, ABH12 provide some guarantees for reconstruction with indels, but require polynomial sequence lengths or very small 𝑝 𝑖 , 𝑝 𝑑 . What 𝑝 𝑖 , 𝑝 𝑑 can we handle with π‘˜= log 𝑂 1 𝑛 ? 𝑝 𝑖 , 𝑝 𝑑 DR10 GZ18 𝑂 1 ABH12 𝑂 1 log 2 𝑛 ESSW99 DMR05 π‘˜ 𝑛 𝑂 1 log 𝑂 1 𝑛 𝑂 log 2 𝑛 We show the answer is 𝑂(1). 𝑂 log 𝑛 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

19 Our contribution In particular, if:
2 1βˆ’2 𝑝 𝑠 𝑒 βˆ’ 𝑝 𝑑 𝑒 𝑝 𝑖 𝑒 βˆ’ 𝑝 𝑑 𝑒 βˆ’1 >1 for all edges (Kesten-Stigum threshold) 𝐷 π‘šπ‘Žπ‘₯ ≀ log 𝛼 𝑛 𝑝 𝑖 𝑒 βˆ’ 𝑝 𝑑 𝑒 ≀ 𝛽 log log 𝑛 𝐷 π‘šπ‘Žπ‘₯ Then we can reconstruct the tree with π‘˜= log πœ… 𝛼, 𝛽 𝑛 . πœ… 𝛼,𝛽 is optimal up to a small multiplicative constant, and otherwise this result is optimal in every possible sense! needed to avoid empty leaf strings Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

20 Roadmap Substitution only With indels
1. Estimating distances using bitwise correlation (ESSW99) 2. Using block signatures instead of bits (DR10) 3. Reconstructing bitstrings of internal nodes (Roc08) 4. Reconstructing signatures of internal nodes (GZ18) π‘˜= 𝑛 𝑂 1 5. Future Directions π‘˜= log 𝑂 1 𝑛 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

21 Roadmap Substitution only With indels
1. Estimating distances using bitwise correlation (ESSW99) 2. Using block signatures instead of bits (DR10) 3. Reconstructing bitstrings of internal nodes (Roc08) 4. Reconstructing signatures of internal nodes (GZ18) π‘˜= 𝑛 𝑂 1 5. Future Directions π‘˜= log 𝑂 1 𝑛 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels SETUP

22 ESTIMATING DISTANCES USING BITWISE CORRELATION
Distance estimation Well known: distance estimates that concentrate well suffice to reconstruct the tree. High-level algorithm: Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels ESTIMATING DISTANCES USING BITWISE CORRELATION

23 ESTIMATING DISTANCES USING BITWISE CORRELATION
Distance estimation Well known: distance estimates that concentrate well suffice to reconstruct the tree. High-level algorithm: 1. Use distances to identify siblings Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels ESTIMATING DISTANCES USING BITWISE CORRELATION

24 ESTIMATING DISTANCES USING BITWISE CORRELATION
Distance estimation Well known: distance estimates that concentrate well suffice to reconstruct the tree. High-level algorithm: 1. Use distances to identify siblings 2. Use distances to compute distances from parents Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels ESTIMATING DISTANCES USING BITWISE CORRELATION

25 ESTIMATING DISTANCES USING BITWISE CORRELATION
Distance estimation Well known: distance estimates that concentrate well suffice to reconstruct the tree. High-level algorithm: 1. Use distances to identify siblings 2. Use distances to compute distances from parents 3. Recurse on parents Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels ESTIMATING DISTANCES USING BITWISE CORRELATION

26 ESTIMATING DISTANCES USING BITWISE CORRELATION
To estimate distance, in substitution-only case can use bitwise correlation (linear rescaling of Hamming distance). Think of bits as Β±1 instead of 0-1. Let 𝜎 π‘Ž,𝑗 be 𝑗th bit of π‘Žβ€™s bitstring. Bitwise correlation is 1 π‘˜ 𝑗=1 π‘˜ 𝜎 π‘Ž,𝑗 𝜎 𝑏,𝑗 . Faraway nodes have dissimilar bitstrings Siblings have very similar bitstrings Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels ESTIMATING DISTANCES USING BITWISE CORRELATION

27 ESTIMATING DISTANCES USING BITWISE CORRELATION
Claim: If we define edge lengths as 𝑑 𝑒 =βˆ’ ln 1βˆ’2 𝑝 𝑠 𝑒 then E 1 π‘˜ 𝑗=1 π‘˜ 𝜎 π‘Ž,𝑗 𝜎 𝑏,𝑗 = π‘’βˆˆ 𝑃 π‘Ž,𝑏 1βˆ’2 𝑝 𝑠 𝑒 = 𝑒 βˆ’π‘‘ π‘Ž,𝑏 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels ESTIMATING DISTANCES USING BITWISE CORRELATION

28 Concentration of bitwise correlation
How well does it concentrate? Rough analysis: Bitwise correlation has standard deviation β‰ˆ 1 π‘˜ . For the correlation to concentrate at distance O log 𝑛 , need 𝑒 βˆ’π‘‚ log 𝑛 > 1 π‘˜ β†’π‘˜= 𝑛 Ξ© 1 . Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels ESTIMATING DISTANCES USING BITWISE CORRELATION

29 Using block signatures
Roadmap Substitution only With indels 1. Estimating distances using bitwise correlation (ESSW99) 2. Using block signatures instead of bits (DR10) 3. Reconstructing bitstrings of internal nodes (Roc08) 4. Reconstructing signatures of internal nodes (GZ18) π‘˜= 𝑛 𝑂 1 5. Future Directions π‘˜= log 𝑂 1 𝑛 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Using block signatures

30 Using block signatures
Handling indels Problem with bitwise correlation when indels are introduced: indels make bits move around a lot between bitstrings. 𝑗 π‘Ž What bit in 𝑏’s bitstring does bit 𝑗 of π‘Žβ€™s bitstring correspond to? In the substitution only case: With indels: What if it doesn’t appear at all? 𝑏 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Using block signatures

31 Blockwise correlation
To handle shifts due to insertions and deletions, split bitstrings into blocks. π‘˜ 3/4 𝑗 π‘Ž Assume 𝑝 𝑖 = 𝑝 𝑑 everywhere (not too hard to generalize). If we split bitstrings into blocks of length, say, 𝑙= π‘˜ 3/4 , most bits will stay within a block throughout the tree. Any bit shifts by < π‘˜ positions in expectation, at most π‘˜ log 𝑛 with high probability on one edge. < π‘˜ log 𝑛 𝑏 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Using block signatures

32 Blockwise correlation
Define signature of block 𝑖 in bitstring π‘Ž, 𝑠 π‘Ž,𝑖 as sum of bits in block 𝑖, divided by 𝑙 . Signatures 𝑠 π‘Ž,𝑖 are robust to shifts, so they behave like bits in substitution only case, i.e. 𝑠 π‘Ž,𝑖 𝑠 𝑏,𝑖 behaves like bitwise correlation. 𝑠 π‘Ž,𝑖 𝑠 𝑏,𝑖 = 1 𝑙 𝑗 𝑗 β€² 𝜎 π‘Ž,𝑗 𝜎 𝑏, 𝑗 β€² Fixing any series of indels, 𝜎 π‘Ž,𝑗 𝜎 𝑏, 𝑗 β€² is non-zero in expectation only if bits 𝑗 and 𝑗′ correspond to each other. Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Using block signatures

33 Blockwise correlation
Define 𝑑 𝑒 =βˆ’ ln 1βˆ’2 𝑝 𝑠 𝑒 1βˆ’ 𝑝 𝑑 𝑒 Lemma: E 𝑠 π‘Ž,𝑖 𝑠 𝑏,𝑖 = 1Β±π‘œ 1 π‘’βˆˆ 𝑃 π‘Ž,𝑏 1βˆ’2 𝑝 𝑠 𝑒 1βˆ’ 𝑝 𝑑 𝑒 β‰ˆ 𝑒 βˆ’π‘‘(π‘Ž,𝑏) Error term to account for the tiny fraction of bits that move in/out of blocks Decay in number of corresponding bits between the blocks due to deletions Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Using block signatures

34 Challenges with concentration
𝑠 π‘Ž,𝑖 𝑠 𝑏,𝑖 includes lots of non-corresponding bitwise products that are expectation 0, but total variance may be high. But each 𝑠 π‘Ž,𝑖 is 𝑂( log 𝑛 ) with high probability, and we have π‘˜ 1/4 products – can boost π‘˜ to get concentration via Azuma’s if 𝑠 π‘Ž,𝑖 𝑠 𝑏,𝑖 are independent. Unfortunately 𝑠 π‘Ž,𝑖 𝑠 𝑏,𝑖 and 𝑠 π‘Ž, 𝑖 β€² 𝑠 𝑏, 𝑖 β€² are not independent. Knowing 𝑠 π‘Ž,𝑖 𝑠 𝑏,𝑖 gives you information about the indel process between π‘Ž and 𝑏, and thus about 𝑠 π‘Ž, 𝑖 β€² 𝑠 𝑏, 𝑖 β€² (since indels affect many blocks simultaneously). Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Using block signatures

35 Challenges with concentration
Our fix: Suppose we could define blocks based on actual alignment. Let 𝑠 π‘Ž,𝑖 βˆ— be the signatures of these blocks. Indels can only affect one β€œaligned” signature, so 𝑠 π‘Ž,𝑖 βˆ— 𝑠 𝑏,𝑖 βˆ— are independent and concentrate nicely. We don’t know actual alignment, so can’t compute 𝑠 π‘Ž,𝑖 βˆ— , but we can show 𝑠 π‘Ž,𝑖 𝑠 𝑏,𝑖 is very close to 𝑠 π‘Ž,𝑖 βˆ— 𝑠 𝑏,𝑖 βˆ— . π‘Ž Blocks our estimator uses Aligned blocks 𝑏 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Using block signatures

36 Reconstructing bitstrings
Roadmap Substitution only With indels 1. Estimating distances using bitwise correlation (ESSW99) 2. Using block signatures instead of bits (DR10) 3. Reconstructing bitstrings of internal nodes (Roc08) 4. Reconstructing signatures of internal nodes (GZ18) π‘˜= 𝑛 𝑂 1 5. Future Directions π‘˜= log 𝑂 1 𝑛 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing bitstrings

37 Challenges with π‘˜= log 𝑂(1) 𝑛
With π‘˜= log 𝑂 1 𝑛 , correlation only concentrates at distance 𝑂( log log 𝑛 ). Can show far apart nodes have low correlation with high probability (no false positives), so can still reconstruct first Ξ©( log log 𝑛 ) levels just based off leaf distance estimates. Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing bitstrings

38 Challenges with π‘˜= log 𝑂(1) 𝑛
𝑑(π‘Ž,𝑏) =O(log log 𝑛 ) π‘Ž 𝑏 Use estimator 𝜎 π‘Ž,π‘š = 1 2 β„Ž 𝑗 𝑒 𝑑 π‘Ž, π‘Ž 𝑗 𝜎 π‘Ž 𝑗 ,π‘š Correct in expectation! β„Ž =Ξ©(log log 𝑛 ) But variance of correlation of 𝜎 π‘Ž and 𝜎 𝑏 is large. π‘Ž 1 π‘Ž 2 … π‘Ž 2 β„Ž 𝑏 1 𝑏 2 … 𝑏 2 β„Ž Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing bitstrings

39 Challenges with π‘˜= log 𝑂(1) 𝑛
To get better concentration, use median-of-means approach. 𝑑(π‘Ž,𝑏) =O(log log 𝑛 ) π‘Ž 𝑏 For each descendant π‘Ž 𝑗 , 𝑏 𝑗 at height β„Ž below, compute 𝜎 π‘Ž 𝑗 , 𝜎 𝑏 𝑗 , and use correlation to get estimate of 𝑑 π‘Ž 𝑗 , 𝑏 𝑗 (and thus 𝑑(π‘Ž, 𝑏)). β„Ž =log log 𝑛 π‘Ž 1 π‘Ž 2 … π‘Ž 2 β„Ž 𝑏 1 𝑏 2 … 𝑏 2 β„Ž 𝐴 1 𝐡 1 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing bitstrings

40 Challenges with π‘˜= log 𝑂(1) 𝑛
We get log 𝑛 estimates of 𝑑(π‘Ž, 𝑏), one for each π‘Ž 𝑗 , 𝑏 𝑗 pair. 𝑑(π‘Ž,𝑏) =O(log log 𝑛 ) π‘Ž 𝑏 We can condition on having β€œgood” bitstrings at all π‘Ž 𝑗 , 𝑏 𝑗 . β„Ž =log log 𝑛 The log 𝑛 estimators of 𝑑(π‘Ž, 𝑏) are conditionally independent, so median concentrates well. π‘Ž 1 π‘Ž 2 … π‘Ž 2 β„Ž 𝑏 1 𝑏 2 … 𝑏 2 β„Ž 𝐴 1 𝐴 1 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing bitstrings

41 Reconstructing signatures
Roadmap Substitution only With indels 1. Estimating distances using bitwise correlation (ESSW99) 2. Using block signatures instead of bits (DR10) 3. Reconstructing bitstrings of internal nodes (Roc08) 4. Reconstructing signatures of internal nodes (GZ18) π‘˜= 𝑛 𝑂 1 5. Future Directions π‘˜= log 𝑂 1 𝑛 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing signatures

42 Reconstructing signatures
Signatures are robust to indels, so they behave similarly to bits in substitution-only case. Suggests our algorithm: apply the reconstruction scheme to signatures. Some technical challenges in the analysis we need to overcome. Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing signatures

43 Challenges with reconstruction with indels
In reconstructing signatures, bits appearing in blocks of ancestors but not children or vice-versa may induce noise that is non-zero in expectation in the recursive estimator. We show that since the noise also β€œdecays”, it is tiny in expectation, so misalignment does not ruin the reconstructed signatures. π‘Ž, ancestor which we condition on signal noise decayed signal noise π‘Ž 𝑗 , descendant Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing signatures

44 Challenges with reconstruction with indels
We again have to deal with 𝑠 π‘Ž 𝑗 ,𝑖 𝑠 𝑏 𝑗 ,𝑖 and 𝑠 π‘Ž 𝑗 , 𝑖 β€² 𝑠 𝑏 𝑗 , 𝑖 β€² not being independent when analyzing the variance of our recursive estimator for 𝑑 π‘Ž, 𝑏 . We show that the covariance of the reconstructed blockwise correlations is small, i.e. 𝑠 π‘Ž,𝑖 𝑠 𝑏,𝑖 and 𝑠 π‘Ž, 𝑖 β€² 𝑠 𝑏, 𝑖 β€² are almost completely independent. Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Reconstructing signatures

45 Roadmap Substitution only With indels
1. Estimating distances using bitwise correlation (ESSW99) 2. Using block signatures instead of bits (DR10) 3. Reconstructing bitstrings of internal nodes (Roc08) 4. Reconstructing signatures of internal nodes (GZ18) π‘˜= 𝑛 𝑂 1 5. Future Directions π‘˜= log 𝑂 1 𝑛 Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Future directions

46 Open questions/future directions
Our reconstruction guarantees are optimal up to some constants – what are the right constants? We only use 𝑂 π‘˜ bits of information per π‘˜-bit sequence, so there should be room for improvement. Can we remove some of the strong assumptions in the model? What if there isn’t sitewise independence of mutations? What if the root bitstring isn’t chosen uniformly at random? What if different parts of each sequence are generated using different trees? Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels Future directions

47 Thank You! Questions? Optimal Sequence Length Requirements for Phylogenetic Tree Reconstruction with Indels


Download ppt "Arun ganesh (UC BERKELEY)"

Similar presentations


Ads by Google