Gene Tree Estimation Through Affinity Propagation Vladimir Smirnov
Context The gene tree estimation problem The Affinity Propagation clustering algorithm Can we somehow combine the two?
Quick Reminder - Affinity Propagation We have a set of data points with a notion of “similarity” Each point chooses a representative The algorithm (approximately) optimizes the total similarity
Intuition - Adapting to the Gene Tree Problem “Data points” → Tree nodes “Similarity” → Branch length “Representative” → Parent “Optimizing total similarity” → Optimizing total branch lengths
Informal Algorithm Begin with a star topology tree While tree is not binary: Augment existing nodes with pool of candidate nodes Run Affinity Propagation over this set Candidate nodes chosen as representatives become new internal nodes Cleanup and return result
The Main Design Questions How do we label and select candidate internal nodes? Best solution: label with probability distribution over sequences Pick a probability distribution somewhere between parent and child How do we correctly prepare the similarity matrix? Best solution: “intersect” the probability distributions at each site. Sum up the logs How do we ensure that the tree becomes binary? Best solution: retain all candidates. Revisit unresolved polytomies as much as needed
Conclusion Didn’t work Neighbor Joining is philosophically similar, but does it better Why? Error rate starts very low, but grows nonlinearly with number of nodes inserted Effectively captures coarse distinctions at the outermost layers of the tree, but gets confused in the interior Reliance on distributions at existing internal nodes to anchor subsequent optimization causes error to compound
References Desper, R., & Gascuel, O. (2002, September). Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. In International Workshop on Algorithms in Bioinformatics (pp. 357-374). Springer, Berlin, Heidelberg. Lefort, V., Desper, R., & Gascuel, O. (2015). FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Molecular biology and Evolution, 32(10), 2798-2800. Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. science, 315(5814), 972-976. Liu, K., Raghavan, S., Nelesen, S., Linder, C. R., & Warnow, T. (2009). Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science, 324(5934), 1561-1564. Saitou, N., & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution, 4(4), 406-425.