Linear Least Squares and its applications in distance matrix methods Presented by Shai Berkovich June, 2007 Seminar in Phylogeny, CS236805 Based on the.

Slides:



Advertisements
Similar presentations
General Linear Model With correlated error terms  =  2 V ≠  2 I.
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.
BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.
The General Linear Model. The Simple Linear Model Linear Regression.
The Saitou&Nei Neighbor Joining Algorithm ©Shlomo Moran & Ilan Gronau.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
Fast Algorithms for Minimum Evolution Richard Desper, NCBI Olivier Gascuel, LIRMM.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
Evaluating Hypotheses
CSC401 – Analysis of Algorithms Lecture Notes 12 Dynamic Programming
Linear and generalised linear models
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Incomplete Block Designs
Linear Equations in Linear Algebra
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems
Matrix Algebra THE INVERSE OF A MATRIX © 2012 Pearson Education, Inc.
6 6.3 © 2012 Pearson Education, Inc. Orthogonality and Least Squares ORTHOGONAL PROJECTIONS.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Linear and generalised linear models
Proximity matrices and scaling Purpose of scaling Classical Euclidean scaling Non-Euclidean scaling Non-Metric Scaling Example.
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Principles of Least Squares
Separate multivariate observations
5  Systems of Linear Equations: ✦ An Introduction ✦ Unique Solutions ✦ Underdetermined and Overdetermined Systems  Matrices  Multiplication of Matrices.
Principles of the Global Positioning System Lecture 11 Prof. Thomas Herring Room A;
Lecture 10: Inner Products Norms and angles Projection Sections 2.10.(1-4), Sections 2.2.3, 2.3.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Systems and Matrices (Chapter5)
Systems of Linear Equation and Matrices
Chap. 2 Matrices 2.1 Operations with Matrices
Presentation by: H. Sarper
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
1 C ollege A lgebra Systems and Matrices (Chapter5) 1.
Integrals  In Chapter 2, we used the tangent and velocity problems to introduce the derivative—the central idea in differential calculus.  In much the.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Discrete Structures Lecture 12: Trees Ji Yanyan United International College Thanks to Professor Michael Hvidsten.
Elementary Linear Algebra Anton & Rorres, 9th Edition
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Section 2.3 Properties of Solution Sets
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION ASEN 5070 LECTURE 11 9/16,18/09.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Flow in Network. Graph, oriented graph, network A graph G =(V, E) is specified by a non empty set of nodes V and a set of edges E such that each edge.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
TH EDITION LIAL HORNSBY SCHNEIDER COLLEGE ALGEBRA.
Great Theoretical Ideas in Computer Science for Some.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
2 2.2 © 2016 Pearson Education, Ltd. Matrix Algebra THE INVERSE OF A MATRIX.
Computacion Inteligente Least-Square Methods for System Identification.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Section Recursion 2  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Distance-based phylogeny estimation
Modeling with Recurrence Relations
5 Systems of Linear Equations and Matrices
Chapter 5. Optimal Matchings
Lectures on Graph Algorithms: searching, testing and sorting
CS 581 Tandy Warnow.
Singular Value Decomposition SVD
Phylogeny.
Discrete Mathematics for Computer Science
Matrix Algebra THE INVERSE OF A MATRIX © 2012 Pearson Education, Inc.
Presentation transcript:

Linear Least Squares and its applications in distance matrix methods Presented by Shai Berkovich June, 2007 Seminar in Phylogeny, CS Based on the paper by Olivier Gascuel

Contents Background and Motivation LS in general LS in phylogeny UNJ algorithm LS sense of UNJ

Distance Matrix Methods A major family of phylogenic methods has been the distance matrix methods. The general idea: calculate a measure of the distance between each pair of species, and then find a tree that predicts the observed set of distances as closely as possible.

Distance Matrix Methods This lives out all information from higher-order combinations of character states, reducing the data matrix to a simple table of pairwise distances, though computer simulation studies show that the amount of information of phylogeny that is lost is remarkably small. (As we already saw: Neighbor-Joining and its robustness to noise.)

Additivity Definition: A distance matrix D is additive if there exists a tree with positive edge weights such that where v k are the edges in the path between species i and j. Theorem [Waterman et. al., 1977]: Given ad additive n x n distance matrix D there is a unique edge- weighted tree (without nodes of degree 2) in which n nodes in the tree are labeled s 1,s 2, …,s n so that the path between s i and s j is equal to D ij. Furthermore, this unique tree consistent with D is reconstructable in O(n 2 ) time.

Distance-Based reconstruction Input: distance matrix D Output: edge-weighted tree – T ( if D is additive, then D T = D, otherwise, return a tree best ‘fitting’ the input – D). A B C D E ABCDE A B C D E No topology!

Approximation In practice, the distance matrix between molecular sequences will not be additive. So, we want to find a tree T whose distance matrix approximates the given one. Algorithms give exact results when operating on additive matrix, but it gets unclear when real matrix is handled.

LS Overview Linear least squares is a mathematical optimization technique to find an approximate solution for a system of linear equations that has no exact solution. This usually happens if the number of equations (m) is bigger than the number of variables (n).

LS Overview In mathematical terms, we want to find a solution for the "equation" where A is a known m-by-n matrix (usually with m > n), x is an unknown n-dimensional parameter vector, and b is a known m-dimensional measurement vector.

LS Overview Euclidean norm: on R n the notion of length of vector is captured by formula This gives the ordinary distance from the origin to the point x. More precisely, we want to minimize the Euclidean norm, squared of the residual Ax − b, that is, the quantity where [Ax] i denotes the i-th component of the vector Ax. Hence the name "least squares".

LS Overview Fact: squared norm of v is v T v What we do when we want to minimize?

LS Overview Note that this corresponds to a system of linear equations. The matrix A T A on the left-hand side is a square matrix, which is invertible if A has full column rank (that is, if the rank of A is n). In that case, the solution of the system of linear equations is unique and given by

LS in phylogeny Input: 1. Distance matrix D 2. Tree topology Supposed to receive same tree, since dissimilarity matrix is additive. A B C D E ABCDE A B C D E

The measure that we use is the measure of disperancy between the observed and expected distances: Where w ij are weights that differ between different LS methods: 1 or or LS in phylogeny Intuition

LS in phylogeny A B C D E v7 v2 v4 v6 v3 v1 v5 introduce an indicator variable, which is 1 if branch lies in the path from species i to species j and 0 otherwise

LS in phylogeny prop. 1

LS in phylogeny Number of equations as number of edges => having one solution if the matrix is fully column-ranked. What matrix?

LS in phylogeny Example: A B C v2 v1 v3 ABC A01012 B1008 C1280

LS in Phylogeny When we have weighted LS, then previous equations can be written: where W is a diagonal matrix with distance weights on main diagonal. Simulations usually shows that LS better performance then NJ

LS in Phylogeny One can imagine an LS method that, for each tree topology, formed the matrix, inverted it and obtained the estimates. This can be done, but its computationally burdensome, even if not all topologies are examined. Inversion of matrix: O(n 3 ) for a tree with n tips In principle each tree topology should be considered.

UNJ algorithm Recall NJ algorithm: 1. Begin with star tree & all sequences as nodes in L 2. Find pair of nodes with minimum Q A,B 3. Create & insert new join (node K) w/ branch lengths d A,K = ½ (d A,B + r A – r B ) d B,K = ½ (d A,B + r B – r A ) 4. For remaining nodes, update distance to K as d K,C = ½ (d A,C + d B,C – d A,B ) 5. Insert K and remove A, B from L 6. Repeat steps 2-5 until only two nodes left UNJ

UNJ algorithm Although the NJ algorithm is widely used and has yielded satisfactory simulation results, certain questions remain: Proof of correctness of selection criterion (Saito & Nei) was contested but complete proof is still not provided. NJ reduction formula gives identical importance to nodes x and y, even if one corresponds to a group of several objects and the other is single object.

UNJ algorithm The manner in which the edge lengths are estimated is inexact in terms of LS when the agglomerated nodes represent not individual object but rather groups of objects. The paper provides answers to this questions but we will concentrate on the last one. Weighted/ Unweighted misunderstood

UNJ algorithm Definitions E = {1,2 …,n} a set of n objects (leaves) dissimilarity matrix over E by removing the edge from T we constitute bipartition where X may be viewed in two ways: as a subset of E or as a rooted subtree of T whose root is situated at the extremity of edge T denotes any valued tree T` denotes its structure

UNJ algorithm Definitions cardinality of X – number of leaves in the subtree X, also denoted as n x S = (s ij ) is an adjusted tree generated by LS S`tree structure associated with adjusted tree Let and be two bipartitions of S` when then and

UNJ algorithm Definitions as well as and flow of a rooted subtree X and

Our Model Estimates are unbiased i.e. for every i,j where the noise variables are i.i.d (result of real observations and measurements) The paper states that it is coherent to use an unweighted approach which allocates the same level of importance to each of the initial objects. Furthermore, within this model it is justified to use the “ ordinary ” LS criterion as opposed to “ generalized ”, which takes into account variances and covariances of the estimates Statistics

UNJ algorithm 1. Initialize the running matrix: 2. Initialize the number of remaining nodes: r<-n 3. Initialize the numbers of objects per node: 4. While the number of nodes r is greater than 3: {Compute the sums Find the pair {x,y} to be agglomerated by maximizing Q xy (3) Create the node u, and set: Estimate the lengths of edges (x,u) and (y,u) using (2) Reduce the running matrix using (3) Decrease the number of nodes: r<-r-1 } 5. Create a central node and compute the last three edge-lengths using (2) 6. Output the tree found Reverse of NJ fin

UNJ algorithm 1. Selection criterion 2. Estimation formula 3. Reduction formula d yu obtained by symmetry (1) (2) (3) NJ We won ’ t prove (1) and (3)

Conservation Property 1 Given dissimilarity matrix, S adjusted tree and S`its structure we have (Vach 1989): For every bipartition of S`we have (and ) Proof: we saw that Lets have a closer look on a matrices

Conservation Property 1 Let n be a number of leaves, q=2n-3 be a number of edges and be a number of distances Xv - mx1 matrix of tree paths between the leaves D - mx1 matrix of dissimilarity distances X t Xv - qx1 matrix of all “ interleave ” paths that pass over the given edge X t D - qx1 matrix of all distances that pass over the given edge (slide no. 16).slide no. 16 Property is established.

Conservation Property 2 For all ternary nodes of S` and for every pair X,Y of subtrees associated with this node, we have ( and ) Proof: according to prop. 1 we have: (*) X Y Z u

Conservation Property 2 Property is established.

Formula (2) is correct Using definition of S XY we can rewrite: Using prop.2 we can write: by solving this equations we obtain X Y Z u (4)

Formula (2) is correct Let us consider an agglomerative procedure as described in algorithm: at p-th step it remains to resolve r=n-p+1 nodes when some of them are subtrees. After choosing x and y Z can be viewed as joint of r-2 subtrees, some of them consist of root. XYZ

Formula (2) is correct Thus, we may rewrite expression (4) as: where (5) (*)

Formula (2) is correct Now we will prove by induction the next two statements: For each iteration of algorithm: (a) for every resolved subtree I,J (b) Formula (2) and equation (5) are equal Important: at each step (b) evaluation is based on (a) result from previous step and (a) evaluation is based on (b) result at current step.

Formula (2) is correct Base: at the first step the “ weight ” of each node is 1 Thus, for each node i f i =0 => (b) also achieved because in first iteration. Step: Let us consider that (a) and (b) maintained during the step p. Now we ’ ll show that they are also maintained at step p+1.

Formula (2) is correct (a)We must check that hypothesis is maintained for new node u: Thus, formula (1) maintained.

Formula (2) is correct (b)We prove correctness of (b) for step p+2: (b) is correct.

UNJ algorithm - implications The complexity in time of UNJ is O(n 3 ) The property we proved may be exploited within an algorithm in O(n 2 ), allowing the LS estimation of edge lengths of any fixed structure binary tree I.e. finding tree topology and then LS edge estimates in O(n 3 ) + O(n 2 ) UNJ

UNJ algorithm - implications This new version derives from the original version of Saitou & Nei(1987) (weighted version) and also Vach(1989) (concerning lengths estimation). The simulation shows that UNJ suppresses NJ when data closely follow the chosen model. For certain tree structures, we obtain up to 50% error reduction, in terms of ability to recover the true tree structure.