Download presentation
Presentation is loading. Please wait.
Published byEileen Douglas Modified over 8 years ago
1
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept of distance l Multivariate distance metrics & matrices l Distances between observations, samples and populations l The concept of distance l Multivariate distance metrics & matrices l Distances between observations, samples and populations l Distances based on proportions or presence/absence data l Comparison of multivariate distance matrices.
2
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.2 Distance and distance metrics l The characteristics of different objects define their locations (positions) in some appropriate space. l The geometry of the space endows it with a number of possible distance measures (metrics)... l …which can be used to calculate the distance between two objects d ij.in the space. l The characteristics of different objects define their locations (positions) in some appropriate space. l The geometry of the space endows it with a number of possible distance measures (metrics)... l …which can be used to calculate the distance between two objects d ij.in the space. Objects: terminal branches of a uniform tree {A,B,C,D} Variables: branch points {B 1, B 2, B 3 } d ij : number of intervening branch points between i and j ABCD B1B1 B2B2 B3B3
3
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.3 Distance matrices l Once a distance measure (metric) is defined, we can calculate the distance between objects. These objects could be individual observations, groups of observations (samples) or populations of observations. l For N objects, we then have a symmetric distance matrix D whose elements are the distances between objects i and j. l Once a distance measure (metric) is defined, we can calculate the distance between objects. These objects could be individual observations, groups of observations (samples) or populations of observations. l For N objects, we then have a symmetric distance matrix D whose elements are the distances between objects i and j. 1 2 3
4
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.4 Euclidean distance l A possible distance measure for spaces equipped with a Euclidean metric l For two dimensions (variables), this is just the hypotenuse of a right-angle triangle… l …while for p dimensions, it is the hypotenuse of a hyper- triangle. l A possible distance measure for spaces equipped with a Euclidean metric l For two dimensions (variables), this is just the hypotenuse of a right-angle triangle… l …while for p dimensions, it is the hypotenuse of a hyper- triangle.
5
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.5 Multivariate distances between populations: the Euclidean distance l Calculates the Euclidean distance between two “points” defined by the multivariate means of two samples based on p variables. l Does not take into account differences among populations in within- population variability nor correlations among variables. l Calculates the Euclidean distance between two “points” defined by the multivariate means of two samples based on p variables. l Does not take into account differences among populations in within- population variability nor correlations among variables. X2X2 X1X1 Sample 1 Sample 2
6
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.6 Multivariate distances between populations: the Penrose distance l Distances are computed based on means, variances and covariances for each of g populations (samples) based on p variables. l takes into account within- population variation by weighting each variable by the inverse of its variance, but does not account for correlations among variables. l Distances are computed based on means, variances and covariances for each of g populations (samples) based on p variables. l takes into account within- population variation by weighting each variable by the inverse of its variance, but does not account for correlations among variables. Means of variable k in samples i and j Variance of variable k, assumed the same for all samples
7
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.7 Multivariate distances between populations: the Mahalanobis distance l Distances are computed based on means, variances and covariances for each of g samples (populations) based on p variables. l Mahalanobis distance “weights” the contribution of each pair of variables by the inverse of their covariance. l Distances are computed based on means, variances and covariances for each of g samples (populations) based on p variables. l Mahalanobis distance “weights” the contribution of each pair of variables by the inverse of their covariance. Means of variable r and s in samples i and j Inverse of covariance between variables r and s, assumed equal for all samples
8
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.8 ScaleScale l Since distance measures are generally based on the sum over variables, variables measured on large scales will contribute disproportionately to the measure. l So if variables are measured on different scales, use standardized values before computing distances. l For g samples (groups, populations), variables should be standardized to zero mean and unit variance over all g groups. l Since distance measures are generally based on the sum over variables, variables measured on large scales will contribute disproportionately to the measure. l So if variables are measured on different scales, use standardized values before computing distances. l For g samples (groups, populations), variables should be standardized to zero mean and unit variance over all g groups.
9
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.9 Distances between observations and objects l We can also calculate a distance between an individual observation and some object, where the object may be another observation or a group mean. l The distance between an observation and a group can be used to define the probability that the observation belongs to the group. l We can also calculate a distance between an individual observation and some object, where the object may be another observation or a group mean. l The distance between an observation and a group can be used to define the probability that the observation belongs to the group. X2X2 X1X1 Group 1 Group 2 Group mean Observation
10
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.10 Distances between samples: inferences to populations l Since populations means, variances etc. are usually not known. we can compute distances based only on sample estimators. l But, bear in mind the problems associated with inferring population parameters from estimators. l Since populations means, variances etc. are usually not known. we can compute distances based only on sample estimators. l But, bear in mind the problems associated with inferring population parameters from estimators.
11
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.11 Distances between groups based on proportions l p variables used to measure distance are proportions, i.e. p ki is the proportion of observations in sample k that are in class i.
12
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.12 Distances based on binary data l Similarity between two objects based on list of binary attributes, e.g. comparison of p species based on presence/absence at a n different sites
13
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.13 Binary response dichotomy coefficients Positive matching Jacquard’s Simple matching Anderberg’s Tanimoto’s
14
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.14 More binary response similarity measures Ochiai index Dice index
15
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.15 Example: distribution of 3 species of plants at 10 sites Ochiai index Dice index
16
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.16 Comparing two distance matrices l Used in situations where two different distances matrices (D 1, D 2 ) have been computed for the same set of objects using two different sets of variables. l Question: what is the correlation between D 1 and D 2 l Used in situations where two different distances matrices (D 1, D 2 ) have been computed for the same set of objects using two different sets of variables. l Question: what is the correlation between D 1 and D 2 Variables {X 1,..,X p } Variables {X p+1,..,X M }
17
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.17 The Mantel test l Calculate observed Z value (Z*) and compare with distribution of Z obtained under random ordering of objects in one or the other matrix (say, D 2 ) l If r = 0, then D 2 will be just like a randomized matrix, so Z will be a typical randomized value. l Calculate observed Z value (Z*) and compare with distribution of Z obtained under random ordering of objects in one or the other matrix (say, D 2 ) l If r = 0, then D 2 will be just like a randomized matrix, so Z will be a typical randomized value. Observed Z Randomized Z N times Frequency
18
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.18 The Mantel test (cont’d) l From the distribution of randomized Z, we can compute a test statistic g which is distributed as a standard normal variate, or simply use the randomized distribution itself to calculate p. Frequency
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.