Presentation is loading. Please wait.

Presentation is loading. Please wait.

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept.

Similar presentations


Presentation on theme: "Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept."— Presentation transcript:

1 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept of distance l Multivariate distance metrics & matrices l Distances between observations, samples and populations l The concept of distance l Multivariate distance metrics & matrices l Distances between observations, samples and populations l Distances based on proportions or presence/absence data l Comparison of multivariate distance matrices.

2 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.2 Distance and distance metrics l The characteristics of different objects define their locations (positions) in some appropriate space. l The geometry of the space endows it with a number of possible distance measures (metrics)... l …which can be used to calculate the distance between two objects d ij.in the space. l The characteristics of different objects define their locations (positions) in some appropriate space. l The geometry of the space endows it with a number of possible distance measures (metrics)... l …which can be used to calculate the distance between two objects d ij.in the space. Objects: terminal branches of a uniform tree {A,B,C,D} Variables: branch points {B 1, B 2, B 3 } d ij : number of intervening branch points between i and j ABCD B1B1 B2B2 B3B3

3 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.3 Distance matrices l Once a distance measure (metric) is defined, we can calculate the distance between objects. These objects could be individual observations, groups of observations (samples) or populations of observations. l For N objects, we then have a symmetric distance matrix D whose elements are the distances between objects i and j. l Once a distance measure (metric) is defined, we can calculate the distance between objects. These objects could be individual observations, groups of observations (samples) or populations of observations. l For N objects, we then have a symmetric distance matrix D whose elements are the distances between objects i and j. 1 2 3

4 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.4 Euclidean distance l A possible distance measure for spaces equipped with a Euclidean metric l For two dimensions (variables), this is just the hypotenuse of a right-angle triangle… l …while for p dimensions, it is the hypotenuse of a hyper- triangle. l A possible distance measure for spaces equipped with a Euclidean metric l For two dimensions (variables), this is just the hypotenuse of a right-angle triangle… l …while for p dimensions, it is the hypotenuse of a hyper- triangle.

5 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.5 Multivariate distances between populations: the Euclidean distance l Calculates the Euclidean distance between two “points” defined by the multivariate means of two samples based on p variables. l Does not take into account differences among populations in within- population variability nor correlations among variables. l Calculates the Euclidean distance between two “points” defined by the multivariate means of two samples based on p variables. l Does not take into account differences among populations in within- population variability nor correlations among variables. X2X2 X1X1 Sample 1 Sample 2

6 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.6 Multivariate distances between populations: the Penrose distance l Distances are computed based on means, variances and covariances for each of g populations (samples) based on p variables. l takes into account within- population variation by weighting each variable by the inverse of its variance, but does not account for correlations among variables. l Distances are computed based on means, variances and covariances for each of g populations (samples) based on p variables. l takes into account within- population variation by weighting each variable by the inverse of its variance, but does not account for correlations among variables. Means of variable k in samples i and j Variance of variable k, assumed the same for all samples

7 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.7 Multivariate distances between populations: the Mahalanobis distance l Distances are computed based on means, variances and covariances for each of g samples (populations) based on p variables. l Mahalanobis distance “weights” the contribution of each pair of variables by the inverse of their covariance. l Distances are computed based on means, variances and covariances for each of g samples (populations) based on p variables. l Mahalanobis distance “weights” the contribution of each pair of variables by the inverse of their covariance. Means of variable r and s in samples i and j Inverse of covariance between variables r and s, assumed equal for all samples

8 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.8 ScaleScale l Since distance measures are generally based on the sum over variables, variables measured on large scales will contribute disproportionately to the measure. l So if variables are measured on different scales, use standardized values before computing distances. l For g samples (groups, populations), variables should be standardized to zero mean and unit variance over all g groups. l Since distance measures are generally based on the sum over variables, variables measured on large scales will contribute disproportionately to the measure. l So if variables are measured on different scales, use standardized values before computing distances. l For g samples (groups, populations), variables should be standardized to zero mean and unit variance over all g groups.

9 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.9 Distances between observations and objects l We can also calculate a distance between an individual observation and some object, where the object may be another observation or a group mean. l The distance between an observation and a group can be used to define the probability that the observation belongs to the group. l We can also calculate a distance between an individual observation and some object, where the object may be another observation or a group mean. l The distance between an observation and a group can be used to define the probability that the observation belongs to the group. X2X2 X1X1 Group 1 Group 2 Group mean Observation

10 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.10 Distances between samples: inferences to populations l Since populations means, variances etc. are usually not known. we can compute distances based only on sample estimators. l But, bear in mind the problems associated with inferring population parameters from estimators. l Since populations means, variances etc. are usually not known. we can compute distances based only on sample estimators. l But, bear in mind the problems associated with inferring population parameters from estimators.

11 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.11 Distances between groups based on proportions l p variables used to measure distance are proportions, i.e. p ki is the proportion of observations in sample k that are in class i.

12 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.12 Distances based on binary data l Similarity between two objects based on list of binary attributes, e.g. comparison of p species based on presence/absence at a n different sites

13 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.13 Binary response dichotomy coefficients Positive matching Jacquard’s Simple matching Anderberg’s Tanimoto’s

14 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.14 More binary response similarity measures Ochiai index Dice index

15 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.15 Example: distribution of 3 species of plants at 10 sites Ochiai index Dice index

16 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.16 Comparing two distance matrices l Used in situations where two different distances matrices (D 1, D 2 ) have been computed for the same set of objects using two different sets of variables. l Question: what is the correlation between D 1 and D 2 l Used in situations where two different distances matrices (D 1, D 2 ) have been computed for the same set of objects using two different sets of variables. l Question: what is the correlation between D 1 and D 2 Variables {X 1,..,X p } Variables {X p+1,..,X M }

17 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.17 The Mantel test l Calculate observed Z value (Z*) and compare with distribution of Z obtained under random ordering of objects in one or the other matrix (say, D 2 ) l If r = 0, then D 2 will be just like a randomized matrix, so Z will be a typical randomized value. l Calculate observed Z value (Z*) and compare with distribution of Z obtained under random ordering of objects in one or the other matrix (say, D 2 ) l If r = 0, then D 2 will be just like a randomized matrix, so Z will be a typical randomized value. Observed Z Randomized Z N times Frequency

18 Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.18 The Mantel test (cont’d) l From the distribution of randomized Z, we can compute a test statistic g which is distributed as a standard normal variate, or simply use the randomized distribution itself to calculate p. Frequency


Download ppt "Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept."

Similar presentations


Ads by Google