Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept.

Slides:



Advertisements
Similar presentations
Lecture 3: A brief background to multivariate statistics
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
CLUSTERING PROXIMITY MEASURES
Lecture 7: Principal component analysis (PCA)
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Def: Let and be random variables of the discrete type with the joint p.m.f. on the space S. (1) is called the mean of (2) is called the variance of (3)
Multivariate Distance and Similarity Robert F. Murphy Cytometry Development Workshop 2000.
Correlation. Two variables: Which test? X Y Contingency analysis t-test Logistic regression Correlation Regression.
Chapter 7: Variation in repeated samples – Sampling distributions
Visual Recognition Tutorial1 Random variables, distributions, and probability density functions Discrete Random Variables Continuous Random Variables.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Modern Navigation Thomas Herring
Separate multivariate observations
- Interfering factors in the comparison of two sample means using unpaired samples may inflate the pooled estimate of variance of test results. - It is.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
1 G Lect 8b G Lecture 8b Correlation: quantifying linear association between random variables Example: Okazaki’s inferences from a survey.
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 23/10/2015 9:22 PM 1 Two-sample comparisons Underlying principles.
Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.
7.4 – Sampling Distribution Statistic: a numerical descriptive measure of a sample Parameter: a numerical descriptive measure of a population.
Transforms and other prestidigitations—or new twists in imputation. Albert R. Stage.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
© Copyright McGraw-Hill 2000
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI
1 Sample Geometry and Random Sampling Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.
Introduction to Matrices and Matrix Approach to Simple Linear Regression.
New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences.
Descriptive Statistics Used to describe a data set –Mean, minimum, maximum Usually include information on data variability (error) –Standard deviation.
PCB 3043L - General Ecology Data Analysis. PCB 3043L - General Ecology Data Analysis.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Sampling Theory and Some Important Sampling Distributions.
Introduction to Multivariate Analysis and Multivariate Distances Hal Whitehead BIOL4062/5062.
From the population to the sample The sampling distribution FETP India.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.1 Lecture 14: Contingency tables and log-linear models Appropriate questions.
Pattern Recognition Mathematic Review Hamid R. Rabiee Jafar Muhammadi Ali Jalali.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.1 Lecture 9: Discriminant function analysis (DFA) l Rationale.
Pattern Recognition Mathematic Review Hamid R. Rabiee Jafar Muhammadi Ali Jalali.
Université d’Ottawa / University of Ottawa 1999 Bio 8100s Multivariate biostatistics L5.1 Two sample comparisons l Univariate 2-sample comparisons l The.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Biostatistics Class 3 Probability Distributions 2/15/2000.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L12.1 Lecture 12: Generalized Linear Models (GLM) What are they? When do.
PCB 3043L - General Ecology Data Analysis Organizing an ecological study What is the aim of the study? What is the main question being asked? What are.
CWR 6536 Stochastic Subsurface Hydrology Optimal Estimation of Hydrologic Parameters.
Statistics for Business and Economics 7 th Edition Chapter 7 Estimation: Single Population Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
ESTIMATION.
Lecture 2-2 Data Exploration: Understanding Data
Research Methodology Lecture No :25 (Hypothesis Testing – Difference in Groups)
Chapter 4. Inference about Process Quality
CH 5: Multivariate Methods
Matrices Definition: A matrix is a rectangular array of numbers or symbolic elements In many applications, the rows of a matrix will represent individuals.
CONCEPTS OF ESTIMATION
Statistical Inference about Regression
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005
Multivariate Statistical Methods
OVERVIEW OF LINEAR MODELS
数据的矩阵描述.
IE 355: Quality and Applied Statistics I Confidence Intervals
Multivariate Methods Berlin Chen
CHAPTER 2: Basic Summary Statistics
Multivariate Methods Berlin Chen, 2005 References:
Sampling Distributions (§ )
Lecture 8: Factor analysis (FA)
Chapter 9 Estimation: Additional Topics
Presentation transcript:

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept of distance l Multivariate distance metrics & matrices l Distances between observations, samples and populations l The concept of distance l Multivariate distance metrics & matrices l Distances between observations, samples and populations l Distances based on proportions or presence/absence data l Comparison of multivariate distance matrices.

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.2 Distance and distance metrics l The characteristics of different objects define their locations (positions) in some appropriate space. l The geometry of the space endows it with a number of possible distance measures (metrics)... l …which can be used to calculate the distance between two objects d ij.in the space. l The characteristics of different objects define their locations (positions) in some appropriate space. l The geometry of the space endows it with a number of possible distance measures (metrics)... l …which can be used to calculate the distance between two objects d ij.in the space. Objects: terminal branches of a uniform tree {A,B,C,D} Variables: branch points {B 1, B 2, B 3 } d ij : number of intervening branch points between i and j ABCD B1B1 B2B2 B3B3

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.3 Distance matrices l Once a distance measure (metric) is defined, we can calculate the distance between objects. These objects could be individual observations, groups of observations (samples) or populations of observations. l For N objects, we then have a symmetric distance matrix D whose elements are the distances between objects i and j. l Once a distance measure (metric) is defined, we can calculate the distance between objects. These objects could be individual observations, groups of observations (samples) or populations of observations. l For N objects, we then have a symmetric distance matrix D whose elements are the distances between objects i and j

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.4 Euclidean distance l A possible distance measure for spaces equipped with a Euclidean metric l For two dimensions (variables), this is just the hypotenuse of a right-angle triangle… l …while for p dimensions, it is the hypotenuse of a hyper- triangle. l A possible distance measure for spaces equipped with a Euclidean metric l For two dimensions (variables), this is just the hypotenuse of a right-angle triangle… l …while for p dimensions, it is the hypotenuse of a hyper- triangle.

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.5 Multivariate distances between populations: the Euclidean distance l Calculates the Euclidean distance between two “points” defined by the multivariate means of two samples based on p variables. l Does not take into account differences among populations in within- population variability nor correlations among variables. l Calculates the Euclidean distance between two “points” defined by the multivariate means of two samples based on p variables. l Does not take into account differences among populations in within- population variability nor correlations among variables. X2X2 X1X1 Sample 1 Sample 2

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.6 Multivariate distances between populations: the Penrose distance l Distances are computed based on means, variances and covariances for each of g populations (samples) based on p variables. l takes into account within- population variation by weighting each variable by the inverse of its variance, but does not account for correlations among variables. l Distances are computed based on means, variances and covariances for each of g populations (samples) based on p variables. l takes into account within- population variation by weighting each variable by the inverse of its variance, but does not account for correlations among variables. Means of variable k in samples i and j Variance of variable k, assumed the same for all samples

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.7 Multivariate distances between populations: the Mahalanobis distance l Distances are computed based on means, variances and covariances for each of g samples (populations) based on p variables. l Mahalanobis distance “weights” the contribution of each pair of variables by the inverse of their covariance. l Distances are computed based on means, variances and covariances for each of g samples (populations) based on p variables. l Mahalanobis distance “weights” the contribution of each pair of variables by the inverse of their covariance. Means of variable r and s in samples i and j Inverse of covariance between variables r and s, assumed equal for all samples

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.8 ScaleScale l Since distance measures are generally based on the sum over variables, variables measured on large scales will contribute disproportionately to the measure. l So if variables are measured on different scales, use standardized values before computing distances. l For g samples (groups, populations), variables should be standardized to zero mean and unit variance over all g groups. l Since distance measures are generally based on the sum over variables, variables measured on large scales will contribute disproportionately to the measure. l So if variables are measured on different scales, use standardized values before computing distances. l For g samples (groups, populations), variables should be standardized to zero mean and unit variance over all g groups.

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.9 Distances between observations and objects l We can also calculate a distance between an individual observation and some object, where the object may be another observation or a group mean. l The distance between an observation and a group can be used to define the probability that the observation belongs to the group. l We can also calculate a distance between an individual observation and some object, where the object may be another observation or a group mean. l The distance between an observation and a group can be used to define the probability that the observation belongs to the group. X2X2 X1X1 Group 1 Group 2 Group mean Observation

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.10 Distances between samples: inferences to populations l Since populations means, variances etc. are usually not known. we can compute distances based only on sample estimators. l But, bear in mind the problems associated with inferring population parameters from estimators. l Since populations means, variances etc. are usually not known. we can compute distances based only on sample estimators. l But, bear in mind the problems associated with inferring population parameters from estimators.

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.11 Distances between groups based on proportions l p variables used to measure distance are proportions, i.e. p ki is the proportion of observations in sample k that are in class i.

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.12 Distances based on binary data l Similarity between two objects based on list of binary attributes, e.g. comparison of p species based on presence/absence at a n different sites

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.13 Binary response dichotomy coefficients Positive matching Jacquard’s Simple matching Anderberg’s Tanimoto’s

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.14 More binary response similarity measures Ochiai index Dice index

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.15 Example: distribution of 3 species of plants at 10 sites Ochiai index Dice index

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.16 Comparing two distance matrices l Used in situations where two different distances matrices (D 1, D 2 ) have been computed for the same set of objects using two different sets of variables. l Question: what is the correlation between D 1 and D 2 l Used in situations where two different distances matrices (D 1, D 2 ) have been computed for the same set of objects using two different sets of variables. l Question: what is the correlation between D 1 and D 2 Variables {X 1,..,X p } Variables {X p+1,..,X M }

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.17 The Mantel test l Calculate observed Z value (Z*) and compare with distribution of Z obtained under random ordering of objects in one or the other matrix (say, D 2 ) l If r = 0, then D 2 will be just like a randomized matrix, so Z will be a typical randomized value. l Calculate observed Z value (Z*) and compare with distribution of Z obtained under random ordering of objects in one or the other matrix (say, D 2 ) l If r = 0, then D 2 will be just like a randomized matrix, so Z will be a typical randomized value. Observed Z Randomized Z N times Frequency

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.18 The Mantel test (cont’d) l From the distribution of randomized Z, we can compute a test statistic g which is distributed as a standard normal variate, or simply use the randomized distribution itself to calculate p. Frequency