Transforms and other prestidigitations—or new twists in imputation. Albert R. Stage.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Canonical Correlation
Brief introduction on Logistic Regression
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Hypothesis Testing Steps in Hypothesis Testing:
CLUSTERING PROXIMITY MEASURES
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Terminology species data = the measured variables we want to explain (response or dependent variables) environmental data = the variables we use for explaining.
Multivariate distributions. The Normal distribution.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Psychology 202b Advanced Psychological Statistics, II February 10, 2011.
Correlation and Autocorrelation
1-1 Regression Models  Population Deterministic Regression Model Y i =  0 +  1 X i u Y i only depends on the value of X i and no other factor can affect.
Multivariate Distance and Similarity Robert F. Murphy Cytometry Development Workshop 2000.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
ANCOVA Psy 420 Andrew Ainsworth. What is ANCOVA?
Clustered or Multilevel Data
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
Cluster Analysis (1).
Chapter 6 Distance Measures From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Meta-Analysis and Meta- Regression Airport Noise and Home Values J.P. Nelson (2004). “Meta-Analysis of Airport Noise and Hedonic Property Values: Problems.
Variance and covariance Sums of squares General linear models.
Chapter 9 Two-Sample Tests Part II: Introduction to Hypothesis Testing Renee R. Ha, Ph.D. James C. Ha, Ph.D Integrative Statistics for the Social & Behavioral.
Simple linear regression and correlation analysis
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
Multivariate Statistical Data Analysis with Its Applications
The Campbell Collaborationwww.campbellcollaboration.org Introduction to Robust Standard Errors Emily E. Tanner-Smith Associate Editor, Methods Coordinating.
Multivariate Data Analysis Chapter 8 - Canonical Correlation Analysis.
Some matrix stuff.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
REGENERATION IMPUTATION MODELS FOR INTERIOR CEDAR HEMLOCK STANDS Badre Tameme Hassani, M.Sc., Peter Marshall PhD., Valerie LeMay, PhD., Temesgen Hailemariam,
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
Lecture 6 Forestry 3218 Forest Mensuration II Lecture 6 Double Sampling Cluster Sampling Sampling for Discrete Variables Avery and Burkhart, Chapter 3.
11/12/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Prediction.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.
Linear correlation and linear regression + summary of tests
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
Distances Between Genes and Samples Naomi Altman Oct. 06.
Canonical Correlation Psy 524 Andrew Ainsworth. Matrices Summaries and reconfiguration.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI
Chapter 2: Getting to Know Your Data
Multivariate Data Analysis Chapter 1 - Introduction.
Inferential Statistics. The Logic of Inferential Statistics Makes inferences about a population from a sample Makes inferences about a population from.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Geology 5670/6670 Inverse Theory 28 Jan 2015 © A.R. Lowry 2015 Read for Fri 30 Jan: Menke Ch 4 (69-88) Last time: Ordinary Least Squares: Uncertainty The.
Canonical Correlation. Canonical correlation analysis (CCA) is a statistical technique that facilitates the study of interrelationships among sets of.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Puulajeittainen estimointi ja ei-parametriset menetelmät Multi-scale Geospatial Analysis of Forest Ecosystems Tahko Petteri Packalén Faculty.
Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept.
Estimating standard error using bootstrap
CH 5: Multivariate Methods
Inference for the mean vector
How to handle missing data values
Introduction to Statistical Methods for Measuring “Omics” and Field Data PCA, PcoA, distance measure, AMOVA.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005
OVERVIEW OF LINEAR MODELS
Ch11 Curve Fitting II.
Multidimensional Scaling
Multivariate Methods Berlin Chen
Correlation A measure of the strength of the linear association between two numerical variables.
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Transforms and other prestidigitations—or new twists in imputation. Albert R. Stage

Imputation: To use what we know about “everywhere” that may be useful, but not very interesting- the X’s, To fill in detail that is prohibitive to obtain, except on a sample- the Y’s, By finding surrogates based on similarity of the X’s.

Topics Measures of similarity (a few in particular) Alternative MSN distance function leading to some improved estimates Transformations that improve resolution –On the X-side (known everywhere) –On the Y-side (known for sample only)

Distance measures for interval and ratio scale variables (Podani 2000) Euclidean/Mahalanobis Chord Angular Geodesic Manhattan Canberra Clark Bray-Curtis Marczewski-Steinhaus 1-Kulczynski Pinkham-Pearson Gleason Ellenberg Pandeya Chi-square 1-Correlation 1-similarity ratio Kendall difference Faith intermediate Uppsala coefficient

Distance measures for binary variables Podani (2000) Symmetric for 0/1 Simple matching Euclidean Rogers-Tanimoto Sokal-Sneath Anderberg I Anderberg II Correlation Yule I Yule II Hamann Asymmetric for 0/1 Baroni-Urbani-Buser I Baroni-Urbani-Buser II Russell-Rao Faith I Faith II Ignore 0 Jaccard Sorenson Chord Kulczynski Sokal-Sneath II Mountford

Distance function in matrix notation D 2 iu = min i [ (X i -X u ) W (X i -X u )’ ] –Where, for Euclidean distance: W = I (Identity matrix) Mahalanobis distance: W =    Inverse covariance matrix) MSN (1995): W =    ’ with:  = matrix of coefficients of canonical variates  diagonal matrix of canonical correlations

Why Weight with Canonical Analysis? Not degraded by non-informative X’s

Why Weight with Canonical Analysis? Not affected by non-informative Y’s if number of canonical pairs is determined by test of significance on rank.

Comparison of MSN Distance Functions Moeur and Stage 1995 –Assumes Y’s are “true” –Searches for closest linear combination of Y’s –Set of near neighbors sensitive to lower order canonical correlatrions Stage 2003 –Assumes Y’s include measurement error –Searches for closest linear combination of predicted Y’s –Set of near neighbors less sensitive to random elements “swept” into lower order canonical corr.

Distance is in the r-space spanned by expected values of Canonical Variates conditioned on X In Mahalanobis distance, each is weighted by inverse of its standard error about its expectation: sqrt(1/(1-   Distance function of new alternative

New regression alternative: d ij 2 = (X i - X j )   [ (I-  2 )] -1   ’ (X i - X j )’  is the diagonal matrix of canonical correlations for k =

Effect of change: No change if only first canonical pair is used. Regression alternative gives more relative weight to higher correlated pairs. Effects on Root-Mean-Square Error of imputation are mixed: e.g. the following three data-sets---

Statistics for three data sets

Canonical pair UtahTally Lake User’s Guide 2 Rel.Wgt. New/old 2 Rel.Wgt. New/old 2 Rel.Wgt. New/old Total Change in Relative Weights Depends on 2

Transforming X-variables To predict discrete classes of modal species composition (MSC) with Euclidean or Mahalanobis distance. To predict continuous variables of species composition

Variable 1 Variable 2 Ref. A Ref. B Euclidean vs. Cosine (Spectral angle ) Euclidean Spectral angle Target Obs.

Euclidean distance function with cosine transformation Let:

E ffect of using cosine transformation of TM data on classification accuracy* AttributeUntransformedCosine trans. Plant Assoc. Grp. (Oregon) ** (Mahalanobis) Modal Spp. Comp.(Oregon)** (Mahalanobis) Modal Spp. Comp. (Minn.)*** (Euclidean) * Kappa statistics **TM data ***TM+ Enhanced data

Transforming the Y-variables Variance considerations—want homogeneity And a logical functional form for Y = f(X) –Transformations of species composition Logarithm of species basal area Percent basal area by species Cosine spectral angle Logistic –Evaluated by predicting discrete Plant Association Group (PAG), Users’ Guide example data (Oregon)

Composition transformations: Logistic: = ln[(Total BA – spp BA)/spp BA] = ln( Total BA – spp BA) – ln(spp BA) Represented in MSN by two separate variables. Cosine Spectral Angle: = Spp BA /  (spp BA) 2

Implications of transforming Imputed value derived from the neighbor, not directly from the model as in regression. Neighbor selection may be improved by transforming Y’s and X’s. Multivariate Y’s can resolve some indeterminacies from functions having extreme-value points (maxima or minima).

MSN Software Now Includes Alternative Distance Functions: Both canonical-correlation based distance functions. Euclidean distance on normalized X’s. Mahalanobis distance on normalized X’s. You supply a weight matrix of your derivation. K-nearest neighbors identification.

So ?? Of the many methods available for imputation of attributes, no one alternative is clearly superior for all data sets.

On the Web: In print: Crookston, N.L., Moeur, M. and Renner, D.L User’s guide to the Most Similar Neighbor Imputation Program Version 2. Gen. Tech. Rpt. RMRS-GTR-96. Ogden, UT: USDA Rocky Mountain Research Station 35p. Software Availability