Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality.

Slides:



Advertisements
Similar presentations
Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit.
Advertisements

Chapter 3 Properties of Random Variables
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Mean, Proportion, CLT Bootstrap
Uncertainty in fall time surrogate Prediction variance vs. data sensitivity – Non-uniform noise – Example Uncertainty in fall time data Bootstrapping.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
CLUSTERING PROXIMITY MEASURES
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Introduction to Data Analysis
Terminology species data = the measured variables we want to explain (response or dependent variables) environmental data = the variables we use for explaining.
1 Lecture 2: ANOVA, Prediction, Assumptions and Properties Graduate School Social Science Statistics II Gwilym Pryce
Statistics. The usual course of events for conducting scientific work “The Scientific Method” Reformulate or extend hypothesis Develop a Working Hypothesis.
BHS Methods in Behavioral Sciences I April 18, 2003 Chapter 4 (Ray) – Descriptive Statistics.
Statistical Methods Chichang Jou Tamkang University.
Correlation Patterns. Correlation Coefficient A statistical measure of the covariation or association between two variables. Are dollar sales.
BHS Methods in Behavioral Sciences I April 21, 2003 Chapter 4 & 5 (Stanovich) Demonstrating Causation.
Statistical Analysis SC504/HS927 Spring Term 2008 Week 17 (25th January 2008): Analysing data.
Distance Measures Tan et al. From Chapter 2.
Statistical Background
Statistical Treatment of Data Significant Figures : number of digits know with certainty + the first in doubt. Rounding off: use the same number of significant.
Basic Mathematics for Portfolio Management. Statistics Variables x, y, z Constants a, b Observations {x n, y n |n=1,…N} Mean.
1 Seventh Lecture Error Analysis Instrumentation and Product Testing.
Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals.
Lecture II-2: Probability Review
Correlation and Regression
1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)
BPS - 3rd Ed. Chapter 211 Inference for Regression.
PTP 560 Research Methods Week 8 Thomas Ruediger, PT.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Tuesday August 27, 2013 Distributions: Measures of Central Tendency & Variability.
Correlations. Outline What is a correlation? What is a correlation? What is a scatterplot? What is a scatterplot? What type of information is provided.
UNDERSTANDING RESEARCH RESULTS: DESCRIPTION AND CORRELATION © 2012 The McGraw-Hill Companies, Inc.
METHODS IN BEHAVIORAL RESEARCH NINTH EDITION PAUL C. COZBY Copyright © 2007 The McGraw-Hill Companies, Inc.
Applied Quantitative Analysis and Practices LECTURE#11 By Dr. Osman Sadiq Paracha.
Correlation Patterns.
Slide 1 Copyright © 2004 Pearson Education, Inc..
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Correlation & Regression Chapter 15. Correlation It is a statistical technique that is used to measure and describe a relationship between two variables.
Descriptive statistics Petter Mostad Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when.
Math Sunshine State Standards Wall poster. MAA Associates verbal names, written word names, and standard numerals with integers, rational numbers,
LECTURE 3: ANALYSIS OF EXPERIMENTAL DATA
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
Chapter 2: Getting to Know Your Data
Kin 304 Correlation Correlation Coefficient r Limitations of r
Correlations. Distinguishing Characteristics of Correlation Correlational procedures involve one sample containing all pairs of X and Y scores Correlational.
Chapter 8: Simple Linear Regression Yang Zhenlin.
INTRODUCTORY LECTURE 3 Lecture 3: Analysis of Lab Work Electricity and Measurement (E&M)BPM – 15PHF110.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
1 PAUF 610 TA 1 st Discussion. 2 3 Population & Sample Population includes all members of a specified group. (total collection of objects/people studied)
More on regression Petter Mostad More on indicator variables If an independent variable is an indicator variable, cases where it is 1 will.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Estimating standard error using bootstrap
Lecture 2-2 Data Exploration: Understanding Data
Data Mining: Concepts and Techniques
Understanding Research Results: Description and Correlation
Basic Statistical Terms
Daniela Stan Raicu School of CTI, DePaul University
Introduction to Statistical Methods for Measuring “Omics” and Field Data PCA, PcoA, distance measure, AMOVA.
Matrix Algebra and Random Vectors
What Is Good Clustering?
4/4/2019 Correlations.
15.1 The Role of Statistics in the Research Process
Descriptive Statistics
Correlation and Covariance
DIGITAL IMAGE PROCESSING Elective 3 (5th Sem.)
Presentation transcript:

Measurements and Data

Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

Types of Measurement Ordinal, – e.g., excellent=5, very good=4, good=3… Nominal – e.g., color, religion, profession – Need non-metric methods Ratio – e.g., weight – has concatenation property, two weights add to balance a third: 2+3 = 5 Interval – e.g., temperature, calendar time

Examples of Metrics Euclidean Distance d E –Standardized (divide by variance) –Weighted d WE Minkowski measure –Manhattan Distance Mahanalobis Distance d M –Use of Covariance Binary data Distances

Use of Covariance in Distance Similarities between cups Suppose we measure cup-height 100 times and diameter only once – height will dominate although 99 of the height measurements are not contributing anything They are very highly correlated To eliminate redundancy we need a data- driven method – approach is to not only to standardize data in each direction but also to use covariance between variables

   A scalar value to measure how x and y vary together  Covariance between two Scalar Variables Large positive value – if large values of x tend to be associated with large values of y and small values of x with small values of y Large negative value – if large values of x tend to be associated with small values of y With d variables can construct a d x d matrix of covariances

Correlation Coefficient Value of Covariance is dependent upon ranges of x and y Dependency is removed by dividing values of x by their standard deviation and values of y by their standard deviation

Correlation Matrix Housing related variables across city suburbs (d=11) 11 x 11 pixel image (White 1, Black -1) Columns have values -1,0,1 for pixel intensity reference Remaining represent corrrelation matrix Reference for -1, 0,+1 Variables 3 and 4 are highly negatively correlated with Variable 2 Variable 5 is positively correlated with Variable 11 Variables 8 and 9 are highly correlated

Generalizing Euclidean Distance Minkowski or L λ metric λ = 2 gives the Euclidean metric λ = 1 gives the Manhattan or City-block metric λ = ∞ yields

Distance Measures for Binary Data Most obvious measure is Hamming Distance normalized by number of bits Proportion of variables on which objects have same value If we don’t care about irrelevant properties had by neither object we have Jaccard Coefficient Example: two documents do not have certain terms Dice Coefficient extends this argument – If 00 matches are irrelevant then 10 and 01 matches should have half relevance

Transforming the Data Model depends on form of data If Y is a function of X 2 then we could use quadratic function or choose U= X 2 and use a linear fit

V 1 is non- linearly Related to V 2 V 1 V 3 =1/V 2 is linearly related to V 1 V2V2

Variance increases Square root transformation keeps the variance constant

Forms of Data Standard Data (Data Matrix) Multirelational Data String Sequence of symbols from a finite alphabet Event Sequence Sequence of pairs of the form {event, occurrence time}

NameDepartment Age Salary Department BudgetManager Multirelational Data (multiple data matrices) Payroll Database Name Department Table Name Can be combined together to form a data matrix with fields name, department-name, age, salary, budget, manager Or create as many rows as department-names Flattening requires needless replication (Storage issues)

Data Quality for Individual Measurements Data Mining Depends on Quality of data Many interesting patterns discovered may result from measurement inaccuracies. Sources of error –Errors in measurement –Carelessness –Instrumentation failure

Precision and Accuracy Precise Measurement –Small variability (measured by variance) –Repeated measurements yield same value –Many digits of precision is not necessarily accurate (results of calculations give many digits) Accurate –Not only small variability but close to true value

Data Quality for Collections of Data Collections of Data –Much of statistics is concerned with inference from a sample to a population –How to infer things from a fraction about entire population –Two sources of error: sample size and bias