Automatic Data Transformation Qing Feng Joint work with Dr. Jan Hannig, Dr. J. S. Marron Mar. 22th, 2016.

Slides:



Advertisements
Similar presentations
Chapter 3 Properties of Random Variables
Advertisements

Human Detection Phanindra Varma. Detection -- Overview  Human detection in static images is based on the HOG (Histogram of Oriented Gradients) encoding.
The Normal Distribution
Transformations & Data Cleaning
Aggregating local image descriptors into compact codes
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
An Introduction of Support Vector Machine
Week 3. Logistic Regression Overview and applications Additional issues Select Inputs Optimize complexity Transforming Inputs.
Robust Multi-Kernel Classification of Uncertain and Imbalanced Data
What is Statistical Modeling
Pattern Recognition and Machine Learning
The Normal Curve Z Scores, T Scores, and Skewness.
T T Population Sampling Distribution Purpose Allows the analyst to determine the mean and standard deviation of a sampling distribution.
Regionalized Variables take on values according to spatial location. Given: Where: A “structural” coarse scale forcing or trend A random” Local spatial.
Object Detection using Histograms of Oriented Gradients
T T07-01 Sample Size Effect – Normal Distribution Purpose Allows the analyst to analyze the effect that sample size has on a sampling distribution.
1 Re-expressing Data  Chapter 6 – Normal Model –What if data do not follow a Normal model?  Chapters 8 & 9 – Linear Model –What if a relationship between.
Multivariate Distributions. Distributions The joint distribution of two random variables is f(x 1,x 2 ) The marginal distribution of f(x 1 ) is obtained.
MEASURES OF DISPERSION: SPREAD AND VARIABILITY. DATA SETS FOR PROJECT NES2000.sav States.sav World.sav.
Bayesian Learning, cont’d. Administrivia Homework 1 returned today (details in a second) Reading 2 assigned today S. Thrun, Learning occupancy grids with.
Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of shape measures of relative standing.
1 Measures of Central Tendency Greg C Elvers, Ph.D.
Review of normal distribution. Exercise Solution.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Chapter 3 – Descriptive Statistics
WFM 5201: Data Management and Statistical Analysis © Dr. Akm Saiful IslamDr. Akm Saiful Islam WFM 5201: Data Management and Statistical Analysis Akm Saiful.
Quantile Regression. The Problem The Estimator Computation Properties of the Regression Properties of the Estimator Hypothesis Testing Bibliography Software.
A P STATISTICS LESSON 2 – 2 STANDARD NORMAL CALCULATIONS.
CHAPTER 7: Exploring Data: Part I Review
Descriptive Statistics
10.1: Confidence Intervals – The Basics. Introduction Is caffeine dependence real? What proportion of college students engage in binge drinking? How do.
Fall 2002Biostat Nonparametric Tests Nonparametric tests are useful when normality or the CLT can not be used. Nonparametric tests base inference.
TODAY we will Review what we have learned so far about Regression Develop the ability to use Residual Analysis to assess if a model (LSRL) is appropriate.
Dr. Serhat Eren 1 CHAPTER 6 NUMERICAL DESCRIPTORS OF DATA.
MMSI – SATURDAY SESSION with Mr. Flynn. Describing patterns and departures from patterns (20%–30% of exam) Exploratory analysis of data makes use of graphical.
T06-02.S - 1 T06-02.S Standard Normal Distribution Graphical Purpose Allows the analyst to analyze the Standard Normal Probability Distribution. Probability.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
A generalized bivariate Bernoulli model with covariate dependence Fan Zhang.
Measures of Association: Pairwise Correlation
1 G Lect 7a G Lecture 7a Comparing proportions from independent samples Analysis of matched samples Small samples and 2  2 Tables Strength.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 6-1 The Normal Distribution.
Scales of Measurement n Nominal classificationlabels mutually exclusive exhaustive different in kind, not degree.
Participant Presentations Please Prepare to Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early,
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Copyright © 2015, 2012, and 2009 Pearson Education, Inc. 1 Chapter Normal Probability Distributions 5.
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
JIVE Integration of HCP Data Qunqun Yu Dr. Steve Marron, Dr. Kai Zhang & Dr. Ben Risk University of North Carolina at Chapel Hill.
Selecting Input Probability Distributions. 2 Introduction Part of modeling—what input probability distributions to use as input to simulation for: –Interarrival.
AP Statistics. Chapter 1 Think – Where are you going, and why? Show – Calculate and display. Tell – What have you learned? Without this step, you’re never.
Introduction to Normal Distributions
CHAPTER 10 Comparing Two Populations or Groups
Predicting Brain Size Samantha Stanley Jasmine Dumas Christopher Lee.
Chapter 5 Normal Probability Distributions.
Statistical Data Analysis - Lecture 05 12/03/03
Statistical Data Analysis - Lecture10 26/03/03
Elementary Statistics: Picturing The World
Outline Announcement Texture modeling - continued Some remarks
Chapter 6 Confidence Intervals.
I271b Quantitative Methods
Support Vector Machine
ASV Chapters 1 - Sample Spaces and Probabilities
Data Transformation, T-Tools and Alternatives
CHAPTER 10 Comparing Two Populations or Groups
Introduction to Normal Distributions
Chapter 5 Normal Probability Distributions.
Chapter 6 Confidence Intervals.
Chapter 5 Normal Probability Distributions.
Introduction to Normal Distributions
Probabilistic Surrogate Models
Participant Presentations
Presentation transcript:

Automatic Data Transformation Qing Feng Joint work with Dr. Jan Hannig, Dr. J. S. Marron Mar. 22th, 2016

Transformations  Useful Method for Data Analysts  Apply to Marginal Distributions (i.e. to Individual Variables)  Idea o Data close to normality

Transformations Example: Melanoma Data Analysis Miedema, et al (2012)

Transformations Statistical Challenge: Strong skewness Some Features Spread Over Several Orders of Magnitude Few Big Values Can Dominate Analysis

Transformations Few Large Values Can Dominate Analysis

Transformations Log Scale Gives All More Appropriate Weight Note: Orders Of Magnitude Log 10 most Interpretable

Transformations Another Feature with Large Values Problem For This Feature: 0 Values (no log)

Transformations

Much Nicer Distribution

Transformations Different Direction (Negative) of Skewness

Transformations Use Log Difference Transformation

Call for automation! Automation Challenges: Tune the shift parameter for each variable ( + ): Independent of data magnitude Handle both positive and negative skewness Address influential data points Transformations

Goal Automatically select shift log transformation Automatically detect and address influential values Apply to individual variables Automatic Transformations

Outline A new parameterized transformation function ( + ): re-parametrize Standardization and winsorization Parameter value selection Automatic Transformations

Need valid inputs!

Automatic Transformations

Family of transformation functions Comments One family for address both positive and negative skewness Parameter value selection is independent of data magnitude Automatic Transformations

Transformation scheme  Initialization: construct a grid of parameter values  Step 1: apply the transformation function to the feature vector for each parameter value  Step 2: standardization and winsorization  Step 3: calculate the Anderson–Darling test statistic. Automatic Transformations

Melanoma Data Much Nicer Distribution

Melanoma Data Much Nicer Distribution

Melanoma Data Much Nicer Distribution

Takeaway for Analysis Chemometric Data from Alex Tropsha Lab n = 262, d = 800, Non-binary Response

Takeaway for Analysis Start with MargDistPlot for variable screen Sort on Skewness, Equally Spaced Have Substantial Skewness (Consider Transform’n)

Takeaway for Analysis Perform automatic transformation Sort on Skewness Better Symmetry !

Discussion Limitation –Binary variable –Symmetric but highly kurtotic distribution Software available or/miscellaneous/marron/Matlab7Software/General/ or/miscellaneous/marron/Matlab7Software/General/