Statistical Methods Chichang Jou Tamkang University.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Managerial Economics in a Global Economy
Brief introduction on Logistic Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Ch11 Curve Fitting Dr. Deshi Ye
Objectives (BPS chapter 24)
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
QUANTITATIVE DATA ANALYSIS
Chapter 10 Simple Regression.
9-1 Hypothesis Testing Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
The Simple Regression Model
Chapter 11 Multiple Regression.
Topic 3: Regression.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Simple Linear Regression and Correlation
Richard M. Jacobs, OSA, Ph.D.
Introduction to Regression Analysis, Chapter 13,
Classification and Prediction: Regression Analysis
Correlation & Regression
The Practice of Social Research
Chapter 9 Title and Outline 1 9 Tests of Hypotheses for a Single Sample 9-1 Hypothesis Testing Statistical Hypotheses Tests of Statistical.
Regression and Correlation Methods Judy Zhong Ph.D.
Hypothesis Testing Charity I. Mulig. Variable A variable is any property or quantity that can take on different values. Variables may take on discrete.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Fall 2013 Lecture 5: Chapter 5 Statistical Analysis of Data …yes the “S” word.
Statistics Definition Methods of organizing and analyzing quantitative data Types Descriptive statistics –Central tendency, variability, etc. Inferential.
9-1 Hypothesis Testing Statistical Hypotheses Definition Statistical hypothesis testing and confidence interval estimation of parameters are.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Lecture 5: Chapter 5: Part I: pg Statistical Analysis of Data …yes the “S” word.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Introduction to Inferential Statistics Statistical analyses are initially divided into: Descriptive Statistics or Inferential Statistics. Descriptive Statistics.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 16 Data Analysis: Testing for Associations.
Selecting Input Probability Distribution. Simulation Machine Simulation can be considered as an Engine with input and output as follows: Simulation Engine.
Academic Research Academic Research Dr Kishor Bhanushali M
Lecture 10: Correlation and Regression Model.
Inferential Statistics. The Logic of Inferential Statistics Makes inferences about a population from a sample Makes inferences about a population from.
Correlation & Regression Analysis
Chapter 13 Understanding research results: statistical inference.
Chapter 15 Analyzing Quantitative Data. Levels of Measurement Nominal measurement Involves assigning numbers to classify characteristics into categories.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Appendix I A Refresher on some Statistical Terms and Tests.
Analyze Of VAriance. Application fields ◦ Comparing means for more than two independent samples = examining relationship between categorical->metric variables.
Methods of Presenting and Interpreting Information Class 9.
Applied statistics Usman Roshan.
BINARY LOGISTIC REGRESSION
Chapter 7. Classification and Prediction
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
REGRESSION G&W p
Dr. Siti Nor Binti Yaacob
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Chapter 4. Inference about Process Quality
APPROACHES TO QUANTITATIVE DATA ANALYSIS
Part Three. Data Analysis
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
Basic Statistical Terms
Review for Exam 2 Some important themes from Chapters 6-9
15.1 The Role of Statistics in the Research Process
Statistics Review (It’s not so scary).
Introductory Statistics
Presentation transcript:

Statistical Methods Chichang Jou Tamkang University

2 Chapter Objectives Explain methods of statistical inference in data mining Explain methods of statistical inference in data mining Identify different statistical parameters for accessing differences in data sets Identify different statistical parameters for accessing differences in data sets Describe Na ï ve Bayesian Classifier and the logistic regression method Describe Na ï ve Bayesian Classifier and the logistic regression method Introduce log-linear models using correspondence analysis of contingency tables Introduce log-linear models using correspondence analysis of contingency tables Discuss ANOVA analysis and linear discriminant analysis of multidimensional samples Discuss ANOVA analysis and linear discriminant analysis of multidimensional samples

3 Background Statistics is to collect and organize data and draw conclusions from data sets Statistics is to collect and organize data and draw conclusions from data sets Descriptive Statistics: Descriptive Statistics: –Organization and description of the general characteristics of data sets Statistical Inference: Statistical Inference: –Draw conclusions from data –Main focus of this chapter

4 5.1 Statistical Inference We are interested in arriving at conclusions concerning a population when it is impossible or impractical to observe the entire set of observations that make up the population We are interested in arriving at conclusions concerning a population when it is impossible or impractical to observe the entire set of observations that make up the population Sample in Statistics Sample in Statistics –Describes a finite data set of n-dimensional vectors –Will be called data set Biased Biased –Any sampling procedure that produces inferences that consistently overestimate or underestimate some characteristics of the population

5 5.1 Statistical Inference Statistical Inference is the main form of reasoning relevant to data analysis Statistical Inference is the main form of reasoning relevant to data analysis Statistical Inference methods are categorized as Statistical Inference methods are categorized as –Estimation: Goal: make the expected prediction error close to 0 Goal: make the expected prediction error close to 0 Regression vs. classification Regression vs. classification –Tests of hypothesis Null hypothesis H 0 : any hypothesis we wish to test Null hypothesis H 0 : any hypothesis we wish to test The rejection of H 0 leads to the acceptance of an alternative hypothesis The rejection of H 0 leads to the acceptance of an alternative hypothesis

6 5.2 Assessing Difference in data sets Mean Mean Median: better for skewed data Median: better for skewed data Mode: the value that occurs most frequently Mode: the value that occurs most frequently –For unimodal frequency curves that are moderately asymmetrical, the following empirical relation is useful: mean – mode = 3 x (mean – median) Standard deviation σ (variance: σ 2 ) Standard deviation σ (variance: σ 2 )

7 5.3 Bayesian Inference Prior distribution: given probability distribution for the analyzed data set Prior distribution: given probability distribution for the analyzed data set Let X be a data sample whose class label is unknown. Hypothesis H: X belongs to a specific class C. Let X be a data sample whose class label is unknown. Hypothesis H: X belongs to a specific class C. P( H / X) = [ P( X / H) ˙ P(H)]/P(X) P( H / X) = [ P( X / H) ˙ P(H)]/P(X) See p.97 for an example of Na ï ve Bayesian Classifier See p.97 for an example of Na ï ve Bayesian Classifier P( C i / X) = [ P( X / C i ) ˙ P(C i )]/P(X) P( C i / X) = [ P( X / C i ) ˙ P(C i )]/P(X) P( X /C i ) = P( X /C i ) = Bayesian classifier has the minimum error rate in theory. In practice, this is not always true because of inaccuracies in the assumptions of attributes and class-conditional independence. Bayesian classifier has the minimum error rate in theory. In practice, this is not always true because of inaccuracies in the assumptions of attributes and class-conditional independence.

8 5.4 Predictive Regression Common reasons for performing regression analysis Common reasons for performing regression analysis –The output is expensive to measure –The values of the inputs are known before the output is known, and a working prediction of the output is required –Controlling the input values to predict the behavior of corresponding outputs –To identify the causal link between some of the inputs and the output

9 Linear regression Y=α+β 1 X 1 +β 2 X 2 + … +β n X n Y=α+β 1 X 1 +β 2 X 2 + … +β n X n Applied to each sample Applied to each sample –y j =α+β 1 x 1j +β 2 x 2j +…+β n x nj + ε j Example with one input variable (p.99 – p.100) Example with one input variable (p.99 – p.100) – Y=α+βX –The sum of squares of errors (SSE) –Differentiate SSE w.r.t. α and β, and set them to 0 –Equations for α and β error

10 For real-world data mining, the number of samples may be several millions. Due to exponentially increased complexity of linear regression, it is necessary to find modifications/approximations in the regression, or to use totally different regression methods. For real-world data mining, the number of samples may be several millions. Due to exponentially increased complexity of linear regression, it is necessary to find modifications/approximations in the regression, or to use totally different regression methods. Example: Polynomial regression can be modeled by adding polynomial terms to the basic linear model. (p. 102) Example: Polynomial regression can be modeled by adding polynomial terms to the basic linear model. (p. 102) The major effort of a user is in identifying the relevant independent variables and in selecting the regression model. The major effort of a user is in identifying the relevant independent variables and in selecting the regression model. –Sequential search approach –Combinatorial approach General Linear Model

11 Quality of linear regression Correlation coefficient r Correlation coefficient r

Analysis of Variance (ANOVA) ANOVA is a method of identifying which of the β ’ s in a linear regression model are non-zero. ANOVA is a method of identifying which of the β ’ s in a linear regression model are non-zero. Residues: Residues: –R i = y i – f(x i ) The variance is estimated by: The variance is estimated by: –S 2 allows us to compare different linear models Only if the fitted model does not include inputs that it ought to, will S 2 tend to be significantly larger than σ 2 Only if the fitted model does not include inputs that it ought to, will S 2 tend to be significantly larger than σ 2

13 ANOVA algorithm First start with all inputs and compute S 2 First start with all inputs and compute S 2 Omit inputs from the model one by one (This means forcing the corresponding β i to 0) Omit inputs from the model one by one (This means forcing the corresponding β i to 0) –If we omit a useful input, the new estimate S 2 will significantly increase –If we omit a redundant input, the new estimate S 2 will not change much –F-ratio (example in p.105) Multivariate analysis: The output is a vector. Allow correlation between outputs. (MANOVA) Multivariate analysis: The output is a vector. Allow correlation between outputs. (MANOVA)

Logistic Regression Linear regression is used to model continuous- value functions. Generalized regression models try to apply linear regression to model categorical response variables. Linear regression is used to model continuous- value functions. Generalized regression models try to apply linear regression to model categorical response variables. Logistic regression models the probability of some (YES/NO) event occurring as a linear function of a set of predictor (input) variables. Logistic regression models the probability of some (YES/NO) event occurring as a linear function of a set of predictor (input) variables. –It tries to estimate the probability p that the dependent (output) variable will have a given value. –If p is greater than 0.5, then the prediction is closer to YES –It supports a more general input data set by allowing both categorical and quantitative inputs

15 Logistic Regression P(y j =1)=p j, P(y j =0)=1-p j P(y j =1)=p j, P(y j =0)=1-p j The linear logistic model The linear logistic model –This is to prevent p j from going out of range Example (p. 107) Example (p. 107) –Suppose logit(p) = x x x 3 –With (x 1, x 2, x 3 ) = (1,0,1) –p=0.35 –Y=1 is less probable than Y=0

Log-Linear Models Log-linear modeling is a generalized linear model where the output Y i is assumed to have a Poisson distribution, with expected value μ j Log-linear modeling is a generalized linear model where the output Y i is assumed to have a Poisson distribution, with expected value μ j –It is to analyze the relationship between categorical (or quantitative) variables –It approximates discrete, multi-dimensional probability distributions

17 Log-Linear Models –log(μj) is assumed to be linear function of inputs –We need to find which β ’ s are 0 If β i is 0, then X i is not related to other input variables If β i is 0, then X i is not related to other input variables Correspondence Analysis: Correspondence Analysis: –Log-linear models when no output variable is defined –Use contingency tables to answer the question: Any relationship between the attributes?

18 Correspondence Analysis 1. Transform a given contingency table into a table with expected values, under the assumption that the input variables are independent 2. Compare these two metrics using the squared distance measure and the chi- square test Example: p. 108, p. 111

Linear Discriminant Analysis Linear Discriminant Analysis (LDA) is for classification problems where the dependent variable is categorical (nominal or ordinal) and the independent variables are metric Linear Discriminant Analysis (LDA) is for classification problems where the dependent variable is categorical (nominal or ordinal) and the independent variables are metric LDA is to construct a discriminant function that yields different scores when computed with data from different output classes LDA is to construct a discriminant function that yields different scores when computed with data from different output classes Fig. 5.3

20 Linear Discriminant Analysis LDA tries to find a set of weight values w i that maximizes the ratio of the between-class to the within-class variance of the discriminant score for a pre-classified set of samples. It is then used to predict. LDA tries to find a set of weight values w i that maximizes the ratio of the between-class to the within-class variance of the discriminant score for a pre-classified set of samples. It is then used to predict. Cutting scores serve as the criteria against which each individual discriminant score is judged. Their choice depends on the distribution of samples in classes. Cutting scores serve as the criteria against which each individual discriminant score is judged. Their choice depends on the distribution of samples in classes. –Let z A and z B be the mean discriminant score of pre-classifed samples frm classes A and B. –If the two classes of samples are of equal size and are uniformly distributed –If the two classes of samples are not of equal size –Multiple discriminant analysis (p. 113)