Download presentation
Presentation is loading. Please wait.
1
Statistical Methods Chichang Jou Tamkang University
2
2 Chapter Objectives Explain methods of statistical inference in data mining Explain methods of statistical inference in data mining Identify different statistical parameters for accessing differences in data sets Identify different statistical parameters for accessing differences in data sets Describe Na ï ve Bayesian Classifier and the logistic regression method Describe Na ï ve Bayesian Classifier and the logistic regression method Introduce log-linear models using correspondence analysis of contingency tables Introduce log-linear models using correspondence analysis of contingency tables Discuss ANOVA analysis and linear discriminant analysis of multidimensional samples Discuss ANOVA analysis and linear discriminant analysis of multidimensional samples
3
3 Background Statistics is to collect and organize data and draw conclusions from data sets Statistics is to collect and organize data and draw conclusions from data sets Descriptive Statistics: Descriptive Statistics: –Organization and description of the general characteristics of data sets Statistical Inference: Statistical Inference: –Draw conclusions from data –Main focus of this chapter
4
4 5.1 Statistical Inference We are interested in arriving at conclusions concerning a population when it is impossible or impractical to observe the entire set of observations that make up the population We are interested in arriving at conclusions concerning a population when it is impossible or impractical to observe the entire set of observations that make up the population Sample in Statistics Sample in Statistics –Describes a finite data set of n-dimensional vectors –Will be called data set Biased Biased –Any sampling procedure that produces inferences that consistently overestimate or underestimate some characteristics of the population
5
5 5.1 Statistical Inference Statistical Inference is the main form of reasoning relevant to data analysis Statistical Inference is the main form of reasoning relevant to data analysis Statistical Inference methods are categorized as Statistical Inference methods are categorized as –Estimation: Goal: make the expected prediction error close to 0 Goal: make the expected prediction error close to 0 Regression vs. classification Regression vs. classification –Tests of hypothesis Null hypothesis H 0 : any hypothesis we wish to test Null hypothesis H 0 : any hypothesis we wish to test The rejection of H 0 leads to the acceptance of an alternative hypothesis The rejection of H 0 leads to the acceptance of an alternative hypothesis
6
6 5.2 Assessing Difference in data sets Mean Mean Median: better for skewed data Median: better for skewed data Mode: the value that occurs most frequently Mode: the value that occurs most frequently –For unimodal frequency curves that are moderately asymmetrical, the following empirical relation is useful: mean – mode = 3 x (mean – median) Standard deviation σ (variance: σ 2 ) Standard deviation σ (variance: σ 2 )
7
7 5.3 Bayesian Inference Prior distribution: given probability distribution for the analyzed data set Prior distribution: given probability distribution for the analyzed data set Let X be a data sample whose class label is unknown. Hypothesis H: X belongs to a specific class C. Let X be a data sample whose class label is unknown. Hypothesis H: X belongs to a specific class C. P( H / X) = [ P( X / H) ˙ P(H)]/P(X) P( H / X) = [ P( X / H) ˙ P(H)]/P(X) See p.97 for an example of Na ï ve Bayesian Classifier See p.97 for an example of Na ï ve Bayesian Classifier P( C i / X) = [ P( X / C i ) ˙ P(C i )]/P(X) P( C i / X) = [ P( X / C i ) ˙ P(C i )]/P(X) P( X /C i ) = P( X /C i ) = Bayesian classifier has the minimum error rate in theory. In practice, this is not always true because of inaccuracies in the assumptions of attributes and class-conditional independence. Bayesian classifier has the minimum error rate in theory. In practice, this is not always true because of inaccuracies in the assumptions of attributes and class-conditional independence.
8
8 5.4 Predictive Regression Common reasons for performing regression analysis Common reasons for performing regression analysis –The output is expensive to measure –The values of the inputs are known before the output is known, and a working prediction of the output is required –Controlling the input values to predict the behavior of corresponding outputs –To identify the causal link between some of the inputs and the output
9
9 Linear regression Y=α+β 1 X 1 +β 2 X 2 + … +β n X n Y=α+β 1 X 1 +β 2 X 2 + … +β n X n Applied to each sample Applied to each sample –y j =α+β 1 x 1j +β 2 x 2j +…+β n x nj + ε j Example with one input variable (p.99 – p.100) Example with one input variable (p.99 – p.100) – Y=α+βX –The sum of squares of errors (SSE) –Differentiate SSE w.r.t. α and β, and set them to 0 –Equations for α and β error
10
10 For real-world data mining, the number of samples may be several millions. Due to exponentially increased complexity of linear regression, it is necessary to find modifications/approximations in the regression, or to use totally different regression methods. For real-world data mining, the number of samples may be several millions. Due to exponentially increased complexity of linear regression, it is necessary to find modifications/approximations in the regression, or to use totally different regression methods. Example: Polynomial regression can be modeled by adding polynomial terms to the basic linear model. (p. 102) Example: Polynomial regression can be modeled by adding polynomial terms to the basic linear model. (p. 102) The major effort of a user is in identifying the relevant independent variables and in selecting the regression model. The major effort of a user is in identifying the relevant independent variables and in selecting the regression model. –Sequential search approach –Combinatorial approach General Linear Model
11
11 Quality of linear regression Correlation coefficient r Correlation coefficient r
12
12 5.5 Analysis of Variance (ANOVA) ANOVA is a method of identifying which of the β ’ s in a linear regression model are non-zero. ANOVA is a method of identifying which of the β ’ s in a linear regression model are non-zero. Residues: Residues: –R i = y i – f(x i ) The variance is estimated by: The variance is estimated by: –S 2 allows us to compare different linear models Only if the fitted model does not include inputs that it ought to, will S 2 tend to be significantly larger than σ 2 Only if the fitted model does not include inputs that it ought to, will S 2 tend to be significantly larger than σ 2
13
13 ANOVA algorithm First start with all inputs and compute S 2 First start with all inputs and compute S 2 Omit inputs from the model one by one (This means forcing the corresponding β i to 0) Omit inputs from the model one by one (This means forcing the corresponding β i to 0) –If we omit a useful input, the new estimate S 2 will significantly increase –If we omit a redundant input, the new estimate S 2 will not change much –F-ratio (example in p.105) Multivariate analysis: The output is a vector. Allow correlation between outputs. (MANOVA) Multivariate analysis: The output is a vector. Allow correlation between outputs. (MANOVA)
14
14 5.6 Logistic Regression Linear regression is used to model continuous- value functions. Generalized regression models try to apply linear regression to model categorical response variables. Linear regression is used to model continuous- value functions. Generalized regression models try to apply linear regression to model categorical response variables. Logistic regression models the probability of some (YES/NO) event occurring as a linear function of a set of predictor (input) variables. Logistic regression models the probability of some (YES/NO) event occurring as a linear function of a set of predictor (input) variables. –It tries to estimate the probability p that the dependent (output) variable will have a given value. –If p is greater than 0.5, then the prediction is closer to YES –It supports a more general input data set by allowing both categorical and quantitative inputs
15
15 Logistic Regression P(y j =1)=p j, P(y j =0)=1-p j P(y j =1)=p j, P(y j =0)=1-p j The linear logistic model The linear logistic model –This is to prevent p j from going out of range Example (p. 107) Example (p. 107) –Suppose logit(p) = 1.5 - 0.6 x 1 + 0.4 x 2 -0.3 x 3 –With (x 1, x 2, x 3 ) = (1,0,1) –p=0.35 –Y=1 is less probable than Y=0
16
16 5.7 Log-Linear Models Log-linear modeling is a generalized linear model where the output Y i is assumed to have a Poisson distribution, with expected value μ j Log-linear modeling is a generalized linear model where the output Y i is assumed to have a Poisson distribution, with expected value μ j –It is to analyze the relationship between categorical (or quantitative) variables –It approximates discrete, multi-dimensional probability distributions
17
17 Log-Linear Models –log(μj) is assumed to be linear function of inputs –We need to find which β ’ s are 0 If β i is 0, then X i is not related to other input variables If β i is 0, then X i is not related to other input variables Correspondence Analysis: Correspondence Analysis: –Log-linear models when no output variable is defined –Use contingency tables to answer the question: Any relationship between the attributes?
18
18 Correspondence Analysis 1. Transform a given contingency table into a table with expected values, under the assumption that the input variables are independent 2. Compare these two metrics using the squared distance measure and the chi- square test Example: p. 108, p. 111
19
19 5.8 Linear Discriminant Analysis Linear Discriminant Analysis (LDA) is for classification problems where the dependent variable is categorical (nominal or ordinal) and the independent variables are metric Linear Discriminant Analysis (LDA) is for classification problems where the dependent variable is categorical (nominal or ordinal) and the independent variables are metric LDA is to construct a discriminant function that yields different scores when computed with data from different output classes LDA is to construct a discriminant function that yields different scores when computed with data from different output classes Fig. 5.3
20
20 Linear Discriminant Analysis LDA tries to find a set of weight values w i that maximizes the ratio of the between-class to the within-class variance of the discriminant score for a pre-classified set of samples. It is then used to predict. LDA tries to find a set of weight values w i that maximizes the ratio of the between-class to the within-class variance of the discriminant score for a pre-classified set of samples. It is then used to predict. Cutting scores serve as the criteria against which each individual discriminant score is judged. Their choice depends on the distribution of samples in classes. Cutting scores serve as the criteria against which each individual discriminant score is judged. Their choice depends on the distribution of samples in classes. –Let z A and z B be the mean discriminant score of pre-classifed samples frm classes A and B. –If the two classes of samples are of equal size and are uniformly distributed –If the two classes of samples are not of equal size –Multiple discriminant analysis (p. 113)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.