Data Survey Chapters 11.5 -11.9 in Data Preparation for Data Mining by Dorian Pyle Martti Kesäniemi.

Slides:



Advertisements
Similar presentations
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. A PowerPoint Presentation Package to Accompany Applied Statistics.
Advertisements

Chapter 7 Sampling Distributions
Chapter 6 Confidence Intervals.
Preparing Data for Quantitative Analysis
SAMPLE DESIGN: HOW MANY WILL BE IN THE SAMPLE—DESCRIPTIVE STUDIES ?
Sampling: Final and Initial Sample Size Determination
Statistics for Managers Using Microsoft® Excel 5th Edition
Introduction to Sampling (Dr. Monticino). Assignment Sheet  Read Chapter 19 carefully  Quiz # 10 over Chapter 19  Assignment # 12 (Due Monday April.
Objectives (BPS chapter 24)
SELECTING A SAMPLE. To Define sampling in both: QUALITATIVE RESEARCH & QUANTITATIVE RESEARCH.
Chapter 19: Confidence Intervals for Proportions
The Basics of Regression continued
Chap 9-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 9 Estimation: Additional Topics Statistics for Business and Economics.
T T07-01 Sample Size Effect – Normal Distribution Purpose Allows the analyst to analyze the effect that sample size has on a sampling distribution.
The Question The Answer P = 94 %. Practical Uses of   To infer  from S x To compare a sample to an assumed population To establish a rejection criterion.
Active Learning Lecture Slides For use with Classroom Response Systems Statistical Inference: Confidence Intervals.
7-1 Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall Chapter 7 Sampling and Sampling Distributions Statistics for Managers using Microsoft.
Chapter 12 Inferring from the Data. Inferring from Data Estimation and Significance testing.
Survey Methodology Sampling error and sample size EPID 626 Lecture 4.
1 Psych 5500/6500 Statistics and Parameters Fall, 2008.
1. Homework #2 2. Inferential Statistics 3. Review for Exam.
A P STATISTICS LESSON 9 – 1 ( DAY 1 ) SAMPLING DISTRIBUTIONS.
Chapter 24 Survey Methods and Sampling Techniques
1. Homework #2 2. Inferential Statistics 3. Review for Exam.
Chapter 7 Estimation: Single Population
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Confidence Intervals Chapter 6. § 6.1 Confidence Intervals for the Mean (Large Samples)
Chapter 6 Confidence Intervals.
Populations, Samples, and Probability. Populations and Samples Population – Any complete set of observations (or potential observations) may be characterized.
Research Strategies, Part 2
QBM117 Business Statistics Estimating the population mean , when the population variance  2, is known.
Confidence Intervals Chapter 6. § 6.1 Confidence Intervals for the Mean (Large Samples)
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Week 8 Chapter 8 - Hypothesis Testing I: The One-Sample Case.
Data Collection and Sampling
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 7: Sampling 1.
Advanced Math Topics Chapters 8 and 9 Review. The average purchase by a customer in a large novelty store is $4.00 with a standard deviation of $0.85.
Chapter 7 Estimation Procedures. Basic Logic  In estimation procedures, statistics calculated from random samples are used to estimate the value of population.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Inferential Statistics A Closer Look. Analyze Phase2 Nature of Inference in·fer·ence (n.) “The act or process of deriving logical conclusions from premises.
Concepts and Applications of Kriging
Geo479/579: Geostatistics Ch15. Cross Validation.
Statistics for Business and Economics 8 th Edition Chapter 7 Estimation: Single Population Copyright © 2013 Pearson Education, Inc. Publishing as Prentice.
OBJECTIVE 7. STUDENTS WILL DEMONSTRATE UNDERSTANDING BY EXPLAINING QUANTITATIVE RESEARCH METHODS.
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
Determining the Appropriate Sample Size
The Practice of Statistics Chapter 9: 9.1 Sampling Distributions Copyright © 2008 by W. H. Freeman & Company Daniel S. Yates.
Summarizing Risk Analysis Results To quantify the risk of an output variable, 3 properties must be estimated: A measure of central tendency (e.g. µ ) A.
Review Normal Distributions –Draw a picture. –Convert to standard normal (if necessary) –Use the binomial tables to look up the value. –In the case of.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Sampling and Sampling Distributions Basic Business Statistics 11 th Edition.
Marketing Information System A Marketing Information System is the structure of people, equipment, and procedures used to gather, analyze, and distribute.
Basic Business Statistics
Class 5 Estimating  Confidence Intervals. Estimation of  Imagine that we do not know what  is, so we would like to estimate it. In order to get a point.
Chapter 9 Inferences Based on Two Samples: Confidence Intervals and Tests of Hypothesis.
Chapter 7 Data for Decisions. Population vs Sample A Population in a statistical study is the entire group of individuals about which we want information.
Review Confidence Intervals Sample Size. Estimator and Point Estimate An estimator is a “sample statistic” (such as the sample mean, or sample standard.
Chapter 4: Designing Studies... Sampling. Convenience Sample Voluntary Response Sample Simple Random Sample Stratified Random Sample Cluster Sample Convenience.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Aron, Aron, & Coups, Statistics for the Behavioral and Social Sciences: A Brief Course (3e), © 2005 Prentice Hall Chapter 10 Introduction to the Analysis.
Section Parameter v. Statistic 2 Example 3.
 A national opinion poll recently estimated that 44% (p-hat =.44) of all adults agree that parents of school-age children should be given vouchers good.
Prepared by Lloyd R. Jaisingh
Nature of Estimation.
Elementary Statistics
Sampling and Sample Size Calculations
1. Homework #2 (not on posted slides) 2. Inferential Statistics 3
Chapter 10 Introduction to the Analysis of Variance
Using Clustering to Make Prediction Intervals For Neural Networks
Statistics Review (It’s not so scary).
Presentation transcript:

Data Survey Chapters in Data Preparation for Data Mining by Dorian Pyle Martti Kesäniemi

Surveying the data The goal –to find the problem areas in the data, so that the mining can be planned optimally. Main tools –Cluster analysis –Distribution analysis –Confidence analysis –Entropy analysis –Analysis of sparsity and variability

Sampling Bias Sampling bias is one of the most common error sources in data analysis. Sampling bias is generated, when –data points that should be included are left out from the analysis (omission) –data points that should be excluded are taken in to the analysis process (commission). Analysis of the clusters and variable distributions reveal the possible problems.

Cluster Analysis States of the system can be studied by clustering the data. Clustering may help to detect possible problems in the data.

Clusters represent the likely system states –Finding an explanation for the data clusters help to understand the data. Clusters may also reveal a sampling bias –Clusters can be created by an omission or a commission error.

In general, the input clusters should map to the output clusters –if knowing the input cluster doesn’t help in predicting the output cluster, problems are to be expected. Knowing the possible strict dependencies between the input and output clusters allows the miner to focus on more problematic areas of the data.

Distribution Analysis In general, if the data is unbiased, the shape of the distribution of the output variables should remain the same across different input variable values. –Changing the input value chances the output value, but not the behavior of the system.

An example –When trying to define the amount of potential restaurant customers among a concert hall audience by analyzing the dependence between the number of customers in the restaurant and the number of concert tickets sold, full house hours may bias the results as some of the potential customers can’t be served. –This may be diagnosed as an omission (some potential customers are left out of the data) or as a commission (full house hours should be left out of the analysis). One explanation would be that a variable containing information of the vacant tables is missing.

Sampling bias may be observeded as a change in the distribution of dependent (output) variables –when the number of concert tickets sold is high, the skewness of the distribution of the number of customers in the restaurant changes.

Basic Data Survey Procedure Estimate how well the data represents and covers the true population Analyze the entropy of and between the variables Try to explain the clusters –Check the mapping between input and output clusters. Check sparsity and uncertainty Check variable distributions –Try to explain the possible changes in the distributions.

Additional Methods Novelty detection –mainly used when exploiting the mining results –estimates the probability that a certain input is drawn from the same population as the training data Tensegrity structures Fractals (used as manifolds) Chaotic attractors