Data Preparation for Data Mining Prepared by: Yuenho Leung.

Slides:



Advertisements
Similar presentations
5/15/2015Slide 1 SOLVING THE PROBLEM The one sample t-test compares two values for the population mean of a single variable. The two-sample test of a population.
Advertisements

3.1 Data and Information –The rapid development of technology exposes us to a lot of facts and figures every day. –Some of these facts are not very meaningful.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
Simulating Normal Random Variables Simulation can provide a great deal of information about the behavior of a random variable.
Estimating the Population Mean Assumptions 1.The sample is a simple random sample 2.The value of the population standard deviation (σ) is known 3.Either.
Sampling Distributions
Database Design Concepts INFO1408 Term 2 week 1 Data validation and Referential integrity.
Estimation 8.
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
Probability and Statistics in Engineering Philip Bedient, Ph.D.
Data Mining: A Closer Look
Normal Distributions Section 4.4. Normal Distribution Function Among all the possible probability density functions, there is an important class of functions.
Troy Eversen | 19 May 2015 Data Integrity Workshop.
McGraw-Hill © 2006 The McGraw-Hill Companies, Inc. All rights reserved. Correlational Research Chapter Fifteen.
Database Design.  Define a table for each entity  Give the table the same name as the entity  Make the primary key the same as the identifier of the.
Advanced Tables Lesson 9. Objectives Creating a Custom Table When a table template doesn’t suit your needs, you can create a custom table in Design view.
Business and Management Research
STAT 3130 Statistical Methods II Missing Data and Imputation.
Chapter 24 Survey Methods and Sampling Techniques
Database Applications – Microsoft Access Lesson 9 Designing Special Queries.
Topic 6: Understand the Purpose of Bank Statements
UIS Data Transformation and Validations As it pertains to the SDMX TWG EXL Initiative.
16-1 Copyright  2010 McGraw-Hill Australia Pty Ltd PowerPoint slides to accompany Croucher, Introductory Mathematics and Statistics, 5e Chapter 16 The.
Copyright © Cengage Learning. All rights reserved. Hypothesis Testing 9.
Independent Samples t-Test (or 2-Sample t-Test)
PHP meets MySQL.
1 1 Slide IS 310 – Business Statistics IS 310 Business Statistics CSU Long Beach.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
 Frequency Distribution is a statistical technique to explore the underlying patterns of raw data.  Preparing frequency distribution tables, we can.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 6 Normal Probability Distributions 6-1 Review and Preview 6-2 The Standard Normal.
1 Lesson 8: Basic Monte Carlo integration We begin the 2 nd phase of our course: Study of general mathematics of MC We begin the 2 nd phase of our course:
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
TEKS (6.10) Probability and statistics. The student uses statistical representations to analyze data. The student is expected to: (B) identify mean (using.
Chapter 7 Probability and Samples: The Distribution of Sample Means
An Object-Oriented Approach to Programming Logic and Design Fourth Edition Chapter 4 Looping.
M1G Introduction to Database Development 4. Improving the database design.
GrowingKnowing.com © Frequency distribution Given a 1000 rows of data, most people cannot see any useful information, just rows and rows of data.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 14 Experimental.
6.1 Inference for a Single Proportion  Statistical confidence  Confidence intervals  How confidence intervals behave.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 14 Comparing Groups: Analysis of Variance Methods Section 14.1 One-Way ANOVA: Comparing.
Estimation of a Population Mean
Lab Report Format: Steps of the Scientific Method.
TIMOTHY SERVINSKY PROJECT MANAGER CENTER FOR SURVEY RESEARCH Data Preparation: An Introduction to Getting Data Ready for Analysis.
Organization of statistical investigation. Medical Statistics Commonly the word statistics means the arranging of data into charts, tables, and graphs.
Data modeling Process. Copyright © CIST 2 Definition What is data modeling? –Identify the real world data that must be stored on the database –Design.
Continuous Random Variables Much of the material contained in this presentation can be found at this excellent website
Math 3680 Lecture #15 Confidence Intervals. Review: Suppose that E(X) =  and SD(X) = . Recall the following two facts about the average of n observations.
© The McGraw-Hill Companies, 2006 Chapter 3 Iteration.
HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Section 8.3.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 19 Confidence Intervals for Proportions.
Programming Logic and Design Fourth Edition, Comprehensive Chapter 5 Making Decisions.
Laboratory Investigations Each lab group will submit a single input. All members of the group will get THE SAME grade UNLESS... You are observed goofing.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
The Normal Approximation for Data. History The normal curve was discovered by Abraham de Moivre around Around 1870, the Belgian mathematician Adolph.
HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Section 8.3.
Copyright © Cengage Learning. All rights reserved. 8 9 Correlation and Regression.
Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Data Mining What is to be done before we get to Data Mining?
WHAT IS A SAMPLING DISTRIBUTION? Textbook Section 7.1.
Copyright © 2009 Pearson Education, Inc t LEARNING GOAL Understand when it is appropriate to use the Student t distribution rather than the normal.
Lawson Mid-America User Group Spring 2016 Meeting.
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8… Where we are going… Significance Tests!! –Ch 9 Tests about a population proportion –Ch 9Tests.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 7 l Hypothesis Tests 7.1 Developing Null and Alternative Hypotheses 7.2 Type I & Type.
Elicitation Techniques Questioning is the most effective activation technique used in teaching, mainly within initiation-Response- feedback pattern.
Jeopardy Topic 1Topic Q 1Q 6Q 11Q 16Q 21 Q 2Q 7Q 12Q 17Q 22 Q 3Q 8Q 13Q 18Q 23 Q 4Q 9Q 14Q 19Q 24 Q 5Q 10Q 15Q 20Q 25.
Chapter 10 Confidence Intervals for Proportions © 2010 Pearson Education 1.
Business and Management Research
Instructor Materials Chapter 5: Ensuring Integrity
Presentation transcript:

Data Preparation for Data Mining Prepared by: Yuenho Leung

What to Do before Data Preparation Before the stage of data preparation, you have already: Known the domain of your problem Planned solutions and approaches you are going to apply Got as much data as possible including incomplete data

Data Representation Format First step of data preparation is to convert the raw data into rows and columns format. Such as: XML Access SQL

Validate Data To validate data, you need to: Check the value by data type. Check the range of the variable Compare the values with other instances (rows). Check columns by their relationships.

Validate Data (cont) To validate this table, you can check the relationship among the city name, zip code, and area code. If you get the data from a normalized database, you can skip this step. CusIDNameAddressCityZipPhone 1Alan1800 Bon Ave.Elk Grove Tom600 Bender RdSacramento Sam300 Tent StSan Jose

Validate Data (cont) From this table, you can tell the third instance is wrong. Why? Because no small earthquake on 1975/10/20. DateTimeLatitudeLongitudeMagnitude 1975/7/1000:41: /9/500:41: /10/2000:41: /11/1800:41: /12/3000:41:

Validate Data (cont) Fixing individual errors from each instance is not the main purpose of data validation. The main purpose is to find the cause of errors. If you know the cause of the errors, you might be figure out the pattern of the errors and then fix all errors globally. For example, We want to mine the pattern of wind speed from data generated by 5 sensors. We find 20% of the speed measurements are obviously wrong. Therefore, we check the sensors whether they work normally or not. If we find a broken sensor always display readings 10% higher than the correct readings, we should fix those incorrect measurements by 10% decrement.

Dealing with Missing and Empty Value There is no automated technique for differentiating between missing and empty values: Example: CusIDNameSandwichSauce 1AlanTurkeySweet Union 2TomHam 3SamBeefThousand Island You cannot tell whether: Tom didn ’ t want any sauce. Or The salesperson forgot to input the sauce ’ s name.

Dealing with Missing and Incorrect Value If you know the value is incorrect or missing, you can: Ignore the instance that contains the value (not recommended) Or Assign a value by a reasonable estimate Or Use the default value

Dealing with Missing and Incorrect Value (cont) Example of reasonable estimate CusIDNameAddressCityZipPhone 1Alan1800 Bon Ave.Elk Grove Tom600 Bender RdSacramento Sam??? From the area code 408, you may guess the city is San Jose because San Jose owns over 50% of the phone number with this area code.

Dealing with Missing and Incorrect Value (cont) Example (cont) You would guess the missing zip code is Because is the center of San Jose CusIDNameAddressCityZipPhone 1Alan1800 Bon Ave.Elk Grove Tom600 Bender RdSacramento SamSan Jose???

Reduce No. of Variable More variables generate more relationships and more data points are required. We are not only interested in the pattern of each variables. We are interested in the pattern of relationships among variables. With 10 variables, the 1 st variable has to be compared with 9 neighbors, the 2 nd compares with 8, and so on. The result is 9 x 8 x 7 x 6… which is 362,880 relationships. With 13 variables, it is nearly 40 million relationships. With 15 variables, it is nearly 9 billion relationships. Therefore, when preparing data sets, try to minimize the number of variables.

Reduce No. of Variable (cont) No general strategies to reduce no. of variable. Before select variable sets, you must fully understand the role of each variable in the model.

Define Variable Range Correct range – a variable range contains only the correct variable. Example: Correct range of month is 1 – 12 Any data not in this range must be either repaired or removed from the dataset. Project required range – a variable range we want to analyze according to the project statement. Example: For summer sales, the project required range for month is 7 – 9. Our goal is to find the pattern of data in this range. However, data not in this range may be required by the model.

Define Variable Range (cont) In the following table, ‘B’ stands for business. Sam is a company’s name. ‘G’ is out of the correct range. However, the data miner guesses it stands for “girl,” so he replaces ‘G’ by ‘F’ If he wants to mine people’s shopping behavior, the input will be ‘M’ and ‘F’. CusIDNameAddressCityZipPhoneGender 1April1800 Bon Ave.Elk Grove G 2Tom600 Bender RdSacramento M 3Sam200 Tend StSan Jose B 4May237 Hello BlvdSan Jose F

Define Variable Range Example of variable range: You want to mine customers shopping behavior that is younger than 40 yr old. On the age column, you find the customers are between 20 and 150 yr old. Therefore, you select all records with ages between 20 and 40 as your input. This example is wrong. Nobody is over 130 yr old in the world, so you can conclude the records with ages above 130 are wrong. However, your input should also contains the records with ages between 40 and 130. Why? Because the density and distribution of these ages directly relate the records with ages below 40. Conclusion: Your input should be between 20 – 130 yr old

Choose a Sample Data miners do not always use the entire data collection. Instead they usually choose a sample set randomly to speed up the mining process. A sample size we pick should depend on: No of records available Distributes and density of the data No. of variables Project required range of variables And more… Sounds difficult, but there is strategies to make a sample dataset…

Choose a Sample (cont) Strategies to make a sample dataset: 1. Select 10 instances randomly and put them into your sample set. 2. Create a distribution curve represented the sample set. 3. Add another 10 random instances to your sample set. 4. Create a distribution curve represented the new sample set. 5. Compare the new curve with the previous curve. Do they look almost the same? If no, go back to step 3. If yes, stop and that is your sample set. ***Sample of distribution curve is on the next slide.

The solid line represents the current sample set. The dot line represents the previous sample set. Do they look alike? Choose a Sample (cont)

Data Preparation for Data Mining 1999 by Dorian Pyle Thank you for your attention! Reference