Download presentation
Presentation is loading. Please wait.
1
Data Mining Waqas Haider Bangyal
2
Data Preparation and Preprocessing
3
The extraction of knowledge from data is called data mining.
Data mining is the principle of sorting through large amounts of data and picking out relevant information. The extraction of knowledge from data is called data mining.
4
Database Processing vs. Data Mining Processing
Query Poorly defined No precise query language Output Fuzzy Not a subset of database Data Bases Query Well defined SQL Output Precise Subset of database
5
Query Examples Data Bases
Find all credit applicants with last name of Smith. Identify customers who have purchased more than $10,000 in the last month. Find all customers who have purchased milk Data Mining Find all credit applicants who are poor credit risks. (classification) Identify customers with similar buying habits. (Clustering) Find all items which are frequently purchased with milk. (association rules)
6
Data Types and Forms Collection of data objects and their attributes is called data An attribute is a property or characteristic of an entity Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance
7
Data Types and Forms Data types Other data forms
numeric, categorical (see the hierarchy for their relationship in next slide) static, dynamic (temporal) Other data forms distributed data text, Web, meta data images, audio/video
8
Data Types and Forms
9
TYPES OF ATTRIBUTES There are different types of attributes – Nominal
Examples: ID numbers, eye color, zip codes – Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} – Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Ratio Examples: temperature in Kelvin, length, time, counts
10
DISCRETE AND CONTINUOUS ATTRIBUTES
Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes Continuous Attribute – Has real numbers as attribute values – Examples: temperature, height, or weight. – Practically, real values can only be measured and represented using a finite number of digits. – Continuous attributes are typically represented as floatingpoint variables.
11
Data Preparation An important & time consuming task in KDD
Huge size (volume) data Missing data Outliers Erroneous data (inconsistent, mis-recorded, distorted) Raw data
12
Data Preprocessing Why preprocess the data? Data cleaning
Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary
13
Why Data Preprocessing?
Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data
14
Data Understanding: Relevance
What data is available for the task? Is this data relevant? Is additional relevant data available? How much historical data is available? Who is the data expert ?
15
Data Understanding: Quantity
Number of instances (records, objects) Rule of thumb: 5,000 or more desired if less, results are less reliable; use special methods (boosting) Number of attributes (fields) Rule of thumb: for each attribute, 10 or more instances If more fields, use feature reduction and selection Number of targets Rule of thumb: >100 for each class if very unbalanced, use stratified sampling
16
Data Quality: Why Preprocess the Data?
Measures for data quality: A multidimensional view Accuracy: correct or wrong, accurate or not Completeness: not recorded, unavailable, … Consistency: some modified but some not, dangling, … Timeliness: timely update? Believability: how trustable the data are correct? Interpretability: how easily the data can be understood?
17
Major Tasks in Data Preprocessing
Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data
18
Forms of Data Preprocessing
19
Data cleaning tasks Data Cleaning Data acquisition and metadata
Fill in missing values Unified date format Converting nominal to numeric Identify outliers and smooth out noisy data Correct inconsistent data
20
Data Cleaning: Acquisition
Data can be in DBMS DBC, JDBC protocols Data in a flat file Fixed-column format Delimited format: tab, comma “,”, other E.g. C4.5 and Weka “arff” use comma-delimited data Attention: Convert field delimiters inside strings Verify the number of fields before and after
21
Data Cleaning: Example
22
Data Cleaning: Reformatting
Convert data to a standard format (e.g. arff or csv) Missing values Unified date format Binning of numeric data Fix errors and outliers Convert nominal fields whose values have order to numeric.
23
Missing Data Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred.
24
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., “unknown”, a new class? Imputation: Use the attribute mean to fill in the missing value, or use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree
25
Data Missing Example Position Original Sample 1 0.0886 2 0.0684 3
0.3515 4 0.9874 5 0.4713 6 0.6115 7 0.2573 8 0.2914 9 0.1662 10 0.4400 11 0.6939 Position 11 Missing 0.0886 0.0684 0.3515 0.9874 0.4713 0.6115 0.2573 0.2914 0.1662 0.4400 ? Preserve Mean 0.0886 0.0684 0.3515 0.9874 0.4713 0.6115 0.2573 0.2914 0.1662 0.4400 0.3731 Preserve Variance 0.0886 0.0684 0.3515 0.9874 0.4713 0.6115 0.2573 0.2914 0.1662 0.4400 0.6629
26
Data Cleaning: Unified Date Format
We want to transform all dates to the same format internally Some systems accept dates in many formats e.g. “Sep 24, 2003”, 9/24/03, , etc dates are transformed internally to a standard value Frequently, just the year (YYYY) is sufficient For more details, we may need the month, the day, the hour, etc Representing date as YYYYMM or YYYYMMDD can be OK, but has problems Q: What are the problems with YYYYMMDD dates? Ans: Ignoring for now the Looming Y10K (year 10,000 crisis …) YYYYMMDD does not preserve intervals: /= – This can introduce bias into models
27
Conversion: Nominal to Numeric
Some tools can deal with nominal values internally Other methods (neural nets, regression, nearest neighbor) require only numeric inputs To use nominal fields in such methods need to convert them to a numeric value Q: Why not ignore nominal fields altogether? Ans: They may contain valuable information Different strategies for binary, ordered, multi-valued nominal fields
28
Conversion: Binary to Numeric
Binary fields E.g. Gender=M, F Convert to Field_0_1 with 0, 1 values e.g. Gender = M Gender_0_1 = 0 Gender = F Gender_0_1 = 1
29
Conversion: Ordered to Numeric
Ordered attributes (e.g. Grade) can be converted to numbers preserving natural order, e.g. A + 4.0 A 3.7 B+ 3.3 B 3.0 Q: Why is it important to preserve natural order? A: To allow meaningful comparisons, e.g. Grade > 3.5
30
Conversion: Nominal, Few Values
Multi-valued, unordered attributes with small (rule of thumb < 20) no. of values e.g. Color=Red, Orange, Yellow, …, Violet for each value v create a binary “flag” variable C_v, which is 1 if Color=v, 0 otherwise
31
Noise: random error or variance in a measured variable
Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data
32
How to Handle Noisy Data?
Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions
33
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
34
Cluster Analysis
35
Regression
36
QUESTIONS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.