Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.

Data Preprocessing

Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values

Remember: Relational Databases Model entities and relationships Entities are the things in the real world Information about employees and the department they work for Employee and department are entities Relationships are the links between these entities Employee works for a department

Relation or Tables Relation: a table of data table = relation,(set theory, based on predicate logic) emplyeeIDnamejobdepartmentID 7513NoraProgrammer128 9842BenDBA42 6651AlexProgrammer128 9006ClaudiaSystem- Administrator 128

Columns and Rows Each column or attribute describes some piece of data that each record in the table has Each row in a table represents a record Rows, records or tupels

Keys A superkey is a column (or a set of columns) that can be used to identify a row in as table a key is a minimal superkey There are different possible keys candidate keys We chose form the candidate keys the primary key Primary key is used to identify a single row (record) Foreign keys represents links between tables

Keys primary key foreign key emplyeeIDnamejobdepartmentID 7513NoraProgrammer128 9842BenDBA42 6651AlexProgrammer128 9006ClaudiaSystem- Administrator 128

Functional Dependencies If there is a functional dependency between columns A and B in a given table which may be written Then the value of column A determines the value of column B employeeID functionally determines the name

Schema Database schema Structure or design of the database Database without any data in it employee(employeeID,name,job,departmentID)

Design Minimize redundancy Redundancy: data is repeated in different rows employee(employeeID,name,job,departmentID,departmentName) emplyeeIDnamejobdepartmentIDdepartamentName 7513NoraProgrammer128Research and Development 9842BenDBA42Finance 6651AlexProgrammer128Research and Development 9006ClaudiaSystem-Administrator128Research and Development

Reduce redudancy employee(employeeID,name,job,departmentID,departmentName) employee(employeeID,name,job,departmentID) employee(departmentID,name)

Insert Anomalies Insert data into flawed table Data does not match what is already in the table It is not obvious which of the rows in the database is correct

Deletion Anomalies Delete data from a flawed schema When we delete all the employees of Department 128, we no longer have any record, that the Department 128 exists

Update Anomalies Change data in a flawed schema We do not change the data for every row correctly

Null Values Avoid schema designs that have large numbers of empty attributes

Normalization Remove design flaws from a database Normal forms, which are a set of rules describing what we should and should not do in our table structures Breaking tables into smaller tables that form a better design

Normal Forms 1 Forma Normal 2 Forma Normal 3 Forma Normal 5 Forma Normal 4 Forma Normal Forma Normal Boyce Codd

First Normal Form (1NF) Each attribute or column value must be atomic Each attribute must contain a single value

emplyeeIDnamejobdepartment ID skills 7513NoraProgrammer128C, Perl, Java 9842BenDBA42DB2 6651AlexProgrammer128VB, Java 9006ClaudiaSystem-Administrator128NT, Linux

1NF emplyeeIDnamejobdepartmentIDskills 7513NoraProgrammer128C 7513NoraProgrammer128Perl 7513NoraProgrammer128java 9842BenDBA42DB2 6651AlexProgrammer128VB 6651AlexProgrammer128java 9006ClaudiaSystem-Administrator128NT 9006ClaudiaSystem-Administrator128Linux

Second Normal Form (2NF) All attributes that are no part of the primary key are fully dependent on the primary key Each non key attribute must be functionally dependent on the key Is already in 1NF

2NF ? employee(employeeID,name,job,departmentID,skill ) emplyeeIDnamejobdepartmentIDskills 7513NoraProgrammer128C 7513NoraProgrammer128Perl 7513NoraProgrammer128java 9842BenDBA42DB2 6651AlexProgrammer128VB 6651AlexProgrammer128java 9006ClaudiaSystem-Administrator128NT 9006ClaudiaSystem-Administrator128Linux

Functional dependencies employeeID,skill  name, job, deparmentID employeeID  name, job, deparmentID Partially functionally dependent on the primary key Not fully functionally dependent on the primary key

2NF Decompose the table into tables which all the non-key attributes are fully functionally dependent on the key Breaking the table into two tables employee(employeeID,name,job,departmentID) employeeSkills(employeeID,skill)

Third Normal Form (3NF) Remove all transitive dependencies Be in 2NF

employee(employeeID,name,job,departmentID,departmentName) employeeID  name,job,departmentID,departmentName departmentID  departmentName employeID namejobdepartmentIDdepartment Name 7513NoraProgrammer128Research 9842BenDBA42Finance 6651AjayProgrammer128Research 9006CandySYS128Research

Transitive dependency employeeID  departmentName employeeID  deparmtentID departmentID  departmentName

3NF Remove transitive dependency Decompose into multiple tables emploee(employeeID,name,jop,departmentID) deparment(deparmentID,deparmtentName)

3NF The left side of the functional dependency is a superkey (that is, a key that is not necessarily minimal) Boyce-Codd Normal Form or The right side of the functional dependency is a part of any key of the table

BCNF All attributes must be functionally determined by a superkey

Full normalization means lots of logically seperate relations Lots of logically separate relations means a lot of physically separate files Lots of physically separate files means a lot of I/O Difficulties in finding dimensions for dimensional schema, star schema (dimension tables, fact table)

What is Denormalization? Normalizing a relational variable R means replacing R by a set of projections R1,R2,..,Rn such that R is equal to the join R1,R2,..,Rn Reduce redundancy, each projections R1,R2,..,Rn is at the highest possible value of normalization Denormalizing the relational variables means replacing them by their join R Increase redundancy, by ensuring that R is a lower level of normalization than R1,R2,..,Rn Problems Once we start to denormalize, it is not clear when to stop?

Dimensional Schema Array cells often empty The more dimensions, there more empty cells Empty cell  Missing information How to treat not present information ? How does the system support Information is unknown Has been not captured Not applicable.... Solution?

Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ” noisy: containing errors or outliers e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records

Why Is Data Dirty? Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was collected and when it is analyzed. Human/hardware/software problems Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data) Duplicate records also need data cleaning

Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility

Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data

Forms of Data Preprocessing

Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data cleaning is the number one problem in data warehousing”—DCI survey Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration

Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred

Missing Values The approach of the problem of missing values adopted in SQL is based on nulls and three-valued logic (3VL) null corresponds to UNK for unknown 3VL a mistake?

Boolean Operators In scalar comparison in which either of the compared is UNK evaluates the unknown truth value ANDtuf ttuf uuuf ffff ORtuf tttt utuu ftuf NOT tf uu ft

MAYBE Another important Boolean operator is MAYBE MAYBE tf ut ff

Example Consider the query “Get employees who may be- but are not definitely known to be- programmers born before January 18, 1971, with salary less then €40.000 EMP WHERE MAYBE ( JOB = ‘PROGRAMMER’ AND DOB < DATE (‘1971-1-18’) AND SALLARY < 40000 )

Without maybe we assume the existence of another operator called IS_UKN which takes a single scalar operand and returns true if operand evaluates UNK otherwise false EMP WHERE ( JOB = ‘PROGRAMMER’ OR IS_UKN (JOB) ) AND ( DOB < DATE (‘1971-1-18’) OR IS_UKN (DOB) ) AND ( SALLARY < 40000 OR IS_UKN (SALLARY) ) AND NOT ( JOB = ‘PROGRAMMER’ AND DOB < DATE (‘1971-1-18’) AND SALLARY < 40000 )

Numeric expression WEIGHT * 454 If WEIGHT is UKN, then the result is also UKN Any numeric expression is considered to evaluate UNK if any operands of that expression is itself UNK Anomalies WEIGHT-WEIGHT=UNK (0) WEIGHT/0=UNK (“zero divide”)

UNK is not u (unk) UNK (the value-unknown null) u (unk) (unknown truth value)...are not the same thing u is a value, UNK not a value at all! Suppose X is BOOLEAN Has tree values: t (true),f (false), u ukn X is ukn, X is known to be unk X is UKN, X is not known!

Some 3VL Consequences The comparison x=x does not give true In 3VL x is not equal to itself it is happens to be UNK The Boolean expression p OR NOT(p) does not give necessarily true unk OR NOT (unk) = unk

Example Get all suppliers in Porto and take the union with get all suppliers not in Porto We do not get all suppliers! We need to add maybe in Porto In 2 VL p OR NOT(p) corresponds to p OR NOT(p) OR MAYBE(p) in 3VL While two cases my exhaust full range of possibilities in the real world, the database does not contain the real world - instead it contains only knowledge about real world

Some 3VL Consequences The expression r JOIN r does not necessarily give r A=B and B=C together does not imply A=C.... Many equivalences that are valid in 2VL break down in 3VL We will get wrong answers

Special Values Drop the idea of null and UNK,unk 3VL Use special values instead to represent missing information Special values are used in the real world In the real world we might use the special value „?“ to denote hours worked by a certain employee if actual value is unknown

Special Values General Idea: Use an appropriate special value, distinct from all regular values of the attribute in question, when no regular value can be used The special value must be of the applicable attribute is not just integers, but integers integers plus whatever the special value is Approach is not very elegant, but without 3VL problems, because it is in 2VL

How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Fill in it automatically with a global constant : e.g., “unknown”, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree

Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values

Next.. Data Preprocessing Visual inspection Noise Reduction Data Reduction Data Discretization Data Integration

Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.

Similar presentations

Presentation on theme: "Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.

Similar presentations

Presentation on theme: "Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems."— Presentation transcript:

Similar presentations

About project

Feedback