Normalization and Data Mining R&G Chapter 19 Lecture 27 Science is the knowledge of consequences, and dependence of one fact upon another. Thomas Hobbes.

Slides:



Advertisements
Similar presentations
Lecture 21 CS 157 B Revision of Midterm3 Prof. Sin-Min Lee.
Advertisements

Logical Database Design (3 of 3) John Ortiz. Lecture 7Logical Database Design (2)2 Normalization  If a relation is not in BCNF or 3NF, we refine it by.
Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke1 Schema Refinement and Normal Forms Chapter 19.
Data Mining 198:541. Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Definition Data mining is the exploration and analysis of large.
1 Normalization. 2 Normal Forms v If a relation is in a certain normal form (BCNF, 3NF etc.), it is known that certain kinds of redundancies are avoided/minimized.
CS Algorithm : Decomposition into 3NF  Obviously, the algorithm for lossless join decomp into BCNF can be used to obtain a lossless join decomp.
ICS 421 Spring 2010 Data Mining 1 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/6/20101Lipyeow Lim.
Normalization DB Tuning CS186 Final Review Session.
Normalization DB Tuning CS186 Final Review Session.
Design Theory.
1 Normalization Chapter What it’s all about Given a relation, R, and a set of functional dependencies, F, on R. Assume that R is not in a desirable.
1 CMSC424, Spring 2005 CMSC424: Database Design Lecture 9.
Decomposition By Yuhung Chen CS157A Section 2 October
Schema Refinement and Normalization Nobody realizes that some people expend tremendous energy merely to be normal. Albert Camus.
Cs3431 Normalization Part II. cs3431 Attribute Closure : Example Consider R (A, B, C, D, E) with FDs A  B, B  C, CD  E Does A  E hold ? (Is A  E.
1 Schema Refinement and Normal Forms Yanlei Diao UMass Amherst April 10, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Data Mining – Intro.
Fall 2001Arthur Keller – CS 1804–1 Schedule Today Oct. 4 (TH) Functional Dependencies and Normalization. u Read Sections Project Part 1 due. Oct.
Functional Dependencies CS 186, Spring 2006, Lecture 21 R&G Chapter 19 Science is the knowledge of consequences, and dependence of one fact upon another.
1 Data Mining, Database Tuning Tuesday, Feb. 27, 2007.
©Silberschatz, Korth and Sudarshan7.1Database System Concepts Chapter 7: Relational Database Design First Normal Form Pitfalls in Relational Database Design.
Data Mining Chun-Hung Chou
Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Relational Database Design by Relational Database Design by Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING.
Normal Forms1. 2 The Problems of Redundancy Redundancy is at the root of several problems associated with relational schemas: Wastes storage Causes problems.
CSCD34 - Data Management Systems - A. Vaisman1 Schema Refinement and Normal Forms.
Chapter 8: Relational Database Design First Normal Form First Normal Form Functional Dependencies Functional Dependencies Decomposition Decomposition Boyce-Codd.
Schema Refinement and Normalization. Functional Dependencies (Review) A functional dependency X  Y holds over relation schema R if, for every allowable.
Schema Refinement and Normal Forms Chapter 19 1 Database Management Systems 3ed, R.Ramakrishnan & J.Gehrke.
CS143 Review: Normalization Theory Q: Is it a good table design? We can start with an ER diagram or with a large relation that contain a sample of the.
Functional Dependencies An example: loan-info= Observe: tuples with the same value for lno will always have the same value for amt We write: lno  amt.
Knowledge Discovery and Data Mining Evgueni Smirnov.
SCUJ. Holliday - coen 1784–1 Schedule Today: u Normal Forms. u Section 3.6. Next u Relational Algebra. Read chapter 5 to page 199 After that u SQL Queries.
THIRD NORMAL FORM (3NF) A relation R is in BCNF if whenever a FD XA holds in R, one of the following statements is true: XA is a trivial FD, or X is.
Functional Dependencies and Normalization 1 Instructor: Mohamed Eltabakh
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Functional Dependencies and Normalization R&G Chapter 19 Lecture 26 Science is the knowledge of consequences, and dependence of one fact upon another.
Final Exam Revision Instructor: Mohamed Eltabakh 1.
Functional Dependencies. FarkasCSCE 5202 Reading and Exercises Database Systems- The Complete Book: Chapter 3.1, 3.2, 3.3., 3.4 Following lecture slides.
Christoph F. Eick: Functional Dependencies, BCNF, and Normalization 1 Functional Dependencies, BCNF and Normalization.
Database Systems/COMP4910/Spring02/Melikyan1 Schema Refinement and Normal Forms.
IST 210 Normalization 2 Todd Bacastow IST 210. Normalization Methods Inspection Closure Functional dependencies are key.
1 Schema Refinement and Normal Forms Week 6. 2 The Evils of Redundancy  Redundancy is at the root of several problems associated with relational schemas:
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Schema Refinement and Normal Forms Chapter 15.
Copyright, Harris Corporation & Ophir Frieder, The Process of Normalization.
Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke1 Schema Refinement and Normal Forms Chapter 19.
1 Multivalued Dependencies Fourth Normal Form Reasoning About FD’s + MVD’s.
Schema Refinement and Normalization Nobody realizes that some people expend tremendous energy merely to be normal. Albert Camus.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2009.
CS 338Database Design and Normal Forms9-1 Database Design and Normal Forms Lecture Topics Measuring the quality of a schema Schema design with normalization.
1 Schema Refinement and Normal Forms Chapter The Evils of Redundancy  Redundancy is at the root of several problems associated with relational.
Rensselaer Polytechnic Institute CSCI-4380 – Database Systems David Goldschmidt, Ph.D.
Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke1 Schema Refinement and Normal Forms Chapter 19.
© D. Wong Functional Dependencies (FD)  Given: relation schema R(A1, …, An), and X and Y be subsets of (A1, … An). FD : X  Y means X functionally.
Chapter 8 Relational Database Design. 2 Relational Database Design: Goals n Reduce data redundancy (undesirable replication of data values) n Minimize.
Normalization and FUNctional Dependencies. Redundancy: root of several problems with relational schemas: –redundant storage, insert/delete/update anomalies.
Chapter 26: Data Mining Prepared by Assoc. Professor Bela Stantic.
Normal Forms Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems June 18, 2016 Some slide content courtesy of Susan Davidson.
CSC 411/511: DBMS Design Dr. Nan Wang 1 Schema Refinement and Normal Forms Chapter 19.
Data Mining – Intro.
Schema Refinement and Normal Forms
Normalization First Normal Form (1NF) Boyce-Codd Normal Form (BCNF)
Schema Refinement and Normalization
Normalization Part II cs3431.
Chapter 17 Designing Databases
Schema Refinement and Normalization
Instructor: Mohamed Eltabakh
CS4222 Principles of Database System
Presentation transcript:

Normalization and Data Mining R&G Chapter 19 Lecture 27 Science is the knowledge of consequences, and dependence of one fact upon another. Thomas Hobbes ( )

Administrivia Homework Due a week from Today –RubyOnRails help session Wed, 5-7pm, 310 Soda –(Thanks to Darren Lo & HKN) Final exam 3 weeks from tomorrow

Review: Functional Dependencies –Properties of the real world –Decide when to decompose relations –Help us find keys –Help us evaluate Design Tradeoffs Want to reduce redundancy, avoid anomalies Want reasonable efficiency Must avoid lossy decompositions –F+: closure, all dependencies that can be inferred from a set F –A+: attribute closure, all attributes functionally determined by the set of attributes A –G: minimal cover, smallest set of FDs such that G+ == F+

Review: Normal Forms A property of a single relation Tells us something about redundancy in reln Reln R with FDs F is in BCNF if, for all X  A in F + A  X (called a trivial FD), or X is a superkey for R. Reln R with FDs F is in 3NF if, for all X  A in F + A  X (called a trivial FD), or X is a superkey of R, or A is part of some candidate key (not superkey!) for R. (sometimes stated as “A is prime”)

Review: Decomposition If reln violates normal form, decompose –but must have lossless decomposition Lossless decomposition: –decomposition of R into X and Y is lossless if and only if X  Y is a key for either X or Y –If W  Z holds over R and (W  Z) is empty, then decomposition of R into R-Z and WZ is loss-less. Algorithm: –For each FD W  Z in R that violates normal form, decompose R into R-Z and WZ. Repeat as needed. –Order not important, but can produce very different results

Review: Dependency Preservation –decompose too much, and it might be necessary to join tables to check FDs –decomposition of R into X and Y is dependency preserving if (F X  F Y ) + = F + F X is all FDs involving only attributes in X F Y is all FDs involving only attributes in Y –Not always obvious ABC, A  B, B  C, C  A, decomposed into AB and BC. Is this dependency preserving? Is C  A preserved? –note: F + contains F  {A  C, B  A, C  B}, so… F AB contains A  B and B  A; F BC contains B  C and C  B So, (F AB  F BC ) + contains C  A

Exercise Consider a database about Students: (StudentID, SS#, Name, Street Addr, City, State, Zip) abbreviated as: (D,S,N,R,C,T,Z), where D and S are keys D  DSNRCTZ, S  DSNRCTZ, RCT  Z, Z  CT Is DSNRCTZ in BCNF? If not, decompose it until it is. Is the final decomposion dependency-preserving? Is DSNRCTZ in 3NF, If not, decompose it until it is. Is the final decomposion dependency-preserving?

Exercise Consider a database about Students: (StudentID, SS#, Name, Street Addr, City, State, Zip) abbreviated as: (D,S,N,R,C,T,Z), where D and S are keys D  DSNRCTZ, S  DSNRCTZ, RCT  Z, Z  CT Is DSNRCTZ in BCNF? If not, decompose it until it is. Is the final decomposion dependency-preserving? –no, RCT  Z, RCT not key, decom to: DSNRCT & RCTZ. –still no, Z  CT, Z not key, decom to: DSNRCT, ZCT & RZ, which is BCNF –but, join required to test RCT  Z

Exercise Consider a database about Students: (StudentID, SS#, Name, Street Addr, City, State, Zip) abbreviated as: (D,S,N,R,C,T,Z), where D and S are keys D  DSNRCTZ, S  DSNRCTZ, RCT  Z, Z  CT Is DSNRCTZ in 3NF, If not, decompose it until it is. Is the final decomposion dependency-preserving? –no, RCT  Z, RCT not key, Z not part of key decom to: DSNRCT & RCTZ. –yes, 3NF, Z  CT, CT part of key, (since RCT  Z) –is D  Z preserved? Yes, transitively, since D  RCT (1 st relation), and RCT  Z (2 nd relation)

Minimal Cover for a Set of FDs G: minimal cover, smallest set of FDs such that G+ == F+ –Closure of F = closure of G. –Right hand side of each FD in G is a single attribute. –If we modify G by deleting an FD or by deleting attributes from an FD in G, the closure changes. Every FD in G is needed, and ``as small as possible’’ in order to get the same closure as F. e.g., F+ = {A  B, B  C, C  A, B  A, C  B, A  C} –several minimal covers: {A  B, B  A, C  B, B  C} (AB + BC) –or {A  C, C  A, B  C, C  B} (AC + BC) –or {A  B, B  A, C  A, A  C} (AB + AC) e.g., A  B, ABCD  E, EF  GH, ACDF  EG minimal cover: –A  B, ACD  E, EF  G and EF  H

BCNF and Dependency Preservation In general, there may not be a dependency preserving decomposition into BCNF. But, you can always find dependency-preserving decomposition into 3NF –Top down: Decompose until it is in 3NF Compute minimal cover for FDs If minimal cover contains a FD X  Y is not preserved, add reln XY –Bottom up: Compute minimal cover For each FD X  Y in minimal cover, create reln XY –Why does this work? Minimal cover doesn’t include redundant transitive dependencies, which don’t need to be preserved

Summary of FDs and Normalization FDs are properties of the real world –FDs tell us if a relation is in a Normal Form Normal forms tell us if there is any redundancy –but zero redundancy may mean inefficiency BCNF: each field contains information that cannot be inferred using only FDs. –ensuring BCNF is a good heuristic. Not in BCNF? Try decomposing into BCNF relations. –Must consider whether all FDs are preserved! Lossless-join, dependency preserving decomposition into BCNF impossible? Consider 3NF. Decompositions should be carried out while keeping performance requirements in mind. Note: even more restrictive Normal Forms exist (we don’t cover them in this course, but some are in the book.)

New Topic: Data Mining

Definition Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6%

Definition (Cont.) Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns.

Why Use Data Mining? Human analysis skills are inadequate: –Volume and dimensionality of the data –High data growth rate Availability of: –Data –Storage –Computational power –Off-the-shelf software –Expertise

An Abundance of Data Supermarket scanners, POS data Preferred customer cards Credit card transactions Direct mail response Call center records ATM machines Demographic data Sensor networks Cameras Web server logs Customer web site trails

More Computational Power Moore ’ s Law: In 1965, Intel Corporation cofounder Gordon Moore predicted that the density of transistors in an integrated circuit would double every year. (Later changed to reflect 18 months progress.) Experts on ants estimate that there are to ants on earth. In the year 1997, we produced one transistor per ant.

Much Commercial Support Many data mining tools – – Database systems with data mining support Visualization tools Data mining process support Consultants

Why Use Data Mining Today? Competitive pressure! “ The secret of success is to know something that nobody else knows. ” Aristotle Onassis Competition on service, not only on price (Banks, phone companies, hotel chains, rental car companies) Personalization, CRM The real-time enterprise “ Systemic listening ” Security, homeland defense

The Knowledge Discovery Process Steps: lIdentify business problem lData mining lAction lEvaluation and measurement lDeployment and integration into businesses processes

Data Mining Step in Detail 2.1 Data preprocessing –Data selection: Identify target datasets and relevant fields –Data cleaning Remove noise and outliers Data transformation Create common units Generate new fields 2.2 Data mining model construction 2.3 Model evaluation Models can describe existing data Make predictions about new data

Preprocessing and Mining Original Data Target Data Preprocessed Data Patterns Knowledge Data Integration and Selection Preprocessing Model Construction Interpretation

Examples Insurance: which claims are likely to be fraud? Banks: which customers are likely to repay loans? Stores: which products do people buy together?

Data Mining Techniques Supervised learning –Classification and regression, describe correlative factors, predict values for new data Unsupervised learning –Clustering –Dependency modeling Associations, summarization, causality –Outlier and deviation detection –Trend analysis and change detection Visual Data Mining –Present the information in a visual form, offload the analysis onto the human perceptual system

Supervised learning Need training data set with known outcome –e.g. here is a set of loans that were not repaid, and other loans that were repaid Model is generated from the training set, tested on a separate test data set to determine accuracy Model can predict outcomes on new data, –can also explain predictive factors Examples include Decision Trees, Regression Trees, Naïve Baysian networks

Unsupervised Learning Give data to the algorithm, it does the rest Output might include clustered data, association rules, etc.

E.g. Agglomerative Clustering Algorithm: Put each item in its own cluster (all singletons) Find all pairwise distances between clusters Merge the two closest clusters Repeat until everything is in one cluster Observations: Results in a hierarchical clustering Yields a clustering for each possible number of clusters Greedy clustering: Result is not “ optimal ” for any cluster size

Agglomerative Clustering Example

Density-Based Clustering A cluster is defined as a connected dense component. Density is defined in terms of number of neighbors of a point. We can find clusters of arbitrary shape

Demo

Conclusions Data mining very useful for understanding large data sets Several approaches –Supervised –Unsupervised Can describe patterns, make predictions Many commercial packages Many free algorithms