Mid3 Revision 2 Prof. Sin Min Lee Deparment of Computer Science San Jose State University.

Mid3 Revision 2 Prof. Sin Min Lee Deparment of Computer Science San Jose State University

Functional Dependencies FDs defined over two sets of attributes: X, Y  R Notation: X  Y reads as “X determines Y” If X  Y, then all tuples that agree on X must also agree on Y XYZXYZ 123245124127248379123245124127248379 R

Q6. (1 point) Given the relation Supplies: Snumber | Pnumber | Qty --------|---------|----- 101 1 20 101 2 30 102 1 14 103 4 21 104 4 10 105 1 5 what will be returned by the SQL query: Select Pnumber From Supplies Group By Pnumber Having Count(*) = (Select Max(Count(*)) From Supplies Group By Pnumber) (a) 1 (b) 2 (c) 3 (d) 4 Answer: a

Q4.(1 point) Consider the relation R(ABCDE) with FDs: FD1: AB -> D, FD2. AB->E, FD3. D->A, and FD4. D->B. The number of keys of R is: (a) 1 (b) 2 (c) 3 (d) 10 Answer: candidate keys 2 {A,B,C}, {C,D}. Superkeys 9 {CD},{ABC}, {ACD},{BCD},{CDE},{ABCD},{ABCE},{B CDE},{ABCDE}

2 nd Normal Form has to be in 1 st Normal Form each attribute A in relation schema R meets one of the following criteria: It appears in a candidate key. It is not partially dependent on a candidate key. -No need to check if the primary key has only one attribute -Create a new relation for each partial key and its dependent attributes

Partial dependency A functional dependency  is called a partial dependency if there is a proper subset  of  such that  We say that  is partially dependent on 

2NF example ABCD 1132 2231 3244 4143 1212

2 nd Normal Form (cont.) Property -Id# County- name Lot #AreaPriceTax-Rate Tax-Rate is partially dependent on candidate key {County-name, Lot#} Lots

2NF (cont.) County-nameTax-Rate Property- Id# County- name Lot #AreaPrice

3 rd Normal Form in 2 nd Normal Form no non-key attributes are functionally dependent on other non-key attribute

3NF (cont.) Property- Id# County- name Lot #AreaPrice Prope rty- Id# County -name Lot # Area Price

Inventory(PartNbr, {Warehouse, Location}, QOH, Weight, PartColor) PartNbr --> Weight, PartColor PartNbr + Warehouse --> QOH QOH is Quantity On hand Warehouse --> Location Sample Data PartNbr Warehouse Location QOH Weight PartColor 01 500 NW 135 11.75 Blue 01 600 SW 210 11.75 Blue 01 800 East 192 11.75 Blue 02 500 NW 75 2.50 Red 02 800 East 45 2.50 Red 03 500 NW 290 21.35 Green 03 600 SW 83 21.35 Green Which Normal form is the Inventory table in? Answer: key { PartNbr,W]} 1NF, not 2NF

Q2.(1 point) Hospital(Patient, Insurance, Doctor, {Test, Result}) Patient --> Insurance, Doctor Patient + Test --> Result Sample Data Patient Insurance Doctor Test Result Tweety Red Cross Livingston Brain Scan Not Found Tweety Red Cross Livingston Blood work Yes and red Sylvester Red Shield Kilder Cat Scan Yes he is a Cat Sylvester Red Shield Kilder X Rays No broken bones Sylvester Red Shield Kilder Flea check None Which Normal form is the Hospital table in?

Q7. (1 point) Given the following table (a) Draw the functional dependency graph of this table. (b) Can D in 3NF ?

Closure of F Let F be a set of functional dependencies. The closure of F, denoted by F +, is the set of all functional dependencies logically implied by F.

Armstrong’s Axiom Reflexivity rule. If  is a set of attributes and  then  Augmentation rule. If  holds and  is a set of attributes, then  holds. Transitivity rule. If  holds and   holds, then  holds.

Q3.(1 point) Suppose we have R(A,B,C,D) with FD1. A,B  C FD2.A,C  B FD3. B,D  A Identify all the candidate keys.

Decompositions in General R(A 1,..., A n, B 1,..., B m, C 1,..., C p ) If A 1,..., A n  B 1,..., B m Then the decomposition is lossless R 1 (A 1,..., A n, B 1,..., B m ) R 2 (A 1,..., A n, C 1,..., C p ) Example: name  price, hence the first decomposition is lossless Note: don’t need necessarily A 1,..., A n  C 1,..., C p

BCNF Decomposition Algorithm A’s Others B’s R1R1 Is there a 2-attribute relation that is not in BCNF ? Repeat choose A 1, …, A m  B 1, …, B n that violates the BNCF condition split R into R 1 (A 1, …, A m, B 1, …, B n ) and R 2 (A 1, …, A m, [others]) continue with both R 1 and R 2 Until no more violations R2R2

Summary of BCNF Decomposition Find a dependency that violates the BCNF condition: A’s Others B’s R1R2 Heuristics: choose B, B, … B “as large as possible” 12m Decompose: 2-attribute relations are BCNF Continue until there are no BCNF violations left. A 1, A 2, …, A n  B 1, B 2, …, B m

Example Decomposition Person(name, SSN, age, hairColor, phoneNumber) SSN  name, age age  hairColor Decompose in BCNF (in class): Step 1: find all keys (How ? Compute S +, for various sets S) Step 2: now decompose

Other Example R(A,B,C,D) A  B, B  C Key: AD Violations of BCNF: A  B, A  C, A  BC Pick A  BC: split into R1(A,BC) R2(A,D) What happens if we pick A  B first ?

Q5. (1 point) Given the FDs {B->D, AB->C, D->B} and the relation R(A, B, C, D)}, give a two distinct lossless join decomposition to BNCF indicating the keys of each of the resulting relations Answer: Relations in the first lossless join decomposition R 1 (A, B, C) R 2 (B, D) Relation in the second lossless join decomposition R 1 (A, C, D) R 2 (B, D)

Lossless Decompositions A decomposition is lossless if we can recover: R(A,B,C) R1(A,B) R2(A,C) R’(A,B,C) should be the same as R(A,B,C) R’ is in general larger than R. Must ensure R’ = R Decompose Recover

Q8.(2 points) Consider the relation schema R(A,B,C,D) with FDs F = {AB  C; BC  D; A  B}. Which FD has an extraneous attribute on the left hand side? a. AB  C b. BC  D c. Both (b) and (a) d. None of the above Answer: a

Multivalued Dependencies (MVDs) Let R be a relation schema and let   R and   R. The multivalued dependency    holds on R if in any legal relation r(R), for all pairs for tuples t 1 and t 2 in r such that t 1 [  ] = t 2 [  ], there exist tuples t 3 and t 4 in r such that: t 1 [  ] = t 2 [  ] = t 3 [  ] = t 4 [  ] t 3 [  ] = t 1 [  ] t 3 [R –  ] = t 2 [R –  ] t 4 [  ] = t 2 [  ] t 4 [R –  ] = t 1 [R –  ]

MVD (Cont.) Tabular representation of   

X ->> Y is trivial if (a)Y  X or (b)Y U X = R

Multivalued Dependencies There are database schemas in BCNF that do not seem to be sufficiently normalized Consider a database classes(course, teacher, book) such that (c,t,b)  classes means that t is qualified to teach c, and b is a required textbook for c The database is supposed to list for each course the set of teachers any one of which can be the course’s instructor, and the set of books, all of which are required for the course (no matter who teaches it).

There are no non-trivial functional dependencies and therefore the relation is in BCNF Insertion anomalies – i.e., if Sara is a new teacher that can teach database, two tuples need to be inserted (database, Sara, DB Concepts) (database, Sara, Ullman) courseteacherbook database operating systems Avi Hank Sudarshan Avi Jim DB Concepts Ullman DB Concepts Ullman DB Concepts Ullman OS Concepts Shaw OS Concepts Shaw classes Multivalued Dependencies

Therefore, it is better to decompose classes into: courseteacher database operating systems Avi Hank Sudarshan Avi Jim teaches coursebook database operating systems DB Concepts Ullman OS Concepts Shaw text We shall see that these two relations are in Fourth Normal Form (4NF) Multivalued Dependencies

MVD (Cont.) Tabular representation of   

Example: F= { A  B, B  C } A + = ABC B + = BC C + = C AB + = ABC

First Normal Form Every field contains only atomic values No lists or sets. Implicit in our definition of the relational model. Second Normal Form every non-key attribute is fully functionally dependent on the ENTIRE primary key. Mainly of historical interest.

– Intuitively, in a BCNF relation, the only nontrivial dependencies are those in which a key determines some attributes. – Each tuple can be thought of as an entity or relationship, identified by a key and described by the remaining attributes Key Nonkey attr_1 Nonkey attr_2 Nonkey attr_k FDs in a BCNF Relation

Key Attributes XAttributes A Key Attributes AAttributes X Key Attributes AAttributes X Partial Dependencies Transitive Dependencies A not in a key A in a key

Motivation of 3NF By making an exception for certain dependencies involving key attributes, we can ensure that every relation schema can be decomposed into a collection of 3NF relations using only decompositions. Such a guarantee does not exist for BCNF relations. It weaken the BCNF requirements just enough to make this guarantee possible. Unlike BCNF, some redundancy is possible with 3NF. The problems associate with partial and transitive dependencies persist if there is a nontrivial dependency X  A and X is not a superkey, even if the relation is in 3NF because A is part of a key.

Reserves Assume: sid  cardno (a sailor uses a unique credit card to pay for reservations). Reserves is not in 3NF sid is not a key and cardno is not part of a key In fact, (sid, bid, day) is the only key. (sid, cardno) pairs are redundantly.

Reserves Assume: sid  cardno, and cardno  sid (we know that credit cards also uniquely identify the owner). Reserves is in 3NF (cardno, sid, bid) is also a key for Reserves. sid  cardno does not violate 3NF.

1. Suppose that in our banking example, we had an alternative design including the schema: BC-schema=(loan#, cname, street, ccity) We can see this is not BCNF, as the functional dependency cname  street ccity holds on this schema, and cname is not a superkey.

2. If we have customers who have several addresses, though, then we no longer wish to enforce this functional dependency, and the schema is in BCNF. 3. However, we now have the repetition of information problem. For each address, we must repeat the loan numbers for a customer, and vice versa.

4. Figure 1 shows a tabular representation of this. It looks horrendously complicated, but is really rather simple. A simple example is a table with the schema (name, address, car), as shown in Figure 2. a b R-a-b t 1 t 2 a 1 ….a i a i+1 ….a j b i+1 ….b j a j+1 ….a n b j+1 ….b n t 3 t 4 a 1....a i a i+1 ….a j b i+1 ….b j b j+1 ….b n a j+1 ….a n

(name, address, car) where name  address and name  car nameaddresscar Tom North Rd. Oak St. North Rd. Oak St. Toyota Honda Toyota

What is a Decomposition? Let R be a relation schema. A set of relation schemas {R1, R2,…, Rn} is a decomposition of R if R = R1 U R2 U…U Rn That is, {R1, R2,…, Rn} is a decomposition of R for I=1,2,…,n, each Ri is a subset of R, and every attribute in R appears in at least one Ri.

Normalization Using Functional Dependencies Desirable properties of Decomposition 1. Lossless-Join Decomposition Let R be a relation schema, and let F be a set of functional dependencies on R. Let R1 and R2 form a decomposition of R. This decomposition is a lossless-join decomposition of R if at least one of the following functional dependencies are in F+: R1  R2  R1 R1  R2  R2

2. Dependency Preservation When an update is made to the database, the system should be able to check if it satisfies all the given functional dependencies. If we want to check updates efficiently, we should design relational-database schemas that allow update validation without the computation of joins. To decide whether joins must be computed we need to determine what functional dependencies may be tested by checking each relation individually.

Cont. Let F be a set of functional dependencies on a schema R, and Let R1, R2,…, Rn be a decomposition of R. The restriction of F to Ri is the set Fi of all functional dependencies in F+ that include only attributes of Ri. Let F’ = F1 U F2 U… U Fn. F’ is a set of functional dependencies on schema R, in general, F’  F. However, it may be F’+ = F+. If the latter is true, then every dependency in F is logically implied by F’, and if we verify that F’ is satisfied, we have verified that F is satisfied. We say that a decomposition having the property F’+ = F+ is a dependency preserving decomposition.

Algorithm to test dependency preservation compute F+; for each schema Ri in D do begin Fi: = the restriction of F+ to Ri; end F’:=0 for each restriction Fi do begin F’=F’ U Fi end compute F’+ if(F’+ = F+) then return (true) else return (false); Note: since the first step, computation of F+ takes exponential time, it is often easier not to apply the algorithm.

Boyce-Codd Normal Form A relation schema R is in BCNF with respect to a set F of functional dependencies if for all functional dependencies in F+ of the form a  b, where   R and   R, at least one of the following holds.    R is a trivial functional dependency (   )   is a superkey for schema R.

Cont. " A database design is in BCNF if each member of the set of relation schemas that constitutes the design is in BCNF. " To determine whether these schemas are in BCNF, we need to determine what functional dependencies apply to them. Note: examples are available in text P225-226

BCNF Decomposition Algorithm Result := {R}; done := false; compute F+; while( not done ) do if( there is a schema Ri in result that is not in BCNF ) then begin let    be a nontrivial functional dependency that holds on Ri such that   Ri is not in F+, and  result = ( result -Ri )  ( Ri - B )  end else done := true;

Cont. Not every BCNF decomposition is dependency preserving We can not always satisfy all three design goals: 1. BCNF 2. Lossless join 3. Dependency preservation

Cont. Example: Banker-schema = ( branch-name, customer-name, banker- name ) This banker-schema indicates that a customer has "personal banker" in a particular branch. The set F of functional dependencies that we require to hold on the banker-schema is banker-name  branch-name branch-name customer-name  banker  name Banker-schema is not in BCNF because banker-name is not a superkey

Third Normal Form A relation schema R is in 3NF with respect to a set F of functional dependencies if, for all functional dependencies in F+ of the form    where  R and  R, at least one of the following holds:     is a trivial functional dependency.   is a superkey for R.  Each attribute A in  is contained in a candidate key for R.

Transitive Dependencies The definition of 3NF allows certain functional dependencies that are not allowed in BCNF. A dependency    satisfies only the third condition of the 3NF definition is not allowed in BCNF, but is allowed in 3NF. These dependencies are examples of transitive dependencies.

Cont. If a relation schema is in BCNF, then all functional dependencies are of the form “superkey determines a set of attributes,” or the dependency is trivial. So A BCNF schema cannot have an transitive dependencies. Every BCNF schema is also in 3NF, and BCNF is therefore a more restrictive constraint than is 3NF.

Algorithm for Dependency-preserving, lossless-join decomposition into 3NF Let Fc be a canonical cover for F; i:=0; for each functional dependency  in Fc do if none of the schemas Rj, j=1,2,…, I contains  then begin i:=i + 1; Ri:=  ; end If none of the schemas Rj, j=1,2,…,I contains a candidate key for R return (R1, R2,…, Ri)

Comparison of BCNF and 3NF Using 2NF has an advantage which it is always possible to obtain a 3NF design without sacrificing a lossless join or dependency preservation.So it is generally preferable to choose 3NF.

Conclusion Now we have three design goals for a relational-database design: 1. BCNF 2. Lossless join 3. Dependency preservation If we cannot achieve all three, we can do 1. 3NF 2. Lossless join 3. Dependency preservation

Testing for Lossless Join Fortunately, there is a simple test to determine if a decomposition into two schemes is lossless Let R 1 and R 2 be a decomposition of R Let F be the set of FDs of R If either (R 1  R 2 )  (R 1  R 2 ) or (R 1  R 2 )  (R 1  R 2 ) belongs to F, the decomposition is lossless

Putting the results in practical use Data Mining and KDD

What is Data Mining? “the automated extraction of hidden predictive information from large databases” Algorithms produce patterns, rules Predict future trends/behavior Used to make business decisions

Classification Items belong to classes Given past items’ classification, predict class of new item Example: Issuing credit cards Use information: income, educational background, age, current debts Credit worthiness: Bad, good, excellent

Decision Tree Classifiers Internal Node has predicate Leaf node is class To classify instance Start at root node Traverse tree until reach leaf node Each internal node, make decision

Credit Risk Decision Tree

Decision Tree Construction Some Definitions Purity: > # instances of each leaf belonging to only 1 class means > purity Best Split: split giving the maximum information gain ratio (info gain/info content) Choose attribute and condition resulting in maximum purity

Decision Tree Construction

Association Rules antecedent  consequent if  then beer  diaper (Walmart) economy bad  higher unemployment Higher unemployment  higher unemployment benefits cost Rules associated with population, support, confidence

Association Rules Population: instances such as grocery store purchases Support % of population satisfying antecedent and consequent Confidence % consequent true when antecedent true

Association Rules Population MS, MSA, MSB, MA, MB, BA M=Milk, S=Soda, A=Apple, B=beer Support (M  S)= 3/6 (MS,MSA,MSB)/(MS,MSA,MSB,MA,MB, BA) Confidence (M  S) = 3/5 (MS, MSA, MSB) / (MS,MSA,MSB,MA,MB)

Clustering “The process of dividing a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another, and different groups are as "far" as possible from one another, where distance is measured with respect to all available variables.”

Clustering Birch Algorithm points inserted into multidimensional tree items guided to leaf nodes "near" representative internal nodes nearby points clustered into one leaf node

Clustering Example of Clustering predict what new movies a person is interested in 1) a person’s past movie preferences 2) others with similar preferences 3) preferences of those in the pool for new movies

Clustering 1) cluster people with similar movie preferences 2) given a new movie goer, find a cluster of similar movie goers 3) then predict the cluster's new movie preferences

Amazon Examples

Mid3 Revision 2 Prof. Sin Min Lee Deparment of Computer Science San Jose State University.

Similar presentations

Presentation on theme: "Mid3 Revision 2 Prof. Sin Min Lee Deparment of Computer Science San Jose State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mid3 Revision 2 Prof. Sin Min Lee Deparment of Computer Science San Jose State University.

Similar presentations

Presentation on theme: "Mid3 Revision 2 Prof. Sin Min Lee Deparment of Computer Science San Jose State University."— Presentation transcript:

Similar presentations

About project

Feedback