Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fundamentals/ICY: Databases 2012/13 WEEK 10 (maths, normalization) John Barnden Professor of Artificial Intelligence School of Computer Science University.

Similar presentations


Presentation on theme: "Fundamentals/ICY: Databases 2012/13 WEEK 10 (maths, normalization) John Barnden Professor of Artificial Intelligence School of Computer Science University."— Presentation transcript:

1 Fundamentals/ICY: Databases 2012/13 WEEK 10 (maths, normalization) John Barnden Professor of Artificial Intelligence School of Computer Science University of Birmingham, UK

2 Reminder of Week 9 on Mathematical Background

3 Subsets and Supersets  uA  B means that A is a “subset” of B (and that B is a “superset” of A). I.e., every member of A is also a member of B. l Carefully distinguish between subset-of and member-of !!!  l The symbol  means the same as    does NOT mean that there cannot be equality. uExamples:  l   {4,5}    {5}  {4,5,6}, {6,4}  {4,5,6,7}, {6,4,7,5}  {4,5,6,7}  {n | n is an even whole number}  {n | n is a whole number}

4 Subsets and Supersets  u   A for any set A.  uA  A for any set A. (Reflexivity)  uIf A  B and B  A then A = B. (Antisymmetry)  uIf A  B and B  C then A  C. (Transitivity)

5 New for Week 10 on Mathematical Background

6 Some Operations on Sets uUnion of sets A and B:  A  B = the set of things that are in A or B (or both). NB: no repetitions created. uIntersection of sets A and B:  A  B = the set of things that are in both A and B. uDifference of sets A and B:  A  B = the set of things that are in A but not B. Note: also notated by a backslash instead of a minus sign.

7 Some Properties of those Operations uUnion and intersection are commutative (“can switch”):   A  B = B  A  A  B = B  A uUnion and intersection are associative (“can group differently”):  (   A  (B  C) = (A  B)  C   A  (B  C) = (A  B)  C Because of associativity, we can omit parentheses:    A  B  C  D A  B  C  D

8 Two Other Properties uUnion distributes over intersection:     A  (B  C) = (A  B)  (A  C) uIntersection distributes over union:    A  (B  C) = (A  B)  (A  C)

9 Same Difference? uExercises for bath-time: Is the difference operation commutative or associative? And does it take part in any distributivity with the other operations?

10 Tuples in a Table The tuples are  ‘9568876A’, ‘Chopples’, 37 >  ‘2544799Z’, ‘Blurp’, NULL >  ‘1698674F’, ‘Rumpel’, 88 > PERS-IDNAMEAGE 9568876AChopples37 2544799ZBlurp 1698674FRumpel88 People

11 “Tuples” uA “tuple” is an ordered sequence of items of any sort. We will only deal with finite tuples. Items CAN be duplicated. l Can also be called a “vector” in other CS terminology. uNotation:  6, JAB, 5, “JAB”, 5, , 9> uSingleton and empty tuples:, <> uThe concatenation ( ⃘ ) of two tuples is just the result of putting them end to end to get one tuple. l ⃘ = l ⃘ <> =

12 Table Rows are “Tuples” uIn a table, each attribute has a “domain” – the set of values that the attribute can have. E.g., the set of integers, the set of all character strings of any length, or the set of character strings of a specific format and length. uIf the attribute allows NULL values, we include NULL in the value domain as well. uThe values in a row form a tuple of values from the respective value domains. Just a list of the values, one for each attribute.

13 “Cartesian Products” and “Relations” uThe set of all possible tuples formed from some sets is called the Cartesian product of the sets. Notation, e.g.: D  E  F  G  H if D, E, F, G, H are the sets—not necessarily different. uAny subset at all of that Cartesian product is called a relation on the sets in question (D, E, …) l even the whole of the product (even if infinite) l and even the empty set. uI.e., a relation on D, E, …, H is just some set of tuples that are each of form where d  D, e  E, …, h  H.

14 Examples uLet A = {3, 8, 2} and B = {‘jjj’, ‘bb’}. Then A  B = {,,,,, }. B  B = {,,, }. A   =  =   A A  {TRUE} = {,, } uSome relations on A and B: l {,, } l { }  l A  B l 

15 Rows as forming a Relation uSo, for a given table, the set of all possible rows that you could create using values from the value domains, considered as tuples, forms the Cartesian product of the value domains of the table. uAnd, provided the table does not have repeated rows: AT ANY MOMENT the actual set of rows, considered as tuples, is a relation on the table’s value domains. l NB: crucial here that no row is exactly repeated, because a mathematical set cannot have repeated elements.

16 Relation from a Table The relation at the moment is   ‘9568876A’, ‘Chopples’, 37 >  ‘2544799Z’, ‘Blurp’, NULL >  ‘1698674F’, ‘Rumpel’, 88 >  PERS-IDNAMEAGE 9568876AChopples37 2544799ZBlurp 1698674FRumpel88 People

17 A Table as a Relation? uPeople loosely talk about tables being relations. This is mathematically inaccurate for several reasons: 1)The table properly speaking includes not just the rows but also the attribute names themselves, their domains, specification of primary and foreign keys, etc. 2)It’s only the rows at any given moment that form a relation. When a value in the table changes or a row is added or deleted, the mathematical relation is replaced by a different one. 3)Relations do not cater for tables with repeated rows. ((But there is a more advanced notion of relation, based on “bags” rather than sets, that does cater for repeated rows.)) But OK if you know what you (and those people) mean.

18 ((Aside: “Bags” in Maths)) uA variant of sets called “bags” (or “multisets”) is used in maths (and CS) and allows repeated members. There are union, etc. operations that respect the repetitions. uSo bags and their operations are a better fit to DB tables and notably their repetition-respecting operations (e.g. UNION ALL) than sets and their operations are. uBut bags are non-standard and they’re not normally covered at an introductory level. uSee Garcia-Molina et al 2009 for bags and their use in the DB area.

19 — Back to Database Design — NORMALIZATION

20 Reminder of Week 9

21 Partial and Transitive Dependencies

22 Second Normal Form (2NF) Conversion results on example on previous slide

23 New for Week 10

24 In 2NF but not in 3NF because of a “transitive” dependency

25 Third Normal Form uA table is in third normal form (3NF) if: l It is in 2NF and l It contains no transitive dependencies

26 Transitive Dependencies uA prime attribute is one that is within some candidate key (not necessarily the primary key). uA transitive dependency is where the determinant D is at least partially outside the PK and is not a superkey, and the determined attribute X is non-prime (and therefore in particular is not inside the PK; the reason for this restriction is on a later slide). l E.g.: previous Figure for simple case of a simple (= one- attribute) determinant. l Above definition is partly based on Garcia-Molina, Ullman & Widom 2009 – see later ref. More general than the account in our textbook.

27 Third Normal Form (3NF) Conversion Results on previous example

28 Conversion to 3NF uFor each determinant D involved in a transitive dependency in the original table T, use D as the PK for a new table NT(D) and move out the attributes X transitively determined by D into NT(D). uNB: the determinants themselves stay in T as well.

29 The Boyce-Codd Normal Form (BCNF) uDeterminants of partial and transitive functional dependencies are not superkeys. So the corresponding normalization gets rid of some non-superkey determinants of functional dependencies. uNormalization into BCNF gets rid of all such determinants. A table is in BCNF if it’s in 1NF and every determinant of a functional dependency is a superkey l i.e., every attribute-set that determines any other attribute determines all the attributes, so there’s no redundancy problem

30 A Table in 3NF but not in BCNF The dependency is NOT TRANSITIVE since B is prime

31 Decomposition to BCNF The middle diagram shows that changing the PK so as to include C instead of B changes the dependency into a partial one, which can then be removed in the usual way.

32 ((Note: A Simple Form of BCNF)) uAny simple (= one-attribute) superkey is a candidate key. So BCNF requires all simple determinants to be candidate keys. uSome books (incl. our textbook) define BCNF to mean that “all [simple] determinants are candidate keys”. uThis is a simpler, less general form of BCNF. uA table could be in simple-BCNF but not be in full BCNF. uMy definition of (full) BCNF is from Garcia-Molina, Ullman & Widom, Database Systems: The Complete Book, 2 nd. Ed., Pearson, 2009. This book also gives a process for conversion to full BCNF.

33 BCNF versus 3NF uBCNF implies that there are no partial or transitive dependencies, so a table that is in BCNF is also in 3NF. u((If a table is in 3NF but not BCNF then each of the non-superkey determinants D is partly outside the PK and determines only prime attributes. l If also the PK is the only candidate key, then: the attributes determined by each D must all be in the PK; but they cannot cover all of the PK (otherwise D would be a superkey). So the PK must be composite.))

34 ((Reason for Prime-X Exclusion in Transitive Dependencies)) uEarlier we said that in a transitive dependency the determined attribute X is non-prime (i.e. not within a candidate key). The reason is: uIn removing a transitive dependency, we delete the dependent attribute X from the original table. If X were within the primary key (special case of candidate key), that key would therefore be disrupted, and this would affect other tables referencing the table. But non-primary candidate keys are also sometimes used for such referencing, and are then called secondary keys. So if X were in such a key, the conversion to 3NF would disrupt the referencing. uSo, to keep things simple for the purposes of 3NF, all prime Xs are banned from being transitively dependent.

35 ((Inter-Table Reference Disruption contd.)) uNB: Conversion to 2NF can, and from 3NF to BCNF does, remove dependent prime attributes, so is potentially disruptive of inter-table reference. However, it’s relatively unlikely to be a problem in conversion to 2NF, because, in partial dependencies, the dependent attributes are not normally prime at all. uIf a 3NF table is not in BCNF then the troublesome dependencies necessarily involve prime Xs, because if the X is non-prime then a dependency with a non-superkey determinant must either be partial or transitive.

36 ((3NF and Reference Disruption contd.)) uSome textbooks (e.g., Connolly and Begg, Database Systems, Pearson, 2010) only require transitive dependencies to avoid non- primary-key attributes, rather than to avoid all prime attributes. In that case, conversion to 3NF can disrupt references using a secondary key. But at least the cases of 2NF and 3NF are now more similar to each other. uI haven’t seen a version of 2NF that is only concerned with non- prime Xs. But don’t be too surprised if you come across that!

37 Optional Material on 4NF: in Week 11 if there’s time

38 Normal Forms Overall < uLet “<” mean “provides less protection than”. Then: < < < < 1NF < 2NF < 3NF < BCNF ((and 3NF < 4NF)) < ((Also BCNF < 4NF under the second definition of 4NF. uBCNF and 4NF guard against relatively unusual situations. BCNF is more disruptive to achieve than 2NF or 3NF. Merely requiring 2NF is now unusual. So 3NF is the normal target.

39 Normal Forms Overall, contd uBCNF is more disruptive to achieve than 2NF or 3NF: 1)BCNF may require the PK to be changed, but conversion to 2NF or 3NF never does so. 2)Conversion from 3NF to BCNF always removes prime attributes, including possibly some PK attributes, perhaps disrupting inter-table reference. Conversion to 2NF only sometimes removes prime attributes, and can only do so if they are non-PK, so it has less danger of disrupting inter-table reference. Conversion from 2 NF to 3NF has no such danger.

40 Non-Normalization/Denormalization uNormalization leads to more tables. uJoining larger number of tables takes additional disk input/output (I/O) operations, additional manipulation complexity, and possibly substantial communication delays. uConflicts among design principles, information requirements, and processing speed are often resolved through compromises that may include ending up with some non-normalized tables.

41 Denormalization (continued) uUnnormalized tables in a production database tend to have these defects: l Data updates are less efficient because programs that read and update tables must deal with larger tables l Indexing is much more cumbersome l Unnormalized tables yield no simple strategies for creating virtual tables known as views

42 Database Modifications and Redesign uMany real-world databases have been improperly designed, or burdened with anomalies because of being improperly modified over time uYou may be asked to redesign and modify existing databases

43 Summary: Normalization and Database Design uNormalization helps eliminate data redundancies and some other aspects of poor structure. uNormalization focusses on problems in individual entity types. uDifficult to separate normalization from overall ER modelling process. uNormalization cannot, by itself, guarantee good designs. u3NF is often enough, but BCNF, 4NF etc. may also need to be considered. uNon-normalized tables may be desirable in some cases, to increase processing speed and/or reduce conceptual complexity of operations.


Download ppt "Fundamentals/ICY: Databases 2012/13 WEEK 10 (maths, normalization) John Barnden Professor of Artificial Intelligence School of Computer Science University."

Similar presentations


Ads by Google