By Josh, Spencer, and Lacey

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

XML: Extensible Markup Language
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
Clustering Francisco Moreno Extractos de Mining of Massive Datasets
Data Mining Techniques: Clustering
CS292 Computational Vision and Language Pattern Recognition and Classification.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Basic (Flat) Relational Model.
VBA Modules, Functions, Variables, and Constants
Database Management: Getting Data Together Chapter 14.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Cluster Analysis (1).
Concepts of Database Management Sixth Edition
Recommender systems Ram Akella November 26 th 2008.
CLUSTERING Eitan Lifshits Big Data Processing Seminar Prof. Amir Averbuch Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeffery.
A Guide to SQL, Seventh Edition. Objectives Understand, create, and drop views Recognize the benefits of using views Grant and revoke user’s database.
On-Line Application Processing Warehousing Data Cubes Data Mining 1.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Chapter 4 The Relational Model 3: Advanced Topics Concepts of Database Management Seventh Edition.
Database Management COP4540, SCS, FIU Constraints and security in SQL (Ch. 8.6, Ch22.2)
Slide Chapter 5 The Relational Data Model and Relational Database Constraints.
Visual Information Systems Recognition and Classification.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Tutorial 13 Validating Documents with Schemas
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Association Rule Mining
CSE314 Database Systems Lecture 3 The Relational Data Model and Relational Database Constraints Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Chapter 5 : Integrity And Security  Domain Constraints  Referential Integrity  Security  Triggers  Authorization  Authorization in SQL  Views 
Session 1 Module 1: Introduction to Data Integrity
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
Clustering Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
What is Database Administration ?
Unsupervised Learning
Trigonometric Identities
On-Line Application Processing
Clustering Hierarchical /Agglomerative and Point- Assignment Approaches Measures of “Goodness” for Clusters BFR Algorithm CURE Algorithm Jeffrey D. Ullman.
Oracle structures on database applications development
Module 11: File Structure
TABLES AND INDEXES Ashima Wadhwa.
Privileges Grant and Revoke Grant Diagrams
Privileges Grant and Revoke Grant Diagrams
Privileges Grant and Revoke Grant Diagrams
Foreign Keys Local and Global Constraints Triggers
Entity-Relationship Model
Clustering Algorithms
Copyright © Cengage Learning. All rights reserved.
Data Mining K-means Algorithm
Appendix D: Network Model
SQL Authorization Book: A First Course in Database Systems
CPSC-608 Database Systems
Hash-Based Improvements to A-Priori
Association Rule Mining
Privileges Grant and Revoke Grant Diagrams
Privileges Grant and Revoke Grant Diagrams
Data Mining – Chapter 4 Cluster Analysis Part 2
Market Basket Analysis and Association Rules
On-Line Application Processing
Introduction to Data Structures
Appendix D: Network Model
Database Management system
Chapter 3: Multivalued Dependencies
Chapter 8 Views and Indexes
Unsupervised Learning
Presentation transcript:

By Josh, Spencer, and Lacey Chapter 10, 11, 22 By Josh, Spencer, and Lacey

10.1 Security and User Authorization

10.1 Security and User Authorization Privileges Establishing Ownership The Privilege-Checking Process Granting Privileges Grant Diagrams Revoking Privileges

10.1 User Authorization SQL uses authorization ID’s (user names) to grant privileges There is a special authorization ID, PUBLIC, which includes any user Much like a UNIX file system can grant read, write, or execute privileges to a user, SQL can grant various privileges to an authorization ID Databases are much more complex than file systems, thus they have more complex privileges

10.1.1 Privileges 9 types of SQL privileges: SELECT INSERT DELETE UPDATE REFERENCES USAGE TRIGGER EXECUTE UNDER

10.1.1 Privileges SELECT, INSERT, UPDATE, and DELETE all apply to relations They correspond with their identically named SQL query commands SELECT, INSERT, and UPDATE can optionally have a list of attributes associated with them For example, if a user only has the following privilege on PC: SELECT(model, price) They would only be able to write queries that select model or price from PC

10.1.1 Privileges The REFERENCES privilege on a relation is the right to refer to that relation in an integrity constraint (Chapter 7) REFERENCES may also have attributes associated with it The USAGE privilege applies to several kinds of schema elements other than relations and assertions USAGE is the right to use that element in one’s own declarations.

10.1.1 Privileges The TRIGGER privilege on a relation is the right to define triggers on that relation The EXCECUTE privilege is the right to execute a piece of code, such as a PSM procedure or function The UNDER privilege is the right to create subtypes of a given type

10.1.2 Establishing Ownership SQL elements such as schemas or modules have an owner The owner of something has all privileges associated with that thing There are 3 ways ownership is established in SQL: When a schema is created, it and all the tables and other schema elements in it are owned by the user who created it. This user thus has all possible privileges on elements of the schema.

10.1.2 Establishing Ownership When a session is initiated by a CONNECT statement, there is an opportunity to indicate the user with an AUTHORIZATION clause CONNECT TO sql-server AS conn1 AUTHORIZATION steve; An AUTHORIZATION clause can also be used in a module-creation statement It is also acceptable to specify no owner for a module, in which case the module is publicly executable, but the privileges necessary for executing any operations in the module must come from some other source

10.1.3 The Privilege-Checking Process As we just discussed, each module, schema, and session has an associated user (authorization ID in SQL terms) Any SQL operation has two parties: The database elements upon which the operation is performed The agent that causes the operation

10.1.3 The Privilege-Checking Process The privileges available to the agent derive from a particular authorization ID called the current authorization ID That ID is one of two things: The module authorization ID, if the module that the agent is executing has an authorization ID The session authorization ID if not

10.1.3 The Privilege-Checking Process We may execute the SQL operation only if the current authorization ID possesses all the privileges needed to carry out the operation on the database elements involved.

10.1.4 Granting Privileges SQL provides a GRANT statement to allow one user to “copy” a privilege to another user There is an optional grant option that allows the user to grant the privilege to another user The grant statement takes the following general form: GRANT <privileges> ON <db element> TO <users> WITH GRANT OPTION (optional)

10.1.4 Granting Privileges GRANT <privileges> ON <db element> TO <users> WITH GRANT OPTION (optional) <privileges> - list of privilege(s) to be granted <db element> - typically a relation If it is another type of element, the name is preceded by the type <users> - list of user(s) to which the privileges are granted

10.1.4 Granting Privileges - Example Let’s say user steve is the owner of ComputerSchema that contains the tables PC, Product steve wants to grant INSERT and SELECT privileges on PC and SELECT privileges on Product to users tom and jerry GRANT SELECT, INSERT ON PC TO tom, jerry WITH GRANT OPTION; GRANT SELECT ON Product TO tom, jerry

10.1.4 Granting Privileges - Example Now, jerry wants to grant user fred the same privileges, this time without the grant option GRANT SELECT, INSERT ON PC TO fred; GRANT SELECT ON Product TO fred;

10.1.4 Granting Privileges - Example At the same time, tom grants fred the minimal privileges needed to insert new models from Product into PC GRANT SELECT, INSERT(model) ON PC TO fred; GRANT SELECT ON Product TO fred;

10.1.5 Grant Diagrams Because of the complex web of grants and overlapping privileges that may result from a sequence of grants, it is useful to represent grants by a graph called a grant diagram Nodes of a grant diagram represent a user/privilege combination With or without grant options count as separate nodes We draw lines connecting nodes to represent who the privilege comes from and who it is granted to A single * shows that the privilege has the grant option A double ** shows that the privilege derives from ownership

10.1.5 Grant Diagrams - Example

10.1.6 Revoking Privileges A granted privilege can be revoked at any time. The revoking of privileges may require cascading, meaning that revoking a privilege with the grant option that has been passed on to other users may require those privileges to be revoked as well The simple form of a revoke statement begins: REVOKE <privileges> ON <db element> FROM <users>

10.1.6 Revoking Privileges REVOKE <privileges> ON <db element> FROM <users> The statement must end with either CASCADE or RESTRICT CASCADE - Cascades the granted privileges as previously discussed RESTRICT - Cancels the revoke statement if privileges must be cascaded It is possible to replace REVOKE by REVOKE GRANT OPTION FOR, in which case the core privileges themselves remain, but the option to grant them to others is removed

10.1.6 Revoking Privileges - Example Suppose steve revokes the privileges he granted to jerry earlier REVOKE SELECT, INSERT ON PC FROM jerry CASCADE; REVOKE SELECT ON Product FROM jerry CASCADE;

10.1.6 Revoking Privileges - Example

10.1.6 Revoking Privileges - Example

10.1.6 Revoking Privileges - Example

10.1.6 Revoking Privileges - Example

10.1.6 Revoking Privileges - Example

11.1 Semistructured-Data Model

11.1.1 The Model’s Special Role The semistructured-data model plays a special role in database systems: It serves as a model suitable for integration of databases, that is, for describing the data contained in two or more databases that contain similar data with different schemas It serves as the underlying model for notations such as XML, which is explained in Section 11.1 but we will not get into, that are being used to share information on the Web

11.1.1 Attributes of the Model Unlike the relational model, which uses schema with a rigid framework, the semistructured model is primarily used for its flexibility This model can also be called “schemaless”, though it would be better to call the data “self-describing”. The data can carry information about what its schema is and that schema can vary arbitrarily, both over time and within a single database. The advantage of this flexibility is that relationships can be added without having to change the schema or even represent the relationship in more than one attribute

11.1.2 Semistructured-Data Representation A database of this model is a collection of nodes. Each node is either a leaf or interior node. Leaf nodes have associated data and the type can be any atomic type, like numbers and strings. Interior nodes have one or more arcs out. Each arc has a label, which describes the relation between the two connected nodes. One interior node, called the root, has no arcs entering and represents the entire database.

11.1.3 Information Integration Databases, as is their nature, are nearly impossible to combine with other databases even if they had similar schema. And this is made more difficult when there is a much older and more complex database that is used in several applications, making it impossible to turn off or copy to make a new database. This problem is often referred to as the legacy-database problem. A possible solution to the problem supports the semistructured-data model for its flexibility, which is used in the interface of the integration. It could translate the data of one source to semistructured data, or even present queries to the source itself.

Frequent Item-sets and Clustering of Data 22.1 & 22.5 Data Mining Frequent Item-sets and Clustering of Data

Section 22.1 Frequent-Itemset Mining “There is a family of problems that arise from attempts by marketers to use large databases of customer purchases to extract information about buying patterns.” This problem can be referred to as “frequent itemsets” Question: What sets of items are often bought together? Example:  Amazon

Example

22.1.1 The Market-Basket Model In important applications data involves a set of items and a set of baskets Items For example, all the items that a supermarket sells Baskets A subset of the set of items A set of items that someone has bought together Supermarket Checkout On-Line Purchases

22.1.2 Basic Definition “Suppose we are given a set of items I and a set of baskets B. Each basket b in B is a subset of I. To talk about frequent sets of items, we need a support threshold s, which is an integer. We say a set of items J ⊆ I is frequent if there are at least s baskets that contain all the items in J (perhaps along with other items). Optionally, we can express the support s as a percentage of |B|, the number of baskets in B.”

22.1.3 Association Rules “We want to find pairs of items such that people buying the first are likely to buy the second as well.” Association Rule A statement of the form {i1, i2, …, in} → j, where the i’s and j are items.

22.1.3 Association Rules By itself, this rule doesn’t seem like much. Because of this, there are three properties that are important to know. High Support The support of this association rule is the support of the itemset {i1, i2, …, in, j} High Confidence The probability of finding item j in a basket that has all of {i1, i2, …, in} is above a certain threshold, eg., 50%, e.g., “at least 50% of the people who buy diapers buy beer”

22.1.3 Association Rules Interest The probability of finding item j in a basket that has all of {i1, i2, …, in} is significantly higher or lower than the probability of finding j in a random basket. In statistical terms, j correlates with {i1, i2, …, in} either positively or negatively. The alleged relationship between diapers and beer is really a claim that the association rule {diapers} → beer has high interest in the positive direction.

22.1.4 The Computation Model for Frequent Itemsets One way to look at what we discussed in terms of a relation would be: Baskets (basket, item) However, the issue with looking at it this way is if the Baskets relation is very large. It would be too time-consuming.

22.1.4 The Computation Model for Frequent Itemsets One way to look at this is by comparing your small local grocery store versus an online store like Amazon. A small-town store might not have as much stuff to worry about. However, Amazon will have too many baskets to deal with since tons of people use it at any given time.

22.1.4 The Computation Model for Frequent Itemsets Because of this issue, it is better to not treat it as a relation. “It is far more efficient to put the data in a file or files consisting of the baskets, in some order.”

Section 22.5 Clustering of Large-Scale Data The problem of taking a dataset consisting of “points” and grouping the points into some numbers of clusters. The points in cluster A need to be near each other in some way while the points of cluster B can be far away from the points of cluster A. In other words, points in the same cluster are somehow near each other while points from a different cluster is farther away.

Section 22.5 Clustering of Large-Scale Data Euclidean Distance A distance based on the location of points within a space. Not all distances are Euclidean. This leads to dealing with points that don’t “live” anywhere in space.

Section 22.5 Clustering of Large-Scale Data There are a couple of approaches to clustering. Agglomerative  is to start with points each in their own cluster and merge nearby clusters Point assignment Initializes the clusters in some way and then assigns each point to its best cluster

22.5.1 Applications of Clustering Collaborative Filtering Cluster products (points) into groups of similar products Cluster customers together who have similar tastes (people who like classical music, for example) Clustering Documents by Topic Grouping together points (or documents) based on their topics Clustering DNA Sequences “DNA is a sequence of base-pairs, represented by the letters C, G, A, and T.” Natural edit-distance between DNA sequences These strands can sometimes change by letter substitution, insertion, or deletion

22.5.2 Distance Measures A distance measure on a set of points is a function d(x,y) that satisfies: d(x,y) ≥ 0 for all points x and y d(x,y) = 0 if and only if x = y d(x,y) = d(y,x) (symmetry) d(x,y) ≤ d(x,z) + d(z,y) for any points x,y, and z (triangle inequality)

22.5.2 Distance Measures “The distance from a point to itself is 0, and the distance between any two different points is positive. The distance between points does not depend on which way you travel (symmetry), and it never reduces the distance if you force yourself to go through a particular third point (the triangle inequality).”

22.5.2 Distance Based on Norms One way we can define the distance is for any r. This distance is derived from the Lr -norm. “The conventional Euclidean distance is the case r = 2, and is often called the L2 -norm.”

22.5.2 Distance Based on Norms This leads us to a distance called L1 -norm It is the sum of the distances along the coordinates of the space It also often the Manhattan Distance, “because it is the distance one has to travel along a rectangular grid of streets found in many cities such as Manhattan.”

22.5.2 Other Distances Jaccard Distance The Jaccard distance between points is represented by: d(x,y) = 1 - ( | x ∩ x | / x ∪ x | ) Cosine Distance The cosine distance between two points is the angle between the vectors Edit Distance With edit distances, you can use insertion and deletion

22.5.3 Agglomerative Clustering Agglomerative or Hierarchical Clustering It involves starting “with every point in its own cluster” then repeatedly finding the closet pair of clusters to merge and merging them until a condition is met. We need to answer a couple of questions: How do we measure the “closeness” of clusters? How do we decide when to stop merging?

22.5.3 Defining Closeness A couple of ways to define the closeness of two clusters C and D: Find the minimum distance between any pair of points, one from C and one from D Average the distance between any pair of points, one from C and one from D

22.5.3 Stopping the Merger One approach is to “pick a number of clusters k, and keep merging until you are down to k clusters.” This is only good if you have an idea of how many clusters there should be. Cohesion The degree to which the merged cluster consists of points that are all close “We decline to merge two clusters whose combination fails to meet the cohesion condition that we have chosen”

22.5.3 Stopping the Merger Ways to define a cohesion score for a cluster: Let the cohesion of a cluster be the average distance of each point to the centroid. This definition only makes sense in a Euclidean space. Let the cohesion be the diameter, the largest distance between any pair of points in the cluster. Let the cohesion be the average distance between pairs of points in the cluster.

22.5.4 K-Means Algorithm Outline of a k-means algorithm: Start by choosing k initial clusters in some way. These clusters might be single points, or small sets of points. For each unassigned point, place it in the “nearest” cluster. Optionally, after all points are assigned to clusters, fix the centroid of each cluster (assuming the points are in a Euclidean space). Then reassign all points to the k clusters. Occasionally, some of the earliest points to be assigned will thus wind up in another cluster.

22.5.5 K-Means for Large-Scale Data The idea is not to necessarily assign every point to a cluster but to try to figure out where the centroids of the clusters are. BFR Algorithm This algorithm is more for if we wanted to know the cluster of every point.

22.5.5 K-Means for Large-Scale Data 2n + 1 summary statistics: N, the number of points in the cluster For each dimension i, the sum of the ith coordinates of the points in the cluster, denoted SUMi For each dimension i, the sum of the squares of the ith coordinates of the points in the cluster, denoted SUMSQi

22.5.5 K-Means for Large-Scale Data “During the running of the algorithm, points are divided into three classes:” Discard Set Points that are assigned to a cluster These points do not appear in main memory Represented only by the summary statistics for their cluster

22.5.5 K-Means for Large-Scale Data Compressed Set There may be groups of points that are close to each other that may belong to the same cluster, however, there are not close to any cluster’s current centroid. This can make it confusing to figure out which cluster these points actually belong to Each group is represented by its summary statistics like the clusters and the points do not appear in main memory

22.5.5 K-Means for Large-Scale Data Retained Set These points are outliers Eventually these points will be assigned to a cluster but for now they will be retained in main memory

Sources Database Systems: The Complete Book, 2nd Edition, Garcia- Molina, Ullman, Widom http://www.amazon.com