By Josh, Spencer, and Lacey

By Josh, Spencer, and Lacey
Chapter 10, 11, 22 By Josh, Spencer, and Lacey

10.1 Security and User Authorization

10.1 Security and User Authorization
Privileges Establishing Ownership The Privilege-Checking Process Granting Privileges Grant Diagrams Revoking Privileges

10.1 User Authorization SQL uses authorization ID’s (user names) to grant privileges There is a special authorization ID, PUBLIC, which includes any user Much like a UNIX file system can grant read, write, or execute privileges to a user, SQL can grant various privileges to an authorization ID Databases are much more complex than file systems, thus they have more complex privileges

10.1.1 Privileges 9 types of SQL privileges: SELECT INSERT DELETE
UPDATE REFERENCES USAGE TRIGGER EXECUTE UNDER

Privileges SELECT, INSERT, UPDATE, and DELETE all apply to relations They correspond with their identically named SQL query commands SELECT, INSERT, and UPDATE can optionally have a list of attributes associated with them For example, if a user only has the following privilege on PC: SELECT(model, price) They would only be able to write queries that select model or price from PC

Privileges The REFERENCES privilege on a relation is the right to refer to that relation in an integrity constraint (Chapter 7) REFERENCES may also have attributes associated with it The USAGE privilege applies to several kinds of schema elements other than relations and assertions USAGE is the right to use that element in one’s own declarations.

Privileges The TRIGGER privilege on a relation is the right to define triggers on that relation The EXCECUTE privilege is the right to execute a piece of code, such as a PSM procedure or function The UNDER privilege is the right to create subtypes of a given type

10.1.2 Establishing Ownership
SQL elements such as schemas or modules have an owner The owner of something has all privileges associated with that thing There are 3 ways ownership is established in SQL: When a schema is created, it and all the tables and other schema elements in it are owned by the user who created it. This user thus has all possible privileges on elements of the schema.

10.1.2 Establishing Ownership
When a session is initiated by a CONNECT statement, there is an opportunity to indicate the user with an AUTHORIZATION clause CONNECT TO sql-server AS conn1 AUTHORIZATION steve; An AUTHORIZATION clause can also be used in a module-creation statement It is also acceptable to specify no owner for a module, in which case the module is publicly executable, but the privileges necessary for executing any operations in the module must come from some other source

10.1.3 The Privilege-Checking Process
As we just discussed, each module, schema, and session has an associated user (authorization ID in SQL terms) Any SQL operation has two parties: The database elements upon which the operation is performed The agent that causes the operation

The privileges available to the agent derive from a particular authorization ID called the current authorization ID That ID is one of two things: The module authorization ID, if the module that the agent is executing has an authorization ID The session authorization ID if not

We may execute the SQL operation only if the current authorization ID possesses all the privileges needed to carry out the operation on the database elements involved.

Granting Privileges SQL provides a GRANT statement to allow one user to “copy” a privilege to another user There is an optional grant option that allows the user to grant the privilege to another user The grant statement takes the following general form: GRANT <privileges> ON <db element> TO <users> WITH GRANT OPTION (optional)

Granting Privileges GRANT <privileges> ON <db element> TO <users> WITH GRANT OPTION (optional) <privileges> - list of privilege(s) to be granted <db element> - typically a relation If it is another type of element, the name is preceded by the type <users> - list of user(s) to which the privileges are granted

10.1.4 Granting Privileges - Example
Let’s say user steve is the owner of ComputerSchema that contains the tables PC, Product steve wants to grant INSERT and SELECT privileges on PC and SELECT privileges on Product to users tom and jerry GRANT SELECT, INSERT ON PC TO tom, jerry WITH GRANT OPTION; GRANT SELECT ON Product TO tom, jerry

Now, jerry wants to grant user fred the same privileges, this time without the grant option GRANT SELECT, INSERT ON PC TO fred; GRANT SELECT ON Product TO fred;

At the same time, tom grants fred the minimal privileges needed to insert new models from Product into PC GRANT SELECT, INSERT(model) ON PC TO fred; GRANT SELECT ON Product TO fred;

Grant Diagrams Because of the complex web of grants and overlapping privileges that may result from a sequence of grants, it is useful to represent grants by a graph called a grant diagram Nodes of a grant diagram represent a user/privilege combination With or without grant options count as separate nodes We draw lines connecting nodes to represent who the privilege comes from and who it is granted to A single * shows that the privilege has the grant option A double ** shows that the privilege derives from ownership

10.1.5 Grant Diagrams - Example

Revoking Privileges A granted privilege can be revoked at any time. The revoking of privileges may require cascading, meaning that revoking a privilege with the grant option that has been passed on to other users may require those privileges to be revoked as well The simple form of a revoke statement begins: REVOKE <privileges> ON <db element> FROM <users>

Revoking Privileges REVOKE <privileges> ON <db element> FROM <users> The statement must end with either CASCADE or RESTRICT CASCADE - Cascades the granted privileges as previously discussed RESTRICT - Cancels the revoke statement if privileges must be cascaded It is possible to replace REVOKE by REVOKE GRANT OPTION FOR, in which case the core privileges themselves remain, but the option to grant them to others is removed

10.1.6 Revoking Privileges - Example
Suppose steve revokes the privileges he granted to jerry earlier REVOKE SELECT, INSERT ON PC FROM jerry CASCADE; REVOKE SELECT ON Product FROM jerry CASCADE;

10.1.6 Revoking Privileges - Example

11.1 Semistructured-Data Model

11.1.1 The Model’s Special Role
The semistructured-data model plays a special role in database systems: It serves as a model suitable for integration of databases, that is, for describing the data contained in two or more databases that contain similar data with different schemas It serves as the underlying model for notations such as XML, which is explained in Section 11.1 but we will not get into, that are being used to share information on the Web

11.1.1 Attributes of the Model
Unlike the relational model, which uses schema with a rigid framework, the semistructured model is primarily used for its flexibility This model can also be called “schemaless”, though it would be better to call the data “self-describing”. The data can carry information about what its schema is and that schema can vary arbitrarily, both over time and within a single database. The advantage of this flexibility is that relationships can be added without having to change the schema or even represent the relationship in more than one attribute

11.1.2 Semistructured-Data Representation
A database of this model is a collection of nodes. Each node is either a leaf or interior node. Leaf nodes have associated data and the type can be any atomic type, like numbers and strings. Interior nodes have one or more arcs out. Each arc has a label, which describes the relation between the two connected nodes. One interior node, called the root, has no arcs entering and represents the entire database.

11.1.3 Information Integration
Databases, as is their nature, are nearly impossible to combine with other databases even if they had similar schema. And this is made more difficult when there is a much older and more complex database that is used in several applications, making it impossible to turn off or copy to make a new database. This problem is often referred to as the legacy-database problem. A possible solution to the problem supports the semistructured-data model for its flexibility, which is used in the interface of the integration. It could translate the data of one source to semistructured data, or even present queries to the source itself.

Frequent Item-sets and Clustering of Data
22.1 & 22.5 Data Mining Frequent Item-sets and Clustering of Data

Section 22.1 Frequent-Itemset Mining
“There is a family of problems that arise from attempts by marketers to use large databases of customer purchases to extract information about buying patterns.” This problem can be referred to as “frequent itemsets” Question: What sets of items are often bought together? Example: Amazon

Example

22.1.1 The Market-Basket Model
In important applications data involves a set of items and a set of baskets Items For example, all the items that a supermarket sells Baskets A subset of the set of items A set of items that someone has bought together Supermarket Checkout On-Line Purchases

Basic Definition “Suppose we are given a set of items I and a set of baskets B. Each basket b in B is a subset of I. To talk about frequent sets of items, we need a support threshold s, which is an integer. We say a set of items J ⊆ I is frequent if there are at least s baskets that contain all the items in J (perhaps along with other items). Optionally, we can express the support s as a percentage of |B|, the number of baskets in B.”

Association Rules “We want to find pairs of items such that people buying the first are likely to buy the second as well.” Association Rule A statement of the form {i1, i2, …, in} → j, where the i’s and j are items.

Association Rules By itself, this rule doesn’t seem like much. Because of this, there are three properties that are important to know. High Support The support of this association rule is the support of the itemset {i1, i2, …, in, j} High Confidence The probability of finding item j in a basket that has all of {i1, i2, …, in} is above a certain threshold, eg., 50%, e.g., “at least 50% of the people who buy diapers buy beer”

22.1.3 Association Rules Interest
The probability of finding item j in a basket that has all of {i1, i2, …, in} is significantly higher or lower than the probability of finding j in a random basket. In statistical terms, j correlates with {i1, i2, …, in} either positively or negatively. The alleged relationship between diapers and beer is really a claim that the association rule {diapers} → beer has high interest in the positive direction.

22.1.4 The Computation Model for Frequent Itemsets
One way to look at what we discussed in terms of a relation would be: Baskets (basket, item) However, the issue with looking at it this way is if the Baskets relation is very large. It would be too time-consuming.

One way to look at this is by comparing your small local grocery store versus an online store like Amazon. A small-town store might not have as much stuff to worry about. However, Amazon will have too many baskets to deal with since tons of people use it at any given time.

Because of this issue, it is better to not treat it as a relation. “It is far more efficient to put the data in a file or files consisting of the baskets, in some order.”

Section 22.5 Clustering of Large-Scale Data
The problem of taking a dataset consisting of “points” and grouping the points into some numbers of clusters. The points in cluster A need to be near each other in some way while the points of cluster B can be far away from the points of cluster A. In other words, points in the same cluster are somehow near each other while points from a different cluster is farther away.

Euclidean Distance A distance based on the location of points within a space. Not all distances are Euclidean. This leads to dealing with points that don’t “live” anywhere in space.

There are a couple of approaches to clustering. Agglomerative is to start with points each in their own cluster and merge nearby clusters Point assignment Initializes the clusters in some way and then assigns each point to its best cluster

22.5.1 Applications of Clustering
Collaborative Filtering Cluster products (points) into groups of similar products Cluster customers together who have similar tastes (people who like classical music, for example) Clustering Documents by Topic Grouping together points (or documents) based on their topics Clustering DNA Sequences “DNA is a sequence of base-pairs, represented by the letters C, G, A, and T.” Natural edit-distance between DNA sequences These strands can sometimes change by letter substitution, insertion, or deletion

Distance Measures A distance measure on a set of points is a function d(x,y) that satisfies: d(x,y) ≥ 0 for all points x and y d(x,y) = 0 if and only if x = y d(x,y) = d(y,x) (symmetry) d(x,y) ≤ d(x,z) + d(z,y) for any points x,y, and z (triangle inequality)

Distance Measures “The distance from a point to itself is 0, and the distance between any two different points is positive. The distance between points does not depend on which way you travel (symmetry), and it never reduces the distance if you force yourself to go through a particular third point (the triangle inequality).”

22.5.2 Distance Based on Norms
One way we can define the distance is for any r. This distance is derived from the Lr -norm. “The conventional Euclidean distance is the case r = 2, and is often called the L2 -norm.”

22.5.2 Distance Based on Norms
This leads us to a distance called L1 -norm It is the sum of the distances along the coordinates of the space It also often the Manhattan Distance, “because it is the distance one has to travel along a rectangular grid of streets found in many cities such as Manhattan.”

22.5.2 Other Distances Jaccard Distance
The Jaccard distance between points is represented by: d(x,y) = 1 - ( | x ∩ x | / x ∪ x | ) Cosine Distance The cosine distance between two points is the angle between the vectors Edit Distance With edit distances, you can use insertion and deletion

22.5.3 Agglomerative Clustering
Agglomerative or Hierarchical Clustering It involves starting “with every point in its own cluster” then repeatedly finding the closet pair of clusters to merge and merging them until a condition is met. We need to answer a couple of questions: How do we measure the “closeness” of clusters? How do we decide when to stop merging?

Defining Closeness A couple of ways to define the closeness of two clusters C and D: Find the minimum distance between any pair of points, one from C and one from D Average the distance between any pair of points, one from C and one from D

Stopping the Merger One approach is to “pick a number of clusters k, and keep merging until you are down to k clusters.” This is only good if you have an idea of how many clusters there should be. Cohesion The degree to which the merged cluster consists of points that are all close “We decline to merge two clusters whose combination fails to meet the cohesion condition that we have chosen”

Stopping the Merger Ways to define a cohesion score for a cluster: Let the cohesion of a cluster be the average distance of each point to the centroid. This definition only makes sense in a Euclidean space. Let the cohesion be the diameter, the largest distance between any pair of points in the cluster. Let the cohesion be the average distance between pairs of points in the cluster.

22.5.4 K-Means Algorithm Outline of a k-means algorithm:
Start by choosing k initial clusters in some way. These clusters might be single points, or small sets of points. For each unassigned point, place it in the “nearest” cluster. Optionally, after all points are assigned to clusters, fix the centroid of each cluster (assuming the points are in a Euclidean space). Then reassign all points to the k clusters. Occasionally, some of the earliest points to be assigned will thus wind up in another cluster.

22.5.5 K-Means for Large-Scale Data
The idea is not to necessarily assign every point to a cluster but to try to figure out where the centroids of the clusters are. BFR Algorithm This algorithm is more for if we wanted to know the cluster of every point.

2n + 1 summary statistics: N, the number of points in the cluster For each dimension i, the sum of the ith coordinates of the points in the cluster, denoted SUMi For each dimension i, the sum of the squares of the ith coordinates of the points in the cluster, denoted SUMSQi

“During the running of the algorithm, points are divided into three classes:” Discard Set Points that are assigned to a cluster These points do not appear in main memory Represented only by the summary statistics for their cluster

Compressed Set There may be groups of points that are close to each other that may belong to the same cluster, however, there are not close to any cluster’s current centroid. This can make it confusing to figure out which cluster these points actually belong to Each group is represented by its summary statistics like the clusters and the points do not appear in main memory

Retained Set These points are outliers Eventually these points will be assigned to a cluster but for now they will be retained in main memory

Sources Database Systems: The Complete Book, 2nd Edition, Garcia- Molina, Ullman, Widom

By Josh, Spencer, and Lacey

Similar presentations

Presentation on theme: "By Josh, Spencer, and Lacey"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

By Josh, Spencer, and Lacey

Similar presentations

Presentation on theme: "By Josh, Spencer, and Lacey"— Presentation transcript:

Similar presentations

About project

Feedback