Review.

Review

Topics to review for the final exam
Evaluation of classification predicting performance, confidence intervals ROC analysis Precision, recall, F-Measure Association Analysis APRIORI FP-Tree/FPGrowth Maximal, Closed frequent itemsets Cross support, h-measure Confidence vs. interestingness Mining sequences Mining graphs Cluster Analysis K-Means, Bisecting K-means SOM DBSCAN Hierarchical clustering Web search IR Reputation ranking Single-side help sheet allowed.

Mining Association Rules
Two-step approach: Frequent Itemset Generation Generate all itemsets whose support  minsup (these itemsets are called frequent itemsets) Rule Generation Generate high confidence rules from each frequent itemset The computational requirements for frequent itemset generation are more expensive than those of rule generation. Candidate itemsets are generated and then tested against the database to see whether they are frequent. Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle conversely said: If an itemset is infrequent, then all of its supersets must be infrequent too.

Illustrating Apriori Principle
Found to be Infrequent Pruned supersets

Apriori Algorithm Method: Let k=1
Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified k=k+1 Generate length k candidate itemsets from length k-1 frequent itemsets Prune candidate itemsets containing subsets of length k-1 that are infrequent Count the support of each candidate by scanning the DB and eliminate candidates that are infrequent, leaving only those that are frequent

Fk-1Fk-1 Method Merge a pair of frequent (k-1)itemsets only if their first k-2 items are identical. E.g. frequent itemsets {Bread, Diapers} and {Bread, Milk} are merged to form a candidate 3itemset {Bread, Diapers, Milk}. {Bread, Diapers, Milk}

Fk-1Fk-1 Completeness We don’t merge {Beer, Diapers} with {Diapers, Milk} because the first item in both itemsets is different. Do we loose {Beer, Diapers, Milk}? Prunning Before checking a candidate against the DB, a candidate pruning step is needed to ensure that the remaining subsets of k-1 elements are frequent. Counting Finally, the surviving candidates are tested (counted) on the DB.

Rule Generation Computing the confidence of an association rule does not require additional scans of the transactions. Consider {1, 2}{3}. The rule confidence is  ({1, 2, 3}) /  ({1, 2}) Because {1, 2, 3} is frequent, the antimonotone property of support ensures that {1, 2} must be frequent, too, and we do know the supports of frequent itemsets. Initially, all the highconfidence rules that have only one item in the rule consequent are extracted. These rules are then used to generate new candidate rules. For example, if {acd}  {b} and {abd}  {c} are highconfidence rules, then the candidate rule {ad}  {bc} is generated by merging the consequents of both rules. Then the candidate rules are checked for confidence.

Other concepts and algorithms
FP-Tree/FP-Growth See corresponding slide set and Assignment 2 solution. Maximal Frequent Itemsets Closed Itemset Interest factor Mining Sequences

Maximal Frequent Itemsets
An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Maximal frequent itemsets form the smallest set of itemsets from which all frequent itemsets can be derived. Infrequent Itemsets Border

Closed Itemsets Despite providing a compact representation, maximal frequent itemsets do not contain the support information of their subsets. An additional pass over the data set is needed to determine the support counts of the nonmaximal frequent itemsets. It might be desirable to have a minimal representation of frequent itemsets that preserves the support information. Such representation is the set of the closed frequent itemsets. An itemset is closed if none of its immediate supersets has the same support as the itemset. An itemset is a closed frequent itemset if it is closed and its support is greater than or equal to minsup.

Maximal vs Closed Frequent Itemsets
Closed but not maximal Transaction Ids Minimum support = 2 Closed and maximal # Closed = 9 # Maximal = 4 Not supported by any transactions

Deriving Frequent Itemsets From Closed Frequent Itemsets
E.g., consider the frequent itemset {a, d}. Because the itemset is not closed, its support count must be identical to one of its immediate supersets. The key is to determine which superset among {a, b, d}, {a, c, d}, or {a, d, e} has exactly the same support count as {a, d}. By the Apriori principle the support for {a, d} must be equal to the largest support among its supersets. So, the support for {a, d} must be identical to the support for {a, c, d}. null 124 123 1234 245 345 A B C D E 12 124 24 4 123 2 3 24 34 45 AB AC AD AE BC BD BE CD CE DE 12 2 24 4 4 2 3 4 ABD ABE ACD ABC ACE ADE BCD BCE BDE CDE 2 4 ABCD ABCE ABDE ACDE BCDE ABCDE

Support counting using closed frequent itemsets
Let C denote the set of closed frequent itemsets Let kmax denote the maximum length of closed frequent itemsets Fkmax ={f | f C, | f | = kmax } {Frequent itemsets of size kmax} for k = kmax – 1 downto 1 do Set Fk to be all sub-itemsets of length k from the frequent itemsets in Fk+1 for each f  Fk do if f  C then f.support = max{f’.support | f’  Fk+1, f f’} end if end for

Contingency Table Given a rule X  Y, the information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X  Y Y X f11 f10 f1+ f01 f00 f0+ f+1 f+0 |T| f11: support of X and Y f10: support of X and Y f01: support of X and Y f00: support of X and Y

Pitfall of Confidence Coffee Tea 150 50 200 750 900 1100 Coffee Tea
Consider association rule: Tea  Coffee Confidence= P(Coffee,Tea)/P(Tea) = P(Coffee|Tea) = 150/200 = 0.75 (seems quite high) But, P(Coffee) = 0.9 Thus knowing that a person is a tea drinker actually decreases his/her probability of being a coffee drinker from 90% to 75%! Although confidence is high, rule is misleading In fact P(Coffee|Tea) = P(Coffee, Tea)/P(Tea) = 750/900 = 0.83

Interest Factor Measure that takes into account statistical dependence
f11/N is an estimate for the joint probability P(A,B) f1+ /N and f+1 /N are the estimates for P(A) and P(B), respectively. If A and B are statistically independent, then P(A,B)=P(A)×P(B), thus the Interest is 1.

Example: Interest Coffee Tea 150 50 200 750 900 1100 Coffee Tea
Association Rule: Tea  Coffee Interest = 150*1100 / (200*900)= 0.92 (< 1, therefore they are negatively correlated)

Crosssupport patterns
They are patterns that relate a highfrequency item such as milk to a lowfrequency item such as caviar. Likely to be spurious because their correlations tend to be weak. E.g. confidence of {caviar}{milk} is likely to be high, but still the pattern is spurious, since there isn’t probably any correlation between caviar and milk. Observation: On the other hand, the confidence of {milk}{caviar} is very low. Crosssupport patterns can be detected and eliminated by examining the lowest confidence rule that can be extracted from a given itemset. Such confidence should be above certain level for the pattern to not be cross-support one.

Finding lowest confidence
Recall the antimonotone property of confidence: conf( {i1 ,i2}{i3,i4,…,ik} )  conf( {i1 ,i2 , i3}{i4,…,ik} ) This property suggests that confidence never increases as we shift more items from the left to the righthand side of an association rule. Hence, the lowest confidence rule that can be extracted from a frequent itemset contains only one item on its lefthand side.

Given a frequent itemset {i1,i2,i3,i4,…,ik}, the rule {ij}{i1 ,i2 , i3, ij-1, ij+1, i4,…,ik} has the lowest confidence if s(ij) = max {s(i1), s(i2),…,s(ik)} This follows directly from the definition of confidence as the ratio between the rule's support and the support of the rule antecedent.

Summarizing, the lowest confidence attainable from a frequent itemset {i1,i2,i3,i4,…,ik}, is This is also known as the h-confidence measure or all-confidence measure. Crosssupport patterns can be eliminated by ensuring that the hconfidence values for the patterns exceed some user specified threshold hc. h-confidence is antimonotone, i.e., hconfidence({i1,i2,…, ik})  hconfidence({i1,i2,…, ik+1 }) and thus can be incorporated directly into the mining algorithm.

Examples of Sequence Web sequence:
 {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping}  Purchase history of a given customer {Java in a Nutshell, Intro to Servlets} {EJB Patterns},… Sequence of classes taken by a computer science major:  {Algorithms and Data Structures, Introduction to Operating Systems} {Database Systems, Computer Architecture} {Computer Networks, Software Engineering} {Computer Graphics, Parallel Programming} …

Formal Definition of a Sequence
A sequence is an ordered list of elements (transactions) s = < e1 e2 e3 … > Each element contains a collection of events (items) ei = {i1, i2, …, ik} Each element is attributed to a specific time or location A k-sequence is a sequence that contains k events (items) Sequence E1 E2 E1 E3 E2 E3 E4 Element (Transaction) Event (Item)

Formal Definition of a Subsequence
A sequence  a1 a2 … an  is contained in another sequence  b1 b2 … bm  (m ≥ n) if there exist integers i1 < i2 < … < in such that a1  bi1 , a2  bi2, …, an  bin Data sequence Subsequence Contained?  {2,4} {3,5,6} {8}   {2} {3,5}  Yes  {1,2} {3,4}   {1} {2}  No  {2,4} {2,4} {2,5}   {2} {4}  Support of a subsequence w is the fraction of data sequences that contain w A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup)

APRIORI-like Algorithm
Make the first pass over the sequence database to yield all the 1-element frequent sequences Repeat until no new frequent sequences are found Candidate Generation: Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items Candidate Pruning: Prune candidate k-sequences that contain infrequent (k-1)-subsequences Support Counting: Make a new pass over the sequence database to find the support for these candidate sequences Eliminate candidate k-sequences whose actual support is less than minsup

Candidate Generation Base case (k=2): General case (k>2):
Merging two frequent 1-sequences <{i1}> and <{i2}> will produce four candidate 2-sequences: <{i1}, {i2}>, <{i2}, {i1}>, <{i1, i2}>, <{i2, i1}> General case (k>2): A frequent (k-1)-sequence w1 is merged with another frequent (k-1)-sequence w2 to produce a candidate k-sequence if the subsequence obtained by removing the first event in w1 is the same as the subsequence obtained by removing the last event in w2 The resulting candidate after merging is given by the sequence w1 extended with the last event of w2. If the last two events in w2 belong to the same element, then the last event in w2 becomes part of the last element in w1 Otherwise, the last event in w2 becomes a separate element appended to the end of w1

Candidate Generation Examples
Merging the sequences w1=<{1} {2 3} {4}> and w2 =<{2 3} {4 5}> will produce the candidate sequence < {1} {2 3} {4 5}> because the last two events in w2 (4 and 5) belong to the same element Merging the sequences w1=<{1} {2 3} {4}> and w2 =<{2 3} {4} {5}> will produce the candidate sequence < {1} {2 3} {4} {5}> because the last two events in w2 (4 and 5) do not belong to the same element Finally, the sequences <{1}{2}{3}> and <{1}{2, 5}> don’t have to be merged (Why?) Because removing the first event from the first sequence doesn’t give the same subsequence as removing the last event from the second sequence. If <{1}{2,5}{3}> is a viable candidate, it will be generated by merging a different pair of sequences, <{1}{2,5}> and <{2,5}{3}>.

Example Frequent 3-sequences Candidate Generation Candidate Pruning
< {1} {2} {3} > Generation < {1} {2 5} > < {1} {5} {3} > Candidate < {2} {3} {4} > < {1} {2} {3} {4} > Pruning < {2 5} {3} > < {1} {2 5} {3} > < {3} {4} {5} > < {1} {5} {3 4} > < {5} {3 4} > < {2} {3} {4} {5} > < {1} {2 5} {3} > < {2 5} {3 4} >

<{1,2} {3} {2,3} {3,4} {2,4} {4,5}>
Timing Constraints {A B} {C} {D E} <= max-span <= max-gap max-gap = 2, max-span= 4 Data sequence Subsequence Contained? <{2,4} {3,5,6} {4,7} {4,5} {8}> < {6} {5} > Yes <{1} {2} {3} {4} {5}> < {1} {4} > No <{1} {2,3} {3,4} {4,5}> < {2} {3} {5} > <{1,2} {3} {2,3} {3,4} {2,4} {4,5}> < {1,2} {5} >

Mining Sequential Patterns with Timing Constraints
Approach 1: Mine sequential patterns without timing constraints Postprocess the discovered patterns Approach 2: Modify algorithm to directly prune candidates that violate timing constraints Question: Does APRIORI principle still hold?

APRIORI Principle for Sequence Data
Suppose: max-gap = 1 max-span = 5 <{2} {5}> support = 40% but <{2} {3} {5}> support = 60% !! (APRIORI doesn’t hold) Problem exists because of max-gap constraint This problem can avoided by using the concept of a contiguous subsequence.

Contiguous Subsequences
s is a contiguous subsequence of w = <e1, e2 ,…, ek> if any of the following conditions holds: s is obtained from w by deleting an item from either e1 or ek s is obtained from w by deleting an item from any element ei that contains at least 2 items s is a contiguous subsequence of s’ and s’ is a contiguous subsequence of w (recursive definition) Examples: s = < {1} {2} > is a contiguous subsequence of < {1} {2 3}>, < {1 2} {2} {3}>, and < {3 4} {1 2} {2 3} {4} > is not a contiguous subsequence of < {1} {3} {2}> and < {2} {1} {3} {2}>

Modified Candidate Pruning Step
Modified APRIORI Principle If a k-sequence is frequent, then all of its contiguous (k-1)-subsequences must also be frequent Candidate generation doesn’t change. Only pruning changes. Without maxgap constraint: A candidate k-sequence is pruned if at least one of its (k-1)-subsequences is infrequent With maxgap constraint: A candidate k-sequence is pruned if at least one of its contiguous (k-1)-subsequences is infrequent

Cluster Analysis Find groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized

K-means Clustering Partitional clustering approach
Each cluster is associated with a centroid (center point) typically the mean of the points in the cluster. Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple

Importance of Choosing Initial Centroids

Solutions to Initial Centroids Problem
Multiple runs Helps, but probability is not on your side Bisecting K-means Not as susceptible to initialization issues

Bisecting Kmeans Remove a cluster from the list of clusters.
Straightforward extension of the basic Kmeans algorithm. Simple idea: To obtain K clusters, split the set of points into two clusters, select one of these clusters to split, and so on, until K clusters have been produced. Algorithm Initialize the list of clusters to contain the cluster consisting of all points. repeat Remove a cluster from the list of clusters. //Perform several ``trial'' bisections of the chosen cluster. for i = 1 to number of trials do Bisect the selected cluster using basic Kmeans (i.e. 2-means). end for Select the two clusters from the bisection with the lowest total SSE. Add these two clusters to the list of clusters. until the list of clusters contains K clusters.

Bisecting K-means Example

Limitations of K-means
K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means has problems when the data contains outliers.

Exercise For each figure, could you use K-means to find the patterns represented by the nose, eyes, and mouth? Only for (b) and (d). For (b), K-means would find the nose, eyes, and mouth, but the lower density points would also be included. For (d), K-means would find the nose, eyes, and mouth straightforwardly as long as the number of clusters was set to 4. What limitation does clustering have in detecting all the patterns formed by the points in figure c? Clustering techniques can only find patterns of points, not of empty spaces.

Agglomerative Clustering Algorithm
Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms

Cluster Similarity: MIN
Similarity of two clusters is based on the two most similar (closest) points in the different clusters Determined by one pair of points

Hierarchical Clustering: MIN
5 1 2 3 4 5 6 4 3 2 1 Nested Clusters Dendrogram

Strength of MIN Two Clusters Original Points
Can handle non-globular shapes

Limitations of MIN Original Points Four clusters Three clusters:
The yellow points got wrongly merged with the red ones, as opposed to the green one. Sensitive to noise and outliers

Cluster Similarity: MAX
Similarity of two clusters is based on the two least similar (most distant) points in the different clusters Determined by all pairs of points in the two clusters

Hierarchical Clustering: MAX

Strengths of MAX Original Points Four clusters Three clusters:
The yellow points get now merged with the green one. Robust with respect to noise and outliers

Cluster Similarity: Group Average
Proximity of two clusters is the average of pairwise proximity between points in the two clusters.

Hierarchical Clustering: Group Average

DBSCAN DBSCAN is a density-based algorithm.
Locates regions of high density that are separated from one another by regions of low density. Density = number of points within a specified radius (Eps) A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point A noise point is any point that is not a core point or a border point.

DBSCAN Algorithm Any two core points that are close enough---within a distance Eps of one another---are put in the same cluster. Likewise, any border point that is close enough to a core point is put in the same cluster as the core point. Ties may need to be resolved if a border point is close to core points from different clusters. Noise points are discarded.

When DBSCAN Works Well Original Points Clusters Resistant to Noise
Can handle clusters of different shapes and sizes

When DBSCAN Does NOT Work Well

DBSCAN: Determining EPS and MinPts
Look at the behavior of the distance from a point to its k-th nearest neighbor, called the kdist. For points that belong to some cluster, the value of kdist will be small [if k is not larger than the cluster size]. However, for points that are not in a cluster, such as noise points, the kdist will be relatively large. So, if we compute the kdist for all the data points for some k, sort them in increasing order, and then plot the sorted values, we expect to see a sharp change at the value of kdist that corresponds to a suitable value of Eps. If we select this distance as the Eps parameter and take the value of k as the MinPts parameter, then points for which kdist is less than Eps will be labeled as core points, while other points will be labeled as noise or border points.

DBSCAN: Determining EPS and MinPts
Eps determined in this way depends on k, but does not change dramatically as k changes. If k is too small ? then even a small number of closely spaced points that are noise or outliers will be incorrectly labeled as clusters. If k is too large ? then small clusters (of size less than k) are likely to be labeled as noise. Original DBSCAN used k = 4, which appears to be a reasonable value for most twodimensional data sets.

IR-Web queries Keyword queries Boolean queries (using AND, OR, NOT)
Phrase queries Proximity queries Full document queries Natural language questions From: Bing Liu. Web Data Mining. 2007

Vector space model Documents are also treated as a “bag” of words or terms. Each document is represented as a vector. Term Frequency (TF) Scheme: The weight of a term ti in document dj is the number of times that ti appears in dj, denoted by fij. Normalization may also be applied. Shortcoming of the TF scheme is that it doesn’t consider the situation where a term appears in many documents of the collection. Such a term may not be discriminative. From: Bing Liu. Web Data Mining. 2007

TF-IDF term weighting scheme
The most well known weighting scheme TF: (normalized) term frequency IDF: inverse document frequency. N: total number of docs dfi: the number of docs that ti appears. More documents a word (term) appears, less discriminative it is, and thus, less weight we should give. The final TF-IDF term weight is: From: Bing Liu. Web Data Mining. 2007

Retrieval in vector space model
Query q is represented in the same way as a document. The term wiq of each term ti in q can also computed in the same way as in normal document. Relevance of dj to q: Compare the similarity of query q and document dj. For this, use cosine similarity (the cosine of the angle between the two vectors) From: Bing Liu. Web Data Mining. 2007

Page Rank (PR) Intuitively, we solve the recursive definition of “importance”: A page is important if important pages link to it. Page rank is the estimated page importance. In short PageRank is a “vote”, by all the other pages on the Web, about how important a page is. A link to a page counts as a vote of support. If there’s no link there’s no support (but it’s an abstention from voting rather than a vote against the page). From: Jeff Ullman’s lecture

Page Rank Formula PR(A) = PR(T1)/C(T1) +…+ PR(Tn)/C(Tn)
PR(Tn) - Each page has a notion of its own self-importance, which is say 1 initially. C(Tn) – Count of outgoing links from page Tn. Each page spreads its vote out evenly amongst all of it’s outgoing links. PR(Tn)/C(Tn) – So if our page (say page A) has a back link from page “n” the share of the vote page A will get from page “n” is “PR(Tn)/C(Tn).”

Web Matrix Capture the formula by the web matrix WebM, that is:
ij th entry is 1/n if page i is one of the n successors of page j, and 0 otherwise. Then, the importance vector containing the rank of each page is calculated by: Ranknew = WebM • Rankold Start with Rank=(1,1,…) Observe that the above matrix vector product conforms to the PageRank formula in the previous slide.

Example In 1839, the Web consisted on only three pages: Netscape, Microsoft, and Amazon. The first four iterations give the following estimates: n = 1 m = 1 a = 1 1 1/2 3/2 5/4 3/4 1 9/8 1/2 11/8 5/4 11/16 17/16 In the limit, the solution is n = a = 6/5; m = 3/5. From: Jeff Ullman’s lecture

Problems With Real Web Graphs
Dead ends: a page that has no successors has nowhere to send its importance. Eventually, all importance will “leak out of” the Web. Example: Suppose Microsoft tries to claim that it is a monopoly by removing all links from its site. The new Web, and the rank vectors for the first 4 iterations are shown. Eventually, each of n, m, and a become 0; i.e., all the importance leaked out. n = /4 5/8 1/2 m = 1 1/2 1/4 1/4 3/16 a = 1 1/2 1/2 3/8 5/16 From: Jeff Ullman’s lecture

Problems With Real Web Graphs
Spider traps: a group of one or more pages that have no links out of the group will eventually accumulate all the importance of the Web. Example: Angered by the decision, Microsoft decides it will link only to itself from now on. Now, Microsoft has become a spider trap. The new Web, and the rank vectors for the first 4 iterations are shown. n = /4 5/8 1/2 m = 1 3/2 7/ /16 a = /2 1/2 3/8 5/16 Now, m converges to 3, and n = a = 0. From: Jeff Ullman’s lecture

Google Solution to Dead Ends and Spider Traps
Stop the other pages having too much influence. This total vote is “damped down” by multiplying it by a factor. Example: If we use a 20% damp-down, the equation of previous example becomes: The solution to this equation is n = 7/11; m = 21/11; a = 5/11. From: Jeff Ullman’s lecture

Hubs and Authorities Intuitively, we define “hub” and “authority” in a mutually recursive way: a hub links to many authorities, and an authority is linked to by many hubs. Authorities turn out to be pages that offer information about a topic, e.g., Hubs are pages that don't provide the information, but tell you where to find the information, e.g.,

Matrix formulation Use a matrix formulation similar to that of PageRank, but without the stochastic restriction. We count each link as 1, regardless of how many successors or predecessors a page has. Namely, define a matrix A whose rows and columns correspond to Web pages, with entry Aij = 1 if page i links to page j, and 0 if not. Notice that AT , the transpose of A, looks like the matrix used for computing Page rank, but AT has 1's where the Page-rank matrix has fractions.

Authority and Hubbiness Vectors
Let a and h be vectors, whose i th component corresponds to the degrees of authority and hubbiness of the i th page. Let  and  be suitable scaling factors. Then we can state: h =  A a That is, the hubbiness of each page is the sum of the authorities of all the pages it links to, scaled by . a =  AT h That is, the authority of each page is the sum of the hubbiness of all the pages that link to it, scaled by .

Simple substitutions We can derive from (1) and (2), using simple substitution, two equations that relate vectors a and h only to themselves. a =   ATA a h =   A AT h As a result, we can compute h and a by iterations.

Example If we use  =  = 1 and assume that the vectors
h = [hn, hm, ha] = [1, 1, 1], and a = [an, am, aa] = [1, 1, 1], the first three iterations of the equations for a and h are: From: Jeff Ullman’s lecture

Review.

Similar presentations

Presentation on theme: "Review."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Review.

Similar presentations

Presentation on theme: "Review."— Presentation transcript:

Similar presentations

About project

Feedback