Download presentation
Presentation is loading. Please wait.
Published byDoddy Hermanto Modified over 6 years ago
1
DIVISIVE ALGORITHMS A simple way to identify communities in a graph is to detect the edges that connect vertices of different communities and remove them, so that the clusters get disconnected from each other. This is the philosophy of divisive algorithms. The crucial point is to find a property of intercommunity edges that could allow for their identification. With divisive hierarchical clustering one removes inter-cluster edges instead of edges between pairs of vertices with low similarity and there is no guarantee, apriori, that inter-cluster edges connect vertices with low similarity. In some cases vertices (with all their adjacent edges) or whole subgraphs may be removed, instead of single edges. Being hierarchical clustering techniques, it is customary to represent the resulting partitions by means of dendrograms. The most popular algorithm is that proposed by Girvan and Newman (Girvan and Newman,2002; 2004). The method is historically important, because it marked the beginning of a new era in the field of community detection. Edges are selected based on a measures of edge centrality (importance wrt to a property or process running on the graph. Computation of the centrality for all edges; Removal of edge with largest centrality: in case of ties with other edges, one is picked at random; Recalculation of centralities on the running graph; Iteration of the cycle from step 2. Girvan and Newman used betweenness, expressing the frequency of the participation of edges to a process. They used 3 alternative definitions: geodesic edge betweenness, random-walk edge and current-flow edge betweenness. Edge betweenness is the number of shortest paths (geodesics) between all vertex pairs that run along the edge. It extends site betweenness (Freeman,1977) and expresses the importance of edges in processes like information spreading, where information usually flows through shortest paths. It is intuitive that intercommunity edges have a large value of the edge betweenness, because many shortest paths connecting vertices of different communities will pass through them. As in the calculation of site betweenness, if there are two or more geodesic paths with the same endpoints that run through an edge, the contribution of each of them to the betweenness of the edge must be divided by the multiplicity of the paths, as one assumes that the signal/information propagates equally along each geodesic path. The betweenness of all edges of the graph can be calculated in time that scales as O(|E|*|V|), or O(|V|2) on a sparse graph, with techniques based on breadth-first search.
2
In info spreading, signals flow across random rather than geodesic, so edge betweenness is given by the frequency of the passages across the edge of a random walker (random-walk betweenness). A random walker moving from a vertex follows each adjacent edge with equal probability. A pair of vertices is chosen at random, s and t The walker starts at s and keeps moving until it hits t, where it stops. One computes the probability that each edge was crossed by the walker, and averages over all possible choices for the vertices s and t. It is meaningful to compute the net crossing probability, which is proportional to the number of times the walk crossed the edge in one direction. One neglects back and forth passages that are accidents of the random walk and tell nothing about the centrality of the edge. Calculation of random-walk betweenness requires the inversion of an n n matrix (once), followed by obtaining and averaging the flows for all pairs of nodes. The first task requires a time O(n3), the second O(mn2), for a total complexity O[(m+n)n2], or O(n3) for a sparse matrix. The complete calculation requires a time O(n3) on a sparse graph. Current- flow betweenness is defined by considering the graph a resistor network, with edges having unit resistance. If a voltage difference is applied between 2 vertices, each edge carries some amount of current, that can be calculated by solving Kircho's equations. The procedure is repeated for all possible vertex pairs: the current-flow betweenness of an edge is the average value of the current carried by the edge. It is possible to show that this measure is equivalent to random-walk betweenness, as the voltage differences and the random walks net flows across the edges satisfy the same equations (Newman,2005). Therefore, the calculation of current-flow betweenness has the same complexity. Calculating edge betweenness is faster than current-flow or random walk [O(n2) versus O(n3) on sparse graphs]. In practice GN-edge betweenness gives better results. Tyler et al. proposed a modification to improve the speed of the calculation (Tyler et al.,2003;Wilkinson and Huberman,2004). The gain in speed was required for graphs of gene co-occurrences (gene-gene interactioin graph), which are too large for Girvan and Newman. Algorithms computing site/edge betweenness start from any vertex as center and compute the betweenness from all paths originating at that vertex; the procedure is then repeated for all vertices (Brandes01;GN04;Zhouet,06). Tyler et al. proposed to calculate the contribution to edge betweenness only from a limited number of centers, chosen at random, deriving a sort of Monte Carlo estimate. Numerical tests indicate that, for each connected subgraph, it sufices to pick a number of centers growing as the logarithm of the number of vertices of the component. For a given choice of the centers, the algorithm proceeds just like GN but the stopping condition does not require resulting partition modularity calculation, but relies on a particular definition of community, a connected subgraph with n0 vertices is a community if the edge betweenness of any of its edges does not exceed n0-1. Indeed, if the subgraph has 2 parts connected by a single edge, the betweenness of the edge is n0-1 with equality holding only if one of the two parts consists of a single vertex. Therefore, the condition on the betweenness of the edges would exclude such situations, although other types of cluster structures might still be compatible with it. Edges are removed until all connected components of the partition are communities. Monte Carlo sampling of edge betweenness induces statistical errors. As a consequence the partitions are in general different for different choices of the set of center vertices. However by repeating the calculation many times, the method gives good results on a network of gene co-occurrences (Wilkinson04), with a substantial gain of computer time. The technique has been also applied to a network of people corresponding via (Tyler et al.,2003). In practical examples, only vertices lying at the boundary between communities may not be clearly classified, and be assigned sometimes to a group, sometimes to another. This is nice since it identifies overlaps between communities, as well as the degree of membership of overlapping vertices in the clusters they belong to. GN cannot do this. Another fast version of GN makes a quick approximation of the edge betweenness values by using a network structure index, which consists of a set of vertex annotations combined with a distance measure (Rattigan et al.,2006). Basically one divides the graph into regions and computes the distances of every vertex from each region. In this way Rattigan et al. showed that it is possible to lower the complexity of the algorithm to O(m), by keeping a fair accuracy in the estimate of the edge betweenness values. This version gives good results on the benchmark graphs proposed by Brandes et al. (Brandes et al.,2003), as well as on a collaboration network of actors and on a citation network. Chen and Yuan have pointed out that counting all possible shortest paths in the calculation of the edge betweenness may lead to unbalanced partitions, with communities of very different size, and proposed to count only non-redundant paths, i. e. paths whose endpoints are all different from each other: the resulting betweenness yields better results than standard edge betweenness for mixed clusters on the benchmark graphs of Girvan andNewman (Chen and Yuan,2006). Holme et al. have used a modified version of the algorithm in which vertices, rather than edges, are removed (Holme et al.,2003). A centrality measure for the vertices, proportional to their site betweenness, and inversely proportional to their in-degree, is chosen to identify boundary vertices, which are then iteratively removed with all their edges. This modification, applied to study the hierarchical organization of biochemical networks, is motivated by the need to account for reaction kinetic information, that simple site betweenness does not include. The indegree of a vertex is solely used because it indicates the number of substrates to a metabolic reaction involving that vertex; for the purpose of clustering the graph is considered undirected, as usual.
3
The algorithm of Girvan and Newman is unable to find overlapping communities, as each vertex is assigned to a single cluster. Pinney and Westhead have proposed a modification of the algorithm in which vertices can be split between communities To do that, they also compute the betweenness of all vertices of the graph. Unfortunately the values of edge and site betweenness cannot be simply compared, due to their different normalization, but the authors remarked that the two endvertices of an inter-cluster edge should have similar betweenness values, as the shortest paths crossing one of them are likely to reach the other one as well through the edge. So they take the edge with largest betweenness and remove it only if the ratio of the betweenness values of its endvertices is between... Otherwise, the vertex with highest betweenness (with all its adjacent edges) is temporarily removed. When a subgraph is split by vertex or edge removal, all deleted vertices belonging are copied in each subcomponent, along with all their edges. Gregory has a similar approach, named CONGA in which vertices are split among clusters if their site betweenness exceeds the maximum value of the edge betweenness A vertex is split by assigning some of its edges to one of its duplicates, and the rest to the other. There are several possibilities to do that, Gregory proposed to go for the split that yields the maximum of a new centrality measure, called split betweenness, which is the number of shortest paths that would run between two parts of a vertex if the latter were split. The method has a worst-case complexity O(m3), or O(n3) on a sparse graph. Code is at Another promising track to detect inter-cluster edges is related to the presence of cycles, i. e. closed non-intersecting paths, in the graph. Communities are characterized by a high density of edges, so it is reasonable to expect that such edges form cycles. On the contrary, edges lying between communities will hardly be part of cycles. Based on this intuitive idea, Radicchi et al. proposed a new measure, the edge clustering coefficient, such that low values of the measure are likely to correspond to intercommunity edges. The edge clustering coefficient generalizes to edges the notion of clustering coefficient. The measure is (anti)correlated with edge betweenness: edges with low edge clustering coefficient usually have high betweenness and vice versa, although the correlation is not perfect. The method works as the algorithm by Girvan and Newman. At each iteration, the edge with smallest clustering coefficient is removed, the measure is recalculated again, and so on. If the removal of an edge leads to a split of a subgraph in two parts, the split is accepted only if both clusters are LS-sets (“strong") or “weak" communities. The verification of the community condition on the clusters is performed on the full adjacency matrix of the initial graph. If the condition were satisfied only for one of the two clusters, the initial subgraph may be a random graph, as it can be easily seen that by cutting a random graph in two parts, the larger of them is a strong (or weak) community with very high probability, whereas the smaller part is not. Enforcing the community condition on both clusters, it is more likely that the subgraph to be split indeed has a cluster structure. Therefore, the algorithm stops when all clusters produced by the edge removals are communities in the strong or weak sense, and further splits would violate this condition. The authors suggested to use the same stopping criterion for the algorithm of Girvan and Newman, to get structurally well-defined clusters. Since the edge clustering coefficient is a local measure, involving at most an extended neighborhood of the edge, it can be calculated very quickly. The running time of the algorithm to completion is O(m4=n2), or O(n2) on a sparse graph, if g is small, so it is much shorter than the running time of the Girvan-Newman method. The software of the algorithm can be found in The algorithm may give poor results when the graph has few cycles, as it happens in some social and many non- social networks. In this case, in fact, the edge clustering coefficient is small and fairly similar for most edges, and the algorithm may fail to identify the bridges between communities. An alternative measure of centrality for edges is information centrality. It is based on the concept of efficiency ( Latora and Marchiori,2001), which estimates how easily information travels on a graph according to the length of shortest paths between vertices. The efficiency of a network is defined as the average of the inverse distances between all pairs of vertices. If the vertices are “close" to each other, the efficiency is high. The information centrality of an edge is the relative variation of the efficiency of the graph if the edge is removed. In the algorithm by Fortunato et al. (Fortunato et al.,2004), edges are removed according to decreasing values of information centrality. The method is analogous to that of Girvan and Newman. Computing the information centrality of an edge requires the calculation of the distances between all pairs of vertices, which can be done with breadth-first-search in a time O(mn). So, in order to compute the information centrality of all edges one requires a time O(m2n). At this point one removes the edge with the largest value of information centrality and recalculates the information centrality of all remaining edges with respect to the running graph. Since the procedure is iterated until there are no more edges in the network, the final complexity is O(m3n), or O(n4) on a sparse graph. The partition with the largest value of modularity is chosen as most representative of the community structure of the graph. The method is much slower than the algorithm of Girvan and Newman. Partitions obtained with both techniques are rather consistent, mainly because information centrality has a strong correlation with edge betweenness. The algorithm by Fortunato et al. gives better results when communities are mixed, i. e. with a high degree of interconnectedness, but it tends to isolate leaf vertices and small loosely bound subgraphs.
4
kListPT3hij PT4hijk=Ek after zeroing i and j bits of Ek
Edge pTree, E, PathTree(PT), ShortestPathvTree(SPT), AcyclicPathTree(APT) and CycleList(CL) of G1 G1 1 2 3 4 E2key 1,1,3 1,1,2 1,1,1 1,2,1 1,1,4 1,2,3 1,2,2 1,3,1 1,2,4 1,3,4 1,3,3 1,3,2 1,4,2 1,4,1 1,4,4 1,4,3 2,1,1 2,1,2 2,1,4 2,1,3 2,2,1 2,2,3 2,2,2 2,3,1 2,2,4 2,3,3 2,3,2 2,3,4 2,4,2 2,4,1 2,4,4 2,4,3 3,1,2 3,1,1 3,1,4 3,1,3 3,2,1 3,2,3 3,2,2 3,3,1 3,2,4 3,3,3 3,3,2 3,4,2 3,4,1 3,3,4 3,4,4 3,4,3 4,1,2 4,1,1 4,2,1 4,1,4 4,1,3 4,2,3 4,2,2 4,3,1 4,2,4 4,3,2 4,3,3 4,4,2 4,4,1 4,3,4 4,4,4 4,4,3 PE2 1 PE3 1 , E3key 1,1,1 1,1,4 1,1,3 1,1,2 1,2,1 1,2,4 1,2,3 1,2,2 1,3,1 1,3,4 1,3,3 1,3,2 1,4,1 1,4,4 1,4,3 1,4,2 2,1,1 2,1,4 2,1,3 2,1,2 2,2,1 2,2,4 2,2,3 2,2,2 2,3,1 2,3,4 2,3,3 2,3,2 2,4,2 2,4,1 3,1,1 2,4,4 2,4,3 3,1,2 3,2,1 3,1,4 3,1,3 3,2,2 3,3,1 3,2,4 3,2,3 3,3,2 3,4,1 3,3,4 3,3,3 3,4,2 4,1,1 3,4,4 3,4,3 4,1,2 4,2,1 4,1,4 4,1,3 4,2,2 4,3,1 4,2,4 4,2,3 4,3,2 4,4,1 4,3,4 4,3,3 4,4,3 4,4,2 4,4,4 2 3 4 (pred is NotPureZero) First, construct stride=|V|, 2-level Edge pTree, all others are constructed concurrently from it. E1 key 1,1 1,2 1,4 1,3 2,1 2,3 2,2 2,4 3,1 3,3 3,2 3,4 4,1 4,2 4,3 4,4 PE1 1 E one-level 1 2 3 4 2LEG1 E 2-lev stri=|V|=4 PTG1, extension of EG1 1 2 3 4 PTG1 APTG1 1 2 3 4 All are 3 hop cycles. Each has 3 start pts , 2 directions. Each repeat 6 times. 6/6=1 3hop cycles (1341) SPTG1 1111 1 2 3 4 CLG1 1 2 2 1 3 1 2 1341 1431 3413 3143 4134 4314 SPTG1, init E1=SP1,1 E2=SP2,1 E3=SP3,1 E4=SP4,1 1 2 3 4 SPSFk 1 3 1 4 2 4 1 3 1 3 4 1 4 1 4 3 1 1 3 1 4 2 4 1 3 1 3 4 1 4 1 4 3 1 1 2 2 1 3 2 1 1 3 4 2 4 1 4 2 3 1 3 1 4 1 3 4 3 4 1 2 4 1 4 2 3 1 3 1 4 1 4 3 3 1 4 1 3 4 SPT is completed. For Big Graphs, could stop here (e.g., Friends has ~1B vertices but a diameter of 4, so we would only need to build PT 4-hop paths) and possible expressed as a tree of lists rather than a tree of bitmaps. For sparse BigGraphs, E could be leveled further and/or a tree of lists (then APT, SPT will be also). SPT(G)k (with k turned on) is mask (>0 is “yes”) for connectivity comp, COMP(G)kvk. For bitmap of COMPk bitslicing SPT (SPTk,h..SPTk,0 k=1..|V| then COMPk ORj=h..0SPTk,h. SPT structure may be useful as separate “categorical” bitmaps Shortest Path Length (SPk,h h=1..H. Also keep a mask of Shortest Paths so far, SPSFk vertex, k. With each new SP bitmap, SPB, SPSFkSPSFk | SPB, SPk,h+1 SPB & SPSFk. kListPT3hij PT4hijk=Ek after zeroing i and j bits of Ek To extend to PT: kListEh PT2hk=Ek after zeroing the h bit of Ek kListPT2hj PT3hjk=Ek after zeroing Ek j bit. E PT SPT APT of graph as predicate Trees on E(MaxPathLength). PTG1 E3 pred=(NPZ)|(PZ&AcyclicPathEnd) 1 2 3 4 1,2 1,1 key 1,3 1,4 2,1 2,2 2,4 2,3 3,1 3,3 3,2 3,4 4,1 4,2 4,3 4,4 EG1 E 1lev, pred=NPZ E 2lev str=4 pred=NPZ APTG1 E3predicate = (NPZ&NotCycleEnd)| (PZ&AcyclicPathEnd) SP1,1 SP2,1 SP3,1 SP4,1 SP1,2 SP2,2 SPVertex=3, Len=2 12 SP1,1|2 SP2,1|2 SP3,1|2 SP4,1|2 SPTgives the Connectivity Component Partition; Maximal Cliques (go across SPk,1 then look within subsets of those k’s for commonality); Note, Cliques are 0-plexes. Each mask, SPk,1 masks a 1-plex. Each SPk,1&SPk,2 masks a 2-plex (which is SPSFk,2? So if we save each SPSF instead of overwriting, we have k-plex masks w/o further work?), etc. Next construct predicates for each Path related data structures, PT APT SPT SPSF, to make them into pTrees on a k-path table, E, E2, E3, …
5
SG Clique Mining 1,2 1,1 key 1,3 1,5 1,4 1,7 1,6 2,2 2,1 2,3 2,5 2,4 2,7 2,6 3,2 3,1 3,4 3,3 3,5 3,7 3,6 4,2 4,1 4,4 4,3 4,7 4,6 4,5 5,2 5,1 5,4 5,3 5,6 5,5 5,7 6,2 6,1 6,4 6,3 6,6 6,5 7,1 6,7 7,2 7,4 7,3 7,6 7,5 7,7 PE 1 2 4 3 7 6 G3 5 K=2: 2Cliques (2 vertices): Find endpts of each edges (Int((n-1)/7)+1, Mod(n-1,7) +1) 1 2 4 3 6 G2 7 5 key 1,1 1,3 1,2 1,5 1,4 1,6 2,1 1,7 2,3 2,2 2,5 2,4 2,6 2,7 3,1 3,3 3,2 3,5 3,4 3,7 3,6 4,1 4,3 4,2 4,5 4,4 4,7 4,6 5,2 5,1 5,3 5,5 5,4 5,7 5,6 6,2 6,1 6,3 6,4 6,6 6,5 6,7 7,2 7,1 7,3 7,4 7,6 7,5 7,7 E 1 EU 1 1 2 4 3 6 5 8 7 10 9 20 30 40 C 1 CU 1 6 k=3: k=4: 1234 ( are cliques) 123,134 ,134 , 234 ,2341234. 1234 only 4-clique Using the EdgeCount thm: on C={1,2,3,4}, CU=C&EU C is a clique since ct(CU)=comb(4, 2)=4!/2!2!=6 have 124CS3 PE(1,4)=1 134CS3 Have 123CS3 PE(2,3)=1 234CS3 Have k=2: E= already have 567 PE(2,3)=1 So 123CS3 PE(2,4)=1 124CS3 PE(2,6)=0 PE(6,7)=1 567CS3 PE(1,7)=0 PE(1,5)=0 PE(2,4)=1 1234CS4 Have 1234 k=3: EC, requires counting 1’s in mask pTree of each Subgraph (or candidate Clique, if take the time to generate the CCSs – but then clearly the fastest way to finish up is simply to lookup the single bit position in E, i.e., use EC). EdgeCount Algorithm (EC): |PUC| = (k+1)!/(k-1)!2! then CCCS The SG alg only needs Edge Mask pTree, E, and a fast way to find those pairs of subgraphs in CSk that share k-1 vertices (then check E to see if the two different kth vertices are an edge in G. Again this is a standard part of the Apriori ARM algorithm and has therefore been optimized and engineered ad infinitum!) PE(2,3)=1 234CS3 PE(1,4)=1 134CS3 Have PE(4,8)=1 248CS3 key 1,1 1,3 1,2 1,5 1,4 1,7 1,6 2,2 2,1 1,8 2,4 2,3 2,5 2,6 2,8 2,7 3,1 3,3 3,2 3,5 3,4 3,7 3,6 3,8 4,2 4,1 4,4 4,3 4,6 4,5 4,8 4,7 5,3 5,2 5,1 5,5 5,4 5,7 5,6 6,1 5,8 6,3 6,2 6,4 6,6 6,5 6,8 6,7 7,3 7,2 7,1 7,5 7,4 7,6 7,7 8.1 7,8 8,2 8,4 8,3 8,6 8,5 8.8 8,7 E 1 PE(4,8)=1 348CS3 PE(4,8)=1 12348CS5 have have k=2: k=4: PE(2,3)=1 123CS3 PE(2,4)=1 124CS3 PE(2,8)=1 128CS3 PE(2,6)=0 PE(3,8)=1 138CS3 PE(4,8)=1 148CS3 PE(1,5)=0 PE(1,7)=0 PE(6,8)=0 PE(3,8)=1 238CS3 have PE(6,7)=1 567CS3 have k=5: = CS5. 1 2 4 3 6 G4 7 5 8 PE(3,8)=1 1238CS4 PE(4,8)=1 1248CS4 PE(3,8)=1 1348CS4 k=3: Have PE(2,4)=1 1234CS4 PE(4,8)=1 2348CS4
6
The EdgepTree(E), PathTree(PT), ShortestPathvTree(SPT), AcyclicPathTree(APT) and CycleList(CL) of a graph, G5 1 2 3 4 5 6 8 7 PTG5 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 EG5 2-level str=8 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 1 2 1 5 1 7 2 1 3 6 1 3 8 1 4 2 1 5 1 5 7 1 6 3 1 6 8 1 7 1 7 5 1 8 3 1 8 6 1 1 5 7 1 7 5 5 1 2 7 1 2 3 8 6 1 3 6 8 1 1 2 4 5 1 2 5 1 7 5 7 1 3 6 8 1 8 6 3 1 7 1 2 7 1 5 1 5 7 8 6 3 1 8 3 6 1 4 2 5 1 4 2 7 1 7 5 2 1 APTG5 CLG5 1571 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 1751 3683 3863 5175 1 2 1 5 1 7 2 1 3 6 1 3 8 1 4 2 1 5 1 5 7 1 6 3 1 6 8 1 7 1 7 5 1 8 3 1 8 6 1 5715 6386 6836 7157 7517 2 1 5 2 1 7 4 2 1 5 1 2 2 1 7 7 5 1 8368 8638 PT Clique Miner Algorithm A clique is all cycles Extend to a k-plex (k-core) mining algorithm? PT(=APT+CL), SPT are powerful datamining tools with closure properties (to eliminate branches) . SPTG5 1 2 1 2 1 2 1 3 1 4 2 1 3 4 2 1 4 1 5 1 2 3 5 1 5 1 2 6 1 7 1 2 7 1 7 1 2 3 8 1 Max clique Mining A kCycle is a kClique iff it’s found in CLk as PERM(k-1,k-1)/2=(k-1)!/2 kCycles (e.g., vertices are repeated in CL for 3cycles, 2!/2=1; 4cycles, 3!/2=3; 5cycles, 4!/2=12; 6cycles, 5!/2=60. 4 1 2 5 4 1 2 7 7 1 5 2 Downward closure: Once, a 4cycle is established as a 4clique (by the fact that {1,2,3,4} occurs 3!/2=3 times in CL), all 3vertex subsets are 3cliques {1,2,3},{1,2,4},{1,3,4}, so no need to check further. k-plex (missing k edges) mining alg? k-core (has k edges) mining alg? Density (internal edge density >> external|avg) mining alg? Degree (internal vertex degree >> external|avg) mining alg? DiameterG5 is max{Diameterk} = max{ 2,2,1,3,2,1,3,1}=3. Connected comp containing V1, COMP1={1,2,4,5,7}. Pick 1st vertex not in COMP1,3, COMP3 ={3,6,8}. Done. The partition is { {1,2,4,5,7}, {3,6,8} }. To pick the first vertex not in COMP1, mask off COMP1 with SPTv1’ and then pick the first vertex in this complement.
7
cycles in blue (not in APT)
SP1 SP1&2 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 a 1 b 1 c 1 d 1 e f 1 g 1 1 3 2 4 6 5 8 7 a 9 c b d f e g E=A1Ps 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 a 1 b 1 c 1 d 1 e f 1 g 1 1 2 4 3 5 6 8 7 9 b a c f e d g 1 2 3 4 5 6 7 8 9 a b c d e f g 4 1 cycles in blue (not in APT) A2Ps 1 2 4 3 6 5 8 7 a 9 c b d f e g SP2 SP1&2&3 1 1 2 1 2 3 3 1 4 4 1 5 1 5 6 6 1 7 7 1 8 8 1 9 1 9 a 1 a b 1 b c d e f g 1 2 4 3 6 5 8 7 9 a c b d f e g 1 2 3 4 5 6 7 8 9 a b c d e f g 1 3 1 6 2 4 1 3 1 3 4 1 4 3 1 5 6 1 5 7 1 6 1 6 5 1 6 7 1 7 5 1 7 6 1 8 4 1 9 c 1 A c 1 b c 1 D f 1 D g 1 F d 1 F g 1 G d 1 G f 1 SP3 SP1&2&3&4 4 3 1 1 6 5 1 6 7 4 2 3 1 3 1 6 4 3 1 1 6 5 5 7 6 1 5 6 7 1 7 5 6 1 6 1 3 5 6 7 1 7 6 5 1 5 7 6 1 7 6 5 1 1 6 7 6 7 5 1 8 3 4 1 F D g 1 D f G 1 D F g 1 G F d 1 D G f 1 G d F 1 1 1 2 1 2 3 3 1 4 1 4 5 5 1 6 1 6 7 7 1 8 1 8 9 a b c d e f g 1 2 4 3 6 5 8 7 9 a c b d f e g 1 2 3 4 5 6 7 8 9 a b c d e f g A3Ps 1 3 2 4 5 7 6 8 b a 9 c d e g f SP4 SP1&2&3&4&5 COMPLETE A4Ps 1 2 4 3 5 7 6 8 9 b a c d e g f A5Ps 1 2 4 3 5 6 8 7 9 b a c d f e g A6Ps 1 3 2 4 5 7 6 8 a 9 c b e d g f 1 2 1 2 3 4 1 4 5 1 5 6 1 6 7 7 1 8 8 1 9 a b c d e f g 1 3 2 4 5 7 6 8 a 9 c b e d g f 1 2 3 4 5 6 7 8 9 a b c d e f g 2 3 4 6 1 5 2 4 1 3 3 1 5 6 3 1 7 6 4 3 6 1 5 6 3 1 5 7 1 6 6 1 4 3 7 5 1 6 7 6 3 1 8 3 4 1 2 4 1 3 6 4 6 1 3 5 4 3 6 1 7 5 6 3 1 4 5 7 1 6 3 7 6 5 3 1 7 3 1 6 4 8 4 1 3 6 4 2 1 3 7 6 7 5 1 6 4 3 5 7 1 6 4 3 4 8 1 3 5 6 8 3 4 6 1 7 SP5 SP6 1 2 2 1 3 4 5 5 1 6 7 7 1 8 8 1 9 a b c d e f g 1 3 2 4 7 6 5 8 a 9 c b e d g f 1 2 3 4 5 6 7 8 9 a b c d e f g 1 3 2 4 7 6 5 8 a 9 c b e d g f G6 1 2 4 3 6 7 5 8 9 a b c d e f g
8
All Shortest Path pTrees for a unipartite undirected graph, G7 (SP1, SP2, SP3, SP4, SP5)
SP1 =1deg 1 2 3 4 5 6 7 8 9 SP2 =2dg 10,25,26,28,29,33,34 not shown (only 17 on, 1=4dg) 1 SP4 =4dg 15,16,19,21,23,24,27,30 only 17 on, 5deg=1 17 SP5 8=5dg 1 2 3 4 5 6 7 8 9 SP3 =3dg G7
9
G8 Trying Hamming Similarity to detect communities on G7 and G8 40 41
Zachary's karate club, a standard benchmark in community detection. (best partition found by optimizing modularity of Newman and Girvan) =1deg =2deg =3deg =4deg =5deg Hamming similarity: S(S1,S2)=DegkDif(S1,S2) To produce an [all?] actual shortest path[s] between x and y: Thm: To produce a [all?]: S2P[s], take a [all?] middle vertex[es], x1, from SP1x & SP1y, produce: xx1y; S3P[s], take a [all?] vertex[es], x1, from SP1x and a [all?] vertex[es], x2, from S2P(x1,y): xx1x2y etc. Is it productive to actually produce (one time) a tree of [all?] shortest paths? I think it is not! 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 14 20 17 15 16 24 30 27 18 39 28 42 Can see that this Works Poorly At 1. 17 25 2 24 18 1 14 3 7 Not working! On the other hand, our standard community mining techniques (for kplexes) worked well on G7. Next slide let’s try Hamming on G8. G7 Deg b a g b 2 b f 9 f d Deg 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 40 41 42 46 44 53 48 54 52 45 43 39 38 20 21 24 47 23 22 19 25 36 18 37 35 27 26 28 29 31 32 33 30 51 50 34 49 G8
10
G9 G9, Agglomerative clustering of ESP2 using Hamming Similarity
In ESP2, using Hamming similarity, we get three Event clusters, clustering events iff pTrees [Hamming] identical: EventCluster1={1,2,3,4,5} EventCluster2={6,7,8,9} EventCluster3={10,11,12,13,14} 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 W ESP E WSP W 1 2 3 4 5 6 7 8 9 10 11 12 13 14 E The Degree % of affiliation of Women with R,G,B events is: R G B 1 100% 75% 0% % 75% 0% % 100% 0% % 75% 0% 5 60% 25% 0% % 50% 0% % 75% 0% % 75% 0% % 75% 0% % 75% 20% 11 0% 50% 40% 12 0% 50% 80% 13 0% 75% 80% 14 0% 75% 100% 15 0% 50% 60% 16 0% 50% 0% 17 0% 25% 20% 18 0% 25% 20% W 1 e e e e ESP E 2 3 4 5 6 7 8 9 10 11 12 13 14 E WSP W 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 W G9 ESP3=ESP1’ and ESP4=ESP2’ so again, in this case, all info is already available in ESP1 and ESP2 (all shortest paths are of length 1 or 2). We don’t need ESPk k>2) WSP W 1 2 3 4 5 6 7 8 9 10 11 12 13 14 E WSP3=WSP1’ and WSP4=WSP2’ so, in this case, all information is already available in WSP1 and WSP2 (All shortest paths are of length 1 or 2) (We don’t need WSPk k>2) Clustering Women using Degree% RGB affiliation: WomenClusterR={1,2,4,5} WomanClusterG={3,6,7,8,9,10,11,16,17,18} WomanClsuterB={12,13,14,15} This clustering seems fairly close to the authors. Other methods are possible and if another method puts event6 with 12345, then everything changes and the result seem even closer to the author’s intent..
11
G9 K-plex search on G9 (A k-plex is a SG missing k edges
If H is a k-plex and F is a ISG, then F is a kplex A graph (V,E) is a k-plex iff |V|(|V|-1)/2 – |E| k 1 d d d d ESP2 2 3 4 5 6 7 8 9 10 11 12 13 14 E 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 W WSP2 h f h f b f f g h h g g h h h g c c Events abcde 14*13/2=91 degs=88888dddd88888 |Edge|=66 kplex k25 Events abcde Not calculating k degs= 7777cccc Until it gets lower Events abcde 14*13/2=91 degs= 666bbbb88888 |Edges|=66 kpl Events456789abcde 14*13/2=91 degs= 55aaaa88888 |Edges|=66 kplex k25 Women abcdefghi 18*17/2=153 degs=hfhfbffghhgghhhgcc |Edges| =139 kplex k14 Events56789abcde 14*13/2=91 degs= |Edges|=66 kplex k25 Women abcdefgh 18*17/2=153 degs=gfgfbfffggffgggfc |Edges| =139 kplex k14 Events6789abcde *8/2= A 9Clique! degs= |Edges|=36 kplex k0 Women abcdefg 18*17/2=153 degs=ffffbffeffeefffe |Edges| =139 kplex k14 So take out {6789abcde} and start over. Women abcdefg 15*14/2=105 degs=eeeeeeeeeeeeeee |Edges| = kplex k0 15Clique Events *4/2=10 |Edges|=10 kplex k 0 A 5clique! degs: 44444 So take out { abcdefg} and start over. If we had used the full algorithm which pursues each minimum degree tie path, one of them would start by eliminating 14 instead of 1. That will result in the 9Clique and the 5Clique abcde. All the other 8 ties would result in one of these two situations. How can we know that ahead of time and avoid all those unproductive minimum degree tie paths? Women5hi 3*2/2=3 degs=011 |Edges| =1 kplex k2 Womenhi 2*1/2=1 degs=11 |Edges| =1 kplex k0 Clique We get no information from applying our kplex search algorithm to WSP2. Again, how could we know this ahead of time to avoid all the work? Possibly by noticing the very high 1-density of the pTrees? (only 28 zeros)? Every ISG of a Clique is a Clique so 6789 and 789 are Cliques (which seems to be the authors intent?) If the goal is to find all maximal Cliques, how do we know that CA= is maximal? If it weren’t then there would be at least one of abcde which when added to CA= would results in a 10Clique. Checking a: PCA&Pa would have to have count=9 (It doesn’t! It has count=5) and PCA(a) would have to be 1 (It isn’t. It’s 0). The same is true for bcde. The same type of analysis shows 6789abcde is maximal. I think one can prove that any Clique obtained by our algorithm would be maximal (without the above expensive check), since we start with the whole vertex set and throw out one at a time until we get a clique, so it has to be maximal? The Women associated strongly with the blue EventClique, abgde are { } and associated but loosely are { }. The Women associated strongly with the green EventClique, are { } and associated but loosely are {6 7 9}
12
G10 E=SP1 2level pTrees LevelOneStride=19 (labelled 0-i), Level0Stride=10 (labelled 0-9)
Note: SP1 should be called S1PDV for “Shortest 1 Path Destination Verticies, because each one, e.g. S1PDV(v1) maps all such destination verticies from that given starting vertex, v1 OutDeg 1 8 1 9 2 1 2 1 2 1 2 3 1 2 4 1 2 5 1 2 6 1 2 7 1 2 8 1 2 9 1 3 1 3 1 3 2 1 3 1 3 4 1 3 5 1 3 6 1 3 7 1 3 8 1 3 9 1 4 1 4 2 1 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 35 22 23 24 25 26 27 28 29 30 31 32 33 34 36 37 38 39 40 41 42 43 44 G10: Web graph of pages of a website and hyperlinks. Communities by color (Girvan Newman Algorithm). |V|=180 (1-i0) and |E|=266. Vertices with OutDeg=0 (leaves) do not have pTrees shown because pTrees display only OutEdges and thus those OD=1 have a pure0 pTree. 45 78 46 47 48 49 50 51 c5 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 c1 c2 c3 c4 c6 c7 c8 c9 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 h0 h1 h2 h3 h4 h5 h6 h7 h8 h9 i0 1 5 3 4 8 5 4 8 5 7 6 9 5 8 7 5 9 6 8 6 7 1 6 3 6 6 5 e 7 6 1 7 1 4 9 tens dig 1 3 2 4 5 6 8 7 9 a b d c e f g i h 4 5 1 4 6 1 4 7 1 4 8 1 4 9 1 5 1 5 1 5 2 1 5 1 5 6 1 7 2 1 7 3 1 7 4 1 1 G10 units 1 2 4 3 5 7 6 8 9 1 1 1 1 1 1 1 1 1 1 1 1 1 units 1 2 4 3 1 1 1 1 1 units 2 1 3 5 4 6 8 7 9 1 1 1 1 units 1 2 4 3 1 1 1 1 1 1 units 1 2 4 3 1 units 2 1 3 5 4 6 8 7 9 1 1 1
13
G10 leaves (OutDegree=0):
G10 E=SP1 2level pTrees LevelOneStride=19 (labelled 0-i), Level0Stride=10 (labelled 0-9) 7 OD 9 OD L1 1 2 4 3 5 7 6 8 9 a b d c e g f h i C 4 1 L1 2 1 3 5 4 6 8 7 9 b a c e d f g i h 4 H OutDeg OD 1 8 1 9 2 1 2 1 2 1 2 3 1 2 4 1 2 5 1 2 6 1 2 7 1 2 8 1 2 9 1 3 1 3 1 3 2 1 3 1 3 4 1 3 5 1 3 6 1 3 7 1 3 8 1 3 9 1 4 1 4 2 1 4 3 1 5 7 6 7 7 6 h 5 B 4 C B 5 C 4 B 6 7 1 6 F 7 G 7 F 6 G 1 G H 2 G 9 F 3 G L1 2 1 3 5 4 6 8 7 9 b a . 1 L0 2 4 3 5 7 6 8 9 4 G 8 F 5 G 7 F 6 G F 7 H 4 7 G 6 F 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 C B 2 4 C 1 7 4 2 C B 9 4 C 3 B 8 OD L1 1 2 4 3 5 7 6 8 9 a b d c e g f h i 7 8 1 7 9 1 8 1 8 1 9 1 9 1 4 6 7 8 H B C A 7 1 8 6 8 7 9 8 9 5 9 A 6 H G 1 4 1 H G 4 2 H 4 3 H I 4 6 8 1 H L0 1 3 2 4 6 5 7 9 8 1 L0 2 1 3 5 4 6 8 7 9 OD L0 2 1 3 4 6 5 7 9 8 1 5 1 6 4 5 3 8 4 5 4 8 5 9 4 5 7 9 6 5 8 7 5 9 8 6 6 7 1 6 3 6 6 5 7 e 6 1 7 1 9 4 1 1 9 2 1 9 3 1 9 5 7 9 6 7 8 5 H 7 4 6 H 4 8 H 9 1 4 9 H I 4 3 7 8 I 4 H 9 OD L1 1 3 2 4 5 7 6 8 9 a b d c e g f h i 4 5 1 4 6 1 4 7 1 4 8 1 4 9 1 5 1 5 2 1 5 6 1 7 2 1 7 3 1 7 4 1 C 5 1 L0 2 1 3 5 4 6 8 7 9 1 L0 2 1 3 4 6 5 7 9 8 1 9 7 A 9 8 1 9 8 A 8 L0 1 3 2 4 6 5 7 9 8 1 1 1 1 1 1 A 1 9 A 2 B A 4 7 L0 1 3 2 4 6 5 7 9 8 1 1 1 1 1 1 8 7 9 1 4 9 1 5 2 A 5 3 7 20 OD L0 1 2 4 3 5 7 6 8 9 1 1 1 1 1 1 1 1 1 1 1 1 L1 2 1 3 5 4 6 8 7 9 b a c e d f g i h D 2 1 D 2 4 6 7 3 C 8 9 6 C 9 1 D 2 7 C 9 1 D 2 8 C 7 9 D 2 C 9 7 8 D H 4 2 D 1 4 5 7 8 2 D 3 2 D 4 2 L0 1 3 2 4 6 5 7 9 8 1 1 1 L0 1 2 3 4 1 1 1 1 1 D 5 2 6 D F 5 2 D 7 9 2 8 D F 4 2 D 9 1 2 E 9 1 D 2 1 E 7 9 D 2 E 2 D L0 1 2 4 3 5 7 6 8 9 1 1 1 1 L0 1 2 4 3 5 7 6 8 9 1 OD 3 E F D 2 4 E 9 D 2 5 E 8 D 2 6 E 7 D 2 L0 1 2 4 3 1 1 1 1 1 1 1 B 1 2 2 B 7 6 h 1 B 3 2 L0 1 3 2 4 6 5 7 9 8 1 L0 1 2 4 3 1 L0 2 1 3 5 4 6 8 7 9 1 1 1 G10 leaves (OutDegree=0): a3 a6 a8 a9 b0 B7 b8 b9 e7 e8 e9 f0 f1 f2 f3 f4 f5 f8 f9 g0 g8 g9 h7
14
18 1 G10 SP1 Lists 75 77 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 SP2 Lists 84 85 C A0 A1 A2 A4 B1 B4 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 H8 H9 E A0 A1 A2 A3 A4 B1 B2 C6 C7 C8 C9 D1 D3 D5 D8 D9 E1 E2 E3 E4 E5 E6 H8 H9 H9 H4 19 2 76 77 36 2 20 3 77 76 H5 74 78 B2 D1 H7 I0 H0 H1 H2 H3 H5 H7 H8 21 4 22 5 D3 D2 23 6 D4 D2 24 7 D5 D2 25 8 C A0 A1 A2 A4 B1 B4 C6 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 H8 H9 E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E2 E3 E4 E5 E6 D6 E5 D2 39 12 26 9 27 10 86 80 D7 D9 D2 40 10 28 11 87 79 B2 D1 H7 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 29 12 D8 E4 D2 E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E3 E4 E5 E6 30 13 89 85 D9 91 D2 31 14 90 A6 E0 91 D2 50 76 D2 H1 81 88 32 15 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 91 A6 A7 A8 A9 B0 B2 C4 D2 H4 I0 C C6 C7 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 33 16 E1 79 D2 E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E4 E5 E6 34 17 E2 D2 35 18 B2 D1 H6 H7 36 19 E3 F0 D2 92 91 E4 E9 D2 C E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E5 E6 D2 46 49 93 91 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 H0 H1 H2 H3 H4 H5 H6 H7 H8 E5 E8 D2 74 E6 E7 D2 95 79 51 D2 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E6 F6 G7 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 96 78 39 29 F7 G6 H4 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 40 27 97 A7 G1 H1 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 42 45 98 91 G2 F9 43 78 98 99 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 99 88 70 75 B3 B2 D1 H7 G3 G0 A0 A8 D 79 91 E7 E8 E9 F0 G4 F8 G1 H4 A1 A9 G5 F7 G5 G6 A2 B0 D C6 C7 C8 C9 D1 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 G6 F7 H4 A4 A7 G7 F6 G6 46 81 H0 H1 H2 H3 H4 H5 H6 H7 H8 A5 A3 A7 H0 G1 H4 B2 H6 H6 A A4 A5 A4 A5 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 B1 B2 H1 G1 H4 H0 46 81 H1 H2 H3 H4 H5 H6 H7 H8 51 46 63 1 M A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 B2 76 H1 H2 H0 H4 B3 B2 H3 I0 C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 D C6 C7 C8 C9 D1 D3 D8 D9 E0 E1 E2 E3 E4 E5 E6 H1 46 81 H0 H2 H3 H4 H5 H6 H7 H8 53 48 B4 C4 H4 46 81 H0 H1 H2 H3 H5 H6 H7 H8 54 48 B5 C4 A A5 H2 46 81 G1 H0 H1 H3 H4 H5 H6 H7 H8 A A4 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 E8 55 49 B6 B7 H5 77 H4 A A3 74 C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 B3 C0 B2 C4 H6 H4 B1 76 H1 B2 77 G1 H4 H3 44 H4 H9 57 69 C1 77 C4 H8 91 H4 B3 76 H1 H4 77 91 G1 I0 58 70 C2 B9 C4 B4 B5 B6 C0 C1 C2 C3 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 H9 I 59 68 C3 B8 I0 44 H4 H9 74 C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 B5 B4 B6 C0 C1 C2 C3 H5 H0 H1 H2 H3 H6 H7 H8 60 67 C4 B4 B5 B6 C0 C1 C2 C3 61 66 C0 76 B4 B5 B6 C1 C2 C3 H1 D C6 C7 C8 C9 D1 D3 D5 D9 E0 E1 E2 E3 E4 E5 E6 E9 H6 46 81 H0 H1 H2 H3 H5 H7 H8 63 66 c5 45 D5 OD=0: a3 a6 a8 a9 B0 B7 b8 b9 e7 e8 e9 f0 f1 f2 f3 f4 f5 f8 f9 g0 g8 g9 h7 75 76 H5 C1 77 B4 B5 B6 C0 C1 C2 C3 H5 65 E7 C6 91 D2 76 76 H5 H8 A0 A1 A2 A4 B1 B4 C6 C7 H0 H1 H2 H3 H5 H6 H7 H9 66 61 C7 91 D2 77 H4 C2 B4 B5 B6 C0 C1 C3 D A0 A1 A2 A3 A4 B1 B2 C6 C7 C8 C9 D1 D3 D5 D8 E0 E1 E2 E3 E4 E5 E6 H8 H9 71 49 C8 79 D2 C4 77 B2 B7 B8 B9 C4 72 47 D2 C9 78 73 48 D2 D0 H4 D2 C B2 D1 D2 H7 74 49 D2 D1 78 D2
15
Association Rule Mining (ARM)
Horizontal Trans tbl T T(I) t1 i1 t2 i1, i2, i4 t3 i1, i3 t4 i1, i2, i4 t5 i3, i4 Given any relationship between entities, T (e.g., a set of Customer Transactionss which are involved in those relationship instances). and I (e.g., a set of Items which are involved in those relationship instances). The itemset, T(I), associated with (or related to) a particular transaction, T, is the subset of the items found in the shopping cart or market basket that the customer is bringing through check out at that time). An Association Rule, AC, associates 2 disjoint Itemsets. (A=antecedent, C=consequent) T I A t1 t2 t3 t4 t5 i1 i2 i3 i4 C Its graph The support [ratio] of itemset A, supp(A), is the fraction of Ts such that A T(I), e.g., if A={i1,i2} and C={i4} then supp(A) = |{t2,t4}| / |{t1,t2,t3,t4,t5}| = 2/ Note: | | means set size. The support [ratio] of rule AC, supp(AC), is the support of {A C}=|{T2,T4}|/|{T1,T2,T3,T4,T5}|=2/5 The confidence of rule AC, conf(AC), is supp(AC) / supp(A) = (2/5) / (2/5) = 1 Data Miners typically want to find all STRONG RULES, AC, with supp(AC) ≥ minsupp and conf(AC) ≥ minconf (minsupp, minconf are threshold levels) Note that conf(AC) is also just the conditional probability of t being related to C, given that t is related to A). Given a two entity relationship, we can do ARM with either entity taking the role of the transaction set APRIORI Association Rule Mining: Given a Transaction-Item Relationship, the APRIORI algorithm for finding all Strong I-rules can be done by: Processing a Horizontal Transaction Table (HTT) through vertical scans to find all Frequent I-sets ( e.g., I-sets "frequently" found in baskets). Processing a Vertical Transaction Table (VTT) through horizontal operations to find all Frequent I-sets Then each Frequent I-set found is analyzed to determine if it is the support set of a strong rule. Finding all Frequent I-sets is the hard part. To do this efficiently, the APRIORI Algorithm takes advantage of the "downward closure" property for Frequent I-sets: If a I-set is frequent, then all its subsets are also frequent. E.g., in the Market Basket Example, If A is an I-subset of B and if all of B is in a given Transaction's basket, the certainly all of A is in that basket too. Therefore Supp(A) Supp(B) whenever AB. First, APRIORI scans to determine all Frequent 1-item I-sets (contain 1 item; therfore called 1-Itemsets), next APRIORI uses downward closure to efficiently find candidates for Frequent 2-Itemsets, next APRIORI scans to determine which of those candidate 2-Itemsets is actually Frequent, next APRIORI uses downward closure to efficiently find candidates for Frequent 3-Itemsets, next APRIORI scans to determine which of those candidate 3-Itemsets is actually Frequent, ... Until there are no candidates remaining (on the next slide we walk through an example using both a HTT and a VTT)
16
Example ARM using uncompressed ItemPtrees
HTT Scan D C1 F1 = L1 C2 C2 Scan D F2 = L2 C3 itemset {2 3 5} {1 2 3} {1,3,5} F3 = L3 Scan D {123} pruned since {12} not frequent {135} pruned since {15} not frequent It seems the pruning step in purple above is unnecessary here since root count will show up below the threshold and that root count (using PopCount) is almost free anyway??? Example ARM using uncompressed ItemPtrees (the 1-count at the root of each Ptree) P1^P2^P3 1 //\\ 0010 P1^P3 ^P5 1 P2^P3 ^P5 2 0110 P1 2 //\\ 1010 P2 3 0111 P3 3 1110 P4 1 1000 P5 3 Build Item Ptrees: Scan D P1^P2 1 //\\ 0010 P1^P3 2 1010 P1^P5 1 P2^P3 2 0110 P2^P5 3 0111 P3^P5 2 TID 1 2 3 4 5 100 200 300 400 F1={1}{2}{3}{5} cts: F2={13}{23}{25}{35} cts F3={235} cts All we need to do ARM are theses FrequentItemTables with Counts.
17
L1 L3 Data_Lecture_4.1_ARM L2
1-ItemSets don’t support Association Rules (They will have no antecedent or no consequent). 2-Itemsets do support ARs. Are there any Strong Rules supported by Frequent=Large 2-ItemSets (at minconf=.75)? {1,3} conf({1}{3}) = supp{1,3}/supp{1} = 2/2 = 1 ≥ .75 STRONG conf({3}{1}) = supp{1,3}/supp{3} = 2/3 = .67 < .75 {2,3} conf({2}{3}) = supp{2,3}/supp{2} = 2/3 = .67 < .75 conf({3}{2}) = supp{2,3}/supp{3} = 2/3 = .67 < .75 {2,5} conf({2}{5}) = supp{2,5}/supp{2} = 3/3 = 1 ≥ .75 STRONG! conf({5}{2}) = supp{2,5}/supp{5} = 3/3 = 1 ≥ .75 STRONG! {3,5} conf({3}{5}) = supp{3,5}/supp{3} = 2/3 = .67 < .75 conf({5}{3}) = supp{3,5}/supp{5} = 2/3 = .67 < .75 Are there any Strong Rules supported by Frequent or Large 3-ItemSets? {2,3,5} conf({2,3}{5}) = supp{2,3,5}/supp{2,3} = 2/2 = 1 ≥ .75 STRONG! conf({2,5}{3}) = supp{2,3,5}/supp{2,5} = 2/3 = .67 < .75 No subset antecedent can yield a strong rule either (i.e., no need to check conf({2}{3,5}) or conf({5}{2,3}) since both denominators will be at least as large and therefore, both confidences will be at least as low. conf({3,5}{2}) = supp{2,3,5}/supp{3,5} = 2/3 = .67 < .75 No need to check conf({3}{2,5}) or conf({5}{2,3}) DONE!
18
G9 Using ARM to find kplexes on the bipartite graph, G9? Does it work?
WSP W 1 2 3 4 5 6 7 8 9 10 11 12 13 14 E 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 W ESP E a e c G9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 W WSP2 1 e e e e ESP2 2 3 4 5 6 7 8 9 10 11 12 13 14 E WomenSet ARM: MinSup=6 Mincnf=.75 EventSet ARM MnSp=9 Mncf=.75 Frequent 1WomenSets: Frequency (#events attended) Freq 1EventSets: c Freq (# attended) Candidate 2WomenSets: c 1d 1e c 2d 2e 34 3c 3d 3e 4c 4d 4e cd ce de Freq #events co-attended Cand 2EventSets: c c c c 89 8c 9c Freq=#attended Frequent 2WomenSets: c 1d 1e c 2d 2e 34 3c 3d 3e 4c 4d 4e cd ce de Freq #events co-attended freq 2EventSets: c c c c 89 Freq=#attended Cand 3EventSets all others excl because a sub2 not freq Freq # attended Cand3WSets: (cde is excluded since ce is infreq) Freq #events co-attended Frequent 3WomenSets: Freq #events co-attended Strong Erules 35 53 56 65 57 58 68 78 98 567 657 567 576 675 (Says 567 is a strong Event community?) Freq 3ESets: 567 Freq= StrongWrules 21 12 13 31 14 41 23 32 24 42 34 43 134 314 413 134 143 341 Says 1234 is a strong Women community? Confidence: But 134 is a very strong Women Commun? Note: When I did this ARM analysis, I had several degrees miscounted. None-the-less, I think the same general negative result is expected. Next we try using the WSP2 and ESP2 relationships for ARM??
19
G9 ESP2 EventSet ARM All rule confidences are either
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 W WSP2 1 e e e e ESP2 2 3 4 5 6 7 8 9 10 11 12 13 14 E ESP2 EventSet ARM MnSp=9 Mncf=.75 WSP2 WomenSet ARM MinSup=18 Mincnf=.75 Freq1WSets: 1349adef Frequencies all 18 C2WSets: a 1d 1e 1f a 3d 3e 3f a 4d 4e 4f 9a 9d 9e 9f ad ae af de df ef Freq all This is not interesting! Go to ESP2 Eset ARM Freq1EventSets E1 Freq abcde 99999eeee99999 Freq2EventSets E1 E2 Freq 1 2 3 456789 999999 4 56789 99999 5 6789 9999 6 789abcde eee99999 7 89abcde ee99999 8 9abcde e99999 9 abcde a bcde b cde 999 c de 99 d e Freq3EventSets E1 E2 E3 Freq 1 2 1|2 3 456789 999999 1|2|3 4 56789 99999 1|2|3|4 5 6789 9999 1|2|3|4|5 6 789 999 7 89 99 8 9 89abcde ee99999 6|7 9abcde e99999 6|7|8 abcde 6|7|8|9 a bcde 6|7|8|9|a b cde 6|7|8|9|a|b c de 6|7|8|9|a|b|c d e Freq4EventSets E1 E2 E3 E4 Freq 1 2 3 456789 999999 1|2 4 56789 99999 1|2|3 5 6789 9999 1|2|3|4 6 789 999 1|2|3|4|5 7 89 99 8 9 9abcde e99999 6|7 abcde 6|7|8 a bcde 6|7|8|9 b cde 6|7|8|9|a c de 6|7|8|9|a|b d e Freq5EventSets E1 E2 E3 E4 E5 Freq 1 2 3 4 56789 99999 1|2 5 6789 9999 1|2|3 6 789 999 1|2|3|4 7 89 99 1|2|3|4|5 8 9 abcde 6|7 a bcde 6|7|8 b cde 6|7|8|9 c de 6|7|8|9|a d e Freq8EventSets E1 E2 E3 E4 E5 E6 E7 E8 Freq 1 2 3 4 5 6 7 89 99 1|2 8 9 a b c de 6|7 d e Freq6EventSets E1 E2 E3 E4 E5 E6 Freq 1 2 3 4 5 6789 9999 1|2 6 789 999 1|2|3 7 89 99 1|2|3|4 8 9 a bcde 6|7 b cde 6|7|8 c de 6|7|8|9 d e All rule confidences are either 100% (9/9 or e/e) or 9/e=64% Freq9EventSets E1 E2 E3 E4 E5 E6 E7 E8 E9 Freq 1 2 3 4 5 6 7 8 9 a b c d e Freq7EventSets E1 E2 E3 E4 E5 E6 E7 Freq 1 2 3 4 5 6 789 999 1|2 7 89 99 1|2|3 8 9 a b cde 6|7 c de 6|7|8 d e ARM on either SP1 or SP2 (W or E) does not seem to help much in identifying communities.
20
APPENDIX SP3 Lists 84 85 C A0 A1 A2 A4 B1 B4 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 H8 H9 E A0 A1 A2 A3 A4 B1 B2 C6 C7 C8 C9 D1 D3 D5 D8 D9 E1 E2 E3 E4 E5 E6 H8 H9 H9 H4 I0 H0 H1 H2 H3 H5 H7 H8 81 88 D2 H1 H6 D2 H1 C A0 A1 A2 A4 B1 B4 C6 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 H8 H9 E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E2 E3 E4 E5 E6 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E3 E4 E5 E6 91 A6 A7 A8 A9 B0 B2 C4 D2 H4 I0 C C6 C7 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E4 E5 E6 C E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E5 E6 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 H0 H1 H2 H3 H4 H5 H6 H7 H8 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E6 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 98 99 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 D2 H D 79 91 E7 E8 E9 F0 G1 H4 B2 D1 H6 H7 D C6 C7 C8 C9 D1 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 G5 G6 D2 G6 46 81 H0 H1 H2 H3 H4 H5 H6 H7 H8 A4 A5 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 D2 M A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 H0 46 81 H1 H2 H3 H4 H5 H6 H7 H8 H4 D C6 C7 C8 C9 D1 D3 D8 D9 E0 E1 E2 E3 E4 E5 E6 H1 46 81 H0 H2 H3 H4 H5 H6 H7 H8 B3 A A5 H2 46 81 G1 H0 H1 H3 H4 H5 H6 H7 H8 A A4 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 E8 A A3 B1 76 H1 B2 77 G1 H4 H3 44 H4 H9 B3 76 H1 H4 77 91 G1 I0 B2 H6 B4 B5 B6 C0 C1 C2 C3 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 63 1 B5 B4 B6 C0 C1 C2 C3 H5 H0 H1 H2 H3 H6 H7 H8 C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 C0 76 B4 B5 B6 C1 C2 C3 H1 D C6 C7 C8 C9 D1 D3 D5 D9 E0 E1 E2 E3 E4 E5 E6 E9 H6 46 81 H0 H1 H2 H3 H5 H7 H8 C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 C1 77 B4 B5 B6 C0 C1 C2 C3 H5 H8 A0 A1 A2 A4 B1 B4 C6 C7 H0 H1 H2 H3 H5 H6 H7 H9 C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 C2 B4 B5 B6 C0 C1 C3 D A0 A1 A2 A3 A4 B1 B2 C6 C7 C8 C9 D1 D3 D5 D8 E0 E1 E2 E3 E4 E5 E6 H8 H9 75 76 H5 76 76 H5 C4 77 B2 B7 B8 B9 C4 77 H4 C B2 D1 D2 H7
21
APPENDIX SP3 Lists 84 85 C A0 A1 A2 A4 B1 B4 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 H8 H9 E A0 A1 A2 A3 A4 B1 B2 C6 C7 C8 C9 D1 D3 D5 D8 D9 E1 E2 E3 E4 E5 E6 H8 H9 H9 H4 I0 H0 H1 H2 H3 H5 H7 H8 81 88 D2 H1 H6 D2 H1 C A0 A1 A2 A4 B1 B4 C6 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 H8 H9 E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E2 E3 E4 E5 E6 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 D2 H E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E3 E4 E5 E6 91 A6 A7 A8 A9 B0 B2 C4 D2 H4 I0 C C6 C7 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E4 E5 E6 C E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E5 E6 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 H0 H1 H2 H3 H4 H5 H6 H7 H8 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E6 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 B2 D H7 D2 H1 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 98 99 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 B2 D1 H6 H7 D 79 91 E7 E8 E9 F0 G1 H4 D2 D C6 C7 C8 C9 D1 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 G5 G6 D2 G6 46 81 H0 H1 H2 H3 H4 H5 H6 H7 H8 H6 H4 A4 A5 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 51 46 M A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 H0 46 81 H1 H2 H3 H4 H5 H6 H7 H8 D C6 C7 C8 C9 D1 D3 D8 D9 E0 E1 E2 E3 E4 E5 E6 H1 46 81 H0 H2 H3 H4 H5 H6 H7 H8 53 48 A A5 H2 46 81 G1 H0 H1 H3 H4 H5 H6 H7 H8 54 48 B3 A A4 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 E8 55 49 A A3 B3 B1 76 H1 B2 77 G1 H4 H3 44 H4 H9 57 69 B3 76 H1 H4 77 91 G1 I0 58 70 B2 H6 B4 B5 B6 C0 C1 C2 C3 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 59 68 63 1 B5 B4 B6 C0 C1 C2 C3 H5 H0 H1 H2 H3 H6 H7 H8 60 67 61 66 C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 C0 76 B4 B5 B6 C1 C2 C3 H1 D C6 C7 C8 C9 D1 D3 D5 D9 E0 E1 E2 E3 E4 E5 E6 E9 H6 46 81 H0 H1 H2 H3 H5 H7 H8 63 66 C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 C1 77 B4 B5 B6 C0 C1 C2 C3 H5 65 E7 H8 A0 A1 A2 A4 B1 B4 C6 C7 H0 H1 H2 H3 H5 H6 H7 H9 C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 66 61 C2 B4 B5 B6 C0 C1 C3 D A0 A1 A2 A3 A4 B1 B2 C6 C7 C8 C9 D1 D3 D5 D8 E0 E1 E2 E3 E4 E5 E6 H8 H9 75 76 H5 71 49 76 76 H5 C4 77 B2 B7 B8 B9 C4 72 47 D2 77 H4 73 48 D2 C B2 D1 D2 H7 74 49 D2
22
18 1 75 77 19 2 76 77 20 3 77 76 H5 21 4 22 5 23 6 24 7 25 8 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 26 9 27 10 86 80 D3 D2 28 11 87 79 D4 D2 29 12 D5 D2 30 13 89 85 31 14 32 15 90 A6 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 33 16 34 17 92 91 35 18 93 91 36 19 D6 E5 D2 D7 D9 D2 39 29 D8 E4 D2 40 27 42 45 95 79 D9 91 D2 43 78 96 78 E0 91 D2 B2 D1 H7 97 A7 E1 79 D2 98 91 E2 D2 99 88 E3 F0 D2 A0 A8 E4 E9 D2 A1 A9 E5 E8 D2 H6 A2 B0 E6 E7 D2 51 46 A4 A7 F6 G7 A5 A3 A7 F7 G6 53 48 A A4 A5 G1 H1 54 48 B1 B2 55 49 B2 76 H1 G2 F9 G3 G0 B3 B3 B2 G4 F8 57 69 B4 C4 58 70 B5 C4 G5 F7 G6 F7 H4 59 68 B6 B7 C0 B2 C4 G7 F6 60 67 H0 G1 H4 61 66 C1 77 C4 63 66 C2 B9 C4 H1 G1 H4 65 E7 C3 B8 H2 H0 H4 66 61 C4 B4 B5 B6 C0 C1 C2 C3 H3 I0 71 49 H4 46 81 H0 H1 H2 H3 H5 H6 H7 H8 72 47 D2 c5 45 D5 73 48 D2 C6 91 D2 H5 77 H4 74 49 D2 C7 91 D2 H6 H4 C8 79 D2 H8 91 H4 C9 78 H9 I D0 H4 D2 I0 44 H4 H9 D1 78 D2
23
End Strt 19 18 22 21 20 24 23 27 26 25 29 28 32 31 30 34 33 37 36 35 39 38 42 41 40 44 43 47 46 45 49 48 52 51 50 55 54 53 58 57 56 60 59 63 62 61 65 64 68 67 66 70 69 73 72 71 75 74 78 77 76 80 79 83 82 81 85 84 88 87 86 90 89 93 92 91 96 95 94 99 98 97 101 100 104 103 102 106 105 109 108 107 111 110 114 113 112 116 115 119 118 117 121 120 124 123 122 126 125 129 128 127 131 130 134 133 132 137 136 135 140 139 138 142 141 145 144 143 147 146 150 149 148 152 151 155 154 153 158 157 156 161 160 159 163 162 166 165 164 168 167 171 170 169 173 172 176 175 174 179 178 177 180 1
24
G10 leaves (OutDegree=0):
G10 (v=1-44) SP2 2level pTrees LevelOneStride=19 (labelled 0-i), Level0Stride=10 (labelled 0-9) OutDeg SP1 1 8 1 9 2 1 2 1 2 1 2 3 1 2 4 1 2 5 1 2 6 1 2 7 1 2 8 1 2 9 1 3 1 3 1 3 2 1 3 1 3 4 1 3 5 1 3 6 1 3 7 1 3 8 1 3 9 1 4 1 4 2 1 4 3 1 L1 2 1 3 5 4 6 8 7 9 1 L0 2 4 3 5 7 6 8 9 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 L0 1 2 4 3 5 7 6 8 9 1 1 L0 1 2 4 3 5 7 6 8 9 1 OutDeg SP2 3 5 1 3 6 1 3 7 1 3 8 1 4 1 4 2 1 4 3 1 L1 1 3 2 4 6 5 7 9 8 A B D C E F G H i 1 1 1 1 1 1 1 L0 2 1 3 4 6 5 7 9 8 L0 2 1 3 5 4 6 8 7 9 1 112 131 177 G10 leaves (OutDegree=0): a3 a6 a8 a9 b0 B7 b8 b9 e7 e8 e9 f0 f1 f2 f3 f4 f5 f8 f9 g0 g8 g9 h7
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.