Download presentation
Presentation is loading. Please wait.
1
Shortest Path Trees Construction
(We don’t need the Path Trees to get the Shortest Path Trees! That’s because a subpath of a shortest path is a shortest path.) S1P=E SPSF11 SPSF1’1 SPSF12 SPSF1’2 SPSF13 SPSF1’3 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 a 1 b 1 c 1 1 1 2 1 2 1 3 1 3 1 S2P1=SPSF1’1&(ORjS1P1Ej ) S2P2=SPSF1’2&(ORjS1P2Ej ) S2P3=SPSF1’3&(ORjS1P3Ej ) SPSF21 SPSF2’1 1 2 1 3 1 4 1 2 1 1 3 1 4 1 S2P SPSF23 SPSF2’3 3 1 1 2 1 c 1 1 2 1 3 1 4 1 5 6 7 8 9 a b c 1 1 from here on. Identical to 1 3 1 3 1 S3P1=SPSF2’1&(ORjS2P1Ej ) S3P3=SPSF2’3&(ORjS2P3Ej ) S3P SPSF31 SPSF3’1 1 7 1 c 1 1 4 2 3 5 6 7 c 9 b a 8 G6 SPSF33 SPSF3’3 3 1 4 1 9 1 a 1 b 1 1 2 1 3 1 1 1 1 3 1 3 1 S4P1=SPSF3’1&(ORjS3P1Ej ) What is the cost of creating the SPs? vV, there are ~Avg{Diam(v)vV} steps, each costs 1 complement of SPSF (cost =compl), OR of ~Avg|Ek| pTrees (cost=OrAvg|Ek| 1 SPSF & above_OR_result (cost=AND), 1 OR to update SPSF (cost=OR) Cost= |V|*AvgDiam*(compl+OR*AD+AND+OR), so O(|V|). I.e., linear in # of vertices, assuming AD=AvgDeg is small. This is a one-time, parallelizable construction over the vertices. For Friends, it is B*4*(3*pTOP+AD*pTOP)=4B*(3+AD)pTOP=B*pTOP*(12+4AD), where pTOP is the cost of a pTree Operation (comp, &, OR) and B=billion). Parallelized over an n node cluster, this 1-time Shortest Path Tree construction cost would be B*pTOP*(12+4AvgDeg) / n. The SnP’s capture only the shortest path lengths between all pairs of vertices. We could (have) capture actual shortest paths (all shortest paths?, all paths in PTs?), since we construct (but do not retain) that info along the way. How to structure it/index it?/residualize it? S4P3=SPSF3’3&(ORjS3P3Ej ) S4P SPSF41 SPSF4’1 1 5 1 6 1 9 1 a 1 b 1 SPSF43 SPSF4’3 3 1 7 1 8 1 1 2 1 3 1 1 Done with Vertex 1 Shortest Paths. Diam(1)=4 Done with Vertex 3 Shortest Paths. Vertices 4-c SPs done the same way SPSF1i = S1Pi OR Mi , Mi has 1 only at i SPSF(k+1)i = SPSFki OR S(k+1)Pi S(k+1)Pi=SPSFk’i&(ORjSkPj Ej ) “The mask pTree of the shortest k+1 path starting at vertex i is the Shortest Paths So Far Complement ANDed with the OR of ith edge pTrees over all ithe Shortest k Path List”
2
1 SP1 =1deg 1 2 3 4 5 6 7 8 9 SP2 =2dg 10,25,26,28,29,33,34 not shown (only 17 on, 1=4dg) 1 SP4 =4dg 15,16,19,21,23,24,27,30 only 17 on, 5deg=1 17 SP5 8=5dg 1 2 3 4 5 6 7 8 9 SP3 =3dg G7 ver g9a bg 1dg 9djgdcdhojepepff3fgqfggf66dklfkqb6 2dg 8b4b888c3b889366c nn678581aa 3dg a a dg dg 17 is an outlier. Try clustering by SPdeg from 17. The SPk17 pTrees mask the clustering (next slide) BASE Shortest Path Trees
3
17 is an outlier. Try clustering by SPdeg from 17
17 is an outlier. Try clustering by SPdeg from 17. The SPk17 pTrees mask the clustering. 1 2 3 4 5 6 7 8 9 SPdegk(17) 1 SPdeg=1: 6 7 2 1 SPdeg=2: 3 1 SPdeg=3: 4 1 SPdeg=4: 5 1 SPdeg=5: G7 Now we would want to make this divisive and recursive. The maroon cluster could be broken apart into white and blue. Then one could use DegreeDifference within clusters to trade vertices among clustes to improve the DegDif quality measure. Maybe an agglomerative or divisive approach using SPdeg? Agglomerate two pieces together iff the SPdegdif is improved (or still exceeds a threshold?)? One could use Genetic Algorithm Hill Climbing to optimize clustering based on GAs applied to the SPdeg arrays. The bottom line is that there is a wealth of value in ShortestPathDegrees. One can easily mask subsets and recalculate SPdeg.
4
1 SP1 =1deg 1 2 3 4 5 6 7 8 9 SP2 =2dg 10,25,26,28,29,33,34 not shown (only 17 on, 1=4dg) 1 SP4 =4dg 15,16,19,21,23,24,27,30 only 17 on, 5deg=1 17 SP5 8=5dg 1 2 3 4 5 6 7 8 9 SP3 =3dg G7 1 and 34 have highest SP1deg (most siblings) at 16. Start with clusters, S(1), S(34) of siblings. Break ties with DegreeDiffs defined below. intdegS(x)=#edges from x to S-vertices. extdegS(x)=#edges from x to S’-vertices. DegDifS(x)=indegS(x)-extdegS(x) (or intdegS(x)/1+extdegS(x)? Start with S (and T,U,… if there are ties) =siblings of x of highest SP1degree. So for G7, S=Sibl(1) and T=Sibl(34). Add y(S’-T) to S iff DegDifS(y)>thresh1 and subract zS from S iff DegDif(z)<thesh2.
5
K-plex Search on G6: A k-plex is a Subgraph missing k edges
K-plex Search on G6: A k-plex is a Subgraph missing k edges. All subgraphs will be induced subgraphs defined by their vertex set. Subgraph S has |ES|=s edges, |VS|=v vertices. S is a kplex iff C(v,2) – s = v(v-1)/2-s k If S is a kplex, S’ adds 1 vertex, x to S, (V(S’)=V(S)!{x}) then S’ a kplex iff (v+1)v/2 – (deg(x,S’)+s) k. 1 4 2 3 5 6 7 c 9 b a 8 G6 Edges are 1-plexes. |E{123}| = |PE123| = 3 so 123 is a 0plex(clique) and a 1plex |E{124}| = |PE124| = 3 so 124 is a 0plex (clique) If H is an ISG, |VH|=h, |EH|=H, H=h(h-1)/2 then H is a kplex iff H – H k.. If H is a kplex and F is an ISG of H, then F is a kplex (if F is missing an edge than H is missing that edge also, since K inherits all H edges involving its vertices. F cannot be missing more edges than H.) If G isn’t a kplex, F1 an ISG of G with a vertex of least degree removed. If F1 isn’t a kplex, F2 ISG with a vertex of least degree removed, etc. until we find Fj to be a kplex. Remove Fj Repeat until all vertexes removed. We did a k-plex search of G6 by simple calculating edge counts (which are simply 1-counts of ANDed pTrees) using only SP1=E. 1 3 2 4 5 6 7 8 9 a c b SP1=E G=12*11/2=66. G= G is a kplex for k H1=ISG{ abc} (deg5=2). H1=11*10/2=55, H1=17. H1 is a kplex for k 37. H2=ISG{ abc} (deg6=2). H2=10*9/2=45, H2=15. H2 is a kplex for k 30. H3=ISG{123489abc} (deg7=1). H3=9*8/2=36, H3=14. H3 is a kplex for k 22. H4=ISG{12389abc} (deg4=2). H4=8*7/2=28, H4=12. H4 is a kplex for k 16. 1 2 3 4 5 6 7 8 9 a c b SP2 H5=ISG{1239abc} (deg8=2). H5=7*6/2=21, H5=10. H5 is a kplex for k 11. H6=ISG{239abc} (deg1=2). H6=6*5/2=15, H6= H6 is a kplex for k 7. H7=ISG{39abc} (deg2=1). H7=5*4/2=10, H7= H7 is a kplex for k 3. H8=ISG{9abc} (deg3=1). H8=4*3/2=6, H8= H8 is a kplex for k So take out {9abc} and start over. G={ } G=8*7/2= G= G is a kplex for k 18. deg= H1=ISG{ } (deg8=1) H1=7*6/2=21, H1=9. H1 is a kplex for k 12. deg= 1 2 3 4 5 6 7 8 9 a c b SP3 H2=ISG{234567} (deg1=2) H2=6*5/2=15, H2=6. H2 is a kplex for k 9. deg=112223 H3=ISG{34567} (deg2=1) H3=5*4/2=10, H3=4. H3 is a kplex for k 6. deg=01222 H4=ISG{4567} (deg3=0) H4=4*3/2=6, H4=4. H4 is a kplex for k 2. deg=1222 H5=ISG{567} (deg4=1) H5=3*2/2=3, H5=3. H5 is a kplex for k 0. deg=222 So take out {567} and start over. G={12348} G=5*4/2= G= G is a kplex for k 5. deg=33220 1 2 3 4 5 6 7 8 9 a c b SP4 H1=ISG{1234} (deg8=0) H1=4*3/2=6, H1=5. H1 is a kplex for k 1. deg=3322 H2=ISG{124} (deg3=2) H2=3*2/2=3, H2=3. H2 is a kplex for k 0. deg=222 This is exactly what we want ! is a 1plex (missing only 1 edge) and 124 was determined to be a clique (0plex – missing no edges). It’d have been great if 123 had revealed itself as a clique also, and if 89abc had been detected as a 1plex before 9abc was detected as a clique. How might we make progress in these directions? Try returning to remove all degree ties before moving on? We will try that on the next slide?
6
1 4 2 3 5 6 7 c 9 b a 8 K-plex search on G6 continued G6
k-plex=Subgraph missing k edges. H a kplex and F a ISG(H), then F is a kplex If H is an ISG, |VH|=h, |EH|=H, H=h(h-1)/2, H is a kplex iff H–Hk. If F is missing an edge, H is missing that edge too (K inherits all H edges). F can’t be missing more edges than H. k-core=Subgraph containing k edges. If F a kcore ISG of H then H is a kcore H0=G={ abc} H0=12*11/2=66. H0= H0 is a kplex for k 47 deg= is a kcore for k19 Mining all kplexes and kcores. At each step, we [potentially] branch to each of the lowest degree vertices (note, I skipped many of them in this illustration.) We might want kplex and/or kcore structure around a particular vertex. Use SP1, SP2…. E.g., find the kplex and kcore structure around v=1: H1=ISG{ abc} (deg5=2). H1=11*10/2=55, H1= H1 is a kplex for k 37. deg= is a kcore for k17 H26=ISG{ abc} (deg6=2). H26=10*9/2=45, H26= H26 is a kplex for k 30. deg= is a kcore for k15 H27=ISG{ abc} (deg7=2). H27=10*9/2=45 H27= H27 is a kplex for k 30. deg= is a kcore for k15 (H26 and H27 specify removal of 7 and 6 resp. Thus remove both) H2=ISG{123489abc} H2=9*8/2= H2= H2 is a kplex for k 22. deg= is a kcore for k14 H34=ISG{12389abc H34=8*7/2=28 H34= H34 is a kplex for k 16. deg= is a kcore for k12 1 3 2 4 5 6 7 8 9 a c b SP1 H38=ISG{12349abc} H38=8*7/2=28 H38= H38 is a kplex for k 15. deg= is a kcore for k13 H348=ISG{1239abc H348=7*6/2=21 H384=10 H384 is a kplex for k 11. deg= is a kcore for k10 H341=ISG{2389abc} ( H341=7*6/2=21 H341=10 H341 is a kplex for k 11. deg= is a kcore for k10 SPL1(1)=234 SPL2(1)=7c SPL3(1)=569abc SPL4(1)=8 To check 1234 kplex/core status check if there are edges, (y,y,n). Thus, 123, 124 are 0plexes and 3cores. 134, 234 are 1plexes and 2cores. 1234 is a 1plex and a 5core. H342=ISG{1389abc} H342=7*6/2=21 H342=10 H342 is a kplex for k 11. deg= is a kcore for k10 (H341,H342,H38 specify removal of 1,2. Thus remove both) H4=ISG{389abc H4= H4= H4 is a kplex for k 6. deg= is a kcore for k9 H5=ISG{89abc H5=5*4/2= H5= H5 is a kplex for k 2. deg= is a kcore for k8 1 2 3 4 5 6 7 8 9 a c b SP2 H6=ISG{9abc} (deg7=2) H6= H6= H6 is a kplex for k 0. deg= is a kcore for k6 This is what we want. 89abc a 2plex;9abc a 0plex H0=G={ } H= H= H is a kplex for k 11. deg= is a kcore for k9 H03=G={124567} H= H= H is a kplex for k 7. deg= is a kcore for k8 H05=G={123467} H= H= H is a kplex for k 7. deg= is a kcore for k8 To check 12347c kplex/core status, check edges 17 1c 27 2c 37 3c 47 4c 7c (n n n n n y y n n) 12347c=(Comb(6,2)-7)plex=8plex, 7core H06=G={123457} H= H= H is a kplex for k 7. deg= is a kcore for k8 1 2 3 4 5 6 7 8 9 a c b SP3 H035=G={12467} H= H= H is a kplex for k 7. deg= is a kcore for k8 H036=G={12457} H= H= H is a kplex for k 7. deg= is a kcore for k8 H0356=G={1247} H= H= H is a kplex for k 2. deg= is a kcore for k4 H03567=G={124} H= H= H is a kplex for k 0. deg= is a kcore for k3 This is what we want. Remove 12489abc H7={3567} H7=6. H7= H7 is a kplex for k 3. deg= is a kcore for k3 1 2 3 4 5 6 7 8 9 a c b SP4 H7={567} H7=3. H7= H7 is a kplex for k 0. deg= is a kcore for k3 1 4 2 3 5 6 7 c 9 b a 8 G6
7
K-Degree-Difference Community Search on G6: A kDegreeDifference Community of a graph, G, is a subgraph, H, such that ddHIntDegH-ExtDegH k. Theorem: If hH, ddH-h = ddH – (2idh - edh) So we want to remove h s.t. (2idh – edh) is minimum. H=G= { abc} id= ed= ddH=38 ddH/|VH| = 38/12 = 3.16 Remove 5 H= { } id= 02321 ed= ddH=2 ddH/|VH| = 2/5 = 0.4 2id-ed=-34630 Remove 3 H= { abc} id= ed= ddH=34 ddH/|VH| = 34/11 = 3.09 2id-ed= Remove 6,7 H= { } id= 2321 ed= ddH=5 ddH/|VH| = 5/4 = 1.2 2id-ed= 4630 Remove 8 H= {123489abc} id= ed= ddH=26 ddH/|VH| = 26/9 = 2.88 2id-ed= Remove 4,8 H= { 567} id= 222 ed= 011 ddH=4 ddH/|VH| = 4/3 = 1.33 2id-ed= 433 Clique, so remove 567 and start over with 38 (but it has 0 id) H= {1239abc} id= ed= ddH=16 ddH/|VH| = 16/7 = 2.28 2id-ed= Remove 1,2 H= {39abc} id= 13334 ed= ddH=10 ddH/|VH| = 10/5 = 2.0 2id-ed=05568 Remove 3 H= {9abc} id= 3333 ed= ddH=9 ddH/|VH| = 9/4 = 2.25 2id-ed=5565 Clique so start over with H= { } id= ed= ddH=17 ddH/|VH| = 17/8 = 2.13 2id-ed= Remove 8 H= { } id= ed= ddH=16 ddH/|VH| = 16/7 = 2.28 2id-ed= Remove 3,6 H= {12457} id= 22312 ed= ddH=6 ddH/|VH| =6/5 = 1.2 2id-ed=33613 Remove 5 1 3 2 4 5 6 7 8 9 a c b SP1 H= {1247} id= 2231 ed= ddH=4 ddH/|VH| = 4/4 = 1.0 2id-ed=3360 Remove 7 H= {124} id= 222 ed= 111 ddH=3 ddH/|VH| = 3/3 = 1.0 2id-ed=333 Clique, so start over with 35678 1 4 2 3 5 6 7 c 9 b a 8 G6
8
Very Simple Weighted SP1 and SP2 K-plex Search on G6
Weighting: 0,1path nbrs of x times 3; 2path nbrs of x times 2; Until all degrees are weighted, then back to actual subgraph degrees H={ abc deg x=1 H={ abc H=15 H=7 kplex k8 deg x=1 after cutting 2,3,4 H={ abc H=6 H=5 kplex k1 deg x=1, after cut 23468 H={ abc deg x=2 H={ abc H=15 H=7 kplex k8 deg x=2 after cutting 2,3,4 H={ abc H=6 H=5 kplex k1 deg x=2, after cut 23468 H={ abc H=3 H=3 0plex deg x=3 after cut 1 (actual subgraph degrees) H={ abc deg c x=3 H={ abc H=6 H=4 2plex deg c x=3, after cut 2368 H={ abc deg x=4 H={ abc H=3 H=3 0plex deg x=4 after cut 2346 UNWEIGHTED Degrees H={ abc deg H={ abc deg x=5 H={ abc H=10 H=5 5plex deg x=5 after cut 34 H={ abc H=3 H=3 0plex deg x=5 after cut 1 from SG degs 1 3 2 4 5 6 7 8 9 a c b SP1 H={ abc deg x=6 H={ abc deg x=6 after cut 34 H={ abc H=3 H=2 1plex deg x=6 after cut 12 SG degs 211 H={ abc deg x=7 H={ abc deg x=7 after cut 34 H={ abc H=3 H=3 0plex deg x=7 after cut 1 SG degs H={ abc deg cc68 x=8 H={ abc deg cc68 x=8 after cut 34 H={ abc plex deg x=8 after cut12 SG degs 1 2 3 4 5 6 7 8 9 a c b SP2 H={ abc deg cc9c x=9 H={ abc H=10 H=8 H a kplex k 2 deg cc9c x=9 after Cutting 2,3,6 H={ abc deg cc9c x=a H={ abc H=10 H=8 H a kplex k 2 deg cc9c x=a after cut 2,3,6 H={ abc deg cc9c x=b H={ abc H=6 H=6 H a kplex k 0 deg cc9c x=b after cut 2,3,6 1 2 3 4 5 6 7 8 9 a c b SP3 H={ abc deg ccpc x=c H={ abc H=6 H=6 H a kplex k 0 deg cc9c x=c after cut 2,3,6 By weighting the initial round we have gotten nearly perfect information for this example (G6). The weightings, 3 and 2, were arbitrarily chosen but worked here. In general, one should devise a formula to determine them. Also we could weight SP3 and etc. as well? If we have paid the price of constructing SPk k>1, this is a much simpler way to do it, as compared to the Clique Percolation method of Palla (next slide). 1 2 3 4 5 6 7 8 9 a c b SP4 1 4 2 3 5 6 7 c 9 b a 8 G6
9
G7 Very Simple Weighted SP1 k-plex Search on G7 Weighting:
0,1path nbrs of x times 1; 2path nbrs of x times 0; 1 2 1 3 1 2 1 3 2 1 4 5 1 5 2 1 6 2 1 7 2 1 8 2 1 9 2 2 1 3 2 1 2 1 2 3 1 4 2 1 5 5 2 1 3 6 2 1 3 2 7 1 8 2 1 4 9 2 1 3 3 1 4 3 1 4 2 3 1 6 3 1 4 3 1 6 SP1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 1 6 2 1 9 3 1 4 1 6 5 1 3 6 1 4 7 1 4 8 1 4 9 1 5 H= H=561 H=77 kplx k484 D g9a bg kcore k77 Cut 123: H= H=120 H=38 kplx k82 D kcore k38 Cut 23: H= H=55 H=26 kplx k24 D kcore k26 Cut 24: H= H=15 H=12 kplx k3 D kcore k12 Cut 2: H= H=10 H=10 kplx k0 D kcore k10 {1,2,3,4, 14} is a clique. {1,2,3,4,9,14} is a 3plex. Cut0: H= H=21 H=4 kplx k17 D kcore k4 Cut 1 leaves 25 only. H= D af Cut012: H= H=55 H=19 kplx k36 D kcore k19 H= H=19 H=4 kplex k15 D kcore k4 Cut03: H= H=6 H=4 kplx k2 D kcore k6 {24,32,33,34} is a 2plex G7 Cut0: H= H=19 H=4 kplex k15 D kcore k4 Cut 0 leaves {9,31} as a 0plex H= D H= H=17 H=2 kplex k15 D kcore k2 Cut 0 leaves {27,30} as a 0plex Cut01: H= H=15 H=6 kplx k9 D kcore k6 Cut0: H= H=10 H=6 kplx k4 D kcore k6 {5,6,7,11,17} is a 4plex H= H=14 H=0 kplex k14 D kcore k0 no edges left H= D The expected communities are mostly not detected as kplexes or kcores. Cut0: H= H=21 H=4 kplx k17 D kcore k4 (Symbols for base 65 )
10
ISG EdgeCount kplex Search Alg on G8 G8 is a graph of word associations starting from the word, BRIGHT using USF Free Association. An edge, AB, means some people associate the word B to word A. We try to determine the 4 categories; Intelligence, Astronomy, Light, Colors . 1 2 3 4 5 6 40 41 42 46 7 13 12 14 44 53 17 48 54 8 16 52 45 9 43 39 38 10 20 21 24 11 15 47 23 22 19 25 36 18 37 35 27 26 28 29 31 32 33 30 51 50 34 49 H = H=1431 H=197 kplex k1234 Deg 44444bb5656h9747c3c864fag4a386e j kcore k197 Cut H = H=45 H=22 kplex k13 Deg kcore k22 Cut H = H=10 H=8 kplex k2 Deg kcore k8 So {12,24,25,31,54}={sun,yellow,color,red,bright} is a 2plex Attempt 2: Remove bright, double the weight of nbrs of 12 (vertex if max degree) H = H=1431 H=197 kplx k1234 44444ba5645g9746b2b864f9f49386d Cut H = H=1431 H=197 kplex k1234 44484mka68agie4cm2b8c4fif49386d e356a349c5 G8 Cut H = H=1431 H=197 kplex k1234 c 1 6 1 2 7 1 3 9 1 4 7 1 5 4 1 6 7 1 7 2 1 8 3 1 9 2 2 1 8 2 1 6 2 1 4 3 2 1 5 2 4 1 5 2 1 6 6 2 1 4 2 7 1 8 2 1 3 9 2 1 8 3 1 6 SP1 2 1 3 4 6 5 7 9 8 10 11 12 14 13 15 17 16 18 19 20 22 21 23 25 24 26 28 27 29 30 31 32 34 33 35 37 36 38 39 40 42 41 43 45 44 46 48 47 49 50 51 52 54 53 1 4 2 1 4 3 1 4 4 1 5 1 4 6 1 7 1 8 1 5 9 1 6 1 5 3 1 4 2 3 1 4 3 1 5 3 4 1 5 3 1 6 6 3 1 5 3 7 1 8 3 1 4 9 3 1 6 4 1 8 4 1 5 2 4 1 7 3 4 1 6 4 1 8 5 4 1 3 6 4 1 5 7 4 1 3 8 4 1 5 9 4 1 3 5 1 4 5 1 9 2 5 1 6 5 3 1 4 5 1 9 1 Scientist 2 Science 3 Astronomy 4 Earth 5 Space 6 Moon 7 Star 8 Ray 9 Intelligent 10 Golden 11 Glare 12 Sun 13 Sky 14 Moonlight 15 Eyes 16 Sunshine 17 Light 18 Lit 19 Dark 20 Brown 21 Tan 22 Orange 23 Blue 24 Yellow 25 Color 27 Black 26 Gray 28 Race 29 White 30 Green 32 Crayon 31 Red 33 Pink 35 Flashlight 34 Velvet 36 Glow 38 Gifted 37 Dim 39 Genius 40 Smart 41 Inventor 43 Brilliant 42 Einstein 44 Shine 46 Telescope 45 Laser 47 Horizon 48 Sunset 49 Ribbon 50 Violet 51 Purple 52 Beam 53 Night 54 Bright
11
SP2 1 3 2 4 6 5 7 8 10 9 12 11 14 13 16 15 19 18 17 20 22 21 24 23 27 26 25 29 28 30 31 32 34 33 35 37 36 39 38 40 41 43 42 45 44 47 46 48 50 49 51 53 52 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 SP1 and SP2 for G8 1 b a g b 2 b f 9 f d SP1 3 2 5 4 7 6 8 10 9 11 13 12 15 14 16 18 17 20 19 21 22 24 23 26 25 28 27 30 29 32 31 34 33 36 35 37 38 40 39 42 41 43 44 46 45 47 49 48 50 52 51 53
12
Very Simple Weighted SP1 and SP2 K-plex Search on G8
3 4 5 6 40 41 42 46 7 13 12 14 44 53 17 48 54 8 16 52 45 9 43 39 38 10 20 21 24 11 15 47 23 22 19 25 36 18 37 35 27 26 28 29 31 32 33 30 51 50 34 49 G8 Weighting 0,1path neighbors (12012) times 5 334 2 path nbrs (39893) times 3 next cut<18 x=1 instead cut<19 x=1 This gives C0={1,2,9,39,40,41,42,43} which is exactly the Intelligence Class except that v=38 (gifted) is missing. It is a kplex k8 (not that strong of a community!) x=1 Within the Intelligence Class this is the 1plex, C1={1, 2,40,41,42} ( only edge missing is (2,40) ) with C1-degrees: Thus if we cut next using C1-degrees (cut 2,40) leaves the clique (0plex) C2={1,41,42} Cutting C0 and starting over: G-C0 degs x=3 Weighting 0,1path neighbors (367) times 5 2 path nbrs ( ) times 3 next cut<10 x=3 next cut<12 x=3 This gives C2={3,4,5,6,7, ,13,14,15,17,23,25,31,44, , 53} Whereas, Astronomy is 3,4,5,6,7,8,10,11,12,13,14,16,17, ,45,46,47,48,52,53 so, not a good fit! On the next slide we try again with replacement but using as starting vertex, the remaining vertex of highest degree.
13
Simple Weighted SP1 and SP2 K-plex Search on G8 Cont.
3 4 5 6 40 41 42 46 7 13 12 14 44 53 17 48 54 8 16 52 45 9 43 39 38 10 20 21 24 11 15 47 23 22 19 25 36 18 37 35 27 26 28 29 31 32 33 30 51 50 34 49 With replacement but using as starting vertex, the remaining vertex of highest degree (first, v=12). Weighting 0,1 SP nbrs times SP nbrs times 3 cut<20 x=12 cut<20 x=12 Astronomy Weighting 0,1 SP nbrs times SP nbrs times 3 cut<30 Astronomy Weighting 0,1 SP nbrs times SP nbrs times 1 5 astronomy vertices missing (3,5,45,46,53} and 2 non-astronomy included {21,24} G8 x=25 Weighting 0,1 SP nbrs times 6 Colors is 4 colors missing but zero non-colors included. Next try straight Agglomerative Clustering using the similarity measure of DegDif, where DegDif(v)=0-deg(v) (0=intdeg(v))
14
Agglomerative Clustering with similarity=DegDif DegDif(v)=0-deg(v) since 0=intdeg(v)
1 2 3 4 5 6 40 41 42 46 7 13 12 14 44 53 17 48 54 8 16 52 45 9 43 39 38 10 20 21 24 11 15 47 23 22 19 25 36 18 37 35 27 26 28 29 31 32 33 30 51 50 34 49 DegDif(18)= -3 (max). Agglomerate with siblings,{17,18,19,54} DegDif=6-16 = -10 DegDif(28)= -3 (max). Agglomerate with siblings{ } DegDif=6-14 = -8 DegDif(37)= -3 (max). Agglomerate with siblings{ } DegDif=10-31 = -21 DegDif(45)= -3 (max). Agglomerate with siblings{ } (Note that we have linked up with the “Light” cluster from the Astronomy cluster. It occurs to me that using an Agglomerative method for an example that is known to have overlapping clusters is a bad idea (agglomerative methods always produce a partition with no overlapping clusters. Therefore, let’s start over applying the agglomerative method to a different example that is not expected to have overlapping clusters. SP1 2 1 3 5 4 7 6 9 8 10 11 13 12 15 14 17 16 18 20 19 21 23 22 26 25 24 28 27 29 30 31 33 32 34 36 35 37 38 40 39 42 41 43 44 46 45 49 48 47 50 52 51 54 53 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 - 10 1 1 1 1 1 1 1 1 - 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 G8 - 21 b b h c 3 c f a g e j
15
Agglomerative Clustering with similarity=DegDif DegDif(v)=0-deg(v) since 0=intdeg(v)
DegDif(12)= -1 (max). Agglomerate with siblings{1,12} DegDif= = -14 DegDif(10)= -2 (max). Agglomerate with siblings{ } DegDif= = -23. DegDif(13)= -2 (max). Agglomerate with siblings{ } DegDif= = -12. DegDif(15,16)= -2 (max). Agglomerate with siblings{ } DegDif= = -25. DegDif(17)= -2 (max). Agglomerate with siblings{6 7 17} DegDif=3 - 1 = 2. DegDif(6 7 17)= 2 (max). Agglomerate with siblings{ } DegDif=6 - 4 = 2. DegDif(18)= -2 (max). Agglomerate with siblings{ } DegDif= = -9. DegDif(19,21)= -2 (max). Agglomerate with siblings{ } DegDif= = -13. DegDif(22)= -2 (max). Agglomerate with siblings{ } DegDif= = -5. DegDif(23,27,30)= -2 (max). Agglomerate with siblings{ } DegDif= = -13. 1 SP1 =1deg DegDif(25)=-3 Aggl w sibs{ } DegDif=4-7 = -3. -12 -14 -5 -9 -23 2 -25 -13 -3 Even though there is no cluster overlap here, our method does not follow the usual agglomeration methodology, in which there is a similarity measure between pairs (starting out with all subclusters being points, so the initial similarity is between points and then involves similarity between a point and a subset and also between two subsets. There needs to be a consistent definition of similarity across all these types of pairs, which we do not have here. Therefore, let’s start over trying to define a correct similarity. Given a similarity there are two standard clustering approaches, k means and agglomerative. Agglomerative requires the above complete similarity (between pairs of subsets, one or both of which can be singletons), while k means simply requires a similarity between pairs of points. One similarity we might consider is some weighted sum of common cousins. E.g., let c0 be the # of common 0th cousins (siblings), c1=# of 1st cousins, etc. If we sum the common cousin counts with weights, w0, w1,… (presumably decreasing), then we have a similarity measure which is complete. We try this similarity on the next slide, first for agglomeration, then k means.
16
Agglomerative Clustering with similarity=DegDif DegDif(v)=0-deg(v) since 0=intdeg(v)
DegDif(12)= -1 (max). Agglom with siblings{1,12} DegDif= = -14 DegDif(10)= -2 (max). Agglom with siblings{ } DegDif= = -22 DegDif(12)= -1 (max). Agglom w siblings{1,4,12,13} DegDif= = -9 DegDif(15)= -2 (max) Agglom w sibs{ } DegDif= = -18 DegDif(16)= -1 (max) Aggl w sibs{ } DegDif= = -14 DegDif(19)= -1 mx Aggl w sibs{ } DegDif= = -10 DegDif(21)= -1 mx Ag w sbs{ } DegDif=10-16= -6 DegDif(23)= -1 Ag w sbs{ } DegDif= = -2 DegDif(27,30)= -2 { } DegDif= = 6 DegDif(17)= -2 {6 7 17} DegDif=3 - 2 = 1 DegDif(22)= -2 (max). Agg w siblings{1,4,12,13 22} DegDif= = -7 DegDif(18)= -2 (max). Agg w sibls{1,4,12, } DegDif= = -5 1 SP1 =1deg DegDif(29)= -2 (max). Agg w sibls{29 32} DegDif= = -4 DegDif(31)= -3 { } DgDf= = 16 DegDif( )= -2. Agg w sibls{ } DegDif= = 1 Even though there is no cluster overlap here, our method does not follow the usual agglomeration methodology, in which there is a similarity measure between pairs (starting out with all subclusters being points, so the initial similarity is between points and then involves similarity between a point and a subset and also between two subsets. There needs to be a consistent definition of similarity across all these types of pairs, which we do not have here. Therefore, let’s start over trying to define a correct similarity. Given a similarity there are two standard clustering approaches, k means and agglomerative. Agglomerative requires the above complete similarity (between pairs of subsets, one or both of which can be singletons), while k means simply requires a similarity between pairs of points. One similarity we might consider is some weighted sum of common cousins. E.g., let c0 be the # of common 0th cousins (siblings), c1=# of 1st cousins, etc. If we sum the common cousin counts with weights, w0, w1,… (presumably decreasing), then we have a similarity measure which is complete. We try this similarity on the next slide, first for agglomeration, then k means. =dgdf =dgdf =dgdf =dgdf =dgdf =dgdf =dgdf =dgdf =dgdf =dgdf =dgdf =dgdf =dgdf =dgdf
17
Similarity Clustering sim(x,y)=W+k=0. n wk(x,y)
Similarity Clustering sim(x,y)=W+k=0..n wk(x,y)*ck(x,y), c0=#common siblings (other than themselves), c1=# common 1st cousins, C2=# common 2nd cousins.. W=5 iff siblings, w0=2, w1=1, else 0. Agglomerative first. Calculate initial similarities: 1 22 18 21 10 9 16 6 5 7 15 4 8 11 3 12 2 20 25 7 23 15 9 11 5 18 4 3 6 8 3 9 10 4 11 2 5 14 3 6 14 3 7 13 3 8 13 9 14 10 6 7 11 14 3 12 15 13 4 2 5 1 7 6 16 3 13 2 14 5 15 5 16 5 17 1 18 2 19 6 20 7 21 6 22 2 23 6 24 11 25 6 26 5 27 7 28 11 29 12 30 9 31 10 32 10 33 25 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 G7 1 SP1 =1dg 1 2 3 4 5 6 7 8 9 SP2 =2dg 1
18
Similarity Clustering sim(x,y)= W + h=1. n; k=1. n whk(x,y)
Similarity Clustering sim(x,y)= W + h=1..n; k=1..n whk(x,y)*chk(x,y), chk=count(SPh(x)&SPk(y), W=6, w11=3, w12=w21=2, w22=1, else 0. 1 21 2 25 3 39 4 5 11 6 7 8 9 43 10 11 12 13 14 15 36 16 17 18 19 20 21 22 23 24 25 26 27 28 45 29 30 31 32 33 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 G7 1 SP1 =1dg 1 2 3 4 5 6 7 8 9 SP2 =2dg 1
19
A Divisive Method 2 (two centroids are the max and the max non-nbr, so 1,17
None in neither. 6,7 in both S1,S17. Decide by count of 1,17 sibs=s, then cous=c, then 2ndcous=d, then 3rdcous=e 6S: S6:2,1 so in 1 S7:2,1 so in 1 G7 1 SP1 = 1deg = 2deg 1 2 3 4 5 6 7 8 9 SP2
20
A Divisive Method 3 (two centroids are the max and the max non-nbr, so 33,34
None in neither. 6,7 in both S1,S17. Decide by count of 1,17 sibs=s, then cous=c, then 2ndcous=d, then 3rdcous=e 6S: S6:2,1 so in 1 S7:2,1 so in 1 G7 1 SP1 1 2 3 4 5 6 7 8 9 SP2 1
21
1 SP1 =1deg 1 2 3 4 5 6 7 8 9 SP2 =2dg 10,25,26,28,29,33,34 not shown (only 17 on, 1=4dg) 1 SP4 =4dg 15,16,19,21,23,24,27,30 only 17 on, 5deg=1 17 SP5 8=5dg 1 2 3 4 5 6 7 8 9 SP3 =3dg G7
22
APPENDIX: G8 1 Scientist 3 Astronomy 2 Science 5 Space 4 Earth 7 Star 6 Moon 8 Ray 10 Golden 9 Intelligent 12 Sun 11 Glare 14 Moonlight 13 Sky 15 Eyes 17 Light 16 Sunshine 19 Dark 18 Lit 21 Tan 20 Brown 22 Orange 23 Blue 25 Color 24 Yellow 26 Gray 28 Race 27 Black 29 White 30 Green 32 Crayon 31 Red 33 Pink 35 Flashlight 34 Velvet 36 Glow 37 Dim 39 Genius 38 Gifted 42 Einstein 41 Inventor 40 Smart 44 Shine 43 Brilliant 46 Telescope 45 Laser 48 Sunset 47 Horizon 49 Ribbon 51 Purple 50 Violet 53 Night 52 Beam 54 BRIGHT 1 2 3 4 5 6 40 41 42 46 7 13 12 14 44 53 17 48 54 8 16 52 45 9 43 39 38 10 20 21 24 11 15 47 23 22 19 25 36 18 37 35 27 26 28 29 31 32 33 30 51 50 34 49 Fortunato: A graph of word assoc. starting from BRIGHT. It builds on U. S. Florida Free Association. An edge between words A and B indicates that some people associate B to the word A. 4 categories Intelligence, Astronomy, Light, Colors.“bright" is related to all. e.g. “dark" is in Colors and Light. For overlapping communities introduce a further variable, the membership of vertices in different communities, which enormously increases the number of possible covers wrt standard partitions. Clique Percolation Method (Palla) based on; internal edges of a community are likely to form cliques due to their high density. On the other hand, it is unlikely that intercommunity edges form cliques. Palla used term k-clique to indicate a complete graph with k vertices (k-clique is different from the n-clique). Two k-cliques are adjacent if they share k-1 vertices. The union of adjacent k-cliques is a k-clique chain. Two k-cliques are connected if they are part of a k-clique chain. Finally, a k-clique community is the largest connected subgraph obtained by the union of a k-clique and of all k-cliques which are connected to it (a k-clique community is identified by making a k-clique roll over adjacent k-cliques, where rolling means rotating a k-clique about the k vertices it shares with any adjacent k-clique.) k-clique communities can share vertices, so they can be overlapping. May be vertices belonging to non-adjacent k-cliques, reached by different paths and end up in different clusters. Unfortunately, there are also vertices that cannot be reached by any k-clique, like, e.g. vertices with degree one. In order to find k-clique communities, one searches 1st for maximal cliques. Then a clique-clique overlap matrix O is built, which is an nc by nc matrix (nc=#of cliques). Oij is the number of vertices shared by cliques i ,j . To find k-cliques, keep entries of O k-1, set others to 0 and find connected components of the resulting matrix. Detecting maximal cliques is known to require a running time that grows exponentially with the size of the graph. However, the authors found that, for the real networks they analyzed, the procedure is quite fast, due to the fairly limited number of cliques, and that (sparse) graphs with up to 10^5 vertices can be analyzed in a short time. 1 5 1 6 1 2 7 1 3 9 1 4 7 1 5 4 1 6 7 1 7 2 1 8 3 1 9 2 2 1 8 2 1 6 2 1 4 3 2 1 5 2 4 1 5 2 1 6 6 2 1 4 2 7 1 8 2 1 3 9 2 1 8 3 1 6 SP1 2 1 3 4 6 5 7 9 8 10 11 12 14 13 15 17 16 18 19 20 22 21 23 25 24 26 28 27 29 30 31 32 34 33 35 37 36 38 39 40 42 41 43 45 44 46 48 47 49 50 51 52 54 53 1 4 2 1 4 3 1 4 4 1 5 1 4 6 1 7 1 8 1 5 9 1 6 3 1 4 2 3 1 4 3 1 5 3 4 1 5 3 1 6 6 3 1 5 3 7 1 8 3 1 4 9 3 1 6 4 1 8 4 1 5 2 4 1 7 3 4 1 6 4 1 8 5 4 1 3 6 4 1 5 7 4 1 3 8 4 1 5 9 4 1 3 5 1 4 5 1 9 2 5 1 6 5 3 1 4 5 1 9 1 Scientist 2 Science 3 Astronomy 4 Earth 5 Space 6 Moon 7 Star 8 Ray 9 Intelligent 10 Golden 11 Glare 12 Sun 13 Sky 14 Moonlight 15 Eyes 16 Sunshine 17 Light 18 Lit 19 Dark 20 Brown 21 Tan 22 Orange 23 Blue 24 Yellow 25 Color 27 Black 26 Gray 28 Race 29 White 30 Green 32 Crayon 31 Red 33 Pink 35 Flashlight 34 Velvet 36 Glow 38 Gifted 37 Dim 39 Genius 40 Smart 41 Inventor 43 Brilliant 42 Einstein 44 Shine 46 Telescope 45 Laser 47 Horizon 48 Sunset 49 Ribbon 50 Violet 51 Purple 52 Beam 53 Night 54 Bright
23
SG Clique Mining 1,2 1,1 key 1,3 1,5 1,4 1,7 1,6 2,2 2,1 2,3 2,5 2,4 2,7 2,6 3,2 3,1 3,4 3,3 3,5 3,7 3,6 4,2 4,1 4,4 4,3 4,7 4,6 4,5 5,2 5,1 5,4 5,3 5,6 5,5 5,7 6,2 6,1 6,4 6,3 6,6 6,5 7,1 6,7 7,2 7,4 7,3 7,6 7,5 7,7 PE 1 2 4 3 7 6 G3 5 K=2: 2Cliques (2 vertices): Find endpts of each edges (Int((n-1)/7)+1, Mod(n-1,7) +1) 1 2 4 3 6 G2 7 5 key 1,1 1,3 1,2 1,5 1,4 1,6 2,1 1,7 2,3 2,2 2,5 2,4 2,6 2,7 3,1 3,3 3,2 3,5 3,4 3,7 3,6 4,1 4,3 4,2 4,5 4,4 4,7 4,6 5,2 5,1 5,3 5,5 5,4 5,7 5,6 6,2 6,1 6,3 6,4 6,6 6,5 6,7 7,2 7,1 7,3 7,4 7,6 7,5 7,7 E 1 EU 1 1 2 4 3 6 5 8 7 10 9 20 30 40 C 1 CU 1 6 k=3: k=4: 1234 ( are cliques) 123,134 ,134 , 234 ,2341234. 1234 only 4-clique Using the EdgeCount thm: on C={1,2,3,4}, CU=C&EU C is a clique since ct(CU)=comb(4, 2)=4!/2!2!=6 have 124CS3 PE(1,4)=1 134CS3 Have 123CS3 PE(2,3)=1 234CS3 Have k=2: E= already have 567 PE(2,3)=1 So 123CS3 PE(2,4)=1 124CS3 PE(2,6)=0 PE(6,7)=1 567CS3 PE(1,7)=0 PE(1,5)=0 PE(2,4)=1 1234CS4 Have 1234 k=3: EC, requires counting 1’s in mask pTree of each Subgraph (or candidate Clique, if take the time to generate the CCSs – but then clearly the fastest way to finish up is simply to lookup the single bit position in E, i.e., use EC). EdgeCount Algorithm (EC): |PUC| = (k+1)!/(k-1)!2! then CCCS The SG alg only needs Edge Mask pTree, E, and a fast way to find those pairs of subgraphs in CSk that share k-1 vertices (then check E to see if the two different kth vertices are an edge in G. Again this is a standard part of the Apriori ARM algorithm and has therefore been optimized and engineered ad infinitum!) PE(2,3)=1 234CS3 PE(1,4)=1 134CS3 Have PE(4,8)=1 248CS3 key 1,1 1,3 1,2 1,5 1,4 1,7 1,6 2,2 2,1 1,8 2,4 2,3 2,5 2,6 2,8 2,7 3,1 3,3 3,2 3,5 3,4 3,7 3,6 3,8 4,2 4,1 4,4 4,3 4,6 4,5 4,8 4,7 5,3 5,2 5,1 5,5 5,4 5,7 5,6 6,1 5,8 6,3 6,2 6,4 6,6 6,5 6,8 6,7 7,3 7,2 7,1 7,5 7,4 7,6 7,7 8.1 7,8 8,2 8,4 8,3 8,6 8,5 8.8 8,7 E 1 PE(4,8)=1 348CS3 PE(4,8)=1 12348CS5 have have k=2: k=4: PE(2,3)=1 123CS3 PE(2,4)=1 124CS3 PE(2,8)=1 128CS3 PE(2,6)=0 PE(3,8)=1 138CS3 PE(4,8)=1 148CS3 PE(1,5)=0 PE(1,7)=0 PE(6,8)=0 PE(3,8)=1 238CS3 have PE(6,7)=1 567CS3 have k=5: = CS5. 1 2 4 3 6 G4 7 5 8 PE(3,8)=1 1238CS4 PE(4,8)=1 1248CS4 PE(3,8)=1 1348CS4 k=3: Have PE(2,4)=1 1234CS4 PE(4,8)=1 2348CS4
24
A kDensityDifference Community (kDenDif) of a graph, G, is a subgraph, H, such that dendifHIntDenH-ExtDenH k. |VH|=h |EH|=H IntDenH=H/Comb(H,2) = H/(H(H-1)/2) = 2/(H-1) ExtDenH=(G-H) / h(g-h). So, dendifH = 2/(H-1) – (G-H)/(h(g-h)) For xH, Dendif(H-x)= 2/((H-degHx)-1) - (G-(H-degHx))/((h-1)(g-h+1)) = 2 / (H – (degHx+1) (G-H+degHx) / (hg-hh+2h-g-1) =[ (2hg-2hh+4h-2) – (G-H+degHx)(H-degHx-1) ] / (H-degHx-1)(hg-hh+2h-g-1) Theorem: If hH, dendifH-h = dendifH – (2idh - edh) So we want to remove h s.t. (2idh – edh) is minimum. 1 3 2 4 5 6 7 8 9 a c b SP1 1 4 2 3 5 6 7 c 9 b a 8 G6
25
ELEMENTS OF COMMUNITY DETECTION (Fortunato): The identification of structural clusters is possible only if graphs are sparse, i. e. if the number of edges m is of the order of the number of nodes n of the graph. If m>>n, the distribution of edges among the nodes is too homogeneous for communities to make sense. In this case the problem turns into something rather different, close to data clustering, which requires concepts and methods of a different nature. The main difference: while communities in graphs are related to the concept of edge density (inside versus outside the community), in data clustering communities are sets of points which are “close" to each other, with respect to a measure of distance or similarity, defined for each pair of points. We can relax the notion of cliques to subgraphs which are still clique-like (use properties related to reachability), i. e. to the existence (and length) of paths between vertices. An n-clique is a maximal subgraph such that the distance of each pair of its vertices is not larger than n. For n=1 it’s a clique, so each geodesic (shortest path) has length 1. This definition, more flexible than that of clique, still has some limitations, deriving from the fact that the geodesic paths need not run on the vertices of the subgraph at study. The consequences: First, the diameter of the subgraph may exceed n , even if in principle each vertex of the subgraph is less than n steps away from any of the others. Second, the subgraph may be disconnected, which is not consistent with the notion of cohesion one tries to enforce. There are two possible solutions, the n-clan and the n-club. An n-clan is an n-clique whose diameter is not larger than n, i.e. a subgraph such that the distance computed over shortest paths within the subgraph, does not exceed n. An n-club is a maximal subgraph of diameter n. An n-clan is a maximal n-clique. An n-club is maximal under the constraint imposed by the length of the diameter. The example below is a network of karate club members, a well-known graph used as a benchmark to test community detection algs, consisting of 34 vertices, the members of a karate club in the United States, who were observed during a period of three years. Edges connect individuals who were observed to interact outside the activities of the club. A conflict between the president and the instructor led to the fission of the club in two separate groups (indicated by squares and circles). The question is whether from the original network structure it is possible to infer the composition of the two groups. One can distinguish two aggregations, one around vertices 33 and 34 (34 is the president), the other around vertex 1 (the instructor). One can also identify several vertices lying between the two main structures, like 3, 9, 10; such vertices are often missclassified by community detection methods 1 2 1 3 1 2 1 3 2 1 4 5 1 5 2 1 6 2 1 7 2 1 9 2 2 1 3 2 1 2 1 2 3 1 4 2 1 5 5 2 1 3 6 2 1 3 2 7 1 8 2 1 4 9 2 1 3 3 1 4 3 1 4 3 2 1 3 1 4 3 1 6 G7 SP1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 1 6 2 1 9 3 1 4 1 6 5 1 3 6 1 4 7 1 4 8 1 4 9 1 5
26
SP3 1 3 2 4 6 5 7 8 10 9 12 11 14 13 16 15 19 18 17 20 22 21 24 23 27 26 25 29 28 30 31 32 34 33 35 37 36 39 38 40 41 43 42 45 44 47 46 48 50 49 51 53 52
27
SP2 1 3 2 4 6 5 7 8 10 9 12 11 14 13 16 15 19 18 17 20 22 21 24 23 27 26 25 29 28 30 31 32 34 33 35 37 36 39 38 40 41 43 42 45 44 47 46 48 50 49 51 53 52 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 b a g b 2 b f 9 f d SP1 3 2 5 4 7 6 8 10 9 11 13 12 15 14 16 18 17 20 19 21 22 24 23 26 25 28 27 30 29 32 31 34 33 36 35 37 38 40 39 42 41 43 44 46 45 47 49 48 50 52 51 53
28
Text Mining using pTrees . DTtf DocTerm DT SR DTPe Data Cube . Doc3
DTPe Term Table: Term P1D1 P1D2 P1D3...P7D1…P7D3 … 0 … … 1 . DTPe Term Usage Table: 1 noun verb adj adv …noun 9 adj noun noun adj noun Doc3 Doc2 Doc1 1 DTPe TpTreeSet index (D,P) Positions … P1D1 noun adj tf is the +rollup of the DTPe datacube along the position dimension. One can use any measurement or data structure of measurements, e.g., DT tfidf in which each cell has a decimal tfidf, which can be bitsliced directly into whole number bitslices plus fractional bitslices (one for each binary digit to the right of the binary point-no need to shift!) using: MOD(INT(x/(2k),2), e.g., a tfidf =3.5 is k: bit: 3 2 1 .Docs are April apple and an always. all AAPL buy Terms DTtf DocTerm termfreq Data Cube DT tfidf Doc Table: Doc T1 T T9 Rating of T=stock at doc date close: 1=sell, 2=hold,3=buy 0=non-stock Term 3 2 1 .Docs are April apple and an always. all $AAPL buy Terms DT SR DocTerm StockRating Cube DT SR bitmap DpTreeSet 1 T2,R=buy T2,R=hold T2,R=sell 1 2 … 9 Term 3 D TDcard P=k k=1..7 DTPe k=1..7 TDRolodexCd 1 2 … 7 Pos 3 D PDcard T=k k=1..9 DTPe k=1..9 PDCd 1 2 … 7 Pos 9 T PT card D=k k=1,2,3 DTPe k=1..3 PTCd DT SR bitslice DpTreeSet 1 T2k2 T2k1 DTPe DocTbl DpTreeSet indexed by (T,P)) Position Term an and April are apple 1 always all AAPL buy DT tfidf DpTreeSet T1k1 1 T1k0 T1k-1 T1k-2 1 2 3 4 5 6 7 1 1 .Doc are April apple and an always. all AAPL buy ... Term DTPe Data Cube DTPe Position Table Pos T1D1 T1D2 T1D3...T9D1…T9D3 … 0 … … 1 . 1 Classical DocTbl DpTreeSet 1 Auth Date Subj1 Subjm 1 Term buy DTPe in PpTreeSet index (T,D) Doc3 Doc2 Doc1 Classical Document Table: Doc Auth… Date . . .Subj1 …Subjm /2/ … 0 /2/ … 0 /3/ … 1 DTPe Document Table: Doc T1P1…T1P T9P1…T9P7 … … 0 … … 0 … … 1 Pos
29
SP1=E SPSF2 SP2 1 2 3 4 5 6 7 8 9 a b c d e f g SP3 SP4 SP5 SPA SP1=E 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 a 1 b 1 c 1 d 1 e f 1 g 1 1 2 4 3 5 6 8 7 9 b a c f e d g 1 2 3 4 5 6 7 8 9 a b c d e f g G5 1 2 4 3 6 7 5 8 9 a b c e d f g 4 1 SP2 SPSF3 1 1 2 2 1 3 1 3 4 4 1 5 1 5 6 1 6 7 7 1 8 1 8 9 1 9 a a 1 b 1 b c d e f g 1 2 4 3 6 5 8 7 9 a c b d f e g 1 2 3 4 5 6 7 8 9 a b c d e f g How do we know we don’t have to go further (that Diam(G5)=5?)? We really should have continued one more step and then noticed that SPSF6=all pure0 pTrees. Then we could conclude, since all 6paths are non-shortest, no extension of a 6path can be a shortest. Done! SPSF6=all pure0 since SP3 SPSF4 1 1 2 1 2 3 1 3 4 4 1 5 1 5 6 1 6 7 7 1 8 1 8 9 a b c d e f g 1 2 4 3 6 5 8 7 9 a c b d f e g 1 2 3 4 5 6 7 8 9 a b c d e f g SP5(2)=5,7. E5,E7 have only 6, already in SP5(2) SP4 SPSF5 1 2 1 2 3 4 4 1 5 1 5 6 1 6 7 7 1 8 1 8 9 a b c d e f g 1 3 2 4 5 7 6 8 a 9 c b e d g f 1 2 3 4 5 6 7 8 9 a b c d e f g SP5(5)=2,8. E2,E8 have only 4, already in SP5(5) SP5(7)=2,8. E2,E8 have only 4, already in SP5(7) SP5(8)=5,7. E5,E7 have only 6, already in SP5(8) SP5 SPSF5 gives the connectivity partition and can be formed anytime as ORk=1..5SPk. Since we’ve formed it already, we should retain it for that use. Others? I don’t see value in SPSF1-4. 1 2 2 1 3 4 5 5 1 6 7 1 7 8 1 8 9 a b c d e f g 1 3 2 4 7 6 5 8 a 9 c b e d g f
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.