Download presentation
Presentation is loading. Please wait.
Published byDerick Blankenship Modified over 6 years ago
1
Graph Clustering Algorithms: Divisive Girvan and Neuman delete edges with max “betweenness”, i.e., max participation in shortest paths (of all lengths). CS0 Algorithm: Delete edge with zero Common Sibling (CS0) co-participation. The pTree calculation of CS(h,k)=E(h)&E(k) is instantaneous. We use CS0 on S1P=E only. G1_1 1 2 3 4 S P S 1 P 2 4 3 5 G1_2 G1_3 1 2 3 4 5 S P 1 S P & 3 1 S P & 4 S 1 P 2 & 4 S 1 P 3 & 4 1 S P & 2 S 1 P 2 & 3 S 1 P 2 & 5 1 S P 3 & S 1 P 4 & 5 G1 1 2 3 4 S P 1 S P & 3 S 1 P 2 & 4 S 1 P 3 & 4 1 S P & 2 S 1 P 2 & 3 S 1 P 2 & 4 S 1 P 2 & 5 CS0 picks 24. Correct. CS0 says all edges are equal (seems correct). CS0 says all edges are equal (seems correct). CS0 says all edges are equal (correct?). G1_4 1 2 3 4 5 6 S P 1 S P 2 3 4 5 6 G1_6 G1_5 1 2 3 4 5 6 S P CS0 picks correctly CS0 picks 23 correctly. CS0 says all edges are equal. A F 1 2 A F 1 6 A F 1 2 3 A F 1 2 5 A F 1 3 4 A F 1 3 6 A F 1 4 5 A F 1 2 A F 1 6 A F 1 2 3 A F 1 2 6 A F 1 3 4 A F 1 3 5 A F 1 4 5 A F 1 2 A F 1 6 A F 1 2 3 A F 1 2 6 A F 1 3 4 A F 1 3 5 A F 1 4 5 A F 1 4 6 1 S P 2 3 4 5 6 G1_7 CS0 picks correctly. Note: If we delete ALL Common Siblings, Only 3Cliques survive A F 1 2 A F 1 5 A F 1 6 A F 1 2 3 A F 1 2 6 A F 1 3 4 A F 1 3 5 A F 1 4 5 A F 1 4 6
2
Divisive Graph Clustering
1 2 3 4 5 6 7 8 9 S1P CS0 on G7: Delete edge with the zero Common Siblings. S1P pairwise ANDs CS0 says delete the zero counts above. and 23 with 33 and 34 get deleted because they only have 33 and 34 as nbrs and 33 and 34 are not nbrs (i.e., they are friends with two enemies. They should not be deleted! Solution? (1,12) because 12 is only connected to 1. (1,32) correct. (2,31) correct. (20,34) correct. (3,10) correct. (24,26) and (25,28) are incorrect. But, recall that 24 and 28 are ambiguous wrt cluster? (3,28) correct. (3,29) correct. We can solve “delete if only connected to 1 pt” problem by checking the nbr count. (10,34) because, now, 10 is only connected to 34. The first round goes a long way toward splitting white-blue from green-yellow. (14,34) correct.
3
Divisive Graph Clustering: What can be combined with CS0?
Del CC0: (1,5) (1,6) (1,11) G7 CS0-CC0: Unless it results in an isolated singleton or doubleton (keep 1,12) Delete all common Siblings=0 (CS0) and all common Cousins=0 (CC0). CC0: Delete edge(s) with zero Common 1st Cousins (CCh,kS2Ph & S2Pk). Del 1,32 2,31 3,10 3,28 3,29 14,34 20,34 15,33 16,33 19,33 21,33 23,33 24,26 25,28 S2Ph= blue and orange This is CS0-CC0 h k a b d c e f g i j So do the 1time SiblingANDs (S1Ph&S1Pk) and CousinANDs (S2Ph&S2Pk). Then in one pass reading counts CS0-CC0 deletes 12 edges (whereas Girvan-Neuman makes 1 pass per edge deletion and recalculates each new pass). Next we could delete more edges with our current counts or recaculate counts and redo CS0-CC0. Use DelThresh=1 on Siblings (recalculating nothing): Delete additionally: 1,9 1,13 1,18 1,20 1,22 3,33 6,11 6,17 9,34 24,28 24,33 25,26 27,30 29, ,33 31, ,34 (but not 2,18 2,20 2,22 4,13 5,7 5,11 7,17 25,32 26,32 27,34 28,34 29,34; DONOT ISOLATE rule). This is CS1-CC0. Use DelThresh=1 on Cousins: del 1,4 (but not 7, , , , ,34 27,34 due to the DONOT ISOLATE rule.) . This is CS1-CC1. Likely, next round (after recalculating CS and CC), 1,7 and 3,9 will delete. Note: { } has already separated as a component. Then the other clusters would be: { } TheGreens TheYellows S2Pk = red and green S2P pairwise ANDs counts S2P-AND-OP-1 S2P-AND-OP-2 S1P pairwise ANDs
4
Divisive Graph Clustering: CS2-CC0
Unless singleton/doubleton isolated del CommonCousins0 and CommonSiblings2 Del CC0 (1,5) (1,6) (1,11) (1,12) saved by DNI rule. Del CS2 1:5,6,7,9,11,12,13,18,20,22,32 2:18,20,22,31 3:9,10,28,29, : :7, :11, : : : : :33, :33, :33, : :33, :33, :26,28, :26,28, : :30, : : : : :34 We get Yellow Green(-20) {20, 24, 28, 29 ,10,15,16,19,21,23,27,30,34)} {9, 31, 33,25,26,32} So again Black and Blue are a confused, but Yellow and Green are almost perfect. At this point we have looked at serveral threshold combinations for siblings and cousins. I think CS0-CC0 followed by a recalculation and then a reapplication of CS0-CC0 might be best. S2P pairwise ANDs counts S2P-AND-OP-1 S2P-AND-OP-2 S1P pairwise ANDs
5
k g f Divisive Graph Clustering: CS0 and DONOT ISOLATE
Note CS=0 deletion (CS0) will insure that we never break up a clique! Why? Every k-clique is made up of COMB)(k,3) and we never break up 3cliques – because we never delete gf. S1P(g) contains k,f and S1P(f) contains g,k so CS(g,f)=S1P(g)&S1P(f) contains k and therefore CS(g,f) 0. To insure we never break up cliques, for Round 1 we use “CS0 with DONOT ISOLATE rule” since it’s quick and has this nice clique preservation guarantee. G7 k f g Common Siblings and 3cliques Theorem: An edge, (h,k) has No Common Siblings (i.e., CSh,k ShSk= iff Eh&Ek is pure0) iff that edge is not involved in any 3clique. The proof is very simple: An edge (g,f) has common sibling, k, iff (g,f,k) is a 3clique. Thus, removing all edges with ZeroCommonSiblings leaves only 3cliques (of course, if the DONOT ISOLATE rule is in place, it leaves also leaves isolates.) Thus, instead of turning to CommonCousins (as we do on the next slide) maybe we ought to select pertinent 3cliques to break as a next step (which we do 2 slides ahead)? S1P pairwise ANDs 1 2 3 4 5 6 7 8 9
6
Divisive Graph Clustering: CS0 and DONOT ISOLATE with CC=0 for round 2
E(rd2) G7 This 2nd round there will be no CS=0 deletions (since we have nothing but singletons and 3cliques) , so we could look at CC=0. If we do, we would delete 1&5, 1&6, 1&11, 1&9. 4. Then during the next round of CS=0 deletions, 1&7 and 5&7 will have no common siblings and will delete. Note: In 3. we break 4 3cliques! Should we? Also the remaining white-green connections form 3cliques. Should they be broken? If k-cliques (k3) are not to be preserved, what kind of communities are we going to end up with? By what measure is Fortunato’s white blue green yellow partition considered a good one? (certainly not by any measure which values cliques). S1Prd2 pairwise ANDs 1 2 3 4 5 6 7 8 9
7
Divisive Graph Clustering: CS0 and DONOT ISOLATE with 3CLIQUE deletion for round 2
E(rd2) G7 Common Siblings and 3cliques Thm: An edge, (h,k) has No Common Siblings i.e., CSh,k ShSk= iff Eh&Ek pure0 iff that edge is not involved in a 3clique. Proof: An edge (g,f) has a common sibling, k, iff (g,f,k) is a 3clique. Thus, removing all edges with NoCommonSiblings leaves only 3cliques (of course, if the DONOT ISOLATE rule is in place, it also leaves the isolates.) Keep a list of vertices with 1 or 2 remaining siblings (edges they participate in the DO NOT DELETE): 5 11 6 7 31 9 2. S1P rd2 pairwise AND (of vertices of an edge) with count=2, if the two common siblings do not form and edge themselves (and thus, the 4 form a 4vertex 1plex = two 3cliques with a common edge, namely the original pair) delete the edge of that original pair. If count=1, deleted the edge of that original pair. m f g k CS(1,5)={7,11} not an edge, so delete 1,5 CS(1,11)={5,6} not an edge, so delete 1,11 CS(9,31)={33,34} not an edge, so delete 9,31 CS(1,6)={7,11} not an edge, so delete 1,6 CS(3,9)={1,33} not an edge, so delete 3,9 CS(24,30)={33,34} not an edge, so delete 24,30 CS(1,7)={5, 6} not an edge, so delete 1,7 CS(3,33)={9}, so delete 3,33 CS(30,34)={27}, so delete 30,34 CS(1,9)={3}, so delete 1,9 CS(6,7)={17}, so delete 6,7 CS(32,34)={29}, so delete 32,34 S1Prd2 pairwise ANDs That ends Round-2. If we would do a Round-3 of CS=0 again, (29,34) deletes since there are no common siblings. The result is very very close to GN! 1 2 3 4 5 6 7 8 9
8
G10 Divisive Graph Clustering: CS0 with DONOT ISOLATE rule on G10 1 2
CS0 alone separates all 8 colored communities. It may also delete other edges. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 35 22 23 24 25 26 27 28 29 30 31 32 33 34 36 37 38 39 40 41 42 43 44 G10: Web graph of pages of a website and hyperlinks. Communities by color (Girvan Newman Algorithm). |V|=180 (1-i0) and |E|=266. Vertices with OutDeg=0 (leaves) do not have pTrees shown because pTrees display only OutEdges and thus those OD=1 have a pure0 pTree. 45 78 46 47 48 49 50 51 c5 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 c1 c2 c3 c4 c6 c7 c8 c9 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 h0 h1 h2 h3 h4 h5 h6 h7 h8 h9 i0 G10
9
Divisive Graph Clustering: Shortest Path Partic >=50% (on G7)
SkP, k=2,3,4 for vertices 1,2,3,33, 1 E Count only Shortest Path Participations emanating from vertices with S1P-counts 50% of the maxS1Pcount=16 (i.e., 8). This specifies starting vertices of only SPPC 5 7 8 10 11 12 13 17 18 20 22 29 31 33 34 G7
10
Divisive Graph Clustering: Shortest Path Partic >=75% (on G7)
34 r r 1 34 1 1 E Count only Shortest Path Participations emanating from vertices with S1P-counts 75% of the maxS1Pcount=16 (i.e., 12). This specifies starting vertices of 1 34 only 8 11 12 13 17 18 22 33 34 G7
11
Connectivity Graph Clustering using Shortest Paths (on G5)
1 2 3 4 5 6 8 7 G5 1 E ct 1 E ct Delete (1,2) And {3,6,8} and do over. 6 3 4 1 SPPC (Shortest Path Participation Counts) ct 1 SP2 ct SP2 ct SP gives connectivity comp partition: CC(1)={1,2,4,5,7} is a 5plex since EdgeCt=5=COMBO(5,2) CC(3)={3,6,8} is a 0plex since EdgeCt=3=COMBO(3,2)-0 1 SP3 ct 1 SP ct SP gives connectivity comp partition: CC(1)={1,5,7} is a 0plex since EdgeCt=3=COMBO(3,2)-0. CC(2)={2,4} is a 0plex since EdgeCt=1=COMBO(2,2)-0. SP4 ct 1 2 SP ct
12
Connectivity Graph Clustering using Shortest Paths (on G6)
SP 1 2 3 4 5 6 7 8 9 a b c d e f g 4 1 E 1 2 3 4 5 6 7 8 9 a b c d e f g SP6 1 2 3 4 5 6 7 8 9 a b c d e f g 1 2 4 3 6 5 8 7 9 b a c d f e g SP2 1 3 2 4 6 5 8 7 a 9 c b e d g f SP gives connectivity comp partition: CC(1)={ } is a 20plex since EdgeCt=8=COMBO(8,2)-20. CC(9)={9 a b c} is a plex since EdgeCt=3=COMBO(4,2)-3 CC(d)={d f g} is a plex since EdgeCt=3=COMBO(3,2) CC( e)={e} SPPC 1 g f 2 7 3 4 5 6 8 9 a b c d e E 1 5 6 7 2 3 4 8 SP2 all pure0 SP 1 5 6 7 2 3 4 8 SP3 1 2 3 4 5 6 7 8 9 a b c d e f g SP gives connect comps: CC(1)={1}, CC(5)={5 6 7} Is a 0plex since EdgeCt34=COMBO(3,2)-0 Done! Delete (1,3) (SPPC=16 max) and delete {d f g}, {e} and do over. Also delete {9 a b c} as a 4VetexHubSpoke3plex. SP4 1 2 3 4 5 6 7 8 9 a b c d e f g E 1 2 3 4 5 6 7 8 SP2 1 2 3 4 5 6 7 8 SP3 all pure0 SP 1 2 3 4 5 6 7 8 SP gives connect comps: CC(1)={ } 2plex EdgeCt=4=COMBO(4,2)-2. CC(2)={ } is a 3plex since Ect=3=COMB(4,2)-3 (a 4VertexHubSpoke) G6 1 2 4 3 6 7 5 8 9 a b c d e f g SP5 1 2 3 4 5 6 7 8 9 a b c d e f g SPPC (Shortest Path Participation Counts) 1 3 2 4 5 6 7 8 Delete{ } 4VHubSpoke3plex, (1,6)
13
1 E Agglomerative weighted SPk Clustering 1, 34 are centers. Then among their indiv nbrs, select their communities with threshold on weighted sum (=-20) giving light green “1comm”, black “34comm (overlapping). Next, excise, iterate. Then do a k means reshuffle to improve? E SP SP SP SP wt V#> 2 SP -1 SP -1 SP -1 SP -1 SP WeightSum Nbrs Nbrs If ( WtSum>=-20 & Nbr(1) ) then 1 else 0. wt V#> 2 SP -1 SP -1 SP -1 SP -1 SP WeightSum Nbrs Nbrs 1 2 4 3 5 6 7 8 9 Using weights of 0,1,2,4,6 for SP1,2,3,4,5 resp. wt V#> 0 SP 1 SP 2 SP 4 SP 6 SP WeightSum SP1|2(17) Iterate again on the remaining Using weights of5,5,1,1,0 for SP1,2,3,4,5 resp. wt V#> 5 SP 5 SP 1 SP 1 SP 0 SP WeightSum SP1|2(8) SP1|2(33) This method uses site betweeness, not edge betweenenss 10,25,26,28,29, 31 33,34 not shown (only 17 on, 8 only 27 turned on 1 SP4 =4dg 15,16,19,21,23,24,27,30 only 17 on, 5deg=1 17 SP5 8=5dg G7 1 2 3 4 6 5 7 9 8
14
Basic SP pTree Construction on G1
Edge, E, Path(PT), ShortestPathv(SPT), AcyclicPath(APT) CycleList(CL) of G1 E2key 1,1,3 1,1,2 1,1,1 1,2,1 1,1,4 1,2,3 1,2,2 1,3,1 1,2,4 1,3,4 1,3,3 1,3,2 1,4,2 1,4,1 1,4,4 1,4,3 2,1,1 2,1,2 2,1,4 2,1,3 2,2,1 2,2,3 2,2,2 2,3,1 2,2,4 2,3,3 2,3,2 2,3,4 2,4,2 2,4,1 2,4,4 2,4,3 3,1,2 3,1,1 3,1,4 3,1,3 3,2,1 3,2,3 3,2,2 3,3,1 3,2,4 3,3,3 3,3,2 3,4,2 3,4,1 3,3,4 3,4,4 3,4,3 4,1,2 4,1,1 4,2,1 4,1,4 4,1,3 4,2,3 4,2,2 4,3,1 4,2,4 4,3,2 4,3,3 4,4,2 4,4,1 4,3,4 4,4,4 4,4,3 PE2 1 PE3 1 , E3key 1,1,1 1,1,4 1,1,3 1,1,2 1,2,1 1,2,4 1,2,3 1,2,2 1,3,1 1,3,4 1,3,3 1,3,2 1,4,1 1,4,4 1,4,3 1,4,2 2,1,1 2,1,4 2,1,3 2,1,2 2,2,1 2,2,4 2,2,3 2,2,2 2,3,1 2,3,4 2,3,3 2,3,2 2,4,2 2,4,1 3,1,1 2,4,4 2,4,3 3,1,2 3,2,1 3,1,4 3,1,3 3,2,2 3,3,1 3,2,4 3,2,3 3,3,2 3,4,1 3,3,4 3,3,3 3,4,2 4,1,1 3,4,4 3,4,3 4,1,2 4,2,1 4,1,4 4,1,3 4,2,2 4,3,1 4,2,4 4,2,3 4,3,2 4,4,1 4,3,4 4,3,3 4,4,3 4,4,2 4,4,4 2 3 4 (pred is NotPureZero) First, construct stride=|V|, 2-level Edge pTree, all others are constructed concurrently from it. E1 key 1,1 1,2 1,4 1,3 2,1 2,3 2,2 2,4 3,1 3,3 3,2 3,4 4,1 4,2 4,3 4,4 PE1 1 E one-level 1 2 3 4 2LEG1 E 2-lev stri=|V|=4 PTG1, extension of EG1 1 2 3 4 PTG1 APTG1 1 2 3 4 All are 3 hop cycles. Each has 3 start pts , 2 directions. Each repeat 6 times. 6/6=1 3hop cycles (1341) SPTG1 1111 1 2 3 4 G1 1 2 3 4 CLG1 1 2 2 1 3 1 2 1341 1431 3413 3143 4134 4314 SPTG1, init E1=SP1,1 E2=SP2,1 E3=SP3,1 E4=SP4,1 1 2 3 4 SPSFk 1 3 1 4 2 4 1 3 1 3 4 1 4 1 4 3 1 1 3 1 4 2 4 1 3 1 3 4 1 4 1 4 3 1 1 2 2 1 3 2 1 1 3 4 2 4 1 4 2 3 1 3 1 4 1 3 4 3 4 1 2 4 1 4 2 3 1 3 1 4 1 4 3 3 1 4 1 3 4 SPT is completed. For Big Graphs, could stop here (e.g., Friends has ~1B vertices but a diameter of 4, so we would only need to build PT 4-hop paths) and possible expressed as a tree of lists rather than a tree of bitmaps. For sparse BigGraphs, E could be leveled further and/or a tree of lists (then APT, SPT will be also). SPT(G)k (with k turned on) is mask (>0 is “yes”) for connectivity comp, COMP(G)kvk. For bitmap of COMPk bitslicing SPT (SPTk,h..SPTk,0 k=1..|V| then COMPk ORj=h..0SPTk,h. SPT structure may be useful as separate “categorical” bitmaps Shortest Path Length (SPk,h h=1..H. Also keep a mask of Shortest Paths so far, SPSFk vertex, k. With each new SP bitmap, SPB, SPSFkSPSFk | SPB, SPk,h+1 SPB & SPSFk. kListPT3hij PT4hijk=Ek after zeroing i and j bits of Ek To extend to PT: kListEh PT2hk=Ek after zeroing the h bit of Ek kListPT2hj PT3hjk=Ek after zeroing Ek j bit. E PT SPT APT of graph as predicate Trees on E(MaxPathLength). PTG1 E3 pred=(NPZ)|(PZ&AcyclicPathEnd) 1 2 3 4 1,2 1,1 key 1,3 1,4 2,1 2,2 2,4 2,3 3,1 3,3 3,2 3,4 4,1 4,2 4,3 4,4 EG1 E 1lev, pred=NPZ E 2lev str=4 pred=NPZ APTG1 E3predicate = (NPZ&NotCycleEnd)| (PZ&AcyclicPathEnd) SP1,1 SP2,1 SP3,1 SP4,1 SP1,2 SP2,2 SPVertex=3, Len=2 12 SP1,1|2 SP2,1|2 SP3,1|2 SP4,1|2 SPTgives the Connectivity Component Partition; Maximal Cliques (go across SPk,1 then look within subsets of those k’s for commonality); Note, Cliques are 0-plexes. Each mask, SPk,1 masks a 1-plex. Each SPk,1&SPk,2 masks a 2-plex (which is SPSFk,2? So if we save each SPSF instead of overwriting, we have k-plex masks w/o further work?), etc. Next construct predicates for each Path related data structures, PT APT SPT SPSF, to make them into pTrees on a k-path table, E, E2, E3, …
15
Edge LookUp Clique Mining Algorithm on G2, G3, G4
1 2 4 3 6 G2 7 5 key 1,1 1,3 1,2 1,5 1,4 1,6 2,1 1,7 2,3 2,2 2,5 2,4 2,6 2,7 3,1 3,3 3,2 3,5 3,4 3,7 3,6 4,1 4,3 4,2 4,5 4,4 4,7 4,6 5,2 5,1 5,3 5,5 5,4 5,7 5,6 6,2 6,1 6,3 6,4 6,6 6,5 6,7 7,2 7,1 7,3 7,4 7,6 7,5 7,7 E 1 EU 1 1 2 4 3 6 5 8 7 10 9 20 30 40 C 1 CU 1 6 key 1,1 1,3 1,2 1,5 1,4 1,6 2,1 1,7 2,3 2,2 2,5 2,4 2,6 2,7 3,1 3,3 3,2 3,5 3,4 3,7 3,6 4,1 4,3 4,2 4,5 4,4 4,7 4,6 5,2 5,1 5,3 5,5 5,4 5,7 5,6 6,2 6,1 6,3 6,4 6,6 6,5 6,7 7,2 7,1 7,3 7,4 7,6 7,5 7,7 1 PE 2 4 3 7 6 G3 5 K=2: 2Cliques (2 vertices): Find endpts of each edges (Int((n-1)/7)+1, Mod(n-1,7) +1) k=3: k=4: 1234 ( are cliques) 123,134 ,134 , 234 ,2341234. 1234 only 4-clique Using the EdgeCount thm: on C={1,2,3,4}, CU=C&EU C is a clique since ct(CU)=comb(4, 2)=4!/2!2!=6 have 124CS3 PE(1,4)=1 134CS3 PE(2,3)=1 234CS3 Have 123CS3 Have k=2: E= already have 567 PE(2,3)=1 So 123CS3 PE(2,4)=1 124CS3 PE(2,6)=0 PE(6,7)=1 567CS3 PE(1,7)=0 PE(1,5)=0 PE(2,4)=1 1234CS4 Have 1234 k=3: EC, requires counting 1’s in mask pTree of each Subgraph (or candidate Clique, if take the time to generate the CCSs – but then clearly the fastest way to finish up is simply to lookup the single bit position in E, i.e., use EC). EdgeCount Algorithm (EC): |PUC| = (k+1)!/(k-1)!2! then CCCS The SG alg only needs Edge Mask pTree, E, and a fast way to find those pairs of subgraphs in CSk that share k-1 vertices (then check E to see if the two different kth vertices are an edge in G. Again this is a standard part of the Apriori ARM algorithm and has therefore been optimized and engineered ad infinitum!) key 1,1 1,3 1,2 1,5 1,4 1,7 1,6 2,2 2,1 1,8 2,4 2,3 2,5 2,6 2,8 2,7 3,1 3,3 3,2 3,5 3,4 3,7 3,6 3,8 4,2 4,1 4,4 4,3 4,6 4,5 4,8 4,7 5,3 5,2 5,1 5,5 5,4 5,7 5,6 6,1 5,8 6,3 6,2 6,4 6,6 6,5 6,8 6,7 7,3 7,2 7,1 7,5 7,4 7,6 7,7 8.1 7,8 8,2 8,4 8,3 8,6 8,5 8.8 8,7 E 1 PE(2,3)=1 234CS3 PE(1,4)=1 134CS3 Have PE(4,8)=1 248CS3 PE(4,8)=1 348CS3 PE(4,8)=1 12348CS5 have have k=2: k=4: PE(2,3)=1 123CS3 PE(2,4)=1 124CS3 PE(2,8)=1 128CS3 PE(2,6)=0 PE(3,8)=1 138CS3 PE(4,8)=1 148CS3 PE(1,5)=0 PE(1,7)=0 PE(6,8)=0 PE(3,8)=1 238CS3 have PE(6,7)=1 567CS3 have k=5: = CS5. 1 2 4 3 6 G4 7 5 8 PE(3,8)=1 1238CS4 PE(4,8)=1 1248CS4 PE(3,8)=1 1348CS4 k=3: Have PE(2,4)=1 1234CS4 PE(4,8)=1 2348CS4
16
Basic SP pTree Construction on G5 and a basic Clique, kplex and kcore algorithm
PTG5 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 EG5 2-level str=8 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 The EdgepTree(E), PathTree(PT), ShortestPathvTree(SPT), AcyclicPathTree(APT) and CycleList(CL) of a graph, G5 1 2 1 5 1 7 2 1 3 6 1 3 8 1 4 2 1 5 1 5 7 1 6 3 1 6 8 1 7 1 7 5 1 8 3 1 8 6 1 1 5 7 1 7 5 5 1 2 7 1 2 3 8 6 1 3 6 8 1 1 2 4 5 1 2 5 1 7 5 7 1 3 6 8 1 8 6 3 1 7 1 2 7 1 5 1 5 7 8 6 3 1 8 3 6 1 G5 1 2 3 4 5 6 8 7 4 2 5 1 4 2 7 1 7 5 2 1 APTG5 CLG5 1571 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 1751 3683 3863 5175 1 2 1 5 1 7 2 1 3 6 1 3 8 1 4 2 1 5 1 5 7 1 6 3 1 6 8 1 7 1 7 5 1 8 3 1 8 6 1 5715 6386 6836 7157 7517 2 1 5 2 1 7 4 2 1 5 1 2 2 1 7 7 5 1 8368 8638 PT Clique Miner Algorithm A clique is all cycles Extend to a k-plex (k-core) mining algorithm? PT(=APT+CL), SPT are powerful datamining tools with closure properties (to eliminate branches) . SPTG5 1 1 2 2 1 2 1 3 1 4 1 4 2 1 4 2 1 3 5 1 2 3 5 1 2 5 1 6 1 7 1 2 7 1 7 1 2 3 8 1 Max clique Mining A kCycle is a kClique iff it’s found in CLk as PERM(k-1,k-1)/2=(k-1)!/2 kCycles (e.g., vertices are repeated in CL for 3cycles, 2!/2=1; 4cycles, 3!/2=3; 5cycles, 4!/2=12; 6cycles, 5!/2=60. 4 1 2 5 4 1 2 7 7 1 5 2 Downward closure: Once, a 4cycle is established as a 4clique (by the fact that {1,2,3,4} occurs 3!/2=3 times in CL), all 3vertex subsets are 3cliques {1,2,3},{1,2,4},{1,3,4}, so no need to check further. k-plex (missing k edges) mining alg? k-core (has k edges) mining alg? Density (internal edge density >> external|avg) mining alg? Degree (internal vertex degree >> external|avg) mining alg? DiameterG5 is max{Diameterk} = max{ 2,2,1,3,2,1,3,1}=3. Connected comp containing V1, COMP1={1,2,4,5,7}. Pick 1st vertex not in COMP1,3, COMP3 ={3,6,8}. Done. The partition is { {1,2,4,5,7}, {3,6,8} }. To pick the first vertex not in COMP1, mask off COMP1 with SPTv1’ and then pick the first vertex in this complement.
17
Basic SP pTree Construction on G6
1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 a 1 b 1 c 1 d 1 e f 1 g 1 1 3 2 4 6 5 8 7 a 9 c b d f e g E=A1Ps SP1 SP1&2 4 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 a 1 b 1 c 1 d 1 e f 1 g 1 1 2 4 3 5 7 6 8 b a 9 c d e g f 1 2 3 4 5 6 7 8 9 a b c d e f g cycles in blue (not in APT) A2Ps 1 2 4 3 6 5 8 7 a 9 c b d f e g 1 3 1 6 2 4 1 3 1 3 4 1 4 3 1 5 6 1 5 7 1 6 1 6 5 1 6 7 1 7 5 1 7 6 1 8 4 1 9 c 1 A c 1 b c 1 D f 1 D g 1 F d 1 F g 1 G d 1 G f 1 SP2 1 1 2 2 1 3 3 1 4 4 1 5 1 5 6 6 1 7 7 1 8 1 8 9 1 9 a 1 a b 1 b c d e SP1&2&3 f g 1 2 4 3 5 7 6 8 b a 9 c d e g f 1 2 3 4 5 6 7 8 9 a b c d e f g 4 3 1 1 6 5 1 6 7 4 2 3 1 3 1 6 4 3 1 1 6 5 5 7 6 1 5 6 7 1 7 5 6 1 6 1 3 5 6 7 1 7 6 5 1 5 7 6 1 7 6 5 1 1 6 7 6 7 5 1 8 3 4 1 F D g 1 D f G 1 D F g 1 G F d 1 D G f 1 G d F 1 SP3 A3Ps 1 3 2 4 5 7 6 8 b a 9 c d e g f SP1&2&3&4 1 1 2 1 2 3 1 3 4 4 1 5 5 1 6 1 6 7 1 7 8 1 8 9 a b c d e f g 1 3 2 4 5 7 6 8 a 9 c b e d g f 1 2 3 4 5 6 7 8 9 a b c d e f g SP1&2&3&4&5 COMPLETE A4Ps 1 2 4 3 5 7 6 8 9 b a c d e g f A5Ps 1 2 4 3 5 6 8 7 9 b a c d f e g A6Ps 1 3 2 4 5 7 6 8 a 9 c b e d g f SP4 2 3 4 6 1 5 1 2 1 2 3 4 1 4 5 1 5 6 6 1 7 7 1 8 1 8 9 a b c d e f g 1 2 4 3 5 6 8 7 9 b a c f e d g 1 2 3 4 5 6 7 8 9 a b c d e f g 2 1 3 4 3 5 6 1 3 7 6 1 4 3 6 1 5 6 3 1 5 7 1 6 6 1 4 3 7 5 1 6 7 6 3 1 8 3 4 1 2 4 1 3 6 4 6 1 3 5 4 3 6 1 7 5 6 3 1 4 5 7 1 6 3 7 6 5 3 1 7 3 1 6 4 8 4 1 3 6 4 2 1 3 7 6 7 5 1 6 4 3 5 7 1 6 4 3 4 8 1 3 5 6 8 3 4 6 1 7 SP5 SP6 1 2 1 2 3 4 5 5 1 6 7 1 7 8 1 8 9 a b c d e f g 1 3 2 4 7 6 5 8 a 9 c b e d g f 1 2 3 4 5 6 7 8 9 a b c d e f g 1 3 2 4 7 6 5 8 a 9 c b e d g f G6 1 2 4 3 6 7 5 8 9 a b c d e f g
18
All Shortest Path pTrees for a unipartite undirected graph, G7 (SP1, SP2, SP3, SP4, SP5)
SP1 =1deg 1 2 3 4 5 6 7 8 9 SP2 =2dg 10,25,26,28,29,33,34 not shown (only 17 on, 1=4dg) 1 SP4 =4dg 15,16,19,21,23,24,27,30 only 17 on, 5deg=1 17 SP5 8=5dg 1 2 3 4 5 6 7 8 9 SP3 =3dg G7
19
G8 Trying Hamming Similarity to detect communities on G7 and G8 40 41
Zachary's karate club, a standard benchmark in community detection. (best partition found by optimizing modularity of Newman and Girvan) =1deg =2deg =3deg =4deg =5deg Hamming similarity: S(S1,S2)=DegkDif(S1,S2) To produce an [all?] actual shortest path[s] between x and y: Thm: To produce a [all?]: S2P[s], take a [all?] middle vertex[es], x1, from SP1x & SP1y, produce: xx1y; S3P[s], take a [all?] vertex[es], x1, from SP1x and a [all?] vertex[es], x2, from S2P(x1,y): xx1x2y etc. Is it productive to actually produce (one time) a tree of [all?] shortest paths? I think it is not! 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 14 20 17 15 16 24 30 27 18 39 28 42 Can see that this Works Poorly At 1. 17 25 2 24 18 1 14 3 7 Not working! On the other hand, our standard community mining techniques (for kplexes) worked well on G7. Next slide let’s try Hamming on G8. G7 Deg b a g b 2 b f 9 f d Deg 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 40 41 42 46 44 53 48 54 52 45 43 39 38 20 21 24 47 23 22 19 25 36 18 37 35 27 26 28 29 31 32 33 30 51 50 34 49 G8
20
G9 Agglomerative clustering of S2P using Hamming Similarity on G9
In ESP2, using Hamming similarity, we get three Event clusters, clustering events iff pTrees [Hamming] identical: EventCluster1={1,2,3,4,5} EventCluster2={6,7,8,9} EventCluster3={10,11,12,13,14} 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 W ESP E WSP W 1 2 3 4 5 6 7 8 9 10 11 12 13 14 E The Degree % of affiliation of Women with R,G,B events is: R G B 1 100% 75% 0% % 75% 0% % 100% 0% % 75% 0% 5 60% 25% 0% % 50% 0% % 75% 0% % 75% 0% % 75% 0% % 75% 20% 11 0% 50% 40% 12 0% 50% 80% 13 0% 75% 80% 14 0% 75% 100% 15 0% 50% 60% 16 0% 50% 0% 17 0% 25% 20% 18 0% 25% 20% W 1 e e e e ESP E 2 3 4 5 6 7 8 9 10 11 12 13 14 E WSP W 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 W G9 ESP3=ESP1’ and ESP4=ESP2’ so again, in this case, all info is already available in ESP1 and ESP2 (all shortest paths are of length 1 or 2). We don’t need ESPk k>2) WSP W 1 2 3 4 5 6 7 8 9 10 11 12 13 14 E WSP3=WSP1’ and WSP4=WSP2’ so, in this case, all information is already available in WSP1 and WSP2 (All shortest paths are of length 1 or 2) (We don’t need WSPk k>2) Clustering Women using Degree% RGB affiliation: WomenClusterR={1,2,4,5} WomanClusterG={3,6,7,8,9,10,11,16,17,18} WomanClsuterB={12,13,14,15} This clustering seems fairly close to the authors. Other methods are possible and if another method puts event6 with 12345, then everything changes and the result seem even closer to the author’s intent..
21
G9 K-plex search on G9 (A k-plex is a SG missing k edges
If H is a k-plex and F is a ISG, then F is a kplex A graph (V,E) is a k-plex iff |V|(|V|-1)/2 – |E| k 1 d d d d ESP2 2 3 4 5 6 7 8 9 10 11 12 13 14 E 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 W WSP2 h f h f b f f g h h g g h h h g c c Events abcde 14*13/2=91 degs=88888dddd88888 |Edge|=66 kplex k25 Events abcde Not calculating k degs= 7777cccc Until it gets lower Events abcde 14*13/2=91 degs= 666bbbb88888 |Edges|=66 kpl Events456789abcde 14*13/2=91 degs= 55aaaa88888 |Edges|=66 kplex k25 Women abcdefghi 18*17/2=153 degs=hfhfbffghhgghhhgcc |Edges| =139 kplex k14 Events56789abcde 14*13/2=91 degs= |Edges|=66 kplex k25 Women abcdefgh 18*17/2=153 degs=gfgfbfffggffgggfc |Edges| =139 kplex k14 Events6789abcde *8/2= A 9Clique! degs= |Edges|=36 kplex k0 Women abcdefg 18*17/2=153 degs=ffffbffeffeefffe |Edges| =139 kplex k14 So take out {6789abcde} and start over. Women abcdefg 15*14/2=105 degs=eeeeeeeeeeeeeee |Edges| = kplex k0 15Clique Events *4/2=10 |Edges|=10 kplex k 0 A 5clique! degs: 44444 So take out { abcdefg} and start over. If we had used the full algorithm which pursues each minimum degree tie path, one of them would start by eliminating 14 instead of 1. That will result in the 9Clique and the 5Clique abcde. All the other 8 ties would result in one of these two situations. How can we know that ahead of time and avoid all those unproductive minimum degree tie paths? Women5hi 3*2/2=3 degs=011 |Edges| =1 kplex k2 Womenhi 2*1/2=1 degs=11 |Edges| =1 kplex k0 Clique We get no information from applying our kplex search algorithm to WSP2. Again, how could we know this ahead of time to avoid all the work? Possibly by noticing the very high 1-density of the pTrees? (only 28 zeros)? Every ISG of a Clique is a Clique so 6789 and 789 are Cliques (which seems to be the authors intent?) If the goal is to find all maximal Cliques, how do we know that CA= is maximal? If it weren’t then there would be at least one of abcde which when added to CA= would results in a 10Clique. Checking a: PCA&Pa would have to have count=9 (It doesn’t! It has count=5) and PCA(a) would have to be 1 (It isn’t. It’s 0). The same is true for bcde. The same type of analysis shows 6789abcde is maximal. I think one can prove that any Clique obtained by our algorithm would be maximal (without the above expensive check), since we start with the whole vertex set and throw out one at a time until we get a clique, so it has to be maximal? The Women associated strongly with the blue EventClique, abgde are { } and associated but loosely are { }. The Women associated strongly with the green EventClique, are { } and associated but loosely are {6 7 9}
22
Basic S1P pTree Construction on G10
OutDeg 1 8 1 9 2 1 2 1 2 1 2 3 1 2 4 1 2 5 1 2 6 1 2 7 1 2 8 1 2 9 1 3 1 3 1 3 2 1 3 1 3 4 1 3 5 1 3 6 1 3 7 1 3 8 1 3 9 1 4 1 4 2 1 4 3 1 G10 E=SP1 2level pTrees LevelOneStride=19 (labelled 0-i), Level0Stride=10 (labelled 0-9) Note: SP1 should be called S1PDV for “Shortest 1 Path Destination Verticies, because each one, e.g. S1PDV(v1) maps all such destination verticies from that given starting vertex, v1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 35 22 23 24 25 26 27 28 29 30 31 32 33 34 36 37 38 39 40 41 42 43 44 G10: Web graph of pages of a website and hyperlinks. Communities by color (Girvan Newman Algorithm). |V|=180 (1-i0) and |E|=266. Vertices with OutDeg=0 (leaves) do not have pTrees shown because pTrees display only OutEdges and thus those OD=1 have a pure0 pTree. 45 78 46 47 48 49 50 51 c5 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 c1 c2 c3 c4 c6 c7 c8 c9 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 h0 h1 h2 h3 h4 h5 h6 h7 h8 h9 i0 1 5 3 4 8 5 4 8 5 7 6 9 5 8 7 5 9 6 8 6 7 1 6 3 6 6 5 e 7 6 1 7 1 4 9 tens dig 1 3 2 4 5 6 8 7 9 a b d c e f g i h 4 5 1 4 6 1 4 7 1 4 8 1 4 9 1 5 1 5 1 5 2 1 5 1 5 6 1 7 2 1 7 3 1 7 4 1 1 G10 units 1 2 4 3 5 7 6 8 9 1 1 1 1 1 1 1 1 1 1 1 1 1 units 1 2 4 3 1 1 1 1 1 units 2 1 3 5 4 6 8 7 9 1 1 1 1 units 1 2 4 3 1 1 1 1 1 1 units 1 2 4 3 1 units 2 1 3 5 4 6 8 7 9 1 1 1
23
G10 leaves (OutDegree=0):
G10 E=SP1 2level pTrees LevelOneStride=19 (labelled 0-i), Level0Stride=10 (labelled 0-9) 7 OD 9 OD L1 1 2 4 3 5 7 6 8 9 a b d c e g f h i C 4 1 L1 2 1 3 5 4 6 8 7 9 b a c e d f g i h 4 H OutDeg OD 1 8 1 9 2 1 2 1 2 1 2 3 1 2 4 1 2 5 1 2 6 1 2 7 1 2 8 1 2 9 1 3 1 3 1 3 2 1 3 1 3 4 1 3 5 1 3 6 1 3 7 1 3 8 1 3 9 1 4 1 4 2 1 4 3 1 5 7 6 7 7 6 h 5 B 4 C B 5 C 4 B 6 7 1 6 F 7 G 7 F 6 G 1 G H 2 G 9 F 3 G L1 2 1 3 5 4 6 8 7 9 b a . 1 L0 2 4 3 5 7 6 8 9 4 G 8 F 5 G 7 F 6 G F 7 H 4 7 G 6 F 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 C B 2 4 C 1 7 4 2 C B 9 4 C 3 B 8 OD L1 1 2 4 3 5 7 6 8 9 a b d c e g f h i 7 8 1 7 9 1 8 1 8 1 9 1 9 1 4 6 7 8 H B C A 7 1 8 6 8 7 9 8 9 5 9 A 6 H G 1 4 1 H G 4 2 H 4 3 H I 4 6 8 1 H L0 1 3 2 4 6 5 7 9 8 1 L0 2 1 3 5 4 6 8 7 9 OD L0 2 1 3 4 6 5 7 9 8 1 5 1 6 4 5 3 8 4 5 4 8 5 9 4 5 7 9 6 5 8 7 5 9 8 6 6 7 1 6 3 6 6 5 7 e 6 1 7 1 9 4 1 1 9 2 1 9 3 1 9 5 7 9 6 7 8 5 H 7 4 6 H 4 8 H 9 1 4 9 H I 4 3 7 8 I 4 H 9 OD L1 1 3 2 4 5 7 6 8 9 a b d c e g f h i 4 5 1 4 6 1 4 7 1 4 8 1 4 9 1 5 1 5 2 1 5 6 1 7 2 1 7 3 1 7 4 1 C 5 1 L0 2 1 3 5 4 6 8 7 9 1 L0 2 1 3 4 6 5 7 9 8 1 9 7 A 9 8 1 9 8 A 8 L0 1 3 2 4 6 5 7 9 8 1 1 1 1 1 1 A 1 9 A 2 B A 4 7 L0 1 3 2 4 6 5 7 9 8 1 1 1 1 1 1 8 7 9 1 4 9 1 5 2 A 5 3 7 20 OD L0 1 2 4 3 5 7 6 8 9 1 1 1 1 1 1 1 1 1 1 1 1 L1 2 1 3 5 4 6 8 7 9 b a c e d f g i h D 2 1 D 2 4 6 7 3 C 8 9 6 C 9 1 D 2 7 C 9 1 D 2 8 C 7 9 D 2 C 9 7 8 D H 4 2 D 1 4 5 7 8 2 D 3 2 D 4 2 L0 1 3 2 4 6 5 7 9 8 1 1 1 L0 1 2 3 4 1 1 1 1 1 D 5 2 6 D F 5 2 D 7 9 2 8 D F 4 2 D 9 1 2 E 9 1 D 2 1 E 7 9 D 2 E 2 D L0 1 2 4 3 5 7 6 8 9 1 1 1 1 L0 1 2 4 3 5 7 6 8 9 1 OD 3 E F D 2 4 E 9 D 2 5 E 8 D 2 6 E 7 D 2 L0 1 2 4 3 1 1 1 1 1 1 1 B 1 2 2 B 7 6 h 1 B 3 2 L0 1 3 2 4 6 5 7 9 8 1 L0 1 2 4 3 1 L0 2 1 3 5 4 6 8 7 9 1 1 1 G10 leaves (OutDegree=0): a3 a6 a8 a9 b0 B7 b8 b9 e7 e8 e9 f0 f1 f2 f3 f4 f5 f8 f9 g0 g8 g9 h7
24
18 1 G10 E=SP1 Lists 75 77 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 SP2 Lists 84 85 C A0 A1 A2 A4 B1 B4 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 H8 H9 E A0 A1 A2 A3 A4 B1 B2 C6 C7 C8 C9 D1 D3 D5 D8 D9 E1 E2 E3 E4 E5 E6 H8 H9 H9 H4 19 2 76 77 36 2 20 3 77 76 H5 74 78 B2 D1 H7 I0 H0 H1 H2 H3 H5 H7 H8 21 4 22 5 D3 D2 23 6 D4 D2 24 7 D5 D2 25 8 C A0 A1 A2 A4 B1 B4 C6 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 H8 H9 E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E2 E3 E4 E5 E6 D6 E5 D2 39 12 26 9 27 10 86 80 D7 D9 D2 40 10 28 11 87 79 B2 D1 H7 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 29 12 D8 E4 D2 E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E3 E4 E5 E6 30 13 89 85 D9 91 D2 31 14 90 A6 E0 91 D2 50 76 D2 H1 81 88 32 15 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 91 A6 A7 A8 A9 B0 B2 C4 D2 H4 I0 C C6 C7 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 33 16 E1 79 D2 E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E4 E5 E6 34 17 E2 D2 35 18 B2 D1 H6 H7 36 19 E3 F0 D2 92 91 E4 E9 D2 C E C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E5 E6 D2 46 49 93 91 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 H0 H1 H2 H3 H4 H5 H6 H7 H8 E5 E8 D2 74 E6 E7 D2 95 79 51 D2 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E6 F6 G7 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 96 78 39 29 F7 G6 H4 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 40 27 97 A7 G1 H1 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 42 45 98 91 G2 F9 43 78 98 99 A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 99 88 70 75 B3 B2 D1 H7 G3 G0 A0 A8 D 79 91 E7 E8 E9 F0 G4 F8 G1 H4 A1 A9 G5 F7 G5 G6 A2 B0 D C6 C7 C8 C9 D1 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 G6 F7 H4 A4 A7 G7 F6 G6 46 81 H0 H1 H2 H3 H4 H5 H6 H7 H8 A5 A3 A7 H0 G1 H4 B2 H6 H6 A A4 A5 A4 A5 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 B1 B2 H1 G1 H4 H0 46 81 H1 H2 H3 H4 H5 H6 H7 H8 51 46 63 1 M A0 A1 A2 A4 B1 B4 C6 C7 H8 H9 B2 76 H1 H2 H0 H4 B3 B2 H3 I0 C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 D C6 C7 C8 C9 D1 D3 D8 D9 E0 E1 E2 E3 E4 E5 E6 H1 46 81 H0 H2 H3 H4 H5 H6 H7 H8 53 48 B4 C4 H4 46 81 H0 H1 H2 H3 H5 H6 H7 H8 54 48 B5 C4 A A5 H2 46 81 G1 H0 H1 H3 H4 H5 H6 H7 H8 A A4 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 E8 55 49 B6 B7 H5 77 H4 A A3 74 C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 B3 C0 B2 C4 H6 H4 B1 76 H1 B2 77 G1 H4 H3 44 H4 H9 57 69 C1 77 C4 H8 91 H4 B3 76 H1 H4 77 91 G1 I0 58 70 C2 B9 C4 B4 B5 B6 C0 C1 C2 C3 D C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 H9 I 59 68 C3 B8 I0 44 H4 H9 74 C6 C7 C8 C9 D1 D3 D5 D8 D9 E0 E1 E2 E3 E4 E5 E6 B5 B4 B6 C0 C1 C2 C3 H5 H0 H1 H2 H3 H6 H7 H8 60 67 C4 B4 B5 B6 C0 C1 C2 C3 61 66 C0 76 B4 B5 B6 C1 C2 C3 H1 D C6 C7 C8 C9 D1 D3 D5 D9 E0 E1 E2 E3 E4 E5 E6 E9 H6 46 81 H0 H1 H2 H3 H5 H7 H8 63 66 c5 45 D5 OD=0: a3 a6 a8 a9 B0 B7 b8 b9 e7 e8 e9 f0 f1 f2 f3 f4 f5 f8 f9 g0 g8 g9 h7 75 76 H5 C1 77 B4 B5 B6 C0 C1 C2 C3 H5 65 E7 C6 91 D2 76 76 H5 H8 A0 A1 A2 A4 B1 B4 C6 C7 H0 H1 H2 H3 H5 H6 H7 H9 66 61 C7 91 D2 77 H4 C2 B4 B5 B6 C0 C1 C3 D A0 A1 A2 A3 A4 B1 B2 C6 C7 C8 C9 D1 D3 D5 D8 E0 E1 E2 E3 E4 E5 E6 H8 H9 71 49 C8 79 D2 C4 77 B2 B7 B8 B9 C4 72 47 D2 C9 78 73 48 D2 D0 H4 D2 C B2 D1 D2 H7 74 49 D2 D1 78 D2
25
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * 37 * * * * 39 * * * * * * * * * * ***** ** * * * * * * * * ** ** * * * * ** * * * * * * * * * * * * * * * * * * * * * 65 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * 82 *** * * * * * * * * *** * * * * * ** ** * * * 93 92 * * 94 * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * ** *** * * * * * * 125 130 * * * *** *** ** ************** * * 136 * * 134 * 137 * * * * * * * * * * * * * * * * 153 152 151 * 155 154 * * ** * ** * 164 163 162 * * 165 * 169 168 * * * * * * * * * * *** ** * * * * * * * * * * G10 Edge Matrix Raster ordering EM gives the E table cardinality(E) = 180*180 = 32,400.
26
G7 MCFC: Delete the edge(s) with the Minimum # of Common First Cousins, where CFC(h,k)S2P(h) & S2P(k) S2P(h) = blue and orange S2P h k a b d c e f g i j S2P(k) = red and green S2P All Paths
27
Divisive Graph Clustering: Girvan and Neuman delete edges with max “betweenness”, i.e., max participation in shortest paths (of all lengths) Girvan and Newman (Girvan and Newman,02; 04). Edges deleted based on a measures of edge betweeness:. 1. Computation of the edge betweeness for all edges; 2. Removal of edge with largest betweeness: in case of ties with other edges, one is picked at random; 3. Recalculation of betweeness on the running graph; 4. Iteration of the cycle from step 2. We look for situations where pTrees give us an advantage. Can SPPC (Shortest Path Participation Count) be constructed with pTrees more efficiently? What other measure can pTrees make much more efficiently that can help choose the best edge to delete? Later we will try finding the edge with maximum “Fore-Aft” Shortest Path Participation Difference in S1P, S2P, S3P,… (or some combination). pTrees should provide great advantage in the calculation of FAD(h,k). The other important question to answer is: Does it create a good clsutering? key 1,1 1,2 1,3 1,4 1,5 1,6 1,7 2,1 2,2 2,3 2,4 2,5 2,6 2,7 3,1 3,2 3,3 3,4 3,5 3,6 3,7 4,1 4,2 4,3 4,4 4,5 4,6 4,7 5,1 5,2 5,3 5,4 5,5 5,6 5,7 6,1 6,2 6,3 6,4 6,5 6,6 6,7 7,1 7,2 7,3 7,4 7,5 7,6 7,7 E 1 SP2 1 SP3 1 SPPC 4 c 1 5 While constructing Shortest Path pTrees, SP2…, record Shortest Path Participation Count of each edge (SPPC) The edge(s) with max SPPC should be the best candidates for removal? ct 1 E ct SP2 ct 2 3 4 5 6 7 SP3 ct SP gives the connectivity component partition: CC(1)={1,2,3,4} 0plex since EdgeCt=12= 2*COMBO(4,2) CC(5)={5,6,7} 1plex since EdgeCt=4=2*(COMBO(3,2)-1) SP ct SPPC ct We will try FAD(h,k) |S1P(h)&S1P(k)| / |S1P(h)|*|S1P(k)| Or use S2P? Or both? Or S3P? E 1 1 1 1 1 1 1 1 2 3 4 5 6 7 1 SP2 ct 2 3 4 5 6 7 1 2 4 3 6 G2 7 5 SP3 1 ct 2 3 4 5 6 7 1 SP=SP1 | SP2 | SP3 ct 2 3 4 5 6 7 SP gives connectivity comp partition: CC(1) = {1}List(SP(1) = {1,2,3,4,5,6,7} is a 12plex since EdgeCt=9=COMBO(7,2)-12 4 c SPPC 1 2 3 5 6 7 ct Delete (1,6) and do over.
28
GN Delete max SPPC edge. Recalc SPPCs. Repeat.
Divisive Graph Clustering 1,1 Ekey 1,2 1,3 1,4 1,5 2,1 2,2 2,3 2,4 2,5 3,1 3,2 3,3 3,4 3,5 4,1 4,2 4,3 4,4 4,5 5,1 5,2 5,3 5,4 5,5 E 1 SPPC 4 G1_2 1 2 3 4 5 G1_2 1 2 3 4 5 Ekey 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 4,4 E 1 G1_1 2 3 4 S 1 P 2 3 4 1,1 Ekey 1,2 1,3 1,4 1,5 2,1 2,2 2,3 2,4 2,5 3,1 3,2 3,3 3,4 3,5 4,1 4,2 4,3 4,4 4,5 5,1 5,2 5,3 5,4 5,5 E 1 SPPC 5 4 G1_3 1 2 3 4 5 GN Delete max SPPC edge. Recalc SPPCs. Repeat. G1 1 2 3 4 1 S P 2 3 4 null nul S 1 P 2 4 3 5 S 1 P 2 3 4 5 SPPC 3 2 4 1 null nul Ekey 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 4,4 E 1 SPPC 1 2 3 nul 2 S P 1 3 2 S P 4 1 S 2 P 3 4 1 S 2 P 4 3 1 2 S P 1 3 4 5 2 S P 1 3 5 4 Check SPPC(34)=SPPC(43) (verify SPs backwards from hk get counted.) (34)E so ct=1 + CountS2P(34)=1 + CountS2P(43)=1 so ct=3 + CtS3P(34g)=0 + CtS3P(g34)=1, g=1 ct=4 GN says delete (3,4)! GN says delete any edge! 2 S P 1 2Pkey 1,1,1 1,1,2 1,1,3 1,1,4 1,2,1 1,2,2 1,2,3 1,2,4 1,3,1 1,3,2 1,3,3 1,3,4 1,4,1 1,4,2 1,4,3 1,4,4 2,1,1 2,1,2 2,1,3 2,1,4 2,2,1 2,2,2 2,2,3 2,2,4 2,3,1 2,3,2 2,3,3 2,3,4 2,4,1 2,4,2 2,4,3 2,4,4 3,1,1 3,1,2 3,1,3 3,1,4 3,2,1 3,2,2 3,2,3 3,2,4 3,3,1 3,3,2 3,3,3 3,3,4 3,4,1 3,4,2 3,4,3 3,4,4 4,1,1 4,1,2 4,1,3 4,1,4 4,2,1 4,2,2 4,2,3 4,2,4 4,3,1 4,3,2 4,3,3 4,3,4 4,4,1 4,4,2 4,4,3 4,4,4 2 P 1 3 S P 1 4 S 3 P 2 4 1 S 3 P 1 2 5 4 GN says delete 12 | 25 | 34 | 36 G1_4 1 2 3 4 5 6 To construct SPPC(hk) =SPPC(kh) (Shortest Path Participation Count) if (hk)E count 1 + OneCountS2P(hk) + OneCountS2P(kh) + OneCountS3P(hkg) + OneCountS3P(ghk), g + OneCountS4P(hkfm) + OneCountS4P(fhkm) + OneCountS4P(fmhk) f,m. Etc. GN: delete 12 | 23 | 25 not 34, 45 1 S P 2 3 4 5 6 Ekey 1,1 1,2 1,3 1,4 1,5 1,6 2,1 2,2 2,3 2,4 2,5 2,6 3,1 3,2 3,3 3,4 3,5 3,6 4,1 4,2 4,3 4,4 4,5 4,6 5,1 5,2 5,3 5,4 5,5 5,6 6,1 6,2 6,3 6,4 6,5 6,6 E 1 G1_4 2 3 4 5 6 G1_3 1 2 3 4 5 G1_4 1 2 3 4 5 6 not 23, 16, 45 SPPC 7 5 6 4 G1_3 1 2 3 4 5 2 S P 1 3 5 4 6 G1_3 1 2 3 4 5 SPPC recalculation and repeat steps? Anyone see a shortcut? Or do we just start the calculation over on the reduced graph? Do the pointers help? Since in S2P(hk) one has to search out S2P(kh) and in S3P(hk) one has to find all S3P(hkg) snf D3P(ghk) g In the appendix I begin work on uniquely representing shortest k paths using both a fore and aft pTree. Consider that in G1_4 S3P(16)=2. G1_3 1 2 3 4 5 Notes: If any OneCount=0, no subsequence exist. It might be useful to use ptrs to make this proc easier. GN edge betweenness specifies pruning (2,4) S 3 P 1 2 5 4 6 G1_3 1 2 3 4 5
29
McS0: “McS0 only with the DONOT ISOLATE rule” round 2.
G7 1 2 3 4 5 6 7 8 9 S1P McS0: “McS0 only with the DONOT ISOLATE rule” round 2. Next round the minimum is one. Note that we no longer preserve cliques when the minimum is one. Next round 3 9 have no common siblings and will delete. S1P pairwise ANDs 1 2 3 4 5 6 7 8 9
30
Divisive Graph Clustering
Delete edge with zero Common Siblings co-participation. Calculating CS(h,k) is fast with pTrees, but is the resulting clustering a good one? Divisive Graph Clustering G1_1 1 2 3 4 S P 2 S P 1 3 4 5 G1_2 G1_3 1 2 3 4 5 S P CS0 sats all edges are equal (correct?). 1 S P & 2 S 1 P 2 & 3 S 1 P 2 & 5 1 S P 3 & S 1 P 4 & 5 Define CS2(1,2) = S1P(1) & S1P(2) | S1P(1) & S2P(2) | S2P(1) & S1P(2) | S2P(1) & S2P(2), S2P(h)=ORkS2P(hk) 1 S P & 3 S 1 P 2 & 4 S 1 P 3 & 4 1 S P & 2 S 1 P 2 & 3 S 1 P 2 & 4 S 1 P 2 & 5 1 S P & 3 1 S P & 4 S 1 P 2 & 4 S 1 P 3 & 4 G1 1 2 3 4 S P CS0 says all edges are equal (seems correct). CS0 says all edges are equal (seems correct). A F 2 1 A F 2 3 1 F A 2 5 1 3 F A 2 3 4 1 F A 2 4 5 1 CS2 says are the best to delete (more sensitive!) CS0 picks 24. Correct. A F 1 2 A F 1 6 A F 1 2 3 A F 1 2 5 A F 1 3 4 A F 1 3 6 A F 1 4 5 G1_4 1 2 3 4 5 6 S P A F 1 2 A F 1 6 A F 1 2 3 A F 1 2 6 A F 1 3 4 A F 1 3 5 A F 1 4 5 G1_5 1 2 3 4 5 6 S P CS0 picks 23, correctly A F 2 1 3 A F 2 1 6 F A 2 3 1 4 F A 2 5 1 3 F A 2 3 4 1 F A 2 3 6 1 F A 2 4 5 1 2 S P 1 3 6 4 5 A F 2 1 A F 2 1 6 A F 2 3 4 F A 2 6 1 3 F A 2 3 4 1 F A 2 3 5 1 F A 2 4 5 1 Define CS2(hk) = S1P(1) & S1P(2) | S2P(hk) & S2P(kh), CS0 says all edges are equal. CS2: are best, are 2nd best, 23 worst. I like it 4cycle with 2 1hairs is best. 4cycle with 1 2hair 2nd best 6cycle worst
31
Analyst TickerSymbol matrix w/0 labels (1 = “recommends”)
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 W ESP E a e c An 1 2 3 4 5 6 7 8 9 10 11 12 13 14 TS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 W WSP2 1 e e e e ESP2 2 3 4 5 6 7 8 9 10 11 12 13 14 E WomenSet ARM: MinSup=6 Mincnf=.75 EventSet ARM MnSp=9 Mncf=.75 Frequent 1WomenSets: Frequency (#events attended) Freq 1EventSets: c Freq (# attended) Candidate 2WomenSets: c 1d 1e c 2d 2e 34 3c 3d 3e 4c 4d 4e cd ce de Freq #events co-attended Cand 2EventSets: c c c c 89 8c 9c Freq=#attended Frequent 2WomenSets: c 1d 1e c 2d 2e 34 3c 3d 3e 4c 4d 4e cd ce de Freq #events co-attended freq 2EventSets: c c c c 89 Freq=#attended Cand 3EventSets all others excl because a sub2 not freq Freq # attended Cand3WSets: (cde is excluded since ce is infreq) Freq #events co-attended Frequent 3WomenSets: Freq #events co-attended Strong Erules 35 53 56 65 57 58 68 78 98 567 657 567 576 675 (Says 567 is a strong Event community?) Freq 3ESets: 567 Freq= StrongWrules 21 12 13 31 14 41 23 32 24 42 34 43 134 314 413 134 143 341 Says 1234 is a strong Women community? Confidence: But 134 is a very strong Women Commun? Note: When I did this ARM analysis, I had several degrees miscounted. None-the-less, I think the same general negative result is expected. Next we try using the WSP2 and ESP2 relationships for ARM??
32
18 12 22 8 2 1 3 4 14 20 13 5 6 7 11 17 25 27 32 26 28 29 24 16 30 15 23 21 19 33 10 31 9 34 A Kclique and a 3clique that shares an edge (and thus 2 vertices) form a (K+1)clique iff The K-2 edges between the non-shared 3clique vertex and each of the K-2 non-shared Kclique vertices exists. G7 The 1st time no 3clique shares an edge with a Kclique, the Kclique is maximal. Find a Maximal Maximal Clique for each v (a MaxClique containing v with max # of vertices) 1 2 3 4 5 6 7 8 9 E 3Cliques as a set of vertex triples Remaining pairwise ANDs after removal of PURE0 pairwise ANDs (i.e., after CS0). So these are the 3cliques in pTree form.
33
Find Maximal Cliques 1. If a 3cliques shares nothing with any other 3clique, then it is maximal, else: 18 12 22 8 2 1 3 4 14 20 13 5 6 7 11 17 25 27 32 26 28 29 24 16 30 15 23 21 19 33 10 31 9 34 2. A 3cliques that shares an edge with a 3clique form a 4clique iff the 6th unknown edge exists. 2 1 3 4 2 1 3 4 5 3. A 4clique and a 3clique that shares an edge form a 5clique iff 9th,10th edges exist. G7 2 1 3 4 5 4. A 5clique and a 3clique that shares an edge form a 6clique iff the 13th 14th and 15th unknown edges exist. A Kclique and a 3clique that shares an edge (and thus 2 vertices) form a (K+1)clique iff The K-2 edges from the one non-shared 3clique vertex to the K-2 non-non-shared Kclique vertices exist. 6. The 1st time no 3clique shares an edge w a Kclique, the Kclique is maximal. Remove participating 3cliques from the list and start over? 1 2 3 4 5 6 7 8 9 E Unique 3Cliques (as sets) … After finishing 4clique search, do 5clique with 1234 and 128 by check existence of 38 48, y y so clique. 1 2 3 4 y 1 2 3 8 y 1 2 3 e y 1 2 4 8 y 1 2 4 e y 1234, 12e check 3e 4e, y y so 1234e 5clique … After finishing 5clique search, do 6clique and 12e by check existence of 3e 4e 8e, y y n, so 12348e not 6clique; 1234e and 128 by check existence of 3e 4e 8e, y y n, so 12348e not 6clique. And no other 6 cliques contains 12? We haven’t used pTrees! One needs to study the literature on how maximal cliques are typically mined! Notes: One cannot remove a clique just because it is maximal? So what do we do once we discover and 1234e as maximal 5cliques? Do we have to retain all the 3 cliques and start over? Or? Would it suffice to find one maximal clique containing each vertex? Or find the maximal maximal clique (a maximal clique containing v that has the maximum number of vertices) containing each vertex? Remaining edges after CS0 (removal of PURE0 pairwise ANDs). So these are the 3cliques in pTree form.
34
ANalyst TickerSymbol Relationship w/0 labels (1 = “recommends”)
2 3 4 5 6 7 8 9 10 11 12 13 14 TS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 AN TS a e c AnalystSet ARM: MinSup=6 Mincnf= .75 Large 1AnalystRecomendedSets: c d e Frequency (# of stock recommended) StockRecommendedSet ARM MnSp=6 Mncf=90% Candidate 2AnalystSets: c 1d 1e c 2d 2e 34 3c 3d 3e 4c 4d 4e cd ce de Freq #stock recommendd Frequent 1 StockRcommendedSets: c Frequncy (# of Analyst Recommending) Frequent 2AnalystSets c 1d 1e c 2d 2e 34 3c 3d 3e 4c 4d 4e cd ce de Freq #Stock recomm Cand 2StockSets c c c c 89 8c 9c Freq = # of AN: freq 2StockSets c c c c 89 Freq= # of AN: Cand3AnalystSets: (cde is excluded since ce is infreq) Freq #Stock recom Frequent 3AnalystSets: Frequency: #Stock Recommended Candidate 3StockSets (all others excluded due to a 2subset not freq) Frequency = # of AN: Frequent 3StockSets: 568 Frequency= # of AN= Strongrules 21 12 13 31 14 41 23 32 24 42 34 43 134 314 413 134 143 341 Conf Analysts 1,3,4 seem to be most in synch, Rule conf% Supp(#ofAN) AntecedentSize 3 5 5 6 5 7 5 8 6 8 7 8 8 9 5 6 8 56 58 68 I think Antecedent Size is important. We can think of these as rule labels. My favorite rule is 568 since it has hi confidence And hi Antecedent Size (+ decent Support). We could rate Analysts and use weighted Counts as Frequency (vertex label We can use Sentiment as stock weights and build it into confidence (or use antecedent SA as a column
35
ANalyst TickerSymbol Relationship w labels (1 = “recommends”)
2 3 6 4 8 a e c 5 B v w s t u q c m d a C 5 3 7 8 2 6 9 4 D 1 AN TS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 E 8 7 4 3 6 5 2 F 1 G 9 5 8 2 1 7 4 6 3 1 TS a e c AN 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 (A,B,C,D) is a 4 attribute Vertex Label (on the Ticker Sysmbol part only). Any features of Ticker Symbols can be included here. A is the 1-counts of the Analysts pTrees (number recommending that stock in hexadecimal, These 1-counts are also given in the blue row in the other matrix). C is a decimal column which gives a 1-9 rating of stock performance during the previous week. B is a character column categorizing stocks by type. D is a binary column indicating whether the stock trades on the Nikkei (yes| no). Conditions on these label columns (e.g., expressed in SQL) give us a pTree mask to implement the condition. Likewise (E,F,G) is a 3 attribute Vertex Label (on Analyst part). Any Analyst (or Investor) feature can be included. E is the 1-counts of stock pTrees (# of Analysts recommending that stock (also list in the other matrix). G is decimal giving the Analysts yearly salary in billions. F is binary indicating whether the Analyst is male (yes| no). Conditions on these labels (e.g., expressed in SQL) give us a mask to apply to our pTrees to implement the condition. Probably the simplest implementation language for this recommender would be [PL] SQL or MySQL. We would have only two tables: Stock Table: with the first 18 columns being the AN pTrees and the final 4 columns being A B C and D. Analyst Table: with the first 14 columns being the TS pTrees and the final 3 columns being E F and G. We can call for ANDs, Ors, COMPLIMENTs etc, from SQL! Anyone can program SQL, right? Maybe R would be a good language so we could have one table that can be rotated???
36
my ideas on what the book paper might contain that would push it beyond the workshop paper.
Expansion Idea 1: The reviewers asked for more charts and graphs - i.e., more performance studies comparing to the competition. Expansion Idea 2: Expand to include a labelled bipartite graph (the disjoint vertex sets” "Ticker Symbol" and "Analyst" and an edge connecting, e.g., AAPL with Buffet iff a Buffet tweet was sentiment rich (positive or negative) regarding AAPL. (So an undirected edge connects an Analyst with a Ticker Symbol iff there is a pertinent tweet from the Analyst on that stock on that day/week Analysts (AN) would be labelled (vertex label) by "Respect Level". Stock Ticker Symbols (TS) would be labelled (vertex label) by "buy-sell" value (a positive number if buy and a negative number if sell) as recommended by various stock rating entities (e.g., There are known raters that label stocks as Strong Buy, Buy, Hold, Sell, Strong Sell which would be 2,1,0,-1,-2 respectively) Tweets are labelled (edge label) by Sentiment (LoValue for very negative sentiment and HiValue for very positive sentiment), as already produced by various software products (e.g., MS Azure S?) This set up (vertex labelled and edge labelled bipartite graphs) would allow us to try lots of pTree tools: ARM using pTrees; just like MBR except instead of customers and products we have Analysts and Ticker Symbols...) , Clustering using pTrees (what we are currently looking at at our Saturday meetings), Outlier Analysis using pTrees (also related to current Saturday topics)... Community Detection using pTrees (related to clustering and Outlier Analysis) From Arijit Sept 25, 2015 I am leveraging Microsoft's Azure Machine Learning Service for the Sentiment Analysis and have worked to build a better sentiment analyzer on Tweet Data but the Service end point is public and can be used for research. So far: I have the pulled tweets from Twitter of all the investors whom we would like to track over last 5 years. I have coded around the Twitter platform limitations and the code is parallelized with multiple configurations. This data stored on an Azure SQL Database. I have the Azure Sentiment Analysis Service running on this pulled tweets and have the sentiment score from every tweet of all these individuals. ( This was not done in the workshop paper) I have a separate service which runs which queries the Yahoo Finance API and has pulled the historical data for every ticker symbol. This data is also stored in a separate table. I have added the average sentiment score (SScore) for a particular day for a particular ticker symbol and have mapped this information to the other fundamentals of the stock - like Open Price, Closed Price, Volume, P/E ratio. ( This was also not done in the workshop paper) With this data we can plugin various algs and test My main Hypothesis in naïve form is: Social Sentiments have an effect on stock fluctuations. We can do all sorts of variations and then test different algorithms. One approach I have been thinking is how to weigh the investors, rather than just having an Average Sentiment Score have a Weighted Average Sentiment Score. I also have the code for the Exponential Moving Sentiment Score ( EMScore) calculated in various time periods like 15day Moving Average, 200 day MA, 1 year and so on so that I can see which of these measures is the best indicator of the volatility of the tickers the most. I have attached a spreadsheet which gives an idea on what data is been captured now. This is just a sample sheet which I was working to test the EMScore on Tickers NFLX and AAPL between 07/01 till 07/30. On Sheet3 you would see a column with the DailyAvgSentimentScore values included. The Azure Sentiment Analyzer service consumes the data on Sheet 4 for every tweet from every user and generates a sentiment score. The average of all these Sentiment Scores are calculated and the Daily Avg Sentiment Score field is populated on Sheet 3. Sheet 2 shows the EMScore calculation and you can see a chart which shows how the Sentiment Score is varying over the period of time and how its sort of following the same trend as the Ticker Closing prices.
37
MCS + DND2 (MCS= Minimum Common Cousins)
DND all but DND is all but DND is all but-> That also deletes (1 6) (1 7) (and may delete others as well). This is a good clustering! G7 1234 is a 4clique! Nothing will change after this round. Notes: The only way (3 10), (29 34) will get deleted is with a final round that deletes any edge whose endpoints satisfy CS=CC=0 (+DNI). DND 5 6 7 11 24 30 31 DND 10 Sum 12 DONOT DELETE-2:: As a first step, create (kept current) a DoNotDelete (DND) list. Include all vertices with a 1 or a 2 count. Keep DND current by adding vertices as soon as they exhibit the “Delete” condition. Applying DND first means we only AND edge endpoints that are both off the DND (reducing the AND burden considerably). D N D
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.