Presentation is loading. Please wait.

Presentation is loading. Please wait.

The vertex-labelled, edge-labelled graph

Similar presentations


Presentation on theme: "The vertex-labelled, edge-labelled graph"— Presentation transcript:

1 The vertex-labelled, edge-labelled graph
1 TS a e c AN pTree Ct AN 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ANalyst TickerSymbol Relationship with labels C Sal TS pTree Ct TS SA H B B SB B SS S S H H B B B SB Buy-Hold-Sell SA Dow? We can interpret this structure many ways, 1. as a relationship with entity tables; 2. as a AN[lysist] Table with attributes, the AN attributes (SA, Ct, C, Sal) plus each TickerSymbol pTree as an additional attribute (the TS attributes (Dow?,Ct,BHS,SA) are not captured in this interpretation); 3. as a T[icker] S[ymbol] or Stock Table with attributes, the TS attributes (Dow?, Ct, BHS, SA) plus each Analyst pTree as an additional attribute (the AN attributes (SA, Ct, F, Sal) are not captured in this interpretation); 2 1 3 In full pTree form: H S 10 01 00 11 TS SA0 SA1 SS SB B C0 C1 C2 C3 Dow? AN 1 2 3 4 5 6 7 8 9 12 13 14 15 16 17 18 S F C A We can include this relationship with other relationships sharing entities by using the RoloDex Model (next slide). The graph could be 3D, 4D (i.e., edges are triples, quadruples), etc. The graph could also be edge labelled. A convenient way to capture edge labels is by making the cell content of each matrix cell into the label structure rather than just a yes/no bit. As a simple but pertinent example, suppose we have a 0-3 rating of each Analyst-Stock pair which measure how much that Analysts know about that stock. We just change each bit to a decimal number in [0,3] (or bitslice those using two bits instead of on, so that the matrix columns are 2-bit pTreeSets rather than just one pTree). If C measures the “Correctness Level” of the Analyst over recent days or weeks over all stock (e.g., based on backward analysis of previous sentiment analysis and the actual performance of the stock) and the cell numbers measure the correctness of that Analyst on that Stock, then a signal might be to mask C>=2 and for those Analysts find the average Correctness for each stock, then mask out those Stock for which the number of Analysts is between two thresholds (want a high average but also more than one analyst but not too many).

2 The Multi-Relationship Model
Every Entity (Gene, Term, Experiment, Person, Document, Item, Stock, Course, Movie) has an EntityTable of many descriptive attributes (columns). They aren’t shown here. For example, on the previous slide we show the descriptive columns of Stocks(Dow?, Count, BHS, SA) and Analysts(SA,Count,Female?,SalaryInBillions), not shown here. 7 6 5 4 3 2 Stock 1 Stock-Investor relationship Tweets are Documents, so the Tweet-Tweeter relationship is a Document-Author relationship (Tweetee, hashtag, etc. are Edge Labels). In looking for signals that no one else uses: What if an Investor BUYS an island in the Mediterranean? What if an Investor’s best friend buys lots of stock in an Online University? Supp(A) = CusFreq(ItemSet) Conf(AB) =Supp(AB)/Supp(A) Friends relationship 5 6 16 ItemSet ItemSet antecedent 1 2 3 4 5 6 16 itemset itemset  Customer 1 2 3 4 Item 1 customer rates movie as 5 relationships BUYS 5 6 7 People  1 2 3 4 Author movie 2 3 1 5 4 customer rates movie 2 3 4 5 PI 2 3 4 5 PI 4 3 2 1 Course Enroll 1 Doc TermDocument 1 3 2 Doc AuthDoc 1 2 3 4 Gene genegene rel (ppi) docdoc People  term  7 1 2 3 4 G 5 6 7 6 5 4 3 2 t 1 ShareStem termterm rel CellLabel=stem 1 3 Exp expPI Expgene The Multi-Relationship Model

3 G7 Breadth-First Inductive Clique Search Algorithm:
Let CLQK be the set of all Kcliques, 1st find CLQ3 using CS0. Induction Step: CLQK+1 is obtained by applying the ECKCT to CLQK and CLQ3. 18 12 22 8 2 1 3 4 14 20 13 5 6 7 11 17 25 27 32 26 28 29 24 16 30 15 23 21 19 33 10 31 9 34 G7 Breath-First Edge-Check K-Clique Thm A Kclique and a 3clique that shares an edge form a (K+1)clique iff All K-2 edges from the non-shared Kclique vertices to the non-shared 3clique vertex exist in the graph. on G7 ( List Version): 1 2 3 4 5 6 7 8 9 E 1 2 3 18,20,22CLQ4 since 3:18,20,22E3 1 1 1 2 2 2 3 4 3 4 since 34E3 1 2 3 8 1 2 3 14 1 1 1 3 3 3 2 4 2 4 Note checkback. Is it required? (No: if 132 in 4CLQ it’d show up already). Already in CLQ4 1 2 4 8 1 2 4 14 1 3 4 8 1 3 4 14 2 3 4 8 2 3 4 14 UCLQ4 done. pTree version faster? UCLQ3 Unique 3cliques as lists UCLQ5 MUCLQs 1 2 18 1 2 20 1 2 22 1 3 9 1 4 13 1 5 7 1 5 11 1 6 7 1 6 11 3 9 33 1 2 3 4 8 1 2 3 4 14 1 2 3 4 8 1 2 3 4 14 6 7 17 9 31 33 9 31 34 24 28 34 24 30 33 24 30 34 25 26 32 27 30 34 29 32 34 CLQ3 (as pTrees) Remaining edges after CS0 (removal of PURE0 edge endpoint pair ANDs). Is there a pTree Version of this Algorithm ? Is it faster?

4 1 2 3 4 5 6 7 8 9 E 18 12 22 8 2 1 3 4 14 20 13 5 6 7 11 17 25 27 32 26 28 29 24 16 30 15 23 21 19 33 10 31 9 34 1 2 3 4 1 2 3 8 1 2 3 14 1 2 4 8 1 2 4 14 1 3 4 8 1 3 4 14 2 3 4 8 2 3 4 14 1 2 3 4 8 1 2 3 4 14 G7 Depth-First kClique Thm (pTree version) : Find a Largest Maximal Clique v. Let (x,y)CLQ3pTree(v,w) where w produces the largest count. If (x,y)E and CLQ3pTree(x,y) is the largest such and the Count(NewPtSet(v,w,x,y)CLQ3pTree(v,w)&CLQ3pTree(x,y)) is: 0, the 4 vertices form a maximal 4Clique (i.e., v,w,x,y). 1, the 5 vertices form a maximal 5Clique (i.e., v,w,x,y and the NewPt) 2, the 6 vertices form a maximal 6Clique if the NewPair is an edge, else they form 2 maximal 5Cliques. 3, the 7 vertices form a maximal 7Clique if each of the 3 NewPairs is an edge, elseif 1 or 2 of the NewPairs are edges then each of the 6VertexSets (the 4 original vertices and 2 EdgeEndpoints) form a maximal 6Clique, elseif 0 of the NewPairs is an edge, then each 5VertesSet (original 4 plus 1 NewVertex) forms a maximal 5Clique…. Theorem is:  hCliqueNewPointSet, those h vertices together with v,w,x,y form a maximal h+4Clique, where NPS(v,w,x,y)=CLQ3(v,w)&CLQ3(x,y). With each maximal kClique found, we can determine if it’s a “Largest” by examining counts (or we can find them all and then pick out a “Largest”) but determing “Largest” early can result in significant time savings (can move on to another v immediately). E.g., if there aren’t enough siblings left or a large enough 1-count among CLQ3pTrees… CLQ5pTrees CLQ3pTrees: edge (u,v) CLQ3(u,v)pTree(u)&pTree(v). Removing those with Ct=0 gives all 3Cliques, each listed thrice (Each CLQ3(u,v) 1bit is the 3rd vertex of the 3Clique formed with u and v. (Every edge is uniquely listed as the header of a pTree. CLQ4 pTrees 1 2 & 3 4 8 14 n o t a e d g E i C L Q 1 2 & 3 4 1 5 & 7 11 n o t a e d g E i C L Q 3 1 6 & 7 11 n o t a e d g E i C L Q 3 1 7 & 5 6 n o t a e d g E i C L Q 3 1 11 & 5 6 n o t a e d g E i C L Q 3 3 9 & 1 33 n o t a e d g E i C L Q 6 7 & 1 17 n o t a e d g E i C L Q 3 9 31 & 33 34 n o t a e d g E i C L Q 3 9 33 & 3 31 n o t a e d g E i C L Q 24 30 & 33 34 n o t a e d g E i C L Q 3 24 34 & 28 30 n o t a e d g E i C L Q 3 30 34 & 24 27 n o t a e d g E i C L Q 3

5 1 2 3 4 5 6 7 8 9 E 18 12 22 8 2 1 3 4 14 20 13 5 6 7 11 17 25 27 32 26 28 29 24 16 30 15 23 21 19 33 10 31 9 34 G7 [vw=12 xy=34] Depth-First kClique Thm (pTree ) :  edge (v,w) find a largest max clique. Let (x,y)CLQ3pTree(v,w)largest count. If CLQ3xy=z xyz LMC. If (x,y)E, vwx and vwy maximal, else let NewPtSet(v,w,x,y)CLQ3pTree(v,w)&CLQ3pTree(x,y)) If Ct(NPS(vwxy))= 0, the 4 vertices form a maximal 4Clique (i.e., v,w,x,y). 1, the 5 vertices form a maximal 5Clique (i.e., v,w,x,y and the NewPt) 2, the 6 vertices form a maximal 6Clique if the NewPair is an edge, else they form 2 maximal 5Cliques. 3, the 7 vertices form a maximal 7Clique if each of the 3 NewPairs is an edge, elseif 1 or 2 of the NewPairs are edges then each of the 6VertexSets (the 4 original vertices and 2 EdgeEndpoints) form a maximal 6Clique, elseif 0 of the NewPairs is an edge, then each 5VertesSet (original 4 plus 1 NewVertex) forms a maximal 5Clique…. 1 2 & 3 4 [NPS(1234)=CLQ3p(12)&CLQ3p(34)={8,14}] [(8,14)E so and 1234e are max5cliques] Theorem is:  hCliqueNewPointSet, those h vertices together with v,w,x,y form a maximal h+4Clique, where NPS(v,w,x,y)=CLQ3(v,w)&CLQ3(x,y). Determine if a maximal kClique is Largest from counts [Both ,1234e largest. They use 348e. Any other must use the 7-4=3 remaining pts  5Clique] [So 12348,1234e are LMC for edges, e e e 48 4e. Start over with: 15? 7bE so 157,15b LMCs; 16? 7bE so 167,16b LMCs; UCLQ3pTrees: edge (u,v) CLQ3(u,v)pTree(u)&pTree(v) diagonalized. Removing those with Ct=0 gives all 3Cliques uniquely. (3,9,33) LMC(3,9,33) (24,28,34) LMC(24,28,34) (24,30)? (33,34)E so (24,30,33), (24,30,34)LMCs; (25,26,32) LMC(25,26,32) (27,30,34) LMC(27,30,34) (29,32,34) LMC(29,32,34)

6 1 2 3 4 5 6 7 8 9 E 18 12 22 8 2 1 3 4 14 20 13 5 6 7 11 17 25 27 32 26 28 29 24 16 30 15 23 21 19 33 10 31 9 34 G7 Maximal Clique Theorem:  hCliqueNPS(v,w,x,y)=UCLQ3{v,w}&UCLQ3{x,y}, those h vertices together with v,w,x,y form maximal (h+4). Recursive Depth-First Theorem (pTree) to find the clique structure around an edge; (v,w) If UCLQ3vw has no edges, then xUCLQ3vw, vwx is a MCLQ, else pick a UCLQ3vw edge, xy [with maximal UCLQ3-count?]. Find the clique structure of NPS(vwxy) (recursively). Then apply the Theorem to get the MaxClique({v,w,x,y}UCLQ3vw). Pick vw=12 from UCLQ3. Pick edge xy=34 from UCLQ3vw. Clique structure of NPS(1234)={8,14} has two 1Cliques, {8}, {14}. Thus, {1,2,3,4}UCLQ312 has 2 MaxCliques {1,2,3,4,8}, {1,2,3,4,14} UCLQ3 1 & 2 3 4 1 2 7 1 3 4 1 4 3 1 5 2 1 6 2 2 3 1 2 4 1 4 3 2 1 3 9 1 6 7 1 31 9 2 1 24 28 1 30 24 2 1 25 26 1 27 30 1 29 32 1 3 2 1 4 5 6 7 9 8 10 12 11 13 14 15 16 18 17 19 20 22 21 23 25 24 26 27 28 29 31 30 32 34 33

7 No two share and edge so there are no 4Cliques.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 35 22 23 24 25 26 27 28 29 30 31 32 33 34 36 37 38 39 40 41 42 43 44 G10: Web graph of pages of a website and hyperlinks. Communities by color (Girvan Newman Algorithm). |V|=180 (1-i0) and |E|=478. We have unPTrees (undirected graph). inPTrees (showing all incoming edges and where they come in from) and outPTrees. 45 78 46 47 48 49 50 51 c5 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 c1 c2 c3 c4 c6 c7 c8 c9 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 g0 g1 g2 g3 g4 g5 g6 g7 g8 g9 h0 h1 h2 h3 h4 h5 h6 h7 h8 h9 i0 UCLQ3pTrees: for Max Ct=26 vertex=91. All & with 91 have Ct=0 so 91 is part of no 3cliques a0 a1 a2 a4 b1 b4 c6 c7 d9 e0 h8 h9 UCLQ3pTrees: for Ct=24 vertex=D2. All & with D2 have Ct=0 so D2 is part of no 3cliques a0 a1 a2 a4 b1 b4 c6 c7 d9 e0 h8 h9 G10 UCLQ3pTrees: for Ct=23 vertex=38. All & with 38 have Ct=0 so 38 is part of no 3cliques G for Ct=14 vertex=52 & Ct=0 so 52 part of no 3clique G for Ct=13 vertex=174 is part of 3cliques H0 H2 H4 and H3 H4 I0 4681d0g6h0h1h2h3h5h6h7h8i0 h4h4h4h4h4h4h4h4h4h4h4h4h4 G10 Ct(B2)=9 part of 3clique, B2 b2b2b2b2b2b2b2b2b2 b1b3c0h1 G10 Ct(45)=9 &cts=0 G10 Ct(78)=9 &cts=0 G10 Ct(49)=8 all 0s G10 Ct(81)=8 all 0s G10 Ct(C4)=7 all 0s G10 Ct(A7)=5 all 0s G10 Ct(H9)=5 all 0s G for Ct=13 vertex=46 & Ct=0 so 46 part of no 3clique d2h4 There are only three 3Cliques: {H0 H2 H4} {H3 H4 I0} {45 76 B2} (I quickly checked the rest). No two share and edge so there are no 4Cliques. The fact there are so few cliques may be a characteristic of web page link graphs. Was it worthwhile doing the Clique analysis? Yes! The 8 vertices involved in the three 3Cliques (and the three cliques themselves) are outliers! We can examine each to try to determine what’s unique about them. What does it mean that the three vertices {H3 H4 I0} are a 3Clique in the undirected graph of page references. In this case, after close examination, we see that they form a cycle (in the directed graph sense). Should there ever be circular references like that in web pages? The 3Clique {45 76 B2} appears to be a mistake (no edge from 45 to 76). The clique {H0 H3 H4} does not appear to be a cycle.

8 18 12 22 8 2 1 3 4 14 20 13 5 6 7 11 17 25 27 32 26 28 29 24 16 30 15 23 21 19 33 10 31 9 34 APPENDIX APRIORI Clique Search Algorithm (may be faster since, e.g., a candidate 4clique which survives the “all sub3sets are 3ciques” is automatically a 4clique). G7 Cand4Cliques no 2 3 18 no 2 3 20 no 2 3 22 no 2 4 18 no 2 4 20 no 2 4 22 no 2 8 14,20,22 No LAST 3 no 3 4 9 no 3 8 9,14 no 3 9 14 no 4 8 13 no 4 8 14 no 4 13 14 no 5 7 11 no 6 7 11 No LAST 3 Survivor 4Cliqs Cand5Cliqs 1 1 1 2 2 2 3 3 3 4 4 8 sh 1 2 3 Surviv5Cliqs 1 1 2 2 3 3 4 4 8 14 Can6Clqs 1 2 3 4 8 14 1 2 4 8 14 s1 2 3 4 8 14 no 2 3 4 8 14 1 2 3 4 5 6 7 8 9 E no 2 3 8 14 no 2 4 8 14 no 3 4 8 14 The Clique Search Algorithms are: 1 This list APRIORI method. 2 The pTree version o f this APRIORI Unique 3Cliques (in set form) 3 The Induction Clique Search Alg (list version). 4 The Induction Clique Search Alg (pTree version). Which is fastest? Simplest? (Accuracy should be the same at 100%). Do we need 100% or can we get great time savings by relaxing that? How do these methods perform on a Big Graph? On Friends? On G7, 2 max5cliques and 1234e. All 4cliques are subset of these 5cliques, One can chose at random from v’s 3Cliques (1st ?): MSMC(1,2,3,4,8)= {1,2,3,4,8} MSMC(14) = {1,2,3,4,14} MSMC(5,7)= {1,5,7} Remaining pairwise ANDs after removal of PURE0s (i.e., after CS0). So these are the 3cliques in pTree form. MSMC(9,31,33)= {9,31,33} MSMC(10)= {3,10} MSMC(11)= {1,5,11} MSMC(12)= {1,12} MSMC(13)= {1,4,13} MSMC(*)= {33,*} *=15,16,19,21,23 MSMC(17)= {6,7,17} MSMC(18)= {1,4,18} MSMC(20)= {1,2,20} MSMC(22)= {1,2,22} MSMC(24,28,34)= {24,28,34} MSMC(25,26,32)= {25,26,32} MSMC(27,30)= {27,30,34} MSMC(27,30)= {27,30,34} MSMC(28)= {3,28} MSMC(29)= {29,32,34}


Download ppt "The vertex-labelled, edge-labelled graph"

Similar presentations


Ads by Google