Download presentation
Presentation is loading. Please wait.
Published byOswald Eaton Modified over 9 years ago
1
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft Resarch Asia 3. HKUST
2
Problem Definition Query Processing Framework Solutions Background Ⅰ Ⅱ Ⅲ Ⅳ Outline Conclusions V
3
Background 3 Graph is a complicated data structure, and has been used in many real applications. Bioinformatics Yeast PPI networks Gene regulatory networks
4
Background 4 Compounds benzene ring Compounds database
5
Background 5 Social Networks EntityCube Web2.0
6
Background 6 In these applications, graph data may be noisy and incomplete, which leads to uncertain graphs. STRING database (http://string-db.org) is a data source that contains PPIs with uncertain edges provided by biological experiments.http://string-db.org Visual Pattern Recognition, uncertain graphs are used to model visual objects. Social networks, uncertain links used to represent possible relationships or strength of influence between people. Therefore, it is important to study query processing on large uncertain graphs.
7
Conclusions V Outline Problem Definition Query Processing Framework Solutions Background Ⅰ Ⅱ Ⅲ Ⅳ
8
Problem Definition 8 Probabilistic subgraph search Uncertain graph : Vertex uncertainty (existence probability) Edge uncertainty (existence probability given its two endpoints)
9
Problem Definition 9 Probabilistic subgraph search Possible worlds: combination of all uncertain edges and vertices
10
Problem Definition 10 Probabilistic subgraph search Given: an uncertain graph database G={g 1,g 2,…,g n }, query graph q and probability threshold Query: find all g i ∈ G, such that the subgraph isomorphic probability is not smaller than . Subgraph isomorphic probability (SIP): The SIP between q and g i = the sum of the probabilities of g i ’s possible worlds to which q is subgraph isomorphic
11
Problem Definition 11 Probabilistic subgraph search Subgraph isomorphic probability (SIP): gq + + + += 0.27 It is #P-complete to calculate SIP
12
Conclusions V Outline Problem Definition Query Processing Framework Solutions Background Ⅰ Ⅱ Ⅲ Ⅳ
13
Query Processing Framework 13 Probabilistic subgraph query processing framework Naïve method : sequence scan D, and decide if the SIP between q and g i is not smaller than threshold . g 1 subgraph isomorphic to g 2 : NP-Complete Calculating SIP: #P-Complete Naïve method: very costly, infeasible !
14
Query Processing Framework 14 Probabilistic subgraph query processing framework Filter-and-Verification Filtering Verification Candidates Answers {g 1,g 2,..,g n } {g’ 1,g’ 2,..,g’ m } {g” 1,g” 2,..,g” k } Query q
15
Conclusions V Outline Problem Definition Query Processing Framework Solutions Background Ⅰ Ⅱ Ⅲ Ⅳ
16
Solutions 16 Filtering: structural pruning Principle: if we remove all the uncertainty from g, and the resulting graph still does not contain q, then the original uncertain graph cannot contain q. Theorem: if q g c , then Pr(q g)=0 g q
17
Solutions 17 Probabilistic pruning: let f be a feature of g c i.e., f g c Rule 1 : if f q, UpperB(Pr(f g))< , then g is pruned. ∵ f q, ∴ Pr(q g) Pr(f g)< Uncertain graphfeature query &
18
Solutions 18 Rule 2 : if q f, LowerB(Pr(f g)) , then g is an answer. ∵ q f, ∴ Pr(q g) Pr(f g) Two main issues for probabilistic pruning : How to derive lower and upper bounds of SIP? How to select features with great pruning power? Uncertain graphfeature query &
19
Solutions 19 Technique 1: calculation of lower and upper bounds Lemma : let Bf 1,…,Bf |Ef| be all embeddings of f in g c, then Pr(f g)=Pr(Bf 1 … Bf |Ef| ). UpperB(Pr(f g)):
20
Solutions 20 Technique 1: calculation of lower and upper bounds LowerB(Pr(f g)): Tightest LowerB(f) Converting into computing the maximum weight clique of graph bG, NP-hard.
21
Solutions 21 Technique 1: calculation of lower and upper bounds Exact value V.S. Upper and lower bound Value Computing time
22
Solutions 22 Technique2: optimal feature selection If we index all features, we will have the most pruning power index. But it is also very costly to query such index. Thus we would like to select a small number of features but with the greatest pruning power. Cost model: Max gain = sequence scan cost– query index cost Maximum set coverage: NP-complete; use the greedy algorithm to approximate it.
23
Solutions 23 Technique2: optimal feature selection Maximum converge : greedy algorithm Feature Matrix Probabilistic Index Approximate optimal index within 1-1/e
24
Solutions 24 Probabilistic Index Construct a string for each feature Construct a prefix tree for all feature strings Construct an invert list for all leaf nodes
25
Solutions 25 Verification: Iterative bound pruning Lemma : Pr(q g)=Pr(Bq 1 … Bq |Eq| ) Unfolding: Let Based on Inclusion-Exclusion Principle Iterative bound pruning
26
Solutions 26 Performance Evaluation Real dataset: uncertain PPI 1500 uncertain graphs Average 332 vertices and 584 edges Average probability: 0.367 Synthetic dataset : AIDS dataset Generate probabilities using Gaussian distribution 10k uncertain graphs Average 24.3 vertices and 26.5 edges
27
Solutions 27 Performance Evaluation Results on real dataset
28
Solutions 28 Performance Evaluation Results on real dataset
29
Solutions 29 Performance Evaluation Response and Construction time
30
Solutions 30 Performance Evaluation Results on synthetic dataset Mean Variance
31
Conclusions V Outline Problem Definition Query Processing Framework Solutions Background Ⅰ Ⅱ Ⅲ Ⅳ
32
Conclusion 32 We propose the first efficient solution to answer threshold-based probabilistic sub-graph search over uncertain graph databases. We employ a filter and verification framework, and develop probability bounds for filtering. We design a cost model to select minimum number of features with the largest pruning ability. We demonstrate the effectiveness of our solution through experiments on real and synthetic data sets.
33
33 Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.