Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Similar presentations


Presentation on theme: "1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft."— Presentation transcript:

1 1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft Resarch Asia 3. HKUST

2 Problem Definition Query Processing Framework Solutions Background Ⅰ Ⅱ Ⅲ Ⅳ Outline Conclusions V

3 Background 3  Graph is a complicated data structure, and has been used in many real applications.  Bioinformatics Yeast PPI networks Gene regulatory networks

4 Background 4  Compounds benzene ring Compounds database

5 Background 5  Social Networks EntityCube Web2.0

6 Background 6  In these applications, graph data may be noisy and incomplete, which leads to uncertain graphs.  STRING database (http://string-db.org) is a data source that contains PPIs with uncertain edges provided by biological experiments.http://string-db.org  Visual Pattern Recognition, uncertain graphs are used to model visual objects.  Social networks, uncertain links used to represent possible relationships or strength of influence between people. Therefore, it is important to study query processing on large uncertain graphs.

7 Conclusions V Outline Problem Definition Query Processing Framework Solutions Background Ⅰ Ⅱ Ⅲ Ⅳ

8 Problem Definition 8  Probabilistic subgraph search  Uncertain graph : Vertex uncertainty (existence probability) Edge uncertainty (existence probability given its two endpoints)

9 Problem Definition 9  Probabilistic subgraph search  Possible worlds: combination of all uncertain edges and vertices

10 Problem Definition 10  Probabilistic subgraph search  Given: an uncertain graph database G={g 1,g 2,…,g n }, query graph q and probability threshold   Query: find all g i ∈ G, such that the subgraph isomorphic probability is not smaller than .  Subgraph isomorphic probability (SIP): The SIP between q and g i = the sum of the probabilities of g i ’s possible worlds to which q is subgraph isomorphic

11 Problem Definition 11  Probabilistic subgraph search  Subgraph isomorphic probability (SIP): gq + + + += 0.27 It is #P-complete to calculate SIP

12 Conclusions V Outline Problem Definition Query Processing Framework Solutions Background Ⅰ Ⅱ Ⅲ Ⅳ

13 Query Processing Framework 13  Probabilistic subgraph query processing framework  Naïve method : sequence scan D, and decide if the SIP between q and g i is not smaller than threshold .  g 1 subgraph isomorphic to g 2 : NP-Complete  Calculating SIP: #P-Complete  Naïve method: very costly, infeasible !

14 Query Processing Framework 14  Probabilistic subgraph query processing framework  Filter-and-Verification Filtering Verification Candidates Answers {g 1,g 2,..,g n } {g’ 1,g’ 2,..,g’ m } {g” 1,g” 2,..,g” k } Query q

15 Conclusions V Outline Problem Definition Query Processing Framework Solutions Background Ⅰ Ⅱ Ⅲ Ⅳ

16 Solutions 16  Filtering: structural pruning  Principle: if we remove all the uncertainty from g, and the resulting graph still does not contain q, then the original uncertain graph cannot contain q.  Theorem: if q  g c , then Pr(q  g)=0 g q

17 Solutions 17  Probabilistic pruning: let f be a feature of g c i.e., f  g c  Rule 1 : if f  q, UpperB(Pr(f  g))<  , then g is pruned. ∵ f  q, ∴ Pr(q  g)  Pr(f  g)<  Uncertain graphfeature query & 

18 Solutions 18  Rule 2 : if q  f, LowerB(Pr(f  g))  , then g is an answer. ∵ q  f, ∴ Pr(q  g)  Pr(f  g)   Two main issues for probabilistic pruning : How to derive lower and upper bounds of SIP? How to select features with great pruning power? Uncertain graphfeature query & 

19 Solutions 19  Technique 1: calculation of lower and upper bounds  Lemma : let Bf 1,…,Bf |Ef| be all embeddings of f in g c, then Pr(f  g)=Pr(Bf 1  …  Bf |Ef| ).  UpperB(Pr(f  g)):

20 Solutions 20  Technique 1: calculation of lower and upper bounds  LowerB(Pr(f  g)):  Tightest LowerB(f) Converting into computing the maximum weight clique of graph bG, NP-hard.

21 Solutions 21  Technique 1: calculation of lower and upper bounds  Exact value V.S. Upper and lower bound Value Computing time

22 Solutions 22  Technique2: optimal feature selection  If we index all features, we will have the most pruning power index. But it is also very costly to query such index. Thus we would like to select a small number of features but with the greatest pruning power.  Cost model: Max gain = sequence scan cost– query index cost Maximum set coverage: NP-complete; use the greedy algorithm to approximate it.

23 Solutions 23  Technique2: optimal feature selection  Maximum converge : greedy algorithm Feature Matrix Probabilistic Index Approximate optimal index within 1-1/e

24 Solutions 24  Probabilistic Index  Construct a string for each feature  Construct a prefix tree for all feature strings  Construct an invert list for all leaf nodes

25 Solutions 25  Verification: Iterative bound pruning  Lemma : Pr(q  g)=Pr(Bq 1  …  Bq |Eq| )  Unfolding:   Let  Based on Inclusion-Exclusion Principle Iterative bound pruning

26 Solutions 26  Performance Evaluation  Real dataset: uncertain PPI 1500 uncertain graphs Average 332 vertices and 584 edges Average probability: 0.367  Synthetic dataset : AIDS dataset Generate probabilities using Gaussian distribution 10k uncertain graphs Average 24.3 vertices and 26.5 edges

27 Solutions 27  Performance Evaluation  Results on real dataset

28 Solutions 28  Performance Evaluation  Results on real dataset

29 Solutions 29  Performance Evaluation  Response and Construction time

30 Solutions 30  Performance Evaluation  Results on synthetic dataset Mean Variance

31 Conclusions V Outline Problem Definition Query Processing Framework Solutions Background Ⅰ Ⅱ Ⅲ Ⅳ

32 Conclusion 32  We propose the first efficient solution to answer threshold-based probabilistic sub-graph search over uncertain graph databases.  We employ a filter and verification framework, and develop probability bounds for filtering.  We design a cost model to select minimum number of features with the largest pruning ability.  We demonstrate the effectiveness of our solution through experiments on real and synthetic data sets.

33 33 Thanks!


Download ppt "1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft."

Similar presentations


Ads by Google