1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft Resarch Asia 3. HKUST

Problem Definition Query Processing Framework Solutions Background Ⅰ Ⅱ Ⅲ Ⅳ Outline Conclusions V

Background 3  Graph is a complicated data structure, and has been used in many real applications.  Bioinformatics Yeast PPI networks Gene regulatory networks

Background 4  Compounds benzene ring Compounds database

Background 5  Social Networks EntityCube Web2.0

Background 6  In these applications, graph data may be noisy and incomplete, which leads to uncertain graphs.  STRING database (http://string-db.org) is a data source that contains PPIs with uncertain edges provided by biological experiments.http://string-db.org  Visual Pattern Recognition, uncertain graphs are used to model visual objects.  Social networks, uncertain links used to represent possible relationships or strength of influence between people. Therefore, it is important to study query processing on large uncertain graphs.

Conclusions V Outline Problem Definition Query Processing Framework Solutions Background Ⅰ Ⅱ Ⅲ Ⅳ

Problem Definition 8  Probabilistic subgraph search  Uncertain graph ： Vertex uncertainty (existence probability) Edge uncertainty (existence probability given its two endpoints)

Problem Definition 9  Probabilistic subgraph search  Possible worlds: combination of all uncertain edges and vertices

Problem Definition 10  Probabilistic subgraph search  Given: an uncertain graph database G={g 1,g 2,…,g n }, query graph q and probability threshold   Query: find all g i ∈ G, such that the subgraph isomorphic probability is not smaller than .  Subgraph isomorphic probability (SIP): The SIP between q and g i = the sum of the probabilities of g i ’s possible worlds to which q is subgraph isomorphic

Problem Definition 11  Probabilistic subgraph search  Subgraph isomorphic probability (SIP): gq + + + += 0.27 It is #P-complete to calculate SIP

Query Processing Framework 13  Probabilistic subgraph query processing framework  Naïve method ： sequence scan D, and decide if the SIP between q and g i is not smaller than threshold .  g 1 subgraph isomorphic to g 2 : NP-Complete  Calculating SIP: #P-Complete  Naïve method: very costly, infeasible ！

Query Processing Framework 14  Probabilistic subgraph query processing framework  Filter-and-Verification Filtering Verification Candidates Answers {g 1,g 2,..,g n } {g’ 1,g’ 2,..,g’ m } {g” 1,g” 2,..,g” k } Query q

Solutions 16  Filtering: structural pruning  Principle: if we remove all the uncertainty from g, and the resulting graph still does not contain q, then the original uncertain graph cannot contain q.  Theorem: if q  g c ， then Pr(q  g)=0 g q

Solutions 17  Probabilistic pruning: let f be a feature of g c i.e., f  g c  Rule 1 ： if f  q, UpperB(Pr(f  g))<  ， then g is pruned. ∵ f  q, ∴ Pr(q  g)  Pr(f  g)<  Uncertain graphfeature query & 

Solutions 18  Rule 2 ： if q  f, LowerB(Pr(f  g))  ， then g is an answer. ∵ q  f, ∴ Pr(q  g)  Pr(f  g)   Two main issues for probabilistic pruning ： How to derive lower and upper bounds of SIP? How to select features with great pruning power? Uncertain graphfeature query & 

Solutions 19  Technique 1: calculation of lower and upper bounds  Lemma ： let Bf 1,…,Bf |Ef| be all embeddings of f in g c, then Pr(f  g)=Pr(Bf 1  …  Bf |Ef| ).  UpperB(Pr(f  g)):

Solutions 20  Technique 1: calculation of lower and upper bounds  LowerB(Pr(f  g)):  Tightest LowerB(f) Converting into computing the maximum weight clique of graph bG, NP-hard.

Solutions 21  Technique 1: calculation of lower and upper bounds  Exact value V.S. Upper and lower bound Value Computing time

Solutions 22  Technique2: optimal feature selection  If we index all features, we will have the most pruning power index. But it is also very costly to query such index. Thus we would like to select a small number of features but with the greatest pruning power.  Cost model: Max gain = sequence scan cost– query index cost Maximum set coverage: NP-complete; use the greedy algorithm to approximate it.

Solutions 23  Technique2: optimal feature selection  Maximum converge ： greedy algorithm Feature Matrix Probabilistic Index Approximate optimal index within 1-1/e

Solutions 24  Probabilistic Index  Construct a string for each feature  Construct a prefix tree for all feature strings  Construct an invert list for all leaf nodes

Solutions 25  Verification: Iterative bound pruning  Lemma ： Pr(q  g)=Pr(Bq 1  …  Bq |Eq| )  Unfolding:   Let  Based on Inclusion-Exclusion Principle Iterative bound pruning

Solutions 26  Performance Evaluation  Real dataset: uncertain PPI 1500 uncertain graphs Average 332 vertices and 584 edges Average probability: 0.367  Synthetic dataset ： AIDS dataset Generate probabilities using Gaussian distribution 10k uncertain graphs Average 24.3 vertices and 26.5 edges

Solutions 27  Performance Evaluation  Results on real dataset

Solutions 28  Performance Evaluation  Results on real dataset

Solutions 29  Performance Evaluation  Response and Construction time

Solutions 30  Performance Evaluation  Results on synthetic dataset Mean Variance

Conclusion 32  We propose the first efficient solution to answer threshold-based probabilistic sub-graph search over uncertain graph databases.  We employ a filter and verification framework, and develop probability bounds for filtering.  We design a cost model to select minimum number of features with the largest pruning ability.  We demonstrate the effectiveness of our solution through experiments on real and synthetic data sets.

33 Thanks!

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Similar presentations

Presentation on theme: "1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Similar presentations

Presentation on theme: "1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft."— Presentation transcript:

Similar presentations

About project

Feedback