Power-Law Based Estimation of Set Similarity Join Size Hongrae Lee, University of British Columbia Raymond T. Ng, University of British Columbia Kyuseok.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Fast Algorithms For Hierarchical Range Histogram Constructions
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
Retrieving k-Nearest Neighboring Trajectories by a Set of Point Locations Lu-An Tang, Yu Zheng, Xing Xie, Jing Yuan, Xiao Yu, Jiawei Han University of.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Yoshiharu Ishikawa (Nagoya University) Yoji Machida (University of Tsukuba) Hiroyuki Kitagawa (University of Tsukuba) A Dynamic Mobility Histogram Construction.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Farnoush Banaei-Kashani and Cyrus Shahabi Criticality-based Analysis and Design of Unstructured P2P Networks as “ Complex Systems ” Mohammad Al-Rifai.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Detecting and Tracking Moving Objects for Video Surveillance Isaac Cohen and Gerard Medioni University of Southern California.
SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.
Evaluating Hypotheses
Near Duplicate Detection
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Probability Grid: A Location Estimation Scheme for Wireless Sensor Networks Presented by cychen Date : 3/7 In Secon (Sensor and Ad Hoc Communications and.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, and Jiawei Han SIGMOD 2002 Presented by: Eddie Date: 2002/12/23.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Tomo-gravity Yin ZhangMatthew Roughan Nick DuffieldAlbert Greenberg “A Northern NJ Research Lab” ACM.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
A Statistical Approach to Speed Up Ranking/Re-Ranking Hong-Ming Chen Advisor: Professor Shih-Fu Chang.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.
Comp. Genomics Recitation 3 The statistics of database searching.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Histograms for Selectivity Estimation
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
1 Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance Hongrae Lee, Raymond Ng and Kyuseok Shim.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
32nd International Conference on Very Large Data Bases September , 2006 Seoul, Korea Efficient Detection of Empty Result Queries Gang Luo IBM T.J.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Near Duplicate Detection
Probabilistic Data Management
On Spatial Joins in MapReduce
Weighted Exact Set Similarity Join
Structure and Content Scoring for XML
Efficient Subgraph Similarity All-Matching
Structure and Content Scoring for XML
Evaluation of Relational Operations: Other Techniques
Minwise Hashing and Efficient Search
Topological Signatures For Fast Mobility Analysis
Presentation transcript:

Power-Law Based Estimation of Set Similarity Join Size Hongrae Lee, University of British Columbia Raymond T. Ng, University of British Columbia Kyuseok Shim, Seoul National University

Highly Similar, but Not The Same, Data 2 Nearly word-for-word copy The duplicate does not cite the original Similar news articles

Introduction Finding all pairs of similar objects is a very common task – Near duplicate detection – Data integration – Record linkage – Web search 3 Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates. SIGMOD Conference 2004: S. Sarawagi, A. Kirpal, Efficient set joins on similarity predicates. SIGMOD sp|P45680|YFMU_COXBU HYPOTHETICAL 15.8 KD PROTEIN IN FMU-RP... sp|P45680|YFMU_COXBU H 15.8 KD PROTEIN IN FMU-RP...

Set Similarity Join (SSJoin) SSJoin is proposed as a general framework for finding similar objects Input – two collections of sets, R and S – similarity function sim – similarity threshold τ Output – all pairs (r,s) r ∈ R, s ∈ S, such that sim(r,s) ≥ τ 4 {bolt, destroy, 200, meter, record} {bolt, smashes, 200, meter, world, record, berlin} word n-gram Documents set Jaccard similarity = 0.5

Estimation of SSJoin Size SSJoin in RDBMS – SSJoin operator as a primitive operator [Chaudhuri, Ganti, Kaushik 06] – Data cleaning as a repetitive operation [Fuxman, Fazli, Miller 05] Efficient and accurate estimation of SSJoin size is crucial in query optimization – Poor size estimations can result in sub-optimal plans 5 SSJ NL Seek S.AR.AT.B SSJ HM Scan S.AR.AT.B different opt-plans depending on SSJ size

Problem Statement Input – a collection of sets R (self-join) – threshold τ on Jaccard similarity J S Output – the number of pairs (r,s), SSJ(τ), such that J S (r,s) ≥ τ, r, s ∈ R and r≠ s. Jaccard similarity J S – J S (r,s) = |r ∩ s| / |r ∪ s| – e.g., J S ({1,2,3},{2,3,4}) = |{2,3}| / |{1,2,3,4}| = 0.5 6

Related Work Set similarity join (or selection) – [Sarawagi, Kirpal 04], [Chaudhuri, Ganti, Kaushik 06], [Arasu, Ganti, Kaushik 06], [Bayardo, Ma, Srikant 07], [Xiao, Wang, Lin 08], [Xiao, Wang, Lin, Yu 08], [Hadjieleftheriou, Chandel, Koudas, Srivastava 08], [Xiao, Wang, Lin, Shang 09] Hashed Samples: selectivity estimation of set similarity selection queries – [ Hadjieleftheriou, Yu, Koudas, Srivastava 08] Estimation of the number of frequent patterns – [Chuang, Huang, Chen 08], [Jin, McCallen, Breitbart, Fuhry, Wang 09], [Boley, Grosskreutz 09] 7

Outline Introduction Signature pattern & Lattice counting Power-law based estimation Correction of the estimation Experimental results 8

Min-Hash Signature Min-wise hash function – Prob [h(r) = h(s)] = |r ∩ s| / |r ∪ s| Min-Hash signature – Use M min-wise hash functions, h 1,…,h M – J S (r,s) ≈ fraction of signatures for which Min-hash values agree 9 {1,3,53,55,23,534,…} {2,3,50,51,52,53,…} [4,3,5,2] [4,3,3,5] r s sig(r) sig(s) J S (r,s) ≈ 2/4 M=4 [Cohen 97] [Broder, Glassman, Manasse, Zweig 97]

Min-Hash Representation of Sets We work on Min-Hash signatures of sets – Succinct representation enables faster analysis – Min-Hash signatures preserve Jaccard similarity between original sets – Might be readily available 10 DB r1{7,10,19,52,67} r2{10,19,43,52} r3{10,13,43,52,67,85} r4{10,38,43,49,80,94} r5{3,25,29,47,50,66,73,75} Sig(DB) sig (r1)[4,3,5,2] sig (r2)[4,3,3,5] sig (r3)[4,3,2,2] sig (r4)[3,3,3,2] sig (r5)[1,1,1,2] M (signature size) = 4

Signature Pattern Define signature pattern to represent frequently co- occurring signature values Signature pattern – A Min-Hash signature possibly with ‘X’ ‘X’: don’t care position – A signature (set) matches a pattern if it (its signature) agrees on all non-X positions with the pattern e.g., [4,3,5,2] matches patterns [4,3,X,X] or [X,3,5,2] (and many more), but does not match [4,3,2,X] (position matters) – length: # non-X positions – freq (support count): # matching signatures in the DB 11

An Example Signature Pattern 12 Sig(DB) sig (r1)[4,3,5,2] sig (r2)[4,3,3,5] sig (r3)[4,3,2,2] sig (r4)[3,3,3,2] sig (r5)[1,1,1,2] [4,3,X,X] Signature pattern Pattern Length 2 Pattern Freq 3 (r1,r2,r3)

# Similar Pairs by Pattern Frequency Pattern freq f, length i  pairs have at least i matching positions in their signatures (J S ≥ i /M) – pattern length  J S (estimated) – pattern frequency  # pairs 13 Sig(DB) sig (r1)[4,3,5,2] sig (r2)[4,3,3,5] sig (r3)[4,3,2,2] sig (r4)[3,3,3,2] sig (r5)[1,1,1,2] [4,3,X,X] Signature pattern Pattern Length 2 Pattern Freq 3 (r1,r2,r3) signature pairs match at least 2 positions J S (r1,r2), J S (r2,r3), J S (r3,r1) ≥ 2/4 (est.) 3232 ( ) f2f2 ( )

SSJoin Size By Pattern Frequency Given threshold τ, we find all patterns with length ≥ τ*M For each pattern, pairs satisfy τ 14 LengthMatching setFreqMatching pair set# pairs 2r1, r2, r33{(r1,r2),(r1,r3),(r2,r3)}3 2r1, r32{(r1,r3)}1 2r2, r42{(r2,r4)}1 2r1, r3, r43{(r1,r3),(r1,r4),(r3,r4)}3 3r1, r32{(r1,r3)}1 Signature pattern sig1=[4, 3, X, X] sig2=[4, X, X, 2] sig3=[X, 3, 3, X] sig4=[X, 3, X, 2] sig5=[4, 3, X, 2] Naïve approach for SSJoin Size: sum # pairs from all patterns ∑=9 There are overlaps in pattern frequency and thus # pairs We need the cardinality of union of matching pair sets when τ = 0.5 freq 2 ( )

Computing the Union Size We can compute the union size with Inclusion-Exclusion (IE) formula – Combinatorial # operations! 15 Signature patternMatching pair set sig1=[4, 3, X, X]S1={(r1,r2),(r1,r3),(r2,r3)} sig2=[4, X, X, 2]S2={(r1,r3)} sig3=[X, 3, X, 2]S3={(r1,r3),(r1,r4),(r3,r4)} |S1 ⋃ S2 ⋃ S3| = |S1| + |S2|+ |S3| − (|S1 ⋂ S2|+ |S2 ⋂ S3| +|S3 ⋂ S1|) + |S1 ⋂ S2 ⋂ S3|

Efficient Evaluation of IE-Formula [4,3,X,X][4,X,X,2][X,3,X,2] [4,3,X,2] SSJ(0.5)=|S1 ⋃ S2 ⋃ S3| =|S1| + |S2|+ |S3| − (|S1 ⋂ S2|+ |S2 ⋂ S3| +|S3 ⋂ S1|) + |S1 ⋂ S2 ⋂ S3| =1*(|S1| + |S2|+ |S3|) + (−3 + 1) *|S4| 16 (r1,r2) (r1,r3) (r2,r3) (r1,r3) (r1,r4) (r3,r4) (r1,r3) S2 S1 S3 S4 Pattern LatticeMatching Pair Lattice Patterns and matching pairs exhibit lattice structure layer nodes according to the pattern length (= level) edges: inclusion relationship patterns length < τ*M are not shown

Lattice Counting Compute SSJoin size from ‘pattern distribution’ (# patterns per each length and frequency) Basically simplified IE-formula computation using lattices Does not store actual matching sets or pair sets, only counts! 17 Signature patternLengthMatching setFreqMatching pair set# pairs sig1=[4, 3, X, X]2r1, r2, r33{(r1,r2),(r1,r3),(r2,r3)}3 sig2=[4, X, X, 2]2r1, r32{(r1,r3)}1 sig3=[X, 3, 3, X]2r2, r42{(r2,r4)}1 sig4=[X, 3, X, 2]2r1, r3, r43{(r1,r3),(r1,r4),(r3,r4)}3 sig5=[4, 3, X, 2]3r1, r32{(r1,r3)}1 Pattern Distribution LengthFrequency# pattern See the paper for details Please see the paper for details

Outline Introduction Signature pattern & Lattice counting Power-law based estimation Correction of the estimation Experimental results 18

Pattern Distribution LengthFrequency# pairs pattern frequency # of patterns (pattern count) level 2 (pattern length=2) level 3 (pattern length=3) there 2 patterns that match 3 sets and whose length is 2 i.e., sig1=[4,3,X,X]  (r1,r2,r3) sig4=[X,3,X,2]  (r1,r3,r4) 19 If we have exact pattern dist., we can exactly estimate SSJoin size

Exact Pattern Distribution Computing exact pattern distribution is infeasible – We need pattern distribution for freq >= 2 (min freq for generating a pair)  Minimum support threshold = 2 – Most frequent pattern mining algorithms are not designed to handle such a low support threshold – Even if they could, it would take too long to be used for query optimization purposes 20

Power-Law Distribution of Pattern Count 21 minimum support threshold mined pattern distribution missing pattern distribution A Power-law distribution is observed in # patterns-frequency relationship (or pattern count-support count) [Chuang, Huang, Chen 08] Power law: count = β*frequency -α

SSJoin Size Estimation 1.Find frequent patterns with ξ > 2 2.Estimate the parameters of the Power-law distribution at each level with the acquired patterns 3.Compute the full pattern distribution based on the estimated parameters 4.Compute SSJoin size with Lattice Counting formula 22

Outline Introduction Signature pattern & Lattice counting Power-law based estimation Correction of the estimation Experimental results 23

Systematic Overestimation By Min- Hash Big overestimation is observed e.g., relative error J S =0.4: 10332% J S =0.5: 2614% J S =0.6 : 573% 24 # pair – similarity plot of exhaustive pair-wise comparison

Effect of Skewed Distribution # matching position T(i) 10,0001, – 2* = 181 # pairs with J S =i/M 25 Assume 10% of pairs have +1 or -1 more matching positions in their Min-Hash signatures

Probabilistic Modeling s={1,2,3,4,5,6,8} r={1,2,4,5,7} J S (r,s) =4/8= sig(r) sig(s) 0.5 Pr (J=j | I=i) ≡ Prob (j matching position when J S =i/M) E [ # matching position] = 2 Prob (3 matching position) ? (1-0.5) () 26

Considering All # Pairs # matching position T(i): # pairs with J S =i/M O(j): # pairs with j matching pos in sig T(0) O(0) T(1)T(2) T(3)T(4) O(1) O(2) O(3)O(4) O(2) = T(0)*P(2|0) +T(1)*P(2|1) +T(2)*P(2|2) +T(3)*P(2|3) +T(4)*P(2|4) O(0) O(1) O(2) O(3) O(4) T(0) T(1) T(2) T(3) T(4) P(0|0) P(0|1) P(0|2) P(0|3) P(0|4) P(1|0) P(1|1) P(1|2) P(1|3) P(1|4) P(2|0) P(2|1) P(2|2) P(2|3) P(2|4) P(3|0) P(3|1) P(3|2) P(3|3) P(3|4) P(4|0) P(4|1) P(4|2) P(4|3) P(4|4) = AT=O Observed size by Min-Hash True Size Transition Probability 27

NNLS Optimization AT=O T=A -1 O Subject to X ≥ 0 A is non-singular We actually have an estimated vector O’, not the exact O O is highly skewed and lower entries make higher entries negligible We solve Non-negative least square (NNLS) constrained optmization problem Scale the matrix by a weight matrix W, W i,i =1/O(i) and W i,j =0,i ≠j ∥ WAX – WO ∥ 28 T may have negative values

SSJoin Size Estimation Algorithm 29 Min-Hash Signatures of DB Partial Pattern Distribution (# patterns for each length) SSJoin Size Error Correction Est. Full Pattern Distribution Freq. pattern mining algorithm No need for actual patterns Only count # patterns Power-law parameter estimation Lattice Counting NNLS optimization

Outline Introduction Signature pattern & Lattice counting Power-law based estimation Correction of the estimation Experimental results 30

Experimental Setup Dataset – DBLP, 800K – IBM Quest synthetic data, 50K Compared algorithms – LC(ξ) : the proposed solution with a minimum support threshold of ξ – Independent Sum (IS) : without lattice counting – LCNC(ξ) : LC without the error correction step – HS(ρ) : Hashed samples[Hadjieleftheriou, Yu, Koudas, Srivastava 08] adapted to SSJoin Opt_Merge [Sarawagi, Kirpal 04] ρ: sampling ratio Evaluation metric – Accuracy: actual count, relative error – Runtime: pre-processing time, estimation time 31

Accuracy 32 LC delivers accurate estimations for high similarity thresholds HS: random samples will miss many highly similar pairs DBLP Synthetic Data HS: accurate enough for very low similarity thresholds

Runtime 33 DBLP 40K Estimation timePre-processing time LC is faster (with better accuracy) LC’s pre-processing time is smaller

Effect of Error Correction Step 34 Huge overestimation without considering the overlaps Error correction step effectively reduces the overestimation Computational overhead of the error correction step is negligible AccuracyRuntime

Scalability 35 Estimation timePre-processing time Much slower increase in runtime and pre-processing time than HS, random sampling

Summary Proposed a SSJoin size estimation algorithm based Min-hash signatures and frequent pattern mining technique with the error correction Evaluated the proposed algorithm with synthetic and real-world databases Future work – Apply recent developments in estimating the number of frequent patterns: random sampling 36

Thank you 37

Lattice Structure in Patterns Patterns and corresponding matching pair sets have lattice structure – Partial order by inclusion relationship, lub and glb by intersection and union – E.g, if a set matches [4,3,X,2] it matches all of its children – If a set matches both [4,3,X,X] and [4,X,X,2], it also matches [4,3,X,2] We can compute the union size by Inclusion-Exclusion (IE) formula Lattice structures greatly simplifies the IE-formula computation 38 [4,3,X,X][4,X,X,2][X,3,X,2] [4,3,X,2] (r1,r2) (r1,r3) (r2,r3) (r1,r3) (r1,r4) (r3,r4) (r1,r3) S2 S1 S3 S4 Pattern LatticeMatching Pair Lattice

Lattice Counting Lattice Counting (LC) – Efficient computation of IE-formula exploiting the underlying lattices – Level sum F i : # pairs with i matching values in their signatures – Coefficient C i collapses repeated computation of the same results into a single operation [Lee, Ng, Shim 07] Only needs # patterns of freq f and length i – e.g., if τ = 0.5 and M=4, SSJ(0.5) = LC(2) and LC needs # patterns of length 2,3 and 4 for each frequency – Does not need actual patterns 39 LC(t) = ∑ t≤i≤M C i *F i, t= τ*M C i : coefficient for level i F i : level i sum

Parameter Estimation Might Fail for Longer Patterns There are in general a smaller number of higher-level (longer) patterns We may not have enough points for parameter estimation LC(t) requires all pattern dist. for level t ~ M – LC(t) = ∑ t ≤ i ≤ M C i *F i Our solutions – Approximate Lattice Counting – Interpolation 40 Enough points for parameter estimation, i = 3 Not enough points for parameter estimation, i = 9

Approximate Lattice Counting [_,_,X,X][_,X,_,X][_,X,X,_][X,_,_,X][X,_,X,_][X,X,_,_] [_,_,_,X][_,_,X,_][_,X,_,_][X,_,_,_] [_,_,_,_] [_,_,X,X][_,X,_,X][_,X,X,_][X,_,_,X][X,_,X,_][X,X,_,_] [_,_,_,X][_,_,X,_][_,X,_,_][X,_,_,_] LC(t) = ∑ t ≤ i ≤ M C i *F i t level M … t t + k … LC k (t) = ∑ t ≤ i ≤ t+k C k,i *F i 41 Partial independence assumption: ignore high level nodes only considering nodes up to level t+k t = τ*M, k: approximation constant Full lattice Partial lattice

Estimation with Limited Pattern Distribution An observation – SSJoin size is highly skewed and Pair count – Jaccard similarity exhibits a Power-law relationship Used for interpolation when very low support thresholds or NNLS optimization failure 42 Jaccard similarity Pair count (SSJoin size) Jaccard similarity Pair count (SSJoin size) DBLP

Power Hypothesis 43 DBLP Synthetic Data