Byung-Won On (Penn State Univ.) Nick Koudas (Univ. of Toronto) Dongwon Lee (Penn State Univ.) Divesh Srivastava (AT&T Labs – Research) Group Linkage ICDE.

Slides:

Advertisements

Similar presentations

Text Joins for Data Cleansing and Integration in an RDBMS Luis Gravano Panagiotis G. Ipeirotis Nick Koudas.

Advertisements

Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:

Approximation algorithms for geometric intersection graphs.

Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,

Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :

Power-Law Based Estimation of Set Similarity Join Size Hongrae Lee, University of British Columbia Raymond T. Ng, University of British Columbia Kyuseok.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

Fast Algorithms For Hierarchical Range Histogram Constructions

1 Connectivity Structure of Bipartite Graphs via the KNC-Plot Erik Vee joint work with Ravi Kumar, Andrew Tomkins.

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Approximate Counting via Correlation Decay Pinyan Lu Microsoft Research.

Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.

Efficient Multidimensional Packet Classification with Fast Updates Author: Yeim-Kuan Chang Publisher: IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 4, APRIL.

TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,

The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.

San Diego, 06/12/03 San Diego, 06/12/03 Martin Pfeifle, Database Group, University of Munich Using Sets of Feature Vectors for Similarity Search on Voxelized.

Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.

Top-k Monitoring in Wireless Sensor Networks Minji Wu, Jianliang Xu, Xueyan Tang, and Wang-Chien Lee IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.

A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Name Disambiguation in Digital Libraries The Pennsylvania State University Dongwon Lee

Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.

Small subgraphs in the Achlioptas process Reto Spöhel, ETH Zürich Joint work with Torsten Mütze and Henning Thomas TexPoint fonts used in EMF. Read the.

Network Aware Resource Allocation in Distributed Clouds.

DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.

Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

ANALYSIS AND IMPLEMENTATION OF GRAPH COLORING ALGORITHMS FOR REGISTER ALLOCATION By, Sumeeth K. C Vasanth K.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

2006/3/211 Multiple Aggregations over Data Stream Rui Zhang, Nick Koudas, Beng Chin Ooi Divesh Srivastava SIGMOD 2005.

Solving the Maximum Cardinality Bin Packing Problem with a Weight Annealing-Based Algorithm Kok-Hua Loh University of Maryland Bruce Golden University.

Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.

Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.

Introduction to Graph Theory

1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

NanoCAD Lab UCLA Effective Model-Based Mask Fracturing Heuristic Abde Ali Kagalwalla and Puneet Gupta NanoCAD Lab Department of Electrical Engineering,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Chun Kai Chen Author ： Andrew.

A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

An Efficient Algorithm for Incremental Update of Concept space

COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS

Text Joins in an RDBMS for Web Data Integration

Weighted Exact Set Similarity Join

Zhenjiang Lin, Michael R. Lyu and Irwin King

Sampling in Graphs: node sparsifiers

Structure and Content Scoring for XML

Text Joins for Data Cleansing and Integration in an RDBMS

Structure and Content Scoring for XML

CSE 6408 Advanced Algorithms.

A Framework for Testing Query Transformation Rules

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Relax and Adapt: Computing Top-k Matches to XPath Queries

Presentation transcript:

Byung-Won On (Penn State Univ.) Nick Koudas (Univ. of Toronto) Dongwon Lee (Penn State Univ.) Divesh Srivastava (AT&T Labs – Research) Group Linkage ICDE

Outline Introduction Matching Bipartite Graph Group Linkage Bipartite matching Pre-processing step to speed up Greedy matching Heuristic measure Experiment & Result Conclusion 2

Introduction Poor quality data in databases Transcription errors Lack of standards for recording Poor database design How to identify whether two entities are approximately the same? Group linkage problem Ex: “J.Ullman” “J.D.Ullman” “Ullman, Jeffrey” 3

Group Linkage Problem Ex : Lily Hsueh Paper1 Davy Jones Peter Pan Paper5 Paper4 Paper3 Paper2 ACM DBLP K.L.Hsueh Group : each author Records : a list of citations per author Implement  4

Matching Matching: A matching in a graph G is a set of non-loop edges with no shared endpoints Maximum matching: A matching that contains the largest possible number of edges. 5

Bipartite Graph Bipartite Graph: A graph is bipartite if V is the union of two disjoint independent sets called partite sets of G Bipartite matching 6

Group Linkage(1) Jaccard similarity measure between two sets s 1 and s 2 Records from the two groups can be put into matching when they are identical. 7

Group Linkage(2) NotationDescription DRelation of multi-attribute records g 1,g 2,……Groups of records in D g 1, r 2, ……Records in D sim(r i, r j )Arbitrary record-level similarity function θ Group-level similarity threshold ρ Record-level similarity threshold MMaximum weight bipartite matching BMBipartite matching based group linkage 8

Group Linkage(3) g2g2 g1g1 r 11 r 25 r 24 r 23 r 22 r 21 r 14 r 13 r 12,each normalize Group similarity Similar records K.L.Hsueh Lily.Hsueh Register Allocation & Spilling via graph coloring Register Allocation and Spilling via graph coloring 9

Bipartite Matching 10 Record-level similarity measure [5] S.Chaudhuri, V.Ganti, and R. Kaushik. “A primitive Operator for Similarity Joins in Data Cleaning”. In IEEE ICED, 2006 Maximum weight bipartite matching (BM) [10] S. Guha, N.Koudas, A. Marathe, and D. Srivastava. “Merging the Results of Approximate Match Operations”. In VLCB, pages , Applying this strategy for every pair of groups is infeasible.  pre-processing step Greedy matching Heuristic measure

Greedy Matching(1) S1: For each record r i ∈ g 1, find a record r j ∈ g 2 with the highest record-level similarity among those with sim() ≥ ρ. S2: Same as S1 g2g2 g1g1 r 11 r 25 r 24 r 23 r 22 r 21 r 14 r 13 r 12 May not be a matching! 11

Greedy Matching(2) Upper and lower bounds to BMsim, ρ g2g2 g1g1 r 11 r 25 r 24 r 23 r 22 r 21 r 14 r 13 r 12 12

Greedy Matching(2) is bounded Only when, the more expensive computation would be needed. 13

Heuristic Measure In practice that pairs of groups with a high value of will share at least one record with a high record-level similarity. Simpler and faster measure 14

Implementation Implemented UBsim, ρ, LBsim, ρ, and MAXsim, ρ in SQL. (We only discuss UB) Notation: groupauthor record in a groupcitations of an author group linkage problemlinkage between authors key to linkauthor names 15

Experiment Real data sets: Data sets from ACM and DBLP citation digital libraries. R1: uniform data sets R1a —average # of citations: left=41, right=25 R1b —average # of citations: left=40, right=55 R2: skewed data sets R2 DB —average # of citations: left=30, right=9 R2 AI —average # of citations: left=31, right=10 R2 Net —average # of citations: left=22, right=6 16

Experiment Synthetic data sets: S1 a and S1 b : same as R1 a, but dummy authors are injected to the right S1 a : # of citations  1/3 S1 b : # of citations  3 S2: using “dbgen” tool to generate dummy authors with varying levels of errors and inserted it to the right data set. 17

Experiment Evaluation Metrics—average recall if a 2 is included in the top-k answer window for a 1, then recall becomes 1, and 0 otherwise Compared Methods A(k 1 )|B(k 2 ). Step1: A, window size k 1 Step2: B, window size k 2 Microsoft SQL Server 2000 on Pentium III 3GHZ/512MB machine 18

Results uniform data set : R1 real data set 19

Results S1 and S2 synthetic data sets JA incorrect select dummy authors JA and BM are directly applied to S2 BM outperforms JA by 16-17% 20

Results R2 real data set UB MAX UB outperform MAX in recall UB MAX Pre-processing using: 21

Results Record-level similarity measure : cosine similarity with TF/IDF weighting. Running time against R2 (in sec) 22

Results Window size 23

Conclusion Proposed a bipartite matching based group similarity measure to solve group linkage problem. Proved upper and lower bounds of BM can be used for speed- up. BM is more robust group similarity measure than others 24