Web Data Integration Using Approximate String Join

Slides:



Advertisements
Similar presentations
PEBL: Web Page Classification without Negative Examples Hwanjo Yu, Jiawei Han, Kevin Chen- Chuan Chang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
PARTITIONAL CLUSTERING
1 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
1 RegionKNN: A Scalable Hybrid Collaborative Filtering Algorithm for Personalized Web Service Recommendation Xi Chen, Xudong Liu, Zicheng Huang, and Hailong.
25/11/2013 A method to automatically identify road centerlines from georeferenced smartphone data XIV Brazilian Symposium on GeoInformatics (GEOINFO 2013)
Self Organization of a Massive Document Collection
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
“Random Projections on Smooth Manifolds” -A short summary
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
A shot at Netflix Challenge Hybrid Recommendation System Priyank Chodisetti.
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
Dimensionality Reduction and Embeddings
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimensionality Reduction
Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)
Detecting Image Region Duplication Using SIFT Features March 16, ICASSP 2010 Dallas, TX Xunyu Pan and Siwei Lyu Computer Science Department University.
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.
Dimensionality Reduction
Information Retrieval
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Overview of Kernel Methods Prof. Bennett Math Model of Learning and Discovery 2/27/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.
Fundamentals from Linear Algebra Ghan S. Bhatt and Ali Sekmen Mathematical Sciences and Computer Science College of Engineering Tennessee State University.
Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.
Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Adaptive nonlinear manifolds and their applications to pattern.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.
Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
Manifold learning: MDS and Isomap
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Supervisor: Nakhmani Arie Semester: Winter 2007 Target Recognition Harmatz Isca.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Progressive Sampling Instance Selection and Construction for Data Mining Ch 9. F. Provost, D. Jensen, and T. Oates 신수용.
Data Visualization Fall The Data as a Quantity Quantities can be classified in two categories: Intrinsically continuous (scientific visualization,
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
6 6.1 © 2016 Pearson Education, Ltd. Orthogonality and Least Squares INNER PRODUCT, LENGTH, AND ORTHOGONALITY.
Spectral Methods for Dimensionality
Optimizing Parallel Algorithms for All Pairs Similarity Search
Computing and Compressive Sensing in Wireless Sensor Networks
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Research in Computational Molecular Biology , Vol (2008)
Mean Value Analysis of a Database Grid Application
Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE
Predicting Traffic Dmitriy Bespalov.
Text Joins in an RDBMS for Web Data Integration
Efficient Record Linkage in Large Data Sets
Mashup Service Recommendation based on User Interest and Service Network Buqing Cao ICWS2013, IJWSR.
Text Joins for Data Cleansing and Integration in an RDBMS
Dimension versus Distortion a.k.a. Euclidean Dimension Reduction
Lecture 15: Least Square Regression Metric Embeddings
Group 9 – Data Mining: Data
Latent Semantic Analysis
Presentation transcript:

Web Data Integration Using Approximate String Join Yingping Huang and Gregory Madey Computer Science and Engineering University of Notre Dame WWW2004, New York, 5/19/2004

Introduction Web data integration is an important preprocessing step for web mining and data analysis. Approximate string processing is a fundamental step in many existing data cleansing algorithms. Approximate string join seeks to identify (almost) all pairs of strings whose distances are less than a certain threshold. Typical string distances include edit distance, q-gram distance and vector cosine similarity.

Related Work Li (2003) proposed a mapping algorithm where each string is mapped to a point in a high dimensional euclidean space using FastMap. Then a similarity join algorithm proposed by Hjaltason and Sanel (1998) is used to identify close points. Gravano (2003) presented a sampling approach for performing text join where each string is represented by a sparse vector in a high dimensional space. Then a join is performed on the resulting vector space.

Drawbacks of Previous Approach In Li (2003), the similarity join algorithms is computationally sensitive to the dimensionality of the hosting space. When the dimensionality gets large, the similarity join algorithms becomes very inefficient. In Gravano (2003), the sampling method uses a lower dimensional subspace for join. The accuracy of this approach depends on the dimensionality of the subspace. Usually, to obtain a high accuracy, the dimensionality of the subspace is close to the dimensionality of the original space.

Our Approach We first form the database of strings to be a (1,2)-B metric space and then map the (1,2)-B metric space into a high dimensional grid space. Pairs of points with distance 1 are identified in the grid space. Any two points in the grid space have distance 0, 1, or 2. A post join process is performed to remove false positives.

(1,2)-B Metric Space A metric space M=(X,D) is called a (1,2)-B metric space, if the distance between any two points is either 0, 1, or 2, and for any point in X, there are no more than B points within distance 1. For any two strings s and t, if their string distance is less than k, then we define their new distance to be 1, otherwise, we define their distance to be 2. We also assume that each string has at most B other strings that have string distance less than k.

Lemma Guruswami (2003) proved that a (1,2)-B metric space can be isometrically embedded into a high dimensional grid space, with dimensionality O(BlogN) where N is the size of the string database. An approximate matrix multiplication method is used to construct the actual mapping.

Results The precision and recall are both reasonably good.

Summary The previous figure shows that our approach achieves good precision and recall. It has some potential advantage over the algorithms presented by Li (2003) and Gravano (2003). The execution time is almost linear to the dimensionality of the hosting grid space.

References Li (2003): L. Jin, C. Li and S. Mehrotra. Efficient record linkage in large datasets. In Proc. 8th international conference on database systems for advanced applications. Gravano (2003): L. Gravano and P. Ipeirotis. Text join in an rdbms for web data integration. Proc. 12th international WWW conference. Guruswami (2003): V. Guruswami and P. Indyk. Embeddings and non-approximability of geometric problems. In Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms.