Soft Joins with TFIDF: Why and What

Slides:

Advertisements

Similar presentations

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.

Advertisements

Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Section 4.1 – Vectors (in component form)

Web Information Retrieval

Rocchio’s Algorithm 1. Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2.

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.

Introduction to Information Retrieval

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.

Uncertainty Representation. Gaussian Distribution variance Standard deviation.

Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.

Hinrich Schütze and Christina Lioma

MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

Evaluating the Performance of IR Sytems

1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.

INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.

Geometry of R 2 and R 3 Vectors in R 2 and R 3. NOTATION RThe set of real numbers R 2 The set of ordered pairs of real numbers R 3 The set of ordered.

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.

Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Group Recommendations with Rank Aggregation and Collaborative Filtering Linas Baltrunas, Tadas Makcinskas, Francesco Ricci Free University of Bozen-Bolzano.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

Querying Structured Text in an XML Database By Xuemei Luo.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney.

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –

Methods of data fusion in information retrieval rank vs. score combination D. Frank Hsu Department of Computer and Information Science Fordham University.

Vectors in Space. 1. Describe the set of points (x, y, z) defined by the equation (Similar to p.364 #7-14)

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

Vector Space Models.

Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.

ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

1 VLDB, Background What is important for the user.

Distance functions and IE - 3 William W. Cohen CALD.

EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.

The Vector Space Models (VSM)

Indexing & querying text

A paper on Join Synopses for Approximate Query Answering

Information Retrieval and Web Search

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Probabilistic Data Management

Implementation Issues & IR Systems

Martin Rajman, Martin Vesely

Workflows and Abstractions for Map-Reduce

Relational Algebra Chapter 4, Part A

Lecture 12: Data Wrangling

Rank Aggregation.

MatchCatcher: A Debugger for Blocking in Entity Matching

Implementation Based on Inverted Files

Section 3.1 – Vectors in Component Form

6. Implementation of Vector-Space Retrieval

Copyright © Cengage Learning. All rights reserved.

Boolean and Vector Space Retrieval Models

Workflows and Abstractions for Map-Reduce

VECTOR SPACE MODEL Its Applications and implementations

Presentation transcript:

Soft Joins with TFIDF: Why and What

Sim Joins on Product Descriptions Surface similarity can be high for descriptions of distinct items: AERO TGX-Series Work Table -42'' x 96'' Model 1TGX-4296 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop edge ... AERO TGX-Series Work Table -42'' x 48'' Model 1TGX-4248 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop .. Surface similarity can be low for descriptions of identical items: Canon Angle Finder C 2882A002 Film Camera Angle Finders Right Angle Finder C (Includes ED-C & ED-D Adapters for All SLR Cameras) Film Camera Angle Finders & Magnifiers The Angle Finder C lets you adjust ... CANON 2882A002 ANGLE FINDER C FOR EOS REBEL® SERIES PROVIDES A FULL SCREEN IMAGE SHOWS EXPOSURE DATA BUILT-IN DIOPTRIC ADJUSTMENT COMPATIBLE WITH THE CANON® REBEL, EOS & REBEL EOS SERIES.

Motivation Integrating data is important Data from different sources may not have consistent object identifiers Especially automatically-constructed ones But databases will have human-readable names for the objects But names are tricky….

One solution: Soft (Similarity) joins A similarity join of two sets A and B is an ordered list of triples (sij,ai,bj) such that ai is from A bj is from B sij is the similarity of ai and bj the triples are in descending order the list is either the top K triples by sij or ALL triples with sij>L … or sometimes some approximation of these….

Example: soft joins/similarity joins Input: Two Different Lists of Entity Names … …

TFIDF similarity works well for names Low weight for mismatches on common abbreviations (NPRES) High weight for mismatches on unusual terms (Appomattox, Aniakchak, …) … …

Example: soft joins/similarity joins Output: Pairs of Names Ranked by Similarity identical … similar … less similar

How well does TFIDF work?

There are refinements to TFIDF distance – eg ones that extend with soft matching at the token level (e.g., softTFIDF)

Soft Joins with TFIDF: How?

TFIDF similarity

Soft TFIDF joins A similarity join of two sets of TFIDF-weighted vectors A and B is an ordered list of triples (sij,ai,bj) such that ai is from A bj is from B sij is the dot product of ai and bj the triples are in descending order the list is either the top K triples by sij or ALL triples with sij>L … or sometimes some approximation of these…. do not compare similarity of all pairs!

Parallel Soft JOINS

SIGMOD 2010 Adapted to Guinea Pig….

Parallel Inverted Index Softjoin - 1 want this to work for long documents or short ones…and keep the relations simple sumSquareWeights Statistics for computing TFIDF with IDFs local to each relation

Parallel Inverted Index Softjoin - 2 This join can be very big for common terms What’s the algorithm? Step 1: create document vectors as (Cd, d, term, weight) tuples Step 2: join the tuples from A and B: one sort and reduce Gives you tuples (a, b, term, w(a,term)*w(b,term)) Step 3: group the common terms by (a,b) and reduce to aggregate the components of the sum

Making the algorithm smarter….

Inverted Index Softjoin - 2 we should make a smart choice about which terms to use

Adding heuristics to the soft join - 1

Adding heuristics to the soft join - 2

Adding heuristics Parks: input 40k data 60k docvec 102k softjoin 539k tokens 508k documents 0 errors in top 50 w/ heuristics: input 40k data 60k docvec 102k softjoin 32k tokens 24k documents 3 errors in top 50 < 400 useful terms

Adding heuristics SO vs Wikipedia: input 612M docvec 1050M softjoin 3.7G join time: 54.30m with heuristics max DF=50 softjoin 2.5G join time 27:30m with heuristics max DF=20 softjoin 1.2G join time 11:50m includes 70% of the pairs with score>=0.95 Single-threaded Java baseline: 2-3 days