Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Slides:

Advertisements

Similar presentations

Introduction to Monte Carlo Markov chain (MCMC) methods

Advertisements

Google News Personalization: Scalable Online Collaborative Filtering

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.

Text Categorization.

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.

Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.

Classification Classification Examples

PODC 2007 © 2007 IBM Corporation Constructing Scalable Overlays for Pub/Sub With Many Topics Problems, Algorithms, and Evaluation G. Chockler, R. Melamed,

+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.

Albert Gatt Corpora and Statistical Methods Lecture 13.

Introduction to Information Retrieval

Clustering Categorical Data The Case of Quran Verses

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

Word Spotting DTW.

Imbalanced data David Kauchak CS 451 – Fall 2013.

Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,

Large-Scale Entity-Based Online Social Network Profile Linkage.

Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.

Evaluating the Quality of Image Synthesis and Analysis Techniques Matthew O. Ward Computer Science Department Worcester Polytechnic Institute.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Recommender systems Ram Akella November 26 th 2008.

Scalable Text Mining with Sparse Generative Models

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

Large-Scale Cost-sensitive Online Social Network Profile Linkage.

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Querying Structured Text in an XML Database By Xuemei Luo.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

Fast Effective Clustering for Graphs and Document Collections William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer.

Analysis of Social Media MLD , LTI William Cohen

Analysis of Social Media MLD , LTI William Cohen

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Unsupervised Learning on Graphs. Spectral Clustering: Graph = Matrix A B C F D E G I H J ABCDEFGHIJ A111 B11 C1 D11 E1 F111 G1 H111 I111 J11.

Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.

Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.

Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.

Support Vector Machines

Semi-Supervised Clustering

Optimizing Parallel Algorithms for All Pairs Similarity Search

Constrained Clustering -Semi Supervised Clustering-

Soft Joins with TFIDF: Why and What

Learning with information of features

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Ensemble learning.

Support Vector Machines

Statistical Models and Machine Learning Algorithms --Review

Spectral clustering methods

“Traditional” image segmentation

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Learning and Memorization

Presentation transcript:

Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration and the Web Joint work with Frank Lin, John Wong, Natalie Glance, Charles Schafer, Roy Tromble

Scaling up Information Integration Small scale integration –Few relations, attributes, information sources, … –Integrate using knowledge-based approaches Medium scale integration –More relations, attributes, information sources, … –Statistical approaches work for entity matching (e.g., TFIDF) … Large scale integration –Many relations, attributes, information sources, … –Statistical approaches appropriate for more tasks –Scalability issues are crucial

Scaling up Information Integration Outline: –Product search as a large-scale II task –Issue: determining identity of products with context- sensitive similarity metrics –Scalable clustering techniques –Conclusions

Google Product Search: A Large-Scale II Task

The Data Feeds from merchants Attribute/value data where attribute & data can be any strings The web Merchant sites, their content and organization Review sites, their content and organization Images, videos, blogs, links, … User behavior Searches & clicks

Challenges: Identifying Bad Data Spam detection Duplicate merchants Porn detection Bad merchant names Policy violations

Challenges: Structured data from the web Offers from merchants Merchant reviews Product reviews Manufacturer specs...

Challenges: Understanding Products Catalog construction Canonical description, feature values, price ranges,.... Taxonomy construction Nerf gun is a kind of toy, not a kind of gun Opinion and mentions of products on the web Relationships between products Accessories, compatible replacements, Identity

Google Product Search: A Large-Scale II Task

Challenges: Understanding Offers Identity Category Brand name Model number Price Condition... Plausible baseline for determining if two products are identical: 1) pick a feature set 2) measure similarity with cosine/IDF,... 3) threshold appropriately

Challenges: Understanding Offers Identity Category Brand name Model number Price Condition... Plausible baseline for determining if two products are identical: 1) pick a feature set 2) measure similarity with cosine/IDF,... 3) threshold appropriately Advantages of cosine/IDF: Robust: works well for many types of entities Very fast to compute sim(x,y) Very fast to find y: sim(x,y) > θ using inverted indices Extensive prior work on similarity joins Setting IDF weights requires no labeled data requires only one pass over the unlabeled data easily parallelized

Product similarity: challenges Similarity can be high for descriptions of distinct items: o AERO TGX-Series Work Table -42'' x 96'' Model 1TGX-4296 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop edge... o AERO TGX-Series Work Table -42'' x 48'' Model 1TGX-4248 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop.. Similarity can be low for descriptions of identical items: o Canon Angle Finder C 2882A002 Film Camera Angle Finders Right Angle Finder C (Includes ED-C & ED-D Adapters for All SLR Cameras) Film Camera Angle Finders & Magnifiers The Angle Finder C lets you adjust... o CANON 2882A002 ANGLE FINDER C FOR EOS REBEL® SERIES PROVIDES A FULL SCREEN IMAGE SHOWS EXPOSURE DATA BUILT-IN DIOPTRIC ADJUSTMENT COMPATIBLE WITH THE CANON® REBEL, EOS & REBEL EOS SERIES.

Product similarity: challenges Linguistic diversity and domain-dependent technical specs: o "Camera angle finder" vs "right angle finder", "Dioptric adjustment; "Aerospec Designed", "V countertop edge",... Labeled training data is not easy to produce for subdomains Imperfect and/or poorly adopted standards for identifiers Different levels of granularity in descriptions Brands, manufacturer, … o Product vs. product series o Reviews of products vs. offers to sell products Each merchant is different: intra-merchant regularities can dominate the intra-product regularities

Clustering objects from many sources Possible approaches –1) Model the inter- and intra- source variability directly (e.g., Bagnell, Blei, McCallum UAI2002; Bhattachrya & Getoor SDM 2006); latent variable for source-specific effects –Problem: model is larger and harder to evaluate

Clustering objects from many sources Possible approaches –1) Model the inter- and intra- source variability directly –2) Exploit background knowledge and use constrained clustering: Each merchant's catalogs is duplicate-free If x and y are from the same merchant constrain cluster so that CANNOT-LINK(x,y) –More realistically: locally dedup each catalog and use a soft constraint on clustering E.g., Oyama &Tanaka, distance metric learned from cannot-link constraints only using quadratic programming Problem: expensive for very large datasets

Scaling up Information Integration Outline: –Product search as a large-scale II task –Issue: determining identity of products Merging many catalogs to construct a larger catalog Issues arising from having many source catalogs Possible approaches based on prior work A simple scalable approach to exploiting many sources –Learning a distance metric Experiments with the new distance metric –Scalable clustering techniques –Conclusions

Clustering objects from many sources Here: adjust the IDF importance weights for f using an easily-computed statistic CX(f). c i is source (context) of item x i (the selling merchant) D f is set of items with feature f x i ~ D f is uniform draw n c,f is #items from c with feature f plus smoothing

Clustering objects from many sources Here: adjust the IDF importance weights for f using an easily-computed statistic CX(f). c i is source of item x i D f is set of items with feature f x i ~ D f is uniform draw n c,f is #items from c with feature f plus smoothing

Clustering objects from many sources Here: adjust the IDF importance weights for f using an easily-computed statistic CX(f).

Motivations Theoretical: CX(f) related to naïve Bayes weights for a classifier of pairs of items (x,y): –Classification task: is the pair intra- or inter-source? –Eliminating intra-source pairs enforces CANNOT- LINK constraints; using naïve Bayes classifier approximates this –Features of pair (x,y) are all common features of item x and item y –Training data: all intra- and inter-source pairs Dont need to enumerate them explicitly Experimental: coming up!

Smoothing the CX(f) weights 1.When estimating Pr( _ | x i,x j ), use a Beta distribution with (α,β)=(½,½). 2.When estimating Pr( _ | x i,x j ) for f use a Beta distribution with (α,β) computed from (μ,σ) –Derived empirically using variant (1) on features like f i.e., from the same dataset, same type, … 3.When computing cosine distance, add correction γ

Efficiency of setting CX.IDF Traditional IDF: –One pass over the dataset to derive weights Estimation with (α,β)=(½,½) : –One pass over the dataset to derive weights –Map-reduce can be used –Correcting with fixed γ adds no learning overhead Smoothing with informed priors: –Two passes over the dataset to derive weights –Map-reduce can be used

Computing CX.IDF

Scaling up Information Integration Outline: –Product search as a large-scale II task –Issue: determining identity of products Merging many catalogs to construct a larger catalog Issues arising from having many source catalogs Possible approaches based on prior work A simple scalable approach to exploiting many sources –Learning a distance metric Experiments with the new distance metric –Scalable clustering techniques –Conclusions

Warmup: Experiments with k-NN classification Classification vs matching: –better-understood problem with fewer moving parts Nine small classification datasets –from Cohen & Hirsh, KDD 1998 –instances are short, name-like strings Use class label as context (metric learning) –equivalent to MUST-LINK constraints –stretch same-context features in other direction Heavier weight for features that co-occur in same-context pairs CX -1.IDF weighting (aka IDF/CX).

Experiments with k-NN classification Procedure: learn similarity metric from (labeled) training data for test instances, find closest k=30 items in training data and predict distance-weighted majority class predict majority class in training data if no neighbors with similarity > 0

Experiments with k-NN classification Ratio of k-NN error to baseline k-NN error Lower is better * Statistically significantly better than baseline (α,β)=(½,½)(α,β) from (μ,σ)

Experiments with k-NN classification IDF/CX improves over IDF Nearly 20% lower error Smoothing important for IDF/CX No real difference for IDF Simpler smoothing techniques work well (α,β)=(½,½)(α,β) from (μ,σ)

Experiments Matching Bibliography Data Scraped LaTex *.bib files from the web: –400+ files with 100,000+ bibentries –All contain the phrase machine learning –Generated 3,000,000 weakly similar pairs of bibentries –Scored and ranked the pairs with IDF, CX.IDF, … Used paper URLs and/or DOIs to assess precision –About 3% have useful identifiers –Pairings between these 3% can be assessed as right/wrong Smoothing done using informed priors –Unsmoothed weights averaged over all tokens in a specific bibliography entry field (e.g., author) Data is publicly available

Matching performance for bibliography entries (α,β)=(½,½) (α,β) from (μ,σ) Baseline IDF Interpolated precision versus rank (γ=10, R<10k)

Known errors versus rank (γ=10, R<10k) (α,β)=(½,½) (α,β) from (μ,σ) Baseline IDF

Matching performance for bibliography entries - at higher recall Errors versus rank (γ=10, R>10k)

Experiments Matching Product Data Data from >700 web sites, merchants, hand-built catalogs Larger number of instances: > 40M Scored and ranked > 50M weakly similar pairs Hand-tuned feature set –But tuned on an earlier version of the data Used hard identifiers (ISBN, GTIN, UPC) to assess accuracy –More than half have useful hard identifiers –Most hard identifiers appear only once or twice

Experiments Matching Product Data (α,β)=(½,½) (α,β) from (μ,σ) Baseline IDF

Experiments with product data (α,β)=(½,½) (α,β) from (μ,σ) Baseline IDF

Scaling up Information Integration Outline: –Product search as a large-scale II task –Issue: determining identity of products with context- sensitive similarity metrics –Scalable clustering techniques (w/ Frank Lin) Background on spectral clustering techniques A fast approximate spectral technique Theoretical justification Experimental results –Conclusions

Spectral Clustering: Graph = Matrix A B C F D E G I H J ABCDEFGHIJ A111 B11 C1 D11 E1 F111 G1 H111 I111 J11

Spectral Clustering: Graph = Matrix Transitively Closed Components = Blocks A B C F D E G I H J ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F111_ G_11 H_11 I11_1 J111_ Of course we cant see the blocks unless the nodes are sorted by cluster…

Spectral Clustering: Graph = Matrix Vector = Node Weight H ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F111_ G_11 H_11 I11_1 J111_ A B C F D E G I J A A3 B2 C3 D E F G H I J M M v

Spectral Clustering: Graph = Matrix M*v 1 = v 2 propogates weights from neighbors H ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F11_ G_11 H_11 I11_1 J111_ A B C F D E G I J A3 B2 C3 D E F G H I J M M v1v1 A2*1+3*1+0*1 B3*1+3*1 C3*1+2*1 D E F G H I J v2v2 *=

Spectral Clustering: Graph = Matrix W*v 1 = v 2 propogates weights from neighbors H ABCDEFGHIJ A_.5.3 B _.5 C.3.5_ D_.3 E.5_.3 F.5 _ G_.3 H_ I.5 _.3 J.5.3_ A B C F D E G I J A3 B2 C3 D E F G H I J M W v1v1 A 2*.5+3*.5+0*.3 B 3*.3+3*.5 C 3*.33+2*.5 D E F G H I J v2v2 *= W: normalized so columns sum to 1

Spectral Clustering: Graph = Matrix W*v 1 = v 2 propogates weights from neighbors M [Shi & Meila, 2002] λ2λ2 λ3λ3 λ4λ4 λ 5, 6,7,…. λ1λ1 e1e1 e2e2 e3e3 eigengap

Spectral Clustering: Graph = Matrix W*v 1 = v 2 propogates weights from neighbors M [Shi & Meila, 2002] e2e2 e3e x x x x x x y yy y y x x x x x x z z z z z z z z z z z e1e1 e2e2

Spectral Clustering: Graph = Matrix W*v 1 = v 2 propogates weights from neighbors M If Wis connected but roughly block diagonal with k blocks then the top eigenvector is a constant vector the next k eigenvectors are roughly piecewise constant with pieces corresponding to blocks

Spectral Clustering: Graph = Matrix W*v 1 = v 2 propogates weights from neighbors M If W is connected but roughly block diagonal with k blocks then the top eigenvector is a constant vector the next k eigenvectors are roughly piecewise constant with pieces corresponding to blocks Spectral clustering: Find the top k+1 eigenvectors v 1,…,v k+1 Discard the top one Replace every node a with k-dimensional vector x a = Cluster with k-means

Spectral Clustering: Pros and Cons Elegant, and well-founded mathematically Works quite well when relations are approximately transitive (like similarity) Very noisy datasets cause problems –Informative eigenvectors need not be in top few –Performance can drop suddenly from good to terrible Expensive for very large datasets –Computing eigenvectors is the bottleneck There is a very scalable way to compute the top eigenvector

Aside: power iteration to compute the top eigenvector Let v 0 be almost any vector Repeat until convergence (c is a normalizer): –v t = cW*v t-1 This is how PageRank is computed –For a different W This converges to the top eigenvector –Which in this case is constant –But …

Convergence of PI for a clustering problem: * each box is rescaled to same vertical range small larger

Explanation: ??? e i is i-th eigenvector

Explanation

(converges to zero even more quickly) (converges to zero quickly)

Explanation eigenvectors are piecewise constant across the clusters and some pair of constants is really different for each pair of clusters

Explanation: the signal approximates spectral clusterings distance …but all the pic(a,b) distances are in a small radius:

PIC: Power Iteration Clustering Details: run k-means 10 times and pick best output by intra-cluster similarity) stopping condition: acceleration < /n

Experimental Results

Experimental results: best-case assignment of class labels to clusters

Experiments: run time and scalability Time in millisec

Experiments: run time and scalability

Summary Large-scale integration: New statistical approaches: assuming huge numbers of objects, relations, and sources of information Simplicity and scalability is crucial CX.IDF is an extension of IDF weighting Exploits statistics in data merged from many locally-deduped sources, a very common integration scenario Weights can be learned without labeling Weight learning requires 2-3 passes over the data Errors are reduced significantly relative to IDF 20% lower error on average for classification Up to 65% lower error in matching tasks at high recall levels Very high precision possible at lower recall levels

Summary Large-scale integration: New statistical approaches: assuming huge numbers of objects, relations, and sources of information Simplicity and scalability is crucial CX.IDF is an extension of IDF weighting Simple, scalable, parallelizable PIC is a very scalable clustering method Formally, works when spectral techniques work Experimentally, often better than traditional spectral methods Based on power iteration on a normalized matrix with early stopping Experimentally, linear time Easy to implement and efficient Very easily parallelized

Questions...?