Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
Large-Scale Entity-Based Online Social Network Profile Linkage.
Data Mining Classification: Alternative Techniques
An Introduction of Support Vector Machine
Support Vector Machines
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Machine learning continued Image source:
Face Recognition & Biometric Systems Support Vector Machines (part 2)
Ch. 4: Radial Basis Functions Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 based on slides from many Internet sources Longin.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Discriminative and generative methods for bags of features
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
Optimizing F-Measure with Support Vector Machines David R. Musicant Vipin Kumar Aysel Ozgur FLAIRS 2003 Tuesday, May 13, 2003 Carleton College.
Active Learning with Support Vector Machines
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Introduction to machine learning
Radial Basis Function Networks
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Active Learning for Class Imbalance Problem
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey.
Byung-Won On (Penn State Univ.) Nick Koudas (Univ. of Toronto) Dongwon Lee (Penn State Univ.) Divesh Srivastava (AT&T Labs – Research) Group Linkage ICDE.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Support Vector Machines Exercise solutions Ata Kaban The University of Birmingham.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Data Mining and Text Mining. The Standard Data Mining process.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Semi-Supervised Clustering
Fun with Hyperplanes: Perceptrons, SVMs, and Friends
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Support Vector Machines
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Category-Based Pseudowords
Support Vector Machines Most of the slides were taken from:
Machine Learning in Practice Lecture 26
COSC 4335: Other Classification Techniques
iSRD Spam Review Detection with Imbalanced Data Distributions
Support Vector Machines
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL ’ 05

Abstract bibliographic citationsThey consider the problem of ambiguous author names in bibliographic citations. Scalable two-step framework –Reduce the number of candidates via blocking (four methods) –Measure the distance of two names via coauthor information (seven measures)

Introduction Citation records are important resources for academic communities. Keeping citations correct and up-to-date proved to be a challenging task in a large-scale. We focus on the problem of ambiguous author names. It is difficult to get the complete list of the publications of some authors. –“ John Doe ” published 100 articles, but DL keeps two separate purported author names, “ John Doe ” and “ J. D. Doe ”, each contains 50 citations.

Problem Problem definition: The baseline approach:

Solution Rather than comparing each pair of author names to find similar names, they advocate a scalable two-step name disambiguation framework. –Partition all author-name strings into blocks –Visit each block and compare all possible pairs of names within the block

Solution Overview

Blocking (1/3) The goal of step 1 is to put similar records into the same group by some criteria. They examine four representative blocking methods –heuristics, token-based, n-gram, sampling

Blocking (2/3) Spelling-based heuristics –Group author names based on name spellings –Heuristics: iFfL, iFiL, fL, combination –iFfL: e.g. “ Jeffrey Ullman ”, “ J. Ullman ” Token-based –Author names sharing at least one common token are grouped into the same block –e.g., “ Jeffrey D. Ullman ” and “ Ullman, Jason ”

Blocking (3/3) N-gram –N=4 –The number of author names put into the same block is the largest one. –e.g. “ David R. Johnson ”, “ F. Barr-David ” Sampling –Sampling-based join approximation –Each token from all author names has an TFIDF weight. –Each author name has its token weight vector. –All pairs of names with similarity of at least θ can be put into the same block.

Measuring Distances The goal of step 2 is, for each block, to identify top-k author names that are the closest. Supervised method –Na ï ve Bayes Model, Support Vector Machine Unsupervised method –String-based Distance, Vector-based Cosine Distance

Supervised Methods (1) Na ï ve Bayes Model Training: –A collection of coauthors of x are randomly split, and only the half is used for training. –They estimate each coauthor ’ s conditional probability P(Aj|x) Testing:

Supervised Methods (2) Support Vector Machine –All coauthor information of an author in a block is transformed into vector-space representation. –Author names in a block are randomly split, 50% is used for training, and the other 50% is used for testing. –SVM creates a maximum-margin hyperplane that splits the YES and NO training examples. –In testing, the SVM classifies vectors by mapping them via kernel trick to a high dimensional. Radial Basis Function kernel

Unsupervised Methods(1) String-based Distance –The distance between two author names are measured by the “ distance ” between their coauthor lists. –Two token-based string distances –Two edit-distance-based string distances

Unsupervised Methods(2) Vector-based Cosine Distance –They model the coauthor lists as vectors in the vector space and compute the distances between the vectors. –They use the simple cosine distance.

Experiment

Data Sets They gathered real citation data from four different domains. –DBLP, e-Print, BioMed, EconPapers Different disciplines appear to have slightly different citation policies and the conventions of citations also vary. –Number of coauthors per article –Use the initial of first name instead of full name

Artificial name variants Given the large number of citations, it is not possible nor practical to find a “ real ” solution set. They pick top-100 author names from Y according to their number of citations, and generate 100 corresponding new name variants artificially. “ Grzegorz Rozenberg ” with 344 citations and 114 coauthors in DBLP, we create a new name like “ G. Rozenberg ” or “ Grzegorz Rozenbergg ”. Splitting the original 344 citations into halves, each name carries half of citations 172 They test if the algorithm is able to find the corresponding artificial name variant in Y

Error type: e.g. “ Ji-Woo K. Li ” –Abbreviation: “ J. K. Ki ” –Name alternation: “ Li, Ji-Woo K. ” –Typo: “ Ji-Woo K. Lee ” or “ Jee-Woo K. Li ” –Contraction: “ Jiwoo K. Li ” –Omission: “ Ji-Woo Li ” –Combinations The quantify the effect of error types on the accuracy of name disambiguation is measured. Artificial name variants

(1) mixed error types of abbreviation (30%), alternation (30%), typo (12% each in first/last name), contraction (2%), omission (4%), and combination (10%) (2) abbreviation of the first name (85%) and typo (15%)

Evaluation metrics Scalability –Size of blocks generated in step 1 –Time it took to process both step 1 and 2 Accuracy –They measured the accuracy of top-k.

Scalability The average # of authors in each block Processing time for step 1 and 2

Accuracy Four blocking methods combined with seven distance metrics for all four data set with k = 5. EconPapers data set is omitted.

Conclusion They compared various configurations (four blocking in step 1, seven distance metrics via “ coauthor ” information in step 2), against four data sets. A combination of token-based or N-gram blocking (step 1) and SVM as a supervised method or cosine metric as a unsupervised method (step 2) gave the best scalability/accuracy trade-off. The accuracy of simple name spelling based heuristics were shown to be quite sensitive to the error types. Edit distance based distance metrics such as Jaro or Jaro- Winkler proved to be inadequate for large-scale name disambiguation problem for its slow processing time.