Download presentation
Presentation is loading. Please wait.
Published byIra Farmer Modified over 9 years ago
1
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL ’ 05
2
Abstract bibliographic citationsThey consider the problem of ambiguous author names in bibliographic citations. Scalable two-step framework –Reduce the number of candidates via blocking (four methods) –Measure the distance of two names via coauthor information (seven measures)
3
Introduction Citation records are important resources for academic communities. Keeping citations correct and up-to-date proved to be a challenging task in a large-scale. We focus on the problem of ambiguous author names. It is difficult to get the complete list of the publications of some authors. –“ John Doe ” published 100 articles, but DL keeps two separate purported author names, “ John Doe ” and “ J. D. Doe ”, each contains 50 citations.
5
Problem Problem definition: The baseline approach:
6
Solution Rather than comparing each pair of author names to find similar names, they advocate a scalable two-step name disambiguation framework. –Partition all author-name strings into blocks –Visit each block and compare all possible pairs of names within the block
7
Solution Overview
8
Blocking (1/3) The goal of step 1 is to put similar records into the same group by some criteria. They examine four representative blocking methods –heuristics, token-based, n-gram, sampling
9
Blocking (2/3) Spelling-based heuristics –Group author names based on name spellings –Heuristics: iFfL, iFiL, fL, combination –iFfL: e.g. “ Jeffrey Ullman ”, “ J. Ullman ” Token-based –Author names sharing at least one common token are grouped into the same block –e.g., “ Jeffrey D. Ullman ” and “ Ullman, Jason ”
10
Blocking (3/3) N-gram –N=4 –The number of author names put into the same block is the largest one. –e.g. “ David R. Johnson ”, “ F. Barr-David ” Sampling –Sampling-based join approximation –Each token from all author names has an TFIDF weight. –Each author name has its token weight vector. –All pairs of names with similarity of at least θ can be put into the same block.
11
Measuring Distances The goal of step 2 is, for each block, to identify top-k author names that are the closest. Supervised method –Na ï ve Bayes Model, Support Vector Machine Unsupervised method –String-based Distance, Vector-based Cosine Distance
12
Supervised Methods (1) Na ï ve Bayes Model Training: –A collection of coauthors of x are randomly split, and only the half is used for training. –They estimate each coauthor ’ s conditional probability P(Aj|x) Testing:
13
Supervised Methods (2) Support Vector Machine –All coauthor information of an author in a block is transformed into vector-space representation. –Author names in a block are randomly split, 50% is used for training, and the other 50% is used for testing. –SVM creates a maximum-margin hyperplane that splits the YES and NO training examples. –In testing, the SVM classifies vectors by mapping them via kernel trick to a high dimensional. Radial Basis Function kernel
14
Unsupervised Methods(1) String-based Distance –The distance between two author names are measured by the “ distance ” between their coauthor lists. –Two token-based string distances –Two edit-distance-based string distances
15
Unsupervised Methods(2) Vector-based Cosine Distance –They model the coauthor lists as vectors in the vector space and compute the distances between the vectors. –They use the simple cosine distance.
16
Experiment
17
Data Sets They gathered real citation data from four different domains. –DBLP, e-Print, BioMed, EconPapers Different disciplines appear to have slightly different citation policies and the conventions of citations also vary. –Number of coauthors per article –Use the initial of first name instead of full name
18
Artificial name variants Given the large number of citations, it is not possible nor practical to find a “ real ” solution set. They pick top-100 author names from Y according to their number of citations, and generate 100 corresponding new name variants artificially. “ Grzegorz Rozenberg ” with 344 citations and 114 coauthors in DBLP, we create a new name like “ G. Rozenberg ” or “ Grzegorz Rozenbergg ”. Splitting the original 344 citations into halves, each name carries half of citations 172 They test if the algorithm is able to find the corresponding artificial name variant in Y
19
Error type: e.g. “ Ji-Woo K. Li ” –Abbreviation: “ J. K. Ki ” –Name alternation: “ Li, Ji-Woo K. ” –Typo: “ Ji-Woo K. Lee ” or “ Jee-Woo K. Li ” –Contraction: “ Jiwoo K. Li ” –Omission: “ Ji-Woo Li ” –Combinations The quantify the effect of error types on the accuracy of name disambiguation is measured. Artificial name variants
20
(1) mixed error types of abbreviation (30%), alternation (30%), typo (12% each in first/last name), contraction (2%), omission (4%), and combination (10%) (2) abbreviation of the first name (85%) and typo (15%)
21
Evaluation metrics Scalability –Size of blocks generated in step 1 –Time it took to process both step 1 and 2 Accuracy –They measured the accuracy of top-k.
22
Scalability The average # of authors in each block Processing time for step 1 and 2
23
Accuracy Four blocking methods combined with seven distance metrics for all four data set with k = 5. EconPapers data set is omitted.
24
Conclusion They compared various configurations (four blocking in step 1, seven distance metrics via “ coauthor ” information in step 2), against four data sets. A combination of token-based or N-gram blocking (step 1) and SVM as a supervised method or cosine metric as a unsupervised method (step 2) gave the best scalability/accuracy trade-off. The accuracy of simple name spelling based heuristics were shown to be quite sensitive to the error types. Edit distance based distance metrics such as Jaro or Jaro- Winkler proved to be inadequate for large-scale name disambiguation problem for its slow processing time.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.