Download presentation
Presentation is loading. Please wait.
1
Author Name Disambiguation in Medline
Vetle I. Torvik and Neil R. Smalheiser August 31, 2006
4
A Statistically Based Model:
Hypothesis: an individual tends to publish papers with similar attributes, sufficiently so that these attributes suffice to disambiguate the authors. The case of Dr. Tom Jobe? Large, automatically generated training sets of pairs of articles matching (Last Name, First Initial) written by the same person vs. by different individuals.
5
Name Attributes Suffix if present (III, Jr.) Middle initial match
Original model got very good performance without using first names at all! First name (if available in Medline, or if it can be scraped from online papers) Name spelling variants Name frequency
6
Article Attributes Journal name Number of co-author names in common
Affiliation words in common may not be given for all co-authors Name correlations with affiliations (e.g. Ito and Japan are correlated) Language of the article Title words in common addresses, if given assign to right author MeSH headings in common
7
A Monotone Model Each pair of papers creates a vector of 10 dimensions, each of which has a matching score. Assume monotonicity [more attributes in common, more likely written by the same person] Allows for nonlinear and interactive effects across dimensions
8
Estimate Pairwise Frequencies
For a given pair of articles, compute the match vector, then look up its frequencies in pos vs. neg training sets: ratio = R value For a given name, estimate the a priori probability P that any two papers will be written by the same person This is a whole story in itself…. 1/[1 + (1-P)/PR] = probability of a match The Author-ity Site at (
9
Beyond Pairwise Comparisons
A and B share titles, journals B and C share co-authors, affiliations But A and C share nothing! Yet p(AC) must be > (p(AB) + p(BC) -1) Triangle inequality using probabilities, detect and correct anomalies due to missing data or higher order correlations Catch un-characteristic papers by an author Another long story to optimize the methods!
10
Clustering “all” papers in Medline according to author-individuals
First we compute all pairwise probabilities for each (last name, first initial) modified with triplet correction Then we form clusters at p = 0.95 (high precision) and at p = 0.5 (high recall) i.e. the chance is greater than 0.5 that it belongs to some cluster, or it stays as a singleton
11
First-Pass Disambiguation is Complete!
Except for several hundred names having more than ~3000 papers each, reach memory limit, will assess if the model is reliable for the biggest names For now, proceed for papers giving first names. Monitoring for over-clustering and under-clustering Summarizing global statistics
12
Immediate Next Steps Evaluate the clustering performance
Old vs. new papers Importance of missing data Very frequent names Singletons, least confident assignments Update the web interface
16
Upcoming Grant Renewal Aim 1: Special Cases
name reversal, hyphenated names, spelling errors, Gerald vs. Jerry, Rick vs. A. Rick Use co-author assignment to help disambiguate another co-author Compute confidence level of assignment for each paper, identify least confident assignments
17
Upcoming Grant Renewal Aim 2: Update the Model
Original model covers 1966-present, but new papers have different information, MeSH, s, online information Modify training sets with recent papers. Journal name partial match Abstract words match? Affiliations matched to each authors in PMC, online papers References Cited information taken from PMC
18
Upcoming Grant Renewal Aim 3: Web Interface
Update the pairwise interface (given name, a particular paper, list all others in order of match probability) Show clusters – given a name, show all clusters of author-individuals, link to Community of Science, searchable by attributes, can summarize and explore further (Anne O’Tate tool) Author profile/collaboration finder tools Data made available for bibliometrics and collaboration network research
19
Upcoming Grant Renewal Aim 4: Curation
Curator to identify errors and least-confident assignments manually machine methods (e.g. wobble in clustering) change the database and alter the model as needed Wiki Authors – will monitor postings to Wiki and change the database as verified and warranted (e.g. maiden name to married name)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.