Author Name Disambiguation in Medline Vetle I. Torvik and Neil R. Smalheiser August 31, 2006
A Statistically Based Model: Hypothesis: an individual tends to publish papers with similar attributes, sufficiently so that these attributes suffice to disambiguate the authors. The case of Dr. Tom Jobe? Large, automatically generated training sets of pairs of articles matching (Last Name, First Initial) written by the same person vs. by different individuals.
Name Attributes Suffix if present (III, Jr.) Middle initial match Original model got very good performance without using first names at all! First name (if available in Medline, or if it can be scraped from online papers) Name spelling variants Name frequency
Article Attributes Journal name Number of co-author names in common Affiliation words in common may not be given for all co-authors Name correlations with affiliations (e.g. Ito and Japan are correlated) Language of the article Title words in common Email addresses, if given assign to right author MeSH headings in common
A Monotone Model Each pair of papers creates a vector of 10 dimensions, each of which has a matching score. Assume monotonicity [more attributes in common, more likely written by the same person] Allows for nonlinear and interactive effects across dimensions
Estimate Pairwise Frequencies For a given pair of articles, compute the match vector, then look up its frequencies in pos vs. neg training sets: ratio = R value For a given name, estimate the a priori probability P that any two papers will be written by the same person This is a whole story in itself…. 1/[1 + (1-P)/PR] = probability of a match The Author-ity Site at (http://arrowsmith.psych.uic.edu)
Beyond Pairwise Comparisons A and B share titles, journals B and C share co-authors, affiliations But A and C share nothing! Yet p(AC) must be > (p(AB) + p(BC) -1) Triangle inequality using probabilities, detect and correct anomalies due to missing data or higher order correlations Catch un-characteristic papers by an author Another long story to optimize the methods!
Clustering “all” papers in Medline according to author-individuals First we compute all pairwise probabilities for each (last name, first initial) modified with triplet correction Then we form clusters at p = 0.95 (high precision) and at p = 0.5 (high recall) i.e. the chance is greater than 0.5 that it belongs to some cluster, or it stays as a singleton
First-Pass Disambiguation is Complete! Except for several hundred names having more than ~3000 papers each, reach memory limit, will assess if the model is reliable for the biggest names For now, proceed for papers giving first names. Monitoring for over-clustering and under-clustering Summarizing global statistics
Immediate Next Steps Evaluate the clustering performance Old vs. new papers Importance of missing data Very frequent names Singletons, least confident assignments Update the web interface
Upcoming Grant Renewal Aim 1: Special Cases name reversal, hyphenated names, spelling errors, Gerald vs. Jerry, Rick vs. A. Rick Use co-author assignment to help disambiguate another co-author Compute confidence level of assignment for each paper, identify least confident assignments
Upcoming Grant Renewal Aim 2: Update the Model Original model covers 1966-present, but new papers have different information, MeSH, emails, online information Modify training sets with recent papers. Journal name partial match Abstract words match? Affiliations matched to each authors in PMC, online papers References Cited information taken from PMC
Upcoming Grant Renewal Aim 3: Web Interface Update the pairwise interface (given name, a particular paper, list all others in order of match probability) Show clusters – given a name, show all clusters of author-individuals, link to Community of Science, searchable by attributes, can summarize and explore further (Anne O’Tate tool) Author profile/collaboration finder tools Data made available for bibliometrics and collaboration network research
Upcoming Grant Renewal Aim 4: Curation Curator to identify errors and least-confident assignments manually machine methods (e.g. wobble in clustering) change the database and alter the model as needed Wiki Authors – will monitor postings to Wiki and change the database as verified and warranted (e.g. maiden name to married name)