Author Name Disambiguation in Medline

Slides:



Advertisements
Similar presentations
1 Evaluating Information Sources Prof. Dr. Khalid Mahmood Department of Library & Information Science University of the Punjab.
Advertisements

Researching Physics Web-based Research. Learning objectives Evaluate websites for reliability, level and bias. Reference websites to allow another person.
Author linkage Vetle I. Torvik. PubMed/MEDLINE is topic-driven Articles in MEDLINE are assigned medical subject headings (MeSH) PubMed converts a free.
Web of Science Search and Navigation in the Web of Knowledge
1 Checks and Balances. 2 Why? 3 IF You Are in Lab.
New Features Update ISI Web of Knowledge. Copyright 2006 Thomson Corporation 2 New features added Mozilla Firefox web browser is now supported New access.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
SciVal Experts & SciVal Funding Information Sessions.
Before class begins… Help us to assess this session and plan for future workshops Please complete the Advanced Refworks Pre-learning assessment at:
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Datamining MEDLINE for Topics and Trends in Dental and Craniofacial Research William C. Bartling, D.D.S. NIDCR/NLM Fellow in Dental Informatics Center.
By Kousar Taj A Seminar Paper on LITERATURE REVIEW.
Impact of the Toll-access vs. Open-access Resources.
Collecting Quantitative Data
1 How to find literature - A very short introduction SMED 8004 Medicine and Health Library October 2014.
SCOPUS AND SCIVAL EVALUATION AND PROMOTION OF UKRAINIAN RESEARCH RESULTS PIOTR GOŁKIEWICZ PRODUCT SALES MANAGER, CENTRAL AND EASTERN EUROPE LVIV, 11 SEPTEMBER.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
1 Jo Lambert and Paul Meehan. JUSP aims Supports libraries by providing a single point of access to e-journal usage data Assists management of e- journals.
Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
Bibliometrics for your CV Web of Science Google Scholar & PoP Scopus Bibliometric measurements can be used to assess the output and impact of an individual’s.
IDA2: Intelligent Discovery of Acronyms and Abbreviations Adam Mallen under the advisement of Dr. Craig Struble and Dr. Lenwood Heath.
Presenter: Shanshan Lu 03/04/2010
Author Name Disambiguation in Medline Vetle I. Torvik and Neil R. Smalheiser August 31, 2006.
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Inferential Statistics Introduction. If both variables are categorical, build tables... Convention: Each value of the independent (causal) variable has.
L ITERATURE REVIEW RESEARCH METHOD FOR ACADEMIC PROJECT I.
Citation Searching Isabel Holowaty Juliet Ralph
Handout Six: Sample Size, Effect Size, Power, and Assumptions of ANOVA EPSE 592 Experimental Designs and Analysis in Educational Research Instructor: Dr.
1 CS 430: Information Discovery Lecture 5 Ranking.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Updating image To update the background image: Go to ‘View’ Select ‘Slide Master’ Select the page with the image Right click on the image and select ‘Change.
1 e-Resources on Social Sciences: Scopus. 2 Why Scopus?  A comprehensive abstract and citation database of peer-reviewed literature and quality web sources.
Research Methods in Business and Economics4 Jan Brzozowski, PhD.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Where Should I Publish? Journal Ranking Tools
Information Retrieval in Practice
It’s not about searching…. It’s about finding.
Finding Magazine & Newspaper Articles in a Library Database
BIO1130 Lab 2 Scientific literature
Demonstrating Scholarly Impact: Metrics, Tools and Trends
Bibliometrics toolkit: Thomson Reuters products
Annotated Bibliography
CAPE INFORMATION TECHNOLOGY
Using computers to search electronic databases
Are my Sources Reliable?
SOURCES finding & evaluating them
Method Separate subheadings for participants, materials, and procedure (3 marks in total) Participants (1 mark) Include all info provided in the assignment.
Experimental Psychology PSY 433
How To Do a Research Report
CS 430: Information Discovery
I. Statistical Tests: Why do we use them? What do they involve?
IL Step 3: Using Bibliographic Databases
DATABASES By: Hanna Ben-Or Phone:
Journal evaluation and selection journal
BIO1130 Lab 2 Scientific literature
WISER Humanities: Keeping up to date
CMNS 110: Term paper research
Introduction to Information Retrieval
Mapping and evaluating information sources
From results to submission
Evaluating Print and Electronic Sources
WISER: Citiation searching
Researching Physics Web-based Research.
Calabasas Library Research Resources and Methods
Chapter 4 Summary.
Presentation transcript:

Author Name Disambiguation in Medline Vetle I. Torvik and Neil R. Smalheiser August 31, 2006

A Statistically Based Model: Hypothesis: an individual tends to publish papers with similar attributes, sufficiently so that these attributes suffice to disambiguate the authors. The case of Dr. Tom Jobe? Large, automatically generated training sets of pairs of articles matching (Last Name, First Initial) written by the same person vs. by different individuals.

Name Attributes Suffix if present (III, Jr.) Middle initial match Original model got very good performance without using first names at all! First name (if available in Medline, or if it can be scraped from online papers) Name spelling variants Name frequency

Article Attributes Journal name Number of co-author names in common Affiliation words in common may not be given for all co-authors Name correlations with affiliations (e.g. Ito and Japan are correlated) Language of the article Title words in common Email addresses, if given assign to right author MeSH headings in common

A Monotone Model Each pair of papers creates a vector of 10 dimensions, each of which has a matching score. Assume monotonicity [more attributes in common, more likely written by the same person] Allows for nonlinear and interactive effects across dimensions

Estimate Pairwise Frequencies For a given pair of articles, compute the match vector, then look up its frequencies in pos vs. neg training sets: ratio = R value For a given name, estimate the a priori probability P that any two papers will be written by the same person This is a whole story in itself…. 1/[1 + (1-P)/PR] = probability of a match The Author-ity Site at (http://arrowsmith.psych.uic.edu)

Beyond Pairwise Comparisons A and B share titles, journals B and C share co-authors, affiliations But A and C share nothing! Yet p(AC) must be > (p(AB) + p(BC) -1) Triangle inequality using probabilities, detect and correct anomalies due to missing data or higher order correlations Catch un-characteristic papers by an author Another long story to optimize the methods!

Clustering “all” papers in Medline according to author-individuals First we compute all pairwise probabilities for each (last name, first initial) modified with triplet correction Then we form clusters at p = 0.95 (high precision) and at p = 0.5 (high recall) i.e. the chance is greater than 0.5 that it belongs to some cluster, or it stays as a singleton

First-Pass Disambiguation is Complete! Except for several hundred names having more than ~3000 papers each, reach memory limit, will assess if the model is reliable for the biggest names For now, proceed for papers giving first names. Monitoring for over-clustering and under-clustering Summarizing global statistics

Immediate Next Steps Evaluate the clustering performance Old vs. new papers Importance of missing data Very frequent names Singletons, least confident assignments Update the web interface

Upcoming Grant Renewal Aim 1: Special Cases name reversal, hyphenated names, spelling errors, Gerald vs. Jerry, Rick vs. A. Rick Use co-author assignment to help disambiguate another co-author Compute confidence level of assignment for each paper, identify least confident assignments

Upcoming Grant Renewal Aim 2: Update the Model Original model covers 1966-present, but new papers have different information, MeSH, emails, online information Modify training sets with recent papers. Journal name partial match Abstract words match? Affiliations matched to each authors in PMC, online papers References Cited information taken from PMC

Upcoming Grant Renewal Aim 3: Web Interface Update the pairwise interface (given name, a particular paper, list all others in order of match probability) Show clusters – given a name, show all clusters of author-individuals, link to Community of Science, searchable by attributes, can summarize and explore further (Anne O’Tate tool) Author profile/collaboration finder tools Data made available for bibliometrics and collaboration network research

Upcoming Grant Renewal Aim 4: Curation Curator to identify errors and least-confident assignments manually machine methods (e.g. wobble in clustering) change the database and alter the model as needed Wiki Authors – will monitor postings to Wiki and change the database as verified and warranted (e.g. maiden name to married name)