NICTA Copyright 2013From imagination to impact Identifying Publication Types Using Machine Learning BioASQ Challenge Workshop A. Jimeno Yepes, J.G. Mork,

Slides:



Advertisements
Similar presentations
PubMed Overview From the main HINARI webpage, we can access PubMed by clicking on Search HINARI journal articles through PubMed (Medline). Note: If you.
Advertisements

PubMed/How to Search, Display, Download & (module 4.1)
PubMed/Filters (Limits) and Advanced Search (module 4.2)
We will view one final additional set of filters by clicking the Search fields box and Apply.
A Guide to PMCID numbers Anca Geana, MBA, CRA – May 2012.
NIH Public Access Compliance Cleveland Health Sciences Library Case Western Reserve University Kathleen C. Blazar.
Scoping the literature for your research bid Unit 4.
Revised January 2008 IUPUI University Library Randi L. Stocker, MLS developed for the Indiana University School of Nursing.
PubMed and its search options Jan Emmerich, Sonja Jacobi, Kerstin Müller (5th Semester Library Management)
NIH PUBLIC ACCESS POLICY NIHMSID, PMCID, PMID OBJECTIVE When the National Institutes of Health (NIH) Public Access Policy became law on April 7, 2008 several.
IN THE NAME OF GOD. Searching PubMed PubMed Home Page.
Indexing the Biomedical Literature in a Time of Increased Demand and Limited Resources BioASQ Workshop September 27, 2013 Alan R. Aronson Lister Hill Center,
The National Library of Medicine online resources Salima M’seffar INH- Bibliotheque
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Evidence-Based Medicine in PubMed PubMed for Trainers, Winter 2015 U.S. National Library of Medicine (NLM) and NLM Training Center.
NLM Medical Text Indexer (MTI) BioASQ Challenge Workshop September 27, 2013 J.G. Mork, A. Jimeno Yepes, A. R. Aronson.
NATIONAL LIBRARY OF MEDICINE NLM Journal Archiving and Interchange Tagset Jeff Beck National Center for Biotechnology Information National Library of Medicine.
NATIONAL LIBRARY OF MEDICINE PubMed Central Martha Fishel National Library of Medicine CENDI Meeting September 15, 2004.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Search on Journal of Dairy Science ® An Overview April
Medical Subject Headings (MeSH)
PubMed/How to Search, Display, Download & (module 4.1)
Datamining MEDLINE for Topics and Trends in Dental and Craniofacial Research William C. Bartling, D.D.S. NIDCR/NLM Fellow in Dental Informatics Center.
Indexing 1/2 BDK12-3 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
NLM Online Update May 16, 2005 David Gillikin National Library of Medicine National Institutes of Health Department of Health & Human Services.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Databases Indexes & Abstracts. Indexes & Abstracts = Serials When most librarians think about science and technology they think about serials and the:
PubMed/How to Search, Display, Download & (module 4.1)
Part 1 – PubMed Interface, Display options, Saving, Printing, and ing results. Instructions This part of the course is a PowerPoint demonstration.
We will complete another date search by entering 2008 to 2010 in the Specify date range option and clicking on Search.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Limits From the initial (HINARI) PubMed page, we will click on the Limits search option. Note also the hyperlinks to Advanced search and Help options.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.
Find Full Text Journal Articles Using Pubmed Nancy B. Clark, M.Ed. Director of Medical Informatics Education FSU College of Medicine 1 All recourses are.
From the initial (HINARI) PubMed page, we will run the HIV and pregnancy search and then apply various Filters. Note the to Advanced search and Help options.
IDA2: Intelligent Discovery of Acronyms and Abbreviations Adam Mallen under the advisement of Dr. Craig Struble and Dr. Lenwood Heath.
Limits From the initial (HINARI) PubMed page, we will click on the Limits search option. Note also the hyperlinks to Advanced search and Help options.
Seeking Abbreviations From MEDLINE Jeffrey T. Chang Hinrich Schütze Russ B. Altman Presented by: Bo Han.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Session 1 Review. 1. Which is the last of the four steps in the EBM process? Apply evidence to your patient Evaluate evidence for validity Formulate a.
Journal Searching Nancy B. Clark, M.Ed. Director of Medical Informatics Education FSU College of Medicine 1 All recourses are available online in Medical.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
The role of knowledge in conceptual retrieval: a study in the domain of clinical medicine Jimmy Lin and Dina Demner-Fushman University of Maryland SIGIR.
Mission-Based Management August 2006 Electronic CV System Users Group.
PubMed Review MLA 2005 San Antonio, Texas. 15 Million Milestone million citations in PubMed million citations in PubMed 13+ million citations.
PubMed/Limits and Advanced Search (module 4.2). MODULE 4.2 PubMed/Limits & Advanced Search Instructions - This part of the:  course is a PowerPoint demonstration.
Medical Text Indexing Joe Thomas Unit Supervisor Index Section, NLM.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
The National Library of Medicine and its databases Lívia Vasas, PhD
PubMed Basics Barbara A. Wood, MLIS Calder Library University of Miami Miller School of Medicine.
MEDLINE®/PubMed® PubMed for Trainers, Fall 2015 U.S. National Library of Medicine (NLM) and NLM Training Center An introduction.
Evidence-Based Medicine in PubMed PubMed for Trainers, Summer 2016 U.S. National Library of Medicine (NLM) and NN/LM Training Office.
The National Library of Medicine and its databases
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Natural Language Processing of Knee MRI Reports
PubMed Database Interface (Basic Course Module 4 Part A)
Lívia Vasas, PhD 2018 The National Library of Medicine and its databases Mozilla Firefox/Google Chrome Lívia Vasas, PhD.
The National Library of Medicine and its databases
Lívia Vasas, PhD 2018 The National Library of Medicine and its databases Mozilla Firefox/Google Chrome Lívia Vasas, PhD.
Review Key Teaching Points
The National Library of Medicine and its databases
PubMed/Limits and Advanced Search (module 4.2)
PubMed Database Interface (Basic Course: Module 4)
Presentation transcript:

NICTA Copyright 2013From imagination to impact Identifying Publication Types Using Machine Learning BioASQ Challenge Workshop A. Jimeno Yepes, J.G. Mork, A. R. Aronson Identifying Publication Types Using Machine Learning

NICTA Copyright 2013From imagination to impact Publication Types Define the genre of the article, e.g. Review Special type of MeSH Heading that are used to indicate what an article is rather than what it is about Citations can be indexed with more than one PT There are 61 PTs identified in the four MeSH Publication Characteristics (V) Tree top-level sub-trees that the indexers typically use 2

NICTA Copyright 2013From imagination to impact Publication Type: Review Example 3 This review attempts to highlight... PMID: (September 12, 2013)

NICTA Copyright 2013From imagination to impact Publication Types PubMed allows for queries including publication type fields, e.g. Review[pt] PTs are available in the MEDLINE citation XML and ASCII formats 4 Clinical Trial, Phase II Journal Article Randomized Controlled Trial Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't

NICTA Copyright 2013From imagination to impact Motivation Indexing of citations with Publication Type (PT) as part of the Indexing Initiative at the US NLM Recommend PTs as part of the MTI (Medical Text Indexer) support tool MTI performed poorly on PTs in previous attempts and stopped suggesting PTs altogether on November 10,

NICTA Copyright 2013From imagination to impact MTI in a nutshell 6

NICTA Copyright 2013From imagination to impact Machine learning motivation MTI showed poor results in Publication Type (PT) indexing in previous work Indexing of PTs can be seen as a text categorization task We have considered as a binary case. For a given PT the citations indexed with it are considered as positives and the rest as negative 7

NICTA Copyright 2013From imagination to impact Data set development Over time the indexing policy changes, consider the most recent indexing Selected citations Date Completed (date indexing was applied to the citation) ranging from January 1, 2009 to December 31,

NICTA Copyright 2013From imagination to impact Data set development Data set obtained from the 2012 MEDLINE Baseline Repository (MBR) Query Tool MBR allows us to randomly divide the list of PMIDs into Training (2/3) and Testing (1/3) sets 1,784,061 randomly selected PMIDs for Training and 878,718 for Testing 9

NICTA Copyright 2013From imagination to impact Data set development Filter out articles requiring special handling OLDMEDLINE, PubMed-not-MEDLINE, articles with no indexing, CommentOn, RetractionOf, PartialRetractionOf, UpdateIn, RepublishedIn, ErratumFor, and ReprintOf. Final data set: 1,321,512 articles for Training and 651,617 articles for Testing 10

NICTA Copyright 2013From imagination to impact Test set statistics Citations in test set: 651,617 Imbalance between positives and negatives 11 Publication TypeOccursAbbrevBaseline F 1 Case Reports51,037CR- Clinical Trial6,165CT- Congresses1,954CO Controlled Clinical Trial1,727CC- Editorial11,519ED- English Abstract46,471EA In Vitro4,284IV Meta-Analysis3,467MA Randomized Controlled Trial17,356RC- Review75,298RV-

NICTA Copyright 2013From imagination to impact Machine learning algorithms MTI ML: Support Vector Machine Stochastic Gradient Descent based on Hinge Loss (Sgd) Modified Huber Loss (Yeganova et al, 2011) (Mhl) AdaBoostM1 (C4.5 as based method) (Ada) Mallet: Naïve Bayes (NB) and Logistic Regression (LR) 12

NICTA Copyright 2013From imagination to impact Features Title and abstract text (Base) Base + Journal Unique Identifier, Author affiliations, Author Names, and Grant Agencies (additional features) (F) Base + bigrams (B) Base + additional features + bigrams (BF) AdaBoostM1 was not trained with bigrams due to time constraints 13

NICTA Copyright 2013From imagination to impact Results (F 1 measure) 14 CRCTCOCCEDEAIVMARCRV Mhl Mhl-F Mhl-B Mhl-BF Sgd Sgd-F Sgd-B Sgd-BF NB NB-F NB-B NB-BF LR LR-F LR-B LR-BF Ada Ada-F

NICTA Copyright 2013From imagination to impact Methods/features comparison No clear winning method that works best for all of the Publication Types, echoing the findings for MeSH indexing Logistic Regression provides the highest F 1 measures for six of the ten PTs in our study Bigrams and additional features tend to perform better than using just title and abstract tokens 15

NICTA Copyright 2013From imagination to impact Naïve Bayes performance Naïve Bayes is far behind all of the other methods This effect already known (Rennie et al. 2003) is more dramatic when there is an imbalance between the classes This effect is more dramatic with a larger set of dependent features 16

NICTA Copyright 2013From imagination to impact ML performance indexing PTs Case Reports, Congresses, English Abstract, Meta-Analysis, Randomized Controlled Trial, and Review all have F 1 measures above 0.7 making them promising candidates for future integration into the indexing process The remaining PTs Clinical Trial, Controlled Clinical Trial, Editorial, and In Vitro all have F 1 measures too low for consideration at this time but provide the kernel for further research into improving their performance 17

NICTA Copyright 2013From imagination to impact English Abstract PT ML already high performance (F1: ) Indexing rule already in place: if an article has a title in brackets (meaning it was translated into English) and contains an abstract, it should receive the English Abstract Publication Type This PT is already automatically assigned using this rule and ML algorithms need to add more features explicitly 18

NICTA Copyright 2013From imagination to impact In Vitro PT In Vitro is one of the low performing terms In our error analysis, we find that in almost all of the false negatives that we manually reviewed, the information for designating the article as In Vitro was located in the Methods section of the full text of the article 19

NICTA Copyright 2013From imagination to impact Conclusions Evaluated the automatic assignment of PTs to MEDLINE articles based on machine learning For the majority (6 of 10) of PTs the performance is quite good with F 1 measures above

NICTA Copyright 2013From imagination to impact Conclusions In addition to the title and abstract text, further information provided from fields in the MEDLINE article result in improved performance Extend current work to include most of the remaining frequently used PTs and exploring the use of openly available full text from PubMed Central to see the impact in terms like In Vitro 21

NICTA Copyright 2013From imagination to impact Questions? MTI ML package Publication Types data set 22