©2013 MFMER | slide-1 An Incremental Approach to MEDLINE MeSH Indexing Presenter: Hongfang Liu BioASQ 2013 Team Member: Mayo Clinic: Wu Stephen, James.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Title: The Author-Topic Model for Authors and Documents
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Personalized Search Result Diversification via Structured Learning
Using Natural Language Program Analysis to Locate and understand Action-Oriented Concerns David Shepherd, Zachary P. Fry, Emily Hill, Lori Pollock, and.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Evaluating the Performance of IR Sytems
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Chapter 5: Information Retrieval and Web Search
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Querying Structured Text in an XML Database By Xuemei Luo.
Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Chapter 6: Information Retrieval and Web Search
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
KDD-2008 Anticipating Annotations and Emerging Trends in Biomedical Literature Fabian Mörchen, Mathäus Dejori, Dmitriy Fradkin, Julien Etienne, Bernd Wachmann.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
Using Social Annotations to Improve Language Model for Information Retrieval Shengliang Xu, Shenghua Bao, Yong Yu Shanghai Jiao Tong University Yunbo Cao.
Link Distribution on Wikipedia [0407]KwangHee Park.
Automatic Labeling of Multinomial Topic Models
Michael Bendersky, W. Bruce Croft Dept. of Computer Science Univ. of Massachusetts Amherst Amherst, MA SIGIR
Natural Language Processing Topics in Information Retrieval August, 2002.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Information Retrieval and Web Search
Applying Key Phrase Extraction to aid Invalidity Search
Citation-based Extraction of Core Contents from Biomedical Articles
Topic: Semantic Text Mining
Introduction to Search Engines
Presentation transcript:

©2013 MFMER | slide-1 An Incremental Approach to MEDLINE MeSH Indexing Presenter: Hongfang Liu BioASQ 2013 Team Member: Mayo Clinic: Wu Stephen, James Masanz, and Hongfang Liu University of Delaware: Dongqing Zhu, Ben Carterette

©2013 MFMER | slide-2 Outline Motivation & Task Incremental Systems MetaMap-based Search-based LLDA-based Experiment Setup Evaluation Conclusion

©2013 MFMER | slide-3 Motivation of BioASQ Task Reduce human effort in MeSH indexing Increasing number of new articles Low consistency among annotators [Funk and Reid] Automatic MeSH indexing Suggest MeSH terms for a given new article

©2013 MFMER | slide-4 Motivation of Mayo’s Participation Information retrieval (IR)-based ontology annotation Traditional approach has been information extraction-based Three levels of intelligence in artificial intelligence Knowledge-base intelligence Data intelligence User intelligence > Explore the use of topic modeling and distant supervision for ontology annotation

©2013 MFMER | slide-5 Proposed Approaches MetaMap-based Search-based LLDA-based Three approaches can work either independently or together in an incremental way DUI

©2013 MFMER | slide-6 MetaMap-based System Title: Age-period-cohort effect on mortality from cervical cancer. Abstract: to estimate the effect of age, period and birth cohort … CUI Candidates Score C C C …… MetaMap Restricted to MeSH ontology … … …… …… ….. …… A ranked list of CUI => a ranked list of DUI A ranked list of CUI => a ranked list of DUI

©2013 MFMER | slide-7 MetaMap-based System Parameter Tuning Titles concepts are more important Low threshold roughly leads to high precision/recall Tradeoff between P/R

©2013 MFMER | slide-8 Search-based System Retrieval Model DUI Aggregation Docs D01, D02, D03 … D08, D03, D01 … D02, D03, D01 … DUI ranked by tf * score(Q, D)

©2013 MFMER | slide-9 Search-based System #weight(2.0 examination 2.0 cow 2.0 ultrasonographic 3.0 navel 3.0 urachal 3.0 extra-abdominal 2.0 pathologic 2.0 abscess) #weight(2.0 examination 2.0 cow 2.0 ultrasonographic 3.0 navel 3.0 urachal 3.0 extra-abdominal 2.0 pathologic 2.0 abscess) #weight(3.5 #uw2(hiv-1 infection) 4.5 #uw2(differential susceptibility) 2.0 #uw2(actin dynamics) 2.0 actin 4.5 #uw2(cortical actin) 4.5 #uw3(naive t cells) 2.5 dichotomy 3.5 #uw2(human memory) 3.5 #uw3(chemotactic actin activity) 2.0 cd45ro) #weight(3.5 #uw2(hiv-1 infection) 4.5 #uw2(differential susceptibility) 2.0 #uw2(actin dynamics) 2.0 actin 4.5 #uw2(cortical actin) 4.5 #uw3(naive t cells) 2.5 dichotomy 3.5 #uw2(human memory) 3.5 #uw3(chemotactic actin activity) 2.0 cd45ro)

©2013 MFMER | slide-10 Search-based System Parameter Tuning Less smoothing => better performance A small set of highly relevant documents Tradeoff between P/R

©2013 MFMER | slide-11 Systems LLDA-based LDA Process Each document is a mixture of topics Each topic is a multinomial word distribution Labeled LDA Incorporate label information

©2013 MFMER | slide-12 Systems LLDA-based Top categories in MeSH … … Top-level categories as topics (e.g., Anatomy Category, Chemicals and Drugs Category, etc.) root Each label below is converted to corresponding top-level labels

©2013 MFMER | slide-13 Systems LLDA-based DUI candidate list pruning A pruned rank list doc Search-based LLDA-based Categories DUI

©2013 MFMER | slide-14 Data Training -- Testing -- input: output:

©2013 MFMER | slide-15 Evaluation MM: MetaMap-based system Mi: micro LCA: lowest common ancestor

©2013 MFMER | slide-16 Conclusion and Future Work Three Systems MetaMap-based, search-based, LLDA-based Research findings Explored impact of various parameter on performance Promising results from search-based labeling Future Direction Better concept weighting strategies E.g., corpus-level statistics, external resources Comprehensive comparisons with existing methods A better strategy for incorporating hierarchical info. Into LLDA

©2013 MFMER | slide-17 Questions & Discussion

©2013 MFMER | slide-18 Baseline: MetaMap-based Labeling CONCEPT WEIGHTING CONCEPT DETECTOIN 1.Concepts (K): phrases or terms mapping to UMLS CUI 2.List (L) of CUI (c) with confidence scores (S c ) 3.Negation information for each K 1.Select non-negated CUI (c), with score higher than threshold h 2.Merge & rank c with weighted scores as follows α -> weights assigned to T(itle) β -> weights assigned to A(bstract) 3. β fixed to 1.0 while optimizing α Converge high ranked list of c to MeSH Descriptor Unique Identifiers (DUI)

©2013 MFMER | slide-19 Incremental Labeling: Search-based Labeling 1 Index training set with Indri Retrieve MeSH for testing set Filter out words with a medical stoplist Extract stems with Porter stemmer Indexing fields including titles and abstracts Retrieve Model Retrieve Model w i -> weights for ith matched query term q i f(q i,D) -> the query term matching function defined as: |D| and |C|: length of documents and collections tf qi, D & tf qi, C : document & collection term frequencies of q i μ : the Dirichlet smoothing parameter Query Formulation Result Aggregation

©2013 MFMER | slide-20 Search-based Labeling 2 Index training set with Indri Retrieve MeSH for testing set Retrieve Model Retrieve Model K T : terms in title extracted by MetaMap K A : terms in abstract likewise Query Formulation Result Aggregation Long Query (LQ) Phrase Query (PQ) Term Query (TQ) TQ Example: PQ Example: Longer query than phrase, order & proximity considered PQ: consider collocations

©2013 MFMER | slide-21 Parameter Explorations 2 Parameter setting for MetaMap-based Labeling a)Figure a shows the higher weights for Title, the better the results b)Figure b shows the best CI threshold at 600 c)Figure c shows recall is proportional to the number of DUI while precision is anti-proportional

©2013 MFMER | slide-22 Parameter Explorations 3 Parameter setting for MetaMap-based Labeling a)Figure d: more smoothing hurts the performance b)Figure e: best results come from number of top documents is 20 c)Figure f: similar to figure c, recall is proportional to the number of DUI while precision is anti-proportional

©2013 MFMER | slide-23 θdθd θdθd L mes h L mes h w w α α z z γ γ ψ ψ N D Incremental filtering with Labeled Latent Dirichlet Allocations (LLDA) Generative Story: 1)A generative topic model 2)Both α and ψ play the role of prior for topic generations 3)Θ d generates document topics tuned by both α and Mesh labels L 4)Word topic distribution γ and doc topic z d generate word w i Training and Testing Training: Parameter estimation with Gibbs Sampling for Θ and γ using 10% of provided PubMED corpus. Testing: The trained model suggests multiple mesh terms for testing data Filtering: Utilizing suggested mesh term sets to filter out results obtained from search- based labeling LLDA