Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1

Slides:



Advertisements
Similar presentations
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Advertisements

Introduction to Information Retrieval
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Efficient Information Retrieval for Ranked Queries in Cost-Effective Cloud Environments Presenter: Qin Liu a,b Joint work with Chiu C. Tan b, Jie Wu b,
©2013 MFMER | slide-1 An Incremental Approach to MEDLINE MeSH Indexing Presenter: Hongfang Liu BioASQ 2013 Team Member: Mayo Clinic: Wu Stephen, James.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
CVPR2013 Poster Representing Videos using Mid-level Discriminative Patches.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Jierui Xie, Boleslaw Szymanski, Mohammed J. Zaki Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180, USA {xiej2, szymansk,
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
A novel log-based relevance feedback technique in content- based image retrieval Reporter: Francis 2005/6/2.
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
Presented by Zeehasham Rasheed
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Scalable Text Mining with Sparse Generative Models
Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Automatically obtain a description for a larger cluster of relevant documents Identify terms related to query terms  Synonyms, stemming variations, terms.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.
Advanced Multimedia Text Classification Tamara Berg.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
NICTA Copyright 2013From imagination to impact Identifying Publication Types Using Machine Learning BioASQ Challenge Workshop A. Jimeno Yepes, J.G. Mork,
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Chapter 6: Information Retrieval and Web Search
Seeking Abbreviations From MEDLINE Jeffrey T. Chang Hinrich Schütze Russ B. Altman Presented by: Bo Han.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
CIKM Opinion Retrieval from Blogs Wei Zhang 1 Clement Yu 1 Weiyi Meng 2 1 Department of.
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,
Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Proximity-based Ranking of Biomedical Texts Rey-Long Liu * and Yi-Chih Huang * Dept. of Medical Informatics Tzu Chi University Taiwan.
Final Project Mei-Chen Yeh May 15, General In-class presentation – June 12 and June 19, 2012 – 15 minutes, in English 30% of the overall grade In-class.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Automatic Labeling of Multinomial Topic Models
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Matching References to Headers in PDF Papers Tan Yee Fan 2007 December 19 WING Group Meeting.
Using lexical chains for keyword extraction
CATEGORIZATION OF NEWS ARTICLES USING NEURAL TEXT CATEGORIZER
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Adopted from Bin UIC Recommender Systems Adopted from Bin UIC.
Learning to Rank Shubhra kanti karmaker (Santu)
Machine Learning Week 1.
Citation-based Extraction of Core Contents from Biomedical Articles
Panagiotis G. Ipeirotis Luis Gravano
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1 The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1 zhusf@fudan.edu.cn 1Fudan University 2 Central South University, 3 University of Illinois at Urbana-Champaign

Outline Introduction Related Work Our Methods Experimental Result Conclusion

Introduction MeSH Terms The NLM Indexing Initiative: Current Status and Role in Improving Access to Biomedical Information Introduction MeSH Terms Each year, around 0.8 million biomedical documents are added into MEDLINE.

MeSH is Important Indexing all documents in MEDLINE Indexing many books and collections in NLM Improving the retrieval performance by query expansion using MeSH Improving the clustering performance by integrating MeSH information [Zhu et al. 2009 IP&M] [Zhe et al. 2009 Bioinformatics] [Gu et al. 2013 IEEE TSMCB] Improving the biomedical text mining performance

Automatic MeSH Annotation is a challenging problem More than 26,000 MeSH headings organized in hierarchical structure Quickly approaching 1,000,000 articles indexed per year ~$9.40 to index an article

The number of distinct MeSH is large (almost 27000) The large variations of MeSH frequencies in MEDLINE The large variations in the number of MeSH terms for each document

BioASQ (Large Scale Biomedical Semantic Indexing Competition ) Batch 3, week 1 4342 docs Batch 3, week 2 8840 docs Batch 3, week 3 3702 docs Batch 3, week 4 4726 docs Batch 3, week 5 4533 docs

Label Based Micro F1-measure (MiF) L represent the label set, |L| represents the number of labels. It means that Frequent labels will be weighed more in the evaluation.

Batch 3, week 5 4533 docs We achieved around 10% improvement over current NLM MTI solution (Result of June 2014) Fudan University NLM Current Solution

Outline Introduction Related Work Our Methods Experimental Result Conclusion

NLM approach: MTI Two sources: MetaMap Indexing Maps UMLS Concepts restricting to MeSH PubMed Related Citations reference: http://ii.nlm.nih.gov/MTI/history.shtml Advanced machine learning algorithms are not utilized

MetaLabeler (Tsoumakas et al. 2013) Firstly, for each MeSH heading, a binary classification model was trained using linear SVM. Secondly, a regression model was trained to predict the number of MeSH headings for each citation. Finally, given a target citation, different MeSH headings were ranked according to the SVM prediction score of each classifier, and the top K MeSH headings were returned as the suggested MeSH headings, where K is the number of predicted MeSH headings by the model. Problem: Only use global information. The scores from different classifiers are not comparable.

NCBI’s learning to rank (LTR) (Huang et al., 2011; Mao et al., 2013) Each citation was deemed as a query and each MeSH headings as a document. LTR method was utilized to rank candidate MeSH headings with respect to target citation. The candidate MeSH headings came from similar citations (nearest neighbors). Problem: Only use local information. Similar citations might be rare.

Outline Introduction Related Work Our Methods Experimental Result Conclusion

Our solution: Learning to Rank (LTR) Framework * Obtain an initial list of main headings Initial List * Rank the main headings MH-0 Logistic Regression MH-1 Ranked List MH-2 MH-0 … * Generate features of main headings MH-n MH-1 Target Doc Ranking model MH-2 … PRA MH-n MH-0 MH-1 Features MH-2 … … LambdaMart Evaluation MH-m * Retrieve Similar documents

Main idea: various information (Features) integrated in the Learning to Rank (LTR) Framework Given a target document, for each candidate MeSH, we get prediction scores from all kinds of sources: (1) Logistic Regression (Global information) (2) KNN (Local information) (3) Pattern Matching (4) MTI result (KNN+ pattern +rule)

Logistic Regression Train a binary-class Logistic Regression Model for Each Label. Finally we have 25,000+ binary-class models Question: The prediction scores are from different classifiers. How to make these scores comparable?

Key idea: We have huge validation set of whole MEDLINE Key idea: We have huge validation set of whole MEDLINE. Use the Precision at prediction score K as the Normalized score [Liu et al., In preparation]

The performance comparison on LR between default prediction scores and our normalized scores. Method mip mir mif Default scores 0.5576 0.5614 0.5595 Normalized scores 0.5734 0.5774 0.5754 [Liu et al., In preparation]

KNN Given a target citation, we used NCBI efetch to find its similar(neighbor) citations. For a candidate MeSH, we compute a score from neighbors to represent its confidence. Specifically, in Top 25 documents most similar to target citation, we use the following formula, where Si is the score of a document appearing in top 25, Sk is the score of a document not only appearing in top 25 and also annotated with candidate MeSH.

Pattern matching Use direct string pattern matching for finding MeSH Term, its synonyms and entry terms from the text MTI Whether the candidate MeSH appears in the default results of MTI

The number of MeSH Labels A Support Vector Regression for predicting the Number of Labels by using a number of features, such as Journal information The number of labels in nearest neighbors Number of labels predicted by MTI Number of labels predict by metalabeler

Outline Introduction Related Work Our Methods Experimental Result Conclusion

Evaluation & Experiment Server 4 * Intel XEON E5-4650 2.7GHzs CPU 128GB RAM. Training of LR Classifiers took 5 days. All other training tasks took 1 day. Annotating 10,000 citations 2 hours.

Evaluation & Experiment

Outline Introduction Related Work Our Methods Experimental Result Conclusion

Conclusion & Future Work The superior performance of our methods come from integrating all kinds of information in LTR framework, MTI, KNN, LR as well as direct matching . The big data of MEDLINE make the prediction score normalization possible and improves the performance significantly. More information could be used, such as full text, and indexing rules. How to minimize the gap between a good competition system and real applications?

Acknowledgement Dr. Hongning Wang UIUC Mr. Mingjie Qian UIUC Mr. Jieyao Deng Fudan Mr. Tianyi Peng Tsinghua