Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
Evaluation of Decision Forests on Text Categorization
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin Nov
Automatic Classification of Accounting Literature Nineteenth Annual Strategic and Emerging Technologies Workshop Vasundhara Chakraborty, Victoria Chiu,
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Classification of Gene-Phenotype Co-Occurences in Biological Literature Using Maximum Entropy CIS Term Project Proposal November 1, 2002 Sharon Diskin.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Presented by Zeehasham Rasheed
Scalable Text Mining with Sparse Generative Models
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Today Evaluation Measures Accuracy Significance Testing
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Data Mining and Decision Support
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Classification and Regression Trees
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions.
Queensland University of Technology
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
UNECE Seminar on New Frontiers for Statistical Data Collection, Geneva
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Discriminative Frequent Pattern Analysis for Effective Classification
iSRD Spam Review Detection with Imbalanced Data Distributions
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens

2 Summary We propose a method to improve performance in biomedical article classification. We use Naïve Bayes and Maximum Entropy classifiers to classify real world biomedical articles derived from the dataset used in the classification competition task BC2.5 To improve classification performance, we use two merging operators, Max and Harmonic Mean to combine results of the two classifiers. The results show that we can improve classification performance of real world biomedical data

3 Introduction From the biomedical point of view there are many challenges in classifying biomedical information [3]. Even the most sophisticated of solutions often overfit to the training data and do not perform as well on real-world data [4]. In this paper we try to devise a method which makes real world biomedical data classification more robust. First we parse documents applying a keyword extraction algorithm to find out the keywords from the full text. Second, we apply chi-square feature selection strategy to identify the most relevant. Finally, we apply Naïve Bayes and Maximum Entropy classifiers to classify documents and then combine them using two merging operators to improve performance.

THE CLASSIFICATION METHOD Naïve Bayes Classifiers: A text classifier could be defined as a function that maps a document d of x 1, x 2,x 3,..,x n words (features), d=( x 1, x 2,x 3,..,x n ), to a confidence that the document d belongs to a text category. the Naïve Bayes classifier [1] is often used to estimate the probability of each category. The Bayes theorem can be used to estimate the probabilities: Pr(c|d)=Pr(d|c)*Pr(c)/Pr(d) [6]

THE CLASSIFICATION METHOD Maximum Entropy Classifiers: Entropy was used by Shannon (Shannon, 1948), in the communication theory. The entropy H itself measures the average uncertainty of a single random variable X : H(p)=H(x)=-Σp(x)log 2 p(x) [2] The maximum entropy model can be specially adjusted for text classification. This can be done using the iterative scaling (IIS) algorithm and a hillclimbing algorithm for estimating the parameters of the maximum entropy model [6] 5

Merging Classifiers We use two operators to combine the results of the Naïve Bayes Classifier (NBC) and the Maximum Entropy Classifier (MEC) to improve the classification performance. The Maximum and the Harmonic Mean of the results of the two classifiers MaxC(d) = Max {NBC(d), MEC (d)} HarmC (d) = 2.0 × NBC(d) ×MEC (d) / (NBC(d) + MEC (d)) The MaxC(d) operator chooses a maximum value among the results of the two classifiers. The HarmC (d) operator estimates the Harmonic Mean of the results of these two classifiers. 6

BioCreAtIvE challenge Description [ ] Description The BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) challenge evaluation consists of a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. 7

BioCreative II.5 challenge Evaluation library [ ] Evaluation library This is the current version of the BioCreative evaluation library including a command line tool to use it; current, official version: 3.2 (use command line option --version to see the version of the script you have installed: bc-evaluate --version. If you have reason to believe that there is a bug with the tool or the library, or any other questions related to it, please contact the author, Florian Leitner.Florian Leitner ii5/evaluation-library/ 8

BioCreative II.5 challenge Task 2: Protein-Protein Interactions [ ] Task 2: Protein-Protein Interactions This task is organized as a collaboration between the IntAct and MINT protein interaction databases and the CNIO Structural Bioinformatics and Biocomputing group. IntActMINT Background Introduction Task description Data Resources protein-protein-interac/ 9

10 Preparing the Data. For experimentation purposes we used the data used in the article classification competition task BC2.5 [4]. This classification task was based on a training data set comprised of 61 full-text articles relevant to protein- protein interaction and 558 irrelevant one. For training we chose the first 60 relevant and sampled randomly 60 irrelevant articles, for testing we used the Biocreative 2.5 testing data set consisting of 63 full-text articles relevant to protein- protein interaction and 532 irrelevant ones.

Preparing the Data. Before using the data for training and testing we pre- processed all articles by filtering out stop words and porter stemming the remaining words/keywords. Finally, we ranked keywords extracted from BC2.5 training articles according to chi-square scoring formula to identify most top relevant keywords [6]. 11

12 Experiments The experiments consist of the following phases: First, we collect five sets of top relevant keywords using chi-square feature selection strategy. Second, we compare the performance of the two classifiers, Naïve Bayes and Maximum Entropy, for each set of word features. Third, we use merging operators to combine the results of these two classifiers to improve performance. In each experiment we calculate Precision, Recall, True Negative Rate and Accuracy measures.

Results The Maximum Entropy classifier shows the best performance Precision, Recall and Accuracy, 0,186%, 0.857% and 0.589% at 500 top ranked keywords, while for True Negative Rate shows the best performance 0.565% at 700 top ranked keywords. We combine the results of the two classifiers using the two merging operators mentioned above to improve the performance, especially the Recall rate. The merging operators do improve performance, for Precision 0.189%, Recall 0.873%, True Negative Rate 0.560% and Accuracy 0.591%. 13

Conclusion The results show that the Maximum Entropy classifier shows the better performance at 500 top relevant keywords. Combining the results of the two classifiers we can improve classification performance of real world biomedical data

References 1. Fuhr, N., Hartmanna, S., Lustig, G., Schwantner, M., and Tzeras, K Air/X – A rule-based multi-stage indexing system for lage subject fields. RIAO’91, pp Galathiya, A. S., Ganatra, A. P., and Bhensdadia, K. C An Improved decision tree induction algorithm, with feature selection, cross validation, model complexity & reduced error pruning, IJSCIT, March Feldman, R., Sanger, J The Text Mining Handbook: advanced approaches in analyzing unstructured data. Cambridge University Press. 4. Krallinger, M., et al The BioCreative II. 5 challenge overview. In: Proc. The BioCreative II. 5 Workshop 2009 on Digital Annotations, pp. 7–9. 15

References 5. Fragos, K., Maistros, I A Goodness of Fit Test Approach in Information Retrieval. In journal of “Information Retrieval”, Springer, Volume 9, Number 3, pp 331 – McCallum A. and Nigam, K A comparison of event models for naive Bayes text classification. In AAAI/ICML- 98 Workshop on Learning for Text Categorization. 7. Fragos, K., Maistros, I., Skourlas, C A X2-Weighted Maximum Entropy Model for Text Classification. 2nd International Conference On N.L.U.C.S, Miami, Florida. 16

Questions…