Biomedical Information Extraction using Inductive Logic Programming Mark Goadrich and Louis Oliphant Advisor: Jude Shavlik Acknowledgements to NLM training.

Biomedical Information Extraction using Inductive Logic Programming Mark Goadrich and Louis Oliphant Advisor: Jude Shavlik Acknowledgements to NLM training grant 5T15LM007359-02

Abstract Automated methods for finding relevant information from the large amount of biomedical literature are needed. Information extraction (IE) is the process of finding facts from unstructured text such as biomedical journals and putting those facts in an organized system. Our research mines facts about a relationship (e.g. protein localization) from PubMed abstracts. We use Inductive Logic Programming (ILP) to learn a set of logical rules that explain when and where a relationship occurs in a sentence. We build rules by finding patterns in syntactic as well as semantic information for each sentence in a training corpus that has been previously marked with the relationship. These rules can then be used on unmarked text to find new instances of the relation. Some major research issues involved in this approach are handling unbalanced data, searching the enormous space of clauses, learning probabilistic logical rules, and incorporating expert background knowledge.

The Central Dogma Discoveries  protein - protein interactions  protein localizations  genetic diseases Most knowledge stored in articles Just Google it? *image courtesy of National Human Genome Research Institute

World of Publishing Current  authors write articles in Word, LaTeX, and publish in conferences, journals, etc  humans index and extract relevant information (time and cost intensive) Future?  all published articles available on the Web  semantic web – extension of HTML for content  articles automatically annotated and indexed into searchable databases

Information Extraction Given a set of abstracts tagged with biological relationships between phrases Do learn a theory (eg, set of inference rules) that accurately extracts these relations

Training Data

Why Use ILP? KDD Cup 2002  logical rules (handcrafted) did best on IE task Hypotheses are comprehensible  written in first-order predicate calculus (FOPC)  aim to cover only positive examples Background knowledge easily incorporated  expert advice  linguistic knowledge of English parse trees  biomedical knowledge (eg. MESH)

ILP Example: Family Tree Positive  daughter(mary, ann)  daughter(eve, tom) Negative  daughter(tom, ann)  daughter(eve, ann)  daughter(ian, tom)  daughter(ian, ann)  … Background Knowledge mother(ann, mary) mother(ann, tom) father(tom, eve) father(tom, ian) female(ann) female(mary) female(eve) male(tom) male(ian) Ann IanEve MaryTom Possible Rules daughter(A,B) if male(A) and father(B,A) daughter(A,B) if mother(B,A) daughter(A,B) if female(A) and male(B) daughter(A,B) if female(A) and mother(B,A) Father Mother

Sundance Parsing Sentence …NP-Conj segVP segmentNP segment smf1 and smf2 unkconjunk are cop mitochondrial membrane_proteins Sentence Structure Predicates parent(smf1,np-conj seg) parent(np-conj seg,sentence) child(np-conj seg,smf1) child(sentence,np-conj seg) next(smf1,and) next(np-conj seg,vp seg) after(np-conj seg,np seg) … Part of Speech Predicates noun(membrane_proteins) verb(are) unk(smf1) noun_phrase(np seg) verb_phrase(vp seg) … … …… unknoun Lexical Word Predicates novelword(smf1) novelword(smf2) alphabetic(and) alphanumeric(smf1) … Biomedical Knowledge Predicates in_med_dict(mitochondrial) go_mitochondrial_membrane(smf1) go_mitochondrion(smf1) …

Sample Learned Rule gene_disease(E,A) :- isa_np_segment(E), isa_np_segment(A), prev(A,B), pp_segment(B), child(A,C), next(C,D), alphabetic(D), novelword(C), child(E,F), alphanumeric(F). Sent. AEB CDF Noun Phrase Prepositional Phrase Novel WordAlphabetic WordAlphanumeric Word

Ensembles for Rules N heads are better than one…  learn multiple (sets of) rules with training data  aggregate the results by voting on classification of testing data Bagging (Brieman ’96)  each rule-set gets one vote Boosting (Freund and Shapire ’96)  each rule gets weighted vote

Drawing a PR Curve ConfClassPreRec.98+1.000.20.97+1.000.40.84-0.660.40.78+0.750.60.55+0.80.43-0.660.80.23-0.570.80.22-0.500.80.12+0.551.00 Recall Precision

Testset Results Craven Group Boosting Rule Quality Bagging

Handling Large Skewed Data 5 fold cross-validation  train : 1007 positive / 240,874 negative  test : 284 positive / 243,862 negative With a 95% accurate rule set …  270 true positives  12,193 false positives!  recall = 270 / 284 = 95.0%  precison = 270 / 12,363 = 2.1%

Handling Large Skewed Data Ways to handle data  assign different costs to each class much more important to not cover negatives  under-sampling with bagging negatives under-represented key is to pick good negatives filter data to restore equal ratio in testing data  use naïve Bayes to learn relational parts

pos neg noun phrase filter split into parts genesdiseases naïve Bayes filter join back pos neg Filters to Reduce Negatives 1 : 485 1 : 1,979 1 : 39

Probabilistic Rules Logical rules are too strict and often overfit Add probabilistic weight to each rule  based on accuracy on tuning set Learn parameters  make each rule a binary feature  use any standard Machine Learning algorithm (Naïve Bayes, perceptron, logistic regression…) to learn the weights  assign probability to examples based on weights

Weighted Exponential Model where is a weight for each feature Taking logs we get We need to set to maximize log probability of the tuning set

Weighted Exponential Model

Incorporating Background Knowledge Creation of predicates that capture salient features  endsIn(word, ‘ase’)  occursInAbstractNtimes(word, 5) Incorporation of prior knowledge into the learning system  protein(word) if endsIn(word, ‘ase’) and occursInAbstractNtimes(word, 5).

Searching in Large Spaces Probabilistic bottom clause  probabilistically remove least significant predicates from the “bottom clause” Random rule generation  in place of hill-climbing, randomly select rules of a given length from the bottom clause  retain only those rules which do well on a tune set Learn coverage of clauses  neural network, Bayesian learning, etc.

References Nelson, Stuart J.; Powell, Tammy; Humphreys, Betsy L. The Unified Medical Language System (UMLS) Project. In: Encyclopedia of Library and Information Science. Forthcoming.The Unified Medical Language System (UMLS) Project Christopher D. Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing 1999. MIT Press Ellen Riloff The Sundance Sentence Analyzer. 2002The Sundance Sentence Analyzer Ines De Castro Dutra, et. al. An Emperical Evaluation of Bagging in Inductive Logic Programming. 2002. in Proceedings of the International Conference on Inductive Logic Programming. Syndey, Australia.An Emperical Evaluation of Bagging in Inductive Logic Programming Dayne Frietag and Nicholas Kushmerick Boosted Wrapper Induction. 2001. in Proceedings of American Association of Artificial Intelligence (AAAI-2000)Boosted Wrapper Induction Souyma Ray and Mark Craven Representing Sentence Structure in Hidden Markov Models for Information Extraction. 2001. in Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-2001)Representing Sentence Structure in Hidden Markov Models for Information Extraction Tina Eliassi-Rad and Jude Shavlik A Theory-Refinement Approach to Information Extraction. 2001. in Proceedings of the 18th International Conference on Machine LearningA Theory-Refinement Approach to Information Extraction M. Craven and J. Kumlien Constructing biological knowledge-bases by extracting information from text sources. 1999. in Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 77-86. Germany.Constructing biological knowledge-bases by extracting information from text sources Leo Breiman. Bagging Predictors. 1996. Machine Learning, 24(2):123-140.Bagging Predictors Yoav Freund and Robert E. Schapire. Experiments with a New Boosting Algorithm. 1996. in International Conference on Machine Learning, pages 148-156.Experiments with a New Boosting Algorithm

Biomedical Information Extraction using Inductive Logic Programming Mark Goadrich and Louis Oliphant Advisor: Jude Shavlik Acknowledgements to NLM training.

Similar presentations

Presentation on theme: "Biomedical Information Extraction using Inductive Logic Programming Mark Goadrich and Louis Oliphant Advisor: Jude Shavlik Acknowledgements to NLM training."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Biomedical Information Extraction using Inductive Logic Programming Mark Goadrich and Louis Oliphant Advisor: Jude Shavlik Acknowledgements to NLM training.

Similar presentations

Presentation on theme: "Biomedical Information Extraction using Inductive Logic Programming Mark Goadrich and Louis Oliphant Advisor: Jude Shavlik Acknowledgements to NLM training."— Presentation transcript:

Similar presentations

About project

Feedback