Mark Rich & Louis Oliphant

An Inductive Logic Programming Approach to Biomedical Information Extraction
Mark Rich & Louis Oliphant Acknowledgements to NLM training grant 1T15LM

Abstract Automated methods for finding relevant information from the large amount of biomedical literature are needed. Information extraction (IE) is the process of finding facts from unstructured text such as biomedical journals and putting those facts in an organized system. Our research mines facts about a relationship (e.g. protein-location) from PubMed abstracts. We use inductive logic programming (ILP) to learn a set of logical rules that explain when and where a relationship occurs in a sentence. We build rules by finding patterns in syntactic as well as semantic information for each sentence in a training corpus that has been previously marked with the relationship. These rules can then be used on unmarked text to find new instances of the relation. This research shows how modifications to the aleph ILP system can improve precision without sacrificing recall. Modifications include boosted wrapper induction and bagging techniques.

Information Extraction
Given: a set of abstracts tagged with protein – location relationships between phrases Do: learn a theory that extracts only these relations from the set of abstracts and performs well on unseen abstracts

Training Data

Training Data 925 abstracts taken from PubMed
Abstracts marked with 645 relations from the Yeast Protein Database Relation: protein – cell location Ex. protein_location(smf1,mitochondrial) 335 unique relations

Machine Learning Machine learning provides techniques to classify data into categories We divide data into train and test sets We generate hypotheses on the train set and then measure its performance on the test set

Inductive Logic Programming
Hypotheses are written in first-order predicate calculus (FOPC), and aim to cover only positive examples Background knowledge is incorporated through the use of relations in FOPC Head of clause is relation to be learned; search is through all possible combinations of background predicates

ILP Example Background Knowledge Positive Negative Possible Rules
mother(ann, mary) mother(ann, tom) father(tom, eve) father(tom, ian) female(ann) female(mary) female(eve) male(tom) male(ian) Positive daughter(mary, ann) daughter(eve, tom) Negative daughter(tom, ann) daughter(eve, ann) daughter(ian, tom) daughter(ian, ann) etc… Possible Rules daughter(A,B) :- male(A), father(B,A) daughter(A,B) :- mother(B,A) daughter(A,B) :- female(A), male(B) daughter(A,B) :- female(A), mother(B,A) Ann Tom Mary Eve Ian

Sundance Parsing Sentence … NP-Conj seg VP segment NP segment … …
unk conj unk cop unk noun … smf1 and smf2 are mitochondrial membrane_proteins … Sentence Structure Predicates: parent(smf1,np-conj seg) parent(np-conj seg,sentence) child(np-conj seg,smf1) child(sentence,np-conj seg) next(smf1,and) next(np-conj seg,vp seg) after(np-conj seg,np seg) … Part of Speech Predicates: noun(membrane_proteins) verb(are) unk(smf1) noun_phrase(np seg) verb_phrase(vp seg) …

Kullback-Leibler Divergence
p – prob. word x occurs in relation phrases q – prob. word x occurs outside relation phrases Measures how different two probability distributions are. Think of it like the distance between the means of the two distributions. Example: Number of relation phrases: 523 Total number of phrases: 2305 # “mitochondrial” in relation phrases: 200 # “mitochondrial” outside relation phrases: 205 D(p||q) = 0.137 Select top N% of words, ranked by K-L score Kullback-Leibler Predicates: kl-location-5%(mitochondrial) kl-location-10%(membrane_proteins) …

Lexical Word Types Novel-word: does not occur in /usr/dict/words
Alphabetic: contains only alphabetic characters Alphanumeric: contains numbers and letters Single character word: contains only single character Lexical Word Predicates: novelword(smf1) novelword(smf2) alphabetic(and) alphanumeric(smf1) …

(currently implementing)
Biomedical Knowledge Medical Dictionary: occurs in medical dictionary Medical Subject Headings: MeSH heading tree number Gene Ontology: nodes in GO tree that contain word (currently implementing) Biomedical Knowledge Predicates: in_med_dict(mitochondrial) mesh_a11_284_195_190_875(mitochondrial) mesh_a11_284_195_190(mitochondrial) go_mitochondrial_membrane(smf1) go_mitochondrion(smf1)

Aleph Implementation of ILP in PROLOG
Aleph uses an example as a seed, i.e. the rule learned must cover this example Parameters: search method heuristic function minimum accuracy of learned rules maximum clauses searched

Handling Large Skewed Data
5 fold cross-validation train : 1007 positive / 240,874 negative test : 284 Positive / 243,862 negative Ways to handle data initial breadth-first search, then heuristic heuristic score perturbation random open list pruning prior knowledge to reduce negatives ensemble methods

Bagging N heads are better than one…
learn multiple rule-sets with disjoint training data aggregate the results by voting on classification of testing data The confidence of a prediction is equal to the percentage of rule-sets with a positive vote.

Boosting Focus learning on hard examples
assign each example a weight find best rule based on these weights calculate confidence for this rule scale weights inversely to correctness of classification The confidence of an example is the sum of rule confidences that fire for this example.

Sample Learned Rules protein_location(A,B) :-
[Pos cover = 577 Neg cover = 5 Confidence = 2.33] protein_location(A,B) :- isa_np_segment(B), previous(B,D), pp_segment(D), different_phrases(D,A), before(A,E), different_phrases(D,E), before(B,F), next(F,D), different_phrases(B,A), different_phrases(E,B), different_phrases(F,E), child(B,G), n(G), child(B,H), kl_location_05p(H), isa_np_segment(A), child(A,I), novelword(I), child(A,J), alphanumeric(J). [Pos cover = 32 Neg cover = 0 Confidence = 2.14] np_conj_segment(A), after(A,D), child(D,E), n(E), isa_np_segment(B).

Performance Measures rank examples by confidence
threshold at boundary points Conf Class Pre Rec .98 + 1.00 0.20 .97 0.40 .84 - 0.66 .78 0.75 0.60 .55 0.80 .43 .23 0.57 .22 0.50 .12 0.55

Results: Comparison

Future Work Find novel ways to search for rules
Use more biological background knowledge to inform our search Use “prior knowledge” rules from an expert as our initial hypothesis Apply to additional information extraction tasks

References Nelson, Stuart J.; Powell, Tammy; Humphreys, Betsy L. The Unified Medical Language System (UMLS) Project. In: Encyclopedia of Library and Information Science. Forthcoming. Christopher D. Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing MIT Press Ellen Riloff The Sundance Sentence Analyzer. 2002 Ines De Castro Dutra, et. al. An Emperical Evaluation of Bagging in Inductive Logic Programming in Proceedings of the International Conference on Inductive Logic Programming. Syndey, Australia. Dayne Frietag and Nicholas Kushmerick Boosted Wrapper Induction in Proceedings of American Association of Artificial Intelligence (AAAI-2000) Souyma Ray and Mark Craven Representing Sentence Structure in Hidden Markov Models for Information Extraction in Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-2001) Tina Eliassi-Rad and Jude Shavlik A Theory-Refinement Approach to Information Extraction in Proceedings of the 18th International Conference on Machine Learning M. Craven and J. Kumlien Constructing biological knowledge-bases by extracting information from text sources in Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages Germany.

Mark Rich & Louis Oliphant

Similar presentations

Presentation on theme: "Mark Rich & Louis Oliphant"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mark Rich & Louis Oliphant

Similar presentations

Presentation on theme: "Mark Rich & Louis Oliphant"— Presentation transcript:

Similar presentations

About project

Feedback