Gleaning Relational Information from Biomedical Text Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Joint Work with Jude Shavlik and Louis Oliphant CIBM Seminar - Dec 5th 2006
Outline The Vacation Game The Vacation Game Formalizing with Logic Formalizing with Logic Biomedical Information Extraction Biomedical Information Extraction Evaluating Hypotheses Evaluating Hypotheses Gleaning Logical Rules Gleaning Logical Rules Experiments Experiments Current Directions Current Directions
The Vacation Game Positive Positive Negative Negative
The Vacation Game Positive Positive –Apple –Feet –Luggage –Mushrooms –Books –Wallet –Beekeeper Negative Negative –Pear –Socks –Car –Fungus –Novel –Money –Hive Positive Positive –Apple –Feet –Luggage –Mushrooms –Books –Wallet –Beekeeper
The Vacation Game My Secret Rule My Secret Rule –The word must have two adjacent letters which are the same letter. Found by using inductive logic Found by using inductive logic –Positive and Negative Examples –Formulating and Eliminating Hypotheses –Evaluating Success and Failure
Inductive Logic Programming Machine Learning Machine Learning –Classify data into categories –Divide data into train and test sets –Generate hypotheses on train set and then measure performance on test set In ILP, data are Objects … In ILP, data are Objects … –person, block, molecule, word, phrase, … and Relations between them and Relations between them –grandfather, has_bond, is_member, …
Formalizing with Logic apple a b c d e f g h i j k l m n o p q r s t u v w x y z w2169 apple w2169_1w2169_5w2169_4w2169_3w2169_2 Objects Relations
Formalizing with Logic word(w2169). letter(w2169_1). has_letter(w2169, w2169_2). has_letter(w2169, w2169_3). next(w2169_2, w2169_3). letter_value(w2169_2, ‘p’). letter_value(w2169_3, ‘p’). pos(X) :- has_letter(X, A), has_letter(X, B), next(A, B), letter_value(A, C), letter_value(B, C). a b c d e f g h i j k l m n o p q r s t u v w x y z w2169 w2169_1w2169_5w2169_4w2169_3w2169_2 ‘apple' head body Variables
Biomedical Information Extraction *image courtesy of SEER Cancer Training Site DatabaseStructured
Biomedical Information Extraction
NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism. NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism. ykuD was transcribed by SigK RNA polymerase from T4 of sporulation. Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in early adult life. Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in early adult life.
Biomedical Information Extraction The dog running down the street tackled and bit my little sister.
Biomedical Information Extraction NPL3 encodes a nuclear protein with … verbnounarticleadjnounprepsentence prep phrase … verb phrase noun phrase noun phrase noun phrase noun phrase
MedDict Background Knowledge
MeSH Background Knowledge
GO Background Knowledge
Some Prolog Predicates Biomedical Predicates Biomedical Predicates –phrase_contains_medDict_term(Phrase, Word, WordText) –phrase_contains_mesh_term(Phrase, Word, WordText) –phrase_contains_mesh_disease(Phrase, Word, WordText) –phrase_contains_go_term(Phrase, Word, WordText) Lexical Predicates Lexical Predicates –internal_caps(Word) alphanumeric(Word) Look-ahead Phrase Predicates Look-ahead Phrase Predicates –few_POS_in_phrase(Phrase, POS) –phrase_contains_specific_word_triple(Phrase, W1, W2, W3) –phrase_contains_some_marked_up_arg(Phrase, Arg#, Word, Fold) Relative Location of Phrases Relative Location of Phrases –protein_before_location(ExampleID) –word_pair_in_between_target_phrases(ExampleID, W1, W2)
Still More Predicate High-scoring words in protein phrases High-scoring words in protein phrases – bifunction, repress, pmr1, … High-scoring words in location phrases High-scoring words in location phrases – golgi, cytoplasm, er High-scoring BETWEEN protein & location High-scoring BETWEEN protein & location – across, cofractionate, inside, …
Biomedical Information Extraction Given: Medical Journal abstracts tagged with biological relations Given: Medical Journal abstracts tagged with biological relations Do: Construct system to extract related phrases from unseen text Do: Construct system to extract related phrases from unseen text Our Gleaner Approach Our Gleaner Approach Develop fast ensemble algorithms focused on recall and precision evaluation
Using Modes to Chain Relations Phrase Sentence Word alphanumeric(…) internal_caps(…)verb(…) phrase_child(…, …) long_sentence(…) phrase_parent(…, …) noun_phrase(…)
Growing Rules From Seed NPL3 encodes a nuclear protein with … prot_loc(ab _sen7_ph0, ab _sen7_ph2, ab _sen7). phrase_contains_novelword(ab _sen7_ph0, ab _sen7_ph0_w0). phrase_next(ab _sen7_ph0, ab _sen7_ph1). … noun_phrase(ab _sen7_ph2). word_child(ab _sen7_ph2, ab _sen5_ph11_w3). … avg_length_sentence(ab _sen7). … Phrase Sentence Phrase Word
Growing Rules From Seed prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence).
Rule Evaluation Prediction vs Actual Prediction vs Actual Positive or Negative True or False FNTP FPTP FP FN TN actual prediction RP 2PR F1 Score = Focus on positive examples Focus on positive examples Recall = Precision =
Protein Localization Rule 1 prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence) Recall0.51 Precision0.23 F1 Score
Protein Localization Rule 2 prot_loc(Protein,Location,Sentence) :- phrase_contains_some_marked_up_arg2(Location,C) phrase_contains_some_internal_cap_word(Protein,_), word_previous(C,_) Recall0.12 Precision0.21 F1 Score
Precision-Focused Search
Recall-Focused Search
F1-Focused Search
Aleph - Learning Aleph learns theories of rules (Srinivasan, v4, 2003) Aleph learns theories of rules (Srinivasan, v4, 2003) –Pick positive seed example –Use heuristic search to find best rule –Pick new seed from uncovered positives and repeat until threshold of positives covered Learning theories is time-consuming Learning theories is time-consuming Can we reduce time with ensembles? Can we reduce time with ensembles?
Gleaner Definition of Gleaner Definition of Gleaner –One who gathers grain left behind by reapers Key Ideas of Gleaner Key Ideas of Gleaner –Use Aleph as underlying ILP rule engine –Search rule space with Rapid Random Restart –Keep wide range of rules usually discarded –Create separate theories for diverse recall
Gleaner - Learning Precision Recall Create B Bins Create B Bins Generate Clauses Generate Clauses Record Best per Bin Record Best per Bin
Gleaner - Learning Recall Seed 1 Seed 2 Seed 3 Seed K......
Gleaner - Ensemble pos1: prot_loc(…) 12 pos2: prot_loc(…) 47 pos3: prot_loc(…) 55 neg1: prot_loc(…) 5 neg2: prot_loc(…) 14 neg3: prot_loc(…) 2 neg4: prot_loc(…) pos2: prot_loc(…) 47 Pos Neg Pos Neg Pos Rules from bin 5
Gleaner - Ensemble Recall Precision 1.0 pos3: prot_loc(…) neg28: prot_loc(…) pos2: prot_loc(…) neg4: prot_loc(…) neg475: prot_loc(…). pos9: prot_loc(…) neg15: prot_loc(…) ScoreExamples PrecisionRecall
Gleaner - Overlap For each bin, take the topmost curve For each bin, take the topmost curve Recall Precision
How to use Gleaner Precision Recall Generate Test Curve Generate Test Curve User Selects Recall Bin User Selects Recall Bin Return Classifications Ordered By Their Score Return Classifications Ordered By Their Score Recall = 0.50 Precision = 0.70
Aleph Ensembles We compare to ensembles of theories We compare to ensembles of theories Algorithm ( Dutra et al ILP 2002 ) Algorithm ( Dutra et al ILP 2002 ) –Use K different initial seeds –Learn K theories containing C rules –Rank examples by the number of theories Need to balance C for high performance Need to balance C for high performance –Small C leads to low recall –Large C leads to converging theories
Evaluation Metrics Area Under Recall- Precision Curve (AURPC) Area Under Recall- Precision Curve (AURPC) –All curves standardized to cover full recall range –Averaged AURPC over 5 folds Number of clauses considered Number of clauses considered –Rough estimate of time Recall Precision 1.0
YPD Protein Localization Hand-labeled dataset (Ray & Craven ’01) Hand-labeled dataset (Ray & Craven ’01) –7,245 sentences from 871 abstracts –Examples are phrase-phrase combinations 1,810 positive & 279,154 negative 1.6 GB of background knowledge 1.6 GB of background knowledge –Structural, Statistical, Lexical and Ontological –In total, 200+ distinct background predicates
Experimental Methodology Performed five-fold cross-validation Performed five-fold cross-validation Variation of parameters Variation of parameters –Gleaner (20 recall bins) # seeds = {25, 50, 75, 100} # clauses = {1K, 10K, 25K, 50K, 100K, 250K, 500K} –Ensembles (0.75 minacc, 1K and 35K nodes) # theories = {10, 25, 50, 75, 100} # clauses per theory = {1, 5, 10, 15, 20, 25, 50}
PR Curves - 100,000 Clauses
PR Curves - 1,000,000 Clauses
Protein Localization Results
Genetic Disorder Results
Current Directions Learn diverse rules across seeds Learn diverse rules across seeds Calculate probabilistic scores for examples Calculate probabilistic scores for examples Directed Rapid Random Restarts Directed Rapid Random Restarts Cache rule information to speed scoring Cache rule information to speed scoring Transfer learning across seeds Transfer learning across seeds Explore Active Learning within ILP Explore Active Learning within ILP
Take-Home Message Biology, Gleaner and ILP Biology, Gleaner and ILP –Challenging problems in biology can be naturally formulated for Inductive Logic Programming –Many rules constructed and evaluated in ILP hypothesis search –Gleaner makes use of those rules that are not the highest scoring ones for improved speed and performance
Acknowledgements USA DARPA Grant F USA DARPA Grant F USA Air Force Grant F USA Air Force Grant F USA NLM Grant 5T15LM USA NLM Grant 5T15LM USA NLM Grant 1R01LM USA NLM Grant 1R01LM UW Condor Group UW Condor Group David Page, Vitor Santos Costa, Ines Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jesse Davis, Sarah Cunningham, David Haight, Ameet Soni David Page, Vitor Santos Costa, Ines Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jesse Davis, Sarah Cunningham, David Haight, Ameet Soni