Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gleaning Relational Information from Biomedical Text Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Joint Work with Jude.

Similar presentations


Presentation on theme: "Gleaning Relational Information from Biomedical Text Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Joint Work with Jude."— Presentation transcript:

1 Gleaning Relational Information from Biomedical Text Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Joint Work with Jude Shavlik and Louis Oliphant CIBM Seminar - Dec 5th 2006

2 Outline The Vacation Game The Vacation Game Formalizing with Logic Formalizing with Logic Biomedical Information Extraction Biomedical Information Extraction Evaluating Hypotheses Evaluating Hypotheses Gleaning Logical Rules Gleaning Logical Rules Experiments Experiments Current Directions Current Directions

3 The Vacation Game Positive Positive Negative Negative

4 The Vacation Game Positive Positive –Apple –Feet –Luggage –Mushrooms –Books –Wallet –Beekeeper Negative Negative –Pear –Socks –Car –Fungus –Novel –Money –Hive Positive Positive –Apple –Feet –Luggage –Mushrooms –Books –Wallet –Beekeeper

5 The Vacation Game My Secret Rule My Secret Rule –The word must have two adjacent letters which are the same letter. Found by using inductive logic Found by using inductive logic –Positive and Negative Examples –Formulating and Eliminating Hypotheses –Evaluating Success and Failure

6 Inductive Logic Programming Machine Learning Machine Learning –Classify data into categories –Divide data into train and test sets –Generate hypotheses on train set and then measure performance on test set In ILP, data are Objects … In ILP, data are Objects … –person, block, molecule, word, phrase, … and Relations between them and Relations between them –grandfather, has_bond, is_member, …

7 Formalizing with Logic apple a b c d e f g h i j k l m n o p q r s t u v w x y z w2169 apple w2169_1w2169_5w2169_4w2169_3w2169_2 Objects Relations

8 Formalizing with Logic word(w2169). letter(w2169_1). has_letter(w2169, w2169_2). has_letter(w2169, w2169_3). next(w2169_2, w2169_3). letter_value(w2169_2, ‘p’). letter_value(w2169_3, ‘p’). pos(X) :- has_letter(X, A), has_letter(X, B), next(A, B), letter_value(A, C), letter_value(B, C). a b c d e f g h i j k l m n o p q r s t u v w x y z w2169 w2169_1w2169_5w2169_4w2169_3w2169_2 ‘apple' head body Variables

9 Biomedical Information Extraction *image courtesy of SEER Cancer Training Site DatabaseStructured

10 Biomedical Information Extraction http://www.geneontology.org

11 NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism. NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism. ykuD was transcribed by SigK RNA polymerase from T4 of sporulation. Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in early adult life. Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in early adult life.

12 Biomedical Information Extraction The dog running down the street tackled and bit my little sister.

13 Biomedical Information Extraction NPL3 encodes a nuclear protein with … verbnounarticleadjnounprepsentence prep phrase … verb phrase noun phrase noun phrase noun phrase noun phrase

14 MedDict Background Knowledge http://cancerweb.ncl.ac.uk/omd/

15 MeSH Background Knowledge http://www.nlm.nih.gov/mesh/MBrowser.html

16 GO Background Knowledge http://www.geneontology.org

17 Some Prolog Predicates Biomedical Predicates Biomedical Predicates –phrase_contains_medDict_term(Phrase, Word, WordText) –phrase_contains_mesh_term(Phrase, Word, WordText) –phrase_contains_mesh_disease(Phrase, Word, WordText) –phrase_contains_go_term(Phrase, Word, WordText) Lexical Predicates Lexical Predicates –internal_caps(Word) alphanumeric(Word) Look-ahead Phrase Predicates Look-ahead Phrase Predicates –few_POS_in_phrase(Phrase, POS) –phrase_contains_specific_word_triple(Phrase, W1, W2, W3) –phrase_contains_some_marked_up_arg(Phrase, Arg#, Word, Fold) Relative Location of Phrases Relative Location of Phrases –protein_before_location(ExampleID) –word_pair_in_between_target_phrases(ExampleID, W1, W2)

18 Still More Predicate High-scoring words in protein phrases High-scoring words in protein phrases – bifunction, repress, pmr1, … High-scoring words in location phrases High-scoring words in location phrases – golgi, cytoplasm, er High-scoring BETWEEN protein & location High-scoring BETWEEN protein & location – across, cofractionate, inside, …

19 Biomedical Information Extraction Given: Medical Journal abstracts tagged with biological relations Given: Medical Journal abstracts tagged with biological relations Do: Construct system to extract related phrases from unseen text Do: Construct system to extract related phrases from unseen text Our Gleaner Approach Our Gleaner Approach Develop fast ensemble algorithms focused on recall and precision evaluation

20 Using Modes to Chain Relations Phrase Sentence Word alphanumeric(…) internal_caps(…)verb(…) phrase_child(…, …) long_sentence(…) phrase_parent(…, …) noun_phrase(…)

21 Growing Rules From Seed NPL3 encodes a nuclear protein with … prot_loc(ab1392078_sen7_ph0, ab1392078_sen7_ph2, ab1392078_sen7). phrase_contains_novelword(ab1392078_sen7_ph0, ab1392078_sen7_ph0_w0). phrase_next(ab1392078_sen7_ph0, ab1392078_sen7_ph1). … noun_phrase(ab1392078_sen7_ph2). word_child(ab1392078_sen7_ph2, ab9018277_sen5_ph11_w3). … avg_length_sentence(ab1392078_sen7). … Phrase Sentence Phrase Word

22 Growing Rules From Seed prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence).

23 Rule Evaluation Prediction vs Actual Prediction vs Actual Positive or Negative True or False FNTP  FPTP  FP FN TN actual prediction RP 2PR  F1 Score = Focus on positive examples Focus on positive examples Recall = Precision =

24 Protein Localization Rule 1 prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence). 0.15 Recall0.51 Precision0.23 F1 Score

25 Protein Localization Rule 2 prot_loc(Protein,Location,Sentence) :- phrase_contains_some_marked_up_arg2(Location,C) phrase_contains_some_internal_cap_word(Protein,_), word_previous(C,_). 0.86 Recall0.12 Precision0.21 F1 Score

26 Precision-Focused Search

27 Recall-Focused Search

28 F1-Focused Search

29 Aleph - Learning Aleph learns theories of rules (Srinivasan, v4, 2003) Aleph learns theories of rules (Srinivasan, v4, 2003) –Pick positive seed example –Use heuristic search to find best rule –Pick new seed from uncovered positives and repeat until threshold of positives covered Learning theories is time-consuming Learning theories is time-consuming Can we reduce time with ensembles? Can we reduce time with ensembles?

30 Gleaner Definition of Gleaner Definition of Gleaner –One who gathers grain left behind by reapers Key Ideas of Gleaner Key Ideas of Gleaner –Use Aleph as underlying ILP rule engine –Search rule space with Rapid Random Restart –Keep wide range of rules usually discarded –Create separate theories for diverse recall

31 Gleaner - Learning Precision Recall Create B Bins Create B Bins Generate Clauses Generate Clauses Record Best per Bin Record Best per Bin

32 Gleaner - Learning Recall Seed 1 Seed 2 Seed 3 Seed K......

33 Gleaner - Ensemble........ pos1: prot_loc(…) 12 pos2: prot_loc(…) 47 pos3: prot_loc(…) 55 neg1: prot_loc(…) 5 neg2: prot_loc(…) 14 neg3: prot_loc(…) 2 neg4: prot_loc(…) 18 12 pos2: prot_loc(…) 47 Pos Neg Pos Neg Pos Rules from bin 5

34 Gleaner - Ensemble Recall Precision 1.0 pos3: prot_loc(…) neg28: prot_loc(…) pos2: prot_loc(…) neg4: prot_loc(…) neg475: prot_loc(…). pos9: prot_loc(…) neg15: prot_loc(…). 55 52 47 18 17 16 ScoreExamples 1.00 0.05 0.50 0.05 0.66 0.10 0.12 0.85 0.13 0.90 0.12 0.90 PrecisionRecall

35 Gleaner - Overlap For each bin, take the topmost curve For each bin, take the topmost curve Recall Precision

36 How to use Gleaner Precision Recall Generate Test Curve Generate Test Curve User Selects Recall Bin User Selects Recall Bin Return Classifications Ordered By Their Score Return Classifications Ordered By Their Score Recall = 0.50 Precision = 0.70

37 Aleph Ensembles We compare to ensembles of theories We compare to ensembles of theories Algorithm ( Dutra et al ILP 2002 ) Algorithm ( Dutra et al ILP 2002 ) –Use K different initial seeds –Learn K theories containing C rules –Rank examples by the number of theories Need to balance C for high performance Need to balance C for high performance –Small C leads to low recall –Large C leads to converging theories

38 Evaluation Metrics Area Under Recall- Precision Curve (AURPC) Area Under Recall- Precision Curve (AURPC) –All curves standardized to cover full recall range –Averaged AURPC over 5 folds Number of clauses considered Number of clauses considered –Rough estimate of time Recall Precision 1.0

39 YPD Protein Localization Hand-labeled dataset (Ray & Craven ’01) Hand-labeled dataset (Ray & Craven ’01) –7,245 sentences from 871 abstracts –Examples are phrase-phrase combinations  1,810 positive & 279,154 negative 1.6 GB of background knowledge 1.6 GB of background knowledge –Structural, Statistical, Lexical and Ontological –In total, 200+ distinct background predicates

40 Experimental Methodology Performed five-fold cross-validation Performed five-fold cross-validation Variation of parameters Variation of parameters –Gleaner (20 recall bins)  # seeds = {25, 50, 75, 100}  # clauses = {1K, 10K, 25K, 50K, 100K, 250K, 500K} –Ensembles (0.75 minacc, 1K and 35K nodes)  # theories = {10, 25, 50, 75, 100}  # clauses per theory = {1, 5, 10, 15, 20, 25, 50}

41 PR Curves - 100,000 Clauses

42 PR Curves - 1,000,000 Clauses

43 Protein Localization Results

44 Genetic Disorder Results

45 Current Directions Learn diverse rules across seeds Learn diverse rules across seeds Calculate probabilistic scores for examples Calculate probabilistic scores for examples Directed Rapid Random Restarts Directed Rapid Random Restarts Cache rule information to speed scoring Cache rule information to speed scoring Transfer learning across seeds Transfer learning across seeds Explore Active Learning within ILP Explore Active Learning within ILP

46 Take-Home Message Biology, Gleaner and ILP Biology, Gleaner and ILP –Challenging problems in biology can be naturally formulated for Inductive Logic Programming –Many rules constructed and evaluated in ILP hypothesis search –Gleaner makes use of those rules that are not the highest scoring ones for improved speed and performance

47 Acknowledgements USA DARPA Grant F30602-01-2-0571 USA DARPA Grant F30602-01-2-0571 USA Air Force Grant F30602-01-2-0571 USA Air Force Grant F30602-01-2-0571 USA NLM Grant 5T15LM007359-02 USA NLM Grant 5T15LM007359-02 USA NLM Grant 1R01LM07050-01 USA NLM Grant 1R01LM07050-01 UW Condor Group UW Condor Group David Page, Vitor Santos Costa, Ines Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jesse Davis, Sarah Cunningham, David Haight, Ameet Soni David Page, Vitor Santos Costa, Ines Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jesse Davis, Sarah Cunningham, David Haight, Ameet Soni


Download ppt "Gleaning Relational Information from Biomedical Text Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Joint Work with Jude."

Similar presentations


Ads by Google