Gleaning Relational Information from Biomedical Text Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Joint Work with Jude.

Slides:

Advertisements

Similar presentations

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.

Advertisements

Evaluating Classifiers

Methods of Proof Chapter 7, second half.. Proof methods Proof methods divide into (roughly) two kinds: Application of inference rules: Legitimate (sound)

Learning Algorithm Evaluation

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray.

1 PharmID: A New Algorithm for Pharmacophore Identification Stan Young Jun Feng and Ashish Sanil NISSMPDM 3 June 2005.

SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.

Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Darlene Goldstein 29 January 2003 Receiver Operating Characteristic Methodology.

© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.

1 Accurate Object Detection with Joint Classification- Regression Random Forests Presenter ByungIn Yoo CS688/WST665.

CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 22 Jim Martin.

© Jesse Davis 2006 View Learning Extended: Learning New Tables Jesse Davis 1, Elizabeth Burnside 1, David Page 1, Vítor Santos Costa 2 1 University of.

Introduction to Machine Learning Approach Lecture 5.

Evaluating Classifiers

1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.

Inductive Logic Programming Includes slides by Luis Tari CS7741L16ILP.

Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.

Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007

Introduction to ILP ILP = Inductive Logic Programming = machine learning  logic programming = learning with logic Introduced by Muggleton in 1992.

2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.

Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.

Integrating Machine Learning and Physician Knowledge to Improve the Accuracy of Breast Biopsy Inês Dutra University of Porto, CRACS & INESC-Porto LA Houssam.

Learning Ensembles of First-Order Clauses for Recall-Precision Curves A Case Study in Biomedical Information Extraction Mark Goadrich, Louis Oliphant and.

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.

Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.

Experimental Evaluation of Learning Algorithms Part 1.

Transfer in Reinforcement Learning via Markov Logic Networks Lisa Torrey, Jude Shavlik, Sriraam Natarajan, Pavan Kuppili, Trevor Walker University of Wisconsin-Madison,

1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.

Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.

Speeding Up Relational Data Mining by Learning to Estimate Candidate Hypothesis Scores Frank DiMaio and Jude Shavlik UW-Madison Computer Sciences ICDM.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Learning Ensembles of First-Order Clauses for Recall-Precision Curves Preliminary Thesis Proposal Mark Goadrich Department of Computer Sciences University.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Genetic Algorithms. Evolutionary Methods Methods inspired by the process of biological evolution. Main ideas: Population of solutions Assign a score or.

AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.

Chapter 23: Probabilistic Language Models April 13, 2004.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Chapter 9 Genetic Algorithms.  Based upon biological evolution  Generate successor hypothesis based upon repeated mutations  Acts as a randomized parallel.

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

Biomedical Information Extraction using Inductive Logic Programming Mark Goadrich and Louis Oliphant Advisor: Jude Shavlik Acknowledgements to NLM training.

Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.

Recognition Using Visual Phrases

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

Learning Ensembles of First- Order Clauses That Optimize Precision-Recall Curves Mark Goadrich Computer Sciences Department University of Wisconsin - Madison.

1 Solving ILP Problems in the EELA infrastructure Inês Dutra Departamento de Ciência de Computadores Universidade do Porto, Portugal.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.

Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.

Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.

SCORE AS YOU LIFT (SAYL) A Statistical Relational Learning Approach to Uplift Modeling Houssam Nassif 1, Finn Kuusisto 1, Elizabeth S. Burnside 1, David.

Frank DiMaio and Jude Shavlik Computer Sciences Department

7. Performance Measurement

The Methods of Science Chapter 1.

Semi-Supervised Clustering

Evaluating Classifiers

Presented by: Dr Beatriz de la Iglesia

Louis Oliphant and Jude Shavlik

Mark Goadrich Computer Sciences Department

Mark Rich & Louis Oliphant

Panagiotis G. Ipeirotis Luis Gravano

CS246: Information Retrieval

Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.

Presentation transcript:

Gleaning Relational Information from Biomedical Text Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Joint Work with Jude Shavlik and Louis Oliphant CIBM Seminar - Dec 5th 2006

Outline The Vacation Game The Vacation Game Formalizing with Logic Formalizing with Logic Biomedical Information Extraction Biomedical Information Extraction Evaluating Hypotheses Evaluating Hypotheses Gleaning Logical Rules Gleaning Logical Rules Experiments Experiments Current Directions Current Directions

The Vacation Game Positive Positive Negative Negative

The Vacation Game Positive Positive –Apple –Feet –Luggage –Mushrooms –Books –Wallet –Beekeeper Negative Negative –Pear –Socks –Car –Fungus –Novel –Money –Hive Positive Positive –Apple –Feet –Luggage –Mushrooms –Books –Wallet –Beekeeper

The Vacation Game My Secret Rule My Secret Rule –The word must have two adjacent letters which are the same letter. Found by using inductive logic Found by using inductive logic –Positive and Negative Examples –Formulating and Eliminating Hypotheses –Evaluating Success and Failure

Inductive Logic Programming Machine Learning Machine Learning –Classify data into categories –Divide data into train and test sets –Generate hypotheses on train set and then measure performance on test set In ILP, data are Objects … In ILP, data are Objects … –person, block, molecule, word, phrase, … and Relations between them and Relations between them –grandfather, has_bond, is_member, …

Formalizing with Logic apple a b c d e f g h i j k l m n o p q r s t u v w x y z w2169 apple w2169_1w2169_5w2169_4w2169_3w2169_2 Objects Relations

Formalizing with Logic word(w2169). letter(w2169_1). has_letter(w2169, w2169_2). has_letter(w2169, w2169_3). next(w2169_2, w2169_3). letter_value(w2169_2, ‘p’). letter_value(w2169_3, ‘p’). pos(X) :- has_letter(X, A), has_letter(X, B), next(A, B), letter_value(A, C), letter_value(B, C). a b c d e f g h i j k l m n o p q r s t u v w x y z w2169 w2169_1w2169_5w2169_4w2169_3w2169_2 ‘apple' head body Variables

Biomedical Information Extraction *image courtesy of SEER Cancer Training Site DatabaseStructured

Biomedical Information Extraction

NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism. NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism. ykuD was transcribed by SigK RNA polymerase from T4 of sporulation. Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in early adult life. Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in early adult life.

Biomedical Information Extraction The dog running down the street tackled and bit my little sister.

Biomedical Information Extraction NPL3 encodes a nuclear protein with … verbnounarticleadjnounprepsentence prep phrase … verb phrase noun phrase noun phrase noun phrase noun phrase

MedDict Background Knowledge

MeSH Background Knowledge

GO Background Knowledge

Some Prolog Predicates Biomedical Predicates Biomedical Predicates –phrase_contains_medDict_term(Phrase, Word, WordText) –phrase_contains_mesh_term(Phrase, Word, WordText) –phrase_contains_mesh_disease(Phrase, Word, WordText) –phrase_contains_go_term(Phrase, Word, WordText) Lexical Predicates Lexical Predicates –internal_caps(Word) alphanumeric(Word) Look-ahead Phrase Predicates Look-ahead Phrase Predicates –few_POS_in_phrase(Phrase, POS) –phrase_contains_specific_word_triple(Phrase, W1, W2, W3) –phrase_contains_some_marked_up_arg(Phrase, Arg#, Word, Fold) Relative Location of Phrases Relative Location of Phrases –protein_before_location(ExampleID) –word_pair_in_between_target_phrases(ExampleID, W1, W2)

Still More Predicate High-scoring words in protein phrases High-scoring words in protein phrases – bifunction, repress, pmr1, … High-scoring words in location phrases High-scoring words in location phrases – golgi, cytoplasm, er High-scoring BETWEEN protein & location High-scoring BETWEEN protein & location – across, cofractionate, inside, …

Biomedical Information Extraction Given: Medical Journal abstracts tagged with biological relations Given: Medical Journal abstracts tagged with biological relations Do: Construct system to extract related phrases from unseen text Do: Construct system to extract related phrases from unseen text Our Gleaner Approach Our Gleaner Approach Develop fast ensemble algorithms focused on recall and precision evaluation

Using Modes to Chain Relations Phrase Sentence Word alphanumeric(…) internal_caps(…)verb(…) phrase_child(…, …) long_sentence(…) phrase_parent(…, …) noun_phrase(…)

Growing Rules From Seed NPL3 encodes a nuclear protein with … prot_loc(ab _sen7_ph0, ab _sen7_ph2, ab _sen7). phrase_contains_novelword(ab _sen7_ph0, ab _sen7_ph0_w0). phrase_next(ab _sen7_ph0, ab _sen7_ph1). … noun_phrase(ab _sen7_ph2). word_child(ab _sen7_ph2, ab _sen5_ph11_w3). … avg_length_sentence(ab _sen7). … Phrase Sentence Phrase Word

Growing Rules From Seed prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence).

Rule Evaluation Prediction vs Actual Prediction vs Actual Positive or Negative True or False FNTP  FPTP  FP FN TN actual prediction RP 2PR  F1 Score = Focus on positive examples Focus on positive examples Recall = Precision =

Protein Localization Rule 1 prot_loc(Protein,Location,Sentence) :- phrase_contains_some_alphanumeric(Protein,E), phrase_contains_some_internal_cap_word(Protein,E), phrase_next(Protein,_), different_phrases(Protein,Location), one_POS_in_phrase(Location,noun), phrase_contains_some_arg2_10x_word(Location,_), phrase_previous(Location,_), avg_length_sentence(Sentence) Recall0.51 Precision0.23 F1 Score

Protein Localization Rule 2 prot_loc(Protein,Location,Sentence) :- phrase_contains_some_marked_up_arg2(Location,C) phrase_contains_some_internal_cap_word(Protein,_), word_previous(C,_) Recall0.12 Precision0.21 F1 Score

Precision-Focused Search

Recall-Focused Search

F1-Focused Search

Aleph - Learning Aleph learns theories of rules (Srinivasan, v4, 2003) Aleph learns theories of rules (Srinivasan, v4, 2003) –Pick positive seed example –Use heuristic search to find best rule –Pick new seed from uncovered positives and repeat until threshold of positives covered Learning theories is time-consuming Learning theories is time-consuming Can we reduce time with ensembles? Can we reduce time with ensembles?

Gleaner Definition of Gleaner Definition of Gleaner –One who gathers grain left behind by reapers Key Ideas of Gleaner Key Ideas of Gleaner –Use Aleph as underlying ILP rule engine –Search rule space with Rapid Random Restart –Keep wide range of rules usually discarded –Create separate theories for diverse recall

Gleaner - Learning Precision Recall Create B Bins Create B Bins Generate Clauses Generate Clauses Record Best per Bin Record Best per Bin

Gleaner - Learning Recall Seed 1 Seed 2 Seed 3 Seed K......

Gleaner - Ensemble pos1: prot_loc(…) 12 pos2: prot_loc(…) 47 pos3: prot_loc(…) 55 neg1: prot_loc(…) 5 neg2: prot_loc(…) 14 neg3: prot_loc(…) 2 neg4: prot_loc(…) pos2: prot_loc(…) 47 Pos Neg Pos Neg Pos Rules from bin 5

Gleaner - Ensemble Recall Precision 1.0 pos3: prot_loc(…) neg28: prot_loc(…) pos2: prot_loc(…) neg4: prot_loc(…) neg475: prot_loc(…). pos9: prot_loc(…) neg15: prot_loc(…) ScoreExamples PrecisionRecall

Gleaner - Overlap For each bin, take the topmost curve For each bin, take the topmost curve Recall Precision

How to use Gleaner Precision Recall Generate Test Curve Generate Test Curve User Selects Recall Bin User Selects Recall Bin Return Classifications Ordered By Their Score Return Classifications Ordered By Their Score Recall = 0.50 Precision = 0.70

Aleph Ensembles We compare to ensembles of theories We compare to ensembles of theories Algorithm ( Dutra et al ILP 2002 ) Algorithm ( Dutra et al ILP 2002 ) –Use K different initial seeds –Learn K theories containing C rules –Rank examples by the number of theories Need to balance C for high performance Need to balance C for high performance –Small C leads to low recall –Large C leads to converging theories

Evaluation Metrics Area Under Recall- Precision Curve (AURPC) Area Under Recall- Precision Curve (AURPC) –All curves standardized to cover full recall range –Averaged AURPC over 5 folds Number of clauses considered Number of clauses considered –Rough estimate of time Recall Precision 1.0

YPD Protein Localization Hand-labeled dataset (Ray & Craven ’01) Hand-labeled dataset (Ray & Craven ’01) –7,245 sentences from 871 abstracts –Examples are phrase-phrase combinations  1,810 positive & 279,154 negative 1.6 GB of background knowledge 1.6 GB of background knowledge –Structural, Statistical, Lexical and Ontological –In total, 200+ distinct background predicates

Experimental Methodology Performed five-fold cross-validation Performed five-fold cross-validation Variation of parameters Variation of parameters –Gleaner (20 recall bins)  # seeds = {25, 50, 75, 100}  # clauses = {1K, 10K, 25K, 50K, 100K, 250K, 500K} –Ensembles (0.75 minacc, 1K and 35K nodes)  # theories = {10, 25, 50, 75, 100}  # clauses per theory = {1, 5, 10, 15, 20, 25, 50}

PR Curves - 100,000 Clauses

PR Curves - 1,000,000 Clauses

Protein Localization Results

Genetic Disorder Results

Current Directions Learn diverse rules across seeds Learn diverse rules across seeds Calculate probabilistic scores for examples Calculate probabilistic scores for examples Directed Rapid Random Restarts Directed Rapid Random Restarts Cache rule information to speed scoring Cache rule information to speed scoring Transfer learning across seeds Transfer learning across seeds Explore Active Learning within ILP Explore Active Learning within ILP

Take-Home Message Biology, Gleaner and ILP Biology, Gleaner and ILP –Challenging problems in biology can be naturally formulated for Inductive Logic Programming –Many rules constructed and evaluated in ILP hypothesis search –Gleaner makes use of those rules that are not the highest scoring ones for improved speed and performance

Acknowledgements USA DARPA Grant F USA DARPA Grant F USA Air Force Grant F USA Air Force Grant F USA NLM Grant 5T15LM USA NLM Grant 5T15LM USA NLM Grant 1R01LM USA NLM Grant 1R01LM UW Condor Group UW Condor Group David Page, Vitor Santos Costa, Ines Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jesse Davis, Sarah Cunningham, David Haight, Ameet Soni David Page, Vitor Santos Costa, Ines Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jesse Davis, Sarah Cunningham, David Haight, Ameet Soni