Presentation is loading. Please wait.

Presentation is loading. Please wait.

Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.

Similar presentations


Presentation on theme: "Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004."— Presentation transcript:

1 Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004

2 Outline Accomplishments – Natural Language Processing Perspective – Biomedical Applications Challenges – Organizing A Challenge Evaluation – Sample Challenge Problems: Extraction of Biological Pathways Automated Database Curation and Ontology Development

3 Early Work: to Identify Protein Names Fukuda et al. (1998) Challenges encountered: – Long compound names – Different names for the same protein – Common English words as protein names Solutions proposed: – Uppercase letters (Src homology 2 domains) – Numerals (p54 SAP kinase) – Special endings (EGF receptor)

4 Recent Work: to Recognize Interactions between Proteins and Other Molecules Statistical Approach – Stapley & Benoit (2000): co-occurrences of gene names to predict connections – Ding et al. (2002): co-occurrences when the unit is an abstract, a sentence, or a phrase NLP Approach – Ng & Wong (1999): templates with linguistic structures to recognize interactions – Others: extended Ng & Wong’s work – All based on grammars

5 NLP in Biological Applications To capture specific relations in databases – To learn ontological relations – To extract biological pathways To improve retrieval and clustering in searching large collections – Homology search using sequence similarity – Clustering MEDLINE abstracts For classification

6 Problem I Researchers Precision/ Specificity Recall/ Sensitivity Data Set Extracted Results Yakushiji et al. (2001) 60 – 80%/ MEDLINE abstracts argument structures Friedman et al. (2001) 96%63% 8000 word article from Cell broad set of biological relations Pustejovsky & Castaño (2002) 90%57%MEDLINE the “inhibit” relations How to compare different approaches?

7 Problem II How well does a system have to perform to be useful? – What does 90% specificity at 57% sensitivity mean to the user? – Need user-centered evaluations.

8 Challenge Evaluation Task Definition Building System Identification of Challenge Problem Training Data Evaluation Evaluation MethodologyTest Data Participants EvaluatorFunding

9 Sample Challenge Problem I: Extraction of Biological Pathways What are biological pathways? A network of interactions and events between proteins, drugs, and other molecules. E.g. the Glycolytic Pathway

10 Challenge Problem Three layers of challenges: To recognize names of proteins, drugs, and other molecules To recognize basic interaction events between molecules To recognize the relationships between the basic interaction events

11 Task Definition (t 1, F 1 ) (t 2, F 2 ) … (t m, F m ) db: set of records t i : texts (sentences, abstracts, or whole articles) F i = {f i,1, f i,2, …, f i,n i }: set of expected facts (short sentences in highly standardized forms. e.g. “P 1 activate P 2 ”)

12 Evaluation Methodology recall(E) = TP(E)/[TP(E) + FN(E)] precision(E) = TP(E)/[TP(E) + FP(E)] E: information extractor TP: true positive FN: false negative FP: false positive

13 Evaluation Methodology At the record level At the database level Question: which one is more effective a measure?

14 Test Data Appendix of Kohn (1999) – 200 statements of interaction events – Sentences of a fairly complex form MEDLINE abstracts on “Topoisomerase inhibitors” – 150 – 200 new abstracts each year – Less than 1000 names and less than 200 interaction events each year

15 Sample Challenge Problem II: Automated Database Curation and Ontology Development Importance: – protein referred to by names The nomenclature problem for proteins: – A newly discovered protein may be named based on its functions, sequence features, gene name, cellular location, molecular weight, etc. NLP technologies in information extraction, classification and ontology induction can be applied here

16 An Example 3 fields from the entry for Appl+P130kD in FlyBase: (1) Protein size (kD):Luo et al, 1990130 (2) Cell location:Luo et al, 1990axon (3) Expression pattern:Luo et al, 1990 StageTissue/Position EmbryoEmbryonic Central Nervous System EmbryoPeripheral Nervous System The abstract of Luo et al. (1990) (1) APPL … is converted to a 130-kDa secreted from … (2) APPL … was observed in … axonal tracts, … (3) In the embryo, APPL proteins are expressed exclusively in the CNS and PNS neurons …

17 Knowledge Discovery and Data Mining Challenge Cup 2002 Participants are given – A collection of journal articles – Each labeled with genes mentioned in the article Participants are required to answer – Does the article contain any experimental results about gene expression that should be put in the database? – If so, for each gene in the article, is there experimental evidence for any transcripts (RNA), protein, or polypeptide products of that gene?

18 Protein Knowledge Base

19 Evaluation of Ontologies Challenging: – no established metric for measuring knowledge in terms of content or value Two levels: – Intrinsic: compare terms and ontological relations discovered by the system against those found by humans – Extrinsic: evaluate ontology’s usefulness in manual query expansion

20 Summary Contributions of this paper: Summarized the work done so far in the field of literature data mining for biology Identified the important ingredients for a successful evaluation Gave concrete evaluation examples

21 End of the Talk

22 Identifying Protein Names from Biological Papers (Fukuda et al.) Capital letters, numerical figures, and special symbols (core-terms) – Src homology (SH) 2 and SH3 domains – P54 SAP kinase Key-words (feature-terms) – EGF receptor – Ras GRPase-activating protein (GAP) IE system: – Core-term extraction from tokenized texts – Concatenation of core-terms and f-terms

23 Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts (Ng & Wong) Key function words: – Inhibitor: {inhibit, suppress, negatively regulate} – Activator: {activate, transactivate, induce, unregulate, positively regulate} Pattern matching rules: – … … – … of … – … by …

24 Evaluation Methodology Simple Matching Coefficient (SMC) – SMC(E) = TP(E)/[TP(E) + FN(E) + FP(E)] Satisfies two conditions: – To distinguish the ideal information extractor from the worst one – To show a gradual monotonic change in value when the information extractor is changed from the worst to the best

25 Three Tasks To recognize names: obvious To recognize interaction events: grammar PosEvent ::= P phosphorylate P [on T] [at L] | P dephosphorylate P [on T] [at L] … Event ::= PosEvent [mediated-by P+] [independent-of P+] … To recognize relationships: grammar Relationship ::= Event [is-caused-by Event+] [provided Event+] …


Download ppt "Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004."

Similar presentations


Ads by Google