Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI-0317510 and a gift from.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from.
I256 Applied Natural Language Processing Fall 2009 Lecture 14 Information Extraction (2) Barbara Rosario.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
The BioText Project: Recent Work Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
1 Noun compounds (NCs) Any sequence of nouns that itself functions as a noun asthma hospitalizations asthma hospitalization rates health care personnel.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:
Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.
1 Classification of Semantic Relations in Noun Compounds using MeSH Marti Hearst, Barbara Rosario SIMS, UC Berkeley.
Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley Supported by NSF DBI
New Search Tools for Bioscience Journal Articles Marti Hearst, UC Berkeley School of Information UIUC Comp-Bio Seminar February 12, 2007 Supported by NSF.
1 Classification of Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy Barbara Rosario, Marti Hearst SIMS, UC Berkeley.
Citances: Citation Sentences for Semantic Analysis of Bioscience Text Preslav I. Nakov, Ariel S. Schwartz, and Marti A. Hearst Computer Science Division.
Semantic Interpretation of Medical Text Barbara Rosario, SIMS Steve Tu, UC Berkeley Advisor: Marti Hearst, SIMS.
1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI , ARDA.
Citances and What should our UI look like? Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
1 The BioText Project Myers Seminar Sept 22, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI , ARDA AQUAINT,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario SIMS UC Berkeley.
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
The Descent of Hierarchy, and Selection in Relational Semantics* Barbara Rosario, Marti Hearst, Charles Fillmore UC Berkeley *with apologies to Charles.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
A Language Independent Method for Question Classification COLING 2004.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
INFO Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Social Knowledge Mining
Panagiotis G. Ipeirotis Luis Gravano
The Descent of Hierarchy, and Selection in Relational Semantics*
Classifying Semantic Relations in Bioscience Texts
Marti Hearst Associate Professor SIMS, UC Berkeley
Presentation transcript:

Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech

BioText Project Goals Provide flexible, intelligent access to information for use in biosciences applications. Focus on Textual Information from Journal Articles Tightly integrated with other resources Ontologies Record-based databases

Project Team Project Leaders: PI: Marti Hearst Co-PI: Adam Arkin Computational Linguistics Barbara Rosario Presley Nakov Database Research Ariel Schwartz Gaurav Bhalotia (graduated) Supported primarily by NSF DBI and a gift from Genentech User Interface / IR Adam Newberger Dr. Emilia Stoica Bioscience Dr. TingTing Zhang Janice Hamerja

BioText Architecture Sophisticated Text Analysis Annotations in Database Improved Search Interface

The Nature of Bioscience Text Claim: Bioscience semantics are simultaneously easier and harder than general text. Fewer subtleties Fewer ambiguities “Systematic” meanings Enormous terminology Complex sentence structure easierharder

Sample Sentence “Recent research, in proliferating cells, has demonstrated that interaction of E2F1 with the p53 pathway could involve transcriptional up-regulation of E2F1 target genes such as p14/p19ARF, which affect p53 accumulation [67,68], E2F1-induced phosphorylation of p53 [69], or direct E2F1- p53 complex formation [70].”

BioScience Researchers Read A LOT! Cite A LOT! Curate A LOT! Are interested in specific relations, e.g.: What is the role of this protein in that pathway? Show me articles in which a comparison between two values is significant.

This Talk Discovering semantic relations Between nouns in noun compounds Between entities in sentences Acquiring labeled data: Idea: use text surrounding citations to documents to identify paraphrases A new direction; preliminary work only

Noun Compound Relation Recognition

Noun Compounds(NCs) Technical text is rich with NCs Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment. NC is any sequence of nouns that itself functions as a noun asthma hospitalizations health care personnel hand wash

NCs: 3 computational tasks Identification Syntactic analysis (attachments) [Baseline [headache frequency]] [[Tension headache] patient] Our Goal: Semantic analysis Headache treatment  treatment for headache Corticosteroid treatment  treatment that uses corticosteroid

Descent of Hierarchy Idea: Use the top levels of a lexical hierarchy to identify semantic relations Hypothesis: A particular semantic relation holds between all 2-word NCs that can be categorized by a lexical category pair.

Related work ( Semantic analysis of NCs ) Rule-based Finin (1980) Detailed AI analysis, hand-coded Vanderwende (1994) automatically extracts semantic information from an on-line dictionary, manipulates a set of handwritten rules. 13 classes, 52% accuracy Probabilistic Lauer (1995): probabilistic model, 8 classes, 47% accuracy Lapata (2000) classifies nominalizations into subject/object. 2 classes, 80% accuracy

Related work ( Semantic analysis of NCs ) Lexical Hierarchy Barrett et al. (2001) WordNet, heuristics to classify a NC given the similarity to a known NC Rosario and Hearst (2001) Relations pre-defined MeSH, Neural Network. 18 classes, 60% accuracy

Linguistic Motivation Can cast NC into head-modifier relation, and assume head noun has an argument and qualia structure. (used-in): kitchen knife (made-of): steel knife (instrument-for): carving knife (used-on): putty knife (used-by): butcher’s knife

The lexical Hierarchy: MeSH 1. Anatomy [A] 2. Organisms [B] 3. Diseases [C] 4. Chemicals and Drugs [D] 5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] 6. Psychiatry and Psychology [F] 7. Biological Sciences [G] 8. Physical Sciences [H] 9. Anthropology, Education, Sociology and Social Phenomena [I] 10. Technology and Food and Beverages [J] 11. Humanities [K] 12. Information Science [L] 13. Persons [M] 14. Health Care [N] 15. Geographic Locations [Z]

The lexical Hierarchy: MeSH 1. Anatomy [A] Body Regions [A01] 2. [B] Musculoskeletal System [A02] 3. [C] Digestive System [A03] 4. [D] Respiratory System [A04] 5. [E] Urogenital System [A05] 6. [F] …… 7. [G] 8. Physical Sciences [H] 9. [I] 10. [J] 11. [K] 12. [L] 13. [M]

Descending the Hierarchy 1. Anatomy [A] Body Regions [A01] Abdomen [A01.047] 2. [B] Musculoskeletal System [A02] Back [A01.176] 3. [C] Digestive System [A03] Breast [A01.236] 4. [D] Respiratory System [A04] Extremities [A01.378] 5. [E] Urogenital System [A05] Head [A01.456] 6. [F] …… Neck [A01.598] 7. [G] …. 8. Physical Sciences [H] 9. [I] 10. [J] 11. [K] 12. [L] 13. [M]

Descending the Hierarchy 1. Anatomy [A] Body Regions [A01] Abdomen [A01.047] 2. [B] Musculoskeletal System [A02] Back [A01.176] 3. [C] Digestive System [A03] Breast [A01.236] 4. [D] Respiratory System [A04] Extremities [A01.378] 5. [E] Urogenital System [A05] Head [A01.456] 6. [F] …… Neck [A01.598] 7. [G] …. 8. Physical Sciences [H] Electronics 9. [I] Astronomy 10. [J] Nature 11. [K] Time 12. [L] Weights and Measures 13. [M] ….

Descending the Hierarchy 1. Anatomy [A] Body Regions [A01] Abdomen [A01.047] 2. [B] Musculoskeletal System [A02] Back [A01.176] 3. [C] Digestive System [A03] Breast [A01.236] 4. [D] Respiratory System [A04] Extremities [A01.378] 5. [E] Urogenital System [A05] Head [A01.456] 6. [F] …… Neck [A01.598] 7. [G] …. 8. Physical Sciences [H] Electronics Amplifiers 9. [I] Astronomy Electronics, Medical 10. [J] Nature Transducers 11. [K] Time 12. [L] Weights and Measures 13. [M] ….

Descending the Hierarchy 1. Anatomy [A] Body Regions [A01] Abdomen [A01.047] 2. [B] Musculoskeletal System [A02] Back [A01.176] 3. [C] Digestive System [A03] Breast [A01.236] 4. [D] Respiratory System [A04] Extremities [A01.378] 5. [E] Urogenital System [A05] Head [A01.456] 6. [F] …… Neck [A01.598] 7. [G] …. 8. Physical Sciences [H] Electronics Amplifiers 9. [I] Astronomy Electronics, Medical 10. [J] Nature Transducers 11. [K] Time 12. [L] Weights and Measures Calibration 13. [M] …. Metric System Reference Standard

Descending the Hierarchy 1. Anatomy [A] Body Regions [A01] Abdomen [A01.047] 2. [B] Musculoskeletal System [A02] Back [A01.176] 3. [C] Digestive System [A03] Breast [A01.236] 4. [D] Respiratory System [A04] Extremities [A01.378] 5. [E] Urogenital System [A05] Head [A01.456] 6. [F] …… Neck [A01.598] 7. [G] …. 8. Physical Sciences [H] Electronics Amplifiers 9. [I] Astronomy Electronics, Medical 10. [J] Nature Transducers 11. [K] Time 12. [L] Weights and Measures Calibration 13. [M] …. Metric System Reference Standard Homogeneous Heterogeneous

Mapping Nouns to MeSH Concepts headache recurrence C C headache pain C G

Levels of Description headache pain Level 0: C.23 G.11 Level 1: C G Level 1: C G … Original: C G

Descent of Hierarchy Idea: Words falling in homogeneous MeSH subhierarchies behave “similarly” with respect to relation assignment Hypothesis: A particular semantic relation holds between all 2-word NCs that can be categorized by a MeSH category pairs

Grouping the NCs CP: A02 C04 (Musculoskeletal System, Neoplasms) skull tumors, bone cysts, bone metastases, skull osteosarcoma… CP: C04 M01 (Neoplasms, Person) leukemia survivor, lymphoma patients, cancer physician, cancer nurses…

Distribution of Category Pairs

Collection ~70,000 NCs extracted from titles and abstracts of Medline 2,627 CPs at level 0 (with at least 10 unique NCs) We analyzed 250 CPs with Anatomy (A) 21 CPs with Natural Science (H01) 3 CPs with Neoplasm (C04) This represents 10% of total CPs and 20% of total NCs

For each CP Divide its NCs into “training-testing” sets “Training”: inspect NCs by hand Start from level 0 0 While NCs are not all similar descend one level of the hierarchy Repeat until all NCs for that CP are similar Classification Method

Classification Decisions A02 C04 B06 B06 C04 M01 C04 M C04 M A01 H01 A01 H A01 H A01 H A01 H A01 M01 A01 M A01 M A01 M01.898

Classification Decisions + Relations A02 C04  Location of Disease B06 B06  Kind of Plants C04 M01 C04 M  Person afflicted by Disease C04 M  Person who treats Disease A01 H01 A01 H A01 H A01 H A01 H A01 M01 A01 M A01 M A01 M01.898

Classification Decisions + Relations A02 C04  Location of Disease B06 B06  Kind of Plants C04 M01 C04 M  Person afflicted by Disease C04 M  Person who treats Disease A01 H01 A01 H A01 H A01 H A01 H A01 M01 A01 M  Person afflicted by Disease A01 M A01 M01.898

Classification Decision Levels Anatomy: 250 CPs 187 (75%) remain first level 56 (22%) descend one level 7 (3%) descend two levels Natural Science (H01): 21 CPs 1 ( 4%) remain first level 8 (39%) descend one level 12 (57%) descend two levels Neoplasms (C04) 3 CPs: 3 (100%) descend one level

Evaluation Test the decisions on “testing” set Count how many NCs that fall in the groups defined in the classification decisions are similar to each other Accuracy (for 2 nd noun): Anatomy: 91% Natural Science: 79% Neoplasm: 100% Total Accuracy : 90.8% Generalization: our 415 classification decisions cover ~ 46,000 possible CP pairs

Ambiguity – Two Types Lexical ambiguity: mortality state of being mortal death rate Relationship ambiguity: bacteria mortality death of bacteria death caused by bacteria

Four Cases Single MeSH sensesMultiple MeSH senses Only one possible relationship: abdomen radiography, aciclovir treatment Multiple relationships: hospital databases, education efforts, kidney metabolism Only one possible relationship: alcoholism treatment Ambiguity of relationship Multiple relationships bacteria mortality

Four Cases Single MeSH sensesMultiple MeSH senses Only one possible relationship: abdomen radiography, aciclovir treatment Multiple relationships: hospital databases, education efforts, kidney metabolism Only one possible relationship: alcoholism treatment Ambiguity of relationship Multiple relationships bacteria mortality Most problematic cases … but rare!

Conclusions on NN Relation Classification Very simple method for assigning semantic relations to two-word technical NCs 90.8% accuracy Lexical resource (MeSH) useful for this task Probably works because of the relative lack of ambiguity in this kind of technical text.

Entity-Entity Relation Recognition

Problem: Which relations hold between 2 entities? TreatmentDisease Cure? Prevent? Side Effect?

Hepatitis Examples Cure These results suggest that con A-induced hepatitis was ameliorated by pretreatment with TJ-135. Prevent A two-dose combined hepatitis A and B vaccine would facilitate immunization programs Vague Effect of interferon on hepatitis B

Two tasks Relationship Extraction: Identify the several semantic relations that can occur between the entities disease and treatment in bioscience text Entity extraction: Related problem: identify such entities

The Approach Data: MEDLINE abstracts and titles Graphical models Combine in one framework both relation and entity extraction Both static and dynamic models Simple discriminative approach: Neural network Lexical, syntactic and semantic features

Related Work We allow several DIFFERENT relations between the same entities Thus differs from the problem statement of other work on relations Many find one relation which holds between two entities (many based on ACE) Agichtein and Gravano (2000), lexical patterns for location of Zelenko et al. (2002) SVM for person affiliation and organization-location Hasegawa et al. (ACL 2004) Person-Organization -> President “relation” Craven (1999, 2001) HMM for subcellular-location and disorder-association Doesn’t identify the actual relation

Related work: Bioscience Many hand-built rules Feldman et al. (2002), Friedman et al. (2001) Pustejovsky et al. (2002) Saric et al.; this conference

Data and Relations MEDLINE, abstracts and titles 3662 sentences labeled Relevant: 1724 Irrelevant: 1771 e.g., “Patients were followed up for 6 months” 2 types of Entities, many instances treatment and disease 7 Relationships between these entities

Semantic Relationships 810: Cure Intravenous immune globulin for recurrent spontaneous abortion 616: Only Disease Social ties and susceptibility to the common cold 166: Only Treatment Flucticasone propionate is safe in recommended doses 63: Prevent Statins for prevention of stroke

Semantic Relationships 36: Vague Phenylbutazone and leukemia 29: Side Effect Malignant mesodermal mixed tumor of the uterus following irradiation 4: Does NOT cure Evidence for double resistance to permethrin and malathion in head lice

Features Word Part of speech Phrase constituent Orthographic features ‘is number’, ‘all letters are capitalized’, ‘first letter is capitalized’ … MeSH (semantic features) Replace words, or sequences of words, with generalizations via MeSH categories Peritoneum -> Abdomen

Models 2 static generative models 3 dynamic generative models 1 discriminative model (neural network)

Static Graphical Models S1: observations dependent on Role but independent from Relation given roles S2: observations dependent on both Relation and Role S1S2

Dynamic Graphical Models D1, D2 as in S1, S2 D3: only one observation per state is dependent on both the relation and the role D1 D2 D3

Graphical Models Relation node: Semantic relation (cure, prevent, none..) expressed in the sentence

Graphical Models Role nodes: 3 choices: treatment, disease, or none

Graphical Models Feature nodes (observed): word, POS, MeSH…

Graphical Models Different dependencies between the features and the relation nodes D3 D1 S1 D2 S2

Graphical Models For Dynamic Model D1: Joint probability distribution over relation, roles and features nodes Parameters estimated with maximum likelihood and absolute discounting smoothing

Neural Network Feed-forward network (MATLAB) Training with conjugate gradient descent One hidden layer (hyperbolic tangent function) Logistic sigmoid function for the output layer representing the relationships Same features Discriminative approach

Role extraction Results in terms of F-measure Graphical models Junction tree algorithm (BNT) Relation hidden and marginalized over Neural Net Couldn’t run it (features vectors too large) (Graphical models can do role extraction and relationship classification simultaneously)

Role Extraction: Results F-measures D1 best when no smoothing

Role Extraction: Results F-measures D2 best with smoothing, but doesn’t boost scores as much as in relation classification

Role Extraction: Results Static models better than Dynamic for Note: No Neural Networks

Relation classification: Results With Smoothing and Roles, D1 best GM

Features impact: Role Extraction Most important features: 1)Word, 2)MeSH Models D1 D2 All features No word % -14.1% No MeSH % -8.4% (rel. + irrel.)

Most important features: Roles Accuracy: D1 D2 NN All feat. + roles All feat. – roles % -8.7% -17.8% All feat. + roles – Word % -2.8% -0.5% All feat. + roles – MeSH % 3.1% 0.4% Features impact: Relation classification (rel. + irrel.)

Relation extraction Results in terms of classification accuracy (with and without irrelevant sentences) 2 cases: Roles hidden Roles given Graphical models NN: simple classification problem

Relation classification: Results Neural Net always best

Relation classification: Results With Smoothing and No Roles, D2 best GM

Relation classification: Results Dynamic models always outperform Static

Relation classification: Results With no smoothing, D1 best Graphical Model

Relation classification: Confusion Matrix Computed for the model D2, “rel + irrel.”, “only features”

Features impact: Relation classification Most realistic case: Roles not known Most important features: 1) Mesh 2) Word for D1 and NN (but vice versa for D2) Accuracy: D1 D2 NN All feat. – roles All feat. - roles – Word % -11.8% -4.3% All feat. - roles – MeSH % -3.2% -6.9% (rel. + irrel.)

Relation Recognition: Conclusions Classification of subtle semantic relations in bioscience text Discriminative model (neural network) achieves high classification accuracy Graphical models for the simultaneous extraction of entities and relationships Importance of lexical hierarchy Next Step: Different entities/relations Semi-supervised learning to discover relation types

Acquiring Labeled Data using Citances

A discovery is made … A paper is written …

That paper is cited … and cited … … as the evidence for some fact(s) F.

Each of these in turn are cited for some fact(s) … … until it is the case that all important facts in the field can be found in citation sentences alone!

Citances Nearly every statement in a bioscience journal article is backed up with a cite. It is quite common for papers to be cited times. The text around the citation tends to state biological facts. (Call these citances.) Different citances will state the same facts in different ways … … so can we use these for creating models of language expressing semantic relations?

Using Citances Potential uses of citation sentences (citances) creation of training and testing data for semantic analysis, synonym set creation, database curation, document summarization, and information retrieval generally. Some preliminary results: Citances to a document align well with a hand-built curation. Citances are good candidates for paraphrase creation.

Citances for Acquiring Examples of Semantic Relations A relationship type R between entities of type A and B can be expressed in many ways. Use citances to build a model the different ways to express the relationship: Seed learning algorithms with examples that mention A and B, for which relation R holds. Train a model to recognize R when the relation is not known. Results may extend to sentences that are not citances as well.

Issues for Processing Citances Text span Identification of the appropriate phrase, clause, or sentence that constructs a citance. Correct mapping of citations when shown as lists or groups (e.g., “[22-25]”). Grouping citances by topic Citances that cite the same document should be grouped by the facts they state. Normalizing or paraphrasing citances For IR, summarization, learning synonyms, relation extraction, question answering, and machine translation.

Related Work Traditional citation analysis dates back to the 1960’s (Garfield). Includes: Citation categorization, Context analysis, Citer motivation. Citation indexing systems, such as ISI’s SCI, and CiteSeer. Mercer and Di Marco (2004) propose to improve citation indexing using citation types. Bradshaw (2003) introduces Reference Directed Indexing (RDI), which indexes documents using the terms in the citances citing them.

Related Work (cont.) Teufel and Moens (2002) identify citances to improve summarization of the citing paper.. Nanba et. al. (2000) use citances as features for classifying papers into topics. Related field to citation indexing is the use of link structure and anchor text of Web pages. Applications include: IR, classification, Web crawlers, and summarization.

Example: protein-protein

Early results: Paraphrase Creation from Citances

Sample Sentences NGF withdrawal from sympathetic neurons induces Bim, which then contributes to death. Nerve growth factor withdrawal induces the expression of Bim and mediates Bax dependent cytochrome c release and apoptosis. The proapoptotic Bcl-2 family member Bim is strongly induced in sympathetic neurons in response to NGF withdrawal. In neurons, the BH3 only Bcl2 member, Bim, and JNK are both implicated in apoptosis caused by nerve growth factor deprivation.

Their Paraphrases NGF withdrawal induces Bim. Nerve growth factor withdrawal induces the expression of Bim. Bim has been shown to be upregulated following nerve growth factor withdrawal. Bim implicated in apoptosis caused by nerve growth factor deprivation. They all paraphrase: Bim is induced after NGF withdrawal.

Paraphrase Creation Algorithm 1. Extract the sentences that cite the target. 2. Mark the NEs of interest (genes/proteins, MeSH terms) and normalize. 3. Dependency parse (MiniPar). 4. For each parse For each pair of NEs of interest i. Extract the path between them. ii. Create a paraphrase from the path. 5. Rank the candidates for a given pair of NEs. 6. Select only the ones above a threshold. 7. Generalize.

Creating a Paraphrase Given the path from the dependency parse: Restore the original word order. Add words to improve grammaticality. Bim … shown … be … following nerve growth factor withdrawal. Bim [has] [been] shown [to] be [upregulated] following nerve growth factor withdrawal.

2-word Heuristic Demonstration NGF withdrawal induces Bim. Nerve growth factor withdrawal induces [the] expression of Bim. Bim [has] [been] shown [to] be [upregulated] following nerve growth factor withdrawal. Bim [is] induced in [sympathetic] neurons in response to NGF withdrawal. member Bim implicated in apoptosis caused by nerve growth factor deprivation.

Evaluation (1) An influential journal paper from Neuron: J. Whitfield, S. Neame, L. Paquet, O. Bernard, and J. Ham. Dominantnegative c-jun promotes neuronal survival by reducing bim expression and inhibiting mitochondrial cytochrome c release. Neuron, 29:629–643, journal papers citing it 203 citances in total 36 different types of important biological factoids But we concentrated on one model sentence: “Bim is induced after NGF withdrawal.”

Evaluation (2) Set 1: 67 citances pointing to the target paper and manually found to contain a good or acceptable paraphrase (do not necessarily contain Bim or NGF); (Ideal conditions) Set 2: 65 citances pointing to the target paper and containing both Bim and NGF; Set 3: 102 sentences from the 99 texts, containing both Bim and NGF (Do citances do better than arbitrarily chosen sentences?)

Correctness (Judgments) Bad (0.0), if: different relation (often phosphorylation aspect); opposite meaning; vagueness (wording not clear enough). Acceptable (0.5), If it was not Bad and: contains additional terms (e.g., DP5 protein) or topics (e.g., PPs like in sympathetic neurons); the relation was suggested but not definitely. Else Good (1.0)

Results Obtained 55, 65 and 102 paraphrases for sets 1, 2 and 3 Only one paraphrase from each sentence comparison of the dependency path to that of the model sentence % - good (1.0) or acceptable (0.5)

Correctness (Recall) Calculated on Set 1 60 paraphrases (out of 67 citances) 5 citances produced 2 paraphrases system recall: 55/67, i.e % 10 of the 67 relevant in Set 1 initially missed by the human annotator 8 good, 2 acceptable. human recall is 57/67, i.e %

Misses Sample system miss (no NGF): Growth factor withdrawal was shown to cause increased Bim expression in various populations of neuronal cell types. Sample human miss: The precise targets of c-Jun necessary for the induction of apoptosis have been the subject of intense interest and recently, Bim and Dp5, both “BH3-domain only” family members, have been identified as pro-apoptotic genes induced in a c- Jun-dependent manner in both sympathetic neurons subjected to NGF withdrawal and in cerebellar granule cells deprived of KCl.

Grammaticality Missing coordinating “and”: “Hrk/DP5 Bim [have] [been] found [to] be upregulated after NGF withdrawal” Verb subcategorization “caused by NGF role for Bim” Extra subject words member Bim implicated in apoptosis caused by NGF deprivation sentence: “In neurons, the BH3-only Bcl2 member, Bim, and JNK are both implicated in apoptosis caused by NGF deprivation.”

Related Work Word-level paraphrases. Grefenstette uses a semantic parser to compare the distributional similarity of local contexts for synonyms extraction. Phrase-level paraphrases. Barzilay&McKeown use POS information from the local context and co- training. Template paraphrases. Lin&Pantel apply the idea of Grefenstette to dependency tree paths. Later refined by Shinyama&al. Sentence-level paraphrases. Barzilay&Lee use multiple sequence alignment. Pang&al. merge parse trees into a transducer.

Relevant Papers Citances: Citation Sentences for Semantic Analysis of Bioscience Text, Preslav Nakov, Ariel Schwartz, and Marti Hearst, in the SIGIR'04 workshop on Search and Discovery in Bioinformatics. Classifying Semantic Relations in Bioscience Text, Barbara Rosario and Marti Hearst, in ACL The Descent of Hierarchy, and Selection in Relational Semantics, Barbara Rosario, Marti Hearst, and Charles Fillmore, in ACL 2002.

Thank you! Marti Hearst SIMS, UC Berkeley

Additional slides

Our D1 Thompson et al Frame classification and role labeling for FrameNet sentences Target word must be observed More relations and roles

Smoothing: absolute discounting Lower the probability of seen events by subtracting a constant from their count (ML estimate: ) The remaining probability is evenly divided by the unseen events

F-measures for role extraction in function of smoothing factors

Relation accuracies in function of smoothing factors