Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI-0317510 and a gift from.

Slides:



Advertisements
Similar presentations
A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
Supporting Annotation Layers for Natural Language Processing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Caption Search for Bioscience Search Interfaces Marti Hearst, Anna Divoli, Jerry Ye, Mike Wooldridge UC Berkeley School of Information ACL Workshop on.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
The BioText Project: Recent Work Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
FROM INFORMATION, KNOWLEDGE Prof. Marti Hearst MIMS Visit Day, 2006 Some Research Projects.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:
Predicting Gene Functions from Text Using a Cross- Species Approach Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.
Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley Supported by NSF DBI
New Search Tools for Bioscience Journal Articles Marti Hearst, UC Berkeley School of Information UIUC Comp-Bio Seminar February 12, 2007 Supported by NSF.
BioText Infrastructure Ariel Schwartz Gaurav Bhalotia 10/07/2002.
Citances: Citation Sentences for Semantic Analysis of Bioscience Text Preslav I. Nakov, Ariel S. Schwartz, and Marti A. Hearst Computer Science Division.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Next Steps in Literature Mining Marti Hearst UC Berkeley ASIST 2003 Literature Mining Panel.
1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI , ARDA.
Citances and What should our UI look like? Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
1 The BioText Project Myers Seminar Sept 22, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI , ARDA AQUAINT,
The Jikitou Biomedical Question Answering System: Using a Syntactic Parser to Rank Possible Answers Michael A. Bauer 1,2, Daniel Berleant 1, Robert E.
Cis-Regulatory/ Text Mining Interface Discussion.
Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario SIMS UC Berkeley.
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts G. Melli, M. Ester, A. Sarkar Dec. 6, 2007
Text summarization MEAD NewsInEssence Cross-document structure Sentence compression Lexrank Political science Discourse dynamics Centrality identification.
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
IDA2: Intelligent Discovery of Acronyms and Abbreviations Adam Mallen under the advisement of Dr. Craig Struble and Dr. Lenwood Heath.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
Presenter: Jinhua Du ( 杜金华 ) Xi’an University of Technology 西安理工大学 NLP&CC, Chongqing, Nov , 2013 Discriminative Latent Variable Based Classifier.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Artificial Intelligence: Research and Collaborative Possibilities a presentation by: Dr. Ernest L. McDuffie, Assistant Professor Department of Computer.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Facilitating Document Annotation Using Content and Querying Value.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Social Knowledge Mining
Statistical NLP: Lecture 9
Introduction to Search Engines
Statistical NLP : Lecture 9 Word Sense Disambiguation
Marti Hearst Associate Professor SIMS, UC Berkeley
Predicting Gene Functions from Text Using a Cross-Species Approach
Presentation transcript:

Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech

BioText Project Goals Provide flexible, intelligent access to information for use in biosciences applications. Focus on Textual Information from Journal Articles Tightly integrated with other resources Ontologies Record-based databases

Project Team Project Leaders: PI: Marti Hearst Co-PI: Adam Arkin Computational Linguistics Barbara Rosario (graduated) Presley Nakov Database Research Ariel Schwartz Gaurav Bhalotia (graduated) Supported primarily by NSF DBI and a gift from Genentech User Interface / IR Rowena Luk Dr. Emilia Stoica Bioscience Dr. TingTing Zhang Janice Hamer

BioText Architecture Sophisticated Text Analysis Annotations in Database Improved Search Interface

The Nature of Bioscience Text Claim: Bioscience semantics are simultaneously easier and harder than general text. Fewer subtleties Fewer ambiguities “Systematic” meanings Enormous terminology Complex sentence structure easierharder

Entity-Entity Relation Recognition

Two tasks Relationship Extraction: Identify the several semantic relations that can occur between two entities (in this case, protein names) in bioscience text. Entity extraction: Related problem: identify the entities

The Approach Data: MEDLINE abstracts and titles Graphical models Combine in one framework both relation and entity extraction Both static and dynamic models Simple discriminative approach: Neural network Lexical, syntactic and semantic features

Protein-Protein interactions Tasks: Given sentences from Paper ID, and/or citation sentences to ID Predict the interaction type given in the HIV database for Paper ID Extract the proteins involved 10-way classification problem

Protein-Protein interactions Models Dynamic graphical model Naïve Bayes

Graphical Models

Evaluation Evaluation at document level All (sentences from papers + citations) Papers (only sentences from papers) Citations (only citation sentences) “Trigger word” approach List of keywords (ex: for inhibits: “inhibitor”, “inhibition”, “inhibit”…etc. If keyword presents: assign corresponding interaction

Results Accuracies on interaction classification ModelAllPapersCitations Markov Model Naïve Bayes Baselines Most freq. inter TriggerW TriggerW + BO (Roles hidden)

Results: confusion matrix For All. Overall accuracy: 60.5%

Hiding the protein names Replaced protein names with tokens PROT_NAME Selective CXCR4 antagonism by Tat Selective PROT_NAME antagonism by PROT_NAME

Results with no protein names ModelPapersCitations Markov Model44.4 (-23.1%) 52.3 (-2.0%) Naïve Bayes46.7 (-19.2%) 53.4 (-4.1 %)

Protein extraction (Protein name tagging, role extraction) The identification of all the proteins present in the sentence that are involved in the interaction These results suggest that Tat - induced phosphorylation of serine 5 by CDK9 might be important after transcription has reached the +36 position, at which time CDK7 has been released from the complex. Tat might regulate the phosphorylation of the RNA polymerase II carboxyl - terminal domain in pre - initiation complexes by activating CDK7

Protein extraction: results RecallPrecisionF-measure All Papers Citations No dictionary used

Conclusions of protein- protein interaction project Encouraging results for the automatic classification of protein-protein interactions Use of an existing database for gathering labeled data Use of citations

Acquiring Labeled Data using Citances

BioScience Researchers Read A LOT! Cite A LOT! Curate A LOT! Are interested in specific relations, e.g.: What is the role of this protein in that pathway? Show me articles in which a comparison between two values is significant.

Acquiring Labeled Data using Citances

A discovery is made … A paper is written …

That paper is cited … and cited … … as the evidence for some fact(s) F.

Each of these in turn are cited for some fact(s) … … until it is the case that all important facts in the field can be found in citation sentences alone!

Citances Nearly every statement in a bioscience journal article is backed up with a cite. It is quite common for papers to be cited times. The text around the citation tends to state biological facts. (Call these citances.) Different citances will state the same facts in different ways … … so can we use these for creating models of language expressing semantic relations?

Using Citances Potential uses of citation sentences (citances) creation of training and testing data for semantic analysis, synonym set creation, database curation, document summarization, and information retrieval generally. Some preliminary results: Citances to a document align well with a hand-built curation. Citances are good candidates for paraphrase creation.

Issues for Processing Citances Text span Identification of the appropriate phrase, clause, or sentence that constructs a citance. Correct mapping of citations when shown as lists or groups (e.g., “[22-25]”). Grouping citances by topic Citances that cite the same document should be grouped by the facts they state. Normalizing or paraphrasing citances For IR, summarization, learning synonyms, relation extraction, question answering, and machine translation.

Early results: Paraphrase Creation from Citances

Sample Sentences NGF withdrawal from sympathetic neurons induces Bim, which then contributes to death. Nerve growth factor withdrawal induces the expression of Bim and mediates Bax dependent cytochrome c release and apoptosis. The proapoptotic Bcl-2 family member Bim is strongly induced in sympathetic neurons in response to NGF withdrawal. In neurons, the BH3 only Bcl2 member, Bim, and JNK are both implicated in apoptosis caused by nerve growth factor deprivation.

Their Paraphrases NGF withdrawal induces Bim. Nerve growth factor withdrawal induces the expression of Bim. Bim has been shown to be upregulated following nerve growth factor withdrawal. Bim implicated in apoptosis caused by nerve growth factor deprivation. They all paraphrase: Bim is induced after NGF withdrawal.

Paraphrase Creation Algorithm 1. Extract the sentences that cite the target. 2. Mark the NEs of interest (genes/proteins, MeSH terms) and normalize. 3. Dependency parse (MiniPar). 4. For each parse For each pair of NEs of interest i. Extract the path between them. ii. Create a paraphrase from the path. 5. Rank the candidates for a given pair of NEs. 6. Select only the ones above a threshold. 7. Generalize.

Relevant Papers Citances: Citation Sentences for Semantic Analysis of Bioscience Text, Preslav Nakov, Ariel Schwartz, and Marti Hearst, in the SIGIR'04 workshop on Search and Discovery in Bioinformatics. Classifying Semantic Relations in Bioscience Text, Barbara Rosario and Marti Hearst, in ACL The Descent of Hierarchy, and Selection in Relational Semantics, Barbara Rosario, Marti Hearst, and Charles Fillmore, in ACL 2002.

Thank you! Marti Hearst SIMS, UC Berkeley