BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and.

Slides:



Advertisements
Similar presentations
Bio-Medical Interaction Extractor Syed Toufeeq Ahmed ASU.
Advertisements

Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
TEMPLATE DESIGN © Identifying Noun Product Features that Imply Opinions Lei Zhang Bing Liu Department of Computer Science,
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Automatic Discovery of Technology Trends from Patent Text Youngho Kim, Yingshi Tian, Yoonjae Jeong, Ryu Jihee, Sung-Hyon Myaeng School of Engineering Information.
Predicting protein functions from redundancies in large-scale protein interaction networks Speaker: Chun-hui CAI
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
The Parts of the Sentence.  Every complete sentence must have at least one subject and one verb.  Although it is not necessary to have one in a sentence,
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
The chapter will address the following questions:
Memory Strategy – Using Mental Images
Mining and Summarizing Customer Reviews
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Survey of Semantic Annotation Platforms
Chapter 7. BEAT: the Behavior Expression Animation Toolkit
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Secure Systems Research Group - FAU Classifying security patterns E.B.Fernandez, H. Washizaki, N. Yoshioka, A. Kubo.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,
1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
LOGIC AND ONTOLOGY Both logic and ontology are important areas of philosophy covering large, diverse, and active research projects. These two areas overlap.
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Part4 Methodology of Database Design Chapter 07- Overview of Conceptual Database Design Lu Wei College of Software and Microelectronics Northwestern Polytechnical.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
C. Lawrence Zitnick Microsoft Research, Redmond Devi Parikh Virginia Tech Bringing Semantics Into Focus Using Visual.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
E BERHARD- K ARLS- U NIVERSITÄT T ÜBINGEN SFB 441 Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies Ilona Steiner.
NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Bioinformatics and Computational Biology
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
Communicative and Academic English for the EFL Professional.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Levels of Linguistic Analysis
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.
DISCUSSION Using a Literature-based NMF Model for Discovering Gene Functional Relationships Using a Literature-based NMF Model for Discovering Gene Functional.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
COP Introduction to Database Structures
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Extracting Semantic Concept Relations
Levels of Linguistic Analysis
Presentation transcript:

BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and See-Kiong Ng 2 1 Computer Science Division & AlTrc, Korea Advanced Institute of Science and Technology 2 Knowledge Discovery Department, Institute for Infocomm Research, Singapore (Bioinformatics, Vol. 22, 5, 2006, p )

2/27 Abstract Contrastive information between proteins can reveal what similarities, divergences and relations there are of the two proteins. BioContrasts extracts protein – protein contrastive information from MEDLINE abstracts and presents the information to biologists in a web-application for exploitation. A total of 799,169 pairs of contrastive expressions were extracted from 2.5 million MEDLINE abstracts. Using grounding of contrastive protein names to Swiss- Prot entries, they produce 41,471 pieces of contrasts between Swiss-Prot protein entries.

3/27 1. Introduction Biologists have expended much efforts to compile various protein and interaction databases (e.g., Swiss-Prot, BIND, DIP and KEGG) from experimental data and the published literature. But they are typically positive facts such as “ protein A binds to B ”. BioContrasts identifies contrastive expressions in the MEDLINE abstracts that contain typical contrastive negation patterns such as “ A but not B ”. They also link proteins to Swiss-Prot entries.

4/27 2. Background (1/3) “ not ” often encodes contrasts between biomedical objects. (1) NAT1 binds eIF4A but not eIF4E and inhibits both cap-dependent and cap-independent translation. (2) Truncated N-terminal mutant huntingtin repressed transcription, whereas the corresponding wild-type fragment did not repress transcription. Knowledge representation a. NAT1 binds to X b. X=eIF4A, X  eIF4E.

5/27 2. Background (2/3) Each piece of contrastive information is made up of two parts: (i) A contrastive pair of two or more objects that are so contrasted (e.g. {eIF4A, eIF4E}, {wild-type huntingtin, mutant huntingtin}), called focused objects. (ii) A biological property or process that the contrast is based on (e.g. binding to NAT1, transcription repression), called presupposed property.

6/27 2. Background (3/3) For contrasts to be meaningful, there should also exist a certain level of implicit similarity between the contrasted objects in addition to the explicit differences being communicated. For example, the focused objects of the contrast encoded in (1) are all eukaryotic translation initiation factors.  A contrast is highly informative that not only explicitly represents a difference between its focused objects but also implicitly indicates that the focused objects are semantically similar.

7/27 3. Methodology First, contrastive expressions are identified from the text based on a two-step approach: (i) Matching a subclausal coordination pattern (e.g. ‘ A but not B ’ ) to an input sentence. (ii) identifying parallelism between a negated clause and an affirmative clause in the sentence or adjacent sentences. Then, identifying protein – protein contrasts from the contrastive information to create a useful database of contrastive relationships between Swiss-Prot protein entries. Last, the extracted information are presented via a user-friendly interactive web-portal for exploitation.

8/27

9/ Identifying subclausal coordinations (1/6) Given a sentence that contains a ‘ not ’, the system first tries to identify contrastive expressions using subclausal coordination patterns as defined in Table 1.

10/ Identifying subclausal coordinations (2/6) Authors developed (a) a novel POS tagger that assigns to each word its most frequent POS tag by looking up a manually curated POS dictionary together with some domain specific correction rules and (b) a novel noun phrase recognizer that looks for noun phrases that begin or end at predetermined words or POS, since the NP variables in the patterns are adjacent to a predetermined word (e.g. ‘ but ’, ‘ not ’ ) or a POS (e.g. PREP, V).

11/ Identifying subclausal coordinations (3/6) The system then analyzes semantic similarity between the coordinated expressions by analyzing word-level similarity between the expressions. E.g. ‘ not V NP but V NP ’ : In contrast, IFN-gamma priming did not affect the expression of p105 transcripts but enhanced the expression of p65 mRNA (2-fold).

12/ Identifying subclausal coordinations (4/6) Word-level similarity: checking whether the variable-matching phrases are semantically identical or at least in a subsumption relation. Verb and adjective: Utilize the synonymy and hypernymy relations in WordNet and also in a hand- made resource (constructed by examining 166 ‘ not ’ - containing abstracts, e.g. Table 2).

13/ Identifying subclausal coordinations (5/6) Noun phrase: Analyze the similarity between their headwords. (a) Whether the two nouns denote the same kind of biological objects (e.g. ‘ eIF4A ’ and ‘ eIF4E ’ in example (1) are protein names) by looking up well- known biomedical databases such as Swiss-Prot. (b) Whether they are semantically identical [e.g. ‘ transcription ’ in example (2)] by utilizing WordNet and their resource for semantic similarity. E.g. “ affect ” is a hypernym of “ enhance ”, and the two NPs have the same headword “ expression ”.

14/ Identifying subclausal coordinations (6/6) Next, the system determines the presupposed property for the focused proteins by extracting the subject phrase and the verb whose object phrases correspond to the focused proteins (i.e. ‘ NAT1 binds CONTRAST_OBJ ’, where CONTRAST_OBJ indicates the variable for focused objects.).

15/ Identifying clause-level parallelisms (1/3) If none of the subclausal coordination patterns can be matched to the input sentences in the previous step, the system then attempts to identify any parallelism expressed with the sentences. Parallelism for a contrast refers to a pair of a negated clause and an affirmative clause that are either semantically identical or at least in a subsumption relation.

16/27

17/ Identifying clause-level parallelisms (3/3) (2) Truncated N-terminal mutant huntingtin repressed transcription, whereas the corresponding wild-type fragment did not repress transcription. First locate verbs “ repress ” in the subordinate clause which is negated by “ not ” and the main clause. Next identify subject phrases and object phrases in the two clauses. Then check their semantic identity.

18/ Extracting contrastive protein names The two preceding steps produce contrastive expressions that match the variables for focused objects. Now, the system checks whether the contrastive expressions can be resolved in Swiss-Prot protein entries. Here, the system extract individual protein names as well coordinations between protein names.

19/ Grounding protein names (1/2) Manually constructed pattern dealing with variations of protein names concerning special characters, indices, plurals, discardable words and acronyms.

20/ Grounding protein names (2/2) For ambiguous protein names Match the full names. Discard protein names that are easily confused with other terms such as normal English words of lower case (e.g. mass, stellate) and two character words with one digit (e.g. S1, T1).

21/27 4. Implementation (1/2) A corpus of 2.5 million ‘ not ’ -containing MEDLINE abstracts to extract protein – protein contrasts reported in the literature. Extract 799,169 pairs of contrastive expressions from this corpus. 11,284 pairs of contrastive protein names. 41,471 contrasts between Swiss-Prot (S – P) entries — the larger number in the resulting protein – protein contrasts owes to the fact that a protein name may be grounded with multiple Swiss-Prot entries.

22/27 4. Implementation (2/2)

23/27 5. Evaluation (1/2) The performance in extracting contrastive proteins: Examine a set of 100 pairs of contrastive proteins randomly selected from the BioContrasts database. Correctly extract 97 pairs (97.0% precision) [cf. the performance of the earlier system: 85.7% precision and 61.5% recall). In terms of the system ’ s pattern usage, the main contrastive pattern ‘ A but not B ’ was used to extract 91 pairs (all correct), while the parallelism patterns were used 5 times (2/5 correct or 40% precision) in this evaluation.

24/27 5. Evaluation (2/2) The performance in grounding the contrastive proteins with Swiss-Prot entries: Out of the 200 focused protein names in the test protein contrasts, 182 names were expressed with adequate clarifying descriptions in their respective abstracts for grounding. Among these 182 protein names, the system correctly grounded 164 (90.1% precision) and linked 6 additional protein names with partially correct entries (3.3% partial precision).

25/27 6. Applications (1/2) Refine pathway roles of similar proteins These information can be used for distinguishing between the finer biological roles of seemingly similar proteins.

26/27 6. Applications (2/2) Identify implicitly similar proteins Such implicit similarity information can be useful in understanding the interactome (e.g. functional annotation of proteins). E.g. Out of the 5407 protein contrasts that involved functionally annotated proteins, there are 4948 contrasts (91.7%) between proteins that shared at least one GO code at level 2. This suggested that there are implicit functional similarities between the focused proteins being contrasted.

27/27 7. Conclusions BioContrasts system captures useful contrastive and nonpositive biological relationships between proteins by extracting contrastive expressions in the MEDLINE abstracts containing the contrastive negation patterns. Future work: Extract contrastive information expressed in linguistic structures other than ‘ A but not B ’. Other ways for exploiting the extracted contrasts for biological knowledge discovery. E.g., it may be possible that the presupposed properties (biological activities) of the same pairs of focused proteins are also related to each other.