Download presentation
Presentation is loading. Please wait.
Published byAdela Moody Modified over 9 years ago
1
Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering Nanyang Technological University, Singapore 30 August 2007
2
Outline Protein-protein interactions (PPIs) extraction Hidden Vector State (HVS) model for PPIs extraction Reranking approaches Experimental results Conclusions
3
Protein Interact Protein Protein-Protein Interactions Extraction Spc97p interacts with Spc98 and Tub4 in the two-hybrid system Spc97p interact Spc98 Spc97p interact Tub4
4
Existing Approaches Statistics Methods Pattern Matching Parsing- Based Simple to Complicated
5
An example
6
Statistics-Based Approaches
7
Pattern Matching Approaches
8
Parsing-Based Approaches
9
Semantic Parser Ĉ = argmax { P(C|W n ) } = argmax { P(C) P(W n |C) } c For each candidate word string W n, need to compute most likely set of embedded concepts semantic model lexical model
10
We could use a simple finite state tagger … P(W n |C) P(C) … can be robustly trained using EM, but model is too weak to represent embeddings in natural language
11
Perhaps use some form of hierarchical HMM in which each state is a terminal or a nested HMM … … but when using EM, models rarely converge on good solutions and, in practice, direct maximum-likelihood from “tree-bank” data are needed to train models P(W n |C) P(C)
12
Hidden Vector State Model
13
The HVS model is an HMM in which the states correspond to the stack of a push-down automata with a bounded stack size … P(W n |C) … this is a very convenient framework for applying constraints P(C)
14
HVS model transition constraints: finite stack depth – D push only one non-terminal semantic onto the stack at each step … model defined by three simple probability tables Ĉ = argmax { ∏P(n t |C t-1 ) P(C t [1]|C t [2..D t ]) P(W t |C t ) } c,N t
15
Parsing with the HVS model P(n t |C t-1 ) 1) POP 1 elements from the previous stack state, n =1 P(C t [1]|C t [2..D t ]) 2) Push 1 pre-terminal semantic concept into stack P(W t |C t ) 3) Generate the next word PROTEIN INTERACT PROTEIN SS … with Spc98 and Tub4 … INTERACT PROTEIN SS DUMMY INERACT PROTEIN SS
16
Train using EM and apply constraints Abstract semantic annotation PROTEIN ( INTERACT ( PROTEIN) ) CUL-1 was found to interact with SKR-1, SKR-2, SKR-3, and SKR-7 in yeast two-hybrid system Training text Data Constraints EM Parameter Estimation HVS Model Parameters Parse Statistics Limit forward- backward search to only include states which are consistent with the constraints
17
Reranking Methodology Reranking approaches attempts to improve upon an existing probabilistic parser by reranking the output of the parser. It has benefited applications such as name-entity extraction, semantic parsing and semantic labeling. To rerank parses generated by the HVS model for protein-protein interactions extraction
18
Architecture
19
Abstract Annotation need to provide the for each sentence in the training set. An example –Sentences: CUL-1 was found to interact with SKR-1, SKR-2, SKR-3, SKR-7, SKR-8 and SKR-10 in yeast two-hybrid system. Annotation: –PORTEIN NAME(ACTIVATE (PROTEIN NAME)) Results PPI : –CUL-1#SKR-1#ACTIVATE –CUL-1#SKR-2#ACTIVATE –CUL-1#SKR-3#ACTIVATE –CUL-1#SKR-7#ACTIVATE –CUL-1#SKR-8#ACTIVATE –CUL-1#SKR-10#ACTIVATE
20
Reranking approaches Features for Reranking Suppose sentence S i has its corresponding parse set C i = {C ij, j = 1,.. N} –Parsing Information –Structure Information –Complexity Information
21
Reranking approaches Score is defined as log-linear regression model Neural Network Support Vector Machines
22
Experiments Setup –Corpus I comprises of 300 abstracts randomly retrieved from the GENIA corpus GENIA is a collection of research abstracts selected from the search results of MEDLINE database with keyword (MeSH terms) “human, blood cells and transcription factors” split into two parts: –Part I contains 1500 sentences (training data) –Part II consists of 1000 sentences (test data)
23
Experimental Results Figure 1: F-measure vs number of candidate parses.
24
Experimental Results (cont ’ d) Experime nts Recall (%) Precision (%) F-Score (%) Baseline55.855.655.7 SVM NN LLR 59.1 57.9 58.5 60.2 61.8 61.2 59.7 59.8 Table 3: Results based on the interaction category.
25
Conclusions Three reranking methods for the HVS model in the application of extracting protein-protein interactions from biomedical literature. Experimental results show that 4% relative improvement in F-measure can be obtained through reranking on the semantic parse results Incorporating other semantic or syntactic information might be able to give further gains.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.