Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.

Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering Nanyang Technological University, Singapore 30 August 2007

Outline Protein-protein interactions (PPIs) extraction Hidden Vector State (HVS) model for PPIs extraction Reranking approaches Experimental results Conclusions

Protein Interact Protein Protein-Protein Interactions Extraction Spc97p interacts with Spc98 and Tub4 in the two-hybrid system Spc97p interact Spc98 Spc97p interact Tub4

Existing Approaches Statistics Methods Pattern Matching Parsing- Based Simple to Complicated

An example

Statistics-Based Approaches

Pattern Matching Approaches

Parsing-Based Approaches

Semantic Parser Ĉ = argmax { P(C|W n ) } = argmax { P(C) P(W n |C) } c For each candidate word string W n, need to compute most likely set of embedded concepts semantic model lexical model

We could use a simple finite state tagger … P(W n |C) P(C) … can be robustly trained using EM, but model is too weak to represent embeddings in natural language

Perhaps use some form of hierarchical HMM in which each state is a terminal or a nested HMM … … but when using EM, models rarely converge on good solutions and, in practice, direct maximum-likelihood from “tree-bank” data are needed to train models P(W n |C) P(C)

Hidden Vector State Model

The HVS model is an HMM in which the states correspond to the stack of a push-down automata with a bounded stack size … P(W n |C) … this is a very convenient framework for applying constraints P(C)

HVS model transition constraints: finite stack depth – D push only one non-terminal semantic onto the stack at each step … model defined by three simple probability tables Ĉ = argmax { ∏P(n t |C t-1 ) P(C t [1]|C t [2..D t ]) P(W t |C t ) } c,N t

Parsing with the HVS model P(n t |C t-1 ) 1) POP 1 elements from the previous stack state, n =1 P(C t [1]|C t [2..D t ]) 2) Push 1 pre-terminal semantic concept into stack P(W t |C t ) 3) Generate the next word PROTEIN INTERACT PROTEIN SS … with Spc98 and Tub4 … INTERACT PROTEIN SS DUMMY INERACT PROTEIN SS

Train using EM and apply constraints Abstract semantic annotation PROTEIN ( INTERACT ( PROTEIN) ) CUL-1 was found to interact with SKR-1, SKR-2, SKR-3, and SKR-7 in yeast two-hybrid system Training text Data Constraints EM Parameter Estimation HVS Model Parameters Parse Statistics Limit forward- backward search to only include states which are consistent with the constraints

Reranking Methodology Reranking approaches attempts to improve upon an existing probabilistic parser by reranking the output of the parser. It has benefited applications such as name-entity extraction, semantic parsing and semantic labeling. To rerank parses generated by the HVS model for protein-protein interactions extraction

Architecture

Abstract Annotation need to provide the for each sentence in the training set. An example –Sentences: CUL-1 was found to interact with SKR-1, SKR-2, SKR-3, SKR-7, SKR-8 and SKR-10 in yeast two-hybrid system. Annotation: –PORTEIN NAME(ACTIVATE (PROTEIN NAME)) Results PPI : –CUL-1#SKR-1#ACTIVATE –CUL-1#SKR-2#ACTIVATE –CUL-1#SKR-3#ACTIVATE –CUL-1#SKR-7#ACTIVATE –CUL-1#SKR-8#ACTIVATE –CUL-1#SKR-10#ACTIVATE

Reranking approaches Features for Reranking Suppose sentence S i has its corresponding parse set C i = {C ij, j = 1,.. N} –Parsing Information –Structure Information –Complexity Information

Reranking approaches Score is defined as log-linear regression model Neural Network Support Vector Machines

Experiments Setup –Corpus I comprises of 300 abstracts randomly retrieved from the GENIA corpus GENIA is a collection of research abstracts selected from the search results of MEDLINE database with keyword (MeSH terms) “human, blood cells and transcription factors” split into two parts: –Part I contains 1500 sentences (training data) –Part II consists of 1000 sentences (test data)

Experimental Results Figure 1: F-measure vs number of candidate parses.

Experimental Results (cont ’ d) Experime nts Recall (%) Precision (%) F-Score (%) Baseline55.855.655.7 SVM NN LLR 59.1 57.9 58.5 60.2 61.8 61.2 59.7 59.8 Table 3: Results based on the interaction category.

Conclusions Three reranking methods for the HVS model in the application of extracting protein-protein interactions from biomedical literature. Experimental results show that 4% relative improvement in F-measure can be obtained through reranking on the semantic parse results Incorporating other semantic or syntactic information might be able to give further gains.

Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.

Similar presentations

Presentation on theme: "Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.

Similar presentations

Presentation on theme: "Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering."— Presentation transcript:

Similar presentations

About project

Feedback