Download presentation
Presentation is loading. Please wait.
Published byFerdinand Terence Hamilton Modified over 6 years ago
1
Multiple Alignment of Citation Sentences with Conditional Random Fields and Posterior Decoding
Ariel Schwartz, Anna Divoli, and Marti Hearst University of California, Berkeley Supported in part by NSF DBI
2
Bioscience literature
Rich, complex and fast growing. Online full text enables new forms of automatic document analysis, including caption search, and citation sentences analysis. Citances Nearly every statement in a bioscience journal article is backed up by a citation. It is common for papers to be cited times. The text around the citation tends to state biological facts from the target paper. We term these citation sentences, or citances. Different citances state similar facts in different ways.
3
Papers are cited for some fact(s) …
… until it is the case that many important facts in the field can be found in citation sentences alone!
4
Using citances Potential applications of citances
creation of training and testing data for semantic analysis, synonym set creation, database curation, document summarization, and information retrieval generally. Nakov, Schwartz and Hearst. Citances: Citation Sentences for Semantic Analysis of Bioscience Text, in the SIGIR'04 Workshop on Search and Discovery in Bioinformatics. All these applications require citance word alignments. Align together concepts that are semantically related in the context of the target paper. Related concepts can be expressed in several different ways in the citances. We focus here on the multiple citance alignment (MCA) problem.
5
Example of unaligned citances
“In response to genotoxic stress, Chk1 and Chk2 phosphorylate Cdc25A on N-terminal sites and target it rapidly for ubiquitin-dependent degradation (Mailand et al, 2000, 2002; Molinari et al, 2000; Falck et al, 2001; Shimuta et al, 2002; Busino et al, 2003), which is thought to be central to the S and G2 cell cycle checkpoints (Bartek and Lukas, 2003; Donzelli and Draetta, 2003).” “Given that Chk1 promotes Cdc25A turnover in response to DNA damage in vivo (Falck et al. 2001; Sorensen et al. 2003) and that Chk1 is required for Cdc25A ubiquitination by SCF-TRCP in vitro, we explored the role of Cdc25A phosphorylation in the ubiquitination process.” “Since activated phosphorylated Chk2-T68 is involved in phosphorylation and degradation of Cdc25A (Falck et al., 2001, Falck et al., 2002 ; Bartek and Lukas, 2003), we also examined the levels of Cdc25A in 2fTGH and U3A cells exposed to -IR.”
6
Goal: Align similar concepts
response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A N terminal sites target rapidly ubiquitin dependent degradation thought central S G2 cell cycle checkpoints Given Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 required Cdc25A ubiquitination SCF beta TRCP vitro explored role Cdc25A phosphorylation ubiquitination process activated phosphorylated Chk2 T68 involved phosphorylation degradation Cdc25A examined levels Cdc25A 2fTGH U3A cells exposed gamma IR
7
Multiple citance alignment (MCA)
Goal: Partition the citances’ words/phrases into equivalence classes based on “semantic homology”. Orthographic similarity is important but does not always entail semantic homology: “phosphorylate” » “phosphorylation” “cell cycle” ¿ “U3A cells” “genotoxic stress” » “DNA damage” Related problems: Multiple sequence alignment (MSA) in genomics. Pairwise word alignment in statistical machine translation (SMT).
8
Formal definition of MCA
Pairwise citance alignment of citances Ci and Cj is an equivalence realtion »ij. cik »ij cjl means that the kth word in the ith citance is aligned to the jth word in the lth citance. Multiple citance alignment (MCA) is an equivalence relation ~, which is defined as the transitive closure of the union of all pairwise citance alignments: The transitive closure ensures that the equivalent classes (colors) are consistent across all pairwise citance alignments.
9
Algorithm outline We developed an MCA algorithm based on:
Extension to our posterior decoding algorithm for MSA (AMAP, Schwartz and Pachter ECCB 2006). Modified version of the SMT pairwise word alignment model of Blunsom & Cohn (ACL 2006) for posterior probabilities calculation.
10
Algorithm outline Posterior probabilities calculation (CRF)
Unaligned citances to a target paper Feature extraction Expected utility maximization (posterior decoding) response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A N terminal sites target rapidly ubiquitin dependent degradation thought central S G2 cell cycle checkpoints Given Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 required Cdc25A ubiquitination SCF beta TRCP vitro explored role Cdc25A phosphorylation ubiquitination process activated phosphorylated Chk2 T68 involved phosphorylation degradation Cdc25A examined levels Cdc25A 2fTGH U3A cells exposed gamma IR Muliple citance alignment (MCA) Utility function
11
Algorithm outline Posterior probabilities calculation (CRF)
Unaligned citances to a target paper Feature extraction Utility function Expected utility maximization (posterior decoding) response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A N terminal sites target rapidly ubiquitin dependent degradation thought central S G2 cell cycle checkpoints Given Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 required Cdc25A ubiquitination SCF beta TRCP vitro explored role Cdc25A phosphorylation ubiquitination process activated phosphorylated Chk2 T68 involved phosphorylation degradation Cdc25A examined levels Cdc25A 2fTGH U3A cells exposed gamma IR Muliple citance alignment (MCA)
12
Utility function for MCA
Requirements for a good utility function: Correlated to the accuracy measure used for evaluation. Easily decomposable, for direct optimization using posterior-decoding. Metric-based (optional): Captures intuitive notion of distance. Triangle inequality provides bounds on the search space. AER and F-measure do not satisfy these criteria.
13
Alignment Metric Accuracy (AMA)
We extend AMA (Schwartz et al 2006), a utility function for one-to-one MSA, to many-to-many MCA. Intuitively, UAMA measures the average word-level agreement between the predicted and reference MCAs. Uset_agreement is a “score” assigned to each word position based on the overlap between the sets of word positions the two alignments align to it. Can use Dice, Jaccard, or Hamming for example. We use the Braun-Blanquet coefficient.
14
Example of AMA for MCA Every word gets a score between 0 and 1 based on level of agreement with the reference alignment. AMA is the average word score. In this example AMA = 13.83/ 20 = Sum of pairs is used for multiple alignments.
15
Controlling the recall/precision tradeoff
In addition, two free parameters (match-factor , and gap-factor ) are added in order to provide control of the recall/precision tradeoff. The result is the following utility function:
16
Algorithm outline Posterior probabilities calculation (CRF)
Unaligned citances to a target paper Feature extraction Expected utility maximization (posterior decoding) response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A N terminal sites target rapidly ubiquitin dependent degradation thought central S G2 cell cycle checkpoints Given Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 required Cdc25A ubiquitination SCF beta TRCP vitro explored role Cdc25A phosphorylation ubiquitination process activated phosphorylated Chk2 T68 involved phosphorylation degradation Cdc25A examined levels Cdc25A 2fTGH U3A cells exposed gamma IR Muliple citance alignment (MCA) Utility function
17
Motivation for using a CRF model
Small annotated sets for training, development, and testing Main challenge is to perform well on unseen words. Requires a discriminative model that can use different overlapping features, can incorporate contextual information, allows for computation of posterior probabilities.
18
CRFs based SMT word alignment
Blunsom and Cohn (ACL 2006) developed a CRF based pairwise word alignment model for SMT. Directional model – every source word can be mapped to zero or one target words. Using Viterbi decoding. Features are functions of the implied source-target word-pairs. We modified the program to support MCA. Compute the directional marginal posterior probabilities using the forward-backward algorithm: Modified features. Implementation of a posterior-decoding algorithm for MCA instead of the Viterbi decoding for pairwise SMT word alignment.
19
Algorithm outline Posterior probabilities calculation (CRF)
Unaligned citances to a target paper Feature extraction Expected utility maximization (posterior decoding) response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A N terminal sites target rapidly ubiquitin dependent degradation thought central S G2 cell cycle checkpoints Given Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 required Cdc25A ubiquitination SCF beta TRCP vitro explored role Cdc25A phosphorylation ubiquitination process activated phosphorylated Chk2 T68 involved phosphorylation degradation Cdc25A examined levels Cdc25A 2fTGH U3A cells exposed gamma IR Muliple citance alignment (MCA) Utility function
20
Posterior decoding algorithm for MCA
For every pair of citances compute the directional posterior probabilities using a CRF. For every target word w, compute the combination of source words that maximize the expected utility of w. The (undirected) multiple word alignment is produced by taking the transitive closure of the union of individual word optimal alignments:
21
Decoding Example Later on in the decoding process … Target
C1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A C1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A C1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A Source C2: Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 C2: Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 C3: Chk2 T68 involved phosphorylation degradation Cdc25A C3: Chk2 T68 involved phosphorylation degradation Cdc25A Later on in the decoding process … Source C1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A C1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A C2: Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 Target C3: Chk2 T68 involved phosphorylation degradation Cdc25A C3: Chk2 T68 involved phosphorylation degradation Cdc25A
22
Data sets 3 sets of citances annotated by a PhD with biological training: Training set - 4 groups, 10 citances each (180 pairs). Development set – 51 citances (1275 pairs). Test set – 45 citances (990 pairs). Feature engineering using the training and development sets. Final results based on a model trained on training and development sets combined, and tested on the test set. Baseline – using only normalized edit distance with a simple cutoff.
23
Features for MCA Orthographic features Local contextual features
exact string match, normalized edit distance, prefix, suffix match, word lengths, capitalization. Local contextual features distance between target words of adjacent source words, Word specific tendency to align like the previous/next word, Transition to, from, and between (un)aligned words. Biological ontology based features Medical Subject Headings (MeSH), Gene synonyms (Entrez Gene, Uniprot, OMIM). Lexical features Wordnet similarity (Lin, 1998)
24
Results on pairwise alignments
Unlike Viterbi decoding, posterior-decoding (PD) enables refined control of the recall/precision tradeoff. Viterbi_Union (0.531 recall at precision) is comparable to PD with and set to 1 (0.540 recall at precision). However, PD allows to increase the recall significantly by increasing and decreasing (0.636 recall at precision for = 1.2 and = 0.1, or recall at precision for = 1.5 and = 0.05).
25
Results on MCA The two curves overlap in the range between 0.52 and 0.55 recall (0.84 and 0.9 precision). Orthographic similarity is the dominant feature in this range. Unlike the baseline the CRF+PD system keeps improving recall without a sharp drop in precision up to recall at precision. This is due to the incorporation of multiple overlapping features. The CRF+PD system also achieves better precision than the baseline (0.982 precision at recall vs precision at recall).
26
Error analysis Performed error analysis on MCA with best F-measure (0.690). Out of 1400 unique errors 1194 (85.3%) are false-negatives, and 206 (14.7%) are false-positives. Most errors are due to misalignment of subtypes (cdc, cdc6, cdc25A), opposites (phosphorylated and unphosphorylated), and complex entities (cell cycle and cell line). Many FN errors are due to not aligning entities in only 4 equivalence classes (e.g., 97 FN in the class of motif, site and domain). Other types of errors: not aligning plural and singular forms of the same entities, aligning only part of part of multi-word entities, and incorrectly aligning orthographically similar entites.
27
Contributions Defined the MCA problem.
Developed a posterior-decoding algorithm for MCA. Advantages of posterior-decoding over Viterbi: Directly optimize the expected (metric-based) utility. Control of recall / precision tradeoff. Developed AMA for MCA A metric based accuracy measure for MCA. Balances recall and precision in one measure. The expected AMA can be optimized directly with posterior-decoding (unlike AER or F-Measure). Can also be used for SMT alignments.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.