Ariel Schwartz, Anna Divoli, and Marti Hearst

Slides:



Advertisements
Similar presentations
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Advertisements

1 An Adaptive GA for Multi Objective Flexible Manufacturing Systems A. Younes, H. Ghenniwa, S. Areibi uoguelph.ca.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Reduced Support Vector Machine
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
New Search Tools for Bioscience Journal Articles Marti Hearst, UC Berkeley School of Information UIUC Comp-Bio Seminar February 12, 2007 Supported by NSF.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Citances: Citation Sentences for Semantic Analysis of Bioscience Text Preslav I. Nakov, Ariel S. Schwartz, and Marti A. Hearst Computer Science Division.
Citances and What should our UI look like? Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
12th of October, 2006KEG seminar1 Combining Ontology Mapping Methods Using Bayesian Networks Ontology Alignment Evaluation Initiative 'Conference'
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
A Lightweight and High Performance Monolingual Word Aligner Xuchen Yao, Benjamin Van Durme, (Johns Hopkins) Chris Callison-Burch and Peter Clark (UPenn)
Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Statistical Machine Translation Part II: Word Alignments and EM
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
A German Corpus for Similarity Detection
Semantic Processing with Context Analysis
An Empirical Study of Learning to Rank for Entity Search
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Multimedia Information Retrieval
Data Integration with Dependent Sources
Fast Sequence Alignments
Generalizations of Markov model to characterize biological sequences
Anastasia Baryshnikova  Cell Systems 
Handwritten Characters Recognition Based on an HMM Model
Text Categorization Berlin Chen 2003 Reference:
Bo Li, Akshay Tambe, Sharon Aviran, Lior Pachter  Cell Systems 
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Deep Learning in Bioinformatics
Predicting Gene Functions from Text Using a Cross-Species Approach
Presentation transcript:

Multiple Alignment of Citation Sentences with Conditional Random Fields and Posterior Decoding Ariel Schwartz, Anna Divoli, and Marti Hearst University of California, Berkeley Supported in part by NSF DBI 0317510

Bioscience literature Rich, complex and fast growing. Online full text enables new forms of automatic document analysis, including caption search, and citation sentences analysis. Citances Nearly every statement in a bioscience journal article is backed up by a citation. It is common for papers to be cited 30-100 times. The text around the citation tends to state biological facts from the target paper. We term these citation sentences, or citances. Different citances state similar facts in different ways.

Papers are cited for some fact(s) … … until it is the case that many important facts in the field can be found in citation sentences alone!

Using citances Potential applications of citances creation of training and testing data for semantic analysis, synonym set creation, database curation, document summarization, and information retrieval generally. Nakov, Schwartz and Hearst. Citances: Citation Sentences for Semantic Analysis of Bioscience Text, in the SIGIR'04 Workshop on Search and Discovery in Bioinformatics. All these applications require citance word alignments. Align together concepts that are semantically related in the context of the target paper. Related concepts can be expressed in several different ways in the citances. We focus here on the multiple citance alignment (MCA) problem.

Example of unaligned citances “In response to genotoxic stress, Chk1 and Chk2 phosphorylate Cdc25A on N-terminal sites and target it rapidly for ubiquitin-dependent degradation (Mailand et al, 2000, 2002; Molinari et al, 2000; Falck et al, 2001; Shimuta et al, 2002; Busino et al, 2003), which is thought to be central to the S and G2 cell cycle checkpoints (Bartek and Lukas, 2003; Donzelli and Draetta, 2003).” “Given that Chk1 promotes Cdc25A turnover in response to DNA damage in vivo (Falck et al. 2001; Sorensen et al. 2003) and that Chk1 is required for Cdc25A ubiquitination by SCF-TRCP in vitro, we explored the role of Cdc25A phosphorylation in the ubiquitination process.” “Since activated phosphorylated Chk2-T68 is involved in phosphorylation and degradation of Cdc25A (Falck et al., 2001, Falck et al., 2002 ; Bartek and Lukas, 2003), we also examined the levels of Cdc25A in 2fTGH and U3A cells exposed to -IR.”

Goal: Align similar concepts response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A N terminal sites target rapidly ubiquitin dependent degradation thought central S G2 cell cycle checkpoints Given Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 required Cdc25A ubiquitination SCF beta TRCP vitro explored role Cdc25A phosphorylation ubiquitination process activated phosphorylated Chk2 T68 involved phosphorylation degradation Cdc25A examined levels Cdc25A 2fTGH U3A cells exposed gamma IR

Multiple citance alignment (MCA) Goal: Partition the citances’ words/phrases into equivalence classes based on “semantic homology”. Orthographic similarity is important but does not always entail semantic homology: “phosphorylate” » “phosphorylation” “cell cycle” ¿ “U3A cells” “genotoxic stress” » “DNA damage” Related problems: Multiple sequence alignment (MSA) in genomics. Pairwise word alignment in statistical machine translation (SMT).

Formal definition of MCA Pairwise citance alignment of citances Ci and Cj is an equivalence realtion »ij. cik »ij cjl means that the kth word in the ith citance is aligned to the jth word in the lth citance. Multiple citance alignment (MCA) is an equivalence relation ~, which is defined as the transitive closure of the union of all pairwise citance alignments: The transitive closure ensures that the equivalent classes (colors) are consistent across all pairwise citance alignments.

Algorithm outline We developed an MCA algorithm based on: Extension to our posterior decoding algorithm for MSA (AMAP, Schwartz and Pachter ECCB 2006). Modified version of the SMT pairwise word alignment model of Blunsom & Cohn (ACL 2006) for posterior probabilities calculation.

Algorithm outline Posterior probabilities calculation (CRF) Unaligned citances to a target paper Feature extraction Expected utility maximization (posterior decoding) response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A N terminal sites target rapidly ubiquitin dependent degradation thought central S G2 cell cycle checkpoints Given Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 required Cdc25A ubiquitination SCF beta TRCP vitro explored role Cdc25A phosphorylation ubiquitination process activated phosphorylated Chk2 T68 involved phosphorylation degradation Cdc25A examined levels Cdc25A 2fTGH U3A cells exposed gamma IR Muliple citance alignment (MCA) Utility function

Algorithm outline Posterior probabilities calculation (CRF) Unaligned citances to a target paper Feature extraction Utility function Expected utility maximization (posterior decoding) response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A N terminal sites target rapidly ubiquitin dependent degradation thought central S G2 cell cycle checkpoints Given Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 required Cdc25A ubiquitination SCF beta TRCP vitro explored role Cdc25A phosphorylation ubiquitination process activated phosphorylated Chk2 T68 involved phosphorylation degradation Cdc25A examined levels Cdc25A 2fTGH U3A cells exposed gamma IR Muliple citance alignment (MCA)

Utility function for MCA Requirements for a good utility function: Correlated to the accuracy measure used for evaluation. Easily decomposable, for direct optimization using posterior-decoding. Metric-based (optional): Captures intuitive notion of distance. Triangle inequality provides bounds on the search space. AER and F-measure do not satisfy these criteria.

Alignment Metric Accuracy (AMA) We extend AMA (Schwartz et al 2006), a utility function for one-to-one MSA, to many-to-many MCA. Intuitively, UAMA measures the average word-level agreement between the predicted and reference MCAs. Uset_agreement is a “score” assigned to each word position based on the overlap between the sets of word positions the two alignments align to it. Can use Dice, Jaccard, or Hamming for example. We use the Braun-Blanquet coefficient.

Example of AMA for MCA Every word gets a score between 0 and 1 based on level of agreement with the reference alignment. AMA is the average word score. In this example AMA = 13.83/ 20 = 0.692. Sum of pairs is used for multiple alignments.

Controlling the recall/precision tradeoff In addition, two free parameters (match-factor , and gap-factor ) are added in order to provide control of the recall/precision tradeoff. The result is the following utility function:

Algorithm outline Posterior probabilities calculation (CRF) Unaligned citances to a target paper Feature extraction Expected utility maximization (posterior decoding) response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A N terminal sites target rapidly ubiquitin dependent degradation thought central S G2 cell cycle checkpoints Given Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 required Cdc25A ubiquitination SCF beta TRCP vitro explored role Cdc25A phosphorylation ubiquitination process activated phosphorylated Chk2 T68 involved phosphorylation degradation Cdc25A examined levels Cdc25A 2fTGH U3A cells exposed gamma IR Muliple citance alignment (MCA) Utility function

Motivation for using a CRF model Small annotated sets for training, development, and testing Main challenge is to perform well on unseen words. Requires a discriminative model that can use different overlapping features, can incorporate contextual information, allows for computation of posterior probabilities.

CRFs based SMT word alignment Blunsom and Cohn (ACL 2006) developed a CRF based pairwise word alignment model for SMT. Directional model – every source word can be mapped to zero or one target words. Using Viterbi decoding. Features are functions of the implied source-target word-pairs. We modified the program to support MCA. Compute the directional marginal posterior probabilities using the forward-backward algorithm: Modified features. Implementation of a posterior-decoding algorithm for MCA instead of the Viterbi decoding for pairwise SMT word alignment.

Algorithm outline Posterior probabilities calculation (CRF) Unaligned citances to a target paper Feature extraction Expected utility maximization (posterior decoding) response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A N terminal sites target rapidly ubiquitin dependent degradation thought central S G2 cell cycle checkpoints Given Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 required Cdc25A ubiquitination SCF beta TRCP vitro explored role Cdc25A phosphorylation ubiquitination process activated phosphorylated Chk2 T68 involved phosphorylation degradation Cdc25A examined levels Cdc25A 2fTGH U3A cells exposed gamma IR Muliple citance alignment (MCA) Utility function

Posterior decoding algorithm for MCA For every pair of citances compute the directional posterior probabilities using a CRF. For every target word w, compute the combination of source words that maximize the expected utility of w. The (undirected) multiple word alignment is produced by taking the transitive closure of the union of individual word optimal alignments:

Decoding Example Later on in the decoding process … Target C1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A C1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A C1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A Source C2: Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 C2: Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 C3: Chk2 T68 involved phosphorylation degradation Cdc25A C3: Chk2 T68 involved phosphorylation degradation Cdc25A Later on in the decoding process … Source C1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A C1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A C2: Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 Target C3: Chk2 T68 involved phosphorylation degradation Cdc25A C3: Chk2 T68 involved phosphorylation degradation Cdc25A

Data sets 3 sets of citances annotated by a PhD with biological training: Training set - 4 groups, 10 citances each (180 pairs). Development set – 51 citances (1275 pairs). Test set – 45 citances (990 pairs). Feature engineering using the training and development sets. Final results based on a model trained on training and development sets combined, and tested on the test set. Baseline – using only normalized edit distance with a simple cutoff.

Features for MCA Orthographic features Local contextual features exact string match, normalized edit distance, prefix, suffix match, word lengths, capitalization. Local contextual features distance between target words of adjacent source words, Word specific tendency to align like the previous/next word, Transition to, from, and between (un)aligned words. Biological ontology based features Medical Subject Headings (MeSH), Gene synonyms (Entrez Gene, Uniprot, OMIM). Lexical features Wordnet similarity (Lin, 1998)

Results on pairwise alignments Unlike Viterbi decoding, posterior-decoding (PD) enables refined control of the recall/precision tradeoff. Viterbi_Union (0.531 recall at 0.913 precision) is comparable to PD with  and  set to 1 (0.540 recall at 0.909 precision). However, PD allows to increase the recall significantly by increasing  and decreasing  (0.636 recall at 0.517 precision for  = 1.2 and  = 0.1, or 0.742 recall at 0.198 precision for  = 1.5 and  = 0.05).

Results on MCA The two curves overlap in the range between 0.52 and 0.55 recall (0.84 and 0.9 precision). Orthographic similarity is the dominant feature in this range. Unlike the baseline the CRF+PD system keeps improving recall without a sharp drop in precision up to 0.636 recall at 0.748 precision. This is due to the incorporation of multiple overlapping features. The CRF+PD system also achieves better precision than the baseline (0.982 precision at 0.381 recall vs. 0.937 precision at 0.346 recall).

Error analysis Performed error analysis on MCA with best F-measure (0.690). Out of 1400 unique errors 1194 (85.3%) are false-negatives, and 206 (14.7%) are false-positives. Most errors are due to misalignment of subtypes (cdc, cdc6, cdc25A), opposites (phosphorylated and unphosphorylated), and complex entities (cell cycle and cell line). Many FN errors are due to not aligning entities in only 4 equivalence classes (e.g., 97 FN in the class of motif, site and domain). Other types of errors: not aligning plural and singular forms of the same entities, aligning only part of part of multi-word entities, and incorrectly aligning orthographically similar entites.

Contributions Defined the MCA problem. Developed a posterior-decoding algorithm for MCA. Advantages of posterior-decoding over Viterbi: Directly optimize the expected (metric-based) utility. Control of recall / precision tradeoff. Developed AMA for MCA A metric based accuracy measure for MCA. Balances recall and precision in one measure. The expected AMA can be optimized directly with posterior-decoding (unlike AER or F-Measure). Can also be used for SMT alignments.