New Search Tools for Bioscience Journal Articles Marti Hearst, UC Berkeley School of Information UIUC Comp-Bio Seminar February 12, 2007 Supported by NSF.

New Search Tools for Bioscience Journal Articles Marti Hearst, UC Berkeley School of Information UIUC Comp-Bio Seminar February 12, 2007 Supported by NSF DBI-0317510 And a gift from Genentech

UC Berkeley Biotext Project Outline Biotext Project Introduction Simple Abbreviation Definition Recognition Citances A New Search Interface Idea

UC Berkeley Biotext Project Double Exponential Growth in Bioscience Journal Articles From Hunter & Cohen, Molecular Cell 21, 2006

UC Berkeley Biotext Project BioText Project Goals Provide flexible, useful, appealing search for bioscientists. Focus on: Full text journal articles New language analysis algorithms New search interfaces

UC Berkeley Biotext Project Bioscience Text is Challenging Complex sentence structure Huge vocabulary Including LOTS of abbreviations Gene/protein name recognition a major task Full text documents have complex structure – which parts are key?

UC Berkeley Biotext Project BioText Architecture Sophisticated Text Analysis Annotations in Database Improved Search Interface

UC Berkeley Biotext Project Project Team Project Leaders: PI: Marti Hearst Co-PI: Adam Arkin Computational Linguistics and Databases Preslav Nakov Jerry Ye Ariel Schwartz (alum) Brian Wolf (alum) Barbara Rosario (alum) Gaurav Bhalotia (alum) User Interface / IR Mike Wooldridge Rowena Luk (alum) Dr. Emilia Stoica (alum) Bioscience Dr. Anna Divoli Janice Hamerja (alum) Dr. TingTing Zhang (alum)

UC Berkeley Biotext Project The Problem: Identify Acronym Definitions methyl methanesulfonate sulfate (MMS) heat shock transcription factor (HSF) Gcn5-related N-acetyltransferase (GNAT) We investigated the redox regulation of the stress response and report here that in the human pre- monocytic line U937 cells, H2O2 induced a concentration-dependent transactivation and DNA- binding activity of heat-shock factor-1 (HSF-1)

UC Berkeley Biotext Project Identifying Acronym Definitions To identify pairs from biomedical text: Short form is abbreviation of long form There exists character mapping from short form to long form Examples:  Gcn5-related N-acetyltransferase (GNAT) A non-trivial problem: Words in long form may be skipped Internal letters in long form may be used

UC Berkeley Biotext Project Previous Work Machine learning approaches Linear regression (Chang et al.) Encoding and compression (Yeates et al.) Cubic time or worse Heuristic approach Rule-based (Park & Byrd) Factors considered include:  Distance between definition and abbreviation  Number of stop words  Capitalization Can’t reproduce this algorithm

UC Berkeley Biotext Project Step 1: Identifying Candidates Consider only two cases: long form ‘(‘ short form ‘)’ short form ‘(‘ long form ‘)’ Short form: No more than 2 words Between 2 and 10 chars At least one letter First char alphanumeric Long form: Adjacent to short form No more than min(|A| + 5, |A| * 2) words

UC Berkeley Biotext Project heat shock transcription factor (HSF) [heat shock transcription factor](HSF) Step 2: Identifying Correct Long Forms

UC Berkeley Biotext Project Step 2: Identifying Correct Long Forms Gcn5-related N-acetyltransferase (GNAT)

UC Berkeley Biotext Project Step 2: Identifying Correct Long Forms From right to left, the shortest long form that matches the short form: Each character in short form must match a character in long form The match of the character at the beginning of the short form must match a character in the initial position of the first word in the long form

UC Berkeley Biotext Project Java Code for Finding the Best Long Form for a Given Short Form

UC Berkeley Biotext Project Evaluation 1000 randomly selected MEDLINE abstracts 82% recall, 95% precision Medstract Gold Standard Evaluation Corpus 82% recall, 96% precision Compared with  83% recall, 80% precision (Cheng et al., linear regression)  72% recall, 98% precision (Pustejovsky et al., heuristics)

UC Berkeley Biotext Project Missing Pairs Skipped characters in short form No match Out of order Partial match

UC Berkeley Biotext Project Other NLP Work Relation labeling (Work primarily by Barbara Rosario) Protein-protein interactions: which ones are happening?  They also demonstrate that the GAG protein from membrane-containing viruses, such as HIV, binds to Alix / AIP1, thereby recruiting the ESCRT machinery to allow budding of the virus from the cell surface [cite]. Distinguished among 10 different relations  Binds, degrades, synergizes with, upregulates … Simple supervised approach gets surprisingly high results (~60% accuracy)

UC Berkeley Biotext Project Acquiring Labeled Data using Citances

UC Berkeley Biotext Project A discovery is made … A paper is written …

UC Berkeley Biotext Project That paper is cited … and cited … … as the evidence for some fact(s) F.

UC Berkeley Biotext Project Each of these in turn are cited for some fact(s) … … until it is the case that all important facts in the field can be found in citation sentences alone!

UC Berkeley Biotext Project Citances Nearly every statement in a bioscience journal article is backed up with a cite. It is quite common for papers to be cited 30-100 times. The text around the citation tends to state biological facts. (Call these citances.) Different citances will state the same facts in different ways … … so can we use these for creating models of language expressing semantic relations?

UC Berkeley Biotext Project Using citances Potential uses of citation sentences (citances) creation of training and testing data for semantic analysis, synonym set creation, database curation, document summarization, and information retrieval generally. All of the above require citance word alignments.

UC Berkeley Biotext Project Sample Citance “Recent research, in proliferating cells, has demonstrated that interaction of E2F1 with the p53 pathway could involve transcriptional up-regulation of E2F1 target genes such as p14/p19ARF, which affect p53 accumulation [67,68], E2F1-induced phosphorylation of p53 [69], or direct E2F1- p53 complex formation [70].”

UC Berkeley Biotext Project Related Work Traditional citation analysis dates back to the 1960’s (Garfield). Includes: Citation categorization, Context analysis, Citer motivation. Citation indexing systems, such as ISI’s SCI, and CiteSeer. Mercer and Di Marco (2004) propose to improve citation indexing using citation types. Bradshaw (2003) introduces Reference Directed Indexing (RDI), which indexes documents using the terms in the citances citing them.

UC Berkeley Biotext Project Related Work (cont.) Teufel and Moens (2002) identify citances to improve summarization of the citing paper. They give lower weight to citances as candidate sentences for summarization. Nanba et. al. (2000) use citances as features for classifying papers into topics. Related field to citation indexing is the use of link structure and anchor text of Web pages. Applications include: IR, classification, Web crawlers, and summarization. See the full paper for references.

UC Berkeley Biotext Project Issues for Processing Citances Text span Identification of the appropriate phrase, clause, or sentence that constructs a citance. Correct mapping of citations when shown as lists or groups (e.g., “[22-25]”). Grouping citances by topic Citances that cite the same document should be grouped by the facts they state. Normalizing or paraphrasing concepts in citances

UC Berkeley Biotext Project How Do Citances Differ From Abstracts? (This part primarily by Anna Divoli.) We did a detailed analysis of facts that appear in citances. 6 target papers, molecular interactions domain We did the same for the abstracts of the target papers.

UC Berkeley Biotext Project Distributions of Concept Types

UC Berkeley Biotext Project How Do Citances Differ From Abstracts? Main results: all of the facts in the abstract are covered by the citances (collectively) However, some facts in citances do not appear in the abstracts.  Mainly Entities and Experimental Methods This suggests there is important information in the full text that is not represented by the abstract, title, and metadata alone.

UC Berkeley Biotext Project Paraphrasing Citances (This part primarily by Preslav Nakov) Problem: many citances say the same thing in different ways The sentence structure is very complex and contains irrelevant information We want to first “normalize” those citances that talk about similar things, so we can then determine which sentences repeat the same information. This will then allow us to determine what the key points are and thus convert them into summaries.

UC Berkeley Biotext Project Want to Normalize These: NGF withdrawal from sympathetic neurons induces Bim, which then contributes to death. Nerve growth factor withdrawal induces the expression of Bim and mediates Bax dependent cytochrome c release and apoptosis. Recently, Bim has been shown to be upregulated following both nerve growth factor withdrawal from primary sympathetic neurons, and serum and potassium withdrawal from granule neurons. The proapoptotic Bcl-2 family member Bim is strongly induced in sympathetic neurons in response to NGF withdrawal. In neurons, the BH3 only Bcl2 member, Bim, and JNK are both implicated in apoptosis caused by nerve growth factor deprivation.

UC Berkeley Biotext Project The Resulting Paraphrases NGF withdrawal induces Bim. Nerve growth factor withdrawal induces the expression of Bim. Bim has been shown to be upregulated following nerve growth factor withdrawal. Bim is induced in sympathetic neurons in response to NGF withdrawal. Bim implicated in apoptosis caused by nerve growth factor deprivation. All they paraphrase: Bim is induced after NGF withdrawal.

UC Berkeley Biotext Project Paraphrase Creation Algorithm 1. Extract the sentences that cite the target. 2. Mark the NEs of interest (genes/proteins, MeSH terms) and normalize. 3. Dependency parse. 4. For each parse For each pair of NEs of interest i. Extract the path between them. ii. Create a paraphrase from the path. 5. Rank the candidates for a given pair of NEs. 6. Select only the ones above a threshold. 7. Generalize.

UC Berkeley Biotext Project Creating a Paraphrase Given the path from the dependency parse: 1. Restore the original word order. 2. Add words to improve grammaticality. Bim … shown … be … following nerve growth factor withdrawal. Bim [has] [been] shown [to] be [upregulated] following nerve growth factor withdrawal.

UC Berkeley Biotext Project Creating a Paraphrase Given the path from the dependency parse 1. Restore the original word order. 2. Add words to improve grammaticality. Complex verb forms: passive, infinitive, past etc. Lin&Pantel, Ibrahim&al. manipulate parser’s output We use the “2-word heuristic”: If the path extracted from the dependency parse skips over either one or two words, those one or two words are inserted back into the paraphrase, unless those words are adverbs.

UC Berkeley Biotext Project 2-word Heuristic Demonstration NGF withdrawal induces Bim. Nerve growth factor withdrawal induces [the] expression of Bim. Bim [has] [been] shown [to] be [upregulated] following nerve growth factor withdrawal. Bim [is] induced in [sympathetic] neurons in response to NGF withdrawal. member Bim implicated in apoptosis caused by nerve growth factor deprivation.

UC Berkeley Biotext Project Evaluation (1) An influential journal paper from Neuron: J. Whitfield, S. Neame, L. Paquet, O. Bernard, and J. Ham. Dominantnegative c-jun promotes neuronal survival by reducing bim expression and inhibiting mitochondrial cytochrome c release. Neuron, 29:629– 643, 2001. 99 journal papers citing it 203 citances in total 36 different types of important biological factoids But we concentrated on one of them: “Bim is induced after NGF withdrawal.”

UC Berkeley Biotext Project Evaluation (2) Set 1: 67 citances pointing to the target paper and manually found to contain a good or acceptable paraphrase (do not necessarily contain Bim or NGF); Set 2: 65 citances pointing to the target paper and containing both Bim and NGF; Set 3: 102 sentences from the 99 texts, contain both Bim and NGF Cluster: all 203 citances: Spectral clustering Polynomial kernel clusters for which more than 80% of the citances include both NGF and Bim Set 1 – assess the system under ideal conditions. Set 2 vs. 3 – Do citances produce better paraphrases ?

UC Berkeley Biotext Project Results % - good (1.0) or acceptable (0.5)

UC Berkeley Biotext Project The Citance Fact Extraction Problem (This part primarily by Ariel Schwartz.) Find groups of words/phrases that are semantically “similar” in target paper’s context. Orthographic similarity is important but does not always entail semantic similarity. This is another step needed for normalizing the content. Can use the results of this algorithm to determine which entities to use in the paraphrasing just described.

UC Berkeley Biotext Project Example of original citances

UC Berkeley Biotext Project Entities Identified and Labeled as Equivalent to One Another response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A N terminal sites target rapidly ubiquitin dependent degradation thought central S G2 cell cycle checkpoints Given Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 required Cdc25A ubiquitination SCF beta TRCP vitro explored role Cdc25A phosphorylation ubiquitination process activated phosphorylated Chk2 T68 involved phosphorylation degradation Cdc25A examined levels Cdc25A 2fTGH U3A cells exposed gamma IR

UC Berkeley Biotext Project Features for citance word alignment Orthographic features exact string match, normalized edit distance, prefix, suffix match, word lengths, capitalization. Local contextual features distance between target words of adjacent source words, Word specific tendency to align like the previous/next word, Transition to, from, and between (un)aligned words. Biological ontology based features Medical Subject Headings (MeSH), Gene synonyms (Entrez Gene, Uniprot, OMIM). Lexical features Wordnet similarity (Lin, 1998)

UC Berkeley Biotext Project Approach: Posterior Decoding Use Conditional Random Fields Compute posterior probabilities using EM For every target word w, compute the combination of source words that maximizes the expected score of w Take the union of individual word optimal alignments and produce a multiple alignment Use a match-factor to reward/penalize a combination based on the number of words that align to the same target word

UC Berkeley Biotext Project Data sets 3 sets of citances annotated by a PhD with biological training (Anna Divoli) Training set - 4 groups, 10 citances each (360 pairs). Development set – 51 citances (2550 pairs). Test set – 45 citances (1980 pairs). Feature engineering was done using the training and development sets. Final results based on a model trained on training and development sets combined, and tested on the test set. Baseline – using only normalized edit distance with a simple cutoff.

UC Berkeley Biotext Project Results

A Full Text Search Interface ( This work in part by Mike Wooldridge and Jerry Ye)

UC Berkeley Biotext Project The Importance of Figures and Captions Observations of biologist’s reading habits: It has often observed that biologists focus on figures+captions along with title and abstract. KDD Cup 2002 The objective was to extract only the papers that included experimental results regarding expression of gene products and to identify, from all the genes mentioned in each document, the genes and products for which experimental results were provided. ClearForest+Celera did well in part by focusing on figure captions, which contain critical experimental evidence.

UC Berkeley Biotext Project

Our Idea Make a full text search engine for journal articles that focuses on showing figures Make it possible to search over caption text (and text that refers to captions) Try to group the figures intelligently

UC Berkeley Biotext Project BioFigure Search Interface We’ve indexed the open access journal article collection ~130 journals ~20,000 articles ~80,000 figures We’ve built a figure/caption labeling tool to create training data Image types Comparison or not? We’ve made a start at a search interface Right now figure grouping facility is very crude We are going to add faceted navigation (Flamenco)

UC Berkeley Biotext Project

Interested in Helping? We need figure labeling help! We need user feedback! Contact me, or send email to: divoli@sims.berkeley.edu More information: biotext.berkeley.edu

New Search Tools for Bioscience Journal Articles Marti Hearst, UC Berkeley School of Information UIUC Comp-Bio Seminar February 12, 2007 Supported by NSF.

Similar presentations

Presentation on theme: "New Search Tools for Bioscience Journal Articles Marti Hearst, UC Berkeley School of Information UIUC Comp-Bio Seminar February 12, 2007 Supported by NSF."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

New Search Tools for Bioscience Journal Articles Marti Hearst, UC Berkeley School of Information UIUC Comp-Bio Seminar February 12, 2007 Supported by NSF.

Similar presentations

Presentation on theme: "New Search Tools for Bioscience Journal Articles Marti Hearst, UC Berkeley School of Information UIUC Comp-Bio Seminar February 12, 2007 Supported by NSF."— Presentation transcript:

Similar presentations

About project

Feedback