New Search Tools for Bioscience Journal Articles Marti Hearst, UC Berkeley School of Information UIUC Comp-Bio Seminar February 12, 2007 Supported by NSF.

Slides:



Advertisements
Similar presentations
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Caption Search for Bioscience Search Interfaces Marti Hearst, Anna Divoli, Jerry Ye, Mike Wooldridge UC Berkeley School of Information ACL Workshop on.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from.
Improving Bioscience Literature Search Interfaces National Library of Medicine June 19, 2009 Some research reported here supported by NSF DBI and.
The BioText Project: Recent Work Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
FROM INFORMATION, KNOWLEDGE Prof. Marti Hearst MIMS Visit Day, 2006 Some Research Projects.
Identifying Abbreviation Definitions in Biomedical Text Ariel SchwartzMarti Hearst.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:
Predicting Gene Functions from Text Using a Cross- Species Approach Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.
Citances: Citation Sentences for Semantic Analysis of Bioscience Text Preslav I. Nakov, Ariel S. Schwartz, and Marti A. Hearst Computer Science Division.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI , ARDA.
Citances and What should our UI look like? Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
1 The BioText Project Myers Seminar Sept 22, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI , ARDA AQUAINT,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Advanced Research Methodology
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
A Language Independent Method for Question Classification COLING 2004.
Chapter 6: Information Retrieval and Web Search
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
A Biology Primer Part IV: Gene networks and systems biology Vasileios Hatzivassiloglou University of Texas at Dallas.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Algorithmic Detection of Semantic Similarity WWW 2005.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Developing systems for full-text search in biomedicine. Anna Divoli School of Information University of California, Berkeley 07 Aug 2007 University of.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Ariel Schwartz, Anna Divoli, and Marti Hearst
Batyr Charyyev.
Introduction to Information Retrieval
Introduction to Search Engines
Marti Hearst Associate Professor SIMS, UC Berkeley
Predicting Gene Functions from Text Using a Cross-Species Approach
Presentation transcript:

New Search Tools for Bioscience Journal Articles Marti Hearst, UC Berkeley School of Information UIUC Comp-Bio Seminar February 12, 2007 Supported by NSF DBI And a gift from Genentech

UC Berkeley Biotext Project Outline Biotext Project Introduction Simple Abbreviation Definition Recognition Citances A New Search Interface Idea

UC Berkeley Biotext Project Double Exponential Growth in Bioscience Journal Articles From Hunter & Cohen, Molecular Cell 21, 2006

UC Berkeley Biotext Project BioText Project Goals Provide flexible, useful, appealing search for bioscientists. Focus on: Full text journal articles New language analysis algorithms New search interfaces

UC Berkeley Biotext Project Bioscience Text is Challenging Complex sentence structure Huge vocabulary Including LOTS of abbreviations Gene/protein name recognition a major task Full text documents have complex structure – which parts are key?

UC Berkeley Biotext Project BioText Architecture Sophisticated Text Analysis Annotations in Database Improved Search Interface

UC Berkeley Biotext Project Project Team Project Leaders: PI: Marti Hearst Co-PI: Adam Arkin Computational Linguistics and Databases Preslav Nakov Jerry Ye Ariel Schwartz (alum) Brian Wolf (alum) Barbara Rosario (alum) Gaurav Bhalotia (alum) User Interface / IR Mike Wooldridge Rowena Luk (alum) Dr. Emilia Stoica (alum) Bioscience Dr. Anna Divoli Janice Hamerja (alum) Dr. TingTing Zhang (alum)

UC Berkeley Biotext Project The Problem: Identify Acronym Definitions methyl methanesulfonate sulfate (MMS) heat shock transcription factor (HSF) Gcn5-related N-acetyltransferase (GNAT) We investigated the redox regulation of the stress response and report here that in the human pre- monocytic line U937 cells, H2O2 induced a concentration-dependent transactivation and DNA- binding activity of heat-shock factor-1 (HSF-1)

UC Berkeley Biotext Project Identifying Acronym Definitions To identify pairs from biomedical text: Short form is abbreviation of long form There exists character mapping from short form to long form Examples:  Gcn5-related N-acetyltransferase (GNAT) A non-trivial problem: Words in long form may be skipped Internal letters in long form may be used

UC Berkeley Biotext Project Previous Work Machine learning approaches Linear regression (Chang et al.) Encoding and compression (Yeates et al.) Cubic time or worse Heuristic approach Rule-based (Park & Byrd) Factors considered include:  Distance between definition and abbreviation  Number of stop words  Capitalization Can’t reproduce this algorithm

UC Berkeley Biotext Project Step 1: Identifying Candidates Consider only two cases: long form ‘(‘ short form ‘)’ short form ‘(‘ long form ‘)’ Short form: No more than 2 words Between 2 and 10 chars At least one letter First char alphanumeric Long form: Adjacent to short form No more than min(|A| + 5, |A| * 2) words

UC Berkeley Biotext Project heat shock transcription factor (HSF) [heat shock transcription factor](HSF) Step 2: Identifying Correct Long Forms

UC Berkeley Biotext Project Step 2: Identifying Correct Long Forms Gcn5-related N-acetyltransferase (GNAT)

UC Berkeley Biotext Project Step 2: Identifying Correct Long Forms From right to left, the shortest long form that matches the short form: Each character in short form must match a character in long form The match of the character at the beginning of the short form must match a character in the initial position of the first word in the long form

UC Berkeley Biotext Project Java Code for Finding the Best Long Form for a Given Short Form

UC Berkeley Biotext Project Evaluation 1000 randomly selected MEDLINE abstracts 82% recall, 95% precision Medstract Gold Standard Evaluation Corpus 82% recall, 96% precision Compared with  83% recall, 80% precision (Cheng et al., linear regression)  72% recall, 98% precision (Pustejovsky et al., heuristics)

UC Berkeley Biotext Project Missing Pairs Skipped characters in short form No match Out of order Partial match

UC Berkeley Biotext Project Other NLP Work Relation labeling (Work primarily by Barbara Rosario) Protein-protein interactions: which ones are happening?  They also demonstrate that the GAG protein from membrane-containing viruses, such as HIV, binds to Alix / AIP1, thereby recruiting the ESCRT machinery to allow budding of the virus from the cell surface [cite]. Distinguished among 10 different relations  Binds, degrades, synergizes with, upregulates … Simple supervised approach gets surprisingly high results (~60% accuracy)

UC Berkeley Biotext Project Acquiring Labeled Data using Citances

UC Berkeley Biotext Project A discovery is made … A paper is written …

UC Berkeley Biotext Project That paper is cited … and cited … … as the evidence for some fact(s) F.

UC Berkeley Biotext Project Each of these in turn are cited for some fact(s) … … until it is the case that all important facts in the field can be found in citation sentences alone!

UC Berkeley Biotext Project Citances Nearly every statement in a bioscience journal article is backed up with a cite. It is quite common for papers to be cited times. The text around the citation tends to state biological facts. (Call these citances.) Different citances will state the same facts in different ways … … so can we use these for creating models of language expressing semantic relations?

UC Berkeley Biotext Project Using citances Potential uses of citation sentences (citances) creation of training and testing data for semantic analysis, synonym set creation, database curation, document summarization, and information retrieval generally. All of the above require citance word alignments.

UC Berkeley Biotext Project Sample Citance “Recent research, in proliferating cells, has demonstrated that interaction of E2F1 with the p53 pathway could involve transcriptional up-regulation of E2F1 target genes such as p14/p19ARF, which affect p53 accumulation [67,68], E2F1-induced phosphorylation of p53 [69], or direct E2F1- p53 complex formation [70].”

UC Berkeley Biotext Project Related Work Traditional citation analysis dates back to the 1960’s (Garfield). Includes: Citation categorization, Context analysis, Citer motivation. Citation indexing systems, such as ISI’s SCI, and CiteSeer. Mercer and Di Marco (2004) propose to improve citation indexing using citation types. Bradshaw (2003) introduces Reference Directed Indexing (RDI), which indexes documents using the terms in the citances citing them.

UC Berkeley Biotext Project Related Work (cont.) Teufel and Moens (2002) identify citances to improve summarization of the citing paper. They give lower weight to citances as candidate sentences for summarization. Nanba et. al. (2000) use citances as features for classifying papers into topics. Related field to citation indexing is the use of link structure and anchor text of Web pages. Applications include: IR, classification, Web crawlers, and summarization. See the full paper for references.

UC Berkeley Biotext Project Issues for Processing Citances Text span Identification of the appropriate phrase, clause, or sentence that constructs a citance. Correct mapping of citations when shown as lists or groups (e.g., “[22-25]”). Grouping citances by topic Citances that cite the same document should be grouped by the facts they state. Normalizing or paraphrasing concepts in citances

UC Berkeley Biotext Project How Do Citances Differ From Abstracts? (This part primarily by Anna Divoli.) We did a detailed analysis of facts that appear in citances. 6 target papers, molecular interactions domain We did the same for the abstracts of the target papers.

UC Berkeley Biotext Project Distributions of Concept Types

UC Berkeley Biotext Project How Do Citances Differ From Abstracts? Main results: all of the facts in the abstract are covered by the citances (collectively) However, some facts in citances do not appear in the abstracts.  Mainly Entities and Experimental Methods This suggests there is important information in the full text that is not represented by the abstract, title, and metadata alone.

UC Berkeley Biotext Project Paraphrasing Citances (This part primarily by Preslav Nakov) Problem: many citances say the same thing in different ways The sentence structure is very complex and contains irrelevant information We want to first “normalize” those citances that talk about similar things, so we can then determine which sentences repeat the same information. This will then allow us to determine what the key points are and thus convert them into summaries.

UC Berkeley Biotext Project Want to Normalize These: NGF withdrawal from sympathetic neurons induces Bim, which then contributes to death. Nerve growth factor withdrawal induces the expression of Bim and mediates Bax dependent cytochrome c release and apoptosis. Recently, Bim has been shown to be upregulated following both nerve growth factor withdrawal from primary sympathetic neurons, and serum and potassium withdrawal from granule neurons. The proapoptotic Bcl-2 family member Bim is strongly induced in sympathetic neurons in response to NGF withdrawal. In neurons, the BH3 only Bcl2 member, Bim, and JNK are both implicated in apoptosis caused by nerve growth factor deprivation.

UC Berkeley Biotext Project The Resulting Paraphrases NGF withdrawal induces Bim. Nerve growth factor withdrawal induces the expression of Bim. Bim has been shown to be upregulated following nerve growth factor withdrawal. Bim is induced in sympathetic neurons in response to NGF withdrawal. Bim implicated in apoptosis caused by nerve growth factor deprivation. All they paraphrase: Bim is induced after NGF withdrawal.

UC Berkeley Biotext Project Paraphrase Creation Algorithm 1. Extract the sentences that cite the target. 2. Mark the NEs of interest (genes/proteins, MeSH terms) and normalize. 3. Dependency parse. 4. For each parse For each pair of NEs of interest i. Extract the path between them. ii. Create a paraphrase from the path. 5. Rank the candidates for a given pair of NEs. 6. Select only the ones above a threshold. 7. Generalize.

UC Berkeley Biotext Project Creating a Paraphrase Given the path from the dependency parse: 1. Restore the original word order. 2. Add words to improve grammaticality. Bim … shown … be … following nerve growth factor withdrawal. Bim [has] [been] shown [to] be [upregulated] following nerve growth factor withdrawal.

UC Berkeley Biotext Project Creating a Paraphrase Given the path from the dependency parse 1. Restore the original word order. 2. Add words to improve grammaticality. Complex verb forms: passive, infinitive, past etc. Lin&Pantel, Ibrahim&al. manipulate parser’s output We use the “2-word heuristic”: If the path extracted from the dependency parse skips over either one or two words, those one or two words are inserted back into the paraphrase, unless those words are adverbs.

UC Berkeley Biotext Project 2-word Heuristic Demonstration NGF withdrawal induces Bim. Nerve growth factor withdrawal induces [the] expression of Bim. Bim [has] [been] shown [to] be [upregulated] following nerve growth factor withdrawal. Bim [is] induced in [sympathetic] neurons in response to NGF withdrawal. member Bim implicated in apoptosis caused by nerve growth factor deprivation.

UC Berkeley Biotext Project Evaluation (1) An influential journal paper from Neuron: J. Whitfield, S. Neame, L. Paquet, O. Bernard, and J. Ham. Dominantnegative c-jun promotes neuronal survival by reducing bim expression and inhibiting mitochondrial cytochrome c release. Neuron, 29:629– 643, journal papers citing it 203 citances in total 36 different types of important biological factoids But we concentrated on one of them: “Bim is induced after NGF withdrawal.”

UC Berkeley Biotext Project Evaluation (2) Set 1: 67 citances pointing to the target paper and manually found to contain a good or acceptable paraphrase (do not necessarily contain Bim or NGF); Set 2: 65 citances pointing to the target paper and containing both Bim and NGF; Set 3: 102 sentences from the 99 texts, contain both Bim and NGF Cluster: all 203 citances: Spectral clustering Polynomial kernel clusters for which more than 80% of the citances include both NGF and Bim Set 1 – assess the system under ideal conditions. Set 2 vs. 3 – Do citances produce better paraphrases ?

UC Berkeley Biotext Project Results % - good (1.0) or acceptable (0.5)

UC Berkeley Biotext Project The Citance Fact Extraction Problem (This part primarily by Ariel Schwartz.) Find groups of words/phrases that are semantically “similar” in target paper’s context. Orthographic similarity is important but does not always entail semantic similarity. This is another step needed for normalizing the content. Can use the results of this algorithm to determine which entities to use in the paraphrasing just described.

UC Berkeley Biotext Project Example of original citances

UC Berkeley Biotext Project Entities Identified and Labeled as Equivalent to One Another response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A N terminal sites target rapidly ubiquitin dependent degradation thought central S G2 cell cycle checkpoints Given Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 required Cdc25A ubiquitination SCF beta TRCP vitro explored role Cdc25A phosphorylation ubiquitination process activated phosphorylated Chk2 T68 involved phosphorylation degradation Cdc25A examined levels Cdc25A 2fTGH U3A cells exposed gamma IR

UC Berkeley Biotext Project Features for citance word alignment Orthographic features exact string match, normalized edit distance, prefix, suffix match, word lengths, capitalization. Local contextual features distance between target words of adjacent source words, Word specific tendency to align like the previous/next word, Transition to, from, and between (un)aligned words. Biological ontology based features Medical Subject Headings (MeSH), Gene synonyms (Entrez Gene, Uniprot, OMIM). Lexical features Wordnet similarity (Lin, 1998)

UC Berkeley Biotext Project Approach: Posterior Decoding Use Conditional Random Fields Compute posterior probabilities using EM For every target word w, compute the combination of source words that maximizes the expected score of w Take the union of individual word optimal alignments and produce a multiple alignment Use a match-factor to reward/penalize a combination based on the number of words that align to the same target word

UC Berkeley Biotext Project Data sets 3 sets of citances annotated by a PhD with biological training (Anna Divoli) Training set - 4 groups, 10 citances each (360 pairs). Development set – 51 citances (2550 pairs). Test set – 45 citances (1980 pairs). Feature engineering was done using the training and development sets. Final results based on a model trained on training and development sets combined, and tested on the test set. Baseline – using only normalized edit distance with a simple cutoff.

UC Berkeley Biotext Project Results

A Full Text Search Interface ( This work in part by Mike Wooldridge and Jerry Ye)

UC Berkeley Biotext Project The Importance of Figures and Captions Observations of biologist’s reading habits: It has often observed that biologists focus on figures+captions along with title and abstract. KDD Cup 2002 The objective was to extract only the papers that included experimental results regarding expression of gene products and to identify, from all the genes mentioned in each document, the genes and products for which experimental results were provided. ClearForest+Celera did well in part by focusing on figure captions, which contain critical experimental evidence.

UC Berkeley Biotext Project

Our Idea Make a full text search engine for journal articles that focuses on showing figures Make it possible to search over caption text (and text that refers to captions) Try to group the figures intelligently

UC Berkeley Biotext Project BioFigure Search Interface We’ve indexed the open access journal article collection ~130 journals ~20,000 articles ~80,000 figures We’ve built a figure/caption labeling tool to create training data Image types Comparison or not? We’ve made a start at a search interface Right now figure grouping facility is very crude We are going to add faceted navigation (Flamenco)

UC Berkeley Biotext Project

Interested in Helping? We need figure labeling help! We need user feedback! Contact me, or send to: More information: biotext.berkeley.edu