Literature Mining and Systems Biology Lars Juhl Jensen EMBL.

Slides:



Advertisements
Similar presentations
STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Biological literature mining
Literature Mining for the Biologists Santhosh J. Eapen
The STRING database Michael Kuhn EMBL Heidelberg.
Introduction to Natural Language Processing Phenotype RCN Meeting Feb 2013.
STRING Modeling of biological systems through cross-species data integration.
Information Retrieval in Practice
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Recommender systems Ram Akella November 26 th 2008.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Mining text and data on chemicals Lars Juhl Jensen.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Search Engines and Information Retrieval Chapter 1.
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Flexible Text Mining using Interactive Information Extraction David Milward
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Lars Juhl Jensen Biomedical text mining. exponential growth.
University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.
Chapter 6: Information Retrieval and Web Search
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Copyright OpenHelix. No use or reproduction without express written consent1.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Information Retrieval
Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Open access – making the most of biomedical literature mining Lars Juhl Jensen EMBL Heidelberg.
1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Text Based Information Retrieval
Protein association networks with STRING
STRING Large-scale data and text mining
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Multimedia Information Retrieval
STRING Protein networks from data and text mining
Lecture 8 Information Retrieval Introduction
Text Mining & Natural Language Processing
Network biology An introduction to STRING and Cytoscape
Predicting Gene Functions from Text Using a Cross-Species Approach
Presentation transcript:

Literature Mining and Systems Biology Lars Juhl Jensen EMBL

Why?

Overview Information retrieval: finding the papers Entity recognition: identifying the substance(s) Information extraction: formalizing the facts Text mining: finding nuggets in the literature Integration: combining text and biological data

Status IR, ER, and simple IE methods are fairly well established Advanced NLP-based IE systems are rapidly being improved Methods for text mining and text/data integration are still in their infancy

Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation

Information retrieval Ad hoc information retrieval  The user enters a query/a set of keywords  The system attempts to retrieve the relevant texts from a large text corpus (typically Medline) Text categorization  A training set of texts is created in which texts are manually assigned to classes (often only yes/no)  A machine learning methods is trained to classify texts  This method can subsequently be used to classify a much larger text corpus

Ad hoc IR These systems are very useful since the user can provide any query  The query is typically Boolean (yeast AND cell cycle)  A few systems instead allow the relative weight of each search term to be specified by the user The art is to find the relevant papers even if they do not actually match the query  Ideally our example sentence should be extracted by the query yeast cell cycle although none of these words are mentioned

Automatic query expansion In a typical query, the user will not have provided all relevant words and variants thereof By automatically expanding queries with additional search terms, recall can be improved  Stemming removes common endings (yeast / yeasts)  Thesauri can be used to expand queries with synonyms and/or abbreviations (yeast / S. cerevisiae)  The next logical step is to use ontologies to make complex inferences (yeast cell cycle / Cdc28 )

Document similarity The similarity of two documents can be defined based on their word content  Each document can be represented by a word vector  Words should be weighted based on their frequency and background frequency  The most commonly used scheme is tf*idf weighting Document similarity can be used in ad hoc IR  Rather than matching the query against each document only, the N most similar documents are also considered

Document clustering Unsupervised clustering algorithms can be applied to a document similarity matrix  All pairwise document similarities are calculated  Clusters of “similar documents” can be constructed using one of numerous standard clustering methods Practical uses of document clustering  The “related documents” function in PubMed  Logical organization of the documents found by IR

Text categorization These systems are a lot less flexible than ad hoc systems but can attain better accuracy  Works on a pre-defined set of document classes  Each class is defined by manually assigning a number of documents to it Method  Rules may be manually crafted based on a very small set of manually classified documents  Statistical machine learning methods can be trained on a large number of classified documents

Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation Hints in the text  Strong: Cdc28 and Swe1 (“cell cycle” and “yeast”)  Weaker: mitotic cyclin, Clb2, and Cdk1 ( “cell cycle)

Machine learning Input features  Word content or bi-/tri-grams  Part-of-speech tags  Filtering (stop words, part-of-speech)  Singular value decomposition Training  Support vector machines are best suited  Choice of kernel function  Separate training and evaluation sets, cross validation

Entity recognition An important but boring problem  The genes/proteins/drugs mentioned within a given text must be identified Recognition vs. identification  Recognition: find the words that are names of entities  Identification: figure out which entities they refer to  Recognition without identification is of limited use

Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation Entities identified  S. cerevisiae proteins: Clb2 (YPR119W), Cdc28 (YBR160W), Swe1 (YJL187C), and Cdc5 (YMR001C)

Recognition Features  Morphological: mixes letters and digits or ends on -ase  Context: followed by “protein” or “gene”  Grammar: should occur as a noun Methodologies  Manually crafted rule-based systems  Machine learning (SVMs) But what can it be used for?

Identification A good synonyms list is the key  Combine many sources  Curate to eliminate stop words Flexible matching to handle orthographic variation  Case variation: CDC28, Cdc28, and cdc28  Prefixes: myc and c-myc  Postfixes: Cdc28 and Cdc28p  Spaces and hyphens: cdc28 and cdc-28  Latin vs. Greek letters: TNF-alpha and TNFA

Disambiguation The same word may mean many different things  Entity names may also be common English words (hairy) or technical terms (SDS)  Protein names may refer to related or unrelated proteins in other species (cdc2) The meaning can be resolved from the context  ER can distinguish between names and common words  Disambiguating non-unique names is a hard problem  Ambiguity between orthologs can be safely be ignored

Co-occurrence extraction Relations are extracted for co-occurring entities  Relations are always symmetric  The type of relation is not given Scoring the relations  More co-occurrences  more significant  Ubiquitous entities  less significant  Same sentence vs. same paragraph Simple, good recall, poor precision

Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation Relations  Correct: Clb2–Cdc28, Clb2–Swe1, Cdc28–Swe1, and Cdc5–Swe1  Wrong: Clb2–Cdc5 and Cdc28–Cdc5

Categorization of relations Extracting specific types of relations  Text categorization methods can be used to identify sentences that mention a certain type of relations  Filtering can be done before or after relation extraction Well suited for database curation  Text categorization can be reused  High recall is most important  Curators can compensate for the lack of precision

Relation extraction by NLP Information is extracted based on parsing and interpreting phrases or full sentences  Good at extracting specific types of relations  Handles directed relations Complex, good precision, poor recall

Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation Relations:  Complex: Clb2–Cdc28  Phosphorylation: Clb2  Swe1, Cdc28  Swe1, and Cdc5  Swe1

An NLP architecture Tokenization  Entity recognition with synonyms list  Word boundaries (multi words)  Sentence boundaries (abbreviations) Part-of-speech tagging  TreeTagger trained on G ENIA Semantic labeling  Dictionary of regular expressions Entity and relation chunking  Rule-based system implemented in CASS

Semantic labeling  Gene and protein names  Cue words for entity recognition  Cue words for relation extraction Named entity chunking  A CASS grammar recognizes noun chunks related to gene expression: [ nxgene The GAL4 gene] Relation chunking  Our CASS grammar also extracts relations between entities: [ nxexpr The expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7]]] is controlled by [ nxpg HAP1]

[ expression_repression_active Btk regulates the IL-2 gene] [ dephosphorylation_nominal Dephosphorylation of Syk and Btk mediated by SHP-1] [ phosphorylation_nominal phosphorylation of Shc by the hematopoietic cell-specific tyrosine kinase Syk] [ phosphorylation_nominal the phosphorylation of the adapter protein SHC by the Src-related kinase Lyn] [ phosphorylation_active Lyn also participates in [ phosphorylation the tyrosine phosphorylation and activation of syk]] [ phosphorylation_active Lyn, [ negation but not Jak2] phosphorylated CrkL] [ phosphorylation_active Lyn, [ negation but not Jak2] phosphorylated CrkL] [ phosphorylation_active Lyn also participates in [ phosphorylation the tyrosine phosphorylation and activation of syk]] [ phosphorylation_nominal the phosphorylation of the adapter protein SHC by the Src-related kinase Lyn] [ phosphorylation_nominal phosphorylation of Shc by the hematopoietic cell-specific tyrosine kinase Syk] [ dephosphorylation_nominal Dephosphorylation of Syk and Btk mediated by SHP-1] [ expression_repression_active IL-10 also decreased [ expression mRNA expression of IL-2 and IL18 cytokine receptors] [ expression_repression_active IL-10 also decreased [ expression mRNA expression of IL-2 and IL18 cytokine receptors] [ expression_activation_passive [ expression IL-13 expression] induced by IL-2 + IL-18] [ expression_activation_passive [ expression IL-13 expression] induced by IL-2 + IL-18] [ expression_repression_active Btk regulates the IL-2 gene]

Mining text for nuggets New relations can be inferred from published ones  This can lead to actual discoveries if no person knows all the facts required for making the inference  Combining facts from disconnected literatures Swanson’s pioneering work  Fish oil and Reynaud's disease  Magnesium and migraine

Trends Most similar to existing data mining approaches  Although all the detailed data is in the text, people may have missed the big picture Temporal trends  Historical summaries  Forecasting Correlations  “Customers who bought this item also bought …”

Time

Buzzwords

Correlations “Customers who bought this item also bought …” Protein networks  “Proteins that regulate expression …”  “Proteins that control phosphorylation …”  “Proteins that are phosphorylated …” Co-author networks

Transcriptional networks RegulatesRegulated P < 9  10 -9

Signaling pathways PhosphorylatesPhosphorylated P < 2  10 -7

Integration Automatic annotation of high-throughput data  Loads of fairly trivial methods Protein interaction networks  Can unify many types of interactions  Powerful as exploratory visualization tools More creative strategies  Identification of candidate genes for genetic diseases  Linking genes to traits based on species distributions

RCCs

Disease candidate genes Rank the genes within a chromosomal region to which a disease has been mapped Methods  G2D Gene  Function  Chemical  Phenotype  Disease Uses M EDLINE but not the text  B ITOLA Gene  Words  Disease (similar to A RROWSMITH )  Hide and co-workers Gene  Tissue  Disease

G2D

Genotype–phenotype Genes can be linked to traits by comparing the species distributions of both  Mainly works for prokaryotes  Traits are represented by keywords Finding the species profiles  Gene profiles are found by sequence similarity  Keyword profiles are based co-occurrence with the species name in M EDLINE

Annotation Many experiment result in groups of related genes  ER is used to find the associated abstracts  The frequency of each word is counted in the abstracts  Background frequencies of all words are pre-calculated  A statistical test is used to rank the words The same strategy can be applied to find MeSH terms associated with a gene cluster Most people prefer using GO annotation instead

Outlook Literature mining will not be made obsolete by  Repositories are always made too late  There will always be new types of relations  Semantically tagged XML may replace ER (hopefully!)  Semantically tagged XML will never tag everything Specific IE problems will become obsolete  Protein function  Physical protein interactions

Permission denied Open access  Literature mining methods cannot retrieve, extract, or correlate information from text unless it is accessible  Restricted access is already now the primary problem Standard formats  Getting the text out of a PDF file is not trivial  Many journals now store papers in XML format Where do I get all the patent text?!

Innovation The basic tools are now in place for IR, ER, and IE  Development was driven by computational linguists Text- and data-mining  Biologists are needed  Collaboration with linguists Lack of innovation  Very few new ideas  Text should be combined with other data

Acknowledgments EML Research  Jasmin Saric  Isabel Rojas EMBL Heidelberg  Peer Bork  Miguel Andrade  Michael Kuhn  Rossitza Ouzounova  Jan Korbel  Tobias Doerks

Exercises Lars Juhl Jensen EMBL

Entity recognition iHOP  Ideas  Compare iHOP vs. PubMed for finding papers related to a particular gene  Use iHOP to construct a small literature-based network

Information extraction Relation extraction  iProLINK (  PreBIND (  PubGene ( Ideas  Check how complex sentences iProLINK can handle  Check how well PreBIND can discriminate between physcial and other interactions (other interactions can be found with PubGene, ProLinks, or STRING)

Text mining A RROWSMITH  Ideas  Fish oil and Reynaud's disease  Magnesium and migraine  Arginine and somatomedin C  Estrogen and Alzheimer's disease

Integration 1 Protein networks  S TRING (  ProLinks ( Ideas  Use both tools to find functions for proteins of known and unknown function  Use S TRING to construct a network for a set of proteins  Try to reproduce the Ssn3–Msn2–Hsp104 link

Integration 2 Finding candidate disease genes  G2D (  B ITOLA ( Ideas  Take a look at the G2D results for some diseases where you know which types of genes would be sensible to suggest  Compare the results with B ITOLA (if you have the patience to figure out there interface!)