Literature Mining for the Biologists Santhosh J. Eapen

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Chapter 5: Introduction to Information Retrieval

Mining External Resources for Biomedical IE Why, How, What Malvina Nissim

Biological literature mining

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

Information Retrieval in Practice

Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.

Literature Mining and Systems Biology Lars Juhl Jensen EMBL.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.

Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.

Mining the Medical Literature Chirag Bhatt October 14 th, 2004.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Indexing Overview Approaches to indexing Automatic indexing Information extraction.

B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego

Information Retrieval in Practice

Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

Flexible Text Mining using Interactive Information Extraction David Milward

Creating Metabolic Network Models using Text Mining and Expert Knowledge J.A. Dickerson, D. Berleant, Z. Cox, W. Qi, and E. Wurtele Iowa State University.

1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,

Lars Juhl Jensen Biomedical text mining. exponential growth.

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.

BeeSpace Informatics Research: From Information Access to Knowledge Discovery ChengXiang Zhai Nov. 14, 2007.

Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.

1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.

Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.

Mining the Biomedical Research Literature Ken Baclawski.

A collaborative tool for sequence annotation. Contact:

A literature network of human genes for high-throughput analysis of gene expression Speaker : Shih-Te, YangShih-Te, Yang Advisor : Ueng-Cheng, YangUeng-Cheng,

Information Retrieval

Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.

Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.

Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.

Open access – making the most of biomedical literature mining Lars Juhl Jensen EMBL Heidelberg.

1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.

BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.

Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.

Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.

Information Retrieval in Practice

Searching for Information

Search Engine Architecture

Clustering of Web pages

STRING Large-scale data and text mining

Information Retrieval and Web Search

Information Retrieval and Web Search

Multimedia Information Retrieval

Dept. of Computer Science University of Liverpool

CSE 635 Multimedia Information Retrieval

Citation-based Extraction of Core Contents from Biomedical Articles

Batyr Charyyev.

Introduction to Information Retrieval

Chapter 5: Information Retrieval and Web Search

Network biology An introduction to STRING and Cytoscape

PolyAnalyst Web Report Training

Information Retrieval and Web Design

Presentation transcript:

Literature Mining for the Biologists Santhosh J. Eapen

Present scenario Generation of large scale literature data no longer possible for a researcher to keep up-to-date with all the relevant literature manually

What is Literature Mining? For an average biologist – Keyword search in PubMed/CeRa/CAB Abstracts – ‘maps of science’ that cluster papers together on the basis of how often they cite one another, or by similarities in the frequencies of certain keywords Machine learning The ability of a machine to learn from experience or extract knowledge from examples in a database. Artificial neural networks and support-vector machines are two commonly used types of machine-learning method.

Literature Mining To identify relevant articles (Information Retrieval - IR) For recognizing biological entities mentioned in these articles (Entity recognition - ER) To enable specific facts to be pulled out from papers (Information Extraction - IE)

Text mining or Data mining Integrate the literature with other large data sets such as genome sequences, microarray expression studies, or protein–protein interaction screens Dig out the deeper meaning that leads to biological discoveries

Current status of biological literature mining

IR – Information Retrieval to identify the text segments (be it full articles, abstracts, paragraphs or sentences) that pertain to a certain topic

Tools for IR

Problem setting Given a set of documents, compute a representation, called index to retrieve, summarize, classify or cluster them 

Problem setting Given a set of genes (and their literature), compute a representation, called gene index to retrieve, summarize, classify or cluster them 

Vector space model  Document processing Remove punctuation & grammatical structure (`Bag of words’) Define a vocabulary Identify Multi-word terms (e.g., tumor suppressor) (phrases) Eliminate words low content (e.g., and, thus, gene,...) (stopwords) Map words with same meaning (synonyms) Strip plurals, conjugations,... (stemming) Define weighing scheme and/or transformations (tf-idf,svd,..)  Compute index of textual resources: T 1 T 3 T 2 vocabulary gene

Biomedical Text Mining: Methods Databases Natural Language Processing Information Retrieval Information Extraction Ontologies Clustering Classification Visualization Gene Ontology A set of controlled vocabularies that are used to describe the molecular functions of a gene product, the biological processes in which it participates and the cellular components in which it can be found. MeSH terms A controlled vocabulary that is used for annotating Medline abstracts. Several classes of MeSH term exist, the most relevant for literature mining being ‘Chemicals and Drugs’ (MeSH- D) and ‘Diseases’ (MeSH-C).

Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation

Ad hoc IR These systems are very useful since the user can provide any query – The query is typically Boolean (yeast AND cell cycle) – A few systems instead allow the relative weight of each search term to be specified by the user The art is to find the relevant papers even if they do not actually match the query – Ideally our example sentence should be extracted by the query yeast cell cycle although none of these words are mentioned

Automatic query expansion In a typical query, the user will not have provided all relevant words and variants thereof By automatically expanding queries with additional search terms, recall can be improved – Stemming removes common endings (yeast / yeasts) – Thesauri can be used to expand queries with synonyms and/or abbreviations (yeast / S. cerevisiae) – The next logical step is to use ontologies to make complex inferences (yeast cell cycle / Cdc28 )

Document similarity The similarity of two documents can be defined based on their word content – Each document can be represented by a word vector – Words should be weighted based on their frequency and background frequency – The most commonly used scheme is tf*idf weighting Document similarity can be used in ad hoc IR – Rather than matching the query against each document only, the N most similar documents are also considered

Document clustering Unsupervised clustering algorithms can be applied to a document similarity matrix – All pairwise document similarities are calculated – Clusters of “similar documents” can be constructed using one of numerous standard clustering methods Practical uses of document clustering – The “related documents” function in PubMed – Logical organization of the documents found by IR

Entity recognition An important but boring problem – The genes/proteins/drugs mentioned within a given text must be identified Recognition vs. identification – Recognition: find the words that are names of entities – Identification: figure out which entities they refer to – Recognition without identification is of limited use

Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation Entities identified – S. cerevisiae proteins: Clb2 (YPR119W), Cdc28 (YBR160W), Swe1 (YJL187C), and Cdc5 (YMR001C)

Co-occurrence extraction Relations are extracted for co-occurring entities – Relations are always symmetric – The type of relation is not given Scoring the relations – More co-occurrences  more significant – Ubiquitous entities  less significant – Same sentence vs. same paragraph Simple, good recall, poor precision

Example Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation Relations – Correct: Clb2–Cdc28, Clb2–Swe1, Cdc28–Swe1, and Cdc5–Swe1 – Wrong: Clb2–Cdc5 and Cdc28–Cdc5

Mining text for nuggets New relations can be inferred from published ones – This can lead to actual discoveries if no person knows all the facts required for making the inference – Combining facts from disconnected literatures Swanson’s pioneering work – Fish oil and Reynaud's disease – Magnesium and migraine

Integration Automatic annotation of high-throughput data – Loads of fairly trivial methods Protein interaction networks – Can unify many types of interactions – Powerful as exploratory visualization tools More creative strategies – Identification of candidate genes for genetic diseases – Linking genes to traits based on species distributions

Tools for information retrieval E-BioSci EBIMed Google Scholar GoPubMed MedMiner PubMed PubFinder Textpresso XplorMed

ER & IE Tools Entity recognition iHOPhttp:// Information extraction iProLINK JournalMine. PreBIND PubGene

Text mining & integration tools Text mining Arrowsmith LitInspector CoPub Genei BeeSpace Navigator Integration BITOLA G2D ProLinks STRING

Permission denied Open access – Literature mining methods cannot retrieve, extract, or correlate information from text unless it is accessible – Restricted access is already now the primary problem Standard formats – Getting the text out of a PDF file is not trivial – Many journals now store papers in XML format Where do I get all the patent text?!

Thank You