Mining text and data on chemicals Lars Juhl Jensen.

Slides:



Advertisements
Similar presentations
STRING Prediction of protein networks through integration of diverse large-scale data sets Lars Juhl Jensen EMBL Heidelberg.
Advertisements

LS-SNP: Large-scale annotation of coding non- synonymous SNPs based on multiple information sources -Bioinformatics April 2005.
RDB2RDF: Incorporating Domain Semantics in Structured Data Satya S. Sahoo Kno.e.sis CenterKno.e.sis Center, Computer Science and Engineering Department,
An Information Retrieval and Extraction System for C. elegans Literature.
Searching and Exploring Biomedical Data Vagelis Hristidis School of Computing and Information Sciences Florida International University.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
The STRING database Michael Kuhn EMBL Heidelberg.
A Systematic approach to the Large-Scale Analysis of Genotype- Phenotype correlations Paul Fisher Dr. Robert Stevens Prof. Andrew Brass.
Contents of this Talk [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng.
STRING Modeling of biological systems through cross-species data integration.
Components of a Cell (Eukaryotes) Picture from on-line biology book,on-line biology book,
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Lecture 2.21 Retrieving Information: Using Entrez.
Introduction to Genomics, Bioinformatics & Proteomics Brian Rybarczyk, PhD PMABS Department of Biology University of North Carolina Chapel Hill.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
1 CIS607, Fall 2006 Semantic Information Integration Instructor: Dejing Dou Week 10 (Nov. 29)
Semantic (Language) Models: Robustness, Structure & Beyond Thomas Hofmann Department of Computer Science Brown University Chief Scientist.
BACKGROUND E. coli is a free living, gram negative bacterium which colonizes the lower gut of animals. Since it is a model organism, a lot of experimental.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
BIND: the Biomolecular Interaction Network Database Gary D. Bader, Doron Betel and Christopher W. V. Houge Seminar in Bioinformatics Elinor Heller.
Why, in the future, all sciences will be computer sciences Barry Smith.
The STRING Database What it does and how it interfaces to other resources The STRING Database What it does and how it interfaces to other resources Christian.
Medical data and text mining Linking diseases, drugs, and adverse reactions Lars Juhl Jensen.
From T. MADHAVAN, & K.Chandrasekaran Lecturers in Zoology.. EXIT.
Thomas Lemberger Chief Editor, Molecular Systems Biology Deputy Head, Scientific Publications, EMBO Publishing actionable data.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
Moving beyond free text. Authors Scientist does research Scientist publishes research results in journal article Old Paradigm:
Knowledge Integration for Gene Target Selection Graciela Gonzalez, PhD Juan C. Uribe Contact:
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Bioinformatics and medicine: Are we meeting the challenge?
Biomedical Databases & Tools Rolando Garcia-Milian Biomedical & Health Information Services Department Health Sciences Center Library.
Flexible Text Mining using Interactive Information Extraction David Milward
Helping scientists collaborate BioCAD. ©2003 All Rights Reserved.
Lars Juhl Jensen Biomedical text mining. exponential growth.
Knowledge Discovery from Biological and Clinical Data: BASIC BACKGROUND.
Medical data mining Linking diseases, drugs, and adverse reactions Lars Juhl Jensen.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Watson Genomic Analytics. Select Watson solutions address a wide range of clinical and research needs in oncology Patient InsightsEvidence-based InsightsResearch.
NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker.
Mining the Biomedical Research Literature Ken Baclawski.
A collaborative tool for sequence annotation. Contact:
Bioinformatics and Computational Biology
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
Semantic Web Portal: A Platform for Better Browsing and Visualizing Semantic Data Ying Ding et al. Jin Guang Zheng, Tetherless World Constellation.
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
Retrospective study of a gene by mining texts : The Hepcidin use-case Fouzia Moussouni-Marzolf.
Open access – making the most of biomedical literature mining Lars Juhl Jensen EMBL Heidelberg.
Big Data in Biology: A focus on genomics. Bioinformatics and Genomics O Applications: O Personalized cancer medicines O Disease determination O Pathway.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Medical informatics Linking diseases, drugs, and adverse reactions Lars Juhl Jensen.
Computational Biology Signaling networks and drug repositioning Lars Juhl Jensen.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Biological Databases By: Komal Arora.
Georgios Pavlopoulos Data integration & knowledge management group
Protein association networks with STRING
STRING Large-scale data and text mining
Clinical NLP in North Germanic Languages
Applications of Text Mining
Mangaldai College, Mangaldai
STRING Protein networks from data and text mining
Lixia Yao, James A. Evans, Andrey Rzhetsky  Trends in Biotechnology 
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Nancy Baker SILS Bioinformatics Seminar January 21, 2004
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Introduction to Bioinformatic
Network biology An introduction to STRING and Cytoscape
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Unit: Animals at the Cellular Level
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Mining text and data on chemicals Lars Juhl Jensen

three parts

text mining

data integration

medical records

Part 1 text mining

exponential growth

some things are constant

~45 seconds per paper

information retrieval

find the relevant papers

still too much to read

computer

as smart as a dog

teach it specific tricks

named entity recognition

identify the concepts

small molecules

proteins

diseases

comprehensive lexicon

synonyms

orthographic variation

“black list”

unfortunate names

Reflect

augmented browsing

browser add-on

Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009 O’Donoghue et al., Journal of Web Semantics, 2010

Firefox

Internet Explorer

Google Chrome

Safari

Utopia Documents

web services

collaboration

SciVerse

information extraction

formalize the facts

co-mentioning

NLP Natural Language Processing

Gene and protein names Cue words for entity recognition Verbs for relation extraction [ nxexpr The expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7]]] is controlled by [ nxpg HAP1]

Part 2 data integration

STITCH

Kuhn et al., Nucleic Acids Research, 2012

~300,000 small molecules

~2.6 million proteins

1100+ genomes

experimental data

physical binding

chemical–protein

protein–protein

curated knowledge

drug targets

complexes

pathways

Letunic & Bork, Trends in Biochemical Sciences, 2008

text mining

co-mentioning

NLP Natural Language Processing

many data types

many databases

different formats

different identifiers

variable quality

not comparable

spread over many genomes

quality scores

von Mering et al., Nucleic Acids Research, 2005

calibrate vs. gold standard

von Mering et al., Nucleic Acids Research, 2005

probabilistic scores

orthology transfer

combine the evidence

Part 3 patient records

a hard problem

in Danish

by busy doctors

about psychiatric patients

no lexicon

acronyms

typos

delusions

domain specific system

patient record excerpt

F20 F200 Negation Family

medication

adverse drug events

diagnoses

pharmacovigilance

patient stratification

Roque et al., PLoS Computational Biology, 2011

disease comorbidity

Roque et al., PLoS Computational Biology, 2011

DNA sequencing

genotype

phenotype

Acknowledgments Reflect Sune Frankild Heiko Horn Evangelos Pafilis Juan-Carlos Silla-Castro Michael Kuhn Reinhardt Schneider Sean O’Donoghue STITCH Michael Kuhn Damian Szklarczyk Andrea Franceschini Milan Simonovic Alexander Roth Pablo Minguez Tobias Doerks Manuel Stark Christian von Mering Peer Bork EPJ-mining Francisco S Roque Peter B Jensen Robert Eriksson Henriette Schmock Marlene Dalgaard Massimo Andreatta Thomas Hansen Karen Søeby Søren Bredkjær Anders Juul Thomas Werge Søren Brunak

larsjuhljensen