Mining text and data on chemicals Lars Juhl Jensen
three parts
text mining
data integration
medical records
Part 1 text mining
exponential growth
some things are constant
~45 seconds per paper
information retrieval
find the relevant papers
still too much to read
computer
as smart as a dog
teach it specific tricks
named entity recognition
identify the concepts
small molecules
proteins
diseases
comprehensive lexicon
synonyms
orthographic variation
“black list”
unfortunate names
Reflect
augmented browsing
browser add-on
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009 O’Donoghue et al., Journal of Web Semantics, 2010
Firefox
Internet Explorer
Google Chrome
Safari
Utopia Documents
web services
collaboration
SciVerse
information extraction
formalize the facts
co-mentioning
NLP Natural Language Processing
Gene and protein names Cue words for entity recognition Verbs for relation extraction [ nxexpr The expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7]]] is controlled by [ nxpg HAP1]
Part 2 data integration
STITCH
Kuhn et al., Nucleic Acids Research, 2012
~300,000 small molecules
~2.6 million proteins
1100+ genomes
experimental data
physical binding
chemical–protein
protein–protein
curated knowledge
drug targets
complexes
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
text mining
co-mentioning
NLP Natural Language Processing
many data types
many databases
different formats
different identifiers
variable quality
not comparable
spread over many genomes
quality scores
von Mering et al., Nucleic Acids Research, 2005
calibrate vs. gold standard
von Mering et al., Nucleic Acids Research, 2005
probabilistic scores
orthology transfer
combine the evidence
Part 3 patient records
a hard problem
in Danish
by busy doctors
about psychiatric patients
no lexicon
acronyms
typos
delusions
domain specific system
patient record excerpt
F20 F200 Negation Family
medication
adverse drug events
diagnoses
pharmacovigilance
patient stratification
Roque et al., PLoS Computational Biology, 2011
disease comorbidity
Roque et al., PLoS Computational Biology, 2011
DNA sequencing
genotype
phenotype
Acknowledgments Reflect Sune Frankild Heiko Horn Evangelos Pafilis Juan-Carlos Silla-Castro Michael Kuhn Reinhardt Schneider Sean O’Donoghue STITCH Michael Kuhn Damian Szklarczyk Andrea Franceschini Milan Simonovic Alexander Roth Pablo Minguez Tobias Doerks Manuel Stark Christian von Mering Peer Bork EPJ-mining Francisco S Roque Peter B Jensen Robert Eriksson Henriette Schmock Marlene Dalgaard Massimo Andreatta Thomas Hansen Karen Søeby Søren Bredkjær Anders Juul Thomas Werge Søren Brunak
larsjuhljensen