Lars Juhl Jensen Biomedical text mining
exponential growth
~45 seconds per paper
information retrieval
named entity recognition
augmented browsing
text corpora
information extraction
information retrieval
find the relevant papers
ad hoc retrieval
user-specified query
“yeast AND cell cycle”
PubMed
indexing
fast lookup
stemming
word endings
dynamic query expansion
MeSH terms
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
no tool will find that
named entity recognition
computer
as smart as a dog
teach it specific tricks
identify the concepts
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
comprehensive lexicon
proteins
chemicals
compartments
tissues
diseases
organisms
CDC2
cyclin dependent kinase 1
orthographic variation
upper- and lower-case
CDC2
Cdc2
spaces and hyphens
cyclin dependent kinase 1
cyclin-dependent kinase 1
prefixes and postfixes
CDC2
hCDC2
“black list”
SDS
scalable implementation
text corpora
>10 km <10 hours
most use Medline
~22 million abstracts
few use full-text articles
no access
PDF files
layout-aware extraction
millions of full-text articles
information extraction
formalize the facts
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation
two approaches
co-mentioning
counting
within documents
within paragraphs
within sentences
co-mentioning score
NLP Natural Language Processing
grammatical analysis
part-of-speech tagging
multiword detection
semantic tagging
sentence parsing
Gene and protein names Cue words for entity recognition Verbs for relation extraction [ nxexpr The expression of [ nxgene the cytochrome genes [ nxpg CYC1 and CYC7]]] is controlled by [ nxpg HAP1]
extract stated facts
high precision
poor recall
Exercise Go to Find TYMS disease associations Inspect the text-mining evidence Look for examples of synonym usage Find genes linked to colorectal cancer
thank you!