Download presentation
Presentation is loading. Please wait.
1
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~
2
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
3
Born out of frustration…. Search systems effective at locating interesting papers ….. BUT …. have to read the paper to get to the facts.Search systems effective at locating interesting papers ….. BUT …. have to read the paper to get to the facts. Many data are not contained in abstract or index …. therefore, important papers can be missed by search engines.Many data are not contained in abstract or index …. therefore, important papers can be missed by search engines. GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
4
The Perfect System Type in question and the search engine tells you the answer! Full text “Conceptual search” GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
5
Searches full text –returns any sentences that match your query Provides two ways to query –search raw data – Keyword search –search meta-data – Category search GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Enter Textpresso
6
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary ….. activation of let-7 RNA expression downregulates LIN-4 to relieve inhibition of lin-29. Biological Process Regulation Gene Molecular Function Biological Process // activation of let-7 RNA expression down regulates LIN-41 to relieve inhibition of lin-29. //
7
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Categories GENE PATHWAY REGULATION CELL Locus let-60 eat-4 LIN-12 repress enhanced upregulate inhibition precursor upstream cascade descendants Neuron EMS HSN AB Vulva precursor 37 Categories!!!
8
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
9
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
10
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary lin-39 acts downstream of Ras lin-25 acts indirectly via sur-2 eor-1 and eor-2 are closely involved in Ras signaling
11
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
12
Find sentences from the literature that describe genetic interaction! >= 2 named “Gene” && (>= 1 “Association” || >= 1 “Regulation”) Using Textpresso to expediate curation GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
13
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Sentences containing gene-gene interactions Random 1 (0.5%) 2 named genes 13 (6.5%) 2 named genes + 1 category 39 (19.5%) Sampling 200 sentences …… Adding Textpresso category enriches 3-fold!
14
Installation and Adaption of Textpresso for your Domain GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
15
Dependencies Tested on Redhat 9.0 or Debian 3.1 (kernel 2.4.20 or higher) –should work on any unix-based system Apache (1.3.29), Perl (5.6.1 or higher) Perl Modules: –XML::ParserXML::RegExp –XML::XQL XML::Checker –XML::DOM XML::Parser::PerlSAX –PDF::Create Brill Tagger (C compiler) –parts of speech tagger (http://research.microsoft.com/~brill/) XPDF –pdftotext utility (http://www.foolabs.com/xpdf/) GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
16
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Download http://www.textpresso.org http://www.gmod.org
17
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Unpack and Install
18
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Web-site
19
Web Scripts GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
20
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Database
21
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Build Scripts Electronic PDF Raw Text Parts-of-speech Text Annotated Text Abstracts Keywords Index Maker PDF2Text Preprocessor Text2XML Textpresso Database Wormbase Database Journal Web-sites Textpresso Ontology Collect Papers Collect Abstracts
22
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Tailoring Pt 1 -Text Collection Abstracts Collection –can be downloaded from central resource such as PubMed – PubFetch! PDF Collection: –limited to open access journals (PLoS Biology) or journals to which you subscribe –inject_pmid script from Textpresso web-site (Allen Day) –manual download from journal web-site
23
Tailoring Pt 2 – Adapting Ontology GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
24
Tailoring Pt 2 – Adapting Ontology GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Almost all “Relationship and Description” and “Syntax and Grammer”categories and some “Biological Concepts” categories are generic to the Biomedical domain. Some new categories can use existing category structure (yeast genes replace worm genes) Some de novo categories would be useful (Cell Cycle, Chromosomal Aberrations, Disease etc).
25
Tailoring Pt 3 – Adapting Interface GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
26
Tailoring Pt 3 – Adapting Interface GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
27
Tailoring Pt 3 – Adapting Interface GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
28
Textpresso 2.0 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
29
Overhaul Code GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Adding another layer of abstraction –definition files and modules use constant SY_ANNOTATION_FIELDS => { abstract => ‘abstract/’, body=> ‘body/’, title=> ‘title/’}; … defines which fields are to be annotated during the build process Advantages: –easy to adapt software (no script tweaking) –easy to add new modules
30
New Features GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
31
Distributed Searches GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
32
Variable Scope GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
33
New Sort Modes GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
34
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary 100 sentences per hour!
35
Search for patterns in sentences The life-extension phenotype of old-1 was completely suppressed by daf-16 ( m26 ) ( Figure 1e ). Developed hidden Markov model to identify common patterns of text that surrounds required entities. GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
36
Hidden Markov Model GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
37
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary True test sentences have similar score to training sentences
38
Textpresso Team Developers: Eimear Kenny Hans-Michael Müller Code Contributers: Allen Day (many patches including inject_pmid) Robert Li (alternative pdf2text converter) Stan Dong and Christopher Lane (code optimization for speed) Juancarlos Chan (web-site scripting) Information Extraction Analysis: Andrei Petcherski Paper Collection: Daniel Wang Principle Investigator: Paul Sternberg GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.