Presentation is loading. Please wait.

Presentation is loading. Please wait.

GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

Similar presentations


Presentation on theme: "GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~"— Presentation transcript:

1 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~

2 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

3 Born out of frustration…. Search systems effective at locating interesting papers ….. BUT …. have to read the paper to get to the facts.Search systems effective at locating interesting papers ….. BUT …. have to read the paper to get to the facts. Many data are not contained in abstract or index …. therefore, important papers can be missed by search engines.Many data are not contained in abstract or index …. therefore, important papers can be missed by search engines. GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

4 The Perfect System Type in question and the search engine tells you the answer! Full text “Conceptual search” GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

5 Searches full text –returns any sentences that match your query Provides two ways to query –search raw data – Keyword search –search meta-data – Category search GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Enter Textpresso

6 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary ….. activation of let-7 RNA expression downregulates LIN-4 to relieve inhibition of lin-29. Biological Process Regulation Gene Molecular Function Biological Process // activation of let-7 RNA expression down regulates LIN-41 to relieve inhibition of lin-29. //

7 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Categories GENE PATHWAY REGULATION CELL Locus let-60 eat-4 LIN-12 repress enhanced upregulate inhibition precursor upstream cascade descendants Neuron EMS HSN AB Vulva precursor 37 Categories!!!

8 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

9 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

10 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary lin-39 acts downstream of Ras lin-25 acts indirectly via sur-2 eor-1 and eor-2 are closely involved in Ras signaling

11 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

12 Find sentences from the literature that describe genetic interaction! >= 2 named “Gene” && (>= 1 “Association” || >= 1 “Regulation”) Using Textpresso to expediate curation GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

13 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Sentences containing gene-gene interactions Random 1 (0.5%) 2 named genes 13 (6.5%) 2 named genes + 1 category 39 (19.5%) Sampling 200 sentences …… Adding Textpresso category enriches 3-fold!

14 Installation and Adaption of Textpresso for your Domain GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

15 Dependencies Tested on Redhat 9.0 or Debian 3.1 (kernel 2.4.20 or higher) –should work on any unix-based system Apache (1.3.29), Perl (5.6.1 or higher) Perl Modules: –XML::ParserXML::RegExp –XML::XQL XML::Checker –XML::DOM XML::Parser::PerlSAX –PDF::Create Brill Tagger (C compiler) –parts of speech tagger (http://research.microsoft.com/~brill/) XPDF –pdftotext utility (http://www.foolabs.com/xpdf/) GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

16 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Download http://www.textpresso.org http://www.gmod.org

17 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Unpack and Install

18 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Web-site

19 Web Scripts GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

20 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Database

21 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Build Scripts Electronic PDF Raw Text Parts-of-speech Text Annotated Text Abstracts Keywords Index Maker PDF2Text Preprocessor Text2XML Textpresso Database Wormbase Database Journal Web-sites Textpresso Ontology Collect Papers Collect Abstracts

22 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Tailoring Pt 1 -Text Collection Abstracts Collection –can be downloaded from central resource such as PubMed – PubFetch! PDF Collection: –limited to open access journals (PLoS Biology) or journals to which you subscribe –inject_pmid script from Textpresso web-site (Allen Day) –manual download from journal web-site

23 Tailoring Pt 2 – Adapting Ontology GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

24 Tailoring Pt 2 – Adapting Ontology GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Almost all “Relationship and Description” and “Syntax and Grammer”categories and some “Biological Concepts” categories are generic to the Biomedical domain. Some new categories can use existing category structure (yeast genes replace worm genes) Some de novo categories would be useful (Cell Cycle, Chromosomal Aberrations, Disease etc).

25 Tailoring Pt 3 – Adapting Interface GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

26 Tailoring Pt 3 – Adapting Interface GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

27 Tailoring Pt 3 – Adapting Interface GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

28 Textpresso 2.0 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

29 Overhaul Code GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Adding another layer of abstraction –definition files and modules use constant SY_ANNOTATION_FIELDS => { abstract => ‘abstract/’, body=> ‘body/’, title=> ‘title/’}; … defines which fields are to be annotated during the build process Advantages: –easy to adapt software (no script tweaking) –easy to add new modules

30 New Features GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

31 Distributed Searches GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

32 Variable Scope GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

33 New Sort Modes GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

34 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary 100 sentences per hour!

35 Search for patterns in sentences The life-extension phenotype of old-1 was completely suppressed by daf-16 ( m26 ) ( Figure 1e ). Developed hidden Markov model to identify common patterns of text that surrounds required entities. GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

36 Hidden Markov Model GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary

37 GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary True test sentences have similar score to training sentences

38 Textpresso Team Developers: Eimear Kenny Hans-Michael Müller Code Contributers: Allen Day (many patches including inject_pmid) Robert Li (alternative pdf2text converter) Stan Dong and Christopher Lane (code optimization for speed) Juancarlos Chan (web-site scripting) Information Extraction Analysis: Andrei Petcherski Paper Collection: Daniel Wang Principle Investigator: Paul Sternberg GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary


Download ppt "GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~"

Similar presentations


Ads by Google