Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Information Retrieval and Extraction System for C

Similar presentations


Presentation on theme: "An Information Retrieval and Extraction System for C"— Presentation transcript:

1 An Information Retrieval and Extraction System for C
An Information Retrieval and Extraction System for C. elegans Literature

2 What does NF-kappaB regulate?
How do we currently perform literature searches? Typically we have a question we want to answer in our head, such as “What does NF-kappaB regulate?”. Then we go to a literature search engine, such as the popular PubMed and we try to think of keywords to enter that might return publications that talk about NF-KappaB regulation. So might enter say “NF-kappaB” and “regulate” as keywords, and then hit enter!

3 Eeeeekkkk!!! Problem 1. The return is enormous!! Why?
- Majority of the returns are false positives! (they contain the keywords but they have nothing to do with ‘NF-KappaB regulation”) Problem 2. Missing articles: - Many pertinent articles are missed because don’t use the exact keywords, but synonyms of the keywords! - Many pertinent articles are missed because the answer is not in the abstract but buried in the text of the article! Problem 3. Time consuming - Need to read entire paper to discover whether they answer the question

4 System Specifications
article classification keyword searches semi-semantic queries batch retrieval of facts Queries: Return: citation abstract full text paper sections Solution, Textpresso!!!! We wanted to build a system to make literature searching much less painful! System design (our wish list for textpresso): allows a range of queries: quickly categorize a large batch of papers according to what data they contain allows simple google-like keyword searches allows user to formulate sentence-like queries expert can make queries to automatically extract facts (gene-gene interactions) from text allows range of returns not just citations or abstracts, actual sentences from the text of an article intended users anyone from keyworders to bioinformaticians Also wanted to make the system technologically simple, so that it would be easy to use, easy to install and we could implement it as soon as possible. Two years after it’s initial concept, we have accomplish most of these goals! Target Users: researchers curators bioinformaticians/NLP

5 The textpresso homepage
contains search field, user can enter keyword and off they go …. very easy! contains news section notifies user of updates and new features contains side menus for site navigation and info on database contents

6 A typical return to a query keyword “let-7” , category “gene” and catagory “regulation” . A category is a group of words that have the same meaning. Will talk more later. User can choose to search abstracts and/or FULL TEXT of articles. Returns citation info sorted by number of matches. Links from the page: OT – link to on-line journal article text where available RA - link to pubmed “related articles” EA - download citation in endnote format Magic buttons!! - Purple “view all matches” (and red VM) buttons, lead you too …..

7 ….. The sentences from the journal article that answer your query!!!

8 There is also an extensive user guide and feedback forum

9 With loads of example searches to get you started

10 Specific Partially Generic Generic Biological Entities
transgene allele nucleic acid organism clone strain sex entity feature life stage phenotype drugs and small molecules molecular function cell and cell group cellular component mutant Biological Entities “Plugin Dictionaries” Specific method consort effect purpose pathway regulation action physical association comparison spatial/time relation localization involvement characterization biological process descriptor Actions, Facts or Circumstances that Relate Two Entities So what about those word categories mentioned earlier??? A new and powerful way to search the literature. Overcomes the problem of missing facts that are expressed using a synonym of the keyword you entered. Word categories are groups of words that have the same sense meaning in a biological context. For example, the category, “regulation”, contains the words: enhance, repressed, suppress, regulation, inhibits etc…. The “gene” category contains all the c. elegans gene names, plus the words “gene” and “locus” There are 39 such word categories, 3 of which incorporate the 3 GO categories, that fall into 3 super categories. The Biological Entities super group is specific to the particular organism literature set and comprises mainly dictionaries of entity names (genes, proteins, trangenes etc) The “Relationship” super group was developed from Worm literature and we expect it to port to other organism literature sets with a bit of tweaking. We are currently working to implement Textpresso for SGD. The “Semantic” super group is not seen by the user, rather is used for advance fact extraction from text “Common Sense” Partially Generic bracket determiner conjunction auxiliary conjecture negation pronoun preposition punctuation Auxiliary Generic

11 Gene Regulation Regulation Biological Process Biological Process Molecular Function Gene ….. activation of let-7 RNA expression downregulates LIN-4 to relieve inhibition of lin-29. <?xml version="1.0" encoding="ISO " standalone="no" ?> <!DOCTYPE article SYSTEM "/var/www/html/textpresso.dtd"> <article> // <sentence id='s7'> // <process grammar ='NN' source='textpresso' type='general' biosynthesis='no'> activation</process> <pposition grammar ='IN' type='of'> of </pposition> <gene grammar ='JJ' reference='direct'> let-7 </gene> <text>RNA</text> <process grammar ='NN' source='textpresso' type='molecular' biosynthesis='expression'> expression</process> <regulation grammar ='NNS' type='negative'> down regulates</regulation> <function grammar ='NNP' reference='direct' source='textpresso' protein='yes'> LIN-41 </function> <pposition grammar ='TO' type='to'>to </pposition> <text>relieve</text> <regulation grammar ='NNS' type='negative'> inhibition </regulation> <pposition grammar ='IN' type='of'> of</pposition> <gene grammar ='NNP' reference='direct'> lin-29 </gene> <text>. </text> </sentence> // </article> The word categories work by recognizing words and phrases in the text and labeling them appropriately. The underlying data is marked up in xml.

12 What genes does let-7 regulate?
Keyword: “let-7” Category: “Regulation” Category: “Gene” Therefore, using the example from the beginning, if I wanted to search the literature for the answer to the question “What genes does let-7 regulate?” It might make sense to choose the keyword “let-7”, and two categories “regulation” and “gene”

13 Facts returned from Journal articles!
Keyword Categories Facts returned from Journal articles! Here are some of the real returns to this query

14 Automatic Classification of Papers
965 journal articles were read by human curators at Wormbase and flagged for different types of information. For example, 163 of the 965 papers contains data on antibodies and 327 of the 965 contained information on gene expression. Three different searches were applied to try to flag this paper set automatically. Recall is the percentage of correct papers returned by the searches that were flagged by humans (the recall column). Precision is the percentage of correct papers returned by the searches compared to the total number of papers returned by the searches (the precision column). Using automatic keyword searches on the abstracts of the papers in the test set, only ~33% (1/3) of the papers for the different data types were retrieved overall. The same keyword search on the full text of the papers retrieves > 90% of papers for the different data types. Using Textpresso Advanced Retrieval search engine (a mixture of keywords and/or ontologies), the precision of the full text search was increased by ~10% compared to a keyword search on full text. For some data types such as Antibody data, Mapping data, Sequence Features and Transgenes, information is rarely contained in the abstracts (0% - 9% instance), therefore, full text is required to automatically retrieve this data! We hope to improve the precision rate of Textpresso full text search by introducing a search in sub sections feature in the near future. Significant amount of biological information is only contained in the full text of a journal article and not in the abstract!

15 Extract C. elegans alleles from full text
eg vba-1(e2) We have conducted our first experiment for automatically extracting facts from literature using Textpresso. Task: To extract gene<-> allele associations from literature. In C. elegans nomenclature this is written as gene name, open parenthesis, allele name, close parenthesis.

16 Text extraction pattern:
<gene><bracket><allele><bracket> Result: Gene age-1 dpy-5 daf-16 lon-2 unc-32 osm-3 lin-29 unc-5 daf-2 Allele hx546 e61 mg51a e678 e189 p802 n333 e53 e1370 Evidence cgc3008 cgc666 cgc5034 wbg14.1 wm97ab55 cgc2033 pmid31222 euwm2000 cgc3012 Sentence ...age-1(hx546)... ...expressed in.... . osm-3(p802) was found to be...... Accept y/n? This converts to the text extraction pattern, <gene><bracket><allele><bracket> A curator reviews all the associations returned and decides whether to accept them.

17 ~99% Total papers: ~ 2,000 gene  allele  reference: ~14,000
~1,100 new gene-allele association FILTER ~99% uploaded to Wormbase ~14,000 14,000 distinct gene<->allele<->reference associations retrieved. ~99% could be uploaded to Wormbase straight away. In 300 cases the same allele was associated to more than one gene and had to be manually resolve (80 gene synonyms discovered). Total time = couple of curator days ;) ~300 required manual resolution - ~ 80 synonyms - typos e.g. rol-2(e678) 160 hits bli-2(e768) 17 hits rol-2(e768) hits

18 Lots of work to do….. Increasing recall and precision
Anaphora resolution (5%-8%) Synonym recognition Searching in sub-sections of the paper (i.e. method, results etc) Develop Textpresso Ontology Integrating open source ontologies (MeSH, UMLS) Pilot study of other MOD’s (currently SGD) Package and release software Develop Fact Extraction next target: gene-gene interaction


Download ppt "An Information Retrieval and Extraction System for C"

Similar presentations


Ads by Google