Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining External Resources for Biomedical IE Why, How, What Malvina Nissim

Similar presentations


Presentation on theme: "Mining External Resources for Biomedical IE Why, How, What Malvina Nissim"— Presentation transcript:

1 Mining External Resources for Biomedical IE Why, How, What Malvina Nissim mnissim@inf.ed.ac.uk

2 Why goal: Named Entity Recognition method: supervised learning feature extraction (text) internal features: word shape, n-grams,... protein-indicative features: - of shape a0a0a0a… - followed by /bind/ - shorter than 5 characters generalisations on training data might be incomplete acquired evidence might be absent in test instance

3 Getting Additional Evidence internal features might be insufficient, but good evidence might be somewhere else... Note: some systems (MaxEnt for instance) can easily and successfully integrate a huge number of features small and accurate lists of proteins (gazetteers) use as rules use as features other texts might contain indicative n-grams how to use other texts which texts to use

4 How patterns “X gene/protein/DNA” “X sequence/motif” A. Create patterns (aim, method, input) B. Search corpus for patterns and obtain counts C. Use counts as appropriate

5 1. AIM (granularity) Create Patterns (I) distinguish entities from non-entities distinguish between entities “X gene OR DNA OR protein” “X DNA”“X gene” + bypass ambiguities and data sparseness – less information + more information – ambiguities, data sparseness “X binds” 1. AIM 2. METHOD 3. INPUT

6 Create Patterns (II) 2. METHOD by hand (experts) + high precision, exact target – time consuming, experts needed automatically (collocations, clustering) + no human intervention – lower precision, not necessarily interesting patterns 1. AIM 2. METHOD 3. INPUT

7 3. INPUT(“X gene”) Create Patterns (III) 1. AIM 2. METHOD 3. INPUT low frequency words ( as estimated from a non-specific corpus ) first output of classifier NP chunks words not found in standard dictionary increase precision but lower recall prec recf-score all features.813.861.836 – web.807.864.835

8 What? Google vs PubMed PubMed: searchable collection of over 12M biomedical abstracts, more sophisticated search options Everything : Google searches over 8 billion pages, raw search, API “p53 gene” 5,843 documents~165,000 pages PubMedGoogle

9 Google + PubMed “anything you want” site: “p53 gene” site:www.ncbi.nlm.nih.gov Rob Futrelle has this function available on this webpage: http://www.ccs.neu.edu/home/futrelle/bionlp/search.html comment: sometimes PubMed reports “Quoted phrase not found” even when Google finds the phrase. PubMed provides phrase search only on pre-indexed phrases

10

11

12 PubMed > Google query expansion PubMed uses the MeSH headings to match synonyms (it will expand “Pol II” to search for “DNA Polymerase II”) Google will only try correct misspelling field specific search PubMed allows field-specific searches (eg year) Google cannot refine its search in this respect timeliness PubMed is updated daily Google is slow in updating

13 PubMed > Google (cont’d) ranking Google does a ‘vote’-based ranking: not necessarily good PubMed does not do any ranking (possibly bad too...) truncation and flexibility PubMed accepts truncated entries and will look for all possible Variations. It will try break phrases if no matches are found. Google has a rigid search manual indexing PubMed’s MeSH contain keywords not necessarily contained in the abstract Google cannot find something that is not mentioned in the abstract

14 as a rule as a feature + less false positives + some systems (MaxEnt) can integrate huge number of features – might still not get used or provide enough evidence + sure identification of entities – too powerful -> high risk of false positives might be OK to use Google: more info but not necessarily precise might be better to use PubMed: less info but precise What to Use? (or How to Use the Evidence)

15 iHOP (Information Hyperlinked Over Proteins) A gene network for navigating the literature http://www.pdg.cnb.uam.es/UniPub/iHOP uses genes and proteins as hyperlinks between sentences and abstracts http://www.pdg.cnb.uam.es/UniPub/iHOP each step through the network produces information about one single gene and its interactions information retrieved by connecting similar concepts precision of gene name and synonym identification: 87-99% readers can still check correctness of sentences when they are presented to them shortest path between any 2 genes is on average 4 steps only Nature Genetics, Vol. 36(7), July 2004


Download ppt "Mining External Resources for Biomedical IE Why, How, What Malvina Nissim"

Similar presentations


Ads by Google