Download presentation
Presentation is loading. Please wait.
1
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR
2
Resources: – MEDLINE (Medical Literature, Analysis, and Retrieval System Online) is the U.S. National Library of Medicine’s (NLM) premier bibliographic database that contains over 12 million references to journal articles in life sciences with a concentration on biomedicine. Time coverage: 1966 - present Updates: over 2,000 completed references are added daily. Broad coverage: basic biomedical research and the clinical sciences. Availability: through NLM home page at www.nlm.nih.govwww.nlm.nih.gov
3
–MeSH (Medical Subject Heading) is the National Library of Medicine’s controlled vocabulary thesaurus. It consists of sets of terms naming descriptors in a hierarchical structure that permits searching at various levels of specificity. MeSH descriptors are arranged in both an alphabetic and a hierarchical structure. E.g. General level: broad headings “Anatomy” and “Mental Disorders”; Narrow level: specific headings “Ankle” and “Conduct Disorder”. 21,973 descriptors thousands of cross-references. Used by NLM for indexing articles from 4,600 of the world’s leading biomedical journals for the MEDLINE database.
4
–UMLS (the Unified Medical Language System) Metathesaurus Organized by concept, which is defined as a clustering of terms representing the same meaning (synonyms, lexical variants, acronyms, translations). Over 1.5 million unique English terms drawn from more than sixty families of medical vocabularies, and organized in some 775,000 concepts. Each concept is categorized by semantic types from the semantic network.
5
Application of these sources to dictionary-based retrieval Indexing –The classic model in information retrieval consider that each document is described by a set of representative keywords called index terms. –An index term is simply a word whose semantics helps in remembering the document’s main themes. So, index terms are used to index and summarize the document contents. –All the distinct words in a corpus are considered as index terms. (stopwords, e.g. pronouns, prepositions, conjunctions, etc, will be removed. And feature selection might be applied.) Corpus-based retrieval: Each query(user info need) or document will be represented in terms of index terms usually weighted as TF-IDF. (e.g. if one word appears most often in the document collection (high DF), it should be less important.) Dictionary-based retrieval: the query and corpus are represented in terms of the entries in a predefined dictionary / thesaurus. Only those entries are considered as index terms.
6
Comparison of the two methods –Corpus-based: does not require the predefined thesauras or dictionary relies solely on the statistics of the corpus includes the non-discriminative terms, which will weaken the discriminative power. –Dictionary-based: require a predefined thesaurus relies greatly on the quality of the thesaurus might miss the discriminative terms. does not tune to different corpus. If a good thesauras is provided, dictionary-based approach will be chosen. Otherwise, we have to use corpus-based approach.
7
Problems with dictionary-based retrieval in Biomedical domain The UMLS metathesaurus does not capture all the concepts needed. Does not make full use of the information encoded in biomedical text. The efforts should be extended from biomedical abstracts to full text databases. Especially when some specific subdomain is focused. The UMLS metathesaurus might be too coarse-grained. A well tuned thesaurus specifically for the subdomain is desired. How to use the corpus to enhance the predefined thesaurus?
8
Corpus-based method for extending a biomedical terminology Objectives: to automatically extend downwards an existing biomedical terminology using a corpus and both lexical and terminological knowledge. Methods: to add to the UMLS Metathesaurus those noun phrases extracted from the MEDLINE corpus, which satisfy the following three requirements: –have the strucuture: (adj+, noun*,head) –a demodified term created from this phrase is found in the Metathesaurus –the modifiers removed to create the demodified term also modify some other terms from the terminology in the same category as the demodified term. where: demodified terms were created by removing every possible combinations of adjectival modifiers in the term. So, the number of demodified terms is 2^m - 1, m being the number of adjectival modifiers.
9
noun phrase: chronic sciatic construction injury The resulting syntactic structure: [ [ mod ( [ chronic, adj ] ), mod ( [ sciatic, adj ] ), mod ( [ construction, noun] ), head ( [ injury, noun ] ) ] ] demodified term: sciatic constriction injury modifier: chronic demodified term: chronic constriction injury modifier: sciatic demodified term: constriction injury modifier: chronic sciatic Example
10
Step 1. Mapping phrases to the UMLS ----- check whether the phrases exist in the UMLS. Step 2. Identifying (adj+, noun*, head) phrases --- -- check whether the phrases have the structure. Step 3. Creating demodified terms Step 4. Searching for similarity modified terms in the Metathesaurus ----- the second requirement Step 5. Searching for demodified terms in the Metathesaurus ----- the first requirement Step 6. Hooking candidate terms to the terminology ----- the candidate noun phrase will be put in the directory of its demodified term. Six-step Procedure
11
Clinical medicine: Disorders and procedures
12
TREC-9 Filtering Track TREC (Text REtrieval Conference) For each TREC, NIST provides a test set of documents and questions. Participants run their own retrieval systems on the data, and return to NIST a list of the retrieved top-ranked documents. NIST pools the individual results, judges the retrieved documents for correctness, and evaluates the results.TREC TREC-9 (2000) is the only one using Biomedical data. Medline documents from 1987-1991 are used. The entire collection contains about 350,000 documents. Evaluation results of filtering track, TREC-9
13
Conclusion Corpus-based retrieval solely using the full text corpus cannot obtain satisfactory results. Dictionary-based retrieval might miss important terms especially when applied to some specific domains. Combination of them will be promising.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.