Download presentation
Presentation is loading. Please wait.
Published byEleanore Walsh Modified over 9 years ago
1
6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa Iowa City, IA padmini-srinivasan@uiowa.edu *Students:Aditya Sehgal, Xin Ying Qiu
2
6/23/03 IndoUS DL 2003 Outline 1. Text Mining 2. Metadata-based Topic profiles 3. Function: Exploring topic characteristics via profiles Problem: Study disease research prevalence 4. Conclusions
3
6/23/03 IndoUS DL 2003 1. Text Mining: Novelty and Usefulness Assist researchers with hypothesis generation, exploration, and testing. Discover knowledge that is ‘novel’ at least relative to the text collection Discover knowledge that is potentially ‘useful’ Extract patterns, explore relationships Propositions/Hypotheses: need follow up verification
4
6/23/03 IndoUS DL 2003 Of all 45 studies in Medline on chemical X, 80% have been done in the context of disease L, 10% disease M and the remainder in the context of disease N. Gene A is known to be associated with disease X. The literature suggests that gene B shows some key ‘similarities’ to A and therefore B may also be associated with X. Examples
5
6/23/03 IndoUS DL 2003 Support content organization and management Provide access to content Dublin Core Metadata Initiative RDF: Resource Description Framework Library of Congress Subject Headings (LCSH) Medical Subject Headings (MeSH) Question: Can we use metadata for text mining and knowledge discovery? Given a topic, eg. ‘Toxic waste’ and a collection of texts such as Medline.. Metadata in Digital Libraries
6
6/23/03 IndoUS DL 2003 Describe topics: topic profiles built from the text collection being mined ~ metadata profiles - Compare topics via their profiles: a. topic similarity b. trends over specific features/characteristics -Look for indirect links between topics -Given a topic look for related topics. Metadata for Text Mining
7
6/23/03 IndoUS DL 2003 MeSH Phrase MeSH Qualifier Example MEDLINE Record
8
6/23/03 IndoUS DL 2003 MeSH Metadata Semantic Types Aldehydes Organic Chemical Protein Isoprenylation Genetic Function (134) (22,000) Formaldehyde Chemical
9
6/23/03 IndoUS DL 2003 2. Topic Profiles A set of terms that characterize the topic with weights assigned to represent their relative importance. {Medline: A vector of MeSH term vectors - one for each of the134 semantic types.}
10
6/23/03 IndoUS DL 2003 Topic: “hip fractures in the elderly” Search against Pubmed: (geriatrics or elderly) AND hip fractures Extract MeSH metadata terms from retrieved documents Build weighted profile: vector of vectors can be limited to MeSH terms of particular semantic types
11
6/23/03 IndoUS DL 2003 Example Profile: Raynauds disease
12
6/23/03 IndoUS DL 2003 Comparing topics via their profiles Topic 1: PubMed searchTopic 2: PubMed search MeSH Profile documents (cosine similarity) 13,000 genes
13
6/23/03 IndoUS DL 2003 Comparing topics - studying particular characteristics in their profiles Problem: To study the prevalence of disease research. ‘geographical context’.
14
6/23/03 IndoUS DL 2003 Topic: “cholera” Search against Pubmed: Extract MeSH metadata terms from retrieved documents Build weighted profile vectors can be limited to MeSH terms in ‘Geographical Area’ Cholera: {0.6 Nigeria, 0.1 Malyasia, ……} Breast Cancer: {0.1 Poland, 0.8 Italy, ……} Rank nations
15
6/23/03 IndoUS DL 2003 Research Prevalence: Mental Disorders (1961-2000) Ranking nations.
16
6/23/03 IndoUS DL 2003 Research Prevalence: Cholera (middle & low income; 1991 - 2000) Ranking nations
17
6/23/03 IndoUS DL 2003 Research prevalence versus disease prevalence For each disease: (a) Rank nations by Disease Prevalence (WHO epid. data) - estimated by # of cases reported or # of deaths Statistical Information System weekly epidemiological records (b) Rank nations by Research Prevalence Compare rankings using Spearman’s rank coefficient. Analysis limited to the decade of the 90s. Question: So how does the prevalence of research compare with the prevalence of the disease?
18
6/23/03 IndoUS DL 2003 Breast cancerCholorectal cancer Hodgkins diseaseMeningitis DengueTuberculosis Liver neoplasmsProstate cancer Ovarian cancerEsophagus cancer CholeraAIDS Stomach cancerMelanoma LeprosyMalaria Yellow feverTrypanosomiasis Dracunculiasis 19 diseases
19
6/23/03 IndoUS DL 2003 DiseaseIncomeNCC Breast Cancer All High Medium low 168 35 71 61 0.645* 0.856* 0.709* 0.372* HodgkinsAll High Medium low 165 34 70 61 0.539* 0.71* 0.545* 0.386* *0.05 sig. level
20
6/23/03 IndoUS DL 2003 Observations: Diseases most prevalent in high or middle income group, have significant +ve correlation (9/10 diseases) Diseases most prevalent in low income group significant +ve correlation less likely (4/9, 44%).
21
6/23/03 IndoUS DL 2003 Temporal analysis on disease research Extract the top 3 ranked diseases studied in the context of each nation Pool these together How often does a disease rank in the top 3 positions?
22
6/23/03 IndoUS DL 2003 Topic: Each nation Sweden: {0.6 Breast Cancer, 0.1 Malaria, ……} Nigeria: {0.1 Breast Cancer, 0.8 Malaria, ……} Rank diseases
23
6/23/03 IndoUS DL 2003 Pooling: (for each decade & each income group)
24
6/23/03 IndoUS DL 2003
25
6/23/03 IndoUS DL 2003 Observations from the study: Collecting epidemiological data is extremely complicated. Collect it at a fine grained analysis. Different forms of Leishmaniasis; Plague Complement existing efforts at collecting epidemiological data. Consider more complex phenomena such as the prevalence of Leishmania and HIV as co-infections. Research based evidence to explore policy issues.
26
6/23/03 IndoUS DL 2003 Conclusions: Metadata can be exploited for text mining MeSH ~ rich metadata scheme Importance of metadata for digital libraries Other text mining applications built on DL? Domain independent ~ accounting! Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.