Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Outline Research background Problem definition The proposed approach: IDAI Empirical evaluation Conclusion Disease Aspect Classification2
Research Background Disease Aspect Classification3
Disease Aspect Information (DAI) Disease Aspect Classification4 An example from MedlinePlus: Several passages about three aspects of kidney cancer: treatment, symptom and sign, and etiology. It also contains several passages not related to any aspect. You have two kidneys... Kidney cancer forms in the … Risk factors include smoking, having certain genetic conditions and …. Often, kidney cancer doesn't have early symptoms. However, see your health care provider if you notice Blood in your urine A lump in your abdomen … Pain in your side … Treatment depends on your age, …. It might include surgery, radiation, chemotherapy …
Disease Knowledge Map: An Application of DAI Disease Aspect Classification5
Identification of DAI Disease Aspect Classification6 Healthcare professionals & consumers Disease Info. Query & Aspect Medical texts for specific diseases Disease Aspects Classifier Disease aspect information symptoms diagnosistreatment etiology prevention Healthcare decision support system Disease Info. Cross-disease query Medical information provider Verified Info. Aspect Info.
Problem Definition Disease Aspect Classification7
Goals Modeling the identification of DAI as a text classification problem –Disease aspects are predefined categories of interest, not brief descriptions of information needs Developing a technique to enhance various kinds of text classifiers –Given a medical text, the classifier can be more capable in identifying those texts that talk about aspects of diseases Disease Aspect Classification8
Related Work Text classification (TC) –Weakness: multi-aspect information in a text will incur noises to text classifiers Segment extraction for topic detection –Weakness: designed for specific descriptions (not for categories) Passage extraction for TC –Weakness: location and length of the passages that are relevant to a specific category becoming another problem of TC Disease Aspect Classification 9
The Proposed Approach: IDAI Disease Aspect Classification10
IDAI: Revising Term Frequency (TF) to Improve Classifiers Disease Aspect Classification11 Categories (aspects) Classifier Development Training Testing Underlying Text ClassifierIDAI Classification Training Texts A text (d) Assessing Term Frequencies (TF) TF of terms w.r.t. each category Identifying Term-Category Correlation type
Two Strategies for TF Revision Disease Aspect Classification12 Underlying classifier GEnhanced classifier G+IDAI Feature setsTF revision by IDAI Accepting relevant texts P: Set of positively correlated features (Strategy I) TF of a feature f is amplified (reduced) if neighbors of f have the same (different) correlation type to the category (Strategy II) TF of a feature f in Q is reduced if f appears in a text segment that mainly mentions features in P Rejecting irrelevant texts Q: Set of negatively correlated features
Revised TF(t,d,c) = WindowTF(t,d,c), if t is positively correlated to c; (for Strategy I) Max c’ c {WindowTF(t,d,c’)} - InconsistencyTF(t,d,c), if t is negatively correlated to c (for Strategy II) WindowTF(t,d,c) = k (0.5+P window,k ), for each occurrence of t at k, P window,k = Distance-based sum of weights of other positively correlated terms in a window at k InconsistencyTF(t,d,c) = k (P inconsistency,k ), for each occurrence of t at k, P inconsistency,k =0.5 How the text segment before k is dominated by the terms positively correlated to c Disease Aspect Classification13
Empirical Evaluation Disease Aspect Classification14
Experimental Data Top-10 fatal diseases and top-20 cancers in Taiwan –Total # of diseases: 28 –Source: Web sites of hospitals, healthcare associations, and department of health in Taiwan –Disease aspects (categories): 5 spects: etiology, diagnosis, treatment, prevention, and symptom. –Splitting the texts into aspects: 4669 texts about individual aspects –Test data: Randomly sampling 10% of the 4669 texts and merging them into test texts of 1 to 5 aspects Disease Aspect Classification15
Underlying Classifiers & Experimental Baselines Underlying classifier –The Support Vector Machine (SVM) classifier Baseline enhancer –CTFA (Liu, 2010), which employs Strategy I for better TC –CTFA does not consider Strategy II Disease Aspect Classification16
Results Disease Aspect Classification17
Disease Aspect Classification18
Conclusion Disease Aspect Classification19
Disease knowledge map (Dmap) –Supporting evidence-based medicine, health education, and healthcare decision support A key step to build a Dmap: Automatic identification of disease aspect information (DAI) Identification of DAI as a text classification problem Term proximity as key information to enhance existing classifiers to classify DAI Disease Aspect Classification20