Identifying Disease Diagnosis Factors by Proximity-based Mining of Medical Texts Rey-Long Liu *, Shu-Yu Tung, and Yun-Ling Lu * Dept. of Medical Informatics Tzu Chi University Taiwan, R.O.C.
Outline Research background Problem definition The proposed approach: PDFI Empirical evaluation Conclusion
Research Background
Diagnosis Knowledge Map: Fundamental of Diagnosis Support & Education r5r5 r4r4 r3r3 r2r2 r1r1 d3d3 d2d2 d1d1 Symptoms & Signs (and examinations & tests) DiseasesRisk Factors m1m1 m2m2 m3m3 m4m4 m5m5
Basic Properties Diagnosis factors of a disease –Risk factors, symptoms, and signs of the disease A diagnosis knowledge map consist of many-to-many relationships between diseases and their diagnosis factors –May have different capability of discriminating the diseases, and may evolve Construction of a diagnosis knowledge map is essential but costly
Problem Definition
Goal Explore how the identification of the diagnosis factors may be supported by text mining Develop a technique PDFI (Proximity-based Diagnosis Factors Identifier) that –Employs term proximity to improve diagnosis factors identifiers –Serves as a supplement to improve existing identifiers
Related Work Extract relationships by parsing or template matching –Weakness: Relationships between diseases and diagnosis factors are seldom expressed in individual sentences Select key features by text classification –Weakness: Term proximity is NOT considered Proximity-based retrieval –Weakness: NOT applicable to diagnosis factor identification 8
The Proposed Approach: PDFI
Basic Observation In a medical text talking about the diagnosis of a disease, the diagnosis factors often appear in a nearby area of the text
The Approach For a candidate diagnosis factor u, PDFI –Measures how other candidate diagnosis factors appear in the areas near to u in the medical texts, and then –Encodes the term proximity information into the discriminating capability of u measured by the underlying discriminative factors identifiers.
System Overview Encode term proximity contexts to revise the strengths of candidate factors Measure discriminating strengths of candidate factors Underlying identifierPDFI Ranked factors for individual diseases Texts about individual diseases Discriminating strengths of candidate factors
Scoring for a Candidate Factor MinDist u,c = Minimum distance between u and n in the texts about disease c, and α is set to 30 For a candidate diagnosis factor u for disease c Rank(u,c) = Rank of u w.r.t. c by the underlying identifier Finalscore(u, c) = ProximityScore(u,c)+IdentifierScore(u,c)
Empirical Evaluation
Experimental Data Medical dictionary: from MeSH –Each MeSH term and its retrieval equivalence terms, resulting in a dictionary of 164,354 medical terms Medical texts for disease: from MedlinePlus –All the diseases for which MedlinePlus tags diagnosis/symptoms texts, resulting in a text database of 420 medical texts for 131 diseases –Each medical text is manually read and cross- checked to extract target diagnosis factor terms from the texts, resulting in 2,797 target terms
Underlying Diagnosis Factor Identifier The chi-square feature scoring technique –Produces a discriminating strength for each feature (candidate factor) with respect to each disease, and –For each disease, all positively-correlated features are sent to PDFI for re-ranking
Evaluation Criteria Mean average precision (MAP) –Measuring how target diagnosis factors are ranked high for the medical expert to check and validate –Example Targets ranked 1 st, 3 rd, 5 th AP=(1/1+2/3+3/5)/3=0.76 Targets ranked 1 st, 2 nd, 3 rd AP=(1/1+2/2+3/3)/3=1.00
Results MAP: chi-square: ; chi-square+PDFI:
An Example Parasitic diseases –AP: chi-square:0.3003; chi-square+PDFI: PDFI promotes the ranks of several target diagnosis factors (e.g., parasite, antigen, diarrhea, and MRI scan ) –They appear at some place(s) where more other candidate terms occur in a nearby area PDFI lowers the ranks of a few target diagnosis factors (e.g., serology ) – Serology only appears at one place where the author used lots of words to explain serology
Conclusion
Diagnosis factors to discriminate diseases are the fundamental basis for –Diagnosis decision support, diagnosis skill training, medical research, & health education Text mining is a good way to identify and maintain the huge amount of diagnosis factors for diseases By encoding term proximity information, PDFI may be a good supplement to existing technique to identify the diagnosis factors for individual diseases