Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Taxonomy of Similarity Mechanisms for Case-Based Reasoning Pa´ draig Cunningham TKDE, Vol.21, 2009, pp. 1532–1543. Presenter : Wei-Shen Tai 2009/11/17
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 2 Outline Introduction Representation Similarity measures Direct similarity mechanisms Transformation-based measures Information-theoretic measures Emergent measures Implications for CBR research Conclusion Comments
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 3 Motivation Similarity is central to CBR More recently, a number of novel mechanisms have emerged that introduce interesting alternative perspectives on similarity.
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 4 Objective Novel SM mechanisms review Present a taxonomy of similarity mechanisms that places these new techniques in the context of established CBR techniques.
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 5 Feature value representation In terms of case attributes or instance. Enhancement Discover word associations in a text corpus and then use these associations to add terms to the representation. Bill Gates - > software, CEO, mircrosoft Allow texts to be represented by more features.
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 6 Structural representations Hierarchical structure Features value themselves reference nonatomic objects. Network structure Typically a semantic network The Semantic Web describes the relationships between things (like tire is a part of car and John Lennon was a member of the Beatles) and the properties of things (like size, weight, age, and price) Flow structure Share many of the characteristics of hierarchical and network representations. For example, work or job.
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 7 String and sequence representations The most straightforward representation for free text. (non-structure data) It supports similarity assessment is the bag-of-words strategy from information retrieval.
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 8 Direct similarity mechanisms Similarity and distance metrics k-NN Set-theoretic measures Jaccard index, Dice similarity Kullback-Leibler Divergence and the χ 2 Statistic Compare two images described as histograms. Symbolic attributes in taxonomies Case representation is organized by feature values into a taxonomy of is-a relationships. rootteaGreen teaBlack teacarbonatedPepsiCola
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 9 Transformation-based measures I Edit Distance the number of editing to transform one string. From cat to rat is 1, from cats to cat is 1. Alignment Measures for Biological Sequences A variety of sequence alignment in biology (DNA).
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 10 Transformation-based measures II Earth mover distance A transformation-based distance for image data.
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 11 Transformation-based measures III Similarity for networks and graphs Structure mapping engine (SME) Identify the appropriate mapping between the two domains.
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 12 Information-theoretic measures It works directly on the raw case representation Compression-based similarity for text Two very similar documents, the compressed size of both them will not be much greater than one. Information-based similarity for biological sequences Specialized algorithms are required to compress them Similarity in a taxonomy Distinguish the weight of is-a relationship between features. A taxonomy can be quantified as the negative log likelihood. Similarity is the common parent node with the highest value.
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 13 Emergent measures I Random forests An ensemble of decision trees. For each ensemble member (n > N), build a decision tree for them with less selected features (m >> M). Track the frequency with which cases are located at the same leaf node. Two features get more shared leaf frequency means they are more similar as well.
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 14 Emergent measures II Cluster kernels A semi-supervized learning, where only some of the available data are labeled. Class labels do not change in regions of high density. Cluster kernels allow the unlabelled data to influence similarity. where K(x i, x j ) orig is a basic neighborhood kernel and K(x i, x j ) bag is a kernel derived from repeated clustering of all the data.
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 15 Emergent measures III Web-based kernel Text snippet similarity by documents returned in Web search.
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 16 Implications for CBR research Vocabulary knowledge container In some circumstances (e.g., information-theoretic measures) the role of the similarity knowledge container is increased. Speeding up technique New methodologies are typically computationally intensive, the importance of strategies for speeding up case-retrieval is increased.
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 17 Conclusions Similarity measurement taxonomy Organize the broad range of strategies for similarity assessment in CBR into a coherent taxonomy. Improve effectiveness of CBR Alternative metrics simply offer better accuracy because it embodies specific knowledge about the data.
N.Y.U.S.T. I. M. Intelligent Database Systems Lab 18 Comments Advantage This paper introduces and discusses those alternative metrics of similarity assessment for CBR. Drawback . Application Similarity measurement.