Download presentation
Presentation is loading. Please wait.
1
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS 535 04/29/05
2
Outline Introduction / motivation Background Algorithm Feature selection / feature vector generation Hierarchical agglomerative clustering (HAC) Tree partitioning Results / conclusions
3
Introduction Application – customer self-help (FAQ) system RightNow Technologies’ Customer Service module Need ways to organize Knowledge Base (KB) System already organizes documents (answers) using clustering Desirable to also organize user queries
4
Goals Create concept hierarchy from user queries Domain-specific Self-guided (no human intervention / guidance required) Present hierarchy to help guide users in navigating KB Demonstrate the types of queries that can be answered by system Automatically augment searches with related terms
5
Background Problem – cluster short text segments Inadequate information in queries to provide context for clustering Need some source of context Possible solution – use Web as source of info Cilibrasi and Vitanyi proposed mechanism to extract meaning of words using Google searches Chuang and Chien presented more detailed algorithm for clustering short segments by using text snippets returned by search engine
6
Algorithm Use each text segment as input query to search engine Process resulting text snippets using stemming, stop word lists to extract related terms (keywords) Select set of keywords, build feature vectors Cluster using Hierarchical Agglomerative Clustering (HAC) Compact tree using min-max partitioning
7
KB-Specific Version – HAC-KB Choose set of user queries, corresponding answers Find list of keywords corresponding to those answers Trim down list to reasonable length Generate feature vectors HAC clustering Min-max partitioning
8
Available Data Answers Documents forming the KB – actually question and answer, plus keywords and other information like product and category associations Ans_phrases Extracted from answers, using stop word lists and stemming One-, two-, and three-word phrases Counts of occurences in different parts of answer Keyword_searches List of user queries – also filtered by stop word lists and stemmed List of answers matching query
9
Feature Selection Select N most frequent user queries Select set of all answers matching those queries Select set of all keywords found in those answers Reduce to list of K keywords Avoid removing all keywords associated with a query (would generate empty feature vector) Try to eliminate keywords that provide little discrimimination (ones associated with many queries) Also eliminate keywords that only map to a single query
10
Feature Vector Generation Generate map from queries to keywords, and inverse map from keywords to queries Use the TF-IDF (term frequency / inverse document frequency) metric for weighting v i,j is weight of jth keyword for ith query tf i,j is the number of times that keyword j occurred in list of answers associated with query i n j is number of queries associated with keyword j Now have a N x K feature matrix
11
Standard HAC Algorithm Initialize clusters – one cluster per query Initialize similarity matrix Using the average linkage similarity metric and cosine distance measure Matrix is upper-triangular
12
HAC (cont.) For N – 1 iterations Pick two root-node clusters with largest similarity Combine into new root-node cluster Add new cluster to similarity matrix – compute similarity with all other root-level clusters Generates tall binary tree of clusters 2N – 1 nodes Not particularly usable by humans
13
Min-Max Partitioning Need to combine nodes in cluster tree, produce a shallow, bushy multi-way tree Recursive partitioning algorithm MinMaxPartition(Cluster sub-tree) For each possible cut level in tree, compute quality of cut Choose best-quality cut level For each subtree cut off, recursively process Stop at max depth or max cluster size
14
Cut Levels in Tree
15
Choosing Best Cut Goal is to maximize intra-cluster similarity, minimize inter-cluster similarity Quality = Q(C) / N(C) Cluster set quality (smaller is better) Cluster size preference (gamma distribution)
16
Issues / Further Work Resolve issues with data / implementation Outstanding problem – generating meaningful labels for clusters in hierarchy Means of measuring performance Incorporate other KB data, like relevance scores of search results, products/categories Better feature selection Fuzzy clustering – query can belong to multiple clusters (Frigui & Masraoui)
17
References S.-L. Chuang and L.-F. Chien, “Towards Automatic Generation of Query Taxonomy: A Hierarchical Query Clustering Approach, “Proceedings of ICDM’02, Maebashi City, Japan, Dec. 9-12, 2002, pp. 75–82, 2002. S.-L. Chuang and L.-F. Chien, “A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments,” Proceedings of CIKM’04, Washington, DC, Nov., 2004, pp. 127-136. R. Cilibrasi and P. Vitanyi, “Automatic Meaning Discovery Using Google,” published on Web, available at http://arxiv.org/abs/cs/0412098. http://arxiv.org/abs/cs/0412098 H. Frigui and O. Masraoui, “Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents,” in Survey of Text Mining: Clustering, Classification, and Retrieval, Michael W. Berry, ed., Springer-Verlag, New York, 2004, pp. 45-72.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.