Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS 535 04/29/05.

Slides:



Advertisements
Similar presentations
CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Improved TF-IDF Ranker
AskMe A Web-Based FAQ Management Tool Alex Albu. Background Fast responses to customer inquiries – key factor in customer satisfaction Costs for customer.
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
Data Mining Techniques: Clustering
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
ADVISE: Advanced Digital Video Information Segmentation Engine
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
4. Ad-hoc I: Hierarchical clustering
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Clustering Unsupervised learning Generating “classes”
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Text mining.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc – SIAM Web Analytics Workshop.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Chapter 6: Information Retrieval and Web Search
Query and Analysis on the document and customer/item bag card of the DataDex Kellie Erickson.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments CIKM2004 Speaker : Yao-Min Huang Date : 2005/03/10.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
1 CS 430: Information Discovery Lecture 5 Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Data Mining and Text Mining. The Standard Data Mining process.
Queensland University of Technology
User Modeling for Personal Assistant
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Clustering of Web pages
Information Retrieval
Learning Literature Search Models from Citation Behavior
Text Categorization Berlin Chen 2003 Reference:
Web Mining Research: A Survey
A Neural Passage Model for Ad-hoc Document Retrieval
Presentation transcript:

Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05

Outline Introduction / motivation Background Algorithm Feature selection / feature vector generation Hierarchical agglomerative clustering (HAC) Tree partitioning Results / conclusions

Introduction Application – customer self-help (FAQ) system RightNow Technologies’ Customer Service module Need ways to organize Knowledge Base (KB) System already organizes documents (answers) using clustering Desirable to also organize user queries

Goals Create concept hierarchy from user queries Domain-specific Self-guided (no human intervention / guidance required) Present hierarchy to help guide users in navigating KB Demonstrate the types of queries that can be answered by system Automatically augment searches with related terms

Background Problem – cluster short text segments Inadequate information in queries to provide context for clustering Need some source of context Possible solution – use Web as source of info Cilibrasi and Vitanyi proposed mechanism to extract meaning of words using Google searches Chuang and Chien presented more detailed algorithm for clustering short segments by using text snippets returned by search engine

Algorithm Use each text segment as input query to search engine Process resulting text snippets using stemming, stop word lists to extract related terms (keywords) Select set of keywords, build feature vectors Cluster using Hierarchical Agglomerative Clustering (HAC) Compact tree using min-max partitioning

KB-Specific Version – HAC-KB Choose set of user queries, corresponding answers Find list of keywords corresponding to those answers Trim down list to reasonable length Generate feature vectors HAC clustering Min-max partitioning

Available Data Answers Documents forming the KB – actually question and answer, plus keywords and other information like product and category associations Ans_phrases Extracted from answers, using stop word lists and stemming One-, two-, and three-word phrases Counts of occurences in different parts of answer Keyword_searches List of user queries – also filtered by stop word lists and stemmed List of answers matching query

Feature Selection Select N most frequent user queries Select set of all answers matching those queries Select set of all keywords found in those answers Reduce to list of K keywords Avoid removing all keywords associated with a query (would generate empty feature vector) Try to eliminate keywords that provide little discrimimination (ones associated with many queries) Also eliminate keywords that only map to a single query

Feature Vector Generation Generate map from queries to keywords, and inverse map from keywords to queries Use the TF-IDF (term frequency / inverse document frequency) metric for weighting v i,j is weight of jth keyword for ith query tf i,j is the number of times that keyword j occurred in list of answers associated with query i n j is number of queries associated with keyword j Now have a N x K feature matrix

Standard HAC Algorithm Initialize clusters – one cluster per query Initialize similarity matrix Using the average linkage similarity metric and cosine distance measure Matrix is upper-triangular

HAC (cont.) For N – 1 iterations Pick two root-node clusters with largest similarity Combine into new root-node cluster Add new cluster to similarity matrix – compute similarity with all other root-level clusters Generates tall binary tree of clusters 2N – 1 nodes Not particularly usable by humans

Min-Max Partitioning Need to combine nodes in cluster tree, produce a shallow, bushy multi-way tree Recursive partitioning algorithm MinMaxPartition(Cluster sub-tree) For each possible cut level in tree, compute quality of cut Choose best-quality cut level For each subtree cut off, recursively process Stop at max depth or max cluster size

Cut Levels in Tree

Choosing Best Cut Goal is to maximize intra-cluster similarity, minimize inter-cluster similarity Quality = Q(C) / N(C) Cluster set quality (smaller is better) Cluster size preference (gamma distribution)

Issues / Further Work Resolve issues with data / implementation Outstanding problem – generating meaningful labels for clusters in hierarchy Means of measuring performance Incorporate other KB data, like relevance scores of search results, products/categories Better feature selection Fuzzy clustering – query can belong to multiple clusters (Frigui & Masraoui)

References S.-L. Chuang and L.-F. Chien, “Towards Automatic Generation of Query Taxonomy: A Hierarchical Query Clustering Approach, “Proceedings of ICDM’02, Maebashi City, Japan, Dec. 9-12, 2002, pp. 75–82, S.-L. Chuang and L.-F. Chien, “A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments,” Proceedings of CIKM’04, Washington, DC, Nov., 2004, pp R. Cilibrasi and P. Vitanyi, “Automatic Meaning Discovery Using Google,” published on Web, available at H. Frigui and O. Masraoui, “Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents,” in Survey of Text Mining: Clustering, Classification, and Retrieval, Michael W. Berry, ed., Springer-Verlag, New York, 2004, pp