Download presentation
Presentation is loading. Please wait.
Published byJune Hutchinson Modified over 8 years ago
1
A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments CIKM2004 Speaker : Yao-Min Huang Date : 2005/03/10
2
Text segments –keywords in documents and a natural language query from a user –of many types, including word, phrase, named entity, natural language query, news event, product name, paper or book title, etc. Address the problem of generating topic hierarchies for diverse text segments –deals with the problem using the Web as an additional knowledge source –focus on how to link the clusters of text segments with close concepts decide appropriate levels and reasonable numbers of clusters Introduction
3
Clustering short text segments is a difficult –do not contain enough information to extract reliable features –lack of domain-specific corpora to describe text segments is usually the case in reality Propose Using Web –Ex: the neighboring sentences of the given text segment Introduction
4
Hierarchies –Traditional solution mostly generate binary tree hierarchies (narrow and deep binary-tree) –Propose broad and shallow multi-way-tree representation, Extend the Agglomerative Clustering algorithm (HAC)
5
Feature Extraction Using Search-Result Snippets A text segment could be treated as a query with a certain search request. And its contexts are then obtained directly from the highly ranked search-result snippets. (Analog to pseudo-relevance feedback) Adopt the vector space model as the data representation –For each text segment p collect up to search-result entries, denoted as –Each text segment can then be converted into a bag of feature terms by applying normal text processing techniques (e.g., removing stop words and stemming) to the contents of
6
Feature Extraction Using Search-Result Snippets Adopt the vector space model as the data representation –A text segment p can be represented as a term vector (using tf-idf) –The similarity between a pair of text segments is computed as the cosine of the angle –The average similarity between two sets of vectors, Ci and Cj, as the average of all pairwise similarities
7
Feature Extraction Using Search-Result Snippets
8
Hierarchical Clustering Algorithm HAC+P HAC-Based Binary-Tree Hierarchy Generation (Button- Up) – –The core of an HAC algorithm is a specific function used to measure the similarity between any pair of clusters Ci and Cj. –The inter-cluster similarity function for HAC Single-Linkage (SL) Complete-Linkage (CL)
9
Hierarchical Clustering Algorithm HAC+P HAC-Based Binary-Tree Hierarchy Generation –The inter-cluster similarity function for HAC Average-linkage (AL) Centroid function (CE) –
10
Hierarchical Clustering Algorithm HAC+P Min-Max Partitioning (Top-down) cut level2 generate LC(2) –Let the level between –Let LC(l) be the set of clusters produced after cutting the binary- tree hierarchy at level l ; CH(Ci) be the cluster hierarchy rooted at node Ci Ex : LC(2)={C5,C6,C7}, CH(C8)={C3,C4,C5,C6,C8}
11
Hierarchical Clustering Algorithm HAC+P Min-Max Partitioning (Top-down) –Two criteria used to determine the best cut level Cluster Set Quality (max inter-similarity, min intra-similarity) Cluster-Number Preference –a simplified distribution function is used to measure the degree of preference on # of clusters at each layer
12
Hierarchical Clustering Algorithm HAC+P
13
Cluster Naming –The cluster naming is not fully investigated in our current stage of study. We simply take the most frequent co-occurred feature terms from the composed instances to name the cluster.
14
Experiments Different domains of text segments (have class information) –YahooCS The category names in the top three levels of the Yahoo! Computer Science directory were collected. - People Collect the people names listed in the Yahoo!People/Scientist directory. –Paper –QuizNLQ Collect a data set of general-domain natural language questions from a Web site (http://www.coolquiz.com/trivia)http://www.coolquiz.com/trivia
15
Experiments Evaluation –F-measure of cluster j with respect to class I is defined as –For the entire cluster hierarchy, the F-measure of any class is the maximum value it attains at any node in the tree, and an overall F-measure is computed by taking the weighted average of all the F-measure values as follows:
16
Experiments Baseline –HAC 、 HKMeans K-means method was modified to make it hierarchical (top- down) By HKMeans, all instances are first clustered into k clusters using k-means (random initial), and the same procedure is recursively applied to each cluster until the specified depth e is reached. K is nearest integer of the root of n (# instances to be clustered at each step)
17
Experiments –HAC is the upper bound –AL 、 CL is the better –The incorporation of partitioning and the constraint of hierarchy depth only caused a very small decrement of the F-measure score.
18
Experiments –List Yahoo!CS category names using the AL –The generated structures were considered more natural and helpful in observing the facts contained. –Figure 6 (HAC) 、 7 (HAC+P) show the result.
19
Experiments
21
–E xamine the effects of the ranking of search results returned by search engines and the number of snippets needed to achieve good performance. –More snippets seemed more helpful in achieving good performance.
22
User Evaluation Test1 : Comprehension Test Cohesiveness –Judge whether the instances clustered together are semantically similar. Isolation –Judge whether the auto-generated clusters at the same level are distinguishable and their concepts do not subsume one another. Hierarchy –Judge whether the generated topic hierarchy is traversed from broader concepts at the higher levels to narrower concepts at the lower levels. Navigation Balance –Judge whether the fan-out at each level of the hierarchy is appropriate. Readability –Judge whether the concepts of clusters at all levels are easy to recognize with the composed clusters and instances.
23
User Evaluation Test1 : Comprehension Test –Five volunteers –Rang from 0~7 : high value indicates a better quality. –Yahoo!CS sets
24
User Evaluation Test2 : Usability Test –The purpose of the second test was to realize whether the proposed approach helps human experts in reducing the time to construct a topic hierarchy, and improving the accuracy. –Four additional CS students to form two groups and construct the Yahoo!CS hierarchy manually. –
25
Conclusion and Future Work This paper has proposed –a practical Web-based approach to organizing text segments into a topic hierarchy. –a clustering algorithm for generating a natural multi-way-tree cluster hierarchy is developed Extensive experiments were conducted on different domains of text segments, and the results have shown the feasibility of our approach. Future work –investigate the possibility of our approach on more types of text segments. For example,dealing with polysemous text segments, such as that “Newton” is both a physician and a mathematician, is not well explored in our current stage of study. –providing a more sophisticated cluster naming technique
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.