A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments CIKM2004 Speaker : Yao-Min Huang Date : 2005/03/10.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Chapter 5: Introduction to Information Retrieval
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Clustering Unsupervised learning Generating “classes”
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Presented by Tienwei Tsai July, 2005
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
1 LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora Chien-Chung Huang Shui-Lung Chuang Lee-Feng Chien Presented by: Vu LONG.
Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM DKE Data & Knowledge.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Algorithmic Detection of Semantic Similarity WWW 2005.
Extracting Keyphrases to Represent Relations in Social Networks from Web Junichiro Mori and Mitsuru Ishizuka Universiry of Tokyo Yutaka Matsuo National.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Post-Ranking query suggestion by diversifying search Chao Wang.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
TO Each His Own: Personalized Content Selection Based on Text Comprehensibility Date: 2013/01/24 Author: Chenhao Tan, Evgeniy Gabrilovich, Bo Pang Source:
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.
Contextual Text Cube Model and Aggregation Operator for Text OLAP
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s): ,2008.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Data Mining and Text Mining. The Standard Data Mining process.
Hierarchical Clustering & Topic Models
Queensland University of Technology
Clustering of Web pages
Block Matching for Ontologies
Chapter 5: Information Retrieval and Web Search
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments CIKM2004 Speaker : Yao-Min Huang Date : 2005/03/10

Text segments –keywords in documents and a natural language query from a user –of many types, including word, phrase, named entity, natural language query, news event, product name, paper or book title, etc. Address the problem of generating topic hierarchies for diverse text segments –deals with the problem using the Web as an additional knowledge source –focus on how to link the clusters of text segments with close concepts decide appropriate levels and reasonable numbers of clusters Introduction

Clustering short text segments is a difficult –do not contain enough information to extract reliable features –lack of domain-specific corpora to describe text segments is usually the case in reality Propose  Using Web –Ex: the neighboring sentences of the given text segment Introduction

Hierarchies –Traditional solution  mostly generate binary tree hierarchies (narrow and deep binary-tree) –Propose  broad and shallow multi-way-tree representation, Extend the Agglomerative Clustering algorithm (HAC)

Feature Extraction Using Search-Result Snippets A text segment could be treated as a query with a certain search request. And its contexts are then obtained directly from the highly ranked search-result snippets. (Analog to pseudo-relevance feedback) Adopt the vector space model as the data representation –For each text segment p  collect up to search-result entries, denoted as –Each text segment can then be converted into a bag of feature terms by applying normal text processing techniques (e.g., removing stop words and stemming) to the contents of

Feature Extraction Using Search-Result Snippets Adopt the vector space model as the data representation –A text segment p can be represented as a term vector (using tf-idf) –The similarity between a pair of text segments is computed as the cosine of the angle –The average similarity between two sets of vectors, Ci and Cj, as the average of all pairwise similarities

Feature Extraction Using Search-Result Snippets

Hierarchical Clustering Algorithm HAC+P HAC-Based Binary-Tree Hierarchy Generation (Button- Up) – –The core of an HAC algorithm is a specific function used to measure the similarity between any pair of clusters Ci and Cj. –The inter-cluster similarity function for HAC Single-Linkage (SL) Complete-Linkage (CL)

Hierarchical Clustering Algorithm HAC+P HAC-Based Binary-Tree Hierarchy Generation –The inter-cluster similarity function for HAC Average-linkage (AL) Centroid function (CE) –

Hierarchical Clustering Algorithm HAC+P Min-Max Partitioning (Top-down) cut level2 generate LC(2) –Let the level between –Let LC(l) be the set of clusters produced after cutting the binary- tree hierarchy at level l ; CH(Ci) be the cluster hierarchy rooted at node Ci Ex : LC(2)={C5,C6,C7}, CH(C8)={C3,C4,C5,C6,C8}

Hierarchical Clustering Algorithm HAC+P Min-Max Partitioning (Top-down) –Two criteria used to determine the best cut level Cluster Set Quality (max inter-similarity, min intra-similarity) Cluster-Number Preference –a simplified distribution function is used to measure the degree of preference on # of clusters at each layer

Hierarchical Clustering Algorithm HAC+P

Cluster Naming –The cluster naming is not fully investigated in our current stage of study. We simply take the most frequent co-occurred feature terms from the composed instances to name the cluster.

Experiments Different domains of text segments (have class information) –YahooCS The category names in the top three levels of the Yahoo! Computer Science directory were collected. - People Collect the people names listed in the Yahoo!People/Scientist directory. –Paper –QuizNLQ Collect a data set of general-domain natural language questions from a Web site (

Experiments Evaluation –F-measure of cluster j with respect to class I is defined as –For the entire cluster hierarchy, the F-measure of any class is the maximum value it attains at any node in the tree, and an overall F-measure is computed by taking the weighted average of all the F-measure values as follows:

Experiments Baseline –HAC 、 HKMeans K-means method was modified to make it hierarchical (top- down) By HKMeans, all instances are first clustered into k clusters using k-means (random initial), and the same procedure is recursively applied to each cluster until the specified depth e is reached. K is nearest integer of the root of n (# instances to be clustered at each step)

Experiments –HAC is the upper bound –AL 、 CL is the better –The incorporation of partitioning and the constraint of hierarchy depth only caused a very small decrement of the F-measure score.

Experiments –List Yahoo!CS category names using the AL –The generated structures were considered more natural and helpful in observing the facts contained. –Figure 6 (HAC) 、 7 (HAC+P) show the result.

Experiments

–E xamine the effects of the ranking of search results returned by search engines and the number of snippets needed to achieve good performance. –More snippets seemed more helpful in achieving good performance.

User Evaluation Test1 : Comprehension Test Cohesiveness –Judge whether the instances clustered together are semantically similar. Isolation –Judge whether the auto-generated clusters at the same level are distinguishable and their concepts do not subsume one another. Hierarchy –Judge whether the generated topic hierarchy is traversed from broader concepts at the higher levels to narrower concepts at the lower levels. Navigation Balance –Judge whether the fan-out at each level of the hierarchy is appropriate. Readability –Judge whether the concepts of clusters at all levels are easy to recognize with the composed clusters and instances.

User Evaluation Test1 : Comprehension Test –Five volunteers –Rang from 0~7 : high value indicates a better quality. –Yahoo!CS sets

User Evaluation Test2 : Usability Test –The purpose of the second test was to realize whether the proposed approach helps human experts in reducing the time to construct a topic hierarchy, and improving the accuracy. –Four additional CS students to form two groups and construct the Yahoo!CS hierarchy manually. –

Conclusion and Future Work This paper has proposed –a practical Web-based approach to organizing text segments into a topic hierarchy. –a clustering algorithm for generating a natural multi-way-tree cluster hierarchy is developed Extensive experiments were conducted on different domains of text segments, and the results have shown the feasibility of our approach. Future work –investigate the possibility of our approach on more types of text segments. For example,dealing with polysemous text segments, such as that “Newton” is both a physician and a mathematician, is not well explored in our current stage of study. –providing a more sophisticated cluster naming technique