Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.

Slides:



Advertisements
Similar presentations
CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Advertisements

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Semantic Access to Data from the Web Raquel Trillo *, Laura Po +, Sergio Ilarri *, Sonia Bergamaschi + and E. Mena * 1st International Workshop on Interoperability.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Online Clustering of Web Search results
GROUPER: A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS Erdem Sarıgil O ğ uz Yılmaz
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
Modern Information Retrieval
IR Models: Structural Models
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
MARS: Applying Multiplicative Adaptive User Preference Retrieval to Web Search Zhixiang Chen & Xiannong Meng U.Texas-PanAm & Bucknell Univ.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Chapter 5: Information Retrieval and Web Search
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Text mining.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.
Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.
Querying Structured Text in an XML Database By Xuemei Luo.
Laboratory for InterNet Computing CSCE 561 Social Media Projects Ryan Benton October 8, 2012.
南台科技大學 資訊工程系 A web page usage prediction scheme using sequence indexing and clustering techniques Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2010/10/15.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
PMLAB Finding Similar Image Quickly Using Object Shapes Heng Tao Shen Dept. of Computer Science National University of Singapore Presented by Chin-Yi Tsai.
A New Suffix Tree Similarity Measure for Document Clustering
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.
Chapter 6: Information Retrieval and Web Search
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Information Retrieval
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Post-Ranking query suggestion by diversifying search Chao Wang.
Date: 2013/9/25 Author: Mikhail Ageev, Dmitry Lagun, Eugene Agichtein Source: SIGIR’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Improving Search Result.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s): ,2008.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
1 Unsupervised Learning from URL Corpora Deepak P*, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras *Work done while at IIT Madras.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.
Information Architecture
Clustering of Web pages
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Chapter 5: Information Retrieval and Web Search
Panagiotis G. Ipeirotis Luis Gravano
Zhixiang Chen & Xiannong Meng U.Texas-PanAm & Bucknell Univ.
Recuperação de Informação B
Information Retrieval and Web Design
Presentation transcript:

Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01

Motivation  Low precision of Web search engines—hard for users to locate expected information quickly… Solutions: 1. Increase precision– by filtering methods? by advanced pruning options?… 2. Web Document Clustering - 2. Web Document Clustering - Cluster documents returned by search engine in response to a query and re-present them

Key Requirements for Web Document Clustering Relevance Browsable Summaries Overlap Snippet-tolerance – “snippet”: small piece of info. Or brief extract Speed Incrementality

Suffix Tree Clustering(STC) STC is a linear time clustering algorithm that is based on a suffix tree which efficiently identifies sets of documents that share common phrases. STC satisfies the key requirements: – STC treats a document as a string, making use of proximity information between words. – STC is novel, incremental, and O(n) time algorithm. – STC succinctly summarizes clusters’ contents for users. smaller set – Quick because of working on smaller set of documents, incremantality – …

Operating procedure of STC Step1: Document “cleaning” – Html -> plain text – Words stemming – Mark sentence boundaries – Remove non-word tokens Step 2: Identifying Base Clusters Step3: Combining Base Clusters

Step2: Identifying base Clusters — Suffix Tree * STC treats a document as a set of strings… Suffix tree of string S: a compact tree containing all the suffixes of S – Suffix of a word: lovely – Suffix of a string: “Friends” is a lovely show. Precise definition: – A suffix tree is a rooted, directed tree. – Each internal node has 2+ children. – Each edge is labeled with a non-empty sub-string of S. The label of a node is defined to be the concatenation of the edge-labels on the path from the root to that node – No two edges out of the same node can have edge- labels that begin with the same word—compact.

Ex. A Suffix Tree of Strings String1: “cat ate cheese”, String2: “mouse ate cheese too” String3: “cat ate mouse too”

Base clusters Base clusters corresponding to the suffix tree nodes

Cluster score s(B) = |B| * f(|P|) – |B| is the number of documents in base cluster B – |P| is the number of words in P that have a non- zero score zero score words: stopwords, too few( 40%)

Step 3:Combining Base Clusters Merge base clusters with a high overlap in their document sets – documents may share multiple phrases. Similarity of B m and B n (0.5 is paramter) 1 iff | B m  B n | / | B m | > 0.5 =and | B m  B n | / | B n | > otherwise

Base Cluster Graph Node: cluster Edge: similarity between two clusters > 1 What if “ate” is in the stop word list?

STC is Incremental As each document arrives from the web, we – “clean” it (linear with collection size) – Add it to the suffix tree. Each node that is updated/created as a result of this is tagged(linear) – Update the relevant base clusters and recalculate the similarity of these base clusters to the rest of k highest scoring base clusters(linear) – Check any changes to the final clusters(linear) – Score and sort the final clusters, choose top 10...(linear)

STC allows cluster overlap… – Why overlap is reasonable? a document often has 1+ topics a document often has 1+ topics – STC allows a document to appear in 1+ clusters, since documents may share 1+ phrases with other documents – But not too similar to be merged into one cluster..

Experiments Cluster output of meta search engine, using STC alg. – Representative of Web search engines – WEB clustering, instead of “IR corpus”

Evaluation- Precision Precision of different Clustering algorithm

Cluster overlap & multi-word phrases are critical to STC’s success

Cluster overlap & multi-word phrases are specifically effective to STC’s success

Why? Allowing a document to appear in multiple clusters is only advantageous if that document is relevant; placing an irrelevant document in multiple clusters can only hurt cluster quality

Snippets versus Whole Document

Execution time Incremental – use “free” CPU time when the system is waiting for the search engine results to arrive over the web – speedy

Conclusion The identification of the unique requirements of document clustering of Web seach engine results The definition of STC – an incremental, o(n) time clustering algorithm that satisfies these requirements The first experimental evaluation of clustering algorithms on Web search engine results