Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.

Slides:



Advertisements
Similar presentations
Data Mining and Text Analytics By Saima Rahna & Anees Mohammad Quranic Arabic Corpus.
Advertisements

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Online Clustering of Web Search results
Procedures of Extending the Alphabet for the PPM Algorithm Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
Information Retrieval in Practice
On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
INTRODUCTION TO CLIENT-SIDE WEB PROGRAMMING ACM 511 ACM 262 Course Notes.
Exploring Personal CoreSpace For DataSpace Management Li Yukun and Xiaofeng Meng WAMDM Lab Renmin University of China.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Querying Structured Text in an XML Database By Xuemei Luo.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
A New Suffix Tree Similarity Measure for Document Clustering
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Chapter 6: Information Retrieval and Web Search
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Search engine note. Search Signals “Heuristics” which allow for the sorting of search results – Word based: frequency, position, … – HTML based: emphasis,
1 Centroid Based multi-document summarization: Efficient sentence extraction method Presenter: Chen Yi-Ting.
A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty.
Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,
Document Clustering for Natural Language Dialogue-based IR (Google for the Blind) Antoine Raux IR Seminar and Lab Fall 2003 Initial Presentation.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s): ,2008.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Clustering of Web pages
New Indices for Text : Pat Trees and PAT Arrays
Natural Language Processing (NLP)
Chapter 5: Information Retrieval and Web Search
Natural Language Processing (NLP)
Information Retrieval and Web Design
Natural Language Processing (NLP)
Presentation transcript:

Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong

2 Overview  Introduction  Previous Related Works  SHOC Approach  Prototype System  Conclusion

3  Motivation The Web is the biggest data source. Search engine is the most commonly used tool for Web information retrieval. Its current status is far from the satisfaction.  Solution Clustering of Web search results would help a lot. SHOC can generate both reasonable and readable cluster. Introduction

4 Basic requirements (clustering approach for web search result)  Semantic Each cluster should correspond to a concept. Avoid confining each Web page to only on cluster. A label can describe the topic of cluster well.  Hierarchical Eye-browsing tree structure. Taking advantage of the relationship between them.  Online Provide fresh clustering result “just-in-time”.

5 Previous Related Work  Scatter/Gather system traditional heuristic clustering algorithm. It has some limitations.  Based on hyperlink It needs to download and parse original Web page. Cannot cluster immediately.  STC It is not appropriate for Oriental language. Extract many meaningless partial phrases. Synonymy and polysemy are not considered.

6 SOHC step 1. Data acquisition 2. Data cleaning 3. Feature extraction 4. Identifying base clusters 5. Combining base clusters

7 Data acquision  The data acquisition task here is actually meta-search.  Use 2-level parallelization mechanism 1. Call several engines simultaneously. 2. Fetch all of its search result simultaneously.

8 Data cleaning  Sentence boundaries are identified via the following. punctuation marks (e.g. ‘.’, ‘,’, ‘;’, ‘?’, etc.) HTML tags (e.g.,,, etc.)  Non-word tokens are stripped. (e.g. punctuation marks and HTML tags)  Redundant spaces are compressed.  Stemming algorithm may be applied. (for English text)

9 Feature extraction (Overview)  Words Most clustering algorithm treat a document as “bag of words”. Ignoring word order and proximity.  Key phrases Advantage  Improve the quality of the clusters.  Useful in constructing labels. Data structures (key phrase discovery)  Suffix tree Related to the alphabet size of language.  Suffix array Scalable over alphabet size.

10 Feature extraction (key phrase discovery)  Completeness Left-completeness Right-completeness  Stability (Mutual Information) S =“c 1 c 2 ∙∙∙ c p ”, S L =“c 1 ∙∙∙ c p-1 ”, S R =“c 2 ∙∙∙ c p ”  Significance se(S) = freq(S) * g(|S|) g(x) 0 (x=1) log 2 x (2≤x≤8) 3 (x>8)

11 Feature extraction (Suffix array)  Suffix array An array of all N suffixes, sorted alphabetically  LCP (Longest Common Prefix) Use to accelerate searching in text

12 Feature extraction (Discover rcs) void discover_rcs() { typedef structure{ int ID; int frequency; } RCSTYPE; RSCTYPE rcs_stack[N]; // N is the document's length Initialize rcs_stack; int sp = -1; // the stack pointer int i = 1; while(i < N+1) { if(sp < 0){ // the stack is empty if(lcp[i] > 0){ sp++; rcs_stack[sp].ID = i; rcs_stack[sp].frequency = 2; } i++; } else{. } int r = rcs_stack[sp].ID; if(lcp[r] < lcp[i]) { sp++; rcs_stack[sp].ID = i; rcs_stack[sp].frequency = 2; i++; } else if(lcp[r] == lcp[i]) { rcs_stack[sp].frequecny++; i++; } else { Output rcs_stack[sp]; // ID & frequency int f = rcs_stack[sp].frequency; sp--; if(sp >= 0){ rcs_stack[sp].frequency = rcs_stack[sp].frequency + f -1; }

13 Feature extraction (Intersect lcs_rcs) void intersect_lcs_rcs(sorted lcs array, sorted rcs array) { int i =0, j=0; while(i<L && j < R) { string str_l = lcs[i].ID denoted LCS; string str_r = rcs[j].ID denoted RCS; if(str_l == str_r) { Output lcs[i]; i++; j++; } if(str_l < str_r){ i++; } if(str_l > str_r){ j++; } rcs array IDfrequencyRCS 12_be 25_ 62be 82e 112o_be 124o 163to_be 172t cs array IDfrequencyCS 25_ 124o 163t 172to_be

14 Identifying base clusters Terms (key phrases) Documents The association between terms and documents

15 Combining base clusters Combine base cluster X and Y if ( |X ∩ Y| / |X ∪ Y| > t1 ) { X and Y are merged into one cluster; } else { if ( |X| > |Y| ) { if ( |X ∩ Y| / |Y| > t2 ) { let Y become X’s child; } else { if ( |X ∩ Y| / |X| > t2 ) { let X become Y’s child; } Merging Label if ( label x is a substring of label y ) { label_xy = label_y; } else if ( label_y is a substring of label_x ){ label_xy = label_x; } else { label_xy = “ label_x + label_y ”; }

16 Prototype system  Crate a prototype system named WICE (Web Information Clustering Engine)  Doing well for dealing with the special problems related to Chinese  Output for query “object oriented” object oriented programming object oriented analysis, etc.

17 Conclusion  Main contribution The benefit of using key phrase. Method based on suffix array for key phrase. The concept of orthogonal clustering. The WICE system is designed and implemented.  Further works Detailed analysis. Further experimenting. Interpretation of experiment results. Comparing with other clustering algorithms.