1 Very Large-Scale Incremental Clustering Berk Berker Mumin Cebe Ismet Zeki Yalniz 27 March 2007.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

JKlustor clustering chemical libraries presented by … maintained by Miklós Vargyas Last update: 25 March 2010.
To print your results, click on the printer icon. Choose from the printing options suggested. You can choose to remove items from folder after printing.
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Incremental Clustering Previous clustering algorithms worked in “batch” mode: processed all points at essentially the same time. Some IR applications cluster.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Mining di Dati Web Web Community Mining and Web log Mining : Commody Cluster based execution Romeo Zitarosa.
MIS 2000 Class 20 System Development Process Updated 2014.
Image Information Retrieval Shaw-Ming Yang IST 497E 12/05/02.
An Interactive Visualization of Super-peer P2P Networks Peiqun (Anthony) Yu.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Presented by Zeehasham Rasheed
EGR 141 Computer Problem Solving in Engineering and Computer Science Interfacing with a Database in Visual Basic.NET 2005.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Parallel and Distributed IR
Parallel C3M1 Aylin Tokuç Erkan Okuyan Özlem Gür Aylin Tokuç Erkan Okuyan Özlem Gür.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
TECHNIQUES FOR OPTIMIZING THE QUERY PERFORMANCE OF DISTRIBUTED XML DATABASE - NAHID NEGAR.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Dept. Computer Science, Korea Univ. Intelligent Information System Lab. XML clustering methods Sohn Jong-Soo Intelligent Information.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
UDoCument: Electronic Scrapbook for the Information Era Soufiane Berouel, Undergraduate Student Supervised by Prof. Lily Liang Department of Computer Science.
Module 7 Reading SQL Server® 2008 R2 Execution Plans.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
Clustering Personalized Web Search Results Xuehua Shen and Hong Cheng.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Hierarchical Document Clustering using Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin Ester SDM 2003 Presentation Serhiy Polyakov DSCI 5240 Fall.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Component 4: Introduction to Information and Computer Science Unit 6a Databases and SQL.
Unconstrained Endpoint Profiling Googling the Internet Ionut Trestian, Supranamaya Ranjan, Alekandar Kuzmanovic, Antonio Nucci Reviewed by Lee Young Soo.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Web- and Multimedia-based Information Systems Lecture 2.
Johannes Kepler University Linz Department of Business Informatics Data & Knowledge Engineering Altenberger Str. 69, 4040 Linz Austria/Europe
An Interactive System for CO-Citation Visualization Xia Lin Jan Buzydlowski Howard D. White Drexel University Philadelphia, PA, USA.
DOCUMENT CLUSTERING USING HIERARCHICAL ALGORITHM Submitted in partial fulfillment of requirement for the V Sem MCA Mini Project Under Visvesvaraya Technological.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Text Clustering Hongning Wang
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Modern Information Retrieval
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
DATA MANAGEMENT AND DATABASES. Data Management Data management is the process of controlling the information generated during a research project or transaction.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 Using the Alma Portfolio Loader Yoel Kortick Senior Librarian.
Understanding Core Database Concepts Lesson 1. Objectives.
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Discussion Class 11 Cluster Analysis.
Waikato Environment for Knowledge Analysis
Junqi Zhang+ Xiangdong Zhou+ Wei Wang+ Baile Shi+ Jian Pei*
Text Categorization Berlin Chen 2003 Reference:
Understanding Core Database Concepts
Presentation transcript:

1 Very Large-Scale Incremental Clustering Berk Berker Mumin Cebe Ismet Zeki Yalniz 27 March 2007

2 Table of Contents Why Clustering? Why Incremental Clustering? Related Work Incremental C3M (C2ICM) A Former Implementation of C2ICM for very large datasets Conclusion

3 Why clustering ? It is an effective tool to manage information overload To browse large document collections quickly To easily grasp the distinct topics and subtopics (concept hierarchies) To allow search engines to efficiently query large document collections

4 Types of Clustering Hierarchical vs. Non-hierarchical Partitional vs. Agglomerative Deterministic vs. Probabilistic algorithms Incremental vs. Batch algorithms

5 Why Incremental Clustering ? The current information explosion Popular sources of informational text documents such as Newswire and Blogs Delay would be unacceptable in several important areas

6 Related Work The cluster-splitting approach Adaptive clustering based on user queries Cobweb algorithm Hierarchical Clustering in Incremental manner

7 C2ICM Algorithm C3M is known as an efficient, effective and robust algorithm for clustering documents C3M is well-developed for initial clustering, but maintenance is also necessary in clustering

8 C2ICM algorithm is based on cover coefficient concept as C3M. C2ICM is suitable for dynamic environments where there are additions and deletions of documents With C2ICM, reclustering for each update is avoided. C2ICM Algorithm

9 C2ICM Algorithm Details First we compute the number of clusters and cluster seed powers in the updated database Then we determine the newly added documents and falsified documents

10 How do the clusters become false? When a seed document becomes non-seed or is deleted One or more non-seed documents of that cluster becomes seed C2ICM Algorithm Details

11 C2ICM Algorithm Details We cluster these documents by assigning them to the cluster of the seed that covers them most The documents which does not belong to any cluster are grouped into ragbag cluster

12 C2ICM: An example Current state of the clusters d5 d4 d3 d1 d7 d2 d8 d9 d15 d6 d10 d11 d18 d16 d17 d12 d13 d14 Ragbag cluster Seed List d1 d6 d12 d19

13 C2ICM: CASE 1 When a seed document becomes nonseed d5 d4 d3 d1 d7 d2 The set of documents to be clustered New Seed List d1 d6 d13 d19 New documents arrived d19 d20 d21 d22 Old Seed List d1 d6 d12 d18 d16 d17 d12 d13 d14 d8 d9 d15 d6 d10 d11

14 C2ICM: CASE 1 Seed document d12 becomes nonseed d5 d4 d3 d1 d7 d2 d22 d13 d14 d12 d16 d17 d18 d19 d20 d21 The set of documents to be clustered New Seed List d1 d6 d13 d19 d8 d9 d15 d6 d10 d11

15 C2ICM: CASE 1 d5 d4 d3 d1 d7 d2 New Seed List d1 d6 d13 d19 d20 d16 d12 d13 d18 d21 d14 d17 d19 d22 No elements remaining in the ragbag cluster Final clusters d8 d9 d15 d6 d10 d11

16 When a nonseed document in a cluster becomes seed Old Seed List d1 d6 d12 New documents arrived The set of documents to be clustered C2ICM: CASE 2 New Seed List d1 d6 d12 d14 d5 d4 d3 d1 d7 d2 d19 d20 d21 d22 d18 d16 d17 d12 d13 d14 d8 d9 d15 d6 d10 d11

17 Nonseed document d14 becomes seed. d5 d4 d3 d1 d7 d2 d12 d13 d14 d16 d17 d18 d19 d20 d21 d22 New Seed List d1 d6 d12 d14 The set of documents to be clustered Becomes new seed C2ICM: CASE 2 d8 d9 d15 d6 d10 d11

18 C2ICM: CASE 2 d5 d4 d3 d1 d7 d2 d20 d16 d13 d12 d22 d18 d21 d19 d17 d14 New Seed List d1 d6 d12 d14 No elements remaining in the ragbag cluster Becomes new seed Final clusters d8 d9 d15 d6 d10 d11

19 A Former Implementation of C2ICM for Very Large Datasets C2ICM is implemented by two programs (VS Pascal) Program I selects the seeds Program II clusters documents by using C2ICM algorithm. These programs communicate by exchanging files. Program I Seed Selector Program II C2ICM text filesdocuments clusters

20 Former Experiments C2ICM is tested with a subset of MARIAN database (~43K documents) in experiments are done. Each incremental update added ~6K documents to the different sizes of initially clustered documents

21 Results for the Former Experiments C2ICM provides time savings Clusters formed with C2ICM was very similar to the clusters formed with C3M

22 Conclusion Cluster maintenance problem is challenging Our aim is to conduct experiments for C2ICM with very large number of documents (i.e. millions of documents) HARD dataset will be used for evaluation. Information retrieval performance will be measured. Implementation of C2ICM must be time and memory efficient.

23 References Can, F., Ozkarahan, E. A. "Concepts and effectiveness of the cover coefficient-based clustering methodology for text databases." ACM Transactions on Database Systems. Vol. 15, No. 4 (December, 1990), pp Can, F. "Incremental clustering for dynamic information processing." ACM Transactions on Information Systems. Vol. 11, No. 2 (April, 1993), Can, F., Fox, E. A., Snavely, C. D., France, R. K. "Incremental clustering for very large document databases: initial MARIAN experience." Information Sciences. Vol. 84 (1995), pp A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM Computing Surveys (CSUR), v.31 n.3, p , Sept. 1999

24 Questions?