A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004.

Slides:



Advertisements
Similar presentations
Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.
Advertisements

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
GIS-Integrated Agent-Based Modeling of Residential Solar PV Diffusion Energy Systems transformation Scott A. Robinson, Matt Stringer, Varun Rai, & Abhishek.
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
Speech Group INRIA Lorraine
SLAW: A Mobility Model for Human Walks Lee et al..
Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI.
Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
1 Abstract This paper presents a novel modification to the classical Competitive Learning (CL) by adding a dynamic branching mechanism to neural networks.
1 A Dynamic Clustering and Scheduling Approach to Energy Saving in Data Collection from Wireless Sensor Networks Chong Liu, Kui Wu and Jian Pei Computer.
K. Salah1 On the Performance of a Simple Packet Rate Estimator by K. Salah & F. Haidari The 6th ACS/IEEE International Conference on Computer Systems and.
FACE RECOGNITION, EXPERIMENTS WITH RANDOM PROJECTION
Sparsity, Scalability and Distribution in Recommender Systems
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Genetic Algorithm What is a genetic algorithm? “Genetic Algorithms are defined as global optimization procedures that use an analogy of genetic evolution.
Birch: An efficient data clustering method for very large databases
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
CSE 185 Introduction to Computer Vision Pattern Recognition.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
WEMAREC: Accurate and Scalable Recommendation through Weighted and Ensemble Matrix Approximation Chao Chen ⨳ , Dongsheng Li
Presented by Tienwei Tsai July, 2005
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
A Distributed Clustering Framework for MANETS Mohit Garg, IIT Bombay RK Shyamasundar School of Tech. & Computer Science Tata Institute of Fundamental Research.
Efficient Computation of Reverse Skyline Queries VLDB 2007.
Clustering Moving Objects in Spatial Networks Jidong Chen, Caifeng Lai, Xiaofeng Meng, Renmin University of China Jianliang Xu, and Haibo Hu Hong Kong.
Author:Rakesh Agrawal
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
Probabilistic Coverage in Wireless Sensor Networks Authors : Nadeem Ahmed, Salil S. Kanhere, Sanjay Jha Presenter : Hyeon, Seung-Il.
Human pose recognition from depth image MS Research Cambridge.
DIVERSITY PRESERVING EVOLUTIONARY MULTI-OBJECTIVE SEARCH Brian Piper1, Hana Chmielewski2, Ranji Ranjithan1,2 1Operations Research 2Civil Engineering.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
© Devi Parikh 2008 Devi Parikh and Tsuhan Chen Carnegie Mellon University April 3, ICASSP 2008 Bringing Diverse Classifiers to Common Grounds: dtransform.
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
Other Clustering Techniques
A Statistical Approach to Texture Classification Nicholas Chan Heather Dunlop Project Dec. 14, 2005.
Cluster Analysis Data Mining Experiment Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
An Energy-Efficient Approach for Real-Time Tracking of Moving Objects in Multi-Level Sensor Networks Vincent S. Tseng, Eric H. C. Lu, & Kawuu W. Lin Institute.
1 Personalized IR Reloaded Xuehua Shen
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Presenter: Jae Sung Park
Unsupervised Classification
1 On Demand Classification of Data Streams Charu C. Aggarwal Jiawei Han Philip S. Yu Proc Int. Conf. on Knowledge Discovery and Data Mining (KDD'04),
ItemBased Collaborative Filtering Recommendation Algorithms 1.
A K-Main Routes Approach to Spatial Network Activity Summarization(SNAS) Group 8.
SZRZ6014 Research Methodology Prepared by: Aminat Adebola Adeyemo Study of high-dimensional data for data integration.
Intelligent and Adaptive Systems Research Group A Novel Method of Estimating the Number of Clusters in a Dataset Reza Zafarani and Ali A. Ghorbani Faculty.
Presented by Niwan Wattanakitrungroj
Distributed Network Traffic Feature Extraction for a Real-time IDS
Bag-of-Visual-Words Based Feature Extraction
Parallel Density-based Hybrid Clustering
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Supporting Fault-Tolerance in Streaming Grid Applications
A Consensus-Based Clustering Method
Finding efficient management policies for forest plantations through simulation Models and Simulation Project
Jon Purnell Heidi Jo Newberg Malik Magdon-Ismail
Unsupervised Classification
A Framework for Clustering Evolving Data Streams
Smita Vijayakumar Qian Zhu Gagan Agrawal
Approximate Frequency Counts over Data Streams
Fast and Exact K-Means Clustering
K.L Ong, W. Li, W.K. Ng, and E.P. Lim
MAPO: Mining and Recommending API Usage Patterns
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004

Motivation and Underlying Concepts All dimensions should not be considered in high dimensional setup for clustering The Fading Cluster Structure: Use fading function The half life t0 of a point is defined as the time at which f(t0) = (1=2)f(0). A fading cluster structure at time t for a set of d-dimensional points The clustering structure properties called additivity and temporal multiplicity The clustering process requires a simultaneous maintenance of the clusters as well as the set of dimensions associated with each cluster

HPStream : High-Dimentional Projected Stream Clustering Method

HPStream Algorithm – Brief Explanation -Set parameters -Normalization Process -Initial Clustering using k-means and Init Number -ComputeDimensions: This procedure determines the dimensions in such a way that the spread along the chosen dimensions is as small as possible -The next step is the determination of the closest cluster to the incoming data point using FindProjectedDist -The procedure for determination of the limiting radius is denoted by FindLimitingRadius -Finally decision which cluster to add or delete.

Experimental Setup HPStream compared with Clustream : both implemented on MS VC++ One synthetic data and 2 sets of Real world data - Network Intrusion and Forest cover type data sets. Comparison criteria for judging the 2 algorithms: - accuracy : clustering quality - efficiency : stream processing rate - sensitivity : varying decay rate, l and radius threshold - scalability : varying number of dimensions and clusters Parameters initialized as following: Decay-rate = 0:5, Spread radius factor = 2, InitNumber =2000, Average Projected Dimensionality l > d/2.

Comparing Accuracy : Using clustering quality and cluster purity

Accuracy comparison continued:

Efficiency comparison using Stream Processing Rate:

Sensitivity : Varying ‘l’

Sensitivity: Varying radius threshold and decay rate

Scalability : varying dimensionality and number of clusters