Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time.

Slides:

Advertisements

Similar presentations

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Advertisements

Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.

Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.

Incremental Clustering for Trajectories

Partitional Algorithms to Detect Complex Clusters

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.

Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.

BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.

Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,

1 In-Network PCA and Anomaly Detection Ling Huang* XuanLong Nguyen* Minos Garofalakis § Michael Jordan* Anthony Joseph* Nina Taft § *UC Berkeley § Intel.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.

Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University.

A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004.

Unsupervised Learning of Categories from Sets of Partially Matching Image Features Dominic Rizzo and Giota Stratou.

ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.

Reduced Support Vector Machine

Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.

Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.

Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.

(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

Distributed Model-Based Learning PhD student: Zhang, Xiaofeng.

A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.

CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.

1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

Sensor Networks Storage Sanket Totala Sudarshan Jagannathan.

Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.

Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee College of Computing, Georgia Institute of Technology USENIX Security '08 Presented by Lei Wu.

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Identifying Significant Locations Petteri Nurmi 1, Johan Koolwaaij 2 1) Helsinki Institute for Information Technology HIIT 2) Telematica Instituut (TELIN)

Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

Radial Basis Function Networks

SCAN: a Scalable, Adaptive, Secure and Network-aware Content Distribution Network Yan Chen CS Department Northwestern University.

Distributed Anomaly Detection in Wireless Sensor Networks Ksutharshan Rajasegarar, Christopher Leckie, Marimutha Palaniswami, James C. Bezdek IEEE ICCS2006(Institutions.

RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.

Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.

Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10.

Technical Report of Web Mining Group Presented by: Mohsen Kamyar Ferdowsi University of Mashhad, WTLab.

Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.

Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.

1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Using Adaptive Tracking To Classify And Monitor Activities In A Site W.E.L. Grimson, C. Stauffer, R. Romano, L. Lee.

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #25 Dependable Data Management.

Fuzzy Pattern Recognition. Overview of Pattern Recognition Pattern Recognition Procedure Feature Extraction Feature Reduction Classification (supervised)

Network Anomaly Detection Using Autonomous System Flow Aggregates Thienne Johnson 1,2 and Loukas Lazos 1 1 Department of Electrical and Computer Engineering.

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

Nonlinear Adaptive Kernel Methods Dec. 1, 2009 Anthony Kuh Chaopin Zhu Nate Kowahl.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:

Presented by Niwan Wattanakitrungroj

Semi-Supervised Clustering

University of Waikato, New Zealand

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Supporting Fault-Tolerance in Streaming Grid Applications

Chao Zhang1, Yu Zheng2, Xiuli Ma3, Jiawei Han1

Jianping Fan Dept of Computer Science UNC-Charlotte

A Framework for Clustering Evolving Data Streams

K.L Ong, W. Li, W.K. Ng, and E.P. Lim

Data science laboratory (DSLAB)

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.

Presentation transcript:

Stream Clustering CSE 902

Big Data

Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time access: Not possible to process the data using multiple passes ◦Real-time analysis: Certain applications need real-time analysis of the data ◦Temporal Locality: Data evolves over time, so model should be adaptive.

Stream Clustering Topic cluster Article Listings

Stream Clustering Online Phase Summarize the data into memory-efficient data structures Offline Phase Use a clustering algorithm to find the data partition

Stream Clustering Algorithms Data StructuresExamples PrototypesStream, Stream Lsearch CF-TreesScalable k-means, single pass k-means Microcluster TreesClusTree, DenStream, HP-Stream GridsD-Stream, ODAC Coreset TreeStreamKM++

Prototypes Stream, LSearch

CF-Trees Summarize the data in each CF-vector Linear sum of data points Squared sum of data points Number of points Scalable k-means, Single pass k-means

Microclusters CF-Trees with “time” element CluStream Linear sum and square sum of timestamps Delete old microclusters/merging microclusters if their timestamps are close to each other Sliding Window Clustering Timestamp of the most recent data point added to the vector Maintain only the most recent T microclusters DenStream Microclusters are associated with weights based on recency Outliers detected by creating separate microcluster

Microclusters CF-Trees with “time” element DenStream Microclusters are associated with weights based on recency Outliers detected by creating separate microcluster ClusTree Allows real-time clustering

Grids D-Stream Assign the data to grids Grids weighted by recency of points added to it Each grid associated with a label DGClust Distributed clustering of sensor data Sensors maintain local copies of the grid and communicate updates to the grid to a central site

StreamKM++ (Coresets) StreamKM++: A Clustering Algorithm for Data Streams, Ackermann, Journal of Experimental Algorithmics 2012

Kernel-based Clustering

Kernel-based Stream Clustering  Use non-linear distance measures to define similarity between data points in the stream  Challenges  Quadratic running time complexity  Computationally expensive to compute centers using linear sums and squared sums (CF-vector approach will not work)

Stream Kernel k-means (sKKM) Kernel k-means Weighted Kernel k-means History from only the preceding data chunk retained Approximation of Kernel k-Means for Streaming Data, Havens, ICPR 2012

Statistical Leverage Scores Measures the influence of a point in the low-rank approximation

Statistical Leverage Scores

Approximate Stream kernel k-means o Uses statistical leverage score to determine which data points in the stream are potentially “important” o Retain the important points and discard the rest o Use an approximate version of kernel k-means to obtain the clusters – Linear time complexity o Bounded amount of memory

Approximate Stream kernel k-means

Importance Sampling

Clustering Kernel k-means “Approximate” Kernel k-means

Clustering “Approximate” Kernel k-means

Updating eigenvectors Only eigenvectors and eigenvalues of kernel matrix are required for both sampling and clustering Update the eigenvectors and eigenvalues incrementally

Approximate Stream Kernel k-means

Network Traffic Monitoring  Clustering used to detect intrusions in the network  Network Intrusion Data set  TCP dump data from seven weeks of LAN traffic  10 classes: 9 types of intrusions, 1 class of legitimate traffic. Running Time in milliseconds (per data point) Cluster Accuracy (NMI) Approximate stream kernel k-means StreamKM sKKM Around 200 points clustered per second

Summary  Efficient kernel-based stream clustering algorithm - linear running time complexity  Memory required is bounded  Real-time clustering is possible  Limitation: does not account for data evolution