Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time.

Similar presentations


Presentation on theme: "Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time."— Presentation transcript:

1 Stream Clustering CSE 902

2 Big Data

3 Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time access: Not possible to process the data using multiple passes ◦Real-time analysis: Certain applications need real-time analysis of the data ◦Temporal Locality: Data evolves over time, so model should be adaptive.

4 Stream Clustering Topic cluster Article Listings

5 Stream Clustering Online Phase Summarize the data into memory-efficient data structures Offline Phase Use a clustering algorithm to find the data partition

6 Stream Clustering Algorithms Data StructuresExamples PrototypesStream, Stream Lsearch CF-TreesScalable k-means, single pass k-means Microcluster TreesClusTree, DenStream, HP-Stream GridsD-Stream, ODAC Coreset TreeStreamKM++

7 Prototypes Stream, LSearch

8 CF-Trees Summarize the data in each CF-vector Linear sum of data points Squared sum of data points Number of points Scalable k-means, Single pass k-means

9 Microclusters CF-Trees with “time” element CluStream Linear sum and square sum of timestamps Delete old microclusters/merging microclusters if their timestamps are close to each other Sliding Window Clustering Timestamp of the most recent data point added to the vector Maintain only the most recent T microclusters DenStream Microclusters are associated with weights based on recency Outliers detected by creating separate microcluster

10 Microclusters CF-Trees with “time” element DenStream Microclusters are associated with weights based on recency Outliers detected by creating separate microcluster ClusTree Allows real-time clustering

11 Grids D-Stream Assign the data to grids Grids weighted by recency of points added to it Each grid associated with a label DGClust Distributed clustering of sensor data Sensors maintain local copies of the grid and communicate updates to the grid to a central site

12 StreamKM++ (Coresets) StreamKM++: A Clustering Algorithm for Data Streams, Ackermann, Journal of Experimental Algorithmics 2012

13 Kernel-based Clustering

14 Kernel-based Stream Clustering  Use non-linear distance measures to define similarity between data points in the stream  Challenges  Quadratic running time complexity  Computationally expensive to compute centers using linear sums and squared sums (CF-vector approach will not work)

15 Stream Kernel k-means (sKKM) Kernel k-means Weighted Kernel k-means History from only the preceding data chunk retained Approximation of Kernel k-Means for Streaming Data, Havens, ICPR 2012

16 Statistical Leverage Scores Measures the influence of a point in the low-rank approximation

17 Statistical Leverage Scores

18

19 Approximate Stream kernel k-means o Uses statistical leverage score to determine which data points in the stream are potentially “important” o Retain the important points and discard the rest o Use an approximate version of kernel k-means to obtain the clusters – Linear time complexity o Bounded amount of memory

20 Approximate Stream kernel k-means

21 Importance Sampling

22 Clustering Kernel k-means “Approximate” Kernel k-means

23 Clustering “Approximate” Kernel k-means

24 Updating eigenvectors Only eigenvectors and eigenvalues of kernel matrix are required for both sampling and clustering Update the eigenvectors and eigenvalues incrementally

25 Approximate Stream Kernel k-means

26 Network Traffic Monitoring  Clustering used to detect intrusions in the network  Network Intrusion Data set  TCP dump data from seven weeks of LAN traffic  10 classes: 9 types of intrusions, 1 class of legitimate traffic. Running Time in milliseconds (per data point) Cluster Accuracy (NMI) Approximate stream kernel k-means6.614.2 StreamKM++0.87.0 sKKM42.113.3 Around 200 points clustered per second

27 Summary  Efficient kernel-based stream clustering algorithm - linear running time complexity  Memory required is bounded  Real-time clustering is possible  Limitation: does not account for data evolution


Download ppt "Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time."

Similar presentations


Ads by Google