Download presentation
Presentation is loading. Please wait.
Published byLexie Whiton Modified over 9 years ago
1
Beyond Streams and Graphs: Dynamic Tensor Analysis
Jimeng Sun Dacheng Tao Christos Faloutsos Speaker: Jimeng Sun
2
Motivation Goal: incremental pattern discovery on streaming applications Streams: E1: Environmental sensor networks E2: Cluster/data center monitoring Graphs: E3: Social network analysis Tensors: E4: Network forensics E5: Financial auditing E6: fMRI: Brain image analysis How to summarize streaming data effectively and incrementally? Incremental and efficient summarization of heterogonous streaming data through a general and concise presentation enables many real applications in different domains. How to summarize streaming data efficiently and incrementally? Statement: Incremental and efficient summarization of heterogonous streaming data through a general and concise presentation enables many real applications in different domains.
3
E3: Social network analysis
Traditionally, people focus on static networks and find community structures We plan to monitor the change of the community structure over time and identify abnormal individuals
4
Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie
E4: Network forensics Directional network flows A large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004] 450 GB/hour with compression Task: Identify abnormal traffic pattern and find out the cause abnormal traffic normal traffic source destination destination source Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie
5
Static Data model For a timestamp, the stream measurements can be modeled using a tensor Dimension: a single stream E.g, <Christos, “graph”> Mode: a group of dimensions of the same kind. E.g., Source, Destination, Port Time = 0 Source Destination
6
Static Data model (cont.)
Tensor Formally, Generalization of matrices Represented as multi-array, data cube. Order 1st 2nd 3rd Correspondence Vector Matrix 3D array Example
7
Dynamic Data model (our focus)
Source Destination time Streams come with structure (time, source, destination, port) (time, author, keyword)
8
Dynamic Data model (cont.)
Tensor Streams A sequence of Mth order tensor where n is increasing over time Order 1st 2nd 3rd Correspondence Multiple streams Time evolving graphs 3D arrays Example keyword time … time author … …
9
Dynamic tensor analysis
Old Tensors New Tensor Source Destination UDestination Old cores USource
10
Roadmap Motivation and main ideas Background and related work
Dynamic and streaming tensor analysis Experiments Conclusion
11
Background – Singular value decomposition (SVD)
Best rank k approximation in L2 PCA is an important application of SVD n n R k k k VT A U m m UT Y
12
Latent semantic indexing (LSI)
Singular vectors are useful for clustering or correlation detection cluster cache frequent pattern query concept-association DM x x = DB document-concept concept-term
13
Tensor Operation: Matricize X(d)
Unfold a tensor into a matrix 5 7 6 8 Acknowledge to Tammy Kolda for this slide
14
Tensor Operation: Mode-product
Multiply a tensor with a matrix port port source source destination destination group group source
15
Related work Our Work Low Rank approximation Multilinear analysis
PCA, SVD: orthogonal based projection Multilinear analysis Tensor decompositions: Tucker, PARAFAC, HOSVD Stream mining Scan data once to identify patterns Sampling: [Vitter85], [Gibbons98] Sketches: [Indyk00], [Cormode03] Graph mining Explorative: [Faloutsos04][Kumar99] [Leskovec05]… Algorithmic: [Yan05][Cormode05]… Our Work
16
Roadmap Motivation and main ideas Background and related work
Dynamic and streaming tensor analysis Experiments Conclusion
17
Note that this is a generalization of PCA when n is a constant
Tensor analysis Given a sequence of tensors find the projection matrices such that the reconstruction error e is minimized: … t … Note that this is a generalization of PCA when n is a constant
18
Why do we care? Anomaly detection
Reconstruction error driven Multiple resolution Multiway latent semantic indexing (LSI) Philip Yu time Michael Stonebreaker Pattern Query
19
1st order DTA - problem Given x1…xn where each xi RN, find
URNR such that the error e is small: N Y UT x1 R ? Sensors time n …. xn indoor Note that Y = XU Sensors outdoor
20
Diagonalization has to be done for every new x!
1st order DTA Input: new data vector x RN, old variance matrix C RN N Output: new projection matrix U RN R Algorithm: 1. update variance matrix Cnew = xTx + C 2. Diagonalize UUT = Cnew 3. Determine the rank R and return U Old X time x x UT Cnew U C xT Diagonalization has to be done for every new x!
21
1st order STA Adjust U smoothly when new data arrive without diagonalization [VLDB05] For each new point x Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude For each new point x and for i = 1, …, k : yi := UiTx (proj. onto Ui) di di + yi2 (energy i-th eigenval.) ei := x – yiUi (error) Ui Ui + (1/di) yiei (update estimate) x x – yiUi (repeat with remainder) error Sensor 2 U Sensor 1
22
Mth order DTA
23
Mth order DTA – complexity
Storage: O( Ni), i.e., size of an input tensor at a single timestamp Computation: Ni3 (or Ni2) diagonalization of C + Ni Ni matrix multiplication X (d)T X(d) For low order tensor(<3), diagonalization is the main cost For high order tensor, matrix multiplication is the main cost
24
Mth order STA y1 U1 x e1 U1 updated Run 1st order STA along each mode
Complexity: Storage: O( Ni) Computation: Ri Ni which is smaller than DTA y1 U1 x e1 U1 updated
25
Roadmap Motivation and main ideas Background and related work
Dynamic and streaming tensor analysis Experiments Conclusion
26
Experiment Objectives Computational efficiency Accurate approximation
Real applications Anomaly detection Clustering
27
Data set 1: Network data Sparse data Power-law distribution
TCP flows collected at CMU backbone Raw data 500GB with compression Construct 3rd order tensors with hourly windows with <source, destination, port> Each tensor: 500500100 1200 timestamps (hours) value Sparse data Power-law distribution 10AM to 11AM on 01/06/2005
28
Data set 2: Bibliographic data (DBLP)
Papers from VLDB and KDD conferences Construct 2nd order tensors with yearly windows with <author, keywords> Each tensor: 45843741 11 timestamps (years)
29
Computational cost 3rd order network tensor 2nd order DBLP tensor
OTA is the offline tensor analysis Performance metric: CPU time (sec) Observations: DTA and STA are orders of magnitude faster than OTA The slight upward trend in DBLP is due to the increasing number of papers each year (data become denser over time)
30
Accuracy comparison 3rd order network tensor 2nd order DBLP tensor
Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20% Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes.
31
Network anomaly detection
Abnormal traffic Reconstruction error over time Normal traffic Reconstruction error gives indication of anomalies. Prominent difference between normal and abnormal ones is mainly due to unusual scanning activity (confirmed by the campus admin).
32
Multiway LSI Authors Keywords Year michael carey, michael stonebreaker, h. jagadish, hector garcia-molina queri,parallel,optimization,concurr, objectorient 1995 surajit chaudhuri,mitch cherniack,michael stonebreaker,ugur etintemel distribut,systems,view,storage,servic,process,cache 2004 jiawei han,jian pei,philip s. yu, jianyong wang,charu c. aggarwal streams,pattern,support, cluster, index,gener,queri DB DM Two groups are correctly identified: Databases and Data mining People and concepts are drifting over time
33
Conclusion Tensor stream is a general data model
DTA/STA incrementally decompose tensors into core tensors and projection matrices The result of DTA/STA can be used in other applications Anomaly detection Multiway LSI
34
Final word: Think structurally!
The world is not flat, neither should data mining be. Contact: Jimeng Sun
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.