Continuous Data Stream Processing Music Virtual Channel – extensions Data Stream Monitoring – tree pattern mining Continuous Query Processing – sequence queries Date: 2005/10/21 Post-Excellence Project Subproject 6
Continuous Data Stream Management 2 Clustering engine Clustering engine Music metadata Music Virtual Channel Extensions … 1 1 N N 2 2 … Music collections Internet V.C. player V.C. player Filtering engine Filtering engine Music channel simulator Music channel simulator Interface Profile monitor Profile monitor Cluster monitor Cluster monitor Channel monitor Channel monitor Favorite channel Favorite channel Cluster coordinator Cluster coordinator Peer search engine Peer search engine Profile database Profile database MusicXML database MusicXML database XML Filtering engine XML Filtering engine
Continuous Data Stream Management 3 An Extension on Virtual Channel rangekNN After a player starts a range (or kNN) search, It updates its profile periodically The search results are continuously maintained V.C. player (query) V.C. player (peer)
Continuous Data Stream Management 4 An Extension on Virtual Channel Compared with the clustering engine A flexible definition of “clusters” Update is more natural than insertion/deletion No need of parameter setting and re-clustering Indexing can relieve the pain of frequent update Compared with the problem of moving objects Movements in a high-dimensional feature space In most cases every object is also a query Prediction of object movement is possible
Continuous Data Stream Management 5 When a music piece is played on a channel, The corresponding musicXML file can be obtained A query can be a portion of musicXML or XQuery An Extension on Favorite Channel
Continuous Data Stream Management 6 An Extension on Favorite Channel Compared with query segments More musical semantic in a query Do not interfere the music playback Matching on complex tree-structures Common subquery is still useful
Continuous Data Stream Management 7 Research Issues Peer Search Engine An indexing method to support continuous query processing for high-dimensional moving objects A prediction-based bounding mechanism to reduce the frequency of profile update XML Filtering Engine An online method to enable tree pattern mining over a data stream An indexing mechanism to support XML filtering
Discovering Frequent Tree Patterns over Data Streams Submitted for publication
Continuous Data Stream Management 9 Problem Definition As the query trees stream in, find out the subtrees which occur more then θ·N times, where N is the number of trees received so far and 0 ≦ θ ≦ 1 STMer Frequent Tree Patterns T1 T3 T2
Continuous Data Stream Management 10 Problem Definition (Cont.) Labeled ordered tree Induced subtree B DC differs from B CD A BE CD Tree patternQuery Tree
Continuous Data Stream Management 11 An Example Given θ = 0.6 Frequent Tree Patterns (occurrence > 0.6*1) : STMer A BC A BC ABC A B A C Frequent Tree Patterns (occurrence > 0.6*2) : B B DE Frequent Tree Patterns (occurrence > 0.6*3) : AB A B A BF
Continuous Data Stream Management 12 Main Difficulties The properties of data streams: One pass Traditional tree mining methods fail Fast input rate Efficiency issue is critical Incremental An incremental algorithm is required Unbounded Approximate counting is needed
Continuous Data Stream Management 13 An Overview of Our Method Subtree generation Subtree maintenance STMer T1 A candidate pool Requests on demand
Continuous Data Stream Management 14 String Representation DFS order on T (label, level) node sequence S
Continuous Data Stream Management 15 Subtree Generation Data stream BufferA1 A TDTD A t1t1 A,1 BufferA1B2 A B TDTD B1 B A B A1B2 t2t2 B,2
Continuous Data Stream Management 16 Subtree Generation (Cont.) Data stream t1t1 t2t2 B1 B A B A1B2 A1 A B,2 BufferA1B2C2 TDTD A BC C1 C A C A1C2 A BC A1B2C2 A,1 C,2 t3t3
Continuous Data Stream Management 17 Subtree Generation (Cont.) A1 B1 B2 Φ APT C1 D2 D1 E3 E2 E1 C2 D3 E4 C2 D3 E4 BufferA1B2 TDTD A B C D E F2 C2D3E4
Continuous Data Stream Management 18 Subtree Maintenance BufferA1B2E2 (E2, 1, 3) APT A1 B1 E1 B2E2 Φ GPT +1 #query trees received = 321 (A1, 5, 0) (B2, 4, 1) Φ (C3, 2, 1) +1
Continuous Data Stream Management 19 Experiments on Sensitivity Minimum support Error parameter
Continuous Data Stream Management 20 Experiments on Comparison StreamT (ICDM ’ 02)
Continuous Data Stream Management 21 Conclusion Contribution A novel technique is proposed for efficient subtree generation A compact structure is employed to reduce the the memory requirement of the candidate pool Current work Mining closed frequent subtrees over data streams A BC 2 A B 5 A C 2 A 5