ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I
Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University 12/8/2018

Data Mining Outline EMM Stream Mining Text Mining
Bioinformatics Mining 12/8/2018

EMM Overview Time Varying Discrete First Order Markov Model
Nodes are clusters of real world states. Learning continues during prediction phase. Learning: Transition probabilities between nodes Node labels (centroid of cluster) Nodes are added and removed as data arrives 12/8/2018

MM A first order Markov Chain is a finite or countably infinite sequence of events {E1, E2, … } over discrete time points, where Pij = P(Ej | Ei), and at any time the future behavior of the process is based solely on the current state A Markov Model (MM) is a graph with m vertices or states, S, and directed arcs, A, such that: S ={N1,N2, …, Nm}, and A = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc, Lij = <Ni,Nj> is labeled with a transition probability Pij = P(Nj | Ni). 12/8/2018

EMM Definition Extensible Markov Model (EMM): at any time t, EMM consists of an MC with designated current node, Nn, and algorithms to modify it, where algorithms include: EMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t. EMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1. EMMDecrement algorithm, which removes nodes from the EMM when needed. 12/8/2018

EMM Cluster Find closest node to incoming event.
If none “close” create new node Labeling of cluster is centroid of members in cluster O(n) 12/8/2018

EMMSim Find closest node to incoming event.
If none “close” create new node Labeling of cluster is centroid/medoid of members in cluster Problem Nearest Neighbhor O(n) BIRCH O(lg n) Requires second phase to recluster initial 12/8/2018

EMM Increment <18,10,3,3,1,0,0> <17,10,2,3,1,0,0>
<16,9,2,3,1,0,0> <14,8,2,3,1,0,0> <14,8,2,3,0,0,0> <18,10,3,3,1,1,0.> 1/1 N1 1 N3 1/1 1/3 N1 N2 2/3 1/3 N1 N2 2/3 1/2 N3 1/1 2/3 1/3 N1 N2 N1 2/2 1/1 N3 1/2 1/3 N1 N2 2/3 12/8/2018

EMM Forget N1 N3 N5 N6 1/6 1/3 N2 N1 N3 N5 N6 2/2 1/3 1/2 12/8/2018

Data Mining Outline Stream Mining Data Stream Overview
EMM Stream Mining Data Stream Overview Data Stream Modeling Data Stream Clustering TRAC-DS Anomaly Detection Text Mining Bioinformatics Mining 12/8/2018

Motivation Computer network monitoring data
A growing number of applications generate streams of data. Computer network monitoring data Call detail records in telecommunications (Cisco VoIP 2003) Highway transportation traffic data (MnDot 2005) Online web purchase log records (JCPenney 2003, Travelociy 2005) Sensor network data (Ouse, Serwent 2002) Stock exchange, transactions in retail chains, ATM operations in banks, credit card transactions. Data mining techniques play a key role in data models in Data Stream Management System. Nowadays, a growing number of applications generate streams of data. The data of this type include computer network monitoring data, highway traffic data, call detail records in telecomm industry, online purchase logs and data collected by other sensor networks. A data stream management system is a new research area in recent years. And it has been a new application area of data mining. What it that? A feature of data stream is its high volume of data, it is not possible for us to store all data like traditional database. Instead the data stream must be modeled. Data mining is suitable for this modeling task. On the slide, the items in the parentheses are the datasets available to us. The datasets highlighted with red are those being used in this proposal. The Cisco VoIP data is a 8 weeks of VoIP call log collected in Cisco. MnDot is provided by Mn Dept of transportation. It is the highway traffic data in Twin City area. The data is available from 2000, up to now. Ouse and Serwent are two sets of sensor network data. Ouse is a river level data at three locations near York in UK. And the Serwent data is the water flow rate data at 7 locations near Serwent in UK. (2’20’’) 12/8/2018

Haixun Wang, Jian Pei, Philip S. Yu, ICDE 2005;
Background Characteristics of data stream: Data are raw Records may at a rapid rate High volume (possibly infinite) of continuous data Concept drifts: Data distribution changes on the fly Multidimensional Temporality Stream processing restrictions: Data modeling (synopsis) Single pass: Each record is examined at most once Bounded storage: Limited Memory for storing synopsis Real-time: Per record processing time must be low By examining stream data, we can see the following characteristics. The data are raw because only online preprocessing is applicable. Records may arrive at a rapid rate The volume of data is high, possible infinite. The data profile may change on the fly – we call it concept change or concept drift. Data can be multidimensional Temporal dependency may occur in the data series. To process the data stream, a technique must satisfy the following requirements: Data must be modeled since it is not possible to store all data. In literature the modeled data is called synopsis of the data stream. Single pass: Each record or data point is read at most once. Random access to data like relational databases is not possible. The storage space of the synopsis is bounded. Each record must be processed in a soft real-time manner The system should respond to queries incrementally. (1’35’’) Haixun Wang, Jian Pei, Philip S. Yu, ICDE 2005; Keogh, ICDM’04 12/8/2018

From Sensors to Streams
Data captured and sent by a set of sensors is usually referred to as “stream data”. Real-time sequence of encoded signals which contain desired information. It is continuous, ordered (implicitly by arrival time or explicitly by timestamp or by geographic coordinates) sequence of items Stream data is infinite - the data keeps coming. 12/8/2018

Suppose There Were MANY Sensors
Traditional line graphs would be very difficult to read Requirements for new visualization technique: High level summary of data Handle multiple sensors at once Continuous Temporal Spatial 12/8/2018

Spatiotemporal Environment
Events arriving in a stream At any time, t, we can view the state of the problem as represented by a vector of n numeric values: Vt = <S1t, S2t, ..., Snt> V2 … S2 S21 S22 S2q S1 S11 S12 S1q Sn Sn1 Sn2 Snq Time 12/8/2018

Data Stream Management Systems (DSMS)
Software to facilitate querying and managing stream data. Retrieve the most recent information from the stream Data aggregation facilitates merging together multiple streams Modeling stream data to “summarize” stream Visualization needed to observe in real-time the spatial and temporal patterns and trends hidden in the data. 12/8/2018

DSMS Problems Stream Management development in state similar to that of databases prior to 1970’s Each system/researcher looks at specific application or system No standards concerning functionality No standard query language Unreasonable to expect end users will access raw data, data in the DSMS, or even data at a summarized view Domain experts need to “see” a higher level of data 12/8/2018

Data Stream Modeling Single pass: Each record is examined at most once
Bounded storage: Limited Memory for storing synopsis Real-time: Per record processing time must be low Summarization (Synopsis )of data Use data NOT SAMPLE Temporal and Spatial Dynamic Continuous (infinite stream) Learn Forget Sublinear growth rate - Clustering 12/8/2018 18

Problem with Markov Chains
The required structure of the MC may not be certain at the model construction time. As the real world being modeled by the MC changes, so should the structure of the MC. Not scalable – grows linearly as number of events. Markov Property Our solution: Extensible Markov Model (EMM) Cluster real world events Allow Markov chain to grow and shrink dynamically 12/8/2018

EMM Sublinear Growth Rate
Minnesota Department of Transportation (MnDot) 12/8/2018

Traditional Clustering
12/8/2018

TRAC-DS 12/8/2018

Motivation Temporal Ordering is a major feature of stream data.
Many stream applications depend on this ordering Prediction of future values Anomaly (rare event) detection Concept drift 12/8/2018

Stream Clustering Requirements
Dynamic updating of the clusters Identify outliers Barbara: compactness fast incremental processing 12/8/2018

Stream Clustering Algorithms
LOCALSEARCH Partitions stream into segments Clusters each segment individually by solving the k-medians problem Iteratively reclusters the resulting centers CluStream Micro-clusters represented by summary statistics. Micro-clusters are handled online Micro-clusters merged offline MONIC Evolution of clusters over time Cluster transitions over time 12/8/2018

TRAC-DS NOTE TRAC-DS is not: Another stream clustering algorithm
A new way of looking at clustering Built on top of an existing clustering algorithm TRAC-DS may be used with any stream clustering algorithm 12/8/2018

TRAC-DS Overview 12/8/2018

Data Stream Clustering
At each point in time a data stream clustering ζ is a partitioning of D', the data seen thus far. Instead of the whole partitions C1, C2,..., Ck only synopses Cc1,Cc2,...,Cck are available and k is allowed to change over time. The summaries Cci with i =1, 2,...,k typically contain information about the size, distribution and location of the data points in Ci. 12/8/2018

TRAC-DS Definition Given a data stream clustering ζ, a temporal relationship among clusters (TRAC-DS) overlays a data stream clustering ζ with a EMM M, in such a way that the following are satisﬁed: (1) There is a one-to-one correspondence between the clusters in ζ and the states S in M. (2) A transition aij in the EMM M represents the probability that given a data point in cluster i, the next data point in the data stream will belong to cluster j with i; j = 1; 2; : : : ; k. (3) The EMM M is created online together with the data stream clustering 12/8/2018

Clustering Operations
A clustering operation is a function q : ζ × x → ζ which is used by the data stream clustering algorithm to update the clustering ζ given some additional information x which either is a new data point or other information (e.g., the number of the cluster to be deleted to be simpliﬁed the clustering). 12/8/2018

TRAC-DS Operations A TRAC-DS operation is a function r : M × sc × y → M × sc that updates the temporal relationship among clusters represented by the EMM M with states S given a current state sc ∈ S and additional information y and returns an updated EMM and possibly a new current state. In order to be able to dynamically update the EMM M we need to store a transition count matrix C. The count cij in C contains the number of times we observed a new point being assigned by the clustering algorithm to cluster i followed by a point being assigned to cluster j. 12/8/2018

Stream Clustering Operations *
qassign point(ζ,x): Assigns the new data point x to an existing cluster. qnew cluster(ζ,x): Create a new cluster. qremove cluster(ζ,x): Removes a cluster. Here x is the cluster, i, to be removed. In this case the associated summary Cci is removed from ζ and k is decremented by one. qmerge clusters(ζ,x): Merges two clusters. qfade clusters(ζ,x): Fades the cluster structure. qsplit clusters(ζ,x): Splits a cluster. * Inspired by MONIC 12/8/2018

TRAC-DS Operations rassign point(M,sc,y): Assigns the new data point to the state representing an existing cluster rnew cluster(M,sc,y): Create a state for a new cluster. rremove cluster(M,sc,y): Removes state. rmerge clusters(M,sc,y): Merges two states. rfade clusters(M,sc,y): Fades the transition probabilities using an exponential decay f(t)=2−λt rsplit clusters(M,sc,y): Splits states. Y clustering operations. 12/8/2018

TRAC-DS Example 12/8/2018

TRAC-DS Advantages Dynamic Flexible – Use any Clustering Algorithm
Supports and clustering operations Scalable Merges Clustering & Markov Modeling 12/8/2018

What is Anomaly? Event that is unusual
Event that doesn’t occur frequently Predefined event What is unusual? What is deviation? 12/8/2018

What is Anomaly in Stream Data?
Rare - Anomalous – Surprising Out of the ordinary Not outlier detection No knowledge of data distribution Data is not static Must take temporal and spatial values into account May be interested in sequence of events Ex: Snow in upstate New York is not an anomaly Snow in upstate New York in June is rare Rare events may change over time 12/8/2018

Statistical View of Anomaly
Outlier Data item that is outside the normal distribution of the data Identify by Box Plot Image from Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002. 12/8/2018

Statistical View of Anomaly
Identify by looking at distribution THIS DOES NOT WORK with stream data Image from Normal distribution. 12/8/2018

Data Mining View of Anomaly
Classification Problem Build classifier from training data Problem is that training data shows what is NOT an anomaly Thus an anomaly is anything that is not viewed as normal by the classification technique MUST build dynamic classifier Identify anomalous behavior Signatures of what anomalous behavior looks like Input data is identified as anomaly if it is similar enough to one of these signatures Mixed – Classification and Signature 12/8/2018

EMM Advantages Dynamic Adaptable Use of clustering Learns rare event
Scalable: Growth of EMM is not linear on size of data. Hierarchical feature of EMM Creation/evaluation quasi-real time Distributed / Hierarchical extensions 12/8/2018

Growth of EMM Servent Data 12/8/2018

TRAC-DS Approach to Detect Anomalies
By learning what is normal, the model can predict what is not Normal is based on likelihood of occurrence Use TRAC-DS to build clusters and behavior between clusters We view a rare event as: Unusual event Transition between events states which does not frequently occur. Continue learning 12/8/2018

Determining Rare Occurrence Frequency (OFi) of an EMM state Si is normalized count of state: Normalized Transition Probability (NTPmn), from one state, Sm, to another, Sn, is a normalized transition Count: 12/8/2018

EMMRare EMMRare algorithm indicates if the current input event is rare. Using a threshold occurrence percentage, the input event is determined to be rare if either of the following occurs: The frequency of the node at time t+1 is below this threshold The updated transition probability of the MC transition from node at time t to the node at t+1 is below the threshold 12/8/2018

ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

Similar presentations

Presentation on theme: "ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

Similar presentations

Presentation on theme: "ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I"— Presentation transcript:

Similar presentations

About project

Feedback