ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
PARTITIONAL CLUSTERING
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.
9/15/2008 CTBTO Data Mining/Data Fusion Workshop 1 Spatiotemporal Stream Mining Applied to Seismic+ Data Margaret H. Dunham CSE Department Southern Methodist.
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
The Evolution of Spatial Outlier Detection Algorithms - An Analysis of Design CSci 8715 Spatial Databases Ryan Stello Kriti Mehra.
Date : 21 st of May, Shri Ramdeo Baba College of Engineering and Management Presentation By : Rimjhim Singh Under the Guidance of: Dr. M.B. Chandak.
10/31/2012, METU Spatiotemporal Stream Mining using TRACDS Middle East Technical University October 31, 2012 Margaret H Dunham, Michael Hahsler, Yu Su,
Data Mining Techniques Outline
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Recommender systems Ram Akella November 26 th 2008.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Query Log Analysis Naama Kraus Slides are based on the papers: Andrei Broder, A taxonomy of web search Ricardo Baeza-Yates, Graphs from Search Engine Queries.
WAC/ISSCI Automated Anomaly Detection Using Time-Variant Normal Profiling Jung-Yeop Kim, Utica College Rex E. Gantenbein, University of Wyoming.
Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time.
11/11/051 ME A Novel Technique for Learning Rare Events Margaret H. Dunham, Yu Meng, Jie Huang CSE Department Southern Methodist University Dallas, Texas.
VoIP Data IIIT Allahabad Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275, USA
10/24/081 Anomaly Detection Using Data Mining Techniques Margaret H. Dunham, Yu Meng, Donya Quick, Jie Huang, Charlie Isaksson CSE Department Southern.
Anomaly detection with Bayesian networks Website: John Sandiford.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
11/26/07 – IRADSN’07 1 Stream Hierarchy Data Mining for Sensor Data Margaret H. Dunham SMU Dallas, Texas Vijay Kumar UMKC Kansas.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Data Mining: Concepts and Techniques Mining data streams
07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department.
Data Stream Mining with Extensible Markov Model Yu Meng, Margaret H. Dunham, F. Marco Marchetti, Jie Huang, Charlie Isaksson October 18, 2006.
Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.
Waqas Haider Bangyal. 2 Source Materials “ Data Mining: Concepts and Techniques” by Jiawei Han & Micheline Kamber, Second Edition, Morgan Kaufmann, 2006.
11/3/041 ME Extensible Markov Model Margaret H. Dunham, Yu Meng, Jie Huang CSE Department Southern Methodist University Dallas, Texas 75275
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Data Mining - Introduction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Profiling: What is it? Notes and reflections on profiling and how it could be used in process mining.
Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
Data Mining.
Data Mining – Intro.
DATA MINING Spatial Clustering
DATA MINING © Prentice Hall.
Architecture Concept Documents
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Supervised Time Series Pattern Discovery through Local Importance
RE-Tree: An Efficient Index Structure for Regular Expressions
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Lin Lu, Margaret Dunham, and Yu Meng
Data Warehousing and Data Mining
I don’t need a title slide for a lecture
Data Mining: Concepts and Techniques
A Framework for Clustering Evolving Data Streams
DATA MINING Introductory and Advanced Topics Part II - Clustering
Data Mining: Concepts and Techniques
Chapter 17 Designing Databases
Data Mining: Concepts and Techniques
Discovery of Significant Usage Patterns from Clickstream Data
Online Analytical Processing Stream Data: Is It Feasible?
Learning from Data Streams
Topic 5: Cluster Analysis
Presentation transcript:

ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University 12/8/2018

Data Mining Outline EMM Stream Mining Text Mining Bioinformatics Mining 12/8/2018

EMM Overview Time Varying Discrete First Order Markov Model Nodes are clusters of real world states. Learning continues during prediction phase. Learning: Transition probabilities between nodes Node labels (centroid of cluster) Nodes are added and removed as data arrives 12/8/2018

MM A first order Markov Chain is a finite or countably infinite sequence of events {E1, E2, … } over discrete time points, where Pij = P(Ej | Ei), and at any time the future behavior of the process is based solely on the current state A Markov Model (MM) is a graph with m vertices or states, S, and directed arcs, A, such that: S ={N1,N2, …, Nm}, and A = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc, Lij = <Ni,Nj> is labeled with a transition probability Pij = P(Nj | Ni). 12/8/2018

EMM Definition Extensible Markov Model (EMM): at any time t, EMM consists of an MC with designated current node, Nn, and algorithms to modify it, where algorithms include: EMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t. EMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1. EMMDecrement algorithm, which removes nodes from the EMM when needed. 12/8/2018

EMM Cluster Find closest node to incoming event. If none “close” create new node Labeling of cluster is centroid of members in cluster O(n) 12/8/2018

EMMSim Find closest node to incoming event. If none “close” create new node Labeling of cluster is centroid/medoid of members in cluster Problem Nearest Neighbhor O(n) BIRCH O(lg n) Requires second phase to recluster initial 12/8/2018

EMM Increment <18,10,3,3,1,0,0> <17,10,2,3,1,0,0> <16,9,2,3,1,0,0> <14,8,2,3,1,0,0> <14,8,2,3,0,0,0> <18,10,3,3,1,1,0.> 1/1 N1 1 N3 1/1 1/3 N1 N2 2/3 1/3 N1 N2 2/3 1/2 N3 1/1 2/3 1/3 N1 N2 N1 2/2 1/1 N3 1/2 1/3 N1 N2 2/3 12/8/2018

EMM Forget N1 N3 N5 N6 1/6 1/3 N2 N1 N3 N5 N6 2/2 1/3 1/2 12/8/2018

Data Mining Outline Stream Mining Data Stream Overview EMM Stream Mining Data Stream Overview Data Stream Modeling Data Stream Clustering TRAC-DS Anomaly Detection Text Mining Bioinformatics Mining 12/8/2018

Motivation Computer network monitoring data A growing number of applications generate streams of data. Computer network monitoring data Call detail records in telecommunications (Cisco VoIP 2003) Highway transportation traffic data (MnDot 2005) Online web purchase log records (JCPenney 2003, Travelociy 2005) Sensor network data (Ouse, Serwent 2002) Stock exchange, transactions in retail chains, ATM operations in banks, credit card transactions. Data mining techniques play a key role in data models in Data Stream Management System. Nowadays, a growing number of applications generate streams of data. The data of this type include computer network monitoring data, highway traffic data, call detail records in telecomm industry, online purchase logs and data collected by other sensor networks. A data stream management system is a new research area in recent years. And it has been a new application area of data mining. What it that? A feature of data stream is its high volume of data, it is not possible for us to store all data like traditional database. Instead the data stream must be modeled. Data mining is suitable for this modeling task. On the slide, the items in the parentheses are the datasets available to us. The datasets highlighted with red are those being used in this proposal. The Cisco VoIP data is a 8 weeks of VoIP call log collected in Cisco. MnDot is provided by Mn Dept of transportation. It is the highway traffic data in Twin City area. The data is available from 2000, up to now. Ouse and Serwent are two sets of sensor network data. Ouse is a river level data at three locations near York in UK. And the Serwent data is the water flow rate data at 7 locations near Serwent in UK. (2’20’’) 12/8/2018

Haixun Wang, Jian Pei, Philip S. Yu, ICDE 2005; Background Characteristics of data stream: Data are raw Records may at a rapid rate High volume (possibly infinite) of continuous data Concept drifts: Data distribution changes on the fly Multidimensional Temporality Stream processing restrictions: Data modeling (synopsis) Single pass: Each record is examined at most once Bounded storage: Limited Memory for storing synopsis Real-time: Per record processing time must be low By examining stream data, we can see the following characteristics. The data are raw because only online preprocessing is applicable. Records may arrive at a rapid rate The volume of data is high, possible infinite. The data profile may change on the fly – we call it concept change or concept drift. Data can be multidimensional Temporal dependency may occur in the data series. To process the data stream, a technique must satisfy the following requirements: Data must be modeled since it is not possible to store all data. In literature the modeled data is called synopsis of the data stream. Single pass: Each record or data point is read at most once. Random access to data like relational databases is not possible. The storage space of the synopsis is bounded. Each record must be processed in a soft real-time manner The system should respond to queries incrementally. (1’35’’) Haixun Wang, Jian Pei, Philip S. Yu, ICDE 2005; Keogh, ICDM’04 12/8/2018

From Sensors to Streams Data captured and sent by a set of sensors is usually referred to as “stream data”. Real-time sequence of encoded signals which contain desired information. It is continuous, ordered (implicitly by arrival time or explicitly by timestamp or by geographic coordinates) sequence of items Stream data is infinite - the data keeps coming. 12/8/2018

Suppose There Were MANY Sensors Traditional line graphs would be very difficult to read Requirements for new visualization technique: High level summary of data Handle multiple sensors at once Continuous Temporal Spatial 12/8/2018

Spatiotemporal Environment Events arriving in a stream At any time, t, we can view the state of the problem as represented by a vector of n numeric values: Vt = <S1t, S2t, ..., Snt> V2 … S2 S21 S22 S2q S1 S11 S12 S1q Sn Sn1 Sn2 Snq Time 12/8/2018

Data Stream Management Systems (DSMS) Software to facilitate querying and managing stream data. Retrieve the most recent information from the stream Data aggregation facilitates merging together multiple streams Modeling stream data to “summarize” stream Visualization needed to observe in real-time the spatial and temporal patterns and trends hidden in the data. 12/8/2018

DSMS Problems Stream Management development in state similar to that of databases prior to 1970’s Each system/researcher looks at specific application or system No standards concerning functionality No standard query language Unreasonable to expect end users will access raw data, data in the DSMS, or even data at a summarized view Domain experts need to “see” a higher level of data 12/8/2018

Data Stream Modeling Single pass: Each record is examined at most once Bounded storage: Limited Memory for storing synopsis Real-time: Per record processing time must be low Summarization (Synopsis )of data Use data NOT SAMPLE Temporal and Spatial Dynamic Continuous (infinite stream) Learn Forget Sublinear growth rate - Clustering 12/8/2018 18

Problem with Markov Chains The required structure of the MC may not be certain at the model construction time. As the real world being modeled by the MC changes, so should the structure of the MC. Not scalable – grows linearly as number of events. Markov Property Our solution: Extensible Markov Model (EMM) Cluster real world events Allow Markov chain to grow and shrink dynamically 12/8/2018

EMM Sublinear Growth Rate Minnesota Department of Transportation (MnDot) 12/8/2018

Traditional Clustering 12/8/2018

TRAC-DS 12/8/2018

Motivation Temporal Ordering is a major feature of stream data. Many stream applications depend on this ordering Prediction of future values Anomaly (rare event) detection Concept drift 12/8/2018

Stream Clustering Requirements Dynamic updating of the clusters Identify outliers Barbara: compactness fast incremental processing 12/8/2018

Stream Clustering Algorithms LOCALSEARCH Partitions stream into segments Clusters each segment individually by solving the k-medians problem Iteratively reclusters the resulting centers CluStream Micro-clusters represented by summary statistics. Micro-clusters are handled online Micro-clusters merged offline MONIC Evolution of clusters over time Cluster transitions over time 12/8/2018

TRAC-DS NOTE TRAC-DS is not: Another stream clustering algorithm A new way of looking at clustering Built on top of an existing clustering algorithm TRAC-DS may be used with any stream clustering algorithm 12/8/2018

TRAC-DS Overview 12/8/2018

Data Stream Clustering At each point in time a data stream clustering ζ is a partitioning of D', the data seen thus far. Instead of the whole partitions C1, C2,..., Ck only synopses Cc1,Cc2,...,Cck are available and k is allowed to change over time. The summaries Cci with i =1, 2,...,k typically contain information about the size, distribution and location of the data points in Ci. 12/8/2018

TRAC-DS Definition Given a data stream clustering ζ, a temporal relationship among clusters (TRAC-DS) overlays a data stream clustering ζ with a EMM M, in such a way that the following are satisfied: (1) There is a one-to-one correspondence between the clusters in ζ and the states S in M. (2) A transition aij in the EMM M represents the probability that given a data point in cluster i, the next data point in the data stream will belong to cluster j with i; j = 1; 2; : : : ; k. (3) The EMM M is created online together with the data stream clustering 12/8/2018

Clustering Operations A clustering operation is a function q : ζ × x → ζ which is used by the data stream clustering algorithm to up­date the clustering ζ given some additional information x which either is a new data point or other information (e.g., the number of the cluster to be deleted to be simplified the clustering). 12/8/2018

TRAC-DS Operations A TRAC-DS operation is a function r : M × sc × y → M × sc that updates the temporal relationship among clusters represented by the EMM M with states S given a current state sc ∈ S and additional information y and returns an updated EMM and possibly a new current state. In order to be able to dynamically update the EMM M we need to store a transition count matrix C. The count cij in C contains the number of times we observed a new point being assigned by the clustering algorithm to cluster i followed by a point being assigned to cluster j. 12/8/2018

Stream Clustering Operations * qassign point(ζ,x): Assigns the new data point x to an existing cluster. qnew cluster(ζ,x): Create a new cluster. qremove cluster(ζ,x): Removes a cluster. Here x is the cluster, i, to be removed. In this case the associated summary Cci is removed from ζ and k is decremented by one. qmerge clusters(ζ,x): Merges two clusters. qfade clusters(ζ,x): Fades the cluster structure. qsplit clusters(ζ,x): Splits a cluster. * Inspired by MONIC 12/8/2018

TRAC-DS Operations rassign point(M,sc,y): Assigns the new data point to the state representing an existing cluster rnew cluster(M,sc,y): Create a state for a new cluster. rremove cluster(M,sc,y): Removes state. rmerge clusters(M,sc,y): Merges two states. rfade clusters(M,sc,y): Fades the transition probabilities using an exponential decay f(t)=2−λt rsplit clusters(M,sc,y): Splits states. Y clustering operations. 12/8/2018

TRAC-DS Example 12/8/2018

TRAC-DS Advantages Dynamic Flexible – Use any Clustering Algorithm Supports and clustering operations Scalable Merges Clustering & Markov Modeling 12/8/2018

What is Anomaly? Event that is unusual Event that doesn’t occur frequently Predefined event What is unusual? What is deviation? 12/8/2018

What is Anomaly in Stream Data? Rare - Anomalous – Surprising Out of the ordinary Not outlier detection No knowledge of data distribution Data is not static Must take temporal and spatial values into account May be interested in sequence of events Ex: Snow in upstate New York is not an anomaly Snow in upstate New York in June is rare Rare events may change over time 12/8/2018

Statistical View of Anomaly Outlier Data item that is outside the normal distribution of the data Identify by Box Plot Image from Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002. 12/8/2018

Statistical View of Anomaly Identify by looking at distribution THIS DOES NOT WORK with stream data Image from www.wikipedia.org, Normal distribution. 12/8/2018

Data Mining View of Anomaly Classification Problem Build classifier from training data Problem is that training data shows what is NOT an anomaly Thus an anomaly is anything that is not viewed as normal by the classification technique MUST build dynamic classifier Identify anomalous behavior Signatures of what anomalous behavior looks like Input data is identified as anomaly if it is similar enough to one of these signatures Mixed – Classification and Signature 12/8/2018

EMM Advantages Dynamic Adaptable Use of clustering Learns rare event Scalable: Growth of EMM is not linear on size of data. Hierarchical feature of EMM Creation/evaluation quasi-real time Distributed / Hierarchical extensions 12/8/2018

Growth of EMM Servent Data 12/8/2018

TRAC-DS Approach to Detect Anomalies By learning what is normal, the model can predict what is not Normal is based on likelihood of occurrence Use TRAC-DS to build clusters and behavior between clusters We view a rare event as: Unusual event Transition between events states which does not frequently occur. Continue learning 12/8/2018

Determining Rare Occurrence Frequency (OFi) of an EMM state Si is normalized count of state: Normalized Transition Probability (NTPmn), from one state, Sm, to another, Sn, is a normalized transition Count: 12/8/2018

EMMRare EMMRare algorithm indicates if the current input event is rare. Using a threshold occurrence percentage, the input event is determined to be rare if either of the following occurs: The frequency of the node at time t+1 is below this threshold The updated transition probability of the MC transition from node at time t to the node at t+1 is below the threshold 12/8/2018