© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III.

Slides:



Advertisements
Similar presentations
DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos Exam 1 review: Quizzes 1-6.
Advertisements

Spatial Database Systems. Spatial Database Applications GIS applications (maps): Urban planning, route optimization, fire or pollution monitoring, utility.
7/03Spatial Data Mining G Dong (WSU) & H. Liu (ASU) 1 6. Spatial Mining Spatial Data and Structures Images Spatial Mining Algorithms.
Chapter 5: Introduction to Information Retrieval
Fast Algorithms For Hierarchical Range Histogram Constructions
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 Enviromatics Spatial database systems Spatial database systems Вонр. проф. д-р Александар Маркоски Технички факултет – Битола 2008 год.
Data Mining Techniques: Clustering
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Spring 2003Data Mining by H. Liu, ASU1 6. Spatial Mining Spatial Data and Structures Images Spatial Mining Algorithms.
Spatial Mining.
Information Retrieval in Practice
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Data Mining Techniques Outline
Spatio-Temporal Databases
Data Sources The most sophisticated forecasting model will fail if it is applied to unreliable data Data should be reliable and accurate Data should be.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Data Mining – Intro.
CIS 674 Introduction to Data Mining
Overview of Search Engines
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Data Mining Techniques
TIME SERIES by H.V.S. DE SILVA DEPARTMENT OF MATHEMATICS
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
© Prentice Hall1 CIS 674 Introduction to Data Mining Srinivasan Parthasarathy Office Hours: TTH 4:30-5:25PM DL693.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Levels of Image Data Representation 4.2. Traditional Image Data Structures 4.3. Hierarchical Data Structures Chapter 4 – Data structures for.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
© Prentice Hall1 ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2008 Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
Information Retrieval in Practice
Data Mining – Intro.
Data Transformation: Normalization
Chapter 7. Classification and Prediction
Data Mining Soongsil University
DATA MINING Introductory and Advanced Topics Part III – Web Mining
DATA MINING © Prentice Hall.
Data Mining: Concepts and Techniques
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Intelligent Information System Lab
Text & Web Mining 9/22/2018.
Spatial Online Sampling and Aggregation
I don’t need a title slide for a lecture
DATA MINING Introductory and Advanced Topics Part II - Clustering
DATA MINING Introductory and Advanced Topics Part III
Discovery of Significant Usage Patterns from Clickstream Data
Data Pre-processing Lecture Notes for Chapter 2
Function-oriented Design
Presentation transcript:

© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part III

© Prentice Hall2 Data Mining Outline PART III PART III –Web Mining –Spatial Mining –Temporal Mining

© Prentice Hall3 Web Mining Outline Goal: Examine the use of data mining on the World Wide Web Web Content Mining Web Content Mining Web Structure Mining Web Structure Mining Web Usage Mining Web Usage Mining

© Prentice Hall4 Web Mining Issues Size Size –>350 million pages (1999) –Grows at about 1 million pages a day –Google indexes 3 billion documents Diverse types of data Diverse types of data

© Prentice Hall5 Web Mining Taxonomy Modified from [zai01]

© Prentice Hall6 Web Content Mining Used to discover useful information from the content of a web page Used to discover useful information from the content of a web page Content -> Text / Video / Audio Content -> Text / Video / Audio WCMining are WCMining are –Natural Language Processing –Information Retrieval –Keyword based –Similarity between query and document –Crawlers –Indexing –Profiles –Link analysis

© Prentice Hall7 Focused Crawler

© Prentice Hall8 Context Focused Crawler Context Graph: Context Graph: –Context graph created for each seed document. –Root is the seed document. –Nodes at each level show documents with links to documents at next higher level. –Updated during crawl itself. Approach: Approach: 1.Construct context graph and classifiers using seed documents as training data. 2.Perform crawling using classifiers and context graph created.

© Prentice Hall9 Context Graph R(d) = SUM [ P( c | d ) ] Good(c) Where c is node/page and d is doc

© Prentice Hall10 Virtual Web View Multiple Layered DataBase (MLDB) built on top of the Web. Multiple Layered DataBase (MLDB) built on top of the Web. Each layer of the database is more generalized (and smaller) and centralized than the one beneath it. Each layer of the database is more generalized (and smaller) and centralized than the one beneath it. Upper layers of MLDB are structured and can be accessed with SQL type queries. Upper layers of MLDB are structured and can be accessed with SQL type queries. Translation tools convert Web documents to XML. Translation tools convert Web documents to XML. Extraction tools extract desired information to place in first layer of MLDB. Extraction tools extract desired information to place in first layer of MLDB. Higher levels contain more summarized data obtained through generalizations of the lower levels. Higher levels contain more summarized data obtained through generalizations of the lower levels.

© Prentice Hall11 Web Structure Mining Used to improve the efficiency of the WCMining Used to improve the efficiency of the WCMining Mine structure (links, graph) of the Web Mine structure (links, graph) of the Web Techniques Techniques –PageRank –CLEVER Create a model of the Web organization. Create a model of the Web organization. May be combined with content mining to more effectively retrieve important pages. May be combined with content mining to more effectively retrieve important pages.

© Prentice Hall12 PageRank Used to improve the effectiveness of Search Engine Used to improve the effectiveness of Search Engine Used by Google Used by Google Prioritize pages returned from search by looking at Web structure. Prioritize pages returned from search by looking at Web structure. Importance of page is calculated based on number of pages which point to it – Backlinks. Importance of page is calculated based on number of pages which point to it – Backlinks. Weighting is used to provide more importance to backlinks coming form important pages. Weighting is used to provide more importance to backlinks coming form important pages.

© Prentice Hall13 PageRank (cont’d) PR(p) = c (PR(1)/N 1 + … + PR(n)/N n ) PR(p) = c (PR(1)/N 1 + … + PR(n)/N n ) –PR(i): PageRank for a page i which points to target page p. –N i : number of links coming out of page I –Problem is cyclic Reference

© Prentice Hall14 CLEVER Identify authoritative and hub pages. Identify authoritative and hub pages. Authoritative Pages : Authoritative Pages : –Best Sources –ie Highly important pages. –Best source for requested information. Hub Pages : Hub Pages : –Contain links to highly important pages.

© Prentice Hall15 HITS Hyperlink-Induces Topic Search Hyperlink-Induces Topic Search Based on a set of keywords, find set of relevant pages – R. Based on a set of keywords, find set of relevant pages – R. Identify hub and authority pages for these. Identify hub and authority pages for these. –Expand R to a base set, B, of pages linked to or from R. –Calculate weights for authorities and hubs. Pages with highest ranks in R are returned. Pages with highest ranks in R are returned.

© Prentice Hall16 HITS Algorithm

© Prentice Hall17 Web Usage Mining Extends work of basic search engines Extends work of basic search engines Search Engines Search Engines –IR application –Keyword based –Similarity between query and document –Crawlers –Indexing –Profiles –Link analysis

© Prentice Hall18 Web Usage Mining Applications Personalization Personalization Improve structure of a site’s Web pages Improve structure of a site’s Web pages Aid in caching and prediction of future page references Aid in caching and prediction of future page references Improve design of individual pages Improve design of individual pages Improve effectiveness of e-commerce (sales and advertising) Improve effectiveness of e-commerce (sales and advertising)

© Prentice Hall19 Web Usage Mining Activities Preprocessing Web log Preprocessing Web log –Cleanse –Remove extraneous information –Sessionize A B A C or A B C Session: Sequence of pages referenced by one user at a sitting. Pattern Discovery Pattern Discovery –Count patterns that occur in sessions –Pattern is sequence of pages references in session. –Similar to association rules »Transaction: session »Itemset: pattern (or subset) »Order is important Pattern Analysis Pattern Analysis

© Prentice Hall20 Spatial Mining Outline Goal: Provide an introduction to some spatial mining techniques. Introduction Introduction Spatial Data Overview Spatial Data Overview Spatial Data Mining Primitives Spatial Data Mining Primitives Generalization/Specialization Generalization/Specialization Spatial Rules Spatial Rules Spatial Classification Spatial Classification Spatial Clustering Spatial Clustering

© Prentice Hall21 Spatial Object Contains both spatial and nonspatial attributes. Contains both spatial and nonspatial attributes. Geographic Information System Geographic Information System –Weather,Community Infrastructure needs, Disater Management, Must have a location type attributes: Must have a location type attributes: –Latitude/longitude –Zip code –Street address May retrieve object using either (or both) spatial or nonspatial attributes. May retrieve object using either (or both) spatial or nonspatial attributes.

© Prentice Hall22 Spatial Data Mining Applications Geology Geology GIS Systems GIS Systems Environmental Science Environmental Science Agriculture Agriculture Medicine Medicine Robotics Robotics May involved both spatial and temporal aspects May involved both spatial and temporal aspects

© Prentice Hall23 Spatial Queries Spatial selection may involve specialized selection comparison operations: Spatial selection may involve specialized selection comparison operations: –Near –North, South, East, West –Contained in –Overlap/intersect Region (Range) Query – find objects that intersect a given region. Region (Range) Query – find objects that intersect a given region. Nearest Neighbor Query – find object close to identified object. Nearest Neighbor Query – find object close to identified object. Distance Scan – find object within a certain distance of an identified object where distance is made increasingly larger. Distance Scan – find object within a certain distance of an identified object where distance is made increasingly larger.

© Prentice Hall24 Spatial Data Structures Data structures designed specifically to store or index spatial data. Data structures designed specifically to store or index spatial data. Often based on B-tree or Binary Search Tree Often based on B-tree or Binary Search Tree Cluster data on disk basked on geographic location. Cluster data on disk basked on geographic location. May represent complex spatial structure by placing the spatial object in a containing structure of a specific geographic shape. May represent complex spatial structure by placing the spatial object in a containing structure of a specific geographic shape. Techniques: Techniques: –Quad Tree –R-Tree –k-D Tree

© Prentice Hall25 MBR Minimum Bounding Rectangle Minimum Bounding Rectangle Smallest rectangle that completely contains the object Smallest rectangle that completely contains the object

© Prentice Hall26 MBR Examples

© Prentice Hall27 Quad Tree Hierarchical decomposition of the space into quadrants (MBRs) Hierarchical decomposition of the space into quadrants (MBRs) Each level in the tree represents the object as the set of quadrants which contain any portion of the object. Each level in the tree represents the object as the set of quadrants which contain any portion of the object. Each level is a more exact representation of the object. Each level is a more exact representation of the object. The number of levels is determined by the degree of accuracy desired. The number of levels is determined by the degree of accuracy desired.

© Prentice Hall28 Quad Tree Example

© Prentice Hall29 R-Tree As with Quad Tree the region is divided into successively smaller rectangles (MBRs). As with Quad Tree the region is divided into successively smaller rectangles (MBRs). Rectangles need not be of the same size or number at each level. Rectangles need not be of the same size or number at each level. Rectangles may actually overlap. Rectangles may actually overlap. Lowest level cell has only one object. Lowest level cell has only one object. Tree maintenance algorithms similar to those for B-trees. Tree maintenance algorithms similar to those for B-trees.

© Prentice Hall30 R-Tree Example

© Prentice Hall31 K-D Tree Designed for multi-attribute data, not necessarily spatial Designed for multi-attribute data, not necessarily spatial Variation of binary search tree Variation of binary search tree Each level is used to index one of the dimensions of the spatial object. Each level is used to index one of the dimensions of the spatial object. Lowest level cell has only one object Lowest level cell has only one object Divisions not based on MBRs but successive divisions of the dimension range. Divisions not based on MBRs but successive divisions of the dimension range.

© Prentice Hall32 k-D Tree Example

© Prentice Hall33 Topological Relationships Disjoint Disjoint –A is Disjoint from B –No points in A that are contained in B Overlaps or Intersects Overlaps or Intersects –Atleast one pnt in A that is also in B Equals Equals –All pnts in the two objects are in common Covered by or inside or contained in Covered by or inside or contained in –All pnts in A are in B –There may be points in B that are not in A Covers or contains Covers or contains –A contains B iff B contains A

© Prentice Hall34 STING STatistical Information Grid-based STatistical Information Grid-based Hierarchical technique to divide area into rectangular cells Hierarchical technique to divide area into rectangular cells Grid data structure contains summary information about each cell Grid data structure contains summary information about each cell Hierarchical clustering Hierarchical clustering Similar to quad tree Similar to quad tree

© Prentice Hall35 STING

© Prentice Hall36 STING Build Algorithm

© Prentice Hall37 STING Algorithm

© Prentice Hall38 Spatial Rules Characteristic Rule Characteristic Rule Discriminant Rule Discriminant Rule Association Rule Association Rule

© Prentice Hall39 Spatial Classification Algorithms To classify the Spatial Objects To classify the Spatial Objects – ID3 –Spatial Decision Tree

© Prentice Hall40 Spatial Clustering Detect clusters of irregular shapes Detect clusters of irregular shapes Use of centroids and simple distance approaches may not work well. Use of centroids and simple distance approaches may not work well. Clusters should be independent of order of input. Clusters should be independent of order of input.

© Prentice Hall41 Spatial Clustering

© Prentice Hall42 CLARANS Extensions Remove main memory assumption of CLARANS. Remove main memory assumption of CLARANS. Use spatial index techniques. Use spatial index techniques. Use sampling and R*-tree to identify central objects. Use sampling and R*-tree to identify central objects. Change cost calculations by reducing the number of objects examined. Change cost calculations by reducing the number of objects examined. Voronoi Diagram Voronoi Diagram

© Prentice Hall43 Voronoi

© Prentice Hall44 SD(CLARANS) Spatial Dominant Spatial Dominant First clusters spatial components using CLARANS First clusters spatial components using CLARANS Then iteratively replaces medoids, but limits number of pairs to be searched. Then iteratively replaces medoids, but limits number of pairs to be searched. Uses generalization Uses generalization Uses a learning to to derive description of cluster. Uses a learning to to derive description of cluster.

© Prentice Hall45 SD(CLARANS) Algorithm

© Prentice Hall46 DBCLASD Distributed Based Clustering of LArge Spatial Databases DBCLASD Distributed Based Clustering of LArge Spatial Databases DBCLASD –It assumes that the items within the cluster are uniformly distributed –Identifies distribution satisfied by distances between nearest neighbors. –Outside the cluster do not satisfy Extension of DBSCAN Extension of DBSCAN Identifies distribution satisfied by distances between nearest neighbors. Identifies distribution satisfied by distances between nearest neighbors.

© Prentice Hall47 APPROXIMATION Aggregate Proximity – measure of how close a cluster is to a feature. Aggregate Proximity – measure of how close a cluster is to a feature. Aggregate proximity relationship finds the k closest features to a cluster. Aggregate proximity relationship finds the k closest features to a cluster. CRH Algorithm – uses different shapes: CRH Algorithm – uses different shapes: –Encompassing Circle –Isothetic Rectangle –Convex Hull

© Prentice Hall48

© Prentice Hall49 Temporal Mining Outline Goal: Examine some temporal data mining issues and approaches. Introduction Introduction Modeling Temporal Events Modeling Temporal Events Time Series Time Series Pattern Detection Pattern Detection Sequences Sequences Temporal Association Rules Temporal Association Rules

© Prentice Hall50 Temporal Database / Time Varying Analysis Snapshot – Traditional database (Single Point of Time) Snapshot – Traditional database (Single Point of Time) Temporal – Multiple time points Temporal – Multiple time points Ex: Social Security Number Ex: Social Security Number

© Prentice Hall51 Temporal Queries Query Query Database Database Intersection Query Intersection Query Inclusion Query Inclusion Query Containment Query Containment Query Point Query – Tuple retrieved is valid at a particular point in time. Point Query – Tuple retrieved is valid at a particular point in time. t s q t e q t s d t e d t s q t e q t s d t e d t s q t e q t s d t e d t s q t e q t s d t e d

© Prentice Hall52 Types of Databases Snapshot – No temporal support Snapshot – No temporal support Transaction Time – Supports time when transaction inserted data Transaction Time – Supports time when transaction inserted data –Timestamp –Range Valid Time – Supports time range when data values are valid Valid Time – Supports time range when data values are valid Bitemporal – Supports both transaction and valid time. Bitemporal – Supports both transaction and valid time.

© Prentice Hall53 Modeling Temporal Events Techniques to model temporal events. Techniques to model temporal events. Often based on earlier approaches Often based on earlier approaches Finite State Recognizer (Machine) (FSR) Finite State Recognizer (Machine) (FSR) –Each event recognizes one character –Temporal ordering indicated by arcs –May recognize a sequence –Require precisely defined transitions between states Approaches Approaches –Markov Model –Hidden Markov Model –Recurrent Neural Network

© Prentice Hall54 FSR Directed Graph

© Prentice Hall55 Markov Model (MM) Directed graph Directed graph –Vertices represent states –Arcs show transitions between states –Arc has probability of transition –At any time one state is designated as current state. Markov Property – Given a current state, the transition probability is independent of any previous states. Markov Property – Given a current state, the transition probability is independent of any previous states. Applications: speech recognition, natural language processing Applications: speech recognition, natural language processing

© Prentice Hall56 Markov Model

© Prentice Hall57 Hidden Markov Model (HMM) Like MM, but states need not correspond to observable states. Like MM, but states need not correspond to observable states. HMM models process that produces as output a sequence of observable symbols. HMM models process that produces as output a sequence of observable symbols. HMM will actually output these symbols. HMM will actually output these symbols. Associated with each node is the probability of the observation of an event. Associated with each node is the probability of the observation of an event. Train HMM to recognize a sequence. Train HMM to recognize a sequence. Transition and observation probabilities learned from training set. Transition and observation probabilities learned from training set.

© Prentice Hall58 Hidden Markov Model Modified from [RJ86]

© Prentice Hall59 HMM Algorithm

© Prentice Hall60 HMM Applications Given a sequence of events and an HMM, what is the probability that the HMM produced the sequence? Given a sequence of events and an HMM, what is the probability that the HMM produced the sequence? Given a sequence and an HMM, what is the most likely state sequence which produced this sequence? Given a sequence and an HMM, what is the most likely state sequence which produced this sequence?

© Prentice Hall61 Recurrent Neural Network (RNN) Extension to basic NN Extension to basic NN Neuron can obtain input form any other neuron (including output layer). Neuron can obtain input form any other neuron (including output layer). Can be used for both recognition and prediction applications. Can be used for both recognition and prediction applications. Time to produce output unknown Time to produce output unknown Temporal aspect added by backlinks. Temporal aspect added by backlinks.

© Prentice Hall62 RNN

© Prentice Hall63 Time Series Set of attribute values over period of time Set of attribute values over period of time »Numeric / Specific »Continuous /Discrete Time Series Analysis – finding patterns in the values Time Series Analysis – finding patterns in the values »with Transformation and Similarity and, then Prediction –Trends »Symmetric No repetitive changes »Nonlinear / Linear –Cycles - behavior of cycle –Seasonal- Detecting patterns may be based on time of yr or month or day –Outliers - identification is a serious one,

© Prentice Hall64 Analysis Techniques Smoothing – Smoothing – –Straight forward techniques to detect trends –It will remove non systematic behaviors –Moving average of all attribute values used instead of specific values found at this point –Median value instead of Mean value –Correlation can be used Autocorrelation – relationships between different subseries Autocorrelation – relationships between different subseries –Yearly, seasonal –Correlation can be found between every 12 values –Lag – Time difference between related items. –Correlation Coefficient r is used to measure correlation –ie used to measure the linear relationship between two points

© Prentice Hall65Smoothing

© Prentice Hall66 Correlation with Lag of 3

© Prentice Hall67Similarity Determine similarity between a target pattern, X, and sequence, Y Determine similarity between a target pattern, X, and sequence, Y sim(X,Y) sim(X,Y) Similar to Web usage mining Similar to Web usage mining Similar to earlier word processing and spelling corrector applications. Similar to earlier word processing and spelling corrector applications. Issues: Issues: –Length – may x and y have different length –Scale - same shape / different scale –Gaps – missing data in a group –Outliers – like gap except that extra data –Baseline – between successive values of x and y may differ

© Prentice Hall68 Prediction It is forecasting It is forecasting Predict future value for time series Predict future value for time series Regression may not be sufficient Regression may not be sufficient Studies of Time Series Prediction often assume that the time series is stationary Studies of Time Series Prediction often assume that the time series is stationary ie the values come from model with a constant mean ie the values come from model with a constant mean For more complex Prediction techniques may assume that the time series is nonstationary. For more complex Prediction techniques may assume that the time series is nonstationary.

© Prentice Hall69 Prediction  Statistical Techniques –Auto Regression and Moving Average ( Season based) »It is a method of predicting a future time series value by looking at previous values »Time Series X = (x1,x2,x3,….xn, xn+1) »x n+1 is the future value need to compute, which can by either AR or MA »x n+1 = Φn x n + Φn-1 x n-1 + ……ξn+1 » ξn+1 is the Random error »Φi is the autoregressive parameters »x n+1 = Φn a n + Φn-1 a n-1 + »A n is the shock, it is derived with normal distribution with zero mean

© Prentice Hall70 Prediction  Statistical Techniques –Auto Regression and Moving Average have been discussed –Auto Regressive Moving Average ARMA –Auto Regressive Integrated Moving Average ARIMA

© Prentice Hall71 Pattern Detection Identify patterns of behavior in time series Identify patterns of behavior in time series Speech recognition, signal processing Speech recognition, signal processing FSR, MM, HMM FSR, MM, HMM

© Prentice Hall72 String Matching Find given pattern in sequence Find given pattern in sequence Knuth-Morris-Pratt: Construct FSM Knuth-Morris-Pratt: Construct FSM Boyer-Moore: Construct FSM Boyer-Moore: Construct FSM

© Prentice Hall73 Distance between Strings Cost to convert one to the other Cost to convert one to the other Transformations Transformations –Match: Current characters in both strings are the same –Delete: Delete current character in input string –Insert: Insert current character in target string into string

© Prentice Hall74 Distance between Strings

© Prentice Hall75 Frequent Sequence Frequent Sequence

© Prentice Hall76 Frequent Sequence Example Purchases made by customers Purchases made by customers s( ) = 1/3 s( ) = 1/3 s( ) = 2/3 s( ) = 2/3

© Prentice Hall77 Frequent Sequence Lattice

© Prentice Hall78 SPADE Sequential Pattern Discovery using Equivalence classes Sequential Pattern Discovery using Equivalence classes Identifies patterns by traversing lattice in a top down manner. Identifies patterns by traversing lattice in a top down manner. Divides lattice into equivalent classes and searches each separately. Divides lattice into equivalent classes and searches each separately. ID-List: Associates customers and transactions with each item. ID-List: Associates customers and transactions with each item.

© Prentice Hall79 SPADE Example ID-List for Sequences of length 1: ID-List for Sequences of length 1: Count for is 3 Count for is 3 Count for is 2 Count for is 2

© Prentice Hall80   Equivalence Classes

© Prentice Hall81 SPADE Algorithm

© Prentice Hall82 Temporal Association Rules Transaction has time: Transaction has time: [t s,t e ] is range of time the transaction is active. [t s,t e ] is range of time the transaction is active. Types: Types: –Inter-transaction rules –Episode rules –Trend dependencies –Sequence association rules –Calendric association rules

© Prentice Hall83 Inter-transaction Rules Intra-transaction association rules Intra-transaction association rules Traditional association Rules Inter-transaction association rules Inter-transaction association rules –Rules across transactions –Sliding window – How far apart (time or number of transactions) to look for related itemsets.

© Prentice Hall84 Episode Rules Association rules applied to sequences of events. Association rules applied to sequences of events. Episode – set of event predicates and partial ordering on them Episode – set of event predicates and partial ordering on them

© Prentice Hall85 Trend Dependencies Association rules across two database states based on time. Association rules across two database states based on time. Ex: (SSN,=)  (Salary,  ) Ex: (SSN,=)  (Salary,  )Confidence=4/5Support=4/36

© Prentice Hall86 Sequence Association Rules Association rules involving sequences Association rules involving sequences Ex: Ex:   Support = 1/3 Confidence 1

© Prentice Hall87 Calendric Association Rules Each transaction has a unique timestamp. Each transaction has a unique timestamp. Group transactions based on time interval within which they occur. Group transactions based on time interval within which they occur. Identify large itemsets by looking at transactions only in this predefined interval. Identify large itemsets by looking at transactions only in this predefined interval.