Presentation is loading. Please wait.

Presentation is loading. Please wait.

I don’t need a title slide for a lecture

Similar presentations


Presentation on theme: "I don’t need a title slide for a lecture"— Presentation transcript:

1 I don’t need a title slide for a lecture
Long long ago, in a galaxy far, far away… 11/28/2018

2 Outline Background Data mining Association Rules Classification
Clustering Sequential Patterns Sequence Similarity 11/28/2018

3 Knowledge Discovery in Databases (KDD)
What is it? Finding useful patterns in data Why do we need it? Terabytes of data Impractical to manually search for patterns Where does data mining come in? 11/28/2018

4 Steps of a KDD process Learn the application domain
Create a target dataset Clean and preprocess data Choose type of data mining Pick an algorithm Perform data mining Interpret results 11/28/2018

5 Databases vs. Data warehousing
Storage of all data Details or summaries Metadata Data cleaning, integration Databases Queries over current data Persistent storage Atomic updates 11/28/2018

6 Databases vs. Data warehouses
Databases provide for: Queries over current data Persistent storage Atomic updates Data warehouses provide for: Storage of all data Meta data Data cleaning, integration Fast access to data 11/28/2018

7 Who’s interested? Databases - large amounts of data
Artificial Intelligence - search, planning, machine learning Information Retrieval - searching for similar documents Image Processing - finding similar images 11/28/2018

8 Types of data mining Association Rules Classification Clustering
Sequential Patterns Sequence Similarity 11/28/2018

9 Association rules What are they? Where are they used?
Looking for common causal relationships in basket data Where are they used? Store layout Catalog design Customer segmentation 11/28/2018

10 Association rules example
Find all itemsets that occur at least twice, and the causal relationship of each 11/28/2018

11 Association rules metrics
For a rule a b support = a and b occur together in at least s% of the n baskets confidence = of all of the baskets containing a, at least c% also contain b 11/28/2018

12 Association rules algorithms
Focus on finding support for “itemsets” The naïve method: Combine itemsets of size k-1 that differ only on the last item to find Candidatesk Measure support of itemsets from step 1 to form large itemsetk Increase k and repeat until no new large itemsets 11/28/2018

13 Itemsets of size 1 Looking for support of 2 11/28/2018

14 Finding candidate set 2 11/28/2018

15 Finding candidate set 3 11/28/2018

16 Apriori algorithm An itemset cannot be a large itemset unless all of its subsets are large itemsets Reduces number of candidate itemsets considered 11/28/2018

17 Research directions Online construction of rules
CARMA (Berkeley) Pre filtering the data a posteriori (Limburgs Universitair Centrum) 11/28/2018

18 Classification What is it? Where is it used?
Rules that partition data into separate groups. Where is it used? to classify people as good/bad credit risks weather prediction fraud detection Variation: best k of n (who to send flyers to) 11/28/2018

19 Classification example
11/28/2018

20 Possible solutions Bayesian classification Neural networks
Genetic algorithms Decision Trees 11/28/2018

21 Decision trees Salary < 25,000 no yes Graduate education? Accept no
Reject 11/28/2018

22 Decision trees Build the tree in two steps
Build a perfect tree on sample data At each node, pick a “good” attribute Split data according to attribute Recursively build tree on children Prune the tree Minimum Description Length Cost of encoding tree structure Cost of encoding split attribute Cost of encoding leaf data records 11/28/2018

23 Research directions Integrate building and pruning Incremental Updates
PUBLIC (Bell Labs) Incremental Updates BOAT (University of Wisconsin) 11/28/2018

24 Clustering What is it? Where is it used?
Given n points, separate them into k clusters Where is it used? Information retrieval - text classification Identify similar web documents Mapping the universe 11/28/2018

25 Clustering example 11/28/2018

26 Traditional clustering algorithms
Partitional Determine k partitions that optimize a function Common function is the “square error function” Hierarchical Each point starts as a cluster Clusters are merged until k clusters remain 11/28/2018

27 Clustering difficulties
11/28/2018

28 Research directions Higher dimension subspace clustering
CLIQUE (IBM Almaden) Incremental clustering Incremental DBScan (University of Munich) Remove problems with outliers CURE (Bell Labs) 11/28/2018

29 Sequential patterns What is it? Where is it used?
Given a set of events, find frequently occurring patterns Where is it used? Analyzing basket data Medical diagnosis 11/28/2018

30 Sequential patterns example
11/28/2018

31 AprioriAll Create all large events that occur once
Map each subset to numbers While there still are large itemsets: Find candidate itemsets of length k Find large itemsets of length k Increase k 11/28/2018

32 Mapping the itemsets 11/28/2018

33 Research directions Time limitations
WINEPI (Helsinki/Microsoft) Itemsets over multiple transactions CSP (IBM Almaden) 11/28/2018

34 Sequence Similarity What is it? Where is it used?
Given a number of data sets, look for similar trends Where is it used? Find stocks with similar price movements Find geological irregularities 11/28/2018

35 Example Are the two sequences similar? 11/28/2018

36 Basic algorithm Scale data Match all gap-free sequences
Form pairs of large similar sequences Find the longest common subsequence 11/28/2018

37 Research directions Finding surprising patterns IBM Almaden 11/28/2018

38 Data mining directions
Sampling Fractals Pre-partitioning data Making data mining more accessible User defined aggregation support 11/28/2018

39 References General Data mining: Association Rules: “Fast Algorithms for Mining Association Rules”, Agrawal and Srikant; VLDB 94. Classification: “PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning”, Rastogi and Shim; VLDB 98. 11/28/2018

40 References (cont.) Clustering: “CURE: An Efficient Clustering Algorithm for Large Databases”, Guha, Rastogi, Shim; SIGMOD 98. Sequential Patterns: “Mining Sequential Patterns: Generalizations and Performance Improvements”, Srikant and Agrawal; EDBT 98. Similarity Search: “Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases”, Agrawal, Nin, Sawhney, and Shim; VLDB 95. 11/28/2018


Download ppt "I don’t need a title slide for a lecture"

Similar presentations


Ads by Google