Chase Repp.  knowledge discovery  searching, analyzing, and sifting through large data sets to find new patterns, trends, and relationships contained.

Chase Repp

 knowledge discovery  searching, analyzing, and sifting through large data sets to find new patterns, trends, and relationships contained within

 Data mining differs from database querying in the following manner: database querying asks “what company purchased $100,000 worth of widgets last year?” while this asks “what company is likely to purchase over $100,000 of widgets next year and why?”

 coined in the 1960s  Data mining was used to find basic information from the collections of data such as total revenue over the last three years.  classic statistics  artificial intelligence  machine learning

 Predictive Data Mining Target value Future trends  Descriptive Data Mining No target value Focuses on relations

 focuses on discovering a relationship between independent variables and a relationship between dependent and independent variables  used to forecast specific things

 describes a data set in a brief but comprehensive way and gives interesting characteristics of the data without having any predefined target  Focus on relations

 patterns are discovered based on a relationship of a specific item with other items in the same transaction  Descriptive  Example: groceries

 to classify each item in a set of data into one of the predefined sets of classes or groups  Often used with machine learning  Predictive  Example: cat or dog person?

 Different from classification, the clustering technique also defines the classes and put objects in them  Descriptive  Example: a library

 used to predict numbers from data sets that have known target values  Predictive  Example: sales, distance, temperature, value, etc

 discovers frequent sequences or subsequences as patterns in a sequence database  Descriptive  Derived from association mining

 There are three categories that the main sequential pattern mining techniques fall into.  Apriori-based  Pattern-growth  Early-pruning

 follow the apriori property - all nonempty subsets of a frequent itemset must also be frequent  if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset  AprioriAll, GSP, PSP, and SPAM

 Transaction data  Assume: minsup = 30% minconf = 80%  An example frequent itemset: {Chicken, Clothes, Milk} [sup = 3/7] about 43%  Association rules from the itemset: Clothes  Milk, Chicken [sup = 3/7, conf = 3/3] …… Clothes, Chicken  Milk, [sup = 3/7, conf = 3/3] t1:Beef, Chicken, Milk t2:Beef, Cheese t3:Cheese, Boots t4:Beef, Chicken, Cheese t5:Beef, Chicken, Clothes, Cheese, Milk t6:Chicken, Clothes, Milk t7:Chicken, Milk, Clothes

 Two steps: Find all itemsets that have minimum support (frequent itemsets). Use frequent itemsets to generate rules.  E.g., a frequent itemset {Chicken, Clothes, Milk} [sup = 3/7] and one rule from the frequent itemset Clothes  Milk, Chicken [sup = 3/7, conf = 3/3]

itemset:count 1. scan T  C 1 : {1}:2, {2}:3, {3}:3, {4}:1, {5}:3  F 1 : {1}:2, {2}:3, {3}:3, {5}:3  C 2 : {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5} 2. scan T  C 2 : { 1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2  F 2 : { 1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2  C 3 : {2, 3,5} 3. scan T  C 3 : {2, 3, 5}:2  F 3: {2, 3, 5} TIDItems T1001, 3, 4 T2002, 3, 5 T3001, 2, 3, 5 T4002, 5 Dataset T minsup=50%

 divide-and-conquer strategy  to focus the search on a restricted portion of the initial database and generate as few candidate sequences as possible  FreeSpan, PrefixSpan, WAP-mine, and FS- Miner

 utilize a sort of position induction to prune candidate sequences very early in the mining process and to avoid support counting as much as possible  LAPIN, HVSM, and DISC-all

 searching for patterns in data through  content mining Search engines  structure mining Hyper links (hits / page rank)  usage mining User’s browser data and forms submitted

 One use is for finding user navigational patterns on the World Wide Web by extracting knowledge from web logs

 An example of applying sequential pattern mining  S = {a, b, c, d, e, f}  [P1, ] [P2, ] [P3, ] [P4, ]  Frequent pattern of abac

 combines traditional mining methods and information visualization techniques user is directly involved  VDMS - simplicity, reliability, reusability, availability, and security

 http://www.youtube.com/user/quiterian http://www.youtube.com/user/quiterian  http://www.youtube.com/watch?v=MtJ4X a4-J8g http://www.youtube.com/watch?v=MtJ4X a4-J8g  http://www.youtube.com/watch?v=_8Hz wQCFFfw http://www.youtube.com/watch?v=_8Hz wQCFFfw

Chase Repp.  knowledge discovery  searching, analyzing, and sifting through large data sets to find new patterns, trends, and relationships contained.

Similar presentations

Presentation on theme: "Chase Repp.  knowledge discovery  searching, analyzing, and sifting through large data sets to find new patterns, trends, and relationships contained."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chase Repp.  knowledge discovery  searching, analyzing, and sifting through large data sets to find new patterns, trends, and relationships contained.

Similar presentations

Presentation on theme: "Chase Repp.  knowledge discovery  searching, analyzing, and sifting through large data sets to find new patterns, trends, and relationships contained."— Presentation transcript:

Similar presentations

About project

Feedback