{Coke} (s=0.6, c=0.75) {Diaper, Milk} --> {Beer} (s=0.4, c=0.67)"> {Coke} (s=0.6, c=0.75) {Diaper, Milk} --> {Beer} (s=0.4, c=0.67)">
Download presentation
Presentation is loading. Please wait.
Published by희준 봉 Modified over 6 years ago
1
William Norris Professor and Head, Department of Computer Science
Association Analysis for Finding Patterns in Large Amounts of Biological Data Vipin Kumar William Norris Professor and Head, Department of Computer Science
2
Association Analysis Given a set of records, find dependency rules which will predict occurrence of an item based on occurrences of other items in the record Applications Marketing and Sales Promotion Supermarket shelf management Traffic pattern analysis (e.g., rules such as "high congestion on Intersection 58 implies high accident rates for left turning traffic") Rules Discovered: {Milk} --> {Coke} (s=0.6, c=0.75) {Diaper, Milk} --> {Beer} (s=0.4, c=0.67)
3
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold Brute-force approach: Two Steps Frequent Itemset Generation Generate all itemsets whose support minsup Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is computationally expensive
4
Efficient Pruning Strategy (Ref: Agrawal & Srikant 1994)
Found to be Infrequent Pruned supersets If an itemset is infrequent, then all of its supersets must also be infrequent
5
Illustrating Apriori Principle
Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, = 13
6
Counting Candidates A B C D A C E B C D A B D E B C E B D Candidates
Frequent Itemsets are found by counting candidates. Simple way: Search for each candidate in each transaction. Expensive!!! Candidates Count Transactions 1 A D A E A B C D E A B D E B C D A B E B D B C A C A B 2 1 A D A E A B C D E A B D E B C D A B E B D B C A C A B A B A C A D A E B C B D A B E B C D A B D E A B C D E Reduce the number of comparisons (NM) by using hash tables to store the candidate itemsets 2 A D A E A B C D E A B D E B C D A B E B D B C A C A B 1 4 3 Naïve approach requires O(NM) comparisons A B C D A C E B C D A B D E B C E B D M N
10
Where are the parts located?
How many roles can these play? How flexible and adaptable are they mechanically? What are the shared parts (bolt, nut, washer, spring, bearing), unique parts (cogs, levers)? What are the common parts -- types of parts (nuts & washers)? Where are the parts located? Which parts interact? © Mark Gerstein, Yale
19
Create a data set that records the presence and absence of
Association Analysis for Finding Connections of Disease and Medical and Genomic Characteristics Create a data set that records the presence and absence of Phenotypic characteristics Genetic characteristics (SNPs) Disease Apply association analysis to find groups of phenotypic and genetic characteristics that are highly associated with disease Uses characteristics of the patterns to prune the search space Clustering and classification can also be applied
20
The Need for Error-Tolerant Itemsets
An error-tolerant itemset (ETI) can have a fraction of the items missing in each transaction. Example: see the data in the table Let = 1/4. In other words, each transaction needs to have 3/4 (75%) of the items. X = {i1, i2, i3, i4} and Y = {i5, i6, i7, i8} are both ETIs with a support of 4. Algorithms to find ETIs are still in development You can think of these ETIs as blocks in the data matrix
21
ETIs in For Finding Patterns in Phenotypic and Genomic Data
ETIs consist of A set of patients and A set of attributes such that The block is relatively dense These blocks identify sets of patients that are highly associated with certain sets of attributes and vice-versa If most of these patients share a disease, then these attributes (genetic and/or phenotypic) are candidate markers for the disease X: Set of patients Y: Set of attributes, i.e., SNPs, medical characteristics
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.