Presentation is loading. Please wait.

Presentation is loading. Please wait.

Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

Similar presentations


Presentation on theme: "Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)"— Presentation transcript:

1 Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)

2 Course Outline Objective –Understand the basics of data mining –Gain understanding of the potential for applying it in the bioinformatics domain –Hands on experience Schedule Evaluation –Practical assignment (2 nd ) plus take home exercise Website –http://www.liacs.nl/~putten/edu/dbdm05/

3 Agenda Today: Descriptive Data Mining Before Starting to Mine…. Descriptive Data Mining –Dimension Reduction & Projection –Clustering Hierarchical clustering K-means Self organizing maps –Association rules Frequent item sets Association Rules APRIORI Bio-informatics case: FSG for frequent subgraph discovery

4 Before starting to mine…. Pima Indians Diabetes Data –X = body mass index –Y = age

5 Before starting to mine….

6

7 Attribute Selection –This example: InfoGain by Attribute –Keep the most important ones

8 Before starting to mine…. Types of Attribute Selection –Uni-variate versus multivariate (sub set selection) The fact that attribute x is a strong uni-variate predictor does not necessarily mean it will add predictive power to a set of predictors already used by a model –Filter versus wrapper Wrapper methods involve the subsequent learner (classifier or other)

9 Dimension Reduction Projecting high dimensional data into a lower dimension –Principal Component Analysis –Independent Component Analysis –Fisher Mapping, Sammon’s Mapping etc. –Multi Dimensional Scaling See Pattern Recognition Course (Duin)

10 Data Mining Tasks: Clustering f.e. agef.e. weight Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user

11 Data Mining Tasks: Clustering f.e. agef.e. weight Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user In >3 dimensions this is not possible

12 Clustering Techniques Hierarchical algorithms –Agglomerative –Divisive Partition based clustering –K-Means –Self Organizing Maps / Kohonen Networks Probabilistic Model based –Expectation Maximization / Mixture Models

13 Hierarchical clustering Agglomerative / Bottom up –Start with single-instance clusters –At each step, join the two closest clusters –Method to compute distance between cluster x and y: single linkage (distance between closest point in cluster x and y), average linkage (average distance between all points), complete linkage (distance between furthest points), centroid –Distance measure: Euclidean, Correlation etc. Divisive / Top Down –Start with all data in one cluster –Split into two clusters based on category utility –Proceed recursively on each subset Both methods produce a dendrogram

14 Levels of Clustering Divisive Agglomerative Dunham, 2003

15 Hierarchical Clustering Example Clustering Microarray Gene Expression Data –Gene expression measured using microarrays studied under variety of conditions –On budding yeast Saccharomyces cerevisiae –Groups together efficiently genes of known similar function, Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro

16 Hierarchical Clustering Example Method –Genes are the instances, samples the attributes! –Agglomerative –Distance measure = correlation Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro

17 Simple Clustering: K-means Pick a number (k) of cluster centers (at random) Cluster centers are sometimes called codes, and the k codes a codebook Assign every item to its nearest cluster center F.i. Euclidean distance Move each cluster center to the mean of its assigned items Repeat until convergence change in cluster assignments less than a threshold KDnuggets

18 K-means example, step 1 k1k1 k2k2 k3k3 X Y Initially distribute codes randomly in pattern space KDnuggets

19 K-means example, step 2 k1k1 k2k2 k3k3 X Y Assign each point to the closest code KDnuggets

20 K-means example, step 3 X Y Move each code to the mean of all its assigned points k1k1 k2k2 k2k2 k1k1 k3k3 k3k3 KDnuggets

21 K-means example, step 2 X Y Repeat the process – reassign the data points to the codes Q: Which points are reassigned? k1k1 k2k2 k3k3 KDnuggets

22 K-means example X Y k1k1 k3k3 k2k2 KDnuggets Repeat the process – reassign the data points to the codes Q: Which points are reassigned?

23 K-means example X Y re-compute cluster means k1k1 k3k3 k2k2 KDnuggets

24 K-means example X Y move cluster centers to cluster means k2k2 k1k1 k3k3 KDnuggets

25 K-means clustering summary Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Sensitive to outliers Extensions Adaptive k-means K-mediods (based on median instead of mean) –1,2,3,4,100  average 22, median 3

26 Biological Example Clustering of yeast cell images –Two clusters are found –Left cluster primarily cells with thick capsule, right cluster thin capsule caused by media, proxy for sick vs healthy

27 Self Organizing Maps (Kohonen Maps) Claim to fame –Simplified models of cortical maps in the brain –Things that are near in the outside world link to areas near in the cortex –For a variety of modalities: touch, motor, …. up to echolocation –Nice visualization From a data mining perspective: –SOMs are simple extensions of k-means clustering –Codes are connected in a lattice –In each iteration codes neighboring winning code in the lattice are also allowed to move

28 SOM 10x10 SOM Gaussian Distribution

29 SOM

30

31

32 SOM example

33 Famous example: Phonetic Typewriter SOM lattice below left is trained on spoken letters, after convergence codes are labeled Creates a ‘phonotopic’ map Spoken word creates a sequence of labels

34 Famous example: Phonetic Typewriter Criticism –Topology preserving property is not used so why use SOMs and not adaptive k-means for instance? K-means could also create a sequence This is true for most SOM applications! –Is using clustering for classification optimal?

35 Bioinformatics Example Clustering GPCRs Clustering G Protein Coupled Receptors (GPCRs) [Samsanova et al, 2003, 2004] Important drug target, function often unknown

36 Bioinformatics Example Clustering GPCRs

37 Association Rules Outline What are frequent item sets & association rules? Quality measures –support, confidence, lift How to find item sets efficiently? –APRIORI How to generate association rules from an item set? Biological examples KDnuggets

38 Market Basket Example Gene Expression Example Frequent item set {MILK, BREAD} = 4 Association rule {MILK, BREAD}  {EGGS} Frequency / importance = 2 (‘Support’) Quality = 50% (‘Confidence’) What genes are expressed (‘active’) together? Interaction / regulation Similar function

39 Association Rule Definitions Set of items: I={I 1,I 2,…,I m } Transactions: D={t 1,t 2, …, t n }, t j  I Itemset: {I i1,I i2, …, I ik }  I Support of an itemset: Percentage of transactions which contain that itemset. Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold. Dunham, 2003

40 Frequent Item Set Example I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread,PeanutButter} is 60% Dunham, 2003

41 Association Rule Definitions Association Rule (AR): implication X  Y where X,Y  I and X,Y disjunct; Support of AR (s) X  Y: Percentage of transactions that contain X  Y Confidence of AR (  ) X  Y: Ratio of number of transactions that contain X  Y to the number that contain X Dunham, 2003

42 Association Rules Ex (cont’d) Dunham, 2003

43 Association Rule Problem Given a set of items I={I 1,I 2,…,I m } and a database of transactions D={t 1,t 2, …, t n } where t i ={I i1,I i2, …, I ik } and I ij  I, the Association Rule Problem is to identify all association rules X  Y with a minimum support and confidence. NOTE: Support of X  Y is same as support of X  Y. Dunham, 2003

44 Association Rules Example Q: Given frequent set {A,B,E}, what association rules have minsup = 2 and minconf= 50% ? A, B => E : conf=2/4 = 50% A, E => B : conf=2/2 = 100% B, E => A : conf=2/2 = 100% E => A, B : conf=2/2 = 100% Don’t qualify A =>B, E : conf=2/6 =33%< 50% B => A, E : conf=2/7 = 28% < 50% __ => A,B,E : conf: 2/9 = 22% < 50% KDnuggets

45 Solution Association Rule Problem First, find all frequent itemsets with sup >=minsup –Exhaustive search won’t work Assume we have a set of m items  2 m subsets! –Exploit the subset property (APRIORI algorithm) For every frequent item set, derive rules with confidence >= minconf KDnuggets

46 Finding itemsets: next level Apriori algorithm (Agrawal & Srikant) Idea: use one-item sets to generate two-item sets, two-item sets to generate three-item sets,.. –Subset Property: If (A B) is a frequent item set, then (A) and (B) have to be frequent item sets as well! –In general: if X is frequent k-item set, then all (k-1)- item subsets of X are also frequent  Compute k-item set by merging (k-1)-item sets KDnuggets

47 An example Given: five three-item sets (A B C), (A B D), (A C D), (A C E), (B C D) Candidate four-item sets: (A B C D) Q: OK? A: yes, because all 3-item subsets are frequent (A C D E) Q: OK? A: No, because (C D E) is not frequent KDnuggets

48 From Frequent Itemsets to Association Rules Q: Given frequent set {A,B,E}, what are possible association rules? –A => B, E –A, B => E –A, E => B –B => A, E –B, E => A –E => A, B –__ => A,B,E (empty rule), or true => A,B,E KDnuggets

49 Example: ‘Generating Rules from an Itemset Frequent itemset from golf data: Seven potential rules: Humidity = Normal, Windy = False, Play = Yes (4) If Humidity = Normal and Windy = False then Play = Yes If Humidity = Normal and Play = Yes then Windy = False If Windy = False and Play = Yes then Humidity = Normal If Humidity = Normal then Windy = False and Play = Yes If Windy = False then Humidity = Normal and Play = Yes If Play = Yes then Humidity = Normal and Windy = False If True then Humidity = Normal and Windy = False and Play = Yes 4/4 4/6 4/7 4/8 4/9 4/12 KDnuggets

50 Example: Generating Rules Rules with support > 1 and confidence = 100%: In total: 3 rules with support four, 5 with support three, and 50 with support two Association ruleSup.Conf. 1Humidity=Normal Windy=False  Play=Yes 4100% 2Temperature=Cool  Humidity=Normal 4100% 3Outlook=Overcast  Play=Yes 4100% 4Temperature=Cold Play=Yes  Humidity=Normal 3100%... 58Outlook=Sunny Temperature=Hot  Humidity=High 2100% KDnuggets

51 Weka associations: output KDnuggets

52 Extensions and Challenges Extra quality measure: Lift –The lift of an association rule I => J is defined as: lift = P(J|I) / P(J) Note, P(I) = (support of I) / (no. of transactions) ratio of confidence to expected confidence – Interpretation: if lift > 1, then I and J are positively correlated lift < 1, then I are J are negatively correlated. lift = 1, then I and J are independent Other measures for interestingness –A  B, B  C, but not A  C Efficient algorithms Known Problem –What to do with all these rules? How to exploit / make useful / actionable? KDnuggets

53 Biomedical Application Head and Neck Cancer Example 1. ace27=0 fiveyr=alive 381  tumorbefore=0 372conf:(0.98) 2. gender=M ace27=0 467  tumorbefore=0 455conf:(0.97) 3. ace27=0 588  tumorbefore=0 572 conf:(0.97) 4. tnm=T0N0M0 ace27=0 405  tumorbefore=0 391 conf:(0.97) 5. loc=LOC7 tumorbefore=0 409  tnm=T0N0M0 391 conf:(0.96) 6. loc=LOC7 442  tnm=T0N0M0 422 conf:(0.95) 7. loc=LOC7 gender=M tumorbefore=0 374  tnm=T0N0M0 357 conf:(0.95) 8. loc=LOC7 gender=M 406  tnm=T0N0M0 387 conf:(0.95) 9. gender=M fiveyr=alive 633  tumorbefore=0 595 conf:(0.94) 10. fiveyr=alive 778  tumorbefore=0 726 conf:(0.93)

54 Bioinformatics Application The idea of association rules have been customized for bioinformatics applications In biology it is often interesting to find frequent structures rather than items –For instance protein or other chemical structures Solution: Mining Frequent Patterns –FSG (Kuramochi and Karypis, ICDM 2001) –gSpan (Yan and Han, ICDM 2002) –CloseGraph (Yan and Han, KDD 2002)

55 FSG: Mining Frequent Patterns

56

57 FSG Algorithm for finding frequent subgraphs

58 Frequent Subgraph Examples AIDS Data Compounds are active, inactive or moderately active (CA, CI, CM)

59 Predictive Subgraphs The three most discriminating sub-structures for the PTC, AIDS, and Anthrax datasets

60 Experiments and Results: AIDS Data

61 FSG References Frequent Sub-structure Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, and George Karypis ICDM 2003 An Efficient Algorithm for Discovering Frequent Subgraphs Michihiro Kuramochi and George Karypis IEEE TKDE Automated Approaches for Classifying Structures Mukund Deshpande, Michihiro Kuramochi, and George Karypis BIOKDD 2002 Discovering Frequent Geometric Subgraphs Michihiro Kuramochi and George Karypis ICDM 2002 Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis 1st IEEE Conference on Data Mining 2001

62 Recap Before Starting to Mine…. Descriptive Data Mining –Dimension Reduction & Projection –Clustering Hierarchical clustering K-means Self organizing maps –Association rules Frequent item sets Association Rules APRIORI Bio-informatics case: FSG for frequent subgraph discovery Next week –Bioinformatics Data Mining Cases / Lab Session / Take Home Exercise


Download ppt "Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl)"

Similar presentations


Ads by Google