The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 8/30/2006 Frequent Patterns Administrative No class next Monday (Labor day holiday)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 8/30/2006 Frequent Patterns Outline for today Mining with constraints Summary

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 8/30/2006 Frequent Patterns Maximal & Closed Patterns Reduce the number of patterns Maximal patterns: the boundary of the frequent vs. infrequent patterns Close patterns: an information- lossless compression of frequent patterns

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 8/30/2006 Frequent Patterns Borders of Frequent Itemsets Connected X and Y are frequent and X is an ancestor of Y implies that all patterns between X and Y are frequent  abcd abacadbcbdcd abcabdacdbcd abcd

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 8/30/2006 Frequent Patterns Closed and Maximal Patterns Solution: Mine closed patterns and max-patterns An itemset X is closed if X is frequent and there exists no super-pattern Y  X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules An itemset X is maximal if X is frequent and there exists no super-pattern Y  X such that Y is frequent

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 8/30/2006 Frequent Patterns Closed Patterns and Max-Patterns Exercise. DB = {, } Min_sup = 1. What is the set of all frequent patterns? : 2, : 2, : 1, : 2, : 1, : 1, : 1, What is the set of max-pattern? : 1 What is the set of closed itemset? : 1 : 2

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 8/30/2006 Frequent Patterns Example of Vertical Data Format

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide9 8/30/2006 Frequent Patterns Frequent, Closed and Maximal Itemsets TransactionItems 1ACW T 3 4ACW D 5ACW TD Database

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 8/30/2006 Frequent Patterns Mining Quantitative Associations Techniques can be categorized by how numerical attributes, such as age or salary are treated 1.Static discretization based on predefined concept hierarchies (data cube methods) 2.Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96) 3.Clustering: Distance-based association (e.g., Yang & Miller@SIGMOD97) one dimensional clustering then association 4.Deviation: (such as Aumann and Lindell@KDD99) Sex = female => Wage: mean=$7/hr (overall mean = $9)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 8/30/2006 Frequent Patterns Which Measures Should Be Used? lift and  2 are not good measures for correlations in large transactional DBs all-conf or coherence could be good measures (Omiecinski@TKDE’03) Both all-conf and coherence have the downward closure property Efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 8/30/2006 Frequent Patterns Knowledge Discovery (KDD) Process Preprocessing: Identify the nature of the data Learning the application domain: relevant prior knowledge and goals of application Creating a target data set: sampling Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation Find useful features, dimensionality/variable reduction, invariant representation Data mining: identify the structure of data Computational task: pattern discovery, classification, clustering, etc. Subdivide computational tasks into computational component: choosing the mining algorithm(s) Result evaluation and presentation Visualization Hypothesis generation Prediction

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 8/30/2006 Frequent Patterns KDD Process: Several Key Steps Data mining — core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 8/30/2006 Frequent Patterns Constraint-based (Query-Directed) Mining Finding all the patterns in a database autonomously? — unrealistic! The patterns could be too many but not focused! Data mining should be an interactive process User directs what to be mined using a data mining query language (or a graphical user interface) Constraint-based mining User flexibility: provides constraints on what to be mined System optimization: explores such constraints for efficient mining—constraint-based mining

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 8/30/2006 Frequent Patterns Constraints in Data Mining Knowledge type constraint: classification, association, etc. Data constraint — using SQL-like queries find product pairs sold together in stores in Chicago in Dec.’02 Dimension/level constraint in relevance to region, price, brand, customer category Rule (or pattern) constraint small sales (price $200) Interestingness constraint strong rules: min_support  3%, min_confidence  60%

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 8/30/2006 Frequent Patterns Constrained Mining vs. Constraint-Based Search Constrained mining vs. constraint-based search/reasoning Both are aimed at reducing search space Finding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AI Constraint-pushing vs. heuristic search It is an interesting research problem on how to integrate them Constrained mining vs. query processing in DBMS Database query processing requires to find all Constrained pattern mining shares a similar philosophy as pushing selections deeply in query processing

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 8/30/2006 Frequent Patterns How to Formalize Constraints Anti-monotonicity Monotonicity Succinctness

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide18 8/30/2006 Frequent Patterns Anti-Monotonicity in Constraint Pushing Anti-monotonicity When an intemset S violates the constraint, so does any of its superset sum(S.Price)  v is anti-monotone sum(S.Price)  v is not anti-monotone Example. C: range(S.profit)  15 is anti- monotone Itemset ab violates C So does every superset of ab TDB (min_sup=2) TIDTransaction 10a, b, c, d, f 20b, c, d, f, g, h 30a, c, d, e, f 40c, e, f, g ItemProfit a40 b0 c-20 d10 e-30 f30 g20 h-10

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide19 8/30/2006 Frequent Patterns Monotonicity for Constraint Pushing Monotonicity When an intemset S satisfies the constraint, so does any of its superset sum(S.Price)  v is monotone min(S.Price)  v is monotone Example. C: range(S.profit)  15 Itemset ab satisfies C So does every superset of ab TDB (min_sup=2) TIDTransaction 10a, b, c, d, f 20b, c, d, f, g, h 30a, c, d, e, f 40c, e, f, g ItemProfit a40 b0 c-20 d10 e-30 f30 g20 h-10

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 8/30/2006 Frequent Patterns Succinctness Succinctness: Given A 1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A 1, i.e., S contains a subset belonging to A 1 Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items min(S.Price)  v is succinct sum(S.Price)  v is not succinct

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide21 8/30/2006 Frequent Patterns The Apriori Algorithm — Example Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3 Price 1 1 1 1 10

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide22 8/30/2006 Frequent Patterns Naïve Algorithm: Apriori + Constraint Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3 Constraint: Sum{S.price} < 5 Price 1 1 1 1 10

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide23 8/30/2006 Frequent Patterns The Constrained Apriori Algorithm: Push an Anti-monotone Constraint Deep Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3 Constraint: Sum{S.price} < 5 Price 1 1 1 1 10

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide24 8/30/2006 Frequent Patterns The Constrained Apriori Algorithm: Push a Succinct Constraint Deep Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3 Constraint: min{S.price } <= 1 not immediately to be used Price 1 1 1 1 10

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide25 8/30/2006 Frequent Patterns Converting “Tough” Constraints Convert tough constraints into anti-monotone or monotone by properly ordering items Examine C: avg(S.profit)  25 Order items in value-descending order If an itemset afb violates C So does afbh, afb* It becomes anti-monotone! TDB (min_sup=2) TIDTransaction 10a, b, c, d, f 20b, c, d, f, g, h 30a, c, d, e, f 40c, e, f, g ItemProfit a40 b0 c-20 d10 e-30 f30 g20 h-10

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide26 8/30/2006 Frequent Patterns Strongly Convertible Constraints avg(X)  25 is convertible anti-monotone w.r.t. item value descending order R: If an itemset af violates a constraint C, so does every itemset with af as prefix, such as afd avg(X)  25 is convertible monotone w.r.t. item value ascending order R -1 : If an itemset d satisfies a constraint C, so does itemsets df and dfa, which having d as a prefix Thus, avg(X)  25 is strongly convertible TDB (min_sup=2) TIDTransaction 10a, b, c, d, f 20b, c, d, f, g, h 30a, c, d, e, f 40c, e, f, g ItemProfit a40 b0 c-20 d10 e-30 f30 g20 h-10

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 8/30/2006 Frequent Patterns Can Apriori Handle Convertible Constraint? A convertible, not monotone nor anti-monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm Within the level wise framework, no direct pruning based on the constraint can be made Itemset df violates constraint C: avg(X)>=25 Since adf satisfies C, Apriori needs df to assemble adf, df cannot be pruned But it can be pushed into frequent-pattern growth framework!

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 8/30/2006 Frequent Patterns Handling Multiple Constraints Different constraints may require different or even conflicting item-ordering If there exists an order R s.t. both C 1 and C 2 are convertible w.r.t. R, then there is no conflict between the two convertible constraints If there exists conflict on order of items Try to satisfy one constraint first Then using the order for the other constraint to mine frequent itemsets in the corresponding projected database

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 8/30/2006 Frequent Patterns What Constraints Are Convertible? Constraint Convertible anti- monotone Convertible monotone Strongly convertible avg(S) ,  v Yes median(S) ,  v Yes sum(S)  v (items could be of any value, v  0) YesNo sum(S)  v (items could be of any value, v  0) NoYesNo sum(S)  v (items could be of any value, v  0) NoYesNo sum(S)  v (items could be of any value, v  0) YesNo ……

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 8/30/2006 Frequent Patterns Constraint-Based Mining—A General Picture ConstraintAntimonotoneMonotoneSuccinct v  S noyes S  V noyes S  V yesnoyes min(S)  v noyes min(S)  v yesnoyes max(S)  v yesnoyes max(S)  v noyes count(S)  v yesnoweakly count(S)  v noyesweakly sum(S)  v ( a  S, a  0 ) yesno sum(S)  v ( a  S, a  0 ) noyesno range(S)  v yesno range(S)  v noyesno avg(S)  v,   { , ,  } convertible no support(S)   yesno support(S)   noyesno

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 8/30/2006 Frequent Patterns A Classification of Constraints Convertible anti-monotone Convertible monotone Strongly convertible Inconvertible Succinct Antimonotone Monotone

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 8/30/2006 Frequent Patterns Frequent-Pattern Mining: Research Problems Mining fault-tolerant frequent Patterns allows limited faults (insertion, deletion, mutation) Mining truly interesting patterns Surprising, novel, concise, … Theoretic foundation of patterns For compressing data? For classification analysis? Application exploration Pattern discovery in molecule structures Pattern discovery in bionetworks

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 8/30/2006 Frequent Patterns Mining Biological Data Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems Increasing potential to support biological discoveries End User Domain expert Analyst Data Analyst DBA Hypothesis testing Hypothesis generation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Microarray, Bio-molecule structures, Mass Spectrometry data, …

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide34 8/30/2006 Frequent Patterns Bio-Data Mining: Classification Schemes Different views lead to different classifications Knowledge view: Kinds of knowledge to be discovered Data view: Kinds of data to be mined Method view: Kinds of techniques utilized Application view: Kinds of applications adapted

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide35 8/30/2006 Frequent Patterns Summary Constrained item set mining and association rules The data mining process

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Similar presentations

Presentation on theme: "The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Similar presentations

Presentation on theme: "The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006."— Presentation transcript:

Similar presentations

About project

Feedback