Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T)
Two categories of data mining Descriptive mining: describes concepts or task- relevant data sets in concise, summarative, informative, discriminative forms Predictive mining: Based on data and analysis, constructs models for the database, and predicts the trend and properties of unknown data
What is Concept Description? Concept description (or class description): generates descriptions for characterization and comparison of data Characterization: provides a concise and succinct summarization of the given collection of data Characterization: provides a concise and succinct summarization of the given collection of data Class comparison (or discrimination): provides descriptions comparing two or more collections of data Class comparison (or discrimination): provides descriptions comparing two or more collections of data
Data Generalization A process which abstracts a large set of task- relevant data in a database from a low conceptual levels to higher ones. A process which abstracts a large set of task- relevant data in a database from a low conceptual levels to higher ones Conceptual levels Approaches: Data cube approach(OLAP approach) Attribute-oriented induction approach
Concept Description vs OLAP Similarities: Data generalization Presentation of data summarization at multiple levels of abstraction. Interactive drilling, pivoting, slicing and dicing. Differences: Complex data types of the attributes and their aggregations Automated process to find relevant attributes and generalization degree Dimension relevance analysis and ranking when there are many relevant dimensions.
Attribute-Oriented Induction Proposed in 1989 (KDD '89 workshop) Proposed in 1989 (KDD '89 workshop) Not confined to categorical data nor particular measures. Not confined to categorical data nor particular measures. How it is done? How it is done? Collect the task-relevant data (initial relation) using a relational database query Perform data generalization by attribute removal or attribute generalization, based on the nb. of distinct values of each attribute. Apply aggregation by merging identical, generalized tuples and accumulating their respective counts Interactive presentation with users
Basic Principles (1) Data focusing: task-relevant data, including dimensions, and the result is the initial (working) relation. Attribute-removal: remove attribute A if there is a large set of distinct values for A but: (1) there is no generalization operator on A, or (2) A's higher level concepts are expressed in terms of other attributes. Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A.
Basic Principles (2) Two methods to control a generalization process: Two methods to control a generalization process: Attribute-threshold control: typical 2-8, specified/default if the number of distinct values in an attribute is greater than the att. threshold, then removal or generalization applies Generalized relation threshold control: sets a threshold for the generalized (final) relation/rule size If the number of distinct tuples in the generalized relation is greater than the threshold, then further generalization applies
Basic Principles (3) Acummulate count or other aggregate values : to provide statistical information about the data at diff. levels of abstraction Ex: Count value for a tuple in the initial relation is 1, When generalizing data, n tuples in the initial relation result in groups of identical tuples merged into a single generalized tuple (count is n)
Basic Algorithm 1. InitialRel: Query processing of task-relevant data, deriving the initial relation. 2. PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize? 3. PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a "prime generalized relation", accumulating the counts. 4. Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations.
Class Characterization: Example (1) Describe general characteristics of graduate students in the Big-University database (in DMQL) Describe general characteristics of graduate students in the Big-University database (in DMQL) use Big_University_DB mine characteristics as "Science_Students" in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in "graduate" Corresponding SQL statement: Corresponding SQL statement: select name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in {"Msc", "MBA", "PhD" }
Class Characterization: An Example (2) Initial Relation
Class Characterization: An Example (3) Prime Generalized Relation Cross-tab
Presentation of Generalized Results (1) Generalized relation: Relations where some or all attributes are generalized, with counts or other aggregation values accumulated. Cross tabulation: Mapping results into cross tabulation form (similar to contingency tables). Visualization techniques: Pie charts, bar charts, curves, cubes, and other visual forms.
Presentation— Generalized Relation
Presentation—Crosstab
Presentation of Generalized Results (2) A generalized relation may also be represented in the form of logic rules A generalized relation may also be represented in the form of logic rules Cj = target class q a = a generalized tuple describing the target class t-weight for q a : percentage of tuples of the target class from the initial working class that are covered by q a range: [0, 1]
Presentation of Generalized Results (3) Quantitative characteristic rules: Mapping generalized result into characteristic rules with quantitative information associated with it The disjunction of the conditions forms a necessary condition of the target class, i.e., all tuples of the target class must satisfy the condition The disjunction of the conditions forms a necessary condition of the target class, i.e., all tuples of the target class must satisfy the condition Not a sufficient condition of the target class, since a tuple satisfying the same condition could belong to another class Not a sufficient condition of the target class, since a tuple satisfying the same condition could belong to another class
Attribute Relevance Analysis (1) Why? Why? Which dimensions should be included? How high level of generalization? Automatic vs. interactive Reduce # attributes; easy to understand patterns What? What? statistical method for preprocessing data filter out irrelevant or weakly relevant attributes retain or rank the relevant attributes relevance related to dimensions and levels analytical characterization, analytical comparison
Attribute relevance analysis (2) How? How? 1. Data Collection 2. Preliminary relevance analysis using conservative AOI 3. Analytical Generalization Use information gain analysis (e.g., entropy or other measures) to identify highly relevant dimensions and levels. Sort and select the most relevant dimensions and levels. 4. Attribute-oriented Induction for class description Using a less conservative threshold for AOI
Relevance Measures Quantitative relevance measure: determines the classifying power of an attribute within a set of data. Methods: information gain (ID3) gain ratio (C4.5) gini index 2 contingency table statistics uncertainty coefficient
Entropy and Information Gain S contains s i tuples of class C i for i = {1, …, m} Entropy or expected information measures info required to classify any arbitrary tuple Entropy of attribute A with values {a 1,a 2,…,a v } Information gained by branching on attribute A
Example of Analytical Characterization (1) Task Mine general characteristics describing graduate students using analytical characterizationGiven attributes name, gender, major, birth_place, birth_date, phone#, and gpa Gen(a i ) = concept hierarchies on a i U i = attribute analytical thresholds for a i T i = attribute generalization thresholds for a i R = attribute relevance threshold
Example of Analytical Characterization (2) 1. Data collection target class: graduate student contrasting class: undergraduate student 2. Analytical generalization using U i attribute removal remove name and phone# attribute generalization generalize major, birth_place, birth_date and gpa accumulate counts candidate relation: gender, major, birth_country, age_range and gpa
Example: Analytical characterization (3) Candidate relation for Target class: Graduate students ( =120) Candidate relation for Contrasting class: Undergraduate students ( =130)
Example: Analytical characterization (4) 3. Relevance analysis 3. Relevance analysis Calculate expected info required to classify an arbitrary tuple Calculate entropy of each attribute: e.g. major Number of grad students in "Science" Number of undergrad students in "Science"
Example: Analytical Characterization (5) Calculate expected info required to classify a given sample if S is partitioned according to the attribute Calculate expected info required to classify a given sample if S is partitioned according to the attribute Calculate information gain for each attribute Calculate information gain for each attribute Information gain for all attributes
Example: Analytical characterization (5) 4. Initial working relation (W 0 ) derivation R = 0.1 remove irrelevant/weakly relevant attributes from candidate relation => drop gender, birth_country remove contrasting class candidate relation 5. Perform attribute-oriented induction on W 0 using T i Initial target class working relation W 0 : Graduate students
Mining Class Comparisons Comparison: Comparing two or more classes Comparison: Comparing two or more classes Method: Method: Partition the set of relevant data into the target class and the contrasting class(es) Generalize both classes to the same high level concepts Compare tuples with the same high level descriptions Present for every tuple its description and two measures support - distribution within single class comparison - distribution between classes Highlight the tuples with strong discriminant features Relevance Analysis: Relevance Analysis: Find attributes (features) which best distinguish different classes
Quantitative Discriminant Rules Cj = target class Cj = target class q a = a generalized tuple covers some tuples of target class q a = a generalized tuple covers some tuples of target class but can also cover some tuples of contrasting class d-weight d-weight range: [0, 1] quantitative discriminant rule form quantitative discriminant rule form
Example (1) Compare the general properties between the graduate students and the undergraduate students at the Big- University database, given the attributes: name, gender, etc (in DMQL) Compare the general properties between the graduate students and the undergraduate students at the Big- University database, given the attributes: name, gender, etc (in DMQL) use Big_University_DB mine comparison as "Grad-vs-Undergrad" in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from "graduate_students" where status in "graduate" versus "undergraduate_students" where status in "undergraduate" analyze count% from student
Example (2) Quantitative discriminant rule Quantitative discriminant rule where 90/(90+210) = 30% Count distribution between graduate and undergraduate students for a generalized tuple
Class Description Quantitative characteristic rule Quantitative characteristic rule necessary Quantitative discriminant rule Quantitative discriminant rule sufficient Quantitative description rule Quantitative description rule necessary and sufficient
Example: Quantitative Description Rule Quantitative description rule for target class Europe Quantitative description rule for target class Europe Crosstab showing associated t-weight, d-weight values and total number (in thousands) of TVs and computers sold at AllElectronics in 1998
Bibliografia (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 5 – livro 2001, Secção 3.7 – draft) (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 5 – livro 2001, Secção 3.7 – draft) (Livro) Machine Learning, T. Mitchell, McGraw-Hill, 1997 (Secção 3.4) (Livro) Machine Learning, T. Mitchell, McGraw-Hill, 1997 (Secção 3.4)
Information-Theoretic Approach Decision tree Decision tree each internal node tests an attribute each branch corresponds to an attribute value each leaf node assigns a classification ID3 algorithm ID3 algorithm build decision tree based on training objects with known class labels to classify testing objects rank attributes with information gain measure minimal height the least number of tests to classify an object
2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Top-Down Induction of Decision Tree Attributes = {Outlook, Temperature, Humidity, Wind} Outlook Humidity Wind sunnyrain overcast yes noyes high normal no strong weak yes PlayTennis = {yes, no}