Slides are based on Negnevitsky, Pearson Education, Lecture 14 Data mining and knowledge discovery n Introduction, or what is data mining? n Data warehouse and query tools n Decision trees n Case study: Profiling people with high blood pressure n Summary
Slides are based on Negnevitsky, Pearson Education, What is data mining? n Data is what we collect and store, and knowledge is what helps us to make informed decisions. n The extraction of knowledge from data is called data mining. n Data mining can also be defined as the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules. n The ultimate goal of data mining is to discover knowledge.
Slides are based on Negnevitsky, Pearson Education, Why data mining n The Explosive Growth of Data: from terabytes to petabytes –Data collection and data availability »Automated data collection tools, database systems, Web, computerized society –Major sources of abundant data »Business: Web, e-commerce, transactions, stocks, … »Science: Remote sensing, bioinformatics, scientific simulation, … »Society and everyone: news, digital cameras, YouTube n knowledge!
Slides are based on Negnevitsky, Pearson Education, Why Not Traditional Data Analysis? n Tremendous amount of data –Algorithms must be highly scalable to handle such as tera-bytes of data n High-dimensionality of data –Micro-array may have tens of thousands of dimensions
Slides are based on Negnevitsky, Pearson Education, n High complexity of data –Data streams and sensor data –Time-series data, temporal data, sequence data –Structure data, graphs, social networks and multi-linked data –Heterogeneous databases and legacy databases –Spatial, spatiotemporal, multimedia, text and Web data –Software programs, scientific simulations n New and sophisticated applications
Slides are based on Negnevitsky, Pearson Education, Knowledge Discovery (KDD) Process –Data mining — core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
Slides are based on Negnevitsky, Pearson Education, KDD Process: Several Key Steps n Learning the application domain –relevant prior knowledge and goals of application n Creating a target data set: data selection n Data cleaning and preprocessing: (may take 60% of effort!) n Data reduction and transformation –Find useful features, dimensionality/variable reduction, invariant representation
Slides are based on Negnevitsky, Pearson Education, KDD Process: Several Key Steps n Choosing functions of data mining – summarization, classification, regression, association, clustering n Choosing the mining algorithm(s) n Data mining: search for patterns of interest n Pattern evaluation and knowledge presentation –visualization, transformation, removing redundant patterns, etc. n Use of discovered knowledge
Slides are based on Negnevitsky, Pearson Education, Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization
Slides are based on Negnevitsky, Pearson Education, Architecture: Typical Data Mining System data cleaning, integration, and selection Database or Data Warehouse Server Data Mining Engine Pattern Evaluation Graphical User Interface Knowl edge- Base Database Data Warehouse World-Wide Web Other Info Repositories
Slides are based on Negnevitsky, Pearson Education, Data Mining Functionalities(1) n Frequent patterns, association, correlation vs. causality –Diaper Beer [0.5%, 75%] (Correlation or causality?) n Classification and prediction –Construct models (functions) that describe and distinguish classes or concepts for future prediction »E.g., classify countries based on (climate), or classify cars based on (gas mileage) –Predict some unknown or missing numerical values
Slides are based on Negnevitsky, Pearson Education, Data Mining Functionalities(2) n Cluster analysis –Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns –Maximizing intra-class similarity & minimizing interclass similarity n Outlier analysis –Outlier: Data object that does not comply with the general behavior of the data –Noise or exception? Useful in fraud detection, rare events analysis
Slides are based on Negnevitsky, Pearson Education, Data Mining Functionalities(3) n Trend and evolution analysis –Trend and deviation: e.g., regression analysis –Sequential pattern mining: e.g., digital camera large SD memory –Periodicity analysis –Similarity-based analysis n Other pattern-directed or statistical analyses
Slides are based on Negnevitsky, Pearson Education, Top-10 Most Popular DM Algorithms: 18 Identified Candidates (I) n Classification –#1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann., –#2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, –#3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6) –#4. Naive Bayes Hand, D.J., Yu, K., Idiot's Bayes: Not So Stupid After All? Internat. Statist. Rev. 69,
Slides are based on Negnevitsky, Pearson Education, (II) n Statistical Learning –#5. SVM: Vapnik, V. N The Nature of Statistical Learning Theory. Springer-Verlag. – #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. Association Analysis –#7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94. –#8. FP-Tree: Han, J., Pei, J., and Yin, Y Mining frequent patterns without candidate generation. In SIGMOD '00.
Slides are based on Negnevitsky, Pearson Education, (III) n Link Mining –#9. PageRank: Brin, S. and Page, L The anatomy of a large-scale hypertextual Web search engine. In WWW-7, –#10. HITS: Kleinberg, J. M Authoritative sources in a hyperlinked environment. SODA, 1998.
Slides are based on Negnevitsky, Pearson Education, (IV) n Clustering –#11. K-Means: MacQueen, J. B., Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, –#12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M BIRCH: an efficient data clustering method for very large databases. In SIGMOD '96. n Bagging and Boosting –#13. AdaBoost: Freund, Y. and Schapire, R. E A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997),
Slides are based on Negnevitsky, Pearson Education, (V) n Sequential Patterns –#14. GSP: Srikant, R. and Agrawal, R Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the 5th International Conference on Extending Database Technology, –#15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In ICDE '01. n Integrated Mining –#16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and association rule mining. KDD-98.
Slides are based on Negnevitsky, Pearson Education, (VI) n Rough Sets –#17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992 n Graph Mining –#18. gSpan: Yan, X. and Han, J gSpan: Graph-Based Substructure Pattern Mining. In ICDM '02.
Slides are based on Negnevitsky, Pearson Education, Top-10 Algorithm Finally Selected at ICDM ’ 06 n #1: C4.5 (61 votes) n #2: K-Means (60 votes) n #3: SVM (58 votes) n #4: Apriori (52 votes) n #5: EM (48 votes) n #6: PageRank (46 votes) n #7: AdaBoost (45 votes) n #7: kNN (45 votes) n #7: Naive Bayes (45 votes) n #10: CART (34 votes)
Slides are based on Negnevitsky, Pearson Education, Conferences and Journals on Data Mining n KDD Conferences –ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) –SIAM Data Mining Conf. (SDM) –(IEEE) Int. Conf. on Data Mining (ICDM) –Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD) –Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)
Slides are based on Negnevitsky, Pearson Education, n Other related conferences –ACM SIGMOD –VLDB –(IEEE) ICDE –WWW, SIGIR –ICML, CVPR, NIPS n Journals –Data Mining and Knowledge Discovery (DAMI or DMKD) –IEEE Trans. On Knowledge and Data Eng. (TKDE) –KDD Explorations –ACM Trans. on KDD
Slides are based on Negnevitsky, Pearson Education, Why Not Traditional Data Analysis?(1) n Tremendous amount of data –Algorithms must be highly scalable to handle such as tera-bytes of data n High-dimensionality of data –Micro-array may have tens of thousands of dimensions
Slides are based on Negnevitsky, Pearson Education, (2) n High complexity of data –Data streams and sensor data –Time-series data, temporal data, sequence data –Structure data, graphs, social networks and multi-linked data –Heterogeneous databases and legacy databases –Spatial, spatiotemporal, multimedia, text and Web data –Software programs, scientific simulations n New and sophisticated applications
Slides are based on Negnevitsky, Pearson Education, Data warehouse n Modern organisations must respond quickly to any change in the market. This requires rapid access to current data normally stored in operational databases. n However, an organisation must also determine which trends are relevant. This task is accomplished with access to historical data that are stored in large databases called data warehouses.
Slides are based on Negnevitsky, Pearson Education, n The main characteristic of a data warehouse is its capacity. A data warehouse is really big – it includes millions, even billions, of data records. n The data stored in a data warehouse is l time dependent – linked together by the times of recording – and l integrated – all relevant information from the operational databases is combined and structured in the warehouse.
Slides are based on Negnevitsky, Pearson Education, Query tools n A data warehouse is designed to support decision- making in the organisation. The information needed can be obtained with query tools. n Query tools are assumption-based – a user must ask the right questions.
Slides are based on Negnevitsky, Pearson Education, How is data mining applied in practice? n Many companies use data mining today, but refuse to talk about it. n In direct marketing, data mining is used for targeting people who are most likely to buy certain products and services. n In trend analysis, it is used to determine trends in the marketplace, for example, to model the stock market. In fraud detection, data mining is used to identify insurance claims, cellular phone calls and credit card purchases that are most likely to be fraudulent.
Slides are based on Negnevitsky, Pearson Education, n Motivation: Finding latent relationships in data –What products were often purchased together? — Beer and diapers?! –What are the subsequent purchases after buying a PC? –What kinds of DNA are sensitive to this new drug? –Can we automatically classify web documents?
Slides are based on Negnevitsky, Pearson Education,
Slides are based on Negnevitsky, Pearson Education, n Applications – Market basket data analysis (shelf space planning/increasing sales/promotion) – cross-marketing – catalog design – sale campaign analysis – Web log (click stream) analysis – DNA sequence analysis
Slides are based on Negnevitsky, Pearson Education, Data mining tools Data mining is based on intelligent technologies already discussed in this book. It often applies such tools as neural networks and neuro-fuzzy systems. However, the most popular tool used for data mining is a decision tree.
Slides are based on Negnevitsky, Pearson Education, Decision trees A decision tree can be defined as a map of the reasoning process. It describes a data set by a tree-like structure. Decision trees are particularly good at solving classification problems.
Slides are based on Negnevitsky, Pearson Education, ID3 n (tall, blond, blue) w n (short, silver, blue) w n (short, black, blue) w n (tall, blond, brown) w n (tall, silver, blue) w n (short, blond, blue) w n (short, black, brown) e n (tall, silver, black) e n (short, black, brown) e n (tall, black, brown) e n (tall, black, black) e n (short, blond, black) e
Slides are based on Negnevitsky, Pearson Education,
Slides are based on Negnevitsky, Pearson Education,
Slides are based on Negnevitsky, Pearson Education,
Slides are based on Negnevitsky, Pearson Education,
Slides are based on Negnevitsky, Pearson Education, n A decision tree consists of nodes, branches and leaves. n The top node is called the root node. The tree always starts from the root node and grows down by splitting the data at each level into new nodes. The root node contains the entire data set (all data records), and child nodes hold respective subsets of that set. n All nodes are connected by branches. n Nodes that are at the end of branches are called terminal nodes, or leaves.
Slides are based on Negnevitsky, Pearson Education, How does a decision tree select splits? n A split in a decision tree corresponds to the predictor with the maximum separating power. The best split does the best job in creating nodes where a single class dominates. n One of the best known methods of calculating the predictor’s power to separate data is based on the Gini coefficient of inequality.
Slides are based on Negnevitsky, Pearson Education, Major Issues in Data Mining(1) n Mining methodology –Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web –Performance: efficiency, effectiveness, and scalability –Pattern evaluation: the interestingness problem –Incorporation of background knowledge –Handling noise and incomplete data –Parallel, distributed and incremental mining methods –Integration of the discovered knowledge with existing one: knowledge fusion
Slides are based on Negnevitsky, Pearson Education, (2) n User interaction –Data mining query languages and ad-hoc mining –Expression and visualization of data mining results –Interactive mining of knowledge at multiple levels of abstraction n Applications and social impacts –Domain-specific data mining & invisible data mining –Protection of data security, integrity, and privacy
Slides are based on Negnevitsky, Pearson Education, Summary(1) n Data mining: Discovering interesting patterns from large amounts of data n A natural evolution of database technology, in great demand, with wide applications n A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation
Slides are based on Negnevitsky, Pearson Education, (2) n Mining can be performed in a variety of information repositories n Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. n Data mining systems and architectures n Major issues in data mining
Slides are based on Negnevitsky, Pearson Education, Thank you
Slides are based on Negnevitsky, Pearson Education, An example of a decision tree
Slides are based on Negnevitsky, Pearson Education, The Gini coefficient The Gini coefficient is a measure of how well the predictor separates the classes contained in the parent node. Gini, an Italian economist, introduced a rough measure of the amount of inequality in the income distribution in a country.
Slides are based on Negnevitsky, Pearson Education, Computation of the Gini coefficient The Gini coefficient is calculated as the area between the curve and the diagonal divided by the area below the diagonal. For a perfectly equal wealth distribution, the Gini coefficient is equal to zero.
Slides are based on Negnevitsky, Pearson Education, Selecting an optimal decision tree: (a) Splits selected by Gini
Slides are based on Negnevitsky, Pearson Education, Selecting an optimal decision tree: (b) Splits selected by guesswork
Slides are based on Negnevitsky, Pearson Education, Gain chart of Class A
Slides are based on Negnevitsky, Pearson Education, Can we extract rules from a decision tree? The pass from the root node to the bottom leaf reveals a decision rule. For example, a rule associated with the right bottom leaf in the figure that represents Gini splits can be represented as follows: if (Predictor 1 = no) and (Predictor 4 = no) and (Predictor 6 = no) then class = Class A
Slides are based on Negnevitsky, Pearson Education, A typical task for decision trees is to determine conditions that may lead to certain outcomes. Blood pressure can be categorised as optimal, normal or high. Optimal pressure is below 120/80, normal is between 120/80 and 130/85, and a hypertension is diagnosed when blood pressure is over 140/90. Case study: Profiling people with high blood pressure
Slides are based on Negnevitsky, Pearson Education, A data set for a hypertension study
Slides are based on Negnevitsky, Pearson Education, A data set for a hypertension study (continued)
Slides are based on Negnevitsky, Pearson Education, Data cleaning Decision trees are as good as the data they represent. Unlike neural networks and fuzzy systems, decision trees do not tolerate noisy and polluted data. Therefore, the data must be cleaned before we can start data mining. We might find that such fields as Alcohol Consumption or Smoking have been left blank or contain incorrect information.
Slides are based on Negnevitsky, Pearson Education, Data enriching From such variables as weight and height we can easily derive a new variable, obesity. This variable is calculated with a body-mass index (BMI), that is, the weight in kilograms divided by the square of the height in metres. Men with BMIs of 27.8 or higher and women with BMIs of 27.3 or higher are classified as obese.
Slides are based on Negnevitsky, Pearson Education, A data set for a hypertension study (continued)
Slides are based on Negnevitsky, Pearson Education, Growing a decision tree
Slides are based on Negnevitsky, Pearson Education, Growing a decision tree (continued)
Slides are based on Negnevitsky, Pearson Education, Growing a decision tree (continued)
Slides are based on Negnevitsky, Pearson Education, Solution space of the hypertension study The solution space is first divided into four rectangles by age, then age group is further divided into those who are overweight and those who are not. And finally, the group of obese people is divided by race.
Slides are based on Negnevitsky, Pearson Education, Solution space of the hypertension study
Slides are based on Negnevitsky, Pearson Education, Hypertension study: forcing a split
Slides are based on Negnevitsky, Pearson Education, n The main advantage of the decision-tree approach to data mining is it visualises the solution; it is easy to follow any path through the tree. n Relationships discovered by a decision tree can be expressed as a set of rules, which can then be used in developing an expert system. Advantages of decision trees
Slides are based on Negnevitsky, Pearson Education, n Continuous data, such as age or income, have to be grouped into ranges, which can unwittingly hide important patterns. n Handling of missing and inconsistent data – decision trees can produce reliable outcomes only when they deal with “clean” data. n Inability to examine more than one variable at a time. This confines trees to only the problems that can be solved by dividing the solution space into several successive rectangles. Drawbacks of decision trees
Slides are based on Negnevitsky, Pearson Education, In spite of all these limitations, decision trees have become the most successful technology used for data mining. An ability to produce clear sets of rules make decision trees particularly attractive to business professionals.