Slides are based on Negnevitsky, Pearson Education, 2005 1 Lecture 14 Data mining and knowledge discovery n Introduction, or what is data mining? n Data.

Slides:



Advertisements
Similar presentations
1 Top 10 Algorithms in Data Mining Xindong Wu ( 吴信东 ) Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic University; 合肥工业大学计算机与信息学院.
Advertisements

Advanced Data Mining: Introduction
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Data Mining: Concepts and Techniques
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining By Archana Ketkar.
Data Mining – Intro.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
Data Mining.
CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Web:
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Data Warehouse Fundamentals Rabie A. Ramadan, PhD 2.
Data Mining Using IBM Intelligent Miner Presented by: Qiyan (Jennifer ) Huang.
Data Mining Techniques
Shilpa Seth.  What is Data Mining What is Data Mining  Applications of Data Mining Applications of Data Mining  KDD Process KDD Process  Architecture.
August 29, 2015 Data Mining: Concepts and Techniques 1 Chapter 1. Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Techniques As Tools for Analysis of Customer Behavior
Data Mining Chun-Hung Chou
1 An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007.
© Negnevitsky, Pearson Education, Introduction, or what is data mining? Introduction, or what is data mining? Data warehouse and query tools Data.
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Data Mining Techniques As Tools for Analysis of Customer Behavior Lecture 2:
Chapter 1 Introduction to Data Mining
1 1 Slide Introduction to Data Mining and Business Intelligence.
Chapter 9 – Classification and Regression Trees
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction, or what is data mining? Introduction, or what is data mining? Data warehouse and query tools Data warehouse and query tools Decision trees.
DATA MINING 1. 2 Data Mining Extracting or “mining” knowledge from large amounts of data Data mining is the process of autonomously retrieving useful.
Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Data Mining: Concepts and Techniques. Overview 1.Introduction 2.Data Preprocessing 3.Data Warehouse and OLAP Technology: An Introduction 4.Advanced Data.
MIS2502: Data Analytics Advanced Analytics - Introduction.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,
Academic Year 2014 Spring Academic Year 2014 Spring.
February 13, 2016 Data Mining: Concepts and Techniques 1 1 Data Mining: Concepts and Techniques These slides have been adapted from Han, J., Kamber, M.,
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
July 7, 2016 Data Mining: Concepts and Techniques 1 1.
There is an inherent meaning in everything. “Signs for people who can see.”
1 1 Data Mining: Concepts and Techniques — Chapter 1 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser.
Slides related to: Data Mining: Concepts and Techniques — Chapter 1 and 2 — — Introduction and Data preprocessing — Jiawei Han and Micheline Kamber.
Data Mining Functionalities
Why Data Mining? What Is Data Mining?
Data Mining.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 —
Data Mining – Intro.
Data Mining: Concepts and Techniques
Data Mining.
Data warehouse & Data Mining: Concepts and Techniques
Introduction C.Eng 714 Spring 2010.
Data and Applications Security Introduction to Data Mining
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Mining: Concepts and Techniques Course Outline
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining Concepts and Techniques
Data Mining: Concepts and Techniques
Promising “Newer” Technologies to Cope with the
Presentation transcript:

Slides are based on Negnevitsky, Pearson Education, Lecture 14 Data mining and knowledge discovery n Introduction, or what is data mining? n Data warehouse and query tools n Decision trees n Case study: Profiling people with high blood pressure n Summary

Slides are based on Negnevitsky, Pearson Education, What is data mining? n Data is what we collect and store, and knowledge is what helps us to make informed decisions. n The extraction of knowledge from data is called data mining. n Data mining can also be defined as the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules. n The ultimate goal of data mining is to discover knowledge.

Slides are based on Negnevitsky, Pearson Education, Why data mining n The Explosive Growth of Data: from terabytes to petabytes –Data collection and data availability »Automated data collection tools, database systems, Web, computerized society –Major sources of abundant data »Business: Web, e-commerce, transactions, stocks, … »Science: Remote sensing, bioinformatics, scientific simulation, … »Society and everyone: news, digital cameras, YouTube n knowledge!

Slides are based on Negnevitsky, Pearson Education, Why Not Traditional Data Analysis? n Tremendous amount of data –Algorithms must be highly scalable to handle such as tera-bytes of data n High-dimensionality of data –Micro-array may have tens of thousands of dimensions

Slides are based on Negnevitsky, Pearson Education, n High complexity of data –Data streams and sensor data –Time-series data, temporal data, sequence data –Structure data, graphs, social networks and multi-linked data –Heterogeneous databases and legacy databases –Spatial, spatiotemporal, multimedia, text and Web data –Software programs, scientific simulations n New and sophisticated applications

Slides are based on Negnevitsky, Pearson Education, Knowledge Discovery (KDD) Process –Data mining — core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

Slides are based on Negnevitsky, Pearson Education, KDD Process: Several Key Steps n Learning the application domain –relevant prior knowledge and goals of application n Creating a target data set: data selection n Data cleaning and preprocessing: (may take 60% of effort!) n Data reduction and transformation –Find useful features, dimensionality/variable reduction, invariant representation

Slides are based on Negnevitsky, Pearson Education, KDD Process: Several Key Steps n Choosing functions of data mining – summarization, classification, regression, association, clustering n Choosing the mining algorithm(s) n Data mining: search for patterns of interest n Pattern evaluation and knowledge presentation –visualization, transformation, removing redundant patterns, etc. n Use of discovered knowledge

Slides are based on Negnevitsky, Pearson Education, Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization

Slides are based on Negnevitsky, Pearson Education, Architecture: Typical Data Mining System data cleaning, integration, and selection Database or Data Warehouse Server Data Mining Engine Pattern Evaluation Graphical User Interface Knowl edge- Base Database Data Warehouse World-Wide Web Other Info Repositories

Slides are based on Negnevitsky, Pearson Education, Data Mining Functionalities(1) n Frequent patterns, association, correlation vs. causality –Diaper  Beer [0.5%, 75%] (Correlation or causality?) n Classification and prediction –Construct models (functions) that describe and distinguish classes or concepts for future prediction »E.g., classify countries based on (climate), or classify cars based on (gas mileage) –Predict some unknown or missing numerical values

Slides are based on Negnevitsky, Pearson Education, Data Mining Functionalities(2) n Cluster analysis –Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns –Maximizing intra-class similarity & minimizing interclass similarity n Outlier analysis –Outlier: Data object that does not comply with the general behavior of the data –Noise or exception? Useful in fraud detection, rare events analysis

Slides are based on Negnevitsky, Pearson Education, Data Mining Functionalities(3) n Trend and evolution analysis –Trend and deviation: e.g., regression analysis –Sequential pattern mining: e.g., digital camera  large SD memory –Periodicity analysis –Similarity-based analysis n Other pattern-directed or statistical analyses

Slides are based on Negnevitsky, Pearson Education, Top-10 Most Popular DM Algorithms: 18 Identified Candidates (I) n Classification –#1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann., –#2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, –#3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6) –#4. Naive Bayes Hand, D.J., Yu, K., Idiot's Bayes: Not So Stupid After All? Internat. Statist. Rev. 69,

Slides are based on Negnevitsky, Pearson Education, (II) n Statistical Learning –#5. SVM: Vapnik, V. N The Nature of Statistical Learning Theory. Springer-Verlag. – #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. Association Analysis –#7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94. –#8. FP-Tree: Han, J., Pei, J., and Yin, Y Mining frequent patterns without candidate generation. In SIGMOD '00.

Slides are based on Negnevitsky, Pearson Education, (III) n Link Mining –#9. PageRank: Brin, S. and Page, L The anatomy of a large-scale hypertextual Web search engine. In WWW-7, –#10. HITS: Kleinberg, J. M Authoritative sources in a hyperlinked environment. SODA, 1998.

Slides are based on Negnevitsky, Pearson Education, (IV) n Clustering –#11. K-Means: MacQueen, J. B., Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, –#12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M BIRCH: an efficient data clustering method for very large databases. In SIGMOD '96. n Bagging and Boosting –#13. AdaBoost: Freund, Y. and Schapire, R. E A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997),

Slides are based on Negnevitsky, Pearson Education, (V) n Sequential Patterns –#14. GSP: Srikant, R. and Agrawal, R Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the 5th International Conference on Extending Database Technology, –#15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In ICDE '01. n Integrated Mining –#16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and association rule mining. KDD-98.

Slides are based on Negnevitsky, Pearson Education, (VI) n Rough Sets –#17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992 n Graph Mining –#18. gSpan: Yan, X. and Han, J gSpan: Graph-Based Substructure Pattern Mining. In ICDM '02.

Slides are based on Negnevitsky, Pearson Education, Top-10 Algorithm Finally Selected at ICDM ’ 06 n #1: C4.5 (61 votes) n #2: K-Means (60 votes) n #3: SVM (58 votes) n #4: Apriori (52 votes) n #5: EM (48 votes) n #6: PageRank (46 votes) n #7: AdaBoost (45 votes) n #7: kNN (45 votes) n #7: Naive Bayes (45 votes) n #10: CART (34 votes)

Slides are based on Negnevitsky, Pearson Education, Conferences and Journals on Data Mining n KDD Conferences –ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) –SIAM Data Mining Conf. (SDM) –(IEEE) Int. Conf. on Data Mining (ICDM) –Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD) –Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)

Slides are based on Negnevitsky, Pearson Education, n Other related conferences –ACM SIGMOD –VLDB –(IEEE) ICDE –WWW, SIGIR –ICML, CVPR, NIPS n Journals –Data Mining and Knowledge Discovery (DAMI or DMKD) –IEEE Trans. On Knowledge and Data Eng. (TKDE) –KDD Explorations –ACM Trans. on KDD

Slides are based on Negnevitsky, Pearson Education, Why Not Traditional Data Analysis?(1) n Tremendous amount of data –Algorithms must be highly scalable to handle such as tera-bytes of data n High-dimensionality of data –Micro-array may have tens of thousands of dimensions

Slides are based on Negnevitsky, Pearson Education, (2) n High complexity of data –Data streams and sensor data –Time-series data, temporal data, sequence data –Structure data, graphs, social networks and multi-linked data –Heterogeneous databases and legacy databases –Spatial, spatiotemporal, multimedia, text and Web data –Software programs, scientific simulations n New and sophisticated applications

Slides are based on Negnevitsky, Pearson Education, Data warehouse n Modern organisations must respond quickly to any change in the market. This requires rapid access to current data normally stored in operational databases. n However, an organisation must also determine which trends are relevant. This task is accomplished with access to historical data that are stored in large databases called data warehouses.

Slides are based on Negnevitsky, Pearson Education, n The main characteristic of a data warehouse is its capacity. A data warehouse is really big – it includes millions, even billions, of data records. n The data stored in a data warehouse is l time dependent – linked together by the times of recording – and l integrated – all relevant information from the operational databases is combined and structured in the warehouse.

Slides are based on Negnevitsky, Pearson Education, Query tools n A data warehouse is designed to support decision- making in the organisation. The information needed can be obtained with query tools. n Query tools are assumption-based – a user must ask the right questions.

Slides are based on Negnevitsky, Pearson Education, How is data mining applied in practice? n Many companies use data mining today, but refuse to talk about it. n In direct marketing, data mining is used for targeting people who are most likely to buy certain products and services. n In trend analysis, it is used to determine trends in the marketplace, for example, to model the stock market. In fraud detection, data mining is used to identify insurance claims, cellular phone calls and credit card purchases that are most likely to be fraudulent.

Slides are based on Negnevitsky, Pearson Education, n Motivation: Finding latent relationships in data –What products were often purchased together? — Beer and diapers?! –What are the subsequent purchases after buying a PC? –What kinds of DNA are sensitive to this new drug? –Can we automatically classify web documents?

Slides are based on Negnevitsky, Pearson Education,

Slides are based on Negnevitsky, Pearson Education, n Applications – Market basket data analysis (shelf space planning/increasing sales/promotion) – cross-marketing – catalog design – sale campaign analysis – Web log (click stream) analysis – DNA sequence analysis

Slides are based on Negnevitsky, Pearson Education, Data mining tools Data mining is based on intelligent technologies already discussed in this book. It often applies such tools as neural networks and neuro-fuzzy systems. However, the most popular tool used for data mining is a decision tree.

Slides are based on Negnevitsky, Pearson Education, Decision trees A decision tree can be defined as a map of the reasoning process. It describes a data set by a tree-like structure. Decision trees are particularly good at solving classification problems.

Slides are based on Negnevitsky, Pearson Education, ID3 n (tall, blond, blue) w n (short, silver, blue) w n (short, black, blue) w n (tall, blond, brown) w n (tall, silver, blue) w n (short, blond, blue) w n (short, black, brown) e n (tall, silver, black) e n (short, black, brown) e n (tall, black, brown) e n (tall, black, black) e n (short, blond, black) e

Slides are based on Negnevitsky, Pearson Education,

Slides are based on Negnevitsky, Pearson Education,

Slides are based on Negnevitsky, Pearson Education,

Slides are based on Negnevitsky, Pearson Education,

Slides are based on Negnevitsky, Pearson Education, n A decision tree consists of nodes, branches and leaves. n The top node is called the root node. The tree always starts from the root node and grows down by splitting the data at each level into new nodes. The root node contains the entire data set (all data records), and child nodes hold respective subsets of that set. n All nodes are connected by branches. n Nodes that are at the end of branches are called terminal nodes, or leaves.

Slides are based on Negnevitsky, Pearson Education, How does a decision tree select splits? n A split in a decision tree corresponds to the predictor with the maximum separating power. The best split does the best job in creating nodes where a single class dominates. n One of the best known methods of calculating the predictor’s power to separate data is based on the Gini coefficient of inequality.

Slides are based on Negnevitsky, Pearson Education, Major Issues in Data Mining(1) n Mining methodology –Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web –Performance: efficiency, effectiveness, and scalability –Pattern evaluation: the interestingness problem –Incorporation of background knowledge –Handling noise and incomplete data –Parallel, distributed and incremental mining methods –Integration of the discovered knowledge with existing one: knowledge fusion

Slides are based on Negnevitsky, Pearson Education, (2) n User interaction –Data mining query languages and ad-hoc mining –Expression and visualization of data mining results –Interactive mining of knowledge at multiple levels of abstraction n Applications and social impacts –Domain-specific data mining & invisible data mining –Protection of data security, integrity, and privacy

Slides are based on Negnevitsky, Pearson Education, Summary(1) n Data mining: Discovering interesting patterns from large amounts of data n A natural evolution of database technology, in great demand, with wide applications n A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation

Slides are based on Negnevitsky, Pearson Education, (2) n Mining can be performed in a variety of information repositories n Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. n Data mining systems and architectures n Major issues in data mining

Slides are based on Negnevitsky, Pearson Education, Thank you

Slides are based on Negnevitsky, Pearson Education, An example of a decision tree

Slides are based on Negnevitsky, Pearson Education, The Gini coefficient The Gini coefficient is a measure of how well the predictor separates the classes contained in the parent node. Gini, an Italian economist, introduced a rough measure of the amount of inequality in the income distribution in a country.

Slides are based on Negnevitsky, Pearson Education, Computation of the Gini coefficient The Gini coefficient is calculated as the area between the curve and the diagonal divided by the area below the diagonal. For a perfectly equal wealth distribution, the Gini coefficient is equal to zero.

Slides are based on Negnevitsky, Pearson Education, Selecting an optimal decision tree: (a) Splits selected by Gini

Slides are based on Negnevitsky, Pearson Education, Selecting an optimal decision tree: (b) Splits selected by guesswork

Slides are based on Negnevitsky, Pearson Education, Gain chart of Class A

Slides are based on Negnevitsky, Pearson Education, Can we extract rules from a decision tree? The pass from the root node to the bottom leaf reveals a decision rule. For example, a rule associated with the right bottom leaf in the figure that represents Gini splits can be represented as follows: if (Predictor 1 = no) and (Predictor 4 = no) and (Predictor 6 = no) then class = Class A

Slides are based on Negnevitsky, Pearson Education, A typical task for decision trees is to determine conditions that may lead to certain outcomes. Blood pressure can be categorised as optimal, normal or high. Optimal pressure is below 120/80, normal is between 120/80 and 130/85, and a hypertension is diagnosed when blood pressure is over 140/90. Case study: Profiling people with high blood pressure

Slides are based on Negnevitsky, Pearson Education, A data set for a hypertension study

Slides are based on Negnevitsky, Pearson Education, A data set for a hypertension study (continued)

Slides are based on Negnevitsky, Pearson Education, Data cleaning Decision trees are as good as the data they represent. Unlike neural networks and fuzzy systems, decision trees do not tolerate noisy and polluted data. Therefore, the data must be cleaned before we can start data mining. We might find that such fields as Alcohol Consumption or Smoking have been left blank or contain incorrect information.

Slides are based on Negnevitsky, Pearson Education, Data enriching From such variables as weight and height we can easily derive a new variable, obesity. This variable is calculated with a body-mass index (BMI), that is, the weight in kilograms divided by the square of the height in metres. Men with BMIs of 27.8 or higher and women with BMIs of 27.3 or higher are classified as obese.

Slides are based on Negnevitsky, Pearson Education, A data set for a hypertension study (continued)

Slides are based on Negnevitsky, Pearson Education, Growing a decision tree

Slides are based on Negnevitsky, Pearson Education, Growing a decision tree (continued)

Slides are based on Negnevitsky, Pearson Education, Growing a decision tree (continued)

Slides are based on Negnevitsky, Pearson Education, Solution space of the hypertension study The solution space is first divided into four rectangles by age, then age group is further divided into those who are overweight and those who are not. And finally, the group of obese people is divided by race.

Slides are based on Negnevitsky, Pearson Education, Solution space of the hypertension study

Slides are based on Negnevitsky, Pearson Education, Hypertension study: forcing a split

Slides are based on Negnevitsky, Pearson Education, n The main advantage of the decision-tree approach to data mining is it visualises the solution; it is easy to follow any path through the tree. n Relationships discovered by a decision tree can be expressed as a set of rules, which can then be used in developing an expert system. Advantages of decision trees

Slides are based on Negnevitsky, Pearson Education, n Continuous data, such as age or income, have to be grouped into ranges, which can unwittingly hide important patterns. n Handling of missing and inconsistent data – decision trees can produce reliable outcomes only when they deal with “clean” data. n Inability to examine more than one variable at a time. This confines trees to only the problems that can be solved by dividing the solution space into several successive rectangles. Drawbacks of decision trees

Slides are based on Negnevitsky, Pearson Education, In spite of all these limitations, decision trees have become the most successful technology used for data mining. An ability to produce clear sets of rules make decision trees particularly attractive to business professionals.