Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.

Slides:



Advertisements
Similar presentations
An Introduction to Data Mining
Advertisements

1 Copyright Jiawei Han; modified by Charles Ling for CS411a/538a Data Mining and Data Warehousing  Introduction  Data warehousing and OLAP for data mining.
Mining Multiple-level Association Rules in Large Databases
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining By Archana Ketkar.
Data Mining and Data Warehousing – a connected view.
Mining Association Rules
Data Mining – Intro.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 5 Data mining : A Closer Look.
Data Mining.
Business Intelligence
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Lingma Acheson Department of Computer and Information Science, IUPUI
DATA MINING & KNOWLEDGE DISCOVERY
Data Mining Techniques
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
1 An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007.
Understanding Data Analytics and Data Mining Introduction.
Chapter 1 Introduction to Data Mining
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
CS690L - Lecture 6 1 CS690L Data Mining and Knowledge Discovery Overview Yugi Lee STB #555 (816) This.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Introduction of Data Mining and Association Rules cs157 Spring 2009 Instructor: Dr. Sin-Min Lee Student: Dongyi Jia.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,
Evaluation of DBMiner By: Shu LIN Calin ANTON. Outline  Importing and managing data source  Data mining modules Summarizer Associator Classifier Predictor.
Academic Year 2014 Spring Academic Year 2014 Spring.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
July 7, 2016 Data Mining: Concepts and Techniques 1 1.
Mining Association Rules in Large Database This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed.
1 1 Data Mining: Concepts and Techniques — Chapter 1 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser.
Data Mining Functionalities
Data Mining.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 —
Data Mining – Intro.
What Is Cluster Analysis?
DATA MINING © Prentice Hall.
Data Mining.
Introduction C.Eng 714 Spring 2010.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Chapter 3 Introduction to Data Mining
Introduction to Data Mining
Data Mining II: Association Rule mining & Classification
Sangeeta Devadiga CS 157B, Spring 2007
Data Mining Concept Description
Data Analysis.
Lingma Acheson Department of Computer and Information Science, IUPUI
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Presentation transcript:

Data Mining : Introduction Chapter 1

2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning Frequent Patterns 3. Classification and Prediction 4. Cluster Analysis 5. Outlier Analysis 6. Evolution Analysis 3. Are all Patterns Interesting? 4. Major Issues in Data Mining

3 1. What is Data Mining Data mining is the process of discovering interesting patterns (or knowledge) from large amounts of data. The data sources can include databases, data warehouses, the Web, other information repositories, or data that are streamed into the system dynamically.

1. What is Data Mining Architecture of data mining system This is usually the source of data. The data may require cleaning and integration. Is responsible for fetching relevant data based on user request Performs functionalities like characterization, association, classification, prediction etc. Tests for interestingness of a pattern Communicates between users and data mining system. Visualizes results or perform exploration on data and schemas. This is the information of domain we are mining like concept hierarchies, to organize attributes onto various levels of abstraction Also contains user beliefs, which can be used to access interestingness of pattern or thresholds

4 2. Data Mining Functionalities Data Mining functionalities are used to specify the kind of patterns to be found in data mining tasks. Data Mining tasks can be classified into two categories  Descriptive: Characterize general properties of data in the database  Predictive: perform inference on data to make predictions

5 2.1 Data Mining Functionalities: Characterization and Discrimination Data can be associated with classes or concepts that can be described in summarized, concise, and yet precise, terms. Such descriptions of a concept or class are called class/concept descriptions. These descriptions can be derived via  Data Characterization  Data Discrimination

6 2.1 Data Mining Functionalities: Characterization and Discrimination Data characterization is a summarization of the general characteristics or features of a target class of data. The data corresponding to the user-specified class are typically collected by a query. ex: Description of all users who spent more than $10,000 a year at AllElectronics? A general profile of all customers, such as age, salary, location and credit ratings. Among all the customers meeting target condition (spent > $10,000), 10% are “Youth”, 60% are “Adults” and 30% are “Seniors”. The output of data characterization can be presented in pie charts, bar charts, multidimensional data cubes, and multidimensional tables. They can also be presented in rule form.

7 2.1 Data Mining Functionalities Characterization and Discrimination Data discrimination is a comparison of the target class data objects against the objects from one or multiple contrasting classes with respect to customers that share specified generalized feature(s). ex: compare change is sales of software products for customers with given generalized feature: 40% of “Youth” have sales that increased by more 10% from last year; 10% of “Youth” have sales that decreased by at least 30% during the same period; the remaining 50% of “Youth” change in sales the fell in-between. “Youth” describes the generalized tuple, while increase in sales by > 10% is the target class. The other two amounts of change in sales are the contrasting classes. The forms of output presentation are similar to those for characteristic descriptions, although discrimination descriptions should include comparative measures that help to distinguish between the target and contrasting classes.

8 2.2 Data Mining Functionalities: Mining Frequent Patterns Frequent patterns are the patterns that occur frequently in the data. Patterns can include itemsets, sequences and subsequences. A frequent itemset refers to a set of items that often appear together in a transactional data set. ex: bread and milk

9 2.2 Data Mining Functionalities: Mining Frequent Patterns Association Rules buys(X, “computer”)=>buys(X, “software”) [support =1%, confidence = 50%] age(X, “20..29”)^income(X, “40K..49K”)=>buys(X, “laptop”) Single Dimension Association Rule if a customer buys a computer, there is a 50% chance that he will buy software as well 1% of all the transactions under analysis show that computer and software are purchased together [support = 2%, confidence = 60%] Multi-Dimension Association Rule Association rules are discarded as uninteresting if they do not satisfy minimum support threshold and minimum confidence threshold

Data Mining Functionalities: Classification and Prediction Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts. The model is derived based on the analysis of a set of training data and is used to predict the class label of objects for which the the class label is unknown. Neural Network Decision Tree Representation of Derived model Decision Tree IF-THEN Rules

Data Mining Functionalities: Classification and Prediction Prediction values continuous valued functions, i.e. it is used to predict missing or unavailable numeric data values rather than class labels. Prediction can be used for both numeric prediction and class label prediction. Regression analysis is a statistical method used numeric prediction. Classification and regression may need to be preceded by relevance analysis, which attempts to identify attributes that are significantly relevant to the classification and regression process. Such attributes will be selected for the classification and regression process. Other attributes, which are irrelevant, can then be excluded from consideration Decision Tree

Data Mining Functionalities: Cluster Analysis Clustering analyzes data objects without consulting class labels. Clustering can be used to generate class labels for a group of data which did not exist at the beginning. The objects are clustered or grouped based on the principle of maximizing the intra-class similarity and minimizing the inter-class similarity. Decision Tree

Data Mining Functionalities: Outlier Analysis Outliers are data objects that do not comply with the general behavior or model of data. Many data mining techniques discard outliers or exceptions as noise. However, in some events these kind of events are more interesting. This analysis of outlier data is referred to as outlier analysis ex: fraud detection. Decision Tree

Data Mining Functionalities Evolution Analysis Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time. This may include characterization, discrimination, association and correlation analysis, classification, prediction or clustering of time related data. Distinct features of such data include time series data analysis, sequence or periodicity pattern matching and similarity based data analysis. Decision Tree

15 We need to answer three questions to say if patterns are interesting 1. What makes a pattern interesting? 2. Can a data mining system generate all of the interesting patterns? 3. Can the system generate only the interesting ones? Decision Tree 3. Are all Patterns Interesting?

16 3. Are all Patterns Interesting? What makes a pattern is interesting? Novel, Potentially useful or desired, understandable and valid Easily understood by humans Valid on new set of data with a degree of certainty validates a hypothesis that user sought to confirm Not known before

17 3. Are all Patterns Interesting? Objective measures of interestingness (measurable) Support: The percentage of transactions from transaction database that the given rule satisfies Confidence: The degree of certainty of given transaction support(X=>Y) = P(XUY) Confidence(X=>Y)=P(Y|X)

18 3. Are all Patterns Interesting? Many patterns that are interesting by objective standards may represent common sense and, therefore, are actually uninteresting. So Objective measures are coupled with subjective measures that reflects users needs and interests. Subjective interestingness measures are based on user beliefs in the data. These measures find patterns interesting if the patterns are unexpected (contradicting user’s belief), actionable (offer strategic information on which the user can act) or expected(confirm a hypothesis)

19 3. Are all Patterns Interesting? Can a data mining system generate all of the interesting patterns? A data mining algorithm is complete if it mines all interesting patterns. It is often unrealistic and inefficient for data mining systems to generate all possible patterns. Instead, user-provided constraints and interestingness measures should be used to focus the search. For some mining tasks, such as association, this is often sufficient to ensure the completeness of the algorithm.

20 3. Are all Patterns Interesting? Can a data mining system generate only interesting patterns? A data mining algorithm is consistent if it mines only interesting patterns. It is an optimization problem. It is highly desirable for data mining systems to generate only interesting patterns. This would be efficient for users and data mining systems because neither would have to search through the patterns generated to identify the truly interesting ones. Sufficient progress has been made in this direction, but it still a challenging issue in data mining.

4. Major Issues in Data Mining 1. Mining different kinds of data 2. Handling multiple levels of abstraction 3. Incorporation of background knowledge 4. Visualization of mining results 5. Handling of incomplete or noisy data 6. Scalability of algorithms