Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3 Introduction to Data Mining

Similar presentations


Presentation on theme: "Chapter 3 Introduction to Data Mining"— Presentation transcript:

1 Chapter 3 Introduction to Data Mining
Prof. Chintan H. Makwana

2 Syllabus Topics Introduction Classification of Data Mining System
Data Mining Primitives KDD Process Data Mining Architecture Data Mining Functionalities Integration of a Data Mining System with a Database or Data Warehouse System Issues in Data Mining Importance of Data Mining Application of Data Mining Social Impacts

3 Introduction Data??? Information??? Database??? DBMS???

4 Data Introduction Structured :DBMS Dhaval Gohel 40 50 60
Rishabh Chauhan 70 80 Mayur Chauhan Jaldhi Patel 30 Viral Prajapati 90 Semi –structured:XML Unstructured:text <Name>Dhaval Gohel</Name> <CA>40</CA> <IP>50</IP> <CS>60</CS> Dhaval Gohel,40,50,60 Rishabh Chauhan 60,70,80

5 Information Introduction Dhaval Gohel have 50% in current Sem.
Viral Prajapati have highest marks in Reaserch Skill. Ankit Prajapti have lowest marks in CA.

6 Data base Introduction 120160107001 Dhaval Gohel 120160107002
Rishabh Chauhan Mayur Padiya Jaldhi Patel Viral Prajapati Dhaval Gohel Dakor Rishabh Chauhan Modasa Mayur Padiya Nadiyad Jaldhi Patel Dehgam Viral Prajapati Naroda Dhaval Gohel 40 50 60 Rishabh Chauhan 70 80 Mayur Padiya Jaldhi Patel 30 Viral Prajapati 90

7 Introduction DBMS

8 Introduction Data: row facts Information: processed data
Database: collection of organized related data DBMS: set of software and tools used manipulate the database

9 What do you mean by Data Mining?
Data Mining: “ Data Mining is the process of discovering interesting knowledge from large amount of data stored in databases, data warehouses, or other information repositories. “ Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

10 Different Names of Data Mining
Knowledge discovery (mining) in databases (KDD) knowledge extraction data/pattern analysis data archeology data dredging information harvesting business intelligence, etc.

11 Database vs Data Mining
- Find all employee having salary >=50,000 - Find all the student who have attendance 0% last month - Find all the Student who have Apple Laptop Data Mining: - Find all employee who is contractual (Classification) - Find all the student who have attending lectures (Clustering) - Find all the Student who have Apple Laptop and Apple Phone (Association Rule)

12 Classification of Data Mining System
Database technology Information Science Statistics Machine Learning Visualization Other disciplines

13 Integration of Multiple Technology
Information Science Machine Learning Database Technology Statistics Algorithms Visualization Data Mining

14 Classification of Data Mining System
Classification is based on Kind of database Mined: Data model like relational, transactional, object- relational, or data warehouse. Special types of data handled like spatial, time series, text, stream data, multimedia data mining system, or a World Wide Web mining system.

15 Classification of Data Mining System
Kind of knowledge Mined Data Mining functionalities like Characterization and Discrimination, Mining Frequent Patterns, Classification and Prediction, Cluster Analysis, Outlier Analysis, Evolution Analysis Data regularities vs data irregularities

16 Classification of Data Mining System
Kinds of techniques utilized Degree of user iteration involved e.g., autonomous systems, interactive exploratory systems, query-driven system Method of data analysis employed e.g., database-oriented or data warehouse oriented techniques, machine learning, statistics, visualization, pattern recongnization, neural networks, and so on.

17 Classification of Data Mining System
Application adapted Finance, telecommunication, DNA, stock markets, and so on.

18 Data Mining Primitives
The set of task-relevant data to be mined The kind of knowledge to be mined The background knowledge to be used in the discovery process The interestingness measures and thresholds for pattern evaluation The expected representation for visualizing the discovered patterns

19

20 Knowledge Discovering form Data
Pattern Evaluation Data Mining Pattern Task-relevant Data Data transformations Selection and Transformation Preprocessed Data Data Cleaning Data Integration Databases

21 KDD Process steps Cleaning: remove noise and inconsistent data
Integration: where multiple data sources may be combine Selection: Data relevant to the analysis task are retrieved from the database Transformation: Data are transformed into appropriate form for mining. Summary or aggregation operations Data Mining: Various techniques like Association rule mining, Classification, Clustering are apply to Identify and count patterns Pattern Evaluation: Identify truly interesting patterns representing knowledge base on some interestingness measure. For example Support and Count for Association Rule Mining Knowledge Presentation: Visualization and knowledge representation techniques are used to present the mined knowledge to the user

22 KDD Process on Web Log Data

23

24 KDD Process on Web Log Data
Cleaning: remove error logs Integration: multiple logs may be combine Selection: Data having valid Status and Media type is selected Transformation: Transfer data to day wise, week wise Data Mining: Identify Pattern and count frequent access Pattern Evaluation: Display frequently access sequences Knowledge Presentation: url page wise user count graph, IP address wise number of page visited count graph

25 Data Mining Architecture
Components Databases, Data warehouse, World Wide Web or other Information repository Database or Data warehouse server Knowledge base Data mining engine Pattern Evaluation Module User Interface

26

27 Data Mining Functionalities
Data Mining functionalities are used to specify the kind of patterns to be found in data mining tasks. Task: Descriptive and Predictive Descriptive: General Properties of data and database Predictive: Perform inference (Conclusion) on the current data

28 Data Mining task

29 Data Mining Functionalities
Characterization and Discrimination Mining Frequent Patterns Classification and Prediction Cluster Analysis Outlier Analysis Evolution Analysis

30 Characterization and Discrimination
Data Characterization is a summarization of the general characteristics or features of a target class of data. For example: to analyze the improvements of the students who study in 2nd Semester ME in GECM and whose marks increased 5% in the current semester. Display forms: pie charts, bar charts, multidimensional data cubes etc..

31 Characterization and Discrimination
Data Discrimination is a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes. For example: faculties may like to compare the results of students who study in 2nd Semester ME in GECM and whose marks increased 5% and decreased 5% in the current semester . Display forms: pie charts, bar charts, multidimensional data cubes etc..

32 Mining Frequent Patterns, Association Rule Mining
Frequent patterns are patterns that occur frequently in data set. Forms: Frequent itemsets, subsequences, and substructures. Frequent itemsets: ex. milk and bread. Subsequence: ex. PC followed by Soft. Substructure: sub graph, tress, or lattices

33 Mining Frequent Patterns, Association Rule Mining
Association Rule Mining is method use to find the interesting frequent pattern from large set of data items. computer  antivirus [support=2%, Confidence=60%] Support means that 2% of all the transactions in which computer and antivirus purchased together. Confidence 60% means 60% of customers who purchased a computer also purchased antivirus together

34 Classification and Prediction
Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts. The model is derived based on the analysis of a set of training data and is used to predict the class label of objects for which the class label is unknown. Classification is a two phase process 1) Lerning: Training data are analyzed by classification algorithm. 2) Classification: Classify data into the class lable.

35 Classification and Prediction
Prediction values continuous valued functions, i.e. it is used to predict missing or unavailable numeric data values rather than class labels. Regression analysis is a statistical method used numeric prediction. Dhaval Gohel 40 50 60 Pass Rishabh Chauhan 70 80 Mayur Padiya 30 Fail Ankit Prajapati Classification Prediction

36 Classification and Prediction

37 Cluster Analysis Clustering analyzes data objects without consulting class labels. Clustering can be used to generate class labels for a group of data which did not exist at the beginning. The objects are clustered or grouped based on the principle of maximizing the intra-class similarity and minimizing the inter-class similarity.

38 Outlier Analysis Outliers are data objects that do not comply with the general behavior or model of data. The analysis of outlier data is referred to as outlier mining. Many data mining techniques discard outliers or exceptions as noise. However, in some events these kind of events are more interesting. This analysis of outlier data is referred to as outlier analysis ex: fraud detection.

39 Evolution Analysis Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time. This may include characterization, discrimination, association and correlation analysis, classification, prediction or clustering of time related data. Distinct features of such data include time series data analysis, sequence or periodicity pattern matching and similarity based data analysis.

40 No coupling: DM system will not utilize any function of DB/DW System
Integration of a Data Mining System with a Database or Data Warehouse System No coupling: DM system will not utilize any function of DB/DW System Loose coupling: DM system will use some facilities of a DB/DW System Semitight coupling: Linking a DM system to a DB/DW System, efficient implementations of a few essential data mining primitives Tight coupling: DM system is smoothly integrated into the DB/DW System

41 Major Issues in Data Mining
Mining different kinds of data Handling multiple levels of abstraction Incorporation of background knowledge Visualization of mining results Handling of incomplete or noisy data Scalability of algorithms

42 Importance of Data Mining
Data collected in large data repositories become “data tombs”. Data Mining tools perform data analysis and my uncover important data patterns, contributing greatly to business strategies, knowledge bases, and scientific and medical research. Data Mining tools turns data tombs into “Golden nuggets” of knowledge.

43 Application of Data Mining
Market analysis Fraud detection Customer retention Production control Science exploration

44 Social Impacts of Data Mining
Privacy Profiling Unauthorized use


Download ppt "Chapter 3 Introduction to Data Mining"

Similar presentations


Ads by Google