Download presentation
Presentation is loading. Please wait.
1
Data Mining: Concepts and Techniques
Data Mining and Data Warehousing By Dr. NK SAKTHIVEL, Professor/ SoC SASTRA September 22, 2018 Data Mining: Concepts and Techniques
2
Evolution of Database Technology
1960s and Earlier : Hierarch Model Data collection and database creation, and Primitive File Processing 1970s – Early 1980s: Data-Base Management Systems Network DBMS Relational data model, relational DBMS implementation Data Modeling Tool Indexing, B++ Tree and Hashing Query Languages Transaction Management : Recovery, Concurrency Control On-Line Transaction Processing OLTP September 22, 2018 Data Mining: Concepts and Techniques
3
Evolution of Database Technology
Mid 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s—2000s: Data Analysis and Understanding Data mining and data warehousing, multimedia databases, and Web databases September 22, 2018 Data Mining: Concepts and Techniques
4
Data Mining: Concepts and Techniques
September 22, 2018 Data Mining: Concepts and Techniques
5
Database Processing vs. Data Mining Processing
Query Well defined SQL Query Poorly defined No precise query language Data Operational data Data Not operational data Output Precise Subset of database Output Fuzzy Not a subset of database September 22, 2018 Data Mining: Concepts and Techniques
6
Data Mining: Concepts and Techniques
Query Examples Database Data Mining Find all credit applicants with last name of Smith. Identify customers who have purchased more than $10,000 in the last month. Find all customers who have purchased milk Find all credit applicants who are poor credit risks. (classification) Identify customers with similar buying habits. (Clustering) Find all items which are frequently purchased with milk. (association rules) September 22, 2018 Data Mining: Concepts and Techniques
7
Data Mining: Concepts and Techniques
What Is Data Mining? Data mining - knowledge discovery in databases Extraction of interesting (implicit, previously unknown and potentially useful) information or patterns from data in large databases Alternative names and their “inside stories”: Data mining: a misnomer? Knowledge Discovery(mining) in Databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. What is not data mining? (Deductive) query processing. Expert systems or small ML/statistical programs September 22, 2018 Data Mining: Concepts and Techniques
8
Data Mining: Concepts and Techniques
What Is Data Mining? Gold Mining / Rock Mining / Sand Mining Data Mining / Knowledge Mining September 22, 2018 Data Mining: Concepts and Techniques
9
Why Data Mining? — Potential Applications
Database analysis and decision support Market analysis and management target marketing, customer relation management, market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management Other Applications Text mining (news group, , documents) and Web analysis. Intelligent query answering September 22, 2018 Data Mining: Concepts and Techniques
10
Market Analysis and Management (1)
Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time Conversion of single to a joint bank account: marriage, etc. Cross-market analysis Associations/co-relations between product sales Prediction based on the association information September 22, 2018 Data Mining: Concepts and Techniques
11
Market Analysis and Management (2)
Customer profiling data mining can tell you what types of customers buy what products (clustering or classification) Identifying customer requirements identifying the best products for different customers use prediction to find what factors will attract new customers Provides summary information various multidimensional summary reports statistical summary information (data central tendency and variation) September 22, 2018 Data Mining: Concepts and Techniques
12
Corporate Analysis and Risk Management
Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning: summarize and compare the resources and spending Competition: monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market September 22, 2018 Data Mining: Concepts and Techniques
13
Fraud Detection and Management (1)
Applications widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Approach use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references September 22, 2018 Data Mining: Concepts and Techniques
14
Fraud Detection and Management (2)
Detecting inappropriate medical treatment Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). Detecting telephone fraud Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail Analysts estimate that 38% of retail shrink is due to dishonest employees. September 22, 2018 Data Mining: Concepts and Techniques
15
Data Mining: Concepts and Techniques
Other Applications Sports IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. September 22, 2018 Data Mining: Concepts and Techniques
16
Data Mining: A KDD Process
Knowledge Pattern Evaluation Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases September 22, 2018 Data Mining: Concepts and Techniques
17
Data Mining: Concepts and Techniques
Steps of a KDD Process Learning the application domain: relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge September 22, 2018 Data Mining: Concepts and Techniques
18
Data Mining and Business Intelligence
Increasing potential to support business decisions End User Making Decisions Data Presentation Business Analyst Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP September 22, 2018 Data Mining: Concepts and Techniques
19
Architecture of a Typical Data Mining System
Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Filtering Data cleaning & data integration Data Warehouse Databases September 22, 2018 Data Mining: Concepts and Techniques
20
Data Mining: On What Kind of Data?
Relational databases – Interrelated Operations Data warehouses - Collecting information from different Dbases Transactional databases – File / Table Transaction Advanced DB Systems and their applications Object-oriented and object-relational databases Spatial databases - from satellite Time-series data-base and – Sequence of time ( Market) Temporal data-base - Information with Timestamp Text databases and multimedia databases – Objects and Images Heterogeneous and legacy databases – network databases WWW September 22, 2018 Data Mining: Concepts and Techniques
21
Data Mining Functionalities
Perform Inference for Prediction General Properties September 22, 2018 Data Mining: Concepts and Techniques
22
Data Mining Functionalities (1)
Concept description: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions ( further precise/accurate is possible ) Association (correlation and causality) Link Analysis or Association Multi-dimensional vs. single-dimensional association X is a Customer and T is a Transaction age(X, “20..29”) ^ income(X, “20..29K”) à buys(X, “PC”) [support = 2%, confidence = 60%] Support means % of transaction and Confidence is degree of certainty of the detected association contains(T, “computer”) à contains(T, “software”) [1%, 50%] Bread with Pretzels is 60% and Bread with Jelly is 70% September 22, 2018 Data Mining: Concepts and Techniques
23
Data Mining Functionalities (2)
Classification and Prediction (Supervised ) Finding Groups / classes / models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on climate, or classify cars based on mileage Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values Pattern Matching Credit card company must determine to authorize the credit card purchase i. Authorize ii. Ask for further identification before authorize iii. Do not authorize iv. Do not authorize and contact police September 22, 2018 Data Mining: Concepts and Techniques
24
Data Mining Functionalities (2)
Cluster analysis (Unsupervised ) Pattern Analysis, Data Analysis, Image Processing and Market Research By clustering, one can identify dense and discover overall distribution patterns Early in childhood, one learns how to distinguish between cats and dogs or animals and plants Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity September 22, 2018 Data Mining: Concepts and Techniques
25
Data Mining Functionalities (3)
Outlier analysis (used to identify error) Outlier: a data object that does not comply (fulfill) with the general behavior of the data For example, the program output as Age : -18 Salary of the chief executive officer of a company Data Mining algorithms try to minimize the influence of outliers or eliminate them all together. It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis September 22, 2018 Data Mining: Concepts and Techniques
26
Data Mining Functionalities (3)
Evolution analysis It describes models regularities or rends for objects, whose behavior changes over time It includes Association, Classification or clustering of Time-Related data. However, distinct features are Time-series data analysis (Stock Market Analysis for last several years to invest shares of high-tech industrial company) Sequence or periodicity pattern matching Similarity-based data analysis September 22, 2018 Data Mining: Concepts and Techniques
27
Are All of the Patterns Interesting
Data Mining system can generate thousands of Patterns or rules What make patterns interesting? Can a data mining system generate all of the interesting patterns? Can a data mining system generate only interesting patterns? September 22, 2018 Data Mining: Concepts and Techniques
28
Are All of the Patterns Interesting
What make patterns interesting? Interesting means knowledge Hypothesis become confirmed Support and confidence Can a data mining system generate all of the interesting patterns? Need completeness of the DMAlgorithm Depends on association Rule September 22, 2018 Data Mining: Concepts and Techniques
29
Mining Association Rules—An Example
Min. support 50% Min. confidence 50% For rule A C: support = support({A C}) = 50% confidence = support({A C})/support({A}) = 66.6% September 22, 2018 Data Mining: Concepts and Techniques
30
Data Mining: Confluence of Multiple Disciplines
Database Technology Statistics Data Mining Machine Learning Visualization Information Science Other Disciplines Fuzzy/Neural/GA September 22, 2018 Data Mining: Concepts and Techniques
31
Data Mining Development
Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis Algorithm Design Techniques Algorithm Analysis Data Structures Neural Networks Decision Tree Algorithms September 22, 2018 Data Mining: Concepts and Techniques
32
Categorized of Data Mining
Classification according to the kinds of Database mined Classification according to the kinds of Knowledge Mined Classification according to the kinds of Technique Utilized Classification according to the kinds of Application Adapted September 22, 2018 Data Mining: Concepts and Techniques
33
Data Mining: Classification Schemes
General functionality Descriptive data mining Predictive data mining Different views, different classifications Kinds of databases to be mined Kinds of knowledge to be discovered Kinds of techniques utilized Kinds of applications adapted September 22, 2018 Data Mining: Concepts and Techniques
34
Ex: Time Series Analysis
Example: Stock Market Predict future values Determine similar patterns over time Classify behavior September 22, 2018 Data Mining: Concepts and Techniques
35
Related Concepts Outline
Goal: Examine some areas which are related to data mining. Database/On Line Transaction Processing Systems Fuzzy Sets and Logic Information Retrieval(Web Search Engines) On Line Analytic Processing /DSS Statistics Machine Learning Pattern Matching September 22, 2018 Data Mining: Concepts and Techniques
36
Major Issues in Data Mining
Regarding Mining Methodology and User Interaction Performance and Diverse Data Types September 22, 2018 Data Mining: Concepts and Techniques
37
Major Issues in Data Mining
Mining Methodology and User Interaction This is the kinds of Knowledge Mined Different users have different knowledge Hence, spectrum of data analysis and knowledge discovery tasks with Dmining functionalities These tasks may use same database with different techniques Which leads different discovery The ability to mine knowledge Interactive Mining with refining knowledge with OLAP Incorporation of backward knowledge with deduction rule The use of domain knowledge and Ad hoc mining Which is used for efficient data mining Knowledge Visualization like graphs, charts, curves September 22, 2018 Data Mining: Concepts and Techniques
38
Major Issues in Data Mining
Performance Issues It includes Efficiency, Scalability and Parallelism of DM Algorithms Efficiency and scalability of Data Mining Algorithm DM Algorithms should be efficient and scalable The running time to be predictable for large databases Parallel, Distributed and Incremental DM Algorithms Computational Complexity should be analyzed Algorithm has to divide the data into partitions and then can proceed in parallel The results can merge later If need, the dbase can update (incremental) before computation September 22, 2018 Data Mining: Concepts and Techniques
39
Major Issues in Data Mining
Diversity of Database Types Relational Database Data Warehouse Complex Data Objects Hypertext and Multimedia Data Spatial Data Temporal Data and etc and Internet having Heterogeneous DataBases Hence, we need an efficient and effective data mining systems for different dbases, for different goals September 22, 2018 Data Mining: Concepts and Techniques
40
Data Mining: Concepts and Techniques
Summary Database Technology for DBMSystems Data Mining for discovering interesting patterns Knowledge Discovery Process includes data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and knowledge presentation Data Warehouse organized way for decision making Data Mining Functionalities Description, association, Classification, Prediction, clustering, trend analysis, deviation analysis, similarity analysis Data Mining Systems September 22, 2018 Data Mining: Concepts and Techniques
41
Data Mining: Concepts and Techniques
Exercises What is data mining? In your answer, address the following: Is it another hype? Is it a simple transformation of technology developed from databases, statistics, and machine learning? Explain how the evolution of database technology led to data mining. Describe the steps involved in data mining when viewed as a process of knowledge discovery. September 22, 2018 Data Mining: Concepts and Techniques
42
Data Mining: Concepts and Techniques
Exercises What is data mining? In your answer, address the following: Data mining refers the process or method that extracts or "mines" interesting knowledge or patterns from large amounts of data. Is it another hype? Data mining is not another hype. Instead, the need for data mining has arisen due to the wide availability of huge amounts of data and the imminent need for turning such data, into useful information and knowledge. Thus, data mining can be viewed as the result of the natural evolution of information technology. September 22, 2018 Data Mining: Concepts and Techniques
43
Data Mining: Concepts and Techniques
Exercises Is it a simple transformation of technology developed from databases, statistics, and machine learning? No. Data mining is more than a simple transformation of technology developed from databases, statistics, and machine learning. Instead, data mining involves an integration, rather than a simple transformation, of techniques from multiple, disciplines such as database technology, statistics, machine learning, high-performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial data analysis. September 22, 2018 Data Mining: Concepts and Techniques
44
Data Mining: Concepts and Techniques
Exercises Explain how the evolution of database technology led to data mining. Database technology began with the development of data collection and database creation mechanisms that, led to the development of effective mechanisms for data management including data storage and retrieval, and query and transaction processing. The large number of database systems offering query and transaction processing eventually and naturally led to the need for data analysis and understanding. Hence, data mining began its development out of this necessity. September 22, 2018 Data Mining: Concepts and Techniques
45
Data Mining: Concepts and Techniques
Exercises Describe the steps involved in data mining when viewed as a process of knowledge discovery. The steps involved in data mining when viewed as a process of knowledge discovery arc as follows: Data cleaning, a process that removes or transforms noise and inconsistent data Data integration, where multiple data sources may be combined Data selection, where data relevant to the analysis task are retrieved from the database Data transformation, where data are transformed or consolidated into forms appropriate for mining Data mining, an essential process where intelligent and efficient methods are applied in order to extract, patterns Pattern evaluation, a process that identifies the truly interesting patterns representing knowledge leased on some interestingness measures Knowledge presentation, where visualization and knowledge representation techniques are used to present the mined knowledge to the user September 22, 2018 Data Mining: Concepts and Techniques
46
Data Mining: Concepts and Techniques
Exercises Distinguish between data warehouse and database Applications of Data Mining Architecture and KDD Process of Data Mining Functionalities Challenges and Issues September 22, 2018 Data Mining: Concepts and Techniques
47
Data Mining: Concepts and Techniques
DATA WAREHOUSE Distinguish between data warehouse and database Applications of Data Mining Architecture and KDD Process of Data Mining Functionalities Challenges and Issues September 22, 2018 Data Mining: Concepts and Techniques
48
Data Mining: Concepts and Techniques
September 22, 2018 Data Mining: Concepts and Techniques
49
Data Warehousing and OLAP Technology for Data Mining
What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data cube technology From data warehousing to data mining September 22, 2018 Data Mining: Concepts and Techniques
50
Data Mining: Concepts and Techniques
What is Data Warehouse? Defined in many different ways A decision support database that is maintained separately from the organization’s operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon September 22, 2018 Data Mining: Concepts and Techniques
51
Data Warehouse—Subject-Oriented
Organized around major subjects, such as customer, product, sales. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. September 22, 2018 Data Mining: Concepts and Techniques
52
Data Warehouse—Integrated
Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources When data is moved to the warehouse, it is converted. September 22, 2018 Data Mining: Concepts and Techniques
53
Data Warehouse—Time Variant
The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element”. September 22, 2018 Data Mining: Concepts and Techniques
54
Data Warehouse—Non-Volatile
A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data. September 22, 2018 Data Mining: Concepts and Techniques
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.