Download presentation
Presentation is loading. Please wait.
Published byValerie Mitchell Modified over 8 years ago
1
Data Mining: Concepts and Techniques www.ePowerPoint.com
2
Course Objective Introduce the fundamental concepts of data mining Learning basic data mining principles and algorithms Develop the ability to solve problems with data mining techniques
3
Textbook J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3nd ed., 2012 武森,高学东, Bastian. 数据仓库与数据挖 掘. 北京:冶金工业出版社, 2003 武森,高学东, Bastian. 数据仓库与数据挖 掘. 北京:冶金工业出版社, 2003
4
Course Assessment Attendance + homework + project: 30% Final examination: 70%
5
Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary
6
Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Social media and everyone: search engine, news, blogs,social networks, YouTube, digital picturess&videos, from data age to information age: need powerful dada analysis tools database systems offering query and transaction processing as common practice. Data mining tools perform data analysis and extract the valuable knowledge embedded in the vast amount of data.
7
Why Data Mining?
8
Why Not Traditional Data Analysis? Tremendous amount of data Algorithms must be highly scalable to handle such as tera- bytes of data High-dimensionality of data may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications
9
Why Now? Data is being produced Data is being warehoused The computing power is available The computing power is affordable The competitive pressures are strong Commercial products are available
10
Evolution of Database Technology 1960s: database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems
11
Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary
12
What can we get from data mining?
13
What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
14
What Is Data Mining? Interesting patterns: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm
15
What Is Data Mining?
16
Data mining: a misnomer? Alternative names Knowledge discovery in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
17
What Is not Data Mining? Watch out: Is everything “ data mining ” ? Simple search and query processing (Deductive) expert systems OLAP based on data warehouse statistical analysis system Information system
18
Data mining context Levels of data analysis method hidden shallow surface simple database queries statistical analysis data mining
19
Database Processing vs. Data Mining Processing Query Well defined SQL Query Poorly defined No precise query language Data Data – Operational data Output Output – Precise – Subset of database Data Data – Not operational data Output Output – Fuzzy – Not a subset of database
20
Query Examples (查询实例对比) Database Data Mining – Find all customers who have purchased milk – Find all items which are frequently purchased with milk. (association rules) – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering)
21
Knowledge Discovery (KDD) Process Data mining — core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
22
Knowledge Discovery (KDD) Process step1 Data cleaning: to remove noise and inconsistent data step2 Data integration: where multiple data sources may be combined step3 Data selection: where data relevant to analysis task are retrieved from the database step4 Data mining: an essential process where intelligent methods are applied in order to extract data patterns,such as classification, clustering step5 Pattern evaluation: to identify the truly interesting patterns representing knowledge based on some interesting measures step6 Knowledge presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user
23
Example: A Web Mining Framework Web mining usually involves Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining Data mining Presentation of the mining results Patterns and knowledge to be used or stored into knowledge-base
24
KDD Process: A Typical View from ML and Statistics Input Data Data Mining Data Pre- Processing Post- Processing This is a view from typical machine learning and statistics communities Data integration Normalization Feature selection Dimension reduction Pattern discovery Association & correlation Classification Clustering Outlier analysis … … Pattern evaluation Pattern selection Pattern interpretation Pattern visualization
25
Example: Medical Data Mining Health care & medical data mining – often adopted such a view in statistics and machine learning Preprocessing of the data (including feature extraction and dimension reduction) Classification or/and clustering processes Post-processing for presentation
26
Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
27
Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary
28
Multi-Dimensional View of Data Mining Data to be mined Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks Knowledge to be mined (or: Data mining functions) Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Descriptive vs. predictive data mining Multiple/integrated functions and mining at multiple levels Techniques utilized Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
29
Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary
30
Data Mining: On What Kinds of Data? In general,data mining should be applicable to any kind of data repository, as well as data streams Database-oriented data sets and applications Relational database, 关系数据库 data warehouse, 数据仓库 transactional database 事务数据库 Advanced data sets and advanced applications Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio- sequences) Multimedia database Text databases The World-Wide Web
31
Data sets Data set concerning bridges in USA E13,A,33,CRAFTS,HIGHWAY,?,2,N,THROUGH,WOOD,?,S,WOOD E15,A,28,CRAFTS,RR,?,2,N,THROUGH,WOOD,?,S,WOOD E16,A,25,CRAFTS,HIGHWAY,MEDIUM,2,N,THROUGH,IRON,MEDIUM,S-F,SUSPEN E17,M,4,CRAFTS,RR,MEDIUM,2,N,THROUGH,IRON,MEDIUM,?,SIMPLE-T E18,A,28,CRAFTS,RR,MEDIUM,2,N,THROUGH,IRON,SHORT,S,SIMPLE-T E19,A,29,CRAFTS,HIGHWAY,MEDIUM,2,N,THROUGH,WOOD,MEDIUM,S,WOOD E20,A,32,EMERGING,HIGHWAY,MEDIUM,2,N,THROUGH,WOOD,MEDIUM,S,WOOD E21,M,16,EMERGING,RR,?,2,?,THROUGH,IRON,?,?,SIMPLE-T E23,M,1,EMERGING,HIGHWAY,MEDIUM,?,?,THROUGH,STEEL,LONG,F,SUSPEN E22,A,24,EMERGING,HIGHWAY,MEDIUM,4,G,THROUGH,WOOD,SHORT,S,WOOD E24,O,45,EMERGING,RR,?,2,G,?,STEEL,?,?,SIMPLE-T E25,M,10,EMERGING,RR,?,2,G,?,STEEL,?,?,SIMPLE-T E27,A,39,EMERGING,RR,?,2,G,THROUGH,STEEL,?,F,SIMPLE-T E26,M,12,EMERGING,RR,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,S,SIMPLE-T E30,A,31,EMERGING,RR,?,2,G,THROUGH,STEEL,MEDIUM,F,SIMPLE-T E29,A,26,EMERGING,HIGHWAY,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,?,SUSPEN E28,M,3,EMERGING,HIGHWAY,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,S,ARCH E32,A,30,EMERGING,HIGHWAY,?,2,G,THROUGH,IRON,MEDIUM,F,SIMPLE-T E31,M,8,EMERGING,RR,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,S,SIMPLE-T E34,O,41,EMERGING,RR,LONG,2,G,THROUGH,STEEL,LONG,F,SIMPLE-T E33,M,19,EMERGING,HIGHWAY,MEDIUM,?,G,THROUGH,IRON,MEDIUM,F,SIMPLE-T E36,O,45,MATURE,HIGHWAY,?,2,G,THROUGH,IRON,SHORT,F,SIMPLE-T E35,A,27,MATURE,HIGHWAY,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,F,SIMPLE-T E38,M,17,MATURE,HIGHWAY,?,2,G,THROUGH,IRON,MEDIUM,F,SIMPLE-T E37,M,18,MATURE,RR,MEDIUM,2,G,THROUGH,STEEL,MEDIUM,S,SIMPLE-T E39,A,25,MATURE,HIGHWAY,?,2,G,THROUGH,STEEL,MEDIUM,F,SIMPLE-T E4,A,27,MATURE,AQUEDUCT,MEDIUM,1,N,THROUGH,WOOD,SHORT,S,WOOD E40,M,22,MATURE,HIGHWAY,?,2,G,THROUGH,STEEL,MEDIUM,F,SIMPLE-T E41,M,11,MATURE,HIGHWAY,?,2,G,THROUGH,IRON,MEDIUM,F,SIMPLE-T E42,M,9,MATURE,HIGHWAY,LONG,2,G,THROUGH,STEEL,LONG,F,SIMPLE-T format is simply comma separated values
32
Data sets Data set concerning geotechnical parameters format taken directly from a spreadsheet
33
Example Cappuccino coffee relation missing data attribute-value attribute as number or name note this attribute process of discretization
34
data warehouse Is a data repository A repository of multiple heterogeneous data sources organized under a unified schema at a single site in order to facilitate management decision making. data warehouse are constructed via a process of data cleaning, data integration, data transform, data loading,and periodic data refreshing. on-line analytical processing (OLAP) is the Major task of data warehouse system
35
DATA STREAMS Streaming data :data flow in and out of an observation platform dynamically Features: Huge or possibly infinite volume, Dynamically changing Flow in and out in a fixed order Allowing only one or a small number of scans Demanding fast (often real time ) response time Examples Network traffic Stock exchange web click streams Weather monitoring
36
Data Mining: On What Kinds of Data? Advanced data sets and advanced applications Time-series data, temporal data, sequence data (incl. bio- sequences) Structure data, graphs, social networks and multi-linked data Object-relational databases Heterogeneous databases and legacy databases Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web
37
Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary
38
Data Mining Functionalities General functionality Descriptive data mining (描述性的数据挖掘 ) : characterize the general properties of the data in database Such as: Clustering / similarity matching, Association rules, Deviation detection Predictive data mining (预测性的数据挖掘) : perform inference on the current data in order to make predictions Such as: Classification, Regression
39
Data Mining Models and Tasks
40
concept description: Characterization and discrimination Data Characterization is a summarization of the general characteristics or features of a target class of data. Eg. description the characteristics of customers who spend over 10000$ last year, Use SQL query to get the dada; statistical measures; OLAP; The output can be : pie charts, bar charts,curves, multidimensional data cubes, multidimensional tables Data discrimination is a comparison of the general features of target class data objects with one or a set of contrasting classes
41
Frequent patterns, association, correlation vs. causality Frequent patterns are patterns that occurred frequently in data. many kind of Frequent patterns, such as frequent item sets, subsequences, substructures Mining frequent patterns leads to the discovery of interesting associations and correlations within data. Association analysis Diaper Beer [20%, 75%] (Correlation or causality?) 20% support means that 20% of all the transactions under analysis showed that diaper and beer are purchased together. 75%confidence or certainty means that if a customer buys diaper, there is a 75% chance that she will buy beer
42
Data Mining Functionalities Classification and prediction Construct models (functions) that describe and distinguish classes or concepts for future prediction A model may be presented in various forms: rules, decision tree, mathematical formulae, neural networks A model is derived based on the analysis of training data Predict some unknown or missing numerical values
43
Classification Given old data about customers and payments, predict new applicant ’ s loan eligibility. Age Salary Profession Location Customer type Previous customers ClassifierDecision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad
44
Classification methods Regression: (linear or any other polynomial) Nearest neighour Decision tree classifier: Bayesian learning Neural networks:
45
Cluster analysis Class label is unknown: Group data to form new classes Maximizing intra-class similarity & minimizing interclass similarity, e.g objects within a cluster have high similarity, but are very dissimilarity to objects in other clusters. Key requirement: Need a good measure of similarity between instances. Application Customer segmentation e.g. for targeted marketing Collaborative filtering: group based on common items purchased Text clustering Compression
46
Cluster Analysis
47
Outlier analysis Outlier: Data object that does not comply with the general behavior of the data Noise or exception? One person’s garbage could be another person’s treasure Methods: statistical tests; clustering or regression analysis Application fraud detection rare events analysis
48
Trend and evolution analysis evolution analysis Describes and models regularities or trends for objects whose behavior changes over time. Time-series data analysis Sequence analysis-sequential pattern mining e.g., first buy digital camera, then buy large SD memory cards Periodicity analysis Similarity-based analysis
49
Ex: Time Series Analysis Example: Stock Market Predict future values Determine similar patterns over time Classify behavior
50
Structure and Network Analysis Graph mining Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments) Information network analysis Social networks: actors (objects, nodes) and relationships (edges) e.g., author networks in CS, terrorist networks Multiple heterogeneous networks A person could be multiple information networks: friends, family, classmates, … Links carry a lot of semantic information: Link mining Web mining Web is a big information network: from PageRank to Google Analysis of Web information networks Web community discovery, opinion mining, usage mining, …
51
Data Mining Models and Tasks
52
Are all the patterns interesting? A data mining system may generate thousands of patterns, are all of them interesting? Can a data mining system generate all of the interesting pattern? Can a data mining system generate only the interesting pattern?
53
Are all the patterns interesting? Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm
54
Are all the patterns interesting? Objective vs. subjective interestingness measures: Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user ’ s belief in the data, e.g., unexpectedness, novelty, etc.
55
Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary
56
Data Mining: Confluence of Multiple Disciplines Data Mining Machine Learning Statistics Applications Algorithm Pattern Recognition High-Performance Computing Visualization Database Technology
57
Why Confluence of Multiple Disciplines? Tremendous amount of data Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications
58
Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary
59
Applications of Data Mining Web page analysis: from web page classification, clustering to PageRank & HITS algorithms Collaborative analysis & recommender systems Basket data analysis to targeted marketing Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue) From major dedicated data mining systems/tools (e.g., SAS, MS SQL- Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining
60
Data Mining Application Areas IndustryApplication FinanceCredit Card Analysis InsuranceClaims, Fraud Analysis TelecommunicationCall record analysis TransportLogistics management Consumer goodspromotion analysis Data Service providersValue added data UtilitiesPower usage analysis
61
Data Mining Application in daily life Ubiquitous data mining Shopping (Wal-mart,7-11,etc. market basket analysis ) On-line shopping -collaborative recommender system Credit card –detect fraudulent usage Email - spam filter Internet- ads
62
Data Mining works with Warehouse Data Data Warehousing provides the Enterprise with a memory ÑData Mining provides the Enterprise with intelligence
63
Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? What Kind of Applications Are Targeted? Major Issues in Data Mining A Brief History of Data Mining and Data Mining Society Summary
64
2016年7月7日星期四 2016年7月7日星期四 2016年7月7日星期四 Data Mining: Concepts and Techniques 64 Major Issues in Data Mining (1) Mining Methodology Mining various and new kinds of knowledge Mining knowledge in multi-dimensional space Data mining: An interdisciplinary effort Boosting the power of discovery in a networked environment Handling noise, uncertainty, and incompleteness of data Pattern evaluation and pattern- or constraint-guided mining User Interaction Interactive mining Incorporation of background knowledge Presentation and visualization of data mining results
65
2016年7月7日星期四 2016年7月7日星期四 2016年7月7日星期四 Data Mining: Concepts and Techniques 65 Major Issues in Data Mining (2) Efficiency and Scalability Efficiency and scalability of data mining algorithms Parallel, distributed, stream, and incremental mining methods Diversity of data types Handling complex types of data Mining dynamic, networked, and global data repositories Data mining and society Social impacts of data mining Privacy-preserving data mining Invisible data mining
66
Chapter 1. Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Classification of data mining systems Major issues in data mining A Brief History of Data Mining and Data Mining Society Summary
67
Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Data mining systems and architectures Major issues in data mining
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.