1 Introduction to Data Mining C hapter 1
2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining
3 Introduction
4
5 Information is Power Relevant Relevant Right Information Right Information Globalised world Globalised world Vast amount of information available Vast amount of information available
6 What is an information a collection of data a collection of data The act of human analysis and interpretation of activities The act of human analysis and interpretation of activities Decomposing it into various components and tackling them Decomposing it into various components and tackling them
7 What is Knowledge? The act of human synthesis and evaluation of information The act of human synthesis and evaluation of information Integration of the relevant components and form as a relevant whole system. Integration of the relevant components and form as a relevant whole system.
8 Data Mining Definition I The nontrivial extraction of hidden, previously unidentified, and potentially valuable knowledge from data The nontrivial extraction of hidden, previously unidentified, and potentially valuable knowledge from data A variety of techniques such as neural networks, decision trees or standard statistical techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in areas such as decision support, prediction, forecasting, and estimation. A variety of techniques such as neural networks, decision trees or standard statistical techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in areas such as decision support, prediction, forecasting, and estimation.
9 Data Mining Definition II Finding hidden information in a database Finding hidden information in a database
10 Hidden Information Number of years of experiences Number of years of experiences Great secret recipes Great secret recipes Success Factors Success Factors
11 Database Processing vs. Data Mining Processing Query Query –Well defined –SQL Query Query –Poorly defined –No precise query language Data Data – Operational data Output Output – Precise – Subset of database Data Data – Not operational data Output Output – Fuzzy – Not a subset of database
12 Query Examples Database Database Data Mining Data Mining – Find all customers who have purchased bread – Find all items which are frequently purchased with bread. (association rules) – Find all credit applicants with surname name of Lee. – Identify customers who have purchased more than $100,000 in the last year. – Find all credit applicants who are good credit risks. (classification) – Identify customers with similar eating habits. (Clustering)
13 Data Mining Models and Tasks
14 Data Mining vs. KDD Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.
15 KDD Process Selection ( Pre-Mining 1): Obtain data from various sources. Selection ( Pre-Mining 1): Obtain data from various sources. Preprocessing (Pre-Mining 2) : Cleanse data. Preprocessing (Pre-Mining 2) : Cleanse data. Transformation (Pre-Mining 3): Convert to common format. Transform to new format. Transformation (Pre-Mining 3): Convert to common format. Transform to new format. Data Mining: Obtain desired results. Data Mining: Obtain desired results. Interpretation/Evaluation (Post-Mining): Present results to user in meaningful manner. Interpretation/Evaluation (Post-Mining): Present results to user in meaningful manner. Modified from [FPSS96C]
16 KDD Process Ex: Web Log Selection: Selection: –Select log data (dates and locations) to use Preprocessing: Preprocessing: – Remove identifying URLs – Remove error logs Transformation: Transformation: –Sessionize (sort and group) Data Mining: Data Mining: –Identify and count patterns –Construct data structure Interpretation/Evaluation: Interpretation/Evaluation: –Identify and display frequently accessed sequences. Potential User Applications: Potential User Applications: –Cache prediction –Personalisation
17 Data Mining Development Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis Neural Networks Decision Tree Algorithms Algorithm Design Techniques Algorithm Analysis Data Structures Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques