Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Advanced Databases Data Mining Dr Theodoros Manavis

Similar presentations


Presentation on theme: "1 Advanced Databases Data Mining Dr Theodoros Manavis"— Presentation transcript:

1 1 Advanced Databases Data Mining Dr Theodoros Manavis tmanavis@ist.edu.gr

2 2 Data Mining Definition Data Mining is: (1) The efficient discovery of previously unknown, non-trivial, implicit, valid, potentially useful, understandable patterns in large datasets. (2) The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner

3 3 Data Mining Definition Alternative names: –Data mining: a misnomer? –Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. What is not data mining? –query processing. – Expert systems or small ML/statistical programs

4 4 Data Mining vs. Data Query Data Query:e.g. –A list of all customers who use a credit card to buy a PC –A list of all MIS students having a GPA of 3.5 or higher and has studied 4 or less semesters Data Mining problems:e.g. –What is the likelihood of a customer purchasing PC with credit card –Given the characteristics of MIS students predict her SPA in the coming term –What are the characteristics of MIS undergrad students

5 5 Data Mining Definition

6 6 Knowledge Discovery Process

7 Data Selection Select the information about people who have subscribed to a magazine

8 Pollutions: Type errors, moving from one place to another without notifying change of address, people give incorrect information about themselves –Pattern Recognition Algorithms Cleaning

9 Lack of domain consistency Cleaning

10 Enrichment Need extra information about the clients consisting of date of birth, income, amount of credit, and whether or not an individual owns a car or a house

11 We select only those records that have enough information to be of value (row) Project the fields in which we are interested (column) Coding

12 Code the information which is too detailed –Address to region –Birth date to age –Divide income by 1000 –Divide credit by 1000 –Convert cars yes-no to 1-0 –Convert purchase date to month numbers starting from 1990 The way in which we code the information will determine the type of patterns we find Coding has to be performed repeatedly in order to get the best results Coding

13 Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP

14 Examples of Large Datasets Government: IRS, NGA, … Large corporations –WALMART: 20M transactions per day –MOBIL: 100 TB geological databases –AT&T 300 M calls per day –Credit card companies Scientific –NASA, EOS (earth observing system) project: 50 GB per hour –Environmental datasets

15 15 Examples of Data mining Applications 1. Fraud detection: credit cards, phone cards 2. Marketing: customer targeting 3. Data Warehousing: Walmart 4. Astronomy 5. Molecular biology

16 The Data Mining Process 1. Goal identification: –Define problem –relevant prior knowledge and goals of application 2. Creating a target data set: data selection 3. Data preprocessing: (may take 60%-80% of effort!) –removal of noise or outliers –strategies for handling missing data fields –accounting for time sequence information 4. Data reduction and transformation: –Find useful features, dimensionality/variable reduction.

17 The Data Mining Process 5. Data Mining: –Choosing functions of data mining: summarization, classification, regression, association, clustering. –Choosing the mining algorithm(s): which models or parameters –Search for patterns of interest 6. Presentation and Evaluation: –visualization, transformation, removing redundant patterns, etc. 7. Taking action: –incorporating into the performance system –documenting –reporting to interested parties

18 18 An example: Customer Segmentation 1. Marketing department wants to perform a segmentation study on the customers of AllElectronics Company 2. Decide on revevant variables from a data warehouse on customers, sales, promotions –Customers: name,ID,income,age,education,... –Sales: hisory of sales –Promotion: promotion types durations... 3. Handle missing income, addresses.. determine outliers if any 4. Cenerate new index variables representing wealth of customers –Wealth = a*income+b*#houses+c*#cars... –Make neccesary transformations scores so that some data mining algorithms work more efficiently

19 19 Example: Customer Segmentation cont. 5.a: Choose clustering as the data mining functionality as it is the natural one for a segmentation study so as to find group of customers with similar charecteristics 5.b: Choose a clustering algorithm –e.g. K-means or any suitable one for that problem 5.c: Apply the algorithm –Find clusters or segments 6. make reverse transformations, visualize the customer segments 7. present the results in the form of a report to the marketing deprtment –İmplement the segmentation as part of a DSS so that it can be applied repeatedly at certain internvals as new customers arrive –Develop marketing strategies for each segment

20 Two Styles of Data Mining Descriptive data mining –characterize the general properties of the data in the database –finds patterns in data and –the user determines which ones are important Predictive data mining –perform inference on the current data to make predictions –we know what to predict Not mutually exclusive –used together –Descriptive  predictive Eg. Customer segmentation – descriptive by clustering Followed by a risk assignment model – predictive by ANN

21 21 Supervised vs. Unsupervised Learning Supervised learning (classification, prediction) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (summarization, association, clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

22 22 Descriptive Data Mining Discovering new patterns inside the data Used during the data exploration steps Typical questions answered by descriptive data mining: –what is in the data –what does it look like –are there any unusual patterns –what does the data suggest for customer segmentation users may have no idea –which kind of patterns may be interesting patterns at various granularities –Geography ( country - city - region – street) –Student ( university - faculty - department – minor) Functionalities of descriptive data mining –Clustering (e.g. customer segmentation) –summarization –visualization –Association (e.g. market basket analysis)

23 23 Model Y output inputs X1,X2 The user does not care what the model is doing: it is a black box User interested in the accuracy of its predictions X: vector of independent variables or inputs Y =f(X) : an unknown function Y: dependent variables or output a single variable or a vector Prediction: A model is a black box

24 24 Predictive Data Mining Using known examples the model is trained –the unknown function is learned from data the more data with known outcomes is available –the better the predictive power of the model Used to predict outcomes whose inputs are known but the output values are not realized yet Never %100 accurate The performance of a model on past data is not important –Not important how well it predicts the known outcomes Its performance on unknown data is much more important

25 25 Typical questions answered by predictive models Who is likely to respond to our next offer –based on history of previous marketing campaigns Which customers are likely to leave in the next six months What transactions are likely to be fraudulent –based on known examples of fraud What is the total amount spending of a customer in the next month

26 26

27 Why Data Preprocessing? Data in the real world is dirty –incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data –noisy: containing errors or outliers –inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! –Quality decisions must be based on quality data –Data warehouse needs consistent integration of quality data –Required for both OLAP and Data Mining!

28 Why can Data be Incomplete? Attributes of interest are not available (e.g., customer information for sales transaction data) Data were not considered important at the time of transactions, so they were not recorded! Data not recorder because of misunderstanding or malfunctions Data may have been recorded and later deleted! Missing/unknown values for some data

29 Data Cleaning Data cleaning tasks –Fill in missing values –Identify outliers and smooth out noisy data –Correct inconsistent data

30 30 Data Mining Functionalities (1/5) 1.Concept description: Characterization and discrimination –Generalize, summarize, and contrast data characteristics, e.g., big spenders vs. budget spenders 2.Association ( correlation and causality) –Multi-dimensional vs. single-dimensional association –age(X, “20..29”) ^ income(X, “20..29K”)  buys(X, “PC”) [support = 2%, confidence = 60%] –contains(T, “computer”)  contains(x, “software”) [1%, 75%]

31 31 Data Mining Functionalities (2/5) Characterisation (concept description) : summarizing the characteristics of customers who spend more than $1000 a year at AllElectronics (a database of customers) –age, employment, income –drill down on any dimension

32 32 Data Mining Functionalities (3/5) Discrimination example (concept description): Example 1:Compare the general features of software products –whose sales increased by %10 in the last year (target class) –whose sales decreased by at least %30 during the same period (contrasting class) Example 2: Compare two groups of ‘AllElectronics’ customers –I) who shop for computer products regularly (target class) more than two times a month –II) who rarely shop for such products (contrasting class) less than three times a year The resulting description: %80 of group I customers –university education –ages 20-40 %60 of group II customers –seniors or young –no university degree

33 33 3.Classification and Prediction –Finding models (functions) that describe and distinguish classes or concepts for future prediction –E.g., classify people as healthy or sick, or classify transactions as fraudulent or not –Methods: decision-tree, classification rule, neural network –Prediction: Predict some unknown or missing numerical values 4.Cluster analysis –Class label is unknown: Group data to form new classes, e.g., cluster customers of a retail company to learn about characteristics of different segments –Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity Data Mining Functionalities (4/5)

34 34 5.Outlier analysis –Outlier: a data object that does not comply with the general behavior of the data –It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis 6.Trend and evolution analysis –Trend and deviation: regression analysis –Sequential pattern mining: click stream analysis –Similarity-based analysis 7.Other pattern-directed or statistical analyses Data Mining Functionalities (5/5)

35 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. –A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

36 Classification Example categorical continuous class Test Set Training Set Model Learn Classifier

37 Example of a Decision Tree categorical continuous class HO MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree

38 Another Example of Decision Tree categorical continuous class MarSt HO TaxInc YES NO Yes No Married Single, Divorced < 80K> 80K There could be more than one tree that fits the same data!

39 Classification: Application 1 Direct Marketing –Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. –Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction related information about all such customers. –Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier model. From [Berry & Linoff] Data Mining Techniques, 1997

40 Fraud Detection –Goal: Predict fraudulent cases in credit card transactions. –Approach: Use credit card transactions and the information on its account-holder as attributes. –When does a customer buy, what does he buy, how often he pays on time, etc Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account. Classification: Application 2

41 Clustering Definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that: –Data points in one cluster are more similar to one another. –Data points in separate clusters are less similar to one another. Similarity Measures: –Euclidean Distance if attributes are continuous. –Other Problem-specific Measures.

42 Illustrating Clustering Euclidean Distance Based Clustering in 3-D space. Intracluster distances are minimized Intracluster distances are minimized Intercluster distances are maximized Intercluster distances are maximized

43 Clustering: Application 1 Market Segmentation: –Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. –Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

44 Document Clustering: –Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. –Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. –Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents. Clustering: Application 2

45 Illustrating Document Clustering Clustering Points: 3204 Articles of Los Angeles Times. Similarity Measure: How many words are common in these documents (after some word filtering).

46 Association Rule Discovery: Definition Given a set of records each of which contain some number of items from a given collection; –Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

47 Marketing and Sales Promotion: –Let the rule discovered be {Bagels, … } --> {Potato Chips} –Potato Chips as consequent => Can be used to determine what should be done to boost its sales. –Bagels in the antecedent => C an be used to see which products would be affected if the store discontinues selling bagels. –Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips! Association Rule Discovery: Application

48 48 Thank You for Your Attention Thank You for Your Attention


Download ppt "1 Advanced Databases Data Mining Dr Theodoros Manavis"

Similar presentations


Ads by Google