Data Mining Concepts Emre Eftelioglu.

Slides:



Advertisements
Similar presentations
An Introduction to Data Mining
Advertisements

Copyright Jiawei Han, modified by Charles Ling for CS411a
Data Mining Techniques Association Rule
Data Mining Glen Shih CS157B Section 1 Dr. Sin-Min Lee April 4, 2006.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
DATA MINING CS157A Swathi Rangan. A Brief History of Data Mining The term “Data Mining” was only introduced in the 1990s. Data Mining roots are traced.
1 Data Warehousing. 2 Data Warehouse A data warehouse is a huge database that stores historical data Example: Store information about all sales of products.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Week 9 Data Mining System (Knowledge Data Discovery)
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining By Archana Ketkar.
Mining Association Rules
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Data mining By Aung Oo.
Data Mining: A Closer Look
Enterprise systems infrastructure and architecture DT211 4
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
Chapter 5: Data Mining for Business Intelligence
Shilpa Seth.  What is Data Mining What is Data Mining  Applications of Data Mining Applications of Data Mining  KDD Process KDD Process  Architecture.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Building Data and Document-Driven Decision Support Systems How do managers access and use large databases of historical and external facts?
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
1 What is Data Mining? l Data mining is the process of automatically discovering useful information in large data repositories. l There are many other.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
DATA MINING By Cecilia Parng CS 157B.
1 Chapter 8: Introduction to Pattern Discovery 8.1 Introduction 8.2 Cluster Analysis 8.3 Market Basket Analysis (Self-Study)
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
DATA MINING Using Association Rules by Andrew Williamson.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Academic Year 2014 Spring Academic Year 2014 Spring.
Data Mining Copyright KEYSOFT Solutions.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
Introduction to Data Mining Mining Association Rules Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
Data Mining Functionalities
Data Mining.
Data Mining – Intro.
What Is Cluster Analysis?
MIS2502: Data Analytics Advanced Analytics - Introduction
Sangeeta Devadiga CS 157B, Spring 2007
Data Mining: Introduction
Presentation transcript:

Data Mining Concepts Emre Eftelioglu

What is Knowledge Discovery in Databases? Data mining is actually one step of a larger process known as knowledge discovery in databases (KDD). The KDD process model consists of six phases Data selection Data cleansing Enrichment Data transformation Data mining Reporting and displaying discovered knowledge Database Data Warehouses Data Selection Data Cleansing & Enrichment Data transformation Data mining Reporting

Data Warehouse A subject oriented, integrated, non-volatile, time variant collection of data in support of management’s decisions. Understand and improve the performance of an organization Designed for query and data retrieval. Not for transaction processing. Contains historical data derived from transaction data, but can include data from other sources. Data is consolidated, aggregated and summarized Underlying engine used by the business intelligence environments which generates these reports.

Why Data Mining is Needed? Data is large scale, high dimensional, heterogeneous, complex and distributed and it is required to find useful information using this data. Commercial Perspective Sales can be increased. Lots of data is in hand but not interpreted. Computers are cheap and powerful. There is a competition between companies Better service, more sales Easy access for customers. Customized user experience. Etc. Scientific Perspective Data collected and stored at enormous speeds. Traditional techniques are infeasible. Data mining improve/ease the work of scientists Overall There is often information “hidden” in the data that is not readily evident. Even for a small datasets human analysts may take weeks to discover useful information

What is the Goal of Data Mining? Prediction: Determine how certain attributes will behave in the future. Identification: Identify the existence of an item, event, or activity. Classification: Partition data into classes or categories. Optimization: Optimize the use of limited resources. Credit Request by a Bank Customer: By analyzing credit card usage a customer’s buying capacity can be predicted. Scientists are trying to identify the life on Mars by analyzing different soil samples. A company can grade their employers by classifying them by their skills. UPS avoids left turns in order to save gas on idle.

What is Data Mining? Data Mining - an interdisciplinary field Definition: Non-trivial extraction of implicit, previously unknown and potentially useful information from data. Another definition: Discovering new information in terms of patterns or rules from vast amounts of data. Data Mining - an interdisciplinary field Databases Statistics High Performance Computing Machine Learning Visualization Mathematics Which disciplines does Data Mining use? Data Mining: "Torturing data until it confesses ... and if you torture it enough, it will confess to anything" Jeff Jonas, IBM

What is Not Data Mining? Use “Google” to check the price of an item. Getting the statistical details about the historical price change of an item. (i.e. max price, min price, average price) Checking the sales of an item by color. Example: Green item sales– 500 Blue item sales– 1000 etc. So what would be a data mining question? - How many items can we sell if we produce the same item in Red color? - How will the sales change if we make a discount on the price? - Is there any association between the sales of item X and item Y? Why Data Mining is not Statistical Analysis? Interpretation of results is difficult and daunting Requires expert user guidance Ill-suited for Nominal and Structured Data Types Completely data driven - incorporation of domain knowledge not possible

What is the Difference between DBMS, OLAP and Data Mining? OLAP – Data Warehouse Data Mining Task Extraction of detailed data Summaries, Trends, Reports Knowledge Discovery of Hidden Patterns, Insights Result Information Analysis Insight and Future Prediction Method Deduction (ask the question, verify the data) Model the data, aggregate, use statistics Induction (build the model, apply to new data, get the result) Example Who purchased the Apple iPhone 6 so far? What is the average income of iPhone 6 buyers by region and month? Who will buy the new Apple Watch when it is in the market?

What types of Knowledge can be revealed? Association Rule Discovery (descriptive) Clustering (descriptive) Classification (descriptive) Sequential Pattern Discovery (descriptive) Patterns Within Time Series (predictive)

Association Rule Mining Association rules are frequently used to generate rules from market-basket data. A market-basket corresponds to the sets of items a consumer purchases during one visit to a supermarket. Create dependency rules to predict occurrence of an item based on occurrences of other items. The set of items purchased by customers is known as an item set. An association rule is of the form X=>Y, where X ={x1, x2, …., xn }, and Y = {y1,y2, …., yn} are sets of items, with xi and yi being distinct items for all i and all j. For an association rule to be of interest, it must satisfy a minimum support and confidence. Beer & Diapers as an urban legend: Father goes to grocery store to get a big pack of diapers after he is out of work. When he buys diapers, he decides to get a six pack

Association Rules - Confidence and Support TID Items Support = Occurrence / Total Transactions 1 AB Total Transactions = 5 Support {AB} = 3/5 = 60% Support {BC} = 3/5 = 60% Support {CD} = 1/5 = 20% Support {ABC} = 1/5 = 20% 2 ABD 3 ACD 4 ABC 5 BC Support: The minimum percentage of instances in the database that contain all items listed in a given association rule. Support is the percentage of transactions that contain all of the items in the item set, LHS U RHS. The rule X ⇒ Y holds with support s if s% of transactions in contain X ∪ Y. Confidence: The rule X ⇒ Y holds with confidence c if c% of the transactions that contain X also contain Y Confidence can be computed as support(LHS U RHS) / support(LHS) TID Items Given X ⇒ Y Confidence = Occurrence {X Y} / Occurrence{X} 1 AB Total Transactions = 5 Confidence {A ⇒ B} = 3/4 = 75% Confidence {B ⇒ C} = 2/4 = 50% Confidence {C ⇒ D} = 1/3 = 33% Confidence {AB ⇒ C} = 1/3 = 33% 2 ABD 3 ACD 4 ABC 5 BC

Association Rules - Apriori algorithm A general algorithm for generating association rules is a two-step process. Generate all item sets that have a support exceeding the given threshold. Item sets with this property are called large or frequent item sets. Generate rules for each item set as follows: For item set X and Y (subset of X), let Z = X – Y (set difference); If Support(X)/Support(Z) > minimum confidence, the rule Z=>Y is a valid rule. The Apriori algorithm was the first algorithm used to generate association rules. The Apriori algorithm uses the general algorithm for creating association rules together with downward closure and anti-monotonicity. For a k-item set to be frequent, each and every one of its items must also be frequent. To generate a k-item set: Use a frequent (k-1)-item set and extend it with a frequent 1-itemset. Downward Closure A subset of a large itemset must also be large Anti-monotonicity A superset of a small itemset is also small. This implies that the itemset does not have sufficient support to be considered for rule generation.

Complications seen with Association Rules The cardinality of item sets in most situations is extremely large. Association rule mining is more difficult when transactions show variability in factors such as geographic location and seasons. Item classifications exist along multiple dimensions. Data quality is variable; data may be missing, erroneous, conflicting, as well as redundant.

What types of Knowledge can be revealed? Association Rule Discovery (descriptive) Clustering (descriptive) Classification (descriptive) Sequential Pattern Discovery (descriptive) Patterns Within Time Series (predictive)

Clustering (1/2) Motivation Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs. Land use: Identification of areas of similar land use in an earth observation database. Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. City-planning: Identifying groups of houses according to their house type, value, and geographical location. Many more…

Clustering (2/2) Given a set of data points, find clusters such that; Records in one cluster are highly similar to each other and dissimilar from the records in other clusters. It is an unsupervised learning technique which does not require any prior knowledge of clusters.

k-Means Algorithm (1/2) The k-Means algorithm is a simple yet effective clustering technique. The algorithm clusters observations into k groups, where k is provided as an input parameter. It then assigns each observation to clusters based upon the observation’s proximity to the mean center of the cluster. The cluster’s mean center is then recomputed and the process begins again. Algorithm stops when the means centers do not change. The objective is minimize the error (distance). 𝐸𝑟𝑟𝑜𝑟= 𝑖=1 𝑘 ∀ 𝑟 𝑗 ∈ 𝐶 𝑖 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑟 𝑗 , 𝑚 𝑖 2

k-Means Algorithm (2/2) 1. Select initial k cluster centers. 2. Assignment Step: Assign each point in the dataset to the closest cluster, based upon the Euclidean distance between the point and each cluster center. 3. Update Step: Once all points are clustered, re-compute the cluster centers using the arithmetic mean of the coordinates of the points which belongs to them. 4. Repeat step 2-3 until the centers do not change (convergence) Example of K-Means Execution on a Smiley Face dataset. Input Dataset Initial 3 centers User defined Assignment Update Assignment Final Output

k-Means Algorithm – Issues (1/2) The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results. Initial centroid selection is important for the final result of the algorithm since it converges to a local minimum. Initial Centroids (green/red) Output Output changes by different initial selections of centers Input data set Initial Centroids (green/red) Output

k-Means Algorithm – Issues (2/2) Density of the clusters may bias the results since k-Means depends on the Euclidean distance between points and centroids. Input Dataset

Classification vs. Clustering Classification is a supervised learning technique Classification techniques learn a method for predicting the class of a data record from the pre-labeled (classified) records. Clustering is an unsupervised learning technique which finds clusters of records without any prior knowledge.