Data Mining Tarek Soukieh 11/18/2010. Agenda 1.The Evolution of Database Technology 2.Introduction 3.Data Preprocessing 4.OLAP vs. Data Mining 5.Data.

Slides:



Advertisements
Similar presentations
An Introduction to Data Mining
Advertisements

Chapter 9 Business Intelligence Systems
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Week 9 Data Mining System (Knowledge Data Discovery)
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
1 DATA MINING. 2 Introduction Outline Define data mining Data mining vs. databases Basic data mining tasks Data mining development Data mining issues.
Data Mining By Archana Ketkar.
Classification.
Data Mining Concepts 1.1 COT5230 Data Mining Week 1 Data Mining Concepts M O N A S H A U S T R A L I A ’ S I N T E R N A T I O N A L U N I V E R S I T.
Data Mining – Intro.
Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.
CIS 674 Introduction to Data Mining
Data Mining: A Closer Look
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Beyond Opportunity; Enterprise Miner Ronalda Koster, Data Analyst.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Data Mining Techniques
Data Mining Chun-Hung Chou
1 An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007.
Chapter 1 Introduction to Data Mining
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Basic Data Mining Technique
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
DATA MINING 1. 2 Data Mining Extracting or “mining” knowledge from large amounts of data Data mining is the process of autonomously retrieving useful.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Introduction of Data Mining and Association Rules cs157 Spring 2009 Instructor: Dr. Sin-Min Lee Student: Dongyi Jia.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Data Mining and Decision Support
Academic Year 2014 Spring Academic Year 2014 Spring.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Data Mining Copyright KEYSOFT Solutions.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
Data Mining Functionalities
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 —
Data Mining – Intro.
Data Mining ICCM
What Is Cluster Analysis?
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Chapter 6 Classification and Prediction
Mining Association Rules
Data Mining 101 with Scikit-Learn
Introduction to Data Mining
Classification and Prediction
Sangeeta Devadiga CS 157B, Spring 2007
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Supporting End-User Access
Data Mining: Concepts and Techniques
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining: Concepts and Techniques
©Jiawei Han and Micheline Kamber
Presentation transcript:

Data Mining Tarek Soukieh 11/18/2010

Agenda 1.The Evolution of Database Technology 2.Introduction 3.Data Preprocessing 4.OLAP vs. Data Mining 5.Data Mining Algorithms 1.Association 2.Classification & Prediction 3.Cluster Analysis 6.Data Mining Example 7.Major Issues in Data Mining 8.Data Mining Applications 9.Trends in Data Mining

The Evolution of Database Technology

Introduction Data Mining refers to extracting or mining knowledge from large amounts of data. It is famous acronym is KDD “Knowledge Discovery from Data” Nowadays we have abundance of data but these are called “Data tombs” “Data rich but information poor”

Introduction (Cont.) Data Mining Cycle: – Identifying the business problem – Validate, explore, and clean the data – Prepare the model – Check performance of the model – Act on the results (Training - Testing - Scoring) Data Mining Assumptions: – The past is a good predictor of the future Data Mining Categorization: – Directed vs. Undirected – Descriptive vs. Predictive

Introduction (Cont.)

Data Preprocessing Data Cleaning – Measuring dispersion of data – Principle Component Analysis – Correlation Analysis – Regression – Clustering – Sampling Data Transformation – Smoothing – Aggregation

OLAP vs. Data Mining OLAP is a data summarization/aggregation tool that helps simplify data analysis, while Data Mining allows the automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data Data Mining employs sophisticated patterns recognition algorithms on the data, while OLAP reports aggregated data from data warehouses OLAP allows the user to do drilling, pivoting, slicing and dicing, while data mining covers a much broader spectrum like association, classification, prediction, clustering, and other algorithms

OLAP vs. Data Mining (Cont.) OLAP targets business problems while data mining can have socioeconomic applications Data mining is not confined to the analysis of data stored in data warehouses Data mining is more versatile

Association Frequent Itemset refers to a set of items that frequently appear together in a transactional data set, such as milk and bread Frequent Sequential Pattern is a frequently occurring subsequence such as the pattern that customers tend to purchase first a PC, followed by a digital camera, and then a memory card Market Basket Analysis is a typical example of frequent itemset mining

Association (Cont.) Let I = { I 1, I 2, I 3, …} set of items Let A be a set of items Let B be a set of items Association rule A  B holds where: – A  I – B  I – A  B = 

Association (Cont.) Support is the percentage of transactions that contain A  B. This is taken by the probability of union of sets A and B Confidence is the percentage of transactions containing A that also contain B

Association (Cont.) Frequent pattern mining classification: – Different levels of abstraction – Number of dimensions

Association (Cont.) Strong association rules are not necessarily interesting Correlation analysis:

Classification & Prediction Classification predicts categorical variables, a classifier is constructed to predict labels such as “safe” or “risky” for loan application data Prediction models continuous valued functions, regression analysis is most often used methodology

Classification Learning step, where a classification algorithm builds the classifier by learning from a training set Classification step, where test data are used to estimate the accuracy of the classification rules

Classification by Decision Tree Decision Tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label

Classification by Decision Tree (Cont.) Attribute Selection Measure is heuristic for selecting the splitting criterion that best separates a given data partition Ideally each partition should be pure, where all of the tuples that fall into a given partition would belong to the same class Famous attribute selection measures are “information gain”, “gain ratio”, and “gini index”

Classification by Decision Tree (Cont.) Appropriate for exploratory knowledge discovery Decision tree can handle high dimensional data Their representation of acquired knowledge in tree form is intuitive, easy and fast to assimilate by humans

Clustering Clustering is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters In classification, the class label of each object is known Clustering is an example of “unsupervised learning” or “learning by observation”, it does not rely on predefined classes Clustering is also called data segmentation, and is used for outlier detection Categorization: Partitioning methods, Hierarchical methods, Density-based methods

Clustering (Cont.) Partitioning methods – Each group must contain at least one object – Each object must belong to exactly one group – It creates an initial partitioning, then uses an “iterative relocation technique” – K-means algorithm – K-Medoids algorithm – Density-based method

Clustering (Cont.) Partitioning K-Means

Clustering (Cont.) Partitioning K-Medoids

Clustering (Cont.) Hierarchical methods – Agglomerative (bottom-up) or divisive (top-down) – Once a step is done, it can never be undone

Clustering (Cont.) Density-based methods – Number of data points in the neighborhood exceeds some threshold

Data Mining Example Vermont Country Store – Created a score for each customer based on RFM (Recency, Frequency, Monetary) – Created a model for mailing catalogs, then used the model against older mailings and found significant impact – Created a catalog for each of the customer segments produced by data mining – Found association rules that certain car owners are frequent buyers of certain products. The company purchased a list of all new car owners of that specific type and increased their sales substantially – Data Mining ROI was calculated as the ratio of the extra revenue brought in due to the models, to the money invested in data mining. It was 1,182 percent!

Major Issues in Data Mining Massive datasets and high dimensionality User interaction and prior knowledge Overfitting and assessing statistical significance Missing data Understandability of patterns Managing changing data and knowledge Integration Multimedia and object oriented data

Data Mining Applications Financial Data Analysis: – Loan Payment Prediction – Clustering customers for targeted marketing – Detection of financial crimes Retail Industry: – Effectiveness of sales campaigns – Customer retention – Product recommendation Telecommunication Industry: – Identification of unusual patterns – Multidimensional association analysis – Mobile telecommunication services

Data Mining Trends Data Preprocessing and Integration Increasing Usability Spatial Data Mining, Social Media Mining, Multimedia Mining, Visual Data Mining, Graph Mining, Mobile Data Mining Privacy Protection

Resources “Data Mining: Concepts and Techniques” – Jiawei Han and Micheline Kamber “Mastering Data Mining: The Art and Science of Customer Relationship Management” – Michael Berry and Gordon Linoff “Statistical Analysis and Data Mining Applications” – Robert Nisbet, John elder, and Gary Miner