Lecture 10 1Dr. Nawaz Khan, School of Computing Science BIS4435 Lecture : Data Mining Dr. Nawaz Khan School of Computing Science.

Slides:



Advertisements
Similar presentations
Supporting End-User Access
Advertisements

C6 Databases.
Decision Tree Approach in Data Mining
1 Chapter 34 Data Mining Transparencies © Pearson Education Limited 1995, 2005.
Data Mining Glen Shih CS157B Section 1 Dr. Sin-Min Lee April 4, 2006.
Data warehouse example
DATA MINING CS157A Swathi Rangan. A Brief History of Data Mining The term “Data Mining” was only introduced in the 1990s. Data Mining roots are traced.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining By Archana Ketkar.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
Chapter 35 Data Mining Transparencies. 2 Chapter Objectives u The concepts associated with data mining. u The main features of data mining operations,
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Dr. Awad Khalil Computer Science Department AUC
Data Mining Techniques
1 Data Mining DT211 4 Refer to Connolly and Begg 4ed.
Business Intelligence, Data Mining and Data Analytics/Predictive Analytics By: Asela Thomason IS 495 Summer 2015.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Techniques As Tools for Analysis of Customer Behavior
Data Mining Chun-Hung Chou
Understanding Data Analytics and Data Mining Introduction.
© Negnevitsky, Pearson Education, Introduction, or what is data mining? Introduction, or what is data mining? Data warehouse and query tools Data.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Using Neural Networks in Database Mining Tino Jimenez CS157B MW 9-10:15 February 19, 2009.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Lecture 9: Knowledge Discovery Systems Md. Mahbubul Alam, PhD Associate Professor Dept. of AEIS Sher-e-Bangla Agricultural University.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Introduction, or what is data mining? Introduction, or what is data mining? Data warehouse and query tools Data warehouse and query tools Decision trees.
AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
Amer Kanj Data Mining For Business Professionals.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Guest Lecture Introduction to Data Mining Dr. Bhavani Thuraisingham September 17, 2010.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Chapter 14 Data Mining Transparencies. 2 Chapter Objectives u The concepts associated with data mining. u The main features of data mining operations,
MIS2502: Data Analytics Advanced Analytics - Introduction.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining and Decision Support
Academic Year 2014 Spring Academic Year 2014 Spring.
Data Mining Copyright KEYSOFT Solutions.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
Data Mining Transparencies
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Supporting End-User Access
Data Mining The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make.
Presentation transcript:

Lecture 10 1Dr. Nawaz Khan, School of Computing Science BIS4435 Lecture : Data Mining Dr. Nawaz Khan School of Computing Science

Lecture 10 2Dr. Nawaz Khan, School of Computing Science BIS4229 – Industrial Data Management Technologies Reading Assignment  Core Text:  GC DL materials on the WebCT: Unit 11  Connolly, T. and Begg, C., 2002, Database Systems: A Practical Approach to Design, Implementation, and Management, Addison Wesley, Harlow, England Additional Reading:  Fundamentals of Database Systems. R. Elmasri and S. B. Navathe, 4th Edition, 2004, Addison-Wesley, ISBN : Chapter 27  Data Warehousing, Data Mining, and OLAP, Alex Berson and Stephen J. Smith, McGraw-Hill, 1997, ISBN (Chapters 17, 18)  Other resources on the Internet

Lecture 10 3Dr. Nawaz Khan, School of Computing Science BIS4229 – Industrial Data Management Technologies Data Mining Outline  DW & DM: differences  The Definition  Application areas  Comparison with query and Web site analysis tools  DM Process  Applications, Models and Algorithms  Summary  Q&A

Lecture 10 4Dr. Nawaz Khan, School of Computing Science Data Mining DW & DM: differences Data Warehouse Data Mart Access Tools Information Delivery System Data Transformation Operational Data Metadata

Lecture 10 5Dr. Nawaz Khan, School of Computing Science Data Mining DW & DM: differences  They have the same purpose - decision support  DW assembles, formats, and organises historical data to answer user query as it is - depends on content of DW  DW will not attempt to extract further information or predict trends and patterns from data  DM will extract previously unknown and useful information as well as predict trends and patterns  DM can be performed on DW and/or traditional DB, files

Lecture 10 6Dr. Nawaz Khan, School of Computing Science Data Mining The Definition  DM is the process of extracting previously unknown, valid and actionable information from large sets of data  Unknown - look for things that are not intuitive  Valid - useful  Actionable - translate into business advantage Example: Rule 1: people don’t buy shares when political situation is not stable Rule 2: share market is less active when people don’t want to spend Outcome statement 1 based on rule 1 and 2 is: Share market is less active when political situation is not stable Outcome statement 2 based on rule 1 and 2 is: People don’t want to spend when political situation is not stable

Lecture 10 7Dr. Nawaz Khan, School of Computing Science Data Mining Application areas  Direct Marketing  The ability to predict who is most likely to be interested in what products can save companies immense amounts in marketing expenditures  Trend Analysis  Understanding trends in the marketplace is a strategic advantage, because it is useful in reducing costs and timeliness to market  Security  Fraud detection: data mining techniques can help discover which insurance claims, cellular phone calls, or credit card purchases are likely to be fraudulent  IDS (intrusion detection systems)  Forecasting in Financial Markets  Mining Online – WebKDD  Web sites today find themselves competing for customer loyalty. It costs little for customer to switch to competitors  Text Mining - intelligent document analysis

Lecture 10 8Dr. Nawaz Khan, School of Computing Science Data Mining Comparison with query and Web site analysis tools  Query Tools vs. DM Tools  Both allow user to ask questions of DBMS/DW - find out facts  Query tool - users make assumption, query based on hypothesis  Data mining tool - no assumption when making query (goal) Example queries: 1. What is the number of white shirt sold in the north vs the south? 2. What are the most significant factors involved in high, medium, and low sales volumes of white shirt?  Data mining tool - discover relationships and hidden patterns that are not obvious  Trend - integrate data mining in query tools

Lecture 10 9Dr. Nawaz Khan, School of Computing Science Data Mining Comparison with query and Web site analysis tools  OLAP Tools vs. DM Tools  OLAP - designed to answer top-down queries  OLAP - provides multidimensional data analysis, data can be broken down and summarised  OLAP - query-driven, user-driven, verification-driven  Data mining - bottom-up, requires no assumption  Data mining - focus on finding patterns  Data mining - data-driven, discovery-driven, identify facts/conclusions based on patterns discovered For example, OLAP may tell a bookseller about total number of books it sold in a region during a quarter. Statistics can provide another dimension about these sales. Data mining, on the other hand, can tell you the patterns of these sales, i.e., factors influencing the sales.

Lecture 10 10Dr. Nawaz Khan, School of Computing Science DM Technologies (see Unit 20 - WebCT) Data Mining Database Management and Warehousing Machine Learning Statistics Decision Support Parallel Processing Visualisation

Lecture 10 11Dr. Nawaz Khan, School of Computing Science Data Mining DM Process - Overview  Business objectivesdata preparation DM results analysis & knowledge assimilation  Mining data is only one step in the overall process  Business objectives drive the entire process  Data preparation requires the most efforts  Iterative process with many loop backs over one or more steps  Labour intensive exercise, far from autonomous Data Sources Selected data Pre-processed data Transformed data Extracted data Assimilated knowledge

Lecture 10 12Dr. Nawaz Khan, School of Computing Science Data Mining DM Process – Data Preparation  Data Selection  Data Pre-processing  Data Transformation  Data Selection - identify data sources and extract data for preliminary analysis in preparation for further mining  Process of choosing data to analyse  decide dependent variable - data (field) to be analysed  decide active variable - data actively used in mining decide useful data dimension choose useful (descriptive) fields in the dimension consider adding other useful dimension

Lecture 10 13Dr. Nawaz Khan, School of Computing Science Data Mining DM Process – Data Preparation  Data Selection  Data Pre-processing  Data Transformation  Data Pre-processing - ensure quality of the selected data  Data mining is at best as good as the data it is representing  Data quality  redundant data  incorrect or inconsistent data  noisy data - outliers - values that are significantly out of line bad outlier & good outliers  missing values - value not present or deleted eliminate observations that have missing values - loss info. replace missing values predict value using predictive model

Lecture 10 14Dr. Nawaz Khan, School of Computing Science Data Mining DM Process – Data Preparation  Data Selection  Data Pre-processing  Data Transformation  Data transformation – pre-processed data converted to analytical data model.  Data is refined to suite the input format required by DM algorithms  Techniques for data conversion  simple calculation (SQL) to derive new data fields  data reduction: combine several existing variables into one new variable to reduce the total number of variable  continuous values are scaled/normalised same order of magnitude  discretisation: quantitative variables into categorical variables  one-of-N: convert a categorical variable to a numeric representation

Lecture 10 15Dr. Nawaz Khan, School of Computing Science Data Mining DM Process – Data Mining & Results Analysis  DM - apply selected DM algorithm(s) to the pre-processed data  Inseparable from results analysis - done by data & business analyst  The two are linked in an interactive process - DM definition  Results analysis - depend on application developed  Segmentation - change base variable may improve result  Prediction - accuracy and input sensitivity analysis, overtraining  Association - iteration required for discovering actionable rules

Lecture 10 16Dr. Nawaz Khan, School of Computing Science Data Mining DM Process – Knowledge Assimilation  Close the loop  Objective - take action according to the new, valid and actionable information discovered  Challenges -  present discovery in convincing, business-oriented way  formulate ways to best exploit discovery

Lecture 10 17Dr. Nawaz Khan, School of Computing Science Data Mining Applications, Models and Algorithms  Predictive Modelling –Classification  Human learning experience - observations form a model of the essential, underlying characteristics of some phenomenon - generalisation ability  In DM, predictive model can analyse a DB to determine some essential characteristics about data and make predictions

Lecture 10 18Dr. Nawaz Khan, School of Computing Science Data Mining Applications, Models and Algorithms  Predictive Modelling –Classification  Supervised learning - correct answer to some already solved cases must be given to the model before it can make prediction about the new observations  Model developed in 2-phase  Training - build a model based on large proportion (90%) of available data  Testing - try out the model on previously unseen data (10%) to determine its accuracy and performance characteristics  2 types of predictive modelling  Classification - classify data into some pre-defined classes  Value prediction - predict continuous numeric value for database record  Algorithms – decision trees, neural networks, rule induction

Lecture 10 19Dr. Nawaz Khan, School of Computing Science Data Mining Applications, Models and Algorithms  Segmentation – Clustering  Segmentation can discover homogeneous sub-population - customer profiling/target marketing  Segmentation (Clustering) - partition DB into segments (clusters) of similar records, and segments (clusters) are resulting groups of data records  Similarity is defined by a measure depends on the distance of records from centre of the cluster - Euclidean distance A(a 1,a 2, …, a n ), B(b 1, b 2, …, b n ) Dist(A, B) = ((a 1 -b 1 ) 2 + (a 2 -b 2 ) 2 + … + (a n -b n ) 2 ) 1/2  Clustering is unsupervised learning - the types of clusters or number of clusters are not given - true discovery nature of DM  Algorithm – neural networks

Lecture 10 20Dr. Nawaz Khan, School of Computing Science Data Mining Applications, Models and Algorithms  Link Analysis / Deviation Detection  Link analysis seeks to establish links between individual records or sets of records in the DB  Association discovery - market basket analysis - one transaction  Sequential pattern discovery - sequence information over time  Deviation detection - further investigate outliers  Applications - fraud detection

Lecture 10 21Dr. Nawaz Khan, School of Computing Science Data Mining Applications, Models and Algorithms

Lecture 10 22Dr. Nawaz Khan, School of Computing Science Data Mining Applications, Models and Algorithms  Decision Trees  Decision tree (IF - THEN) - as a commonly used machine learning algorithm are powerful and popular tools for classification and prediction  Attempt to split DB among desired categories and identify important cluster features  Tree construction choose an attribute (field) for testing - root node of tree number of values of the attribute - branches from the root node –binary - yes/no type of questions –multiple - complex questions with more than two answer  Algorithm - ID3 (Interactive Dichotomizer), C4.5, C5.0, CART (chi- squared automatic integration detection) rank all features in terms of effectiveness in partitioning the set of classification - information gain make the most effective features as the root node recur on each branch

Lecture 10 23Dr. Nawaz Khan, School of Computing Science Data Mining Applications, Models and Algorithms  Decision Trees  Optimal tree produced by ID3 root node - “Colour”, most information gain 4 branches - “striped”, “tawny”, “brown” & “grey” recur on branch “striped” & “tawny” DietSizeColourHabitatSpecies meat grass large small large small large striped tawny striped brown striped grey tawny jungle house jungle plains tiger lion tabby weasel zebra rabbit antelope

Lecture 10 24Dr. Nawaz Khan, School of Computing Science Data Mining Applications, Models and Algorithms Colour HabitatDiet striped tawnybrown grey weaselrabbit junglehouseplains grass meat tigertabbyzebraantelopelion

Lecture 10 25Dr. Nawaz Khan, School of Computing Science Data Mining Applications, Models and Algorithms  Neural Networks  An NN is used to simulate the operation of the brain  An NN consists of large number of processors (neurons/nodes) and links (connections) - representing knowledge  An NN is trained with large amount of data and rules about data relationships - memorise  A well trained NN can learn association and similarity – generalise  Supervised learning: NN is trained with sets of inputs and desired outputs If the actual output is different from the desired output, the network adjust its internal connection strengths (weights) to reduce the difference This process continues until the network gets the I/O patterns correct or until an acceptable error rate is attained  Unsupervised learning - Self-Organising Map (SOM)

Lecture 10 26Dr. Nawaz Khan, School of Computing Science Data Mining Summary  DW & DM: differences  The definition  Application areas  Comparison with query and Web site analysis tools  DM Process  Data preparation (60% of the whole time)  DM (~10% of the time)  Applications, Models and Algorithms (decision trees, neural networks, etc.)  Next week:  Revision