Zhangxi Lin ISQS 3358 Texas Tech University 1.  Define data mining and list its objectives and benefits  Understand different purposes and applications.

Slides:

Advertisements

Similar presentations

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Advertisements

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.

1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,

Data Mining Glen Shih CS157B Section 1 Dr. Sin-Min Lee April 4, 2006.

Lecture Notes for Chapter 4 Introduction to Data Mining

1. Abstract 2 Introduction Related Work Conclusion References.

DATA, TEXT, AND WEB MINING

Week 9 Data Mining System (Knowledge Data Discovery)

Data Mining Knowledge Discovery in Databases Data 31.

Lecture 5 (Classification with Decision Trees)

Data Mining By Archana Ketkar.

Data Mining – Intro.

1 Data and Knowledge Management. 2 Data Management: A Critical Success Factor The difficulties and the process Data sources and collection Data quality.

CS157A Spring 05 Data Mining Professor Sin-Min Lee.

DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.

Data Mining: A Closer Look

Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.

Beyond Opportunity; Enterprise Miner Ronalda Koster, Data Analyst.

Enterprise systems infrastructure and architecture DT211 4

Chapter 4 Data, Text, and Web Mining

Basic Data Mining Techniques

CISB594 – Business Intelligence Data Mining. CISB594 – Business Intelligence Reference Materials used in this presentation are extracted mainly from the.

Data Mining Techniques

MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.

Data Mining Techniques As Tools for Analysis of Customer Behavior

DATA, TEXT, AND WEB MINING

Chapter 4 Data, Text, and Web Mining

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

CISB594 – Business Intelligence Data Mining Part I.

Chapter 7 DATA, TEXT, AND WEB MINING Pages , 311, Sections 7.3, 7.5, 7.6.

3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.

Copyright © 2009 Pearson Education, Inc. Slide 6-1 Chapter 6 E-commerce Marketing Concepts.

Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.

Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.

1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,

Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.

INTRODUCTION TO DATA MINING MIS2502 Data Analytics.

1 1 Slide Introduction to Data Mining and Business Intelligence.

Introduction to Text and Web Mining. I. Text Mining is part of our lives.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

1 Business System Analysis & Decision Making – Data Mining and Web Mining Zhangxi Lin ISQS 5340 Summer II 2006.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.

Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.

CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:

Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.

Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.

CHAPTER 4 Data Warehousing, Access, Analysis, Mining, and Visualization 2 1.

Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.

MIS2502: Data Analytics Advanced Analytics - Introduction.

An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.

Lecture Notes for Chapter 4 Introduction to Data Mining

DATA MINING PREPARED BY RAJNIKANT MODI REFERENCE:DOUG ALEXANDER.

Academic Year 2014 Spring Academic Year 2014 Spring.

Chapter 2 Data, Text, and Web Mining. Data Mining Concepts and Applications  Data mining (DM) A process that uses statistical, mathematical, artificial.

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.

1 Ahmed K. Ezzat, Data, Text, and Web Mining for BI Data Mining and Big Data.

Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.

Data Mining – Intro.

MIS2502: Data Analytics Advanced Analytics - Introduction

DATA MINING © Prentice Hall.

Supporting End-User Access

TEXTAND WEB MINING.

TEXT and WEB MINING.

Kenneth C. Laudon & Jane P. Laudon

Presentation transcript:

Zhangxi Lin ISQS 3358 Texas Tech University 1

 Define data mining and list its objectives and benefits  Understand different purposes and applications of data mining  Understand different methods of data mining, especially clustering and decision tree models  Build expertise in use of some data mining software

 Learn the process of data mining projects  Understand data mining pitfalls and myths  Define text mining and its objectives and benefits  Appreciate use of text mining in business applications  Define Web mining and its objectives and benefits

ISQS 6347, Data & Text Mining 4 Case 1: Credit Card Promotion  Credit card companies periodically send promotion offers, e.g. life insurance promotion, to some potential customers.  Assume:  Each promotion letter costs $0.20  The profit from each promotion acceptance is $10  Overall response rate is 1%  Question:  Sending the offer to unselected population will result in the expected average profit $10 * 1% - $0.2 * 99% = -$ a loss. How to send the promotion offers to the right customers in order to make profit?  How to maximize the profit by applying a proper set of selection rules?

Case 2: Customer Segmentation IDNameGenderAgeOccupation C001XM15Student C002YF30Staff C003ZM18Student C004AF45Staff C005BM30Staff C006CF25Student  The data is used to segment the customers for sell promotion  Three products: DVD, game, a drink for adult  Problems  How to segment the customers into two clusters  Is two clusters good enough? Why not three clusters Data & Text Mining 5

6 Case 3: Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

 Data mining (DM) A process that uses statistical, mathematical, artificial intelligence and machine-learning techniques to extract and identify useful information and subsequent knowledge from large databases

Knowledge discovery in databases (KDD) A comprehensive process of using data mining methods to find useful information and patterns in data

 Major characteristics and objectives of data mining  Data are often buried deep within very large databases, which sometimes contain data from several years; sometimes the data are cleansed and consolidated in a data warehouse  The data mining environment is usually client/server architecture or a Web-based architecture

 Major characteristics and objectives of data mining  Sophisticated new tools help to remove the information ore buried in corporate files or archival public records; finding it involves massaging and synchronizing the data to get the right results.  The miner is often an end user, empowered by data drills and other power query tools to ask ad hoc questions and obtain answers quickly, with little or no programming skill

 Major characteristics and objectives of data mining  Striking it rich often involves finding an unexpected result and requires end users to think creatively  Data mining tools are readily combined with spreadsheets and other software development tools; the mined data can be analyzed and processed quickly and easily  Parallel processing is sometimes used because of the large amounts of data and massive search efforts

 How data mining works  Data mining tools find patterns in data and may even infer rules from them  Three methods are used to identify patterns in data: 1. Simple models 2. Intermediate models 3. Complex models

 Classification Supervised induction used to analyze the historical data stored in a database and to automatically generate a model that can predict future behavior  Common tools used for classification are:  Neural networks  Decision trees  If-then-else rules

 Clustering Partitioning a database into segments in which the members of a segment share similar qualities  Association A category of data mining algorithm that establishes relationships about items that occur together in a given record

 Sequence discovery The identification of associations over time  Visualization can be used in conjunction with data mining to gain a clearer understanding of many underlying relationships

 Regression is a well-known statistical technique that is used to map data to a prediction value  Forecasting estimates future values based on patterns within large sets of data

– Marketing – Banking – Retailing and sales – Manufacturing and production – Brokerage and securities trading – Insurance – Computer hardware and software – Government and defense – Airlines – Health care – Broadcasting – Police – Homeland security Data mining applications

ISQS 6347, Data & Text Mining 18 20% 80%

 Data mining tools and techniques can be classified based on the structure of the data and the algorithms used:  Statistical methods  Decision trees Defined as a root followed by internal nodes. Each node (including root) is labeled with a question and arcs associated with each node cover all possible responses

 Data mining tools and techniques can be classified based on the structure of the data and the algorithms used:  Case-based reasoning  Neural computing  Intelligent agents  Genetic algorithms  Other tools  Rule induction  Data visualization

 A general algorithm for building a decision tree: 1. Create a root node and select a splitting attribute. 2. Add a branch to the root node for each split candidate value and label 3. Take the following iterative steps: a. Classify data by applying the split value. b. If a stopping point is reached, then create leaf node and label it. Otherwise, build another subtree

 Gini index Used in economics to measure the diversity of the population. The same concept can be used to determine the ‘purity’ of a specific class as a result of a decision to branch along a particular attribute/variable

ISQS 6347, Data & Text Mining 23  Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t).  Maximum (1 - 1/n c ) when records are equally distributed among all classes, implying least interesting information  Minimum (0.0) when all records belong to one class, implying most interesting information

ISQS 6347, Data & Text Mining 24 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1) 2 – P(C2) 2 = 1 – 0 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6) 2 – (5/6) 2 = P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6) 2 – (4/6) 2 = 0.444

 The ID3 ( Iterative Dichotomizer 3) algorithm decision tree approach  Entropy Measures the extent of uncertainty or randomness in a data set. If all the data in a subset belong to just one class, then there is no uncertainty or randomness in that dataset, therefore the entropy is zero

ISQS 6347, Data & Text Mining 26  Collection of data objects and their attributes (variables)  An attribute is a property or characteristic of an object  Examples: eye color of a person, temperature, etc.  Attribute is also known as variable, field, characteristic, or feature  A collection of attributes describe an object  Object is also known as record, point, case, sample, entity, or instance Attributes (Variables) Objects

ISQS 6347, Data & Text Mining 27

ISQS 6347, Data & Text Mining 28 categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree

ISQS 6347, Data & Text Mining 29 categorical continuous class MarSt Refund TaxInc YES NO Yes No Married Single, Divorced < 80K> 80K There could be more than one tree that fits the same data!

ISQS 6347, Data & Text Mining 30 Decision Tree

ISQS 6347, Data & Text Mining 31 Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Start from the root of tree.

ISQS 6347, Data & Text Mining 32 Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

ISQS 6347, Data & Text Mining 33 Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

ISQS 6347, Data & Text Mining 34 Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

ISQS 6347, Data & Text Mining 35 Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

ISQS 6347, Data & Text Mining 36 Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Assign Cheat to “No”

ISQS 6347, Data & Text Mining 37 Decision Tree

ISQS 6347, Data & Text Mining 38 Actual Accept Actual Reject Computed Accept Computed Reject True Positive (TP) a True Negative (TN) d False Positive (FP) c False Negative (FN) b Accuracy rate = a / (a + c), Coverage rate = a / (a + b) Lift = Accuracy rate / [(a + b) / (a + b + c + d)] a + b c + d a + cb + d

 Cluster analysis for data mining  Cluster analysis is an exploratory data analysis tool for solving classification problems  The object is to sort cases into groups so that the degree of association is strong between members of the same cluster and weak between members of different clusters

 Cluster analysis results may be used to:  Help identify a classification scheme  Suggest statistical models to describe populations  Indicate rules for assigning new cases to classes for identification, targeting, and diagnostic purposes  Provide measures of definition, size, and change in what were previously broad concepts  Find typical cases to represent classes

 Cluster analysis methods  Statistical methods  Optimal methods  Neural networks  Fuzzy logic  Genetic algorithms  Each of these methods generally works with one of two general method classes:  Divisive  Agglomerative

 Hierarchical clustering method and example 1. Decide which data to record from the items 2. Calculate the distances between all initial clusters. Store the results in a distance matrix 3. Search through the distance matrix and find the two most similar clusters 4. Fuse those two clusters together to produce a cluster that has at least two items 5. Calculate the distances between this new cluster and all the other clusters 6. Repeat steps 3 to 5 until you have reached the prespecified maximum number of clusters

 Classes of data mining tools and techniques as they relate to information and business intelligence (BI) technologies  Mathematical and statistical analysis packages  Personalization tools for Web-based marketing  Analytics built into marketing platforms  Advanced CRM tools  Analytics added to other vertical industry-specific platforms  Analytics added to database tools (e.g., OLAP)  Standalone data mining tools

45 What Is Text Mining?  Text mining is a process that employs a set of algorithms for converting unstructured text into structured data objects and the quantitative methods used to analyze these data objects.

Text Mining Case: Federalist papers  Alexander Hamilton, James Madison, and John Jay wrote a series of essays in 1787 and 1788 to try to convince the citizens of the state of New York to ratify the new constitution of the United States. These essays are collectively called The Federalist Papers. Copies of the papers in a variety of formats can be found at  or   Of the 85 essays, 51 are attributed to Hamilton, 15 to Madison, 5 to Jay, and 3 to Hamilton and Madison jointly. The 11 remaining essays can be attributed only to Hamilton or Madison. Mosteller and Wallace (1964) used Bayesian statistical techniques to provide evidence that Madison wrote all 11 of the essays of unknown authorship. (The essays in question are numbers 49, 50, 51, 52, 53, 54, 55, 56, 57, 62, and 63.)  Problem: Uniquely identify an author based on the distribution of words in a document. 46

47 A simple text mining example  A tiny case - 9 documents  deposit the cash and check in the bank - Fin  the river boat is on the bank - Riv  borrow based on credit - Fin  river boat floats up the river - Riv  boat is by the dock near the bank - Riv  with credit, I can borrow cash from the bank - Fin  boat floats by dock near the river bank - Riv  check the parade route to see the floats - Par  along the parade route - Par

 Text mining helps organizations:  Find the “hidden” content of documents, including additional useful relationships  Relate documents across previous unnoticed divisions  Group documents by common themes

 Applications of text mining  Automatic detection of spam or phishing through analysis of the document content  Automatic processing of messages or s to route a message to the most appropriate party to process that message  Analysis of warranty claims, help desk calls/reports, and so on to identify the most common problems and relevant responses

 Applications of text mining  Analysis of related scientific publications in journals to create an automated summary view of a particular discipline  Creation of a “relationship view” of a document collection  Qualitative analysis of documents to detect deception

 How to mine text 1. Eliminate commonly used words (stop-words) 2. Replace words with their stems or roots (stemming algorithms) 3. Consider synonyms and phrases 4. Calculate the weights of the remaining terms

52 Example  Coca-Cola announced earnings on Saturday, Dec. 12, Profits were up by 3.1% as of 12/12/1999.  coca-cola  + announce  earnings  on  Saturday  dec.  12  2000  + profit  + be  up  3.1%  as of  

 Web mining The discovery and analysis of interesting and useful information from the Web, about the Web, and usually through Web-based tools

 Web content mining The extraction of useful information from Web pages  Web structure mining The development of useful information from the links included in the Web documents  Web usage mining The extraction of useful information from the data being generated through webpage visits, transaction, etc.

 Uses for Web mining:  Determine the lifetime value of clients  Design cross-marketing strategies across products  Evaluate promotional campaigns  Target electronic ads and coupons at user groups  Predict user behavior  Present dynamic information to users

58 Banners Landing page Sign up Target page Click BANNERAD ABANDON PROPBUY Buy Exit Depth of conversion Sign up First time purchase Repeated purchase Data

 How to improve the effectiveness of banner advertising?  Understand the context:  Availability of the information: click-through flow, user profile, etc.  Multiple ads – which one should be used?  Data collection  Data mining  Model evaluation

60  Model can be built using  Web log data  Registration data  Vendor data (may not be required)  One model with indicator for banner ad/vendor selected  Multiple models, one for each vendor  Overlapping data if page sequences are included, because “did not click” entries will have common elements in all models  Model scores the propensity to click on a vendor’s banner ad

 In the case there is only one slot for one of two ads, which one is the best decision:  Selectively place an ad from the two choices  Randomly place one of the ads  Place both with two slots, or time-sharing alternatively  Place nothing when the likelihood of the click-through is low, because of the possible negative effect. 61

ISQS 6347, Data & Text Mining 62 SAS Enterprise Miner 4.3  Basic  How to use the application main menu  Using the pop-up menus  Enterprise Miner documentation  Project – Diagram  The SEMMA methodology  Sample  Explore  Modify  Model  Assess

Decision Tree Example (pp ) IncomePattern#Loan risk 171High 205High 230High 324Low 432High 683Low

MBA Admission Decision Problem GMATGPAQuantitative GAMT Score (percentile) Decision No No Yes No Yes Yes No Yes ? ? ?