DATA, TEXT, AND WEB MINING

Slides:



Advertisements
Similar presentations
Chapter 9 Business Intelligence Systems
Advertisements

Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
DATA, TEXT, AND WEB MINING
Week 9 Data Mining System (Knowledge Data Discovery)
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining Knowledge Discovery in Databases Data 31.
Data Mining.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Data Mining By Archana Ketkar.
Data Mining Adrian Tuhtan CS157A Section1.
Building Knowledge-Driven DSS and Mining Data
Data Mining – Intro.
1 Data and Knowledge Management. 2 Data Management: A Critical Success Factor The difficulties and the process Data sources and collection Data quality.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Business Intelligence
Enterprise systems infrastructure and architecture DT211 4
Chapter 4 Data, Text, and Web Mining
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization.
Chapter 5: Data Mining for Business Intelligence
CISB594 – Business Intelligence Data Mining. CISB594 – Business Intelligence Reference Materials used in this presentation are extracted mainly from the.
Data Mining Techniques
Data Mining Chun-Hung Chou
Chapter 4 Data, Text, and Web Mining
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
CISB594 – Business Intelligence Data Mining Part I.
Chapter 7 DATA, TEXT, AND WEB MINING Pages , 311, Sections 7.3, 7.5, 7.6.
Zhangxi Lin ISQS 3358 Texas Tech University 1.  Define data mining and list its objectives and benefits  Understand different purposes and applications.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Introduction to Text and Web Mining. I. Text Mining is part of our lives.
Chapter 9 – Classification and Regression Trees
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction, or what is data mining? Introduction, or what is data mining? Data warehouse and query tools Data warehouse and query tools Decision trees.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Fox MIS Spring 2011 Data Mining Week 9 Introduction to Data Mining.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Chapter 13 Designing Databases Systems Analysis and Design Kendall & Kendall Sixth Edition.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
CHAPTER 4 Data Warehousing, Access, Analysis, Mining, and Visualization 2 1.
MIS2502: Data Analytics Advanced Analytics - Introduction.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
DATA MINING PREPARED BY RAJNIKANT MODI REFERENCE:DOUG ALEXANDER.
Academic Year 2014 Spring Academic Year 2014 Spring.
Data Mining. Overview the extraction of hidden predictive information from large databases Data mining tools predict future trends and behaviors, allowing.
Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.
Chapter 2 Data, Text, and Web Mining. Data Mining Concepts and Applications  Data mining (DM) A process that uses statistical, mathematical, artificial.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
1 Ahmed K. Ezzat, Data, Text, and Web Mining for BI Data Mining and Big Data.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Adrian Tuhtan CS157A Section1
Supporting End-User Access
Chapter 17 Designing Databases
TEXTAND WEB MINING.
TEXT and WEB MINING.
Presentation transcript:

DATA, TEXT, AND WEB MINING Chapter 7 DATA, TEXT, AND WEB MINING

Learning Objectives Define data mining and list its objectives and benefits Understand different purposes and applications of data mining Understand different methods of data mining, especially clustering and decision tree models Build expertise in use of some data mining software

Learning Objectives Learn the process of data mining projects Understand data mining pitfalls and myths Define text mining and its objectives and benefits Appreciate use of text mining in business applications Define Web mining and its objectives and benefits

Data Mining Concepts and Applications Six factors behind the sudden rise in popularity of data mining General recognition of the untapped value in large databases; Consolidation of database records tending toward a single customer view; Consolidation of databases, including the concept of an information warehouse; Reduction in the cost of data storage and processing, providing for the ability to collect and accumulate data; Intense competition for a customer’s attention in an increasingly saturated marketplace; and The movement toward the de-massification of business practices

Data Mining Concepts and Applications Data mining (DM) A process that uses statistical, mathematical, artificial intelligence and machine-learning techniques to extract and identify useful information and subsequent knowledge from large databases

Data Mining Concepts and Applications Major characteristics and objectives of data mining Data are often buried deep within very large databases, which sometimes contain data from several years; sometimes the data are cleansed and consolidated in a data warehouse The data mining environment is usually client/server architecture or a Web-based architecture

Data Mining Concepts and Applications Major characteristics and objectives of data mining Sophisticated new tools help to remove the information ore buried in corporate files or archival public records; finding it involves massaging and synchronizing the data to get the right results. The miner is often an end user, empowered by data drills and other power query tools to ask ad hoc questions and obtain answers quickly, with little or no programming skill

Data Mining Concepts and Applications Major characteristics and objectives of data mining Striking it rich often involves finding an unexpected result and requires end users to think creatively Data mining tools are readily combined with spreadsheets and other software development tools; the mined data can be analyzed and processed quickly and easily Parallel processing is sometimes used because of the large amounts of data and massive search efforts

Data Mining Concepts and Applications How data mining works Data mining tools find patterns in data and may even infer rules from them Three methods are used to identify patterns in data: Simple models Intermediate models Complex models

Data Mining Concepts and Applications Classification Supervised induction used to analyze the historical data stored in a database and to automatically generate a model that can predict future behavior Common tools used for classification are: Neural networks Decision trees If-then-else rules

Data Mining Concepts and Applications Clustering words cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise cluster analysis simply discovers structures in data without explaining why they exist. The term cluster analysis (first used by Tryon, 1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories. Example, people and animal classification Joining (Tree Clustering), Two-way Joining (Block Clustering), and k-Means Clustering

Data Mining Concepts and Applications k-Means Clustering: the k-means method will produce exactly k different clusters of greatest possible distinction. Algorithms: Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS) where μi is the mean of points in Si. See paper.

Data Mining Concepts and Applications 1) k initial "means" (in this case k=3) are randomly generated within the data domain (shown in color). 2) k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means. 3) The centroid of each of the k clusters becomes the new mean. 4) Steps 2 and 3 are repeated until convergence has been reached.

Data Mining Concepts and Applications EM clustering on an artificial dataset ("mouse"). The tendency of k-means to produce equi-sized clusters leads to bad results, while EM benefits from the Gaussian distribution present in the data set

Data Mining Concepts and Applications Expectation Maximization) Clustering: to detect clusters in observations (or variables) and to assign those observations to the clusters. A typical example application: a number of consumer behavior related variables are measured for a large sample of respondents.

Data Mining Concepts and Applications Association A category of data mining algorithm that establishes relationships about items that occur together in a given record These powerful exploratory techniques have a wide range of applications in many areas of business practice and also research - from the analysis of consumer preferences or human resource management, to the history of language. These techniques enable analysts and researchers to uncover hidden patterns in large data sets, such as "customers who order product A often also order product B or C" or "employees who said positive things about initiative X also frequently complain about issue Y but are happy with issue Z." For example, if (Car=Porsche and Gender=Male and Age<20) then (Risk=High and Insurance=High)). Book store recommendation. The implementation of the so-called a-priori algorithm (see Agrawal and Swami, 1993; Agrawal and Srikant, 1994; Han and Lakshmanan, 2001; see also Witten and Frank, 2000) allows us to process rapidly huge data sets for such associations, based on predefined "threshold" values for detection.

Data Mining Concepts and Applications Association Sequence Analysis. Sequence analysis is concerned with a subsequent purchase of a product or products given a previous buy. For instance, buying an extended warranty is more likely to follow (in that specific sequential order) the purchase of a TV or other electric appliances. Sequence rules, however, are not always that obvious, and sequence analysis helps you to extract such rules no matter how hidden they may be in your market basket data. Link Analysis. In retailing or marketing, knowledge of purchase "patterns" can help with the direct marketing of special offers to the "right" or "ready" customers (i.e., those who, according to the rules, are most likely to purchase specific items given their observed past consumption patterns). “Link analysis" is often used when these techniques - for extracting sequential or non-sequential association rules - are applied to organize complex "evidence." It is easy to see how the "transactions" or "shopping basket" metaphor can be applied to situations where individuals engage in certain actions, open accounts, contact other specific individuals, and so on. Unique data analysis requirements. Crosstabulation tables, and in particular Multiple Response tables

Data Mining Concepts and Applications Visualization can be used in conjunction with data mining to gain a clearer understanding of many underlying relationships

Data Mining Concepts and Applications

Data Mining Concepts and Applications a-priori algorithm See paper.

Data Mining Concepts and Applications Regression is a well-known statistical technique that is used to map data to a prediction value: Forecasting estimates future values based on patterns within large sets of data

Data Mining Concepts and Applications Hypothesis-driven data mining Begins with a proposition by the user, who then seeks to validate the truthfulness of the proposition Discovery-driven data mining Finds patterns, associations, and relationships among the data in order to uncover facts that were previously unknown or not even contemplated by an organization

Data Mining Concepts and Applications Data mining applications Marketing Banking Retailing and sales Manufacturing and production Brokerage and securities trading Insurance Computer hardware and software Government and defense Airlines Health care Broadcasting Police Homeland security

Data Mining Techniques and Tools Data mining tools and techniques can be classified based on the structure of the data and the algorithms used: Statistical methods Decision trees Defined as a root followed by internal nodes. Each node (including root) is labeled with a question and arcs associated with each node cover all possible responses

Data Mining Techniques and Tools Data mining tools and techniques can be classified based on the structure of the data and the algorithms used: Case-based reasoning Neural computing Intelligent agents Genetic algorithms Other tools Rule induction Data visualization

Data Mining Techniques and Tools A general algorithm for building a decision tree: Create a root node and select a splitting attribute. Add a branch to the root node for each split candidate value and label Take the following iterative steps: Classify data by applying the split value. If a stopping point is reached, then create leaf node and label it. Otherwise, build another subtree

Data Mining Techniques and Tools Gini index Used in economics to measure the diversity of the population. The same concept can be used to determine the ‘purity’ of a specific class as a result of a decision to branch along a particular attribute/variable Formula: Gini(S)=1-∑pj2 Where S is a data set that contains example from n classes. Pj is a relative frequency of class j in S.

Data Mining Techniques and Tools Example: Sample patterns for Training a Decision Tree to Predict Loan Risk Pattern # Income Credit Rating Loan Risk 1 2 3 4 5 23 17 43 68 32 20 High Low Moderate There is only two classes, High and Low, the data set S with p High and n low elements, then the Gini computation is as follows:

Data Mining Techniques and Tools Phigh=p/(p+n) pLow=n/(n+p) Gini(S)=1 – p2High – p2 Low If data set S is split into S1 and S2, the splitting index is defined as follows: GiniSPLIT(S)= (p1 + n 1)/(p + n)×Gini(S1) + (p2 + n 2)/(p + n)×Gini(S2) Where p1,n 1 (p2+ n 2) denote p1 High elements and n1 Low element in the data set S1 (S2). In this definition, the best split point is the one with the lowest value of the GiniSPLIT index. For our example, reorder the data according to the income: Pattern # Income Loan Risk 17 20 23 32 43 68 1 5 4 2 3 High Low

Data Mining Techniques and Tools Possible value of a split point for the Income attribute are Income<=17, Income<=20, Income<=23, income<=32, Income<=43, and Income <=68. Now we can compute the Gini index for each of these levels of splits: Consider the choice of dividing the data at Income <=17. We have the following choices of classifications: Pattern Count High Low Income<=17 Income >17 1 3 2 So the Gini index for Income<=17 and Income > 17 will be: G(Income<=17) = 1 — (Proportion of records with High risk)2 – (Proportion of records with High risk)2 =1 – 12 – 02=0. Similarly, G(Income > 17) = 1 — ((3/5)2 – (2/5)2)=12/25

Data Mining Techniques and Tools Gini index for the split choice is computed as follows: GiniSPLIT= (Proportion of records at Income <=17×G(Income<=17) + (Proportion of records at Income >17 )×G( Income >17) That is GSPLIT=(1/6) × 0 + (5/6) × (12/25) =2/5. Now consider the choice Income <=20. Pattern Count High Low Income<=20 Income >20 2 So the Gini index for Income<=20 and Income > 20 will be: G(Income<=20) = 1 — ((1)2 + (0)2) = 0. G(Income > 20) = 1 — ((2/4)2 – (2/4)2)=1/2. GSPLIT=(2/6) × 0 + (4/6) × (1/2) =1/3.

Data Mining Techniques and Tools For choice split at Income =23 Pattern Count High Low Income<=23 Income >23 3 1 2 G(Income<=23) = 1 — ((1)2 + (0)2) = 0. G(Income > 23) = 1 — ((1/3)2 – (2/3)2)=4/9. GSPLIT=(3/6) × 0 + (3/6) × (4/9) =2/9. For choice split at Income =32 Pattern Count High Low Income<=32 Income >32 3 1 G(Income<=32) = 1 — ((3/4)2 + (1/4)2) = 3/8. G(Income > 32) = 1 — ((1/2)2 – (1/2)2)=1/2. GSPLIT=(4/6) × 3/8 + (2/6) × (1/2) =7/24.

Data Mining Techniques and Tools The lowest value of GSPLIT is for Income<=23. So we take the two nearest values and average them. Thus, we have a split point at Income =(23+32)/2=27.5. Attribute lists are divided at the split point. That is, we expect to have a rule that says: If Income<=27.5 Then Else if Income>27.5 The following is the attribute list for Income<=27.5 Income Pattern # Loan Risk Credit Rating 17 20 23 1 5 High Low So the conclusion is if the Income<=27.5, the loan risk is high.

Data Mining Techniques and Tools But what about the Income > 27.5? The following tables suggest that Income >27.5 is not a definitive indicator of Loan Risk. Income Pattern # Loan Risk Credit Rating 32 43 68 4 2 3 High Low Moderate So we can borrow examining credit rating to develop the subtree for Income > 27.5 case. However, credit rating is category variable. The rules for category variable is slightly different from those for a continuous variable. The Gini index formula will be Gini ( Two Proportion)=1 – p2one proportion – p2 the other proportion

Data Mining Techniques and Tools In case of category variable, one proportion is the set of records of Credit Rating ={Low}, and the other proportion is the set of records of Credit Rating = not {Low}, or {Moderate, High}. Thus we have to compute proportion of each category and its complement. But what about the Income > 27.5? The following tables suggest that Income >27.5 is not a definitive indicator of Loan Risk. Pattern Count Loan Risk High Loan Risk Low Credit Rating={Low} Credit Rating={Moderate} Credit Rating={High} 1 First, compute the Gini index for each category G( Credit Rating={Low}) =1 – 02 – 12= 0 G( Credit Rating={Moderate}) =1 – 12 – 02= 0 G( Credit Rating={Low}) =1 – 12 – 02= 0

Data Mining Techniques and Tools Next, compute the Gini index for complement categories: G( Credit Rating  {Low, Moderate}) =1 – (½)2 – (1/2)2=1/2 G( Credit Rating {Low, High}) = 1/2 G( Credit Rating {Moderate, High}) =1 – 02 – 12= 0 Third, compute the Gini index for possible branches. For branch choice of credit rating= {low} and = {Moderate, high}, we would have GSPLIT =(Proportion of records with Credit Rating =Low) ×G (Credit Rating {Low}) + (Proportion of records with Credit Rating =not Low) ×G (Credit Rating not {Low}) = (Proportion of records with Credit Rating =Low) ×G (Credit Rating {Low}) + (Proportion of records with Credit Rating =High, Moderate) ×G (Credit Rating = {High, Moderate}) GSPLIT(Credite Rating ={Low}) =(1/3) ×0+(2/3) ×0=0.

Data Mining Techniques and Tools Last, compute the Gini index for other categories: GSPLIT(Credite Rating ={Moderate}) =(1/3) ×0+(2/3) ×(1/2)=1/3 GSPLIT(Credite Rating ={High}) =(1/3) ×0+(2/3) ×(1/2)=1/3 GSPLIT(Credite Rating ={Low, Moderate}) =(2/3) ×(1/2)+(1/3) ×0=1/3 GSPLIT(Credite Rating ={Low, High}) =(2/3) ×(1/2)+(1/3) ×0=1/3 GSPLIT(Credite Rating ={Moderate}) =(2/3) ×0+(1/3) ×0=0 The lowest value of the Gini index for the split is zero at Credit Rating= Low and Credit Rating {Moderate, High}, thus this is split point and these are the next branch of subtree. See figure.

Data Mining Techniques and Tools

Data Mining Techniques and Tools The ID3 algorithm decision tree approach Entropy Measures the extent of uncertainty or randomness in a data set. If all the data in a subset belong to just one class, then there is no uncertainty or randomness in that dataset, therefore the entropy is zero

Data Mining Techniques and Tools Cluster analysis for data mining Cluster analysis is an exploratory data analysis tool for solving classification problems The object is to sort cases into groups so that the degree of association is strong between members of the same cluster and weak between members of different clusters

Data Mining Techniques and Tools Cluster analysis results may be used to: Help identify a classification scheme Suggest statistical models to describe populations Indicate rules for assigning new cases to classes for identification, targeting, and diagnostic purposes Provide measures of definition, size, and change in what were previously broad concepts Find typical cases to represent classes

Data Mining Techniques and Tools Cluster analysis methods Statistical methods Optimal methods Neural networks Fuzzy logic Genetic algorithms Each of these methods generally works with one of two general method classes: Divisive Agglomerative

Data Mining Techniques and Tools Hierarchical clustering method and example Decide which data to record from the items Calculate the distances between all initial clusters. Store the results in a distance matrix Search through the distance matrix and find the two most similar clusters Fuse those two clusters together to produce a cluster that has at least two items Calculate the distances between this new cluster and all the other clusters Repeat steps 3 to 5 until you have reached the prespecified maximum number of clusters

Data Mining Techniques and Tools Classes of data mining tools and techniques as they relate to information and business intelligence (BI) technologies Mathematical and statistical analysis packages Personalization tools for Web-based marketing Analytics built into marketing platforms Advanced CRM tools Analytics added to other vertical industry-specific platforms Analytics added to database tools (e.g., OLAP) Standalone data mining tools

Data Mining Project Processes

Data Mining Project Processes

Data Mining Project Processes Knowledge discovery in databases (KDD) A comprehensive process of using data mining methods to find useful information and patterns in data

Data Mining Project Processes KDD process Selection Preprocessing Transformation Data mining Interpretation/evaluation

Text Mining Text mining Application of data mining to nonstructured or less structured text files. It entails the generation of meaningful numerical indices from the unstructured text and then processing these indices using various data mining algorithms

Text Mining Text mining helps organizations: Find the “hidden” content of documents, including additional useful relationships Relate documents across previous unnoticed divisions Group documents by common themes

Text Mining Applications of text mining Automatic detection of e-mail spam or phishing through analysis of the document content Automatic processing of messages or e-mails to route a message to the most appropriate party to process that message Analysis of warranty claims, help desk calls/reports, and so on to identify the most common problems and relevant responses

Text Mining Applications of text mining Analysis of related scientific publications in journals to create an automated summary view of a particular discipline Creation of a “relationship view” of a document collection Qualitative analysis of documents to detect deception

Text Mining How to mine text Eliminate commonly used words (stop-words) Replace words with their stems or roots (stemming algorithms) Consider synonyms and phrases Calculate the weights of the remaining terms

Web Mining Web mining The discovery and analysis of interesting and useful information from the Web, about the Web, and usually through Web-based tools

Data Mining Project Processes

Web Mining Web content mining The extraction of useful information from Web pages Web structure mining The development of useful information from the links included in the Web documents Web usage mining The extraction of useful information from the data being generated through webpage visits, transaction, etc.

Web Mining Uses for Web mining: Determine the lifetime value of clients Design cross-marketing strategies across products Evaluate promotional campaigns Target electronic ads and coupons at user groups Predict user behavior Present dynamic information to users

Data Mining Project Processes