Data Mining By : Tung, Sze Ming ( Leo ) CS 157B
Definition A class of database application that analyze data in a database using tools which look for trends or anomalies. A class of database application that analyze data in a database using tools which look for trends or anomalies. Data mining was invented by IBM. Data mining was invented by IBM.
Purpose To look for hidden patterns or previously unknown relationships among the data in a group of data that can be used to predict future behavior. To look for hidden patterns or previously unknown relationships among the data in a group of data that can be used to predict future behavior. Ex: Data mining software can help retail companies find customers with common interests. Ex: Data mining software can help retail companies find customers with common interests.
Background Information Many of the techniques used by today's data mining tools have been around for many years, having originated in the artificial intelligence research of the 1980s and early 1990s. Many of the techniques used by today's data mining tools have been around for many years, having originated in the artificial intelligence research of the 1980s and early 1990s. Data Mining tools are only now being applied to large-scale database systems. Data Mining tools are only now being applied to large-scale database systems.
The Need for Data Mining The amount of raw data stored in corporate data warehouses is growing rapidly. The amount of raw data stored in corporate data warehouses is growing rapidly. There is too much data and complexity that might be relevant to a specific problem. There is too much data and complexity that might be relevant to a specific problem. Data mining promises to bridge the analytical gap by giving knowledgeworkers the tools to navigate this complex analytical space. Data mining promises to bridge the analytical gap by giving knowledgeworkers the tools to navigate this complex analytical space.
The Need for Data Mining, cont’ The need for information has resulted in the proliferation of data warehouses that integrate information multiple sources to support decision making. The need for information has resulted in the proliferation of data warehouses that integrate information multiple sources to support decision making. Often include data from external sources, such as customer demographics and household information. Often include data from external sources, such as customer demographics and household information.
Approach to Data Mining association association sequence-based analysis sequence-based analysis clustering clustering classification classification
Association Classic market-basket analysis, which treats the purchase of a number of items (for example, the contents of a shopping basket) as a single transaction. Classic market-basket analysis, which treats the purchase of a number of items (for example, the contents of a shopping basket) as a single transaction. This information can be used to adjust inventories, modify floor or shelf layouts, or introduce targeted promotional activities to increase overall sales or move specific products. This information can be used to adjust inventories, modify floor or shelf layouts, or introduce targeted promotional activities to increase overall sales or move specific products. Example : 80 percent of all transactions in which beer was purchased also included potato chips. Example : 80 percent of all transactions in which beer was purchased also included potato chips.
Sequence-based analysis Traditional market-basket analysis deals with a collection of items as part of a point-in-time transaction. Traditional market-basket analysis deals with a collection of items as part of a point-in-time transaction. to identify a typical set of purchases that might predict the subsequent purchase of a specific item. to identify a typical set of purchases that might predict the subsequent purchase of a specific item.
Clustering Clustering approach address segmentation problems. Clustering approach address segmentation problems. These approaches assign records with a large number of attributes into a relatively small set of groups or "segments." These approaches assign records with a large number of attributes into a relatively small set of groups or "segments." Example : Buying habits of multiple population segments might be compared to determine which segments to target for a new sales campaign. Example : Buying habits of multiple population segments might be compared to determine which segments to target for a new sales campaign.
Classification Most commonly applied data mining technique Most commonly applied data mining technique Algorithm uses preclassified examples to determine the set of parameters required for proper discrimination. Algorithm uses preclassified examples to determine the set of parameters required for proper discrimination. Example : A classifier derived from the Classification approach is capable of identifying risky loans, could be used to aid in the decision of whether to grant a loan to an individual. Example : A classifier derived from the Classification approach is capable of identifying risky loans, could be used to aid in the decision of whether to grant a loan to an individual.
Issues of Data Mining Present-day tools are strong but require significant expertise to implement effectively. Present-day tools are strong but require significant expertise to implement effectively. Issues of Data Mining Issues of Data Mining Susceptibility to "dirty" or irrelevant data. Susceptibility to "dirty" or irrelevant data. Inability to "explain" results in human terms. Inability to "explain" results in human terms.
Issues susceptibility to "dirty" or irrelevant data susceptibility to "dirty" or irrelevant data Data mining tools of today simply take everything they are given as factual and draw the resulting conclusions. Data mining tools of today simply take everything they are given as factual and draw the resulting conclusions. Users must take the necessary precautions to ensure that the data being analyzed is "clean." Users must take the necessary precautions to ensure that the data being analyzed is "clean."
Issues, cont’ inability to "explain" results in human terms inability to "explain" results in human terms Many of the tools employed in data mining analysis use complex mathematical algorithms that are not easily mapped into human terms. Many of the tools employed in data mining analysis use complex mathematical algorithms that are not easily mapped into human terms. what good does the information do if you don’t understand it? what good does the information do if you don’t understand it?
The End The End