Lecture 15 Data Mining Concepts

Slides:



Advertisements
Similar presentations
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Advertisements

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
C6 Databases.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 29 Overview of Data Warehousing and OLAP.
Chapter 13 The Data Warehouse
Data Mining – Intro.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Chapter 5 Data mining : A Closer Look.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Data Mining CS 157B Section 2 Keng Teng Lao. Overview Definition of Data Mining Application of Data Mining.
M ODULE 5 Metadata, Tools, and Data Warehousing Section 4 Data Warehouse Administration 1 ITEC 450.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Understanding Data Analytics and Data Mining Introduction.
Data Warehouse & Data Mining
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
© 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang 5-1 Chapter 5 Business Intelligence: Data.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
1 Data Warehouses BUAD/American University Data Warehouses.
Data Warehousing.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 10 New Application.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data.
Advanced Database Concepts
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Data Warehousing COMP3017 Advanced Databases Dr Nicholas Gibbins –
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Presented By: Pedel Oppong-Abebrese,Pedel Oppong-Abebrese Michael Boadi, William Osei, Nana Amoa OforiMichael BoadiWilliam OseiNana Amoa Ofori DATA WAREHOUSING.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
11/20/ :11 AMData Mining 1 Data Mining – CSE 9033 Chapter – 1; Data Warehousing Dr. Goutam Sarker, B.E., M.E., Ph.D.(Engineering), Fellow: IE(I),
Overview of Data Warehousing and OLAP
Overview of Data Warehousing (DW) and OLAP
CHAPTER 28 Data Mining Concepts. CHAPTER 28 Data Mining Concepts.
Intro to MIS – MGS351 Databases and Data Warehouses
Data Mining Functionalities
Data Mining – Intro.
Advanced Applied IT for Business 2
DATA MINING © Prentice Hall.
Data Warehousing Enterprise Database Systems
Data warehouse and OLAP
A Research Oriented Study Report By :- Akash Saxena
Chapter 13 The Data Warehouse
Data Warehouse—Subject‐Oriented
Data storage is growing Future Prediction through historical data
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
Chapter 13 – Data Warehousing
Adrian Tuhtan CS157A Section1
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Market Basket Analysis and Association Rules
MANAGING DATA RESOURCES
Data Warehouse and OLAP
I don’t need a title slide for a lecture
MANAGING DATA RESOURCES
Supporting End-User Access
Data Warehousing Data Model –Part 1
OLAP in DWH Ján Genči PDT.
Data Warehouse.
Data Management Part 1.1 DBMS.
Chapter 17 Designing Databases
Data Warehousing Concepts
Data Pre-processing Lecture Notes for Chapter 2
Data Warehouse and OLAP
Presentation transcript:

Lecture 15 Data Mining Concepts Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe

Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential pattern analysis Time Series Analysis Regression Neural Networks Genetic Algorithms

Definitions of Data Mining The discovery of new information in terms of patterns or rules from vast amounts of data. The process of finding interesting structure in data. The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data.

Data Warehousing The data warehouse is a historical database designed for decision support. Data mining can be applied to the data in a warehouse to help with certain types of decisions. Proper construction of a data warehouse is fundamental to the successful use of data mining.

Knowledge Discovery in Databases (KDD) Data mining is actually one step of a larger process known as knowledge discovery in databases (KDD). The KDD process model comprises six phases Data selection Data cleansing Enrichment Data transformation or encoding Data mining Reporting and displaying discovered knowledge

Goals of Data Mining and Knowledge Discovery (PICO) Prediction: Determine how certain attributes will behave in the future. Identification: Identify the existence of an item, event, or activity. Classification: Partition data into classes or categories. Optimization: Optimize the use of limited resources.

Types of Discovered Knowledge Association Rules Classification Hierarchies Sequential Patterns Patterns Within Time Series Clustering

Association Rules Association rules are frequently used to generate rules from market-basket data. A market basket corresponds to the sets of items a consumer purchases during one visit to a supermarket. The set of items purchased by customers is known as an itemset. An association rule is of the form X=>Y, where X ={x1, x2, …., xn }, and Y = {y1,y2, …., yn} are sets of items, with xi and yi being distinct items for all i and all j. For an association rule to be of interest, it must satisfy a minimum support and confidence.

Association Rules Confidence and Support The minimum percentage of instances in the database that contain all items listed in a given association rule. Support is the percentage of transactions that contain all of the items in the itemset, LHS U RHS. Confidence: Given a rule of the form A=>B, rule confidence is the conditional probability that B is true when A is known to be true. Confidence can be computed as support(LHS U RHS) / support(LHS)

Generating Association Rules The general algorithm for generating association rules is a two-step process. Generate all itemsets that have a support exceeding the given threshold. Itemsets with this property are called large or frequent itemsets. Generate rules for each itemset as follows: For itemset X and Y a subset of X, let Z = X – Y; If support(X)/Support(Z) > minimum confidence, the rule Z=>Y is a valid rule.

Reducing Association Rule Complexity Two properties are used to reduce the search space for association rule generation. Downward Closure A subset of a large itemset must also be large Anti-monotonicity A superset of a small itemset is also small. This implies that the itemset does not have sufficient support to be considered for rule generation.

Generating Association Rules: The Apriori Algorithm The Apriori algorithm was the first algorithm used to generate association rules. The Apriori algorithm uses the general algorithm for creating association rules together with downward closure and anti-monotonicity.

Generating Association Rules: The Sampling Algorithm The sampling algorithm selects samples from the database of transactions that individually fit into memory. Frequent itemsets are then formed for each sample. If the frequent itemsets form a superset of the frequent itemsets for the entire database, then the real frequent itemsets can be obtained by scanning the remainder of the database. In some rare cases, a second scan of the database is required to find all frequent itemsets.

Generating Association Rules: Frequent-Pattern Tree Algorithm The Frequent-Pattern Tree Algorithm reduces the total number of candidate itemsets by producing a compressed version of the database in terms of an FP-tree. The FP-tree stores relevant information and allows for the efficient discovery of frequent itemsets. The algorithm consists of two steps: Step 1 builds the FP-tree. Step 2 uses the tree to find frequent itemsets.

Step 1: Building the FP-Tree First, frequent 1-itemsets along with the count of transactions containing each item are computed. The 1-itemsets are sorted in non-increasing order. The root of the FP-tree is created with a “null” label. For each transaction T in the database, place the frequent 1-itemsets in T in sorted order. Designate T as consisting of a head and the remaining items, the tail. Insert itemset information recursively into the FP-tree as follows: if the current node, N, of the FP-tree has a child with an item name = head, increment the count associated with N by 1 else create a new node, N, with a count of 1, link N to its parent and link N with the item header table. if tail is nonempty, repeat the above step using only the tail, i.e., the old head is removed and the new head is the first item from the tail and the remaining items become the new tail.

Step 2: The FP-growth Algorithm For Finding Frequent Itemsets Input: Fp-tree and minimum support, mins Output: frequent patterns (itemsets) procedure FP-growth (tree, alpha); Begin if tree contains a single path P then for each combination, beta of the nodes in the path generate pattern (beta U alpha) with support = minimum support of nodes in beta else for each item, i, in the header of the tree do begin generate pattern beta = (i U alpha) with support = i.support; construct beta’s conditional pattern base; construct beta’s conditional FP-tree, beta_tree; if beta_tree is not empty then FP-growth(beta_tree, beta); end; End;

Generating Association Rules: The Partition Algorithm Divide the database into non-overlapping subsets. Treat each subset as a separate database where each subset fits entirely into main memory. Apply the Apriori algorithm to each partition. Take the union of all frequent itemsets from each partition. These itemsets form the global candidate frequent itemsets for the entire database. Verify the global set of itemsets by having their actual support measured for the entire database.

Complications seen with Association Rules The cardinality of itemsets in most situations is extremely large. Association rule mining is more difficult when transactions show variability in factors such as geographic location and seasons. Item classifications exist along multiple dimensions. Data quality is variable; data may be missing, erroneous, conflicting, as well as redundant.

Classification Classification is the process of learning a model that is able to describe different classes of data. Learning is supervised as the classes to be learned are predetermined. Learning is accomplished by using a training set of pre-classified data. The model produced is usually in the form of a decision tree or a set of rules.

An Example Rule Here is one of the rules extracted from the decision tree of Figure 28.7. IF 50K > salary >= 20K AND age >=25 THEN class is “yes”

Clustering Unsupervised learning or clustering builds models from data without predefined classes. The goal is to place records into groups where the records in a group are highly similar to each other and dissimilar to records in other groups. The k-Means algorithm is a simple yet effective clustering technique.

Additional Data Mining Methods Sequential pattern analysis Time Series Analysis Regression Neural Networks Genetic Algorithms

Sequential Pattern Analysis Transactions ordered by time of purchase form a sequence of itemsets. The problem is to find all subsequences from a given set of sequences that have a minimum support. The sequence S1, S2, S3, .. is a predictor of the fact that a customer purchasing itemset S1 is likely to buy S2 , and then S3, and so on.

Time Series Analysis Time series are sequences of events. For example, the closing price of a stock is an event that occurs each day of the week. Time series analysis can be used to identify the price trends of a stock or mutual fund. Time series analysis is an extended functionality of temporal data management.

Regression Analysis A regression equation estimates a dependent variable using a set of independent variables and a set of constants. The independent variables as well as the dependent variable are numeric. A regression equation can be written in the form Y=f(x1,x2,…,xn) where Y is the dependent variable. If f is linear in the domain variables xi, the equation is call a linear regression equation.

Neural Networks A neural network is a set of interconnected nodes designed to imitate the functioning of the brain. Node connections have weights which are modified during the learning process. Neural networks can be used for supervised learning and unsupervised clustering. The output of a neural network is quantitative and not easily understood.

Genetic Learning Genetic learning is based on the theory of evolution. An initial population of several candidate solutions is provided to the learning model. A fitness function defines which solutions survive from one generation to the next. Crossover, mutation and selection are used to create new population elements.

Data Mining Applications Marketing Marketing strategies and consumer behavior Finance Fraud detection, creditworthiness and investment analysis Manufacturing Resource optimization Health Image analysis, side effects of drug, and treatment effectiveness

Overview of Data Warehousing and OLAP Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe

Outline Purpose of Data Warehousing Introduction, Definitions, and Terminology Comparison with Traditional Databases Characteristics of Data Warehouses Classification of Data Warehouses Multi-dimensional Schemas Building a Data Warehouse Functionality of a Data Warehouse Warehouse vs. Data Views Implementation difficulties and open issues

Purpose of Data Warehousing Traditional databases are not optimized for data access only they have to balance the requirement of data access with the need to ensure integrity of data. Most of the times the data warehouse users need only read access but, need the access to be fast over a large volume of data. Most of the data required for data warehouse analysis comes from multiple databases and these analysis are recurrent and predictable to be able to design specific software to meet the requirements. There is a great need for tools that provide decision makers with information to make decisions quickly and reliably based on historical data. The above functionality is achieved by Data Warehousing and Online analytical processing (OLAP)

Introduction, Definitions, and Terminology W. H Inmon characterized a data warehouse as: “A subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management’s decisions.”

Introduction, Definitions, and Terminology Data warehouses have the distinguishing characteristic that they are mainly intended for decision support applications. Traditional databases are transactional. Applications that data warehouse supports are: OLAP (Online Analytical Processing) is a term used to describe the analysis of complex data from the data warehouse. DSS (Decision Support Systems) also known as EIS (Executive Information Systems) supports organization’s leading decision makers for making complex and important decisions. Data Mining is used for knowledge discovery, the process of searching data for unanticipated new knowledge.

Conceptual Structure of Data Warehouse Data Warehouse processing involves Cleaning and reformatting of data OLAP Data Mining

Comparison with Traditional Databases Data Warehouses are mainly optimized for appropriate data access. Traditional databases are transactional and are optimized for both access mechanisms and integrity assurance measures. Data warehouses emphasize more on historical data as their main purpose is to support time-series and trend analysis. Compared with transactional databases, data warehouses are nonvolatile. In transactional databases transaction is the mechanism change to the database. By contrast information in data warehouse is relatively coarse grained and refresh policy is carefully chosen, usually incremental.

Characteristics of Data Warehouses Multidimensional conceptual view Generic dimensionality Unlimited dimensions and aggregation levels Unrestricted cross-dimensional operations Dynamic sparse matrix handling Client-server architecture Multi-user support Accessibility Transparency Intuitive data manipulation Consistent reporting performance Flexible reporting

Classification of Data Warehouses Generally, Data Warehouses are an order of magnitude larger than the source databases. The sheer volume of data is an issue, based on which Data Warehouses could be classified as follows. Enterprise-wide data warehouses They are huge projects requiring massive investment of time and resources. Virtual data warehouses They provide views of operational databases that are materialized for efficient access. Data marts These are generally targeted to a subset of organization, such as a department, and are more tightly focused.

Data Modeling for Data Warehouses Traditional Databases generally deal with two-dimensional data (similar to a spread sheet). However, querying performance in a multi-dimensional data storage model is much more efficient. Data warehouses can take advantage of this feature as generally these are Non volatile The degree of predictability of the analysis that will be performed on them is high.

Data Modeling for Data Warehouses Example of Two- Dimensional vs. Multi- Dimensional

Data Modeling for Data Warehouses Advantages of a multi-dimensional model Multi-dimensional models lend themselves readily to hierarchical views in what is known as roll-up display and drill-down display. The data can be directly queried in any combination of dimensions, bypassing complex database queries.

Multi-dimensional Schemas Multi-dimensional schemas are specified using: Dimension table It consists of tuples of attributes of the dimension. Fact table Each tuple is a recorded fact. This fact contains some measured or observed variable (s) and identifies it with pointers to dimension tables. The fact table contains the data, and the dimensions to identify each tuple in the data.

Multi-dimensional Schemas Two common multi-dimensional schemas are Star schema: Consists of a fact table with a single table for each dimension Snowflake Schema: It is a variation of star schema, in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them.

Multi-dimensional Schemas Star schema: Consists of a fact table with a single table for each dimension.

Multi-dimensional Schemas Snowflake Schema: It is a variation of star schema, in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them.

Multi-dimensional Schemas Fact Constellation Fact constellation is a set of tables that share some dimension tables. However, fact constellations limit the possible queries for the warehouse.

Multi-dimensional Schemas Indexing Data warehouse also utilizes indexing to support high performance access. A technique called bitmap indexing constructs a bit vector for each value in domain being indexed. Indexing works very well for domains of low cardinality.

Building A Data Warehouse The builders of Data warehouse should take a broad view of the anticipated use of the warehouse. The design should support ad-hoc querying An appropriate schema should be chosen that reflects the anticipated usage.

Building A Data Warehouse The Design of a Data Warehouse involves following steps. Acquisition of data for the warehouse. Ensuring that Data Storage meets the query requirements efficiently. Giving full consideration to the environment in which the data warehouse resides.

Building A Data Warehouse Acquisition of data for the warehouse The data must be extracted from multiple, heterogeneous sources. Data must be formatted for consistency within the warehouse. The data must be cleaned to ensure validity. Difficult to automate cleaning process. Back flushing, upgrading the data with cleaned data.

Building A Data Warehouse Acquisition of data for the warehouse (contd.) The data must be fitted into the data model of the warehouse. The data must be loaded into the warehouse. Proper design for refresh policy should be considered.

Building A Data Warehouse Storing the data according to the data model of the warehouse Creating and maintaining required data structures Creating and maintaining appropriate access paths Providing for time-variant data as new data are added Supporting the updating of warehouse data. Refreshing the data Purging data

Building A Data Warehouse Usage projections The fit of the data model Characteristics of available resources Design of the metadata component Modular component design Design for manageability and change Considerations of distributed and parallel architecture Distributed vs. federated warehouses

Functionality of a Data Warehouse Functionality that can be expected: Roll-up: Data is summarized with increasing generalization Drill-Down: Increasing levels of detail are revealed Pivot: Cross tabulation is performed Slice and dice: Performing projection operations on the dimensions. Sorting: Data is sorted by ordinal value. Selection: Data is available by value or range. Derived attributes: Attributes are computed by operations on stored derived values.

Warehouse vs. Data Views Views and data warehouses are alike in that they both have read-only extracts from the databases. However, data warehouses are different from views in the following ways: Data Warehouses exist as persistent storage instead of being materialized on demand. Data Warehouses are not usually relational, but rather multi-dimensional. Data Warehouses can be indexed for optimization. Data Warehouses provide specific support of functionality. Data Warehouses deals huge volumes of data that is contained generally in more than one database.

Difficulties of implementing Data Warehouses Lead time is huge in building a data warehouse Potentially it takes years to build and efficiently maintain a data warehouse. Both quality and consistency of data are major concerns. Revising the usage projections regularly to meet the current requirements. The data warehouse should be designed to accommodate addition and attrition of data sources without major redesign Administration of data warehouse would require far broader skills than are needed for a traditional database.

Open Issues in Data Warehousing Data cleaning, indexing, partitioning, and views could be given new attention with perspective to data warehousing. Automation of data acquisition data quality management selection and construction of access paths and structures self-maintainability functionality and performance optimization Incorporating of domain and business rules appropriately into the warehouse creation and maintenance process more intelligently.

Recap Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential pattern analysis Time Series Analysis Regression Neural Networks Genetic Algorithms

Recap Purpose of Data Warehousing Introduction, Definitions, and Terminology Comparison with Traditional Databases Characteristics of data Warehouses Classification of Data Warehouses Multi-dimensional Schemas Building A Data Warehouse Functionality of a Data Warehouse Warehouse vs. Data Views Implementation difficulties and open issues