Chapter 3 Database Support in Data Mining

Slides:



Advertisements
Similar presentations
Chapter 13 The Data Warehouse
Advertisements

C6 Databases.
Database Management3-1 L3 Database Management Santa R. Susarapu Ph.D. Student Virginia Commonwealth University.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Chapter 8 Business Intelligence & ERP ERP offers opportunity to store vast volumes of data This data can be data mined Customer Relationship Management.
Management Information Systems, Sixth Edition
McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc. All rights reserved. 8-1 BUSINESS DRIVEN TECHNOLOGY Chapter Eight: Viewing and Protecting Organizational.
Sharing Enterprise Data Data administration Data administration Data downloading Data downloading Data warehousing Data warehousing.
Managing Data Resources
Chapter 3 Database Management
Copyright 2002 Prentice-Hall, Inc. Modern Systems Analysis and Design Third Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich Chapter 16 Designing.
Database Management: Getting Data Together Chapter 14.
Components and Architecture CS 543 – Data Warehousing.
Introduction to Database Management
Chapter 13 The Data Warehouse
Data Warehousing DSCI 4103 Dr. Mennecke Introduction and Chapter 1.
Designing a Data Warehouse
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
© 2003, Prentice-Hall Chapter Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
Basic Concepts of Datawarehousing An Overview Prasanth Gurram.
Data Warehouse & Data Mining
Database Systems – Data Warehousing
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Data Warehouse Concepts Transparencies
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
OnLine Analytical Processing (OLAP)
© 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang 5-1 Chapter 5 Business Intelligence: Data.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
Databases & Data Mining Types of database systems How are they related to data mining.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
BUS1MIS Management Information Systems Semester 1, 2012 Week 6 Lecture 1.
Data Warehouse Fundamentals Rabie A. Ramadan, PhD 2.
BUSINESS DRIVEN TECHNOLOGY
The Data Warehouse “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of “all” an organisation’s data in support.
1 Reviewing Data Warehouse Basics. Lessons 1.Reviewing Data Warehouse Basics 2.Defining the Business and Logical Models 3.Creating the Dimensional Model.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
1 Topics about Data Warehouses What is a data warehouse? How does a data warehouse differ from a transaction processing database? What are the characteristics.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Sachin Goel (68) Manav Mudgal (69) Piyush Samsukha (76) Rachit Singhal (82) Richa Somvanshi (85) Sahar ( )
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
Data Warehouse. Group 5 Kacie Johnson Summer Bird Washington Farver Jonathan Wright Mike Muchane.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
© 2003 Prentice Hall, Inc.3-1 Chapter 3 Database Management Information Systems Today Leonard Jessup and Joseph Valacich.
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.
Business intelligence systems. Data warehousing. An orderly and accessible repositery of known facts and related data used as a basis for making better.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
1 Management Information Systems M Agung Ali Fikri, SE. MM.
Chapter 8: Data Warehousing. Data Warehouse Defined A physical repository where relational data are specially organized to provide enterprise- wide, cleansed.
Managing Data Resources File Organization and databases for business information systems.
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Chapter 8 Business Intelligence & ERP
Advanced Applied IT for Business 2
Chapter 16 Designing Distributed and Internet Systems
MANAGING DATA RESOURCES
Introduction of Week 9 Return assignment 5-2
Data Warehouse.
Chapter 1 Database Systems
Data Warehousing Concepts
Presentation transcript:

Chapter 3 Database Support in Data Mining Types of database systems How relate to data mining

Contents Describes data warehousing and related database system. Discusses feature of data found in data warehouse Describes how data warehouses are typically implemented and operated Defines metadata in the context of data warehouses Show how different data systems are typically used in data mining Provides real examples of database systems used in data mining Discusses the concept of data quality Reviews the database software market

Data management Retail organization generate masses of data that require very advanced data storage system. Wal-Mart relied on modern data management to engage with SCM. The manipulation of data is a key element in the data mining process. Data mining and other analysis can draw upon data collected in internal systems and external sources.

Data access Data warehouses are not requirements to do data mining, data warehouses store massive amounts of data that can be used for data mining. Data mining analyses also use smaller sets of data that can be organized in online analytic processing (OLAP) systems of in data mining. OLAP: provides access to report generators and graphical support.

Contemporary Database Gain competitive advantage customer information systems data mining Develop and market new products micromarketing

Systems Database On-Line Analytic Processing (OLAP) Data Mart Personal, small business level On-Line Analytic Processing (OLAP) Ability to use many dimensions, reports & graphics Data Mart Usually temporary analysis Data Warehouse Usually permanent repository

Price Waterhouse definition: Data Warehousing Price Waterhouse definition: A data warehouse is an orderly and accessible repository of known facts and related data that is used as a basis for making better management decisions. The data warehouse provides a unified repository of consistent data for decision making that is subject oriented, integrated, time variant, and nonvolatile.

Data Warehousing Data warehouses are used to store massive quantities of data that can be updated and allow quick retrieval of specific types of data. Not just a technology; an architecture and process designed to support decision making special-purpose database systems to improve query performance significantly Three general data warehouse processes: warehouse generation is the process of designing the warehouse and loading the data. Data management is the process of storing the data. Information analysis is the process of using the data to support organization decision making.

Benefits from Data Warehousing Provide business users views of data appropriate to mission Consolidate & reconcile (consistent) data Give macro views of critical aspects Timely & detailed access to information Provide specific information to particular groups Ability to identify trends

The data is gathered from operational systems: Data warehousing Within data warehouses, data is classified and organized around subjects meaningful to the company. The data is gathered from operational systems: Barcode readers at cash registers, Information from e-commerce, Daily reports… Industry volumes Economic data.. Data from different sources (shipping, marketing, billing) are integrated into a common format.

Data Transformation Consolidate data from multiple sources Filter to eliminate unnecessary details Clean data eliminate incorrect entries eliminate duplications Convert & translate data into proper format Aggregate data as designed

Data warehousing A data warehouse is a central aggregation of data, intended as a permanent storage facility with normalized, formatted. Normalized implies the use of small, stable data structure within the database. Normalized data would group data elements by category, making it possible to apply relational principles in data updating.

Key Concepts Scalability Granularity Ability to accurately cope with changing conditions (especially magnitude of computing) Granularity Level of detail Data warehouse – tends to be fine granularity OLAP – tends to aggregate to coarse granularity

Data Warehousing OLAP On-Line Transactional Processing summary data detailed operational data few users many concurrent users data driven transaction driven effectiveness efficiency use spreadsheets to access

Data Marts Intermediate-level database system Originally, many data marts were marketed as preliminary data warehouses. Currently, many data marts are used in conjunction with data warehouses rather than as competitive products. Data marts are usually used as repositories of data gathered to serve a particular set of users, providing data extracted from data warehouses and/or other sources. Often used as temporary storage Gather data for study from data warehouse, other sources (including external) Clean & transform for data mining

OLAP Multidimensional spreadsheet approach to shared data storage designed to allow users to extract data and generate report on the dimensions important to them. Data is segregated into different dimensions and organized in a hierarchical manner. Hypercube – term to reflect ability to sort on many dimensional forms Many forms MOLAP – multidimensional ROLAP – relational (uses SQL) DOLAP – desktop WOLAP – web enabled HOLAP - hybrid

OLAP One function of OLAP is standard report generation, including financial performance analysis on selected dimensions (such as by department, geographical region, product, salesperson, time…). Supporting the planning and forecasting projects using spreadsheet analytic tools. An OLAP product including a data warehouse, an OLAP server, and a client server on a local area network (LAN). OLAP functions – see page. 37

Relationships of database and DM Data warehouses are not required for data mining, nor are OLAP system. However, the existence of either presents many opportunities to data mining.

Data Warehouse Implementation Data warehouses create the opportunity to provide much better information than what was available in the past. DW can produce consistent views of events and reports. DW provides Reliable, comprehensive source of clean data Accurate, complete, in correct format Processes System development Data acquisition Data extraction for use

Data Warehouse Implementation Implementing processes involve a degree of continuity since data warehousing is a dynamic environment. To have a suite of software tools to extract data from sources and move it to the data warehouse itself and provide user access to this information. Data acquisition is supported data warehouse generation.

Data Warehouse Generation Extract data from sources Transform Clean Load into data warehouse 60-80% of effort in operating data warehouse

Data Extraction Routines Extraction programs are executed periodically to obtain records, and copy the information to an intermediate file. Data extraction routines: Interpret data formats Identify changed records Copy information to intermediate file

Data Transformation Transformation programs accomplish final data preparation, including: The consolidation of data from multiple sources Filtering data to eliminate unnecessary details Cleaning data eliminate incorrect entries of duplications Converting and translating data into the format established for the data warehouse The aggregation of data

Data Management involve in: Retrieve information from data warehouse Run extraction programs to generate repetitive reports and serve specific needs Implementation Problems: Required data not available Initial data warehouse scope too broad Not enough time to do prototyping, or needs analysis Insufficient senior direction

Data warehouse management vs. data management: Meta Data Data warehouse management vs. data management: Data management concerns the management of all of the enterprise’s data. Data warehouse management refers to the designs and operation of the data warehouse through all phases of its life cycle. Manage meta data Design data warehouse Ensure data quality Manage system during operations

Meta Data Metadata is the set of reference (Data) to keep track of data, and is used to describe the organization of the warehouse. A data catalog provides users with the ability to see specifically what the data warehouse contains. The content of the data warehouse is defined by metadata, which provides business views of data (information access tools) and technical views (warehouse generation tools).

Business Metadata What data are available Source of each data element Frequency of data updates Location of specific data Predefined reports & queries Methods of data access

Technical Meta Data Data source (internal or external) Data preparation features (transformation & aggregation rules) Logical structure of data Physical structure & content Data ownership Security aspects (access rights, restrictions) System information (date of last update, retention policy, data usage)

Wal-Mart’s Data Warehouse Heavy user of IT Core competency – supply chain distribution 2900 outlets Data warehouse of 101 terabytes ($4 billion) 65 million transactions per week Subject-oriented, integrated, time-variant, nonvolatile data 65 weeks of data by item, store, day

Wal-Mart Use data warehouse to: Support decision making Buyers, merchandisers, logistics, forecasters 3,500 vendor partners can query Can handle 35 thousand queries per week Benefit $12,000 per query Some users about 1 thousand queries per day

Summers Rubber Company Distribution firm 7 operating locations 10,000 items 3,000 customers Old system: OLAP Databases transactional & summarized, distributed

Summers Data Storage System Built in-house, PCs, Access database Visual Basic & Excel Distributed system Data warehouse server controlled queries, managed resources Security Passwords gave some protection To protect from leaving employees, used data marts with small versions of central database

Summers – Negative features Too much disk space on user local drives Often difficult to understand & use Updating multiple data sites slow, limited access Summary data often wrong Couldn’t use data mining tools Problem was aggregated data stored

Comparison Product Use Duration Granularity Warehouse Repository Permanent Finest Mart Specific study Temporary Aggregate OLAP Report & analysis Repetitive Summary

Examples of Data Uses Customer information systems Fingerhut

Customer Information Systems Massive databases Detailed information about individuals and households Use automated analysis identify focused market target

Target small groups of highly responsive customers Micromarketing Target small groups of highly responsive customers Own niches like smaller competitors EXAMPLES: Great Atlantic & Pacific Tea Company (A&P) target customers, centralize buying Fingerhut sell on credit to households <$25,000 income

System demonstrations A dealer wholesaler. A small portion for the first 10 shipments (Table. 3.1). Data warehouse are normalized into relational form. The data is organized into a series of tables connected by keys. Revenue

Data mart Examining the characteristics of customers who buy the products. (Advertising by mail, internet, …) Data marts could extract the data and aggregate it in a form useful for data mining. Table 3.2 shows entries that might be found in a data mart. (on product D428 in two-year interval)

OLAP An OLAP application focuses more on analyzing trends or other aspects of organizational operations. It may obtain much of its information from the data warehouse, but extracts granular information. This information could be accessed to make a report by product category. Table. 3.3. positive

OLAP Evaluating the value of each client to the firm. Data can be aggregated within data mart, or on an OLAP system.

OLAP Organizing volume according to the shipper. Table 3.5 displays the results of cases by shipper for each shipper.

Data Quality Data warehouse projects can fail, one of the most common reason is the refusal (reject) of users to accept the validity of data obtained from a data warehouse. Because: The corruption of data or missing data from the original sources. Failure of the software transferring data into or out of the data warehouse. Failure of the data-cleansing process to resolve data inconsistence. The responsible staff must verify the integrity of data, ensuring the data loading and storing process. Data Integrity: Do not allow any meaningless, corrupt, or redundant data into the data warehouse. Controls can be implemented prior to loading data, in the data migration, cleansing, transforming, and loading processes.

Data Quality An example of multiple variations, as illustrated in Table. 3.6. What are the variations? Variations of the same customer Misspell Corrected spell but with a more complete definition

Data Quality Matching involves associating variables. Software used to introduce new data into the data warehouse needs to check that the appropriate spelling and entry values are used. Also, matching companies with addresses… and some maintenance. Software tools to ensure data quality, including: The analysis of data for type The construction of standardization schemes The identification of redundant data The adjustment of matching criteria to achieve selected levels of discrimination The transformation of data into designed format

Software products