Download presentation
1
Chapter 3 Database Support in Data Mining
Types of database systems How relate to data mining
2
Contents Describes data warehousing and related database system.
Discusses feature of data found in data warehouse Describes how data warehouses are typically implemented and operated Defines metadata in the context of data warehouses Show how different data systems are typically used in data mining Provides real examples of database systems used in data mining Discusses the concept of data quality Reviews the database software market
3
Data management Retail organization generate masses of data that require very advanced data storage system. Wal-Mart relied on modern data management to engage with SCM. The manipulation of data is a key element in the data mining process. Data mining and other analysis can draw upon data collected in internal systems and external sources.
4
Data access Data warehouses are not requirements to do data mining, data warehouses store massive amounts of data that can be used for data mining. Data mining analyses also use smaller sets of data that can be organized in online analytic processing (OLAP) systems of in data mining. OLAP: provides access to report generators and graphical support.
5
Contemporary Database
Gain competitive advantage customer information systems data mining Develop and market new products micromarketing
6
Systems Database On-Line Analytic Processing (OLAP) Data Mart
Personal, small business level On-Line Analytic Processing (OLAP) Ability to use many dimensions, reports & graphics Data Mart Usually temporary analysis Data Warehouse Usually permanent repository
7
Price Waterhouse definition:
Data Warehousing Price Waterhouse definition: A data warehouse is an orderly and accessible repository of known facts and related data that is used as a basis for making better management decisions. The data warehouse provides a unified repository of consistent data for decision making that is subject oriented, integrated, time variant, and nonvolatile.
8
Data Warehousing Data warehouses are used to store massive quantities of data that can be updated and allow quick retrieval of specific types of data. Not just a technology; an architecture and process designed to support decision making special-purpose database systems to improve query performance significantly Three general data warehouse processes: warehouse generation is the process of designing the warehouse and loading the data. Data management is the process of storing the data. Information analysis is the process of using the data to support organization decision making.
9
Benefits from Data Warehousing
Provide business users views of data appropriate to mission Consolidate & reconcile (consistent) data Give macro views of critical aspects Timely & detailed access to information Provide specific information to particular groups Ability to identify trends
10
The data is gathered from operational systems:
Data warehousing Within data warehouses, data is classified and organized around subjects meaningful to the company. The data is gathered from operational systems: Barcode readers at cash registers, Information from e-commerce, Daily reports… Industry volumes Economic data.. Data from different sources (shipping, marketing, billing) are integrated into a common format.
11
Data Transformation Consolidate data from multiple sources
Filter to eliminate unnecessary details Clean data eliminate incorrect entries eliminate duplications Convert & translate data into proper format Aggregate data as designed
12
Data warehousing A data warehouse is a central aggregation of data, intended as a permanent storage facility with normalized, formatted. Normalized implies the use of small, stable data structure within the database. Normalized data would group data elements by category, making it possible to apply relational principles in data updating.
13
Key Concepts Scalability Granularity
Ability to accurately cope with changing conditions (especially magnitude of computing) Granularity Level of detail Data warehouse – tends to be fine granularity OLAP – tends to aggregate to coarse granularity
14
Data Warehousing OLAP On-Line Transactional Processing
summary data detailed operational data few users many concurrent users data driven transaction driven effectiveness efficiency use spreadsheets to access
15
Data Marts Intermediate-level database system
Originally, many data marts were marketed as preliminary data warehouses. Currently, many data marts are used in conjunction with data warehouses rather than as competitive products. Data marts are usually used as repositories of data gathered to serve a particular set of users, providing data extracted from data warehouses and/or other sources. Often used as temporary storage Gather data for study from data warehouse, other sources (including external) Clean & transform for data mining
16
OLAP Multidimensional spreadsheet approach to shared data storage designed to allow users to extract data and generate report on the dimensions important to them. Data is segregated into different dimensions and organized in a hierarchical manner. Hypercube – term to reflect ability to sort on many dimensional forms Many forms MOLAP – multidimensional ROLAP – relational (uses SQL) DOLAP – desktop WOLAP – web enabled HOLAP - hybrid
17
OLAP One function of OLAP is standard report generation, including financial performance analysis on selected dimensions (such as by department, geographical region, product, salesperson, time…). Supporting the planning and forecasting projects using spreadsheet analytic tools. An OLAP product including a data warehouse, an OLAP server, and a client server on a local area network (LAN). OLAP functions – see page. 37
18
Relationships of database and DM
Data warehouses are not required for data mining, nor are OLAP system. However, the existence of either presents many opportunities to data mining.
19
Data Warehouse Implementation
Data warehouses create the opportunity to provide much better information than what was available in the past. DW can produce consistent views of events and reports. DW provides Reliable, comprehensive source of clean data Accurate, complete, in correct format Processes System development Data acquisition Data extraction for use
20
Data Warehouse Implementation
Implementing processes involve a degree of continuity since data warehousing is a dynamic environment. To have a suite of software tools to extract data from sources and move it to the data warehouse itself and provide user access to this information. Data acquisition is supported data warehouse generation.
21
Data Warehouse Generation
Extract data from sources Transform Clean Load into data warehouse 60-80% of effort in operating data warehouse
22
Data Extraction Routines
Extraction programs are executed periodically to obtain records, and copy the information to an intermediate file. Data extraction routines: Interpret data formats Identify changed records Copy information to intermediate file
23
Data Transformation Transformation programs accomplish final data preparation, including: The consolidation of data from multiple sources Filtering data to eliminate unnecessary details Cleaning data eliminate incorrect entries of duplications Converting and translating data into the format established for the data warehouse The aggregation of data
24
Data Management involve in:
Retrieve information from data warehouse Run extraction programs to generate repetitive reports and serve specific needs Implementation Problems: Required data not available Initial data warehouse scope too broad Not enough time to do prototyping, or needs analysis Insufficient senior direction
25
Data warehouse management vs. data management:
Meta Data Data warehouse management vs. data management: Data management concerns the management of all of the enterprise’s data. Data warehouse management refers to the designs and operation of the data warehouse through all phases of its life cycle. Manage meta data Design data warehouse Ensure data quality Manage system during operations
26
Meta Data Metadata is the set of reference (Data) to keep track of data, and is used to describe the organization of the warehouse. A data catalog provides users with the ability to see specifically what the data warehouse contains. The content of the data warehouse is defined by metadata, which provides business views of data (information access tools) and technical views (warehouse generation tools).
27
Business Metadata What data are available Source of each data element Frequency of data updates Location of specific data Predefined reports & queries Methods of data access
28
Technical Meta Data Data source (internal or external)
Data preparation features (transformation & aggregation rules) Logical structure of data Physical structure & content Data ownership Security aspects (access rights, restrictions) System information (date of last update, retention policy, data usage)
29
Wal-Mart’s Data Warehouse
Heavy user of IT Core competency – supply chain distribution 2900 outlets Data warehouse of 101 terabytes ($4 billion) 65 million transactions per week Subject-oriented, integrated, time-variant, nonvolatile data 65 weeks of data by item, store, day
30
Wal-Mart Use data warehouse to: Support decision making
Buyers, merchandisers, logistics, forecasters 3,500 vendor partners can query Can handle 35 thousand queries per week Benefit $12,000 per query Some users about 1 thousand queries per day
31
Summers Rubber Company
Distribution firm 7 operating locations 10,000 items 3,000 customers Old system: OLAP Databases transactional & summarized, distributed
32
Summers Data Storage System
Built in-house, PCs, Access database Visual Basic & Excel Distributed system Data warehouse server controlled queries, managed resources Security Passwords gave some protection To protect from leaving employees, used data marts with small versions of central database
33
Summers – Negative features
Too much disk space on user local drives Often difficult to understand & use Updating multiple data sites slow, limited access Summary data often wrong Couldn’t use data mining tools Problem was aggregated data stored
34
Comparison Product Use Duration Granularity Warehouse Repository
Permanent Finest Mart Specific study Temporary Aggregate OLAP Report & analysis Repetitive Summary
35
Examples of Data Uses Customer information systems Fingerhut
36
Customer Information Systems
Massive databases Detailed information about individuals and households Use automated analysis identify focused market target
37
Target small groups of highly responsive customers
Micromarketing Target small groups of highly responsive customers Own niches like smaller competitors EXAMPLES: Great Atlantic & Pacific Tea Company (A&P) target customers, centralize buying Fingerhut sell on credit to households <$25,000 income
38
System demonstrations
A dealer wholesaler. A small portion for the first 10 shipments (Table. 3.1). Data warehouse are normalized into relational form. The data is organized into a series of tables connected by keys. Revenue
39
Data mart Examining the characteristics of customers who buy the products. (Advertising by mail, internet, …) Data marts could extract the data and aggregate it in a form useful for data mining. Table 3.2 shows entries that might be found in a data mart. (on product D428 in two-year interval)
40
OLAP An OLAP application focuses more on analyzing trends or other aspects of organizational operations. It may obtain much of its information from the data warehouse, but extracts granular information. This information could be accessed to make a report by product category. Table. 3.3. positive
41
OLAP Evaluating the value of each client to the firm.
Data can be aggregated within data mart, or on an OLAP system.
42
OLAP Organizing volume according to the shipper.
Table 3.5 displays the results of cases by shipper for each shipper.
43
Data Quality Data warehouse projects can fail, one of the most common reason is the refusal (reject) of users to accept the validity of data obtained from a data warehouse. Because: The corruption of data or missing data from the original sources. Failure of the software transferring data into or out of the data warehouse. Failure of the data-cleansing process to resolve data inconsistence. The responsible staff must verify the integrity of data, ensuring the data loading and storing process. Data Integrity: Do not allow any meaningless, corrupt, or redundant data into the data warehouse. Controls can be implemented prior to loading data, in the data migration, cleansing, transforming, and loading processes.
44
Data Quality An example of multiple variations, as illustrated in Table. 3.6. What are the variations? Variations of the same customer Misspell Corrected spell but with a more complete definition
45
Data Quality Matching involves associating variables.
Software used to introduce new data into the data warehouse needs to check that the appropriate spelling and entry values are used. Also, matching companies with addresses… and some maintenance. Software tools to ensure data quality, including: The analysis of data for type The construction of standardization schemes The identification of redundant data The adjustment of matching criteria to achieve selected levels of discrimination The transformation of data into designed format
46
Software products
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.