Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant.

Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

Steps of a KDD Process Learning the application domain: relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge

Architecture of a Typical Data Mining System Data Warehouse Data cleaning & data integration Filtering Databases Database or data warehouse server Data mining engine Pattern evaluation Graphical user interface Knowledge-base

Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Other Disciplines Information Science Machine Learning Visualization

Major Issues in Data Mining (1) Mining methodology and user interaction Mining different kinds of knowledge in databases Interactive mining of knowledge at multiple levels of abstraction Incorporation of background knowledge Data mining query languages and ad-hoc data mining Expression and visualization of data mining results Handling noise and incomplete data Pattern evaluation: the interestingness problem Performance and scalability Efficiency and scalability of data mining algorithms Parallel, distributed and incremental mining methods

Major Issues in Data Mining (2) Issues relating to the diversity of data types Handling relational and complex types of data Mining information from heterogeneous databases and global information systems (WWW) Issues related to applications and social impacts Application of discovered knowledge Domain-specific data mining tools Intelligent query answering Process control and decision making Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem Protection of data security, integrity, and privacy

An Overview of Data Warehousing and OLAP Technology

Outline Abstract Related Work OLTP vs OLAP Architecture and End-to-end Process of Data warehouse Back End Tools and Utilities Conceptual Model and Front End Tools Database Design Methodology Warehouse Servers Metadata and Warehouse Management Conclusion

Abstract Decision Support :It is increasingly becoming Focus of the database industry Our Focus: This paper presents a roadmap of data warehousing technologies, focusing on the special requirements that data warehouses place on database management systems (DBMSs). We are going to discuss: Back end tools for extracting, cleaning and loading data into a data warehouse Multidimensional data models typical of OLAP Front end client tools for querying and data analysis Server extensions for efficient query processing Tools for Metadata Management and for managing the warehouse.

Related Work Data warehousing is a collection of decision support technologies, aimed at enabling the knowledge worker (executive, manager, analyst) to make better and faster decisions. The past years have seen explosive growth both in the number of products and services offered, and in the adoption of these technologies by industry. Data warehousing technologies have been successfully deployed in many industries. e.g. Manufacturing, Financial Services, Transportation, telecommunication, healthcare, utilities etc.

OLAP-Online Analytical Processing Data ware house enables OLAP to help decision support. Organize and format data in various format. OLTP-Online Transaction Processing It uses Operational datatbases. Data w/h is kept separate from Operational DB. It covers day to day operations of an org. Such as purchase, inventory, manufacturing, banking, payroll, accounting.

OLTP vs OLAP OLTP applications typically automate clerical data processing tasks Data warehouses, in contrast, are targeted for decision support The transactions require detailed, up-to-date data, and read or update a few (tens of) records accessed typically on their primary keys Historical, summarized and consolidated data is more important than detailed, individual records Operational databases tend to be hundreds of megabytes to gigabytes in size Enterprise data warehouses are projected to be hundreds of gigabytes to terabytes in size Consistency and recoverability of the database are critical, and maximizing transaction throughput is the key performance metric Query throughput and response times are more important than transaction throughput

OLAP-characteristics Use multi dimensional data analysis techniques. Provide advance data base support. Provides easy to use end user interfaces. Support client/server architecture.

Architecture and End-to-End Process

Processes in implementing Data warehouse

Back End Tools and Utilities Data Cleaning Tools: Tools that help to detect data anomalies and correct them E.g. To correct Inconsistent field lengths, inconsistent descriptions, inconsistent value assignments, missing entries and violation of integrity constraints Types: Data migration tools e.g. Warehouse Manager from Prism Data scrubbing tools e.g. Integrity Data auditing tools: such tools may be considered as variants of data mining tools

Back End Tools and Utilities (Contd.) Load: After extracting, cleaning and transforming, data must be loaded into the warehouse. e.g., RedBrick Table Management Utility Additional preprocessing may still be required for: checking integrity constraints; sorting; summarization, aggregation and other computation to build the derived tables stored in the warehouse; building indices and other access paths; and partitioning to multiple target storage areas. Methods: Batch load utilities Pipelined and partitioned parallelism To insert only updated table

Back End Tools and Utilities (Contd.) Refresh: Done only if some OLAP queries need current data Most contemporary database systems provide replication servers that support incremental techniques for propagating updates from a primary database to one or more replicas. Techniques: Data shipping and Transaction shipping Transaction shipping has the advantage that it does not require triggers, which can increase the workload on the operational source databases

Conceptual Model and Front End Tools A popular conceptual model that influences the front-end tools, database design, and the query engines for OLAP is the multidimensional view of data in the warehouse.

Front End Tools The spreadsheet is still the most compelling front-end application for OLAP Popular operations that are supported by the multidimensional spreadsheet: rollup (increasing the level of aggregation) drill-down (decreasing the level of aggregation or increasing detail) along one or more dimension hierarchies slice_and_dice (selection and projection) pivot (re-orienting the multidimensional view of data).

Front End Tools Other Applications : Traditional analysis by means of a managed query environment These applications often use raw data access tools and optimize the access patterns depending on the back end database server. E.g. there are query environments (e.g., Microsoft Access) that help build ad hoc SQL queries by “pointing- and-clicking”

Database Design Methodology The database designs recommended by ER diagrams are inappropriate for decision support systems where efficiency in querying and in loading data (including incremental loads) are important Schema used to represent the multidimensional data model are: Star schema Snowflake schemas Fact constellations

Star Schema

Snowflake Schema

Warehouse Servers Data warehouses may contain large volumes of data Thus, improving the efficiency of scans is important Index Structures and their Usage: Warehouse servers can use bit map indices, which support efficient index operations (e.g., union, intersection). Materialized Views and their Usage: strategy for using a materialized view is to use selection on the materialized view, or rollup on the materialized view by grouping and aggregating on additional columns Transformation of Complex SQL Queries: “unnesting” complex SQL queries containing nested subqueries Parallel Processing

Warehouse Servers (Contd.) Server Architectures for Query Processing: Specialized SQL Servers: The objective here is to provide advanced query language and query processing support for SQL queries over star and snowflake schemas in read-only environments. e.g. Redbrick ROLAP Servers: These are intermediate servers that sit between a relational back end server (where the data in the warehouse is stored) and client front end tools e.g. Microstrategy. MOLAP Servers: These servers directly support the multidimensional view of data through multidimensional storage engine e.g. Essbase (Arbor)

Warehouse Servers (Contd.) SQL Extensions: Extended family of aggregate functions: rank, percentile, mean, mode, median Reporting Features: moving average Multiple Group-By: Cube and Rollup Comparisons

Metadata and Warehouse Management Administrative metadata includes: Descriptions of the source databases, back-end and front-end tools; definitions of the warehouse schema, derived data, dimensions and hierarchies, predefined queries and reports; data mart locations and contents; physical organization such as data partitions; data extraction, cleaning, and transformation rules; data refresh and purging policies; and user profiles, user authorization and access control policies Business metadata includes: Business terms and definitions, ownership of the data, and charging policies Operational metadata includes: Information that is collected during the operation of the warehouse: the lineage of migrated and transformed data; the currency of data in the warehouse (active, archived or purged); and monitoring information such as usage statistics, error reports, and audit trails.

Metadata and Warehouse Management (Contd.) A metadata repository is used to store and manage all the metadata associated with the warehouse. E.g. Platinum Repository and Prism Directory Manager Warehouse management tools (e.g., HP Intelligent Warehouse Advisor, IBM Data Hub, Prism Warehouse Manager) are used for monitoring a warehouse System and network management tools (e.g., HP OpenView, IBM NetView,Tivoli) are used to measure traffic between clients and servers, between warehouse servers and operational databases Workflow management tools been considered for managing the extract-scrub-transform-load-refresh process

Conclusion There are substantial technical challenges in developing and deploying decision support systems While many commercial products and services exist, there are still several interesting avenues for research related to the different aspects in designing and maintaining a data warehouse.

References: Surajit Chaudhuri Umeshwar Dayal Microsoft Research, Redmond, Umeshwar Dayal, Hewlett-Packard Labs, Palo Alto, An Overview of Data Warehousing and OLAP Technology Inmon, W.H., Building the Data Warehouse. John Wiley, 1992. Athanasios Vavouras, Stella Gatziu, Klaus R. Dittrich, Modeling and Executing the Data Warehouse Refreshment Process, Technical Report 2000.01, January 2000

Thank You Sathe Chaitanya R Vipin Saraogi Paras Saini Varun Sharma Renu Goyal

Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant.

Similar presentations

Presentation on theme: "Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant.

Similar presentations

Presentation on theme: "Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant."— Presentation transcript:

Similar presentations

About project

Feedback