Data Mining: Concepts and Techniques

Slides:



Advertisements
Similar presentations
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Advertisements

Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
April 30, Data Warehousing and OLAP Technology: An Overview  What is a data warehouse?  Data warehouse architecture  From data warehousing to.
Data Warehousing.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Introduction to Data Warehousing CPS Notes 6.
ICS 421 Spring 2010 Data Warehousing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/18/20101Lipyeow.
The Role of Data Warehousing and OLAP Technologies CS 536 – Data Mining These slides are adapted from J. Han and M. Kamber’s book slides (
Data Warehousing Xintao Wu. Evolution of Database Technology (See Fig. 1.1) 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational.
Data Warehousing.
Dr. M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2010 COMP207: Data Mining Data Warehousing COMP207: Data Mining.
Ch3 Data Warehouse part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
1 Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously.  A decision support database that is maintained.
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Dr. Bernard Chen Ph.D. University of Central Arkansas
8/20/ Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously. Defined in many different ways, but.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
Data Warehousing Xintao Wu. Can You Easily Answer These Questions? What are Personnel Services costs across all departments for all funding sources? What.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
October 28, Data Warehouse Architecture Data Sources Operational DBs other sources Analysis Query Reports Data mining Front-End Tools OLAP Engine.
Dr. N. MamoulisAdvanced Database Technologies1 Topic 6: Data Warehousing & OLAP Defined in many different ways, but not rigorously. A decision support.
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining Data Warehouses.
January 21, 2016Data Mining: Concepts and Techniques 1 Chapter 3: Data Warehousing and OLAP Technology: An Overview What is a data warehouse? A multi-dimensional.
Datawarehousing and OLAP C.Eng 714 Spring
1 Chapter 4: Data Warehousing and On-line Analytical Processing Data Warehouse: Basic Concepts Data Warehouse Modeling: Data Cube and OLAP Data Warehouse.
Data Mining and Data Warehousing: Concepts and Techniques Conceptual Modeling of Data Warehouses Defining a Snowflake Schema in Data Mining Query Language.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Chapter 4. Data Warehousing and On-line Analytical Processing (OLAP)
Data Mining: Data Warehousing
Introduction to Data Warehousing
Data Mining: Concepts and Techniques — Chapter 3 —
Data Warehousing and OLAP
Data Mining: Concepts and Techniques — Chapter 3 —
Information Management course
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data warehouse and OLAP
A multi-dimensional data model
Chapter 13 The Data Warehouse
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 4 —
Information Management course
Data Warehouse—Subject‐Oriented
OLAP Concepts and Techniques
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
OLAP Systems versus Statistical Databases
Data Mining Data Warehousing
Data Mining: Concepts and Techniques — Chapter 3 —
Data Warehousing and OLAP Technology for Data Mining
Data Warehousing modified by Donghui Zhang
Chapter 2: Data Warehousing and OLAP Technology for Data Mining
Data Mining: Concepts and Techniques — Chapter 3 —
Data Warehouse and OLAP
Lecture 4: From Data Cubes to ML
Overview of Data Warehousing and OLAP
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
Data Warehousing: Data Models and OLAP operations
Data Warehousing and Decision Support Chapter 25
Resource: J. Han and other books
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009
Data Mining: Concepts and Techniques — Chapter 3 —
Data Mining: Concepts and Techniques
Data Warehouse.
UNIT-1 Introduction Lecture-1 Motivation: Why data mining?
Data Warehouse and OLAP
Presented by: Tek Narayan Adhikari
Data Warehouse and OLAP Technology
CIS671-Knowledge Discovery and Data Mining
Presentation transcript:

Data Mining: Concepts and Techniques Data Warehousing What is Data Warehouse, Multidimensional Data Model, Data Warehouse Architecture, Data warehouse Implementation, From Data Warehouse to Data mining. May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques What is Data Warehouse? A decision support database that is maintained separately from the organization’s operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. Data warehousing provides architectures and tools for business executives to systematically organize, understand and use their data to make strategic decisions Data warehousing:The process of constructing and using data warehouses May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—William. H. Inmon The four keywords subject-oriented, integrated, time-variant and nonvolatile, distinguish data warehouses from other data repository systems such as relational database systems, transaction processing systems, and file systems. Time- variant: data are stored to provide information from a historical perspective. Non- volatile: a data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. May 18, 2018 Data Mining: Concepts and Techniques

Data Warehouse—Subject-Oriented Subject-oriented: a data warehouse is organized around major subjects. Organized around major subjects, such as customer, product, sales Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process May 18, 2018 Data Mining: Concepts and Techniques

Data Warehouse—Integrated Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. May 18, 2018 Data Mining: Concepts and Techniques

Data Warehouse—Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems Operational database: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element” May 18, 2018 Data Mining: Concepts and Techniques

Data Warehouse—Nonvolatile A physically separate store of data transformed from the operational environment Operational update of data does not occur in the data warehouse environment Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data May 18, 2018 Data Mining: Concepts and Techniques

Differences between Operational Database Systems and Data Warehouses The major task of on-line operational database systems is to perform on-line transaction and query processing. These systems are called on-line transaction processing(OLTP)systems. Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data analysis and decision making. Such systems can organize and present data in various formats in order to accommodate the diverse needs of the different users. These systems are known as on-line analytical processing(OLAP) systems. May 18, 2018 Data Mining: Concepts and Techniques

Differences between Operational Database Systems and Data Warehouses Users and System orientation : An OLTP system is customer-oriented and is used for transaction and query processing by clerks, clients and information technology professionals. Am OLAP system is market-oriented and is used for data analysis by knowledge workers , including managers , executives and analysts. Data Contents : An OLTP system current data that , typically, are too detailed to be easily used for decision making An OLAP system manages large amount of historical data. May 18, 2018 Data Mining: Concepts and Techniques

Differences between Operational Database Systems and Data Warehouses Database Design: An OLTP system usually adopts an entity-relation ship (ER) data model and an application oriented database design. An OLAP system typically adopts either a star or snowflake model and subject oriented data base design. May 18, 2018 Data Mining: Concepts and Techniques

Differences between Operational Database Systems and Data Warehouses View: An OLTP system focuses mainly on the current data within an enterprise or department , without referring to historical data in different. An OLAP system often spans multiple versions of a database schema , due to the evolutionary process of an organization. OLAP – deal with information that originate from different organizations , integrating information from many data stores. Because of their huge volume , OLAP data are stored on multiple storage media. May 18, 2018 Data Mining: Concepts and Techniques

Differences between Operational Database Systems and Data Warehouses 5 Access Patterns – OLTP – Short atomic transactions. ( System requires concurrency control and recovery mechanisms) OLAP – Read only operations ( Historical rather than up to date information) . Complex queries. May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques OLTP vs. OLAP May 18, 2018 Data Mining: Concepts and Techniques

Multidimensional Data Model A data warehouse is based on a multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube. May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques May 18, 2018 Data Mining: Concepts and Techniques

Cube: A Lattice of Cuboids all time item location supplier time,location time,supplier item,location item,supplier location,supplier time,item,supplier time,location,supplier item,location,supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid time,item time,item,location time, item, location, supplier May 18, 2018 Data Mining: Concepts and Techniques

Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Example of Star Schema time_key day day_of_the_week month quarter year time item_key item_name brand type supplier_type item Sales Fact Table time_key item_key branch_key branch_key branch_name branch_type branch location_key street city state_or_province country location location_key units_sold dollars_sold avg_sales Measures May 18, 2018 Data Mining: Concepts and Techniques

Example of Snowflake Schema time_key day day_of_the_week month quarter year time item_key item_name brand type supplier_key item supplier_key supplier_type supplier Sales Fact Table time_key item_key branch_key location_key street city_key location branch_key branch_name branch_type branch location_key units_sold city_key city state_or_province country dollars_sold avg_sales Measures May 18, 2018 Data Mining: Concepts and Techniques

Example of Fact Constellation time_key day day_of_the_week month quarter year time item_key item_name brand type supplier_type item Shipping Fact Table Sales Fact Table time_key item_key time_key shipper_key item_key from_location branch_key branch_key branch_name branch_type branch location_key to_location location_key street city province_or_state country location dollars_cost units_sold units_shipped dollars_sold avg_sales shipper_key shipper_name location_key shipper_type shipper Measures May 18, 2018 Data Mining: Concepts and Techniques

Cube Definition Syntax in DMQL define cube <cube_name> [<dimension_list>]: <measure_list> Dimension Definition (Dimension Table) define dimension <dimension_name> as (<attribute_or_subdimension_list>) May 18, 2018 Data Mining: Concepts and Techniques

Defining Star Schema in DMQL define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) May 18, 2018 Data Mining: Concepts and Techniques

Defining Snowflake Schema in DMQL define cube sales_snowflake [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, province_or_state, country)) May 18, 2018 Data Mining: Concepts and Techniques

Defining Fact Constellation in DMQL define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales May 18, 2018 Data Mining: Concepts and Techniques

Measures of Data Cube: Three Categories Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning E.g., count(), sum(), min(), max() Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function E.g., avg(), min_N(), standard_deviation() Holistic: if there is no constant bound on the storage size needed to describe a subaggregate. E.g., median(), mode() May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques DMQL to SQL Select s.time_key , s.item_key , s.branch_key . S.location_key , sum(s.number_of_units_sold * s.price) , sum(s.number_of_units_sold) Rom time t , item I , branch b , location l , Sales s, Where s.time_key = t.time_key and s.item_key = i.item_key and S.brach_key = b.branch_key and s,location_key = l.location_key Group by s.time_key , s.item_key , s.branch_key , s.location_key May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Concept Hierarchies Concept Hierarchies defines a sequence of mappings from a set of low-level concepts to higher level, more general concepts May 18, 2018 Data Mining: Concepts and Techniques

A Concept Hierarchy: Dimension (location) all all Europe ... North_America region Germany ... Spain Canada ... Mexico country Vancouver ... city Frankfurt ... Toronto L. Chan ... M. Wind office May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques A concept hierarchy that is a total or partial order among attributes in a database schema is called a schema hierarchy. May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Concept hierarchies may also be defined by grouping values for a given dimension or attribute, resulting in a set group hierarchy. {1..10} < inexpensive May 18, 2018 Data Mining: Concepts and Techniques

View of Warehouses and Hierarchies May 18, 2018 Data Mining: Concepts and Techniques

OLAP operations in the Multidimensional data model. Roll-up: The roll-up operation performs aggregation on a data cube either by climbing up a concept hierarchy for a dimension or by dimension reduction. Drill-down: It is the reverse of roll-up. It navigates from less detailed data to more detailed data. Drill-down can be realized by either stepping down a concept hierarchy for a dimension or introducing additional dimension May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Fig. 3.10 Typical OLAP Operations May 18, 2018 Data Mining: Concepts and Techniques

OLAP operations in the Multidimensional data model. Slice and dice: The slice operation performs a selection on one dimension of the given cube, resulting in a sub cube. The dice operation defines a sub cube by performing a selection on two or more dimensions. Pivot(rotate): Pivot is a visualization operation that rotate the data axes in view in order to provide an alternative presentation of the data. May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL) May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques A Star-Net Query Model Each circle is called a footprint May 18, 2018 Data Mining: Concepts and Techniques

Chapter 3: Data Warehousing and OLAP Technology: An Overview What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation From data warehousing to data mining May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Warehouse Architecture Four different views regarding the design of a data warehouse must be considered: The top-down view The data source view The data warehouse view The business query view To design an effective data warehouse we need to understand and analyze business needs and construct a business analysis framework. May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Warehouse Architecture The top-down view allows the selection of the relevant information necessary for data warehouse. The data source view exposes the information being captured, stored, and managed by operational systems To design an effective data warehouse we need to understand and analyze business needs and construct a business analysis framework. The data warehouse view includes fact tables and dimension tables. It represents the information that is stored inside the data warehouse. May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Warehouse Architecture The business query view is the perspective of data warehouse from the view point of the end user. . To design an effective data warehouse we need to understand and analyze business needs and construct a business analysis framework. May 18, 2018 Data Mining: Concepts and Techniques

The process of data warehouse design: A data warehouse can be built using a top-down approach, or a combination of both. The top-down approach starts with the overall design and planning. It is useful in cases where the technology is mature and well-known, and where the business problems that must be solved are clear and well understood. The bottom up approach starts with experiments and prototypes. This is useful in the early stage of business modeling and technology development. It allows an organization to move forward at considerably less expense and to evaluate the benefits of the technology. May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques From software engineering point of view Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around May 18, 2018 Data Mining: Concepts and Techniques

The warehouse design consists of the following steps: Choose a business process to model: If the business process is organizational and involves multiple complex object collections, a data warehouse model should be followed. Choose the grain of the business process: The grain is the fundamental, atomic level data to be represented in the fact table for this process. Choose the dimensions that will apply to each fact table record. Choose the measures that will populate each fact table record. May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Three-Tier data warehouse architecture Operational DBs Other sources Monitor & Integrator OLAP Server Metadata Extract Transform Load Refresh Analysis Query Reports Data mining Serve Data Warehouse Data Marts Data Sources Data Storage OLAP Engine Front-End Tools May 18, 2018 Data Mining: Concepts and Techniques

Three-Tier data warehouse architecture Data warehouses often adopt a three-tier architecture. The bottom tier is a warehouse database server that is almost always a relational database system. Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources. These tools and utilities perform data extraction, cleaning, and transformation. The middle tier is an OLAP server that is typically implemented using either a relational OLAP(ROLAP) model, that is an extended relational DBMS. Or a multidimensional OLAP(MOLAP) model, that is , a special purpose server that directly implements multidimensional data and operations. 3. The top tier is a front end client layer, which contains query and reporting tools, analysis tools, data mining tools.

Three Data Warehouse Models Enterprise Warehouse: An enterprise warehouse collects all of the information about subjects spanning the entire organization. It typically contains detailed data as well as summarized data. An enterprise data warehouse may be implemented on traditional mainframes, computer supervisors or parallel architecture platforms. Data Mart: A data mart contains a subset of corporate wide data that is of value to a specific group of users. The scope is confined to specific selected subjects. The implementation cycle of a data mart is more likely to be measured in weeks rather than months or years. Dependant on the source of data ,data marts can be categorized as independent or dependant. Virtual Warehouse: A virtual warehouse is a set of views over operational database. It is easy to build but requires excess capacity on operational database servers.

Data Mining: Concepts and Techniques May 18, 2018 Data Mining: Concepts and Techniques

Data Warehouse Back-end Tools and Utilities Data extraction, which typically gathers data from multiple, heterogeneous, and external sources Data cleaning, which detects errors in data and rectifies when possible Data transformation, which converts data from legacy or host format to warehouse format Load which sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions Refresh, which propagates the updates from the data sources to warehouse

Metadata Repository Metadata are data about data. When used in data warehouse, metadata are the data that defines the warehouse objects A metadata contains the following: A description of the structure of the data warehouse which includes the warehouse schema, view, dimensions, hierarchies and derived data definitions as well as data mart locations and contents Operational metadata which includes data lineage( history of migrated data an the sequence of transformation applied to it), currency of data and monitoring information The algorithms used for summarizations which include measure and dimension definition algorithm, data on granularity, partition, subject areas, aggregation, summarization, and predefined queries and objects

The mapping from the operational environment to the data warehouse, which includes source databases and their contents, gateway descriptions, data partitions, data extraction, cleaning, transformation rules and defaults, data refresh and purging rules and security Data related to system performance which includes indices and profile that improve data access and retrieval performance Business metadata, which include business terms and definitions, data ownership information and charging policies

Types of OLAP Servers: ROLAP v/s MOLAP v/s HOLAP OLAP servers present business users with multidimensional data from data warehouses or data marts without concerning regarding how or where the data are stored. Implementation of warehouse server for OLAP processing include the following: Relational OLAP(ROLAP)servers: These are the intermediate servers that stand in between a relational back-end and client front end tools. They use a relational or extended-RDBMS to store and manage warehouse data Multidimensional OLAP(MOLAP)servers: These servers support multidimensional view of data through array based multidimensional storage engines. They map multidimensional views directly to data cube array structure.

Hybrid OLAP(HOLAP)servers: The hybrid OLAP combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. Specialized SQL Servers: To meet the growing demand of OLAP processing in relational database system vendors implement specialized SQL servers that provide advanced query language and query process support for SQL queries over star and snowflake schemas in read-only environment.

Data Mining: Concepts and Techniques May 18, 2018 Data Mining: Concepts and Techniques

Chapter 3: Data Warehousing and OLAP Technology: An Overview What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation From data warehousing to data mining May 18, 2018 Data Mining: Concepts and Techniques

Data warehouse implementation Data warehouses contain huge volumes of data. OLAP servers demand that decision support queries be answered in the order of second. In SQL terms, these aggregations are referred to as group-by’s . Each group-by can be represented by a cuboid,where the set of group-by’s forms a lattice of cuboids defining a data cube.

Data Mining: Concepts and Techniques May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Cube Operation Cube definition and computation in DMQL define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96) SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year Need compute the following Group-Bys (date, product, customer), (date,product),(date, customer), (product, customer), (date), (product), (customer) () May 18, 2018 Data Mining: Concepts and Techniques

Lattice of cuboids making a 3-D data cube (item) (city) () (year) (city, item) (city, year) (item, year) (city, item, year) May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Cube = 2n How many cuboids in an n-dimensional cube with L levels? May 18, 2018 Data Mining: Concepts and Techniques

Partial materialization:Selected Computation of Cuboids There are three choice for data cube materialization given a base cuboid No materialization Do not precomputed any of the “nonbase” cuboids. -computing expensive multidimensional aggregates on the fly . Extremely slow full materialization precompute all of the cuboid. The resulting lattice of computed cuboids is referred to as the full cube. - huge amount of memory Partial materialization selectively compute a proper subset of the whole set of possible cuboid.

Data Mining: Concepts and Techniques Iceberg Cube Computing only the cuboid cells whose count or other aggregates satisfying the condition like HAVING COUNT(*) >= minsup Motivation Only a small portion of cube cells may be “above the water’’ in a sparse cube Only calculate “interesting” cells—data above certain threshold Avoid explosive growth of the cube May 18, 2018 Data Mining: Concepts and Techniques

Indexing OLAP Data: Bitmap Index Index on a particular column Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column not suitable for high cardinality domains Base table Index on Region Index on Type May 18, 2018 Data Mining: Concepts and Techniques

Indexing OLAP Data: Bitmap Index May 18, 2018 Data Mining: Concepts and Techniques

Indexing OLAP Data: Join Indices Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …) Traditional indices map the values to a list of record ids It materializes relational join in JI file and speeds up relational join In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table. E.g. fact table: Sales and two dimensions city and product A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city Join indices can span multiple dimensions May 18, 2018 Data Mining: Concepts and Techniques

Indexing OLAP Data: Join Indices May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques May 18, 2018 Data Mining: Concepts and Techniques

Efficient Processing OLAP Queries Determine which operations should be performed on the available cuboids Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection Determine which materialized cuboid(s) should be selected for OLAP op. Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are 4 materialized cuboids available: 1) {year, item_name, city} 2) {year, brand, country} 3) {year, brand, province_or_state} 4) {item_name, province_or_state} where year = 2004 Which should be selected to process the query? Explore indexing structures and compressed vs. dense array structs in MOLAP May 18, 2018 Data Mining: Concepts and Techniques

Chapter 3: Data Warehousing and OLAP Technology: An Overview What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation From data warehousing to data mining May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Warehouse Usage Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools May 18, 2018 Data Mining: Concepts and Techniques

From On-Line Analytical Processing to On-Line Analytical Mining On-line analytical mining(OLAM) integrates on-line analytical processing(OLAP) with data mining and mining knowledge in multidimensional databases. OLAM is particularly important for the following reasons. High quality of data in data warehouses: Most data mining tools need to work on integrated, consistent, and cleaned data, which requires costly data cleaning, data integration, and data transformation as preprocessing steps.

On-line selection of data mining functions: Available information processing infrastructure surrounding data warehouses: Comprehensive information processing and data analysis infrastructures have been or will be systematically constructed surrounding data warehouses. ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools OLAP-based exploratory data analysis: Effective data mining needs exploratory data analysis. Mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions: Integration and swapping of multiple mining functions, algorithms, and tasks

An OLAM System Architecture Mining query Mining result Layer4 User Interface User GUI API OLAM Engine OLAP Engine Layer3 OLAP/OLAM Data Cube API Layer2 MDDB MDDB Meta Data Database API Filtering&Integration Filtering Layer1 Data Repository Data cleaning Data Warehouse Databases Data integration May 18, 2018 Data Mining: Concepts and Techniques

Chapter 3: Data Warehousing and OLAP Technology: An Overview What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation From data warehousing to data mining Summary May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Summary May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Summary May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Summary May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Summary May 18, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Summary May 18, 2018 Data Mining: Concepts and Techniques