Reducing the Response Time for Data Warehouse Queries Using Rough Set Theory By Mahmoud Mohamed Al-Bouraie Yasser Fouad Mahmoud Hassan Wesam Fathy Jasser.

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Clustering Categorical Data The Case of Quran Verses
A New Algorithm of Fuzzy Clustering for Data with Uncertainties: Fuzzy c-Means for Data with Tolerance Defined as Hyper-rectangles ENDO Yasunori MIYAMOTO.
_ Rough Sets. Basic Concepts of Rough Sets _ Information/Decision Systems (Tables) _ Indiscernibility _ Set Approximation _ Reducts and Core _ Rough Membership.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Fast Algorithms For Hierarchical Range Histogram Constructions
Using data sets to simulate evolution within complex environments Bruce Edmonds Centre for Policy Modelling Manchester Metropolitan University.
Introduction to Databases
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Dimensional Modeling Business Intelligence Solutions.
_ Rough Sets. Basic Concepts of Rough Sets _ Information/Decision Systems (Tables) _ Indiscernibility _ Set Approximation _ Reducts and Core.
A Classification Approach for Effective Noninvasive Diagnosis of Coronary Artery Disease Advisor: 黃三益 教授 Student: 李建祥 D 楊宗憲 D 張珀銀 D
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Feature Selection for Regression Problems
File Systems and Databases
Efficient Multidimensional Packet Classification with Fast Updates Author: Yeim-Kuan Chang Publisher: IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 4, APRIL.
Organizing Data & Information
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Strategies and Tactics for Data Mining  Data Mining is part of Knowledge Discovery in databases, KDD.  There Are various KDD paradigmns. The CRISP KDD.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Mapping Techniques and Visualization of Statistical Indicators Haitham Zeidan Palestinian Central Bureau of Statistics IAOS 2014 Conference.
By N.Gopinath AP/CSE. Two common multi-dimensional schemas are 1. Star schema: Consists of a fact table with a single table for each dimension 2. Snowflake.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Data Mining: A Closer Look
ROUGH SET THEORY AND FUZZY LOGIC BASED WAREHOUSING OF HETEROGENEOUS CLINICAL DATABASES Yiwen Fan.
Enterprise systems infrastructure and architecture DT211 4
Query Processing Presented by Aung S. Win.
Basic Data Mining Techniques
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
A Simple Method to Extract Fuzzy Rules by Measure of Fuzziness Jieh-Ren Chang Nai-Jian Wang.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Multi-Dimensional Databases & Online Analytical Processing This presentation uses some materials from: “ An Introduction to Multidimensional Database Technology,
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Data Warehousing.
Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
CS 101 – Nov. 11 Finish Database concepts –1-1 relationship –1-many relationship –Many-to-many relationship Review.
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
13 1 Chapter 13 The Data Warehouse Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.
Copyright © 2008 Elzbieta Malinowski & Esteban Zimányi 1 Chapter 3 Conventional Data Warehouses.
Modeling of Core Protection Calculator System Software February 28, 2005 Kim, Sung Ho Kim, Sung Ho.
1 Context-aware Data Mining using Ontologies Sachin Singh, Pravin Vajirkar, and Yugyung Lee Springer-Verlag Berlin Heidelberg 2003, pp Reporter:
Operation Data Analysis Hints and Guidelines
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Rough Sets.
Medical Diagnosis via Genetic Programming
Relational Algebra Chapter 4, Part A
Group 7 • Shing • Gueye • Thakur
Rough Sets.
A Modified Naïve Possibilistic Classifier for Numerical Data
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Chapter 13 The Data Warehouse
A task of induction to find patterns
Implementation of Learning Systems
A task of induction to find patterns
Presentation transcript:

Reducing the Response Time for Data Warehouse Queries Using Rough Set Theory By Mahmoud Mohamed Al-Bouraie Yasser Fouad Mahmoud Hassan Wesam Fathy Jasser

Outline Aim Pre-grouping Transformation Hierarchical pre-grouping Attribute Selection A Heuristic Algorithm for Attribute Selection Worked Example Applications Conclusion

Aim To reach to the optimization case. This case is happened when the response time for processing a query become small as possible using –Pre-grouping transformation and –The size of any database become small as possible (using rough set theory).

Pre-grouping Transformation (1) It dependents on some concepts like star schema structure as shown in the figure it contents of fact table surrounded by dimension tables. The relationships between them is 1: N.

Pre-grouping Transformation (2) Another concept that that this transformation dependents on is hierarchical clustering of data. It is based on the idea that the hierarchy of one dimension are encoded into hierarchical surrogates used in the fact table. There is a compact representation of the hierarchy path of a dimension member making it possible to use hierarchy on the fact table without requiring residual joins. The figure shows an example

Hierarchical pre-grouping We assume that the DBMS has information about hierarchical relationships of the dimension attributes. We group on the highest hierarchy level to reduces the number of resulting groups. The groups of the pre-grouping operation are joined with the dimension tables, in order to get the values for the grouping attributes.

Attribute Selection Depending on rough set theory, a database always contains a lot of attributes that are redundant. To eliminate these redundant attributes we use attribute selection that used to find an optimal subset of attributes in a database according to some criterion, so that a classifier with the highest possible accuracy can be induced by learning algorithm using information about data available only from the subset of attributes.

A Heuristic Algorithm for Attribute Selection Let R be a set of the selected attributes, P be the set of unselected condition attributes, U be the set of all objects, X be the set of contradictory objects, Va denotes the attribute a values and EXPECT be the threshold of accuracy. In the initial state, R = CORE(C), k = 0.

Attribute Selection using RSH (1) Step 1. If k >= EXPECT, finish, otherwise calculate the dependency degree, k, Step 2. For each p in P, calculate where max_size denotes the cardinality of the maximal subset.

Attribute Selection using RSH (2) Step 3. Choose the best attribute p with the largest and let Step 4. Remove all consistent instances u in from X. Step 5. Go back to Step 1.

Worked Example of Attribute Selection Condition Attributes: a: Va = {1, 2} b: Vb = {0, 1, 2} c: Vc = {0, 1, 2} d: Vd = {0, 1} Decision Attribute: e: Ve = {0, 1, 2}

R={b} The instances containing b0 will not be considered. TT’ After deleting all consistent objects we have:

1. Selecting {a} R = {a,b} u3,u5,u6 u4 u7 U/{e} u3 u4 u7 U/{a,b} u5 u6

Result: Subset of attributes = {b, d} Also, we select {c} and {d}. Then finally we found:

Application and Result We built a simple database depending on Heart diseases dataset using Excel file. The attributes information and their types will be as following: Attribute Information: 1) age 2) sex 3) chest pain type (4 values) 4) resting blood pressure 5) serum cholesterol in mg/dl 6) fasting blood sugar > 120 mg/dl 7) resting electrocardiograph results (values 0, 1, and 2) 8) maximum heart rate achieved 9) exercise induced angina 10) old peak = ST depression induced by exercise relative to rest 11) the slope of the peak exercise ST segment 12) number of major vessels (0-3) colored by fluoroscopy 13) thal: 3 = normal; 6 = fixed defect; 7 = reversable defect Attributes types Real: 1,4,5,8,10,12 Ordered: 11 Binary: 2,6,9 Nominal: 7,3,13 Using Rosetta software which is used for analyze the data; Then the database reduced. After that we generate the rules on the best reduct. Finally we filtered the rules using Quality filtering loop. The result for our experiment like this in Table.Table

Conclusion The star query: the most common type of query in data warehouse, One of the most promising techniques for efficiently evaluating such queries is the use of fact table organizations that store data clustered according to the dimension hierarchies. –A special hierarchical encoding is imposed on star joins are transformed to multidimensional range queries on the underlying multidimensional structures. The conventional star query evaluation plan changes radically and new processing steps are required.