SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing
Overview Introduction Basic OLAP technologies – ROLAP and MOLAP Structure of an SF-Tree Using SF-Tree for OLAP Conclusion
Introduction On-line Analytical Processing (OLAP) is an important tool for decision making Users may ask for the aggregated measure attributes for different combination of dimension attributes [GrayBLP96].
Introduction -- OLAP e.g., in the CarSales table of a data warehouse: CarSales(TransID, Buyer, Date, Shop, Color, Price) The users may want to know the total sales of the yellow cars sold in Sept, 2002.
Introduction -- OLAP
i.e., answer = $60k To answer this efficiently, usually the answers are precomputed and stored A popular data model is called data cube [GrayBLP96].
Introduction -- OLAP
Note that we're referring to the model only, not every entry is materialized every combination of dimensions are included (so, here, it's just one example cuboid, other cuboids include,, etc)
Introduction -- OLAP Research issues – To store this information with high space efficiency and/or high query speed.
ROLAP and MOLAP
In a data cube, we have to store the combinations of dimension values and the associated aggregate values. ROLAP (Relational OLAP) stores the entries of the mapping in relational tables. MOLAP (Multi-dimensional OLAP) stores the entries in a multi-dimensional array.
ROLAP and MOLAP A materialized MOLAP multidimensional array (for colour and shop)
ROLAP and MOLAP A ROLAP table that stores the entries (for colour and shop)
ROLAP and MOLAP MOLAP adv: quick: 1 retrieval for 1 point query (i.e., use the dimension values to calculate the address of the aggregated value stored) may be space efficient if the cube is very dense (since all dimensions are not explicitly stored in every tuple)
ROLAP and MOLAP MOLAP disadv: space inefficient if the cube is sparse (has many zero entries) esp. for high dimensional cases. eased by chunking may need a lot of scans if we issue a large range query (i.e., involves many dimension values)
ROLAP and MOLAP ROLAP: index are built on the table to improve query performance e.g., B + -Tree on each dimension, or R-Tree over all points. ROLAP Adv: space efficient (non-zero entries are not stored)
ROLAP and MOLAP ROLAP Disadv: indexes, such as R-Tree, may not be effective in high-D data Many joins are required to produce the result (if single D indexes are used) Intermediate result may be large Can we do better?
SF-Tree
stands for Signature File Tree stores a mapping from objects to integer flexibly, efficient and has a statistical accuracy guarantee
SF-Tree Basic Idea: divide the objects into groups of the same (or similar) associated number. checking the associated number of an object is the same as checking which group this object belongs. signature files are used to improve the efficiency of existence checking, trees are used to improve accuracy and speed.
SF-Tree
Properties (Adv): Space efficient, independent of object size Flexible, can have a tradeoff among space, speed and accuracy Speed is independent of number of objects
Using SF-Tree for OLAP
The information in OLAP can be modeled as a mapping from objects (dimension values) to numbers (aggregate values). Thus we can use SF-Tree to store this mapping.
Using SF-Tree for OLAP e.g., (TST, Yellow) is an object, it's associated number is 10k we can insert it into SF-Tree space requirement: m/ln2 bits per object per level independent of object size (i.e., dimensionality) smaller than ROLAP esp. for high-D
Using SF-Tree for OLAP Adv: more space efficient than ROLAP (definitely much better than MOLAP) quicker than ROLAP in point queries (no need to do joins) Disadv: range queries require scanning all possible points in query range (as in MOLAP).
Using SF-Tree for OLAP To avoid the disadvantage, we borrow the idea from MRA-Tree [LazM01] MRA-Tree (Multi-Resolution Aggregate Tree): in a data/space partitioning tree, add aggregates in all internal nodes. one example is quad-tree + aggregates.
MRA-Tree
580k 60k 160k
MRA-Tree For answering range queries, the number of accesses is reduced Extra space is required Leaf nodes may not contain only 1 record The tree size drop significantly if we increase the number of points in a leaf node page.
MRA-Tree
SF-Tree with MRA-Tree SF-Tree is more space efficient than ROLAP Use SF-Tree to store leaf nodes => each page thus can store more points => tree size/depth is reduced => less page accesses in query
SF-Tree with MRA-Tree Adv: more space efficient, i.e., may be small enough to fit in memory and reduce page accesses, esp. for high-D data Disadv: still need to scan the area in leaf nodes (vs. scanning data points while using ROLAP)
Conclusion SF-Tree is space efficient, can be used to store a data cube. (or may be used as a ROLAP index) Though in analysis the speed of SF-Tree is poor for range queries, we try to incorporate the idea in MRA-Tree on SF- Tree to increase the speed.
Reference [GrayBLP96] J. Gray, A. Bosworth, A. Layman, and H. Piramish. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In ICDE'96. [LazM01] I. Lazaridis, and S. Mehrotra. Progressive Approximate Aggregate Queryies with a Multi-Resolution Tree Structure. In SIGMOD'01.