Horizontal data sets: Number of attributes is of the same order to several orders of magnitude higher than the number of records. Example: genetic data sets, can have 10,000 attributes and 100 records. 10, 000 attributes, up to 100 million combinations of two attributes and up to 1 trillion 3 attribute sets!
Data Driven Algorithm Constructing the Max-conf kernel for small data sets: Input: i) a Database DB ii) a fixed consequent C Output: a set R of rules such that for any rule of the form X->C there exists a rule X'->C in R, where X' is a superset of X and X'->C has a a higher confidence then X->C
Algorithm: // DB(C) is the set of records that satisfy the consequent // RS is a working set which maintain the current subset of records that satisfy the consequent COMMON is the set of common descriptors for the record set RS; MaxConfKernelSet(DB, C, DB(C), RS, COMMON) { i= size(RS)+1; if (i==1) { COMMON=Descriptors in the ith record in DB(C);} RS=RS \union {ith record in DB(C)}; while (i<=size(DB(C))) do { Delete from COMMON the descriptors not shared by the ith record; Compute support of records satisfying {COMMON-C}; Compute the confidence of COMMON-C->C; if (COMMON-C)!=null) { if sufficient support and not duplicate output "COMMON-C->C [support, conf]" ; MaxConfKernelSet(DB, C, DB(C), RS, COMMON); RS=RS-{ith record in DB(C)}; i++; RS=RS \union {ith record in DB(C)}; } Invoke: MaxConfKenalSet(DB,C, DB(C), null, null); // RS, COMMON is empty initially
OLAP and Statistical databases Statistical databases – from early 80s –Mutidimensional datasets concerned with summariziation over the dimensions of the data sets. 2-D representations – census, socioeconomic data etd OLAP: on line analytical processing: mid 90s
Multi-dimensional Statistical Table
2-D representation of statistical data
A graph model for statistical data
A scheme for stat data
More schemes
Relational representation of statistical object
Automatic aggregation concept
Terms in SDB and OLAP
SDB and OLAP operators
Completeness of statistical algebra
Overlapping and timevarying categories
Physical organization
Encoding column category values
Array linearization
Header compression
Lattice of materialization
Partitioning of a data cube into subcubes
Cube operator
Data Cube – shortcomings of SQL
Sales Roll Up by Model by Year and by color
Using ALL value
3 dimensional rollup in SQL
Cross-tabulation in SQL
Cross Tabulation
CUBE operator
Support of histograms
A 3D data cube
ALL value and decoration field
Decorations
ROLLUP operator
Percentage of total as an aggregate function
Indices
STAR scheme