March DGRC FedStats Visit Aggregation in Main Memory Kenneth A. Ross Columbia University
March DGRC FedStats Visit Research Experience n Complex query processing n Data Warehousing n Main memory databases Students: Kazi Zaman, Junyan Ding
March DGRC FedStats Visit Mediator Query Unified Results User Main- Memory DBMS Traditional DBMS... Scenario A
March DGRC FedStats Visit Mediator Data Request Unified Results User Web Traditional DBMS... Scenario B Main Memory DB Sequence Of Interactive Queries Queries
March DGRC FedStats Visit Mediator Data Request Unified Results User Web Traditional DBMS... Scenario C Main Memory DB Graphical User Interface Dynamic Query
March DGRC FedStats Visit Outline n Introduction to Datacubes n Frameworks for querying cubes n The Main Memory based framework n Experimental Results n Conclusions and Plan
March DGRC FedStats Visit The CUBE BY Operator State Year Grade Sales CA 1997 Regular 90 NY 1997 Premium 70 CA 1998 Premium 65 NY 1998 Premium 95 State Year Grade Sales CA 1997 Regular 90 CA 1997 ALL 90 ALL 1997 Regular 90 CA ALL Regular 90 ALL 1997 Regular 90 ALL 1997 ALL 160 ALL ALL Regular 90 CA ALL ALL 155 ALL ALL ALL 320 CUBE BY (sum Sales) Large increase in total Size, especially with many dimensions ……. Additional records
March DGRC FedStats Visit Lattice Representation State, Year, Grade State, YearState, Grade Year, Grade StateYear Grade
March DGRC FedStats Visit Modeling Queries Slice Queries ask for a single aggregate record SELECT State, year, sum(sales) FROM BLS GROUP BY State, year HAVING State = “NY” AND year = “1998”
March DGRC FedStats Visit Existing Frameworks State, Year, Grade State, Year State,Grade Year,Grade State Year Grade Choose subset of cube to materialize based on workload. Materialize on disk Appropriate record recovered or computed for incoming slice query Drawbacks: Ignores Clustering of Relation on disk. Smallest unit of materialization is too big.
March DGRC FedStats Visit Our approach State, Year, Grade State, Year State,Grade Year,Grade State Year Grade The full cube is often larger than available memory, but... The finest granularity aggregate may fit. Any record can be computed without having to go to disk. How should the finest granularity be organized ?
March DGRC FedStats Visit Framework Level-1 Store Level-2 Store records in linked lists Slot directory Selected coarse records in hash table Finest granularity cuboid Query q
March DGRC FedStats Visit The Level-1 Store Records are pairs stored in a hash table. Records can contain ALL’s Given query Q, form composite key and check level-1 store (constant time). If not found, use level-2 store Key Value a1 55 b2 34 c2 12 …...
March DGRC FedStats Visit The Level-2 Store Level-2 Store records in linked lists Slot directory Finest granularity cuboid Slot directory is organized as a multidimensional array: level2[sz1][sz2][sz3][sz4] Each slot points to a linked list of elements. Records placed according to set of mapping functions H
March DGRC FedStats Visit Using the Level-2 store b4 Query Q without ALL’s d5 a3 c2 Slot 4 Slot 3 Slot 7 Slot1 Access list denoted by level2[4][3][7][1] ; aggregate those matching (a3,b4,c2,d5).
March DGRC FedStats Visit Using the Level-2 store ALL Query Q with ALL’s ALLa3 c2 Slot 4 List of Slots Slot 7 List of Slots Access lists matching level2[4][*][7][*] ; aggregate those matching (a3,*,c2,*).
March DGRC FedStats Visit Demo n Shows multidimensional dataset (subset of columns of 5% Census sample for NY in 1990). n User asks queries: fast answers. n Future: User Interface asks many queries, with display changing interactively. n demo demo
March DGRC FedStats Visit Experimental Results Scanning all records takes 194 ms.
March DGRC FedStats Visit Importance of Work Aggregation is fundamental to analysis. Make analysis interactive, even for many dimensions. Make a variety of aggregate granularities available, where possible.
March DGRC FedStats Visit Contributions n A Main Memory based framework for answering datacube queries efficiently. n Query Performance in the 2-4 ms range which is more efficient than going to disk.
March DGRC FedStats Visit Plan n Integrate with user interface to generate dynamic queries. n Self-tuning capability. n Multiple data sets. n Work with agencies to generate value –For intra-agency analysis –For enhanced data dissemination