I am Xinyuan Niu I am here because I love to give presentations. Data Warehousing
◉ Decision Support Systems ◉ Data Warehousing Concepts of Data Warehouse Components of Data Warehouse Warehouse Schema Storage Features – Column Oriented OUTLINE
Decision Support Systems Let’s start with the first set of slides 1
Information Explosion Age
Transaction Processing Systems ◉ Record information about transactions ◉ For example: product sales information for companies, ◉ Course registration and grade information for universities. ◉ Organizations have accumulated a vast amount of information generated by these systems Database Application Classification Decision Support Systems ◉ Get high-level information out of the detailed information stored in transaction- processing systems ◉ Use the high-level information to make a variety of decisions Decision Support Systems
Transaction Information Example Decision Support Systems Retailer customer Item purchased Price paidDate on which the purchase made Item Name Manufacturer Color, size, etc Credit History Annual Income Age Information range up to hundreds of gigabytes or even terabytes
Decision Making Based On Transaction Information Decision Support Systems Decision Making – E.g. what items to stock and what discount to offer Transaction Information
Decision Making Example – Precision Marketing Decision Support Systems Decision Making System: Analyze input transaction information Customer Input: Age/ Gender / Job / Purchase Pattern, etc Make decision according decision making system: Expose specific ads to specific customer group
Decision Making Issues Decision Support Systems ◉ General queries written in SQL cannot fulfil decision making analysis, several SQL extensions have been proposed. ◉ Database query language cannot meet the performance of detailed statistical analysis of data. Professional software come to use such as SAS ◉ Data used for decision making come from different sources. ◉ Knowledge-discovery techniques discover rules and patterns from data automatically – data mining
Data Warehousing 2
Concepts Data Warehousing ◉ A data warehouse is a repository (or archive) of information gathered from multiple sources, stored under a unified schema, at a single site. ◉ Data warehouses provide the user a single consolidated interface to data, making decision- support queries easier to write.
Architecture Data Warehousing
When and How to Gather Data Data Warehousing ARRAY Factory OLED Factory Module Factory A_DBC_DBB_DB Panel Screen Industry: One company owns 3 factories, i.e. 3 databases – Cumbersome to analyze data when extract data from 3 different sources.
When and How to Gather Data Data Warehousing ◉ Source data (transaction data) update in real time, e.g when a customer buy an item, the database will be update at the same time. ◉ Warehouse will never be quite up-to-date with sources. ◉ Warehouse sends a request for new data to the sources periodically, e.g. update every night.
What Schema to Use Data Warehousing ◉ Data sources have different schemas or even use different data models. ◉ Before stored, data warehouse will perform schema integration and convert data into the integrated schema. ◉ Actually, data stored in the data warehouse is a materialized view of the data at the source.
What Schema to Use Data Warehousing ARRAY OLED MODULE (0, 0) (0, 1) (1, 0) (1, 1) (2, 0) (3, 0) (2, 1) (3, 1) (0, 0) (0, 1) (1, 0) (1, 1) (0, 0) (0, 1) (1, 0) (1, 1) CUT 1 CUT 2
What Schema to Use Data Warehousing sheet_idX_axisY_axis A Cut_idsheet_idX_axisY_axis C001A00100 C002A00101 panel_idCut_id B001C001 B002C001 TABLE: ASHEET TABLE: CSHEET TABLE: BSHEET
What Schema to Use Data Warehousing SHEET_IDCUT_IDPANEL_IDX_AXISY_AXISFAB_ID A001NULL 00A A001C001NULL00C A001C001B001NULL B Schema Integration ARRAY DATA OLED DATA Module DATA
Data Transformation Data Warehousing ◉ Sometimes data stored in warehouse should be transformed. ◉ For example change the units of measure or convert the data to a different schema by joining data from multiple source relations, see previous example. ◉ Data warehouses typically have graphical tools to support data transformation.
Data Cleansing Data Warehousing ◉ Correcting and preprocessing data. ◉ Fuzzy lookup – E.g correct misspelled name, address, incorrect postal code to a reasonable extent by consulting a database of street names and postal codes in each city. ◉ Merge-purge operation / deduplication – E.g: Address lists collected from multiple sources may have duplicates which need to be eliminated. ◉ Householding – E.g. Records for multiple individuals in a house may be grouped together so only one mailing is sent to each house.
How to propagate updates Data Warehousing ◉ Updates on relations at the data source must be propagated to the data warehouse. ◉ Case1: the relations at the data warehouse are exactly the same as those at the data source – copy directly. ◉ Case2: Relations are different between data warehouse and data source – it’s a view- maintenance problem.
What data to summarize Data Warehousing ◉ Raw data generated by a transaction-processing system may be too large to store online. ◉ Maintain summary data by aggregation on a relation is important – E.g. instead of storing data about every sale of clothing, we can store only total sales of clothing by item_name and category.
Warehouse Schemas Data Warehousing ◉ Data warehouses schemas for data analysis. ◉ Data are usually multidimensional – dimension attributes and measure attributes. ◉ Tables containing multidimensional data are called facts tables, usually very large. ◉ To minimize storage requirements, dimension attributes are usually short identifiers that are foreign keys into other tables called dimension tables.
Warehouse Schemas Data Warehousing A fact table with several multiple dimension tables is called star schema ; More complex which have multiple levels of dimension tables are called snowflake schema
Column-Oriented Storage Data Warehousing ◉ Row-oriented storage – store all attributes of a tuple together and tuples are stored sequentially in a file – traditional database ◉ Column-Oriented storage – Each attribute of a relation is stored in a separate file, with values from successive tuples stored at successive positions in a file.
Column-Oriented Storage Data Warehousing SHEET_IDCUT_IDPANEL_IDX_AXISY_AXISFAB_ID A001NULL 00A A001C001NULL00C A001C001B001NULL B File Row-oriented Storage SHEET_ID A001 CUT_ID NULL C001 FAB_ID A C B File1File2File 3 Column-oriented Storage
Column-Oriented Storage – Benefits Data Warehousing ◉ When a query access only a few attributes with a large number of attributes, the remaining attributes need not to be fetched from disk into memory. ◉ Storing values of the same types increases the effectiveness of compression which can greatly reduce both the disk storage cost and time to retrieve data from disk.
Column-Oriented Storage – Drawbacks Data Warehousing ◉ Storing or fetching a single tuple requires multiple I/O operations. ◉ Transaction-processing systems always manipulate tuples – use row-oriented storage. ◉ Warehouses rarely access to individual tuples, but rather require scanning and aggregating multiple tuples – use column-oriented storage.
Any questions ? Thanks!