The Data Warehouse of the Future Where to Now? The landscape has changed with proliferation of data across the organization. Click data, NoSQL, etc.
Thanks for Attending SQL Saturday Baton Rouge 2016! Speaker evaluations: Use the small square cards at the front of the classroom, give directly to speaker Speaker: Please give out 1 book ticket Book Ticket Winner: Bring your ticket to the user group booth in the main atrium to redeem (supplies limited)
About Me Sr. Product Manager with Idera Geek Sync Presenter Performance Monitoring of Microsoft BI stack Backup and Recovery of Microsoft SQL Server Geek Sync Presenter Blog Contributor HSSUG presenter Over 25 years experience BI, Data Architect DBA Developer Data Analyst
Data Lake or Data Tsunami?
Where in the world are we? Data sources OLTP ERP CRM LOB ETL Data warehouse BI and analytics Dashboards Reporting … data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. – Gartner, “The State of Data Warehousing in 2012”
Is a Data Warehouse “Old School”? Traditional BI is built on traditional architecture. So what's behind this statement? Here is the scoop...most of the traditional BI players in the market use a traditional architecture design where there is a data warehouse that becomes a “central repository” for all data. Companies frequently spend tons of time and cash with ETL tools and database vendors getting this data warehouse in place. I worked on a project that took over 2 years working on establishing an enterprise data warehouse (with another group of vendors) that brought data together from 13 different legacy systems. The project cost millions and, to be honest, I am not sure if they ever got it done. WHAT’S MISSING FROM THIS DIAGRAM? Non-relational data. IOT, Devices Sensor data Social Media
Is a Data Warehouse “Old School”? Predefined reports and dashboards are designed to answer questions tailored to individual roles within the organization. Interactive reports and dashboards rely on the IT department or “super users” In order to collect data from disparate systems, you need to land them in a common data store. Then you connect your analytics platform. ELT, Not ETL Business reporting is the reason the practice of data warehousing exists in the first place. Without a warehouse, someone in your organization, maybe even you, is going to ten different systems, grabbing different exports and metrics, and consolidating that data on some giant Excel spreadsheet. The historical approach of the ETL (extract, transformation, load) process is that you transform (normalize, align, cleanse, and aggregate) the data from the different source systems prior to loading it into your target repository. We are moving to an ELT (extract, load, transform) approach, you can do more powerful and innovative data transformations and enrichment than is available through traditional ETL tools and processes. EXAMPLE The benefits of this ELT (extract, load, and then transform) approach are: If there is a change in the source, you have the chance to isolate the change and adjust a simple part of the job (the extract), without touching the complex part of the transformation. If you have to change or create new metrics out of the raw data, you don’t affect the other metrics because the raw data is available locally. You don’t need to touch the sources. QUERY IS THE NEW ETL There are many new technologies being developed that will alleviate the need to bring all of your data into your data warehouse, and will provide comparable data access performance that you receive today with your data warehouse. As more and more of an organization’s key data is found outside the organization’s four walls (e.g., social media, mobile data, SaaS solutions like salesforce.com), the ability to quickly and seamless get access to this data will be a game-changer.
The Cool Kid’s Data Warehouse James Serra Diagram The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth fast query expectations from users non-relational/unstructured data cloud-born data Modern Data Warehouse handles relational data as well as data in Hadoop, provides a way to easily interface with all these types of data through one query model, and can handle “big data” while providing very fast queries
The Data Warehouse of the Future? Diverse Big Data Workload Centric Approach Data stored on multiple platforms Physically distributed data warehouse data warehouse appliances columnar RDBMSs NoSQL databases MapReduce tools, and HDFS. Big Data is “all data”. Multiple big data structures create multiple platform options for storing data. The DW becomes spread across multiple platforms as a result. This is because no one platform will run query and analysis workloads efficiently across all data. Diverse data will be loaded onto the platform based on storage, processing and budget requirements. While a multi-platform approach adds more complexity to the data warehouse environment, BI/DW professionals have always managed complex technology stacks successfully, and end-users love the high performance and solid information outcomes they get from workload-tuned platforms.
The Data Warehouse of the Future…Its Here! An integrated RDBMS/HDFS combo is an emerging architecture for the modern data warehouse. For example, an emerging best practice among data warehouse professionals with Hadoop experience is to manage non-relational data in HDFS (i.e. creating a “data lake“) but process it and move the results (via queries, ELT, or PolyBase) to RDBMSs (elsewhere in the data warehouse architecture) that are more conducive to SQL-based analytics So HDFS serves as a massive data staging area for the data warehouse The secret sauce that unifies the RDBMS/HDFS architecture is a single query model which enables distributed queries based on standard SQL to simultaneously access data in the warehouse Clouds are emerging as platforms and architectural components for modern data warehouses. One way of simplifying the modern data warehouse environment is to outsource some or all of it, typically to a cloud-based DBMS, data warehouse, or analytics platform.
SQL Server Technology Drivers PolyBase JSON Data Temporal Tables In Memory Table ColumnStore Index
PolyBase Uses T-SQL statements to access data stored in HDFS or Azuare Blob Storage. PolyBase was initially available in PDW (Parallel Data Warehouse) MS data appliance. PolyBase addresses one of the main customer pain points in data warehousing: accessing distributed datasets. Talked about this earlier….With the increasing volumes of unstructured or semi-structured data sets, users are storing data sets in more cost-effective distributed and scalable systems, such as Hadoop and cloud environments (for example, Azure storage) Originally SQOOP was used, but the data was actually moved from Hadoop cluster into SQL Server for querying. Using PolyBase it is possible to integrate data from two completely different file systems, providing freedom to store the data in either place. No longer will people start automatically equating retrieving data in Hadoop with MapReduce. With PolyBase all of the SQL knowledge accumulated by millions of people becomes a useful tool which provides the ability to retrieve valuable information from Hadoop with SQL.
PolyBase Use T-SQL to store data in SQL Server from Hadoop or Azure as tables. Knowledge of Hadoop or Azure is not required to use. Pushes computation to where data resides Export relational data into Hadoop or Azure Based on statistics and corresponding costs, SQL Server decides when to generate map jobs on the fly, to be executed within Hadoop. This is also transparent to the actual end user or application. Ability to create column-store table on-the-fly via T-SQL to leverage SQL Server’s column-store technology
PolyBase - External Tables, Data Sources & File Formats Your Apps PowerPivot PowerView Data Scientists, BI Users, DB Admins SQL Server w/ PolyBase Social Apps Sensor &RFID Mobile Apps Web Apps External Table Hadoop External Data Source External File Format PolyBase Split-Based Query Processing Relational DW
PolyBase Scenarios Querying ETL Run T-SQL over HDFS Combine data from different Hadoop clusters Join relational with non-relational data ETL Subset of Hadoop in Columnar Format Enable data aging scenarios to more economic storage Allows building of multi-temperate DW platforms SQL Server acts as hot query engine processing most recent data sets Aged data immediately accessible via external tables No need to groom data Hybrid (Azure Integration) Mesh-up on-premise and cloud apps Bridge between on-premise and Azure Querying: Customer Value Ease-of-use & Improved Time-To-Insights Build the data lake w/o heavily investing into new resources, i.e. Java & map/reduce experts Leverage familiar & mature T-SQL scripts and constructs Seamless tool integration w/ PolyBase ETL: Avoids the need of maintaining a separate import or export utility Allows building multi-temperature DW platforms PDW/APS acts as hot query engine processing most recent/relevant data sets Aged data immediately accessible via external tables No need for deleting any data anymore Hybrid: Indefinite storage and compute Azure as extension for your on-premise data assets Cloud transition on your own terms Move only subsets of on-premise data, e.g. non-sensitive data Leverage new Azure data services Reduced capex & availability of new emerging data services in Azure for on-premise focused users
PolyBase Create external data source (Hadoop). Create external file format (delimited text file). Create external table pointing to file stored in Hadoop. CREATE EXTERNAL DATA SOURCE hdp2 with ( TYPE = HADOOP, LOCATION ='hdfs://10.xxx.xx.xxx:xxxx', RESOURCE_MANAGER_LOCATION='10.xxx.xx.xxx:xxxx') CREATE EXTERNAL TABLE [dbo].[CarSensor_Data] ( [SensorKey] int NOT NULL, [CustomerKey] int NOT NULL, [GeographyKey] int NULL, [Speed] float NOT NULL, [YearMeasured] int NOT NULL ) WITH (LOCATION='/Demo/car_sensordata.tbl', DATA_SOURCE = hdp2, FILE_FORMAT = ff2, REJECT_TYPE = VALUE, REJECT_VALUE = 0 CREATE EXTERNAL FILE FORMAT ff2 WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE) Users and applications can leverage SQL Server’s mature features, such as column-store technology, for example, for frequent BI reports. There is no need for a separate ETL or import tool.
PolyBase - Ad-Hoc Query joining relational with Hadoop data Who drives faster than 35 Miles > joining structured customer data stored in SQL Server with sensor data SELECT DISTINCT Insured_Customers.FirstName, Insured_Customers.LastName, Insured_Customers.YearlyIncome, Insured_Customers.MaritalStatus into Fast_Customers from Insured_Customers INNER JOIN ( select * from CarSensor_Data where Speed > 35 ) as SensorD ON Insured_Customers.CustomerKey = SensorD.CustomerKey ORDER BY YearlyIncome CREATE CLUSTERED COLUMNSTORE INDEX CCI_FastCustomers ON Fast_Customers; Insured_Customers is in SQL Server. Fast_Customers will be a table stored in SQL Server. CarSensor_Data is located in Hadoop. CREATE CLUSTERED COLUMNSTORE INDEX makes Fast_Customers a columnstore table.
JSON Data What is JSON Product Product Reviews { "ProductID":709, "Name":"Mountain Bike Socks, M", "Color":"White", "Reviews":[ { "Reviewer":{ "Name":"John Smith", "Email":"john@fourthcoffee.com" }, "ReviewDate":"2007-10-20T00:00:00", "Rating":5, "ModifiedDate":"2007-10-20T00:00:00" } ] } Product Product Reviews (1,n) Javascript Object Notation (JSON) is a fancy name for a simple idea: data stored as javascript variables. One of the reasons why NoSQL systems become popular is the fact that you can use composite objects where you can store attributes of primary entity (product in our example) with related records (product reviews) within the primary entity as an array or collection of sub-objects. As an example, in MongoDb or DocumentDb you will create one JSON document for Product and add related reviews as an array of JSON objects. It’s caused a buzz in the tech world because JSON is much easier to load, read and manipulate compared to XML. Also, parsing large XML files can be a large performance hit – JSON gives you an object, packaged up and ready to go! While relational databases have lots of use cases, there are areas where different technologies are a much better fit. One of them is flexible and complex real-time searching. Elasticsearch is one such search mechanism often used with JSON documents. You may have a need Elasticsearch can be used to search all kinds of documents. It provides scalable search, has near real-time search, and supports multitenancy (apartment model). Elasticsearch is distributed, which means that indices can be divided into shards (A shard is a single Lucene instance) and each shard can have zero or more replicas.
JSON Data – Export data as JSON Ability to format query results as JSON text SET @json = ( SELECT 1 as firstKey, getdate() as dateKey, “Value of key” as thirdKey FOR JSON PATH) -- Result is: { "firstKey": 1, "dateKey": "2016-06-15 11:35:21", "thirdKey" : “Value of key" } FOR JSON PATH that enables you to define the structure of output JSON using the column names/aliases. If you put dot-separated names in the column aliases, JSON properties will follow the naming convention. This feature is similar to FOR XML PATH where you can use slash separated paths. FOR JSON AUTO that automatically create nested JSON sub arrays based on the table hierarchy used in the query, again similar to FOR XML AUTO.
JSON Data Transform JSON text to relational table SELECT Number, Customer, Date, Quantity FROM OPENJSON (@JSalestOrderDetails, '$.OrdersArray') WITH ( Number varchar(200), Date datetime, Customer varchar(200), Quantity int ) AS OrdersArray @JSalesOrderDetails is a text variable that contains an array of JSON objects in the property OrdersArray as it is shown in the following example: '{"OrdersArray": [ {"Number":1, "Date": "8/10/2012", "Customer": "Adventure works", "Quantity": 1200}, {"Number":4, "Date": "5/11/2012", "Customer": "Adventure works", "Quantity": 100}, {"Number":6, "Date": "1/3/2012", "Customer": "Adventure works", "Quantity": 250}, {"Number":8, "Date": "12/7/2012", "Customer": "Adventure works", "Quantity": 2200} ]}' OPENJSON table value function takes related reviews formatted as JSON and returns them as a table. OPENJSON will find an array in this property and return one row for each JSON object (i.e. element of the array). Four columns in the result set are defined in WITH clause. OPENJSON will try to find properties Number, Date, Customer, and Quantity in each JSON object and convert their values into columns in result set. By default, NULL will be returned if property is not found. When you will use OPENJSON? Imagine that you are importing JSON documents in database and you want to load them in a table. Instead of parsing JSON at the client side and streaming set of columns to table, you can send JSON as-is, and parse it in database layer.
JSON DATA Number Date Customer Quantity 1 8/10/2012 Adventure works 1200 4 5/11/2012 100 6 1/3/2012 250 8 12/7/2012 2200
JSON Data In PATH mode, you can use the dot syntax to format nested output.
Temporal Tables Temporal Table is really two tables. Data Table Historical Table (PERIOD) A temporal table can be defined as a table for which PERIOD definition exists comprising of system columns Slowly Changing Dimension Data Table is Type 1 Historical Table is Type 2 Recover accidental data changes A temporal table can be defined as a table for which PERIOD definition exists comprising of system columns. These columns are available with data-type of datetme2 where the period of validity is recorded by system. One table contains the current values while another handles the historic versions of the data. It also has a history table associated with it where all the system records of previous versions are recorded. So the most significant function of Temporal Table is that it allows storing data in table at any point in time. UPDATES: On an UPDATE, the system stores the previous value of the record in the history table and sets the value for the SysEndTime column to the UTC time of the current transaction based on the system clock. This marks the record as closed, with a period recorded for which the record was valid. DELETES: On a DELETE, the system stores the previous value of the record in the history table and sets the value for the SysEndTime column to the UTC time of the current transaction based on the system clock. This marks this record as closed, with a period recorded for which the previous record was valid.
Temporal Tables Requirements/Limitations Primary Key Two columns (start and end date as datetime2) In-Memory tables cannot be used INSERT and UPDATE not allowed on SYSTEM_TIME period columns History Table data cannot be changed. Regular queries only affect data in the current table. A Primary Key has to be defined. Two columns for recording start and end date should be defined with data type of datetime2. These columns are referred as SYSTEM_TIME period columns. AFTER triggers are allowed but INSTEAD OF triggers are not allowed. In-memory OLTP cannot be used. Temporal and history table cannot be FILETABLE. Statements like INSERT and UPDATE cannot reference the SYSTEM_TIME period columns. Data available in history table cannot be changed. To query data in the history table, you must use temporal/time based queries. SQL Server allows you to control the indexing on the current and history tables. If you prefer to use columnstore technology (requires Enterprise edition) to optimize both storage and performance, you can use clustered columnstore indexes on both tables. For example, over time the history table will likely be substantially bigger than the current table, and hence the use of columnstore technology there is more important.
Temporal Tables Example: CREATE TABLE dbo.TestTemporal (ID int primary key, A int, B int, C AS A*B, SysStartTime datetime2 GENERATED ALWAYS AS ROW START NOT NULL, SysEndTime datetime2 GENERATED ALWAYS AS ROW END NOT NULL, PERIOD FOR SYSTEM_TIME (SysStartTime, SysEndTime) ) WITH (SYSTEM_VERSIONING = ON);
Temporal Tables If you let SQL Server create the history table for you, it will automatically create a rowstore clustered index with page compression enabled, with the key list based on the columns: (<primary key column from the current table>, <system period start column>, <system period end column>). If the specified history table doesn’t already exist, SQL Server creates a new one based on the current table definition, but without a primary key and with a clustered index on (empid, systart, sysend). If, when defining a table as a system-versioned table, you don’t provide your own name for the history table, SQL Server names it based on the pattern MSSQL_TemporalHistoryFor_<object_id>. WITH ( SYSTEM_VERSIONING = ON ( HISTORY_TABLE = dbo.TestTemporalHistory ) ); Suppose that you already had a table called Employees with existing data and you wanted to turn it into a system-versioned table. To achieve this you would need to add the period start and end columns (with defaults since they must be non-nullable), the PERIOD designation, set system versioning to on, and connect the table to either a new or an existing history table.
Temporal Tables The SELECT statement FROM <table> clause has a new clause FOR SYSTEM_TIME with four temporal-specific sub-clauses to query data across the current and history tables. Point in time: AS OF <date_time> Exclusive bounds: FROM <start_date_time> TO <end_date_time> Inclusive lower bound, exclusive upper bound: BETWEEN <start_date_time> AND <end_date_time> Inclusive bounds: CONTAINED IN (<start_date_time> , <end_date_time>) This new SELECT statement syntax is supported directly on a single table, propogated through multiple joins, and through views on top of multiple temporal tables. You can query a history table directly, but since it doesn’t contain the current values you wouldn’t normally touch it. Instead, you should always query the base table using one of the following operations.
Temporal Tables For example, if you want to look at the values active for customer 27 on the first of the year: … FROM Customer FOR SYSTEM_TIME AS OF '2015-1-1' WHERE CustomerID = 27 If instead you want to see every version of the users records for that day you could write: … FROM Customer FOR SYSTEM_TIME BETWEEN '2015-1-1' AND '2015-1-2'WHERE CustomerID = 27 Temporal Table called Customer. If you use the temporal queries then the system queries the historical table.;
In-Memory Tables Held in memory at all times. Lock Free Writes A single Columnstore index allowed Defined at table creation Include all columns in base table Cannot be a filtered index Types SCHEMA_AND_DATA SCHEMA_ONLY Types: The SCHEMA_AND_DATA Memory-Optimized table is a table that resides in memory where the data is available after a server crash, a shutdown or a restart of SQL Server. a SCHEMA_ONLY Memory-Optimized table is a table that does not persist data should SQL Server crash, or the instance is stopped or restarted. The SCHEMA_ONLY Memory-Optimized tables do retain their table structure should the server crash, or be shutdown. a SCHEMA_ONLY table would be useful for a staging table in a data warehouse application. Typically it is fairly easily to reload a data warehouse staging table from its data source. This is why making these type of tables a SCHEMA_ONLY type table is relatively safe.
In-Memory Tables – ETL example Data Warehouse data loading Time Series data (date and value) Multiple Files (nightly reload) Calculate Correlation SSIS for ETL Load Time 14 hrs Tried Parallel processing of Packages SSIS and Bulk Insert T-SQL Bulk Insert from File Achieved 20% improvement Good use case for ETL even though its known as OLTP and on demand reporting. Data Loading ETL Example: A set of flat files with time series data. Some of these series are updated daily, and older values can be changed. Storing this data in a relational database presents a bit of a maintenance challenge in that each series must be reloaded/merged in its entirety. Storing a bunch of time series data (which consists of just two columns – date and value) in a database may seem unnecessary. We did consider keeping them in their original format (i.e. as individual files) or using a NoSql (e.g., Hadoop) database. One of the key requirements was ability to calculate the correlation of a time series against this entire dataset. With the power of a SQL database (and set based queries), I can run this kind of query in about a minute - using a moderately powered server. I created a T-SQL script with a cursor to grab the file path for each series (note that in the SSIS package, I iterated through the file paths via a ForEach loop task). Within the cursor, I called the BULK INSERT command to load each series into staging tables; after the cursor completed, I ran a stored procedure ([dbo].[spI_SeriesValue]) to merge the staged results with the destination table.
In-Memory Tables In-Memory Staging Tables Solution scaled linearly Minimized writing data and log files No disk writes, other than the final merge command Execute T-SQL commands asynchronously “With my final solution, I was able to re-process all data series in under 15 minutes.” These Memory-Optimized tables are targeted at OLTP applications, where a heavily accessed table can benefit from the inherit performance benefits of memory over disk. Used memory-optimized, non-durable, staging tables. I modified the script to accept a data series start/end range, along with a batch value to perform periodic commits from the staging to destination table, and then saved it as a stored procedure. In the end, I used the SQL Agent, setting up four jobs – three of which are called asynchronously from the fourth job. With my final solution, I was able to re-process all data series in under 15 minutes. Interestingly, my “server” (actually, a VM running on a laptop) still had plenty of available CPU/Memory, so it’s likely I could run additional jobs in parallel to further reduce processing time. But now, the bottleneck is no longer SQL Server – it is the (lack of) network bandwidth available to re-download the data series!
Columnstore Index A columnstore is data that is logically organized as a table with rows and columns, and physically stored in a column-wise data format. A rowstore is data that is logically organized as a table with rows and columns, and then physically stored in a row-wise data format. A clustered columnstore index is the physical storage for the entire table. For example from above table if I want to select column1 and column2 from the table then the sql engine will have to read all the rows in traditional Row store Index, However In case of Column store Index we will only pull the pages related to column1 and column2 into the memory thus reducing overall effort for data retrieval. Reasons why columnstore indexes are so fast: Columns store values from the same domain and commonly have similar values, which results in high compression rates. This minimizes or eliminates IO bottleneck in your system while reducing the memory footprint significantly. High compression rates improve query performance by using a smaller in-memory footprint. In turn, query performance can improve because SQL Server can perform more query and data operations in-memory. Batch execution improves query performance, typically 2-4x, by processing multiple rows together. Queries often select only a few columns from a table, which reduces total I/O from the physical media.
Columnstore Index Standard for storing and querying large data warehousing fact tables Uses column-based data storage and query processing Up to 10x Query Performance Data Compression In SQL 2016 you can define one nonclustered index on a clustered columnstore index. To improve efficiency of table seeks in a data warehouse, you can create a nonclustered index designed to run queries that perform best with table seeks. queries that look for matching values or return a small range of values will perform better against a btree index rather than a columnstore index. They don’t require a full table scan through the columnstore index and will return the correct result faster by doing a binary search through a btree index.
Columnstore Index Example: CREATE TABLE t_account ( accountkey int NOT NULL, Accountdescription nvarchar (50), accounttype nvarchar(50), unitsold int ); GO --Store the table as a columnstore. CREATE CLUSTERED COLUMNSTORE INDEX taccount_cci ON t_account; --Add a nonclustered index. CREATE UNIQUE INDEX taccount_nc1 ON t_account (accountKey);
www.idera.com Try any of our tools for free! Twitter: @MSBI_Stan Email: stan.geiger@idera.com