SQL S ERVER D ATA Q UALITY S ERVICES Marc Jellinek Principal Consultant – Neudesic
A BOUT M E Experience Principal Consultant - Neudesic Assistant Director (SQL Team) – Application Engineering at Ernst & Young IT Manager at MLB Network Sr. Technology Specialist at Microsoft Technologies Microsoft SQL Server 6.0, 6.5, 7.0, 2000, 2005, 2008, 2008 R2 and 2012 Relational Engine, Analysis Services, Integration Services and Reporting Services Marc Jellinek –
S ESSION O BJECTIVES Introduction to SQL Server Data Quality Services (DQS) Understanding the problem Demo Where do we go from here?
S ETTING THE STAGE Building from the SQL Server Series: Master Data Services in SQL Server 2012, presented by Patrick Gallucci (start at 6:26) PASS MDM/DQS Virtual Chapter Based on demos from “SQL Server 2012 Developers Update”
Data QualityIssueSample Data Problem Standard Are data elements consistently defined and understood? Gender code = M, F, U in one system and Gender code = 0, 1, 2 in another system Complete Is all necessary data present?20% of customers’ last name is blank, 50% of zip-codes are Accurate Does the data accurately represent reality or a verifiable source? A Supplier is listed as ‘Active’ but went out of business six years ago Valid Do data values fall within acceptable ranges? Salary values should be between 60, ,000 Unique Data appears several timesBoth John Ryan and Jack Ryan appear in the system – are they the same person? T HE D ATA Q UALITY P ROBLEM S PACE
W HAT ’ S T HE P ROBLEM My name is Marc Jellinek Marc <> “Mark”, “Marck” or “March” Jellinek <> “Jelinek”, “Jellineck”, “Jelineck”, “Jelliner”, “Jeliner” or “Jellyneck” RrRr
W HAT ’ S THE P ROBLEM JellinekJelinekJellineckJelineckJellinerJelinerJellyneck Marc Mark Marck March
T HE NIGHTMARE S CENARIO The Customer Dimension – Jellinek, Marc – Jellinek, Mark – Jellinek, Marck – Jellinek, March – Jelinek, Marc – Jelinek, Mark – Jelinek, Marck – Jelinek, March – Jellineck, Marc – Jellineck, Mark – Jellineck, Marck – Jellineck, March – Jelineck, Marc – Jelineck, Mark – Jelineck, Marck – Jelineck, March – Jelliner, Marc – Jelliner, Mark – Jelliner, Marck – Jelliner, March – Jelliner, Marc – Jelliner, Mark – Jelliner, Marck – Jelliner, March – Jellyneck, Marc – Jellyneck, Mark – Jellyneck, Marck – Jellyneck, March
A NALYTIC I MPACT Average Revenue per customer Average Profit per customer Number of customers Customers per Geography Customers by Income Customers by Gender Customers by Educational Level Customers by Product Bought
O BLIGATORY T RUISMS The accuracy of your reporting is determined by the accuracy of your data (Garbage In, Garbage Out) Decisions made based on data will only be as good as the data on which you are basing your decisions. You can’t manage what you can’t measure. Inaccurate measurements lead to interesting management challenges.
T HE D ATA Q UALITY S OLUTION S PACE Amend, remove or enrich data that is incorrect or incomplete. This includes correction, enrichment and standardization. Identifying, linking or merging related entries within or across sets of data. CleansingMatching ProfilingMonitoring Analysis of the data source to provide insight into the quality of the data and help to identify data quality issues. Tracking and monitoring the state of Quality activities and Quality of Data.
SQL S ERVER 2012 D ATA Q UALITY S ERVICES High quality data is critical to effective business intelligence and to business activities DQS is an on-premise Data Quality product in SQL Server 2012, extendible with knowledge from multiple parties thru Azure DataMarket Richer DQ knowledge and capabilities in the cloud will make it even easier to provide high quality data Data Quality Services (DQS) is a Knowledge-Driven data quality solution enabling IT Pros and data stewards to easily improve the quality of their data Included with SQL Server 2012 Enterprise and BI Editions
K EY D ATA Q UALITY S ERVICES C ONCEPTS Knowledge-Driven Semantics Knowledge Discovery Based on a Data Quality Knowledge Base (DQKB) Data Domains capture the semantics of your data Acquires additional knowledge the more you use it Open and Extendible Easy to use Add user-generated knowledge & 3 rd party reference data providers User experience designed for increased productivity
DQS A RCHITECTURE Matching Reference Data DQ Clients DQS UI DQ Server DQ Projects StoreCommon Knowledge Store Knowledge Base Store DQ Engine 3 rd Party / Internal MS DQ Domains Store MS DQ Domains Store Reference Data Services Reference Data Sets SSIS DQ Component DQ Active Projects MS Data Domains Local Data Domains Published KBs Knowledge Discovery Data Profiling & Exploration Cleansing Knowledge Discovery and Management Interactive DQ Projects Data Exploration Azure Market Place Categorized Reference Data Categorized Reference Data Services Reference Data API (Browse, Get, Update…) Reference Data API (Browse, Get, Update…) RD Services API (Browse, Set, Validate…) RD Services API (Browse, Set, Validate…) MDS Excel Add in
D ATA Q UALITY S ERVICES P ROCESSES Build Use DQ Projects Knowledge Management Match & De-dupe Correct & standardize Manage Knowledge Enterprise Data Reference Data Reference Data Cloud Services Integrated Profiling Notifications Progress Status Knowledge Base Discover / Explore Data
B ASIC D EFINITIONS Knowledge Base –Stores all the knowledge related to a specific type of data source –Container for domains Domain –Semantic representation of a type of data in a data field or column –Trusted values, invalid values and erroneous data –Synonym associations, term relationships, validation and business rules, matching policies Matching Rule –Set of rules and conditions that determine a match or duplicate
D ATA Q UALITY S ERVICES C OMPONENTS Data Quality Server Data Quality Client DQS Cleansing Component for SQL Server Integration Data Quality Processes in Master Data Management
D ATA Q UALITY S ERVICES C OMPONENTS Data Quality Server –SQL Server Databases DQS_MAIN –DQS Stored Procedures, the DQS Engine and published Knowledge Bases DQS_PROJECTS –Data required for knowledge base management and DQS project activities DQS_STAGING_AREA –Intermediate staging area where source data is copied and processed
D ATA Q UALITY S ERVICES C OMPONENTS Data Quality Client –Standalone application Designed for both data stewards and DQS Administrators Perform knowledge management, data quality projects and administration in one user interface Allows for domain management, matching policy creation, data cleansing, matching, profiling, monitoring and server administration. Can be installed on a remote computer
D ATA Q UALITY S ERVICES C OMPONENTS DQS Cleansing Component in SQL Server Integration Services –Performs data cleansing as a part of an SSIS package –Alternative to running a cleansing project within the Data Quality Services Client application Data Quality Processes within Master Data Services –Perform de-duplication on source data and master data within the Microsoft SQL Server Data Services Add-in for Microsoft Excel.
DEMO
R ESOURCES DBI207: Using Knowledge to Cleanse Data with Data Quality Services Data Quality Services Blog Books Online for SQL Server - Data Quality Services Install Data Quality Services Used Master Data Services Configuration Manager, set up IIS Troubleshoot Installation and Configuration Issues (Master Data Services in SQL Server SQL Server 2012 Developer Training Kit Web Installer SQL Server 2012 Update for Developers Training Workshop SQL Server 2012 Update for Developers Training Kit SQL Server 2012 Update for Developers Training Kit Content MSDN – Data Quality Services MSDN Discussions – Data Quality Services Technet – Data Quality Services PASS MDM/DQS Virtual Chapter
Driving Innovation with SAP On-Premise and Beyond Monish Nagisetty Daniel Sepp 7 June :00-11:00 PDT m/?id= Delivering a Semantic Model for Ad-Hoc Reporting Steve Muise21 June :00-11:00 PDT m/?id= Exploring the New Hadoop Implementation in Azure Jacob Saunders10 July :00-11:00 PDT m/?id= Amerishore – an Innovative, Socially Conscious Approach to Offshoring Tracy Derr17 July :00-11:00 PDT m/?id= Data Integration Improvements within SQL Server 2012 Kola Bolarin24 July :00-11:00 PDT m/?id= EDI: The Reports of My Death Have Been Greatly Exaggerated Abhilash Shanmug9 August :00-11:00 PDT m/?id=
THANK YOU