Migrating Master Data to a Data Lake DAMA Chicago – December 2017 Chapter Meeting
My Background Employed by Protective Insurance (just started in October of this year): Senior Enterprise Data Architect Previous employer was CNO Financial Group (Director – Data Strategy & Architecture) Experience (IT, over 25 yrs; Data focus, nearly 20 yrs): Disciplines: Enterprise Data Strategy, Data Architecture, Data Design, Data Integration, Reference & Master Data, Data Warehousing, Business Intelligence, Metadata, Data Quality, Data Governance Industries: Insurance & Financial Services Pharmaceutical State Government Manufacturing Other Items: Founding member (since 2009) and current President (2016) of DAMA Indiana chapter Hold CDMP certification (Master level since 2010) Contributing author to DM-BOK2 (Reference & Master Data) released June of this year
Discussion Topics Current State Review Future State Proposed Data Model Data Architecture Future State Proposed Overall Architecture Data Lake Specific Big Data POC (Proof of Concept) Environment Setup Use Case Review POC Results Items on Deck Data Access / Presentation Layer Information Governance Implications Wrap up and Questions
Current State – Enterprise Data Model (High-level Conceptual) Main Business Entities: (9 in Total) Product (Coverage Master) Client (Consolidated Level View) Party (Source Level View) Point of Contact (Communication Method) Agent (Producer Contracts & Licenses) Application (for Policy Coverage) Policy (Pending, Active, or Terminated) Claim (Submitted against Policy) Event (Type and Timestamp) Subject Area Relationships: Identify Relationship Type / Role Enterprise Data Glossary: Business Terms & Attributes Vetted by Data Governance Council
Current State – Data Sharing Model (High-level Logical) Current Data Design: Relational Model Abstract Design Source Linkage and Lineage Lends Itself to Columnar Reference Entities: Static Reference Data Environment Metadata Subject Area Entities: Domain Specific (by Business Entity) Key-value Pairs (Simulate Columnar) Model instantiated for each Subject Area identified (9 in total)
Current State – Data Sharing Architecture Current Data Stores (all Oracle): Landing Zone Master Data Hub Enterprise Data Warehouse Current Data Flows: Traditional ETL (Informatica) Custom Extracts (COBOL, PL/SQL) Current Reporting & Analytics: Static (Business Objects) Visualization (Tableau) Predictive / Statistical (SAS) Current Data Profiling: Informatica IDQ and Traditional SQL
Future State – Proposed Architecture Data Layer Components: Operational Zone Presentation Zone + DV Data Lake (BDE) Ad-Hoc Zone Data Flows: Batch (solid black lines) Service (solid red lines) proxied via ESB RT Query (dashed black lines) All Data Layer components expected to be on-prem with exception of Ad-Hoc Zone (to enable variable use and cost models)
Future State – Proposed Architecture Architecture Approach: Assure Data Centric Design as Hub-n-Spoke Reduce Point-to-Point Enable Data Accessibility Implement Data Services Data Layer as Hub: Manage Client Identities Proxy Transactions Implement EDW Provide Data Domain Perspective Views Curate Master Data Link Transactional Data Enable Data Archiving Establish Enterprise LZ
Future State – Proposed Data Lake Data Lake Environment: Cloudera distribution of Hadoop 14 Node cluster (10 data, 4 name/edge) Technical Considerations: Enterprise Landing Zone (HDFS + Hive) Archive Zone (HDFS) Curation Zone (Hive + Impala + Kudu) Insights Zone (Hive + Impala + HBase) Sandbox Zone (Hive + Hbase + SAS) Ingestion (Sqoop + Syncsort) Transformation (M/R + Hive + Python + SAS) Existing MDS Hub to be migrated from relational Oracle data store to columnar Kudu data store Existing ETL to be migrated from Informatica to Hive + Impala Utilize Security Toolset from Cloudera to ensure Data encrypted at rest Note that Informatica BDM (Big Data Management) suite was reviewed / considered
Data Lake POC (Proof of Concept) POC Environment: MS Azure (IaaS set up) Cloudera distribution of Hadoop 4 Node cluster (3 data, 1 name/edge) Focused on Three (3) Use Cases: Actuarial Valuation Analysis (Single Product Type) Ingestion of Relational and Mainframe Data Data Service Query (Performance Goal <= 300ms) Results: Condensed Valuation Process (From Two Weeks to Twenty Hours) Ingestion of Relational Data (via Sqoop) and Mainframe Data (via Syncsort) Successful Mirrored 1000 simultaneous executions (Average Response Time Obtained of 150ms)
Next Steps – Items on Deck Data Access / Presentation Layer: Perform POC on Data Virtualization Product (Denodo) Determine How to Package Conformed Dimensions from EDW to Present ‘Perspective Views’ Establish Integration Patterns within ESB Environment (Semantic / Taxonomic Messaging Approach) Execute Performance Testing of Data Service Queries from Presentation Zone Information Governance Implications: Establish Governance Policies Determine Data Classification Approach Define Security Architecture for Data Lake Identify Access Roles and Security Controls Certify Security of Data Lake Environment
Next Steps – Plans for 2018 Funding Secured for POC Environment until June: But Establish a Larger Cluster (10 data, 4 name/edge) Along with Security Set-up and Data Encryption Collaborate with Business Areas on new / expanded prospective Use Cases: Expand Actuarial Valuation to Other Product Types Additional Actuarial Items outside of Valuation Agent Recruiting and Retention Claims Fraud (although this one has a long tail…) Customer Experience (Journey Map and/or Retention) Go on the Road… Presentations to Business Partners and IT folks Extoll the Value of BD and Future State Architecture Troll for Funding…$$$ (Sad but true…)
Recap In the end it is all about… Current State Review Data Model (Conceptual and Logical) Data Architecture Future State Proposed Overall Architecture (Layout and Approach) Data Layer Components Data Lake Environment Big Data POC (Proof of Concept) Environment Setup Use Case Review POC Results Items on Deck Data Access / Presentation Layer Information Governance Implications Next Steps
Thank You For Your Time and Interest…!!! Contact Information: Gene Boomer Protective Insurance gboomer@protectiveinsurance.com