Data Lake and HAWQ Integration

Slides:

Advertisements

Similar presentations

Power BI Sites and Mobile BI. What You Will Learn Sharing and Collaboration Introducing Power BI Exploring Power BI Features and Services Partner Opportunities.

Advertisements

An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

Data Warehousing: Defined and Its Applications Pete Johnson April 2002.

Creating New Business Value with Big Data Attivio Active Intelligence Engine®

Create Content Capture Content Review Content Edit Content Version Content Version Content Translate Content Translate Content Format Content Transform.

Andy Roberts Data Architect

Metadata Driven Clinical Data Integration – Integral to Clinical Analytics April 11, 2016 Kalyan Gopalakrishnan, Priya Shetty Intelent Inc. Sudeep Pattnaik,

A Suite of Products that allow you to Predict Outcomes, Prescribe Actions and Automate Decisions.

SQL Server 2008 R2 Report Builder 3.0 SQL Server 2008 Feature Pack Report Builder 2.0 SQL Server 2008 General Availability Authoring & Collaboration (Acquisition:

Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.

Business Insights Play briefing deck.

Energy Management Solution

DATA Storage and analytics with AZURE DATA LAKE

Big Data & Test Automation

We Optimize. You Capitalize Software Development Services

BUILD BIG DATA ENTERPRISE SOLUTIONS FASTER ON AZURE HDINSIGHT

Protecting a Tsunami of Data in Hadoop

Connected Infrastructure

Fan Engagement Solution

Data Platform and Analytics Foundational Training

SAS users meeting in Halifax

Big Data Enterprise Patterns

5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.

Connected Living Connected Living What to look for Architecture

Data Platform and Analytics Foundational Training

Smart Building Solution

Hadoop and Analytics at CERN IT

Examine information management in Cortana Intelligence

Intro to BI Architecture| Warren Sifre

PLM, Document and Workflow Management

Creating Enterprise Grade BI Models with Azure Analysis Services

Zhangxi Lin, The Rawls College,

Smart Building Solution

Connected Living Connected Living What to look for Architecture

Connected Infrastructure

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Power BI Security Best Practices

Energy Management Solution

Data Warehouse.

Creating New Business Value with Big Data

Add intelligence to Dynamics AX with Cortana Intelligence suite

Establishing A Data Management Fabric For Grid Modernization At Exelon

07 | Analyzing Big Data with Excel

Operationalize your data lake Accelerate business insight

SYSTEMART, LLC We Optimize. You Capitalize Software Application Development

Welcome! Power BI User Group (PUG)

Power Apps & Flow for Microsoft Dynamics SL

Analytics for Cloud ERP

A Guide to Shift’s Open Data ecosystem & Data workflow

Stop Data Wrangling, Start Transforming Data to Intelligence

Accelerate Your Self-Service Data Analytics

Welcome! Power BI User Group (PUG)

XtremeData on the Microsoft Azure Cloud Platform:

THE ARCHITECTURAL COMPONENTS

Technical Capabilities

VIEWS / TSS Overview.

Welcome to SQLSaturday #767! Hosted by Lincoln SQL Server User Group

Data Warehousing Concepts

Business Intelligence

AI Discovery Template IBM Cloud Architecture Center

Data Wrangling as the key to success with Data Lake

Microsoft Data Insights Summit

Data Wrangling for ETL enthusiasts

Microsoft Azure Data Catalog

Michael French Principal Consultant 5/18/2019

SQL Server 2019 Bringing Apache Spark to SQL Server

Architecture of modern data warehouse

Presentation transcript:

Data Lake and HAWQ Integration LU WENBIN

Data Lake Analytic Insights Module From Dell EMC HAWQ Integration Agenda Data Lake Analytic Insights Module From Dell EMC HAWQ Integration

Data Lake

A Typical Data Scientist My friend, a data scientist who works for a leading bank in US, told me his working content Data Gathering: From internal RDBMS, Hadoop, Document Store and external government or third-party data source. Data Cleansing: Normally to integrate data and reduce dimension. Put cleaned data into division Unix File Server, normally in SAS format. Data Analysis: Uses SAS normally. Trend analysis, Prediction, Time serious analysis, clustering analysis, etc. Analysis result and reports still stores back into division owned Unix file server. BI/IT team is involved for data sharing.

Problems Data is not integrated: Creates a lot of Data Silos in the enterprise Data is not easy to share: To share data, one data scientist need to inform BI team to put the data into data warehouse and used by another team for analysis, a lot of manual work Data is not indexed and cataloged No single catalog for users to search and find data Data Lineage is not preserved No well-designed data governance and auditing

Data Lake Pentaho CTO James Dixon has generally been credited with coining the term “data lake”. A data lake is like a body of water in its natural state. Data flows from the streams (the source systems) to the lake. Users have access to the lake to examine, take samples or dive in.

Difference from Data Warehouse Data Lake retains all data: Compared to the selected and special designed and transformed data in the data warehouse All Data Format: “Schema on Read of Data Lake” vs “Schema on Write of Data Warehouse” User friendly to all: Compared to the report oriented Data Warehouse users, Data Lake has a much wide audience, especially to Data Scientist Adapts fast to changes Still the rigid data format and cost way to design and build of Data Warehouse. On the other hand, Data Lake stores the raw data

Business Ready Data Lake Data Ingestion From a wide range of internal and external data sources Load the data as-is Data Indexing and Catalog Data lineage and Governance Easy to analysis Data Sharing Action for surface and management users

Analytic Insights Module Gather Analyze Act

Analytic Insights Module Analytic Insights Module is engineered by DELL EMC to combine self-service data analytics, with cloud-native application development into a single cloud platform. It enables organizations to rapidly transform data into actionable insights with high business value Gather: Gather the right data with deep awareness through powerful indexing and ingestion capabilities with data centric security and lineage. Analyze: Analyze data through a self-service experience with a choice of expansive ecosystem tools and cataloged data sets for reuse. Act: Act to operationalize insights into data-driven applications, new business process and enhanced customers engagements.

Solution Architecture

Data Flow

Functions Data Curator Attivio DSD for source data discovery Zaloni Bedrock for data ingestion pipeline Consolidated data lineage tracking Data Scientist Self-service analysis environment Open Architecture for adopting any data analysis and visualization tools Easy and efficient data sharing Global data view and index for quick search

Functions Cont. Data Governor BlueTalon for global security policy configuration, enforcement and auditing All data access is enforced by BlueTalon Policy Enforcement Point for different data containers, i.e. HDFS, Hive, HAWQ, etc IT/Admin Built on Hyper-Converged Infrastructure EMC VxRail PCF as PAAS Open architecture to deploy Big Data Suite: CDH, HDP, Mongo, etc. Fine-grained quota setting System Monitoring Integrated Logs Analysis

Functions Cont. App Developer PCF for Native Cloud Apps development Extremely easy to develop data driven application All data container has PCF User Provided Service exposed

Demo Video https://www.emc.com/en-us/solutions/big-data/analytic-insights.htm

HAWQ Integration Provided as one of SQL on Hadoop tool in AIM Some of our customers widely use HAWQ Automatic provision Data Ingestion into HAWQ table HAWQ table sharing between Data Scientists Data Access Security of HAWQ

Automatic Provision HDP 2.5 with HDB 2.1.0 and PXF Automatically deployed with HDP Integrated with Dell EMC Isilon Storage

Data Ingestion into HAWQ Data can be ingested into HAWQ in three ways: Dirt Way: Data Scientist manually created the HAWQ table and use whatever ways to ingest data into HAWQ table. Paved Way: Data Scientist can use pre-created Ingestion Pipeline to bring data into their own workspace HAWQ The pipeline starts from data discovery. Direct into ODC: Data Admin can use pre-created Ingestion Workflow to ingest external files directly into Operational Data Container (ODC) as published HAWQ table

ODC HAWQ Ingestion Architecture

ODC HAWQ Ingestion Detail We use Zaloni Bedrock as the Ingestion Engine, which has a component called BDCA to monitoring the incoming files and transfer into HDFS Zaloni Bedrock as ingestion workflow manager Zaloni guarantees the data quality during ingestion Based on the incoming file, an external Table and an internal HAWQ table will be created External Table uses PXF to connect to the HDFS location After File Ingestion is completed, External table will be “selected into” internal table. External table will be dropped. Notify Data And Analytics Catalog (DAC) component

Data Scientist Share HAWQ Table

Data Scientist Share HAWQ Table The problem is how to migrate between HAWQ clusters General Steps: Get Table Underlying HDFS Location from Catalog Table Transfer Source HAWQ HDFS files into target HAWQ HDFS using EMC Isilon API Use hawq extract to extract the source table metadata Modify the yaml file HDFS location and DFS location Use hawq register to register into target HAWQ Issue an “analyze” to refresh target table statistics

For Non-Partitioned Table pg_database: 1) dat2tablespace (table space id) 2) oid (database id) pg_class and pg_namespace: relfilenode (table id) pg_filespace_entry and pg_filespace: fselocation (HDFS HAWQ Directory) Based on the above information, we can identify the HDFS location of the table as: fselocation/dat2tablespace/(database)oid/relfilenode/

For One-Level Partitioned Table Only one-level partitioned table is supported now In addition to non-partitioned table, pg_partitions is used to get partitionschemaname and partitiontablename Iterates from each partitions above: Get partitioned table relfilenode (table id) from pg_class and pg_namespace Then, all partitioned table’s HDFS location are: fselocation/dat2tablespace/(database)oid/[List]relfilenode/

Data Access Protection for HAWQ We used a partner product BlueTalon as the global policy and enforcement engine. It protects all our data containers like: Isilon HDFS, Hive, HAWQ, etc.

Data Access Protection for HAWQ

Data Access Protection for HAWQ

Data Access Protection for HAWQ Cont. GreenPlum Policy Enforcement Point is automatically deployed and configured GreenPlum Data Domain is added for the specified HAWQ User Domain is integrated with LDAP group BlueTalon service account is created on HAWQ BlueTalon service account is added into hba.conf BlueTalon has integrated policy setting, enforcement and auditing control and display

Thank you!

Q&A