Drive Data Quality at Your Company: Create a Data Lake George Corugedo Chief Technology Officer & Co-Founder.

Slides:



Advertisements
Similar presentations
From Startup to Enterprise A Story of MySQL Evolution Vidur Apparao, CTO Stephen OSullivan, Manager of Data and Grid Technologies April 2009.
Advertisements

Syncsort Data Integration Update Summary Helping Data Intensive Organizations Across the Big Data Continuum Hadoop – The Operating System.
BI Web Intelligence 4.0. Business Challenges Incorrect decisions based on inadequate data Lack of Ad hoc reporting and analysis Delayed decisions.
Business-Led IT & Central IT Scaffolding UCCSC August 4, 2014.
Big Data Training Course for IT Professionals Name of course : Big Data Developer Course Duration : 3 days full time including practical sessions Dates.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Applying the SOA RA Utah Public Safety ESB Project Utah Department of Technology Services April 10, 2008 Prepared by Robert Woolley.
Setting Big Data Capabilities Free How to Make Business on Big Data? Stig Torngaard, Partner Platon.
1HP Confidential THE BIG DATA ECOSYSTEM AND YOU!.
©2014 LinkedIn Corporation. All Rights Reserved. Gobblin’ Big Data with Ease Lin Qiao Data Analytics LinkedIn.
Cloudera & Hadoop Use Cases Rob Lancaster | Omer Trajman "Big Data"... Applications From Enterprises to Individuals.
SAS® Data Integration Solution
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.
Platinum Sponsors Titanium Sponsors. ETL Tool (SSIS, etc) EDW (SQL Svr, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools.
1 Community (Optimize both Yarn & Non Yarn Hadoop clusters)
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Managing Master Data with MDS and Microsoft Excel
TITLE SLIDE: HEADLINE Presenter name Title, Red Hat Date For Red Hat, it's 1994 all over again Sarangan Rangachari VP and GM, Storage and Big Data Red.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Business Intelligence: The Next Big Thing (Really!) John Bair CTO, Ajilitee Sep 14, 2012 Presented to TDWI St. Louis Chapter.
Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record Purchase detail Purchase record Payment record ERP CRM.
Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record Purchase detail Purchase record Payment record ERP CRM.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Page 1 © Hortonworks Inc – All Rights Reserved Hortonworks Naser Ali UK Building Energy Management Group Hadoop: A Data platform for businesses.
Using the Powerful Microsoft Azure Platform, e-SUAP Properly and Securely Manages All Steps for Customizable Business Activities Permissions MICROSOFT.
Our Experience Running YARN at Scale Bobby Evans.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Michael Corcoran Sr. Vice President & CMO New Data Requirements Driven By Analytics 1.
Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.
An Introduction to HDInsight June 27 th,
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
© 2007 IBM Corporation IBM Information Management Accelerate information on demand with dynamic warehousing April 2007.
Actualog Social PIM Helps Companies to Manage and Share Product Information Using Secure, Scalable Ease of Microsoft Azure MICROSOFT AZURE ISV PROFILE:
Joe Caserta President Elliott Cordo Chief Architect September 30, 2015, Javits Center, New York City Building a Data Lake for Digital Music Dominance.
Jan Peterson Microsoft Dynamics CRM Mobility Update - Productivity on the Go PRD24 1.
Datalayer Notebook Allows Data Scientists to Play with Big Data, Build Innovative Models, and Share Results Easily on Microsoft Azure MICROSOFT AZURE ISV.
1 ©2015 Talend Inc Talend VAR program Presentation.
Powered by Microsoft Azure, PointMatter Is a Flexible Solution to Move and Share Data between Business Groups and IT MICROSOFT AZURE ISV PROFILE: LOGICMATTER.
Nov 2006 Google released the paper on BigTable.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Data-Centric Security and User Access Controls for Hadoop on Microsoft Azure MICROSOFT AZURE APP BUILDER PROFILE: BLUETALON BlueTalon provides data-centric.
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
1 © Cloudera, Inc. All rights reserved. Alexander Bibighaus| Director of Engineering The Future of Data Management with Hadoop and the Enterprise Data.
Overture Is a Unique Omni-channel E-commerce Platform that Leverages the Power of Microsoft Azure to Orchestrate Every Customer Transaction MICROSOFT AZURE.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
An Introduction To Big Data For The SQL Server DBA.
BIG DATA/ Hadoop Interview Questions.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Microsoft Partner since 2011
Microsoft Ignite /28/2017 6:07 PM
OMOP CDM on Hadoop Reference Architecture
AuraPortal Cloud Helps Empower Organizations to Organize and Control Their Business Processes via Applications on the Microsoft Azure Cloud Platform MICROSOFT.
PROTECT | OPTIMIZE | TRANSFORM
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
New Heights by Guiding Them into the Cloud
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Operationalize your data lake Accelerate business insight
Oscar AP by Massive Analytic: A Precognitive Analytics Platform for Effortless Data-Driven Decisions. Now Available in Azure Marketplace MICROSOFT AZURE.
Hosted on Microsoft Azure, Seismic is Drastically Changing How Enterprise Sales Teams Utilize Content to Accelerate Sales and Close Deals MICROSOFT AZURE.
Cloud Analytics for Microsoft Azure
Setup Sqoop.
Top five challenges facing the Enterprise Data Warehouse (EDW)
Last.Backend is a Continuous Delivery Platform for Developers and Dev Teams, Allowing Them to Manage and Deploy Applications Easier and Faster MICROSOFT.
Presentation transcript:

Drive Data Quality at Your Company: Create a Data Lake George Corugedo Chief Technology Officer & Co-Founder

Purpose of MDM Create correct and consistent data across the enterprise that fosters trust in information and acceleration of growth.

“ Without data you’re just another person with an opinion.” W. Edwards Deming Why It Matters

Vicious Cycle of Unmanaged Data Unmanaged Data Master Data Issues remain unaddressed or unresolved 1 Garbage in/garbage out creates process confusion 2 Data conflicts reinforce siloed operations 4 Lack of process trust slows business momentum 3

New Data Straining Current Architectures Unstructured documents, s Transactional data Server logs Sentiment, web data Geolocation Sensor, machine data Clickstream Hierarchical data OLTP, ERP, CRM Master data 2.8 ZB in % from new data types 15x Machine Data by ZB by 2020 Source: IDC

Gartner’s Nexus of Forces

Business Benefits of MDM

How About MDM on a Data Lake? Severe shortage of Map Reduce skilled resources Inconsistent skills lead to inconsistent results of code based solutions Nascent technologies require multiple point solutions Technologies are not enterprise grade Some functionality may not be possible within these frameworks Challenges to Data Lake Approach Data is ingested in its raw state regardless of format, structure or lack of structure Raw data can be used and reused for differing purposes across the enterprise Beyond inexpensive storage, Hadoop is an extremely power and scalable and segmentable computational platform Master Data can be fed across the enterprise and deep analytics on clean data is immediately enabled Benefits of a Hadoop Data Lake

A Data Lake as the Center of Your MDM Strategy Ingestion of all data available from any source, format, cadence, structure or non-structure ELT and data transformation, refinement, cleansing, completion, validation and standardization Geospatial processing and geocoding Data profiling, lineage and metadata management Identity resolution and persistent keying and entity profile management Attribute source and consumer mapping

Modern Data Lake Architecture for MDM

Key Functions for Master Data Management

How Can That Possibly Work?

Overview - What is Hadoop/Hadoop 2.0

RedPoint Data Management on Hadoop Partitioning AM / Tasks Execution AM / Tasks Data I/O Key / Split Analysis RedPoint Parallel Section YARN MapReduce

Zero Footprint Installation

Reference Hadoop Architecture Monitoring and Management Tools Ambari MAPREDUCE REST DATA REFINEMENT HIVE PIG HTTP STREAM STRUCTURE HCATALOG (metadata services) Query/Visualization/ Reporting/Analytical Tools and Apps Data Sources RDBMS EDW INTERACTIVE HIVE Server2 LOAD SQOOP WebHDFS Flume NFS LOAD SQOOP/Hive Web HDFS YARN       n 1        HDFS

Benchmarks – Project Gutenberg

The Modern Data Architecture for MDM

Recommendations Make your MDM an iterative project rather than boiling the ocean Don’t forget that Hadoop is a powerful computational platform; not just cheap storage Use a tool that gives you agility and ease of use so your people are as productive as quickly as possible Parallelization is where Hadoop compute power comes from; use tools that leverage it properly. Come see the future of MDM in Booth 109