Uber How to Stream Data with StorageTapper

Slides:

Advertisements

Similar presentations

BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading.

Advertisements

©2014 LinkedIn Corporation. All Rights Reserved. Gobblin’ Big Data with Ease Lin Qiao Data Analytics LinkedIn.

Presentation by Krishna

Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.

Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.

Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,

Managing Multi-User Databases AIMS 3710 R. Nakatsu.

1 Large-scale Incremental Processing Using Distributed Transactions and Notifications Written By Daniel Peng and Frank Dabek Presented By Michael Over.

6-1 DATABASE FUNDAMENTALS Information is everywhere in an organization Information is stored in databases –Database – maintains information about various.

Open Source Backup 1 Cloud Backup of Distributed MySQL Applications Taking snapshot of a thousand dancing dolphins Chander Kant Paddy.

Database Technical Session By: Prof. Adarsh Patel.

Integrating and managing your Engaging Networks data Top ten data features.

Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

MySQL. Dept. of Computing Science, University of Aberdeen2 In this lecture you will learn The main subsystems in MySQL architecture The different storage.

1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

Chapter 15: Achieving High Availability Through Replication.

INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.

SQL Server 2005 Implementation and Maintenance Chapter 12: Achieving High Availability Through Replication.

Authors Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana.

Building micro-service based applications using Azure Service Fabric

Distributed Time Series Database

1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.

Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)

Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.

Stream Processing with Tamás István Ujj

AZ PASS User Group Azure Data Factory Overview Josh Sivey, Solution Partner October

1 Copyright © 2005, Oracle. All rights reserved. Oracle Database Administration: Overview.

DBMS ● What are they? ● Why used ● Examples? – Oracle – Access – MySQL – Postgres – SQLServer – Sqlite.

1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.

What is BizTalk ?

Connected Infrastructure

GitHub Insights Understanding Open – Microsoft

Managing Multi-User Databases

Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.

Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng

Real-time analytics using Kudu at petabyte scale

Connected Maintenance Solution

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Antonio Abalos Castillo

Docker Birthday #3.

SOFTWARE DESIGN AND ARCHITECTURE

Connected Maintenance Solution

Maximum Availability Architecture Enterprise Technology Centre.

A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)

Software Design and Architecture

MyRocks at Facebook and Roadmaps

Connected Infrastructure

Payments at scale Uber’s next generation payments system

Chapter Overview Understanding the Database Architecture

MDEV Cross-Engine SQL Backup

Exploring Azure Event Grid

A developers guide to Azure SQL Data Warehouse

AWS DevOps Engineer - Professional dumps.html Exam Code Exam Name.

1 Demand of your DB is changing Presented By: Ashwani Kumar

Arrested by the CAP Handling Data in Distributed Systems

SQL 2014 In-Memory OLTP What, Why, and How

Machine Learning Platform Life-Cycle Management

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

A developers guide to Azure SQL Data Warehouse

Near Real Time ETLs with Azure Serverless Architecture

Clouds & Containers: Case Studies for Big Data

Cloud Data Replication with SQL Data Sync

THE GOOGLE FILE SYSTEM.

AWS S3 Cloud Backup Licensing per system Starting at $79 per year.

NoSQL databases An introduction and comparison between Mongodb and Mysql document store.

TN19-TCI: Integration and API management using TIBCO Cloud™ Integration

Presentation transcript:

MySQL @ Uber How to Stream Data with StorageTapper Ovais Tariq, Shriniket Kale & Yevgeniy Firsov October 23, 2018

Agenda MySQL @ Uber Background & Motivation High Level Features System Architecture System Components Use Cases Current Status & Future Work

MySQL @ Uber

Overview MySQL @ Uber MySQL stores majority of the operational data Currently running MySQL 5.7 Have our own fork: uber/percona-server MySQL based storage offerings at Uber: Schemaless MySQL-as-a-service

Schemaless MySQL @ Uber Distributed sharded key-value store Ledger style (append-only) semantics Uses RAFT consensus protocol to guarantee consistency Schemaless at Uber handles tens of Petabytes of data millions of queries per second tens of thousands of nodes

Schemaless MySQL @ Uber App Worker Mysql RAFT process S< cluster ….

MySQL-as-a-Service MySQL @ Uber Full featured MySQL available to services in raw form Use cases needing: ACID-compliant RDBMS dataset that can fit in a single host More than 500 microservices using it for their storage needs

Background & Motivation

Change Data Capture Motivation and use-cases How Wikipedia defines it? “In databases, change data capture (CDC) is a set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data”

Capturing Data Changes The traditional way Export the entire dataset and import it to your analytics platform every 24 hours Doesn’t scale can’t export and import large data sets quickly enough for business needs Data is no longer fresh 24 hours is way too long! Competitive pressure demands up-to-the-minute information Inefficient to read and write the same data over and over again Performance penalty on the source

Change Data Capture The StorageTapper way Deal with data changes incrementally Only dealing with data that has changed Possible to make changes available close to real-time Use cases at Uber Data ingestion into analytics platform Logical incremental backups

High Level Features

Key Features What does StorageTapper provide? Real-time change data streaming Multiple output formats (Avro, JSON, MessagePack) Multiple publishing destination (Kafka, File, S3, HDFS, SQL) REST API for automated operations Source performance-aware design Horizontally scalable

Guarantees & Limitations What are the assurances & requirements? Data changes on the database consistent with what is published Events guaranteed to be published at-least-once Events guaranteed to be published with a pre-defined format Events corresponding to the same row guaranteed to be published in commit order Restrictions & Limitations Every table being ingested must have a Primary Key defined Incompatible schema changes will break ingestion MySQL binary log must be enabled with the RBR format

System Architecture

High Level Design Bird’s-eye View StorageTapper * MySQL * Schemaless Binlog Rows * Avro * JSON * MsgPack * SQL * Kafka * File * S3 * HDFS Publishing destination

High Level Design System Flow Schema Manager StorageTapper API State Manager MySQL Binary Log MySQL Table MySQL Source Event Streamer Destination [Kafka] Input Buffer Log Reader

System Components

State Manager System Flow Schema Manager StorageTapper API /table (add) State Manager Table Registry Schema Registry Schema Converter MySQL Binary Log (in commit order) MySQL Table MySQL Source Snapshot Reader Event Transformer Event Streamer Stream rows Log Offset Pipe Offset Destination Kafka Input Buffer Binlogs Log Reader Binlog reader Buffer binlogs MySQL consistent snapshot

State Manager Managing and persisting state Tracks the state of ingestion for every table that is being published State tracking involves: tracking the latest GTID up to which binlog events have been consumed tracking the Kafka input buffer offset Ensures stop/restart of StorageTapper without losing the current state Persistence State updated after publishing a batch of binary log events Ensures that we don’t impede the performance

Schema Manager System Flow StorageTapper API /table (add) State Manager Table Registry Schema Registry Schema Converter MySQL Binary Log (in commit order) MySQL Table MySQL Source Snapshot Reader Event Transformer Event Streamer Stream rows Log Offset Pipe Offset Destination Kafka Input Buffer Binlogs Log Reader Binlog reader Buffer binlogs MySQL consistent snapshot

Schema Manager Managing the schema StorageTapper doesn’t control MySQL schema creation or changes Converts MySQL schema to output schema, publishes to external or integrated schema service Schema changes validated for backward compatibility Need for schema to be validated and registered before publishing events Maintains versioned history of MySQL and corresponding output schema

MySQL consistent snapshot Log Reader System Flow Schema Manager StorageTapper API /table (add) State Manager Table Registry Schema Registry Schema Converter MySQL Binary Log (in commit order) MySQL Table MySQL Source Snapshot Reader Event Transformer Event Streamer Stream rows Log Offset Pipe Offset Destination Kafka Input Buffer Binlogs Log Reader Binlog reader Buffer binlogs MySQL consistent snapshot

Log Reader MySQL binlog events reader Fetches binary log events from master Appears as a regular slave, no additional config Update master connection info on master failover via API Publishes binary log events to Kafka input buffer topic Input topic acts as a persistent buffer Binary log events buffered until consumed and transformed by Event Streamer Database Transaction COMMIT Application Append Next Event Binary Log of Events in COMMIT ORDER First Event

Event Streamer System Flow Schema Manager StorageTapper API /table (add) State Manager Table Registry Schema Registry Schema Converter MySQL Binary Log (in commit order) MySQL Table MySQL Source Snapshot Reader Event Transformer Event Streamer Stream rows Log Offset Pipe Offset Destination Kafka Input Buffer Binlogs Log Reader Binlog reader Buffer binlogs MySQL consistent snapshot

Event Streamer Snapshot Reader Snapshot Reader: a special type of event reader that reads rows from the table Takes transactionally consistent snapshot of the table together Typically used when bootstrapping a table with existing data Allows all the existing rows to be transformed into the required format and published

Event Transformer System Flow Schema Manager StorageTapper API /table (add) State Manager Table Registry Schema Registry Schema Converter MySQL Binary Log (in commit order) MySQL Table MySQL Source Snapshot Reader Event Transformer Event Streamer Stream rows Log Offset Pipe Offset Destination Kafka Input Buffer Binlogs Log Reader Binlog reader Buffer binlogs MySQL consistent snapshot

Event Streamer Events Transformation Events are transformed into messages according to the latest schema Row Key: generated from the primary key column values in an event Sequence Number: used by the consumer to read events for the same Row Key in order Is Deleted flag: signals whether the row was deleted Event Transformation DDL Avro Schema Row INSERT Message encoded with Avro schema. Additional fields are row_key, ref_key & is_deleted set to FALSE. Row UPDATE Additional fields as in INSERT. Update converted to DELETE + INSERT. Row DELETE Message encoded with Avro schema. All fields set to NULL, except row_key, ref_key. is_deleted set to 1

Event Streamer System Flow Schema Manager StorageTapper API /table (add) State Manager Table Registry Schema Registry Schema Converter MySQL Binary Log (in commit order) MySQL Table MySQL Source Snapshot Reader Event Transformer Event Streamer Stream rows Log Offset Pipe Offset Destination Kafka Input Buffer Binlogs Log Reader Binlog reader Buffer binlogs MySQL consistent snapshot

Event Streamer Publishing events to Kafka The events are published to Kafka as keyed messages The Row Key is used as the key of the message Ensures that all events corresponding to the same row go to the same partition Kafka provides ordering guarantees within a partition All messages written to Kafka correspond to a complete Row in MySQL

Use Cases

Use Cases How Uber is using StorageTapper? Realtime change data ingestion into data warehouse Logical backups

Data Ingestion for Warehouse MySQL Change Data Streaming Used to connect online datastore to the warehouse Produces realtime changelog to Kafka Data is further ingested into warehouse by reading from Kafka Cold start of warehouse is supported through snapshot functionality Currently producing thousands of data streams

Logical Backups Overview Version independent and provide better data integrity Storage engine independent Significantly more compact Support for partial restore

Logical Backups Hot & Cold Backups Stream to on-premise HDFS for hot backup Compressed JSON format Stream to S3 for cold backup For cold backup, data is PGP encrypted and sealed Configurable number of workers to avoid availability impact

Current Status & Future Work

Current Status Where we are right now? Currently in use at Uber for data ingestion into data warehouse Used by Schemaless and MySQL-as-a-Service internal customers Rollout completed on all production databases Open sourced on Github github.com/uber/storagetapper

In The Works What to expect next? Logical backup recovery SQL encoder Performance optimization Multiple workers per table Multiple changelog readers per cluster

Thank You Ovais Tariq, Shriniket Kale, Yevgeniy Firsov Contact us Explore further and contribute: github.com/uber/storagetapper Reach out to us directly: ot@uber.com skale@uber.com firsov@uber.com MySQL @ Uber The Architecture of Schemaless, Uber Engineering’s Trip Datastore Using MySQL Dockerizing MySQL at Uber Engineering We are hiring We're bringing Uber to every major city in the world. We need your skills and passion to help make it happen! Reach out to ot@uber.com