Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.

Slides:

Advertisements

Similar presentations

Building LinkedIn’s Real-time Data Pipeline

Advertisements

Distributed Data Processing

©2014 LinkedIn Corporation. All Rights Reserved. Gobblin’ Big Data with Ease Lin Qiao Data Analytics LinkedIn.

Maintaining and Updating Windows Server 2008

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.

Apache Spark and the future of big data applications Eric Baldeschwieler.

SOA – Development Organization Yogish Pai. 2 IT organization are structured to meet the business needs LOB-IT Aligned to a particular business unit for.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.

M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.

IMDGs An essential part of your architecture. About me

FI-CORE Data Context Media Management Chapter Release 4.1 & Sprint Review.

9 Systems Analysis and Design in a Changing World, Fourth Edition.

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

Stairway to the cloud or can we take the highway? Taivo Liik.

AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?

System/SDWG Update Management Council Face-to-Face Flagstaff, AZ August 22-23, 2011 Sean Hardman.

SYSTEMSDESIGNANALYSIS 1 Chapter 21 Implementation Jerry Post Copyright © 1997.

Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

Maintaining and Updating Windows Server 2008 Lesson 8.

Apache Kafka A distributed publish-subscribe messaging system

9 Systems Analysis and Design in a Changing World, Fifth Edition.

Microsoft Partner since 2011

Towards High Performance Processing of Streaming Data May Supun Kamburugamuve, Saliya Ekanayake, Milinda Pathirage and Geoffrey C. Fox Indiana.

The Big Data Network (phase 2) Cloud Hadoop system

Integration of Oracle and Hadoop: hybrid databases affordable at scale

Protecting a Tsunami of Data in Hadoop

Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.

WLCG Workshop 2017 [Manchester] Operations Session Summary

Big Data Enterprise Patterns

PROTECT | OPTIMIZE | TRANSFORM

Integration of Oracle and Hadoop: hybrid databases affordable at scale

Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.

Clouds , Grids and Clusters

Data Analytics and CERN IT Hadoop Service

Introduction to Distributed Platforms

Hadoop and Analytics at CERN IT

Real-time Streaming and Data Pipelines with Apache Kafka

Open Source distributed document DB for an enterprise

Next Generation of Post Mortem Event Storage and Analysis

Collecting heterogeneous data into a central repository

Spark Presentation.

Database Workshop Report

DI4R, 30th September 2016, Krakow

New Big Data Solutions and Opportunities for DB Workloads

Cloud Management Mechanisms

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Hadoop Clusters Tess Fulkerson.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

AWS DevOps Engineer - Professional dumps.html Exam Code Exam Name.

2018 Amazon AWS DevOps Engineer Professional Dumps - DumpsProfessor

Microsoft Build /8/2018 5:15 AM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY,

Introduction to Spark.

Dr. John P. Abraham Professor, Computer Engineering UTPA

Dev Test on Windows Azure Solution in a Box

Data Analytics – Use Cases, Platforms, Services

Big Data - in Performance Engineering

Ewen Cheslack-Postava

Data science and machine learning at scale, powered by Jupyter

Hadoop Technopoints.

Evolution of messaging systems and event driven architecture

Overview of big data tools

Big-Data Analytics with Azure HDInsight

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Presentation transcript:

Pilot Kafka Service Manuel Martín Márquez

Kafka Kafka is a distributed streaming platform High Scalable (partition) Fault Tolerant (replication) Allow high level of parallelism and decoupling between data producers and data consumers De facto standard for near real-time store, access and process data streams Critical component of most of the Big Data Platform and therefore of Hadoop ecosystem

Kafka Basic Concepts Broker: Kafka node on the cluster Topics: Stream of records category - Multiple writers and readers Partitioned Replicated Consumer: pulls messages off of a Kafka topic Producer: push messages into a Kafka topic Data Retention: - Based on time or size Zookeeper: Stores Kafka Metadata Source: Hortonworks

Kafka entry points Custom implementation of producer and consumer using Kafka client API Java, Scala, C++, Python Kafka Connectors LogFile, HDFS, JDBC, ElasticSearch… Logstash Source and sink Apache Flume out-of-the-box can use Kafka as Source, Channel, Sink Other ingestion or processing tools support Kafka Apache Spark, LinkedIn Gobblin, Apache Storm…

Kafka for Data Integration and Processing

Kafka at CERN – it monitoring

Kafka at CERN – it monitoring (Requirements) Throughput and retention policy Currently 200 GB/day (forecast 500 GB/day) Retention Policy 12h in qa and 24 hours in prod (largest retention policy to cover potential problems over weekends) ~ 4000 messages, up to 10k peaks ~50 topics Security (Kerberos) Flume can be potentially upgrade to 1.7 early in 2017 (work in progress already). Administration Capabilities Administrative operations Topic configuration, rebalancing, user management, start/stop cluster Possibility to increase retention policy, replication factor

Kafka at CERN – CALS Critical Service no data no beam

Kafka at CERN – CALS (Requirements) Throughput and retention policy Currently 30 GB/hour only including the logging processes Plan to incrementally include all the systems with potentially mean several TBs Compression with Snappy will be evaluated to determined performance Retention policy 24 hours, which is the time they need to buffer data and compact it to send it to Hadoop Security (Kerberos) Infrastructure Openstack under several conditions: TN need to be supported for several reasons High availability of the service CALS on top of private cloud (No CALS no BEAM in the LHC) Administration Capabilities Administrative operations Topic configuration, rebalancing, user management, start/stop cluster Possibility to increase retention policy, replication factor

Kafka at CERN Security Team LHC Postmortem Industrial Control Systems Already using Kafka for pattern matching Data integration LHC Postmortem Potentially ingested by CALS Industrial Control Systems WinCCOA Data

Pilot Kafka Service Scope Study the current Kafka use case together with the different teams involved Collect requirements Understand feasibility and added value of Kafka as a central service

Pilot Kafka Service Collect requirements (5 Major Use Cases): CALS, IT-Monitoring, Security Team, Industrial Control, Post-mortem Throughput, Retention Policy, Security, Infrastructure, Administration Capabilities Agreement to test the service from the first phase Ensure the service cope with their requirements More details: https://twiki.cern.ch/twiki/bin/viewauth/DB/CERNonly/KafkaService

Pilot Kafka Service – Current Development Pilot Implementation - rapid iteration which will help to understand service and use case. On-demand Kafka service approach Self-Service Cluster creation, management and expansion Allow users to perform administrative tasks that are traditionally carried out by administrators Facilitating operating system and engine updates (Kafka, Zookeeper) Transparently integrate all the needed services (Security, Storage, Procurement, etc) Support for service continuity in case of hardware failure

Pilot Kafka Service – Current Development Configuration and Management REST API Security enabled - Kerberos on Kafka and Zookeeper (SSL optional) Monitoring Capabilities OpenStack on GPN Network storage Dedicated Kafka and Zookeeper per user

Towards Kafka Production Service Service evaluation phase and time line

Towards Kafka Production Service Consolidation to Production Web Interface to manage clusters (Self-service) Evolution of the configuration management API Functionalities toward the self-service platform Integration with Openstack Full monitoring beyond JMX metrics Kafka-Mirroring (High Availability) Deploy service in TN (due to service design that is transparent for us) Kafka as close as possible to consumers and producers