Gen-Tao Chiang Data and Analytic Engineer

Slides:

Advertisements

Similar presentations

Amaze business, make your devs happy

Advertisements

1 Vic Hargrave |

Running Hadoop-as-a-Service in the Cloud

Hadoop Ecosystem Overview

Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.

This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.

Clemens Düpmeier (KIT / IAI)

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Pakiti.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.

Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp.

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.

Matthew Winter and Ned Shawa

Distributed Time Series Database

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.

Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.

Big Data Analytics with Excel Peter Myers Bitwise Solutions.

Big Data Yuan Xue CS 292 Special topics on.

Andy Roberts Data Architect

Data Warehousing The Easy Way with AWS Redshift

An Introduction To Big Data For The SQL Server DBA.

Microsoft Ignite /28/2017 6:07 PM

A Suite of Products that allow you to Predict Outcomes, Prescribe Actions and Automate Decisions.

Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,

Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences

A presentation on ElasticSearch

Energy Management Solution

OMOP CDM on Hadoop Reference Architecture

BUILD BIG DATA ENTERPRISE SOLUTIONS FASTER ON AZURE HDINSIGHT

Pipe Engineering.

Connected Infrastructure

Monitoring Evolution and IPv6

PROTECT | OPTIMIZE | TRANSFORM

5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.

Connected Living Connected Living What to look for Architecture

Data Platform and Analytics Foundational Training

Introduction to Spark Streaming for Real Time data analysis

Hadoop and Analytics at CERN IT

Connected Maintenance Solution

Microsoft Machine Learning & Data Science Summit

CS122B: Projects in Databases and Web Applications Winter 2017

Spark Presentation.

Connected Maintenance Solution

Connected Living Connected Living What to look for Architecture

Combining Metrics and Logs for Holistic System/Application Analysis

Connected Infrastructure

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Data Platform and Analytics Foundational Training

Report from MesosCon North America June 2016, Denver, U.S.

Energy Management Solution

Azure Machine Learning & ML Studio

Blaze - An IoT Analytics Engine

Big Data - in Performance Engineering

CS6604 Digital Libraries IDEAL Webpages Presented by

Near Real Time ETLs with Azure Serverless Architecture

Data science and machine learning at scale, powered by Jupyter

Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data

Overview of big data tools

The ELK stack - get to know logs

Indexing with ElasticSearch

Big-Data Analytics with Azure HDInsight

build a real time operational data lake in minutes.

Introduction to Azure Data Lake

SQL Server 2019 Bringing Apache Spark to SQL Server

Visual Data Flows – Azure Data Factory v2

Architecture of modern data warehouse

Presentation transcript:

Gen-Tao Chiang Data and Analytic Engineer Data Engineering Gen-Tao Chiang Data and Analytic Engineer Data Science Africa 2018

Outline Data Engineering Architecture Explorer Data Kibana Logstash Python

Data Engineer

Data Engineer Role To build a system and pipelines that transform that data into formats that data scientists can use.

Data Engineer Typical pipeline Data sources: API, IoT devices, mobile, logs, CSV, streaming data or batch data If this streaming data, which tool we are going to use? Kafka, RabbitMQ, Fkuend? Raw data storage: SQL database (Mysql PotsgreSQL, Redshift), S3, HDFS, NoSQL(mongodb, elasticsearch) Processing: Hadoop, Spark, Analytic environment: (BI, tableau, serving ML models?)

Data Engineer Responsibilities Developing and choosing correct tools for collecting data from different sources. If needed, combining data from different sources. Architecting scalable distributed systems. Architecting the data store. Creating reliable data pipeline. Working closely with data scientists to provide right tools and solutions.

Data Engineering SQL : MySQL, PostgreSQL, Redshift, Hive Useful Tools SQL : MySQL, PostgreSQL, Redshift, Hive NoSQL: MongoDB, ElasticSearch, Cassandra, InfluxDB Big data solutions: Hadoop, Spark Programming: Python, Scala Notebook: Jupyter, zeppelin DevOps, or Cluster Management: Docker, Linux System Administration, Jenkins, Kubernetes, Yarn, Mesos, Terraform. Cloud Computing: AWS, Azure Workflow management, Logstash, Luigi, GLUE Message queue: Kafka, RabbitMQ, Kinesis. Business Intelligent and Visualization: Splunk, Tableau, Kibana, D3, Superset..etc.

Architecture Greenhouse and Air quality

Architecture Data Storage: Elasticsearch Elasticsearch is a search engine and provides a distributed storage and an HTTP web interface and schema-free JSON documents. It can be scale up by simply adding more nodes. You can run Elasticsearch locally or using AWS Elasticsearch service. https://search-dsa-fx7s43inji63qg3oxxuziwjxa4.us-west-2.es.amazonaws.com

Architecture Data Storage: Elasticsearch : Create Index and Mapping curl -X PUT "localhost:9200/greenhouse" -H 'Content-Type: application/json' -d' { "settings" : { "number_of_shards" : 1 }, "mappings": { "type1": { "properties": { "analog_in_4" : { "type": "float" }, "relative_humidity_3": { "type": "float" }, "temperature_2": { "type": "float" }, "app_id": { "type": "text" }, "dev_id": { "type": "text" }, "time": { "type": "date", "format": "strict_date_optional_time" }, …………………………. Shrads: split data into how many nodes Also can specify replica

Architecture Data Storage: Elasticsearch : List Index curl -X GET "https://search-dsa-fx7s43inji63qg3oxxuziwjxa4.us-west- 2.es.amazonaws.com:443/_cat/indices?v" health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open airquality 46OHSD-ASzG3cLWGW2A4Rw 1 1 0 0 230b 230b yellow open greenhouse2 SLJ7oNS2SqKT-6qkgkKRmA 1 1 2552 0 704.7kb 704.7kb green open .kibana pn-pPxQVQRO5gxe2C0iyug 1 0 7 1 44.5kb 44.5kb

Architecture Data Storage: Indexing Data from TTN to Elasticsearch import ttn from elasticsearch import Elasticsearch app_id = "dsa2018-greenhouse" access_key = ”ttn-key" index = "greenhouse" doctype = "type1" eshost = "localhost:9200” es = Elasticsearch( [eshost], scheme="http", port=9200, )

Architecture Data Storage: Indexing Data from TTN to Elasticsearch def uplink_callback(msg, client): print("Received uplink from ", msg.dev_id) print("type of msg",type(msg)) doc = { "_op_type": "index", "_index": index, "_type": doctype, "app_id": msg.app_id, "dev_id": msg.dev_id, "hardware_serial": msg.hardware_serial, "counter":msg.counter, "analog_in_4" : analog_in_4, "relative_humidity_3": relative_humidity_3, "temperature_2" : temperature_2 }

Architecture Data Storage: Indexing Data from TTN to Elasticsearch def write2es(documents): helpers.bulk(es, documents, index=index, doc_type=doctype) print("reset docs to 0") global docs docs = [] if len(docs) == 2: print("call write2es") write2es(docs)

Exploring Data Visualization: Kibana Kibana is a open source data visualization plugin for Elasticsearch. It provides visualization capabilities on top of the content indexed on an Elasticsearch cluster. https://search-dsa-fx7s43inji63qg3oxxuziwjxa4.us-west- 2.es.amazonaws.com/_plugin/kibana/ Choose Index: greenhouse2 Adding fields (counter, app_id, relative_humidity_3, temperature_2, analog_in_4, dev_id) Define time range (June 1st) Check data coverage cross the time Choose your sensor (dev_id:"dsa-greenhouse10") Limit the moisture data range (dev_id:"dsa-greenhouse10" AND analog_in_4 : [120 TO 121])

Exploring Data ETL: Logstash Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then send it to storage. Three component: input for reading the data, filter for manipulating data and output for sending/storing data.

Exploring Data input { # Read all documents from Elasticsearch matching the given query elasticsearch { hosts => "https://search-dsa-fx7s43inji63qg3oxxuziwjxa4.us-west-2.es.amazonaws.com:443" index => "greenhouse2" query => '{"query": {"range" : {"time" : {"gte": "31/05/2018","lte": "2019","format": "dd/MM/yyyy||yyyy"}}}}' } output { stdout { codec => dots } csv { fields => ['time','app_id','dev_id','hardware_serial','relative_humidity_3','temperature_2','analog_4'] path => 'dsa-greenhouse.csv'

Exploring Data Python Example: Greenhouse-practice.ipynb pip install elasticsearch

Asante!