Gen-Tao Chiang Data and Analytic Engineer

Slides:



Advertisements
Similar presentations
Amaze business, make your devs happy
Advertisements

1 Vic Hargrave |
Running Hadoop-as-a-Service in the Cloud
Hadoop Ecosystem Overview
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Clemens Düpmeier (KIT / IAI)
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Pakiti.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Matthew Winter and Ned Shawa
Distributed Time Series Database
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Big Data Yuan Xue CS 292 Special topics on.
Andy Roberts Data Architect
Data Warehousing The Easy Way with AWS Redshift
An Introduction To Big Data For The SQL Server DBA.
Microsoft Ignite /28/2017 6:07 PM
A Suite of Products that allow you to Predict Outcomes, Prescribe Actions and Automate Decisions.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,
Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences
A presentation on ElasticSearch
Energy Management Solution
OMOP CDM on Hadoop Reference Architecture
BUILD BIG DATA ENTERPRISE SOLUTIONS FASTER ON AZURE HDINSIGHT
Pipe Engineering.
Connected Infrastructure
Monitoring Evolution and IPv6
PROTECT | OPTIMIZE | TRANSFORM
5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.
Connected Living Connected Living What to look for Architecture
Data Platform and Analytics Foundational Training
Introduction to Spark Streaming for Real Time data analysis
Hadoop and Analytics at CERN IT
Connected Maintenance Solution
Microsoft Machine Learning & Data Science Summit
CS122B: Projects in Databases and Web Applications Winter 2017
Spark Presentation.
Connected Maintenance Solution
Connected Living Connected Living What to look for Architecture
Combining Metrics and Logs for Holistic System/Application Analysis
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
Report from MesosCon North America June 2016, Denver, U.S.
Energy Management Solution
Azure Machine Learning & ML Studio
Blaze - An IoT Analytics Engine
Big Data - in Performance Engineering
CS6604 Digital Libraries IDEAL Webpages Presented by
Near Real Time ETLs with Azure Serverless Architecture
Data science and machine learning at scale, powered by Jupyter
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Overview of big data tools
The ELK stack - get to know logs
Indexing with ElasticSearch
Big-Data Analytics with Azure HDInsight
build a real time operational data lake in minutes.
Introduction to Azure Data Lake
SQL Server 2019 Bringing Apache Spark to SQL Server
Visual Data Flows – Azure Data Factory v2
Architecture of modern data warehouse
Presentation transcript:

Gen-Tao Chiang Data and Analytic Engineer Data Engineering Gen-Tao Chiang Data and Analytic Engineer Data Science Africa 2018

Outline Data Engineering Architecture Explorer Data Kibana Logstash Python

Data Engineer

Data Engineer Role To build a system and pipelines that transform that data into formats that data scientists can use.

Data Engineer Typical pipeline Data sources: API, IoT devices, mobile, logs, CSV, streaming data or batch data If this streaming data, which tool we are going to use? Kafka, RabbitMQ, Fkuend? Raw data storage: SQL database (Mysql PotsgreSQL, Redshift), S3, HDFS, NoSQL(mongodb, elasticsearch) Processing: Hadoop, Spark, Analytic environment: (BI, tableau, serving ML models?)

Data Engineer Responsibilities Developing and choosing correct tools for collecting data from different sources. If needed, combining data from different sources. Architecting scalable distributed systems. Architecting the data store. Creating reliable data pipeline. Working closely with data scientists to provide right tools and solutions.

Data Engineering SQL : MySQL, PostgreSQL, Redshift, Hive Useful Tools SQL : MySQL, PostgreSQL, Redshift, Hive NoSQL: MongoDB, ElasticSearch, Cassandra, InfluxDB Big data solutions: Hadoop, Spark Programming: Python, Scala Notebook: Jupyter, zeppelin DevOps, or Cluster Management: Docker, Linux System Administration, Jenkins, Kubernetes, Yarn, Mesos, Terraform. Cloud Computing: AWS, Azure Workflow management, Logstash, Luigi, GLUE Message queue: Kafka, RabbitMQ, Kinesis. Business Intelligent and Visualization: Splunk, Tableau, Kibana, D3, Superset..etc.

Architecture Greenhouse and Air quality

Architecture Data Storage: Elasticsearch Elasticsearch is a search engine and provides a distributed storage and an HTTP web interface and schema-free JSON documents. It can be scale up by simply adding more nodes. You can run Elasticsearch locally or using AWS Elasticsearch service. https://search-dsa-fx7s43inji63qg3oxxuziwjxa4.us-west-2.es.amazonaws.com

Architecture Data Storage: Elasticsearch : Create Index and Mapping curl -X PUT "localhost:9200/greenhouse" -H 'Content-Type: application/json' -d' { "settings" : { "number_of_shards" : 1 }, "mappings": { "type1": { "properties": { "analog_in_4" : { "type": "float" }, "relative_humidity_3": { "type": "float" }, "temperature_2": { "type": "float" }, "app_id": { "type": "text" }, "dev_id": { "type": "text" }, "time": { "type": "date", "format": "strict_date_optional_time" }, …………………………. Shrads: split data into how many nodes Also can specify replica

Architecture Data Storage: Elasticsearch : List Index curl -X GET "https://search-dsa-fx7s43inji63qg3oxxuziwjxa4.us-west- 2.es.amazonaws.com:443/_cat/indices?v" health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open airquality 46OHSD-ASzG3cLWGW2A4Rw 1 1 0 0 230b 230b yellow open greenhouse2 SLJ7oNS2SqKT-6qkgkKRmA 1 1 2552 0 704.7kb 704.7kb green open .kibana pn-pPxQVQRO5gxe2C0iyug 1 0 7 1 44.5kb 44.5kb

Architecture Data Storage: Indexing Data from TTN to Elasticsearch import ttn from elasticsearch import Elasticsearch app_id = "dsa2018-greenhouse" access_key = ”ttn-key" index = "greenhouse" doctype = "type1" eshost = "localhost:9200” es = Elasticsearch( [eshost], scheme="http", port=9200, )

Architecture Data Storage: Indexing Data from TTN to Elasticsearch def uplink_callback(msg, client): print("Received uplink from ", msg.dev_id) print("type of msg",type(msg)) doc = { "_op_type": "index", "_index": index, "_type": doctype, "app_id": msg.app_id, "dev_id": msg.dev_id, "hardware_serial": msg.hardware_serial, "counter":msg.counter, "analog_in_4" : analog_in_4, "relative_humidity_3": relative_humidity_3, "temperature_2" : temperature_2 }

Architecture Data Storage: Indexing Data from TTN to Elasticsearch def write2es(documents): helpers.bulk(es, documents, index=index, doc_type=doctype) print("reset docs to 0") global docs docs = [] if len(docs) == 2: print("call write2es") write2es(docs)

Exploring Data Visualization: Kibana Kibana is a open source data visualization plugin for Elasticsearch. It provides visualization capabilities on top of the content indexed on an Elasticsearch cluster. https://search-dsa-fx7s43inji63qg3oxxuziwjxa4.us-west- 2.es.amazonaws.com/_plugin/kibana/ Choose Index: greenhouse2 Adding fields (counter, app_id, relative_humidity_3, temperature_2, analog_in_4, dev_id) Define time range (June 1st) Check data coverage cross the time Choose your sensor (dev_id:"dsa-greenhouse10") Limit the moisture data range (dev_id:"dsa-greenhouse10" AND analog_in_4 : [120 TO 121])

Exploring Data ETL: Logstash Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then send it to storage. Three component: input for reading the data, filter for manipulating data and output for sending/storing data.

Exploring Data input { # Read all documents from Elasticsearch matching the given query elasticsearch { hosts => "https://search-dsa-fx7s43inji63qg3oxxuziwjxa4.us-west-2.es.amazonaws.com:443" index => "greenhouse2" query => '{"query": {"range" : {"time" : {"gte": "31/05/2018","lte": "2019","format": "dd/MM/yyyy||yyyy"}}}}' } output { stdout { codec => dots } csv { fields => ['time','app_id','dev_id','hardware_serial','relative_humidity_3','temperature_2','analog_4'] path => 'dsa-greenhouse.csv'

Exploring Data Python Example: Greenhouse-practice.ipynb pip install elasticsearch

Asante!