Building a real time Tweet map with Flink in six weeks

Slides:



Advertisements
Similar presentations
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
Advertisements

Page 1 GADD Software - An Introduction Public version, August 2014, gaddsoftware.com.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
A NEW PLATFORM FOR A NEW ERA. 2 Pivotal Confidential–Internal Use Only 2 The Pivotal Big Data Suite.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
FACULTY OF COMPUTER SCIENCE OUTPUT DD  annual event from students for students with contact to industry (~800 visitors)  live demonstrations  research.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Overview of SQL Server Alka Arora.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Project 2 Presentation & Demo Course: Distributed Systems By Pooja Singhal 11/22/
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
G063 - Distributed Databases. Learning Objectives: By the end of this topic you should be able to: explain how databases may be stored in more than one.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. HP-ITU Innovation Center Dicle Erkul.
Near Real-Time Verification At The Forecast Systems Laboratory: An Operational Perspective Michael P. Kay (CIRES/FSL/NOAA) Jennifer L. Mahoney (FSL/NOAA)
E a s y S h a r e Jung Son Ky Le. Operational Concepts Recent years, huge number of growth in Internet users and broadband usage File-sharing become extremely.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Capture and Storage of Tabular Data Leveraging Ephesoft and Alfresco W. Gary Cox Senior Consultant Blue Fish Development Group.
Specto training SAP SRM Online Training Contact us: Ph: Mail:
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,
Distributed Geospatial Indexing
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
NO SQL for SQL DBA Dilip Nayak & Dan Hess.
PROTECT | OPTIMIZE | TRANSFORM
Welcome to ….. File Organization.
Column-Based.
Hadoop.
Connected Maintenance Solution
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Open Source distributed document DB for an enterprise
Scaling SQL with different approaches
Connected Maintenance Solution
NOSQL.
Optimizing SQL Queries
MID-SEM REVIEW.
GeoMesa, GeoBench & SFCurve: Measuring & Improving BigGeo performance
Microsoft Build /20/2018 5:17 AM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY,
AWS DevOps Engineer - Professional dumps.html Exam Code Exam Name.
Best Amazon AWS Certified Big Data - Specialty Actual Test Preparation Solutions for Guaranteed Success
1 Demand of your DB is changing Presented By: Ashwani Kumar
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
G063 - Distributed Databases
Atea + Microsoft + FCT|FCCN.
Graph-based analysis and demand prediction for bike rentals
Table Partitioning Intro and make that a sliding window too!
Distributed File Systems
Overview of big data tools
Quasardb Is a Fast, Reliable, and Highly Scalable Application Database, Built on Microsoft Azure and Designed Not to Buckle Under Demand MICROSOFT AZURE.
Database Systems Instructor Name: Lecture-3.
Table Partitioning Intro and make that a sliding window too!
Overview of Oracle Site Hub
Table Partitioning Intro and make that a sliding window too!
Chapter 17 Designing Databases
OPS-14: Effective OpenEdge® Database Configuration
Chapter 3 Database Management
September 12-14, 2018 Raleigh, NC.
C Practice Test 2019
Best Practices in Higher Education Student Data Warehousing Forum
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Database management systems
Thomas “Tom” Martens Using Charticulator.
Presentation transcript:

Building a real time Tweet map with Flink in six weeks OSTMap Fast poc development with flink

Proof of concept - an important tool in the industry PoC often necessary to show feasibility to customers touch several topics: Scalability Stream processing Batch processing Storage and querying of data OSTMap as example PoC

Goals for OSTMap Increase trust into big data technologies on customer side It is easy to build an application with current technologies With almost no experience Teach students big data technologies Recruiting Bring big data to the university Build a real time application to view recent geotagged tweets on a map Search for terms and users, show these tweets on a map Analytics: First data science jobs …

Industry in practice: IT-Ringvorlesung 2016 Martin Grimmer A course at the University of Leipzig. work on projects of local companies six students over a period of 6 weeks - no full time invest Weekly meetings Github project: github.com/IIDP/OSTMap Matthias Kricke Michael Schmeißer Nico Graebling OSTMap Vincent Märkl Christopher Rost Christopher Schott Kevin Shrestha Hans Dieter Pogrzeba

mgm technology partners We bring applications into production! Innovative software solution provider with application responsibility Specialist for highly scalable, transactional online applications Central lines of business: Insurance, E-Commerce, E-Government Founded in 1994 347 employees, 9 offices (2014) Revenue: 43,7 Mio € (2014) Part of Allgeier SE

ScaDS Competence center for scalable data services and solutions Dresden/Leipzig bundled Big Data research expertise of the TU Dresden and Leipzig University Drive Big Data innovations Bring industry and science together Knowledge exchange and transfer

Walking skeleton “A Walking Skeleton is a tiny implementation of the system that performs a small end-to-end function. It need not use the final architecture, but it should link together the main architectural components. The architecture and the functionality can then evolve in parallel.” - Alistair Cockburn gif from http://blog.codeclimate.com/blog/2014/03/20/kickstart-your-next-project-with-a-walking-skeleton

Milestone 1 read stream, store data as json file, show tweets, read data from json files stream processing Ingest / Storage Querying visualisation

Milestone 2 write to and read from accumulo, show tweets on map, full table scans, slow visualization

Milestone 3 Term index, geotemporal index, ui improvements, clustering, …

OSTMap – stream, batch, storage and querying a) stream processing geotagged tweets b) batch processing c) querying data webservice

Stream processing of incoming data – first version data for GeoTweetSource DateExtraction KeyGeneration RawTweetSink time index This enabled us to build a slow term search and a slow map search via full table scans.

Stream processing of incoming data – final version data for GeoTweetSource DateExtraction KeyGeneration RawTweetSink time index TermExtraction (tokenizing) TermIndexSink term index UserExtraction GeoTemporalIndexCreation GeoTemporalIndexSink geotemporal index Language Extraction 1 minute window sum by language LanguageFrequencySink language statistics Now we were able to build a faster term and map search and language frequency visualization.

Batch processing Initial creation of the term index and geotemporal index for already processed tweets Data export Other statistics like: Area/ tweet distance a user covers with his tweets

RawTweetData (TimeIndex) Storage Accumulo table design Table Row Column Family Column Qualifier Value RawTweetData (TimeIndex) timestamp, hash 8b + 4b - raw tweet json TermIndex term field (user,text) RawTweetData key 12b LanguageFrequency time bucket YYYYMMDDhhmm language-tag tweet count 4b

Geotemporal Index for OSTMap Geo index geohashes used as row keys in accumulo … 3z 6b 6c 6f 6q 9p 9r 9x 9z d0 d1 d2 d3 d4 d5 d6 … dg Row CF CQ geohash RawTweetKey - geo data 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g partitioned by geohash (z curve) function from 2d coordinate space to 1d key space

Geotemporal Index for OSTMap Geo index – querying? Row CF CQ geohash RawTweetKey lat/lon 9z db dc df dg … 3z 6b 6c 6f 6q 9p 9r 9x 9z d0 d1 d2 d3 d4 d5 d6 … dg 9x d8 d9 dd de range: [9p] 9r d2 d3 d6 d7 bounding box range: [9r] accumulo iterators 9p d0 d1 d4 d5 accumulo iterators range: [d0,d1,d2,d3] result 3z 6b 6c 6f 6g partitioned by geohash accumulo iterators calculate coverage of bounding box calculate scan ranges from coverage

partitioned by geohash, Geotemporal Index for OSTMap Add some time! … 13z 16b 16c 16f 16q 19p 19r 19x 19z 1d0 1d1 1d2 1d3 1d4 1d5 1d6 … 1dg … 23z 26b 26c 26f 26q 29p 29r 29x 29z 2d0 2d1 2d2 2d3 2d4 2d5 2d6 … 2dg Row CF CQ day, geohash RawTweetKey lat/lon 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g … day lon lat partitioned by geohash, with timebuckets day 1 day 2 day i …

partitioned by geohash, Geotemporal Index for OSTMap What about Hotspots? … 13z 16b 16c 16f 16q 19p 19r 19x 19z 1d0 1d1 1d2 1d3 1d4 1d5 1d6 … 1dg … 23z 26b 26c 26f 26q 29p 29r 29x 29z 2d0 2d1 2d2 2d3 2d4 2d5 2d6 … 2dg Row CF CQ day, geohash RawTweetKey lat/lon 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g … day lon lat partitioned by geohash, with timebuckets

partitioned by geohash, Geotemporal Index for OSTMap What about Hotspots? spreading byte Row CF CQ sb, day, geohash RawTweetKey lat/lon node 0 … 01d2 01d3 01d4 … … 02d2 02d3 02d4 … 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g … 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g spreading byte = hash(tweet) % 255 reproducable pre table splits in accumulo node 1 … 11d2 11d3 11d4 … … 12d2 12d3 12d4 … … node 2 … 21d2 21d3 21d4 … … 22d2 22d3 22d4 … day lon lat … partitioned by geohash, with timebuckets … node n

demo

Thank you www.scads.de www.mgm-tp.com Martin Grimmer grimmer[at]informatik.uni-leipzig.de Matthias Kricke kricke[at]informatik.uni-leipzig.de Michael Schmeißer michael.schmeisser[at]mgm-tp.com