Programming Models for IoT and Streaming Data IC2E Internet of Things Panel Judy Qiu Indiana University.

Slides:



Advertisements
Similar presentations
Suggested Course Outline Cloud Computing Bahga & Madisetti, © 2014Book website:
Advertisements

Spark Streaming Large-scale near-real-time stream processing
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Architecture and Measured Characteristics of a Cloud Based Internet of Things May 22, 2012 The 2012 International Conference.
SALSA HPC Group School of Informatics and Computing Indiana University.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Big Data Open Source Software and Projects ABDS in Summary XIX: Layer 14B Data Science Curriculum March Geoffrey Fox
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Big Data Open Source Software and Projects ABDS in Summary XVI: Layer 13 Part 1 Data Science Curriculum March Geoffrey Fox
MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Real-Time Stream Processing CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Tyson Condie.
IoTCloud Platform – Connecting Sensors to Cloud Services Supun Kamburugamuve, Geoffrey C. Fox {skamburu, School of Informatics and Computing.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
Department of Information Engineering The Chinese University of Hong Kong A Framework for Monitoring and Measuring a Large-Scale Distributed System in.
High Throughput Computing on P2P Networks Carlos Pérez Miguel
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
SALSA HPC Group School of Informatics and Computing Indiana University.
Spark Streaming Large-scale near-real-time stream processing
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
SALSASALSA Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data 1 Thesis Defense Xiaoming Gao Advisor: Prof. Judy Qiu 01/21/2015.
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
SALSASALSA Parallel Clustering of High-Dimensional Social Media Data Streams 1 Xiaoming Gao, Emilio Ferrara, Judy Qiu School of Informatics and Computing.
Supporting Queries and Analyses of Large- Scale Social Media Data with Customizable and Scalable Indexing Techniques over NoSQL databases Xiaoming Gao,
High Performance Processing of Streaming Data Workshops on Dynamic Data Driven Applications Systems(DDDAS) In conjunction with: 22nd International Conference.
Streaming Applications for Robots with Real Time QoS Oct Supun Kamburugamuve Indiana University.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
History • Created by Nathan BackType • Open sourced on 19th September, 2011 Documentation at Contribution
Stefanos Antaris Distributed Publish/Subscribe Notification System for Online Social Networks Stefanos Antaris *, Sarunas Girdzijauskas † George Pallis.
Big Data for the.NET Developer Scott Klein M310
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.
Scientific days, June 16 th & 17 th, 2014 This work has been partially supported by the LabEx PERSYVAL-Lab (ANR-11-LABX ) funded by the French program.
Part III BigData Analysis Tools (Storm) Yuan Xue
An Introduction To Big Data For The SQL Server DBA.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Apache Beam: The Case for Unifying Streaming API's Andrew Psaltis HDF / IoT Product Solution June 13, 2016 HDF / IoT Product Solution.
Towards High Performance Processing of Streaming Data May Supun Kamburugamuve, Saliya Ekanayake, Milinda Pathirage and Geoffrey C. Fox Indiana.
Microsoft Ignite /28/2017 6:07 PM
High Performance Processing of Streaming Data in the Cloud AFOSR FA : Cloud-Based Perception and Control of Sensor Nets and Robot Swarms 01/27/2016.
Cloud-based Parallel Implementation of SLAM for Mobile Robots Supun Kamburugamuve, Hengjing He, Geoffrey Fox, David Crandall School of Informatics & Computing.
Connected Infrastructure
Introduction to Spark Streaming for Real Time data analysis
Connected Maintenance Solution
International Conference on Data Engineering (ICDE 2016)
Original Slides by Nathan Twitter Shyam Nutanix
Connected Maintenance Solution
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Algorithms for Big Data Delivery over the Internet of Things
9/18/2018 Big Data Analytics with HDInsight Module 6 – Storm Essentials Asad Khan Nishant Thacker Principal PM Manager Technical Product Manager.
ETL Architecture for Real-Time BI
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Summary of Streaming Data Workshop STREAM2015 October
This meme comes from South Park (S2E )
Data Science Curriculum March
Analysis of Lucene Index on Hbase in an HPC Environment
Scalable Parallel Interoperable Data Analytics Library
Architecture for Real-Time ETL
Digital Science Center III
Twister2: Design of a Big Data Toolkit
Computational Advertising and
Data science laboratory (DSLAB)
Streaming data processing using Spark
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Big-Data Analytics with Azure HDInsight
Presentation transcript:

Programming Models for IoT and Streaming Data IC2E Internet of Things Panel Judy Qiu Indiana University

Event Processing Programming Models Query Based –Complex Event processing –SQL like languages Programming APIs Queries or the Programs run on a continuous stream, unlike Hadoop where your data is static for the Batch processor Need to address diverse streams – Unbounded sequence of events Examples  Video Camera frames  Tweets  Laser scans from a robot  Log data

Distributed Stream Processing Frameworks (DSPF) Aurora – Early Research System Borealis – Early Research System Apache Storm Apache S4 Apache Samza Google MillWheel Amazon Kinesis LinkedIn Databus Facebook Puma/Ptail/Scribe/ODS Azure Stream Analytics Will discuss 2 Apache Storm projects at Indiana University

I: IoTCloud Framework to connect devices to cloud services IoTCloud consists of –a set of distributed nodes running close to the devices to gather data –a set of publish-subscribe brokers to relay the information to the cloud services –a distributed stream processing framework (DSPF) coupled with batch processing frameworks in the Cloud Uses OpenStack environment Improving fault-tolerance and quality of service for especially guarantees on maximum response time

IoTCloud Architecture Built on Apache Storm, RabbitMQ, Hbase ………

IoTCloud Applications Particle Filtering Based SLAM N-Body Collision Avoidance Using parallel algorithms inside Storm for performance performance Map Built from Robot dataRobots need to avoid collisions when they move Response Time better with RabbitMQ

II: Batch and Streaming Analysis for Social Media Data Storage substrate Batch analysis module Streaming analysis module

Streaming Analysis  Non-trivial parallel stream processing algorithm with novel global synchronization and cluster-delta data transfer to achieve scalability  Clustering of social media streams: real-time processing of 10% Twitter (“Gardenhose”)  Recent progress in learning data representations and similarity metrics  High-dimensional vectors: textual and network information  Expensive similarity computation: 43.4 hours to cluster 1 hour’s data with sequential algorithm  Online K-Means with sliding time window and outlier detection  Group tweets as protomemes: hashtags, mentions, URLs, and phrases Xiaoming Gao, Emilio Ferrara, Judy Qiu. Parallel Clustering of High-Dimensional Social Media Data Streams. To appear at 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2015).

Social media data – an example data record 9

Sequential clustering algorithm Final step statistics for a sequential run over 6 minutes data: Time Step Length (s) Total Length of Centroids’ Content Vector Similarity Compute time (s) Centroids Update Time (s) clusters, time window length: 6 steps, outlier: 2 standard deviation

Parallelization with Storm - challenges Data point 1: Content_Vector: [“step”:1, “time”:1, “nation”: 1, “ram”:1] Diffusion_Vector: … … Data point 2: Content_Vector: [“lovin”:1, “support”:1, “vcu”:1, “ram”:1] Diffusion_Vector: … … Centroid: Content_Vector: [“step”:0.5, “time”:0.5, “nation”: 0.5, “ram”:1.0, “lovin”:0.5, “support”:0.5, “vcu”:0.5] Diffusion_Vector: … … Cluster  Sparsity of high-dimensional vectors make any synchronization expensive -Cluster-delta synchronization strategy reduces message traffic and synchronization overhead  DAG organization of parallel workers: hard to synchronize cluster information

Solution – enhanced Apache Storm topology Protomeme Generator Spout Synchronization Coordinator Bolt ActiveMQ Broker SYNCINIT CDELTAS … Sequential or Parallel Batch Clustering Algorithm Bootstrap Information Worker Process Clustering Bolt … Worker Process Clustering Bolt … PMADD OUTLIER SYNCREQ tweet stream

Scalability comparison  1 hour’s data for testing, first 10 mins for bootstrap  33 mins to process 50 mins’ data (better than real time) with Cluster-delta method due to decreased message sizes compared to full-centroid approach