Towards Real-Time Road Traffic Analytics using Telco Big Data

Slides:



Advertisements
Similar presentations
Incremental Clustering for Trajectories
Advertisements

Solving Manufacturing Equipment Monitoring Through Efficient Complex Event Processing Tilmann Rabl, Kaiwen Zhang, Mohammad Sadoghi, Navneet Kumar Pandey,
Research Challenges in the CarTel Mobile Sensor System Samuel Madden Associate Professor, MIT.
Capacity Planning in a Virtual Environment
SDN + Storage.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Tracking Moving Objects in Anonymized Trajectories Nikolay Vyahhi 1, Spiridon Bakiras 2, Panos Kalnis 3, and Gabriel Ghinita 3 1 St. Petersburg State University.
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Eric Carr (VP Core Systems Group) Spark Summit 2014 BUILDING BIG DATA.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
How to Resolve Bottlenecks and Optimize your Virtual Environment Chris Chesley, Sr. Systems Engineer
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Ran aware flow control tool
Block1 Wrapping Your Nugget Around Distributed Processing.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Course 03 Basic Concepts assist. eng. Jánó Rajmond, PhD
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
The Application of Big Data and Learning Analytics in Universities Management: Practices from the Open University of China 魏顺平 博士 Shunping Wei May. 22nd,
SketchVisor: Robust Network Measurement for Software Packet Processing
Efficient Exploration of Telco Big Data with Compression and Decaying
Connected Infrastructure
Presented by: Omar Alqahtani Fall 2016
Organizations Are Embracing New Opportunities
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Jacob R. Lorch Microsoft Research
T-Share: A Large-Scale Dynamic Taxi Ridesharing Service
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Measurement-based Design
Tutorial: Big Data Algorithms and Applications Under Hadoop
Distributed Network Traffic Feature Extraction for a Real-time IDS
Spark Presentation.
Network Analysis with ArcGIS Online
PROTEAN: A Scalable Architecture for Active Networks
Chapter 13 The Data Warehouse
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
S SPATE: Compacting and Exploring Telco Big Data Constantinos Costa1 , Georgios Chatzimilioudis1, Demetris Zeinalipour-Yazti2,1, Mohamed F. Mokbel3.
Algorithms for Big Data Delivery over the Internet of Things
FailRank: Towards a Unified Grid Failure Monitoring and Ranking System
Street Cleanliness Assessment System for Smart City using Mobile and Cloud Bharat Bhushan, Kavin Pradeep Sriram Kumar, Mithra Desinguraj, Sonal Gupta Project.
Hadoop Clusters Tess Fulkerson.
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
Overview Introduction VPS Understanding VPS Architecture
Spatio-Temporal Query Processing in Smartphone Networks
FMS: Managing Crowdsourced Indoor Signals with the Fingerprint Management Studio Marileni Angelidou1, Constantinos Costa1, Artyom Nikitin2, Demetrios.
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Telco Big Data (TBD) Example:
ExaO: Software Defined Data Distribution for Exascale Sciences
Predicting Traffic Dmitriy Bespalov.
Ch 4. The Evolution of Analytic Scalability
Hadoop Technopoints.
Introduction to Apache
Support for ”interactive batch”
Speaker: Jin-Wei Lin Advisor: Dr. Ho-Ting Wu
Operating Systems : Overview
Ron Carovano Manager, Business Development F5 Networks
IP Control Gateway (IPCG)
Big DATA.
Hao Hu, Luo Qi, Fazhi Qi IHEP 22 Mar. 2018
Towards Predictable Datacenter Networks
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Towards Real-Time Road Traffic Analytics using Telco Big Data Constantinos Costa∗, Georgios Chatzimilioudis∗, Demetrios Zeinalipour-Yazti‡∗ and Mohamed F. Mokbel§ ∗ University of Cyprus, Cyprus ‡ Max Planck Institute for Informatics, Germany § University of Minnesota, MN, USA {costa.c, gchatzim, dzeina}@cs.ucy.ac.cy dzeinali@mpi-inf.mpg.de, mokbel@cs.umn.edu 11th International Workshop on Real-Time Business Intelligence and Analytics, collocated with VLDB 2017, Munich, Germany, August 28, 2017

Motivation The expansion of mobile networks and IoT (IP-enabled hardware) have contributed to an explosion of data inside Telecommunication Companies (Telcos)

Total Cost Per Driver in 2016 Motivation INRIX reports that congestion costs U.S. drivers nearly $300 Billion in 2016,which is an average of $1400 per driver per year (both direct and indirect costs) INRIX Inc. 2017. Los Angeles Tops INRIX Global Congestion Ranking. (2017). https://goo.gl/c2K3E1 Accessed 7-Aug-2017. Rank City / Large Urban Area 2016 Peak Hours Spent Total Cost Per Driver in 2016 Total Cost to the City in 2016 (based on city population size) 1 Los Angeles, CA 104  $    2,408  $9.7bn 2 New York, NY 89  $    2,533  $16.9bn 3 San Francisco, CA 83  $    1,996  $2.5bn 4 Atlanta, GA 71  $    1,861  $3.1bn 5 Miami, FL 65  $    1,762  $3.6bn

Telco Big Data (TBD) Telco Data: Traditional source for OLAP Data Warehouses and Analytics. e.g., Accounting, Billing, Session data. Problem: Inadequate data resolution to address biggest challenges: e.g., churn prediction, 5G network optimization, user-experience assessment, traffic mapping. Telco Big Data (TBD): Velocity data generated at the cell towers. e.g., signal strength, call drops, bandwidth measurements. Size: 5TBs/day for 10M clients (i.e., 2PB/year). RDBMS Data Store

Challenge Road traffic analytic and prediction systems are not sufficient. traffic modeling and prediction is limited to navigation enterprises utilizing crowdsourcing the location privacy boundaries of users could be compromise by third-party mobile applications the cost of continuously monitoring the traffic is very high Our Approach: Introduce a Traffic-TBD (Traffic Telco Big Data) architecture that goes beyond crowdsourcing retaining the user privacy and minimizing the cost using existing resources.

Traffic-TBD Architecture Presentation Outline Introduction Background Traffic-TBD Architecture Data Layer Processing Layer Application Layer Experiments Conclusions

Background: Traffic analytics First Era (1E) traffic mapping dedicated hardware magnetic loop detectors or traffic cameras mounted on traffic lamps or road-sides Second Era (2E) traffic mapping handheld-based and probe vehicles Taxis or Buses were equipped with a smartphone and a dedicated app for location updates MIT Cartel (http://cartel.csail.mit.edu/doku.php)

Background: Traffic analytics Second+ Era (2.5E) traffic mapping opportunistic crowdsourcing Third Era (3E) traffic mapping Traffic-TBD: TBD data to generate the traffic mapping both accurately, without additional hardware and additional costs. E.g., TomTom MyDrive

TBD Network Radio & Core Network Generate Streams of IP packets from the GSM/GPRS (2G), UMTS (3G), LTE (4G) antennas.

TBD Streams Telco Big Data (TBD) Call Quality Data quality (web speed, connection success rate) Measurement Reports (can be used to localize users with triangulation ~50 meters) Telco Big Data (TBD) Business Supporting Systems (BSS) data Tens of GB/day for a xM+ city Operating Supporting Systems (OSS) data Tens of TB/day for a xM+ city

Traffic-TBD Architecture Presentation Outline Introduction Background Traffic-TBD Architecture Data Layer Processing Layer Application Layer Experiments Conclusions

Traffic-TBD Architecture The Traffic-TBD architecture is based on SPATE* and has 3 layers: Data layer aggregate and store data provide the access methods Processing layer minimizing the query response time for data analytic queries Application layer enhance user experience * Constantinos Costa, Georgios Chatzimilioudis, Demetrios Zeinalipour-Yazti, and Mohamed F. Mokbel. 2017. Efficient Exploration of Telco Big Data with Compression and Decaying. In Proceedings of the 33rd International Conference on Data Engineering (ICDE’17). IEEE,

Data Layer: Components Storage component: minimizes storage Indexing component: maintains and uses a multi-resolution spatiotemporal index

Storage Component: Compression Compression refers to the encoding of data using fewer bits than the original representation Benefits: Shifts the resource bottlenecks from storage- and network-I/O to CPU (cycles increasing faster). Save enormous amounts of storage and I/O on subsequent analytic tasks! Desiderata: Given a setting where TBD arrives periodically in batches, we want to: minimize the space needed to store/archive data minimize response time for spatiotemporal data exploration queries

Storage Component: Compression We analyzed an anonymized real TBD dataset to understand what compression ratios can be achieved? Dataset: 1 week, 1.7M CDR, 21M NMS, 300K users Tool: Shannon’s entropy (typical measure for the maximum compressibility of a dataset) Observation: TBD contains significant redundancy (Entropy close to 0)!

Storage Component : Compression Libraries\ Objectives GZIP 7z SNAPPY ZSTD Compression Ratio (rc) 9.06 11.75 4.94 9.72 Compression Time (Tc1) in sec 21.37 20.99 21.39 21.07 Decompr. Time (Tc2) in sec 0.11 0.12 0.13 * GZIP also readily available by most Stream I/O libraries & application layer (e.g., HIVE, HUE)

Indexing Component: Modules Our spatio-temporal index has 4 levels of temporal resolutions epoch (30 minutes), day, month, year. A) Incremence Module Add compressed snapshots on the right-most path

Indexing Component: Modules B) Highlights Module materialized views to long-standing queries of users (e.g., the drop-call counters, bandwidth statistics, etc.) essential for multi-granular visual analytics!

Processing Layer: Components Map Matching: generate traffic points match them to the road network Privacy: Location data has high privacy requirement it is easy to infer user activity from trajectories Visualization: efficiently pack and ship data to the UI Mining: interesting incidents through theTBD

Application Layer: Queries A1. Hot Spot Analysis: Detect and visualize the most congested areas around a location. A2. Navigation Monitoring: Calculate and visualize the best route based on the travel time and traffic queue length. A3. Travel time: Calculate aggregated time loss due to congestion. A4. Travel speed: Determine average speed based on the traffic flow.

Application Layer: Queries A5. Accessibility analysis: Compute the time that is needed to reach a specific location at a specific time slot considering the road traffic. A6. Incident detection: Detects any incidents due to traffic. A7. Before-after analysis: Measure the traffic flow after a change in the road network in order to be evaluated from the traffic administration. A8. Travel behaviors: Identify travel behaviors throughout a region.

Application Layer: Components Interactive Traffic Analytics: receives a data analytic query and uses the index to combine the needed highlights to answer the query Alert Service: is a background service that can provide real-time and predicted traffic alerts combining public APIs with the TBD API: makes the generated traffic insights available for public use.

Presentation Outline Introduction Background Traffic-TBD Architecture Data Layer Processing Layer Application Layer Experiments Conclusions

Experimental Methodology To evaluate the DATA layer, we have implemented a trace-driven testbed: Compared Frameworks: RAW: stores the snapshot on disk without compression or indexing SHAHED: stores the snapshot using a quad-tree index (part of the offline analytic SpatialHadoop 2.4 framework @ ICDE’15) . SPATE: the underlying framework used in this work. Metrics: Ingestion Time (sec): time cost to store a 30-min TBD snapshot Space (MB): space cost to store a 30-min TBD snapshot Response Time (sec): time to answer a query.

Experimental Testbed TBD Operating System Stack TBD Framework Stack Datacenter: VMWare ESXi 5.0.0 Hosts VMs: 4 Ubuntu 14.04 server images, each featuring: 8GB of RAM with 2 virtual CPUs (2.40GHz) Storage Element: Slow 7.2K RPM RAID-5 SAS, 6 Gbps disks. Each disk formatted in VMFS 5.54 (1MB block) TBD Framework Stack Hadoop Distributed File System (HDFS) v2.5.2 Apache Hive 2.0 (online querying) Apache Spark 1.6.0 (offline data processing) Individual Scala programs submitted directly to the Spark computation master Apache HIVE Warehouse Apache HDFS Filesystem SPARK Jobs

Ingestion Time and Space Ingestion time (left) and Space (right) of methods over different time windows within a day. Observation: SPATE 1.25x slower in ingestion but needs 1 order of magnitude less storage space! Ingestions are separated by 30 minute windows, so this is acceptable.

Data Exploration Tasks Observation: CPU intensive tasks: SPATE = RAW = SHAHED even though we use 1 order of magnitude less disk space! Multivariate Statistics (SparkML) Kmeans clustering (SparkML) Linear Regression (SparkML)

Conclusions Future Work: Conclusions: Traffic-TBD architecture aims to provide: micro-level traffic modeling and prediction location privacy boundaries of users inside their mobile network availability with minimal costs and using existing infrastructure Future Work: map-matching of trajectories privacy-preserving trajectory processing

Towards Real-Time Road Traffic Analytics using Telco Big Data Thanks! Questions? Constantinos Costa∗, Georgios Chatzimilioudis∗, Demetrios Zeinalipour-Yazti‡∗ and Mohamed F. Mokbel§ ∗ University of Cyprus, Cyprus ‡ Max Planck Institute for Informatics, Germany § University of Minnesota, MN, USA {costa.c, gchatzim, dzeina}@cs.ucy.ac.cy dzeinali@mpi-inf.mpg.de, mokbel@cs.umn.edu Eleventh International Workshop on Real-Time Business Intelligence and Analytics Munich, Germany, August 28, 2017