Prototyping A Web-based High-Performance Visual Analytics Platform for Origin-Destination Data: A Case study of NYC Taxi Trip Records Jianting Zhang1,2.

Slides:



Advertisements
Similar presentations
University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
Advertisements

Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science.
Data Parallel Quadtree Indexing and Spatial Query Processing of Complex Polygon Data on GPUs Jianting Zhang 1,2 Simin You 2, Le Gruenwald 3 1 Depart of.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
Distributed Systems: Client/Server Computing
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
U 2 SOD-DB: A Database System to Manage Large-Scale Ubiquitous Urban Sensing Origin-Destination Data Jianting Zhang 134 Hongmian Gong 234 Camille Kamga.
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
Fall, Privacy&Security - Virginia Tech – Computer Science Click to edit Master title style Design Extensions to Google+ CS6204 Privacy and Security.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Goodbye rows and tables, hello documents and collections.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
NOVA Networked Object-based EnVironment for Analysis P. Nevski, A. Vaniachine, T. Wenaus NOVA is a project to develop distributed object oriented physics.
MySQL spatial indexing for GIS data in a web 2.0 internet application Brian Toone Samford University
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
 INDEX  Overview.  Introduction.  System Requirement.  Features Of SQL.  Development Process.  System Design (SDLC).  Implementation.  Future.
Parallel Algorithm Design & Analysis Course Dr. Stephen V. Providence Motivation, Overview, Expectations, What’s next.
Introduction to Operating Systems Concepts
SDN controllers App Network elements has two components: OpenFlow client, forwarding hardware with flow tables. The SDN controller must implement the network.
Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,
Managing Massive Trajectories on the Cloud
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
GPUNFV: a GPU-Accelerated NFV System
CIIT-Human Computer Interaction-CSC456-Fall-2015-Mr
CS122A: Introduction to Data Management Lecture #16: AsterixDB
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Open Source distributed document DB for an enterprise
VI-SEEM Data Discovery Service
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Tools and Services Workshop Overview of Atmosphere
Spatial Analysis With Big Data
Steven Ge, Xinmin Tian, and Yen-Kuang Chen
CHAPTER 3 Architectures for Distributed Systems
Map-Scan Node Accelerator for Big-Data
Hadoop Clusters Tess Fulkerson.
University of Technology
Towards GPU-Accelerated Web-GIS
SpatialHadoop: A MapReduce Framework for Spatial Data
Ray-Cast Rendering in VTK-m
Northbound API Dan Shmidt | January 2017
Jianting Zhang City College of New York
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Predictive Performance
Development of the Nanoconfinement Science Gateway
DESIGN & IMPLEMENTATION
Akshay Tomar Prateek Singh Lohchubh
Pregelix: Think Like a Vertex, Scale Like Spandex
Overview of Computer Architecture and Organization
Module 01 ETICS Overview ETICS Online Tutorials
Overview of big data tools
Cloud computing mechanisms
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Introduction to Visual Analytics
High-Performance Analytics on Large-Scale GPS Taxi Trip Records in NYC
Outline Summary an Future Work Introduction
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Jianting Zhang1,2 Simin You2, Le Gruenwald3
Effective Parallelization Strategies for Scalable, High
Accelerating Regular Path Queries using FPGA
Jianting Zhang1,2,4, Le Gruenwald3
Presentation transcript:

Prototyping A Web-based High-Performance Visual Analytics Platform for Origin-Destination Data: A Case study of NYC Taxi Trip Records Jianting Zhang1,2 Simin You23, Yinglong Xia4 1 Department of Computer Science, CUNY City College (CCNY) 2 Department of Computer Science, CUNY Graduate Center 3 Pitney Bowes, Inc. 4 IBM T. J. Watson Research Center

Outline Introduction, Background and Motivation System Architecture and Implementations Geospatial backend Graph Database for Social Network Analysis Web Frontend Experiments and Demonstrations http://134.74.112.65/ibmjsa/web/ http://134.74.112.65/~you/geosocial/ Summary and Future Work

Taxi Trip OD Data in NYC Taxicabs 13,000 Medallion taxi cabs Car services and taxi services are separate Taxi trip records ~170 million trips (300 million passengers) in 2009 1/5 of that of subway riders and 1/3 of that of bus riders in NYC 2013 and onward data are open (http://chriswhong.com/open-data/foil_nyc_taxi/) 3 3

Other types of OD Data Social network activities Call Detail Record (CDR) 4 4

Vis. GIS Big Data and HPC Web Web-based High-Performance Visual Analytics Platform for Origin-Destination Data GIS Vis. Web See Section 2 for a more detailed review Web-GIS Big Data and HPC

Commodity Parallel Hardware B C Thread Block CPU Host (CMP) Core Local Cache Shared Cache DRAM Disk SSD GPU SIMD PCI-E Ring Bus ... GDRAM MIC T0 T1 T2 T3 4-Threads In-Order 16 Intel Sandy Bridge CPU cores+ 128GB RAM + 8TB disk + GTX TITAN + Xeon Phi 3120A ~ $9994 (Jan. 2014)

Prototype System Architecture and Components CCNY Geospatial Backend Web Proxy (PHP) Web-GIS API Large-scale geospatial data management Columnar data layout and storage Spatial query processing Web frontend for geospatial and geosocial visual exploration using NYC Taxi Trip Data Spatial and spatiotemporal aggregation to derive graph structures and weights IBM SystemG Backend Network/Web Communication Javascript asynchronous function call JSON string encoding and parsing User data and Web-GIS API binding   Frontend Geometry Library MBR Indexing Point-in-Polygon Test Line-Polygon Intersection Graph Data Management and Analytics: Shortest Path, Centrality, PageRank… GUI Spatial selection: polygon drawing Temporal selection: dropdown list OD indication: arrow/polyline drawing 1 3 2

Geospatial backend 1 Dual role: Online processing spatial queries through client side visual exploration interfaces Offline aggregating OD records to generate dynamic graphs for online social network analysis and visualization. Design Choices: Traditional GIS and spatial databases (for point aggregations over polygons) disk-resident, serial computing  slow for large scale data Standard programming interfaces/protocols (e.g., SQL, OGC specifications) easy to use Hardware accelerated parallel systems High-performance Robustness/usability concerns Observations: Interactively drawn ROI polygons are typically simple (low complexity) Linear scan of points is cache friendly and embarrassingly parallelizable) Our solution: a lightweight parallel backend for high-performance point aggregations

Geospatial backend 1 int pip_count(float vertices[][2], int num_vertices) { … int count = 0; #pragma omp parallel for reduction(+:count) for (int i = 0; i < num_points; ++i) { double x = point_x[i]; double y = point_y[i]; if (x < xmin || x > xmax || y < ymin || y > ymax) continue; bool in_polygon = false; for (int j = 0; j < num_vertices-1; ++j) { double x0 = vertices[j][0]; double x1 = vertices[j+1][0]; double y0 = vertices[j][1]; double y1 = vertices[j+1][1];   if ((((y0 <= y) && (y < y1)) || ((y1 <= y) && (y < y0))) && (x < (x1 - x0) * (y - y0) / (y1 - y0) + x0)) in_polygon = !in_polygon; } if (in_polygon) ++count; return count; Simple OpenMP directive for parallelization on multi-core CPUs MBR filtering: many ROIs have small spatial extends In-memory processing: scanning ~170 million points in ~1/4s for a typical ROI polygon on a legacy machine (dual quad-core 2.0 GHZ released in 2007) PIP test code due to W. Randolph Franklin of RPI in 1990s

Graph Database 2 http://systemg.research.ibm.com/ IBM SystemG Backend Primarily use SystemG as a graph database backend to manage dynamical graphs and provide social network analysis functionality http://systemg.research.ibm.com/ To respond to dynamic parameters (spatial, temporal and thematic) during a visual exploration process, retrieve and transform the corresponding graphs, perform required graph analytics and send back the results. https://github.com/ibmppl/ibmppl A whole spectrum solution for large scale graph processing, including graph storage, runtime, analytics and visualization Use PageRank for demonstration purposes where graph weights are defined as the numbers of OD records between an OD pair Built-in support for web-based applications (socket mode and JSON support): easy to use and fast prototyping PageRank extension: consider not only graph structure (node degrees) but also edge weight

Web Frontend 3 All implemented in Javascript (Google Map API) Web frontend for geospatial and geosocial visual exploration using NYC Taxi Trip Data Network/Web Communication Javascript asynchronous function call JSON string encoding and parsing User data and Web-GIS API binding   Frontend Geometry Library MBR Indexing Point-in-Polygon Test Line-Polygon Intersection   Check the validity of interactively selected OD pairs Query graph weights of any OD pairs using a map interface To support information seeking mantra – “Overview First, filter and zoom and details on demand” GUI Spatial selection: polygon drawing Temporal selection: dropdown list OD indication: arrow/polyline drawing

Demonstration #1 http://134.74.112.65/ibmjsa/web/ Interactive Spatial Query Processing Demonstration after Users Draw a Pair of OD Polygons

Demonstration #2 http://134.74.112.65/~you/geosocial/ Thematic selection (datasets) Temporal selection (hours) OD Pair selection Mapping PageRank result

Summary and Future Work Report our work on developing a high-performance research platform to visually explore large-scale urban OD data in a web computing environment. Integrates an in-memory parallel geospatial query processing backend and a graph database backend and provides several novel web frontend modules for both functionality and efficiency Demonstrate preliminary implementations using NYC taxi trip data Extend the geospatial backend to efficiently support more types of spatial queries, in addition to point-in-polygon test Work with IBM SystemG development team to integrate spatial data processing functionality to support in-graph spatial queries Develop more intuitive visual gadgets for temporal selection in the web frontend

Acknowledgement CISE/IIS Medium Collaborative Research Grants 1302423/1302439: “Spatial Data and Trajectory Data Management on GPUs” Joint Study Agreement (JSA #W1463481) between IBM T. J. Watson Research Center and CCNY Q&A