Outline Summary an Future Work Introduction

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science.

Data Parallel Quadtree Indexing and Spatial Query Processing of Complex Polygon Data on GPUs Jianting Zhang 1,2 Simin You 2, Le Gruenwald 3 1 Depart of.

Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

U 2 SOD-DB: A Database System to Manage Large-Scale Ubiquitous Urban Sensing Origin-Destination Data Jianting Zhang 134 Hongmian Gong 234 Camille Kamga.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Supporting GPU Sharing in Cloud Environments with a Transparent

Distributed Systems. Outline  Services: DNSSEC  Architecture Models: Grid  Network Protocols: IPv6  Design Issues: Security  The Future: World Community.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Large-scale Deep Unsupervised Learning using Graphics Processors

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY HPCDB Satisfying Data-Intensive Queries Using GPU Clusters November.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Harnessing the Cloud for Securely Outsourcing Large- Scale Systems of Linear Equations.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.

Load Rebalancing for Distributed File Systems in Clouds.

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,

Presented by: Omar Alqahtani Fall 2016

Lynn Choi School of Electrical Engineering

Employing compression solutions under openacc

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

Enabling Effective Utilization of GPUs for Data Management Systems

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Distributed Network Traffic Feature Extraction for a Real-time IDS

Hot Processors Of Today

Spatial Analysis With Big Data

Real-Time Ray Tracing Stefan Popov.

Towards GPU-Accelerated Web-GIS

CS : Technology Trends August 31, 2015 Ion Stoica and Ali Ghodsi (

SpatialHadoop: A MapReduce Framework for Spatial Data

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Sameh Shohdy, Yu Su, and Gagan Agrawal

Jianting Zhang Department of Computer Science

Linchuan Chen, Xin Huo and Gagan Agrawal

Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Introduction to CUDA Programming

Scalable Parallel Interoperable Data Analytics Library

Data-Intensive Computing: From Clouds to GPU Clusters

Declarative Transfer Learning from Deep CNNs at Scale

High-Performance Analytics on Large-Scale GPS Taxi Trip Records in NYC

Prototyping A Web-based High-Performance Visual Analytics Platform for Origin-Destination Data: A Case study of NYC Taxi Trip Records Jianting Zhang1,2.

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

Assoc. Prof. Marc FRÎNCU, PhD. Habil.

Jianting Zhang1,2 Simin You2, Le Gruenwald3

6- General Purpose GPU Programming

Fast Accesses to Big Data in Memory and Storage Systems

Jianting Zhang1,2, Simin You2,4, Le Gruenwald3

Jianting Zhang1,2,4, Le Gruenwald3

Presentation transcript:

Outline Summary an Future Work Introduction Big Spatial data, GPU Computing and Distributed Platforms Spatial query processing on GPUs ISP System Architecture Implementations Experiments Setup Single-Node performance Scalability on Amazon EC2 Clusters Summary an Future Work

Taxi trip data in NYC Taxicabs 13,000 Medallion taxi cabs License priced at > $1M Car services and taxi services are separate Taxi trip records ~170 million trips (300 million passengers) in 2009 1/5 of that of subway riders and 1/3 of that of bus riders in NYC 2 2

Taxi trip data in NYC Over all distributions of trip distance, time, speed and fare (2009)

Taxi trip data in NYC How to manage taxi trip data? How good are they? Geographical Information System (GIS) Spatial Databases (SDB) Moving Object Databases (MOD) How good are they? Pretty good for small amount of data  But, rather poor for large-scale data 

Taxi trip data in NYC Can we do better? Example 1: Loading 170 million taxi pickup locations into PostgreSQL UPDATE t SET PUGeo = ST_SetSRID(ST_Point("PULong","PuLat"),4326); 105.8 hours! Example 2: Finding the nearest tax blocks for 170 million taxi pickup locations using open source libspatiaindex+GDAL 30.5 hours! Intel Xeon 2.26 GHz processors with 48G memory I do not have time to wait... Can we do better?

Cloud computing+MapReduce+Hadoop B C Thread Block CPU Host (CMP) Core Local Cache Shared Cache DRAM HDD SSD GPU SIMD PCI-E Ring Bus ... GDRAM MIC T0 T1 T2 T3 4-Threads In-Order 16 Intel Sandy Bridge CPU cores+ 128GB RAM + 8TB disk + GTX TITAN + Xeon Phi 3120A ~ $9,994

Attractive Features Extension is challenging! ISP-GPU: Scaling out Geospatial Data Processing to GPU Clusters SQL Frontend: translate SQL queries into execution plans C/C++ backend with SSE4 support (for strings operations) Efficient implementations of hash-joins (partitioned and non-partitioned) LLVM-based JIT …. Attractive Features http://www.slideshare.net/hadooparchbook/impala-architecture-presentation Extension is challenging!

7.1 billion transistors (551mm²) 2,688 processors Feb. 2013 7.1 billion transistors (551mm²) 2,688 processors 4.5 TFLOPS SP and 1.3 TFLOPS DP Max bandwidth 288.4 GB/s PCI-E peripheral device 250 W (17.98 GFLOPS/W -SP) Suggested retail price: $999 ASCI Red: 1997 First 1 Teraflops (sustained) system with 9298 Intel Pentium II Xeon processors (in 72 Cabinets) What can we do today using a device that is more powerful than ASCI Red 16 years ago?

Outline Summary an Future Work Introduction Big Spatial data, GPU Computing and Distributed Platforms Spatial query processing on GPUs ISP System Architecture Implementations Experiments Setup Single-Node performance Scalability on Amazon EC2 Clusters Summary an Future Work

Spatial query processing on GPUs Single-Level Grid-File based Spatial Filtering Vertices (polygon/polyline) Points Perfect coalesced memory accesses Utilizing GPU floating point computing power Nested-Loop based Refinement J. Zhang, S. You and L. Gruenwald, "Parallel Online Spatial and Temporal Aggregations on Multi-core CPUs and Many-Core GPUs," Information Systems, vol. 44, p. 134–154, 2014.

Spatial query processing on GPUs 38,794 census blocks (470,941 points) 735,488 tax blocks (4,698,986 points) 147,011 street segments P2N-D P2P-T P2P-D P2N-D P2P-T P2P-D - 15.2 h 30.5 h 10.9 s 11.2 s 33.1 s 4,900X 3,200X Algorithmic improvement: 3.7X Using main-memory data structures: 37.4X GPU Acceleration: 24.3X CPU time GPU Time Speedup

Outline Summary an Future Work Introduction Big Spatial data, GPU Computing and Distributed Platforms Spatial query processing on GPUs ISP-MC+ and ISP-GPU System Architecture Implementations Experiments Setup Single-Node performance Scalability on Amazon EC2 Clusters Summary an Future Work

pip_join(…) nearest_join(…) create_rtree(…) class SpatialJoinNode : public BlockingJoinNode { public: SpatialJoinNode(ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs); virtual Status Prepare(RuntimeState* state); virtual Status GetNext(RuntimeState* state, RowBatch* row_batch, bool* eos); virtual void Close(RuntimeState* state); protected: virtual Status InitGetNext(TupleRow* first_left_row); virtual Status ConstructBuildSide(RuntimeState* state); private: boost::scoped_ptr<TPlanNode> thrift_plan_node_; RuntimeState* runtime_state_; … } pip_join(…) nearest_join(…) create_rtree(…)

ISP-GPU: Scaling out Geospatial Data Processing to GPU Clusters

Outline Summary an Future Work Introduction Big Spatial data, GPU Computing and Distributed Platforms Spatial query processing on GPUs ISP System Architecture Implementations Experiments Setup Single-Node performance Scalability on Amazon EC2 Clusters Summary an Future Work

Taxi trip data in NYC Taxicabs 13,000 Medallion taxi cabs License priced at > $1M Car services and taxi services are separate Taxi trip records ~170 million trips (300 million passengers) in 2009 1/5 of that of subway riders and 1/3 of that of bus riders in NYC 16 16

Global Biodiversity Data at GBIF http://gbif.org SELECT aoi_id, sp_id, sum (ST_area (inter_geom)) FROM ( SELECT aoi_id, sp_id, ST_Intersection (sp_geom, qw_geom) AS inter_geom FROM SP_TB, QW_TB WHERE ST_Intersects (sp_geometry, qw_geom) ) GROUP BY aoi_id, sp_id HAVING sum(ST_area(inter_geom)) >T; 17 17

Single-node results: 16core CPU/128GB, GTX Titan ISP-GPU: Scaling out Geospatial Data Processing to GPU Clusters Single-node results: 16core CPU/128GB, GTX Titan ISP-GPU ISP-MC+ GPU-Standalone MC-Standalone taxi-nycb (s) 96 130 50 89 GBIF-WWF(s) 1822 2816 1498 2664 Taxi-nycb: ~170 million points, ~40 thousand polygons (9 vertices/polygon) GBF-WWF: ~375 million points, ~15 thousand polygons (279 vertices/polygon) Cluster results: 2-10 nodes each with 8 vCPU cores/15GB, 1536 CUDA cores/4 GB (50 million species locations used due to memory constraint)

Outline Summary an Future Work Introduction Big Spatial data, GPU Computing and Distributed Platforms Spatial query processing on GPUs ISP System Architecture Implementations Experiments Setup Single-Node performance Scalability on Amazon EC2 Clusters Summary an Future Work

Summary and Future Work Designs and implementations of an in-memory spatial data management system on multi-core CPU and many-core GPU clusters by extending Cloudera Impala for distributed spatial join query processing Experiments on the initial implementations have revealed both advantages and disadvantages of extending a tightly-coupled big data system to support spatial data types and their operations. Alternative techniques are being developed to further improve efficiency, scalability, extensibility and portability.

SpatialSpark: Just Open-Sourced Alternative Techniques SpatialSpark: Just Open-Sourced http://simin.me/projects/spatialspark/ val sc = new SparkContext(conf) //reading left side data from HDFS and perform pre-processing val leftData = sc.textFile(leftFile, numPartitions).map(x => x.split(SEPARATOR)).zipWithIndex() val leftGeometryById = leftData.map(x => (x._2, Try(new WKTReader().read(x._1.apply(leftGeometryIndex))))) .filter(_._2.isSuccess).map(x => (x._1, x._2.get)) //similarly for right-side data…. //ready for spatial query (broadcast-based) val joinPredicate =SpatialOperator.Within // NearestD can be applied similarly var matchedPairs:RDD[(Long, Long)] = BroadcastSpatialJoin(sc, leftGeometryById, rightGeometryById, joinPredicate)

Alternative Techniques Lightweight Distributed Execution Engine for Large-Scale Spatial Join Query Processing http://www-cs.engr.ccny.cuny.edu/~jzhang/papers/lde_spatial_tr.pdf