Jianting Zhang1,2 Simin You2, Le Gruenwald3

Slides:



Advertisements
Similar presentations
Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.
Advertisements

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs Visiting Faculty: Jianting Zhang, The City College of New York.
Data Parallel Quadtree Indexing and Spatial Query Processing of Complex Polygon Data on GPUs Jianting Zhang 1,2 Simin You 2, Le Gruenwald 3 1 Depart of.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Energy-Efficient Query Processing on Embedded CPU-GPU Architectures Xuntao Cheng, Bingsheng He, Chiew Tong Lau Nanyang Technological University, Singapore.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
U 2 SOD-DB: A Database System to Manage Large-Scale Ubiquitous Urban Sensing Origin-Destination Data Jianting Zhang 134 Hongmian Gong 234 Camille Kamga.
+ CS 325: CS Hardware and Software Organization and Architecture Introduction.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Supporting GPU Sharing in Cloud Environments with a Transparent
Comp-TIA Standards.  AMD- (Advanced Micro Devices) An American multinational semiconductor company that develops computer processors and related technologies.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Biryaltsev E.V., Galimov M.R., Demidov D.E., Elizarov A.M. HPC CLUSTER DEVELOPMENT AND OPERATION EXPERIENCE FOR SOLVING THE INVERSE PROBLEMS OF SEISMIC.
Performance and Energy Efficiency Evaluation of Big Data Systems Presented by Yingjie Shi Institute of Computing Technology, CAS
Sunpyo Hong, Hyesoon Kim
The Evolution of the Italian HPC Infrastructure Carlo Cavazzoni CINECA – Supercomputing Application & Innovation 31 Marzo 2015.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
NFV Compute Acceleration APIs and Evaluation
Lynn Choi School of Electrical Engineering
Buying into “Summit” under the “Condo” model
Modern supercomputers, Georgian supercomputer project and usage areas
Low-Cost High-Performance Computing Via Consumer GPUs
CS427 Multicore Architecture and Parallel Computing
Distributed Network Traffic Feature Extraction for a Real-time IDS
Appro Xtreme-X Supercomputers
Hot Processors Of Today
Spatial Analysis With Big Data
Low-Cost High-Performance Computing Via Consumer GPUs
Map-Scan Node Accelerator for Big-Data
Hadoop Clusters Tess Fulkerson.
BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.
Towards GPU-Accelerated Web-GIS
Clusters of Computational Accelerators
CS : Technology Trends August 31, 2015 Ion Stoica and Ali Ghodsi (
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Jianting Zhang Department of Computer Science
Jianting Zhang City College of New York
The Yin and Yang of Processing Data Warehousing Queries on GPUs
Scalable Parallel Interoperable Data Analytics Library
Constructing a system with multiple computers or processors
High Performance Computing On Laptops With Multicores & GPUs
High-Performance Analytics on Large-Scale GPS Taxi Trip Records in NYC
Outline Summary an Future Work Introduction
CUBAN ICT NETWORK UNIVERSITY COOPERATION (VLIRED
Prototyping A Web-based High-Performance Visual Analytics Platform for Origin-Destination Data: A Case study of NYC Taxi Trip Records Jianting Zhang1,2.
CSE 502: Computer Architecture
Cluster Computers.
Jianting Zhang1,2,4, Le Gruenwald3
Presentation transcript:

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation Jianting Zhang1,2 Simin You2, Le Gruenwald3 1 Depart of Computer Science, CUNY City College (CCNY) 2 Department of Computer Science, CUNY Graduate Center 3 School of Computer Science, the University of Oklahoma

Outline Introduction, Background and Motivation System Design, Implementation and Application Experiments and Results Summary and Future Work

Parallel Computing – Hardware B C Thread Block CPU Host (CMP) Core Local Cache Shared Cache DRAM Disk SSD GPU SIMD PCI-E Ring Bus ... GDRAM MIC T0 T1 T2 T3 4-Threads In-Order 16 Intel Sandy Bridge CPU cores+ 128GB RAM + 8TB disk + GTX TITAN + Xeon Phi 3120A ~ $9994 (Jan. 2014)

Parallel Computing – GPU March 2015 8 billion transistors 3,072 processors/12 GB mem 7 TFLOPS SP (GTX TITAN 1.3 TFLOPS DP) Max bandwidth 336.5 GB/s PCI-E peripheral device 250 W Suggested retail price: $999 ASCI Red: 1997 First 1 Teraflops (sustained) system with 9298 Intel Pentium II Xeon processors (in 72 Cabinets) This slides shows ASCI Red, the world’s first supercomputer with Teraflops double precision computing power in 1997 and the Nvidia GTX Titan available from the Market in 2013 for $1000 with even more computing power. Now the question is: Many years ago, some database researchers tried to develop specialized hardware for database applications. However, that did not generate the expected impacts from the industry due to high cost. Personally, I do not believe GPUs are perfect for data management. However, I do believe that it is the most cost-effective hardware for high-performance data management, specially for geospatial data which is both data intensive and computing intensive. The simple belief has motivated us to work continuously over the past four years and we are happy to receive an NSF grant to continue the research in the next four years. To our knowledge, Prof. Kenneth Ross’s group at Columbia is also working on GPU-based data management, but they are mostly focusing on relational data. What can we do today using a device that is more powerful than ASCI Red 19 years ago?

...building a highly-configurable experimental computing environment for innovative BigData technologies… GeoTECI@CCNY CCNY Computer Science LAN CUNY HPCC “Brawny” GPU cluster Web Server/ Linux App Server Windows App Server SGI Octane III Microway DIY Dell T5400 HP 8740w KVM HP 8740w Lenovo T400s Dual 8-core 128GB memory Nvidia GTX Titan Intel Xeon Phi 3120A 8 TB storage Dual Quadcore 48GB memory Nvidia C2050*2 8 TB storage *2 Dual-core 8GB memory Nvidia GTX Titan 3 TB storage Dual Quadcore 16GB memory Nvidia Quadro 6000 1.5 TB storage Quadcore 8 GB memory Nvidia Quadro 5000m “Wimmy” GPU cluster Dell T7500 Dell T7500 Dell T5400 DIY Dual 6-core 24 GB memory Nvidia Quadro 6000 Dual 6-core 24 GB memory Nvidia GTX 480 Dual Quadcore 16GB memory Nvidia FX3700*2 Quadcore (Haswell) 16 GB memory AMD/ATI 7970

Spatial Data Management Computer Architecture Spatial Data Management How to fill the big gap effectively? David Wentzlaff, “Computer Architecture”, Princeton University Course on Coursea

Distributed Spatial Join Techniques Large-Scale Spatial Data Processing on GPUs and GPU-Accelerated Clusters ACM SIGSPATIAL Special (doi:10.1145/2766196.2766201 ) Distributed Spatial Join Techniques SpatialSpark (CloudDM’15) ISP-MC (CloudDM’15), ISP-MC+ and ISP-GPU (HardBD’15) LDE-MC+ and IDE-GPU (BigData Congress’15)

Background and Motivation Issue #1: Limited access to reconfigurable HPC resources for Big Data research

Background and Motivation Issue #2: architectural limitations of Hadoop-based systems for large-scale spatial data processing https://sites.google.com/site/hadoopgis/ http://spatialhadoop.cs.umn.edu/ http://simin.me/projects/spatialspark/ Spatial Join Query Processing in Cloud: Analyzing Design Choices and Performance Comparisons (HPC4BD’15 –ICPP)

Background and Motivation Issue #3: SIMD computing power is available for free for Big Data –use as much as you can ISP: Big Spatial Data Processing On Impala Using Multicore CPUs and GPUs (HardBD’15) Recently open sourced at: http://geoteci.engr.ccny.cuny.edu/isp/ EC2-10  WS ISP-GPU GPU-Standalone taxi-nycb (s) 96 50

Background and Motivation Issue #4: lightweight distributed runtime library for spatial Big Data processing research LDE engine codebase < 1K LOC Lightweight Distributed Execution Engine for Large-Scale Spatial Join Query Processing (IEEE Big Data Congress’15)

System Design and Implementation Basic Idea: Use GPU-accelerated SoCs as downscaled high-performance Clusters The network bandwidth to compute ratio is much higher than regular clusters Advantages: low cost and easily configurable Nvida TK1 SoC: 4 ARM CPU cores+192 Kepler GPU cores ($193)

System Design and Implementation Light Weight Distributed Execution Engine Asynchronous network communication, disk I/O and computing Using native parallel programming tools for local processing

Experiments and Results Taxi-NYCB experiment 170 million taxi trips in NYC in 2013 (pickup locations as points) 38,794 census blocks (as polygons); average # of vertices per polygon ~9 g10m-wwf experiment ~10 million global species occurrence records (locations as points) 14,458 ecoregions (as polygons) ; average # of vertices per polygon 279 g50m-wwf experiment “Brawny” configurations for Comparisons http://aws.amazon.com/ec2/instance-types/ Dual 8-core Sandy Bridge CPU (2.60G) 128GB memory Nvidia GTX Titan (6GB, 2688 cores)

Experiments and Results Setting standalone 1-node 2-node 4-node taxi-nycb LDE-MC 18.6 27.1 15.0 11.7 LDE-GPU 18.5 26.3 17.7 10.2 SpatialSpark - 179.3 95.0 70.5 g10m-wwf 1029.5 1290.2 653.6 412.9 941.9 765.9 568.6 309.7

Experiments and Results g50m-wwf (more computing bound)   TK1-Standalone TK1-4 Node Workstation-Standalone EC2-4 Node CPU Spec. (per node) ARM A15 2.34 GHZ 4 Cores 2 GB DDR3 Intel SB 2.6 GHZ 16 cores 128 GB DDR3 8 cores (virtual) 15 GB DDR3 GPU Spec. (per node) 192 Cores 2GB DDR3 2,688 cores 6 GB GDDR5 1,536 cores 4 GB GDDR5 Runtime (s) –MC 4478 1908 350 334 Runtime (s) –GPU 4199 1523 174 105

Experiments and Results CPU computing: TK1 SoC~10W, 8-core CPU ~95W Workstation Standalone vs. TK1-standalone: 12.8X faster; consumes 19X more power EC2-4 nodes vs. TK1-4 nodes: 5.7X faster consumes 9.5X more power GPU computing: Workstation Standalone vs. TK1-standalone: 24X faster; 14X more CUDA cores EC2-4 nodes vs. TK1-4 nodes: 14X faster; 8X more CUDA cores

Summary and Future Work We propose to develop a low-cost prototype research cluster made of Nvidia TK1 SoC boards and we evaluate the performance of the tiny GPU cluster for spatial join query processing on large-scale geospatial data. Using a simplified model, the results seem to suggest that the ARM CPU of the TK1 board is likely to achieve better energy efficiency while the Nvidia GPU of the TK1 board is less performant when compared with desktop/server grade GPUs, in both the standalone setting and the 4-node cluster setting for the two particular applications Develop a formal method to model the scaling effect between SoC-based clusters and regular clusters, not only including processors but also memory, disk and network components. Evaluate the performance of SpatialSpark and the LDE engine using more real world geospatial datasets and applications  Spatial Data Benchmark?

Q&A http://www-cs.ccny.cuny.edu/~jzhang/ jzhang@cs.ccny.cuny.edu CISE/IIS Medium Collaborative Research Grants 1302423/1302439: “Spatial Data and Trajectory Data Management on GPUs” Q&A http://www-cs.ccny.cuny.edu/~jzhang/ jzhang@cs.ccny.cuny.edu