Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems Kaibo Wang 1, Yin Huai 1, Rubao Lee 1, Fusheng Wang 2,3, Xiaodong Zhang 1,

Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems Kaibo Wang 1, Yin Huai 1, Rubao Lee 1, Fusheng Wang 2,3, Xiaodong Zhang 1, Joel H. Saltz 2,3 1 Department of Computer Science and Engineering, The Ohio State University 2 Center for Comprehensive Informatics, Emory University 3 Department of Biomedical Informatics, Emory University

Background: Digital Pathology Digital pathology imaging has become an increasingly important field in the past decade Examination of high-resolution tissue images enables more effective prediction, diagnosis, and therapy of diseases Glass slides Scanning Whole slide images Image analysis 1

Background: Image Algorithm Evaluation High-quality image analysis algorithms are essential to support biomedical research and diagnosis – Validate algorithms with human annotations – Compare and consolidate multiple algorithm results – Sensitivity study of algorithm parameters 2 Green : algorithm one Red : algorithm two

Spatial cross-comparison : identify and compare derived spatial objects belonging to different observations or analyses Jaccard similarity : the overlap ratio of intersecting polygons from two result sets Problem: Spatial Cross-Comparison pq 3

Both Data- and Compute-Intensive Increasingly large data sets – 10 5 x10 5 pixels per image – 1 million objects per image – Hundreds to thousands of images per study – Big data demanding high throughput High computation intensity – Computing Jaccard similarity requires heavy- duty geometric operations – Demanding high performance 4 Parallel computing techniques must be utilized to handle such intensive workloads

Existing Approach: Spatial DBMS Extension of RDBMS with spatial data types and operators (PostGIS, DB2, etc.) A typical cross-comparison takes many hours to finish on a single machine – 90+% computing time spent on computing the areas of polygon intersections and unions – Algorithms used by SDBMS are highly branch- intensive and difficult to parallelize Task-based (MIMD) parallel computing can be applied on large-scale clusters – Expensive in facility of many high end nodes 5 A high-performance/throughput and cost- effective solution is highly desirable

7900 GTX 8800 GTX 9800 GTX GTX 280 GTX 285 GTX 480 GTX 580 E4300 E6850 Q9650 X7460 980 XE Low-cost and powerful data-parallel devices SIMD data parallel architecture 6 Graphics Processing Units (GPU) All cores on a streaming multiprocessor (SM) execute the same instruction on different data Applications must exploit SIMD data parallelism in order to best utilize the power of GPUs E.g., NVIDIA GTX 580 has 512 cores (16 SMs, 32 cores each) USENIX ATC’11

Our Solution: SCCG Spatial Cross-comparison on CPUs and GPUs – Utilize GPUs with CPUs for both high throughput and high performance in a cost-effective way Critical challenges – SIMD data-parallel algorithms on GPU – CPU-GPU hybrid computing framework – Load balancing between CPU and GPU 7

Outline Introduction SCCG – PixelBox GPU algorithm – Cross-comparing framework – Load balancing Experiments Conclusions 8

The PixelBox GPU Algorithm Given an array of polygon pairs, compute the area of intersection and area of union for each polygon pair Algorithm principles – Exploit SIMD data parallelism Compute areas of polygon intersections and unions in an SIMD data parallelism mode – Maximize data parallelism and minimize unnecessary compute intensity Reduce compute intensity while maintain high data parallelism 9

Exploit SIMD Data Parallelism Monte-Carlo approach (a basic method) 10 pq pqpq Union Compute areas of intersection/union by counting the number of pixels lying within each region Perfect data parallelism, but high compute intensity when polygons are large Consider polygons lying on a pixel map UnionIntersection

pq Reduce Unnecessary Compute Intensity Partition the bounding rectangular of a polygon pair into boxes (like grid cells) 11 pq Use sampling boxes Compute areas box by box, thus avoiding lots of costly per-pixel testing Completely within the intersection of p and q Completely within the union of p and q Don’t belong to either intersection or union Recursively explore unsettled boxes by partitioning them into smaller sub-boxes pq Further partitioned According to our testing, 50+% of areas can be determined with only one level of box partitioning Unsettled, thus need further exploration

The PixelBox Algorithm PixelBox works on both pixels and boxes 12 pqpqpq First, box by box to quickly finish the testing of large regions Then, pixel by pixel within small sub- boxes that need further testing In this way, PixelBox preserves the benefits of both high data parallelism and low compute intensity

1 Use a shared stack to store the boxes to be tested Implementation for GPU 13 1 Thread 0Thread 1Thread 2Thread 3 …… 1 stack 0 0 0 Sampling box Mark: 1 – need further testing 0 – no further testing The stack top is fetched by all threads, and partitioned if it needs further testing Partitioned into sub-boxes, then tested by different threads in parallel Contribution of each sub-box is computed; also see whether further testing is needed 0 010 New sub-boxes are pushed to the top of stack again All threads keep popping boxes from the stack for processing A box with mark 0 needs no further testing, thus ignored by all threads A box with mark 1 needs to be further partitioned, or apply Monte Carlo if it has been small enough 0 000 0 0 0 0 Computation finishes when the stack becomes empty Using a stack improves data parallelism: both popping and pushing can be done in SIMD fashion The bounding rectangular of two polygons is pushed on the stack as the first box

The Cross-Comparing Framework A pipelined framework that executes the whole cross-comparing workflow Pipelined execution reduces resource contention over GPUs – GPU is an exclusive, non-preemptive device – Multiple threads trying to access a GPU simultaneously are serialized – A single initiator (Aggregator) reduces blocking over GPUs 14 CPU GPU (PixelBox) Load data from disk Find intersecting polygon pairs from the data Compute Jaccard similarity

Load data from disk Find intersecting polygon pairs from the data Compute Jaccard similarity CPU-GPU Load Balancing Both CPUs and GPUs have to be fully utilized in order to maximize throughput 15 Produce polygon pairs on CPUs Consume polygon pairs on GPUs CPUs idle if production speed > consumption speed GPUs idle if production speed < consumption speed Tasks have to be dynamically migrated among CPUs and GPUs to achieve load balancing (please read paper for details) Tasks have to be dynamically migrated among CPUs and GPUs to achieve load balancing (please read paper for details)

Experiments 16 Both simplifications favor the performance of PostGIS Methodology – Spatial DBMS: PostgresQL 9.1.3 + PostGIS 1.5.3 – Disk loading time not considered “First-load-then-query” scheme of database is known to be inefficient to process one-time data Storage devices are improving (e.g., SSD) – Ignore data format conversion and data partitioning time in SDBMS Platforms – Dell T1500 workstation One Intel Core i7 860 CPU (4 cores) One NVIDIA GTX 580 GPU – Amazon EC2 instance Two Intel Xeon X5570 CPUs (8 cores, 16 threads) Data sets – 18 pairs of polygon sets extracted from 18 real-world brain tumor pathology images – 12GiB in raw text format

Effectiveness of PixelBox on GPU Algorithm performance 17 On Dell T1500, compute the Jaccard similarity of 619609 pairs of polygons in a representative data set PixelBox on GPU GEOS on a single core CPU-version PixelBox on a single core Over 430 s Over 290 s Only 3.6 s 1.5 x 120 x This experiment shows the effectiveness of PixelBox and its best utilization of SIMD data parallelism of GPUs

Overall Performance Cross-comparing performance 18 Parallelized PostGIS on EC2 Our solution on Dell T1500 18x speedup on average Two Intel Xeon X5570: $ 2000 Core i7 860 + GTX 580: $ 800 Two Intel Xeon X5570: 190 w Core i7 860 + GTX 580: 339 w Our solution is 2.4x lower in hardware cost, and 10x higher in performance per watt Our solution is 2.4x lower in hardware cost, and 10x higher in performance per watt

Conclusions Spatial cross-comparison is a data- and compute-intensive operation Existing approach with SDBMS is not high- performance and low-cost We provide a software solution based on GPUs and CPUs to significantly accelerate the work at low cost 19

Thank You Q & A 20

Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems Kaibo Wang 1, Yin Huai 1, Rubao Lee 1, Fusheng Wang 2,3, Xiaodong Zhang 1,

Similar presentations

Presentation on theme: "Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems Kaibo Wang 1, Yin Huai 1, Rubao Lee 1, Fusheng Wang 2,3, Xiaodong Zhang 1,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems Kaibo Wang 1, Yin Huai 1, Rubao Lee 1, Fusheng Wang 2,3, Xiaodong Zhang 1,

Similar presentations

Presentation on theme: "Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems Kaibo Wang 1, Yin Huai 1, Rubao Lee 1, Fusheng Wang 2,3, Xiaodong Zhang 1,"— Presentation transcript:

Similar presentations

About project

Feedback