Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)

Slides:



Advertisements
Similar presentations
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Advertisements

Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
OpenFOAM on a GPU-based Heterogeneous Cluster
Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel/Concurrent Programming on the SGI Altix Conley Read January 25, 2007 UC Riverside, Department of Computer Science.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
Chapter 2 Computer Clusters Lecture 2.1 Overview.
Center for Subsurface Sensing & Imaging Systems Overview of Image and Data Information Management in CenSSIS David Kaeli Northeastern University Boston,
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
EMC PresentationApril Northeastern University I/O storage modeling and performance –David Kaeli Soft error modeling and mitigation –Mehdi.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
“SEMI-AUTOMATED PARALLELISM USING STAR-P " “SEMI-AUTOMATED PARALLELISM USING STAR-P " Dana Schaa 1, David Kaeli 1 and Alan Edelman 2 2 Interactive Supercomputing.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
Integrating Trilinos Solvers to SEAM code Dagoberto A.R. Justo – UNM Tim Warburton – UNM Bill Spotz – Sandia.
林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.
ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006outline.1 ITCS 4145/5145 Parallel Programming (Cluster Computing) Fall 2006 Barry Wilkinson.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Efficient Data Accesses for Parallel Sequence Searches Heshan Lin (NCSU) Xiaosong Ma (NCSU & ORNL) Praveen Chandramohan (ORNL) Al Geist (ORNL) Nagiza Samatova.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
1/30/2003 BARC1 Profile-Guided I/O Partitioning Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University {yiwang,
Parallelization of Classification Algorithms For Medical Imaging on a Cluster Computing System 指導教授 : 梁廷宇 老師 系所 : 碩光通一甲 姓名 : 吳秉謙 學號 :
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
A High Performance Middleware in Java with a Real Application Fabrice Huet*, Denis Caromel*, Henri Bal + * Inria-I3S-CNRS, Sophia-Antipolis, France + Vrije.
On High Performance Computing and Grid Activities at Vilnius Gediminas Technical University (VGTU) dr. Vadimas Starikovičius VGTU, Parallel Computing Laboratory.
Real Time Tomography Advanced Imaging Techniques
“Live” Tomographic Reconstructions Alun Ashton Mark Basham.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
MODPI: A parallel MOdel Data Passing Interface for integrating legacy environmental system models A. Dozier, O. David, Y. Zhang, and M. Arabi.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Backprojection and Synthetic Aperture Radar Processing on a HHPC Albert Conti, Ben Cordes, Prof. Miriam Leeser, Prof. Eric Miller
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
1 Distributed BDD-based Model Checking Orna Grumberg Technion, Israel Joint work with Tamir Heyman, Nili Ifergan, and Assaf Schuster CAV00, FMCAD00, CAV01,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.
Welcome: Intel Multicore Research Conference
Parallel Programming By J. H. Wang May 2, 2017.
Department of Computer Science University of California, Santa Barbara
Architectural Interactions in High Performance Clusters
Constructing a system with multiple computers or processors
MPJ: A Java-based Parallel Computing System
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)

Parallelization of SSI Applications We have developed profile-guided parallelization techniques to rapidly characterize program control flow and data flow, and use this information to guide parallelization We have already sped up a number of CenSSIS applications, including: – finite-difference time domain – steepest descent fast multi-pole method – photo simulation – ellipsoid algorithm We target Beowulf clusters running Linux We utilize MPICH as our middleware

Tomographic mammography 3D image reconstruction from x-ray projections – Used to detect and diagnose breast cancer – Based on well-developed mammography techniques – Exposes tissue structure using multiple projections from different angles Advantages Accuracy: provides at least as much useful information than x-ray film Flexibility: digital image manipulation, digital storage Provides structural information: using layered images Safe: low-dose x-ray Lower cost: compared to MRI

Image acquisition and reconstruction process Acquisition: 11 uniform angular samples along Y-axis X-ray projection: breast tissue density absorption radiograph Algorithm: constrained non-linear convergence and iterative process detector X-ray source Y Set 3D volume Compute projections Correct 3D volume 3D volume No Yes Exit Initialization Forward Backward Satisfied? X Y Z x-ray projections

Reconstruction and Parallelization Reconstruction algorithm: Maximum likelihood expectation maximization (ML-EM) High resolution image Computationally intensive: 3 hours serial execution on 2.2GHz Pentium 4 workstation, using 2GB memory The need for speed: – Large number of medical cases – Execution time increases as a function of breast size – Real-time application: computer-guided needle biopsy breast surgery Research motivation – Computation vs. communication – Platforms vs. parallelization methods

Parallelization approaches Reduce communication data – Segmentation along Y-axis – Using redundant computation to replace communication – Segmenting along x-ray beam First approach: No inter-node communication (more computation, no communication) Second approach: Overlap with inter-node communication Third approach: Non-overlapped with inter-node communication (no redundant computation, more communication) exchange dataOverlap area

Implementation and tests Serial code provided by T. Wu at MGH Programming model – C++ and message passing interface (MPI) – Globus tool kits: MPICH-G2 over NPACI Grid, in progress Test input data set – Phantom data set: 1600x2034x45 – A large patient data set: 1040x2034x70 Test platforms ProcessorInterconnection MGH cluster2.5GHz Pentium 4100Mb interconnect switch UIUC NCSA Titan cluster800MHz Itanium 1 dual-processor 1Gb Myrinet, Shared L3 cache UIUC NCSA IBM p690 server 1.3GHz Power41Gb Ethernet Shared memory system SGI Altix 3300 system1.3GHz Itanium 2 dual-processor NUMAlink interconnect, Shared memory system

Partitioning methods comparison Input data set – phantom 1600x2034x45 Platform: – UIUC NCSA Titan cluster Non-overlap method out- performs other two methods The best parallel runtime is under 3 minutes using 64 processors Three methods show very similar speedup trends Given additional processors, non-overlap method yields higher performance increase than other methods

Platform performance comparison using non-overlap method Input data set: phantom 1600x2034x45 Platforms: – SGI Altix system – UIUC NCSA Titan cluster – UIUC NCSA IBM p690 – Pentium 4 cluster at MGH Number of processors: 32 Algorithm: Non-overlap with inter-node communication partition method Computation: SGI Altix with Itanium 2 processor outperforms the other CPUs Communication: shared memory platforms have very low communication overhead Over 2 times performance difference between SGI Altix and Pentium IV cluster

Platform performance comparison using no inter-node communication Input data set: phantom 1600x2034x45 Platform: – SGI Altix system – UIUC NCSA Titan cluster – UIUC NCSA IBM p690 – Pentium 4 cluster at MGH Number of processors: 32 Algorithm: overlap without inter- node communications Computation: significant differences between Titan, IBM p690 and P4 clusters Synchronization: more waiting time accumulated at the end iterations SGI Altix performance remains similar to non-overlap method

Platform and parallel partitioning method performance comparison Input data set: –phantom 1600x2034x45 Platform: – Pentium 4 cluster at MGH – UIUC NCSA IBM p690 – UIUC NCSA Titan cluster – SGI Altix Number of processors: 32 Computation power dominant performances Inter-node communication and non-overlap methods lead to higher performance on some platforms

Summary and future work Over 180X speedup vs. serial implementation 1. Phantom data set: 1600x2034x45 –1 minute using 64 processors on SGI Altix 2. A large patient data set: 1040x2034x70 –1.5 minutes using 64 processors on SGI Altix Joint SPIE paper with T. Wu at MGH: “A parallel reconstruction method for digital tomosynthesis mammography,” 2004 SPIE Workshop on Medical Imaging Future work: –Real-time application: computer-guided needle biopsy Goal: 5~10 seconds delay or less Evaluation of computation reduction effects on image quality –Move code to a Grid environment (underway)