Stream Processing of X-ray Microdiffraction Data on Multicores Yuzhen Xie, University of Western Ontario (UWO) joint work with Alain Biem, IBM Research.

Slides:



Advertisements
Similar presentations
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.
CHAPTER 3: CRYSTAL STRUCTURES X-Ray Diffraction (XRD)
UWO Nanofabrication Facility and Science Studio. Facility to be hooked into Science Studio: Western Nanofabrication Facility, University of Western Ontario.
Electron Diffraction Applications Using the PDF-4+ Relational Database.
Structure of thin films by electron diffraction János L. Lábár.
Small Molecule Example – YLID Unit Cell Contents and Z Value
CHE (Structural Inorganic Chemistry) X-ray Diffraction & Crystallography lecture 3 Dr Rob Jackson LJ1.16,
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Outline Introduction Image Registration High Performance Computing Desired Testing Methodology Reviewed Registration Methods Preliminary Results Future.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Tracking Migratory Birds Around Large Structures Presented by: Arik Brooks and Nicholas Patrick Advisors: Dr. Huggins, Dr. Schertz, and Dr. Stewart Senior.
The goal of Data Reduction From a series of diffraction images (films), obtain a file containing the intensity ( I ) and standard deviation (  ( I ))
Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)
Submitted By:- Nardev Kumar Bajaj Roll NO Group-C
“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Binary Image Compression via Monochromatic Pattern Substitution: A Sequential Speed-Up Luigi Cinque and Sergio De Agostino Computer Science Department.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
SICSA Concordance Challenge: Using Groovy and the JCSP Library Jon Kerridge.
Takuya Matsuo, Norishige Fukushima and Yutaka Ishibashi
Fan Zhang, Yang Gao and Jason D. Bakos
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Lars Ehm National Synchrotron Light Source
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Detectors for Light Sources Contribution to the eXtreme Data Workshop of Nicola Tartoni Diamond Light Source.
Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.
Fine-grained Partitioning for Aggressive Data Skipping Liwen Sun, Michael J. Franklin, Sanjay Krishnan, Reynold S. Xin† UC Berkeley and †Databricks Inc.
Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Peter J. LaPuma1 © 1998 BRUKER AXS, Inc. All Rights Reserved This is powder diffraction!
OMFS An Object-Oriented Multimedia File System for Cluster Streaming Server CHENG Bin, JIN Hai Cluster & Grid Computing Lab Huazhong University of Science.
Of Remote Beamlines, Micro-diffraction and HP Network Computing VESPERS X ray Beamline Capabilities: Micro-diffraction/fluorescence User Base: Earth and.
28/03/2003Julie PRAST, LAPP CNRS, FRANCE 1 The ATLAS Liquid Argon Calorimeters ReadOut Drivers A 600 MHz TMS320C6414 DSPs based design.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Original Requirements for Science Studio : (1)Convenient control of all aspects of an X ray fluorescence (XRF) facility: visible sample, easy sample manipulation,
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
1 ANISE: Active Network for Information from Synchrotron Experiments “Active” means near-instantaneous stream processing of complex data during transfer.
By shooting 2009/6/22. Flow chart Load Image Undistotion Pre-process Finger detection Show result Send Result to imTop Calculate Background image by 10.
CS 732: Advance Machine Learning
Peterson xBSM Optics, Beam Size Calibration1 xBSM Beam Size Calibration Dan Peterson CesrTA general meeting introduction to the optics.
The Muppet’s Guide to: The Structure and Dynamics of Solids Single Crystal Diffraction.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
Sunpyo Hong, Hyesoon Kim
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
XRD data analysis software development. Outline  Background  Reasons for change  Conversion challenges  Status 2.
Crystallography : How do you do? From Diffraction to structure…. Normally one would use a microscope to view very small objects. If we use a light microscope.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
SIMULATION OF BACKGROUND REDUCTION TECHNIQUES FOR Ge DBD DETECTORS Héctor Gómez Maluenda. University of Zaragoza. GERDA/Majorana MC Meeting.
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
NFV Compute Acceleration APIs and Evaluation
CHARACTERIZATION OF THE STRUCTURE OF SOLIDS
COMPUTER ORGANIZATION AND ARCHITECTURE
Presentation transcript:

Stream Processing of X-ray Microdiffraction Data on Multicores Yuzhen Xie, University of Western Ontario (UWO) joint work with Alain Biem, IBM Research Michael A. Bauer, UWO Stewart McIntyre, UWO Nobumichi Tamura, Lawrence Berkeley National Lab AMMCS, July 2011

Motivation Efficiently use of multi-core processors to process large blocks of synchrotron XRD data generated at high rates (1 to 10 images per second of each 4MB) Develop high-performance kernels to achieve near real- time data analysis for synchrotron experiments, the goal of the Active Network Interchange for Scientific Experimentation (ANISE) project

Synchrotron X-ray White-beam Microdiffraction Incident X-ray (5 – 30 KeV) CCD Camera Sample Diffracted beams Dectris Pilatus 1M CCD at ALS (2010): sub-second readout An image showing the Laue microdiffration pattern of a unit-cell in a crystal sample

Process of Laue Patterns for Micro-texture Analysis Background fit and removal (optional)

Example of Crystallographic Orientation and Strain Maps (courtesy: Jing Chao and Marina Fuller, UWO) Strain map, average strain: 9.92 x Result by XMAS (X-ray Microdiffraction Analysis Software), Advanced Light Source Orientation map

Reference Software Packages XMAS (X-ray Microdiffraction Analysis Software), Advanced Light Source 3D X-ray Microdiffraction Analysis Software Package in IDL, Advanced Photon Source A prototype of C code for a selection of features in Laue pattern analysis, Science Studio and ANISE projects, UWO Best sequential processing time: 25 to 50 seconds per image

7 7 Stream Processing Illustration Continuous IngestionContinuous Analysis

8 IBM Streams Programming Model Streams Processing Language (SPADE) Input OutputProcess Platform optimized compilation

Laue XRD Processing System on Streams Processing Elements (mainly User-defined Operators (UDOPs)) Preprocessing -Formatting -Parsing -XRD image data Background Removal Blob Searching Peak Fitting Indexing -Blobs search -Scheduling for parallel peak fitting XRD Image Stream Filters available -Parabolic -2D Bruckner -2D Mean Filter - Lorentz - Gaussian - Pearson VII Split operator Functions Available Strain Bundle Sorting

Key Implementation Techniques Efficient Source operator for parsing image files: block reading and type casting Fine-grained pipelining and cache-efficient background filters Memory-efficient parallel peak fitting Organize common parameter values as a stream for shared-use in indexing and strain analysis

A Fine-pipelined Background Filter based on Parabolic Method

A Pilatus TIFF Image before and after Background Removal

Memory-efficient Parallel Peak Fitting

Data Management: the Key Issue Blob center b: data set R b (d b x d b ) is needed for fitting a peak with center at p. Peak center p: data set R p is needed for integrated intensity computation. Assume p is not far from b. Define R to be the square region (2d b x 2d b ) with center at b. Attach a data set R to a blob tuple rather than passing the whole image to each peaking fitting element. Determine R b and R p by coordinate mapping in R. Small data size, good locality, no memory contention, …, and hence efficiency.

A SPADE Code Snippet for Blob Searching and Parallel Peak Fitting ## Parse an image stream engStream(height: Integer, width: Integer, emax: …, evalues: DoubleList) := Source()[“file://c4-3_001.spe”,udfbinformat=“speParser”, blocksize=65536*15]{} ## Search blobs and generate blob stream stream blobStream( groupid: Integer, blobid: Integer, …, lroi: DoubleList) := Udop(engStream)[“blobSearch”]{np=“NUM_PF”} ## Split blobs to subgroups 0 to NUM_PF-1 stream Integer, blobid: Integer, …, lroi: DoubleList) for_end := Split(blobStream)[groupid]{} ## Parallel peak fitting for subgroups of bobs and bundle all peaks together bundle peakBundle := () 0 to NUM_PF-1 stream Integer, x: Integer, …, inten: Double) := peakBundle += for_end

Organize Common Parameter Values as one Stream for Shared- use in Indexing and Strain Refinement of all XRD Images k in q3q3 q1q1 q2q2 33 11 22 k out q 22 Known crystal structure and energy range (5-30 keV) List of peak positions on the CCD Find triplets  1,  2,  3 (thus q 1,q 2,q 3 ) matching calculated and measured values within a given angular tolerance Calculated q hkl list of reflections Experimental q i list of reflections Choose triplets indexing the largest number of reflections within a given angular tolerance. Look for “missing” reflections. Beam direction k in, Detector position and dimensions Strain refinement

Streams Live Graph: One Pipeline with 4 Processing Elements for Parallel Peak Fitting 2.5 seconds per image (2084*2084) on an Intel Core2 Quad CPU Q9550 (2.83 GHz, 8 GB RAM and 6 MB L2 cache) Image Sourcing Blob Search & Scheduling Blob Search & Scheduling Parallel Peak Fitting Parallel Peak Fitting Indexing Parameter Sourcing Parameter Sourcing Strain

Super-linear speedup obtained on an Intel Core2 Quad CPU Q9550 Streams Live Graph: 4 Pipelines to Process 4 Images Concurrently in Streaming Mode

Conclusion We present the first stream processing application in the field of synchrotron XRD data analysis. We show that stream processing is an effective model for efficiently using multicore processors for XRD image data analysis. Our system provides a high-performance processing kernel to achieve near real-time data analysis of image data from synchrotron experiments. Our work-in-progress include: evaluation, optimization, configuration and deployment of this kernel to large systems with many cores to process large set of XRD images in parallel and streaming mode. Thank You!