Simple, Efficient, Portable Decomposition of Large Data Sets William Lundgren Gedae), David Erb (IBM), Max Aguilar (IBM), Kerry Barnes.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Intermediate GPGPU Programming in CUDA
Memory.
DSPs Vs General Purpose Microprocessors
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Lecture 6 Programming the TMS320C6x Family of DSPs.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
0 HPEC 2010 Automated Software Cache Management.
Systems I Locality and Caching
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Instruction Set Architecture
High Performance Linear Transform Program Generation for the Cell BE
Gedae Portability: From Simulation to DSPs to the Cell Broadband Engine James Steed, William Lundgren, Kerry Barnes Gedae, Inc
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
1 chapter 1 Computer Architecture and Design ECE4480/5480 Computer Architecture and Design Department of Electrical and Computer Engineering University.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
KERRY BARNES WILLIAM LUNDGREN JAMES STEED
My Coordinates Office EM G.27 contact time:
Chapter 2 Memory and process management
Lecture 3: MIPS Instruction Set
CS427 Multicore Architecture and Parallel Computing
William Stallings Computer Organization and Architecture 8th Edition
Embedded Systems Design
Chapter 9 – Real Memory Organization and Management
课程名 编译原理 Compiling Techniques
Central Processing Unit
Lecture 3: MIPS Instruction Set
COMS 361 Computer Organization
Memory System Performance Chapter 3
Lecture 4: Instruction Set Design/Pipelining
6- General Purpose GPU Programming
Presentation transcript:

Simple, Efficient, Portable Decomposition of Large Data Sets William Lundgren Gedae), David Erb (IBM), Max Aguilar (IBM), Kerry Barnes (Gedae), James Steed (Gedae) HPEC 2008

Introduction The study of High Performance Computing is the study of –How to move data into fast memory –How to process data when it is there Multicores like Cell/B.E. and Intel Core2 have hierarchical memories –Small, fast memories close to the SIMD ALUs –Large, slower memories offchip Processing large data sets requires decomposition –Break data into pieces small enough for the local storage –Stream pieces through using multibuffering 2

SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS Cell/B.E. Memory Hierarchy Each SPE core has a 256 kB local storage Each Cell/B.E. chip has a large system memory 3 SPE PPE EIB Cell/B.E. Chip LS SYSMEM Bridge EIB Cell/B.E. Chip SYSMEM Duplicate or heterogeneous Subsystems PPEBridge SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS

Intel Quad Core Memory Hierarchy Caching on Intel and other SMP multicores also creates memory hierarchy 4 System Bus L2 Cache L1 Cache Instruction Units Schedulers Load/ Store ALUs Load/ Store ALUs L2 Cache L1 Cache Instruction Units Schedulers Load/ Store ALUs Load/ Store ALUs

Optimization of Data Movement Optimize data movement using software Upside –Higher performance possibilities Downside –Complexity beyond the reach of many programmers In analogy, introduction of Fortran and C –The CPU was beyond the reach of many potential software developers –Fortran and C provide automatic compilation to assembly –Spurred the industry 5 Multicores require the introduction of fundamentally new automation.

Gedae Background We can understand the problem by considering the guiding principles of automation that effectively addresses the problem. 6

Structure of Gedae 7 SW / HW System Hardware Model Compiler Implementation Specification Functional Model Developer Analysis Tools Threaded Application Thread Manager

Compiler Guiding Principle for Evolution of Multicore SW Development Tools 8 Functional model Architecture- specific details Libraries Implementation specification Implementation Complexity 8

Language – Invariant Functionality Functionality must be free of implementation policy –C and Fortran freed programmer from specifying details of moving data between memory, registers, and ALU –Extend this to multicore parallelism and memory structure The invariant functionality does not include multicore concerns like –Data decomposition/tiling –Thread and task parallelism Functionality must be easy to express –Scientist and engineers want a thinking tool Functional expressiveness must be complete –Some algorithms are hard if the language is limited 9

Language Features for Expressiveness and Invariance Stream data (time based data) * Stream segments with software reset on segment boundaries * Persistent data – extends from state* to databases ‡ Algebraic equations (HLL most similar to Mathcad) ‡ Conditionals † Iteration ‡ State behavior † Procedural * * These are mature language features † These are currently directly supported in the language but will continue to evolve ‡ Support for directly expressing algebraic equations and iteration. while possible to implement in the current tool, will be added to the language and compiler in the next major release. Databases will be added soon after. 10

Library Functions Black box functions hide essential functionality from compiler Library is a vocabulary with an implementation conv(float *in, float *out, int R, int C, float *kernel, int KR, int KC); Algebraic language is a specification range i=0..R-1, j=0..C-1, i1=0..KR-1, j1=0..KC-1; out[i][j] += in[i+i1][j+j1] * kernel[i1][j1]; 11 Other examples: As[i][j] += B[i+i1][j+j1]; /* kernel of ones */ Ae[i][j] |= B[i+i1][j+j1]; /* erosion */ Am[i][j] = As[i][j] > (Kz/2); /* majority operation */

Library Functions A simple example of hiding essential functionality is tile extraction from a matrix –Software structure changes based on data size and target architecture –Library hides implementation from developer and compiler 12 Image in System Memory Transfer Data Reorg Process Tile Tile Contiguous in SPE LS …Back to System Memory CPU Data Reorg Process Tile Tile Contiguous in PPE cache …Back to System Memory Option A Option B

Features Added to Increase Automation of Example Presented at HPEC

New Features New language features and compiler functionality provide increased automation of hierarchical memory management Language features –Tiled dimensions –Iteration –Pointer port types Compiler functions –Application of stripmining to iteration –Inclusion of close-to-the-hardware List DMA to get/put tiles –Multibuffering –Accommodation of memory alignment requirements of SPU and DMA 14

Matrix Multiplication Algorithm 15

Distributed Algorithm Symbolic Expression A[i][j] += B[i][k]*C[k][j] Tile operation for distribution and small memory i->p,i2; j->j1,j2; k->k1,k2 [p][j1]A[i2][j2] += [p][k1]B[i2][k2] * [k1][j1]C[k2][j2] Process p sum spatially and k1 and j1 sums temporally Accumulate in local store, then transfer result tiles back to system memory 16 1,11,21,0 Mul Acc 1,2 2,2 0,2 Stream tiles 1,1 1,21,0 1,22,20,2 k1 = 0,1,2 Tiles contiguous in SPE local store 1,2 System Memory SPE Processing

Data Partitioning by Processor Each processor computes different set of rows of “a” 17 Blue translucent boxes indicate these boxes will migrate to implementation and compiler System Memory Data for Processor 0 Data for Processor 1 Data for Processor 7 … Data for Processor 2

Temporal Data Partitioning Fetch tiles from system memory –Automatically incorporate DMA List transfer Compute the sum of the tile matrix multiplies Reconstitute result in system memory 18 System Memory Data for Processor 0 Data for Processor 1 …

Stripmining and Multibuffering Stripmining this algorithm will process the matrix tile-by-tile instead of all at once –Enabling automated stripming adds this to the compilation Multibuffering will overlap DMA of next tile with processing of current tile –Multibuffering table allows this to be turned off and on 19

Analysis of Complexity and Performance 13 kernels –Each kernel has 10 lines of code or less in its Apply method – Future version will be one kernel with 1 line of code defined using algebraic expression Automation ratio (internal kernels added / original kernels / processors) –Internal kernels added: 276 – Current ratio: 2.65 – Future ratio: 36 Runs at GFLOPs on 8 SPEs for large matrices –Higher rates possible using block data layout, up to 95% max throughput of processor 20

Polar Format Synthetic Aperture Radar Algorithm

Algorithm complex out[j][i] pfa_sar(complex in[i][j], float Taylor[j], complex Azker[i2]) { float Taylor[j], complex Azker[i2]) { t1[i][j] = Taylor[j] * in[i][j] t1[i][j] = Taylor[j] * in[i][j] rng[i] = fft(t1[i]); /* FFT of rows */ rng[i] = fft(t1[i]); /* FFT of rows */ cturn[j][i] = rng[i][j]; cturn[j][i] = rng[i][j]; adjoin[j][i2](t) = i2 < R ? cturn[i2][i](t) : cturn[j][i2-R](t-1) ; adjoin[j][i2](t) = i2 < R ? cturn[i2][i](t) : cturn[j][i2-R](t-1) ; t2[j] = ifft(adjoin[j]); t2[j] = ifft(adjoin[j]); t3[j][i2] = Azker[i2] * t2[j][i2]; t3[j][i2] = Azker[i2] * t2[j][i2]; azimuth[j] = fft(t3[j]); azimuth[j] = fft(t3[j]); out[j][i] = azimuth[j][i]; out[j][i] = azimuth[j][i];} 22

Analysis of Code Complexity for Benchmark From HPEC kernels –7 tiling kernels specially crafted for this application –5 data allocation kernels specially crafted for this application DMA transfers between system memory and SPE local storage coded by hand using E library Multibuffering is incorporated into the kernels by hand The tiling kernels are very complex –80 to 150 lines of code each –20 to 100 lines of code in the Apply method A productivity tool should do better! 23

Analysis of Code Complexity and Performance 23 kernels Each kernel has 10 lines of code or less in its Apply method Automation ratio –Internal kernels added: 1308 –Current ratio: 7.11 –Future ratio: Performance: 24 Algorithm GFLOPSGB/s TOTAL SAR

Backprojection Synthetic Aperture Radar Algorithm

Backprojection – the Technique 26 Flight path in[p][f] /* pulse returns*/ 2D Image (x,y) /* image element indices */ r[p][x][y] /* range from observation point to image element*/ (xo[p],yo[p],zo[p]) /* observation point indices */

Algorithm complex out[x][y] backprojection(complex in[p][f0], float xo[p], float yo[p], float zo[p], float sr[p], int X, int Y, float DF, float yo[p], float zo[p], float sr[p], int X, int Y, float DF, int Nbins) { int Nbins) { range f = f0 * 4, x = X-1, y = Y-1; { in1[p][f] = 0.0; in1[p][f0] = in[p][f0]*W[f0]; } in2[p] = ifft(in1[p]) rng[p][y][x] = sqrt((xo[p]-x[x])^2 + (yo[p]-y[y])^2 + zo[p]^2); dphase[p][y][x] = exp(i*4*PI/C*f0[p]* rng[p][y][x]); rbin[p][y][x] = Nbins * (rng[p][y][x] - rstart[p]) * 2 * DF / C; irbin[p][y][x] = floor(rbin[p][y][x]); w1[p][y][x] = rbin[p][y][x] - irbin[p][y][x]; w0[p][y][x] = 1 – w1[p][y][x]; out[y][x] += (in2[p][irbin[p][y][x]]*w0[p][y][x] + in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x]; in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x];} 27

Comparison of Manual vs. Automated Implementation Automation ratio –Internal kernels added: 1192 –Current ratio: 2.26 –Future ratio: 4.13 Processing time for manual results were reported at IEEE RADAR 2008 conference Processing time for automated memory transfers with tiling 28 Npulses Time (mSec) , ,259.4 Npulses Time (mSec) ,958.2

Summary and Roadmap Gedae’s tiling language allows the compiler to manage movement of data through hierarchical memory –Great reduction in code size and programmer effort –Equivalent performance Gedae Symbolic Expressions will take the next step forward in ease of implementation –Specify the algorithm as algebraic code –Express the data decomposition (both spatially and temporally) –The compiler handles the rest 29

End of Presentation

Appendix: Gedae’s Future Direction Towards Full Automation

Implementation Tools – Automatic Implementation 32 Threaded Application Hardware Model with Characterization Compiler Functional Model Rule Based Engine Analysis Tools++ Software Characterization on HW Model Developer Implementation Specification SW / HW System Thread Manager 32

Appendix: Cell/B.E. Architecture Details and Tiled DMA Characterization

Cell/B.E Compute Capacity and System Memory Bandwidth Maximum flop capacity Gflop/sec 32 bit (4 byte data) –3.2 GHz * 8 flop/SPU * 8 SPU Maximum memory bandwidth – 3.2 GWords/sec –25.6 / 4 / 2 words / function / second 25.6 GB/sec 4 bytes/word 2 words/function (into and out-of memory) Ideal compute to memory ratio – 64 flops per floating point data value –204.8 /

Practical Issues Degradation of memory bandwidth –Large transfer alignment and size requirements Need 16 byte alignment on source and destination addresses Transfer size must be multiple of 16 bytes –DMA transfers have startup overhead Less overhead to use list DMA than to do individual DMA transfers Degradation of compute capacity –Compute capacity is based on: add:multiply ratio of 1:1 4 wide SIMD ALU –Filling and emptying ALU pipe –Pipeline latency –Data shuffling using SIMD unit 35

# Procs (Times are measured within Gedae) Effect of Tile Size on Throughput

Appendix: Polar Format SAR Description and Flow Graph Specificaiton

Stages of SAR Algorithm Partition –Distribute the matrix to multiple PEs Range –Compute intense operation on the rows of the matrix Corner Turn –Distributed matrix transpose Azimuth –Compute intense operation on the rows of [ M(i-1) M(i) ] Concatenation –Combine results from the PEs for display 38

Simplifications Tile dimensions specify data decomposition –Input: stream float in[Rt:R][Ct:C] –Output: stream float out[Rt/N:R][Ct/N:C] This is all the information the compiler needs –User specifies tile size to best fit in fast local storage –Compiler stripmines the computation to stream the data through the coprocessors 39

Simplified Specification: Range 40

Simplified Specification: Corner Turn 41

Simplified Specification: Azimuth 42

Kernels Kernels are no more complex than the partitioning kernel shown in the Matrix Multiply example Only difference is it partitions split complex data! 43

Appendix: Backprojection SAR Symbolic Expression Code Analysis

Algorithm complex out[x][y] backprojection(complex in[p][f0], float xo[p], float yo[p], float zo[p], float sr[p], int X, int Y, float DF, float yo[p], float zo[p], float sr[p], int X, int Y, float DF, int Nbins) { int Nbins) { range f = f0 * 4, x = X-1, y = Y-1; { in1[p][f] = 0.0; in1[p][f0] = in[p][f0]*W[f0]; } in2[p] = ifft(in1[p]) rng[p][y][x] = sqrt((xo[p]-x[x])^2 + (yo[p]-y[y])^2 + zo[p]^2); dphase[p][y][x] = exp(i*4*PI/C*f0[p]* rng[p][y][x]); rbin[p][y][x] = Nbins * (rng[p][y][x] - rstart[p]) * 2 * DF / C; irbin[p][y][x] = floor(rbin[p][y][x]); w1[p][y][x] = rbin[p][y][x] - irbin[p][y][x]; w0[p][y][x] = 1 – w1[p][y][x]; out[y][x] += (in2[p][irbin[p][y][x]]*w0[p][y][x] + in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x]; in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x];} 45 Function prototype

Algorithm complex out[x][y] backprojection(complex in[p][f0], float xo[p], float yo[p], float zo[p], float sr[p], int X, int Y, float DF, float yo[p], float zo[p], float sr[p], int X, int Y, float DF, int Nbins) { int Nbins) { range f = f0 * 4, x = X-1, y = Y-1; { in1[p][f] = 0.0; in1[p][f0] = in[p][f0]*W[f0]; } in2[p] = ifft(in1[p]) rng[p][y][x] = sqrt((xo[p]-x[x])^2 + (yo[p]-y[y])^2 + zo[p]^2); dphase[p][y][x] = exp(i*4*PI/C*f0[p]* rng[p][y][x]); rbin[p][y][x] = Nbins * (rng[p][y][x] - rstart[p]) * 2 * DF / C; irbin[p][y][x] = floor(rbin[p][y][x]); w1[p][y][x] = rbin[p][y][x] - irbin[p][y][x]; w0[p][y][x] = 1 – w1[p][y][x]; out[y][x] += (in2[p][irbin[p][y][x]]*w0[p][y][x] + in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x]; in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x];} 46 Declare range variable (iteraters) needed.

Algorithm complex out[x][y] backprojection(complex in[p][f0], float xo[p], float yo[p], float zo[p], float sr[p], int X, int Y, float DF, float yo[p], float zo[p], float sr[p], int X, int Y, float DF, int Nbins) { int Nbins) { range f = f0 * 4, x = X-1, y = Y-1; { in1[p][f] = 0.0; in1[p][f0] = in[p][f0]*W[f0]; } in2[p] = ifft(in1[p]) rng[p][y][x] = sqrt((xo[p]-x[x])^2 + (yo[p]-y[y])^2 + zo[p]^2); dphase[p][y][x] = exp(i*4*PI/C*f0[p]* rng[p][y][x]); rbin[p][y][x] = Nbins * (rng[p][y][x] - rstart[p]) * 2 * DF / C; irbin[p][y][x] = floor(rbin[p][y][x]); w1[p][y][x] = rbin[p][y][x] - irbin[p][y][x]; w0[p][y][x] = 1 – w1[p][y][x]; out[y][x] += (in2[p][irbin[p][y][x]]*w0[p][y][x] + in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x]; in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x];} 47 Zero fill interpolation array and then fill with input data. Notice interpolation array is 4X the input array.

Algorithm complex out[x][y] backprojection(complex in[p][f0], float xo[p], float yo[p], float zo[p], float sr[p], int X, int Y, float DF, float yo[p], float zo[p], float sr[p], int X, int Y, float DF, int Nbins) { int Nbins) { range f = f0 * 4, x = X-1, y = Y-1; { in1[p][f] = 0.0; in1[p][f0] = in[p][f0]*W[f0]; } in2[p] = ifft(in1[p]) rng[p][y][x] = sqrt((xo[p]-x[x])^2 + (yo[p]-y[y])^2 + zo[p]^2); dphase[p][y][x] = exp(i*4*PI/C*f0[p]* rng[p][y][x]); rbin[p][y][x] = Nbins * (rng[p][y][x] - rstart[p]) * 2 * DF / C; irbin[p][y][x] = floor(rbin[p][y][x]); w1[p][y][x] = rbin[p][y][x] - irbin[p][y][x]; w0[p][y][x] = 1 – w1[p][y][x]; out[y][x] += (in2[p][irbin[p][y][x]]*w0[p][y][x] + in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x]; in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x];} 48 Use FFT for pulse compression resulting in a 4X interpolation of the data in the spatial domain.

Algorithm complex out[x][y] backprojection(complex in[p][f0], float xo[p], float yo[p], float zo[p], float sr[p], int X, int Y, float DF, float yo[p], float zo[p], float sr[p], int X, int Y, float DF, int Nbins) { int Nbins) { range f = f0 * 4, x = X-1, y = Y-1; { in1[p][f] = 0.0; in1[p][f0] = in[p][f0]*W[f0]; } in2[p] = ifft(in1[p]) rng[p][y][x] = sqrt((xo[p]-x[x])^2 + (yo[p]-y[y])^2 + zo[p]^2); dphase[p][y][x] = exp(i*4*PI/C*f0[p]* rng[p][y][x]); rbin[p][y][x] = Nbins * (rng[p][y][x] - rstart[p]) * 2 * DF / C; irbin[p][y][x] = floor(rbin[p][y][x]); w1[p][y][x] = rbin[p][y][x] - irbin[p][y][x]; w0[p][y][x] = 1 – w1[p][y][x]; out[y][x] += (in2[p][irbin[p][y][x]]*w0[p][y][x] + in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x]; in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x];} 49 Calculate the range from every point in the output image to every pulse observation point.

Algorithm complex out[x][y] backprojection(complex in[p][f0], float xo[p], float yo[p], float zo[p], float sr[p], int X, int Y, float DF, float yo[p], float zo[p], float sr[p], int X, int Y, float DF, int Nbins) { int Nbins) { range f = f0 * 4, x = X-1, y = Y-1; { in1[p][f] = 0.0; in1[p][f0] = in[p][f0]*W[f0]; } in2[p] = ifft(in1[p]) rng[p][y][x] = sqrt((xo[p]-x[x])^2 + (yo[p]-y[y])^2 + zo[p]^2); dphase[p][y][x] = exp(i*4*PI/C*f0[p]* rng[p][y][x]); rbin[p][y][x] = Nbins * (rng[p][y][x] - rstart[p]) * 2 * DF * 4; irbin[p][y][x] = floor(rbin[p][y][x]); w1[p][y][x] = rbin[p][y][x] - irbin[p][y][x]; w0[p][y][x] = 1 – w1[p][y][x]; out[y][x] += (in2[p][irbin[p][y][x]]*w0[p][y][x] + in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x]; in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x];} 50 Calculate the phase shift from every point in the output image to every pulse observation point.

Algorithm complex out[x][y] backprojection(complex in[p][f0], float xo[p], float yo[p], float zo[p], float sr[p], int X, int Y, float DF, float yo[p], float zo[p], float sr[p], int X, int Y, float DF, int Nbins) { int Nbins) { range f = f0 * 4, x = X-1, y = Y-1; { in1[p][f] = 0.0; in1[p][f0] = in[p][f0]*W[f0]; } in2[p] = ifft(in1[p]) rng[p][y][x] = sqrt((xo[p]-x[x])^2 + (yo[p]-y[y])^2 + zo[p]^2); dphase[p][y][x] = exp(i*4*PI/C*f0[p]* rng[p][y][x]); rbin[p][y][x] = Nbins * (rng[p][y][x] - rstart[p]) *2 * DF / 4; irbin[p][y][x] = floor(rbin[p][y][x]); w1[p][y][x] = rbin[p][y][x] - irbin[p][y][x]; w0[p][y][x] = 1 – w1[p][y][x]; out[y][x] += (in2[p][irbin[p][y][x]]*w0[p][y][x] + in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x]; in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x];} 51 Calculate the range bin corresponding to the range of the image point from the observation point.

Algorithm complex out[x][y] backprojection(complex in[p][f0], float xo[p], float yo[p], float zo[p], float sr[p], int X, int Y, float DF, float yo[p], float zo[p], float sr[p], int X, int Y, float DF, int Nbins) { int Nbins) { range f = f0 * 4, x = X-1, y = Y-1; { in1[p][f] = 0.0; in1[p][f0] = in[p][f0]*W[f0]; } in2[p] = ifft(in1[p]) rng[p][y][x] = sqrt((xo[p]-x[x])^2 + (yo[p]-y[y])^2 + zo[p]^2); dphase[p][y][x] = exp(i*4*PI/C*f0[p]* rng[p][y][x]); rbin[p][y][x] = Nbins * (rng[p][y][x] - rstart[p]) * 2 * DF / C; irbin[p][y][x] = floor(rbin[p][y][x]); w1[p][y][x] = rbin[p][y][x] - irbin[p][y][x]; w0[p][y][x] = 1 – w1[p][y][x]; out[y][x] += (in2[p][irbin[p][y][x]]*w0[p][y][x] + in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x]; in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x];} 52 Calculate the linear interpolation weights since range will not in center of bin.

Algorithm complex out[x][y] backprojection(complex in[p][f0], float xo[p], float yo[p], float zo[p], float sr[p], int X, int Y, float DF, float yo[p], float zo[p], float sr[p], int X, int Y, float DF, int Nbins) { int Nbins) { range f = f0 * 4, x = X-1, y = Y-1; { in1[p][f] = 0.0; in1[p][f0] = in[p][f0]*W[f0]; } in2[p] = ifft(in1[p]) rng[p][y][x] = sqrt((xo[p]-x[x])^2 + (yo[p]-y[y])^2 + zo[p]^2); dphase[p][y][x] = exp(i*4*PI/C*f0[p]* rng[p][y][x]); rbin[p][y][x] = Nbins * (rng[p][y][x] - rstart[p]) * 2 * DF / 4; irbin[p][y][x] = floor(rbin[p][y][x]); w1[p][y][x] = rbin[p][y][x] - irbin[p][y][x]; w0[p][y][x] = 1 – w1[p][y][x]; out[y][x] += (in2[p][irbin[p][y][x]]*w0[p][y][x] + in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x]; in2[p][irbin[p][y][x]+1]*w1[p][y][x])* phase[p][y][x];} 53 Linearly interpolate return and adjust for phase change due to propagation time.

Analysis of Code Complexity and Performance The graphs for the backprojection algorithm – while much simpler than the corresponding C code – are relatively complex compared with the data movement. The complexity of the graph is compounded by the 2 sources of complexity. There is great benefit to using symbolic expressions to replace block diagrams as the input. The comparison is shown in an example in the next chart. 54

Comparison of Symbolic Expression and Block Diagram 55 w1[p][y][x] = rbin[p][y][x] - irbin[p][y][x]; w0[p][y][x] = 1 – w1[p][y][x]; out[y][x] += (in2[p][irbin[p][y][x]]*w0[p][y][x] + in2[p][irbin[p][y][x]+1]*w1[p][y][x]) * phase[p][y][x];