Download presentation
Presentation is loading. Please wait.
Published byFlorence Melton Modified over 9 years ago
1
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Application Development on the SRC Computers, Inc. Systems Jeff Hammes hammes@srccomp.com Dan Poznanovic poz@srccomp.com SRC Computers, Inc. July 12, 2005
2
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.comOutline SRC Hardware Architecture SRC Hardware Architecture SRC Programming Environment SRC Programming Environment Example Applications Example Applications General FPGA Structure General FPGA Structure Compiling to Hardware Compiling to Hardware –Demos Streams Streams –Demos
3
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com IMPLICIT+EXPLICIT ™ Architecture IMPLICIT+EXPLICIT ™ Architecture SRC’s explicitly controlled processor is called MAP ® – Dense logic device – Higher clock rates – Typically fixed logic P, DSP, ASIC, etc. Implicitly Controlled Device Implicitly Controlled Device – Direct execution logic – Lower clock rates – Typically reconfigurable – FPGA, CPLD, OPLD, etc. Explicitly Controlled Device Explicitly Controlled Device Fortran Unified Executable C Implicit Device Explicit Device Carte™ Programming Environment Memory I/O Bridge Memory Control
4
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com MAP ® Implementation Direct Execution Logic (DEL) made up of one or more User Logic devices Direct Execution Logic (DEL) made up of one or more User Logic devices Control circuits allow explicit control of memory prefetch and data access Control circuits allow explicit control of memory prefetch and data access Multiple banks of On-Board Memory maximizes local memory bandwidth Multiple banks of On-Board Memory maximizes local memory bandwidth GPIO ports allow direct MAP to MAP chain connections or direct data input GPIO ports allow direct MAP to MAP chain connections or direct data input Multiple DMA engines support Multiple DMA engines support –Distributed SRAM in User Logic 264 KB @ 844 GB/s264 KB @ 844 GB/s –Block SRAM in User Logic 648 KB @ 260 GB/s648 KB @ 260 GB/s –On-Board SRAM 28 MB @ 9.6 GB/s28 MB @ 9.6 GB/s –Microprocessor Memory 8 GB @ 1400 MB/s8 GB @ 1400 MB/s Six Banks Dual-ported On-Board Memory (24 MB) 4800 MB/s (6 x 64b) 4800 MB/s 192b 2400 MB/s each GPIO 4800 MB/s (6 x 64b) Controller XC2V6000 User Logic 1 XC2V6000 User Logic 2 XC2V6000 108b 4800 MB/s (6 x 64b) 108b 1400 MB/s sustained payload MAP 1400 MB/s sustained payload Dual-ported Memory (4 MB)
5
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Wide Area Network Disk Storage Area Network Local Area Network PCI-X MAPstation MAP PPPP Memory SNAP™ GPIOPorts SRC MAPstation™ MAPstation Configurations SRC-6 uses standard external network connections Tower 2U Single MAP Workstation Portable
6
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Utilizes standard clustering technology – system size limited only by clustering technology SRC Cluster Based Systems PCI-X MAPstation MAP PPPP MemorySNAPGPIOPort PCI-X MAPstation™ MAP ® PPPP Memory SNAP™ GPIOPort PCI-X MAPstation MAP PPPP MemorySNAPGPIOPort PCI-X MAPstation MAP PPPP MemorySNAPGPIOPort Gig Ethernet etc. Wide Area Network Disk Storage Area Network Local Area Network SRC-6
7
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Storage Area Network Local Area Network Wide Area Network Disk Customers’ Existing Networks Hi-Bar sustains 1.4 GB/s per port with 180 ns latency per tier Hi-Bar sustains 1.4 GB/s per port with 180 ns latency per tier Up to 256 input and 256 output ports with two tiers of switch Up to 256 input and 256 output ports with two tiers of switch Common Memory (CM) has controller with DMA capability Common Memory (CM) has controller with DMA capability Controller can perform other functions such as scatter/gather Controller can perform other functions such as scatter/gather Up to 8 GB DDR SDRAM supported per CM node Up to 8 GB DDR SDRAM supported per CM node PCI-X PCI-X SRC Hi-Bar TM Based Systems MAP ® SRC-6 MAP PPPP Memory SNAP™ PPPP Memory SNAP Gig Ethernet etc. Common Memory ChainingGPIO SRC Hi-Bar Switch
8
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Wide Area Network Disk Storage Area Network Local Area Network SRC MAPstation ™ with Hi-Bar ™ MAPstation towers hold up to 3 MAP or memory nodes MAPstation Tower MAPstation with 2 MAPs and Common Memory PCI-X/EXP PPPP Memory SNAP™ MAP GPIOPorts SRC Hi-Bar ™ Switch MEMORY MAP GPIOPorts
9
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com System Solutions MAP ® Processor SNAP ™ Interface Common Memory Hi-Bar ™ Switch Carte ™ Environment Embedded Solutions Server Solutions Technology Building Blocks Workstation Solutions
10
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Unified Applications Environment C or Fortran Source Code Written Specifically for MAP ® Universe of all Software PP Explicit DEL UnifiedExecutable Carte ™ Carte ™ Programming Environment P Compiler Tools Unified Program Execution Linux Operating System StandardNetwork StandardPeripherals MAP Compiler Tools MAP Implicit DLD
11
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com SRC MAP ® Compiler Support standard programming languages and tools Support standard programming languages and tools Infrastructure supports all codes Infrastructure supports all codes Functional units are building blocks Functional units are building blocks Support explicit access to memory hierarchy Support explicit access to memory hierarchy Isolate generated logic from H/W dependencies Isolate generated logic from H/W dependencies Compiled code performance competitive to HDL Compiled code performance competitive to HDL Today’s explicit is tomorrow’s compiler optimization Today’s explicit is tomorrow’s compiler optimization Integration is critical Integration is critical The Design & Implementation Principles
12
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Carte™ Programming Environment HLLSourceFortran & C MAP ® Compiling System MAPMacrosCustomerMacrosRun-timeLibrary UnifiedExecutable Parsing Optimizations Place & Route HDL Generation CFG/DFG Generation
13
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com MAP ® Compiler ApplicationSource Linker UnifiedExecutable.o files Place and Route logic.bin P Compiler Compilation and Linking Libraries
14
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Support for Development Process Application Debug Environment Application Debug Environment –Fast compile times –Familiar debugger environment –Debugging includes data movement to/from MAP ® –Does not require MAP hardware to run –Fast enough to test a full application Execution on MAP Hardware Execution on MAP Hardware –Compilation for MAP on the order 10s of minutes –Results match debug and simulation Easy integration of Existing Direct Execution Logic Easy integration of Existing Direct Execution Logic –Custom functional unit development supported –Functional unit simulation environment supported
15
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Some Applications
16
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com R/T Spectrum Analyzer 2 G-samples/s Sensor data Find 32 peak frequencies R/T Display 32 Peak Frequencies
17
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com R/T Spectrum Analyzer Performance ComputeFFT TimeSpeedup Microprocessor487 S*1x MAP1.6 S304x Sources of Performance C with optimized functional units Pipelined loops Parallel code blocks Streams extend pipelines
18
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com R/T Video – Edge Detect Buffers 4 VGA cameras Four images on monitor 54 MPixels/S (120 FPS) Median Filter Prewitt Edge DetectR/T
19
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com 180 o Edge Detection Performance ComputeSpeedup Microprocessor1x* MAP120x Sources of Performance C with optimized functional units – Data access Pipelined loops Parallel code blocks Streams to fuse loops – extend pipelines
20
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com R/T Video – Target Recognition VGA VGAcamera Image on monitor 30 FPS 30 FPS ProbeCodeTargetRecognition Probe Description
21
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com R/T Video – Target Recognition Performance ComputeTime/ImageSpeedup Microprocessor2.5 S1x MAP.007 S357x Sources of Performance C with optimized functional units – Data access Pipelined loops Parallel functional units
22
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com VisualD.x Sparse Aperture 3-D challenge Polarization Challenge Reduced BW 2-D challenge CAD Model Cross PolVertical Pol Horizontal Pol Backhoe Challenge Problem Public Release Numbers: ASC 04-0273 and ASC 04-0990
23
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com SAR Backprojection Performance 2D Data – 5040 scans of 512 rows ComputeFile ReadImage_2DSpeedupTotal MATLAB4.749200.24x4924 C ( P)4.511571x1162 MATLAB – MAP527.536x32.5 C – MAP4.627.536x32.1 Two MAPs 53.2 361x8.2 Three MAPs52.2525x7.2 All times are in seconds
24
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Now Let’s Look Deeper
25
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Reconfigurable Computing Compile C or Fortran codes to reconfigurable hardware instead of microprocessor instructions Compile C or Fortran codes to reconfigurable hardware instead of microprocessor instructions Performance advantage from extensive parallelism Performance advantage from extensive parallelism
26
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Generic FGPA Structure switch block I/O block function block CLB interconnect
27
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Generic FPGA Structure Simple view Simple view –Little “islands” of logic –Grid of interconnecting wires Logic functions and connections can be reconfigured Logic functions and connections can be reconfigured “Bitstream” is a complete configuration for the chip “Bitstream” is a complete configuration for the chip Configuration takes ~50 msec (Xilinx v6000) Configuration takes ~50 msec (Xilinx v6000)
28
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Logic in FPGA Example – 32b Integer Add Example – 32b Integer Add Described by VHDL, Verilog, Schematic Capture Described by VHDL, Verilog, Schematic Capture “Synthesized” to the chip’s resources “Synthesized” to the chip’s resources + 32b32b 32b clock
29
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.comDataflow Dataflow ideas have been around for decades Dataflow ideas have been around for decades –Ex: Manchester Dataflow –Ex: Monsoon Dataflow Basic idea: express computation with interconnected function units Basic idea: express computation with interconnected function units + A = B + C * D; E = C + D; B = A / E; F = B + E; B * CD / + ‘E’ ‘A’ + ‘F’ ‘B’ Note concurrent operations
30
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.comDataflow *BCDF + + / + Dataflow can be implemented in FPGA Logic
31
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Pipelining of Functional Units Fully pipelined functional unit can take new inputs on every clock Fully pipelined functional unit can take new inputs on every clock Non-pipelined functional unit can take new inputs only after it has finished processing the previous inputs Non-pipelined functional unit can take new inputs only after it has finished processing the previous inputs Pipelining versus non-pipelining is typically a space/performance tradeoff Pipelining versus non-pipelining is typically a space/performance tradeoff Latency of a functional unit is the time (in clocks) between input values being applied and result values appearing Latency of a functional unit is the time (in clocks) between input values being applied and result values appearing
32
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Time Behavior Latencies are shown next to function units Latencies are shown next to function units If we apply and hold B, C and D values on the inputs of the DFG, we see a result at the bottom 38 clocks later If we apply and hold B, C and D values on the inputs of the DFG, we see a result at the bottom 38 clocks later We want to exploit pipelining to improve performance We want to exploit pipelining to improve performance To pipeline: To pipeline: –Use pipelined function units –Paths need to be balanced in time +B *CD / + 1 c + 4 c 1 c 32 c 38 c
33
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Time Behavior Using fully pipelined function units we can apply new inputs on every clock Using fully pipelined function units we can apply new inputs on every clock 38 clocks after the first inputs are applied, the first result appears 38 clocks after the first inputs are applied, the first result appears Thereafter, a new result appears on every clock Thereafter, a new result appears on every clock +B *CD / + 1 c + 4 c 1 c 32 c delay 4 c delay 32 c 38 c delay 4 c
34
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Pipelined Loop DFG comes naturally from the loop body DFG comes naturally from the loop body Compiler then inserts delays to balance the paths Compiler then inserts delays to balance the paths Compiler adds function units to: Compiler adds function units to: –Generate a stream of indices –Detect loop termination for (i=0; i<n; i++) A[i] = B[i] + C[i] * 17; +Abaseaddr + Load i + Bbaseaddr + Cbaseaddr Load * 17 Store
35
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.comConditionalsAbaseaddr Bbaseaddr Cbaseaddr + + Loadi + + Load Store 42 < neg selector * for (i=0; i<n; i++) { a = B[i] + C[i]; a = B[i] + C[i]; if (a > 42) if (a > 42) v = a; v = a; else else v = -a; v = -a; A[i] = v * i; A[i] = v * i;}
36
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.comDemo
37
MAP Programmers View 4800 MB/s (6 x 64b) 4800 MB/s 192b 2400 MB/s each GPIO 4800 MB/s (6 x 64b) ControlFPGA User FPGA 0 User FPGA 1 108b 4800 MB/s (6 x 64b) 108b 1400 MB/s sustained payload MAP 1400 MB/s sustained payload Dual-portedMemory (4 MB) OBMAOBMBOBMCOBMDOBMEOBMF
38
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.comDemo
39
Performance Optimizations Pipelined Loops Pipelined Loops –All function units within loop are computing at every clock Parallel Code Sections Parallel Code Sections –Multiple parallel code blocks are active in parallel –All function units within loop are computing at every clock Streams Streams –Multiple serial code blocks are active in parallel –All function units within loop are computing at every clock Multiple FPGAs Multiple FPGAs –Logic in both FPGAs can be computing in parallel
40
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com What is Streams? Conventional Data Flow Streams and Conventional Data Flow On-Board Memory or BRAM Compute Loop 1 On-Board Memory or BRAM Compute Loop 2 On-Board Memory or BRAM Compute Loop 1 Steams Compute Loop 2 On-Board Memory or BRAM Time Saves Access to On-BoardMemory Data is flowing In the logic
41
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Serial Computational Loops v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop v4 v3 v2 v1 v0 Compute Loop
42
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Serial Computational Loops OBM / BRAM v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop v4 v3 v2 v1 v0 Compute Loop OBM / BRAM OBM / BRAM Step 0
43
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Serial Computational Loops v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 1
44
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Serial Computational Loops v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 2
45
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Serial Computational Loops v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 4
46
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Serial Computational Loops v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 5
47
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Serial Computational Loops v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 6
48
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Serial Computational Loops v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 6
49
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Serial Computational Loops v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 7
50
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Serial Computational Loops v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 8
51
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Serial Computational Loops v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 10
52
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Serial Computational Loops v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 11 Computational Time 2N + latency CL1 + latency CL2
53
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Computational Loops v4 v3 v2 v1 v0 v4 Compute Loop v4 v3 v2 v1 v0 Compute Loop FIFO OBM / BRAM Step 0
54
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Computational Loops v4 v3 v2 v1 v0 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 1
55
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Computational Loops v4 v3 v2 v1 v0 v1 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 2
56
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Computational Loops v4 v3 v2 v1 v0 v2 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 3
57
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Computational Loops v4 v3 v2 v1 v0 v3 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 4
58
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Computational Loops v4 v3 v2 v1 v0 v4 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 5
59
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Computational Loops v4 v3 v2 v1 v0 v1 Compute Loop v4 v3 v2 v1 v0 Compute Loop Step 6 Computational Time N + latency CL1 + latency CL2
60
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Execution Comparison Compute Loop Compute Loop v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Step 0 v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop Compute Loop
61
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Execution Comparison Compute Loop Compute Loop v4 v3 v2 v1 v0 v1 v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Step 1 v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop Compute Loop
62
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Execution Comparison Compute Loop Compute Loop v4 v3 v2 v1 v0 v1 v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Step 2 v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop Compute Loop
63
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Execution Comparison Compute Loop Compute Loop v4 v3 v2 v1 v0 v2 v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Step 3 v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop Compute Loop
64
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Execution Comparison Compute Loop Compute Loop v4 v3 v2 v1 v0 v3 v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Step 4 v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop Compute Loop
65
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Execution Comparison Compute Loop Compute Loop v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Step 5 v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop Compute Loop
66
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Execution Comparison Compute Loop Compute Loop v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Step 6 v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop Compute Loop Computational Improvement N * (NLoops – 1)
67
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stream Example Loop Slowdown v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop2 v4 v0 Loop2 has a loop slowdown to every other clock
68
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop2 v4 v0 Step 1 Ready for input
69
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop2 v4 v1 Step 2 Not ready for input
70
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop2 v2 v1 Step 3 Ready for input
71
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop2 v3 v2 Step 4 Not ready for input
72
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop2 v3 v2 Step 5 Ready for input
73
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop2 v4 v3 Step 6
74
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop2 v4 v3 Step 7
75
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop2 v4 Step 8
76
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop2 v4 Step 9
77
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop2 v4 Step 10
78
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop2 v4 Step 11 Computational Time Based upon the loop with the slowest firing interval
79
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Example Multiple Producers v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop3 v4 v3 v2 v1 v0 Compute Loop2 v4 Loop2 has a loop slowdown to every other clock
80
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Example Multiple Producers v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop3 v4 v0 v4 v3 v2 v1 v0 Compute Loop2 v4 Step 1
81
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Example Multiple Producers v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop3 v1 v0 v4 v3 v2 v1 v0 Compute Loop2 v4 v0 Step 2
82
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Example Multiple Producers v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop3 v2 v1 v4 v3 v2 v1 v0 Compute Loop2 v4 v1 Step 3
83
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Example Multiple Producers v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop3 v2 v1 v4 v3 v2 v1 v0 Compute Loop2 v4 v1 Step 4
84
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Example Multiple Producers v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop3 v3 v2 v4 v3 v2 v1 v0 Compute Loop2 v4 v1 Step 5
85
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Example Multiple Producers v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop3 v3 v2 v4 v3 v2 v1 v0 Compute Loop2 v4 v2 Step 6
86
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Example Multiple Producers v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop3 v4 v3 v4 v3 v2 v1 v0 Compute Loop2 v4 v2 Step 7
87
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Example Multiple Producers v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop3 v4 v3 v4 v3 v2 v1 v0 Compute Loop2 v4 v3 Step 8
88
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Example Multiple Producers v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop3 v4 v3 v2 v1 v0 Compute Loop2 v4 v3 Step 9
89
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Example Multiple Producers v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop3 v4 v3 v2 v1 v0 Compute Loop2 v4 Step 10
90
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Streams Example Multiple Producers v4 v3 v2 v1 v0 v4 v3 v2 v1 v0 Compute Loop1 Compute Loop3 v4 v3 v2 v1 v0 Compute Loop2 v4 Step 11
91
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.comDemo
92
Concluding Remarks It is possible to program RC in C (FORTRAN) It is possible to program RC in C (FORTRAN) Performance comes from: Performance comes from: –Attention to data movement –Pipelining of loops –Pipeline extension via streams –Concurrent data movement & compute –Optimized functional units Programming must include the whole system Programming must include the whole system Integration is REALLY important Integration is REALLY important
93
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com SRC Computers, Inc.
94
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Code Median Filter & Prewitt Edge Detect (The Booth Demo)
95
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Median Filter & Edge Detector (The Visual Results)
96
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Pipelining Stencil Code 9 points in stencil. 8 have been seen before. The leading point should be the only data access. Compute Process Move a window through the image Data access input(i) x3 x6 x9 x1 x4 x7 x2 x5 x8 Compute f(x 1,x 2,..x 9 ) Data Storage f(x)
97
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Pipelining Stencil Code 9 points in stencil. 8 have been seen before. The leading point should be the only data access. Compute Process Move a window through the image Data access input(i) x3 x6 x9 x1 x4 x7 x2 x5 x8 Compute f(x 1,x 2,..x 9 ) Data Storage f(x)
98
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Pipelining Stencil Code 9 points in stencil. 8 have been seen before. The leading point should be the only data access. Compute Process Move a window through the image Data access input(i) x3 x6 x9 x1 x4 x7 x2 x5 x8 Compute f(x 1,x 2,..x 9 ) Data Storage f(x)
99
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Pipelining Stencil Code 9 points in stencil. 8 have been seen before. The leading point should be the only data access. Compute Process Move a window through the image Data access input(i) x3 x6 x9 x1 x4 x7 x2 x5 x8 Compute f(x 1,x 2,..x 9 ) Data Storage f(x)
100
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Pipelining Stencil Code 9 points in stencil. 8 have been seen before. The leading point should be the only data access. Compute Process Move a window through the image Data access input(i) x3 x6 x9 x1 x4 x7 x2 x5 x8 Compute f(x 1,x 2,..x 9 ) Data Storage f(x)
101
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Pipelining Stencil Code 9 points in stencil. 8 have been seen before. The leading point should be the only data access. Compute Process Move a window through the image x3 x6 x9 x1 x4 x7 x2 x5 x8 Compute f(x 1,x 2,..x 9 ) Data Storage f(x) F1(x1,….,x9): median(x1,…,x9) F2(x1,….,x9): hz = (x0 + x1 + x2) – (x7 + x8 + x9); hz = (x0 + x1 + x2) – (x7 + x8 + x9); vt = (x0 + x4 + x7) – (x3 + x6 + x9); vt = (x0 + x4 + x7) – (x3 + x6 + x9); *w0=fsqrt(hz*hz+vt*vt); *w0=fsqrt(hz*hz+vt*vt);
102
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Pipelined Stencil Code Performance through: Performance through: –Streaming DMA –Specialized Data Access functional units –Pipelined Logic –Parallel Code Blocks –Optimized application specific functional units
103
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow x3 x6 x9 x1 x4 x7 x2 x5 x8 9 Registers (16 –unit Shift Register, remembers previous row) Data access f(x) Data access input(x) Compute f(x 1,x 2,..x 9 ) Data Storage f(x)
104
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i0 Output(i) i0 Compute f(x 1,x 2,..x 9 )
105
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i1i0 i0i1 Output(i) Compute f(x 1,x 2,..x 9 )
106
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i2i0i1 i0i1i2 Output(i) Compute f(x 1,x 2,..x 9 )
107
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i3i1i2 i0i1i2i3 Output(i) Compute f(x 1,x 2,..x 9 )
108
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i4i2i3 i0i1i2i3i4 Output(i) Compute f(x 1,x 2,..x 9 )
109
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i5i3i4 i0i1i2i3i4i5 Output(i) Compute f(x 1,x 2,..x 9 )
110
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i6i4i5 i0i1i2i3i4i5i6 Output(i) Compute f(x 1,x 2,..x 9 )
111
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i15i13i14 i1i2i3i4i5i6i7i7i9i10i11i12i13i14i15 Output(i) i0 Compute f(x 1,x 2,..x 9 )
112
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i0 i16i14i15 i2i3i4i5i6i7i8i9i10i11i12i13i14i15i16 i0 Output(i) i1 Compute f(x 1,x 2,..x 9 )
113
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i1 i17i15 i0 i16 i3i4i5i6i7i8i9i10i11i12i13i14i15i16i17 i0i1 Output(i) i2 Compute f(x 1,x 2,..x 9 )
114
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i17 i31 i15 i29 i16 i30 i17i18i19i20i21i22i23i24i25i26i27i28i29i30i31 i1i2i3i4i5i6i7i8i9i10i11i12i13i14i15 Output(i) i16 i0 Compute f(x 1,x 2,..x 9 )
115
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i0 i17 i32 i15 i30 i16 i31 i18i19i20i21i22i23i24i25i26i27i28i29i30i31i32 i2i3i4i5i6i7i8i9i10i11i12i13i14i15i16 Output(i) i17 i1 Compute f(x 1,x 2,..x 9 )
116
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i1 i18 i33 i16 i31 i0 i17 i32 i19i20i21i22i23i24i25i26i27i28i29i30i31i32i33 i3i4i5i6i7i8i9i10i11i12i13i14i15i16i17 Output(i) i18 i2 Compute f(x 1,x 2,..x 9 )
117
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i2 i18 i34 i0 i16 i32 i1 i17 i33 i20i21i22i23i24i25i26i27i28i29i30i31i32i33i34 i4i5i6i7i8i9i10i11i12i13i14i15i16i17i18 Output(i) i19 i3 Compute f(x 1,x 2,..x 9 )
118
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i3 i19 i35 i1 i17 i33 i2 i18 i34 i21i22i23i24i25i26i27i28i29i30i31i32i33i34i35 i5i6i7i8i9i10i11i12i13i14i15i16i17i18i19 Output(i) i20 i4 Compute f(x 1,x 2,..x 9 )
119
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i15 i31 i47 i13 i29 i45 i14 i30 i46 Compute f(x 1,x 2,..x 9 ) i33i34i35i36i37i38i39i40i41i42i43i44i45i46i47 i17i18i19i20i21i22i23i24i25i26i27i28i29i30i31 Output(i) i32 i16
120
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stencil Data Flow Data access input(i) i223 i239 i255 i221 237 i253 i222 i238 i254 Compute f(x 1,x 2,..x 9 ) i241i242i243i244i245i246i247i248i249i250i251i252i253i254i255 i225i226i227i228i229i230i231i232i233i234i235i236i237i238i239 Output(i) i240 i224
121
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Code Structure 2 - 8 byte streams in 2 – 8 byte streams out 2 Parallel Code blocks 16 Parallel medians 16 Parallel Prewitts 2 Connecting streams
122
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Med_edge MAP Routine void med_edge(uint64_t In[], int HEIGHT, int WIDTH,uint64_t Out[], int mapno) { #pragma src parallel sections { /* Define the input & output streams */ Streams_64 S0, S1, S2, S3. S4, S5, S6; #pragma src section { stream_dma_cpu_dual (&S0, &S1, PORT_TO_STREAM, AL, DMA_A_B, In, 1, HEIGHT*WIDTH); } #pragma src section { stream_dma_cpu_dual (&S5, &S6, STREAM_TO_PORT, CL, DMA_C_D, Out, 1, HEIGHT*WIDTH); }
123
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stream Data In #pragma src section {for (j=0; j<HEIGHT*WIDTH/8/2; j++) { {for (j=0; j<HEIGHT*WIDTH/8/2; j++) { get_stream (&S0, &valuea); get_stream (&S0, &valuea); get_stream (&S1, &valueb); get_stream (&S1, &valueb); a0a = valuea; a0a = valuea; delay_queue_64_var (a0a,WORDS_PER_ROW/2, &a1a); delay_queue_64_var (a0a,WORDS_PER_ROW/2, &a1a); delay_queue_64_var (a1a,WORDS_PER_ROW/2, &a2a); delay_queue_64_var (a1a,WORDS_PER_ROW/2, &a2a); a0b = valueb; a0b = valueb; delay_queue_64_var (a0b,WORDS_PER_ROW/2, &a1b); delay_queue_64_var (a0b,WORDS_PER_ROW/2, &a1b); delay_queue_64_var (a1b,WORDS_PER_ROW/2, &a2b); delay_queue_64_var (a1b,WORDS_PER_ROW/2, &a2b);
124
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Stream Data In #pragma src section {for (j=0; j<HEIGHT*WIDTH/8/2; j++) { {for (j=0; j<HEIGHT*WIDTH/8/2; j++) { get_stream (&S0, &valuea); get_stream (&S0, &valuea); get_stream (&S1, &valueb); get_stream (&S1, &valueb); a0a = valuea; a0a = valuea; delay_queue_64_var (a0a,WORDS_PER_ROW/2, &a1a); delay_queue_64_var (a0a,WORDS_PER_ROW/2, &a1a); delay_queue_64_var (a1a,WORDS_PER_ROW/2, &a2a); delay_queue_64_var (a1a,WORDS_PER_ROW/2, &a2a); a0b = valueb; a0b = valueb; delay_queue_64_var (a0b,WORDS_PER_ROW/2, &a1b); delay_queue_64_var (a0b,WORDS_PER_ROW/2, &a1b); delay_queue_64_var (a1b,WORDS_PER_ROW/2, &a2b); delay_queue_64_var (a1b,WORDS_PER_ROW/2, &a2b);
125
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Median Filter /* Move the window */ /* Move the window */ x017 = x01; x016 = x00; x117 = x11; x017 = x01; x016 = x00; x117 = x11; x116 = x10; x217 = x21; x216 = x20; x116 = x10; x217 = x21; x216 = x20; split_64to8 (a0a, &x08,.., &x015); split_64to8 (a1a, &x18,.., &x115); split_64to8 (a0a, &x08,.., &x015); split_64to8 (a1a, &x18,.., &x115); split_64to8 (a2a, &x28,..., &x215); split_64to8 (a0b, &x00,..., &x07); split_64to8 (a2a, &x28,..., &x215); split_64to8 (a0b, &x00,..., &x07); split_64to8 (a1b, &x10,..., &x17); split_64to8 (a2b, &x20,...., &x27); split_64to8 (a1b, &x10,..., &x17); split_64to8 (a2b, &x20,...., &x27); /* 16 parallel median filter calculations on 16 3x3 windows */ /* 16 parallel median filter calculations on 16 3x3 windows */ median_8_9 (x01, x02, x03, x11, x12, x13, x21, x22, x23, &v1); median_8_9 (x01, x02, x03, x11, x12, x13, x21, x22, x23, &v1);....... median_8_9 (x015, x016, x017, x115, x116, x117, x215, x216, x217, &v15); median_8_9 (x015, x016, x017, x115, x116, x117, x215, x216, x217, &v15); /* Combine 16 results and sent to Prewitt calc */ /* Combine 16 results and sent to Prewitt calc */ comb_8to64 (v8, v9, v10, v11, v12, v13, v14, v15, &vcomba); comb_8to64 (v8, v9, v10, v11, v12, v13, v14, v15, &vcomba); comb_8to64 (v0, v1, v2, v3, v4, v5, v6, v7, &vcombb); comb_8to64 (v0, v1, v2, v3, v4, v5, v6, v7, &vcombb); put_stream (&S3, ycomba,1); put_stream (&S3, ycomba,1); put_stream (&s4, ycombb,1); put_stream (&s4, ycombb,1); }} }}
126
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Median Filter /* Move the window */ /* Move the window */ x017 = x01; x016 = x00; x117 = x11; x017 = x01; x016 = x00; x117 = x11; x116 = x10; x217 = x21; x216 = x20; x116 = x10; x217 = x21; x216 = x20; split_64to8 (a0a, &x08,.., &x015); split_64to8 (a1a, &x18,.., &x115); split_64to8 (a0a, &x08,.., &x015); split_64to8 (a1a, &x18,.., &x115); split_64to8 (a2a, &x28,..., &x215); split_64to8 (a0b, &x00,..., &x07); split_64to8 (a2a, &x28,..., &x215); split_64to8 (a0b, &x00,..., &x07); split_64to8 (a1b, &x10,..., &x17); split_64to8 (a2b, &x20,...., &x27); split_64to8 (a1b, &x10,..., &x17); split_64to8 (a2b, &x20,...., &x27); /* 16 parallel median filter calculations on 16 3x3 windows */ /* 16 parallel median filter calculations on 16 3x3 windows */ median_8_9 (x01, x02, x03, x11, x12, x13, x21, x22, x23, &v1); median_8_9 (x01, x02, x03, x11, x12, x13, x21, x22, x23, &v1);....... median_8_9 (x015, x016, x017, x115, x116, x117, x215, x216, x217, &v15); median_8_9 (x015, x016, x017, x115, x116, x117, x215, x216, x217, &v15); /* Combine 16 results and sent to Prewitt calc */ /* Combine 16 results and sent to Prewitt calc */ comb_8to64 (v8, v9, v10, v11, v12, v13, v14, v15, &vcomba); comb_8to64 (v8, v9, v10, v11, v12, v13, v14, v15, &vcomba); comb_8to64 (v0, v1, v2, v3, v4, v5, v6, v7, &vcombb); comb_8to64 (v0, v1, v2, v3, v4, v5, v6, v7, &vcombb); put_stream (&S3, ycomba,1); put_stream (&S3, ycomba,1); put_stream (&s4, ycombb,1); put_stream (&s4, ycombb,1); }} }}
127
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Prewitt Calculation #pragma src section {for (j=0; j<HEIGHT*WIDTH/8/2; j++) { {for (j=0; j<HEIGHT*WIDTH/8/2; j++) { get_stream (&S3, &zcomba); get_stream (&S3, &zcomba); get_stream (&S4, &zcombb); get_stream (&S4, &zcombb); delay_queue_64_var (zcomba, WORDS_PER_ROW/2, &e1a); delay_queue_64_var (zcomba, WORDS_PER_ROW/2, &e1a); delay_queue_64_var (e1a, WORDS_PER_ROW/2, &e2a); delay_queue_64_var (e1a, WORDS_PER_ROW/2, &e2a); delay_queue_64_var (zcombb, WORDS_PER_ROW/2, &e1b); delay_queue_64_var (zcombb, WORDS_PER_ROW/2, &e1b); delay_queue_64_var (e1b, vld, WORDS_PER_ROW/2, &e2b); delay_queue_64_var (e1b, vld, WORDS_PER_ROW/2, &e2b); y017 = y01; y016 = y00; y017 = y01; y016 = y00; y117 = y11; y116 = y10; y117 = y11; y116 = y10; y217 = y21; y216 = y20; y217 = y21; y216 = y20; split_64to8 (zcomba, &y08,..., &y015); split_64to8 (zcomba, &y08,..., &y015); split_64to8 (e1a, &y18,...,&y115); split_64to8 (e1a, &y18,...,&y115); split_64to8 (e2a, &y28,...,&y215); split_64to8 (e2a, &y28,...,&y215); split_64to8 (zcombb, &y00,...,&y07); split_64to8 (zcombb, &y00,...,&y07); split_64to8 (e1b, &y10,..., &y17); split_64to8 (e1b, &y10,..., &y17); split_64to8 (e2b, &y20,..., &y27); split_64to8 (e2b, &y20,..., &y27);
128
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Prewitt Edge Detect /* The prewitt routine is inlined and thus is actually * 16 parallel computations, again since there is no * data dependency among the 16. */ prewitt (y00,y01,y02,y20,y21,y22,y10,y12,thr, &w0); prewitt (y00,y01,y02,y20,y21,y22,y10,y12,thr, &w0); prewitt (y01,y02,y03,y21,y22,y23,y11,y13,thr,&w1); prewitt (y01,y02,y03,y21,y22,y23,y11,y13,thr,&w1);........ prewitt (y014,y015,y016,y214..............., thr,&w14); prewitt (y014,y015,y016,y214..............., thr,&w14); prewitt (y015,y016,y017,y215,.............., thr,&w15); prewitt (y015,y016,y017,y215,.............., thr,&w15); /* Collect all of the 16 results into 2 packed 64 bit wds */ comb_8to64 (w8, w9, w10, w11, w12, w13, w14, w15, &wcomba); comb_8to64 (w8, w9, w10, w11, w12, w13, w14, w15, &wcomba); comb_8to64 (w0, w1, w2, w3, w4, w5, w6, w7, &wcombb); comb_8to64 (w0, w1, w2, w3, w4, w5, w6, w7, &wcombb); /* Send the results out to the CPU memory */ put_stream(&s5, wcomba, 1) put_stream(&s5, wcomba, 1) put_stream(&s6, wcombb, 1); put_stream(&s6, wcombb, 1); }} }}
129
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com void prewitt (unsigned char b00,..,b02, unsigned char b20,..,b22, unsigned char b20,..,b22, unsigned char b10,..,b12, unsigned char b10,..,b12, int thr, int thr, unsigned char*w0) unsigned char*w0){ >> >> hz = (b00 + b01 + b02) – (b20 + b21 + b22); hz = (b00 + b01 + b02) – (b20 + b21 + b22); vt = (b00 + b10 + b20) – (b02 + b12 + b22); vt = (b00 + b10 + b20) – (b02 + b12 + b22); *w0=fsqrt(hz*hz+vt*vt); *w0=fsqrt(hz*hz+vt*vt); if (*w0 > thr)*w0 = 0; if (*w0 > thr)*w0 = 0; else *w0 = 255; else *w0 = 255;}
130
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com 180 o Edge Detection PCI-X MAP PPPP MEMORYSNAPGPIO To Network MAP performs 4 simultaneous median filter, Prewitt edge detector and thresholding operations MAP performs 4 simultaneous median filter, Prewitt edge detector and thresholding operations All computation, storage and image generation is performed in MAP All computation, storage and image generation is performed in MAP Utilizes just 12% of MAP’s gates and 1.5% of one GPIO port BW Utilizes just 12% of MAP’s gates and 1.5% of one GPIO port BW Buffers 4 VGA cameras Four images on monitor 54 MPixels/S (120 FPS) D/A
131
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Code Structure – 2 Chips 2 - 8 byte streams in 2 – 8 byte streams out 2 Parallel Code blocks 16 Parallel medians 16 Parallel Prewitts 2 Connecting streams Convert to 2 chips: 1. Define two bridge streams 2. Split parallel code blocks into parallel routines
132
Copyright 2004 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Median Filter and Edge Detection Algorithm Data Flow Graph Full loop runs over 400 operations per clock cycle. Full loop runs over 400 operations per clock cycle.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.