Download presentation
Presentation is loading. Please wait.
1
Cell BE Basic Programming Concepts
Cell Programming Workshop Cell/Quasar Ecosystem & Solutions Enablement Cell Programming Workshop 9/17/2018
2
Course Objectives Cell BE run time environment
Understand the basic differences between PPE and SPEs PPE and SPE address space PPE programming basic concepts SPE programming basic concepts Trademarks - Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. Cell Programming Workshop 9/17/2018
3
Course Agenda Cell BE run time environment
Kernel support and Linux vs. SPE threads PPE and SPE comparison Architectural differences Language extension differences MFC command differences PPE and SPE communication PPE and SPE storage domains PPE programming Registers layout, instruction sets, instruction types, PowerPC compatibility, etc. PPE VMX instructions and language extensions SPE programming Registers layout, instruction sets, instruction types Floating point operations Local store access arbitration priority Pipeline and dual-issue rules SPU intrinsics Cell Programming Workshop 9/17/2018
4
Cell BE Runtime Environment
Cell Programming Workshop 9/17/2018
5
Linux Kernel Support PPE runs PowerPC applications and operating systems PPE handles thread allocation and resource management among SPEs PPE’s Linux kernel controls the SPUs’ execution of programs schedule SPE execution independent from regular Linux threads Responsible for runtime loading, passing parameters to SPE programs, notification of SPE events and errors, and debugger support PPE’s Linux kernel manages virtual memory, including mapping each SPE’s local store (LS) and problem state (PS) into the effective-address space The kernel also controls virtual-memory mapping of MFC resources, as well as MFC segment-fault and page-fault handling Large pages (16-MB pages, using the hugetlbfs Linux extension) are supported Cell Programming Workshop 9/17/2018
6
Threads and Tasks The main thread of a program is a Linux thread running on the PPE It can spawn one or more Cell BE Linux tasks A cell Linux task has one or more Linux threads associated with it, along with some number of SPE threads An SPE thread is a thread that is spawned to run on an available SPE A Linux thread can interact directly with an SPE thread through the SPE’s local store or indirectly through effective-address (EA) memory A thread can poll or sleep, waiting for SPE threads, using the spe_get_event() or spe_wait() SPE Runtime Management subroutines SPE threads follow the M:N thread model M threads distributed over N processor elements SPE threads run to completion The SDK Linux kernel supports a run-to-completion model, except for certain preemptive debugging services Cell Programming Workshop 9/17/2018
7
PPE vs SPE Both PPE and SPE execute SIMD instructions
PPE processes SIMD operations in the VXU within its PPU SPEs process SIMD operations in their SPU Both processors execute different instruction sets Programs written for the PPE and SPEs must be compiled by different compilers Cell Programming Workshop 9/17/2018
8
PPE and SPE Architectural Differences
Cell Programming Workshop 9/17/2018
9
Communication Between the PPE and SPEs
PPE communicates with SPEs through MMIO registers supported by the MFC of each SPE Three primary communication mechanisms between the PPE and SPEs Mailboxes Queues for exchanging 32-bit messages Two mailboxes (the SPU Write Outbound Mailbox and the SPU Write Outbound Interrupt Mailbox) are provided for sending messages from the SPE to the PPE One mailbox (the SPU Read Inbound Mailbox) is provided for sending messages to the SPE Signal notification registers Each SPE has two 32-bit signal-notification registers, each has a corresponding memory-mapped I/O (MMIO) register into which the signal-notification data is written by the sending processor Signal-notification channels, or signals, are inbound (to an SPE) registers They can be used by other SPEs, the PPE, or other devices to send information, such as a buffer-completion synchronization flag, to an SPE DMAs To transfer data between main storage and the LS Cell Programming Workshop 9/17/2018
10
PPE and SPEs Storage Domains
Three types of storage domains main-storage domain, 8 SPE local store domains 8 SPE channel domains The main-storage domain, which is the entire effective-address space, can be configured by the PPE operating system to be shared by all processors and memory-mapped devices in the system (all I/O is memory-mapped) Local-storage and channel problem-state (user-state) domains are private to the SPU, LS, and MFC of each SPE Cell Programming Workshop 9/17/2018
11
SPE Local Store Domain 256-KB, ECC-protected, single-ported, non-caching memory Stores all instructions and data used by the SPU Supports one access per cycle from either SPE software or DMA transfers SPU instruction prefetches are 128 bytes per cycle SPU data-access bandwidth is 16 bytes per cycle, quadword aligned DMA-access bandwidth is 128 bytes per cycle DMA transfers perform a read-modify-write of LS for writes less than a quadword An SPU can only fetch instructions from its own LS with load and store instructions, and it performs no address translation for such accesses With respect to accesses by its SPU, the LS is unprotected and untranslated storage An SPE program references its own LS using a Local Store Address (LSA) Cell Programming Workshop 9/17/2018
12
SPE Addressing and Address Aliasing
The LS of each SPE is also assigned a Real Address (RA) range within the system's memory map. This allows privileged software to map LS areas into the effective address (EA) space, where the PPE, other SPEs, and other devices that generate EAs can access the LS An SPU’s LS can also be accessed by the PPE and other devices through the main-storage space PPE accesses main storage with load and store instructions, without the need for DMA transfers SPEs must use DMA transfers to access the LS in the main storage No direct (SPU-program addressable) access to main storage When aliasing is set up by privileged software on the PPE, the SPE whose LS is being accessed performs address translation (by the SPE’s MFC) No direct access to system control, such as page-table entries. PPE privileged software provides the SPU with the address-translation information that its MFC need Cell Programming Workshop 9/17/2018
13
Main storage (effective address space)
LS Access Methods Main storage (effective address space) DMA requests can be sent to an MFC either by software on its associated SPU or on the PPE, or by any other processing device that has access to the MFC's MMIO problem-state registers Cell Programming Workshop 9/17/2018
14
Data Transfer Between Main Storage and LS Domain
An SPE or PPE performs data transfers between the SPE’s LS and main storage primarily using DMA transfers controlled by the MFC DMA controller for that SPE Channels Software on the SPE’s SPU interacts with the MFC through channels, which enqueue DMA commands and provide other facilities, such as mailboxes, signal notification, and access auxiliary resources DMA transfer requests contain both an LSA and an EA Thus, they can address both an SPE’s LS and main storage and thereby initiate DMA transfers between the domains Each MFC can maintain and process multiple in-progress DMA command requests and DMA transfers The MFC can autonomously manage a sequence of DMA transfers in response to a DMA-list command Each DMA command is tagged with a 5-bit Tag Group ID. This identifier is used to check or wait on the completion of all queued commands in one or more tag groups The MFC supports naturally aligned transfer sizes of 1, 2, 4, or 8 bytes, and multiples of 16-bytes, with a maximum transfer size of 16 KB Peak performance can be achieved when both the EA and LSA are 128-byte aligned and the size of the transfer is an even multiple of 128 bytes Cell Programming Workshop 9/17/2018
15
PPE Programming Cell Programming Workshop 9/17/2018
16
PPE Instruction Sets PowerPC instruction set
instructions are 4 bytes long and word-aligned supports byte, halfword, word, and doubleword operand accesses between storage and its 32 GPRs supports word and doubleword operand accesses between storage and a set of 32 FPRs signed integers are represented in twos-complement form Vector/SIMD Multimedia Extension instruction set (VMX) instructions are 4 bytes long and word-aligned, and all of its operands are 128 bits wide most of the VMX operands are vectors, including single-precision floating-point, integer, scalar, and fixed-point of vector-element sizes of 8,16, and 32 bits Cell Programming Workshop 9/17/2018
17
PPE VMX Instructions VMX instructions are 4 bytes long and word-aligned support simultaneous execution on multiple elements that make up the 128-bit vector operands vector elements may be byte, halfword, or word The 128-bit Vector/SIMD Multimedia Extension unit (VXU) operates concurrently with the PPU’s fixed-point integer unit (FXU) and floating-point execution unit (FPU) Cell Programming Workshop 9/17/2018
18
PPE C/C++ Language Extensions (Intrinsics)
C-language extensions: vector data types and vector commands (intrinsics) Intrinsics - inline assembly-language instructions Vector data types – 128-bit vector types Sixteen 8-bit values, signed or unsigned Eight 16-bit values, signed or unsigned Four 32-bit values, signed or unsigned Four single-precision IEEE-754 floating-point values Example: vector signed int: 128-bit operand containing four 32-bit signed ints Vector intrinsics Specific Intrinsics—Intrinsics that have a one-to-one mapping with a single assembly-language instruction Generic Intrinsics—Intrinsics that map to one or more assembly-language instructions as a function of the type of input parameters Predicates Intrinsics—Intrinsics that compare values and return an integer that may be used directly as a value or as a condition for branching Notes: The VMX intrinsics and predicates use the prefix, “vec_” in front of an assembly-language or operation mnemonic Cell Programming Workshop 9/17/2018
19
Example of a VMX Program
#include <stdio.h> // Define a type we can look at either as an array of ints or as a vector. typedef union { int iVals[4]; vector signed int myVec; } vecVar; int main() { vecVar v1, v2, vConst; // define variables // load the literal value 2 into the 4 positions in vConst, vConst.myVec = (vector signed int){2, 2, 2, 2}; // load 4 values into the 4 element of vector v1 v1.myVec = (vector signed int){10, 20, 30, 40}; // call vector add function v2.myVec = vec_add( v1.myVec, vConst.myVec ); // see what we got! printf("\nResults:\nv2[0] = %d, v2[1] = %d, v2[2] = %d, v2[3] = %d\n\n", v2.iVals[0], v2.iVals[1], v2.iVals[2], v2.iVals[3]); return 0; } VMX intrinsics Cell Programming Workshop 9/17/2018
20
SPEs Programming Cell Programming Workshop 9/17/2018
21
SPE Floating-Point Operations
SPU executes both single-precision and double-precision floating-point operations Single-precision instructions are performed in 4-way SIMD fashion, fully pipelined Double-precision instructions are partially pipelined Data formats for single-precision and double-precision instructions are those defined by IEEE Standard 754, but the results calculated by single-precision instructions are not fully compliant with IEEE Standard 754 Cell Programming Workshop 9/17/2018
22
SPE Local Store Holds instructions and data
Filled with instructions and data using DMA transfers initiated from SPU or PPE software DMA operations are buffered and can only access the LS at most one of every eight cycles Instruction prefetches deliver at least 17 instructions sequentially from the branch target impact of DMA operations on SPU loads and stores and program-execution times is, by design, limited. Cell Programming Workshop 9/17/2018
23
LS Access Arbitration Priority
LS is single ported, the SPU arbitrates access to the LS according the following priorities (highest priority first): DMA reads and writes by the PPE or an I/O device SPU loads and stores Instruction prefetch Cell Programming Workshop 9/17/2018
24
SPE – Pipelines and Dual-Issues Rules
SPU has two pipelines even (pipeline 0) odd (pipeline 1) Dual-issue occurs when a fetch group has two issue-able instructions in which the first instruction can be executed on the even pipeline and the second instruction can be executed on the odd pipeline. SPU can issue and complete up to two instructions per cycle, one in each of the pipelines Cell Programming Workshop 9/17/2018
25
SPE C/C++ Language Extensions (Intrinsics)
Three classes of intrinsics Specific Intrinsics - one-to-one mapping with a single assembly-language instruction prefixed by the string, si_ e.g., si_to_char // Cast byte element 3 of qword to char Generic Intrinsics and Built-Ins - map to one or more assembly-language instructions as a function of the type of input parameters prefixed by the string, spu_ e.g., d = spu_add(a, b) // Vector add Composite Intrinsics - constructed from a sequence of specific or generic intrinsics e.g., spu_mfcdma32(ls, ea, size, tagid, cmd) //Initiate DMA to or from 32-bit effective address Cell Programming Workshop 9/17/2018
26
Register Layout of Data Types and Preferred Slot
When instructions use or produce scalar operands or addresses, the values are in the preferred scalar slot The left-most word (bytes 0, 1, 2, and 3) of a register is called the preferred slot Cell Programming Workshop 9/17/2018
27
Promoting Scalar Data Types to Vector Data Types
SPU only loads and stores a quadword at a time Value of scalar operands (including addresses) is kept in the preferred slot of a SIMD register Scalar (sub quadword) loads and stores require several instructions to format the data for use on the SIMD architecture of the SPE e.g., scalar stores require a read, scalar insert, and write operation Strategies to make operations on scalar data more efficient: Change the scalars to quadword vectors to eliminate three extra instructions associated with loading and storing scalars Cluster scalars into groups, and load multiple scalars at a time using a quadword memory access. Manually extract or insert the scalars as needed. This will eliminate redundant loads and stores. Intrinsics for Changing Scalar and Vector Data Types Cell Programming Workshop 9/17/2018
28
Compiler Directives __builtin_expect to direct branch prediction
e.g., int __builtin_expect (int exp, int value) returns the result of evaluating exp, and means that the programmer expects exp to equal value. The value can be a constant for compile-time prediction, or a variable used for run-time prediction aligned attribute to ensure proper DMA alignment for efficient data transfer e.g., float factor __attribute__((aligned (16)); //aligns “factor” to a quadword align_hint to helps compilers auto-vectorize e.g., _align_hint (ptr, base, offset) informs the compiler that the pointer, ptr, points to data with a base alignment of base, with a byte offset from the base alignment of offset Cell Programming Workshop 9/17/2018
29
Example of an SPU Program
#include <stdio.h> // Define a type we can look at either as an array of ints or as a vector. typedef union { int iVals[4]; vector signed int myVec; } vecVar; int main() { vecVar v1, v2, vConst; // define variables // load the literal value 2 into the 4 positions in vConst, vConst.myVec = (vector signed int){2, 2, 2, 2}; // load 4 values into the 4 element of vector v1 v1.myVec = (vector signed int){10, 20, 30, 40}; // call vector add function v2.myVec = spu_add( v1.myVec, vConst.myVec ); // see what we got! printf("\nResults:\nv2[0] = %d, v2[1] = %d, v2[2] = %d, v2[3] = %d\n\n", v2.iVals[0], v2.iVals[1], v2.iVals[2], v2.iVals[3]); return 0; } SPU intrinsics Cell Programming Workshop 9/17/2018
30
(c) Copyright International Business Machines Corporation 2005.
All Rights Reserved. Printed in the United Sates September 2005. The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture Other company, product and service names may be trademarks or service marks of others. All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary. While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document. IBM Microelectronics Division The IBM home page is 1580 Route 52, Bldg The IBM Microelectronics Division home page is Hopewell Junction, NY Cell Programming Workshop 9/17/2018
31
Special Notices -- Trademarks
This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document. Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY USA. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied. All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions. IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice. IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies. All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply. Many of the features described in this document are operating system dependent and may not be available on Linux. For more information, please check: Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment. Revised January 19, 2006 Cell Programming Workshop 9/17/2018
32
Special Notices (Cont.) -- Trademarks
The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: alphaWorks, BladeCenter, Blue Gene, ClusterProven, developerWorks, e business(logo), e(logo)business, e(logo)server, IBM, IBM(logo), ibm.com, IBM Business Partner (logo), IntelliStation, MediaStreamer, Micro Channel, NUMA-Q, PartnerWorld, PowerPC, PowerPC(logo), pSeries, TotalStorage, xSeries; Advanced Micro-Partitioning, eServer, Micro-Partitioning, NUMACenter, On Demand Business logo, OpenPower, POWER, Power Architecture, Power Everywhere, Power Family, Power PC, PowerPC Architecture, POWER5, POWER5+, POWER6, POWER6+, Redbooks, System p, System p5, System Storage, VideoCharger, Virtualization Engine. A full list of U.S. trademarks owned by IBM may be found at: Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. in the United States, other countries, or both. Rambus is a registered trademark of Rambus, Inc. XDR and FlexIO are trademarks of Rambus, Inc. UNIX is a registered trademark in the United States, other countries or both. Linux is a trademark of Linus Torvalds in the United States, other countries or both. Fedora is a trademark of Redhat, Inc. Microsoft, Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both. Intel, Intel Xeon, Itanium and Pentium are trademarks or registered trademarks of Intel Corporation in the United States and/or other countries. AMD Opteron is a trademark of Advanced Micro Devices, Inc. Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States and/or other countries. TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC). SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC). AltiVec is a trademark of Freescale Semiconductor, Inc. PCI-X and PCI Express are registered trademarks of PCI SIG. InfiniBand™ is a trademark the InfiniBand® Trade Association Other company, product and service names may be trademarks or service marks of others. Revised July 23, 2006 Cell Programming Workshop 9/17/2018
33
Special Notices - Copyrights
(c) Copyright International Business Machines Corporation 2005. All Rights Reserved. Printed in the United Sates September 2005. The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture Other company, product and service names may be trademarks or service marks of others. All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary. While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document. IBM Microelectronics Division The IBM home page is 1580 Route 52, Bldg The IBM Microelectronics Division home page is Hopewell Junction, NY Cell Programming Workshop 9/17/2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.