Architecture and Hardware APIC Tutorial --- Architecture and Hardware John DeHart Washington University jdd@arl.wustl.edu http://www.arl.wustl.edu/~jdd
Coverage APIC is a complicated device No way we can cover everything today. in the original workshop we spent one whole day on the APIC architecture and hardware and a second day on the software Lots more details in Zubin’s slides from the original workshop: http://www.arl.wustl.edu/gigabitkits/kits.html go to “Course Slides & Papers” in left margin Also, papers and documentation from web site.
Our Original Goals for the APIC Build a high speed ATM host interface Single Chip Low cost High Bandwidth Gigabit all the way to the application Low Latency Zero copy Support for Quality of Service
APIC Features Overview 32 bit and 64 bit PCI at 33MHz All of our cards are 32 bit. Point-to-Point, Multipoint and Loopback VCs AAL5 Segmentation and Reassembly AAL0: Raw ATM (RATM) Support for multiple traffic types Batching of cells in PCI Transaction Control via PCI bus and remotely via control cells Multiple DMA modes Interrupts and Notification List for efficient interrupt handling Flow Control: UTOPIA and ATM GFC field
APIC Internal Design Port 0 Port 0 Port 1 Port 1 Port 2 Port 2 Data Input Port Input Sync Port 0 VC Trans- lation Table (VCXT) Port 0 Output Sync Output Port . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Port 1 Output Sync Output Port Port 1 Port 2 Port 2 Tx Sync Rx Sync Data Paths Control Requestor Register Manager Pacer DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
APIC Internal Design: 6 Clock Regions Input Port Input Sync F VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . Utopia Ports Utopia Ports B Cell Store D Input Port Input Sync Output Sync Output Port Port 1 Port 1 A,B,C,D: Link Clocks (typically 62.5 MHz) E: Bus Clock (PCI: 33 MHz) F: Internal Clock (85 MHz) Port 2 Port 2 Tx Sync Rx Sync E Requestor Register Manager Pacer DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
APIC Transit Path: ATM Port ATM Port Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Data Paths Control Requestor Register Manager Pacer DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
APIC Receive Path: ATM Port Memory Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Data Paths Control Requestor Register Manager Pacer DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
APIC Transmit Path: Memory ATM Port Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Data Paths Control Requestor Register Manager Pacer DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
APIC Multipoint Receive Path: ATM Port * Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Data Paths Control Requestor Register Manager Pacer DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
APIC Multipoint Transmit Path: Memory * Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Data Paths Control Requestor Register Manager Pacer DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
APIC Loopback Path: Memory Memory Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Data Paths Control Requestor Register Manager Pacer DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
APIC Multipoint Loopback Path: Memory * Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Data Paths Control Requestor Register Manager Pacer DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
APIC Control and Response Cell Path Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Data Paths Control Requestor Register Manager Pacer DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
APIC and AALs AAL5 AAL0 Frames up to 65535 bytes. Used for IP Packets Format on next slide AAL0 Host can send and receive individual ATM Cells Used for: communication with raw ATM devices sending specially formatted control cells APIC uses 56 byte cell format shown on a future slide.
AAL5 Frames AAL5 Frame Packet data Padding User-to-User Reserved Length CRC 1 to 65535 bytes 0 to 47 1 4 2 Length Bytes Multiple of 48 Bytes AAL5 Frame
the ATM Link, of course it AAL0 Frames One Cell 56 Bytes Internally, 56 bytes. When it goes out onto the ATM Link, of course it is 53 bytes AAL0 Frame 4 4 48 APIC ATM ChanId L 8 16 24 31 C pOut pIn APIC AAL0 Header pIn: Port In pOut: Port Out C: Control Cell L: Low Delay ChanId: Channel Id
APIC Traffic Types Transmit Receive Low Delay Paced Best Effort highest priority transmitted at link rate (APIC Global Pacing Rate) Paced transmitted at rate configured for channel rates independently configurable for each channel Best Effort lowest priority can use whatever bandwidth is left after low delay and paced channels Receive Strictly higher priority then Normal Delay Normal Delay Only serviced when all Low Delay queues are empty
APIC Descriptors and Buffers Current Descriptor ... Full Buffers Partially Filled Buffer Empty Buffer Descriptor points to a buffer queued for sending data from or receiving data into Buffer Descriptor contains: Address of buffer physical address: PCI bus operates on physical not virtual memory Buffer Length Link to next descriptor Flags
Buffer Details Receive Buffers: 8-byte aligned and a multiple of 8 bytes in length CAVEAT: RX Sync Bug AAL0 buffers should be multiple of 56 bytes in length AAL5 buffers should be multiple of 48 bytes in length Single AAL5 frame can span multiple buffers No buffer can contain data from more then one AAL5 frame EndOfFrame bit (E) set in buffer containing the last 8 bytes of the AAL5 frame. with caveat above, this expands to be the last cell of the AAL5 frame Multiple AAL0 frames can occupy the same buffer Single AAL0 frame can span multiple buffers BUT because of caveat above, this won’t happen. Buffers for AAL0 will be completely filled
Buffer Details Transmit Buffers: Need not be aligned on word boundaries But our drivers always do… Can be of any length Single AAL5 frame can span multiple buffers No buffer can contain data from more than one AAL5 frame EndOfFrame bit (E) set in buffer containing first byte of the last cell for the AAL5 frame. Multiple AAL0 frames can occupy the same buffer A single AAL0 frame can span multiple buffers All buffers will be completely transmitted unless there is an error
Descriptor Details All descriptors MUST reside in a block of contiguous physical memory, 1MB or less All descriptors MUST be 16-byte aligned APIC global register, descriptor area pointer register, must contain the address of this block of memory Think of the descriptor area as an array of descriptors nextDescOfs field in the descriptors is an index into the descriptor array 16 bit index 65536 descriptors possible 65536 descriptors * 16 bytes per descriptor = 1MB
APIC Receive Descriptor BufAddrLo (physical address) BufAddrHi (physical address) E C Y T X L V I O S Match/TCP_Checksum BufLen NextDescOfs We’ll look at the Y field … For more details, see Zubin’s original workshop slides
APIC Transmit Descriptor BufAddrLo (physical address) BufAddrHi (physical address) E Y T V I O S TCRC Match BufLen NextDescOfs We’ll look at the Y field next … For more details, see Zubin’s original workshop slides
Sync Bits (Y Field) of APIC Descriptor Sync (Y) Bits: Implement Ready/Done 0 DONE_VALIDLINK APIC is finished with this descriptor and its link to the next descriptor is valid 1 DONE_INVALIDLINK APIC is done with this descriptor BUT its link to the next descriptor is not valid! Be Careful of this one 2 NOT_READY Not ready for the APIC to use The last descriptor in a chain is always marked NOT_READY by the driver 3 READY Ready for the APIC to use Set in Receive Descriptors in a chain for APIC to use Set in Transmit Descriptors that are ready for the APIC to send
APIC DMA Modes Simple DMA Pool DMA Protected DMA Separate queue of buffer descriptors for each connection works well for transmit Inefficient for receive no sharing of receive buffers and descriptors Pool DMA multiple connections share a pool of buffer descriptors works well for receive caveat: one connection can use up all the buffer descriptors obviously, does not work for transmit Protected DMA queueing operations executed by user-space driver pair of descriptors associated with each buffer: kernel descriptor user descriptor See details in Zubin’s original workshop slides.
Simple DMA
Pool DMA
APIC Interrupts and Notifications Interrupts used to report an asynchronous event: completion of transmission/reception of a frame error condition Interrupts can be enabled/disabled per channel Notification List contains list of channels that have had events. APIC issues an interrupt and disables further interrupts until processor re-enables. subsequent events will just set an entry in notification list. This reduces frequency of interrupts This can also help reduce overhead of interrupt processing.
APIC Memory Mapped Register Space
APIC Register Addresses 27 bit address space On PCI Bus, high order 5 bits are device select These are programmed into the APIC PCI Configuration space at boot time by the BIOS 00000000000000 RegID 00 14 9 2 Global Registers (i.e. not per channel):
APIC Register Addresses (continued) Kernel Access Per-channel Registers: 2 8 8 9 2 10 t CID 00000000 RegID 00 User Access Per-channel Registers: 2 8 8 9 2 11 t CID 00000000 RegID 00 t=0 Rx Channel, t=1 Tx Channel CID: Channel Index or VCI
APIC Pacing: General Stuff Pacing is for Transmit Channels only Cells are NOT Paced out onto the wire Not Exactly Pacing is done on the PCI bus Pacing is not a Guarantee, it is just a Restriction Pacing Calculations include the ATM headers But not the APIC header
APIC Pacing: General Stuff Two pacer controls: Global Pacing APIC Pacing Parameter register (Global, 0x208) Per VC Pacing TX Channel Pacing Parameter Register (TX, 0x500XX68) XX is the Channel ID Three types of Channels: Low Delay (Highest Priority) Paced Best Effort (Lowest Priority) All channels are paced by the Global Pacing Paced Channels also use Per VC Pacing
APIC Data Transfers APIC pulls data from memory across the PCI bus in Batches of cells. The number of cells in a Batch is controlled by a register The Pacer identifies when it is time to transmit data and which connection should transmit Pacer “wakes up” every 14 PCI Bus clock ticks checks to see if it is time to transmit Controlled by the Global APIC Pacing Parameter (APP) If it is time to transmit, it takes the first connection off the previously sorted list of keys and transmits its data. A lot of gory details about keys and heap storage of connections is not going to be included here. Read Rex’s documentation and/or read the VHDL if you want that level of detail
Global Pacing Parameter Pacing parameters are 24 bits 16 bits of Integer 8 bits of fractional part Global Apic Pacing Parameter (APP) (256 * BatchSz * 53 * 8 * 8192 *InteralClockMhz) APP = -------------------------------------------------------- (14 * ClockEstimate * LinkRateMbps) [Items in formula explained on next slide]
Explanation of Expression (256 * BatchSz * 53 * 8 * 8192 *InteralClockMhz) APP = -------------------------------------------------------- (14 * ClockEstimate * LinkRateMbps) 256 : shifts left by 8 bits to set “decimal point” BatchSz: How many cells per transfer 53*8: Translate cells/second into bits/second 8192, InternalClockMhz (85MHz), ClockEstimate APIC counts how many of its internal 85MHz clock ticks take place during the time it takes for 8192 PCI bus clock ticks. This value is the ClockEstimate. PCI Bus Clock Rate in MHz = (8192 * 85)/ClockEstimate 14: # of PCI Bus Ticks in a Pacer Period LinkRateMbps: Our target rate [Example on next 2 slides]
Example: Units in the APP Formula (256 * BatchSz * 53 * 8 * 8192 *InteralClockMhz) APP = -------------------------------------------------------- (14 * ClockEstimate * LinkRateMbps) (256 * Cells * Bytes/Cell * Bits/Byte * 8192 * M/sec) (14 * 1 * MBits/sec)
Example: APP for 1Gb/s Link Rate (256 * BatchSz * 53 * 8 * 8192 *InteralClockMhz) APP = -------------------------------------------------------- (14 * ClockEstimate * LinkRateMbps) BatchSz=8 53*8: Translate cells/second into bits/second InternalClockMhz = 85MHz ClockEstimate = 20954 (typical value) LinkRateMbps: 1000 (1000 Mb/s == 1Gb/s) (256 * 8 * 53 * 8 * 8192 * 85) APP = ---------------------------------------- = 2061.15 (14 * 20954 * 1000) APP = 2061 = 0x80D
Example: APP for 1Gb/s Link Rate APP = 2061 = 0x80D This means that every 14*8 = 112 PCI Bus clock ticks the APIC will be able to pull 8 Cells worth of data across the PCI Bus. (8 Cells)/(112 * 30ns) = (3392 bits)/(3360ns) ~= 1Gb/s
Per VC Pacing Per VC Pacing Parameter Conceptually like this: What portion of the full link rate can be used e.g. an integer value of 2 means that this channel can use half the link rate Conceptually like this: This Tx Channel is Ready to Transmit BATCH Cells Count to 14 to APP to TX Pacing Parameter 33 MHz PCI Bus Clock
oldExpirationTime + vcPacingParameter newExpirationTime Per VC Pacing vcPacingParameter ~ 10 One APIC Pacing Period current pacedTime Expired connections X X X X X X X time oldExpirationTime + vcPacingParameter newExpirationTime
pacedTime pacedTime is incremented every global pacing cycle in which a non-LowDelay connection wins contention Example with two connections: (L) Low Delay at 1/24th of the global rate (P) Paced at 1/6th of the global rate (.1666667) L L L L P P P P P P P P P P P P P P P 6 12 18 24 30 36 42 48 54 60 66 72 78 84
pacedTime (continued) L P 6 12 18 24 30 36 42 48 54 60 66 72 78 84 We might expect the Paced channel to miss its exact turn and fire on the next global pacing interval but keep it next expiration on the (0,6,12,18,…) boundaries. But…
pacedTime (continued) L P P P P P P P P P P P P P P P t+ 5 11 17 22 28 34 40 45 51 57 63 68 74 80 pacedTime t+ 6 12 18 24 30 36 42 48 54 60 66 72 78 84 “Real” time Actual rate for Paced connection: (GlobalRate) * (3*(1/6) + 1*(1/7))/4 (GlobalRate) * (.1607) For a Global Rate of 24Mb/s (DQ test example) 24 * .1607 = 3.8568
Example of a Pacing Oddity Suppose we have a channel on which we are sending single cell packets at a rate of 2 cells every pacing period for that channel and the BATCH size is 1 cell so that the channel should only send 1 cell during each pacing period. D D D D D D D You would expect the connection to build up a backlog, but it doesn’t……
Example of a Pacing Oddity (con’t) Turns out the Driver does a RESUME each time it puts data in an empty transmit queue to restart it. A RESUME causes the ExpireTime to be set to the current PacedTime. This causes the channel to be expired at the very next Pacer Period. Thus the channel transmits at twice its expected rate D T D T D T D T D T D T D T R R R R R R R
APIC Bugs and Caveats: RxSync RxSync Lockup when buffers too short APIC is receiving data for a connection. APIC runs out of buffers when there is still data left If this happens repeatedly, under certain conditions the APIC’s Rx-Sync module can lock-up. Example: if we have 3 16 byte buffers set up to receive one 56 byte AAL0 cell (re- member that the APIC AAL0 cell size is 56 bytes), then each time we receive a cell with these buffers we will have 8 bytes left over that the APIC SHOULD throw away. After the eighth time we use this chain of buffers to receive a cell, the APIC locks up. A similar problem exists for AAL5. Bug has not been identified in VHDL Work- arounds: For AAL0, always allocate buffers in multiples of 56 bytes. For AAL5, always allocate buffers in multiples of 48 bytes.
APIC Bugs and Caveats: Word Swap APIC swaps contiguous 32bit words when receiving data into host memory. Exists in APIC when used in Intel architectures Exists only in 32bit PCI mode Bug has been identified in VHDL but we aren’t going to respin the chip… Work-arounds: Driver performs a word swap on all data received. painful and costly data touch
APIC Bugs and Caveats: ILR Bug in APIC decode of Interrupt Line Register address on writes ILR is at 0x3C BIOS writes IRQ value to ILR register and then reads it back to see if this is a functioning PCI device. If it doesn’t read back properly, it “removes” this device from the PCI bus BIOS write to 0x3C enters APIC as write to 0x7C reads of 0x3C are ok. Bug has been identified in VHDL. Work-around implemented on NICs and SPCs you should never have to worry about this one…
Notes
Notes