PCI and PCIe Architecture (ESP – Fall 2014) Computer Science & Engineering Department Arizona State University Tempe, AZ 85287 Dr. Yann-Hang Lee yhlee@asu.edu (480) 727-7507
PCI Bus Release 2.1 -- 66MHz, 32-bit and 64-bit connectors. 3.3V or 5V based on PCI chip set’s buffer/drivers Agent, bus master (initiator) and slave (target) Bus transaction : bus masters issue requests arbitration bus grant issues address and command and begins a cycle frame (transaction) memory, I/O, configuration read/write commands a target is selected (device select) it is ready to complete the data transfer phase 1 12,13 50,51 62 94 5V key 3.3V key 64-bit portion
Buses in PC-XT and PC-AT ISA (Industry Standard Architecture) IBM-PC and PC-XT: 8 bits at 4.77MHz, directly connect to 8088, 2-stage bus cycle (2.38Mbyte/sec bus bandwidth) AT bus: extension slot + 8 bit ISA 16 bits at 8.33MHz for 80286 BIOS timer, int. contl. bus buffer ISA bus expansion slots CPU DRAM contrl. DMA contrl. DRAM
Buses in PC(486) 16-bit ISA cannot support Window applications --- video data VESA LB (local bus) -- linked to 486 local bus, 33MHZ, 32 bits ISA bus expansion slots 486 CPU local bus L2 cache DRAM ISA bridge bus buffer video card LAN adapter HDD contrl.
Buses in PC (Pentium) Backside Bus Frontside Bus PCI ISA Direct access to system memory for connected devices Uses a bridge to connect to the frontside bus and therefore to the CPU ISA
Increasing the Bus Bandwidth Separate versus multiplexed address and data lines: Address and data can be transmitted in one bus cycle if separate address and data lines are available Cost: (a) more bus lines, (b) increased complexity Data bus width: By increasing the width of the data bus, transfers of multiple words require fewer bus cycles Example: SPARCstation 20’s memory bus is 128 bit wide Cost: more bus lines Block transfers: Allow the bus to transfer multiple words in back-to-back bus cycles Only one address needs to be sent at the beginning The bus is not released until the last word is transferred Cost: (a) increased complexity (b) decreased response time for request Our handshaking example in the previous slide used the same wires to transmit the address as well as data. The advantage is saving in signal wires. The disadvantage is that it will take multiple cycles to transmit address and data. By having separate lines for addresses and data, we can increase the bus bandwidth by transmitting address and data in the same cycle at the cost of more bus lines and increased complexity. This (1st bullet) is one way to increase bus bandwidth. Another way is to increase the width of the data bus so multiple words can be transferred in a single cycle. For example, the SPARCstation memory bus is 128 bits of 16 bytes wide. The cost of this approach is more bus lines. Finally, we can also increase the bus bandwidth by allowing the bus to transfer multiple words in back-to-back bus cycles without sending an address or releasing the bus. The cost of this last approach is an increase of complexity in the bus controller as well as a decease in response time for other parties who want to get onto the bus. +2 = 33 min. (Y:13)
Increasing Bus Transaction Rate Overlapped operations (pipelined) perform arbitration for next transaction during current transaction initiate next address phase during current data phase Bus parking master holds onto bus and performs multiple transactions as long as no other master makes request Split-phase (or packet switched) bus completely separate address and data phases arbitrate separately for each address phase yield a tag which is matched with data phase ”All of the above” in most modern processor-memory busses
PCI Bus Signals PCI master device A typical PCI read transaction C/BE[7:4] REQ64 ACK64 Misc control BIST signals INT REQ Error reporting REQ GNT RST CLK C/BE[3:0] AD[31:0] FRAME IRDY TRDY DEVSEL STOP A typical PCI read transaction
PCI Bus Operation Address phase Data Phase At the same time, initiator identifiers target device and the type of transaction The initiator assert the FRAME# signal Every PCI target device latch the address and decode it Data Phase Number of data bytes to be transformed is determined by the number of Command/Byte Enable signals asserted by initiator Both of initiator and target must be ready to complete data phase IRDY# and TRDY# used Transaction completion and return of bus to idle state By deasserting the FRAME# but asserting IRDY# When the last data transfer has completed the initiator returns the PCI bus to idle state by deasserting IRDY#
PCI Commands Address and data phases C/BE[3::0]# Command Type 0000 Interrupt Acknowledge 0001 Special Cycle 0010 I/O Read 0011 I/O Write 0100 Reserved 0101 0110 Memory Read 0111 Memory Write 1000 1001 1010 Configuration Read 1011 Configuration Write 1100 Memory Read Multiple 1101 Dual Address Cycle 1110 Memory Read Line 1111 Memory Write and Invalidate Address and data phases PCI allows the use of up to 16 different 4-bit commands Configuration commands Memory commands I/O commands Special-purpose commands A command is presented on the C/BE# bus by the initiator during an address phase (a transaction’s first assertion of FRAME#)
Basic Write Transaction
PCI Optimizations and Additional Features Push bus efficiency toward 100% under common simple usage Bus parking retain bus grant for previous master until another makes request granted master can start next transfer without arbitration Arbitrary burst length initiator and target can exert flow control with xRDY discount with STOP (abort or retry, by target), FRAME (by master) and GNT (by arbiter) Delayed (pended, split-phase) transactions free the bus after request to slow device Additional Features Interrupts: support for controlling I/O devices Cache coherency: support for I/O and multiprocessors Locks: support timesharing, I/O, and MPs Configuration Address Space (plug and play)
PCI Address Space A PCI target can implement up to three different types of address spaces Configuration space Stores basic information about the device Allows the central resource or O/S to program a device with operational settings I/O space – Used mainly with PC peripherals and not much else Memory space – Used for just about everything else Message bus space message bus space is through the SoC’s PCI configuration registers
Accessing the Address Spaces accessed using a large variety of processor instructions (mov, add, or, shr, push, etc.) and virtual-to-physical address-translation memory space (4GB) accessed only by using the processor’s special ‘in’ and ‘out’ instructions (without any translation of port-addresses) PCI configuration space (16MB) i/o space (64KB) i/o-ports 0x0CF8-0x0CFF dedicated to accessing PCI Configuration Space
PCI Configuration Address Space Contains 256 bytes of basic device information, addressable by 8-bit PCI bus, 5-bit device, and 3-bit function numbers for the device the first 64 bytes (00h – 3Fh) make up the standard configuration header, including PCI ID, i.e. vendor ID and device ID registers, to identify the device the remaining 192 bytes (40h – FFh) represent user-definable configuration space, such as the information specific to a PC card for use by its accompanying software driver Also permits Plug-N-Play base address registers allow an agent to be mapped dynamically into memory or I/O space a programmable interrupt-line setting allows a software driver to program a PC card with an IRQ upon power-up
Memory and IO Spaces Memory space is used by most everything else – it’s the general-purpose address space The PCI spec recommends that a device use memory space, even if it is a peripheral An agent can request between 16 bytes and 2GB of memory space. The PCI spec recommends that an agent use at least 4kB of memory space, to reduce the width of the agent’s address decoder IO space is where basic PC peripherals (keyboard, serial port, etc.) are mapped The PCI spec allows an agent to request 4 bytes to 2GB of I/O space For x86 systems, the maximum is 256 bytes because of legacy ISA issues
The Plug-and-Play Concept Allows add-in cards to be plugged into any slot without changing jumpers or switches Address mapping, IRQs, COM ports, etc., are assigned dynamically at system start-up For PNP to work, add-in cards must contain basic information for the BIOS and/or O/S, e.g.: Type of card and device Memory-space requirements Interrupt requirements
Configuration Transactions Are generated by a host or PCI-to-PCI bridge Use a set of IDSEL signals as chip selects Dedicated address decoding Each agent is given a unique IDSEL signal Are typically single data phase Bursting is allowed, but is very rarely used Two types (specified via AD[1:0] in addr. phase) Type 0: Configures agents on same bus segment Type 1: Configures across PCI-to-PCI bridges
Type 00h Configuration Space Header
Configuration Commands Two DWORD I/O locations are used to generate configuration transactions 0CF8h references a read/write register, CONFIG_ADDRESS. 0CFCh references a read/write register, CONFIG_DATA. Bus enumeration attempting to read the Vendor- and Device ID register for each combination of bus number and device number, at the device's function #0 knows a device exists, and can then program the memory mapped and I/O port addresses for the device.
Example Quark GIP Configuratio lspci –s 00:15.2 –vvvxxx 00: 86 80 34 09 06 04 10 00 10 00 80 0c 00 00 80 00 10: 00 70 00 90 00 60 00 90 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 34 09 30: 00 00 00 00 80 00 00 00 00 00 00 00 ff 03 00 00 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 01 a0 03 48 08 00 00 00 00 00 00 00 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 05 00 01 01 0c 10 e0 fe d1 41 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 01 00 00 00 00 00 00 c0 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 b1 0f 00 00 00 00 00 00
Example: I2C and GPIO in Quark A PCI device: B:0, D:21, F:2 MMIO – use two base registers in configuration registers I2C memory registers – BAR0+offset I2C Master mode operation Disable the I2C controller by writing 0 to IC_ENABLE.ENABLE. Write to the IC_CON register Write to the IC_TAR register. Enable the I2C controller by writing a 1 in IC_ENABLE. Write the transfer direction and data to be sent to the IC_DATA_CMD register. Offset Start Offset End Register ID Default Value 10h 13h BAR0 00000000h 14h 17h BAR1
Example: Quark GPIO IRQ Enable Allows each bit of Port A to be configured for interrupts. In drivers/mfd/intel_cln_gip_gpio.c, #define PORTA_INT_EN 0x30 /* Interrupt enable */ #define PORTA_INT_MASK 0x34 /* Interrupt mask */ #define PORTA_INT_TYPE_LEVEL 0x38 /* Interrupt level*/ . . . . . . . static void intel_cln_gpio_irq_enable(struct irq_data *d) { . . . . void __iomem *reg_inte = reg_base + PORTA_INT_EN; gpio = d->irq - irq_base; spin_lock_irqsave(&lock, flags); val_inte = ioread32(reg_inte); iowrite32(val_inte | BIT(gpio % 32), reg_inte); spin_unlock_irqrestore(&lock, flags); }
National Instruments Leadership Seminar April, 2002 PCI Challenges Limited Bandwidth PCI-X and Advanced Graphics Port (AGP) for higher frequency Reduction of distance Bandwidth shared between all devices Limited host pin-count Lack of support for real time data transfer Stringent routing rules Lack of scaling with frequency and voltage Absence of power management PCI-X -- an enhancement of the 32-bit PCI Local Bus for a higher bandwidth demand. a double-wide version of PCI, running at up to four times the clock speed As PCI clock frequencies have become inadequate in certain applications, the PCI derivates such as PCI-X and Advanced Graphics Port (AGP) have sought to provide bandwidth relief by increasing bus frequencies. A side effect of increasing frequencies is a commensurate reduction in the distance the bus can be routed and the number of connectors the bus transceivers can drive, which leads to the concept of dividing the PCI bus into multiple segments. Each of these segments requires a full PCI-X bus to be routed from the host driving silicon to each active slot. For example, the 64-bit PCI-X requires 150 pins for each segment. Clearly this is costly to implement and places strain on routing, board layer count and chip package pin-outs. This extra cost is justified only where the bandwidth is crucial, such as in servers. Applications such as data acquisition, waveform generation, and multimedia applications including streaming audio and video require guaranteed bandwidth and deterministic latency, without which the user experiences glitches. The original PCI specification did not address these issues because the applications were not prevalent at the time the specification was developed. Today’s isochronous data transfers, such as high-definition uncompressed video and audio, demonstrate the need for the I/O system to include isochronous transfers. A side effect of isochronous transfers is that the local PCI Express devices need a lot less memory for buffering purposes than typical PCI devices use for minimizing variable bandwidth issues. Finally next-generation I/O requirements such as quality of service measurements and power management will improve data integrity and permit selective powering-down of system devices – an important consideration as the amount of power required by modern PCs continues to grow. Virtual Channels permit data to be routed via virtual routes; data transfers to take place even if other channels are blocked by outstanding transactions. Although the PCI bus is showing signs of age in some areas, the transition to PCI Express will be a long one, and the PCI bus will remain a strong contender for I/O expansion for many years to come. Modern PCs introduced in 2004 and later will have a combination of PCI and PCI Express slots, with the ratio changing more towards PCI Express as the technology is adopted. National Instruments CONFIDENTIAL
Inter-Networking Driving Demand Multimedia applications drive the need for fast, efficient processing of data over wired or wireless media CPU performance doubles about every 18 months while PC Bus performance doubles about every 3 years 10 100 1000 10000 1980 1990 2000 1985 1995 Fast Ethernet Gbit Ethernet 10 Gbit Ethernet Relative Bandwidth 8b ISA 16b ISA EISA MCA PCI 32/33 PCI 64/66 PCI-X 4.77 8 12 16-20 25-33 75-100 133-200 350-400 500-1000 40-50 66 Source: Intel It all starts with Internet. People are getting more and more used to the interactivity offered by the internet and are demanding creation of complex multimedia applications for data, streaming audio and video. The industry has tried to support that by increasing the CPU speeds, but have come to the realization that just increasing CPU speeds is not the solution. The real bottleneck in any system is at the system interconnect level. To use a perfect analogy, the number of cars on the street has increased, but the streets have not become any wider. New interconnect solutions capable of transferring terabits of data over wired or wirelss media quickly and efficiently are needed.
PCI Express Basics Serial, point-to-point, Low Voltage Differential Signaling 2.5GHz full duplex lanes (2.5Gb/s) PCIe Gen 2 = 5Gb/s Scaleable links – x1, x4, x8, x16 Packet based transaction protocol Software compatible but with higher speeds Built-in Quality of Service provisions Virtual Channels Traffic Classes Reliability, Availability and Serviceability End-to-End CRC (Cyclic redundant checking) Poison Packet Native Hot Plug support Flow Control and advance error reporting PCI Express Device 1 Ref Clock Lane PCI Express Device 2 x4 Link Example
PCI Express Performance Link Width X1 X2 X4 X8 X12 X16 x32 Bandwidth in Gbits/s (Tx and Rx) 5 10 20 40 60 80 160 Throughput in GB/s .5 1 2 4 6 8 16 (per direction) .25 3 = PCI 32/66 = PCI or PCI-X 64/66 = PCI-X 64/133 Raw: Assuming 100% efficiency with no payload overhead.
PCIe Layers Layered architecture Application Data transferred via packets Transaction Layer Packet (TLP) PCIe core usually implement the lower three layers Protocol handling connection establishing link control flow control power management error detection and reporting
Transaction Layer Packet Types
PCIe TLP Structure
Transaction Types, Address Spaces Request are translated to one of four transaction types by the Transaction Layer: Memory Read or Memory Write. Used to transfer data from or to a memory mapped location also supports a locked memory read transaction variant. I/O Read or I/O Write. Used to transfer data from or to an I/O location restricted to supporting legacy endpoint devices. Configuration Read or Configuration Write – Used to discover device capabilities, program features, and check status in the 4KB PCI Express configuration space. Messages. Handled like posted writes. Used for event signaling and general purpose messaging.
Programmed I/O Transaction
Back-up Slides
Message Bus Register Access Indirect access via PCI configuration space Message Bus Control Reg. (MCR) - PCI[B:0,D:0,F:0]+D0h Message Data Reg. (MDR) - PCI[B:0,D:0,F:0]+D4h Message Control Reg. eXtension (MCRX) - PCI[B:0,D:0,F:0]+D8h Uses the MCR/MCRX as an index register and MDR as the data register. Writes to the MCR trigger message bus transactions MCR description Field MBPR Bits OpCode (typically 10h for read, 11h for write) 31:24 Port 23:16 Offset/Register 15:08 Byte Enable 07:04
Advantages and Disadvantages of Buses Versatility: New devices can be added easily Peripherals can be moved between computer systems that use the same bus standard Low Cost: A single set of wires is shared in multiple ways Manage complexity by partitioning the design It creates a communication bottleneck The bandwidth of that bus can limit the maximum I/O throughput The maximum bus speed is largely limited by: The length of the bus and the number of devices on the bus The need to support a range of devices with varying latencies and data transfer rates The two major advantages of the bus organization are versatility and low cost. By versatility, we mean new devices can easily be added. Furthermore, if a device is designed according to a industry bus standard, it can be move between computer systems that use the same bus standard. The bus organization is a low cost solution because a single set of wires is shared in multiple ways. +1 = 7 min. (X:47)
Master versus Slave in a Bus Master issues command Bus Master Bus Slave Data can go either way Control lines: Signal requests and acknowledgments Data/address lines carry information between the source and the destination: A bus transaction includes three parts: Arbitration – which master can use the bus Issuing the command (and address) – request Transferring the data – action Master is the one who starts the bus transaction by: issuing the command (and address) Slave is the one who responds to the command by: Sending data to the master if the master asks for data Receiving data from the master if the master wants to send data The bus master is the one who starts the bus transaction by sending out the address. The slave is the one who responds to the master by either sending data to the master if the master asks for data. Or the slave may end up receiving data from the master if the master wants to send data. In most simple I/O operations, the processor will be the bus master but as I will show you later in today’s lecture, this is not always be the case. +1 = 11 min. (X:51)
Types of Buses Processor-Memory Bus (design specific) Short and high speed Only need to match the memory system Maximize memory-to-processor bandwidth Connects directly to the processor Optimized for cache block transfers I/O Bus (industry standard) Usually is lengthy and slower Need to match a wide range of I/O devices Connects to the processor-memory bus or backplane bus Backplane Bus (standard or proprietary) Backplane: an interconnection structure within the chassis Allow processors, memory, and I/O devices to coexist Cost advantage: one bus for all components Buses are traditionally classified as one of 3 types: processor memory buses, I/O buses, or backplane buses. The processor memory bus is usually design specific while the I/O and backplane buses are often standard buses. In general processor bus are short and high speed. It tries to match the memory system in order to maximize the memory-to-processor BW and is connected directly to the processor. I/O bus usually is lengthy and slow because it has to match a wide range of I/O devices and it usually connects to the processor-memory bus or backplane bus. Backplane bus receives its name because it was often built into the backplane of the computer--it is an interconnection structure within the chassis. It is designed to allow processors, memory, and I/O devices to coexist on a single bus so it has the cost advantage of having only one single bus for all components. +2 = 16 min. (X:56)
Synchronous and Asynchronous Bus Includes a clock in the control lines A fixed protocol for communication that is relative to the clock Advantage: involves very little logic and can run very fast Disadvantages: Every device on the bus must run at the same clock rate To avoid clock skew, they cannot be long if they are fast Asynchronous Bus: It is not clocked It can accommodate a wide range of devices It can be lengthened without worrying about clock skew It requires a handshaking protocol There are substantial differences between the design requirements for the I/O buses and processor-memory buses and the backplane buses. Consequently, there are two different schemes for communication on the bus: synchronous and asynchronous. Synchronous bus includes a clock in the control lines and a fixed protocol for communication that is relative to the clock. Since the protocol is fixed and everything happens with respect to the clock, it involves very logic and can run very fast. Most processor-memory buses fall into this category. Synchronous buses have two major disadvantages: (1) every device on the bus must run at the same clock rate. (2) And if they are fast, they must be short to avoid clock skew problem. By definition, an asynchronous bus is not clocked so it can accommodate a wide range of devices at different clock rates and can be lengthened without worrying about clock skew. The draw back is that it can be slow and more complex because a handshaking protocol is needed to coordinate the transmission of data between the sender and receiver. +2 = 28 min. (Y:08)
Arbitration for Multiple Bus Masters To obtain access to the bus Bus arbitration scheme: A bus master wanting to use the bus asserts the bus request A bus master cannot use the bus until its request is granted A bus master must signal to the arbiter after it finishes using the bus Bus arbitration schemes usually try to balance two factors: Bus priority Fairness and starvation Bus arbitration schemes can be divided into four broad classes: Daisy chain arbitration: single device with all request lines. Centralized, parallel arbitration Distributed arbitration by self-selection: each device wanting the bus places a code indicating its identity on the bus. Distributed arbitration by collision detection: Ethernet uses this. A more aggressive approach is to allow multiple potential bus masters in the system. With multiple potential bus masters, a mechanism is needed to decide which master gets to use the bus next. This decision process is called bus arbitration and this is how it works. A potential bus master (which can be a device or the processor) wanting to use the bus first asserts the bus request line and it cannot start using the bus until the request is granted. Once it finishes using the bus, it must tell the arbiter that it is done so the arbiter can allow other potential bus master to get onto the bus. All bus arbitration schemes try to balance two factors: bus priority and fairness. Priority is self explanatory. Fairness means even the device with the lowest priority should never be completely locked out from the bus. Bus arbitration schemes can be divided into four broad classes. In the fist one: (a) Each device wanting the bus places a code indicating its identity on the bus. (b) By examining the bus, the device can determine the highest priority device that has made a request and decide whether it can get on. In the second scheme, each device independently requests the bus and collision will result in garbage on the bus if multiple request occurs simultaneously. Each device will detect whether its request result in a collision and if it does, it will back off for an random period of time before trying again. The Ethernet you use for your workstation uses this scheme. We will talk about the 3rd and 4th schemes in the next two slides. +3 = 38 min. (Y:18)
Example – Basic Write A four-DWORD burst from an initiator to a target Addressing, handshaking, and data transfer phases
Write Example – Things to Note The initiator has a phase profile of 3-1-1-1 First data can be transferred in three clock cycles (idle + address +data = “3”) The 2nd, 3rd, and last data are transferred one cycle each (“1-1-1”) If the profile is 5-1-1-1 Medium decode – DEVSEL# asserted on 2 nd clock after FRAME# One clock period of latency (or wait state) in the beginning of the transfer DEVSEL# asserted on clock 3, but TRDY# not asserted unti clock 4 Total of 4 data phases, but required 8 clocks Only 50% efficiency
Target Address Decoding PCI uses distributed address decoding A transaction begins over the PCI bus Each potential target on the bus decodes the transaction’s PCI address to determine whether it belongs to that target’s assigned address space One target may be assigned a larger address space than another, and would thus respond to more addresses The target that owns the PCI address then claims the transaction by asserting DEVSEL#
More Terms Turnaround cycle Wait state Target termination “Dead” bus cycle to prevent bus contention Wait state A bus cycle where it is possible to transfer data, but no data transfer occurs Wait states may be inserted dynamically by the initiator or target Target deasserts TRDY# to signal it is not ready Initiator deasserts IRDY# to signal it is not ready Target termination Either agent may signal the end of a transaction The target signals termination by asserting STOP# The initiator signals completion by deasserting FRAME#
Zero and One Wait State A one-wait-state agent inserts a wait state at the beginning of each data phase This is done if an agent – built in older, slower silicon – needs to pipeline critical paths internally Reduces bandwidth by 50% The need to insert a wait state is typically an issue only when the agent is sourcing data (initiator write or target read) This is because such an agent would have to sample its counterpart’s xRDY# signal to see if that agent accepted data, then fan out to 36 or more clock enables (for AD[31:0] and possibly C/BE#[3:0]) to drive the next piece of data onto the PCI bus . . . all within 11 ns!