ARCHITECTURE OF APPLE’S G4 PROCESSOR BY RON WEINWURZEL MICROPROCESSORS PROFESSOR DEWAR SPRING 2002
What makes a supercomputer “super” is its ability to execute at least one billion floating- point operations per second. This is a staggering measure of speed, also known as a “gigaflop” APPLE G4: can deliver performance of over one gigaflop, and has a theoretical peak performance of 3.6 gigaflops. It is the first architecture to deliver over one gigaflop.
The G4 Processor is also known as the MPC7400. One of its advantages is its shorter processor pipelines: Pentium 4: 20 stages to accomplish a task. 20 stages to accomplish a task.G4: 7 stages.
The G4 has a L3 cache that uses 2MB of DDR SDRAM running at a data rate of up to 500 MHz. It boosts processor function by providing fast access to data and application code at speeds of up to 4 gigabytes per second G4: L3 cache Pentium 4: L2 cache
More on L3 cache The high speed L3 cache, with its dedicated bus, enables the G4 processor to receive data up to five times faster than it could from main memory. This low latency keeps the processors constantly fed with data, preventing idling while waiting for the next task to arrive. The L3 cache is large enough to store active application code and data. When an application is run, most of the active code for the program — along with most of the data being used — is in L3 cache. Therefore, the information most required by the processor is close at hand. It’s analogous to the caching of web pages on a hard disk drive: When the ‘Back’ button is clicked on a web browser, the computer will use the data loaded two pages ago — skipping the step of reloading the same data again — making the page appear quicker.
The G4 processor was designed to be targeted at both portable and desktop computing system applications. This had a dramatic effect on its design, which is a 32-bit architecture (as shown in the next slide), combined with a 128-bit engine named Velocity Engine. This provides 32-bit effective addresses, integer data types of 8, 16, and 32-bits and floating-point data types of 32 and 64 bits. See diagram on next slide:
G4 Hardware Design Diagram:
Standard features in the G4 architecture: Branch processing unit This unit allows one branch to be processed per clock cycle, as well as fetching four instructions and resolving 2 speculations. This unit incorporates a 512-entry branch history table (BHT) and a 64-entry, 4-way set associative branch target instruction cache (BTIC). Dispatch unit Completion Unit This unit incorporates instruction tracking and peak completion of two instructions per cycle. As well as an 8-entry completion buffer.
features continued… Fixed-point units (FXUs) that share 32 GPRs for integer operands. Three-stage floating-point unit and a 32-entry FPR file System Unit Load/Store Unit: This unit incorporates all of the usual features such as 1 cycle load and store cache access, effective address generation, zero padding and sign extension. It also incorporates such features as internal floating-point conversion, sequencing for load/store multiples, as well as support for Big- and Little-endian addressing and all of their variants. Memory Management Unit
The Velocity Engine Behind the G4’s phenomenal performance is its Velocity Engine. The Velocity Engine processes data in huge 128-bit chunks, instead of the smaller 32-bit or 64-bit chunks used in traditional processors (it’s the 128-bit vector processing technology used in scientific supercomputers plus 162 new instructions to speed up computations). In addition, the G4 can perform four (in some cases eight) 32-bit floating-point calculations in a single cycle — two to four times faster than processors found in PCs. See diagram on next page:
Structural Overview For G4 Velocity Engine Technology:
Applications G4 : Resource-consuming software has been tested and compared: Adobe Photoshop 6 (20 Actions) I I Athlon 1.4GHz I================ 48 I I I I I Pentium 4 1.8GHz I====================== 59 I I I I I Dual G4/1000 I=============== 47 I time in seconds (SHORTER bar means faster)
Digital media production and streaming on the G4 architecture is outstanding: Taking advantage of the power of the G4 architecture for digital media processing, producing the highest quality streaming audio available can be done in significantly less time.
Memory The G4 microprocessor contains separate memory management units (MMUs) for instructions and data, supporting 4 Petabytes (2^52) of virtual memory and 4 Gigabytes (2^32) of physical memory. They also offer four instruction block address translation (iBAT) and four data block address translation (dBAT) registers.
Buses Provided By The G4 The G4 has a separate 32-bit address and 64-bit data bus each with its own set of arbitration and control signals. This allows for the decoupling of the data tenure from the address tenure of a transaction, and provides for a wide range of system bus implementations. This is supported by a choice of two interface protocols; the 60x-bus interface and the MPX bus interface. The 60x protocol implements the PowerPC 32-bit bus interface. However, the MPX protocol includes several additional features that provide higher memory bandwidth, and more efficient use of the system bus in a multiprocessing environment. These interface protocols have been put into place to make the most of the Velocity Engine’s features, and to try and decrease the data and instruction transfer times.
Balance of Power Optimum performance requires efficient operation from all levels of the system architecture. Accordingly, to enhance performance at the system level, the G4 architecture has been designed to accommodate the high volumes of system traffic required for complex processing. The major features of this balanced design include reduced memory traffic, integrated high-speed I/O and a fast, direct PCI bus.
To Sum It Up: The G4 processor is an innovative design that has come about because of the increased demand for speed under processor intensive tasks, with impressive memory management, and short pipeline processing. Because the G4 pipeline is short, the processor recovers from bubbles more quickly, resulting in higher processor utilization. With fewer processing steps, faster recovery and higher processor utilization, processor output is maximized.
HAVE A NICE SUMMER! RON WEINWURZEL