Register Windows

Home Up

Home Teaching Glossary ARM Processors Supplements Prof issues About

A fundamental conflict in instruction set design is the need for a compact instruction format and the desire for a large number of on-chip general-purpose registers. A compact way of encoding instructions is necessary, because the designer has to fit an instruction into a single word (or an integer multiple words). Designing a 17-bit instruction op-code for a processor with a 16-bit data bus is as bad as being just slightly dead.

Today, the maximum number of on-chip registers is frequently limited more by the ability to address the registers than the ability to provide silicon real estate to house them. If we consider the RISC format in figure 11.4, it’s apparent that each time the number of on-chip registers is doubled, three bits are added to the length of the instruction format. Although it would be possible for a 32-bit word to provide longer operand address fields than the Berkeley RISC, a move in this direction would make it impossible to include a constant as part of the operand. Remember that about 90% of constants can be expressed by a 10-bit field.

An important design feature of the Berkeley RISC architecture is the way in which it allocates register space to each new procedure. We will soon see that whenever a procedure is called, that procedure gets a new set of registers. Consequently, although the programmer can access only 32 registers at any instant, there is a different set of 32 registers associated with each procedure call.

Although only twelve or so registers are required by the invocation of a procedure for parameter passing and local workspace, the successive nesting of procedures rapidly increases the number of on-chip registers assigned to procedures (if each procedure is to have its own local variables). Suppose a processor allocates a set of N registers each time a procedure is called and the procedure is called recursively D times. The number of registers allocated would be N·D. Since D can be very large, it might be thought that any attempt to dedicate a set of registers to each new procedure is impractical.

Fortunately, although procedures can be nested to an arbitrary depth, the average behavior of programs does not demonstrate pathological behavior with respect to procedure nesting (except for programs involving recursion). Much RISC literature refers to research by Halbert and Kessler, [HalKes80], and Tamir and Sequin, [TamSeq83], demonstrating that most procedures are not nested to any great depth. These results indicate that it might be feasible to adopt a modest number of local register sets for a short sequence of nested procedures.

Figure 1 provides a graphical representation of the execution of a program in terms of the depth of nesting of procedures. The vertical axis represents the depth of the current procedure (up represents a procedure call and down a procedure return). The horizontal axis represents time (i.e., the execution of the program). This diagram is hypothetical, although it illustrates the procedure call-return features described by Katevenis et al [Kat86]. What figure 1 demonstrates is that even though procedures may be nested to depths very much greater than 12, there are long runs of procedure invocations and returns that do not require, for example, a nesting level of greater than five from some arbitrary point. The shaded box illustrates a period in which the depth of procedure nesting is fairly low.

The designers of the Berkeley RISC adopted a mechanism for implementing local workspace for up to eight nested procedures. They dedicated a set of registers to each of the eight procedures. Any further nesting forces the CPU to resort to external memory space for local variable storage, as we shall soon see.

Figure 1 The relationship between depth of procedure nesting and time

For the purpose of our succeeding discussion, we are going to describe a procedure that calls another procedure the parent and the called procedure the child (this terminology is less ugly than using the terms caller and the callee). The parent may have its own parent, which is, of course, the grandparent of the child. Memory space used by procedures can be divided into four types:

Global space Global space is a region of logical address space that is directly accessible by all procedures. It is used to hold constants and data that may be required from any point within the program. Most conventional microprocessors have only internal global registers.

Local space Local space is a region of logical address space that is private to the current procedure. That is, no other procedure can access the current procedure’s local address space from outside the procedure. Local space is employed as working space by the current procedure.

Imported parameter space Imported parameter space holds the parameters imported by the current procedure from its parent. In Berkeley RISC terminology these are called the high registers.

Exported parameter space Exported parameter space holds the parameters exported by the current procedure to its child. In Berkeley RISC terminology these are called the low registers.

We are now going to look at how the Berkeley RISC takes a set of 32 registers and divides this set into the four groups we have just described.

Windows and Parameter Passing

One of the reasons for the relatively high frequency of data movement operations in programs is the need to pass parameters to procedures and to receive them from procedures. Any system that removes (or at least reduces) this two-way transfer of data has the potential to speed up program execution. The Berkeley RISC architecture solves the problem of parameter passing by means of multiple overlapped windows. A window is the set of registers visible to the current procedure. Figure 2 illustrates the structure of the RISC I’s overlapping windows.

Figure 2 Structure of the RISC’s eight overlapping windows

Suppose that the RISC processor is currently using the ith window set. A special-purpose 3-bit register, called the current window pointer (CWP), indicates the currently active window. In this case the CWP contains the value i. As figure 2 shows, each window is divided into four parts. Registers R0 to R9 are the global register set, registers R10 to R15 are used by the procedure to receive parameters from its parent (and to pass parameters back to its parent), registers R16 to R25 are ten local registers that cannot be accessed directly by any other procedure, and registers R26 to R31 are used by the procedure to pass parameters to and from its own child (i.e., a procedure called by itself).

All windows consist of 32 registers, R0 to R31, and each of these registers can be addressed by a RISC I instruction that contains five address bits per operand field. In this section we will use the mnemonics of the Berkeley RISC, wherever possible when describing the operation of RISC architectures. A register-to-register instruction of the form ADD R3,R12,R25 implements the operation [R25] ← [R3] + [R12], where R3 lies within the window’s global address space, R12 lies within its import from parent (or export to child) procedure space, and R25 lies within its local address space. Berkeley RISC arithmetic and logical instructions always involve 32-bit values—there are no 8- or 16-bit operations.

The machine’s physical hardware places no restriction on the use of a procedure’s 32 registers. For example, programmers can employ parameter import/export registers as local variable space if they wish. However, it is considered good programming practice to employ the four register spaces in the manor intended by the chip’s architects. In other words, it would be regarded as poor practice to employ registers R26 to R31 as local variable space (even though there are no hardware limitations to stop you doing this).

Whenever a procedure is invoked by an instruction of the form CALLR Rd,<address>, the contents of the window pointer are incremented by 1 and the current value of the program counter saved in register Rd of the new window. The Berkeley RISC does not employ a conventional stack in external main memory to save subroutine return addresses.

Once a new window has been invoked (in figure 6 this might be window i), the new procedure sees a different set of registers to the previous window. Global registers R0 to R9 are an exception, since they are common to all windows. Register R10 of the child procedure corresponds to register R26 of the calling procedure. That is, the parent procedure’s register R26 is the same register as the child procedure’s register R10.

Suppose a procedure at level i calculates the sum of A and B, and then calls a procedure to subtract C from this sum (a trivial and unlikely example). The following code performs this action.

ADD R16,R17,R26 Add A (reg R16) and B (reg R17) in the procedure’s local address space

and deposit the sum in the procedure’s export-to-child address space (reg R26).

CALLR R18,SBTR Call a child procedure to execute the subtraction.

SBTR SUB R10,R16,R10 Subtract C (register R16) from the value of (A + B) imported in R10 and deposit the result in

R10 ready to export it to the parent procedure.

RET R18 Return to the parent procedure with the result in its register R26.

The return address to be loaded into the PC is in R18.

This example demonstrates that we don’t have to make a physical transfer of data between a parent and a child procedure, as long as no more than six 32-bit words of data are to be transferred.

It is important to emphasize that the same register has different names in the parent and child procedures. In this example the sum of A and B is located in R26 of the parent procedure, which corresponds to register R10 of the child procedure. Since registers have to be renamed each time they are used to pass data between procedures, you will appreciate that RISC architectures are not well suited to the assembly language programmer. In fact, there are several other reasons why RISC architectures are not assembler programmer friendly. RISC architectures are, however, better suited to the output of high-level language compilers that automatically take care of the naming of registers.

The complete structure of Berkeley RISC I’s window system is given in figure 3. On the left hand side of the diagram is the actual physical register array that holds all the on-chip general-purpose registers. The eight columns associated with windows zero to seven show how each window is mapped onto the physical memory array and how the overlapping regions are organized. For example, register R10 of window 2 is really location 42 in the on-chip memory array. Note that the windows are organized in a circular fashion so that window zero follows window seven and window seven precedes window zero. As you can see, register R25 in window 7 corresponds to physical memory location 137, whereas register R26 in the same window corresponds to location 10. The total number of registers required is:

10 global registers + 8 x 10 local registers + 8 x 6 parameter transfer registers = 138 registers.

Figure 3 Structure of the RISC’s eight overlapping windows

Window Overflow

Unfortunately, the on-chip resources of any processor are finite and, in the case of the Berkeley RISCs, the registers sets are limited to eight windows. If procedures are nested to a depth greater than or equal to seven, window overflow is said to occur, as there is no longer a new window for the next procedure invocation. When an overflow takes place, the only thing left to do is to employ external memory to hold the overflow data. In practice, the oldest window is saved rather than the new window created by the procedure just called.

If the number of procedure returns minus the number of procedure calls exceeds eight, window underflow takes place. Window underflow is the converse of window overflow and the youngest window saved in main store must be returned to a window. A program may run for relatively long periods when all procedure calls and returns can be catered for by the overlapped window mechanism. Only when window overflow occurs is it necessary to transfer data between the windows and the external memory. Of course, one corollary of this is that heavily nested procedures can severely limit the performance of Berkeley RISC type architectures to levels below that of more conventional microprocessors.

You can regard the window structure of the Berkeley RISC as a stack, figure 4. Each time a procedure is called, the stack pointer (i.e., the window register) is incremented and each time a return is made the window register is decremented. Since the first and last windows overlap, the window structure is arranged in a circular fashion as illustrated in figure 5.

Figure 4 The RISC’s window structure as a stack

Figure 5 Logical arrangement of the RISC’s windows

Consider now the effect of window overflow. Figure 6 demonstrates the situation in which windows zero to five are occupied. At first sight you might think that both windows six and seven are free. However, due the overlapping structure of the windows, window five overlaps window six, and window zero overlaps window seven. In figure 7 another procedure has been called and window six occupied. At this stage it is no longer possible to call a child procedure without taking some further action. If a procedure were called, window seven would be occupied and a clash would take place between the registers that window seven uses to pass parameters to its child and the registers that window zero uses to receive parameters from its parent (clearly, window seven is not the parent of window zero).

Figure 6 Windows 0 to 5 in use

Figure 7 Windows 0 to 6 in use

The Berkeley RISC has two three-bit pointers, CWP and SWP. CWP is the current window pointer that points to the active window. SWP is the saved window pointer that points to the youngest window that has been saved in the external memory. Whenever the current window pointer is incremented and becomes equal to the saved window pointer, overflow takes place. The Berkeley RISC is designed to call a trap (i.e., call the operating system) whenever a procedure is called that will make the CWP = SWP.

When a window overflow trap occurs, it is necessary to make room in the window registers by saving the oldest window(s) in external store. As you can imagine, this operation takes some time to perform and can much reduce the processor’s throughput. Tamir and Sequin suggest that the best strategy is to save only one window at a time. Each time a window is saved, the SWP is incremented by one. A run of procedure calls (without intervening returns) causes a sequence of window overflow traps, each one resulting in the saving of a window and the incrementing of both the CWP and the SWP.

Each time a return from procedure is made, the CWP is decremented. If a decrement causes CWP to become equal to SWP then underflow occurs, because all on-chip windows have now been freed and any further windows must be retrieved from memory. Therefore, if a return from procedure is executed that would cause the decremented CWP to be equal to the SWP, an underflow trap is executed and a window (or windows) must be retrieved from memory and loaded into the window pointed at by SWP.

We discuss windowing here because it was a significant innovation at the time RISC I and RISC II were being developed. However, the notion of windowing is orthogonal to RISC technology in the sense that you can apply windowing to a processor without affecting any other aspect of its architectures such as the instruction set or addressing modes. Moreover, windowing can be applied to any architecture.

When windowing was first introduced, it seemed to promise a massive increase in performance by offering lots of registers without the added penalty of increasing the instruction space by requiring more bits to specify a given register. It was believed that the penalty incurred in dealing with a window overflow or underflow exception was about 40 clock cycles. In practice things didn’t turn out that way.

One of the first commercial systems to implement windowed registers was the Sparcstation. Running under SunOS 4.1.1 this system incurred a massive 280 cycles whenever a window overflow/ underflow exception occurred. This penalty was incurred because of the overall system architecture. The sequence of eight 32-bit word store instructions had to go through a write-through cache; the memory system had to be tested to ensure that the current page was available; the data transfers had to be checked to ensure that a page boundary had not be crossed; and the register window handler had to ensure that crossing a page boundary did not compromise system security.

Alternative Window Strategies

The window strategy adopted by the designers of the Berkeley RISC is not the only way in which windows can be organized. Katevenis devotes some time to the discussion of alternative policies to the Berkeley RISC style windows in his thesis. Three topics covered by Katevenis are:

Variable-size windows
Dribble-back windows
Cache memory versus windows.

The fixed-size windows of the Berkeley RISC are rather like fixed-price subway tickets. You feel pleased when you make a trip from one end of the city to the other, but miserable when you make a short hop to the next station. If a procedure requires few parameters to be passed and needs little or no working space, allocating both parameter space and local working space to it is wasteful. These unused registers could be better devoted to providing more windows, or they could be omitted to save silicon real estate. Katevenis quotes various figures for window usage implying that less than six registers per window is required on average. A corollary of these findings is that the Berkeley RISC wastes two-thirds of its windows.

A possible solution to the inefficient use of registers by Berkeley RISC processors is to allocate only those registers actually needed by each procedure. By taking this approach the designer can save silicon real estate and increase the processor’s speed. Just as importantly, reducing the average size of the windows means that less time need be spent on dumping/restoring windows following each window overflow/underflow.

Variable window sizes are very attractive in principle but are relatively complex to realize. If fully variable window sizes are implemented, the current window pointer, CWP, cannot be treated as a simple 3-bit counter, because the number of active windows can be increased in increments of one register (rather than the 22 allocated to a procedure by a Berkeley RISC). That is, the CWP becomes a stack pointer that points to the beginning of the current window. Moreover, the calculation of a register address within the processor’s register array cannot be implemented by the simple logic employed by the Berkeley RISC. Instead it is necessary to employ a 6- to 8-bit adder to add the address of the register specified by the current instruction to the contents of the stack pointer. The extra delay in calculating the address of a register may impose a severe penalty on the performance of the processor.

Not only is it more difficult to generate the address of a register within the register file, but the problem of dealing with register overflow/underflow is magnified. A Berkeley RISC architecture detects overflow/underflow by simply comparing the three-bit CWP with the SWP. In a system with variable-size windows it is necessary to compare the 6- to 8-bit CWP with the SWP and then determine whether the result is greater than or less than the register space required by the current procedure.

A compromise approach to the problem of optimum window sizes has been suggested by Furht, [Furht88]. Instead of opting for the two extremes of fixed size or infinitely variable size windows, Furht constrains the size of a window to one of two values. Since 70% of procedure calls require eight or less arguments plus local scalars and 95% of procedures require 16 or less arguments plus local scalars, Furht proposed two window sizes: large L-windows and small S-windows.

Figure 8 demonstrates Furht’s two-size windows. In order to provide for a constant overlap of parameters for parent-to-child or child-to-parent transfers, the overlapped parameter space is fixed at six words exactly like the Berkeley RISC. The local workspace is made equal to two registers for a small window and ten registers (i.e., like the Berkeley RISC) for a large window. Each time a procedure is called, the CWP is incremented by 1 for an S-window and by 2 for an L-window. Similarly, the CWP is decremented by 1 for return from an S-window and by 2 for a return from an L-window. The mechanism for detecting overflow/underflow is exactly the same as that of the Berkeley RISC.

Figure 8 Overlapping two-sized windows

Dribble-back Windows

Instead of waiting for the occurrence of a window overflow or underflow, a so-called dribble-back scheme anticipates these events and transfers data between the external main store and the on-chip registers as a background activity. The dribble-back scheme discussed by Katevenis was adapted from an original concept proposal by Sites, and later refined by Sherburn.

As in the case of variable sized windows, a dribble-back scheme permits the use of a smaller register file (because the chance of overflow or underflow is greatly reduced). Since references to external memory are performed by a RISC processor relatively infrequently, the unused bandwidth between the processor and the external memory can be employed to save the oldest window in the register file concurrently with normal operations. Equally, when the window set begins to become empty, registers can be filled from memory in anticipation of an underflow. A dribble-back mechanism requires at least two windows. In theory, overflow or underflow should never take place, although in practice overflow or underflow can take place if the window is filled or depleted faster than data can be transferred to or from memory.