ARMs for the Poor

Home Up

Home Teaching Glossary ARM Processors Supplements Prof issues About

ARMs for the Poor: Selecting a Processor for Teaching Computer Architecture

Abstract – Teachers of computer architecture and organization courses have to choose a target processor to illustrate the basic principles of instruction set design. In this paper we suggest that it is time to choose the ARM processor architecture that is markedly different to those used in most current courses.

A specific computer architecture is required as a vehicle to teach about registers, addressing modes, instruction types, and so on. Resorting to a hypothetical teaching machine reduces the student’s learning burden and makes their learning curve shallow, but failing to introduce them to the complexities they will encounter in the real world can destroy their motivation.

Teachers are concerned not only with covering a body of knowledge; they must motivate students and create a sense of excitement. In a discipline as rapidly changing as computer science, only those students who can adapt to change are likely to thrive over the four or more decades of their career. This paper explains why the ARM architecture is an excellent vehicle for teaching computer architecture; in particular, its predicated execution, inclusion of shifting in all data-processing instructions, and its compressed code (Thumb) mode. Moreover, the ARM has a RISC architecture with some traditional CISC architectural features.

Computer Architecture Curriculum

Computer architecture is a key component of degree courses in computer science; in particular, the joint ACM IEEE Computer Society Computing Curriculum spells out what should be included in the core curriculum for computer architecture [1] [2]. There is a widespread consensus on the content of computer architecture courses, although, in the UK, there is a growing tendency to combine architecture with computer networks or operating systems because of the way in which curricula overlap.

Table 1 lists the key components of the computer architecture curriculum proposed by the recently revised CC2001 report. Most of the topics refer to elements of the computer system other than the CPU itself. Table 2 expands the curriculum and includes the learning objectives for the CPU [1]. Note that no specific computer is specified and the individual teacher is free to choose a suitable example.

Table 2 demonstrates that the intent of the curriculum is to cover the underlying principles of the operation of the computer and not the details of either its low-level programming or the characteristics of a particular machine.

TABLE 1 PROPOSED ARCHITECTURE IN THE REVISED CURRICULUM 2001

Digital logic and computer arithmetic
Computer architecture
Interfacing and communication
Memory system organization and architecture
Functional organization
Multiprocessing and alternative architectures
Performance acceleration
Architecture for networks and distributed systems
Devices; New directions in computing

TABLE 2 PROPOSED ARCHITECTURE IN THE REVISED CURRICULUM 2001

Computer architecture [core]

Overview of the history of the digital computer
Introduction to instruction set architecture, microarchitecture and system architecture
Processor architecture – instruction types, register sets, addressing modes
Processor structures – memory-to-register and load/store architectures
Instruction sequencing, flow-of-control, subroutine call/return
Structure of machine-level programs
Limitations of low-level architectures

Low-level architectural support for high-level languages

Learning objectives:

Describe the progression of computers from vacuum tubes to VLSI.
Appreciate the concept of an instruction set architecture, ISA, and the nature of a machine-level instruction in terms of its functionality and use of resources (registers and memory).
Understand the relationship between instruction set architecture, microarchitecture and system architecture, and their roles in the development of the computer.
Be aware of the classes of instruction: data movement, arithmetic, logical, and flow control.
Appreciate the difference between register-to-memory ISAs and load/store ISAs.
Appreciate how conditional operations are implemented at the machine level.
Understand the way in which subroutines are called and returns made
Appreciate how a lack of resources in ISPs has an impact on high-level languages and the design of compilers
Understand how parameters are passed to subroutines and how local workplace is accessed.

The Professor’s Dilemma

In practice computer architecture is taught by real professors to real students; and that rather complicates matters. A glance at typical computer architecture texts [4]-[5], demonstrates that authors usually select a real commercially available device as a vehicle to illustrate the course; for example the Motorola 68K, the Intel Pentium 4, or MIPS.

Why do professors make life difficult for themselves by using CPUs that were made by engineers wanting to maximize market penetration and company profits? Why don’t they specify a simple, hypothetical teaching machine and make it easier to teach the subject? Some professors do invent their own machines; I do myself for the first two weeks of the course. Most do not. Professors I have spoken to say that students do not want to use hypothetical hardware because they feel it is unrealistic and does not give a true picture of the real world they will soon be entering. Moreover, I find that students prefer to use hardware like the PC because they feel familiar with it. When I based my courses on the 68K microprocessor, it was well received by students in the days of the Apple Mac that incorporated it. When the 68K was dropped by Apple, students became less enthusiastic.

The Professor ARMed

The most visible role of a professor is to teach a student a given body of knowledge and to examine the quantity and quality of their knowledge. The real job of the professor is to instill in the student a love of the subject [6]. Without that, it’s difficult to transform the student into an autonomous learner who will work independently and continue to build on the course after its end.

I decided to change the processor I use to teach computer architecture from the Motorola 68K to the ARM family. The principle reasons for making the change are that the ARM covers the requirements of existing curricula, is easy to learn, and has an elegant and sophisticated architecture. Moreover, it is widely found in real systems.

ARM – the Background

Microprocessors used as vehicles to teach computer architecture are often mainstream industry-standard devices like the Motorola 68K or the Intel IA32. The processor that is the subject of this paper, the ARM, is beginning to appear in mainstream texts [4]-[5]-[7]. It is an unusual processor from a Cinderella of the microprocessor world rather than a giant like Intel or Texas Instruments. Although a RISC processor like the high performance MIPS or PowerPC, it is found in low-cost consumer applications such as PDAs and cell phones. Its characteristics make it stand out from other processors. It has a delightfully simple core architecture, there are development tools freely available, and it fully supports the computer architecture curriculum.

Advanced RISC Machines Ltd. was founded in the UK in 1990 and changed its name to ARM Ltd. Unusually, ARM does not manufacture microprocessors. It is an IP (intellectual property) company that designs systems and licenses other companies to make them; for example, ARM microprocessors are manufactured by Intel, Texas Instruments, and Samsung. Indeed, ARM 32-bit microprocessors now account for 75% of the world’s embedded 32-bit applications [8]. As we have already stated, students are more likely to be motivated if they study a processor that is in their cell phones.

ARM in a Nutshell

As there is insufficient space here to discuss all the ARM’s attributes, we list its key features and highlight how they differ from other processors, pointing out the pedagogical advantages.

 Register set: The ARM has 16 general-purpose registers r0 to r15 – the same as the 68K and less than typical RISC processors with 32 registers. Register r15 holds the program counter which is very unusual (the program counter is normally hidden from the user and cannot be directly accessed). As the program counter is visible, the student can read its contents and even modify them to perform jumps. This feature allows a class discussion of the advantages and disadvantages of general purpose register sets in contrast with special purpose register sets. One of my prime teaching objectives is to demonstrate how machine manufacturers have to make choices and how those choices affect future performance and applications.

 Instruction Set: The ARM has a RISC load/store (register-to-register) architecture. RISC processors are called 1 ½ address machines because they permit operations of the form ADD T1,D1 which adds the contents of memory location T1 to register D1 and puts the sum in D1, overwriting the old value. The term 1 ½ is used (sarcastically) to indicate a full memory address and a short register address. RISC processors permit data processing operations only on registers and provide instructions of the form ADD r1,r2,r3 where register r2 is added to r3 and the sum put in r1 (the destination register is in bold font). The only memory operations are load a register from memory and store a register’s contents in memory. A load/store computer reduces the student’s burden because he or she does not have to remember what addressing mode each instruction can use.

 Instruction types: The ARM has a conventional integer data-processing instruction set with traditional arithmetic, logical, and shift operations (although the shift is implemented in an unusual way). One special instruction is the MLA (multiply and add) that takes four operands and has the form MLA r1,r2,r3,r4. Its effect is r1 = r2*r3 + r4; that is, it calculates a product and adds it to a previous value. This seemingly innocuous instruction is at the heart of many signal-processing operations (used in audio and video applications). It is able to implement the inner product of two vectors efficiently (i.e., s=a0.b0+a1.b1+a2.b2…). The pedagogical advantage of this instruction is that it allows you to introduce modern applications such as multimedia and graphics.

 Subroutine call: Two subroutine call mechanisms are widely in use. CISC processors are stack-based. They push the return address before a subroutine call and end a call by restoring the address from the stack. This process is handled automatically in hardware and uses a call instruction, JSR, and a return instruction, RTS. RISC processors increase speed by saving the return address in a register prior to a call and moving the return address to the program counter to return. This mechanism is very fast because it does not access external memory. It does not permit nested subroutines unless the return address is saved in memory. The ARM implements a RISC like call/return mechanism but it also provides a conventional stack mechanism which gives the programmer the best of both worlds. The pedagogical advantage of these features is that you can compare and contrast the two call mechanisms and the students can investigate them for themselves.

 Shadow registers: Shadowing, where two physical memory locations share the same logical name, is an important concept. For example, the 68K has two physical stack pointers with the same name. One stack pointer is visible to the user programmer and one to the operating system. By using different physical pointers, an application program can’t corrupt the operating system stack. Shadowed registers allow the professor to mine a rich vein of system security and reliability. The ARM has several shadowed registers and the physical instance is determined by the interrupt and exception handling mechanism. When the ARM is interrupted, a new bank of shadowed registers is switched in. This allows an interrupt handler to access a clean set of registers and avoid saving pre-interrupt data that is in use elsewhere in the program. Shadowing enables the teacher to demonstrate how special-purpose hardware increases performance. It also provides an opportunity to discuss hardware-software tradeoffs.

 Literals: All computers provide a means of loading a literal (immediate value). The ARM deals with literals in a unique way by providing a 12-bit value where 8 bits specify the significant bits and 4 bits specify a multiplier; for example, the literals 8416 or 840016 can both be specified in 12 bits. This mechanism reinforces notions of exponents and mantissas that appear in floating-point arithmetic, as well as the concepts of range and precision.

 Shift instructions: The ARM implements a zero-cycle shift by incorporating a shift as part of other data processing instructions. Because of its unusual characteristics we deal with it separately.

Highlights of the ARM instruction Set

Although we can’t cover all the ARM’s architectural features, there are three that are particularly important from a teaching point of view because they illustrate interesting and innovative features – some of which are excellent vehicles for engaging in class discussions with the students. These are: the shift, predicated execution, and addressing modes.

A shift operation moves a string of bits by one or more positions left or right. The difference between shift operations depends on:

the direction of the shift (left or right),
the number of shifts – one or more places,
dynamic/static shifts (a dynamic shift permits the number of places shifted to be changed at run-time by using a variable in a register),
the type of shift – arithmetic (preserves the sign), logical, circular (the bit shifted out at one end is shifted in at the other end), extended (the shift takes place through the carry bit to allow multiple-precision arithmetic).

The ARM implements shift instructions but in an entirely unusual way. The computer architect is engaged in an eternal struggle to minimize the time taken to perform operations. A designer’s ultimate goal is the zero-cycle instruction that takes no time to execute. Such an operation is impossible, but the effect of a zero-cycle instruction can be created by hiding the operation. Consider the following non-ARM code.

ASL r0,#4 ;shift contents of register r0 left 4 places

ADD r1,r1,r0 ;Add the contents of r0 to r1

The time taken to execute this code is two cycles. The ARM implements shifts ingeniously by shifting the second operand during a data-processing instruction. High-speed on-chip logic implements the shift by directly routing bits from the source to their destination in a network called a barrel shifter. A typical ARM shift is written:

ADD r1,r1,r0 ASL #3 ;shift r0 left before adding

and implements [r1] = [r1] + 8 * [r0]. The shift and addition are performed in a single cycle. To perform a shift without data processing, a shift can be placed in the data path of a move operation; that is

MOV r1,r1,ASL #3 ;shift r1 left before moving to r1.

ADD r1,r1,r0 ASL #3 ;shift r0 left before adding

MOV r1,r1,ASL #3 ;shift r1 left before moving to r1.

There is quite a pedagogical significance in this operation. A little thought and ingenuity in the design of the ARM’s architecture has significantly increased performance without incurring a lot of additional logic. This demonstrates that tried-and-tested systems can sometimes be improved by looking at the system in a new way. In class I point out that ARM’s invention is not entirely new – they have borrowed a technique from the realm of microprogramming that was popular in the 1970s. I stress that old tricks can be reused in new circumstances and that students should always appreciate the value of discussions about computer history.

The Delights of Predication

When teaching computer architecture it is important to let students know exactly why you are using a particular architecture out of the many available. The aspect of the ARM that I find most appealing is its predicated execution ability where an instruction is executed if, and only if, certain conditions are met. Typical architectures used in teaching lack predicated execution and each op-code in the instruction stream is executed in turn unless a change in the flow-of-control (e.g., branch or jump) bypasses it.

A suffix can be applied to an ARM op-code to define the condition under which it is executed; for example, ADDEQ performs an operation (addition) only if the result of a previous operation was zero – otherwise the instruction is not executed (it is said to be nullified or squashed). Consider the following fragment of pseudo-code.

If (x == 0) || (y < 5) p = p + 1;

A conventional assembly language uses two conditional tests and generate the following (illustrative) code:

CMP r1,#0 ;test x in r1

BNE exit ;if not zero then exit

CMP r2,#5 ;compare y in r2 with 5

BGE exit ;if greater than 4 then exit

ADD r3,r3,#1 ;increment p in r3

Exit ;exit point

Now consider the use of predicted code.

CMP r1,#0 ;test x in r1

CMPEQ r2,#5 ;if zero then the test y < 5

ADDLT r3,r3,#1 ;if y<5 then e=e+1

In this case only three instructions are necessary and the execution time is three clock cycles, whereas the previous version requires up to five cycles.

From a teaching point of view, the ARM’s predicated execution has the following advantages:

 It demonstrates an alternative way of implementing a conditional cooperation.

 It provides an introduction to branchless computing, which reduces the severe cost of branches in heavily pipelined computers.

Creative Programming

The position of assembly language within the curriculum is sometimes a contentious issue between academics. Some argue that prosaic programming techniques are the order of the day to ensure readable, maintainable, and reliable code. Others like to exploit a machine’s architecture to the full to extract the highest performance.

My own approach is to discuss the pros and cons of the situation including ethical considerations; for example, I ask students whether they would use a short cut to speed up a PC (they all say ‘Yes’). Then I ask them, ‘Would you use the same techniques when designing a nuclear reactor control system’… I came across a rather unusual application of an ARM instruction that both demonstrates the power of the ARM instruction set and the ability to write code whose immediate meaning is rather less than clear. Consider the single operation BIC r0,r0,r0 ASR #31.

The BIC forms the logical AND between the first operand and the logical complement of the second operand. Both operands are specified as register r0. The ASR #31 shifts the second operand right 31 places using an arithmetic shift that propagates the sign-bit. After 31 shifts the second operand will consist only of 32 copies of the sign bit. If the number was positive, the second operand will be 0 and if the number was negative the second operand will be 111…11.

Since BIC complements the second operand, if r0 initially contains a positive number, the AND will be carried out between the value in r0 and the compliment of 000..00, which is r0. If r0 contains a negative number, the AND will be between r0 and 0000…00 which is zero. That is, this operation implements: If (x < 0) x = 0;

This example demonstrates an insight into the nature of binary strings and binary arithmetic and demonstrates the power of assembly language.

Addressing Modes

I have observed that addressing modes and pointers give my students more difficulty than any other concept. Some students have great difficulty in distinguishing between a pointer and the value pointed at. A good machine architecture should help students overcome their conceptual difficulties with pointers and yet allow the introduction of more advanced topics.

The ARM has a simple pointer-based addressing mode using a register to point to an operand in memory. As well as a pointer-plus-offset addressing mode, it supports two pointers (allowing two-dimensional addressing) and pre- and post-indexed addressing modes as Figure 1 demonstrates. That is, the ARM has an exceptionally rich range of addressing modes.

Figure 1 The ARM’s Indexed addressing modes

Traditional RISC processors, like the MIPS or PowerPC, do not provide an automatic stack mechanism. The ARM is closer to a CISC processor and provides a stack mechanism. Indeed, you can structure ARM’s stack to grow upwards or downward and to point to either the next element at the top or the next free element above that. Figure 2 illustrates typical ARM stack accesses.

Figure 2 The ARM and the Stack frame

ARM Extensions

One of the most unusual aspects of the ARM family is its extensions or personalizations that add considerable value to the ARM as a teaching machine. The Pentium processor is a Pentium processor. Always. The ARM can behave exactly like an ARM, but it can also mimic a very different processor; that is, the ARM’s instruction set can be changed at run time to provide a very different model of computation.

In order to understand the ARM’s versatility, you have to appreciate the market. The ARM is intended primarily for end-user embedded applications in systems like MP3 players that are characterized by very low manufacturing costs. Memory capacities should be small and buses narrow (it costs more to implement a 32-bit wide bus than an 8-bit wide bus). Traditionally, embedded applications have been almost exclusively the province of the 8- or 16-bit microprocessor because they have higher code densities than RISC processors.

The ARM is designed so that it can dynamically adopt a 16-bit architecture in the sense that its instruction and external data paths are 16 bits wide. Internally, the ARM remains a 32-bit processor with 32-bit registers. ARM’s 16-bit instructions are executed while the processor is in the Thumb state.

The ingenuity of the ARM mode is that it is possible to map the Thumb mode onto the ARM’s normal instruction set architecture. That is, 16-bit Thumb code is read by the processor and internally converted (i.e., decompressed) into a 32-bit instruction stream that uses the ARM’s existing hardware – the registers and the ALU hardware.

Figure 3 illustrates the Thumb mode architecture. Of the ARM’s 16 general purpose registers, only eight, r0 to r7, are directly visible to the programmer. Because the Thumb architecture uses a two address format, with instructions of the form ADD r0,r2 which implements [r0] = [r0] + [r1], only 6 bits/instruction are required to specify register addresses as opposed to the ARM’s 12 (i.e., three 16-register fields).

Figure 3 The Thumb-mode Instruction Set Registers

The Thumb mode instruction set is a simplified version of the underlying ARM instruction set. Some of the ARM’s features, such as its predicated execution mode, have been abandoned. Code compression yields a 30% improvement in code density over native ARM code [4]. The ARM cannot handle exceptions in the Thumb mode. Thumb and ARM code cannot be mingled. Special instructions are used to switch between the two modes. In practice, a real ARM system will locate some code (including exception handling) in a 3-bit memory and the application Thumb code in 16-bit memory.

ARM has implemented a second version of Thumb that improves Thumb’s performance by permitting some 32-bit operations and by extending the instruction set.

Classwork

It’s virtually impossible to teach the architecture of a machine effectively without a working example that students can play with – preferably at home on their PCs. Fortunately several ARM simulators are available to the student, either in a textbook or via the web [3], [7]. Figure 4 demonstrates the output of an ARM simulator where a student can execute a program line by line, observe the way in which registers change, view both memory and the stack and implement a simple input/output port.

As well as helping students understand the processor, class based simulators have an additional spinoff. They encourage students to work in groups and learn from each other very successfully. More importantly, simulators free the teacher to help those who are having particular difficulties.

Figure 4 Using the ARM Simulator

The following code, a simple loop, demonstrates that ARM assembly language is easy to follow (the MLT instruction is the multiply and accumulate instruction).

MOV r0,#0 ;clear total in r0

MOV r1,#10 ;FOR i = 1 to 10

Next MUL r2,r1,r1 ; square number

MLA r0,r2,r1,r0 ; cube and add to total

ADD r1,r1,#1 ; increment number

CMP r1,#11 ; test for end

BNE Next ;END FOR

Summary

It is impossible to teach a course on computer architecture without introducing a computer, which means inventing one or using one off the shelf. This paper has suggested that the ARM is an ideal choice for both the professor and the student because it is easy to understand initially, teaching material is widely available, and high-quality ARM simulators can be freely downloaded from the Internet. The advantage of an architecture that allows students to get going from day one can’t be over-estimated. If you lose students in the first two weeks, they are likely to stay lost for the rest of the semester.

The purpose of this paper has not been to suggest that the ARM is yet another processor indistinguishable from all the rest. I have presented the case that the ARM has a wealth of interesting, indeed exciting, features such as predicated computing and compressed instruction encoding facilities that can inspire the student. Moreover, some of the concepts that the ARM introduces cut across the computer science curriculum, increasing its value as an educational tool.

References

[1] ACM, IEEE Computer Society, "Computing Curriculum 2001".

[2] Nelson, V,P, Theys, M,D, Clements, A, “Computer Architecture and Organization in the Model Computer Engineering Curriculum”, Frontiers in Education, Boulder 2003.

[3] Clements, A, “The undergraduate curriculum in computer architecture”, IEEE Micro, Volume 20, No. 3, pp13-22

[4] Patterson, D,A, Hennessy, J, L, “Computer Organization and Design”, Morgan Kaufmannn 4th edition 2009

[5] Clements, A, “Principles of Computer Hardware”, Oxford University Press 3rd edition, 2004.

[6] Brewer, E,W, “Professor’s Role in Motivating Students to Attend Class”, Journal of Industrial Teacher Education, Vol. 42, No. 3, 2005.

[7] Hohl, W, “ARM Assembly Language – Fundamentals and Techniques”, CRC Press, 2009

[8] http://en.wikipedia.org/wiki/ARM_Holdings

[9] Goudge, L, Segars, S, “Thumb: Reducing the cost of 32-bit RISC performance in portable and consumer applications”, Proc. COMPCON ’96.

Download paper

This is the text of an paper I wrote for an Frontiers in Engineering Education Conference