Performance – Some Examples

Example 1

A computer executes 2.5 x 1010 instructions in 12 seconds. All instructions except loads take 1 cycle, and loads take 4 cycles. The clock rate is 3,000 MHz. What fraction of the instruction workload consists of register loads?

Solution 1

We need to find the average number of cycles per instruction (which must be greater than 1) and then work out what fraction of loads are needed to achieve this figure.

In 12 seconds 2.5 x 1010 instructions are executed. This period corresponds to 12 x 3,000 x 106 = 3.6 x 1010 cycles.

Consequently, an instruction is executed in 3.6 x 1010/2.5 x 1010 = 1.44 cycles/instruction.

If the fraction of load instructions is f, the fraction of non-

If 3f + 1 = 1.44, then 3f = 0.44 and f = 0.44/3 = 0.137. That is, 13.7% of instructions are load operations.

Example 2

Consider the following loop:

int sum = 0, int x[64]

for (j = 0; j < max, j++){

sum+= x[j]/max;

}

This can be encoded in an ARM-

ADR r0,x ; r0 points to array X

MOV r1,#max ; r1 contains max

MOV r2,#0 ; r2 is loop variable j initialized to 0

MOV r3,#0 ; r3 is sum initialized to 0

Next LDR r4,[r0] ; get x[j]

DIV r4,r4,#max ; calculate r3/max (not ARM code)

ADD r3,r4,r4 ; add new element to running total

ADD r2,r2,#1 ; increment loop variable j

ADD r0,r0,#4 ; point to next element in X

CMP r2,r1 ; test for end of loop

BNE Next ; repeat until all done

Note that the integers of array X are expressed as 32-

ADR 2 CPI

LDR 4 CPI

MOV, ADD,CMP, BNE 1 CPI

DIV 15 CPI

- What is the average CPI for this code if it is considered as static code?
- How many instructions does this code execute in total if the value of max is 50?
- What is the average dynamic CPI for this code
- If the clock frequency is 1 GHz, what is the execution time?
- How can this code be improved to reduce its execution time?
- How many cycles does the optimized code take to execute?

Solution 2

- There are 11 instructions and these take: 2 + 1 + 1 + 1 + 4 + 15 + 1 + 1 + 1 + 1 + 1 = 29 cycles.

The average CPI is 29/11 = 2.64.

- The total number of instructions executed is 4 + 50 x 7 = 354.
- The average CPI is the total number of cycles divided by the total number of instructions. That is:

[(2 + 1 + 1 + 1 ) + 50(4 + 15 + 1 + 1 + 1 + 1 + 1)]/(4 + 50 x 7) = 1205/354 = 3.40

- The execution takes 1,205 cycles. A cycle is 1 ns at a clock rate of 1 GHz. The total time is 1,205 ns or 1.205 ms.
- The code can be improved in several ways. The main improvement is to take the division out of the loop and divide at the end, since p/a + q/a + r/a is (p+q+r)/a. We also use two indices, the element address and the loop count, whereas we need only one. Since the element address goes up in increments of 4, the last address is first + max x 4. We can now write:

ADR r0,x ; r0 points to array X

ADD r1,r0,#4*max ; r1 contains base + 4 x max (multiplication by 4 at compile time)

MOV r3,#0 ; r3 is sum initialized to 0

Next LDR r4,[r0] ; get x[j]

ADD r3,r4,r4 ; add new element to running total

ADD r0,r0,#4 ; point to next element in X

CMP r0,r1 ; test for end of loop

BNE Next ; repeat until all done

DIV r3,r3,#max ; calculate sum/max (take the division out of the loop)

- The code now takes 2 + 1 + 1 + 50 x (2 + 1 + 1 + 1) + 15 cycles = 4 + 250 + 15 = 269 cycles.