﻿ Chap 8

VLIW example

Consider a VLIW computer that can execute pairs of instructions. Each instruction is executed in one clock cycle. However, a load instruction and a multiply each take two clock cycles. When a 2-cycle instruction is encountered, the slot vertically below it is a  stall slot. The slot below it and adjacent can still be used. Note that the code is ARM-like and the multiply instruction has no restriction on the use of its registers and can also take a literal operand.

Write a suitable fragment of code to implement the following construct. A, B, and C are memory locations that are pointed at by [r8], [r8,#4], and [r8,#8], respectively.

18(A2 + B2) + 12C2 + 17C + 25

Solution

We can write down the simple serial code as follows.

LDR  r0,[r8]       ;Get A

MUL  r0,r0,r0      ;A*A

LDR  r1,[r8,#4]    ;Get B

MUL  r1,r1,r1      ;B*B

MUL  r0,r0,#18     ;18(A*A + B*B)

LDR  r1,[r8,#8]    ;Get C reuse r1

MUL  r2,r1,r1      ;C*C

MUL  r3,r1,#12     ;12C*C

ADD  r2,r2,r3      ;18(A*A + B*B) + 12C*C

MUL  r1,r1,#17     ;17C

ADD  r2,r2,r10     ;18(A*A + B*B) + 12C*C + 17C

ADD  r2,r2,#25     ;18(A*A + B*B) + 12C*C + 17C + 25

There are no data hazards in this code. If it were executed, the total time would be 22 cycles (allowing for load and multiply stalls).

The next step is to convert it into VLIW code.

Execution unit 1                         Execution unit 2

LDR  r0,[r8]                     LDR  r1,[r8,#4]

MUL  r0,r0,r0                   MUL  r1,r1,r1

Stall mult                                    Stall mult

Stall – no code available         Stall load

MUL  r3,r1,#12                 MUL  r2,r1,r1

Stall mult                                     Stall mult

In this case the cycle time is 11 and one slot is lost due to a data hazard.

VLIW Example 2

Consider a VLIW with three operation slots per instruction. One slot is dedicated to load, store and branch operations. The other two slots perform all other operations. Load and multiply operations both have a 1 cycle penalty.

Consider the vector operation  zi = 6xi + yi + 12, where index i is from 0 to m 0 - 1. Write the code to perform this operation using loop unrolling with three iterations per loop. The processor is a pseudo ARM with a multiply without register restrictions.

We can write the following simple inline code. Blue is the setup portion. Black is the loop and red is the actual data processing part of the code. This non-VLIW codes would require 14 cycles per iteration (11 instructions and 3 delay slots).

MOV   r3,#N     ;N is the loop count divided by 3

Loop LDR   r4,[r1]   ;get x

LDR   r5,[r2]   ;get y

MUL   r4,r4,#6  ;6x

ADD   r4,r4,#12 ;6x + y + 12

STR   r4,[r0]   ;store z

SUBS  r3,r3,#1  ;decrement loop counter

BNE   Loop

The following demonstrates the code running on a hypothetical VLIW computer. The blue stalls indicate a delay cycle and the red stall are due to an empty slot because no instruction is available.

Load Store Branch                                             Data processing                                             Data processing

Stall no operation          ADR   r2,Y                 MOV   r3,#N

LDR   r4,[r1]               ADD   r1,r1,#4             Stall no operation

Stall load                  Stall no operation         Stall no operation

LDR   r5,[r2]               MUL   r4,r4,#6             ADD   r2,r2,#4

Stall load                  Stall multiply             Stall no operation

Stall no operation          ADD   r4,r4,r5             Stall no operation

Stall no operation          ADD   r4,r4,#12            Stall no operation

STR   r4,[r0]               ADD   r0,r0,#4             SUBS  r3,r3,#1

BNE   Loop                  Stall no operation         Stall no operation

This code takes 8 cycles to execute one pass round the loop. Out of the 8 x 3 execution slots, only 11 are filled. That is, the efficiency is 11/24 = 46%. One iteration is performed in 8 cycles.

Now consider the same system but with loop unrolling. We will perform two cycles of iteration each time we pass round the loop. Below is the linear code. To simplify register naming, we’ve used r4 and rr4, etc., to indicate the same operation at cycle i and at cycle i + 1. Similarly, we’ll use x and xx and y and yy for the pairs of variables.

ADR   r0,Z        ;r0 points to array z

ADR   r1,X        ;r1 points to array x

ADR   r2,Y        ;r2 points to array y

MOV   r3,#N       ;N is the loop count divided by 2

Loop LDR   r4,[r1]     ;get x

LDR   rr4,[r1,#4] ;get xx

LDR   r5,[r2]     ;get y

LDR   rr5,[r2,#4] ;get yy

ADD   r1,r1,#8    ;increment x pointer by 2 places

ADD   r2,r2,#8    ;increment y pointer by 2 places

MUL   r4,r4,#6    ;6x

MUL   rr4,rr4,#6  ;6xx

ADD   r4,r4,#12   ;6x + y + 12

ADD   rr4,rr4,#12 ;6xx + yy + 12

STR   r4,[r0]     ;store z

STR   rr4,[r0,#4] ;store zz

ADD   r0,r0,#8    ;increment z pointer by 2 places

SUBS  r3,r3,#1    ;decrement loop counter

BNE   Loop

We can now write the VLIW code.

Load Store Branch                                                       Data processing                                                    Data processing

Stall no operation               ADR   r2,Y                     MOV   r3,#N

LDR   r4,[r1]                    Stall no operation             Stall no operation

Stall load                       Stall no operation             Stall no operation

LDR   r5,[r2]                    MUL   r4,r4,#6                 Stall no operation

Stall load                       Stall multiply                 Stall no operation

LDR   rr5,[r2,#4]                ADD   r2,r2,#8                 MUL   rr4,rr4,#6

STR   r4,[r0]                    ADD   rr4,rr4,rr5              SUBS  r3,r3,#1

STR   rr4,[r0,#4]                ADD   r0,r0,#8                 BNE   Loop

Note that the code order has been slightly rearranged. The instruction in red, ADD rr4,rr4,#12, is out of sequence but can be executed while we are waiting for the next value of yy.

The total number of cycles per trip is 10 and the number of slots used is 30. Of these 17 are filled giving a slot efficiency of 17/30 = 57%. Since two iterations are performed per trip, the average number of cycles per iteration is 10/2 = 5.

Let’s modify the problem by allowing two slots to perform load and store operations.

Stall no operation               ADR   r2,Y                     MOV   r3,#N

LDR   r4,[r1]                    LDR   r5,[r2]                  Stall no operation

LDR   rr4,[r1,#4]                MUL   r4,r4,#6                 MUL   rr4,rr4,#6

Stall load                       Stall multiply                 Stall multiply

STR   r4,[r0,#4]                 ADD   rr4,rr4,rr5              Stall no operation

Stall no operation               ADD   rr4,rr4,#12              SUBS  r3,r3,#1

STR   rr4,[r0,#4]                ADD   r0,r0,#8                 BNE   Loop

In this case we’ve shaved a cycle off a loop. Now a trip takes 9 cycles (27 slots). The occupancy is 17/27 = 63%.

Now consider the case when all three slots can handle any type of instruction.

Operation1                                                   Operation 2                                              Operation 3

MOV   r3,#N                      Stall no operation             Stall no operation

LDR   r4,[r1]                    LDR   r5,[r2]                  LDR   rr4,[r1,#4]

MUL   r4,r4,#6                   MUL   rr4,rr4,#6               LDR   rr5,[r2,#4]

Stall multiply                   Stall multiply                 Stall load