Chap 8

Home Up

Home Teaching Glossary ARM Processors Supplements Prof issues About

VLIW example

Consider a VLIW computer that can execute pairs of instructions. Each instruction is executed in one clock cycle. However, a load instruction and a multiply each take two clock cycles. When a 2-cycle instruction is encountered, the slot vertically below it is a stall slot. The slot below it and adjacent can still be used. Note that the code is ARM-like and the multiply instruction has no restriction on the use of its registers and can also take a literal operand.

Write a suitable fragment of code to implement the following construct. A, B, and C are memory locations that are pointed at by [r8], [r8,#4], and [r8,#8], respectively.

18(A2 + B2) + 12C2 + 17C + 25

Solution

We can write down the simple serial code as follows.

LDR r0,[r8] ;Get A

MUL r0,r0,r0 ;A*A

LDR r1,[r8,#4] ;Get B

MUL r1,r1,r1 ;B*B

ADD r0,r0,r1 ;A*A + B*B

MUL r0,r0,#18 ;18(A*A + B*B)

LDR r1,[r8,#8] ;Get C reuse r1

MUL r2,r1,r1 ;C*C

MUL r3,r1,#12 ;12C*C

ADD r2,r2,r3 ;18(A*A + B*B) + 12C*C

MUL r1,r1,#17 ;17C

ADD r2,r2,r10 ;18(A*A + B*B) + 12C*C + 17C

ADD r2,r2,#25 ;18(A*A + B*B) + 12C*C + 17C + 25

There are no data hazards in this code. If it were executed, the total time would be 22 cycles (allowing for load and multiply stalls).

The next step is to convert it into VLIW code.

Execution unit 1 Execution unit 2

LDR r0,[r8] LDR r1,[r8,#4]

Stall load Stall load

MUL r0,r0,r0 MUL r1,r1,r1

Stall mult Stall mult

ADD r0,r0,r1 LDR r1,[r8,#8]

Stall – no code available Stall load

MUL r3,r1,#12 MUL r2,r1,r1

Stall mult Stall mult

ADD r2,r2,r3 MUL r1,r1,#17

ADD r2,r2,r10 Stall mult

ADD r2,r2,#25

In this case the cycle time is 11 and one slot is lost due to a data hazard.

VLIW Example 2

Consider a VLIW with three operation slots per instruction. One slot is dedicated to load, store and branch operations. The other two slots perform all other operations. Load and multiply operations both have a 1 cycle penalty.

Consider the vector operation zi = 6xi + yi + 12, where index i is from 0 to m 0 - 1. Write the code to perform this operation using loop unrolling with three iterations per loop. The processor is a pseudo ARM with a multiply without register restrictions.

We can write the following simple inline code. Blue is the setup portion. Black is the loop and red is the actual data processing part of the code. This non-VLIW codes would require 14 cycles per iteration (11 instructions and 3 delay slots).

ADR r0,Z

ADR r1,X

ADR r2,Y

MOV r3,#N ;N is the loop count divided by 3

Loop LDR r4,[r1] ;get x

LDR r5,[r2] ;get y

ADD r1,r1,#4 ;increment x pointer

ADD r2,r2,#4 ;increment y pointer

MUL r4,r4,#6 ;6x

ADD r4,r4,r5 ;6x + y

ADD r4,r4,#12 ;6x + y + 12

STR r4,[r0] ;store z

ADD r0,r0,#4 ;increment z pointer

SUBS r3,r3,#1 ;decrement loop counter

BNE Loop

The following demonstrates the code running on a hypothetical VLIW computer. The blue stalls indicate a delay cycle and the red stall are due to an empty slot because no instruction is available.

Load Store Branch Data processing Data processing

Stall no operation ADR r0,Z ADR r1,X

Stall no operation ADR r2,Y MOV r3,#N

LDR r4,[r1] ADD r1,r1,#4 Stall no operation

Stall load Stall no operation Stall no operation

LDR r5,[r2] MUL r4,r4,#6 ADD r2,r2,#4

Stall load Stall multiply Stall no operation

Stall no operation ADD r4,r4,r5 Stall no operation

Stall no operation ADD r4,r4,#12 Stall no operation

STR r4,[r0] ADD r0,r0,#4 SUBS r3,r3,#1

BNE Loop Stall no operation Stall no operation

This code takes 8 cycles to execute one pass round the loop. Out of the 8 x 3 execution slots, only 11 are filled. That is, the efficiency is 11/24 = 46%. One iteration is performed in 8 cycles.

Now consider the same system but with loop unrolling. We will perform two cycles of iteration each time we pass round the loop. Below is the linear code. To simplify register naming, we’ve used r4 and rr4, etc., to indicate the same operation at cycle i and at cycle i + 1. Similarly, we’ll use x and xx and y and yy for the pairs of variables.

ADR r0,Z ;r0 points to array z

ADR r1,X ;r1 points to array x

ADR r2,Y ;r2 points to array y

MOV r3,#N ;N is the loop count divided by 2

Loop LDR r4,[r1] ;get x

LDR rr4,[r1,#4] ;get xx

LDR r5,[r2] ;get y

LDR rr5,[r2,#4] ;get yy

ADD r1,r1,#8 ;increment x pointer by 2 places

ADD r2,r2,#8 ;increment y pointer by 2 places

MUL r4,r4,#6 ;6x

MUL rr4,rr4,#6 ;6xx

ADD r4,r4,r5 ;6x + y

ADD rr4,rr4,rr5 ;6xx + yy

ADD r4,r4,#12 ;6x + y + 12

ADD rr4,rr4,#12 ;6xx + yy + 12

STR r4,[r0] ;store z

STR rr4,[r0,#4] ;store zz

ADD r0,r0,#8 ;increment z pointer by 2 places

SUBS r3,r3,#1 ;decrement loop counter

BNE Loop

We can now write the VLIW code.

Load Store Branch Data processing Data processing

Stall no operation ADR r0,Z ADR r1,X

Stall no operation ADR r2,Y MOV r3,#N

LDR r4,[r1] Stall no operation Stall no operation

Stall load Stall no operation Stall no operation

LDR r5,[r2] MUL r4,r4,#6 Stall no operation

Stall load Stall multiply Stall no operation

LDR rr4,[r1,#4] ADD r4,r4,r5 ADD r1,r1,#8

Stall load ADD r4,r4,#12 Stall no operation

LDR rr5,[r2,#4] ADD r2,r2,#8 MUL rr4,rr4,#6

Stall load ADD rr4,rr4,#12 Stall multiply

STR r4,[r0] ADD rr4,rr4,rr5 SUBS r3,r3,#1

STR rr4,[r0,#4] ADD r0,r0,#8 BNE Loop

Note that the code order has been slightly rearranged. The instruction in red, ADD rr4,rr4,#12, is out of sequence but can be executed while we are waiting for the next value of yy.

The total number of cycles per trip is 10 and the number of slots used is 30. Of these 17 are filled giving a slot efficiency of 17/30 = 57%. Since two iterations are performed per trip, the average number of cycles per iteration is 10/2 = 5.

Let’s modify the problem by allowing two slots to perform load and store operations.

Load Store Branch Load Store Data processing Data processing

Stall no operation ADR r0,Z ADR r1,X

Stall no operation ADR r2,Y MOV r3,#N

LDR r4,[r1] LDR r5,[r2] Stall no operation

Stall load Stall load Stall no operation

LDR rr4,[r1,#4] MUL r4,r4,#6 MUL rr4,rr4,#6

Stall load Stall multiply Stall multiply

LDR rr5,[r2,#4] ADD r4,r4,r5 ADD r1,r1,#8

Stall load ADD r4,r4,#12 ADD r2,r2,#8

STR r4,[r0,#4] ADD rr4,rr4,rr5 Stall no operation

Stall no operation ADD rr4,rr4,#12 SUBS r3,r3,#1

STR rr4,[r0,#4] ADD r0,r0,#8 BNE Loop

In this case we’ve shaved a cycle off a loop. Now a trip takes 9 cycles (27 slots). The occupancy is 17/27 = 63%.

Now consider the case when all three slots can handle any type of instruction.

Operation1 Operation 2 Operation 3

ADR r0,Z ADR r1,X ADR r2,Y

MOV r3,#N Stall no operation Stall no operation

LDR r4,[r1] LDR r5,[r2] LDR rr4,[r1,#4]

Stall load Stall load Stall load

MUL r4,r4,#6 MUL rr4,rr4,#6 LDR rr5,[r2,#4]

Stall multiply Stall multiply Stall load

ADD rr4,rr4,rr5 ADD r4,r4,r5 ADD r1,r1,#8

ADD rr4,rr4,#12 ADD r4,r4,#12 ADD r2,r2,#8

STR rr4,[r0,#4] STR r4,[r0,#4] SUBS r3,r3,#1

ADD r0,r0,#8 BNE Loop Stall no operation

Now we have 8 cycles per trip using 24 slots of which 17 are filled. The occupancy is 17/24 = 71%. Only one slot per trip is unused because of no available data. Two iterations are performed per trip which means that one iteration takes 8/2 = 4 cycles.

Click here for more detail of the instruction sequencing