VLIW example
Consider a VLIW computer that can execute pairs of instructions. Each instruction
is executed in one clock cycle. However, a load instruction and a multiply each take
two clock cycles. When a 2-
Write a suitable fragment of code to implement the following construct. A, B, and C are memory locations that are pointed at by [r8], [r8,#4], and [r8,#8], respectively.
18(A2 + B2) + 12C2 + 17C + 25
Solution
We can write down the simple serial code as follows.
LDR r0,[r8] ;Get A
MUL r0,r0,r0 ;A*A
LDR r1,[r8,#4] ;Get B
MUL r1,r1,r1 ;B*B
ADD r0,r0,r1 ;A*A + B*B
MUL r0,r0,#18 ;18(A*A + B*B)
LDR r1,[r8,#8] ;Get C reuse r1
MUL r2,r1,r1 ;C*C
MUL r3,r1,#12 ;12C*C
ADD r2,r2,r3 ;18(A*A + B*B) + 12C*C
MUL r1,r1,#17 ;17C
ADD r2,r2,r10 ;18(A*A + B*B) + 12C*C + 17C
ADD r2,r2,#25 ;18(A*A + B*B) + 12C*C + 17C + 25
There are no data hazards in this code. If it were executed, the total time would be 22 cycles (allowing for load and multiply stalls).
The next step is to convert it into VLIW code.
Execution unit 1 Execution unit 2
LDR r0,[r8] LDR r1,[r8,#4]
Stall load Stall load
MUL r0,r0,r0 MUL r1,r1,r1
Stall mult Stall mult
ADD r0,r0,r1 LDR r1,[r8,#8]
Stall – no code available Stall load
MUL r3,r1,#12 MUL r2,r1,r1
Stall mult Stall mult
ADD r2,r2,r3 MUL r1,r1,#17
ADD r2,r2,r10 Stall mult
ADD r2,r2,#25
In this case the cycle time is 11 and one slot is lost due to a data hazard.
VLIW Example 2
Consider a VLIW with three operation slots per instruction. One slot is dedicated to load, store and branch operations. The other two slots perform all other operations. Load and multiply operations both have a 1 cycle penalty.
Consider the vector operation zi = 6xi + yi + 12, where index i is from 0 to m 0
-
We can write the following simple inline code. Blue is the setup portion. Black is
the loop and red is the actual data processing part of the code. This non-
ADR r0,Z
ADR r1,X
ADR r2,Y
MOV r3,#N ;N is the loop count divided by 3
Loop LDR r4,[r1] ;get x
LDR r5,[r2] ;get y
ADD r1,r1,#4 ;increment x pointer
ADD r2,r2,#4 ;increment y pointer
MUL r4,r4,#6 ;6x
ADD r4,r4,r5 ;6x + y
ADD r4,r4,#12 ;6x + y + 12
STR r4,[r0] ;store z
ADD r0,r0,#4 ;increment z pointer
SUBS r3,r3,#1 ;decrement loop counter
BNE Loop
The following demonstrates the code running on a hypothetical VLIW computer. The blue stalls indicate a delay cycle and the red stall are due to an empty slot because no instruction is available.
Load Store Branch Data processing Data processing
Stall no operation ADR r0,Z ADR r1,X
Stall no operation ADR r2,Y MOV r3,#N
LDR r4,[r1] ADD r1,r1,#4 Stall no operation
Stall load Stall no operation Stall no operation
LDR r5,[r2] MUL r4,r4,#6 ADD r2,r2,#4
Stall load Stall multiply Stall no operation
Stall no operation ADD r4,r4,r5 Stall no operation
Stall no operation ADD r4,r4,#12 Stall no operation
STR r4,[r0] ADD r0,r0,#4 SUBS r3,r3,#1
BNE Loop Stall no operation Stall no operation
This code takes 8 cycles to execute one pass round the loop. Out of the 8 x 3 execution slots, only 11 are filled. That is, the efficiency is 11/24 = 46%. One iteration is performed in 8 cycles.
Now consider the same system but with loop unrolling. We will perform two cycles of iteration each time we pass round the loop. Below is the linear code. To simplify register naming, we’ve used r4 and rr4, etc., to indicate the same operation at cycle i and at cycle i + 1. Similarly, we’ll use x and xx and y and yy for the pairs of variables.
ADR r0,Z ;r0 points to array z
ADR r1,X ;r1 points to array x
ADR r2,Y ;r2 points to array y
MOV r3,#N ;N is the loop count divided by 2
Loop LDR r4,[r1] ;get x
LDR rr4,[r1,#4] ;get xx
LDR r5,[r2] ;get y
LDR rr5,[r2,#4] ;get yy
ADD r1,r1,#8 ;increment x pointer by 2 places
ADD r2,r2,#8 ;increment y pointer by 2 places
MUL r4,r4,#6 ;6x
MUL rr4,rr4,#6 ;6xx
ADD r4,r4,r5 ;6x + y
ADD rr4,rr4,rr5 ;6xx + yy
ADD r4,r4,#12 ;6x + y + 12
ADD rr4,rr4,#12 ;6xx + yy + 12
STR r4,[r0] ;store z
STR rr4,[r0,#4] ;store zz
ADD r0,r0,#8 ;increment z pointer by 2 places
SUBS r3,r3,#1 ;decrement loop counter
BNE Loop
We can now write the VLIW code.
Load Store Branch Data processing Data processing
Stall no operation ADR r0,Z ADR r1,X
Stall no operation ADR r2,Y MOV r3,#N
LDR r4,[r1] Stall no operation Stall no operation
Stall load Stall no operation Stall no operation
LDR r5,[r2] MUL r4,r4,#6 Stall no operation
Stall load Stall multiply Stall no operation
LDR rr4,[r1,#4] ADD r4,r4,r5 ADD r1,r1,#8
Stall load ADD r4,r4,#12 Stall no operation
LDR rr5,[r2,#4] ADD r2,r2,#8 MUL rr4,rr4,#6
Stall load ADD rr4,rr4,#12 Stall multiply
STR r4,[r0] ADD rr4,rr4,rr5 SUBS r3,r3,#1
STR rr4,[r0,#4] ADD r0,r0,#8 BNE Loop
Note that the code order has been slightly rearranged. The instruction in red, ADD rr4,rr4,#12, is out of sequence but can be executed while we are waiting for the next value of yy.
The total number of cycles per trip is 10 and the number of slots used is 30. Of these 17 are filled giving a slot efficiency of 17/30 = 57%. Since two iterations are performed per trip, the average number of cycles per iteration is 10/2 = 5.
Let’s modify the problem by allowing two slots to perform load and store operations.
Load Store Branch Load Store Data processing Data processing
Stall no operation ADR r0,Z ADR r1,X
Stall no operation ADR r2,Y MOV r3,#N
LDR r4,[r1] LDR r5,[r2] Stall no operation
Stall load Stall load Stall no operation
LDR rr4,[r1,#4] MUL r4,r4,#6 MUL rr4,rr4,#6
Stall load Stall multiply Stall multiply
LDR rr5,[r2,#4] ADD r4,r4,r5 ADD r1,r1,#8
Stall load ADD r4,r4,#12 ADD r2,r2,#8
STR r4,[r0,#4] ADD rr4,rr4,rr5 Stall no operation
Stall no operation ADD rr4,rr4,#12 SUBS r3,r3,#1
STR rr4,[r0,#4] ADD r0,r0,#8 BNE Loop
In this case we’ve shaved a cycle off a loop. Now a trip takes 9 cycles (27 slots). The occupancy is 17/27 = 63%.
Now consider the case when all three slots can handle any type of instruction.
Operation1 Operation 2 Operation 3
ADR r0,Z ADR r1,X ADR r2,Y
MOV r3,#N Stall no operation Stall no operation
LDR r4,[r1] LDR r5,[r2] LDR rr4,[r1,#4]
Stall load Stall load Stall load
MUL r4,r4,#6 MUL rr4,rr4,#6 LDR rr5,[r2,#4]
Stall multiply Stall multiply Stall load
ADD rr4,rr4,rr5 ADD r4,r4,r5 ADD r1,r1,#8
ADD rr4,rr4,#12 ADD r4,r4,#12 ADD r2,r2,#8
STR rr4,[r0,#4] STR r4,[r0,#4] SUBS r3,r3,#1
ADD r0,r0,#8 BNE Loop Stall no operation
Now we have 8 cycles per trip using 24 slots of which 17 are filled. The occupancy is 17/24 = 71%. Only one slot per trip is unused because of no available data. Two iterations are performed per trip which means that one iteration takes 8/2 = 4 cycles.