# Optimizing ARM Assembly

Computer Organization and Assembly Languages Yung-Yu Chuang

with slides by Peng-Sheng Chen

## **ARM** optimization

- Utilize ARM ISA's features
  - Conditional execution
  - Multiple register load/store
  - Scaled register operand
  - Addressing modes

### Optimization

- Compilers do perform optimization, but they have blind sites. There are some optimization tools that you can't explicitly use by writing C, for example.
  - Instruction scheduling
  - Register allocation
  - Conditional execution

You have to use hand-written assembly to optimize critical routines.

- Use ARM9TDMI as the example, but the rules apply to all ARM cores.
- Note that the codes are sometimes in *armasm* format, not *gas*.

## Instruction scheduling

• ARM9 pipeline



 Hazard/Interlock: If the required data is the unavailable result from the previous instruction, then the process stalls.

### Instruction scheduling

• No hazard, 2 cycles

• One-cycle interlock



### Instruction scheduling

• One-cycle interlock, 4 cycles

```
LDRB r1, [r2, #1]
ADD r0, r0, r2; no effect on performance
EOR r0, r0, r1
```

| Pipeline | Fetch | Decode | ALU  | LS1  | LS2  |
|----------|-------|--------|------|------|------|
| Cycle 1  | EOR   | ADD    | LDRB |      |      |
| Cycle 2  |       | EOR    | ADD  | LDRB |      |
| Cycle 3  |       |        | EOR  | ADD  | LDRB |
| Cycle 4  |       |        | EOR  | _    | ADD  |

# Instruction scheduling

• Brach takes 3 cycles due to stalls

```
MOV r1, #1
B case1
AND r0, r0, r1
EOR r2, r2, r3
...
case1
SUB r0, r0, r1
```

| Pipeline | Fetch | Decode | ALU | LS1 | LS2 |
|----------|-------|--------|-----|-----|-----|
| Cycle 1  | AND   | В      | MOV |     |     |
| Cycle 2  | EOR   | AND    | В   | MOV |     |
| Cycle 3  | SUB   | _      | _   | В   | MOV |
| Cycle 4  |       | SUB    | _   | _   | В   |
| Cycle 5  |       | • • •  | SUB | _   | _   |

# Scheduling of load instructions

 Load occurs frequently in the compiled code, taking approximately 1/3 of all instructions.
 Careful scheduling of loads can avoid stalls.

```
void str_tolower(char *out, char *in)
{
  unsigned int c;

  do
  {
    c = *(in++);
    if (c>='A' && c<='Z')
    {
       c = c + ('a' -'A');
    }
    *(out++) = (char)c;
} while (c);
}</pre>
```

# Scheduling of load instructions

```
str tolower
       LDRB
              (r2), [r1], #1; c = *(in++)
              r3,(r2),#0x41; r3 = c - 'A'
       SUB
                           ; if (c <='Z'-'A')
       CMP
              r3,#0x19
       ADDLS
             r2,r2,\#0x20; c += 'a' - 'A'
       STRB
              r2,[r0],#1; *(out++) = (char)c
       CMP
              r2,#0
                           ; if (c!=0)
       BNE
              str tolower ; goto str tolower
       MOV
              pc,r14
                           ; return
```

2-cycle stall. Total 11 cycles for a character. It can be avoided by preloading and unrolling. The key is to do some work when awaiting data.

### Load scheduling by preloading

- Preloading: loads the data required for the loop at the end of the previous loop, rather than at the beginning of the current loop.
- Since loop i is loading data for loop i+1, there is always a problem with the first and last loops.
   For the first loop, insert an extra load outside the loop. For the last loop, be careful not to read any data. This can be effectively done by conditional execution.

### Load scheduling by preloading

```
; pointer to output string
       RN O
out
                                           9 cycles.
       RN 1
              ; pointer to input string
in
                                           11/9~1.22
       RN 2
              ; character loaded
t
       RN 3
              ; scratch register
       ; void str_tolower preload(char *out, char *in)
       str tolower preload
       LDRB
             c, [in], #1
                             ; c = *(in++)
loop
       SUB
              t, c, #'A'
                             : t = c-'A'
              t, #'Z'-'A'; if (t <= 'Z'-'A')
       CMP
                               ; c += 'a'-'A';
              c, c, #'a'-'A'
       ADDLS
              c, [out], #1
                               ; *(out++) = (char)c;
       STRB
                               ; test if c==0
       TEQ
              c, #0
       LDRNEB c, [in], #1
                               ; if (c!=0) { c=*in++;
       BNE
              loop
                                            goto loop; }
       MOV
              pc, 1r
                               ; return
```

### Load scheduling by unrolling

 Unroll and interleave the body of the loop. For example, we can perform three loops together.
 When the result of an operation from loop i is not ready, we can perform an operation from loop i+1 that avoids waiting for the loop i result.

### Load scheduling by unrolling

```
RN 0
               ; pointer to output string
out
in
        RN 1
               ; pointer to input string
ca0
       RN 2
               ; character 0
t.
       RN 3
               : scratch register
ca1
       RN 12
              ; character 1
ca2
       RN 14 ; character 2
       ; void str tolower unrolled(char *out, char *in)
       str tolower unrolled
       STMFD sp!, {1r}
                                 ; function entry
```

### Load scheduling by unrolling

```
loop next3
               ca0, [in], #1
        LDRB
                                 ; ca0 = *in++;
               ca1, [in], #1
       LDRB
                                 ; ca1 = *in++;
               ca2, [in], #1
                                 ; ca2 = *in++;
        LDRB
               t, ca0, #'A'
                                 ; convert caO to lower case
               t, #'Z'-'A'
        CMP
              ca0, ca0, #'a'-'A'
        ADDLS
               t, ca1, #'A'
                                 ; convert cal to lower case
               t, #'Z'-'A'
       CMP
       ADDLS cal, cal, #'a'-'A'
               t, ca2, #'A'
                              ; convert ca2 to lower case
               t, #'Z'-'A'
        CMP
              ca2, ca2, #'a'-'A'
        ADDLS
```

## Load scheduling by unrolling

```
ca0, [out], #1
STRB
                       ; *out++ = ca0;
TEQ
       ca0, #0
                        ; if (ca0!=0)
STRNEB cal, [out], #1; *out++ = cal;
TEQNE
       ca1, #0
                        ; if (ca0!=0 && ca1!=0)
STRNEB ca2, [out], #1
                      ; *out++ = ca2;
TEQNE ca2, #0
                        ; if (ca0!=0 && ca1!=0 && ca2!=0)
       loop next3
                            goto loop next3;
BNE
LDMFD
       sp!, {pc}
                        ; return;
```

21 cycles. 7 cycle/character 11/7~1.57

More than doubling the code size Only efficient for a large data size.

## Register allocation

 APCS requires callee to save R4~R11 and to keep the stack 8-byte aligned.

```
routine_name

STMFD sp!, {r4-r12, 1r}

; body of routine

; the fourteen registers r0-r12 and 1r

LDMFD sp!, {r4-r12, pc}

Do not use sp(R13) and pc(R15)

Total 14 general-purpose registers.
```

 We stack R12 only for making the stack 8-byte aligned.

### Register allocation

## Register allocation

```
y 4, carry, x 4, LSL k
MOV
       carry, x 4, LSR kr
       y 5, carry, x 5, LSL k
ORR
MOV
       carry, x 5, LSR kr
ORR
       y 6, carry, x 6, LSL k
       carry, x 6, LSR kr
MOV
ORR
       y 7, carry, x 7, LSL k
       carry, x 7, LSR kr
MOV
STMIA
       out!, {y 0-y 7}
                                 ; store 8 words
SUBS
        N, N, #256
                                 ; N = (8 \text{ words} * 32 \text{ bits})
BNE
        loop
                                 ; if (N!=0) goto loop;
MOV
        r0, carry
                                 ; return carry;
LDMFD
       sp!, {r4-r11, pc}
```

# Register allocation

Unroll the loop to handle 8 words at a time and to use multiple load/store shift\_bits

STMFD sp!, {r4-r11, 1r} ; save registers RSB kr, k, #32 ; kr = 32-k; MOV carry, #0

loop

LDMIA in!, {x 0-x 7} ; load 8 words

```
ORR y_0, carry, x_0, LSL k; shift the 8 words

MOV carry, x_0, LSR kr

ORR y_1, carry, x_1, LSL k

MOV carry, x_1, LSR kr

ORR y_2, carry, x_2, LSL k

MOV carry, x_2, LSR kr

ORR y_3, carry, x_3, LSL k

MOV carry, x 3, LSR kr
```

## Register allocation

 What variables do we have? overlap y\_0 RN 4 arguments read-in x 0 RN 5 out RN 0 RN 6  $RN \times 0$ x 1 in RN 1 y\_2 RN x\_1 Ν RN 2 x 2 RN 7 RN 3 x 3 y\_3 RN x\_2 RN 8  $y_4 = RN x_3$ x 4 RN 9 RN 10 RN x 4

x 6

x 7

 We still need to assign carry and kr, but we have used 13 registers and only one remains.

RN 11

RN 12

RN x 5

RN x 6

y 7

- Work on 4 words instead
- Use stack to save least-used variable, here N
- Alter the code

### Register allocation

• We notice that **carry** does not need to stay in the same register. Thus, we can use yi for it.

```
kr RN 1r
shift bits
                   sp!, {r4-r11, lr}
                                          ; save registers
            STMFD
            RSB
                    kr, k, #32
                                          ; kr = 32-k;
                   y 0, #0
            MOV
                                          ; initial carry
100p
            LDMIA in!, \{x \ 0-x \ 7\}
                                          ; load 8 words
                   y 0, y 0, x 0, LSL k; shift the 8 words
            ORR
                   y 1, x 0, LSR kr
            MOV
                                          ; recall x 0 = y 1
                   y 1, y 1, x 1, LSL k
            ORR
                   y 2, x 1, LSR kr
            MOV
                   y 2, y 2, x 2, LSL k
            ORR
                   y 3, x 2, LSR kr
            MOV
```

### More than 14 local variables

- If you need more than 14 local variables, then you store some on the stack.
- Work outwards from the inner loops since they have more performance impact.

# Register allocation

```
y 3, y 3, x 3, LSL k
         ORR
                 y 4, x 3, LSR kr
         MOV
                 y_4, y_4, x_4, LSL k
         ORR
                 y 5, x 4, LSR kr
         MOV
                 y_5, y_5, x_5, LSL k
         ORR
                 y 6, x 5, LSR kr
         MOV
                 y 6, y 6, x 6, LSL k
         ORR
         MOV
                 y 7, x 6, LSR kr
                 y 7, y 7, x 7, LSL k
                 out!, {y 0-y 7}
         STMIA
                                          ; store 8 words
                 y 0, x 7, LSR kr
         MOV
                 N, N, #256
         SUBS
                                          ; N = (8 \text{ words} * 32 \text{ bits})
         BNE
                 100p
                                          ; if (N!=0) goto loop;
                 r0, y 0
         MOV
                                          ; return carry;
LDMFD sp!, {r4-r11, pc}
This is often an iterative process until all variables are
```

This is often an iterative process until all variables are assigned to registers.

#### More than 14 local variables

#### More than 14 local variables

```
; body of loop 3
B{cond} loop3
LDMFD sp!, {loop2 registers}
; body of loop 2
B{cond} loop2
LDMFD sp!, {loop1 registers}
; body of loop 1
B{cond} loop1
LDMFD sp!, {r4-r11, pc}
```

# Packing

- When shifting by a register amount, ARM uses bits 0~7 and ignores others.
- Shift an array of 40 entries by shift bits.

```
Bit 31 8 7 0 cntshf = (count << 8) + shift = count shift
```

## **Packing**

 Pack multiple (sub-32bit) variables into a single register.

```
LDRB sample, [table, indinc, LSR#16]; table[index]
ADD indinc, indinc, LSL#16; index+=increment
```

## **Packing**

```
; address of the output array
out
        RN 1 ; address of the input array
in
        RN 2 ; count and shift right amount
cntshf
        RN 3 ; scratch variable
Χ
        ; void shift right(int *out, int *in, unsigned shift);
shift right
               cntshf, cntshf, \#39 << 8; count = 39
        ADD
shift loop
               x, [in], #4
        LDR
              cntshf, cntshf, #1 << 8 ; decrement count</pre>
        SUBS
               x, x, ASR cntshf
                                       ; shift by shift
        MOV
               x, [out], #4
        STR
        BGE
               shift loop
                                       ; continue if count>=0
        MOV
               pc, lr
```

# Packing

- Simulate SIMD (single instruction multiple data).
- Assume that we want to merge two images X and Y to produce Z by

$$z_n = (ax_n + (256 - a)y_n)/256$$
$$0 \le a \le 256$$

# Example



30

$$\alpha=0.75$$



# $\alpha=0.5$



#### $\alpha = 0.25$



# **Packing**

· Load 4 bytes at a time

Bit 24 16 8 0  

$$[x3, x2, x1, x0] = x_3 2^{24} + x_2 2^{16} + x_1 2^8 + x_0 = \begin{bmatrix} x_3 & x_2 & x_1 & x_0 \end{bmatrix}$$

• Unpack it and promote to 16-bit data

Bit 31 16 15 0  

$$[x2, x0] = x_2 2^{16} + x_0 = \begin{array}{|c|c|c|c|c|}\hline
 & x_2 & x_0 \\ \hline
\end{array}$$

• Work on 176x144 images

## **Packing**

| IMAGE_W<br>IMAGE_H               |                                               | EQU 176 ; QCIF width EQU 144 ; QCIF height                                                                                                                                                                                                                      |
|----------------------------------|-----------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| pz<br>px<br>py<br>a              | RN 0<br>RN 1<br>RN 2<br>RN 3                  | ; pointer to destination image (word aligned)<br>; pointer to first source image (word aligned)<br>; pointer to second source image (word aligned)<br>; 8-bit scaling factor (0-256)                                                                            |
| xx<br>yy<br>x<br>y<br>z<br>count | RN 4<br>RN 5<br>RN 6<br>RN 7<br>RN 8<br>RN 12 | ; holds four x pixels [x3, x2, x1, x0]; holds four y pixels [y3, y2, y1, y0]; holds two expanded x pixels [x2, x0]; holds two expanded y pixels [y2, y0]; holds four z pixels [z3, z2, z1, z0]; number of pixels remaining; constant mask with value 0x00ff00ff |

# **Packing**

```
; void merge images(char *pz, char *px, char *py, int a)
merge images
       STMFD
               sp!, {r4-r8, 1r}
               count, #IMAGE WIDTH*IMAGE HEIGHT
       MOV
       LDR
               mask, =0x00FF00FF ; [ 0, 0xFF,
                                                       0, 0xFF ]
 merge loop
               xx, [px], #4
                                   ; [ x3, x2, x1, x0 ]
       LDR
               yy, [py], #4 ; [ y3, y2, y1, y0 ]
       LDR
               x, mask, xx ; [ 0, x2, 0, x0 ]
y, mask, yy ; [ 0, y2, 0, y0 ]
x, x, y ; [ (x2-y2), (x0-y0) ]
       AND
       AND
       SUB
```

## **Packing**

```
a*(x0-y0)
                        ; [ a*(x2-y2),
MUL
       x, a, x
ADD
      x, x, y, LSL#8
                                  w2,
                                             w0 ]
      z, mask, x, LSR#8 ; [ 0, z2,
AND
                                        0, z0]
AND
      x, mask, xx, LSR#8 ; [ 0, x3,
                                        0, x1]
      y, mask, yy, LSR#8 ; [ 0, y3,
AND
                                        0, y1 ]
SUB
                        ; [ (x3-y3),
                                       (x1-y1)
      x, x, y
MUL
                        ; [ a*(x3-y3),
                                      a*(x1-y1)
      x, a, x
      x, x, y, LSL#8
                                  w3,
ADD
                                             w1 ]
      x, mask, x, LSR#8 ; [ 0, z3,
                                        0, z1]
AND
      z, z, x, LSL#8
                       ; [ z3, z2, z1, z0 ]
ORR
      z, [pz], #4
                       ; store four z pixels
STR
SUBS
      count, count, #4
BGT
      merge loop
LDMFD sp!, \{r4-r8, pc\}
```

#### Conditional execution

- By combining conditional execution and conditional setting of the flags, you can implement simple if statements without any need of branches.
- This improves efficiency since branches can take many cycles and also reduces code size.

### **Conditional execution**

#### **Conditional execution**

#### **Conditional execution**

```
if ((c>='A' && c<='Z') || (c>='a' && c<='z'))
{
   letter++;
}

SUB     temp, c, #'A'
   CMP     temp, #'Z'-'A'
   SUBHI     temp, c, #'a'
   CMPHI     temp, #'z'-'a'
   ADDLS   letter, letter, #1</pre>
```

# Block copy example

```
void bcopy(char *to, char *from, int n)
{
  while (n--)
    *to++ = *from++;
}
```

# Block copy example

## Block copy example

```
@ arguments: R0: to, R1: from, R2: n
@ rewrite "n--" as "--n>=0"
bcopy: SUBS R2, R2, #1
    LDRPLB R3, [R1], #1
    STRPLB R3, [R0], #1
    BPL bcopy
    MOV PC, LR
```

### Block copy example

```
@ arguments: R0: to, R1: from, R2: n
@ assume n is a multiple of 4; loop unrolling
bcopy: SUBS
             R2, R2, #4
        LDRPLB R3, [R1], #1
        STRPLB R3, [R0], #1
        BPL
               bcopy
        VOM
               PC, LR
```

# Block copy example

```
@ arguments: R0: to, R1: from, R2: n
@ n is a multiple of 16;
bcopy:
        SUBS R2, R2, #16
        LDRPL R3, [R1], #4
        STRPL R3, [R0], #4
        \mathtt{BPL}
              bcopy
        MOV
              PC, LR
```

### Block copy example

```
@ arguments: R0: to, R1: from, R2: n
@ n is a multiple of 16;
bcopy: SUBS R2, R2, #16
        LDMPL R1!, {R3-R6}
        STMPL R0!, {R3-R6}
        BPL bcopy
        MOV PC, LR

@ could be extend to copy 40 byte at a time
@ if not multiple of 40, add a copy_rest loop
```

### Search example

```
int main(void)
{
  int a[10]={7,6,4,5,5,1,3,2,9,8};
  int i;
  int s=4;

for (i=0; i<10; i++)
   if (s==a[i]) break;
  if (i>=10) return -1;
  else return i;
}
```

#### Search

```
.section .rodata
.LC0:

.word 7
.word 6
.word 4
.word 5
.word 5
.word 1
.word 3
.word 2
.word 9
.word 8
```

#### Search

```
low
      .text
      .global main
               main, %function
      . type
                                        i
main: sub
            sp, sp, #48
                                       a[0]
            r4, L9 @ =.LC0
      adr
            r5, sp, #8
      add
                                         :
      ldmia r4!, {r0, r1, r2, r3}
                                       a[9]
      stmia r5!, {r0, r1, r2, r3}
      ldmia r4!, {r0, r1, r2, r3}
      stmia r5!, {r0, r1, r2, r3}
      ldmia r4!, {r0, r1}
                                  high
      stmia r5!, {r0, r1}
                                       stack
```

#### Search

```
mov r3, #4
                                  low
     str r3, [sp, #0] @ s=4
     mov r3, #0
                                       s
     str r3, [sp, #4] @ i=0
                                      a[0]
loop: ldr r0, [sp, #4] @ r0=i
         r0, #10
                       @ i<10?
      cmp
                                        :
      bge
          end
      ldr r1, [sp, #0] @ r1=s
                                      a[9]
     mov r2, #4
     mul r3, r0, r2
     add r3, r3, #8
     ldr r4, [sp, r3] @ r4=a[i] high
                                      stack
```

#### Search

```
teq r1, r4 @ test if s==a[i]|OW
    beg end
                                        s
     add r0, r0, #1 @ i++
                                        i
     str r0, [sp, #4] @ update i
                                       a[0]
     b
          loop
                                        :
end: str r0, [sp, #4]
                                      a[9]
          r0, #10
     cmp
    movge r0, \#-1
          sp, sp, #48
     add
                                  high
          pc, lr
     mov
                                       stack
```

### **Optimization**

- Remove unnecessary load/store
- Remove loop invariant
- Use addressing mode
- Use conditional execution

## Search (remove load/store)

```
low -
      mov r1, #4
     str r3, [sp, #0] @ s=4
      mov r0, #0
     str r3, [sp, #4] @ i=0
                                        a[0]
loop: <del>ldr r0, [sp, #4]</del> @ r0=i
      cmp r0, #10
                         @ i<10?
      bge end
      ldr r1, [sp, #0] @ r1=s
                                        a[9]
      mov r2, #4
      mul r3, r0, r2
      add r3, r3, #8
      ldr r4, [sp, r3] @ r4=a[i] high
                                        stack
```

### Search (remove load/store)

```
teq r1, r4 @ test if s==a[i]OW
    beg end
                                        s
     add r0, r0, #1 @ i++
     str r0, [sp, #4] @ update i
                                       a[0]
     b
          loop
end: str r0, [sp, #4]
                                       a[9]
          r0, #10
     cmp
    movge r0, \#-1
           sp, sp, #48
     add
                                  high
           pc, lr
     mov
                                       stack
```

# Search (loop invariant/addressing mode)

```
mov r1, #4
                              low
    str r3, [sp, #0] @ s=4
     mov r0, #0
                                   s
    str r3, [sp, #4] @ i=0
     add r2, sp, #8
                                 a[0]
cmp r0, #10
                     @ i<10?
         end
     bge
        r1, [sp, #0] @ r1=s
                                 a[9]
     ldr r4, [sp, r3] @ r4=a[i] high
                                 stack
     ldr r4, [r2, r0, LSL #2]
```

# Search (conditional execution)

```
teq r1, r4 @ test if s==a[i]|OW|
    beq end
                                       s
   addeq r0, r0, #1 @ i++
    str r0, [sp, #4] @ update i
                                     a[0]
   beq
         loop
end: str r0, [sp, #4]
                                     a[9]
          r0, #10
     cmp
    movge r0, #-1
    add
          sp, sp, #48
                                 high
          pc, lr
    mov
                                      stack
```

# Optimization

- Remove unnecessary load/store
- Remove loop invariant
- Use addressing mode
- Use conditional execution
- From 22 words to 13 words and execution time is greatly reduced.