To move or mutate large objects

Suppose your have a large object that you need to process:

pub struct LargeObj([u8; 512]);

What’s the difference in MCU time and memory when

passing by value, creating a new object?
passing by value, mutating the argument?
passing by mutable reference?

Here is the compiler explorer link. All code targets a Cortex-M MCU. Additionally, all examples assume separate compilation units, preventing caller-callee optimizations. I enabled size optimizations.

Pass by value, new object

pub fn process_move_new(obj: LargeObj) -> LargeObj {
    let mut n = LargeObj([0; 512]);
    for (slot, source) in n.0.iter_mut().zip(obj.0.iter()) {
        *slot = source.wrapping_add(42);
    }
    n
}

The callee’s stack holds a new object, 512 bytes large. That object is zeroed with memclr4. Then, there’s a loop to increment each element by 42. The callee accesses the argument through a pointer in r1

The call concludes with a memcpy from the callee’s stack to the caller’s stack, forming the return. The callee finds the location of the result object in r0.

Time on the CPU:

zero the stack-allocated object.
increment all elements by 42.
copy the values from the callee’s stack the caller’s stack.

        push    {r4, r5, r6, r7, lr}
        add     r7, sp, #12
        str     r11, [sp, #-4]!
        sub.w   sp, sp, #512
        mov     r6, sp
        mov     r5, r1
        mov     r4, r0
        mov     r0, r6
        mov.w   r1, #512
        bl      __aeabi_memclr4
        movs    r0, #0
.LBB2_1:
        ldrb    r1, [r5, r0]
        adds    r1, #42
        strb    r1, [r6, r0]
        adds    r0, #1
        cmp.w   r0, #512
        bne     .LBB2_1
        mov     r1, sp
        mov     r0, r4
        mov.w   r2, #512
        bl      __aeabi_memcpy
        add.w   sp, sp, #512
        ldr     r11, [sp], #4
        pop     {r4, r5, r6, r7, pc}

To figure out the total memory usage, let’s also look at the caller:

pub fn use_process_move() {
    let mut obj = LargeObj([0; 512]);
    obj = process_move_new(obj);
}

        push    {r4, r6, r7, lr}
        add     r7, sp, #8
        sub.w   sp, sp, #1024
        add     r4, sp, #512
        mov.w   r1, #512
        mov     r0, r4
        bl      __aeabi_memclr4
        mov     r0, sp
        mov     r1, r4
        bl      process_move_new
        add.w   sp, sp, #1024
        pop     {r4, r6, r7, pc}

Although we’re overwriting the original object, notice how the stack usage is 1024 bytes. There are two copies of the object on the caller’s stack:

one 512 byte slot, zeroed, for the argument.
another 512 byte slot for the result.

Since there’s another 512 byte object in the callee’s stack, the total stack memory usage is 3 x 512 bytes.

Pass by value, mutate the argument

pub fn process_move_mut(mut obj: LargeObj) -> LargeObj {
    for x in &mut obj.0 {
        *x = x.wrapping_add(42);
    }
    obj
}

Mutating the argument requires less memory in the callee. There’s no LargeObj stack allocation in the callee. Instead, the callee accesses the argument through a pointer in r1. There’s a loop to increment every element by 42.

As before, the call concludes with a memcpy at the end of the function, forming the return. r0 holds a pointer to place the result.

Time on the CPU:

increment all elements by 42.
copy values from the caller’s stack to the caller’s stack.

        movs    r2, #0
.LBB3_1:
        ldrb    r3, [r1, r2]
        adds    r3, #42
        strb    r3, [r1, r2]
        adds    r2, #1
        cmp.w   r2, #512
        bne     .LBB3_1
        push    {r7, lr}
        mov     r7, sp
        mov.w   r2, #512
        bl      __aeabi_memcpy
        pop     {r7, pc}

The perspective from the caller is unchanged from the other pass by value example. The caller doesn’t know if the callee is mutating the argument or forming a new object on its own stack. Therefore, the total stack allocation is 2 x 512 bytes.

Pass by mutable reference

pub fn process_in_place(obj: &mut LargeObj) {
    for x in &mut obj.0 {
        *x = x.wrapping_add(42);
    }
}

There’s no stack allocation for a LargeObj. The caller supplies the mutable LargeObj pointer in r0. There’s only a loop to increment each element by 42. Since we modify the object in place, there’s no memcpy to conclude the function.

Time on the CPU:

increment all elements by 42.

        push    {r7, lr}
        mov     r7, sp
        movs    r1, #0
.LBB4_1:
        ldrb    r2, [r0, r1]
        adds    r2, #42
        strb    r2, [r0, r1]
        adds    r1, #1
        cmp.w   r1, #512
        bne     .LBB4_1
        pop     {r7, pc}

How about the caller of this function?

pub fn use_process_in_place() {
    let mut obj = LargeObj([0; 512]);
    process_in_place(&mut obj);
}

Since we modify the object in place, we only need one stack allocation in the caller. It’s understood that the argument can change, so there’s no need for a defensive copy.

        push    {r4, r6, r7, lr}
        add     r7, sp, #8
        sub.w   sp, sp, #512
        mov     r4, sp
        mov.w   r1, #512
        mov     r0, r4
        bl      __aeabi_memclr4
        mov     r0, r4
        bl      process_in_place
        add.w   sp, sp, #512
        pop     {r4, r6, r7, pc}

Conclusion

Approach	Caller stack for object(s)	Callee stack for object(s)	Runtime work
Pass by value, new object	1024 bytes	512 bytes	zero, increment, copy
Pass by value, mutate argument	1024 bytes	0 bytes	increment, copy
Pass by mutable reference	512 bytes	0 bytes	increment

Function interfaces can have a hidden cost in a real-time embedded system. In these systems, we want to minimize memory usage and only do the necessary work.

In the first example, there’s an explicit operation to zero the stack-allocated object. I’m not sure why; the compiler should see that the slots are only written, so there’s no need to initialize those values to zero. Nevertheless, that has impact on how long it takes to process data.

Additionally, the first two examples reveal how “return by value” are implemented. They’re a memcpy! That’s a cost just to provide the result to the caller. And in order to realize pass by value, the caller requires additional memory for both the argument and the result.

Passing by mutable reference optimizes both the caller and the callee. The caller has less memory to allocate. The callee can compute the result without performing any other work.