To move or mutate large objects
Suppose your have a large object that you need to process:
pub struct LargeObj([u8; 512]);
What’s the difference in MCU time and memory when
- passing by value, creating a new object?
- passing by value, mutating the argument?
- passing by mutable reference?
Here is the compiler explorer link. All code targets a Cortex-M MCU. Additionally, all examples assume separate compilation units, preventing caller-callee optimizations. I enabled size optimizations.
Pass by value, new object
pub fn process_move_new(obj: LargeObj) -> LargeObj {
let mut n = LargeObj([0; 512]);
for (slot, source) in n.0.iter_mut().zip(obj.0.iter()) {
*slot = source.wrapping_add(42);
}
n
}
The callee’s stack holds a new object, 512 bytes large.
That object is zeroed with memclr4
.
Then, there’s a loop to increment each element by 42.
The callee accesses the argument through a pointer in r1
The call concludes with a memcpy
from the callee’s stack to the caller’s stack, forming the return.
The callee finds the location of the result object in r0
.
Time on the CPU:
- zero the stack-allocated object.
- increment all elements by 42.
- copy the values from the callee’s stack the caller’s stack.
push {r4, r5, r6, r7, lr}
add r7, sp, #12
str r11, [sp, #-4]!
sub.w sp, sp, #512
mov r6, sp
mov r5, r1
mov r4, r0
mov r0, r6
mov.w r1, #512
bl __aeabi_memclr4
movs r0, #0
.LBB2_1:
ldrb r1, [r5, r0]
adds r1, #42
strb r1, [r6, r0]
adds r0, #1
cmp.w r0, #512
bne .LBB2_1
mov r1, sp
mov r0, r4
mov.w r2, #512
bl __aeabi_memcpy
add.w sp, sp, #512
ldr r11, [sp], #4
pop {r4, r5, r6, r7, pc}
To figure out the total memory usage, let’s also look at the caller:
pub fn use_process_move() {
let mut obj = LargeObj([0; 512]);
obj = process_move_new(obj);
}
push {r4, r6, r7, lr}
add r7, sp, #8
sub.w sp, sp, #1024
add r4, sp, #512
mov.w r1, #512
mov r0, r4
bl __aeabi_memclr4
mov r0, sp
mov r1, r4
bl process_move_new
add.w sp, sp, #1024
pop {r4, r6, r7, pc}
Although we’re overwriting the original object, notice how the stack usage is 1024 bytes. There are two copies of the object on the caller’s stack:
- one 512 byte slot, zeroed, for the argument.
- another 512 byte slot for the result.
Since there’s another 512 byte object in the callee’s stack, the total stack memory usage is 3 x 512 bytes.
Pass by value, mutate the argument
pub fn process_move_mut(mut obj: LargeObj) -> LargeObj {
for x in &mut obj.0 {
*x = x.wrapping_add(42);
}
obj
}
Mutating the argument requires less memory in the callee.
There’s no LargeObj
stack allocation in the callee.
Instead, the callee accesses the argument through a pointer in r1
.
There’s a loop to increment every element by 42.
As before, the call concludes with a memcpy
at the end of the function, forming the return.
r0
holds a pointer to place the result.
Time on the CPU:
- increment all elements by 42.
- copy values from the caller’s stack to the caller’s stack.
movs r2, #0
.LBB3_1:
ldrb r3, [r1, r2]
adds r3, #42
strb r3, [r1, r2]
adds r2, #1
cmp.w r2, #512
bne .LBB3_1
push {r7, lr}
mov r7, sp
mov.w r2, #512
bl __aeabi_memcpy
pop {r7, pc}
The perspective from the caller is unchanged from the other pass by value example. The caller doesn’t know if the callee is mutating the argument or forming a new object on its own stack. Therefore, the total stack allocation is 2 x 512 bytes.
Pass by mutable reference
pub fn process_in_place(obj: &mut LargeObj) {
for x in &mut obj.0 {
*x = x.wrapping_add(42);
}
}
There’s no stack allocation for a LargeObj
.
The caller supplies the mutable LargeObj
pointer in r0
.
There’s only a loop to increment each element by 42.
Since we modify the object in place, there’s no memcpy
to conclude the function.
Time on the CPU:
- increment all elements by 42.
push {r7, lr}
mov r7, sp
movs r1, #0
.LBB4_1:
ldrb r2, [r0, r1]
adds r2, #42
strb r2, [r0, r1]
adds r1, #1
cmp.w r1, #512
bne .LBB4_1
pop {r7, pc}
How about the caller of this function?
pub fn use_process_in_place() {
let mut obj = LargeObj([0; 512]);
process_in_place(&mut obj);
}
Since we modify the object in place, we only need one stack allocation in the caller. It’s understood that the argument can change, so there’s no need for a defensive copy.
push {r4, r6, r7, lr}
add r7, sp, #8
sub.w sp, sp, #512
mov r4, sp
mov.w r1, #512
mov r0, r4
bl __aeabi_memclr4
mov r0, r4
bl process_in_place
add.w sp, sp, #512
pop {r4, r6, r7, pc}
Conclusion
Approach | Caller stack for object(s) | Callee stack for object(s) | Runtime work |
---|---|---|---|
Pass by value, new object | 1024 bytes | 512 bytes | zero, increment, copy |
Pass by value, mutate argument | 1024 bytes | 0 bytes | increment, copy |
Pass by mutable reference | 512 bytes | 0 bytes | increment |
Function interfaces can have a hidden cost in a real-time embedded system. In these systems, we want to minimize memory usage and only do the necessary work.
In the first example, there’s an explicit operation to zero the stack-allocated object. I’m not sure why; the compiler should see that the slots are only written, so there’s no need to initialize those values to zero. Nevertheless, that has impact on how long it takes to process data.
Additionally, the first two examples reveal how “return by value” are implemented.
They’re a memcpy
!
That’s a cost just to provide the result to the caller.
And in order to realize pass by value, the caller requires additional memory for both the argument and the result.
Passing by mutable reference optimizes both the caller and the callee. The caller has less memory to allocate. The callee can compute the result without performing any other work.