Hi,
(sorry in advance for the long post, this is a rather technical one
)
I hit a fairly painful boot delay on seL4 when the user image is big (Sculpt-class, hundreds of megabytes). I see this on arm platforms (armStone and Verdin iMX8M Plus). The elfloader unpack stage runs with MMU off, so the inner memcpy in elf_loadFile is doing non-cached aarch64 stores all the way through DRAM. For small images this is invisible; for my Sculpt-on-seL4 boots it eats minutes.
Concrete numbers from the same armStone (i.MX8MP, 1.2 GHz, 2 GiB DDR4), same elfloader build, same kernel, only the user image changes:
| image | size | ELF-loading genode.elf → “Enabling hypervisor MMU and paging” |
|---|---|---|
| bare-Genode + framebuffer | ~6 MB | 0:00:16.676 → 0:00:19.043 (~2.4 s) |
| Sculpt + sel4 (Sculpt 26.04) | ~596 MB | 0:04:58.445 → 0:08:15.328 (~3 min 17 s) |
So roughly 3 MB/s effective throughput for the elfloader’s memcpy on this SoC. The board itself can do hundreds of MB/s of DRAM bandwidth once D-cache is on, so I’m paying a ~100× tax for running the copy strongly-ordered.
The code path, as I see it in tools/seL4_tools (slightly trimmed):
/* elfloader-tool/src/binaries/elf/elf.c */
int elf_loadFile(void const *elfFile, int phys)
{
if (elf_checkFile(elfFile) != 0) {
return 0;
}
for (unsigned int i = 0; i < elf_getNumProgramHeaders(elfFile); i++) {
uint64_t dest, src;
size_t len;
if (phys) {
dest = elf_getProgramHeaderPaddr(elfFile, i);
} else {
dest = elf_getProgramHeaderVaddr(elfFile, i);
}
len = elf_getProgramHeaderFileSize(elfFile, i);
src = (uint64_t)(uintptr_t)elfFile
+ elf_getProgramHeaderOffset(elfFile, i);
memcpy((void *)(uintptr_t)dest, (void *)(uintptr_t)src, len);
...
}
return 1;
}
And the boot-time page table built in
elfloader-tool/src/arch-arm/64/mmu.c:
void init_boot_vspace(struct image_info *kernel_info)
{
...
for (i = 0; i < BIT(PUD_BITS); i++) {
_boot_pud_down[i] = (i << ARM_1GB_BLOCK_BITS)
| BIT(10) /* access flag */
| (0 << 2) /* strongly ordered memory */
| BIT(0); /* 1G block */
}
...
}
…which is fine for the kernel/runtime transition but it runs after the unpack copy. The copy itself happens with MMU off.
What I’d like to ask the community:
- Has anyone wired up an early identity-mapped page table inside the elfloader, with the DRAM region marked as Normal Inner/Outer Write-Back Cacheable, just for the duration of the unpack loop? Then flush + invalidate D-cache and tear it down before handing off to
the kernel. That should bringmemcpyclose to native DRAM bandwidth on aarch64. - Is there a less invasive route I’m missing — e.g. an aarch64 cacheable-
memcpyhelper that uses non-temporal stores so the lack of a cacheable mapping matters less? On Cortex-A53 the gains from NEON wide stores alone (without caching) are typically modest, so I don’t think this gets us back the 100× — but happy to be wrong. - Or is there a “load the ELF directly to its final paddr via U-Boot’s
bootm/fitImagemechanism and skip the elfloader copy entirely” story that I haven’t found? My current build doesbootm→ elfloader → kernel; the elfloader is responsible for the second copy because the ELF lives at the U-Boot load address, not atIMAGE_START_ADDR.
If “set up an early cacheable identity map for unpack” is the right direction, I’d be happy to contribute a patch to sel4_tools.
Shape I have in mind:
- Add an arch-arm/64 helper
elfloader_early_mmu_enable()/elfloader_early_mmu_disable()that installs a minimal page table covering all of usable DRAM as Normal-Cacheable + identity-mapped. - Call it once before the unpack loop and tear it down (with a full D-cache clean + I-cache invalidate) immediately after the last
elf_loadFile. - Keep the existing
init_boot_vspacepath untouched.
Before I start, two sanity-checks I’d appreciate guidance on:
- Are there platforms (in seL4’s current portfolio) where enabling the MMU this early would conflict with the kernel’s later setup — e.g. boards where some MMIO region we’d accidentally cover with a Normal-mapping is touched between unpack and
Enabling hypervisor MMU and paging? - Is there an existing config knob (something like
CONFIG_ELFLOADER_EARLY_MMU) I missed in the kbuild? I grep’d and didn’t find one but I might be looking in the wrong place.
Thanks!
PS:
happy to share full serial logs of both the 6 MB and the 596 MB boots if useful — they’re just the standard printf output of elfloader-tool/src/common.c framed with timestamps. The 3:17 gap between “ELF-loading image ‘genode.elf’” and “Enabling hypervisor MMU and paging” is the entire window I’d like to collapse.