Sculpt on sel4, ARM initial boot unpack copy is painfuly slow :/

Hi,

(sorry in advance for the long post, this is a rather technical one :wink:)

I hit a fairly painful boot delay on seL4 when the user image is big (Sculpt-class, hundreds of megabytes). I see this on arm platforms (armStone and Verdin iMX8M Plus). The elfloader unpack stage runs with MMU off, so the inner memcpy in elf_loadFile is doing non-cached aarch64 stores all the way through DRAM. For small images this is invisible; for my Sculpt-on-seL4 boots it eats minutes.

Concrete numbers from the same armStone (i.MX8MP, 1.2 GHz, 2 GiB DDR4), same elfloader build, same kernel, only the user image changes:

image size ELF-loading genode.elf → “Enabling hypervisor MMU and paging”
bare-Genode + framebuffer ~6 MB 0:00:16.676 → 0:00:19.043 (~2.4 s)
Sculpt + sel4 (Sculpt 26.04) ~596 MB 0:04:58.445 → 0:08:15.328 (~3 min 17 s)

So roughly 3 MB/s effective throughput for the elfloader’s memcpy on this SoC. The board itself can do hundreds of MB/s of DRAM bandwidth once D-cache is on, so I’m paying a ~100× tax for running the copy strongly-ordered.

The code path, as I see it in tools/seL4_tools (slightly trimmed):

/* elfloader-tool/src/binaries/elf/elf.c */
int elf_loadFile(void const *elfFile, int phys)
{
    if (elf_checkFile(elfFile) != 0) {
        return 0;
    }
    for (unsigned int i = 0; i < elf_getNumProgramHeaders(elfFile); i++) {
        uint64_t dest, src;
        size_t len;
        if (phys) {
            dest = elf_getProgramHeaderPaddr(elfFile, i);
        } else {
            dest = elf_getProgramHeaderVaddr(elfFile, i);
        }
        len = elf_getProgramHeaderFileSize(elfFile, i);
        src = (uint64_t)(uintptr_t)elfFile
            + elf_getProgramHeaderOffset(elfFile, i);
        memcpy((void *)(uintptr_t)dest, (void *)(uintptr_t)src, len);
        ...
    }
    return 1;
}

And the boot-time page table built in
elfloader-tool/src/arch-arm/64/mmu.c:

void init_boot_vspace(struct image_info *kernel_info)
{
    ...
    for (i = 0; i < BIT(PUD_BITS); i++) {
        _boot_pud_down[i] = (i << ARM_1GB_BLOCK_BITS)
                            | BIT(10) /* access flag */
                            | (0 << 2) /* strongly ordered memory */
                            | BIT(0);  /* 1G block */
    }
    ...
}

…which is fine for the kernel/runtime transition but it runs after the unpack copy. The copy itself happens with MMU off.

What I’d like to ask the community:

  • Has anyone wired up an early identity-mapped page table inside the elfloader, with the DRAM region marked as Normal Inner/Outer Write-Back Cacheable, just for the duration of the unpack loop? Then flush + invalidate D-cache and tear it down before handing off to
    the kernel. That should bring memcpy close to native DRAM bandwidth on aarch64.
  • Is there a less invasive route I’m missing — e.g. an aarch64 cacheable-memcpy helper that uses non-temporal stores so the lack of a cacheable mapping matters less? On Cortex-A53 the gains from NEON wide stores alone (without caching) are typically modest, so I don’t think this gets us back the 100× — but happy to be wrong.
  • Or is there a “load the ELF directly to its final paddr via U-Boot’s bootm/fitImage mechanism and skip the elfloader copy entirely” story that I haven’t found? My current build does bootm → elfloader → kernel; the elfloader is responsible for the second copy because the ELF lives at the U-Boot load address, not at IMAGE_START_ADDR.

If “set up an early cacheable identity map for unpack” is the right direction, I’d be happy to contribute a patch to sel4_tools.

Shape I have in mind:

  • Add an arch-arm/64 helper elfloader_early_mmu_enable() / elfloader_early_mmu_disable() that installs a minimal page table covering all of usable DRAM as Normal-Cacheable + identity-mapped.
  • Call it once before the unpack loop and tear it down (with a full D-cache clean + I-cache invalidate) immediately after the last elf_loadFile.
  • Keep the existing init_boot_vspace path untouched.

Before I start, two sanity-checks I’d appreciate guidance on:

  • Are there platforms (in seL4’s current portfolio) where enabling the MMU this early would conflict with the kernel’s later setup — e.g. boards where some MMIO region we’d accidentally cover with a Normal-mapping is touched between unpack and Enabling hypervisor MMU and paging?
  • Is there an existing config knob (something likeCONFIG_ELFLOADER_EARLY_MMU) I missed in the kbuild? I grep’d and didn’t find one but I might be looking in the wrong place.

Thanks!

PS:

happy to share full serial logs of both the 6 MB and the 596 MB boots if useful — they’re just the standard printf output of elfloader-tool/src/common.c framed with timestamps. The 3:17 gap between “ELF-loading image ‘genode.elf’” and “Enabling hypervisor MMU and paging” is the entire window I’d like to collapse.

I’m wondering, why is your Sculpt image so large in the first place? It should be in the order of tens of MiB, not hundreds.

I presume you missed specifying the DEPOT=omit argument when building it. In this case, all depot content referenced by the presets and options will the incorporated in the depot.tar archive in the boot image. This is useful for building a special-purpose appliance (using a custom .sculpt file) but not for the default sculpt image.

Thanks @nfeske — and good catch!

In my case the large image is the appliance shape, not an accident: I’m building a custom special-purpose Sculpt (specific .sculpt preset, pinned depot, no on-target download) so the depot.tar in the boot image is exactly what I want. I’m using this on lab boards where post-boot Internet access isn’t always available and I need a fully self-contained, reproducible image for benchmarking and porting work.

So the ~596 MB number isn’t a default-Sculpt regression — DEPOT=omit would let me drop it for a stock-Sculpt baseline, but for the appliance build the depot.tar has to come along. That makes the elfloader unpack cost a structural part of every boot for my use case, not a one-time penalty.

Two extra data points that may help frame the trade-off:

  • The same image with KERNEL=hw (base-hw, identical depot.tar) boots through to userspace in roughly 20 seconds on the same board — there is no equivalent elfloader unpack step on hw. So the ~3 min 17 s isn’t “the appliance is heavy”; it’s specifically “the elfloader’s memcpy is uncached on sel4”.
  • The cost scales linearly with image size: a ~300 MB intermediate appliance build sits between the two numbers in the table, as you’d expect from a ~3 MB/s throughput.

Given that the appliance use case is explicitly endorsed (the DEPOT=omit flag exists because the other shape is also legitimate), would speeding up the elfloader unpack still be worth pursuing from your side? It looks like a one-time cost in sel4_tools that would benefit every appliance build, not just mine — and the change is localised to elfloader-tool/src/arch-arm/64/.

Let me know what you think

Thanks!

I presume, no one of the seL4 developers are in this discourse forum and you won’t get an appropriate answer to your elfloader question. You may consider to request the (speedup) feature to upstream, e.g. GitHub - seL4/seL4_tools: Basic tools for building seL4 projects · GitHub or their discourse forum.

Thanks @alex-ab — makes sense, seL4-upstream is the right home for this. I’ll write up the proposal as an issue on seL4/seL4_tools and link it back here once it’s open.

One adjacent question while I have you: my Genode build currently pulls seL4_tools from your alex-ab/seL4_tools fork (specifically for the elfloader-tool/include/arch-arm/imx8mp_evk/ headers that I needed for armStone bring-up). If an elfloader-early-MMU change ever lands on seL4/seL4_tools master, would your fork rebase forward to pick it up, or is it pinned at a snapshot? Asking so I know whether a single upstream PR would also reach the Genode-flavour build downstream, or whether the change needs to be carried separately.

Either way, getting the upstream PR up first is the right next step:

Thanks again!

Ideally there would no fork needed, but unfortunately the integration of elfloader into the Genode build system is not friction less (If someone find a better solution, go ahead!). So, yes, either we/I will have to rebase to upstream, or we have to cherry-pick on the fork or you keep a fork and maintain it yourself. Will see.

@fdelizy You may check also the flags used to build the elfloader in repos/base-sel4/lib/mk/spec/arm_v8a/kernel-sel4.inc. I vaguely remember I had to add some to get reliable boots, but potentially they effect larger image boot time.

hum, interesting, I’ll look into this too.

Also, my issue got bumped on sel4, my assumption was that u-boot disabled MMU, but it doesn’t so, something else is doing it, I am still trying to figure this one out.