Sculpt on sel4, ARM initial boot unpack copy is painfuly slow :/

fdelizy · May 19, 2026, 9:28am

Hi,

(sorry in advance for the long post, this is a rather technical one )

I hit a fairly painful boot delay on seL4 when the user image is big (Sculpt-class, hundreds of megabytes). I see this on arm platforms (armStone and Verdin iMX8M Plus). The elfloader unpack stage runs with MMU off, so the inner memcpy in elf_loadFile is doing non-cached aarch64 stores all the way through DRAM. For small images this is invisible; for my Sculpt-on-seL4 boots it eats minutes.

Concrete numbers from the same armStone (i.MX8MP, 1.2 GHz, 2 GiB DDR4), same elfloader build, same kernel, only the user image changes:

image	size	ELF-loading `genode.elf` → “Enabling hypervisor MMU and paging”
bare-Genode + framebuffer	~6 MB	0:00:16.676 → 0:00:19.043 (~2.4 s)
Sculpt + sel4 (Sculpt 26.04)	~596 MB	0:04:58.445 → 0:08:15.328 (~3 min 17 s)

So roughly 3 MB/s effective throughput for the elfloader’s memcpy on this SoC. The board itself can do hundreds of MB/s of DRAM bandwidth once D-cache is on, so I’m paying a ~100× tax for running the copy strongly-ordered.

The code path, as I see it in tools/seL4_tools (slightly trimmed):

/* elfloader-tool/src/binaries/elf/elf.c */
int elf_loadFile(void const *elfFile, int phys)
{
    if (elf_checkFile(elfFile) != 0) {
        return 0;
    }
    for (unsigned int i = 0; i < elf_getNumProgramHeaders(elfFile); i++) {
        uint64_t dest, src;
        size_t len;
        if (phys) {
            dest = elf_getProgramHeaderPaddr(elfFile, i);
        } else {
            dest = elf_getProgramHeaderVaddr(elfFile, i);
        }
        len = elf_getProgramHeaderFileSize(elfFile, i);
        src = (uint64_t)(uintptr_t)elfFile
            + elf_getProgramHeaderOffset(elfFile, i);
        memcpy((void *)(uintptr_t)dest, (void *)(uintptr_t)src, len);
        ...
    }
    return 1;
}

And the boot-time page table built in
elfloader-tool/src/arch-arm/64/mmu.c:

void init_boot_vspace(struct image_info *kernel_info)
{
    ...
    for (i = 0; i < BIT(PUD_BITS); i++) {
        _boot_pud_down[i] = (i << ARM_1GB_BLOCK_BITS)
                            | BIT(10) /* access flag */
                            | (0 << 2) /* strongly ordered memory */
                            | BIT(0);  /* 1G block */
    }
    ...
}

…which is fine for the kernel/runtime transition but it runs after the unpack copy. The copy itself happens with MMU off.

What I’d like to ask the community:

Has anyone wired up an early identity-mapped page table inside the elfloader, with the DRAM region marked as Normal Inner/Outer Write-Back Cacheable, just for the duration of the unpack loop? Then flush + invalidate D-cache and tear it down before handing off to
the kernel. That should bring memcpy close to native DRAM bandwidth on aarch64.
Is there a less invasive route I’m missing — e.g. an aarch64 cacheable-memcpy helper that uses non-temporal stores so the lack of a cacheable mapping matters less? On Cortex-A53 the gains from NEON wide stores alone (without caching) are typically modest, so I don’t think this gets us back the 100× — but happy to be wrong.
Or is there a “load the ELF directly to its final paddr via U-Boot’s bootm/fitImage mechanism and skip the elfloader copy entirely” story that I haven’t found? My current build does bootm → elfloader → kernel; the elfloader is responsible for the second copy because the ELF lives at the U-Boot load address, not at IMAGE_START_ADDR.

If “set up an early cacheable identity map for unpack” is the right direction, I’d be happy to contribute a patch to sel4_tools.

Shape I have in mind:

Add an arch-arm/64 helper elfloader_early_mmu_enable() / elfloader_early_mmu_disable() that installs a minimal page table covering all of usable DRAM as Normal-Cacheable + identity-mapped.
Call it once before the unpack loop and tear it down (with a full D-cache clean + I-cache invalidate) immediately after the last elf_loadFile.
Keep the existing init_boot_vspace path untouched.

Before I start, two sanity-checks I’d appreciate guidance on:

Are there platforms (in seL4’s current portfolio) where enabling the MMU this early would conflict with the kernel’s later setup — e.g. boards where some MMIO region we’d accidentally cover with a Normal-mapping is touched between unpack and Enabling hypervisor MMU and paging?
Is there an existing config knob (something likeCONFIG_ELFLOADER_EARLY_MMU) I missed in the kbuild? I grep’d and didn’t find one but I might be looking in the wrong place.

Thanks!

PS:

happy to share full serial logs of both the 6 MB and the 596 MB boots if useful — they’re just the standard printf output of elfloader-tool/src/common.c framed with timestamps. The 3:17 gap between “ELF-loading image ‘genode.elf’” and “Enabling hypervisor MMU and paging” is the entire window I’d like to collapse.

nfeske · May 19, 2026, 10:17am

I’m wondering, why is your Sculpt image so large in the first place? It should be in the order of tens of MiB, not hundreds.

I presume you missed specifying the DEPOT=omit argument when building it. In this case, all depot content referenced by the presets and options will the incorporated in the depot.tar archive in the boot image. This is useful for building a special-purpose appliance (using a custom .sculpt file) but not for the default sculpt image.

fdelizy · May 19, 2026, 10:29am

Thanks @nfeske — and good catch!

In my case the large image is the appliance shape, not an accident: I’m building a custom special-purpose Sculpt (specific .sculpt preset, pinned depot, no on-target download) so the depot.tar in the boot image is exactly what I want. I’m using this on lab boards where post-boot Internet access isn’t always available and I need a fully self-contained, reproducible image for benchmarking and porting work.

So the ~596 MB number isn’t a default-Sculpt regression — DEPOT=omit would let me drop it for a stock-Sculpt baseline, but for the appliance build the depot.tar has to come along. That makes the elfloader unpack cost a structural part of every boot for my use case, not a one-time penalty.

Two extra data points that may help frame the trade-off:

The same image with KERNEL=hw (base-hw, identical depot.tar) boots through to userspace in roughly 20 seconds on the same board — there is no equivalent elfloader unpack step on hw. So the ~3 min 17 s isn’t “the appliance is heavy”; it’s specifically “the elfloader’s memcpy is uncached on sel4”.
The cost scales linearly with image size: a ~300 MB intermediate appliance build sits between the two numbers in the table, as you’d expect from a ~3 MB/s throughput.

Given that the appliance use case is explicitly endorsed (the DEPOT=omit flag exists because the other shape is also legitimate), would speeding up the elfloader unpack still be worth pursuing from your side? It looks like a one-time cost in sel4_tools that would benefit every appliance build, not just mine — and the change is localised to elfloader-tool/src/arch-arm/64/.

Let me know what you think

Thanks!

alex-ab · May 19, 2026, 10:53am

I presume, no one of the seL4 developers are in this discourse forum and you won’t get an appropriate answer to your elfloader question. You may consider to request the (speedup) feature to upstream, e.g. GitHub - seL4/seL4_tools: Basic tools for building seL4 projects · GitHub or their discourse forum.

fdelizy · May 19, 2026, 12:49pm

Thanks @alex-ab — makes sense, seL4-upstream is the right home for this. I’ll write up the proposal as an issue on seL4/seL4_tools and link it back here once it’s open.

One adjacent question while I have you: my Genode build currently pulls seL4_tools from your alex-ab/seL4_tools fork (specifically for the elfloader-tool/include/arch-arm/imx8mp_evk/ headers that I needed for armStone bring-up). If an elfloader-early-MMU change ever lands on seL4/seL4_tools master, would your fork rebase forward to pick it up, or is it pinned at a snapshot? Asking so I know whether a single upstream PR would also reach the Genode-flavour build downstream, or whether the change needs to be carried separately.

Either way, getting the upstream PR up first is the right next step:

github.com/seL4/seL4_tools

elfloader/arm64: enable D-cache during ELF unpack to recover ~100× throughput

opened 01:03PM - 19 May 26 UTC

goutnet

## Quick Summary / TL;DR `elfloader-tool/src/binaries/elf/elf.c::elf_loadFile()…` copies each program-header payload into its physical destination via plain `memcpy`. On aarch64 the elfloader runs **with the MMU disabled**, so the inner load/store sequence executes as strongly-ordered, non-cached DRAM traffic. On a representative Cortex-A53 SoC I measure **~3 MB/s** effective copy throughput. That is fine for small kernels but becomes a hard wall for any user image in the hundreds-of-megabytes range — boot adds *minutes* of pure copy time. ## Reproduction Hardware: F&S armStone MX8M Plus (NXP i.MX8M Plus, 4× Cortex-A53 @ 1.2 GHz, 2 GiB DDR4). Boot flow: U-Boot → `bootm` → `elfloader` → seL4 kernel → user payload. Same elfloader build (`PLAT=imx8mp-evk`, `IMAGE_START_ADDR=0x50000000`), same kernel, only the **secondary ELF payload** (the user/OS image that the elfloader copies *after* the kernel) changes. The numbers below are wall-clock deltas between the elfloader's own `ELF-loading image '<name>' to <paddr>` line and its `Enabling hypervisor MMU and paging` line — i.e. the time spent inside `elf_loadFile()` for the second ELF (the user image), with no other instrumentation: | Secondary ELF payload | Size | t(start unpack) → t(MMU on) | Net unpack time | |-----------------------|------|-----------------------------|-----------------| | small (minimal framework + framebuffer) | ~6 MB | 0:00:16.676 → 0:00:19.043 | **~2.4 s** | | large (appliance with bundled software) | ~596 MB | 0:04:58.445 → 0:08:15.328 | **~3 min 17 s** | For comparison, the same SoC running a non-seL4 stack that loads its kernel directly to its run-time address with no elfloader-style intermediate copy reaches userspace in ~20 s. So the bottleneck is specifically the elfloader's `memcpy` step, not the SoC, DRAM, or image size. Throughput scales linearly with bytes: ~3 MB/s. That's roughly two orders of magnitude below this SoC's measured DRAM bandwidth once D-cache is on. ## Where the time goes ```c /* elfloader-tool/src/binaries/elf/elf.c */ int elf_loadFile(void const *elfFile, int phys) { if (elf_checkFile(elfFile) != 0) return 0; for (unsigned int i = 0; i < elf_getNumProgramHeaders(elfFile); i++) { uint64_t dest, src; size_t len; if (phys) dest = elf_getProgramHeaderPaddr(elfFile, i); else dest = elf_getProgramHeaderVaddr(elfFile, i); len = elf_getProgramHeaderFileSize(elfFile, i); src = (uint64_t)(uintptr_t)elfFile + elf_getProgramHeaderOffset(elfFile, i); memcpy((void *)(uintptr_t)dest, (void *)(uintptr_t)src, len); ... } return 1; } ``` The page table installed by `init_boot_vspace()` is correct for the kernel/runtime transition but only takes effect *after* unpack: ```c /* elfloader-tool/src/arch-arm/64/mmu.c */ void init_boot_vspace(struct image_info *kernel_info) { ... for (i = 0; i < BIT(PUD_BITS); i++) { _boot_pud_down[i] = (i << ARM_1GB_BLOCK_BITS) | BIT(10) /* access flag */ | (0 << 2) /* strongly-ordered memory */ | BIT(0); /* 1 GiB block */ } ... } ``` When `elf_loadFile` runs, no MMU is on, so every store goes uncached to DRAM (worse: strongly-ordered, drains the write buffer per store on aarch64). NEON wide stores alone don't recover the full gap — the ~100× factor is the cacheable-vs-non-cacheable difference, not a microcode issue. ## Proposed fix (sketch) Install a minimal **identity-mapped, Normal Inner+Outer Write-Back Cacheable** page table covering usable DRAM for the duration of the unpack loop, then tear it down (full D-cache clean + I-cache invalidate) before handing off to the kernel. New helpers under `elfloader-tool/src/arch-arm/64/`: * `elfloader_early_mmu_enable(paddr_t dram_base, size_t dram_size)` Build a 1:1 page table with the DRAM range as Normal-Cacheable, set TCR/MAIR appropriately, point TTBR0, enable SCTLR.M+C+I. * `elfloader_early_mmu_disable(void)` `dsb sy; ic ialluis; dc cisw` over the populated range (or `dc cvac` per cache-line on the touched paddr extents), clear SCTLR.M+C, invalidate TLBs. Call sites: ```c elfloader_early_mmu_enable(dram_base, dram_size); /* existing loop in common.c that calls elf_loadFile for each * blob (kernel, user image, optional dtb) */ elfloader_early_mmu_disable(); /* hand off to init_boot_vspace + kernel as today */ ``` Existing `init_boot_vspace()` path is **untouched**; it still owns the kernel handoff state. Optional refinement: gate on a Kconfig knob, e.g. `CONFIG_ELFLOADER_EARLY_MMU`, default-on for arm64. Off-switch lets boards that touch sensitive MMIO between unpack and handoff disable it. ## Discussion points / open questions for maintainers 1. **MMIO regions.** Are there boards in the current portfolio where some MMIO peripheral is touched between `elf_loadFile()` return and `Enabling hypervisor MMU and paging` that would be miscovered by a generic identity Normal-Cacheable mapping? I haven't found one in `elfloader-tool/src/plat/`, but I'd rather ask than break a board. 2. **Pre-existing config knob.** Did I miss an existing flag for early cacheable copy in the kbuild? I grep'd for `EARLY_MMU`, `CACHED_COPY`, `MMU_PRE`, found nothing — but might be looking under the wrong name. 3. **NEON-store fallback.** If touching the MMU pre-handoff is unwelcome for some platforms, an aarch64 `memcpy` variant using `LDNP/STNP` (non-temporal pairs) + DC ZVA where applicable might recover ~2–4×. Worth a fallback knob, or not worth the complexity? 4. **Scope.** Same `memcpy`/MMU-off pattern exists on arch-arm/32 (cortex-a7/a8/a9) and arch-riscv. Would maintainers like a unified design across arches or arm64-only first? ## Context The issue surfaced while bringing up a self-contained appliance image (~600 MB, all software bundled into the boot blob) on this board for research/development work. The slowdown is purely a function of payload size — kernel images themselves are small, so this only matters once the secondary ELF payload reaches a few tens of MiB. Nothing about the proposal is framework-specific: any seL4 user that ships a large secondary ELF (test harnesses, embedded data sections, statically-linked monolithic apps) will see the same throughput ceiling. If the direction sounds right, I'm happy to send a draft PR against `seL4/seL4_tools`. For background, the discussion that triggered this issue (Genode-framework specific but the seL4 reproduction is general): https://genode.discourse.group/t/sculpt-on-sel4-arm-initial-boot-unpack-copy-is-painfuly-slow/317 Happy to attach full timestamped serial logs of both the 6 MB and the 596 MB unpacks, plus the boot-config diff, on request. Thanks!

Thanks again!

alex-ab · May 20, 2026, 7:25am

Ideally there would no fork needed, but unfortunately the integration of elfloader into the Genode build system is not friction less (If someone find a better solution, go ahead!). So, yes, either we/I will have to rebase to upstream, or we have to cherry-pick on the fork or you keep a fork and maintain it yourself. Will see.

alex-ab · May 20, 2026, 9:20am

@fdelizy You may check also the flags used to build the elfloader in repos/base-sel4/lib/mk/spec/arm_v8a/kernel-sel4.inc. I vaguely remember I had to add some to get reliable boots, but potentially they effect larger image boot time.

fdelizy · May 20, 2026, 12:12pm

hum, interesting, I’ll look into this too.

Also, my issue got bumped on sel4, my assumption was that u-boot disabled MMU, but it doesn’t so, something else is doing it, I am still trying to figure this one out.