Maaxboard (imx8mq_evk) networking performance

Hi all,

I am a total noob to Genode (finding it pretty cool so far though) and have recently been trying to run some simple network benchmarks on the Avnet Maaxboard (same SoC as the IMX8MQ_EVK), with KERNEL=hw. When I measure, I’m seeing some low numbers and am wondering if I’m doing something wrong or can make any changes to improve performance.

In my setup, the Genode system acts as an echo server that echoes packets back to a load generator, which measures the throughput and latency achieved. In my runs of the test, I can only achieve a maximum throughout of around 100Mbps before it seems to max out despite the NIC supporting 1Gbps, and if I try to push it to higher throughputs (like 300 Mbps), I see a whole bunch of

[init -> nic] handle_rx: uplink saturated, dropping packet

logs from the NIC driver and there is a performance collapse (probably in part due to all the logs).

My benchmarking setup on Genode consist of the following components:

Echo server <--> Uplink <--> <Nic>

where the client is the echo server found in libports/src/test/lwip/udp/server.

I’ve tried turning up the optimisation level following Genode Compilation Flags - Optimization - users - lists.genode.org, but this didn’t make much difference. Does anyone have any other suggestions for things I could try to improve the performance?

I was thinking increasing the sizes of the queues between the NIC/uplink and Client/Uplink, but I wasn’t too sure what the right way of doing this was. Another thing that might be useful is to measure the CPU utilisation, but I’m not too sure how to do this in the minimal example I have set up. If anyone could point me to an example that shows how this can be done, that would also be very much appreciated.

Thanks!

Edit: I also ran the netperf-lwip benchmark and am getting around 250Mbps for each direction with the TCP stream benchmarks. Not really sure how to interpret this, should I be expecting roughly double what I see with the echo server because the netperf benchmarks are unidirectional?

I encountered a similar issue during network tests on Genode with x86. In the QEMU environment, the bandwidth would saturate, and I would see similar log messages from the NIC driver. However, when I ran the same test on real hardware, the problem was resolved. For this reason, I believe it’s likely a driver-related issue or a low-level limitation.

I think your assumption is correct, and you should work on the structure of the input and output queues. However, it seems that the issue is related to the rx and tx queues at the driver level, not the interface between the driver and the NIC router.

This may be a stupid question, but have you tried it with the Nova kernel?

From what I can see, the Genode support for Nova currently only extends to x86_64 and x86_32. I tried building my example with KERNEL=nova and the build failed.

I had the feeling it was a dumb question. :wink:

The reason I think it is because of the queues between the NIC/uplink or uplink/echo server (or something else related to these components) is because the message says that the uplink is saturated, which results in the packet being dropped. To me, this suggests that the NIC can keep up fine and is receiving packets, but that the the uplink/echo server is unable to consume these at a fast enough rate. This results in the queues becoming full, and the driver drops received packets because it has nowhere else to put them.

Edit: Additionally, my understanding is that the NIC driver is transplanted from the Linux kernel, and running a similar echo server benchmark on Linux (albeit 6.1.0) achieves a maximum throughout of close to 600Mbps.

Based on the test you conducted and the observations you mentioned, I believe your conclusion makes sense. However, I have a different perspective. I think the issue might be related to improper scheduling due to CPU performance. In similar conditions on x86, an increase in CPU performance leads to better network throughput, and the same error doesn’t occur. So, I suggest paying attention to the prioritization of the modules as well.

That being said, I’m not a developer of the Genode framework and I’m merely sharing an idea. Hopefully, the Genode developers can provide more precise guidance.

I took a quick glance at the code of this component and think that it is not suitable for benchmarking.

	while (true) {

		/* receive and send back one message without any modifications */
		char buf[4096];
		::memset(buf, 0, sizeof(buf));
		ssize_t rcv_sz = recvfrom(s, buf, sizeof(buf), 0, (struct sockaddr*)&addr, &addr_sz);
		if (rcv_sz < 0) {
			throw Receive_failed();
		}
		if (sendto(s, buf, rcv_sz, 0, (struct sockaddr*)&addr, addr_sz) != rcv_sz) {
			throw Send_failed();
		}
	}

The code operates strictly synchronously on a mere 4K buffer. So the costs of context switching between the NIC driver, NIC router, and the server must be paid for every 4 KiB of data. To put this value into perspective, consider that the I/O buffer for the NIC session is around 3 MiB. As a quick test of this hypothesis, you may increase the size of buf to, let’s say, 400 KiB. (you may better turn the local variable into a static local variable to avoid a large stack allocation)

Regarding the tweaking of I/O buffers, you may find the run script at repos/ports/run/lighttpd_fetchurl.run useful as inspiration. It hosts a completely virtual network setup consisting of a web server (lighttpd) and a web client (fetchurl using curl) connected via the NIC router. You’ll need to increase the download_size value to your liking. Since there is no real network device involved, you can observe how the context switches and protocol stacks influence the end-to-end throughput. It is a nice playground for experimenting with I/O buffer sizes and cross-correlating different platforms.

2 Likes

Thanks for the reply!

Just tried this and it didn’t make a difference. The size of the actual packets being sent are 1472 bytes each.

I’m a little confused by this, because wouldn’t this only necessarily be the case if the protocol used for networking is synchronous? My understanding was that the protocol used was asynchronous, meaning that the client’s lwIP stack would just check the queues for content and only block if these were empty. Similarly, the NIC upon receiving packets would just enqueue them to the buffers as long as they had space and not necessarily wait for the client to call recv(). Is my understanding here incorrect?

I see how this could fall apart and inadvertently act as a synchronous interface if you weren’t careful with component priorities and CPU time allocation for components (for example the NIC might monopolise CPU time at higher throughputs if you allow it to do so), but my run example was based on mnt_pocket_stmmac_nic, which doesn’t do any special priority/cpu time allocation of the components. Maybe I’ve misconfigured this in some way?

Thanks, I’ll check this out and see what kind of throughput I can achieve in the absence of the NIC.

Edit: Another thing to consider might be DMA and copying. Is it the case that on Tx, the client is copying packet data from a user buffer into a bulk buffer, which is then copied by the uplink into another bulk buffer between it and the NIC, and finally from that bulk buffer into a DMA buffer that can be accessed by the NIC device?

I’m a little confused by this, because wouldn’t this only necessarily be the case if the protocol used for networking is synchronous?

I’m sorry that I jumped to conclusions too hastily. When I looked at the code, a ping-pong-scenario (like TFTP) sprang to my mind. But in your post you mentioned the use of a traffic generator. Thank you for giving the idea a try anyway and dispelling my assumption.

[…] the NIC might monopolise CPU time […]

For analyzing the CPU utilization, let me recommend the trace-logger utility that you can find at repos/os/src/app/trace_logger/. It uses the TRACE service of Genode’s core component to obtain the consumed CPU cycles for each active thread over time. To give it a try, you may modify the lighttpd_fetchurl.run script as follows:

diff --git a/repos/ports/run/lighttpd_fetchurl.run b/repos/ports/run/lighttpd_fetchurl.run
index bd403cbc39f..108b1e6962c 100644
--- a/repos/ports/run/lighttpd_fetchurl.run
+++ b/repos/ports/run/lighttpd_fetchurl.run
@@ -10,8 +10,8 @@ if {[have_board virt_qemu_riscv]} {
 }
 
 proc ip_stack      { } { return lwip }
-proc download_size { } { return "1M" }
-proc audit         { } { return 1 }
+proc download_size { } { return "1000M" }
+proc audit         { } { return 0 }
 
 build "lib/vfs_[ip_stack] lib/libc app/fetchurl lib/curl"
 
@@ -32,6 +32,7 @@ import_from_depot [depot_user]/src/[base_src] \
                   [depot_user]/src/vfs_[ip_stack] \
                   [depot_user]/src/vfs_audit \
                   [depot_user]/src/vfs_pipe \
+                  [depot_user]/pkg/trace_logger \
                   [depot_user]/src/zlib
 
 proc lighttpd_config { } {
@@ -59,6 +60,7 @@ install_config {
 		<service name="IRQ"/>
 		<service name="IO_PORT"/>
 		<service name="IO_MEM"/>
+		<service name="TRACE"/>
 	</parent-provides>
 	<default-route>
 		<any-service> <parent/> <any-child/> </any-service>
@@ -147,6 +149,23 @@ install_config {
 		</config>
 	</start>
 
+	<start name="timer">
+		<resource name="RAM" quantum="1M"/>
+		<provides><service name="Timer"/></provides>
+	</start>
+
+	<start name="trace_logger" caps="200">
+		<resource name="RAM" quantum="16M"/>
+		<config verbose="yes" session_ram="9M" session_arg_buffer="1M"
+		        period_sec="5" default_policy="null" default_buffer="4K">
+			<default-policy/>
+		</config>
+		<route>
+			<service name="TRACE"> <parent label=""/> </service>
+			<any-service> <parent/> <any-child/> </any-service>
+		</route>
+	</start>
+
 </config>}
 
 build_boot_image [build_artifacts]

The patch

  • removes the logging of VFS interactions (audit),
  • bumps the download size to 1000M,
  • adds the trace_logger to the scenario by adding its pkg to the import_from_depot function,
  • starts the timer service (needed by the trace_logger for the periodic reporting), and
  • starts the trace_logger with the policy null to omit the tracing of RPCs but still reporting the CPU cycles consumed by each thread at a rate of 5 seconds.

When I start the scenario on the PinePhone, I get a table as follows:

[init -> trace_logger] 
[init -> trace_logger] Report 12
[init -> trace_logger] 
[init -> trace_logger] PD "kernel" -------------------------------------------------------------
[init -> trace_logger]  Thread "idle"         at (0,0)  ATTACHED total:158253919 recent:0       
[init -> trace_logger]  Thread "idle"         at (1,0)  ATTACHED total:0         recent:0       
[init -> trace_logger]  Thread "idle"         at (2,0)  ATTACHED total:0         recent:0       
[init -> trace_logger]  Thread "idle"         at (3,0)  ATTACHED total:0         recent:0       
[init -> trace_logger] 
[init -> trace_logger] PD "init" ---------------------------------------------------------------
[init -> trace_logger]  Thread "init"         at (0,0)  ATTACHED total:148584    recent:0       
[init -> trace_logger]  Thread "ep"           at (0,0)  TRACED   total:2297321   recent:0       
[init -> trace_logger] 
[init -> trace_logger] PD "init -> nic_router" -------------------------------------------------
[init -> trace_logger]  Thread "root"         at (0,0)  ATTACHED total:44299     recent:0       
[init -> trace_logger]  Thread "signal_proxy" at (0,0)  ATTACHED total:3308      recent:0       
[init -> trace_logger]  Thread "nic_router"   at (0,0)  TRACED   total:7837729   recent:650534  
[init -> trace_logger]  Thread "ep"           at (0,0)  TRACED   total:185400974 recent:15709180
[init -> trace_logger] 
[init -> trace_logger] PD "init -> lighttpd" ---------------------------------------------------
[init -> trace_logger]  Thread "lighttpd"     at (0,0)  ATTACHED total:60677     recent:0       
[init -> trace_logger]  Thread "ep"           at (0,0)  TRACED   total:673275788 recent:57378535
[init -> trace_logger] 
[init -> trace_logger] PD "init -> fetchurl" ---------------------------------------------------
[init -> trace_logger]  Thread "fetchurl"     at (0,0)  ATTACHED total:60570     recent:0       
[init -> trace_logger]  Thread "ep"           at (0,0)  TRACED   total:392392278 recent:33509079
[init -> trace_logger] 
[init -> trace_logger] PD "init -> timer" ------------------------------------------------------
[init -> trace_logger]  Thread "timer"        at (0,0)  TRACED   total:406402    recent:28967   
[init -> trace_logger]  Thread "ep"           at (0,0)  TRACED   total:25855050  recent:2173874 
[init -> trace_logger]  Thread "root"         at (0,0)  TRACED   total:145896    recent:0       
[init -> trace_logger]  Thread "signal_proxy" at (0,0)  TRACED   total:9023      recent:0       
[init -> trace_logger] 
[init -> trace_logger] PD "init -> trace_logger" -----------------------------------------------
[init -> trace_logger]  Thread "trace_logger" at (0,0)  TRACED   total:125072    recent:4238    
[init -> trace_logger]  Thread "ep"           at (0,0)  TRACED   total:19370723  recent:1962548 
[init -> trace_logger] 
[init -> trace_logger] PD "init -> report_rom" -------------------------------------------------
[init -> trace_logger]  Thread "report_rom"   at (0,0)  ATTACHED total:62137     recent:0       
[init -> trace_logger]  Thread "ep"           at (0,0)  TRACED   total:7606472   recent:598562  
[init -> trace_logger]  Thread "root"         at (0,0)  TRACED   total:23612     recent:0       
[init -> trace_logger]  Thread "signal_proxy" at (0,0)  TRACED   total:3465      recent:0       
[init -> trace_logger] 

It tells me that the CPU has not been idle during the most recent 5 seconds. The scenario is CPU-bounded as expected. The dominating CPU consumers are lighttpd (57M ticks), fetchurl (34M ticks), NIC router (16M ticks), and trace logger (2M ticks).

Another thing to consider might be DMA and copying.

That’s right. For investigating this hypothesis, another tool becomes handy, namely the GENODE_LOG_TSC and GENODE_TRACE_TSC macros of base/log.h. Let’s say we want to know how much time the TCP/IP stacks spends with copying data. A grep for memcpy within libports/src/lib/vfs/lwip brings nic_netif.h to my attention (sys_arch.cc might also be of interest). The following modification introduces a scope for the code portion I want to measure and puts the GENODE_TRACE_TSC macro at the beginning of this scope:

diff --git a/repos/libports/src/lib/vfs/lwip/nic_netif.h b/repos/libports/src/lib/vfs/lwip/nic_netif.h
index 9a5433627c1..cf706a93dbe 100644
--- a/repos/libports/src/lib/vfs/lwip/nic_netif.h
+++ b/repos/libports/src/lib/vfs/lwip/nic_netif.h
@@ -400,12 +400,16 @@ class Lwip::Nic_netif
 			 * We iterate over the pbuf chain until we have read the entire
 			 * pbuf into the packet.
 			 */
+			{
+				GENODE_TRACE_TSC(5000);
+
 			char *dst = tx.packet_content(packet);
 			for (struct pbuf *q = p; q != nullptr; q = q->next) {
 				char const *src = (char const *)q->payload;
 				Genode::memcpy(dst, src, q->len);
 				dst += q->len;
 			}
+			}
 
 			tx.try_submit_packet(packet);
 			_wakeup_scheduler.schedule_nic_server_wakeup();

According to the GENODE_TRACE_TSC argument, the time stamp counter is logged to the trace buffer after every 5000 times the code is executed. There is also a GENODE_LOG_TSC variant printing messages to the log, but this is much more heavy-weight than tracing. With this change, the output of the trace logger is supplemented with messages like this:

[init -> trace_logger] PD "init -> lighttpd" ---------------------------------------------------
[init -> trace_logger]  Thread "lighttpd"     at (0,0)  ATTACHED total:60713     recent:0       
[init -> trace_logger]  Thread "ep"           at (0,0)  TRACED   total:440991386 recent:57509127
[init -> trace_logger]   28867784731: TSC linkoutput: 1172M (295000 calls, last 4076)
[init -> trace_logger]   29411984170: TSC linkoutput: 1191M (300000 calls, last 3684)
[init -> trace_logger]   29879209261: TSC linkoutput: 1211M (305000 calls, last 3748)
[init -> trace_logger]   30325553506: TSC linkoutput: 1229M (310000 calls, last 3604)
[init -> trace_logger]   30761963089: TSC linkoutput: 1249M (315000 calls, last 3587)
[init -> trace_logger]   31214243710: TSC linkoutput: 1269M (320000 calls, last 6169)
[init -> trace_logger]   31641200854: TSC linkoutput: 1288M (325000 calls, last 3606)
[init -> trace_logger]   32175499318: TSC linkoutput: 1307M (330000 calls, last 3816)
[init -> trace_logger] 
[init -> trace_logger] PD "init -> fetchurl" ---------------------------------------------------
[init -> trace_logger]  Thread "fetchurl"     at (0,0)  ATTACHED total:60829     recent:0       
[init -> trace_logger]  Thread "ep"           at (0,0)  TRACED   total:250211658 recent:32007396
[init -> trace_logger]   29764990276: TSC linkoutput: 90220K (160000 calls, last 501)
[init -> trace_logger]   30602302084: TSC linkoutput: 92853K (165000 calls, last 475)
[init -> trace_logger]   31447420543: TSC linkoutput: 95493K (170000 calls, last 482)
[init -> trace_logger]   32408192512: TSC linkoutput: 98128K (175000 calls, last 482)
[init -> trace_logger] 

Note that TSC values must be handled with caution. Should a thread be preempted by the kernel, the TSC value would jump. However, in practice, in particular when inter-thread dependencies dominate the context switching, the TSC values can be correlated with the CPU cycles spent.

We can see that within one period (5 seconds) lighttpd spends about 200M cycles in this particular memory-copy operation. In relation to the 5G cycles (assuming that TSC values grow 1G per second, depends on the CPU) recently spent, it is not dominating. The major costs must come from somewhere else. For the client side (fetchurl), the costs are negligible. This is expected because the bulk data transfer happens from server to client.

The further performance analysis comes down to finding good spots for placing GENODE_TRACE_TSC measurement points.

By tweaking the argument of GENODE_TRACE_TSC you can get a feel how often a given code path is executed. E.g., the following instrumentation within the libc tells me how often the libc-using application wakes up the remote side (the NIC router).

diff --git a/repos/libports/src/lib/libc/internal/kernel.h b/repos/libports/src/lib/libc/internal/kernel.h
index 12617364958..255f7500289 100644
--- a/repos/libports/src/lib/libc/internal/kernel.h
+++ b/repos/libports/src/lib/libc/internal/kernel.h
@@ -726,6 +726,7 @@ struct Libc::Kernel final : Vfs::Read_ready_response_handler,
 		void wakeup_remote_peers()
 		{
 			_libc_env.vfs_env().io().commit();
+			GENODE_LOG_TSC(1000);
 		}
 };
 

I tweaked the value until I got messages in a rhythm of approx. 1 second. So I can infer that the code is executed 1000 times per second.

Hope that you will find these instruments helpful for your further analysis.

3 Likes

Thanks for the super detailed response!

I’ll try using the methods you’ve suggested to narrow down what’s going on with my benchmark. If I find any useful insights, I’ll report back here (probably going to take a while though).

I just edited my post above, trying to distinguish the different time units. The values for the consumed CPU time as printed by the trace_logger are given in kernel time ticks whereas the GENODE_TRACE_TSC macro captures values of the x86 TSC or ARM performance counter. So the values of the different devices cannot put directly in relation to each other.

If I find any useful insights, I’ll report back here

That would be very much appreciated. :slight_smile:

Regarding your concrete setup, there are a number of further directions you may check:

  • If I remember right, on the i.MX8 EVK, the CPU was clocked at only 1 GHz or lower by default (by the firmware or boot loader). Maybe your board is not operating at full speed when running Genode whereas the Linux kernel initializes the clock to a high frequency?
  • The Ethernet PHY can differ from board to board. It may operate at 100MBit if not set up to 1G by a driver supporting the specific PHY chip, usually U-Boot, later by the NIC driver.
  • To follow your intuition that additional memory-copy operations on account of the inter-component communication may impede the throughput, you may comment out the copying and zeroing of the packet content and observe the effect on the throughput. Your test setup is not looking at the packet content after all.