As I mentioned earlier, I’m learning about microkernel concepts, and Genode seemed like a great framework to dive into. I’ve grasped most of the high-level concepts in Genode and have done some development exercises. Now, I’m studying and analyzing the base structure, which acts as a wrapper for different microkernels.
One thing that confused me from the start was the concept of capabilities. Essentially, capabilities are like tokens that processes can use to control access and permissions for various resources.
If we stick to this definition, the way caps are used in Genode seems a bit at odds with the capability paradigm. For example, as a developer, I might estimate that I’ll need about 1000 tokens for the services in my app and allocate them all upfront. This feels like I’m not using the capability paradigm with enough precision or awareness. Is there a reason for this kind of setup?
I have another question:
I noticed that in Sculpt OS, RAM allocation is mostly managed by Leitzentrale, where, for example, 32GB of RAM is allocated to distribute among subprograms. Since ACPI provides access to memory info, I was wondering if there’s a way to make this allocation more dynamic? This would help the OS adjust better in situations where memory is low. (I checked but didn’t find a solution, though I suspect there might be something I missed or haven’t fully understood.)
The way I understand capabilites is as follows (I haven’t seen it mentionned this way elsewhere, so not sure how accurate the comparison is):
You know how Posix file descriptors from one app cannot be used in another app, i.e. just because it’s universally known that stdout and stderr have value 1 and 2, and the next opened file can be predicted to have file descriptor 3, does not mean that any other app can access that file : they cannot take a shortcut, they have to go through the open() API. (because FD number “3” does not mean anything in that other app, until it gets properly alloted)
Well I think I understand caps work quite the same way, even if one app could somehow break the memory protection barrier and get a pointer to the cap of a neighbor app and access its capabilities values (or forge the values), it could not gain access to the resources they control, for lack of being registered as having been granted access to them.
Capabilities (it seems to me) are just a generalization of the FD concept. It seems someone woke up some day and thought “hey, this FD thingy works great for accessing files, why not generalize it for accessing memory, video and audio cards, and the like ?”. It’s obvious in hindsight that it ‘has to be’ that way, just like no-one would have thought of going back in time after file FDs were invented and generalized, reverting to using global pointers or such.
So how does that relate to your question ?
Well, you might have run into FD (file desc.) limits in the past: in some OS’es you can open just 256 files, then when attempting a 257th time, open() fails. In Linux it’s higher I guess (more like 1024 ?). What do you do if you want to break free of that limit ? You call a special syscall that allows you to extend the ceiling, the maximum number of simultaneous opened FDs. I know there is one in Haiku, not sure what the Linux one is called.
Genode is the same, except that 1) you can configure the cap ceiling yourself in the XML config, it does not have to be 256 you can make it lower or higher, and 2) the “upgrade the ceiling” procedure is better codified, it can be programmed (by customizing the parent init) in various ways, it does not have to be a syscall.
Does that make sense ?
It’s probably more something like the ‘sculpt_manager’ component, with the Leitzentrale component being a UI ‘piloting’ it (and itself ‘piloting’ an init component that contains the runtime) but yes your point stands, there is a component which configures a sort of 32 GB ceiling on runtime even when running on a computer with 64 GB RAM.
Dunno if it would be just one line of code (trivial change) to make it configure “env().global-ram()” instead, maybe someone else could chime in ?
I think we’re in agreement regarding the definition of capabilities, and also that while it’s fine, it’s not particularly compelling.
Here’s my take: as a high-level developer, I’m not really involved in figuring out how many caps are needed. I’m not supposed to know that. So, I just set a number, and if I run into problems during execution, the only thing I can do is say, “Alright, let’s add 10000 more caps so the program doesn’t run into issues if it’s open for a long time.” To me, that doesn’t seem very logical. Though, I assume I might have misunderstood part of the concept.
As for RAM allocation, I’d love to know what the Genode team’s policy is. Is there a challenge when it comes to adding it dynamically, or not?
@hamed, you raise actually two topics at once. The first topic is about the purpose of capabilities as tokens for interacting with objects across protection-domain boundaries. The second topic is about the budget allocation and distribution of physical (limited) resources.
Regarding the first topic, @ttcoder’s analogy to file descriptors is spot-on. In fact, the Linux version of Genode implements capabilities as file descriptors. Another fitting analogy is the concept of references in Java. One part of a Java program cannot access an object living in some other part of the program without a reference to this object. Such a reference cannot be created out of thin air. Either you have to create the object yourself - then you are the owner of the object and can naturally reference it - or you have received the reference as argument by an explicit API call from someone else.
In respect to the second topic - resource allocation and trading - one has to consider that an object identity is not for free. There is some tiny bit of meta data needed in the kernel (or following the analogy, in the Java runtime) to keep track of which objects exist and which references point to which object. Or to follow @ttcoder’s analogy with file descriptors, I’m talking about the in-kernel structure representing an open file/socket/pipe/device. Now the question arises: Who shall pay for this meta data? In traditional operating systems, the answer is: the kernel! In most situations, this is convenient because nobody needs to spend a second thought on such costs. But in other situations - should the kernel run out of memory - there is no compelling answer. Linux for example has a super complex out-of-memory (OOM) subsystem that takes hard decisions to free up memory, like killing presumed resource hogs.
Genode wants to be resilient against there kind of resource-exhaustion problems. So for us, the answer is: kernel resources must be paid for by the component that allocates them. If a component wants to allocate RAM, it needs to have the budget (RAM quota) for it. If a component wants to create a new object reference (an RPC object or a signal handler), it needs to have the budget (cap quota) for it. So Genode models kernel-maintained object references as a system-globally limited (aka physical) resource.
The enabler for dynamic systems like Sculpt OS is Genode’s approach to budget allocation and the trading of budgets between components. For the description of these concepts, let me refer you to Section 3.3 of the “Foundations” book PDF.
In short, the resource management can be dynamic. The init component responds to dynamic configuration changes. So one can extend the budget of a subsystem by increasing its quota assignment in the configuration. The Sculpt manager does that automatically for a limited set of “managed” components like the display driver. But for components interactively added, Sculpt merely fulfills the subsystem’s contractual terms as declared by the software provider in the pkg/runtime file.
the only thing I can do is say, “Alright, let’s add 10000 more caps so the program doesn’t run into issues if it’s open for a long time.”
In some situations that is a practical approach. But if one reallyreally want to have assurance, one should double check the plausibility of that, asking the question: Is it reasonable that this particular subsystem requires 100000 object references where most components can live with 100? If this cannot be explained, I’m afraid the component must be assumed to be brittle, and the system should be designed around this assumption (e.g., by monitoring its health and restarting it whenever it becomes unhealthy).
Deviating from the original question a little (not too much) I’d like to share my “wish list” (or “plan for the future”, if I ever stop doing wordpress/php jobs and get back to H/Genode).
My ideal system would be one with two resource “silos”, so to speak, inside the runtime:
Silo 1 would be “non critical apps”
Silo 2 would be “mission critical apps”.
Recommended policy for it to work:
put all the untrusted and (necessarily) non-critical apps in silo 1
put all the critical and (necessarily) better-trusted apps in silo 2
Ground rules:
In both silos, an app that runs out of ram or caps and that issues a request to the parent init, gets it fulfilled immediately.
if silo 2 runs out of ram or caps, it can “raid” (borrow ?) some from silo 1, which complies without question
the reverse is not true, silo 1 cannot borrow anything from silo 2, which must always keep a high safety margin (lots of extra caps and ram).
silo 2 gets more “verbose” as it gets close to the limit, it issues messages like “one of your so-called critical/trusted apps seems to actually be leaky and is using more and more memory or caps, please choose an appropriate moment to restart it !”
That kind of policy/system would be ideal for systems like the ones I used to sell (radio stations that run unattended at night, and even sometimes at daytime for a week or more), where you don’t want to be woken up in the middle of the night with a message like “the radio’s gone silent, go drive there to reboot the computer!”
I suspect it would even be interesting, if not 100% ideal, for plain-end-user systems too.
That would avoid the usual comment/problem experienced by people who use a Genode based system (like SculptOS) for the first time, “why do I have to update ram/caps quotas explicitely, whereas on Linux it’s done automatically”
Overall, it would feel more “linux-ish” to users, except we would both have our cake and eat it too : we would get automatic/implicit increases in quotas for ram/cap as their usage increases (which can happen a lot in resource hungry apps, like Web Browsers) instead of relying on pkg’s defining big initial values that might be “too little” or “too much”… And we would not have the linux etc vulnerability where an misbehaving app can exhaust computer resources. Best of both worlds.
It is (very) partly implemented in my “toy” system already : I have a couple “silos” with separate ram/caps allotments (they’re simply two inits standing side by side)… But the resource requests are left unfulfilled. Probably wouldn’t be hard to finish.
P.S. conceptually, what Silo 2 amounts to, is just “mutualization of safety margins”. Currently you can already configure a critical stack (ahci → vfs → apps that use that vfs) so that each component in the stack has an extra MB and extra 100 caps… But that’s more wasteful and less foolproof that if those were mutualized : that waste would be reduced and the safety margins would be increased, if instead you collected those extra MB and caps and put them in the parent silo/init, for it to hand out to components if/when they really need it.
That’s a very interesting idea—I’ve actually thought about something similar myself, though not quite as complete as your concept.
Based on my own experience and the research I’ve done, I believe a hybrid approach might be more practical. If we continue to emphasize strong security principles in microkernel-based systems, then we naturally have to accept certain limitations. For example, the very idea of moving RAM from silo 1 to silo 2 violates the principle of strict isolation, which is fundamental in microkernel design.
Also, in specific cases like the network stack—which is a stable process, not directly tied to the NIC driver, and plays a critical role in system functionality—it might be more efficient to move such components into the kernel space.
Overall, I’m a fan of hybrid systems—just like hybrid cars , which often turn out to be the most practical and ideal solution in many scenarios.
Thanks for the detailed explanation. Based on what you described, this kind of setup actually sounds pretty logical and practical. Just to make sure I’ve got it right:
As long as services are properly routed and we avoid using broad configurations like <any-service> <parent/> <any-child/> </any-service>, we shouldn’t really have to worry too much about capability leaks or running out of caps. So setting a caps_quota is mostly a safety measure—to make sure a program doesn’t go overboard.
If that’s correct, then I think it would be great if the system itself could estimate a lower bound for the cap quota, based on how often a module’s services are going to be used. That way, if a developer isn’t exactly sure how many capabilities they’ll need, there’s at least a safe starting point. And of course, if the app is expected to be heavier, the developer can always bump the quota up manually.
The two silos seem to me reminiscent to Sculpt’s initial approach of separating the drivers/leitzentrale/runtime subsystems. In the very first version, this separation was very pronounced. Since then, however, we gradually moved more and more aspects of the drivers and leitzentrale subsystems into the runtime to gain more flexibility. E.g., whereas we used to host most drivers in the drivers subsystem, today most of them are hosted in the runtime - to the benefit of being able to allocate their resources and manage their lifetime dynamically. Think of support high-DPI displays, which require a lot of RAM for the display driver, or think about suspend/resume where drivers enter/leave the system dynamically. I sense that we already went the silo route but ultimately had to deal with its limitations.
Since the runtime is already capable of dynamically re-balancing budgets, it seems to me that the missing part is merely a convenient user interface to the balancing. If Sculpt’s administrative user interface provided an interactive UI to define grace limits for RAM and caps, or to interactively respond to resource requests, may that settle the issue?
For example, I can imagine that once a component triggers a resource request (like Falkon when opening an extremely complex website), the component could turn red in the graph, drawing attention to it. When selecting it, it could tell the user the additional wants of component and allow a click on a button to grant the requested resources. This would retain the ultimate sense of control while avoiding the inconvenient manual editing of the deploy config.
Right! I only went down the two-separate-init path because that was the first implementation that came to mind (and there’s also an implementation detail that is not relevant to this discussion: one init is “piloted” by the Deskbar/registrar, whereas the other init is fed bona fide Genode package run files). So my first naive implementation for protecting critical apps while keeping the “lid” on leaky non-essential apps, was done by separating them in different inits. But that goal can clearly be (better) achieved with a different implementation.
A proper Sculpt OS implementation would not have to regress and should be done ‘the right way’, and it seems to me that what you suggest would fit the bill. Agreed on removing artificial boundaries, since boundaries remove the mutualization benefit without really buying us anything in exchange.
Sounds like this would do everything needed, I don’t see how to improve on that scheme. Hopefully other communities will see it too, and go like “hey, on Genode/Sculpt I can control memory usage, why don’t we have that feature in our OS and allow it to have runaway consumption ??” (especially with modern laptops, which, I’m shocked to discover, don’t have have a Disk Activity LED, so you don’t know when they’re swapping virtual memory to death.. what kind of crazyness is that)…
As to Genodians, I suspect they will be outright enthusiastic.
Some apps could be set to “unattended-mode/always grant quota requests”, others to “ask before granting” (open a UI dialog), and others could be set to the current behavior of “suspend the app/thread and send a message to LOG suggesting to edit the quota”.
So yes, “LGTM” ! Others should vote on Sculpt, but talking for myself, I can say I’ll clearly follow that lead and implement something like that (maybe less refined) in h/g.
EDIT: let’s see if I can figure out how to create a “poll”…
I’ve always found it unintuitive to guess how many CAPs a program needs, compared to RAM.
Given that the cost paid by the OS for using a CAP is in RAM (as I understand it), I wonder if it would be possible for programs to pay for CAPs out of their RAM budget directly, and leave the split of general purpose RAM / CAP-allocated ram to the program to decide. This would mean only one number would need to be set in the runtime file, so half as many ways to accidentally under/over provision.
I suppose this would not be possible everywhere, as my understanding is that some kernels provide a hard limit on the maximum number of capabilities possible throughout the system. So on those kernels, RAM and CAPs would not be perfectly fungable.
What feels lovely to me would be some sort of interface that works similar to how disk space is allocated -
As in, there’s a limited resource and you can’t go above the total amount on the system. When you try to launch a new program it has an idea of how much RAM it needs and refuses to start if that amount isn’t there. But once it’s started you can be confident it will keep running properly without something stealing RAM away from it.
Within that, a program can always ask for more RAM and be granted it automatically so the programmer can, if they want, write linux-style apps that assume infinite RAM is possible.
And for the user when they are trying to start a program and find that RAM is full they could be presented with some sort of user interface / pie chart / waffle chart showing which programs are using what amount. From there, they can decide which program to kill so that the new program they want to launch will be allowed to start.
It would also be kinda neat to have a protocol stack component that works as a swap file - converting a block session into additional (slow) ram as needed. This could be hooked up to specific programs that expect infinite memory, as an alternative way to allow the developer to not need to know the exact RAM budget up front.