[withdrawn]Proposed eXtensible Tokenized Format (tokenized XML)

Continuing the discussion from Proposed HRD File Format:

As an alternative to XML and HRF, I’d like to propose a tokenized alternative to either of them, tentatively called XTF.

Most XML usage is stored in a Zip file because there is so much repetition of the same keywords and syntax that it compresses quite well but obviates the actual need for compression. Tokenization would combine the steps of decompressing and parsing into a single, common entity. Another example of tokenization in the real world was that Adobe created a text-based language for laser printers to use called PostScript. It was humanly readable but was so verbose and produced such huge files that a tokenized version called PDF has almost completely replaced the earlier PostScript printer driver format (as well as some competing driver encodings like HPPCL).

In the HRD thread, I pointed out that a graphical editor based on a tree gadget would be exceedingly more readable than any humanly-readable text format. The tree gadget typically has members that fall into one of 3 categories:

  1. The Root element (or Header)
  2. Nodes
  3. Leaves

The Root Element

The root element contains a representation of the string format and the schema of XML the rest of the file is based upon. To make tokenization work, I’d make the schema a serialization of all the tokens and their respective types in a hash table (known as an “unordered map” in C++11 for any beginners out there).

Node types

There are 2 ways to delimit the contents of a node type:

  1. Header records (with a count) like C++ string format
  2. Trailer records like the null-terminated C string format

I prefer the former because allocations can be accomplished all at once when loading, rather than having to have a lot of little allocations for each separate entity. Also, arrays take less memory and are generally the quickest and most randomly accessible data structure.

Leaf types

All of the primitive formats and enumerations that don’t expand into a bigger multi-element form are leaf types. This includes integers, floating-point numbers, strings and so on. Since enumerations are generally known at the time of the schema’s definition, the enumerated types will also be tokens.

Other notes

  • Most of the string definitions of tokens are invariant and can benefit from a perfect or minimal perfect hash function. CMPH generates the latter and GPerf generates the former. Since minimal perfect hash functions are much more dense in terms of storage, they would be preferred. (The goal of making compression unnecessary is one goal of the project, after all.)
  • Nodes containing other node types can easily be represented by fixed-length arrays of pointers so that order can be preserved and memory consumption minimized while still allowing nodes within nodes.

Questions and Comments?

Please post below and let me know what you think.

For what its worth, while I think a visual editor would be lovely, I’d be scared to depend entirely on a complex visual development tool that doesn’t yet exist.

Both the current XML and proposed HRD can be edited in any editor or IDE.

If space savings were a concern or if it made the parser simpler I would be fine with a bytecode format that was compiled at build time. There’s many versions of this already Binary XML - Wikipedia though given that only a subset of XML is used I’m sure a much simpler version would be possible.

With that said though, of the gripes listed in Moving on from XML? A teaser for a possible alternative I don’t think a binary format would help.

2 Likes

On the binary-vs-textual divide I clearly fall on the ‘textual’ side in this case, for a simple reason: version tracking.

Don’t get me wrong – I’m as vocal as anyone against textual data and try to not use that too much, for the usual reasons (there are so many ways to get outside of the “correct” space and into the “error/mistake” space, plus textual data takes up more storage and similar considerations, and so on). Heck, the first variant of “BRS tunetracker” (as it was called in 2000) used FORM/IFF binary config files… which later became flattened BMessages.

But in this case, consider this: the Genode .run script files contain configuration data that is embedded inside the script, and either in static form or generated on the fly, both for “unit test” run files and the more general scenarios run files, thus

  1. you’re not going to find a remotely practical equivalent way of embedding and generating binary data, that stuff can only be reasonably done with textual data, and
  2. that config data is critical to the functioning of Genode, so it must be “diff’able”, version tracking must clearly show differences, which can only be done with textual data (you cannot “diff” binary data!)

The above is clear as day to anyone who spent some significant time with Genode, doing system integration or playing with run files a little. Especially point 1 ; but point 2 casts a much wider net, and was (for instance) instrumental in my moving to Jamfiles, in the early 2000s: when I started coding for BeOS with Metrowerks Codewarrior I tried to include its binary .proj files in version tracking, but quickly realized it made little sense.

The Genode team might explain all this better… Though hopefully I saved them some time by editing the above explanation to spare them the trouble :wink:

I’m more inclined to discuss the visual (textual?) editor tool. Now that we’re migrating to HRD I could imagine myself coming up with an ad-hoc quick and dirty parser for HRD, and write a little tool that assists on an initially very reduced set of features. For example, my dang brain can’t assimilate or grok the subtleties of prefix/suffix label matching. So if the knowledge of how those work could be embodied by a piece of code, which would immediately show on-screen how the matching is going to be processed, without the need for me to run the scenario in Qemu, and possibly fail silently without any tracing information on screen, that would save me gigatons of valuable development time.
(disclaimer: that’s an hypothetical, there’s about zero odds that I’m actually going to pull the trigger on that project, there are many things that are higher on my priority list)

1 Like

Thanks for the Wikipedia link. I did a single search before I posted but I hadn’t checked Wikipedia. EXI looks the most promising.

@ttcoder
The binary parsing shortcomings of Git never cease to amaze me. Having a macro-expansion that would return an XML or HRD document with a sane style might solve that if Git and Diff could be supplemented.

Since EXI already exists and is recommended by the W3C, I’m withdrawing the proposal. Also in the original thread, it has been made clear that HRD is an exploration of the idea of a new format but is not decided conclusively. Compiled XML formats can be adopted at build time anyway.