Reviving The Guide: dusting off an old binary file format

Quite a while ago, in the 2010s while I was still a Windows user, The Guide was my absolute favorite note taking application.

The concept of The Guide is simple as it is brilliant. Notes are stored in a single file (guide). The notebook has a tree structure, every node can have (rich) text associated with it. Nodes can be assigned individual icons and colours. The tree can be of arbitrary depth. That’s it.

The official site describes the concept as follows:

the Guide is a two-pane extrinsic outliner. This concept is similar to mindmapping in some ways.

The Guide screenshot

When I moved from Windows to macOS however, my glorius time with The Guide was over. It is Windows only, so I was forced to switch to different note taking approaches.

I went for notes as separate Markdown files, ordered in a file system folders. Although different from my beloved Guide tree, this approach also works well for me. What I most like about it is that it’s portable, available everywhere when used in combination with for example Dropbox or Google drive, and completely without vendor-lock in. I can use any markdown editor to edit my notes, on desktop or on mobile.

But still. At times I miss the build-in tree based one-file approach from The Guide in all its simplicity. And I also sometimes would like to browse through my old notes. These are stored in .gde files, a binary format specific to The Guide. The only way to open them on Mac is by running The Guide in a virtual Windows machine, or in Wine. That’s not ideal. But also not very fun. Reading those files programmatically, so that I can do whatever I want with the data, is what’s I’m really after.

Using libguide in C

In order to programatically read my old .gde files on Mac without the Windows application, I first checked whether The Guide source code was available (for some reason I never checked that before, or I forgot). To my great joy, the source is (still) available at sourceforge. But even better, the part that handles reading/writing .gde, is nicely packaged in a stand-alone C library: libguide (compiled as DLL). libguide can be used completely independent from the main GUI application (Guide) that was written in C++.

Next step would be to call libguide functions from my own code, and experiment with reading my old .gde files. It became apparent soon however, that libguide would not run as-is on my Mac, because of several Windows specific API calls in the code:

  • CreateFileMapping,MapViewOfFileUse are used to create memory mapped files when reading gde files.
  • MultiByteToWideChar, WideCharToMultiByte are used for unicode conversion.

Luckily, these functions can be relatively easy replaced by Posix variants, like mmap and mbsrtowcs. So I did.

After replacing Windows specific functions with Posix ones and replacing the Visual Studio project by a Makefile, the library would compile. But, actually reading a guide file caused a segfault.

Back to the drawing board.

Pointer size problems

It took me a while to figure out that the segfault cause was related to the difference in pointer size between 32-bit and 64-bit architectures.

See, on a 32-bit architecture (my old Windows machine) the pointer size is 4 bytes, on my current 64-bit machine (macOS) pointer size is 8 bytes. Not necessary an issue. But it becomes a problem when code explicitly relies on a specific pointer size, as turned out to be the case in libguide.

While reading a file fro memory, libguide uses “fake pointers”, to store unique IDs for nodes. A fake pointer here, is a pointer value read does not point to a real memory address. The value of the pointer interpreted as an uint32 value.

Fake pointers are being used in libguide to store ID values of nodes. The small code fragment below, shows how IDs are being read from a memory mapped region that contains a gde file:

// This fragment read the id and parent id of a node
// char *p points to memory mapped area with gde data
*fake_node_ptr =  ((struct tree_node_t **)p)[0]; 
*fake_parent_ptr = ((struct tree_node_t **)p)[1];
p += 2 * sizeof(struct tree_node_t *);

Because the number of bytes read from from the memory mapped area pdepends on the pointer size of the reading machine, things go wrong when reading a file created on a 32-bit architecture by a 64-bit machine.

Writing has the same problem:

static int _guide_storer_fn(struct tree_node_t *node, void *memory_mapped_data)
{
	struct tree_node_t *parent = tree_get_parent(node);
	struct guide_nodedata_t *data = 
		(struct guide_nodedata_t *)tree_get_data(node);
  FILE *fp = (FILE *)memory_mapped_data;
  
	// write node_id
	fwrite(&node, 1, sizeof(node), fp); // <- sizeof(node) depends on architecture
	// parent_node_id
	fwrite(&parent,1, sizeof(parent), fp);
  
  // ... read rest of the data

I fixed this by always reading and writing uint32 values for node IDs, not relying on machine pointer sizes anymore:

fwrite(&node, 1, sizeof(uint32), fp);	
fwrite(&parent, 1, sizeof(uint32), fp);

This worked. Although for me it’s still an open question if there are situations where a node id won’t fit in a uint32.

Anyway, after these changes and some testing, I had a working cross-platform C library, that can read my old guide files on my Mac (no guarantees about yours, maybe it wil eat them). Github repro.

Multi-language parsing with Kaitai struct

This was all really nice. But what if I would want to read/write .gde files in Go or Swift or whatever other language? Of course one can bridge the C code to other languages. Functions in a C library can be called by almost any language if there is some kind of “bridge” in-between. Often this is a foreign function interface (FFI) . For example, Go has the "C" package, for node there is node-ffi, etc.

But developing bridge code for multiple languages is repetitive work. And also, calling unmanaged C from managed languages like Go and Swift, has all kind of drawbacks. Ideally, instead of bridging C, these languages would have a native gde parsers. Of course, creating those sounds like a lot of work. So, what if generating native parsers for multiple languages would be an automated process?

A while ago I discovered Kaitai struct, which immediately appealed to me, but I could not think of a use case for me at the time.

Kaitai Struct is a declarative language (using YAML syntax), which can be used to describe binary formats. Once you have a description of a format, you can use Kaitai Struct to generate parsers for it in multiple languages.

The main idea is that a particular format is described in Kaitai Struct language (.ksy file) and then can be compiled with ksc into source files in one of the supported programming languages. These modules will include a generated code for a parser that can read the described data structure from a file or stream and give access to it in a nice, easy-to-comprehend API.

Example of what a Kaitai Struct format description looks like:

meta:
  id: tcp_segment
  endian: be
seq:
  - id: src_port
    type: u2
  - id: dst_port
    type: u2
  - id: seq_num
    type: u4
  - id: ack_num
    type: u4

The Guide uses a binary format which has a relatively straightforward structure. Seems like an ideal use case to experiment with Kaitai Struct!

Creating a Kaitai language file for The Guide was indeed not hard. After a few iterations, I was able to parse all my old Guide files without any issues in kaitai’s web IDE. After that, it was trivial to generate parsers for major programming languages via kaitai’s command line tool.

Besides generating parsers, Kaitai can also export parsed data as JSON or XML:

$ gem install kaitai-struct-visualizer
# JSON output
$ ksdump -f json guide.gde gde32.kty
# XML output
$ ksdump -f xml guide.gde gde32.kty

The Kaitai files I created for parsing .gde files are available in the Github repro.

Conclusion

Dusting off old binary files can be fun. If you know how the format is structured, modern tools can relieve some of the effort it takes to parse them with your favourite programming language.

Although mainly an experiment, my first experience with kaitai was quite positive. Some nice to haves that will hopefully arrive in Kaitai in the future:

  • Ability to write (generate) files with
  • The Kaitai project is still very much in development. Go support for example not finished and well documented yet (but works)

You don’t have to always write your own Kaitai format descriptions by the way. For a lot of binary format there is already one available.

updated_at 28-10-2022