Retro file formats: dBase files with custom indexes
In my newly found interest in retro file formats, I turned my attention to the good old dBase format.
Quoting from Wikipedia:
dBase (also stylized dBASE) was one of the first database management systems for microcomputers and the most successful in its day.
I’m old enough to remember working with dBase files on a MS-DOS computer. A typical use case for dBase at the time on a personal computer was to store a personal address book in a database. There were no iCloud backups or cheap reliable storage media back then. You’d put your database, worth hours of manual labour, on a diskette and prayed you would not encounter disk read errors one day.
Anyway. What was nice about dBase is that it was relatively easy to create forms to enter data. Also, it included a programming language, ideal for quick data manipulation.
The dbf format
The actual data was stored in .dbf
files. A .dbf
file contains field definitions (names an types of fields in de database), followed by the actual data. The data is stored in fixed length records. Records are what we would refer to today as “rows” in SQL terminology, and fields are equivalent to columns.
While being being essentially an obsolete file format for most use-cases, dBase files are still widely used as part of the ESRI Shapefile format. The Shapefile format is a geospatial data format. While called “shapefile”, data is actually stored in a collection of files. One of those files is a dBase file, which is being used to store columnar attributes for each shape stored in .shp
file.
Libraries for dBase files today
Although the dBase file structure is not very complicated, it is convenient that there are still libraries available today in many languages to read and write dBase files. For example, the code for reading/writing dBase files included in PHP 3.0 and in many other projects, originally created by Brad Eacker in 1993, is still a good starting point for handling dBase files in your own C code. Another example is the C implementation in Shapelib.
One of the main reasons that dBase libraries are not hard to find in these post-dBase times, is that dBase files are abundant in the world of GIS software, thanks to ESRI shapefiles. Many open source libraries that handle geospatial data contain code for handling dBase files, as part of their Shapefile implementation.
(By the way, the fact that dBase is still a widely used format, while not well suited for modern applications, is a situation that not everyone is happy with.)
In general, dBase libraries can handle only a subset of .dbf
files that you’ll encounter in the wild. That is because the most complicated part of the dBase format is undoubtedly the many versions and variants of the format. Other database software products like FoxPro and Clipper used to dBase file format as well, but with their own additions. For example by adding support for additional field types. Also, the dBase product evolved itself, ending with the dBASE 7 format. Currently though, III+–V is the most common dBASE file format found in the wild. So this is the the format most libraries will handle without issues.
The large amount of dBase variants becomes more of a problem when working with indexes.
The jungle of dBase indexes
.dbf
files are the backbone of a dBase database. But dBase actually defines many types of files. Just like Shapefiles, a dBase database can be a collection of files.
Widespread amongst these companion files are .dbt
files, which are used for memo fields. These are fields that can contain more characters than the character max 254 characters of character fields.
Another important dBase file type are indexes. They are generally stored as a B+ tree in a separate file. Originally, dBase used .ndx
for single indexes, and .mdx
for multiple indexes. But of course, also for indexes, many new variants came, and stil exist in the wild.
For example, Visual FoxPro supports structural compound index (.cdx
), nonstructural compound index (.cdx
) files, and standalone index (.idx
) files. Clipper uses .ntx
files. It is similar to an .ndx
file, but allows for longer search key expressions and stores data in ASCII format. In Shapfiles, .ain
and .aih
attribute indexes are used. In ArcGIS 8 and later however, .atx
attribute indexes are used. GDAL/OGR 1.7 is using .ind
and .idm
attribute indexes for Shapefiles, which are not compatible with other GIS software.
As you can see, it’s a jungle out there. Figuring out which legacy index format to use when creating a new dBase file is daunting, just like finding libraries or format descriptions to read and write all these kinds of formats. And once you know what index format to use, and how it should be structured, it turns out that writing code for creating those indexes is really complicated (reading is a easier, but still not straightforward). For that reason probably, many software packages that can read .dbf
files ignore indexes altogether.
While experimenting with handling dBase files in C, I was a bit stuck on how to handle indexing, for the above reasons. But then a pragmatic approach came to mind: forget legacy dBase index formats and just go for the most convenient, but still efficient, way of storing an index available today.
Using file-based key/value stores for dBase indexes
An index is basically nothing more than a way to quickly map a key to a value. In this case, the key being the field to index in the database, and the value the record number. Because records in a .dbf
file have a fixed length, once you now the record number, you can directly access it.
A while ago, I discoverd the existance of file-based key/value databases, like Tokyo cabinet, its successor tkrzw and constant databases. I’m a big fan of the concept, but finding every-day uses cases for personal projects is challenging. But I might have found at least one now: use it as an index for a dBase file. Because these key/value stores are implemented as a file hash or file B+ tree database, they can act perfectly as an index. The same goes for constant databaes.
All of the above key/value engines have C
libraries available, often with bindings for other languages as well. So they can be used in any modern project relatively easy.
Constant databases for indexes
CDBs, short for “constant database”, are awesome. They are basically a very reliable and fast on-disk associative array, mapping keys to values. Why “constant”? Because you create them once, after that they are read only. To add/change entries, you must recreate the database. Not per se a problem when used for indexes: an index can be rebuild when needed. Rebuilding is not as fast as reading, but still fast enough for most use cases.
The use case for that would be what in the dBase world is called a “non-production” index. Production indexes in dBase are automatically opened and kept up to date whenever a table is opened. Non-production .mdx files and .ndx files must be explicitly opened to be updated.
Below an example in PHP, using PHPs build in DBA extension and the PECL dBase extension. It shows a simply way to build a custom index for a dBase file in a CDB database:
// Create `.dbf` file
$def = array(
array("name", "C", 50),
array("age", "N", 3, 0),
array("email", "C", 128)
);
$dbf = dbase_create('/tmp/test.dbf', $def);
if (!$dbf) {
echo "Error, can't create the database\n";
}
// Add example records to .dbf
dbase_add_record($dbf, ['Maxim Topolov', '23', 'max@example.com']);
dbase_add_record($dbf, ['Leo Bakker', '45', 'leo@example.com']);
dbase_add_record($dbf, ['Bear Voxny', '23', 'bear@example.com']);
dbase_add_record($dbf, ['Qudo Malek', '21', 'qudo@example.com']);
dbase_add_record($dbf, ['Pavlo Nyrola', '34', 'pavlo@example.com']);
// Create index for the 'email' field in a constant database
$cdb = dba_open("/tmp/test.cdb", "n", "cdb_make");
$num = dbase_numrecords($dbf);
for ($i = 1; $i <= $num; $i++) {
$rec = dbase_get_record($dbf, $i);
$key = trim($rec[2]);
dba_insert($key, (string)$i, $cdb);
}
// Cleanup
dba_close($cdb);
dbase_close($dbf);
DBMs (or “Berkeley DB style databases”) for indexes
A constant database is by far the fastest solution for lookups, but updating the index requires a rebuild. So if you need to update the index often, you can also resort to a DBM, or “Berkeley DB style databases” as they are called in PHP docs. Berkeley DB was one of the first in it’s genre of embedded databases for key/value data. Berkeley DB itself is not around anymore (not maintained). But its legacy is lasting. There are many modern variants, such as GNU dbm (C), TKRZW (C/C++), and for Go: BadgerDB and BoltDB amongst many others.
TKRZW is my favorite of these, one of the reasons is that it has a nice C interface. For example, creating a TKRZW database and adding records in C requires a minimum amount of code:
TkrzwDBM *dbm = tkrzw_dbm_open("index.tkh", true, "truncate=false,num_buckets=100");
if (dbm == NULL)
{
printf("Failure while opening database\n");
printf("Last status message: %s\n", tkrzw_get_last_status_message());
exit(EXIT_FAILURE);
}
// Add records.
tkrzw_dbm_set(dbm, "foo", -1, "hop", -1, true);
tkrzw_dbm_set(dbm, "bar", -1, "step", -1, true);
tkrzw_dbm_set(dbm, "baz", -1, "jump", -1, true);
By adding a loop through dBase records, like in the PHP example above, an index of one or more columns can be created.
Conclusion
Stepping away from index formats supported by other software that handles dBase files comes with the price of lack interoperability. Other software won’t be able to read those custom indexes. But for personal use, that’s not a problem at all, since I’m only user.
Altough you lose interoperability, it can easily beat struggling with old dBase index formats. Especially since there is no guarantee whatsover that even if you do pick an existing format, the index you create is interchangeable between different software products. There is no single standard with broad support for dBase indexes afterall.
A unanswered question you might have still about all this is: why would someone in 2023 work with the ancient dBase format in the first place? And the answer is the same as for many hobby projects: because why not! Why would you use something mundane as SQLite, when you can keep alive a 40 year old database format with no unicode support?