A project manager would periodically call me to complain about (lack of) documentation, specifically: “I do wish they would document their shit.” Meanwhile, the developers’ simultaneous gripe was “Why document our code with one, two, five or fifty pages of shit when no one will actually read it?” And they might have mentioned: writing documentation is boring. Most programmers are creatives, and most programming is writing, but most programmers are not writers in the usual sense. A lot of documentation is generated in a state of partial duress, often resulting in information-free one-liners. Function load_all is the function that loads everything. Right…
The manager and developers both had a point. I’m working with a data set right now with 200+ tables and 5000+ columns, and without a data dictionary I’d really be sunk. But standalone documentation is not easy to use, and it quickly goes out of date. One column name change, and poof! – the documentation and the code are out of sync… If I have to search just to find a definition, I won’t do it as much, and I’ll also be a lot less motivated to write documentation knowing my work has the lifespan of a mayfly.
Part of the problem is how we write documentation – as large blobs that are repetitive and painful to craft, when a short snippet in context would be fine. Much of the remaining issue might be where we put our docs – off in a disconnected corner someplace, far away from the objects they describe. As the Godfather might have put it: Keep your friends close, and your attributes closer. Which is just object-oriented programming: documentation is an attribute of an object (e.g. a data table, row, or column) and it belongs with the object, not in some remote, disconnected data junkpile. In context I can write much less, meaning I’m a lot more likely to do it. And with an object’s documentation attached to an object, I can always find it. (When the time comes, another programmer can assemble the bits and pieces to meet the inevitable requirement for a virtual-dust-gathering 500 page documentary PDF. And truly, no one will read it…) As a bonus, crafty developers will rise to the challenge of finding ways to automatically turn object attributes into relevant documentation, making it even more likely we’ll get something.
OK, if you like OO you might say “sounds good in theory, but…” I did too, so to test things I decided to take this approach (using R) with the data set I mentioned above (involving energy production – more on that separately). It wasn’t a slam-dunk, as the sources had separate data dictionaries that I had to back-integrate to the data. But is the result better? It is… I was actually excited to see it work. Woo hoo! Now as long as I can get to my data object, I’m only a call away from getting definitions, or data lineage (which I automated so I can understand how my data got from there to here). Now things like name changes or re-ordering columns don’t impact the docs. Having the documents attached also means that functions like ls() that display tables and columns can also display documentation at the same time (a good start on that PDF, which in this case only runs to about 10 pages).
This does hinge on one thing: having a convenient way to assign an attribute to a data object, and most relational databases I see don’t offer simple and natural support for object attributes. Some systems like SQL Server offer very extensive information on pre-defined attributes, but not on arbitrary objects like a data lineage stack, or even a column definition. That support can always be crafted, but it’s a significant chore.
I don’t follow the add-on products for systems like SQL Server closely, but an add-on that provided table and column attribute support (via auxilliary tables) might just make data documentation a lot simpler to create and easier to use. Being able to reliably and simply attach a set of varchar name/value pairs to any schema object would be a great start. A really nice system would allow arbitrary R-style lists as attributes, like this list-of-list data-lineage attribute, which describes the call stack for loading a table from source.
 “load_ds(enprod, mode = \”a\”, progress = TRUE)”
 “load_ds_tbl(ds, mode, progress)”
 “load_ds_tbl_src(ds, srccode = sc, mode, progress)”
I’m convinced that having documentation snippets right with our data objects really makes documentation simpler to create, and more reliable to consume. (There is still content to create of course.) However, I don’t see the necessary infrastructure in most data systems, which may account for some of the difficulties we see in crafting and using data documentation. Objects and their attributes: we should keep ’em close.