On several occasions, including today while talking with a longtime colleague, I’ve threatened to make my professional epitaph “‘I might need that data someday’ is not a valid use case!”
Seriously: what we store now is unlikely to be useful ten, five, or even two years from now, and perhaps we should rethink our current approach to data storage, which so often defaults to “let’s hold on to that.”
Most data records have a shelf life on the order of months and a value not even worth speaking of – especially for big data collections. Look: if the 1000 billion records that we’ve so assiduously collected were worth even a penny apiece, we would be multi-billionaires. (The last time I checked, we are not.) In fact, much of our data probably has negative value: it is never actively queried, but still costs time and money to store and maintain.
I know. Organizations gather information like obsessed antique collectors, and the urge to keep a currently handy file, table, or database is almost irresistible. But we should resist, because the context to make sense of these bits of info-junk will very likely be lost as soon as we pull them into our our personal information attic. And, any data context that we haven’t completely, fully, and entirely documented will disappear like a lead weight sinking to the bottom of muddy lake. The data will remain, but untethered from its original context, may become worse than useless – without context, it could very well be used wrongly at that later time.
“Data lakes” are now all the rage, but they don’t address the issue of declining data value and the natural loss of context that makes most data meaningful. To load something into a “data lake” is the information equivalent going to one of those old-fashioned hardware stores, asking for a random selection of nuts, bolts, washers, and screws, putting them in a box and storing them in our attic. Then, five years from now, if we even remember that we went out and bought that random box of unassigned parts, we’ll not know where we put the box, nor remember how to find what we want, and quite possibly after a frustrating search through all N pieces of junk realize that what we need isn’t there: our new and modern equipment doesn’t use the kinds of nuts, bolts, washers and screws that we purchased in anticipation of our future unforeseen need.
I’m no better. I have files on my laptop from the 1990’s. But when, as an experiment, I dug into one of those old directories recently, I found that I could not name the purpose of a single file by looking at is name, and in most cases not give the purpose even after opening the file.
Why do we do it? Why store things we know we may never use, or even remember we stored? In part because we believe the cost of storage is zero, which it is not; the value of data is constant, which it isn’t; and the context we have in our heads will remain there, which it won’t – our heads will be filled with other new and interesting things in the future.
Perhaps we should design our systems to auto-archive any data older than six months, unless the data can be proven to have a value over the cost of storage and maintenance, a known use, and is meaningful to someone who would not normally use it – the latter being a test of whether good context is available. The data that are representative, or supportive of exploratory and predictive analysis can be allowed to stay; as for the rest – they can’t pay the rent. It’s hasta la vista, baby.
Holding on to valueless data entails not only direct cost, but an indirect one: data clutter slows or cripples applications that explore, visualize, and predict from our information that really does have value. Big data for a proven business or research purpose is laudable, but big data for an unproven purpose – big infojunk – means the cost of storage and maintenance now, and the cost of obscuring valuable information and insight later.
Data value, like any other kind of value, is a thing to be proven, not something to be assumed. There is no “maybe someday” for data value, only what can be shown to be useful now. For if we can’t prove that value now, we’ll very unlikely to have the context to prove the value later, after it’s been sitting ignored in our info-attic or data warehouse for a couple of years. And what of those data without value, that we’ll unceremoniously jettison? We can say thanks if we like, but we should throw that data away.