I’ve occasionally asked my data teams about our intent – with all of the work we’re doing to implement our current application, what are the core intents or functions we’ll have when we’re done?
I’ve received blank stares about as often as I’ve asked this, and I’ll grant that being taking out of what you’re doing in the middle of a busy morning to be asked What’s it for? isn’t an entirely fair question.
But it’s a relevant question. I’ve seen that when we have a pretty-much constant view of the core functions and purposes for what we’re doing, we build better and more focused applications better and more efficiently, with fewer tangents and distractions.
To make the question better and more productive, I asked myself about the core intents of any information application. Can we reduce the basic intents of information apps to a small set, so the question is easier to answer, and define the ideal properties of each intent, for comparison? We won’t need, or even want, the ideal intent in most cases, but having that yardstick can help us assess how well we are addressing our users’ needs.
There are many different, very good information applications, but I believe there are really only five fundamental intents: to transmit, to find, to transform, to summarize, and to predict. In short, when it comes to information, fundamentally we’re moving it, reshaping it, or extrapolating from it. I divide these activities in a particular way, e.g. when it comes to “transform,” “summarize,” and “predict” for reasons I’ll decribe below. You might come up with a slightly different list, but here I’m going to try to define these five core intents and their optimal versions.
Transmit. An ideal transmit intent moves information to and from authorized users, rapidly and without distortion.
It’s pretty rare these days for an information application to have no transmit intent, although this intent may reside in the background. Many valuable applications are primarily about transmit, including informational web sites. An area where applications often fall short of the ideal is security, i.e. assuring that only the correct information is viewed by properly authenticated users.
Find. An ideal find intent rapidly and accurately locates information in response to a well-formulated question, and provides that information in a fashion that suggests additional questions when desired.
Here by a “well-formulated” question I mean one that a knowledgeable user can be expected to ask. That’s a very tall order in general, both because it is difficult to know all that might be asked (and how), and because the data will be limited. In practice, we’ll cast the range of possible questions into standardized formats (e.g. keywords, drilldowns).
Whether the answer to one question sensibly leads to another question is a function of taxonomy – sometimes of the data itself, as with data warehouses, or sometimes of metadata in the system – e.g. entity names can be very revealing, when properly organized. Search sites can offer suggestions (and ads) based on the community history of prior searches.
Transform. An ideal transformation intent changes the data to make it more accurate or easier to consume, without distortion.
Data engineers spend more time here than anywhere else – everything from data cleansing to building a data warehouses is a matter of transformation. This intent rarely stands alone, but acts in support of a find, summarize, or predict intent. I do separate this from any intent involving aggregation of data – that’s an important but separate case.
It’s surprisingly difficult to asssure that data transformations do not distort information. One of the most distorting actions is one that is very easy to take for granted: that of modeling and loading the data in the first place. The mapping from the data associated with a real object to the set of attributes we capture in a database determines a great deal about what the application will be able to do. (In my view, that is also why up-front analyis and assessments of data models in terms of the question-answering capability is a fine idea.)
Summarize. An ideal summarize intent provides a simplifying view of the data, eliminating only irrelevancies, along with a mapping back to the original data.
Summarize intents can range from aggregations, to regressions, to catagorizations, to reports, to visualizations – anything that gives us an helpful overview of the data.
Aggregations are common, but not always essential, to a summarization intent. But if we do aggregate we necessarily remove detail, and a way to return to that detail is optimal.
Predict. An ideal predict intent derives a inference not available from the data itself.
I differentiate between the predict intent and summarize intents like regressions, even when the mechanics (e.g. a cross-validated regression or hierarchical clustering) might be identical. Summarizations simplify and illuminate data content, whereas a prediction offers answers beyond the data set alone.
There is something else concerning predictions, and I’ll grant my statement might make people a little uneasy. A reliable prediction involves some basis outside the data itself. To state it another way, there is no such thing as a preduction that is reliable and purely data based. There are always additional assumptions, and these must be understood to validate the prediction itself. Those assumptions might be very simple and reliable – for example a prediction might interpolate between geocoding data – a very sure bet. On the other hand, we might extrapolate financial information under the presumption of a linear trend. Depending on the time frames involved, the reliability of that assumption could vary from quite certain to almost entirely uncertain.
It’s OK to summarize data as a standalone entity, but with predictions we have to look outside the data, and sometimes examine and challenge our assumptions and uncertainties to assess the quality of our answers.