What is a good data question, anyway?
For that matter, what is a data question in the first place?
Data questions are data operators – they perform an action on data. The operator proscribes a way to extract source records from a database, transform and combine them, and then produce an output data set, which we read as an answer. Data questions can be simple or complex, but ultimately they are all recipes for converting source data into a palatable size and format.
That might seem pretty routine, and data questions are often just that – we ask them every day almost without thinking. But for many questions, we obtain answers that are ambiguous: they might be too numerous, or responses to a subtly different inquiry, or an arbitrarily-selected subset of all possible answers. Often the question itself is ambiguous – there are assumptions, constraints, definitions that are unstated or unknown. We often need look no further than our last Google session for examples of these problems.
After synthesizing my experience with different data teams, I concluded that a good data question is first and foremost an unambiguous question – in definition, assumptions, and generated answers.
That might sound simple, but other QUOTE contexts quickly come into play, such as data error and bias. Crafting a good question is challenging, in part because we often don’t know if we have developed a good question until we have tried it on our data. We might then find our answer has excessive error, or several other problems that I’ll break down below. When that happens (which is often), we need to iterate our question definition. Also, when capable users ask questions they also learn, so users will often iterate their line of inquiry in a span of weeks, or even days. Woe betide the data architecture and processes that can’t keep up.
It might help to separate this discussion into two parts. Here I’ll outline the criteria I’ve found to be associated with good data questions. In a second post, I’ll talk about some processes and checklists to help us evaluate whether our current question should be considered “good” (or “good enough”).
- A good data question produces an answer that is useful now. I’ve personally invested many hours delivering answers that no one particularly cared about, or that might be useful someday. User communities often ask for more than they can consume, or wish to hedge their bets against long development times. That’s understandable, but it also aggravates the development-time issue it is hedging against. I’ve found it’s best to focus on what is most useful now. If we need to adjust our development process and environment to keep up with the dynamics of question-asking, that is a separate item to address (which I put in the “T” QUOTE context – data type/context).
- A good data question produces at most one answer. You must be joking. I do know some jokes, but this isn’t one of them. A good question as zero answers or one answer for each entity or aggregate thereof, and input parameter set. This forces us to call out, as an assumption or constraint, any hidden or arbitrary selection of records from an answer data set. It’s easy for us to quietly choose an answer that we “like,” when we see many results due to an imprecise question or data error. If we’re going to have a “good” question we have to say what we’ve done: “I am choosing the first non-sponsored apple pie recipe out of the 5.3 million responses to my Google “apple pie recipe” query because for my purposes the answers are all the same – I’ll adjust what I see to my taste – and because I don’t trust online ads.” Our stated assumption might be wrong, but that’s not the concern for now. We do want our assumptions to be out in the open – that’s the best way to know what we have, and improve our question when we must.
- A good question enumerates assumptions and constraints. We should list assumptions in a “given” clause, and list them all. A “given” clause may look like a chapter from the Federal Tax Code, but that’s OK. Later a creative person (or tax lawyer) may be able to boil this down, and sometimes analysis will prove that certain assumptions do not impact outcomes. Similarly, we should list constraints in a “subject to” clause. You would be surprised at how much time and effort can be save by not answering questions that are constrained from the start. If my piggy bank only contains $14.25 at this time, I will not need to evaluate the best location for the factory I would like to, but cannot afford to, build.
- A good data question corresponds to a real-world question we can actually answer. Teams can find themselves spending time working on a question they have no chance of answering. In some cases, there are real-world constraints: in many parts of the world we cannot predict the weather ten days from now due to physical limitations having nothing to do with data. In some cases the metrics we need are unavailable or must be developed – true cost and true value are two metrics that are frequently unavailable, and for which their nominal counterparts may be poor proxies. It’s good for us to consider what we expect to learn in situations where our available data are significantly separated from the real-world question of interest.
- A good data question accommodates data uncertainty and error. We should regard all of our data (including categorical data) as having error until proven otherwise. Questions in the face of uncertainty can look quite different than their certain-data analogues. We might need to introduce a probability into our question, or downshift to general categories. The one thing we shouldn’t do is to assume our data has no error, or arbitrary pick one data value, like a mean, out of many possible values (unless we’re willing to state that as an assumption). Teams often push back on this, for understandable reasons. We don’t like to acknowledge uncertainty. Also, the querying mechanics needed to accommodate data error can become involved, often involving scripting.
- A good data question is deterministic. Some perfectly good data questions have a random component (e.g. questions that involving data sampling). However, for a given set of question parameters and a particular data set, the outcome must always be the same. If a random-number generator is involved, the seed (or seeds) should be one of the parameters. If sampling is involved, that should be called out as an assumption.
- A good data question is comparative. Edward Tufte points out in detail that practical questions almost always refer to a basis, or compare entities. (A favorite point is the use of nominal currencies.) Even when asking basic questions we can find ourselves providing an absolute metric, like an employee count. In our heads we might know that 40,000 employees is fewer than last year at this time, but others without the same context won’t know that, and may misinterpret the answer. It’s good to have a stated basis, or to look directly at the differences of interest. Certain metrics – such as cost – are often naturally paired with a comparable opposed metric, such as value, and we should consider why a question might be looking at only one of a natural pair.
- A good data question is limited to the data. That sounds obvious, but it’s also easy for us to forget this. A data question is a data operator, so when there are not data queries to back up our question, we’ve left planet Data Question for the wider universe of Interpretation and Mystery. It’s just fine to get some support for a business decision with a data question, and finish the job with people adding in their extra-data experience and expertise. But the latter isn’t a data question – it’s something else. Teams will sometimes state that their data gave them the answer to a “why” or “how” question. Usually that’s not the case – most data reference who, what, where, when, how much, or how many. We’ve mixed data and interpretation without realizing it.
- A good data question manages the limits and uncertainties associated with our data extensions. Unfortunately, this often presents a challenge. Data transforms are often devoid of error analysis, and predictive models will frequently pump out a bogus answer with the same certainty as a valid one. Sometimes we can test the conditions where a problem might occur; sometimes we have to downshift to state, as an assumption, a presumed error rate for the transform or model. Ideally though, we don’t just assume those rates are zero – they almost never are.
- A good data question accommodates any actionable context. If we are going to take action or optimize a process based on our question, the only parameters in our question can be those items for which we have knowledge and control. Our question must also contain an agreed-open objective when an optimization is at hand – and these are not always available. If non-actionable parameters are present, we ideally consider all reasonable values of these inputs, and then determine if an action or optimization is still availing.
It’s challenging to explicitly consider all of these criteria, but we often implicitly or partially consider many of them. Part of our challenge is that different stakeholders – sponsors, developers, users – often have quite different experiences, technical languages, and interests. But crafting a good question impacts all of us.
I believe a good practical step is for stakeholders to assess these criteria together. The assessments usually don’t take long. From there, we determine if our key questions are “good enough,” or if we should augment our current approach. I’ve seen that good questions provide a mechanism for common understanding, and thus for enhanced confidence – because we have a better and common understanding of what our answers mean. And because we’re talking together, we’re more likely to invest in areas of true interest, and far less likely to invest in, or scrutinize, dead-end outcomes.