I don’t submit many arguments from authority, but the list of advocates for good questions is impressive. Voltaire, Einstein, Pynchon, Levi-Strauss, Bacon, and many others.
These quotes have a philosophical, mathematical, or scientific bent, but it’s the same situation when answering questions with data. We can’t truly understand any answer without understanding the question first. If our question is ambiguous, the understanding of our answer will be too.
Most of the time, we satisfy ourselves with a question that we presume is OK. But in fact, we’re not sure how good our question is (and therefore how good our answer is). Consider this: in a business database, what data are difficult to know exactly? How about: true value, true cost, the current state of most entities (like an employee), and how any of these might change over time. And what do we usually want to know? The very same things. So if we’re going to frame a good question, we can’t just fire away, we must deal with uncertainty and error, and be able to decide if our data is good enough to allow a decision.
So what is a good question? In my experience, a good data question meets these criteria: it 1) is useful, 2) generates a unique answer, 3) manages errors and uncertainties, 4) is comparative, 5) is limited to the data, 6) calls out assumptions and constraints.
I’ll expand on these criteria separately. However, by this perhaps intimidating list, we normally don’t go to the trouble to frame good questions, and I asked myself why. I believe there are several reasons. First, many teams lack a formal process for evaluating their questions. Question evaluation can look daunting and be perceived as impenetrable, so we find ourselves avoiding it. Second, our systems usually return a lot of definite answers very quickly. Automation bias leads us to believe that the machine’s speed and certainty of response is related to quality of response. But of course, quality has nothing to do with speed and certainty. Third, we don’t want to frame a good question, and for our efforts learn that our current data capabilities are limited. We’re going to naturally avoid something that might be perceived as a setback. Finally, question evaluation can look just plain expensive, and things are costly enough.
I will say this: when we frame questions that deal openly with data realities, including uncertainty and error, confidence in our outcomes skyrockets. For then, we’ve crafted an inquiry that openly considers our data limitations, constraints, and defects, and we still have drawn a conclusion. Who could argue with that? However, brushing ambiguities under the virtual rug is when our users can get queasy, particularly about unexpected results.
Let’s say that you’re sold on the idea that making good questions is desirable. I would certainly understand if your next thought is This is going to be a PITA, and possibly depressing in addition. I hear you, but hang on a minute. First, think of a good question as a good requirement – but more compact and dynamic. (Very often, we’ll have to iterate a few times to get it right.) Second, let’s say our question helps us identify limitations in our data. OK! It’s better to know earlier, when perhaps we can do something about it, than later, N months and X dollars later. And if we can’t fix our data, we’ll accurately answer the parts we can with data, and hand off to human judgment for the rest. Third, good questions can save a lot of time and money. If my question includes a budgetary or time constraint, that may obviate a lot of modeling and analysis (which I’ve seen people do, when they didn’t have to).
Finally, while formal question evaluation can take a little time, it is not inherently difficult. In fact, we perform a question evaluation process almost every day, without thinking much about it. Did you look at the weather forecast today? Me too. And probably like you, I first looked at what my phone told me (today’s high 90 F, with a 30% chance of rain mainly after 2 pm), and supplanted a detailed question like “will it be raining at 3 pm today?” with something like this question: if this forecast is my sole basis, and I want to assure I won’t get cold or wet, should I bring a coat and umbrella to work today? That wasn’t so bad, and it does meet our “good question” criteria: Is the answer useful? Sure – who wants to get wet or be cold? Is there a unique answer? Yes – guaranteed by “if I want to assure.” Did we manage errors? Yes, by replacing the data details with binary yes/no categories: it’s hot, it might rain. Is the question comparative? Implicitly, yes. (I know when it’s hot enough for me to wear a coat.) Is the question limited to the data? It is. Did the question call out assumptions? Yes – I am taking one forecast as my information basis. If it is severely wrong, I’m going to be out in the cold or wet.
On the other hand, “Will it rain at 3 pm today” has no unique answer – it might rain, and it might not. So it’s not “good” by our criteria, and we have to reformulate our question. We might go for a probability: “If I accept the forecast, is the chance I’ll get rained on at 3 pm more than 50%?” or use the “binary” approach above to get to a unique answer. In our databases the process might take a little longer but the strategies and ideas are the same.