Hi. Imagine that we’ve been asked to find a best-fit line for these data, which are the outcome of a stochastic simulation:
Sequence (X) | Value (Y) |
---|---|
1 | 4.586792 |
2 | 31.230494 |
3 | 32.805177 |
4 | 40.057982 |
5 | 40.389495 |
6 | 51.093562 |
7 | 51.651214 |
8 | 54.419457 |
9 | 63.673709 |
10 | 63.891718 |
11 | 74.354809 |
12 | 87.494790 |
13 | 94.858425 |
Whether we whip out a sheet of graph paper (!), use a spreadsheet, or spin up a statistics package, our analysis will give us a plot looking something like this:
(I used R for simulation and plotting – scripts appear at the end of this post.)
Things look pretty good. The first point is well below the line, but we can hypothesize that this is an initialization artifact. It would be nice to learn more about the error in our simulation, but these data show a trend that is approximately linear.
Great. So, what is the question this simulation addresses? It is: using the R pseudo-random function runif() and a seed of 1234, what is the first sequence of 13 continually-increasing numbers? The answer is the data set above, and it takes almost 20 billion tries to arrive at this sequence. It’s nothing more than a model answer to a model-based question, without a corresponding real-world question. Bluntly put: who cares?
However, I would be willing to bet that if you thought about this little line-fitting problem, one of the last things that went through your head was It’s a set-up and the entire problem is bogus!
Sorry about that…. The point is this: We do not normally start our analytics process by asking whether our model-based question has any, or even some connection to a real-world question. We like to cooperate and be helpful, rather than instantly plague our coworkers and stakeholders with questions about the association with our model question to a real-world analogue (if any).
But perhaps we should. For every problem involving data also involves a model – we cannot apply an analytical process otherwise. The core reality of all models is that they are approximations of reality – models are abstractions. As the answers we craft come from our model, so do the questions we are answering. In analytical processes, we always address model questions, not real-world questions.
There are situations in which a model question is close to a real-world question, but they are never quite the same. Starting in this post and continuing in the next, I’ll look at examples – in everyday life and in analytics problems – where that difference is ignored, but definitely matters.
Model and real-world questions can be very different, even when the look and feel almost identical – questions are often slippery and ambiguous. Is the real-world question “Is Sue smarter than Abby” the same as the model question “Does Sue have a higher IQ test score than Abby?” You might laugh and say “of course not,” but many would use the latter question as a substitute for the former, without having any idea what an IQ score really measures. Once we have a number, we can formally manipulate it, and that fact alone makes modeling questions and answers very attractive. But with no clear connection to real-world questions, those manipulations can be without true purpose..
I think most of us have been there. Because questions are ambiguous and readily morphed in our minds, we often transpose model questions into real-world questions, often without realizing this has happened, but with consequences that range from minor to very serious. However, if we don’t fully understand what question we’re answering, we really don’t know what problem we’re solving. Getting to the “right” problem may not be easy, but the first step in arriving at the right problem is knowing where we are now.
Some argue, and I concur, that analytics is overly concerned with detailed mechanics, algorithms, and computations, thereby too often chasing “wrong” problems down conceptual rabbit holes. But the problem with “wrong problems” really starts when we don’t understand how our model questions relate to the original questions we purported and hoped to address.
This can present a real challenge. We’re trying to deal with an imperfectly-defined and complex concept – questions – without an image to guide us. That thought – we really need a picture – struck me with force not long ago, and since then I’ve been drawing small diagrams showing how model questions map to actual questions.
I’ve not seen this done quite as I suggest below, but if you have a similar system that works for you, go for it. My pitch is not necessarily to create question maps my way, but to create a map, and make sure every stakeholder, everywhere, knows about it. For every stakeholder, everywhere, should know what it is we’re actually answering (or not answering), and a good question map can address that.
A question map[1] is just a table. It has two columns, and at least three rows, and traces the lineage of our questions from model question back to real-world question, along with the important features of each corresponding answer along the way. In outline, it looks like this:
Question | Answer features |
---|---|
Model question as conventionally stated | Data outcome as it is usually understood |
Translated model question – a precise statement of how the data is being queried or summarized | Data outcome including assumptions, uncertainties, errors, missing elements |
Translated model question – it’s sometimes handy to “translate” in more than one step | Data outcome including assumptions, uncertainties, errors, missing elements |
“Actual” real-world question as conventionally stated | Actual outcome and its relationship to the data outcomes |
By “conventionally stated,” I mean the question as people actually ask it. We want to start and end with what people really say, rather than technical description of what they might mean. The translated questions in the middle are usually detailed and technical, so we can understand the connection between a general question and its technical analogue.
I’ve been surprised when applying this to examples from ordinary life as well as analytics, to see that just writing down the sequence of questions and answers tells a story of how we move from a question we believe we’re addressing, to a separate and sometimes very different question that we are truly addressing.
I’m going to offer more examples in the next post, but let’s check out one example now. You’ll remember, hopefully without irritation, that bogus best-fit relationship. Here is a question map for that situation:
Outcome Q | What is the relationship between the points | Create a best-fit line | |
Translated Q | What is the best-fit line for the points | Define the best-fit line with line() in R | |
Translated Q | What is the best-fit line for the points | The points are randomly generated | Define the best-fit line with line() in R |
Actual Q | <None> | In context, there is no meaningful relationship |
The progression of questions from top-to-bottom starts with a model question asking about the relationship between the points. But our model only considers 13 points, and not the context that generated them. This may seem a little silly now, but it emulates what we do with data models every day – data models can be very limited abstractions of complex concepts (like an IQ test score as a data model for intelligence).
The first translated question (line 2) just converts “relationship” to “best fit line,” with a proscription in R for solving that problem.
The second translated question (line 3) is the same question, but now we’ve added in a new element to our answer features – our recognition that the way the points were generated impacts the problem.
The “Actual,” real-world question (line 4) is , for unless we are interested in peculiar artifacts of pseudo-random-number generators, there is nothing practical to be asked. The trend, the relationship, indicate exactly nothing (and the next value in the random sequence would have confirmed that).
It isn’t unusual to write down a question map and find that we can’t state a definite real-world question aligned to the analytics or computational number we’ve produced. I’ll show another example next time, but when this happens it means we don’t know the problem we’re addressing. And speaking of problems, that situation is certainly a problem.
Next time, I’ll focus on some practical examples and their question maps, and how maps can show when question differences matter, as well as what we can do about that. There are times that we should update our model, but sometimes the solution is to replace an overly-ambitious real-world question with our model question instead.
[1]There are other diagrams, including concept maps, that are also called “question maps.” Here, a question map is a table that shows the lineage between a model question and an actual real-world question, each row holding one question and its corresponding answer features.
R scripts for the simulation and plots.
# Note: the data.table package is required library(data.table); set.seed(1234); rnd_blk_ct <- 1e08L; rnd_min <- 0L; rnd_max <- 100L; #points generated by the simulation. #13: seed = 1234, 1.98e10 attempts. # x id # 1: 4.586792 60283905 # 2: 31.230494 60283906 # 3: 32.805177 60283907 # 4: 40.057982 60283908 # 5: 40.389495 60283909 # 6: 51.093562 60283910 # 7: 51.651214 60283911 # 8: 54.419457 60283912 # 9: 63.673709 60283913 # 10: 63.891718 60283914 # 11: 74.354809 60283915 # 12: 87.494790 60283916 # 13: 94.858425 60283917 # to plot these points: # q <- c(4.586792, # 31.230494, # 32.805177, # 40.057982, # 40.389495, # 51.093562, # 51.651214, # 54.419457, # 63.673709, # 63.891718, # 74.354809, # 87.494790, # 94.858425); # plot(q, pch=19, col="blue", main="A regression looking for a question....", xlab="", ylab="Value"); # grid(13,10, col="darkgrey"); # abline( line( 1:13, q)$coefficients, lwd=2, col="lightblue"); # points( 1:13, q, pch=19, col="blue" ); ################################################################################ # Simulation: # Identify a continuous sequence of increasing numbers from the pseudo-random # distribution runif() seq_ct <- 13; foundit <- FALSE; blk_id <- 0; while(!foundit & (blk_id <- blk_id + 1)) { rnd_x <- runif(rnd_blk_ct, rnd_min, rnd_max); rnd_dpos 0L), 0L); id_dpos_zero <- which(rnd_dpos==0L, useNames = FALSE); #which() output is sorted. #ID when there is a sequential set longer than seq_ct having dpos_zero != 0. #This is what we want. id_rnd_start = seq_ct] + 1; if( foundit 0L)) { id_rnd <- unlist(lapply(id_rnd_start, function(i) {seq.int(i,i+seq_ct-1)})); is_ord <- 1:rnd_blk_ct %in% id_rnd; r <- data.table(seq_id = rep(1:length(id_rnd_start), each=seq_ct), id = rep(1:seq_ct, times=length(id_rnd_start)), x = rnd_x[is_ord]); print(r); } print(paste0("block, total attempts = (", blk_id, ", ", blk_id*rnd_blk_ct, ")")); }