For decades, there has been a vigorous argument in the United States as to whether the national pastime is football or baseball. I’ve never followed this very closely – for most of us, these activities are mere spectator sports.
On the other hand, a game that most citizens do play, and actively, is the sport of information cherry picking, which consists of gathering up numbers and facts in support of a preconceived idea, and ignoring any other numbers and facts that might stand in opposition to what it is we’re trying to prove.
Data cherry picking might be the most common form of argument – and it’s very impressive when someone ticks off facts or numbers in support of a position. But really, it’s not valid – cherry picking uses only part of the information at our disposal. I’ll be the first to grant that we all cherry-pick information at times, but the more important the discussion and the outcome, the more critical it is that we avoid this approach, which often inflames more than it informs.
So last week, when I saw a NY Times opinion piece announcing that hurricane Harvey was “the storm that humans helped cause” my response was that’s irresponsible. The thesis of the article is that the surface temperatures in the Gulf of Mexico are warming, which contributes to hurricanes (true), and we humans have contributed to global warming (probably).
And when we’re done berating ourselves for our personal responsibility for Harvey, then what? Well, perhaps we could look at the slightly larger picture, and gather facts beyond those supporting one argument. For Harvey was by no means a historically intense storm, making landfall at category 4. In addition, the United States had experienced a long period – over a decade – in which no category 4 or 5 hurricane had reached its shores, a fact that could be cherry-picked to argue against a global warming impact.
More crucially, the reason Harvey created such damage was that it moved slowly, essentially stalling after it made landfall. The trajectories, speed, and strength of hurricanes depend not only on water temperature, but on atmospheric wind and moisture both near and far from the hurricane itself. Atmospheric dynamics cannot be predicted even a week in advance, but are all-important in determining a storm’s wind damage, and in Harvey’s case, water damage. It’s beyond the competence of climate science to know local weather conditions in detail, and without that it has little to say about the flood damage inflicted by a particular tropical system.
We can rationalize nearly anything. However, the purpose of analytics should not be simply to rationalize an expected hypothesis, but to help us understand whether our hypothesis is really correct.
So, we might expect that the methods of formal data analysis would provide a more even-handed analysis, but that’s far from a given. Instead, my experience has been that experienced practitioners are actually more prone to cherry-picking than novices. Those believing they know the answer to a problem are more likely to find data supporting their expected answer. That’s OK – as long as the selected data are fully representative. It’s surprisingly easy to use data that are supportive of an expected conclusion, or convenient, or both, when building an analytics platform – I’ll call out some (very common) examples below.
Data cherry-picking means that we’re assuming the information we’re using are complete and accurate for the problem at hand. But as in statistics, our first duty is really to set aside that assumption and understand the limits of what our information can actually tell us.
Sounds simple enough, right? And really, how often do we have incomplete data systems?
Well, pretty often. Not only that, some of the most crucial problems analytics now faces are problems involving incomplete and uncertain information. When we deliver exact answers with that kind of information, we’re probably cherry-picking our results.
Let me give you some examples of data and operations that can create problems:
- Time. Most of the systems we examine are dynamic. However, the time stamps we have in our systems are reporting times, which are different from event times. In economic and business systems that difference can be months or even years. If the system dynamics are slow enough the impact will be small, but we need to prove that.
- Money. More than one analyst has told me their monetary metrics are “rock solid,” but none of these people were ever accountants. Or salespeople – we really haven’t lived until we’ve seen two sales groups partition the spoils of a shared sale. It’s also easy for us to forget that while we assign costs and prices we assign to products, people and things, costs and prices are really properties of a buying or selling transaction. And as such, they are often negotiable and variable. The uncertainty in monetary numbers may be too small to matter, but we’ll need to prove that.
- Counts. Go ahead, say it: that’s easy. Now come over to my place, where every merger and acquisition presents a new data challenge. For example, employees have different histories, using different systems, and occasionally the information in the system is insufficient to distinguish two different people.
- Money and Time together. The value of money changes over time, as a value that can only be approximated at any time, and is more uncertain in future times. The uncertainty may be small, but if we’re looking more than a few years into the future, we’ll need to prove it.
- Categories and taxonomies. A category can be mislabeled, but the real danger with categories is that they can be distorted. Consider the tags and metadata associated with online material. These are often designed to generate hits more than characterize content. For complex entities that are categorized by hand, inconsistent assignments are common.
- Cost and value. There is a much better chance that we have cost metrics in our system than value metrics, because costs are concrete and value is often difficult to measure. Unfortunately, this doesn’t guarantee that value is irrelevant to a decision process. Models based purely on costs can be one-sided and yield poor or irrelevant decisions.
- Scores. Clever analysts often concoct score metrics as part of their design, but scores very often put me on the alert. Scores tend to distort reality, and be unassociated with a measurable real-world metric – in fact, that’s kind of the point. The fun really starts when advanced analytics techniques are applied indiscriminately to scores. To a clustering technique, a score of 90 and 96 may be close, but if these scores measure, for example, quality of a wine, a 96 may sell for two or three times the price of a 90. When I see a score without a corresponding real-world metric, I’ve learned to flag it. And if it’s the real-world metric that counts, why have a score in the first place?
- “Facts.” Quickly! Global warming is/is not primarily caused by humans. Each statement is purported to be a fact by its adherents, but neither assertion could survive the level of scrutiny experimental scientists apply to their data, which is the gold standard for establishing factual information. If we consider the statements we think of as factual, how many of these are really more than received and unexamined information? The problem with using many “facts” in analytics is that we often lack the context to compare two discordant statements, and we wind up selecting the “fact” most consistent with other statements we already accept. That’s a pretty OK way to get through the day without going crazy, but for analytic purposes It’s really just another low-hanging bit of information.
- Stochastic variables. It’s common in simulations to estimate the impact of external forces on the system by randomly-varying data. That can be valid, but not all external forces subscribe to the stochastic model – in particular, when the time scale of external dynamics can match that of system dynamics, random external forces can be very misleading. Exhibit A: economic forces.
- Arithmetic. If our data are complete and precise it’s perfectly fine to perform an operation like A – B. But uncertainly makes even this simple operation a risk-taking venture. I don’t only mean statistical uncertainty. A minus B can be a dubious operation if A and B are metrics relating to complex entities, e.g. yearly sales in a sales group. Oh sure, you can perform the operation, and get a number – but if the products, region, personnel, or leadership in the sales group has changed significantly, what does this figure really mean?
Can we apply these data and operations in analysis work? Sure. But with data that are uncertain and incomplete, certainty and completeness are elements to be proven, rather than assumed. The first duty of analytics really is to establish the limits of analytical conclusions.
When I try out this list on my acquaintances, the median response is one of resignation more than surprise. For we really do know at some level that these metrics are flawed, biased, or irrelevant. On the other hand, we also tend to proceed with our analysis regardless, assuming our data are complete and accurate, and rationalizing our decision by telling ourselves that the data are the best we have, and our responsibility is to understand the data we have, rather than to start a war about the value and completeness of the data.
I’ve been there – many times. But as analysts our responsibility is not merely to manipulate data and indicate what it appears to mean. Our responsibility extends to helping people assess what data can realistically conclude, and what questions the data actually answer. The trap, and there is a trap here, is to start from the premise that the data are, well, really pretty good, and then find ourselves having to backtrack when we wish to argue there are limits to the conclusions that can be drawn, and the questions that can be answered. That can be very difficult.
It’s better to initially presume the data are not too great – 10 to 15 percent in error if we’re told the quality is “good” or better, and more otherwise (yes, I’m serious).
Hey! I might not be able to conclude anything, except for the most course-grained conclusions!
I know. And that’s the point. To conclude more, we need to first prove the data are good enough, and that means more work. This approach – and I’ll grant it’s not very conventional – has one real advantage: we put ourselves in a position to always improve our results relative to our starting point. Experienced analysts will recognize that they can suggest which new conclusions are likely to emerge from improvements in particular data fields.
Some – but not all – of the problems I’ve called out are data quality and curation problems. Some of these are intrinsic uncertainties, however, and some conclusions are intrinsically limited. There is no obvious cure for the uncertainty of inflation. There is no obvious cure for a cost metric without its corresponding value. To offer specific answers in the light of these uncertainties is a pretense.
The assumption of complete and accurate data – “data cherry picking” – is definitely convenient. But beyond convenience, the motivation for cherry picking is strongest when we start from the premise that our data can and should support an answer – especially, an answer we expect. But as in statistical reasoning, that’s really a false premise.
The first duty of data analysis is to ascertain the limits of what our data can conclude, and what questions our data is addressing. It is not to presume an answer exists and it’s merely our job to uncover that answer – something we often do without realizing we’re actually doing it. Every time we project future value with exactitude, every time we estimate net value from cost because cost is what we have, every time we introduce a score, we’ve really jumped ahead and forgotten our first duty, which is one of assessment.
That “first duty” is far easier if we stop expecting our data to give us any answer at all, and instead expect to prove our answer is valid in the light of our original real-world problem. Two questions encapsulate the duty of assessment:
- What is the actual question we are answering with the aid of our data?
- What is the competence of our data to answer that question?
There is understandable resistance to the idea of starting from a “null conclusion” basis – for one thing, it means accepting that the cost and effort of collecting and analyzing data may not tell us what we want to know. That’s true, but that’s also real. In addition, basic and seemingly trivial operations can become complex and, well, very irritating, when uncertainty analysis becomes part of the picture. It isn’t without reason that many good analysts think of uncertainty analysis as living at a dismal intersection of tedium, reduced impact, and differential calculus.
I grant that. However, the alternative is to cherry-pick data and then to over-conclude. It’s a common ailment that could turn people off to the genuine merits of analytical and data-based reasoning. If you’re skeptical, ask President Hillary Clinton, to see what she thinks. Or perhaps we should start taking those five-day weather forecasts seriously? I don’t think so. The five-day forecast is a kind of stock joke – we read it and more-or-less ignore it. But not being president when you expected to be, or hiring the wrong person, or expecting income to increase when it might not – those are more serious matters.
If we’re to really leverage analytics in problems where uncertainty and partial information prevail – and that’s many if not must interesting problems – the days of “cherry picking” must come to an end. This party is over.
Ironically, while tracking uncertainty may appear to be a time sink, it can actually be a major time saver, particularly if additional data are clearly required, or the desired answers are beyond the competence of any available information. One strategy for managing incompleteness and uncertainty is as follows:
- Let’s start by assuming our data are not perfect or complete. Conversations with stakeholders about what they might conclude are easier – if not easy – starting from the premise that conclusions may be limited, rather than confirmation of hoped-for outcomes. The idea that a data system has limits is something to instill right from the outset.
- Next, write out the questions the data system is actually answering, as exactly as possible, and map these questions to what stakeholders are asking. (There is often no mapping – that’s OK.) I find this very helpful, and am still surprised at the differences between stakeholder perception of a question, and the actual question being addressed. Both data questions and stakeholder question should be something that can actually be measured and validated.
- Next, identify situations in which data incompleteness or uncertainty does not impact conclusions. And cross them off, with pleasure. Aggregated answers are often (not always) less sensitive to uncertainty.
- Of the remaining data problems that can impact conclusions, identify the subset of issues that can be repaired, either by improved data quality or other methods. A decision must be made as to whether the repair is worth the cost.
- What’s left at this point? Uncertain/incomplete information with a range of outcomes. We’ve either decided that it’s not worth the trouble to improve the information, or determined that improvement is not feasible (e.g. inflation estimates). It’s still OK to proceed, but only by reporting the full range of outcomes consistent with the input information uncertainties.
It’s common to replace this step by a classic “cherry pick” – to simply plug in a reasonable set of input values and then calculate the outcomes. That’s OK if we can certify that the input uncertainty doesn’t matter, but otherwise not. However, there are still options to simplify the job. For example, many data inputs will typically impact outcomes in about the same way, so dimensional reduction can (albeit approximately) limit the amount of calculation involved.
I’m planning to follow up with a few posts illustrating these points in more detail with sample problems. If you’re like me you might be thinking it seems that few people take this kind of trouble – is it really necessary? I understand the sentiment, and agree that a full assessment of analytical limits is not particularly common. It can feel quite negative. But it isn’t. It’s realistic, and sets the stage for improving the information we use when that is feasible. And I have seen it done, by some of the best analysts I know. Perhaps ironically, less-certain outcomes wind up being perceived as more valuable, as they are also more reliable.