Bertrand Russell would periodically bellyache in A History Of Western Civilization that life would be simpler if people confined themselves to statements that were not demonstrably false.
Paraphrasing that remark, people might save themselves considerable trouble if they were to stop assuming their data errors are zero, which is almost always demonstrably false.
I know – we don’t exactly assume that. There are data quality checks, and data validations, and so on and so forth. Ultimately however, we still tend to give users a single number, like $51,134, to indicate (for example) how many of a particular product were sold in the Montana sales group – and that number is very unlikely to be perfect. Everything from data entry errors to apportioned sales to invoicing problems to data transform issues – including imperfect assignments to groups – come into play. And if we’re comparing any monetary quantity across time or nationalities, the elements of fluctuating currency values and inflaction rates become a factor. How data errors originate and propagate are both expected and occasionally surprising – I’ll break that off into a separate post.
Here’s the problem: it’s convenient to take a single, probably imperfect number, and ask our computers to operate on it without worrying about the errors – we add and substract it from other imperfect numbers, multiply and divide it with other imperfect numbers, or compare it with other imperfect numbers. Nonetheless, the errors will accumulate, and ultimately limit the accuracy of our answer. Any project or effort to go beyond that error limit is ultimately a waste of time and money. Perhaps more importantly, if errors are not evaluated teams don’t know when to stop persuing a particular level of accuracy. I’ve witnessed more than one team pursuing accuracy that simply wasn’t achievable.
Some teams do consider errors explicitly – and that’s great. Many don’t, and when I’ve inquired over the years, the reasons they don’t run along these lines:
- We already validated our data
- Our database doesn’t accomodate errors
- Errors tend to cancel each other out
- We don’t know how to handle an explicit error analysis
Which equals: we checked what we could, doing more is a pain in the ass, and it’s probably a minor problem anyway. I sympathize, but let’s take a look.
We already validated our data. Great, but unless that was against an external and independently-audited source, we can’t be truly sure of what we have. A complete validation, even between a business-intelligence and transactional system, is actually difficult and time-consuming. And transformational business rules – which are new data – can be a significant source of difference between BI and transactional systems. In practice the “validation” is often pushed to users, who are very smart about what they do, not necessarily schooled in data validation – that’s not supposed to be their job. The only way I know of to truly validate error rates is an outside audit – and ideally we’re not asking our users to be our auditors too.
Our database doesn’t accomodate errors. I know it doesn’t. But it should, so we can query errors directly along with our raw data.
Errors tend to cancel each other out. This is a misconception. In fact percentage error can decrease in very special circumstances, including random and uncorrelated errors over the addition of numbers with the same sign. In general “cancellation” of errors has to be demonstrated – it can’t be assumed.
We don’t know how to handle an explicit error analysis. Agreed – the techniques and queries can be tedious, particularly for problems like misassigned groups. On the other hand, a full analysis of where errors propagate may not be necessary. It’s worth checking to see if the answers varied by 1%, 5%, or 10%, whether any substantive conclusion or action from the data would be altered. That depends on the user community and what they are doing. Ideally, this is one of the first steps – an excellent analytics director I know did precisely that – she confirmed her users didn’t need more a readily-achievable 5% accuracy, and then worked with her team to assure any transformational error was limited to that 5% margin, using the transactional system as a reference. It’s much easier to do this up front than later – in fact in this instance the goal was changed later on, requiring a retrofit of business rules that were acceptable at 5% error, but much more involved to reach 1% accuracy.
In my experience with errors, the first and most critical thing is this: we should assume they are something other than zero, and make an explicit allowance for them in outcomes – in short, bring on the error bars. Even a reasonable guess is OK to start – say 1% error in raw data. If you find that might create issues for users, then we can look into the details and see if the actual error is less, or can be made less. Without error tracking someone – developers or users – will assume the answers are better than they actually are.