Not long ago a CEO asked me if his data system qualified for the label of big data. When I told him it probably didn’t, he was obviously disappointed. Big data and big data analytics have developed real momentum. Vendors push it, articles tout the advantages, and many tech teams have advocates.
But in analytics it’s still good to be small. Any push to make our analysis systems bigger deserves heightened scrutiny – we should prove a need for size rather than assume it. Data size runs against the natural grain of analytics, whose job is to illuminate, clarify, simplify, and reduce.
Beyond that, data size may cause more analysis and analytics problems than any other single feature. Size is unwieldy. Size is expensive. Size brings complexity and can obscure accuracy. Size slows development, and makes systems hard to change. We intuitively know this and so we ask for too much, even before we really know what we want. So size often locks in sub-optimal outcomes. And many analytics tools are best suited to smaller data sets.
How many records do we really need to answer our questions? We might eliminate some headaches by getting the answer, and a friendly local data scientist can help. Very often, we’ll learn that only a small fraction of our reference records are needed for analytics development. I’ve seen business intelligence systems that replicated (or increased) the size of the reference data system, just because people were afraid of losing something. Nearly all of that data sat untouched, but still cost money to maintain.
Sometimes size isn’t even an issue. There are plenty of high-value analytics problems with modestly-sized data sets. Pharmaceutics: about 10 million organic compounds. Sabermetrics: under 16,000 major-league baseball players. Medical studies: usually 100,000 or less. Human resources: 100,000 or fewer.
For system size in analytics, I think we’re best served by a philosophy of “big if we have to, small if we can.” The red racer is faster and more fun than the 18-wheeler, so we want to be sure whatever we’re hauling in that semi is worth the trouble.