Arthur C. Clarke’s classic 1953 novel Childhood’s End describes the alien-aided amalgamation of humanity into a larger cosmic intelligence, with a concomitant loss of human creativity and individuality. Talk about disruption…
Less disruptive, but disruptions nonetheless, are the economic transitions that occur when latent demand and technology converge to the mass production of a product, often a product previously crafted by skilled artisans.
Books, clothes, tools, watches, vehicles, among others have followed the same pattern, with the same result: the product in question becomes less expensive, more widely consumed, and skilled people are looking for work.
I’d wager that in each of these cases there were artisans who predicted that their craft would be immune to mass production – the argument being that quality rests with the individual, and quality must prevail. But when mass production did eventually come, history has told a consistent tale: those artisans were sometimes right about quality, but always wrong about economics. Artisanship, however laudable, doesn’t scale, and consumers will flock to a mass-produced product when that’s their only affordable choice.
I suspect we are on the cusp of a similar – but not identical – transition in the implementation of data systems and their related analytics outcomes. And disruptive it will be, but in a fashion almost entirely to the good, even for the skilled artisans we know as data engineers and scientists. As I’ve propounded parts of this thesis before colleagues, what surprised me was not that I received some well-reasoned skepticism (which I did), but rather than in the broad outlines, most agreed.
If data systems and outcomes do gravitate to semi-automated production, with our human role increasingly being one of design and managing exceptions – the weird, the wacky, and the entirely new – there remains the question: why now?
Now is the time, for the same reason mass-production transitions have been triggered in the past: there is a product shortage – in this case, of useful and actionable information, and technology can deliver a mass-produced capability.
When it comes to information, most agree about the shortage, but some dispute whether the technology will be ready soon.
That the demand for actionable information exceeds the supply is hardly in doubt. We talk about too much information, but the real information crisis is a shortage of the good stuff, and as data artisans can all tell us, bringing that good stuff to fruition is a lot of work.
As for the imminent capacity of technology to deliver largely-automated data outcomes, reasonable people can differ. My skeptical friends look at current tools for data transformation and analytics, and argue that they fall short of an automation capacity. I’d agree that they do. But what I see is not lack of success, but embryonic versions of future success. I see that a substantial degree of automated data outcomes is within range of present capabilities, and I believe this gap will only close over the next decade, to the point where largely-automated data outcomes will be far superior to what we people can produce.
I’m not saying that data engineers and scientists will be irrelevant. On the contrary, our value should increase because we’ll be able to focus on those aspects of answering questions with data that are most challenging – high-level design and specification; management of exceptional situations, the formulation of questions and the interpretation of answers, the development of better information. As for the implementation themselves? Increasingly, I expect we’ll become less relevant.
Can machines really do what highly skilled data artisans now do? Probably not, but they don’t have to. Handling what people find repetitive – which would be the vast majority of data manipulations – will be just fine.
Show me a person bored with their work, and I’ll show you an opportunity to automate the task they would rather not be doing. My first boss – a very distinguished data scientist before “data scientist” was an established profession – called me into his office one day, had me sit down, and informed me that “Most individual modeling projects are, well, kind of boring.” I had started to wonder myself…. Since that day, over 20 years ago, I’ve heard experienced analysts express similar thoughts – for in most problems the strategy of data transformation, modeling, and problem resolution is invariant, even as the details differ.
Boredom comes from repetition, but also from knowing that our manual actions cannot keep pace with our thinking. Then, we inevitably make choices about what we can do, and what to set aside, and a really tedious task – like uncertainty analysis – is unlikely to be at the top of our list.
So, if we’re bored anyway, why not have machines do that work? Machines are not very smart, and we’ll need to point them in the right direction, but they simply do not get bored, and that offers serious advantages.
As I looked at key tasks in crafting a data outcome, the technology for automation appears to be at hand, or nearly so – well within the “boredom zone.” I thought we might start with the most repetitive and least conceptual, and work our way up.
Validating the impact of assumptions, rules, and uncertainties to avoid conclusions beyond the precision of our data is important, but often not performed because it’s, well, time-consuming and tedious. But not difficult…. Wondering which outcomes would be impacted by a somewhat arbitrary rule? Or, which outcomes would be impacted if our data metrics or categories are only 99% accurate, or even 95% accurate? With cloud computing capabilities, the answer is readily at hand – grab a healthy sample, spin up some nodes, and check it out. In an uncertain world with uncertain data and arbitrary rules, there is this certainty: when it comes to grind-it-out tasks like this, the unbored will carry the day. I say: let the machines have this one, and the sooner the better.
Most data analysis problems start with a routine exploratory analysis, and there are already tools that do a pretty fair job of flagging univariate outliers, bivariate correlations, and in some cases offering a comprehensible dashboard with little or no effort. The biggest problem so far is the limited range of what is offered, and the number of trivial results – but that’s still way better than crafting this by hand. This kind of facility, a part of what is now called self-service analytics (SSA), only promises to get better, as people like me complain as I’ve just been complaining here. Vendors are now releasing updates quarterly or even monthly.
Self-service analytics vendors are well on their way to having semi-automated descriptive analytics in the bag. It will still be a few years before we have an ideal tool, in which we can specify what we want, or don’t want, in advance. I say: if a machine does this so much the better, because I would have to do it anyway.
What about data transformations? Good extract, transform and load (ETL) tools have been encouraging us for at least a decade to think in terms of data operations rather than low-level coding operations.
The next step – and I’ve not personally seen a tool that quite solves this problem – is to specify inputs and target data shapes (for example, components of a star schema) and let the tool handle the rest, stopping only when we need to intervene, perhaps to supply a rule, or when there is no sensible option forward. (That can and does happen in most implementations, but why should we otherwise be involved?) In analysis work, data transformation strategy is sufficiently straightforward, with each step having a clear impact on the final data topology, that strategic ETL may not even require machine-learning techniques. Regardless, I say: take it away, machines – I’ve been there, and done that.
And as for predictive analytics? Those bored quantitative analysts are experiencing tedium for a reason. Each model is different, but the tools of the trade, including fitting algorithms, exceptional data detection, and problem resolution – are often invariant. Granted, that’s different than a concrete procedure, and in some cases – predictor development and optimization constraint definition being good examples – automation may offer guidance rather than a full solution.
That said, predictive analytics processes offer features that commend themselves to automation: the number of strategies is usually limited, there is a clear target, and it’s possible to make incremental progress with only part of a solution. Those are features that align well with machine-learning protocols such as genetic algorithms (which in principle are also highly scalable). Machines might not be creative or smart, but there is a great deal to be said in such cases for trying many options, combining the pieces that seem to work well, and gradually working towards better answers – precisely what (for example) a genetic algorithm would do. Our human alternative – being creative and clever in our choice of approach and then trying a handful of things – is noble, but slow and potentially biased.
Many quantitative analysts have taken their own steps down this path, perhaps scripting an onerous task, or parameterizing and then optimizing an algorithm. When I’ve used such approaches in the past, I found it straightforward to set the problem, but harder to find enough computing power to reach convergence in a reasonable time. That’s a consideration, but far less of a problem now.
Of course, none of these tasks stand alone. Data uncertainty, rules, and representations go a long way to determine if analytics is simple or difficult, or even feasible. That’s true whether we craft solutions by hand, or with the aid of automation. Either way, iterative improvement will continue to be the order of the day, and automation is a help there too, by lowering the burden of trying nearly the same thing multiple times.
The real issue may no longer be whether a high degree of automation is within reach for data systems and their associated analytics outcomes – that is a process which is well under way. My view is that aided by self-service analytics, cloud computing, machine learning, and social media, we’ll soon work in a world where much, if not most, of what we currently do as artisans will be handled under the hood – even the higher-concept tasks of quantitative analytics.
The critical issue may be: should we actively support this transition, assuming it’s on an economically-driven track that will not easily be derailed?
I say yes, even in the face of the argument that we’ll experience some spectacular train wrecks when an entire class of newly-enabled users work with powerful but dangerous tools. That’s fine. Train wrecks are part of the IT drill, and these new train wrecks will have the merit of being immediately relevant, and therefore imminently detectable.
I say yes, because the best use of an experienced analyst is not to write essentially the same code that many others have written before, in nearly the same situation. It is rather to assist with what is truly unique to each problem, whether that be formulating questions, interpreting answers, extracting better predictors, improving data, and managing exceptional conditions as they arise.
I say yes, because where humans can falter most easily – in necessary but highly tedious tasks – is precisely where machines excel.
With a shortage of good information, analysts have become too valuable to be engaged in detailed artisanship, and our best route to enhanced relevance and productivity is to reduce the role of that artisanship.
We’ll see how far down the road of automation we go, but should we support and encourage this transition? I say yes. As automation becomes available, our role in the process of supplying actionable information will not be reduced to irrelevance, but rather transformed to greater relevance. Our discussions about human relevance in the face of automation echo Childhood’s End as well as the automation transitions of history. However, the age-old problem of answering questions with data really is sophisticated enough that automation will only assist us, rather than supplant us.