Those Were The Days

A friend called recently to complain that I am no longer complaining about Trump.   It’s not that my views have changed, or that I’m becoming shy about my dislike for power without intellect, and the arbitrary denigration of people for attributes including nationality, ethnicity, and professional competence.

I hadn’t written much on Trump recently, as matters have unfolded pretty much as I expected.  Without ideas or competence, Trump’s administration is without positive accomplishments, and has only succeeded when destruction has been its objective.   The pace of legislative accomplishment has slowed to nearly zero – during the last congressional health care debate I had the strange sensation of opening the New York Times web page, and wondering if their new, slower site was actually showing me yesterday’s news.  Nope – I was seeing today’s news – it simply looked like yesterday’s news.

But last week, without congressional inaction to distract the president, the administration’s chaotic trajectory through political hyperspace was re-energized.  And,  while simultaneously reading the paper and wincing, I had a weird but genuine sensation of déjà vu.

Of the feelings that the Trump administration brings to mind, I would not expect déjà vu to be high on anyone’s list, whether you are in the majority of US citizens who disapprove of the President, or the minority who approve.  Depression?  Sure.  The feeling “this has happened before?” Hopefully not.

But there it was last week – a modest but non-zero feeling of recurrence –  as Vice President Mike Pence executed a near-perfect imitation of Nixon press secretary Ron Ziegler.

Those recalling the Nixon administration, or All The President’s Men, will remember the term “non-denial denial,” in which Ziegler responded to anti-Nixonian news developments with the then-novel technique of misdirection.  Typically, Ziegler would respond to a negative news story by complaining about the source’s politics and biases, but not about the accuracy of the story itself.

That might seem self-defeating, but as a propaganda strategy it was true genius:  it actually got some people feeling sorry for Nixon – a very non-trivial accomplishment. Imagine feeling sorry for Donald Trump after he threatens North Korea with nuclear oblivion, rather than feeling stupefied wonderment at yet another brazen act of incompetence, insolence, and insanity.  Now that would be propaganda skill. However, the White House, lacking skill at nearly every skill position, has no one to cover for Trump as Ziegler covered for Nixon.  (In fairness, the fact that compared with Trump, Nixon qualifies as a left-leaning moderate does not make the current press secretary’s job easier.)

Last week, Pence drew upon his inner Ziegler to complain about a New York Times story asserting that he, among others, was beginning his 2020 presidential campaign now, on the reasonable grounds that eight years of Trump would be geometrically more dangerous than four years of Trump.  Hell, it’s reasonable to wonder if we can actually get to 2020 to watch these characters scramble reluctantly for the top of the political heap.

Caught in the probable act, the Vice President’s response was highly Ziegler-like, at once unctuous and pitiful. “Today’s article in the New York Times is disgraceful and offensive to me, my family, and our entire team,” he told his listening audience.

Well, maybe Pence, his family, and his entire team were offended by the story, but that, as students of presidential propaganda and analysts know, is not the same as claiming a story is false.   I don’t suppose that Pence would be a good President, but it’s interesting to see someone in the Trump administration show some kind of skill, at something, at least some small portion of the time.  Whether that is a good thing, is unclear.

Workplace Analytics Blues

Is the modern workplace anti-analytic?

At first glance this might seem to be an otherworldly question – after all, we have more data, more databases, more analytics, more visualizations, more predictive models than ever.

Where’s the lack of analytics in that? And the answer is: context.

A friend of mine consults in the realm of personal productivity, and he tells me that his clients report more problems than ever getting things done.  Large amounts of data and analytics can be a hindrance more than a help, as people respond to more things in a shorter span of time, and end up doing almost none of them really well.  We can become near-constant context shifters – always changing gears as they move from one task to another. Context shifting is a process that consumes our finite resources of time and energy, and only the time and energy that remain can be devoted to productive work.

Although my friend works mostly with IT security personnel, analysts frequently describe the same pressures.  For analysts the consequences are, if anything, worse, for even the best analytics models are fractional representations of reality, and for us to accurately interpret the results we see requires the additional context and experience we bring to the job.

We can get away with “at a glance” analysis only when context stands still: the result means just what it did the last time we glanced.  Otherwise, we can easily find ourselves interpreting a number in the light of what it once was, or in the light of the last thing we were thinking about.  Does this cause all interpretive problems? Certainly not.  Does it cause many problems? In my experience, it does.

Workplace environment becomes a factor when we lack the time to shift our mental context from one analysis problem, clear our heads, and seriously consider an unrelated problem.   When our workplace combines demands for “multi-tasking” information processing with insufficient time for “task switching” yes, it is anti-analytic.

The cost of “task switching” depends on the task, of course – however my friend argues a minimum of 30 minutes for each task-switch, and many of my coworkers report similar times. Interpreting data is a context-intensive task.   That may seem like a long time to “just look at a chart,”  and it is a long time to “just look” at information, but not a long time to understand that information.

Is it easier to reply “it will take more than just a minute” to the next urgent request that “will only take a minute,” or to sing the Workplace Analytics Blues?   For sure, it can be uncomfortable to offer that kind of “push back” – a breakdown of the time costs for moving task-switching often helps. On the other hand, I can’t sing, so for me there isn’t really a choice.


Short takes


It’s more fun and practical, if less exact, to have short-take versions of things I’ve mentioned elsewhere.  Here’s round one – less than 15 seconds a shot.


Analytics is a genuine “people” business – it assists the quintessential human activity of asking good questions, and interpreting the answers.


In engineering and science, we can’t don’t know what we have until we know when it breaks.  One of the first duties of analytics is to understand the limits of analytics.

And if the answer is “we really don’t have the information to answer your question,” that should be perfectly OK.  However, it’s a lot easier to say that early in the game.


Show me a typical database, and I’ll show you tables filled with What, Where, When, Who, How much, and How Many.   But show me the questions to which people most want answers, and what I’ll see is usually How? And Why?

It is we people who fill the gap between the data we’re likely to have, and the answers we most desire.


A requirement explains to stakeholders what they should do, so they can explain to us what they might be reluctant to pay for.


We must often ask for analytics requirements without detailed data, but this is rather like asking an arborist to trim a tree, based only a general notion of what trees should look like.

We have to start somewhere. But we should plan to have another go, after we learn how desires and data really align.

Luckily, rebuilding a database is usually easier than rebuilding a tree…


Programmers and analysts tend to be creatives, which may explain why programmers and analysts have an intense dislike for documentation – it’s a little like asking an artist to explain the brushstrokes.


If there is a task more disliked by analysts than documentation, it might be uncertainty analysis, which operates at the tedious intersection of repetition, reduction of apparent impact, and differential calculus.


Insight depends as much on those receiving information, as the information itself.

A perfectly good insight might be well known but not widely known – the application assisting the information transfer is performing a real service.

That you can sing “Amazing Grace” to the theme from “The Beverly Hillbillies” is well-known to choristers, but for most of us, insight is hardly the word.   (Oh, do try it…)

Some insights really are new to everyone, but that’s less common. The only analytics application not worth having may be the one that confirms what everyone pretty much knows already.


Things often assumed, which ideally would always be proven: data are certain; data are complete; metrics are meaningful; numbers can be meaningfully compared; rules and transforms have little impact on outcomes.


Things that only seem to cost very little:  storing data, and shareware.


Things that only seem expensive:  design, testing, and analytics applied to either one.


If I have 1015 records in a table, and I need 1000 of those records for a particular graph or table, it would take 31,710 years for me to query the whole table at a rate of one query per second.

A great deal of big data is never meaningfully used. In many cases a good use of big data analytics would be to reduce the data set, rather than find algorithms to process a great deal of valueless data.   It doesn’t take 10 million records to know there are no gold mines in Kansas.

Storage and processing of data costs money, whether the data are meaningful or not.  Perhaps more importantly, the mechanics of big data storage can directly impact our ability to manipulate and model data, and therefore our analysis options.


Analytics isn’t supposed to be easy. Asking good questions, assembling and using data, and understanding the answers has always been a challenge.

Those asking questions, interpreting answers, and building data systems – from infrastructure, to databases, to data engineering and science – do a remarkable job and nearly always deliver value.


Locate a technical expert bored with their job, and you’ve also found an opportunity to automate the task they would rather not be doing.  With plenty of good and challenging problems to solve, why should we fear automation of the bad and simple ones?


I like to understand the code I see, except for regular expressions.  Even after I write them I don’t understand them.


When interpreting data there are conclusions, which are supported by the data, and hypotheses, which deliberately reach beyond the data.  Both are great in their separate realms, but it’s surprisingly easy to mix them together.

You time a car racing wildly down the street and conclude the driver is over the speed limit.  You can only hypothesize the driver is not in control of his vehicle.

Each of five stores with new managers show August sales below those of last year, while other stores have done well.  It’s only a hypothesis that managerial inexperience is the cause.   (In reality, the managers were new, but so were the stores and the previous good sales were an artifact of “grand opening” sales.)


Data warehouses are often excellent, because they simplify and systematize asking and answering of common questions related to aggregation, particularly over hierarchies.

If you don’t want to do that, you probably don’t need a data warehouse.


There are scores of biases, but three I see the most in analysis work are  confirmation biases – we want to be right; automation bias – what the computer says is valid; and sunk cost bias – the more we invest in something the more we want to believe in its authenticity.

A rather brutal but honest kickoff meeting might be:  ask each stakeholder what they expect, and then explain that 1) their information may not be able to deliver what they would like; 2) computers would as soon distort their information as present it; and 3) things can only look up from here, for a cost identical to what it would cost for disappointment later.


My mother, a perfectly nice computerphobe, called one day to ask my help with a data problem she was experiencing in Excel.  I listened to her problem, I helped her, and then called my consultant friends with this message: prepare yourself – analytical mechanics are now in the mainstream.   And that was 15 years ago.  Increasingly, the best contribution for IT professionals will be beyond the detailed manipulation of code and data.


A friend of mine insists that all problems in analytics start with invalid reasoning from aggregates.  Which he also offers as proof that you can pretty much be right, and pretty much be inconsistent, all at the same time.


The real crisis of information is less about having too much information, than about having too little that is reliable, relevant, and believable.


Is that a fact?  In the absence of clear assumptions, a well-defined system, and a transparent characterization – the elements of a good scientific observation, but broadly applicable – we shouldn’t be required to answer “yes.”

That excludes many statements, but includes a great deal too.


Bad day? I hear you…   But if it weren’t for problems, we would quickly be out of business.


Analytics is as old as people answering questions with information.

Until about 25 years ago analytical reasoning was limited by the data at hand. Each item was examined, reviewed, scrutinized, and conclusions were drawn.

Since then, we have gained in available data, but lost in data context, as we automatically ingest data and place them in tables and fields we can handily consume.

In big data systems, we’ve gone from comprehending a lot about a few items, to comprehending a little about a lot of items.   And that’s fine, if that little is enough to characterize those items.

But often the little we have about each item is not enough. When it comes to people, or organizations, or economies, or healthcare, or many other complex things, it is difficult to know what information to collect in advance of inquiry, and harder still to comprehensively collect all of that information.

Meaning: for simpler systems we’ve moved ahead; for complex systems we are limited, as we always have been, by the data we have available.


In engineering and science, simplicity may be the most complex thing to achieve, while complexity – which is frequently mistaken for sophistication – the simplest to achieve.

Simplicity may be most underrated property of good technical work, and the key to transparency, which give us understanding, which leads to acceptance, which offers the chance to make an impact.

My Favorite Users

Sports Illustrated recently carried a perceptive article on the use of analytics in the NFL, which varies widely across the league.  Some teams employ predictive models only as a rough definition of how a player’s attributes would normally relate to their expected performance, so they can knowingly do something outside the model, and the normal.  It’s a discipline where success comes from establishing new trends, more than following existing ones.

Something similar happens in baseball. When I interviewed for my first job, my to-be-boss presented me with one of Bill James’ tomes –  and I was very interested to learn that James didn’t give a damn about formal analytics.  What he cared about was developing new metrics and ways to measure performance.  (I still think the “Hall of Fame Watch” is a masterpiece of scorecarding.)   Real differentiation comes from understanding when a model will be wrong, for one player or a group of players.

We aren’t playing in the NFL or MLB, brother….  Undeniably, but many engaged users act like they are, and use analytics accordingly – less to predict their situation, than to understand how that situation differs from normality. And then to act, not by the dictates of analytics, but rather by using analytics to gauge how far they’ve traveled into the realm of the unnormal when they overrule analytics.

We might be tempted to see this as a problem. Actually, it might be the best user scenario there is: these are users who are engaged, who value analytics outcomes, who understand the limits of those outcomes, and who augment the outcomes with local expertise to get just what they want.

Using outcomes without question, or not using them at all, are the real problems.   We can get either or both, by attempting to shrink-wrap outcomes in a “here is all that matters; do this; don’t do that” format.

Now, when enthusiastic users say that Your stuff is great! Here’s how we use it – we start by pulling a few numbers from this table…. And when we’re finished, we email everyone on our team! – I listen, say “thank you,” and admire their ingenuity.

Childhood’s End

Arthur C. Clarke’s classic 1953 novel Childhood’s End describes the alien-aided amalgamation of humanity into a larger cosmic intelligence, with a concomitant loss of human creativity and individuality.   Talk about disruption

Less disruptive, but disruptions nonetheless, are the economic transitions that occur when latent demand and technology converge to the mass production of a product, often a product previously crafted by skilled artisans.

Books, clothes, tools, watches, vehicles, among others have followed the same pattern, with the same result:  the product in question becomes less expensive, more widely consumed, and skilled people are looking for work.

I’d wager that in each of these cases there were artisans who predicted that their craft would be immune to mass production – the argument being that quality rests with the individual, and quality must prevail.  But when mass production did eventually come, history has told a consistent tale: those artisans were sometimes right about quality, but always wrong about economics.  Artisanship, however laudable, doesn’t scale, and consumers will flock to a mass-produced product when that’s their only affordable choice.

I suspect we are on the cusp of a similar – but not identical – transition in the implementation of data systems and their related analytics outcomes.  And disruptive it will be, but in a fashion almost entirely to the good, even for the skilled artisans we know as data engineers and scientists.   As I’ve propounded parts of this thesis before colleagues, what surprised me was not that I received some well-reasoned skepticism (which I did),  but rather than in the broad outlines, most agreed.

If data systems and outcomes do gravitate to semi-automated production, with our human role increasingly being one of design and managing exceptions – the weird, the wacky, and the entirely new – there remains the question: why now?

Now is the time, for the same reason mass-production transitions have been triggered in the past:  there is a product shortage – in this case, of useful and actionable information, and technology can deliver a mass-produced capability.

When it comes to information, most agree about the shortage, but some dispute whether the technology will be ready soon.

That the demand for actionable information exceeds the supply is hardly in doubt.  We talk about too much information, but the real information crisis is a shortage of the good stuff, and as data artisans can all tell us, bringing that good stuff to fruition is a lot of work.

As for the imminent capacity of technology to deliver largely-automated data outcomes, reasonable people can differ.  My skeptical friends look at current tools for data transformation and analytics, and argue that they fall short of an automation capacity.   I’d agree that they do. But what I see is not lack of success, but embryonic versions of future success. I see that a substantial degree of automated data outcomes is within range of present capabilities, and I believe this gap will only close over the next decade, to the point where largely-automated data outcomes will be far superior to what we people can produce.

I’m not saying that data engineers and scientists will be irrelevant. On the contrary, our value should increase because we’ll be able to focus on those aspects of answering questions with data that are most challenging  – high-level design and specification; management of exceptional situations, the formulation of questions and the interpretation of answers, the development of better information.  As for the implementation themselves?  Increasingly, I expect we’ll become less relevant.

Can machines really do what highly skilled data artisans now do?  Probably not, but they don’t have to.  Handling what people find repetitive – which would be the vast majority of data manipulations – will be just fine.

Show me a person bored with their work, and I’ll show you an opportunity to automate the task they would rather not be doing.   My first boss – a very distinguished data scientist before “data scientist” was an established profession – called me into his office one day, had me sit down, and informed me that “Most individual modeling projects are, well, kind of boring.”   I had started to wonder myself….  Since that day, over 20 years ago, I’ve heard experienced analysts express similar thoughts – for in most problems the strategy of data transformation, modeling, and problem resolution is invariant, even as the details differ.

Boredom comes from repetition, but also from knowing that our manual actions cannot keep pace with our thinking.  Then, we inevitably make choices about what we can do, and what to set aside, and a really tedious task – like uncertainty analysis – is unlikely to be at the top of our list.

So, if we’re bored anyway, why not have machines do that work?  Machines are not very smart, and we’ll need to point them in the right direction, but they simply do not get bored, and that offers serious advantages.

As I looked at key tasks in crafting a data outcome, the technology for automation appears to be at hand, or nearly so – well within the “boredom zone.” I thought we might start with the most repetitive and least conceptual, and work our way up.

Validating the impact of assumptions, rules, and uncertainties to avoid conclusions beyond the precision of our data is important, but often not performed because it’s, well, time-consuming and tedious.  But not difficult….   Wondering which outcomes would be impacted by a somewhat arbitrary rule? Or,  which outcomes would be impacted if our data metrics or categories are only 99% accurate, or even 95% accurate?   With cloud computing capabilities, the answer is readily at hand – grab a healthy sample, spin up some nodes, and check it out.   In an uncertain world with uncertain data and arbitrary rules, there is this certainty:  when it comes to grind-it-out tasks like this, the unbored will carry the day.  I say: let the machines have this one, and the sooner the better.

Most data analysis problems start with a routine exploratory analysis, and there are already tools that do a pretty fair job of flagging univariate outliers, bivariate correlations, and in some cases offering a comprehensible dashboard with little or no effort.  The biggest problem so far is the limited range of what is offered, and the number of trivial results – but that’s still way better than crafting this by hand.    This kind of facility, a part of what is now called self-service analytics (SSA), only promises to get better, as people like me complain as I’ve just been complaining here.    Vendors are now releasing updates quarterly or even monthly.

Self-service analytics vendors are well on their way to having semi-automated descriptive analytics in the bag.  It will still be a few years before we have an ideal tool, in which we can specify what we want, or don’t want, in advance.  I say: if a machine does this so much the better, because I would have to do it anyway.

What about data transformations? Good extract, transform and load (ETL) tools have been encouraging us for at least a decade to think in terms of data operations rather than low-level coding operations.

The next step – and I’ve not personally seen a tool that quite solves this problem – is to specify inputs and target data shapes (for example, components of a star schema) and let the tool handle the rest, stopping only when we need to intervene, perhaps to supply a rule, or when there is no sensible option forward.  (That can and does happen in most implementations, but why should we otherwise be involved?)    In analysis work, data transformation strategy is sufficiently straightforward,  with each step having a clear impact on the final data topology, that strategic ETL may not even require machine-learning techniques.  Regardless, I say:  take it away, machines – I’ve been there, and done that.

And as for predictive analytics?  Those bored quantitative analysts are experiencing tedium for a reason.  Each model is different, but the tools of the trade, including fitting algorithms, exceptional data detection, and problem resolution – are often invariant.   Granted, that’s different than a concrete procedure, and in some cases – predictor development and optimization constraint definition being good examples – automation may offer guidance rather than a full solution.

That said, predictive analytics processes offer features that commend themselves to automation:   the number of strategies is usually limited, there is a clear target, and it’s possible to make incremental progress with only part of a solution.  Those are features that align well with machine-learning protocols such as genetic algorithms (which in principle are also highly scalable).  Machines might not be creative or smart, but there is a great deal to be said in such cases for trying many options, combining the pieces that seem to work well, and gradually working towards better answers – precisely what (for example) a genetic algorithm would do.  Our human alternative – being creative and clever in our choice of approach and then trying a handful of things – is noble, but slow and potentially biased.

Many quantitative analysts have taken their own steps down this path, perhaps scripting an onerous task, or parameterizing and then optimizing an algorithm.   When I’ve used such approaches in the past, I found it straightforward to set the problem, but harder to find enough computing power to reach convergence in a reasonable time.  That’s a consideration, but far less of a problem now.

Of course, none of these tasks stand alone.  Data uncertainty, rules, and representations go a long way to determine if analytics is simple or difficult, or even feasible.  That’s true whether we craft solutions by hand, or with the aid of automation. Either way, iterative improvement will continue to be the order of the day, and automation is a help there too, by lowering the burden of trying nearly the same thing multiple times.

The real issue may no longer be whether a high degree of automation is within reach for data systems and their associated analytics outcomes – that is a process which is well under way.  My view is that aided by self-service analytics, cloud computing, machine learning, and social media, we’ll soon work in a world where much, if not most, of what we currently do as artisans will be handled under the hood – even the higher-concept tasks of quantitative analytics.

The critical issue may be: should we actively support this transition, assuming it’s on an economically-driven track that will not easily be derailed?

I say yes, even in the face of the argument that we’ll experience some spectacular train wrecks when an entire class of newly-enabled users work with powerful but dangerous tools.   That’s fine. Train wrecks are part of the IT drill, and these new train wrecks will have the merit of being immediately relevant, and therefore imminently detectable.

I say yes, because the best use of an experienced analyst is not to write essentially the same code that many others have written before, in nearly the same situation. It is rather to assist with what is truly unique to each problem, whether that be formulating questions, interpreting answers, extracting better predictors, improving data, and managing exceptional conditions as they arise.

I say yes, because where humans can falter most easily – in necessary but highly tedious tasks – is precisely where machines excel.

With a shortage of good information, analysts have become too valuable to be engaged in detailed artisanship, and our best route to enhanced relevance and productivity is to reduce the role of that artisanship.

We’ll see how far down the road of automation we go, but should we support and encourage this transition? I say yes.  As automation becomes available, our role in the process of supplying actionable information will not be reduced to irrelevance, but rather transformed to greater relevance.   Our discussions about human relevance in the face of automation echo Childhood’s End as well as the automation transitions of history.   However,  the age-old problem of answering questions with data really is sophisticated enough that automation will only assist us, rather than supplant us.


Tool Time

As the saying goes, only poor artisans criticize their tools, or in what amounts to the same thing, better artisans don’t criticize their tools.

And what does that make an artisan who is focused primarily on a particular tool or technique?  Perhaps a person more likely to imprecisely answer whatever question we’re not exactly asking.

Morphing problems to meet the requirements of tools is a time-honored practice, and an equally time-honored headache for project stakeholders, who can sometimes feel like the family who wanted their vacation home repainted.

A local contractor was engaged before the season and asked to spruce things up. He immediately set to work, and reported that all would be in readiness when the family appeared for their annual holiday.

And ready it was. As parents and impressionable children arrived for their holiday, they were greeted with the vision of a dichromatic acid trip, their home now resplendent in a basecoat of Federal Safety Orange, overlaid from top to bottom with informational and inspirational sayings in chartreuse – the front door being adorned with “Be Ye Separate” on the left,  “There is No Queue Like FIFO” on the right, and of course, “Welcome” on the top.

When the contractor was called and informed that “repainting” entailed limited creative license, he was unmoved.  “You’re missing the point.   The result is consistent with your request; the craftsmanship is top-notch; and only the most sophisticated tools, techniques, and creativity could have converted your living space into a space of living art.”

To this day, part of me stands with that contractor and says You tell it, brother.  But when our tools start looking for problems to solve, or morphing the problems they are supposed to solve, watch out.

Especially when we build up expertise in a particular tool, it’s tempting to use that tool for any problem that comes to hand, whether it’s optimal or not. It’s comfortable, and fast, and seemingly enables our creativity – we can think about the problem at hand, rather than think about the mechanics of tool-based manipulations.

It’s also a deception, with the hidden cost of forcing problems to conform to the confines of what our favorite tool does well.  I’ve seen data-transform scripts running to thousands of lines when a better tool would do the same thing in fifty lines; I’ve seen complex visualizations deployed when a simple exploratory analysis would be more transparent; I’ve seen any number of non-linear quantitative models applied to the mystification of all, when a simple linear model would have done as well, if not better.

Learning multiple tools well is the best antidote for this trap. Even by itself it’s a great problem-solving technique, not because we suddenly become all-around experts, but because we gain perspective beyond one mode of problem-solving expression.

Learning multiple tools is a great deal like learning multiple languages, and with many of the same benefits – there is no better way to know our home language than to learn another language, and there is no better way to learn new modes of expression than to learn from other cultures.   After the first new tool, the next one will be a lot easier – also like languages.

And you really never do know when that seemingly obscure tool or language will be just the right thing.  A couple of times a year I still write and compile an awk script (like perl, awk is famously “write-only,” as no one can actually read that crap).

Personally, I’ve found that the more tools I have in hand – even at a modest level of skill – the more likely I am to identify simple and transparent solutions, as I’m more likely to bring the right tool to the problem, rather than engage in the distorting act of moving my problem to the tool I know best.

Simplicity: Taken For Granted?

Engineers and scientists will testify with near-unanimity: it is surprisingly simple to create a complex solution, and surprisingly complex to render a simple solution.

Particularly as our systems grow larger, complexity tends to happen all on its own, without any help from us.  Still, we do sometimes give complexity help it surely doesn’t need.  We expand our systems with data of dubious value, to the point they are difficult to maintain; we twist problems to meet the tools we know; we ask algorithms to do jobs that could be better handled by an improved problem formulation or data representation.

Complex constructions all too quickly yield systems where contact with the questions and outcomes is uncertain for users, and even developers.

The cost of complexity is simple:  opaque and complex systems may pass muster when outcomes are expected, or unexamined, or simply ignored. However, when people do not understand an unexpected or interesting result, they will not accept the outcome.   When we need outcomes the most, complexity defeats their useful application.

It’s simple enough to complain about complexity.  But more to the issue, we often take simplicity for granted. Simple solutions are often very unassuming, naturally connecting inquiry to answers.  And because they can be unassuming, we find ourselves assuming that a what is simple now was also simple throughout its development, rather than a process of iteratively refining questions, models, transforms, and continually removing what is unessential and complicating.

Perhaps ironically, tools now falling under the umbrella of “data science” are some of the most potent simplifying tools available for maintaining simplicity, offering the opportunity to simplify data models and representations, and set aside data with little probative value.  These tools are best applied at design time and used throughout development, for simplicity is much easier to sustain than to retrofit from a large, complex, and essentially immobile system.

How often do we add in low-value information to our systems, making them large and inflexible, motivated by our worry that when our system becomes large and inflexible we’ll not be able to make a change later?   I understand the reasoning, but a smaller and more manageable system should never have this issue – new information with genuine value can always be added later.  That’s as it should be:  asking good questions is almost always an iterative process, rarely gotten entirely right the first time through.

I often see over-complex systems, too big to change, long after change is precisely what’s needed for better adoption or improved inquiry.  However, I don’t believe I’ve ever seen a business intelligence or analysis system that truly had to be that way.   Simplicity isn’t impossible to achieve, but it is hard work.

Those crafting simple and transparent solutions deserve our appreciation for what truly is a complex – but very worthwhile – task.

Particularly as we contemplate larger and larger data systems, simplicity is a worthy design and development goal.  It is simplicity – not complexity – that brings solutions to our most challenging problems, where questions, answers, data, and metrics will evolve to their final form.   Simplicity offers us comprehension; comprehension brings challenges to our early outcomes; those challenges bring improvement; from improvement we reach adoption; and with adoption we can assure impact.    And all of that change and iteration is only possible with solutions that begin – and stay – as simple as possible.


Two Out of Three

A writer needs three things, experience, observation, and imagination, any two of which, at times any one of which, can supply the lack of the others.” ― William Faulkner

Right on, and not only writers.   Observation and imagination are underrated in information work. A writer’s imagination and observational skills can be perfect raw material for an analyst, while focused technical experience may not help with understanding another person’s questions, or interpreting their analytics outcomes.

Observation, imagination, and experience all matter in analysis, but the origin of those skills is increasingly irrelevant, as tools become mainstream and simpler to use.

The perspective of creative disciplines may be closer to real-world questions than the perspective of technical professions in many areas, including human resources, reporting, social listening, and politics.

If we consider analysis the province of a purely technical community, and focus largely on technical aspects, I believe we’re missing out where it really matters: developing the the right questions and really understanding the resulting answers.    The combination of technical and creative modes of thought, working together to understand questions and interpret answers, is something that we can and should welcome.

When More Really Is Better

Analytics might be defined as people asking questions and deriving answers using data. Even as our computational and data capability has transformed in the last 25 years, that definition needs little alteration.

Analytics truly is an ancient and fundamentally human activity, now amplified and augmented by modern computing capabilities.   The essential algorithms and processes used in analytics have changed far less than our capability to execute them.   If you look at textbooks from the early 1990’s, most elements of current analytics thinking are there.

And now, almost anyone with an interest and internet access can apply analytics tools, processing, and thinking to their daily activities.   What was once the province of a relative few with access to arcane and difficult-to-use tools is now widely, and almost freely, available.

Is that good? Absolutely.  The more, the merrier.   If I could, I’d invite anyone working with data and an interest to Columbus for a one-week short course in exploratory analysis.   That short course wouldn’t make people expert at deriving any analytics answer, but it would make people aware of the questions analytics addresses, and some of the thought process in addressing those questions.  And that’s a start.

A great contribution from analytic thinking is that it makes for better discussion and problem-solving all around.  Good analytics works in the realm of verifiable facts – the invariable basis for informed discussion. The alternative is to ignore analytics tools and processes, resulting in a continuation of the trivial arguments which pass for much of discussion today.  Can people hurt themselves using complex tools when they are just starting?  Sure – trust me, I’ve been there.  But that’s OK – mistakes in analytics are part of analytics, and working within the process is ever so much better than working outside of it.

Analytics also improves dialogue through its fundamental recognition of limits, and its sometimes irritating dismissal of absolute truths. The first duty of analytics is frequently to identify the limits of analytics outcomes themselves – while we may not discuss that often,  it’s fundamental to analytics nonetheless.  Analytics tells us that we don’t really know how good our knowledge is until we break it – every model, every theory, every data set, every process has its limits. Finding those limits is frequently the topic of good and creative analytics work.

Beyond better dialogue, why is it desirable for more people to apply more analytics more-or-less all of the time?  Because analytics results depend on context –  biases, uncertainties, nuances of question, interpretation of answers, implicit metadata, and the entire universe of subject-matter knowledge – and those supplying that context are the ideal people to apply analytics in the furtherance of knowledge and ideas, rather than data experts.  Applying analytics without the nuances of problem context is like using a chain saw to trim a tree, based only a rough idea of what a tree should look like.  Context and problem knowledge can, should, and do rule the problem-solving process – if you like, it’s data we can’t do without.

Then do analytics experts matter?  Of course.  They matter in the same way that experts in storage, in databases, in visualization, and application development, or a score of other data-related disciplines matter – as experts helping people to understand and solve data-related problems.  But integration, context, and collaboration are the order of the day, if we’re to move forward, and I’ve minimal patience for the idea that data science, or data scientists, (or any other technical discipline) somehow stands apart or even above the general flow of problem-solving progress.  80% percent of the analytics problems are solved by 20% if the techniques, and everyone everywhere should be encouraged to use those techniques whenever and wherever they can – in database design, in data assessment, in performance tuning – you name it.

There really is so much to accomplish, and analytics can help with accomplishing it – this time, more really is better.