Tinker, Tailor, Soldier, Spyware

Like one of John Le Carre’s “moles,” anti-virus software is the perfect, obvious software platform to act as a double agent, protecting computers from malware, while simultaneously transmitting interesting tidbits to some intelligence controller in parts unknown.

Security software has permission to do just about anything on our machines – to search, investigate, and neutralize computer software in order to provide adequate security protection.  But what is necessary is also and obviously dangerous, especially if we don’t know the provenance of our security software. As for example, if it is made in Russia!

Of course, a United States security agency would never deploy Russian security software – like Kaspersky Lab’s products – to monitor US security files.  Even in an age of monumental bureaucracy and decision-making unfettered by the constraints of technical knowledge, this would be so ridiculous that we should relax in the sure knowledge this would never happen.

But it has.    We don’t know if Kaspersky Labs is a witting, unwitting, or coerced partner to Russian intelligence gathering,  but its anti-virus tools have been acting as an adjunct Russian agent against US intelligence.   However, for a risk so entirely obvious, these details are irrelevant.

The choice of Kaspersky Labs as a security vendor for US security agencies probably won’t bring about changes in technical decision making.  Some even more ridiculous event will be needed to wake us up to the fact that in technology decisions, it is technology that counts, and technical understanding that drives correct decisions.  Not sales pitches,  nor vendor claims, nor vision, nor relationships, nor cost.  These things matter but are secondary.  Every day, experienced technologists see good organizations spend on technology they cannot possibly use, because the decision makers simply didn’t bother to understand the technology.   But when someone bothers to ask, the experts to make a good decision are usually available.

When it comes to technology decisions, understanding the technology comes first, and trumps whatever is in a distant second place.  If that sounds obvious, I agree that it is, but it’s also very far from standard practice.  Just ask the NSA.

A Healthier Way To Buy Health Insurance

For small business owners and independent professionals, the fight to obtain economical and effective health insurance is a slow-motion train wreck whose collateral damage seems to increase with each passing year.

It’s not surprising health insurance is expensive for small groups, while its value is often limited.  We’re told that the health insurance market is mysteriously different from other insurance markets, but for small business owners two realities prevail.

First, spending money on health insurance does not guarantee employee health, or even healthcare.  For health insurance does not guarantee coverage; coverage does not guarantee care; care does not guarantee good care; and none of this guarantees personal health.  How monumentally frustrating it is, that for a per-employee monthly bill often exceeding a house payment, there is only one true certainty: that of economic warfare with our insurance companies, should we or one of our employees become injured or ill.

Second, small groups – particularly groups of one or two – have minuscule leverage in the health insurance market.  As a small group, our risk can be evaluated and avoided piecemeal.  Small groups pay premiums for what our insurance companies – not our doctors or medical science – perceive as potential future risk.

This is a scenario small business owners regularly face, but one that is also antithetical to the workings of a well-functioning insurance market.  Any healthy insurance market pools risk, by assembling a large and diverse group to share the costs impacting a small fraction of that group.  We pool risk for our homes and vehicles – even when we buy as individuals of varying age and experience, we are still considered part of a larger risk pool.  But for health insurance, we’re treated as individuals whenever possible.

It doesn’t have to be this way.   While I do not often find Trump administration proposals appealing or sensible, the proposal to allow associations of all kinds, nationwide, to purchase health insurance is just that – appealing, and sensible.  Particularly to individuals and small business owners for whom health insurance is a major cost.

Vested interests are already banding together in opposition, arguing that new groups might consist only of healthy people, thereby raising costs for everyone else.  While I favor sensible regulations and rules for associations buying insurance, it is difficult to see why an association of small businesses, or CPAs, or union members, or plumbers, or chemists, or physicists is inherently different from a group of people who happen to be working together in the same large company.  On the other hand, it is easy to see why additional competition and risk pooling can benefit those of us without much leverage in the current health insurance market.

The proposal seems a simple and sensible way to increase competition and choice, and nudge a distorted health insurance market towards the risk pooling that makes other insurance work.


As a longtime small business owner and consultant, I’ve developed an exquisite sensitivity to the value of people’s time – my time, as well as the time of clients and coworkers.

However, owning and respecting time is not the same as being busy.   Anyone can be busy – owning our time is harder.

Over the years, I’ve noticed that my coworkers who truly own and control their own time are precisely the same people who respect the time of others.   I appreciate the planning and consideration that implies, and I know these are the people I want on my team.   They are “timely” in the best sense – they respect their own time and help their team members own their time in turn.  And inevitably more gets done, more easily.

For the merely busy, there is always an excuse, always a plausible reason why another person or the system is to blame for wasted time: for being late for a meeting, for a scheduling conflict, for missing meetings without apology, for being unprepared at meetings, for not getting things done on time, and – most of all – for failing to communicate a problem, make amends, and take ownership.   A person who claims to be busy is actually signaling that instead of making things happen, they are allowing – or even encouraging – things to happen to them.

It’s amazing how I can behave if my ultimate excuse is that I’m busy.  You schedule a meeting, but I’m 15 minutes late.  You ask for a meeting, and I agree but don’t follow up.  I am on time for a meeting but don’t know what the meeting is about.   I miss a meeting and after the meeting is over send you a text telling you “I can’t make the meeting.“ I text you four minutes before a meeting telling you I won’t be there.

Your response to these acts of inconsideration is likely to be this person is wasting my time,  and you’re right, with my sincere apologies.  And in your place, my response would be “this is not a person I can rely on, and the most crucial, most important, and most valuable attribute for my team members is that I can rely on them.  With consideration a close second.

By suddenly cancelling a meeting, or being unprepared for a meeting, I’ve saved 30 minutes of my time, but I’ve wasted at least an hour, and perhaps much more than an hour, of my colleagues’ time.  And signaled, that when push comes to shove my problems are more important than yours.   Why would you want me on your team?  And with the tables turned, why would I want someone who can’t control their time on my team?

Owning our time is an acquired skill.  I for one am still in acquisition – after 20 years in business, I still have days that are fully out of control before the business day has officially started.  But it requires no real skill, only consideration, to make amends when our best efforts at time management fail.

Being busy has become a form of subtle bragging, but it really sends a message of unreliability, with lack of consideration waiting in the wings.     On the other hand, being timely – owning our own time and assisting our team members in owning theirs – might be one of the best compliments a working person can receive.

Answers Without Questions, Part 1

Hi.  Imagine that we’ve been asked to find a best-fit line for these data, which are the outcome of a stochastic simulation:

Sequence (X) Value (Y)
1 4.586792
2 31.230494
3 32.805177
4 40.057982
5 40.389495
6 51.093562
7 51.651214
8 54.419457
9 63.673709
10 63.891718
11 74.354809
12 87.494790
13 94.858425

Whether we whip out a sheet of graph paper (!), use a spreadsheet, or spin up a statistics package, our analysis will give us a plot looking something like this:


(I used R for simulation and plotting – scripts appear at the end of this post.)

Things look pretty good. The first point is well below the line, but we can hypothesize that this is an initialization artifact.   It would be nice to learn more about the error in our simulation, but these data show a trend that is approximately linear.

Great. So, what is the question this simulation addresses?  It is: using the R pseudo-random function runif() and a seed of 1234, what is the first sequence of 13 continually-increasing numbers?  The answer is the data set above, and it takes almost 20 billion tries to arrive at this sequence.  It’s nothing more than a model answer to a model-based question, without a corresponding real-world question. Bluntly put: who cares?

However, I would be willing to bet that if you thought about this little line-fitting problem, one of the last things that went through your head was It’s a set-up and the entire problem is bogus!   

Sorry about that…. The point is this: We do not normally start our analytics process by asking whether our model-based question has any, or even some connection to a real-world question.  We like to cooperate and be helpful, rather than instantly plague our coworkers and stakeholders with questions about the association with our model question to a real-world analogue (if any).

But perhaps we should.  For every problem involving data also involves a model – we cannot apply an analytical process otherwise.  The core reality of all models is that they are approximations of reality – models are abstractions.    As the answers we craft come from our model, so do the questions we are answering.  In analytical processes, we always address model questions, not real-world questions.

There are situations in which a model question is close to a real-world question, but they are never quite the same. Starting in this post and continuing in the next, I’ll look at examples – in everyday life and in analytics problems – where that difference is ignored, but definitely matters.

Model and real-world questions can be very different, even when the look and feel almost identical – questions are often slippery and ambiguous.   Is the real-world question “Is Sue smarter than Abby” the same as the model question “Does Sue have a higher IQ test score than Abby?”  You might laugh and say “of course not,” but many would use the latter question as a substitute for the former, without having any idea what an IQ score really measures. Once we have a number, we can formally manipulate it, and that fact alone makes modeling questions and answers very attractive. But with no clear connection to real-world questions, those manipulations can be without true purpose..

I think most of us have been there. Because questions are ambiguous and readily morphed in our minds, we often transpose model questions into real-world questions, often without realizing this has happened, but with consequences that range from minor to very serious.   However, if we don’t fully understand what question we’re answering, we really don’t know what problem we’re solving. Getting to the “right” problem may not be easy, but the first step in arriving at the right problem is knowing where we are now.

Some argue, and I concur, that analytics is overly concerned with detailed mechanics, algorithms, and computations, thereby too often chasing “wrong” problems down conceptual rabbit holes.  But the problem with “wrong problems” really starts when we don’t understand how our model questions relate to the original questions we purported and hoped to address.

This can present a real challenge.  We’re trying to deal with an imperfectly-defined and complex concept – questions – without an image to guide us. That thought – we really need a picture – struck me with force not long ago, and since then I’ve been drawing small diagrams showing how model questions map to actual questions.  

I’ve not seen this done quite as I suggest below, but if you have a similar system that works for you, go for it.  My pitch is not necessarily to create question maps my way, but to create a map, and make sure every stakeholder, everywhere, knows about it.  For every stakeholder, everywhere, should know what it is we’re actually answering (or not answering), and a good question map can address that.

A question map[1] is just a table.  It has two columns, and at least three rows, and traces the lineage of our questions from model question back to real-world question, along with the important features of each corresponding answer along the way. In outline, it looks like this:

Question Answer features
Model question as conventionally stated Data outcome as it is usually understood
Translated model question – a precise statement of how the data is being queried or summarized Data outcome including assumptions, uncertainties, errors, missing elements
Translated model question – it’s sometimes handy to “translate” in  more than one step Data outcome including assumptions, uncertainties, errors, missing elements
“Actual” real-world question as conventionally stated Actual outcome and its relationship to the data outcomes

By “conventionally stated,” I mean the question as people actually ask it.  We want to start and end with what people really say, rather than technical description of what they might mean. The translated questions in the middle are usually detailed and technical, so we can understand the connection between a general question and its technical analogue.

I’ve been surprised when applying this to examples from ordinary life as well as analytics, to see that just writing down the sequence of questions and answers tells a story of how we move from a question we believe we’re addressing, to a separate and sometimes very different question that we are truly addressing.

I’m going to offer more examples in the next post, but let’s check out one example now. You’ll remember, hopefully without irritation, that bogus best-fit relationship.  Here is a question map for that situation:

Outcome Q What is the relationship between the points Create a best-fit line
Translated Q What is the best-fit line for the points Define the best-fit line with line() in R
Translated Q What is the best-fit line for the points The points are randomly generated Define the best-fit line with line() in R
Actual Q <None> In context, there is no meaningful relationship

The progression of questions from top-to-bottom starts with a model question asking about the relationship between the points. But our model only considers 13 points, and not the context that generated them.  This may seem a little silly now, but it emulates what we do with data models every day – data models can be very limited abstractions of complex concepts (like an IQ test score as a data model for intelligence).

The first translated question (line 2) just converts “relationship” to “best fit line,” with a proscription in R for solving that problem.

The second translated question (line 3) is the same question, but now we’ve added in a new element to our answer features – our recognition that the way the points were generated impacts the problem.

The “Actual,” real-world question (line 4) is , for unless we are interested in peculiar artifacts of pseudo-random-number generators, there is nothing practical to be asked.   The trend, the relationship, indicate exactly nothing (and the next value in the random sequence would have confirmed that).

It isn’t unusual to write down a question map and find that we can’t state a definite real-world question aligned to the analytics or computational number we’ve produced. I’ll show another example next time, but when this happens it means we don’t know the problem we’re addressing. And speaking of problems, that situation is certainly a problem.

Next time, I’ll focus on some practical examples and their question maps, and how maps can show when question differences matter, as well as what we can do about that. There are times that we should update our model, but sometimes the solution is to replace an overly-ambitious real-world question with our model question instead.

[1]There are other diagrams, including concept maps, that are also called “question maps.”  Here, a question map is a table that shows the lineage between a model question and an actual real-world question, each row holding one question and its corresponding answer features.

R scripts for the simulation and plots.

# Note:  the data.table package is required
rnd_blk_ct <- 1e08L;
rnd_min <- 0L;
rnd_max <- 100L;

#points generated by the simulation.
#13: seed = 1234, 1.98e10 attempts. 

#     x        id
# 1:  4.586792 60283905
# 2: 31.230494 60283906
# 3: 32.805177 60283907
# 4: 40.057982 60283908
# 5: 40.389495 60283909
# 6: 51.093562 60283910
# 7: 51.651214 60283911
# 8: 54.419457 60283912
# 9: 63.673709 60283913
# 10: 63.891718 60283914
# 11: 74.354809 60283915
# 12: 87.494790 60283916
# 13: 94.858425 60283917

# to plot these points:
# q <- c(4.586792,
# 31.230494,
# 32.805177,
# 40.057982,
# 40.389495,
# 51.093562,
# 51.651214,
# 54.419457,
# 63.673709,
# 63.891718,
# 74.354809,
# 87.494790,
# 94.858425);

# plot(q, pch=19, col="blue", main="A regression looking for a question....", xlab="", ylab="Value");
# grid(13,10, col="darkgrey");
# abline( line( 1:13, q)$coefficients, lwd=2, col="lightblue");
# points( 1:13, q, pch=19, col="blue" );

# Simulation: 
# Identify a continuous sequence of increasing numbers from the  pseudo-random
# distribution runif()

seq_ct <- 13;
foundit <- FALSE;
blk_id <- 0;

while(!foundit & (blk_id <- blk_id + 1)) {

  rnd_x <- runif(rnd_blk_ct, rnd_min, rnd_max);
  rnd_dpos  0L), 0L);
  id_dpos_zero <- which(rnd_dpos==0L, useNames = FALSE); #which() output is sorted. 

  #ID when there is a sequential set longer than seq_ct having dpos_zero != 0.
  #This is what we want.

  id_rnd_start = seq_ct] + 1;

  if( foundit  0L)) { 
    id_rnd <- unlist(lapply(id_rnd_start, function(i) {seq.int(i,i+seq_ct-1)}));
    is_ord <- 1:rnd_blk_ct %in% id_rnd;
    r <- data.table(seq_id = rep(1:length(id_rnd_start), each=seq_ct),
                        id = rep(1:seq_ct, times=length(id_rnd_start)),
                        x = rnd_x[is_ord]);


  print(paste0("block, total attempts = (", blk_id, ", ", blk_id*rnd_blk_ct, ")"));


Life of PII

A coworker wrote to me yesterday, wondering about the process for enabling a credit freeze with Equifax. He pointed out that those requesting a credit freeze must enter their date of birth and social security number online, and Equifax has demonstrated that it cannot keep this information secure!

If the notion of sending your personally identifying information (PII) to a place like Equifax makes you uneasy, you’re right to be nervous. Unfortunately, the problem is more serious than a handful of vendors who verifiably cannot protect our identifying information.

There are several misunderstandings about PII that, until they are addressed, assure that each of us is at risk for identify theft, or worse.

The first misunderstanding is that PII can definitely be protected.  Organizations often brag about their data security, and on more than one occasion some of those same organizations have sent me, via unsecured email, data sets simply oozing with personal information. Presumably, they felt I could be trusted to protect what I received.   But as for everyone in between who might have read those emails, who can say?

It’s a start, as security experts recommend, to only store PII in an encrypted format that has no outside use.   And that (almost) assures that no one can just read our social security numbers from a database once they break in.  But as my coworker pointed out, there is still risk:  for an external person to confirm their identify, they must submit PII, and for at least some period of time our PII is unencrypted, and vulnerable.

As for other data security: this is only as secure as the least secure access to that data, and we should admit that this means not very secure.   One slip – like the well-intentioned people who sent me PII in an email – and the game is up.

I’ve learned from security pros, and my own experience: the First Law of Security is that nothing is truly secure.  When thinking about security, we should never start a sentence with “An attacker could never…,”  because they almost certainly will, if it’s worth their trouble.


The next understanding is that PII is well-defined – that if we encrypt unique identifiers (like social security numbers) so no outsider can use them, we’ll be in pretty good shape.

Regrettably, this is not the case.  Personal attributes like age, or gender, or zip code, and income bracket taken in combination may serve almost as well as a social security number. We cannot usually encrypt these attributes, as they’re useful for presentation and analysis.  If a bad actor finds information suggesting that we’re worth the trouble of an attack, a combination of human-readable personal attributes may very well be “good enough” – for being in a very small group is little different than being uniquely identified.  Consider what is now available in online public records, and remember the First Law…

If in a data set of personal information, there is any combination of unencrypted attributes that generates a small group of records, that’s effectively PII – it means individuals are at risk, for we cannot know what information outside of our system might be used to resolve our identity completely.


Perhaps the most crucial misunderstanding about PII is our presumption that PII is useful for anything other than confirming identity – i.e. authentication.  When it comes to analytics and business intelligence, PII should really stand for “Probably Is Irrelevant.”

What analyst needs to know someone’s date of birth?  There are many things that correlate with age, but few that correlate with whether we’re a Capricorn or a Sagittarius.   And social security numbers?  Anything this attribute can tell us, there are better ways to go.

Authentication using PII is a process that can be made nearly secure, using encrypted information wherever possible.    But once human-readable attributes – either singly or in combination, can come close to identifying us, we should know there is a security problem waiting to happen.   Translation: liability!  No organization has yet, to my knowledge, been forced into bankruptcy by liability from a PII breach, but that time may not be far off.  If we take the First Law seriously, this corollary also applies: for planning purposes, all potential data breaches should be regarded as actual breaches.

In information delivery and modeling, we are usually oblivious to the potential cost and risk associated with our input data, but these are considerations that should become part of our world view.   The inadvertent delivery of “effective” PII can be trapped.   In modeling, we frequently use data like dates of birth that are far more precise than what is necessary to build a suitable model.   Even if the model results do not expose PII, the presence of input data sets in our organization holding  actual or effective PII presents a risk.  If data is present, someone will find it, and use it – potentially in a fashion that we won’t like.

The best way to protect PII is not to use it at all – to deliver models and visualizations that never use actual or effective PII in the first place.

Does this complicate modeling and information delivery?  Sure.  For quantitative models, it means an optimal model must meet its predictive requirements, while limiting the precision of the data it employs.  At a minimum this means an embedded optimization, and you know what’s coming next…. sometimes, we won’t be able to meet both the requirement and the constraints.   C’est la guerre.

About five years ago I predicted that within five years many of us analysts would find ourselves engaged in security-related work.   That prediction has not yet come to pass, but I still think it will, and before too long.  The best protection against breaches of valuable data is not to encrypt it, or to protect it, or to otherwise make it difficult for a hacker to get to it – history tells us that those methods ultimately fail.   These methods are valuable, and they do slow attackers, but ultimately they require a strategy of perfect defense, against attackers whose weapons are always improving.  At some point,  the defense will be scored upon.

The best way to protect against a data breach is to limit the use of data we don’t need, or even to insist that some sensitive data are off limits. Analytics practice, which has had rather little to say about limiting data usage in the past, should have a great deal to say about limiting data usage in the future.

Low Hanging Data

For decades, there has been a vigorous argument in the United States as to whether the national pastime is football or baseball.  I’ve never followed this very closely – for most of us, these activities are mere spectator sports.

On the other hand, a game that most citizens do play, and actively, is the sport of information cherry picking, which consists of gathering up numbers and facts in support of a preconceived idea, and ignoring any other numbers and facts that might stand in opposition to what it is we’re trying to prove.

Data cherry picking might be the most common form of argument – and it’s very impressive when someone ticks off facts or numbers in support of a position.  But really, it’s not valid – cherry picking uses only part of the information at our disposal.   I’ll be the first to grant that we all cherry-pick information at times, but the more important the discussion and the outcome, the more critical it is that we avoid this approach, which often inflames more than it informs.

So last week, when I saw a NY Times opinion piece announcing that hurricane Harvey was “the storm that humans helped cause” my response was that’s irresponsible.  The thesis of the article is that the surface temperatures in the Gulf of Mexico are warming, which contributes to hurricanes (true), and we humans have contributed to global warming (probably).

And when we’re done berating ourselves for our personal responsibility for Harvey, then what?  Well, perhaps we could look at the slightly larger picture, and gather facts beyond those supporting one argument.  For Harvey was by no means a historically intense storm, making landfall at category 4.  In addition, the United States had experienced a long period – over a decade – in which no category 4 or 5 hurricane had reached its shores, a fact that could be cherry-picked to argue against a global warming impact.

More crucially, the reason Harvey created such damage was that it moved slowly, essentially stalling after it made landfall.   The trajectories, speed, and strength of hurricanes depend not only on water temperature, but on atmospheric wind and moisture both near and far from the hurricane itself.   Atmospheric dynamics cannot be predicted even a week in advance, but are all-important in determining a storm’s wind damage, and in Harvey’s case, water damage.   It’s beyond the competence of climate science to know local weather conditions in detail, and without that it has little to say about the flood damage inflicted by a particular tropical system.

We can rationalize nearly anything.  However, the purpose of analytics should not be simply to rationalize an expected hypothesis, but to help us understand whether our hypothesis is really correct.

So, we might expect that the methods of formal data analysis would provide a more even-handed analysis, but that’s far from a given. Instead, my experience has been that experienced practitioners are actually more prone to cherry-picking than novices. Those believing they know the answer to a problem are more likely to find data supporting their expected answer.   That’s OK – as long as the selected data are fully representative.   It’s surprisingly easy to use data that are supportive of an expected conclusion, or convenient, or both, when building an analytics platform – I’ll call out some (very common) examples below.

Data cherry-picking means that we’re assuming the information we’re using are complete and accurate for the problem at hand.   But as in statistics, our first duty is really to set aside that assumption and understand the limits of what our information can actually tell us.

Sounds simple enough, right?  And really, how often do we have incomplete data systems?

Well, pretty often.  Not only that, some of the most crucial problems analytics now faces are problems involving incomplete and uncertain information.  When we deliver exact answers with that kind of information, we’re probably cherry-picking our results.

Let me give you some examples of data and operations that can create problems:

  • Time. Most of the systems we examine are dynamic.  However, the time stamps we have in our systems are reporting times, which are different from event times.  In economic and business systems that difference can be months or even years.   If the system dynamics are slow enough the impact will be small, but we need to prove that.


  • Money. More than one analyst has told me their monetary metrics are “rock solid,” but none of these people were ever accountants.   Or salespeople – we really haven’t lived until we’ve seen two sales groups partition the spoils of a shared sale.  It’s also easy for us to forget that while we assign costs and prices we assign to products, people and things, costs and prices are really properties of a buying or selling transaction.  And as such, they are often negotiable and variable.  The uncertainty in monetary numbers may be too small to matter, but we’ll need to prove that.


  • Counts. Go ahead, say it: that’s easy.   Now come over to my place, where every merger and acquisition presents a new data challenge.  For example, employees have different histories, using different systems, and occasionally the information in the system is insufficient to distinguish two different people.


  • Money and Time together. The value of money changes over time, as a value that can only be approximated at any time, and is more uncertain in future times.  The uncertainty may be small, but if we’re looking more than a few years into the future, we’ll need to prove it.


  • Categories and taxonomies. A category can be mislabeled, but the real danger with categories is that they can be distorted.   Consider the tags and metadata associated with online material.  These are often designed to generate hits more than characterize content.  For complex entities that are categorized by hand, inconsistent assignments are common.


  • Cost and value. There is a much better chance that we have cost metrics in our system than value metrics, because costs are concrete and value is often difficult to measure.   Unfortunately, this doesn’t guarantee that value is irrelevant to a decision process.   Models based purely on costs can be one-sided and yield poor or irrelevant decisions.


  • Scores. Clever analysts often concoct score metrics as part of their design, but scores very often put me on the alert. Scores tend to distort reality, and be unassociated with a measurable real-world metric – in fact, that’s kind of the point.    The fun really starts when advanced analytics techniques are applied indiscriminately to scores.   To a clustering technique, a score of 90 and 96 may be close, but if these scores measure, for example, quality of a wine, a 96 may sell for two or three times the price of a 90.    When I see a score without a corresponding real-world metric, I’ve learned to flag it.   And if it’s the real-world metric that counts, why have a score in the first place?


  • “Facts.” Quickly! Global warming is/is not primarily caused by humans.  Each statement is purported to be a fact by its adherents, but neither assertion could survive the level of scrutiny experimental scientists apply to their data, which is the gold standard for establishing factual information.  If we consider the statements we think of as factual, how many of these are really more than received and unexamined information?   The problem with using many “facts” in analytics is that we often lack the context to compare two discordant statements, and we wind up selecting the “fact” most consistent with other statements we already accept.  That’s a pretty OK way to get through the day without going crazy, but for analytic purposes It’s really just another low-hanging bit of information.


  • Stochastic variables. It’s common in simulations to estimate the impact of external forces on the system by randomly-varying data.   That can be valid, but not all external forces subscribe to the stochastic model – in particular, when the time scale of external dynamics can match that of system dynamics, random external forces can be very misleading.  Exhibit A: economic forces.


  • Arithmetic. If our data are complete and precise it’s perfectly fine to perform an operation like A – B.     But uncertainly makes even this simple operation a risk-taking venture.  I don’t only mean statistical uncertainty.   A minus B can be a dubious operation if A and B are metrics relating to complex entities, e.g. yearly sales in a sales group. Oh sure, you can perform the operation, and get a number – but if the products, region, personnel, or leadership in the sales group has changed significantly, what does this figure really mean?

Can we apply these data and operations in analysis work? Sure.  But with data that are uncertain and incomplete, certainty and completeness are elements to be proven, rather than assumed.   The first duty of analytics really is to establish the limits of analytical conclusions.

When I try out this list on my acquaintances, the median response is one of resignation more than surprise.  For we really do know at some level that these metrics are flawed, biased, or irrelevant.  On the other hand, we also tend to proceed with our analysis regardless, assuming our data are complete and accurate, and rationalizing our decision by telling ourselves that the data are the best we have, and our responsibility is to understand the data we have, rather than to start a war about the value and completeness of the data.

I’ve been there –  many times.   But as analysts our responsibility is not merely to manipulate data and indicate what it appears to mean. Our responsibility extends to helping people assess what data can realistically conclude, and what questions the data actually answer.   The trap, and there is a trap here, is to start from the premise that the data are, well, really pretty good, and then find ourselves having to backtrack when we wish to argue there are limits to the conclusions that can be drawn, and the questions that can be answered.  That can be very difficult.

It’s better to initially presume the data are not too great – 10 to 15 percent in error if we’re told the quality is “good” or better, and more otherwise (yes, I’m serious).

Hey! I might not be able to conclude anything, except for the most course-grained conclusions!

I know.  And that’s the point.  To conclude more, we need to first prove the data are good enough, and that means more work.  This approach – and I’ll grant it’s not very conventional – has one real advantage: we put ourselves in a position to always improve our results relative to our starting point. Experienced analysts will recognize that they can suggest which new conclusions are likely to emerge from improvements in particular data fields.

Some – but not all – of the problems I’ve called out are data quality and curation problems.  Some of these are intrinsic uncertainties, however, and some conclusions are intrinsically limited.  There is no obvious cure for the uncertainty of inflation.  There is no obvious cure for a cost metric without its corresponding value.  To offer specific answers in the light of these uncertainties is a pretense.

The assumption of complete and accurate data – “data cherry picking” – is definitely convenient.  But beyond convenience, the motivation for cherry picking is strongest when we start from the premise that our data can and should support an answer – especially, an answer we expect.   But as in statistical reasoning, that’s really a false premise.

The first duty of data analysis is to ascertain the limits of what our data can conclude, and what questions our data is addressing.  It is not to presume an answer exists and it’s merely our job to uncover that answer – something we often do without realizing we’re actually doing it.   Every time we project future value with exactitude, every time we estimate net value from cost because cost is what we have, every time we introduce a score, we’ve really jumped ahead and forgotten our first duty, which is one of assessment.

That “first duty” is far easier if we stop expecting our data to give us any answer at all, and instead expect to prove our answer is valid in the light of our original real-world problem.   Two questions encapsulate the duty of assessment:

  • What is the actual question we are answering with the aid of our data?
  • What is the competence of our data to answer that question?

There is understandable resistance to the idea of starting from a “null conclusion” basis – for one thing, it means accepting that the cost and effort of collecting and analyzing data may not tell us what we want to know.   That’s true, but that’s also real.   In addition, basic and seemingly trivial operations can become complex and, well, very irritating, when uncertainty analysis becomes part of the picture. It isn’t without reason that many good analysts think of uncertainty analysis as living at a dismal intersection of tedium, reduced impact, and differential calculus.

I grant that. However, the alternative is to cherry-pick data and then to over-conclude.  It’s a common ailment that could turn people off to the genuine merits of analytical and data-based reasoning. If you’re skeptical, ask President Hillary Clinton, to see what she thinks.  Or perhaps we should start taking those five-day weather forecasts seriously?  I don’t think so. The five-day forecast is a kind of stock joke – we read it and more-or-less ignore it.   But not being president when you expected to be, or hiring the wrong person, or expecting income to increase when it might not – those are more serious matters.

If we’re to really leverage analytics in problems where uncertainty and partial information prevail – and that’s many if not must interesting problems – the days of “cherry picking” must come to an end.    This party is over.

Ironically, while tracking uncertainty may appear to be a time sink, it can actually be a major time saver, particularly if additional data are clearly required, or the desired answers are beyond the competence of any available information.  One strategy for managing incompleteness and uncertainty is as follows:

  • Let’s start by assuming our data are not perfect or complete. Conversations with stakeholders about what they might conclude are easier – if not easy – starting from the premise that conclusions may be limited, rather than confirmation of hoped-for outcomes.    The idea that a data system has limits is something to instill right from the outset.


  • Next, write out the questions the data system is actually answering, as exactly as possible, and map these questions to what stakeholders are asking. (There is often no mapping – that’s OK.)  I find this very helpful, and am still surprised at the differences between stakeholder perception of a question, and the actual question being addressed.    Both data questions and stakeholder question should be something that can actually be measured and validated.


  • Next, identify situations in which data incompleteness or uncertainty does not impact conclusions. And cross them off, with pleasure. Aggregated answers are often (not always) less sensitive to uncertainty.


  • Of the remaining data problems that can impact conclusions, identify the subset of issues that can be repaired, either by improved data quality or other methods. A decision must be made as to whether the repair is worth the cost.


  • What’s left at this point? Uncertain/incomplete information with a range of outcomes. We’ve either decided that it’s not worth the trouble to improve the information, or determined that improvement is not feasible (e.g. inflation estimates). It’s still OK to proceed, but only by reporting the full range of outcomes consistent with the input information uncertainties.

It’s common to replace this step by a classic “cherry pick” – to simply plug in a reasonable set of input values and then calculate the outcomes.  That’s OK if we can certify that the input uncertainty doesn’t matter, but otherwise not.    However, there are still options to simplify the job.   For example, many data inputs will typically impact outcomes in about the same way, so dimensional reduction can (albeit approximately) limit the amount of calculation involved.

I’m planning to follow up with a few posts illustrating these points in more detail with sample problems.  If you’re like me you might be thinking it seems that few people take this kind of trouble – is it really necessary?   I understand the sentiment, and agree that a full assessment of analytical limits is not particularly common.  It can feel quite negative.  But it isn’t. It’s realistic, and sets the stage for improving the information we use when that is feasible.  And I have seen it done, by some of the best analysts I know.  Perhaps ironically, less-certain outcomes wind up being perceived as more valuable, as they are also more reliable.

The Teardown Of History

Abraham Lincoln wrote that “We cannot escape history,” but those tearing down statues of Confederate Civil War soldiers are trying to do just that.   It won’t work.

Thankfully, most of us now agree that the human slavery and racism, upon which the Confederate economy was based, is an intrinsic evil.

However, if we now take the step of removing all Confederate memorials from view, we erase history itself – for wars require two sides, not just one.  A war with only one side is a joke and the Civil War, with its causes and carnage, was very far from a joking matter.

When removing monuments, we forget that in the 1860’s what most of us now take for granted – equality of race and religion – was not a widely held view anywhere in the United States, even by men such as Lincoln.  The North and South were more alike in social mores than many people realize.  That subtlety is eradicated if we associate only one side of the Civil War with racism, and we also trivialize the unfinished journey, started in that war, to recognizing that all people really are created equal.

When removing monuments, we forget that the motives of men fighting for the Confederacy were more nuanced than a defense of racism and slavery. Most Southern soldiers were poor and slaveless, but also refused to countenance the interference of unctuous outsiders in their lives.  Men like Lee were often ambiguous about slavery, but felt compelled to defend their homes, right or wrong, much as we would come to the defense of our own country, though it has often acted in ways far removed from the path of virtue.

Some on the right and left view Confederate statues as simple memorials to racism, but I see a more involved reality.

I see people who fought – as Ulysses Grant put it – “honorably, but in one of the worst causes for which men over fought.”  I see people acting honorably, given the social mores of their times, even as we now see those mores as clearly wrong.

I see the reminders of history, the painful start of a long and unfinished journey, and people who, while of a different time, were little different from us now.  That could have been us.

I see that as we now act imperfectly, for instance treating the Earth like a gigantic toilet, we should hope our descendants hold a more lenient view of us, than many of us hold of those who fought on both sides of our Civil War 150 years ago.

Confusion and Cowardice

Hi. Just when you thought it was going to be a quiet summer, and you could drop by for a little tech-talk, we get Charlottesville.

Sorry – I can’t let this pass.  I sometimes rant privately and then don’t post, but not this time.  If you’re not in a rant-receptive mood, no worries.  See you next time.


While the events in Charlottesville were a problem,  the reaction to those events is even more of a problem.

It’s doubtful whether the southern aristocrat Robert E. Lee could have related to those protesting the removal of his Charlottesville statue  – white nationalists and racists for whom Lee’s core concept of honor is entirely absent.

In an ironic twist, at the end of his life Lee felt his military training had been a mistake, so Lee himself might have supported the statue’s removal.

As Lee himself recognized, he made serious mistakes. But he was certainly no coward.

Unfortunately, the same cannot be said for our President and many of our nation’s CEO’s, who failed to speak out against racist violence, electing instead to comment only on violence itself.   Presumably these individuals, after responding to the Charlottesville violence with powder-puff tweets of mild indignation, hid beneath their aircraft-carrier sized office desks, worrying that someone might be angry with what little they had said. Well, everyone has their worries: CEO’s can worry that Trump might get mad at them and issue one of his many crap-tweets; Trump can worry that white racists might get mad and vote for someone even more misanthropic than he is.

No wonder people hold our leadership in contempt.  For little is more contemptible than racism, and there would have been few better opportunities to pick up votes, pick up customers, pick up stock prices, and pick up morale than to have stood up, been counted, and shown a minuscule amount of backbone in this situation.

Equivocation about Charlottesville isn’t just cowardice – it’s unintelligent.  It is a basic misunderstanding of history, and a basic misreading of what citizens really want. People do not want to avoid being irritated with their leaders, they want to admire them.

Lee, Grant, and Lincoln – leaders in a presumably less-enlightened age – understood that, and must now be turning over in their respective graves.

Thankfully, there are a few exceptions in the New Age of Equivocation: Merck CEO Kenneth Frazier and two other CEO’s removed themselves from Trump’s advisory circle after our Coward-in-Chief failed to condemn white racism over the weekend.  Regrettably, in the vacuum these individuals have created, less upright individuals will very likely step in.

If it seems ironic that equivocation is now order of the day, in an age when our President and his cronies are well-known for rudeness and crudeness, realize that rudeness and equivocation are two sides of the same currency of cowardice. As Eric Hoffer put it:  “Rudeness is the weak man’s imitation of strength.”

Let’s stop equivocating, and call equivocation about injustice – whenever and wherever it occurs – what it really is: cowardice.

The Draft Climate Science Report

The draft multi-agency US government report on climate change is available in draft form, after it was apparently leaked to mainstream news organizations including the New York Times.

The Draft Report is a virtual clinic on how to explain a complex topic. It’s not perfect, and occasionally downshifts into government-report formatting and prose, but it is very good.

It’s also rational and prudent, and therefore convincing in its conclusions, which are labeled as to their certainty.  The Report is a credit the participating authors and agencies.

Which is not to say anyone should take the report conclusions at face value. I offer a few comments on that below.

One of our greatest problems with climate change is a dismal level of public dialogue.  Sure, it’s a complicated topic, but we have allowed public debate to be dominated by alarmists and denialists, both of which tend to be uninformed.

To arrive at sensible dialogue people actually need to understand what the issues are!  The executive summary on pages 12-37 of the Report is a good place to start, and requires only about 30 minutes to read.

Ironically, I felt that even the Report’s executive summary jumps into the middle of things, without framing the key questions in summary form.  Those questions (with the Report’s conclusions and indicators of likelihood) are:

  • Q: Are temperatures rising in the Earth’s biosphere (land, sea, atmosphere)?
    • A: To a very high certainty, they’re going up, particularly in the last century and last forty years.
  • Q: Then, what are the likely consequences of increased temperatures?
    • A: Increased sea levels (very likely); other climate changes (ranging from possible to likely); social consequences (also possibly to likely).
  • Q: Then, are temperature increases the result of external energy (i.e. the Sun and Volcanoes), or changes within the biosphere?
    • A: Internal energy transfer within the biosphere.  Increased temperatures cannot be blamed on solar fluctuations (likely).
  • Q: Then, what are the primary causes of temperature increases within the biosphere?
    • A: The primary culprit is increasing levels of atmospheric carbon dioxide (likely) – energy that would be returned to space is absorbed by carbon dioxide and partially re-emitted into the biosphere.
  • Q: are these biosphere causes man-made or natural?
    • A: Primarily man-made (extremely likely).


My pitch is simply: check it out – you’ll be glad you did.  And don’t feel intimidated! You’ll be surprised at what you pick up.   Reading the Report and accepting its value is not the same as accepting its dictates wholesale.  Here are a few of my takeaways:


It’s a challenge to evaluate climate change information independently, and news outlets like the New York Times shriek that humanity has caused climate change with extreme likelihood, and that sea levels will rise.

Ever unctuous, is the Times. But they are just aping the Report, on the basis that many smart people contributed to its conclusions.

It’s true: many smart people have contributed to the Report’s conclusions, but this is also a very bad reason to accept a scientific conclusion.  As for the argument that climate change is too complicated to understand, and we therefore should accept its findings on faith, I say a) nonsense, and b) that’s the fault of the scientific community.  As the saying goes, those who cannot explain do not truly understand.

As for the word “extreme,” I think that’s a mistake on two levels.  First, it’s inflammatory. Second, the basis for assigning a very high likelihood to man’s role in global warming is essentially that there are multiple “lines of evidence,” which point to this assertion.  But that idea only holds if the lines of evidence are really independent, which is questionable, and if no single argument, should it be proven false, might call everything else into question – that’s also suspect.


Is action prudent? I think the short answer is hell yes, even if the climate change community is sometimes over-optimistic with their uncertainty estimates, as I believe they might be.

We should stop asking whether these conclusions are entirely right, and instead consider the chances they are entirely wrong – for those are really the only conditions in which we can justify inaction.   And there is simply too much solid science for climate science to be largely invalid.

If there were even a 25 percent chance that a meteor would hit my house today, you can bet I’d be taking evasive action. That would be, well, the conservative thing to do.  Unfortunately, conservatives don’t act like conservatives anymore.   The perception of climate change consequences is somehow different from other potential catastrophes, because it’s not happening instantly, and is not a well-defined event like blowing up my house.  But that does not mean there is no potential calamity.


There are serious thinkers – not uninformed denialists – who suggest that some aspects of climate change thinking is flawed. These people have sometimes been shouted down, which is both an embarrassment to the scientific community, and terrible science. Challenging consensus views is how science progresses – and it’s a win regardless of the outcome.

If the challengers fall short, we’re even more sure of being on the right track. If the challengers find a serious flaw, that could be good or bad news – it might suggest we have less leverage over climate change than we think.  While it’s probable the consensus view is largely valid, it’s always worth the trouble to see if somehow, we’ve missed something.


The origins and consequences of climate change, as the Report’s authors remark, are not entirely certain, but there is  more than enough certainty to warrant action, and definitely enough to warrant informed dialogue.


If you haven’t checked it out already, I think the Report is a great place to start, with summaries and individual chapter introductions that are surprisingly accessible.