A coworker wrote to me yesterday, wondering about the process for enabling a credit freeze with Equifax. He pointed out that those requesting a credit freeze must enter their date of birth and social security number online, and Equifax has demonstrated that it cannot keep this information secure!
If the notion of sending your personally identifying information (PII) to a place like Equifax makes you uneasy, you’re right to be nervous. Unfortunately, the problem is more serious than a handful of vendors who verifiably cannot protect our identifying information.
There are several misunderstandings about PII that, until they are addressed, assure that each of us is at risk for identify theft, or worse.
The first misunderstanding is that PII can definitely be protected. Organizations often brag about their data security, and on more than one occasion some of those same organizations have sent me, via unsecured email, data sets simply oozing with personal information. Presumably, they felt I could be trusted to protect what I received. But as for everyone in between who might have read those emails, who can say?
It’s a start, as security experts recommend, to only store PII in an encrypted format that has no outside use. And that (almost) assures that no one can just read our social security numbers from a database once they break in. But as my coworker pointed out, there is still risk: for an external person to confirm their identify, they must submit PII, and for at least some period of time our PII is unencrypted, and vulnerable.
As for other data security: this is only as secure as the least secure access to that data, and we should admit that this means not very secure. One slip – like the well-intentioned people who sent me PII in an email – and the game is up.
I’ve learned from security pros, and my own experience: the First Law of Security is that nothing is truly secure. When thinking about security, we should never start a sentence with “An attacker could never…,” because they almost certainly will, if it’s worth their trouble.
The next understanding is that PII is well-defined – that if we encrypt unique identifiers (like social security numbers) so no outsider can use them, we’ll be in pretty good shape.
Regrettably, this is not the case. Personal attributes like age, or gender, or zip code, and income bracket taken in combination may serve almost as well as a social security number. We cannot usually encrypt these attributes, as they’re useful for presentation and analysis. If a bad actor finds information suggesting that we’re worth the trouble of an attack, a combination of human-readable personal attributes may very well be “good enough” – for being in a very small group is little different than being uniquely identified. Consider what is now available in online public records, and remember the First Law…
If in a data set of personal information, there is any combination of unencrypted attributes that generates a small group of records, that’s effectively PII – it means individuals are at risk, for we cannot know what information outside of our system might be used to resolve our identity completely.
Perhaps the most crucial misunderstanding about PII is our presumption that PII is useful for anything other than confirming identity – i.e. authentication. When it comes to analytics and business intelligence, PII should really stand for “Probably Is Irrelevant.”
What analyst needs to know someone’s date of birth? There are many things that correlate with age, but few that correlate with whether we’re a Capricorn or a Sagittarius. And social security numbers? Anything this attribute can tell us, there are better ways to go.
Authentication using PII is a process that can be made nearly secure, using encrypted information wherever possible. But once human-readable attributes – either singly or in combination, can come close to identifying us, we should know there is a security problem waiting to happen. Translation: liability! No organization has yet, to my knowledge, been forced into bankruptcy by liability from a PII breach, but that time may not be far off. If we take the First Law seriously, this corollary also applies: for planning purposes, all potential data breaches should be regarded as actual breaches.
In information delivery and modeling, we are usually oblivious to the potential cost and risk associated with our input data, but these are considerations that should become part of our world view. The inadvertent delivery of “effective” PII can be trapped. In modeling, we frequently use data like dates of birth that are far more precise than what is necessary to build a suitable model. Even if the model results do not expose PII, the presence of input data sets in our organization holding actual or effective PII presents a risk. If data is present, someone will find it, and use it – potentially in a fashion that we won’t like.
The best way to protect PII is not to use it at all – to deliver models and visualizations that never use actual or effective PII in the first place.
Does this complicate modeling and information delivery? Sure. For quantitative models, it means an optimal model must meet its predictive requirements, while limiting the precision of the data it employs. At a minimum this means an embedded optimization, and you know what’s coming next…. sometimes, we won’t be able to meet both the requirement and the constraints. C’est la guerre.
About five years ago I predicted that within five years many of us analysts would find ourselves engaged in security-related work. That prediction has not yet come to pass, but I still think it will, and before too long. The best protection against breaches of valuable data is not to encrypt it, or to protect it, or to otherwise make it difficult for a hacker to get to it – history tells us that those methods ultimately fail. These methods are valuable, and they do slow attackers, but ultimately they require a strategy of perfect defense, against attackers whose weapons are always improving. At some point, the defense will be scored upon.
The best way to protect against a data breach is to limit the use of data we don’t need, or even to insist that some sensitive data are off limits. Analytics practice, which has had rather little to say about limiting data usage in the past, should have a great deal to say about limiting data usage in the future.